[PDF] Predictive Synthesis of Quantum Materials by Probabilistic Reinforcement Learning

Abstract

Predictive materials synthesis is the primary bottleneck in realizing new functional and quantum materials. Strategies for synthesis of promising materials are currently identified by time-consuming trial and error approaches and there are no known predictive schemes to design synthesis parameters for new materials. We use reinforcement learning to predict optimal synthesis schedules, i.e. a time-sequence of reaction conditions like temperatures and reactant concentrations, for the synthesis of a prototypical quantum material, semiconducting monolayer MoS 2 , using chemical vapor deposition. The predictive reinforcement leaning agent is coupled to a deep generative model to capture the crystallinity and phase-composition of synthesized MoS 2 during CVD synthesis as a function of time-dependent synthesis conditions. This model, trained on 10000 computational synthesis simulations, successfully learned threshold temperatures and chemical potentials for the onset of chemical reactions and predicted new synthesis schedules for producing well-sulfidized crystalline and phase-pure MoS 2 , which were validated by computational synthesis simulations. The model can be extended to predict profiles for synthesis of complex structures including multi-phase heterostructures and can also predict long-time behavior of reacting systems, far beyond the domain of the MD simulations used to train the model, making these predictions directly relevant to experimental synthesis.

Full PDF

PPredictive Synthesis of Quantum Materials by Probabilistic Reinforcement Learning

Pankaj Rajak *, Aravind Krishnamoorthy *, Ankit Mishra , Rajiv Kalia , Aiichiro Nakano and Priya Vashishta Argonne Leadership Computing Facility, Argonne National Laboratory, Argonne, Illinois 60439, United States Collaboratory for Advanced Computing and Simulations, Department of Physics & Astronomy, Department of Computer Science, Department of Chemical Engineering & Materials Science, University of Southern California, Los Angeles, California 90089-0242, United States * equal contribution † Email: [email protected] Abstract

Predictive materials synthesis is the primary bottleneck in realizing new functional and quantum materials. Strategies for synthesis of promising materials are currently identified by time-consuming trial and error approaches and there are no known predictive schemes to design synthesis parameters for new materials. We use reinforcement learning to predict optimal synthesis schedules, i.e. a time-sequence of reaction conditions like temperatures and reactant concentrations, for the synthesis of a prototypical quantum material, semiconducting monolayer MoS , using chemical vapor deposition. The predictive reinforcement leaning agent is coupled to a deep generative model to capture the crystallinity and phase-composition of synthesized MoS during CVD synthesis as a function of time-dependent synthesis conditions. This model, trained on 10000 computational synthesis simulations, successfully learned threshold temperatures and chemical potentials for the onset of chemical reactions and predicted new synthesis schedules for producing well-sulfidized crystalline and phase-pure MoS , which were validated by computational synthesis simulations. The model can be extended to predict profiles for synthesis of complex structures including multi-phase heterostructures and can also predict long-time behavior of reacting systems, far beyond the domain of the MD simulations used to train the model, making these predictions directly relevant to experimental synthesis. Introduction

Rapid development of technology based on new and advanced materials requires us to considerably shorten the existing ~20-year materials development timeline [1]. This long timeline results both from the empirical discovery of promising materials as well as the trial-and-error approach to identifying scalable synthesis routes for these material candidates. Over the last decade, we have made considerable progress in addressing the first of these challenges through data-driven materials science to perform large-scale materials screening for new properties. The exponential explosion in available computing power and increase efficiency of ab initio and machine learning (ML) driven materials simulation software have enabled the high-throughput simulations of several tens of thousands of materials from multiple material classes [2]. These high-throughput simulations and the resulting rich databases are increasingly being mined and analyzed using emerging ML techniques to identify promising material compositions and phases [3-6]. These strategies have been successfully employed to identify new ultrahard materials, ternary nitride compositions, battery materials, polymers [7], organic solar cells [8], OLEDs [9], thermoelectrics etc. [10-12]. This identification of new materials is only one piece necessary towards the goal of reducing time to deployment of new materials [13]. An equally important component in this paradigm is the corresponding ability to synthesize these promising materials and compositions. However, techniques for experimental synthesis of materials have not kept pace with advances in computational materials screening [13, 14]. As a result, materials synthesis is largely dominated by individual groups that can identify synthesis strategies for new materials based on empirically insights and materials intuition. There are several attempted strategies to identify and optimize new synthesis routes prior to actual synthesis. The first strategy, common in chemical and biological synthesis of small molecules, uses high-throughput experimental synthesis to screen for optimal synthesis precursors for chemical synthesis of small molecules [15-18]. The effectiveness of such strategies is limited since an exhaustive search of synthesis strategies is prohibitively expensive and inefficient in regard to time and reagents, whereas a narrow search scheme that varies only a single synthesis parameter at a time will likely miss several promising synthesis strategies. In contrast to the relatively widespread use of automated algorithms to optimize chemical reactions of molecular and organic systems [19], synthesis planning for bulk inorganic materials is still in its infancy [20, 21]. Non-solution-based synthesis of quantum materials involves more complicated time-correlations between synthesis parameters, which are not amenable to experimental high-throughput synthesis. This also requires considerably more refined models than previous efforts which only considered the combination of reactants to predict the outcome of chemical reactions [22, 23]. Therefore, there are efforts to perform text-mining on published synthesis profiles from the literature, including common solvent concentrations, heating temperatures, processing times, and precursors used to understand common rules-of-thumb and identify new synthesis schedules for new materials [24-26]. However, even these upcoming ML techniques are limited by scarcity of data in terms of existing schedules and synthesized materials and therefore their extension to new, potentially unknown materials is problematic [25]. Finally, the identification of a synthesis schedule is the optimization of a time sequence of multiple synthesis parameters, which requires the analysis of a new class of ML techniques. This problem is well-suited for Reinforcement Learning (RL), a branch of machine learning, where the goal of the RL agent is design an optimal policy to solve problems that involves sequential decision making in an environment consisting of thousands of tunable parameters and a huge search space [27, 28]. Due to this flexibility and ability of RL in handling complex tasks involving non-trivial decision making and planning under uncertainties imposed by the surrounding environment, it has been used in robotics, self-driving cars and in material science domain for problems such as designing drug molecules with desired proproteins, predict reaction pathways and construct optimal conditions for chemical reactions [15, 29-32]. In this work, we describe a reinforcement learning model to optimize synthesis routes for a prototypical member of the family of 2D quantum material, MoS , via Chemical Vapor Deposition (CVD). CVD, a popular scalable technique for the synthesis of 2D materials [33], has numerous time-dependent parameters such as temperature, flow rates, concentration of gaseous reactants, and type of reaction precursors, dopants and substrates (together referred to as the synthesis profile) that need to be optimized for the synthesis of new materials. Recent computational studies have identified several mechanistic details about the synthesis process [34, 35], but there are no comprehensive rules for designing synthesis strategies for a given material. We use RL specifically o (1) Identify synthesis profiles that result in material structures that optimize a desired property (in our case, the phase fraction of the semiconducting crystalline phase of MoS ) in the shortest possible time and (2) Understand trends and time-correlations in the synthesis parameters that are most important in realizing materials with desired properties. These trends and time-correlations effectively provide information about mechanism of the synthesis process. Experimental synthesis by CVD is time-consuming and not amenable to high-throughput synthesis and is therefore incapable of generating the significant amount of data on synthesis using multiple profiles required for RL training. Therefore, we train our RL workflow on data from simulated CVD performed using reactive molecular dynamics simulations (RMD), which were previously shown to accurately reflect the potential energy surface of the reacting system as well as capture important mechanisms of the CVD synthesis reaction [35-38]. Below, we describe results from the molecular dynamics simulation of CVD, followed by a representation of the dynamics of this CVD-environment as a probability density function using a probabilistic deep generative model called Neural Autoregressive Density Estimator (NADE-CVD) and model-based Reinforcement Learning to identify optimal synthesis strategies. We conclude with a discussion on applicability of RL + NADE-CVD models for prediction of long-time material synthesis. Results

A. Reactive MD for Chemical Vapor Deposition

We perform RMD simulations to simulate a multi-step reaction of MoO crystal with a sulfidizing atmosphere containing H S, S and H molecules. Each RMD simulation models a 20-ns long synthesis schedule, divided into 20 steps, each 1 ns long. At the beginning of each step, the gaseous atmosphere from the previous step is purged and replaced with a predefined number of H S, S and H molecules. These changes in RMD parameters reflect the time-dependent changes in synthesis conditions during experimental synthesis. The sulfidizing environment is then made to react with the partially sulfidized MoO x S y structure from the end of the previous step at a predefined temperature for 1 ns. Each step is characterized by 4 variables, the system temperature, and the number of S , H S and H molecules in the reacting environment denoted as the quartet, !𝑇, 𝑛 ! ! , 𝑛 " ! , 𝑛 ! ! " % . While the initial structure for each RMD simulation at 𝑡 = 0 ns is a pristine MoO slab, the final output structure (MoS + MoO ) is a non-trivial function of its synthesis schedule, defined by 20 such quartets as shown in Figure 1. B. Neural Autoregressive Destiny Estimation for Predicting Output of Synthesis Schedules

RMD simulations can generate output structures for thousands of simulated synthesis schedules to overcome the primary problem of data scarcity common to experiments. RL-based optimization of synthesis schedules consists successive stages of policy generation by the RL agent and policy evaluation by the environment. However, using RMD simulations directly as the policy evaluation environment is infeasibly time-consuming since direct evaluation a single synthesis profile by RMD takes approximately 2 days of computing. To overcome this problem, we construct a probabilistic representation of the CVD synthesis of MoS as a Bayesian Network (BN) which encodes a functional relationship between the synthesis conditions and generated output structures and can therefore predict output structures for an arbitrary input condition in a fraction of the time required by RMD simulations. The BN consists of two sets random variables, namely the (a) the unobserved variable Z given by the time dependent phase fractions of 2H, 1T phases and defects in the MoO x S y surface, and (b) the observed variables, X , given by the user-defined synthesis condition, namely the temperature and gas concentrations (Figures 2a and 2b) [39]. Each node in the BN represents either the synthesis condition at time t as 𝑋 or the distribution of different phases on MoO x S y surface as 𝑍 . Together, the BN represents the joint distribution of 𝑋 and 𝑍 as 𝑃(𝑋, 𝑍).

Since, 𝑍 $ (initial structure, pristine MoO ) and 𝑋 (synthesis condition) is known, we can convert 𝑃(𝑋, 𝑍) into a conditional distribution

𝑃(𝑍 %:’ |𝑋, 𝑍 $ ) using chain rule. Further, using conditional independence between BN variables, 𝑃(𝑍 %:’ |𝑋, 𝑍 $ ) can be further simplified as the autoregressive probability density function, where each 𝑍 depends only upon the simulation history of observed and unobserved variables till time 𝑡 (Figure 2b). 𝑃(𝑍 %:’ |𝑋, 𝑍 $ ) = 𝑃(𝑍 % |𝑍 $ , 𝑋 $ ) … 𝑃(𝑍 |𝑍 $: , 𝑋 $: ) … 𝑃(𝑍 ’ |𝑍 $:’)$ , 𝑋 $:’)$ ) In the BN, each of these conditional probabilities,

𝑃(𝑍 |𝑍 $: , 𝑋 $: ) is modeled as a multivariate Gaussian distribution 𝒩(𝑍 |𝜇 , 𝜎 ) , whose mean 𝜇 = 4𝜇 , 𝜇 , 𝜇 and variance 𝜎 = 4𝜎 , 𝜎 , 𝜎 is function of simulation history, (𝑍 $: , 𝑋 $: ) . To learn the BN representation of the CVD process and capture the conditional distribution 𝑃(𝑍|𝑋, 𝑍 $ ) compactly, we have developed a deep generative model architecture called a Neural Autoregressive Density Estimator (NADE-CVD; Figure 2c), which consist of an encoder, decoder and recurrent neural network (RNN) [40-43]. The output of NADE-CVD function at time step 𝑡 + Figure 1:

Reactive MD for computational synthesis. (a) Snapshot of RMD simulation for MoS synthesis. The sulfidizing environment containing S , H and H S gases reacts with the MoO x S y slab in the middle of the simulation cell (black lines). (b) Schematic of the RMD simulation of a single 20-ns long synthesis schedule. The initial MoO slab at t = 0 ns reacts with a time-varying sulfidizing environment to generate a final structure composed of MoS and MoO at t = 20 ns . is 𝜇 and 𝜎 for three phases in MoO x S y surface which are functions of simulation history encoded by the RNN cell as ℎ , where ℎ is a function of ℎ and synthesis condition (𝑍 , 𝑋 ) at time 𝑡 . Parameters of the NADE-CVD model are learned using maximum likelihood estimate using a training data of 10000 RMD simulations of CVD using different synthesis conditions. The prediction error of the trained NADE-CVD model on test data (Figure 2c) shows a RMSE error of merely 3.5 atoms and maximum prediction error on any phase of ≤ 30 atoms. The architecture of the NADE-CVD model is described in the Methods section and details about model training are provided in Section 1 of the supplementary material. C. Probabilistic Model-Based Reinforcement Learning for Designing Optimal Synthesis Schedules

The NADE-CVD model accurately approximates a computationally expensive RMD simulation and provides a fast and probabilistic evaluation of the output structure from a given synthesis schedule. However, on its own, this model cannot be used to achieve the goal of predictive synthesis, which is to identify the most likely synthesis schedules that yield a material with optimal properties (such as high crystallinity, phase purity or hardness). For MoS synthesis, one example of a design goal is to determine synthesis schedules that yield high quality MoS (i.e. largest phase fraction of semiconducting 2H phase in the final product), in the shortest possible time. In other Figure 2:

NADE model of computational synthesis of MoS . (a) Each 1-ns step of the RMD simulation is characterized by an input vector 𝑋 ! characterizing the synthesis conditions and the distribution of phases in the resulting structure, 𝑍 ! (b) Bayesian Network representation of CVD synthesis of MoS over 𝑇 " = 20 ns. The green and blue nodes are synthesis condition as observed variables ( 𝑋 % ), whereas orange nodes are unobserved ( 𝑍 % ), which represents phase fraction of 2H, 1T and defect in MoO x S y surface as a function of time. (c) Schematic of the NADE-CVD, composed of two multi-layer perceptrons F MLP as encoder and decoder networks and an intermediate recurrent neural network block, F

RNN . (d) High test accuracy of NADE-CVD with a mean absolute error < 0.1 phase fraction. ords, we wish to perform the non-trivial optimization of 𝑋 $: to maximize the value of ∑ 𝑍 $: (see supplementary material). Mathematically, it can be written as arg max . ":$ ? 𝑍 $: where (𝑍 $: , 𝑋 $: )~𝑃(𝑍 $: , 𝑋 $: ) = 𝑃(𝑍 $: |𝑋 $: )𝑃(𝑋 $: ) (1) For this purpose, we construct a reinforcement learning (RL) scheme (Figure 3a), consisting of a RL agent coupled to NADE-CVD trained on RMD data as discussed in the previous section. The RL agent (𝜋 / ) is a multi-layer perceptron, where the input state (𝑠 ) at time t is a 128-dimension embedding vector of the entire simulation history till t , (𝑍 $: , 𝑋 $: ) . At each time step 𝑡 , the RL agent takes an action, 𝑎 , which is the change in synthesis condition (i.e. reaction temperature and gas concentrations) at t , 𝑎 = Δ𝑍 = {Δ𝑇, Δ𝑆 % , Δ𝐻 % , Δ𝐻 % 𝑆} . The synthesis condition for the next nanosecond of the simulation is defined as 𝑋 = 𝑋 + 𝑎 . The corresponding action (𝑎 ) to take at 𝑠 is modeled using a Gaussian distribution !𝑎 ~𝒩(𝜇(𝑠 ’ ), 𝜎 % )% , whose parameters 𝜇(𝑠 ’ ) – state dependent mean – is the output of the RL agent, 𝜇(𝑠 ’ ) = π θ (𝑠 ’ ) . The variance, 𝜎 % is assumed to be constant and is tuned as a hyperparameter of the RL scheme. Therefore, the RL scheme designs a 20 ns synthesis schedule (𝜏) starting with an arbitrary synthesis condition, {𝑇 , 𝑆 %0 , 𝐻 %0 , 𝐻 % 𝑆 } , such that the action proposed at each timestep t serves to convert the initial MoO crystal into 2H-MoS structure as quickly as possible. During training, the RL agent learns the policy of designing the optimal synthesis condition via policy gradient algorithm informed by the NADE-CVD model [28, 44-46]. At each time step 𝑡 in an episode, the RL agent receives an input state 𝑠 and proposes an action 𝑎 that determines the synthesis condition at next time step, 𝑋 . Using this, NADE-CVD predicts the distribution of various phases in the synthesized product 𝑍 . The NADE-CVD model also gives a reward (𝑟 ) proportional to the concentration of 2H phase in 𝑍 and a new state 𝑠 to the RL agent. During training, the goal of the RL agent is to use these reward signals and adjust its policy parameters (𝜋 / ) so as to maximize its total reward, to produce 2H-rich MoS structure in minimum time. 𝑂𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒: arg max / 𝔼 % W? 𝑟(𝑠 , 𝑎 ) ’ X where 𝑟 (𝑠 , 𝑎 ) = Y 0.0 𝑖𝑓 𝑍 [𝑛 %! ] < 1000.2𝑍 𝑖𝑓 𝑍 [𝑛 %! ] ≥ 100 (2) The details of the network architecture, and the policy gradient algorithm is given in the Methods section and RL agent training is described in Sections 2-5 in the Supplementary Material. The efficiency of the trained RL agent in identifying promising synthesis schedules is demonstrated in Figure 3b, which compares the 2H phase fraction of the resulting structures from 3200 synthesis schedules generated by the RL agent against 3200 randomly generated schedules. The RL agent is able to consistently identify schedules that result in highly crystalline and phase-pure products, while the randomly generated schedules overwhelmingly yield poorly-sulfidized and/or poorly crystalline products. In other words, the RL agent constructs a probability distribution function (pdf) of 𝑋 $: that places most of its probability mass on regions on 𝑋 $: that maximizes ∑ 𝑍 $: . Figure 3c shows the validation of one RL-predicted synthesis schedule by subsequent RMD simulation, showing that the observed time-dependent phase fraction tracks the RL-NADE prediction closely. D. Optimal Synthesis Schedules for MoS and Mechanistic Insights from the RL model The RL agent is trained to learn polices that generate time-dependent temperatures, and concentrations of H S, S and H molecules to synthesize 2H-rich MoS structures in least time. Closer inspection of these RL designed policies provides mechanistic insight into CVD synthesis and the effect of variations in temperature and gas concentration on the quality of the synthesized product. Figure 4 shows that the RL agent has learned to generate a two-part temperature profile consisting of an early high-temperature (>3000 K) phase spanning the first 7-10 ns followed by annealing to an intermediate temperature (~ 2000 K) for the reminder of the synthesis profile. This two-part synthesis profile identified by RL policy is consistent with the experiments and atomistic simulations, that is high temperature (> 3000 K) is necessary for both the reduction of MoO surface and its sulfidation, whereas the subsequent lower temperature (~ 2000 K) is necessary for Figure 3:

Reinforcement Learning model for synthesis schedule design. (a) Schematic of the RL-NADE model for optimizing schedules for MoS synthesis. (b) Comparison of structures generated by the RL-designed schedules against randomly generated schedules demonstrates that the RL-NADE model consistently identifies CVD synthesis schedules that generate highly crystalline products. (c) Validation of a promising RL-generated schedule using RMD simulations. nabling crystallization in the 2H structure, while continuing to promote residual sulfidation. It is observed that the RL agent maintains this two-stage synthesis profile even if the provided initial temperature at 𝑡 = 0 ns is low by quickly ramping up the synthesis temperature to the high-temperature regime (> 3000 K). The RL agent is also able to predict non-trivial mechanistic details about phase evolution, including the observation that the nucleation of the 1T phase precedes the nucleation of the 2H crystal structure (Figure 4a and 4b). Similar trends were observed in previous mechanistic studies of MoS synthesis [35]. Another important phenomenon identified by RL agent is the effect of gas concentrations on the quality of the final product (fig 4b). To analyse the effect of initial gas concentration, we compute the probability distribution of 2H phase in MoS over the last 10 ns of the simulation for the synthesis conditions proposed by the RL agent under different initial conditions of gas conc. but with similar temperature profile. The mean (𝜇 %! ) of the pdf is 𝜇 %! = 𝔼 % a $$0 ∑ 𝑍 [𝑛 %! ] b , is the expected fraction of the 2H phase in over the last 10 ns of the synthesis simulation and a higher value of 𝜇 %! provides an indication of the extent of sulfidation as well as the time required to generate 2H phases. The RL agent is found to promote synthesis profiles that have low concentration of gas molecules (particularly non-reducing S molecules) at early stages (0-3 ns) of the synthesis, when the temperature is high. This partially evacuated synthesis atmosphere promotes the evolution of oxygen from and self-reduction of the MoO surface. This can be clearly observed by comparing the histogram of 2H phase fractions in structures generated by synthesis profiles with low initial (i.e. 𝑡 = 0 ns) concentration of S molecules against those with higher concentration of S molecules (Fig. 4c). Profiles with low initial S concentrations enable greater self-reduction of the MoO surface resulting in a significantly higher 2H phase fraction in the synthesized product at 𝑡 = 10-20 ns. H S and H molecules, which are more reducing than S , do not meaningfully affect the MoO self-reduction rate, and the 2H phase fraction in the final MoO x S y product is largely independent of the initial H S and H concentrations (Fig 4d-e). E. Extensions to RL-NADE-CVD: Schedules for multi-phase heterostructures and predictions for large systems and Long-Time Synthesis

The outputs of the NADE-CVD model, each 𝜇 and 𝜎 is only function of simulation history up to time 𝑡 . Similarly, each action 𝑎 taken by the RL agent is a function only of the input state Figure 4:

Effect of synthesis conditions on products. (a) A generated synthesis profile starting from low temperature and low gas concentrations. The RL model quickly ramps up the temperature up to 7 ns to promote reduction and sulfidation and then lowers the temperature to intermediate values to promote crystallization. This profile generates significant phase fraction of 2H starting from 10 ns (b) A generated synthesis profile starting from high temperature and high S concentrations. The RL-NADE model retains the high temperature at early stages of synthesis and slowly anneals the system to intermediate temperatures after 10 ns. This schedule promotes relatively late crystallization and 2H phase formation. (c) Synthesis profiles with initially low S concentrations yield significantly higher phase fraction of 2H in the final product compared to profiles containing higher S concentrations at 𝑡 = 0 ns. (d-e) Synthesis schedules are relatively insensitive to the initial concentration of reducing species, H S and H . , which is an encoded representation of simulation history up to time 𝑡 . Hence, we can use RL + NADE-CVD to design policies for synthesis over time scales significantly longer than the 20 ns RMD simulation trajectories used for NADE-CVD training. Figure 5 shows a policy proposed by the RL + NADE-CVD model for a 30 ns simulation. This extended synthesis profile retains the design principles such as a two-phase temperature cycle and low initial gas phase concentrations that were learned from 20-ns trajectories. Further, the longer synthesis schedule also allows the RL agent to uncover new synthesis design rules for improving 2H phase fraction. The RL profile in Figure 5 includes a heating-cooling cycle between 15-30 ns what has previously been shown to improve the crystallinity and 2H phase fraction in the synthesized material [35]. The RL agent learns promising synthesis profile by adjusting its policy parameters (𝜋 / ) to maximize a pre-defined reward function, that corresponds the material to be synthesized. Therefore, the RL agent can optimize synthesis schedules for other material structures, including multi-phase heterostructures, by constructing corresponding reward functions. The following reward function, 𝑟 (𝑠 , 𝑎 ) maximizes the phase fraction of 1T crystal structure over the 20 ns simulation. 𝑂𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒: arg max / 𝔼 % W ? 𝑟(𝑠 , 𝑎 ) X where 𝑟 (𝑠 , 𝑎 ) = Y 0.0 𝑖𝑓 𝑍 [𝑛 $’ ] < 500.35𝑍 𝑖𝑓 𝑍 [𝑛 $’ ] ≥ 50 (3) Figure 5c shows a RL-generated schedule to synthesized 1T-rich structures. The temperature profile is largely consistent with those observed for 2H-maximized synthesis schedules. The RL generated gas-phase concentrations optimized for 1T synthesis maximize H and H S concentrations, while minimizing S concentrations. This is consistent with experimental observations, where reducing environments were observed to produce more 1T phase fractions [47]. This is in contrast to schedules optimized for 2H MoS , where the concentration of all three gaseous species show correlated variations (Figure 4a-b). Figure 5d shows the MoS x O y structure generated at the end of MD simulations according to the RL-generated synthesis schedule. The synthesized heterostructure consists of an island of 1T-MoS embedded in the 2H-MoS matrix with an atomically sharp interface between the two phases. Finally, RL-predicted synthesis schedules are also extremely robust with respect to system-size scaling. Figure 5e shows the validation of a single RL-generated profile using RMD simulations on systems of two different sizes – 51Å ⨉

49 Å and 100 Å ⨉

100 Å. Figure 5f shows that the observed fractions of 2H and 1T phases in RMD simulations of both the small and large systems are consistent with each other over the entire 20-ns simulation range. Further, these phase fractions are also quantitatively consistent with the values predicted by the NADE model used in the RL optimization loop. This capability to optimize synthesis schedules independent of system size is useful to extend this approach to experimental synthesis.

Conclusion

We have developed a machine learning scheme for the predictive design of time-dependent reaction conditions for the synthesis of new nanomaterials. The scheme integrates a reinforcement learning agent with a deep generative model of chemical reactions to predict and design optimum conditions for the rapid synthesis of two-dimensional MoS monolayers using chemical vapor deposition. This model was trained on thousands of computational synthesis simulations at different reaction conditions performed using reactive molecular dynamics. The model uccessfully learned the dynamics of material synthesis during simulated chemical vapor deposition and was able to accurately predict new synthesis schedules to generate a variety of MoS structures such as 2H-MoS , 1T-MoS and 2H-1T in-plane heterostructures. Beyond mere synthesis design, the model was also useful for mechanistic understanding of the synthesis process and helped identify distinct temperature regimes that promote sulfidation and crystallization and the impact of a reducing environment on the phase purity of the synthesis product. We also demonstrate how the reinforcement learning scheme can be extended to predict the outcome of material synthesis over long time-scales for system sizes larger than those used for training. This flexibility makes the reinforcement learning based design scheme suitable for optimization of xperimental synthesis of wide variety of nanomaterials. Methods

A. Molecular Dynamics Simulation

All 10000 RMD simulations were performed using the RXMD molecular dynamics engine [48, 49] using the reactive forcefield originally developed by Hong et al. [36] that is optimized for

Figure 5:

Extensions of RL + NADE-CVD Method. (a,b) A 30-ns long synthesis profile predicted by RL + NADE-CVD retains design principles about two-phase temperature cycle and low initial gas phase concentrations learned from 20-ns RMD trajectories. In addition, the 30-ns profile also includes a temperature annealing step between 15-30 ns (arrows) that improves the 2H phase fraction beyond 60%. (c) RL + NADE-CVD generated synthesis schedule for optimizing 1T phase fraction. (d) Output structure from an RMD simulation of the 1T-optimized synthesis schedule reveals a heterostructure containing a 1T-rich region embedded in the 2H phase. (e,f) The robustness of RL-generated profiles against system size-scaling is validated by the identical fractions of 2H and 1T phases in laterally-small and laterally-large systems simulated using RMD using the same profile. eacting Mo-O-S-H systems. RMD computational synthesis simulations were performed on a 51Å ⨉ ⨉

94Å simulation cell containing 1200-atom MoO slab at 𝑧 = 47 Å surrounded by a reacting atmosphere containing H , S and H S molecules. During RMD simulations, a one-dimensional harmonic potential is applied to each Mo atom along the 𝑧 -axis (i.e., normal to the slab surface) with the spring constant of 75.0 kcal/mol to keep the atoms in a two-dimensional plane at elevated temperatures. For each nanosecond of the computational synthesis simulation, the system temperature is maintained at the value specified in the synthesis profile by scaling the velocities of the atoms. MD trajectories are integrated with a timestep of 1 femtosecond and charge-equilibration is performed every 10 timesteps [50]. B. NADE-CVD

The NADE-CVD consists of an encoder, a LSTM block and a decoder (fig 2a). The encoder transforms (𝑋 , 𝑍 ) into a 72-dimension vector, 𝑒 = 𝐹 +5-6*+7 (𝑋 , 𝑍 ) . After that, the LSTM layer constructs an embedding of the simulation history till time t as ℎ = 𝐹 (ℎ , 𝑒 ) , where ℎ is a 128 dimension vector. The decoder than uses the ℎ to predict the mean and variance of various phases in MoO x S y surface as 𝜇 , 𝜎 = 𝐹 *+-6*+7 (ℎ ) . The encoder and decoder are fully connected neural network of dimensions and

128 × 72, 72 × 24,24 × 3 , respectively. The parameters of the NADE-CVD (Θ) are learned via maximum likelihood estimate (MLE) of the following likelihood function

L(Θ; D) = p P : !Z ; , X ; % = p p P : !Z <; |Z $:<)$; , X $:<)$; % <4=<4%;4>;4$;4>;4$ Here,

D = {(X $:=$ Z $:=$ ), (X $:=% Z $:=% ), … (X $:=> Z $:=> )} is training dataset of 𝑚 RMD simulation trajectories. Further details such as log-likelihood of training data during training and evaluation of the NADE-CVD on test data is given in supplementary material.

C. RL agent architecture and Policy Gradient

The RL agent, 𝜋 / , is constructed using a fully connected neural network with tunable parameters 𝜃 . It consists of an input layer of 128 nodes that is followed by two hidden layers with 72 and 24 nodes and then an output layer. The input 𝑠 to 𝜋 / is the embedding of the simulation history, (𝑋 $: , 𝑍 $: ) , generated by NADE-CVD, ℎ . The output of the RL agent is the mean 𝜇(𝑠 ) of action 𝑎 and value function 𝑉(𝑠 ) associated with 𝑠 . The hyperparameters 𝜎 % associated with the variance of the Gaussian distribution of actions 𝑎 is taken as 5. During training, the RL agent learns the optimal policy that maximize the total expected reward 𝔼 (eq.1) using policy gradient algorithm by taking the derivative of 𝔼 with respect to its parameter 𝜃 , ∇𝔼 = ?𝔼 &~(% A∑ 7(D $ ,F $ ) )$*" H?/ , where trajectory 𝜏 = {𝑠 $ , 𝑎 $ , 𝑠 % , 𝑎 % , … 𝑠 ’ , 𝑎 ’ } . This derivate reduces into the following objective function which is optimized via gradient accent. ∇ I 𝔼 = 𝔼 % x ? ∇ I log 𝜋 / (𝑠 , 𝑎 ) ’ +,- !𝐺 − 𝑉(𝑠 )%} ; 𝑤ℎ𝑒𝑟𝑒 𝐺 = ? 𝑟 Here, value function

𝑉(𝑠 ) is used as a variance reduction technique in the calculation of ∇ I 𝔼 via Monte Carlo estimate. Details of the above derivation and the policy gradient algorithm is given in supplementary material. Acknowledgements

This work was supported as part of the Computational Materials Sciences Program funded by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, under Award Number DE-SC0014607. The simulations were performed at the Argonne Leadership Computing Facility under the DOE INCITE and Aurora Early Science programs and at the Center for Advanced Research Computing of the University of Southern California.

References

1. Green, M.L., et al.,

Fulfilling the promise of the materials genome initiative with high-throughput experimental methodologies.

Applied Physics Reviews, 2017. (1). 2. Bernstein, N., G. Csányi, and V.L. Deringer, De novo exploration and self-guided learning of potential-energy surfaces. npj Computational Materials, 2019. (1): p. 99. 3. Zunger, A., Inverse design in search of materials with target functionalities.

Nature Reviews Chemistry, 2018. (4). 4. Butler, K.T., et al., Machine learning for molecular and materials science.

Nature, 2018. (7715): p. 547-555. 5. Gubernatis, J.E. and T. Lookman,

Machine learning in materials design and discovery: Examples from the present and suggestions for the future.

Physical Review Materials, 2018. (12). 6. Dai, C. and S.C. Glotzer, Efficient Phase Diagram Sampling by Active Learning.

The Journal of Physical Chemistry B, 2020. (7): p. 1275-1284. 7. Ramprasad, R., et al.,

Machine learning in materials informatics: recent applications and prospects.

Npj Computational Materials, 2017. . 8. Tagade, P.M., et al., Attribute driven inverse materials design using deep learning Bayesian framework.

Npj Computational Materials, 2019. . 9. Gomez-Bombarelli, R., et al., Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach.

Nature Materials, 2016. (10): p. 1120-+. 10. Yan, J., et al., Material descriptors for predicting thermoelectric performance.

Energy & Environmental Science, 2015. (3): p. 983-994. 11. Gaultois, M.W., et al., Data-Driven Review of Thermoelectric Materials: Performance and Resource Considerations.

Chemistry of Materials, 2013. (15): p. 2911-2920. 12. Bassman, L., et al., Efficient Discovery of Optimal N-Layered TMDC Hetero-Structures.

Mrs Advances, 2018. (6-7): p. 397-402. 13. de Pablo, J.J., et al., New frontiers for the materials genome initiative.

Npj Computational Materials, 2019. . 14. Yang, Q., C.A. Sing-Long, and E.J. Reed, Learning reduced kinetic Monte Carlo models of complex chemistry from molecular dynamics.

Chemical Science, 2017. (8): p. 5781-5796. 15. Zhou, Z.P., X.C. Li, and R.N. Zare, Optimizing Chemical Reactions with Deep Reinforcement Learning.

ACS Central Science, 2017. (12): p. 1337-1344. 16. Coley, C.W., et al., A robotic platform for flow synthesis of organic compounds informed by AI planning.

Science, 2019. (6453): p. 557-+. 7. McMullen, J.P. and K.F. Jensen,

Integrated Microreactors for Reaction Automation: New Approaches to Reaction Development.

Annual Review of Analytical Chemistry, Vol 3, 2010. : p. 19-42. 18. Sanchez-Lengeling, B., et al., Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC) . 2017, ChemRxiv. 19. Fabry, D.C., E. Sugiono, and M. Rueping,

Self-Optimizing Reactor Systems: Algorithms, On-line Analytics, Setups, and Strategies for Accelerating Continuous Flow Process Optimization.

Israel Journal of Chemistry, 2014. (4): p. 341-350. 20. Tabor, D.P., et al., Accelerating the discovery of materials for clean energy in the era of smart automation.

Nature Reviews Materials, 2018. (5): p. 5-20. 21. Raccuglia, P., et al., Machine-learning-assisted materials discovery using failed experiments.

Nature, 2016. (7601): p. 73-+. 22. Coley, C.W., et al.,

Prediction of Organic Reaction Outcomes Using Machine Learning.

ACS Central Science, 2017. (5): p. 434-443. 23. Wei, J.N., D. Duvenaud, and A. Aspuru-Guzik, Neural Networks for the Prediction of Organic Chemistry Reactions.

ACS Central Science, 2016. (10): p. 725-732. 24. Kononova, O., et al., Text-mined dataset of inorganic materials synthesis recipes.

Scientific Data, 2019. . 25. Kim, E., et al., Virtual screening of inorganic materials synthesis parameters with deep learning.

Npj Computational Materials, 2017. . 26. Kim, E., et al., Data Descriptor: Machine-learned and codified synthesis parameters of oxide materials.

Scientific Data, 2017. . 27. Mnih, V., et al., Human-level control through deep reinforcement learning.

Nature, 2015. (7540): p. 529-533. 28. Sutton, R.S. and A.G. Barto,

Reinforcement Learning: An Introduction 2nd edition . MIT Press, in the press. 29. Sanchez-Lengeling, B. and A. Aspuru-Guzik,

Inverse molecular design using machine learning: Generative models for matter engineering.

Science, 2018. (6400): p. 360. 30. Popova, M., O. Isayev, and A. Tropsha,

Deep reinforcement learning for de novo drug design.

Science Advances, 2018. (7): p. eaap7885. 31. Segler, M.H.S., M. Preuss, and M.P. Waller, Planning chemical syntheses with deep neural networks and symbolic AI.

Nature, 2018. (7698): p. 604-610. 32. Kearnes, S., L. Li, and P. Riley,

Decoding Molecular Graph Embeddings with Reinforcement Learning. eprint: arXiv:1904.08915, 2019: p. eprint: arXiv:1904.08915. 33. Jin, G., et al.,

Atomically thin three-dimensional membranes of van der Waals semiconductors by wafer-scale growth.

Science Advances, 2019. (7). 34. Hong, S., et al., Chemical Vapor Deposition Synthesis of MoS2 Layers from the Direct Sulfidation of MoO3 Surfaces Using Reactive Molecular Dynamics Simulations.

The Journal of Physical Chemistry C, 2018. (13): p. 7494-7503. 35. Hong, S., et al.,

Defect Healing in Layered Materials: A Machine Learning-Assisted Characterization of MoS2 Crystal Phases.

Journal of Physical Chemistry Letters, 2019. (11): p. 2739-2744. 36. Hong, S., et al., Computational Synthesis of MoS2 Layers by Reactive Molecular Dynamics Simulations: Initial Sulfidation of MoO3 Surfaces.

Nano Letters, 2017. (8): p. 4866-4872. 7. Hong, S., et al., A Reactive Molecular Dynamics Study of Atomistic Mechanisms During Synthesis of MoS2 Layers by Chemical Vapor Deposition.

Mrs Advances, 2018. (6-7): p. 307-311. 38. Hong, S., et al., Chemical Vapor Deposition Synthesis of MoS2 Layers from the Direct Sulfidation of MoO3 Surfaces Using Reactive Molecular Dynamics Simulations.

Journal of Physical Chemistry C, 2018. (13): p. 7494-7503. 39. Koller, D. and N. Friedman,

Probabilistic graphical models : principles and techniques . Adaptive computation and machine learning. 2009, Cambridge, MA: MIT Press. xxxv, 1231 p. 40. Ou, Z.,

A Review of Learning with Deep Generative Models from Perspective of Graphical Modeling. arXiv preprint arXiv:1808.01630, 2018. 41. Hugo, L. and M. Iain,

The Neural Autoregressive Distribution Estimator , in

Fourteenth International Conference on Artificial Intelligence and Statistics . 2011, PMLR: Ft. Lauderdale, FL, USA. p. 29-37. 42. Karol, G., et al.,

Deep AutoRegressive Networks , in . 2014, PMLR: Beijing, China. p. 1242-1250. 43. Oord, A.V.D., N. Kalchbrenner, and K. Kavukcuoglu,

Pixel recurrent neural networks , in

Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 . 2016, JMLR.org: New York, NY, USA. p. 1747–1756. 44. Schulman, J., et al.,

High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015. 45. Sutton, R.S., et al.,

Policy gradient methods for reinforcement learning with function approximation.

Advances in Neural Information Processing Systems 12, 2000. : p. 1057-1063. 46. Duan, Y., et al., Benchmarking deep reinforcement learning for continuous control , in

Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 . 2016, JMLR.org: New York, NY, USA. p. 1329–1338. 47. Liu, L.N., et al.,

Phase-selective synthesis of 1T ' MoS2 monolayers and heterophase bilayers.

Nature Materials, 2018. (12): p. 1108-+. 48. Nomura, K.-i., et al., RXMD: A scalable reactive molecular dynamics simulator for optimized time-to-solution.

SoftwareX, 2020. : p. 100389. 49. Nomura, K.-i., et al., A scalable parallel algorithm for large-scale reactive force-field molecular dynamics simulations.

Computer Physics Communications, 2008. (2): p. 73-87. 50. Nomura, K., et al.,

An extended-Lagrangian scheme for charge equilibration in reactive molecular dynamics simulations.

Computer Physics Communications, 2015.192