Atomic-Level Features for Kinetic Monte Carlo Models of Complex Chemistry from Molecular Dynamics Simulations
AAtomic-level features for statistical learning of the chemical kinetics from moleculardynamics simulations
Vincent Dufour-D´ecieux ∗ and Evan J. Reed † Department of Materials Science and Engineering, Stanford University, Stanford, CA 94305, USA
Rodrigo Freitas
Department of Materials Science and Engineering,Stanford University, Stanford, CA 94305, USA andDepartment of Materials Science and Engineering,Massachussets Institute of Technology, Cambridge, MA 02139, USA (Dated: January 27, 2021)The high computational cost of evaluating atomic interactions recently motivated the developmentof computationally inexpensive kinetic models, which can be parametrized from MD simulations andaccelerate the prediction of the chemical evolution by up to four order of magnitude. Such kineticmodels utilize molecular descriptions of reactions and have been constrained to only reproducemolecules previously observed in MD simulations. Therefore, these descriptions fail to predictthe reactivity of unobserved molecules, for example in the case of large molecules or solids. Herewe propose a new approach for the extraction of reaction mechanisms and reaction rates fromMD simulations, namely the use of atomic-level features. Using the complex chemical networkof hydrocarbon pyrolysis as example, it is demonstrated that kinetic models built using atomicfeatures are able to explore chemical reaction pathways never observed in the MD simulations usedto parametrize them, a critical feature to describe rare events. Atomic-level features are shownto construct reaction mechanisms and estimate reaction rates of unknown molecular species fromelementary atomic events. Through comparisons of the model ability to extrapolate to longersimulation timescales and different chemical compositions than the ones used for parameterization,it is demonstrated that kinetic models employing atomic features retain the same level of accuracyand transferability as the use of features based on molecular species, while being more compact andparametrized with less data. We also find that atomic features can better describe the formation oflarge molecules enabling the simultaneous description of small molecules and condensed phases.
I. INTRODUCTION
Understanding mechanisms of chemical reaction kinet-ics is a challenging task. Reaction networks can be enor-mous and complex, even for simple chemical systems con-taining no more than two or three species. Yet, mucheffort has been put in advancing our knowledge of chem-ical kinetics due to its important role in several fieldsof science, such as fuel combustion [1, 2], astrophysics[3, 4], polymer science [5], and organic chemistry [6, 7].Atomistic simulations have long been a valuable tool forreaction-mechanism discovery as they complement manyof the experimental efforts [1, 2] by providing unrestrictedaccess to the contributions of each atom along specificreaction pathways. This level of detail is difficult to beachieved through experimental methods alone.One of the main limitations of atomistic simulations istheir time scale. Such simulations are often restricted toevents that occur in the range of femtoseconds to hun-dreds of nanoseconds due to the intrinsic need to resolveprocesses step by step. Several fruitful approaches havebeen developed with the goal of overcoming these lim-itations, including parallel replica [8, 9], bias potential ∗ [email protected] † [email protected] [10, 11], enhanced sampling [12], GPU computing [13],and transition-path sampling [6]. From these advances itbecame clear that no single method would solve all thetimescale limitations of atomistic simulations. Instead,each approach has a niche of applications for which it isbest suited for. Futhermore, ease of use by non-expertsin commonly used codes is a generally desirable featureof an algorithm that can significantly enhance adoptionand use by a much broader spectrum of researchers, e.g.many density functional theory codes. Here, we focuson such an approach where atomic simulation methodsare used to automatically parametrize kinetic models ofchemical reactivity that naturally give access to extendedtime scales (Fig. 1). This algorithm takes as input MDsimulations of the type that are already routinely per-formed by many researchers, and squeezes additional ca-pabilities out of those simulations at negligible cost com-pared with performing the MD simulations. Our specificapplication in this work is in capturing and understand-ing the chemical reactivity of hydrocarbons.Atomic-level simulations have contributed much to theunderstanding of the mechanisms of pyrolysis and com-bustion of hydrocarbons [15–25]. Reaction mechanismsand reaction rates can be extracted [15–17] from atom-istic simulations such as MD simulations and compareddirectly to experimental data. Alternatively, this infor-mation can also be employed to parametrize a kineticMonte Carlo (KMC) [16, 26–30] model that can be used a r X i v : . [ phy s i c s . c o m p - ph ] J a n MD of C H pyrolysis MD of CH pyrolysis MD of C H pyrolysis Conventional Approach MD of C H pyrolysisProposed Approach Bond analysis
Atom or molecule featurization Reaction discovery Reaction rate estimation KMC of CH pyrolysis KMC of C H pyrolysis FIG. 1. Illustration of two different approaches to obtain the chemical reaction kinetics of three different systems with startingcompositions consisting of only one type of molecule: CH , C H , or C H . In the “Conventional Approach” one MDsimulation is performed for each of the three compositions. The “Conventional Approach” is time-consuming but accurate.In the “Proposed Approach”, first developed by Chen et al. [14], a single MD simulation is performed for one of the threecompositions. From the data of this MD simulation the observed reaction mechanisms and reaction rates are extracted. Thisinformation is in turn employed to obtain the chemical reaction kinetics of the two remaining systems through the use of acomputationally inexpensive kinetic model (namely Kinetic Monte Carlo (KMC) simulations). The “Proposed Approach” isfaster than the “Conventional Approach” and can be made just as accurate by the judicious choice of the kinetic model. Thesimulation times in the illustration are representative of the simulations performed in this work. They are presented only toprovide a sense of the computational speedup provided by the “Proposed Approach”. to reproduce chemical kinetics for longer time scales at amuch reduced computational cost, as illustrated in Fig. 1.These kinetic model extractions characterized each re-action by the molecules involved. This approach showedgood accuracy in reproducing initial MD simulations evo-lution [14, 26], but it has several disadvantages. One ofthe main disadvantages of employing mechanisms andrates of reactions obtained in MD simulations describedin terms of molecules (e.g. A+B → AB where A, B andAB are molecular species) to parametrize KMC modelsis that the resulting KMC simulations are bound to onlycreate molecules that have been previously observed inthe atomistic simulation. If there exists a molecule thattakes longer to be created than the accessible time scaleof the MD simulation, then the KMC simulation will notbe able to create that molecule either, despite being ableto simulate chemical kinetics for longer time scales thanMD simulations. Consider for example the process ofcreation and growth of soot particles. Such particles arethe result of incomplete combustion of hydrocarbons andare composed of long carbon chains. Small carbon chainsgrow in size by aggregation of other molecules. Any MD simulation can only study the growth of carbon chainsup to a certain length due to the time scale limitation.Because larger carbon molecules are not observed in theMD simulation, a KMC model would not be able to pre-dict the growth of carbon molecules beyond that specificsize observed in the MD simulation.Here, we propose an alternative strategy for the ex-traction of reaction mechanisms and reaction rates fromMD simulations, namely the employment of atomic-levelfeatures (Fig. 2). This novel strategy naturally leads tothe parametrization of KMC models that can not onlyextend the time scale of MD simulations but also predictthe mechanisms and rates of reactions never observedin MD simulations. Moreover, we demonstrate that ourapproach results in a much more compact descriptionof chemical kinetics of hydrocarbons, requiring less datafrom costly atomistic simulations in order to train modelsthat are just as effective as previous approaches.
FIG. 2. Illustration of how the same chemical reaction is numerically characterized by the two different types of chemicalrepresentations considered in this manuscript. The “molecular features” involve assigning a numerical fingerprint to eachmolecule and representing a chemical reaction by the interaction between these molecular fingerprints. Molecular features arechemically intuitive and most commonly. Here we introduce a novel representation that we refer to as “atomic features”. Thisrepresentation employs a more local description were each atom has its own set of features and chemical reactions are describedlocally by the interaction between the atomic fingerprints. Atomic features provide a smaller features space than molecularfeatures, can be parameterized with smaller sets of MD data, and they also enable the determination of the chemical reactivityof systems containing novel chemical species not observed before.
II. METHODS
In this section we present the approach employed toparametrize a KMC model using atomic features. Reac-tion rates and reaction mechanisms are extracted directlyfrom MD simulations. The general framework is summa-rized in Figs. 1 and 2.
A. Molecular Dynamics simulations.
The MD simulations were performed using LAMMPS[31, 32] and the ReaxFF potential [33–35] with param-eters as described by Mattsson et al. [36]. Independentsimulations were run starting with either 160 moleculesof methane (CH ), 125 molecules of ethane (C H ), 64molecules of isobutane (C H ), or 64 molecules of oc-tane (C H ). Temperature and pressure were increasedtogether to 3,300 K and from 1,013 hPa to 40.53 GPa us-ing a Nos´e-Hoover chain thermostat and barostat withchain length 3 and damping parameter of 2.4 fs for tem-perature and 60 fs for pressure.[37–41] The ramping pro-cess was spread over 24 ps with a timestep duration of0.12 fs. Finally, the system was kept at 3,300 K and40.53 GPa for 500 ps using the same thermostat, baro-stat (but now with damping parameter of 14.4 fs), andtimestep for 500 ps. During this 500 ps period the atomcoordinates were saved every 12 fs in order to perform ananalysis of the system’s chemical reactivity. These con-ditions of temperature and pressure are chosen because they are considered as the approximate thermodynamicconditions of gas-giant planetary interiors [4], where it isspeculated that a rich hydrocarbon chemistry might bepresent and chemical kinetic evolution of solid phases ofcarbon could impact internal planetary dynamics. B. Bond analysis.
In order to determine chemical reactivity it is necessaryto capture the formation and breaking of chemical bondsin the MD simulations. Here, this is achieved by usingthe following criteria: two atoms are considered bonded ifthey are separated by less than a bond cutoff length λ forlonger than a time period τ . Similarly, a bond betweentwo atoms is considered to have been broken if two atomsinitially bonded are separated by a distance larger than λ for a period of time longer than τ . The values for λ and τ were respectively taken from Refs. 26 and 42, where acareful analysis lead to the optimal values of λ = 1 . (cid:6) Afor C-C bonds, λ = 1 . (cid:6) A for C-H bonds, λ = 1 . (cid:6) A forH-H bonds, and τ = 0 .
096 ps.
C. Reaction representation.
Two different representations of chemical reactions areconsidered here (Fig. 2), each one leading to a differentset of numerical features characterizing a reaction. Thefirst representation is a chemically intuitive one: eachmolecule has a numerical fingerprint consisting of fea-tures that count the number of each chemical element inthe molecule as well as the number of bonds between eachpair of elements. Whenever a reaction occurs the quan-tities registered are the types of molecular fingerprintsinvolved. Because of this we refer to this representationas “molecular features”. This is a well-known represen-tation in the literature, e.g., Refs. 14 and 26.In this article we introduce a second type of representa-tion for chemical reactions in which the characterizationoccurs more locally, at the atomic level. In this repre-sentation each atom has its own numerical fingerprint(Fig. 2) consisting of features that identify the chemicalelement of the atom and the number of bonds formedwith each chemical element available. Whenever a reac-tion occurs the quantities registered are only the typesof atomic fingerprints involved. Because of this we referto this representation as “atomic features”. Notice thatwhile molecular features often lead to reactions involvingmany molecules (resulting in many bonds being simulta-neously broken or created), the atomic features alwaysinvolve only the breaking or formation of a single bond.
D. Reaction rates estimation.
Over the course of the MD simulations all reactionsobserved were recorded and their numerical fingerprintscomputed using both representations, atomic and molec-ular. Here we describe how this information was em-ployed to estimate reaction rates. Our approach followsthe work of Yang et al. [26].The state of the system at time t is represented bya vector of concentrations X ( t ). For the molecular fea-tures each component is the concentration of one of themolecular species (i.e., molecular fingerprint), while foratomic features each component is the concentration ofone atomic fingerprint. The probability of occurrence ofa reaction j in the time interval [ t, t + ∆ t ] is a j ( X ( t ))∆ t ,where a j is known as the propensity function. Noticethat a reaction j is considered to be a molecular reac-tion for the molecular features, while for atomic featuresa reaction j is a bond breaking or formation event. Thepropensity function is a j ( X ( t )) = k j h j ( X ( t )), where k j is the reaction rate coefficient and h j ( X ( t )) is the combi-natorial number of times that reaction j could have takenplace given the system state X ( t ). For atomic features wehave h j ( X ( t )) = X m ( t ) for bond breaking, h j ( X ( t )) = X m ( t ) X m (cid:48) ( t ) for bond formation between two differentatomic fingerprints, and h j ( X ( t )) = X m ( t )( X m ( t ) − h j ( X ( t )) has a similar form butmore than two reactants might be involved, in whichcase the same combinatorial argument can be applied(see Ref. 26 for more details).The calculation of k j is more intricate and requiresthe following assumptions. First, the time interval ∆ t isassumed to be short enough for the propensity function a j ( X ( t )) to be considered constant during that time in-terval. Second, the number of times n j ( t, t + ∆ t ) thatreaction j occurs in the time interval [ t, t + ∆ t ] is as-sumed to follow a Poisson distribution with parameter a j ( X ( t ))∆ t . Finally, the Poisson random variables of allreactions are assumed to be conditionally independentgiven X ( t ). With these assumptions it becomes possibleto use maximum-likelihood estimation to calculate thereaction rate coefficient k j as k j = (cid:80) t n j ( t, t + ∆ t )∆ t (cid:80) t h j ( X ( t )) = N j ∆ tH j , (1)where N j = (cid:80) t n j ( t, t + ∆ t ) is the total number of timesthat reaction j occurred and H j = (cid:80) t h j ( X ( t )) is thetotal number of times reaction j could have occurred.The 95% confidence interval of k j can be calculated usingthe Fisher information of the likelihood [43]. Few lines ofcalculations described in Ref. 43 gives a 95% confidenceinterval of: k j ± . (cid:115) k j ∆ tH j . (2)Yet, reactions have rates that can vary by orders of mag-nitude. Thus, it is often useful to normalize the size ofthe 95% confidence interval of Eq. (2) by k j when com-paring the accuracy of different reaction rates, leadingus to the normalized size of the 95% confidence interval(NSCI):NSCI( k j ) = 2 × . (cid:113) k j ∆ tH j .k j = 3 . (cid:115) N j . (3) E. Kinetic Monte Carlo.
Once the set of all possible reactions j and reactionrates k j have been obtained from the MD simulations itis possible to reproduce the system time evolution usinga Kinetic Monte Carlo (KMC) approach known as theGillespie stochastic simulation algorithm [44, 45], whichwe briefly review next. Given the state of the system attime t , X ( t ), the KMC algorithm determines the state X ( t ) at a future time t by selecting a single reaction tooccur between t and t . The time t = t + τ at whichthe next reaction occurs is randomly selected from anexponential distribution p ( τ | X ( t )) = a ( t ) exp[ − a ( t ) τ ],where a ( t ) = (cid:80) i a i ( X ( t )). The reaction taking place at t is also selected randomly, with reaction j being selectedwith probability p j ( t ) = a j ( X ( t )) /a ( t ). Applying themodifications caused by reaction j to X ( t ) results in X ( t ).Atomic and molecular features are both capable of de-scribing the same set of chemical reactions. Yet, there isa fundamental difference between them in how the stateof the system evolves in time during a KMC simulation.For molecular features the state of the system X ( t ) issimply the number of each distinct molecular fingerprintcurrently present in the system. The set of all possiblemolecular fingerprints (i.e. the length of vector X ) ispredetermined by those fingerprints observed in an MDsimulation. Thus, by using molecular features the KMCsimulation is constrained to never exhibit any reactionevent or molecular species that has not been observed inthe MD simulation. Such is not the case for the atomicfeatures, where the state of the system X ( t ) is composedof the number of distinct atomic fingerprints currently inthe system. When a reaction is chosen using the KMCalgorithm, it is necessary to randomly select the pair ofatoms participating in this reaction. Each pair of atomswith the correct atomic fingerprints (i.e. the fingerprintsinvolved in the reaction) has the same probability to bechosen. Once the two atoms are selected, a bond betweenthem is created in the case of a bond creation, or is bro-ken in the case of a bond breaking. To keep track of theconnectivity between the different atoms an adjacencymatrix is employed. The adjacency matrix is a squarematrix with a number of rows and columns equal to thenumber of atoms in the system. The elements of this ma-trix are equal to 1 when the pair of atoms is connectedand 0 otherwise. This matrix allows us to reconstruct thenetwork of connections between the atoms at each timestep, i.e., it allows us to define the molecules present inthe system from the atomic fingerprints. It is during thisreconstruction step that the atomic features can resultin molecular species that have never been observed inthe MD simulation. Note that a given adjacency ma-trix, obtained from the atomic features, reconstructs aunique set of molecules described by the molecular fea-tures, however a set of molecular features is not associ-ated to a unique adjacency matrix. Moreover, the molec-ular features do not differentiate between certain isomersof the same molecule, whereas the adjacency matrix re-constructs a specific isomer of a molecule.In a KMC simulation, when a reaction is picked, it isassumed here that all species with the correct fingerprintare equally likely to react. But this assumption may nothold in the case of the atomic features when a very longcarbon chain is present in the system because atoms inthe long chain may not all behave in the same way. Longcarbon chains tend to contract themselves into large par-ticles that are the result of incomplete hydrocarbon py-rolysis. When that happens atoms in the periphery ofthe particle may react with the remaining of the sys-tem, while atoms deep inside the particle would morelikely react with atoms of the large particle they belongto (Fig. S1). In such cases, the assumption that atomswith the correct atomic fingerprints are all equally likelyto react could be broken and this problem can result insome limitations of the atomic features. For example,in the MD simulation, atoms on the periphery of thelarge particle could react with atoms out of this particlewhich would result in the growth of the particle, whereasatoms deep inside the particle would react with atoms ofthe same particle, which would not result in the growth of the particle. Yet, in the KMC simulation atoms onthe periphery or deep inside the particle would have thesame probability to react with atoms out of or inside theparticle, which would disrupt the growth of large parti-cle. In order to avoid this, the growth of large carbonchains was tracked and the chemical reactivity analysishalted at the simulation time when the longest moleculein the system contained 10 % of all the carbons in thesystem. This constraint resulted in considering only thefirst 300 ps of the CH simulations, 100 ps of the C H simulations, 50 ps of the C H simulations, and 50 psof the C H simulations. The different time needed toreach 10 % reflects the fact that systems with larger car-bon content and larger initial molecules result in fastergrowth of carbon chains. F. Error calculation.
An error metric is needed in order to compare the sys-tem time evolution predicted by KMC to the results ofthe MD simulation. An appropriate option is to measureand compare the concentration of the most numerousmolecules: CH , C H , and H . Another good indicatorof the accuracy of KMC simulations is the number of car-bon atoms in the longest molecule, as the growth of longcarbon chains is also a function of the system kinetics.Thus, in this article we often compare the time evolu-tion of CH , C H , H , and number of carbons in thelongest molecule as predicted by KMC and MD. In orderto increase the statistical accuracy of the comparison theMD simulations results are averaged over three indepen-dent simulations while the KMC simulations results areaveraged over 20 independent simulations.Besides comparing the time evolution, it is also use-ful to have a more condensed and objective metric thatsummarizes the errors accumulated over the entire time-evolution trajectory. For that purpose we define the fol-lowing quantityError( m ) = 1 T (cid:88) n (cid:88) t (cid:12)(cid:12) µ MD n ( t ) − µ KMC n ( t ) (cid:12)(cid:12) σ MD n ( t ) , (4)where n is one of the four species of interest mentionedabove, T is the number of timesteps, µ MD/KMC n ( t ) is thenumber of species n (i.e. atomic or molecular finger-prints) at time t averaged over all independent simula-tions, σ MD n ( t ) is the standard deviation of the number ofspecies n at time t . The division by the standard devi-ation serves to account for the variability of MD resultswhen evaluating the discrepancy between MD and KMC.Consequently, time intervals in which MD results presentlarge variance influence the error calculation less strongly. III. RESULTS
Reactions and their rates, Eq. (1), were computed foratomic and molecular features employing the entire tra-jectory of a single MD simulation of a system startingwith only C H molecules. Using this set of reactionsa KMC model was parametrized and KMC simulationswere run to study the chemical kinetics of a system withthe exact same starting configuration as the MD simu-lation (i.e. same amount of C H molecules). In Fig. 3we compare the time evolution of the system accordingto both simulation methods for the atomic and molecu-lar features. It is visually clear that KMC simulationswith either type of features are able to reproduce theresults of the more computationally expensive MD sim-ulations. The metric of Eq. (4), shown in Fig. 3c, con-firms these observations: atomic and molecular featurespresent similar total error accumulated over the entiretrajectory. Despite the similarity of the results there isa large difference in the number of unique reactions ob-served: while the molecular features result in 845 uniquemolecular reactions the atomic features produce only 122unique atomic reactions. The total number of reactionsobserved is 2 ,
683 for the molecular features and 3 ,
358 forthe atomic features. These two numbers are different be-cause a reaction with the atomic features can only be onebond breaking or creation, whereas there is no such con-straint with the molecular features. Equation (3) showsthat the accurate estimation of rate k j requires reaction j to occur many times. Thus, the atomic features lead to amore accurate and compact KMC model representationof the atomistic MD results.In terms of computation costs, a single MD simulationtakes around one full day to run in parallel on 40 CPUs.Meanwhile, the feature extraction process for either typeof features takes only two minutes a single CPU and aKMC simulation running in two minutes in a single CPU.This represents a speedup on the order of 14 ,
000 in termsof CPU-hours.
A. Model transferability.
Next, we test whether the set of reactions learned froma single MD simulation starting with C H is capable ofreproducing the kinetics of systems with different ratiosof carbon to hydrogen. The MD simulations startingfrom CH molecules or C H represent two test caseswhere the C/H ratio is above and below, respectively,that for the C H starting condition. Figure 3 showsthat KMC simulations with both types of features per-form similarly well in reproducing the CH system kinet-ics, with the atomic features having a lower total errorthan the molecular features. Atomic features do seem tohave a relatively larger error for the H time evolution,while molecular features have a similarly larger error forthe C H time evolution. The scenario is different for thereproduction of the MD simulation starting with C H .Now, atomic features result in a much lower total er-ror (by a factor of 2 .
5) than molecular features. Yet, itis noticeable that the majority of the error for atomicfeatures stem from the reproduction of the size of the longest carbon chain. Further analysis of this discrep-ancy is postponed to Section IV.
B. Time extrapolation.
One desirable property on KMC simulations is the abil-ity to accurately extrapolate the results of MD simu-lations to time scales unattainable in MD simulationsdue to their prohibitive computational costs. In order tocompare the ability of atomic and molecular features toperform time extrapolation, KMC models were trainedfor both types of features using only part of the dataextracted from the 100 ps MD simulation of C H . InFig. 4 it is shown the time evolution of KMC modelstrained on the first 10 ps, 30 ps, 50 ps, 70 ps, and 100 psof the MD simulation. The performance of atomic andmolecular features is similar, except for the time evolu-tion of the size of the largest carbon chain. It is clear inthis case that atomic features present reasonable resultswhen learning from simulations as short as 30 ps, whilemolecular features only converge to the MD results whentraining on the entire 100 ps trajectory.Atomic features result in a much more compact repre-sentation of the chemical reactivity of hydrocarbon sys-tems (122 unique reactions compared to 845 unique re-actions for molecular features). Thus, it is reasonableto expect that a KMC model with atomic features canbe parametrized with much less data (i.e., shorter MDsimulations), which explains in part the capacity thatatomic features have shown in Fig. 4 to reproduce 100 psof MD simulations of the growth of the largest carbonchain from only 30 ps of data. Another important factoris that molecular features cannot predict the creation ofmolecules that have not been observed in the MD sim-ulation, limiting its capacity to extrapolate in time thekinetics of growth of large carbon chains. Meanwhile,atomic features can estimate the kinetic rates of reactionsthat have not been observed during MD simulations bybuilding it from its elementary atomic reaction events.The fact that both types of features perform similarlyfor the small molecular species (H , CH , and C H ) ismost likely because reactions resulting in the creation orconsumption of such small molecules are similarly rep-resented in both types of features, resulting in the samereaction rates. For example, the H → H + H chemicalreaction has the exact same reaction rate in the atomicrepresentation or molecular representation.
IV. DISCUSSIONA. Mechanism extraction.
Atomic and molecular features represent chemical ki-netics in different ways. The atomic features frameworkbreaks each molecule into small units composed of one ortwo atoms and information about their nearest neighbors. (a)(b)
Testing set Training set Testing set (c)
FIG. 3. Time evolution of the three most numerous molecules (H , CH , and C H ) and the number of carbons in the longestmolecule. The KMC results are obtained using (a) atomic features and (b) molecular features. The simulations initial statecontained only CH (left), C H (middle), or C H (right) molecules. The KMC simulations were parametrized using asingle MD simulation that started with only C H molecules. Therefore the middle plots in (a) and (b) show how well theKMC reproduced the simulation it was parameterized with (training set in blue), and the left and right plots in (a) and (b)show transferability of the KMC model to initial compositions it was not trained on (testing sets in grey). Results for theMD simulations represent the average of three independent simulations, while KMC results are the average of 20 independentsimulations. (c) Total trajectory error computed according to Eq. (4). Atomic and molecular features reproduce well theresults of the MD simulation for which they were trained on (i.e., starting with only C H ) and are both equally transferable toMD simulations starting with only CH , which has a different C/H ratio. Atomic features result in a more transferable KMCmodel for a system starting with C H molecules, especially for the kinetics of small molecules. Molecular features are lesstransferable (i.e., larger total error), but better reproduce the time evolution of the number of carbon in the longest molecule. (a)(b)(c) FIG. 4. Time extrapolation of MD simulations using KMC with (a) atomic features and (b) molecular features. Time evolutionof the three most numerous molecules (H , CH , and C H ) and number of carbons in the longest molecule for simulationswith initial state containing only C H molecules. The KMC models were parametrized using only the first 30 ps, 50 ps, 70 ps,and 100 ps of a single 100 ps MD simulation. Results for the MD simulations represent the average of three independentsimulations, while KMC results are the average of 20 independent simulations. (c) Total trajectory error computed accordingto Eq. (4). Both types of features learn the kinetics of small molecules at the same pace. However, the atomic features areable to reproduce the growth of large carbon chains much faster than molecular features. This is likely due to the fact thatthe molecular features cannot predict the appearance of molecules it has not observed during its parametrization (i.e. largercarbon chains). Meanwhile, atomic features can estimate the reaction rates of molecular reactions that have not been observedduring training by building such molecular reactions from its elementary atomic reaction events. These units can be common to different molecules, whichallows this framework to capture similarities in reactionsinvolving completely different molecules. Meanwhile, themolecular features framework fundamental unit is themolecule. A reaction is then described as the interactionsamong these fundamental units generating other funda-mental units, without considering the rearrangement ofatomic bonds at any step.There is also a meaningful difference in how chemicalkinetics is reproduced through KMC simulations by the two types of features. With atomic features a KMC sim-ulation is capable of creating and consuming moleculesnever observed in a MD simulation by building their re-action rates from the more elementary atomic reactionevents. Molecular features result in KMC simulationsthat are only able to create and consume molecules thathave been previously observed in MD simulations. Thisdifference allows KMC simulations with atomic featuresto explore a larger variety of chemical reaction pathwayswhen compared to molecular features. Such differencecan become important whenever the system trajectorypasses through bottlenecks in order to reach different re-gions of the chemical space. The growth of a large carbonchain can occur in many different ways that can be con-sidered bottlenecks in the chemical trajectory, becauseeach independent simulation only goes through one spe-cific pathway of all possible ones. For example, one canconceive of a trajectory where small molecules such asCH are added to a steadily growing chain. This trajec-tory is much different from one where two independentcarbon chains grow to a medium size and then merge toform a large chain. It is evident that a simulation wherea chain reaches a determined length can only go throughone of these two trajectories.In order to offer some evidence of this essential differ-ence between atomic and molecular features we have per-formed the chemical kinetics analysis of two independentMD simulations with identical initial chemistries (onlyC H molecules), but the atomic velocities were initial-ized randomly so that the simulation trajectories wouldbe different. The molecular features resulted in a to-tal of 1,426 unique reactions, with only 314 (22 %) ofthose in common among the two identical but indepen-dent simulations (Fig. 5b). These 314 reactions in com-mon account for to 75 % of the total reactions observed,while 68 % of the total 1,426 unique reactions occurredonly once during the entire MD simulation. Meanwhile,atomic features resulted in 153 unique reactions with 105(69 %) of them in common among the two identical butindependent simulations (Fig. 5a). The 105 unique reac-tions in common account for 99 % of the total reactionsobserved, with only 28 % of the total 153 unique reactionsoccurring only once during the entire simulation. Hence,KMC models parametrized using independent MD sim-ulations can be much different when molecular featuresare employed, while models created with atomic featuresare essentially identical.Figure 6 compares the rates of reactions for those reac-tions in common to the two independent MD simulations.The coefficient of determination, R , shows that atomicfeatures result in more similar reaction rates ( R = 0 . R = 0 . .
14 for molecular features and 1 .
96 for atomic features.The redundant and lengthy nature of KMC modelswith molecular features has been acknowledged in theliterature before. For example, Yang et al. [26] and Wu et al. [30] employed techniques such as L1 regulariza-tion and computational singular perturbation to reducethe number of unique reactions space by selectively dis-carding reactions that had small impact on the chemical (a)
Simulation 1(C H ) Simulation 2(C H )3017 Atomic features unique reactions99%of total reactions (b)
Simulation 1(C H ) Simulation 2(C H )581531 Molecular features unique reactions75%of total reactions
FIG. 5. Unique reactions observed from two independent MDsimulations starting from only C H molecules. a) Atomicfeatures result in a more unique and compact representationof the chemical kinetics, with 69 % of unique reactions (rep-resenting 99 % of total reactions observed) in common amongthe two MD simulations. b) Molecular features share only22 % of reactions among the two MD simulations. (a) (b) FIG. 6. Log-scale comparison of the reaction rates in com-mon to two independent MD simulations. The y = x line is shown in red. The coefficient of determination, R ,shows that (a) atomic features result in more similar reactionrates ( R = 0 .
98) when compared to (b) molecular features( R = 0 . kinetics. Employing atomic features can be seen as anapproach to achieve the same goals without discardingany data collected from MD simulations, consequentlymaking better use of the available data and avoiding anyreduction in the accuracy of the reaction rates. B. Predictions comparison.
Kinetic models with atomic and molecular featuresparametrized using a system initiated with C H showexcellent transferability to a system initiated with CH molecules, but not to C H . This happens becauseof different reasons for the different types of features.The atomic features framework overestimates the size ofthe longest polymer chain, while still performing well on0smaller molecules. As discussed in the Sec. II E, when along carbon chain grows, the atomic features frameworkshows its limitations since we suppose it does not incor-porate any information that slows down atomic reactivityin large molecules.The limitations in the transferability of the molecularfeatures to a system initiated with C H are due to thescarcity of C H in the MD simulation that was usedto parametrize the KMC model (i.e., starting with C H only). Thus, the molecular features framework has littledata on the reactivity of C H . Parametrizing a KMCmodel using molecular features can be considered a formof overfitting: the trained model performs well for sys-tems that present only molecules available in the trainingdata set, but it does not extrapolate to molecules it hasnot seen or that appear in small amounts in the train-ing data set. An extreme case of this lack of transfer-ability can be seen in Fig. 7, where the time evolutionof a MD simulation initiated with only C H is com-pared to the predictions of a KMC model parametrizedon a MD simulation started with only C H molecules.C H was chosen because this molecule never appears inthe MD simulation used for training. Thus, the molecu-lar features model has essentially no time evolution as itremains stuck in the initial configuration. Meanwhile, aKMC model employing atomic features is able estimatethe rate of reactions of C H by building it from therate of atomic level events. FIG. 7. Time evolution of C H in a simulation wherethe initial state contained only C H . KMC models wereparametrized on an MD simulation starting with C H only,where C H is never observed. As a consequence, the KMCmodel using molecular features cannot result in any time evo-lution for the chemical species. This extreme case illustratesa limitation of employing molecular features. The model em-ploying atomic features model is capable to estimate the reac-tion mechanisms and reactions rates of C H by building itfrom elementary atomic reactions. Thus, atomic features arestill capable of predict chemical kinetics even in such extremecase. The molecular features framework also presented limi-tations in time extrapolation to predict the growth of thelargest carbon cluster, Fig. 4. As discussed in Sec. III,this occurs because KMC simulations employing molec-ular features are not able to predict the formation ofmolecules that have not been observed in the simulationit was trained on. As a result, a KMC simulation withmolecular features trained on 50 ps of the MD simula-tion cannot grow a molecule with more than about tencarbons in it. Another consequence of this effect is thescarcity of pathways for the growth of longer chains. Fig-ure 8 shows that during the MD simulations used to trainthe KMC model there are two rare events occurring justafter 60 ps that cause the chain size quickly increase from11 to 20 carbons. These events involve the addition oftwo rare molecules to the longest chain, first a moleculewith four carbon atoms and then a molecule with five.This represents a bottleneck for the chemical evolutionthat can be easily missed by the KMC simulation withmolecular features if, for example, all carbon chains inthe system grow past the size of five carbons. There areno other pathways learned by the molecular features thatincrease the size of the system above 11. Therefore, foreach MD simulation can only train a KMC model withone or very few pathways of growth past any specificcarbon chain size. It is evident in Fig. 8 that training aKMC model with atomic features result in no such con-straints on the number of pathways leading to the growthof carbon chains past any size. Atomic features allow thekinetic model to build the reaction mechanism and reac-tion rates of multiply pathways for growth past the sizeof 11 by estimating them from the elementary atomicevents.
V. CONCLUSION
In this paper it is demonstrated that kinetic modelsbuilt using atomic features allow the determination ofthe reaction mechanisms of a complex chemical system,hydrocarbon pyrolysis, and to accurately predict the evo-lution different systems that rely on the same chemistryusing a KMC model. It is shown that atomic featuresresult in more compact kinetic models than molecularfeatures, while being able to predict the appearance ofmolecules not observed during the parametrization pro-cess, which molecular features are not capable of. Atomicfeatures are shown to result in better chemical transfer-ability and time extrapolation due to the ability of ki-netic models based on atomic features to explore multi-ple pathways of chemical evolution by building unknownreaction mechanisms and rates from elementary atomicevents. The framework of atomic features considers onlythe chemical species the reacting atoms and their respec-tive nearest neighbors. This fairly simple description,while powerful, can be easily extended to include otherelaborate non-local features. Although our study focusedon the specific mechanism of hydrocarbon pyrolysis, the1approach developed for the construction of kinetic mod-els can be readily applied to other chemical systems withdifferent levels of complexity.
VI. ACKNOWLEDGEMENTS
Material in this paper is based upon work supportedby the Air Force Office of Scientific Research under awardnumber FA9550-20-1-0397 and the Department of EnergyNational Nuclear Security Administration under AwardNumber DE–NA0002007. (a)(b)
FIG. 8. Time evolution of the number of carbons in thelongest molecule. All KMC simulations (shown in coloredlines) share the same kinetic model that was parametrizedusing a single MD simulation (shown in black) initiated withonly C H molecules. The KMC curves are trained on 70 psof this simulation (shown as the dashed black line) with (a)atomic features and (b) molecular features. The two quickchanges in size that happen at around 60 ps in the MD sim-ulation are rare events that represent a bottleneck for thegrowth of the largest carbon chain. KMC models with molec-ular features can only grow carbon chains past the size of 11by reproducing these two singular rare events because thatis the only known pathway. Thus, if the conditions for therare events is missed the length of the longest carbon chainin KMC simulations using molecular features will be limitedto about 11. It is clear that the atomic features do not suf-fer from this limitation since they allow the construction oflonger carbon chains through multiple pathways not observedin MD simulations. These pathways are found by building thereaction mechanisms and rates from elementary atomic reac-tions. [1] M. R. Harper, K. M. Van Geem, S. P. Pyl, G. B. Marin,and W. H. Green, Comprehensive reaction mechanismfor n-butanol pyrolysis and combustion, Combustion andFlame , 16 (2011).[2] C. K. Westbrook, W. J. Pitz, O. Herbinet, H. J. Cur-ran, and E. J. Silke, A comprehensive detailed chemicalkinetic reaction mechanism for combustion of n-alkanehydrocarbons from n-octane to n-hexadecane, Combus-tion and flame , 181 (2009).[3] M. Ross, The ice layer in uranus and neptune—diamondsin the sky?, Nature , 435 (1981).[4] D. Kraus, J. Vorberger, A. Pak, N. Hartley, L. Fletcher,S. Frydrych, E. Galtier, E. Gamboa, D. Gericke, S. Glen-zer, et al. , Formation of diamonds in laser-compressed hy-drocarbons at planetary interior conditions, Nature As-tronomy , 606 (2017).[5] F. Codari, S. Lazzari, M. Soos, G. Storti, M. Morbidelli,and D. Moscatelli, Kinetics of the hydrolytic degradationof poly (lactic acid), Polymer degradation and stability , 2460 (2012).[6] L.-P. Wang, A. Titov, R. McGibbon, F. Liu, V. S. Pande,and T. J. Mart´ınez, Discovering chemistry with an abinitio nanoreactor, Nature chemistry , 1044 (2014).[7] J. Meisner, X. Zhu, and T. J. Mart´ınez, Computationaldiscovery of the origins of life (2019).[8] K. L. Joshi, S. Raman, and A. C. van Duin, Connectivity-based parallel replica dynamics for chemically reactivesystems: from femtoseconds to microseconds, The Jour-nal of Physical Chemistry Letters , 3792 (2013).[9] A. F. Voter, Parallel replica method for dynamics of in-frequent events, Physical Review B , R13985 (1998).[10] A. F. Voter, Hyperdynamics: Accelerated molecular dy-namics of infrequent events, Physical Review Letters ,3908 (1997).[11] D. Hamelberg, J. Mongan, and J. A. McCammon, Ac-celerated molecular dynamics: a promising and effi-cient simulation method for biomolecules, The Journalof chemical physics , 11919 (2004).[12] A. C. Pan, T. M. Weinreich, S. Piana, and D. E. Shaw,Demonstrating an order-of-magnitude sampling enhance-ment in molecular dynamics simulations of complex pro-tein systems, Journal of chemical theory and computa-tion , 1360 (2016).[13] W. Liu, B. Schmidt, G. Voss, and W. M¨uller-Wittig, Ac-celerating molecular dynamics simulations using graphicsprocessing units with cuda, Computer Physics Commu-nications , 634 (2008).[14] E. Chen, Q. Yang, V. Dufour-D´ecieux, C. A. Sing-Long,R. Freitas, and E. J. Reed, Transferable kinetic montecarlo models with thousands of reactions learned frommolecular dynamics simulations, The Journal of Phys-ical Chemistry A , 1874 (2019), pMID: 30735373,https://doi.org/10.1021/acs.jpca.8b09947.[15] X.-M. Cheng, Q.-D. Wang, J.-Q. Li, J.-B. Wang, andX.-Y. Li, Reaxff molecular dynamics simulations of oxi-dation of toluene at high temperatures, The Journal ofPhysical Chemistry A , 9811 (2012).[16] M. D¨ontgen, M.-D. Przybylski-Freund, L. C. Kr¨oger,W. A. Kopp, A. E. Ismail, and K. Leonhard, Automateddiscovery of reaction pathways, rate constants, and tran-sition states using reactive molecular dynamics simula- tions, Journal of chemical theory and computation ,2517 (2015).[17] Z. He, X.-B. Li, L.-M. Liu, and W. Zhu, The intrinsicmechanism of methane oxidation under explosion con-dition: A combined reaxff and dft study, Fuel , 85(2014).[18] G. Yan, Z. Zhang, and K. Yan, Reactive molecular dy-namics simulations of the initial stage of brown coal oxi-dation at high temperatures, Molecular Physics , 147(2013).[19] N. L¨ummen, Reaxff-molecular dynamics simulations ofnon-oxidative and non-catalyzed thermal decompositionof methane at high temperatures, Physical ChemistryChemical Physics , 7883 (2010).[20] L. Liu, C. Bai, H. Sun, and W. A. Goddard III, Mech-anism and kinetics for the initial steps of pyrolysisand combustion of 1, 6-dicyclopropane-2, 4-hexyne fromreaxff reactive dynamics, The Journal of Physical Chem-istry A , 4941 (2011).[21] C. Bai, L. Liu, and H. Sun, Molecular dynamics simu-lations of methanol to olefin reactions in hzsm-5 zeoliteusing a reaxff force field, The Journal of Physical Chem-istry C , 7029 (2012).[22] J. Liu, X. Li, L. Guo, M. Zheng, J. Han, X. Yuan, F. Nie,and X. Liu, Reaction analysis and visualization of reaxffmolecular dynamics simulations, Journal of MolecularGraphics and Modelling , 13 (2014).[23] T. Cheng, A. Jaramillo-Botero, W. A. Goddard III, andH. Sun, Adaptive accelerated reaxff reactive dynamicswith validation from simulating hydrogen combustion,Journal of the American Chemical Society , 9434(2014).[24] J. Zeng, L. Cao, C.-H. Chin, H. Ren, J. Z. Zhang, andT. Zhu, Reacnetgenerator: an automatic reaction net-work generator for reactive molecular dynamics simula-tions, Physical Chemistry Chemical Physics (2020).[25] J. Liang, W. Jia, Y. Sun, and Q.-D. Wang, Skeletal chem-ical kinetic mechanism generation for methanol combus-tion and systematic analysis on the ignition characteris-tics, Asia-Pacific Journal of Chemical Engineering , e2434(2020).[26] Q. Yang, C. A. Sing-Long, and E. J. Reed, Learning re-duced kinetic monte carlo models of complex chemistryfrom molecular dynamics, Chem. Sci. , 5781 (2017).[27] Q. Yang, C. A. Sing-Long, and E. J. Reed, L1regularization-based model reduction of complex chem-istry molecular dynamics for statistical learning of kineticmonte carlo models, MRS Advances , 1767 (2016).[28] Q. Yang, C. A. Sing-Long, E. Chen, and E. J. Reed, Data-driven methods for building reduced kinetic monte carlomodels of complex chemistry from molecular dynamicssimulations, in Computational Approaches for ChemistryUnder Extreme Conditions (Springer, 2019) pp. 209–227.[29] Q. Yang, C. Sing-Long, and E. Reed, Rapid data-drivenmodel reduction of nonlinear dynamical systems includ-ing chemical reaction networks using l1-regularization,Chaos: An Interdisciplinary Journal of Nonlinear Science , 053122 (2020).[30] Y. Wu, H. Sun, L. Wu, and J. D. Deetz, Extract-ing the mechanisms and kinetic models of complexreactions from atomistic simulation data, Jour- nal of Computational Chemistry , 1586 (2019),https://onlinelibrary.wiley.com/doi/pdf/10.1002/jcc.25809.[31] S. Plimpton, Fast parallel algorithms for short-rangemolecular dynamics, Journal of Computational Physics , 1 (1995).[32] LAMMPS Molecular Dynamics Simulator, http://lammps.sandia.gov (1995).[33] A. C. Van Duin, S. Dasgupta, F. Lorant, and W. A.Goddard, Reaxff: a reactive force field for hydrocarbons,The Journal of Physical Chemistry A , 9396 (2001).[34] H. M. Aktulga, C. Knight, P. Coffman, K. A. O’Hearn,T.-R. Shan, and W. Jiang, Optimizing the performanceof reactive molecular dynamics simulations for many-corearchitectures, The International Journal of High Perfor-mance Computing Applications , 304 (2019).[35] H. Aktulga, J. Fogarty, S. Pandit, and A. Grama, Par-allel reactive molecular dynamics: Numerical methodsand algorithmic techniques, Parallel Computing , 245(2012).[36] T. R. Mattsson, J. M. D. Lane, K. R. Cochrane, M. P.Desjarlais, A. P. Thompson, F. Pierce, and G. S. Grest,First-principles and classical molecular dynamics simu-lation of shocked polymers, Phys. Rev. B , 054103(2010).[37] G. J. Martyna, D. J. Tobias, and M. L. Klein, Constantpressure molecular dynamics algorithms, The Journal ofchemical physics , 4177 (1994). [38] M. Parrinello and A. Rahman, Polymorphic transitionsin single crystals: A new molecular dynamics method,Journal of Applied physics , 7182 (1981).[39] M. E. Tuckerman, J. Alejandre, R. L´opez-Rend´on, A. L.Jochim, and G. J. Martyna, A liouville-operator derivedmeasure-preserving integrator for molecular dynamicssimulations in the isothermal–isobaric ensemble, Journalof Physics A: Mathematical and General , 5629 (2006).[40] W. Shinoda, M. Shiga, and M. Mikami, Rapid estima-tion of elastic constants by molecular dynamics simula-tion under constant stress, Physical Review B , 134103(2004).[41] A. Dullweber, B. Leimkuhler, and R. McLachlan, Sym-plectic splitting methods for rigid body molecular dy-namics, The Journal of chemical physics , 5840(1997).[42] T. Qi and E. J. Reed, Simulations of shocked methane in-cluding self-consistent semiclassical quantum nuclear ef-fects, The Journal of Physical Chemistry A , 10451(2012).[43] J. A. Rice, Mathematical statistics and data analysis, Sec-tion 8.5.3 (Nelson Education, 2006).[44] D. T. Gillespie, A general method for numerically sim-ulating the stochastic time evolution of coupled chemi-cal reactions, Journal of computational physics , 403(1976).[45] D. J. Higham, Modeling and simulating chemical reac-tions, SIAM review50