[PDF] Machine Learning in QM/MM Molecular Dynamics Simulations of Condensed-Phase Systems

Abstract

Quantum mechanics/molecular mechanics (QM/MM) molecular dynamics (MD) simulations have been developed to simulate molecular systems, where an explicit description of changes in the electronic structure is necessary. However, QM/MM MD simulations are computationally expensive compared to fully classical simulations as all valence electrons are treated explicitly and a self-consistent field (SCF) procedure is required. Recently, approaches have been proposed to replace the QM description with machine learned (ML) models. However, condensed-phase systems pose a challenge for these approaches due to long-range interactions. Here, we establish a workflow, which incorporates the MM environment as an element type in a high-dimensional neural network potential (HDNNP). The fitted HDNNP describes the potential-energy surface of the QM particles with an electrostatic embedding scheme. Thus, the MM particles feel a force from the polarized QM particles. To achieve chemical accuracy, we find that even simple systems require models with a strong gradient regularization, a large number of data points, and a substantial number of parameters. To address this issue, we extend our approach to a delta-learning scheme, where the ML model learns the difference between a reference method (DFT) and a cheaper semi-empirical method (DFTB). We show that such a scheme reaches the accuracy of the DFT reference method, while requiring significantly less parameters. Furthermore, the delta-learning scheme is capable of correctly incorporating long-range interactions within a cutoff of 1.4 nm. It is validated by performing MD simulations of retinoic acid in water and the interaction between S-adenoslymethioniat with cytosine in water. The presented results indicate that delta-learning is a promising approach for (QM)ML/MM MD simulations of condensed-phase systems.

Full PDF

MMachine Learning in QM/MM Molecular DynamicsSimulations of Condensed-Phase Systems

Lennard Böselt, Moritz Thürlemann, and Sereina Riniker ∗ Laboratory of Physical Chemistry, ETH Zurich, Vladimir-Prelog-Weg 2, 8093 Zurich,Switzerland

Abstract

Quantum mechanics/molecular mechanics (QM/MM) molecular dynamics (MD) simulationshave been developed to simulate molecular systems, where an explicit description of changes inthe electronic structure is necessary. However, QM/MM MD simulations are computationallyexpensive compared to fully classical simulations as all valence electrons are treated explicitlyand a self-consistent ﬁeld (SCF) procedure is required. Recently, approaches have been proposedto replace the QM description with machine learned (ML) models. However, condensed-phasesystems pose a challenge for these approaches due to long-range interactions. Here, we establish aworkﬂow, which incorporates the MM environment as an element type in a high-dimensional neuralnetwork potential (HDNNP). The ﬁtted HDNNP describes the potential-energy surface of the QMparticles with an electrostatic embedding scheme. Thus, the MM particles feel a force from thepolarized QM particles. To achieve chemical accuracy, we ﬁnd that even simple systems requiremodels with a substantial number of parameters, posing the risk of overﬁtting. To address thisissue, we extend our approach to a ∆ -learning scheme, where the ML model learns the diﬀerencebetween a reference method (DFT) and a cheaper semi-empirical method (DFTB). We show thatsuch a scheme reaches the accuracy of the DFT reference method, while requiring signiﬁcantly lessparameters. Furthermore, the ∆ -learning scheme is capable of correctly incorporating long-rangeinteractions within a cutoﬀ of 1.4 nm. It is validated by performing MD simulations of retinoic acidin water and the interaction between S-adenoslymethioniat with cytosine in water. The presentedresults indicate that ∆ -learning is a promising approach for (QM)ML/MM MD simulations ofcondensed-phase systems. Classical ﬁxed-charge force ﬁelds (FF) are readily used to perform molecular dynamics (MD) simula-tions of condensed-phase systems . They consist of a relatively small number of parameters, such asbond-stretching, bond-angle bending, and torsional dihedral terms, partial charges and Lennard-Jonesparameters. They are partially ﬁtted against experimental values such as the density, heat of vapori-sation, and solvation free energy.

FF are the gold standard for simulations over long time scales ofsystems for which long-range interactions are essential. In classical simulations, averaged propertiesare computed by neglecting electron rearrangement. The parameterization of FF requires the avail-ability of suﬃcient experimental data. On the other hand, quantum mechanics/molecular mechanics(QM/MM)

MD simulations can provide a valuable alternative to classical FF simulations, whenchanges in the electronic structure are important or if reliable force-ﬁeld parameters are not available. ∗ Corresponding author: email: [email protected], ORCID: 0000-0003-1893-4031 a r X i v : . [ phy s i c s . c h e m - ph ] O c t n the QM/MM scheme, the QM zone simulated with density functional theory (DFT) or ab initio principles is placed into a classical environment (MM zone). This approach permits the simulationof the electronic structure of small systems in more realistic surroundings. Most crucial in QM/MMsimulations is the description of the interaction between the QM and MM zones. This interaction canbe based either on (i) mechanical constraints (i.e. “mechanical embedding" scheme), or (ii) electronicperturbations (i.e. “electrostatic embedding”). In mechanical embedding, the particles in the QM zoneare assigned partial charges, which then interact with the MM zone on a classical level. Thus, the MMparticles do not interact with the electron density of the QM solute but rather with point charges ona classical level. Mechanical embedding favours eﬃcient implementations, however, a major drawbackis that a set of classical parameters have to be determined for the QM zone, which is not trivial andsometimes not possible with suﬃcient accuracy. In electrostatic embedding, the MM environment isincorporated into the Hamiltonian operator of the QM zone as electron operators. Thus, the electrondensity of the QM zone is perturbed by the MM environment, and the MM zone in turn interacts withthe perturbed electronic structure of the QM solute. In other words, the MM particles feel a force fromthe MM-polarized QM solute. The advantages of electrostatic embedding are that no partial chargeshave to be assigned for the computation of the interactions between the MM and the QM zones, and thatthe description of the interaction is physically better motivated compared to mechanical embedding.A well known limitation is the neglect of polarization of the MM environment – unless polarizableFF are used . Furthermore, computation is more expensive compared to mechanical embedding, andit is not clear whether the partial charges of the MM zone are suitable for inclusion as one-electronHamiltionans in the QM zone. Generally, an electrostatic embedding scheme is to be preferred asit has been shown to be more accurate, and it reﬂects thus the current standard protocol for QM/MMsimulations.

In QM/MM, the description of the QM zone becomes the computational bottle-neck as it requiresan expensive self-consistent ﬁeld (SCF) procedure, and explicit treatment of all valence electrons .To partially circumvent these issues, semi empirical methods can be used to describe the QMzone . This extends the accessible time scales but also reduces the accuracy. An alternative is toemploy machine-learned (ML) potentials.In recent years, a lot of research eﬀort has been invested into the development of ML models trainedon the potential energy surface (PES) of QM systems . This enables ML-MD simulations at anaccuracy level close to that of the electronic structure method chosen to generate the training set. Thecosts of the resulting ML-MD simulations are reduced drastically compared to a normal DFT or abinitio

MD simulation as a SCF procedure is no longer necessary and the valence electrons do not haveto be treated explicitly. A large amount of approaches have been reported in the literature to achievethis task. These approaches diﬀer in:•

Use case : e.g. gas-phase or periodic box , small compounds or larger compounds, system speciﬁc or system unspeciﬁc • Descriptor used to encode the chemical structure : e.g. Coulomb matrices , distance matri-ces , low-order polynomials, or so-called symmetry functions • Target output : e.g. energies or gradients • Scope : e.g. global or local • ML method : e.g. neural networks or Kernel methods • Applying a restriction of the ﬁnal form of the PES or the wave function .However, these will not be subject of the current study. All of the approaches listed above partiallygenerate the correct parity behaviour of the system. In other words, swapping two elements of the sametype and rotating or translating the system leaves the output unchanged or changes it equivariantly.The usage of the ML models heavily depends on the speciﬁc use case. In the following, selectedapproaches are discussed in more detail.Permutational-invariant polynomial neural networks (PIP) are suitable for accurately describingreactions of small systems in the gas phase (less than 5 atoms). Here, low-order polynomials, whichdepend on the internuclear distance of the compounds, are used as a descriptor for rigorously intro-ducing the correct parity behaviour for the system studied. The neural networks are trained on thetotal energy of the system and are global. Symmetric gradient domain machine learning (sGDML) targets systems up to 30 atoms in the gas phase, which do not undergo a change in topology. In thiscase, the compound is encoded as an inverse distance matrix. The latter is fed into a Kernel ridgeregression model and trained in the gradient domain, i.e. the target output is the gradient of the sys-tem. The energy can be obtained by integrating the forces. The deﬁnition of the kernel used ensuresthat the integrated forces give rise to a continuously-diﬀerentiable PES. The models can be employedfor spectroscopic studies .Larger compounds and box simulations can be tackled with high-dimensional neural network po-tentials (HDNNP) . The environment of each atom is encoded using symmetry functions. Simpleexamples of symmetry functions are Gaussians that depend on the interatomic distance, or trigono-metric functions. They map the input to a higher dimension, which is subsequently summed up. Thesummation operator and the mapping to higher dimensions using symmetry functions introduces thecorrect parity behaviour, while still allowing the HDNNP to learn the PES accurately. Using sym-metry functions, Smith et al. showed that a universal force-ﬁeld (ANI-1) can be constructed, whichis capable of describing unseen compounds containing the same elements. The model was trained ina system unspeciﬁc manner on gas-phase data. The models can be employed in the drug-discoveryprocess. For periodic-box simulations, HDNNP can be trained system-speciﬁc on previously sampleddata points.

The sampling of the data points can be done on a lower level of theory, as longas the trajectory samples all structures important for learning the PES. The data points are thenre-computed on the desired level, and the HDNNP is trained to interpolate between the training datapoints. These models employ a cutoﬀ for the symmetry functions and are thus local. They are trainedon the energy and use optionally a gradient correction. HDNNP based on symmetry functions moti-vated other groups to apply a similar approach for diﬀerent target outputs such as gradients, or withdiﬀerent ML models such as Kernel methods. . The uniqueness of the descriptor has been discussedin theoretical work. Another important contribution to the ﬁeld is the concept of ∆ -learning. Here, the ML modelis trained to reproduce the diﬀerence between two QM calculations, an expensive higher-level and acheaper lower-level method. Diﬀerent ML methods and descriptors can be used. Subsequently, theoutput on the higher level can be partially recovered by performing the calculation with the cheapmethod and applying the trained ML model. A good example of the ∆ -learning scheme was proposedby Shen et al. , where a modiﬁed HDDNP was trained on the energy diﬀerence between a semi-empirical and an expensive QM method. It was used to reweight the free-energy proﬁles of reactionsobtained with a QM/MM approach. More recently, the same authors introduced a model, which canbe used to perform MD simulations. The approach requires as input for the ML model a reactioncoordinate and the partial charges from the lower level method. In our opinion and that of others, HDNNP have been proven to be the most suitable methodfor performing system-speciﬁc periodic-box simulations. However, their usage in the simulation ofcondensed-phase systems with biological relevance is still hampered because such systems typically3nvolve: Large number of element types : HDNNP input descriptors scale exponentially with the numberof element types.2.

Many rotatable bonds and/or no symmetry, large system size : The quality of HDNNP modeldepends on a densely sampled training set. The number of diﬀerent possible system conﬁgurationscales exponentially with the number of rotatable bonds, which renders suﬃcient sampling forthe training set increasingly challenging. In principle, if the ML model is able to learn theunderlying physics, it could extrapolate to unseen system conﬁgurations. However, to the bestof our knowledge, this was only partially achieved yet. The models, which achieve this taskpartially, employ a small cutoﬀ for the multi-body (0.32 nm) and two-body terms (0.52 nm). Important long-range interactions : HDNNP make use of a locality ansatz , which fails to de-scribe long-range electrostatics. Long-range interactions can partially be introduced via a chargepartition scheme and a charge summation scheme (e.g. Ewald summation). In QM/MM, thiswould correspond to a mechanical embedding scheme. ML approaches have been proposed tospeciﬁcally target this issue, but these are to the best of our knowledge not yet applicableto simulations of large condensed-phase systems.4.

Large cutoﬀ : Classical force ﬁelds typically work with a cutoﬀ of 1.0 - 1.4 nm for pairwisenonbonded interactions.

Such large cutoﬀs are necessary to achieve the desired accuracy forbulk properties such as heat of vaporization or solvation free energy. Long time scales : Although a prediction with the ML model is faster than a QM calculation, asmall integration step (i.e. 0.5 fs) is still required for simulations. Furthermore, obtaining thegradients from a ML potential is usually signiﬁcantly slower than calculating the gradients witha classical force ﬁeld. In addition, HDNNP are generally plagued by two issues: (i) Generating training sets can imply asigniﬁcant time investment, and (ii) frequently used fully connected neural networks are prone tooverﬁtting , especially if a high number of weights is necessary to converge to chemical accuracy. A promising alternative for condensed-phase systems is therefore the combination of HDNNP andclassical FF in a QM/MM-type approach, where the HDNNP is used for simulating the QM particlesand to compute the interactions between the MM and the QM zone with electrostatic embedding. Byusing HDNNP for the QM zone, longer time scales will become accessible than with standard QM/MMsimulations. Such a hybrid approach requires the training data set to be generated with a QM/MMscheme. In other words, the MM environment is incorporated as one-electron Hamiltonians in the QMreference calculation. With such an approach, the complexity of the generation of the training set isreduced drastically as not all atoms are treated on a QM level. Furthermore, it permits the use of alarger cutoﬀ. Including long-range interactions directly in the ML model is therefore possible.In this work, we assess the concept of (QM)ML/MM MD simulations for condensed-phase systems.To enable electrostatic embedding, we integrate the MM environment directly as an additional elementtype in the HDNNP. The symmetry functions including MM particles are weighted with their respectivepartial charge. Two-body and three-body symmetry functions are included for the description of theMM environment. The symmetry functions used follow the proposal of Smith et al. and are adaptedfor larger cutoﬀs. In the ﬁrst part, we explore the eﬀect of diﬀerent parameters on the accuracy of(QM)ML/MM calculations of small molecules in water. The parameters include the application ofgradient correction, the inclusion of three-body terms, and the cutoﬀ distance for the descriptors. Inthe second part, we extend the methodology to a ∆ -learning scheme with a semi-empirical method.4emi-empirical methods are known to recover long-range interactions well , an issue ML models witha locality ansatz usually struggle with. The approach is validated using diﬀerent molecular systems.Finally, MD simulations are performed for retinoic acid in water (50 atoms in the QM zone andapproximately 2500 classical partial charges), and the interaction of SAM with cytosin (63 atoms inthe QM zone and approximately 3000 partial charges). The results are compared to standard QM/MMsimulations. In non-relativistic Born-Oppenheimer MD simulations, the potential energy V of the system is afunction of all Cartesian coordinates of the system, V (cid:126)r ( (cid:126)R ) ≡ V, (1)where (cid:126)R are the Cartesian coordinates of all nuclei, and (cid:126)r are the Cartesian coordinates of the electrons.The subscript (cid:126)r indicates that the electronic conﬁguration is treated under the Born-Oppenheimerapproximation. . According to Newton’s second law , the gradient on the nuclei of the system canbe obtained by taking the derivative with respect to the coordinates (cid:126)R , − ∂V (cid:126)r ( (cid:126)R i ) ∂ (cid:126)R i = (cid:126)F i = m i d (cid:126)v i d t . (2)In Eq. (2), (cid:126)F i is the force of particle i , (cid:126)R i its coordinates, m i its mass, (cid:126)v i its velocity, and t is the time.The system can then be propagated in time t using an integration algorithm such as the leap-frogalgorithm. V can be computed by solving the Schrödinger equation as, ˆ Hψ (cid:126)r ( (cid:126)R ) = V (cid:126)r ( (cid:126)R ) ψ (cid:126)r ( (cid:126)R ) . (3)In Eq. (3), ˆ H is the full Hamiltonian operator, and ψ is the time-independent Born-Oppenheimerapproximated wave function of the system. Solving for V is computationally unfeasible for largesystems. Hence, approximated solutions to Eq. (3) were developed, which can be grouped into fourclasses: (1) Ab initio methods use the correct Hamiltonian and an approximated wave function ψ . (2) Density functional theory (DFT) computes the electron density ρ (cid:126)r ( (cid:126)R ) , which is used tocalculate the energy of the molecular system, i.e. V (cid:126)r [ ρ (cid:126)r ( (cid:126)R )] . Modern DFT approaches achieve anaccuracy comparable to ab initio methods (or above), while still being able to simulate larger systems.However, DFT requires a treatment of all valence electrons and an SCF procedure, which makesit unfeasible to perform long MD simulations of large systems. (3) Semi-empirical methods use anapproximated Hamiltonian ˆ H and correct for the error made by introducing a set of ﬁtted parameters.Most prominent example is the PM7 method . Another attractive alternative is density functionaltight binding (DFTB). In DFTB, the electron density ρ is expanded in a series around a referencedensity ρ and truncated. The error made is corrected with ﬁtted parameters, similarly to other semi-empirical methods. (4) The last class includes empirical models with eﬀective parameters such asclassical force ﬁelds. .1.2 Classical Force Fields In classical ﬁxed-charge MD simulations, the potential energy V N of the system is calculated with aforce ﬁeld, which is the sum of bonded and nonbonded interactions terms, V N ( (cid:126)R ) = V bond + V angle + V dihedral + V el + V vdW , (4)where V bond is the contribution of all covalent bonds, V angle that of all covalent angles, and V dihedral that of all covalent dihedrals. The nonbonded terms consist of electrostatic ( V el ) and van der Waals( V vdW ) interactions. The parameters of these interactions terms are typically ﬁtted to reproduceexperimental and/or QM reference data. Eq. (4) is inexpensive to solve, and depends only on thenumber of nuclei (atoms). However, ﬁxed-charge force ﬁelds treat electronic eﬀects only in an averagedﬁeld, and can thus not be used to simulate chemical reactions. The simplicity of Eq. (4) allows to userelatively large cutoﬀs, and methods are available to include long-range interactions beyond a givencutoﬀ. QM/MM is a hybrid approach that combines a QM subsystem with an MM environment, thusstriking a balance between cost and accuracy. The challenge is to describe the interactions betweenthe parts appropriately. Here, the question is how to obtain the total energy V total of the system,which is now a combination of the QM subsystem V QM and the MM surrounding V N, MM . A ﬁrstapproach to compute the total energy of the system is a subtractive scheme, V total = V QM (cid:126)r ( (cid:126)R ) + V N, total ( (cid:126)R ) − V N, QM ( (cid:126)R ) , (5)where V QM (cid:126)r ( (cid:126)R ) is the energy of the QM subsystem, V N, total ( (cid:126)R ) the energy of the complete systemcalculated with the classical force ﬁeld, and V N, QM ( (cid:126)R ) the energy of the QM zone calculated with theforce ﬁeld. The more prominent alternative is the additive scheme, which is also used in this work,i.e. V total = V QM (cid:126)r + V N, MM + V QM-MM (cid:126)r . (6)In this case, the force ﬁeld is used to calculate the energy of the MM region, while the interactionbetween the QM and MM parts ( V QM-MM ) is added to the energy expression. The latter term can becomputed either via a mechanical embedding or an electrostatic embedding scheme. In mechanicalembedding, the particles in the QM subsystem are assigned partial charges, which then interact withthe MM surrounding. However, as already discussed in the Introduction, this scheme is inferior toelectrostatic embedding, where the Hamiltonian operator ˆ H is extended as (in atomic units), ˆ H QM/MM = ˆ H − N el (cid:88) i N MM (cid:88) j q j | (cid:126)r i − (cid:126)R j | + N QM (cid:88) i N MM (cid:88) j q j Q i | (cid:126)R i − (cid:126)R j | . (7) N el is here the number of the QM electrons in the system, N MM the number of partial charges, N QM the number of QM atoms, q j the partial charge of MM atom j , (cid:126)r i the coordinates of electron i , (cid:126)R j thecoordinates of MM atom j , and (cid:126)R i the coordinates of QM atom i . The QM subsystem is thus directlyinﬂuenced by the MM partial charges, while the latter feel a force of the perturbed QM zone. Theenergy expression of the complete system can then be written as, ( ˆ H QM/MM + V N, MM ) ψ (cid:126)r = ( V QM (cid:126)r + V QM/MM (cid:126)r + V N,MM ) ψ (cid:126)r = V total ψ (cid:126)r . (8)6ypically, not all MM partial charges are included in the summation in Eq. (7), but only thosewithin a cutoﬀ radius r . This leads to a PES, which is non-continuous at the cutoﬀ. This issue canbe (partially) resolved by adaptive resolution schemes. It has been found that the cutoﬀ radius r needs to be chosen relatively large (i.e. 1.4 nm) to converge to accurate results. Furthermore, severalapproaches have been proposed to scale , include, or smear the classical partial charges includedin the Hamiltonian ˆ H QM/MM . Supervised ML methods can be used to learn a function f P , which describes the relation f P : h ( (cid:126)R ) → ˜ V (cid:126)r ( (cid:126)R ) . The tilde denotes that the quantity is estimated. P is a set of parameters introducedby the ML model. The idea of the ML approaches is to sample conﬁgurations (cid:126)R on a PES to obtainthe potential energy V (cid:126)r ( (cid:126)R ) , and then ﬁt the parameters P in the function f P to obtain an accuratedescription of ˜ V (cid:126)r ( (cid:126)R ) . The conﬁgurations (cid:126)R can be transformed using a function h before being fed intothe ML model. An example would be a distance transformation (distance matrix). The parameters P of the ML model are ﬁtted such that the loss L P ( V (cid:126)r ( (cid:126)R ) , ˜ V (cid:126)r ( (cid:126)R )) = argmin P ( g c ( V (cid:126)r ( (cid:126)R ) , ˜ V (cid:126)r ( (cid:126)R ))) becomesminimal, where g c is a an arbitrary convex function. Examples are mean squared errors and meanabsolute errors. ˜ V (cid:126)r ( (cid:126)R ) can subsequently be used to propagate an MD trajectory. Diﬀerent target outputs can be chosen. Instead of learning the complete potential ˜ V (cid:126)r ( (cid:126)R ) , it ispossible to learn only the diﬀerence between two levels of theory (e.g. semi-empirical and DFT), ∆ V (cid:126)r ( (cid:126)R ) = V expensive (cid:126)r ( (cid:126)R ) − V cheap (cid:126)r ( (cid:126)R ) . (9)The function f thus becomes a mapping of f P : h ( (cid:126)R ) → ∆ ˜ V (cid:126)r ( (cid:126)R ) . Alternatively, it is possible todirectly learn the gradient f P : h ( (cid:126)R ) → ∂V (cid:126)r ( (cid:126)R ) ∂ (cid:126)R . An obstacle when learning gradients is that − (cid:82) ∞−∞ d (cid:126)R ∂V (cid:126)r ( (cid:126)R ) ∂ (cid:126)R = V must be ensured. This can for example be achieved with the sGDML method. As a third option, information about the gradients can be included in the training process. In otherwords, an additional loss term L P ( V, ˜ V ) = argmin P ( g c ( V, ˜ V ) + g c ( ∂V∂ (cid:126)R , ∂ ˜ V∂ (cid:126)R )) is added to account for thegradients. The concepts described above are independent of the ML method used and the transforma-tion h is applied beforehand. It is essential that h does not remove information relevant for the targetoutput. Feed-forward neural networks (FNN) consist of a number of sequentially chained (non)-linear functions σ l , weight matrices W l and biases b l , where l is the number of layers. A FNN can be deﬁned recursivelyas, x l +1 = σ l ( W l x l + b l ) , (10)where x is the input vector. The weight matrices and biases correspond to the parameters P . Inaddition, there are user-deﬁned hyperparameters such as the number of layers l , the number of neuronsper layer, and the type of functions σ l .FNN are the method of choice in cases, where a memory intensive data set is required and com-plicated transformations are necessary in order to learn the underlying relation between the input z and the output ˜ V . Moreover, they provide more ﬂexibility compared to Kernel ridge regression andGaussian processes. It has been observed in Kaggle contests that two hidden layers usually performbest if no special activation functions are used. In addition, FNN with a lower number of parameterstend to generalize better and require smaller training sets. et al. identiﬁed this issue and developed a FNNarchitecture and an input-coordinate transformation h , which guarantee the correct parity behaviour,the so-called high-dimensional neural network potentials (HDNNP). Instead of using only one FNN,each element type in the system is described by its own FNN. Each FNN is called an NNP becauseit describes the atomic energy of the respective atom type. The total energy of the system V is thusseparated, i.e., V = (cid:80) N H i V H i + (cid:80) N C i V C i + ... , where N H is the number of hydrogen atoms, N C thenumber of carbon atoms, and so on. Note that this separation ansatz is an approximation, it has noquantum-mechanical foundation. It guarantees that the total energy of a system is invariant with re-spect to swapping the positions of two atoms of the same type. However, rotation and translation stillchange the total energy. To address this issue, Behler et al. used the concept of symmetry functions. The Cartesian coordinates are transformed using a set of symmetry functions, which are invariant withrespect to rotation and translation, S t,Ri ( r ) = (cid:88) j e − η · ( r ij − µ ) (11)and S t,Ai = 2 ζ − (cid:88) j (cid:54) = k (cid:54) = i (1 − cos ( θ ijk − θ S )) ζ · e − ( rij − rik − µ A ) · η A , (12)where r ij is the distance between atoms i and j , and θ ijk is the angle spanned by atoms i , j and k .The additional parameters η , µ , µ A , ζ , θ S and η A need to be set manually. The superscript t (for type)indicates that the summation only runs over a speciﬁc element type. Symmetry functions describethe chemical environment of each atom with radial (Eq. (11)) and angular features (Eq. (12)). Thesummation operator enforces correct parity behaviour. Applying symmetry functions to periodic boundary conditions requires a cutoﬀ. For this, a cutoﬀfunction f C can be added, i.e. S t,Ri ( r ) = (cid:88) j e − η · ( r ij − µ ) · f C ( r ij ; R QM-QMsym ,r ) (13)and S t,Ai = 2 ζ − N (cid:88) j (cid:54) = k (cid:54) = i (1 − cos ( θ ijk − θ S )) ζ · e − ( rij − rik − µ A ) · η A · f C ( r ij ; R A ) · f C ( r ik ; R A ) , (14)with the cutoﬀ function f C deﬁned as, f C ( r ij ; R ) = (cid:40) . · (1 + cos ( r ij · πR )) for r ij ≤ R else (15)To mimic long-range interactions, a second ML model can be trained on QM-derived partial chargesand multipole moments to be used in an Ewald summation scheme. For covalently bonded or metallicsystems, the cutoﬀ R QM-MMsym ,r can be chosen relatively small because the contributions of the long-rangeinteractions are minor. However, as discussed in the Introduction, much larger cutoﬀs are required forcondensed-phase systems.Even though HDNNP are a powerful technique, they are limited in the following ways: First, thenumber of angle terms of the symmetry functions scale exponentially with the number of element8ypes as all possible combinations must be considered. For example, a system consisting of hydrogenand carbon atoms involves ﬁve unique angle combinations. Introducing an additional element type(e.g. oxygen) adds ﬁve more unique combinations. This issue can be resolved partially by introducingweighted symmetry functions, i.e. instead of introducing a new element type, the symmetry functionsare weighted with respect to their atomic number and/or charge. Other issues with HDNNP are thatthe full correct permuting behaviour is only partially recovered, and that the descriptors used arenon-unique. However, the latter issue is in our opinion more of theoretical nature, and should notlimit the usage of HDNNPs in practical applications. Lastly, one has to keep in mind that HDNNPsdo not resolve the inherent general limitations of ML approaches. For example, the requirement forlarge training sets, danger of overﬁtting, and limited extrapolation capabilities apply also to HDNNPs.

In order to combine HDNNP and QM/MM MD simulations with electrostatic embedding, it is necessaryto incorporate the MM environment into the descriptor of the HDNNP, in order to ﬁnd a representation f P for ˜ V QM + ˜ V QM-MM in Eq. (6). V N, MM is obtained using the classical force ﬁeld, which means that asigniﬁcant amount of complexity is removed from the ML model. The MM environment can be encodedin the HDNNP by introducing the MM particles as an additional element type, which is weighted withthe respective partial charge. For the QM particles, we obtain S t,Ri ( r ) = N ( t ) QM (cid:88) j e − η · ( r ij − µ ) · f C ( r ij ; R QM-QMsym ,r )) (16)and S t,Ai = 2 ζ − N ( t ) QM (cid:88) j (cid:54) = k (cid:54) = i (1 − cos ( θ ijk − θ S )) ζ · e − ( rij − rik − R s ) · η A · f C ( r ij ; R A ) · f C ( r ik ; R A ) . (17)MM particles are introduced as a new element type by S t,Ri ( r ) = N MM (cid:88) j e − η · ( r ij − µ ) · f C ( r ij ; R QM-MMsym ,r ) · Z ( j ) , (18) S t,Ai = 2 ζ − N ( t ) QM (cid:88) j (cid:54) = i N MM (cid:88) k (1 − cos ( θ ijk − θ S )) ζ · e − ( rij − rik − µ A ) · η A · f C ( r ij ; R A ) · f C ( r ik ; R A ) · Z ( k ) , (19)and S t,Ai = 2 ζ − N MM (cid:88) j (cid:54) = k (1 − cos ( θ ijk − θ S )) ζ · e − ( rij − rik − µ A ) · η A · f C ( r ij ; R A ) · f C ( r ik ; R A ) · Z ( j ) · Z ( k ) . (20)In Eqs. (18-20), the summation operator iterates over all MM particles N MM and QM particles N ( t ) QM of type t . The function Z ( i ) returns the partial charge of the MM particle. For SPC/E water, we have Z ( O ) = − . and Z ( H ) = 0 . .This approach requires that the data set is generated within the same QM/MM approach. Nopartitioning of the QM zone or MM zone is performed. The model f P is then trained on the generateddata points. The learned potential ˜ V QM (cid:126)r + ˜ V QM/MM (cid:126)r in combination with the classical force ﬁeld cansubsequently be used to propagate the total system in time.9

Methods

Two types of systems were investigated: (i) single molecules in water, including benzene, uracil andretionic acid, and (ii) transition states of chemical reactions in water, namely the S N Cl with Cl − and the reaction of S-adenosylmethionate (SAM) with cytosine. The systems werechosen to represent diﬀerent system sizes, use cases, and diﬃculties. The systems were either treatedin a static or dynamic manner. Static refers to a retrospective evaluation of sampled conﬁgurations,whereas dynamic indicates that an actual MD simulation was performed using the gradients providedby the ML model. All MD simulations were performed using the GROMOS software package interfaced to DFTB+/19.2 and ORCA/4.2.0 . The basis set and functionals used in this study are the gradient corrected Becke-Perdew functional (BP86), Head-Gordon’s range-separated functional ω B97X-D3 , and Grimme’sdouble hybrid functional B2-PLYP . The basis sets used in this study were Ahlrichs and Weigend’sdef2-SVP and def2-TZVP .All QM calculations used the resolution of identity , Weigend’s auxiliary basis functions, andGrimme’s dispersion correction with Becke-Johnson damping (except for ω B97X-D3). DFT cal-culations, which require the computation of a Hartree-Fock reference wave function, were furtheraccelerated using the so-called COSX approximation. The ORCA computations used TightSCFconvergence criteria and the integration Grid5. The grid was changed to Grid6 in the ﬁnal iteration.For the systems retinoic acid and SAM/cytosine, the integration Grid7 was selected to avoid oscillatingenergies during the computation. Otherwise, standard parameters were used. DFTB computationswere performed with Grimme’s D3 dispersion correction with Becke-Johnson damping. All point charges within the cutoﬀ radius r were included in an electrostatic embedding scheme inthe QM computation. Selected solvent atoms beyond the cutoﬀ were included to avoid bond-breakingand creation of artiﬁcial charges. The convergence criteria was set to − eV for the systems benzene,uracil and CH Cl/Cl − , and − eV for the others. Structures, which did not converge in the maximumnumber of steps, were discarded. MD simulations were performed with a convergence criteria of − eVand a Broyden mixer with a mixing parameter of α = 0 . to ensure numerical stability.Newton’s equations of motion were integrated with a time step of 0.5 fs if not noted otherwise. Thetemperature was kept constant in the MD simulations with the Nosé-Hover chain thermostat andtwo baths (one coupled to the internal motion and the rotation of the solute, and the other one to thesolvent). Berendsen’s barostat was used for constant pressure simulations. Long-range electrostaticinteractions beyond the cutoﬀ of 1.4 nm were included using a reaction-ﬁeld method. Note that thereaction ﬁeld acts only on the MM particles. The water model used in this study was SPC/E . Allbonds between MM particles were constrained using the SHAKE algorithm and a relative toleranceof 10 − . For technical reasons, a topology ﬁle of the QM solute is required as input in GROMOS. Theseparameters were not used throughout the QM/MM simulation. The topology ﬁles were obtained fromthe ATB server.The neural networks were implemented using Tensorﬂow/keras. We exported the computationalgraphs trained in Python to the C++ GROMOS code.10 .3 Initial Structures

Initial structures of the individual solutes were generated using the ATB server. Initial structures ofthe transition states of chemical reactions were generated by placing the reactants manually close tothe transition state. During sampling, a biasing potential was applied of the form, V bias = 12 k ( r ij + d · r kl − r ) , (21)where k is the force constant, and r ij , r kl are the distances between the atoms i and j , and k and l , respectively. r is the ideal distance, and d ∈ {− , +1 } determines, whether the distances aresubtracted or added. For both systems, we set d = − and r = 0 . . For the CH Cl/Cl − system, theindexing is shown in the following,Cl j C i,k H + Cl − l → ClCH + Cl − (22)The structure was minimized using a force constant for the biasing potential k = 60 (cid:48) kJ mol − nm − .For the SAM/cytosine system, atom i was the carbon atom participating in the reaction, atom j the sulfur atom, atom k was the same carbon atom as atom i , and atom l corresponds to the carbonatom in α -position to the amine group.For all systems, the solute(s) was solvated in a periodic box of SPC/E water using the GROMOS++package of programs, with a minimum solute-wall distance of 1.4 nm and a minimum solute-solventdistance of 0.3 nm. The size of the simulation boxes were between 3.3 nm and 5.0 nm. All systemswere initially relaxed at 0 K using a gradient descent algorithm. A conﬁguration was considered to beconverged when the predicted change in energy was less than 0.1 kJ mol − averaged over all particles.The thermostat temperature was set to 300 K or 500 K with a coupling constant of 0.1 ps. For the three relatively small systems (solutes with low ﬂexibility), a set of conﬁgurations was sampled,which was split afterwards into a training/validation sets for the ML model. For benzene in water, acutoﬀ for the partial charges in the QM Hamiltonian of r = 0 . nm was used. For uracil in water aswell as the CH Cl/Cl − transition state, the cutoﬀ for the partial charges was set to r = 1 . nm.Benzene and uracil were simulated with DFTB at a temperature T of 500 K and a pressure p of1 bar for 10’000 steps. The obtained trajectory was sorted by the QM/MM energy. The data pointswere binned by energy with a range of 0.25 kJ mol − . The lowest representative of each bin was re-computed on a higher level of theory. Every second recomputed point was placed in the validation set,every other point in the training set, resulting in a training/validation split of 50%. The CH Cl/Cl − system was simulated at T = 300 K, p = 1 bar, k = 20 (cid:48) kJ mol − nm − for 100’000 steps. Everytenth step was recomputed on a higher level of theory. Every second recomputed data point was placedin the validation set, every other point in the training set, resulting in a training/validation split of50%.Note that commonly used sampling strategies such as normal-mode sampling are diﬃcult to performfor these systems as they do not account for the solvent environment. In future work, we will exploreimproved sampling approaches. To assess the performance of the ML approaches in a real setting, MD production runs were performedfor two larger systems: (i) retinoic acid in water (50 QM atoms, 2500 partial charges contributing to11he QM/MM Hamiltonian), and (ii) the SAM/cytosine transition state in water (63 QM atoms, 3000partial charges contributing to the QM/MM Hamiltonian).The training set for retinoic acid in water was generated by running DFTB MD simulations at300 K and a pressure set at 1 bar for 100’000 steps. Every 10 th step was recomputed on a higher levelof theory (10’000 data points in total). The data set was randomly shuﬄed, and 50% of the datapoints were placed in the training set, and the other 50% in the validation set. For the SAM/cytosinesystem, we performed DFTB MD simulations at 500 K and a pressure set at 1 bar for steps withan integration step of 1.0 fs. The data set was randomly shuﬄed, and 90% of the data points wereplaced in the training set. For the large systems, (QM)ML/MM MD simulations were performed using the ﬁtted ML mod-els. The time step was set for both system to 0.5 fs, the temperature was kept at T = 300 K, thepressure was set to 1 bar. For the SAM/cytosine system, the biasing force constant was kept at k = 20 (cid:48) kJ mol − nm − . We implemented the HDNNP in Tensorﬂow/keras with a ﬂoat64 precision. Each HDNNP used atmost two hidden layer and thus followed the recommendation in Ref. 59. The activation functionmila was used with a coeﬃcient β = − . , because it is continuously-diﬀerentiable and is known tooutperform commonly used activation functions such as tanh. The ML models were trained using theAdamax optimizer with varying learning rates. The number of neurons per hidden layer ranged foreach NNP from 20 to 160 neurons. As a loss function, we used L = 1 N · N (cid:88) i ( V i − ˜ V i ) (23)and L (cid:48) = L + ω N QM · N QM (cid:88) i (cid:88) α ( − F iα + ˜ F iα ) + ω N MM · N MM (cid:88) i (cid:88) α ( − F iα + ˜ F iα ) , (24)where N MM is the number of MM particles in the QM/MM Hamiltonian, N QM the number of QMparticles, and ω and ω are weight parameters for the gradient contribution. If not noted otherwise, ω = 1 and ω = 1 . Note that L is a non-gradient corrected loss function and L (cid:48) is a gradient correctedloss function. All models were trained on a NVIDIA K1400 with 3GB VRAM using single batchtraining. Static models were initially trained for 500 steps with a learning rate of . · − . Thetraining was continued for 500 steps if the mean absolute error on the energies of the training set wasabove 10 kJ mol − . Below 10 kJ mol − , training was continued until the change in the mean absoluteerror of the energies on the training set was lower than 0.5 kJ mol − over 100 steps. The learning ratewas changed to . · − , and training was continued. The training was stopped when the change inthe mean absolute error of the energies on the training set was lower than 0.5 kJ mol − over 50 steps.In two cases, this training procedure led to an unreasonable error on the training set (M4 and M5trained on 10% of the training set, see Results and Discussion). For these cases, we manually restartedthe training procedure for additional 2000 steps with a learning rate of . · − . The dynamic models(for retinoic acid and SAM/cytosine) were trained for 50 steps, each with a learning rate of . · − .12 .7 Symmetry Functions The symmetry functions were extracted from Refs. 29,36. In contrast to Smith et al. , we did notapply any prefactors. In addition, we allowed for a longer cutoﬀ and extended the spacing of the radialfunctions. Table 1 lists the symmetry functions, which encode environment. The cutoﬀs for the radialsymmetry functions were varied in this study and are provided in the Results and Discussion sectionfor all models.

Parameter Type From To Step size µ Radial 0.09 1.40 0.026875 η Radial 160 160 0 µ A Angular 0.09 0.285 0.065 R A Angular 0.35 0.35 0 η A Angular 80 80 0 ζ Angular 32 32 0 θ S Angular 0.19634954 2.9452431 0.19634954Table 1: Symmetry functions used in this work, taken from Ref. 36 and adapted to units nm and nm − .The cutoﬀs for the radial symmetry functions for the diﬀerent models are provided in the Results andDiscussion section.For the MM particles, we used either the same symmetry functions as for the QM subsystems(two-body terms and three-body terms), or omitted the three-body terms (only two-body terms). Allsymmetry functions that include MM particles were weighted with the corresponding partial charge.Note that symmetry functions could be varied to optimize the results. However, Smith et al. haveshown that their set of symmetry functions covers important local interactions and torsional space inorganic compounds, while still being computationally eﬃcient and less prone to overﬁtting.In the following, we will refer to three cutoﬀs: R c is the cutoﬀ chosen in the QM/MM calculationto generate the training/validation sets. All partial charges within the cutoﬀ were included in theHamiltonian ˆ H QM/MM . R QM-QMsym ,r is the cutoﬀ used for the radial symmetry functions encoding the QMenvironment, whereas R QM-MMsym ,r is the cutoﬀ used for the radial symmetry functions encoding the MMenvironment. The loss function of neural networks has multiple local minima. The training procedure usually ﬁndsa conﬁguration of weight parameters close to a local minimum, but not necessarily close to the globalminimum. To assess the impact of gradients correction, multi-body terms, and the cutoﬀ size, wechose test systems with a conformationally rigid solute in water where the conformational space can besampled completely. Three diﬀerent solutes were investigated, which diﬀer in the polarity: (i) benzene(apolar), uracil (polar), and the transition state of CH Cl/Cl − (charged). The sampled conﬁgurationswere split 50/50 in training and validation sets (see Methods). The ﬁrst test system consists of benzene in water. Benzene is an apolar solute. Thus, long range inter-actions do not contribute to the PES of the QM/MM region. This means that the training/validation13et could be generated with a cutoﬀ of R c = 0 . nm. This test system is suitable to study the eﬀect ofincluding gradient correction and multi-body terms (Eqs. (18-20)) in the description of the PES for theMM part. Note that the descriptor always contains multi-body information for the QM particles (Eq.(14)). Table 2 summarizes the settings of four ML models used for the comparison. All models usethe parameters proposed by Ref. 29, however, we account for the interactions of the QM subsystemwith the MM particles by introducing them as a new element type (see Theory section). Model 1(M1) is the baseline model, including only two-body information, no gradient correction, and a smallcutoﬀ R QM-MMsym ,r = 0.52 nm. M2 includes both three-body terms for the MM point charges and gradientinformation, while M3 only includes gradient information. M4 additionally weights the gradients ofthe MM particles in the loss function, and uses a larger cutoﬀ for the QM-QM symmetry functions( R QM-QMsym ,r = 1.0 nm). Settings ID Three-body terms Gradient correction Cutoﬀfor MM? ω ω R QM-QMsym ,r [nm] R QM-MMsym ,r [nm] R c [nm]M1 No 0 0 0.52 0.52 0.60M2 Yes 1 1 0.52 0.52 0.60M3 No 1 1 0.52 0.52 0.60M4 No 1 1-200 1.00 1.00 0.60 Table 2: Settings for the ML models used to test the inﬂuence of including gradient correction and/orthree-body terms for the MM point charges. ω and ω are weight parameters for the gradient con-tribution (Eq. (24)). R c is the cutoﬀ radius for the MM partial charges in the QM/MM calculationto generate the training/validation, R QM-QMsym ,r is the cutoﬀ for the QM-QM symmetry functions, and R QM-MMsym ,r the cutoﬀ for the QM-MM symmetry functions. The test system is benzene in water.Figure 1: Performance of the settings M1-M3 on the validation set of benzene in water as a functionof the number of neurons in each hidden layer. The reference data points were computed with B2-PLYP/def2-TZVP. The performance is reported as the mean absolute error (MAE). (Left): MAE ofthe predicted energies. (Middle): MAE of the predicted QM gradients. (Right): MAE of the predictedMM gradients. The dashed purple line shows the accuracy achieved with the DFTB Hamiltonian onthe same data set for comparison. The black dashed line indicates chemical accuracy (i.e. 4.18 kJmol − for the energies, and 41.8 kJ mol − nm − for the gradients).We ﬁrst analyzed the model performance as a function of the number of neurons in the hiddenlayers for settings M1-M3. As can be seen in Figure 1, the accuracy of the predicted energies (leftpanel) and QM gradients (middle panel) generally improved with increasing number of neurons perlayer. Inclusion of three-body terms for the QM - MM interactions (M2 versus M1) does not improvethe accuracy. These terms are expensive to compute and increase the amount of memory drastically.14e will thus not include them in the subsequent models. Introducing a gradient correction (M3 versusM1), on the other hand, improved the prediction accuracy of the QM forces by a factor of three to ﬁve(middle panel of Figure 1). This is largely to be expected because information about the gradients isgiven to the model (the amount of data is multiplied by a factor of three per data point). Nevertheless,also the accuracy of the predicted energies of the validation set is improved, which indicates that themodel trained on the energies alone (M1) likely overﬁtted the PES. The accuracy of the predictedforces of the MM particles improved also slightly when including gradient correction (right panel).All trained HDNNP models outperform the DFTB Hamiltonian (purple lines) on the predictedenergies and the QM gradients (note that DFTB was parametrized on a diﬀerent level of theory thanused in this study). Interestingly, the gradients on the MM particles are predicted more accuratelywith DFTB than with the ML models. One can ask whether this diﬀerence (about 3 kJ mol − nm − )in accuracy is signiﬁcant because the absolute deviation of the gradients on the MM particles is lowcompared to the absolute deviation of the gradients on the QM particles. However, we note thatthe majority of the particles in the system is aﬀected by this, and therefore the higher deviation ispotentially problematic.This issue can be resolved relatively simply by weighting the contributions of the MM gradientsin the loss function with a constant factor as done with settings M4 (Figure 2). In this case, thecutoﬀ of the QM-QM symmetry functions was increased to 1.0 nm to ensure that the descriptor ofeach atom “sees” all solvent atoms and MM atoms. The M4 variant with ω = 1 serves thereby asbaseline (i.e. increased cutoﬀ but no weighting of the MM gradient contribution). For ω < , theprediction accuracy of the MM gradients is not improved. At ω = 200 , however, the HDNNP ﬁndsa diﬀerent minimum, yielding a better description of the energies, QM gradients, and MM gradientssimultaneously. These results indicate that a correct description by the model depends heavily on theregularization in the training procedure. With the improvements, the accuracy of the predicted MMgradients from the ML model outperforms that of the DFTB Hamiltonian.Figure 2: Performance of the settings M4 on the validation set of benzene in water as a function of theweight ω . The reference data points were computed with B2-PLYP/def2-TZVP. All models used 160neurons per layer. Performance is reported as the mean absolute error (MAE). (Left): MAE of thepredicted energies. (Middle): MAE of the predicted QM gradients. (Right): MAE of the predictedMM gradients. The dashed purple line shows the accuracy achieved with the DFTB Hamiltonian onthe same data set for comparison. The black dashed line indicates chemical accuracy (i.e. 4.18 kJmol − for the energies, and 41.8 kJ mol − nm − for the gradients). ∆ -Learning As ML models use a locality ansatz, it is diﬃcult for them to describe long-range interactions. Semi-empirical methods, on the other hand, are known to be able to account for long-range interactions.15hus, a ∆ -learning scheme based on a semi-empirical method might be a valuable approach. Insuch a scheme, the ML model introduces a correction to the semi-empirical method to recover theaccuracy of the higher-level method. Although this decreases naturally the computational eﬃciency ofthe production run, since a additional computation is required at each time step t , we will demonstratein the following that the number of weights in the ML model can be reduced signiﬁcantly. As a result,the amount of training data points needed is also reduced, the training procedure is accelerated, andthe extrapolation capabilities of the model are possibly increased.The ∆ -learning model M5 uses the same settings as M3, i.e. only two-body terms for the QM- MM interactions, gradient correction ( ω =1, ω =1), and a cutoﬀ for the symmetry functions of0.52 nm. In Figure 3, the performance of M5 is compared to that of the models M3 and M4 (with ω =200). Using ∆ -learning improved the accuracy of all three properties, i.e. energies, QM gradients,and MM gradients signiﬁcantly already at a much smaller number of neurons per hidden layer. Thisindicates that only a reduced number of training data points is necessary to achieve the accuracy ofthe reference method (see discussion below). It also suggests that the extrapolation capabilities of M5might be better than those of the previous models. This makes the model particularly interesting foruse cases, where the ensemble of conformations/conﬁgurations of the system in the training set is notbe complete. The ∆ -learning scheme will be used in the rest of this study.Figure 3: Comparison of the performance of M3 and M5 on the validation set of benzene in water as afunction of the number of neurons in the hidden layer. The reference data points were computed withB2-PLYP/def2-TZVP. The performance of M4 with ω =200 and 160 neurons per layer is shown asred dashed line. The performance is reported as the mean absolute error (MAE). (Left): MAE of thepredicted energies. (Middle): MAE of the predicted QM gradients. (Right): MAE of the predictedMM gradients. The dashed purple line shows the accuracy achieved with the DFTB Hamiltonian onthe same data set for comparison. The black dashed line indicates chemical accuracy (i.e. 4.18 kJmol − for the energies, and 41.8 kJ mol − nm − for the gradients).To the best of our knowledge, the lowest level of theory available for such a ∆ -learning schemeis DFTB. Other semi-empirical methods do not incorporate the partial charges in an SCF manner inthe Hamiltonian. Rather, they describe the interaction based on parameters. This means that thegradients on the MM particles usually do not correlate with those computed by the reference method(DFT). A major issue for learning of complex systems is that not all relevant conﬁgurations can be enumerated.Thus, ML approaches that can be trained on a small data set are especially valuable. As shownabove, ∆ -learning requires less weight parameters to achieve chemical accuracy on the validation set16f benzene in water, which indicates that less training points are needed compared to the ML modelstrained directly on the full energies and gradients. To quantify this observation, the M4 ( ω = 200 ,160 neurons per layer) and M5 ( ω = 1 , 20 neurons per layer) models were trained on only 10%of the original training set. This was done by selecting every tenth data point of the training set,which corresponds to 211 data points in total. The validation set remained the same. The results arepresented in Table 3. While the MAE on the energies of the validation set is 6.0 kJ mol − for M5, theerror increases to 9.5 kJ mol − for M4. Furthermore, we observe a much larger drop in accuracy onthe MM gradients for M4 than for M5 when trained with only 10% of the data. The ∆ -learning modelis thus more robust towards smaller training set sizes. However, we note that the description of theMM particles becomes slightly worse compared to DFTB. Model Energy QM gradients MM gradientsMAE RMSE MAE RMSE MAE RMSE [kJ mol − ] [kJ mol − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ]M4 (100%) 2.61 (2.36) 3.40 (3.04) 41.40 (38.90) 57.80 (52.70) 5.75 (5.52) 10.29 (9.84)M4 (10%) 9.48 (8.95) 11.86 (11.72) 88.70 (79.57) 124.43 (102.99) 16.04 (20.21) 27.11 (36.22)M5 (100%) 2.80 (2.67) 3.58 (3.41) 26.68 (26.36) 35.66 (34.69) 5.25 (5.22) 9.77 (9.83)M5 (10%) 6.00 (5.30) 7.26 (6.27) 38.88 (38.09) 50.93 (49.72) 7.59 (8.64) 16.40 (14.44)DFTB 13.54 (13.83) 14.75 (15.10) 138.80 (138.70) 189.40 (190.60) 7.34 (7.31) 13.70 (13.70) Table 3: Comparison of the performance of M4 and M5 on the validation set of benzene in waterwhen trained on 10% or 100% of the training set. Performance is reported as the mean absolute error(MAE) and root-mean-square-error (RMSE). The value for the training set is given in brackets. Thereference data points were computed with B2-PLYP/def2-TZVP. The performance of DFTB is givenas a reference. After the standard training protocol described in the method section, M4 (10%) andM5 (10%) were trained for 2000 more steps with a learning rate of . · − . Fitting a ML model for the benzene in water system did not require the inclusion of long-rangeinteractions. To assess the eﬀect of the cutoﬀ size and thus the improved accounting for long-rangeinteractions, the test system of uracil in water is used in the following. The cutoﬀ R c is increased up to1.40 nm, which corresponds to 2000-3000 MM partial charges in the cutoﬀ sphere. A cutoﬀ of 1.4 nmis used in the GROMOS force ﬁeld, of which the SPC/E water model is part of. Table 4 lists thesettings M6-M8. All models use gradient correction and the ∆ -learning scheme with DFTB. M6 servesas a “worst-case” model, where the MM particles are simulated solely with the semi-empirical methodand the ML model introduces just a local correction to the QM subsystem. Settings ID Cutoﬀ Gradient correction R QM-QMsym ,r [nm] R QM-MMsym ,r [nm] R c [nm] ω M6 0.52 0.00 1.40 1M7 0.52 0.52 1.40 1M8 1.40 1.40 1.40 1Table 4: Settings for the ∆ -learning ML models used to test the inﬂuence of the cutoﬀ size. All modelsemploy only two-body terms for the QM - MM interactions and gradient correction. The models weretrained with 20 neurons per hidden layer. R c is the cutoﬀ radius for the MM partial charges in theQM/MM calculation, R QM-QMsym ,r is the cutoﬀ for the QM-QM symmetry functions, and R QM-MMsym ,r thecutoﬀ for the QM-MM symmetry functions. The test system is uracil in water.The results with the models M6-M8 and the DFTB baseline on the validation set of uracil in water17re summarized in Table 5. Model M6 with no information about the MM environment (cutoﬀ QM-MM set to zero) performs worst but still better than the DFTB baseline. Including the MM particlesin the QM calculation (M7 versus M6) improved the accuracy on all three properties. Increasing thecutoﬀ from 0.52 nm to 1.40 nm (M8 versus M7) reduced the error on the energies and QM gradientsfurther, however, the error on the MM gradients increased indicating overﬁtting. This behaviour maybe explained with the insensitivity of the loss function towards the comparatively small gradients ofthe MM particles (Eq. (24)). The ML model can thus use these input features to achieve a betterperformance measured by the loss function. This is achieved by improving the accuracy of the totalenergies and the QM gradients. However, the trade oﬀ is that the prediction accuracy of the MMgradients becomes worse. We already demonstrated for the benzene test system that increasing theweights of the MM gradients in the loss function can resolve this issue partially. Here, we foundthat using a lower cutoﬀ for the symmetry functions ( R QM-MMsym ,r = 0.52 nm, M7) regularizes the modelsimilarly. Model Energy QM gradients MM gradientsMAE RMSE MAE RMSE MAE RMSE [kJ mol − ] [kJ mol − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ]M6 4.03 (3.94) 5.25 (5.14) 54.20 (54.20) 74.00 (74.10) 1.35 (1.35) 5.27 (5.30)M7 2.06 (2.03) 2.53 (2.49) 37.90 (37.50) 50.86 (50.21) 1.19 (1.20) 3.81 (3.80)M8 1.20 (1.14) 1.45 (1.36) 28.53 (26.85) 37.48 (35.18) 1.76 (1.76) 4.27 (4.26)DFTB 6.55 (6.56) 8.29 (8.27) 311.70 (312.80) 429.10 (429.80) 1.35 (1.35) 5.27 (5.27) Table 5: Performance of the models M6-M8 and the DFTB baseline on the validation set of uracilin water. Performance is reported as the mean absolute error (MAE) and root-mean-square-error(RMSE). The value for the training set is given in brackets. The training set was computed withBP86/def2-TZVP.The results for the benzene and uracil test systems show that constructing an ML model, whichhandles long-range interactions correctly, is challenging, and that the ML model requires regularizationto avoid overﬁtting. The model can be regularized by increasing the weights of the MM gradients orby decreasing the cutoﬀ in combination with ∆ -learning. However, so far the test systems were notcharged. For charged systems, incorporation of long-range electrostatic interactions will be even moreimportant.To assess the transferability of the conclusions from the previous test systems, we analyzed theperformance of the ML models on a third test system, the transition state of the S N Cl and Cl − . For this system, larger changes in the electronic structure within the conﬁgurationalensemble are expected. Table 6 lists the settings investigated for the CH Cl/Cl − test system.18 ettings ID Cutoﬀ Gradient correction R QM-QMsym ,r [nm] R QM-MMsym ,r [nm] R c [nm] ω M9 0.52 0.00 1.40 1M10 0.52 0.52 1.40 1M11 1.40 1.40 1.40 1M12 1.40 1.40 1.40 200Table 6: Settings for the ∆ -learning ML models used to test the inﬂuence of the cutoﬀ size. All modelsemploy only two-body terms for the QM - MM interactions and gradient correction. The models weretrained with 20 neurons per hidden layer. The test system is the transition state of CH Cl/Cl − inwater. R c is the cutoﬀ radius for the MM partial charges in the QM/MM calculation, R QM-QMsym ,r isthe cutoﬀ for the QM-QM symmetry functions, and R QM-MMsym ,r the cutoﬀ for the QM-MM symmetryfunctions.The results with the models M9-M12 and the DFTB baseline on the validation set of the transitionstate of CH Cl/Cl − in water are summarized in Table 7. For model M9-M11 we see in general the samebehaviour as in the previous test system with uracil in water. Not including the MM partial chargesin the ML model as in model M9 results again in the worst performance, far from chemical accuracy.Including them with a cutoﬀ of 0.52 nm (M10) reduces the error on the energies and QM gradientssigniﬁcantly, and increasing the cutoﬀ to 1.4 ˙nm (M11) improved the performance further. The eﬀectof the increased cutoﬀ is more dramatic for the CH Cl/Cl − test system compared to uracil in waterdue to the stronger long-range electrostatic interactions in the charged system. Furthermore, DFTBdescribes the training and validation set much worse than it was the case for the uracil test system,i.e. the correction by the ML model is larger. However, as observed above, the prediction accuracyof the gradients on the MM particles decreased with increasing cutoﬀ for the QM - MM interactions.Thus, we tested model M12, where the weights of the MM partial charges in the loss function wereincreased ( ω =200). This improved the performance on the MM gradients, but the accuracy of theDFTB baseline was not reached. This indicates that stronger regularization of the ML model duringthe training procedure is necessary. Model Energy QM gradients MM gradientsMAE RMSE MAE RMSE MAE RMSE [kJ mol − ] [kJ mol − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ]M9 10.06 (9.53) 12.51 (11.84) 129.01 (127.05) 196.76 (167.7) 2.10 (2.10) 16.40 (16.40)M10 2.35 (2.20) 3.05 (2.86) 41.70 (40.50) 56.10 (54.40) 3.26 (3.30) 23.00 (22.80)M11 0.86 (0.79) 1.14 (1.04) 24.80 (22.90) 32.90 (29.90) 4.70 (4.67) 23.40 (23.10)M12 1.21 (1.10) 1.60 (1.44) 28.20 (26.60) 37.40 (35.20) 3.91 (3.89) 22.92 (22.65)DFTB 17.91 (17.84) 222.10 (221.00) 202.60 (201.70) 302.10 (300.70) 2.10 (2.10) 16.40 (16.40) Table 7: Performance of the models M9-M12 and the DFTB baseline on the validation set of thetransition state of CH Cl/Cl − in water. Performance is reported as the mean absolute error (MAE)and root-mean-square-error (RMSE). The value for the training set is given in brackets. The trainingset was computed with Head-Gordon’s range separated functional ω B97X-D3.From the ﬁndings discussed above, we can conclude that the choice of the cutoﬀ size depends onthe polarity of the solute (i.e. how important long-range electrostatic interactions are). In general, alarger cutoﬀ is recommended for polar solutes in polar solvents, however, it increases also the risk ofoverﬁtting. This can be accounted for by increasing the weights of the MM partial charges in the lossfunction. 19 .2 Application of ML Models in MD Simulations

So far we have assessed the performance of the ML models in terms of MAE (and RMSE) on a validationset. However, these results can be to some degree misleading because rare outliers get averaged. Suchrare outliers might be less tolerable in the context of an actual MD simulations, where the results of thenext step depend directly on the results of the previous step. MD simulations require thus the inclusionof relatively high-energy conﬁgurations in the training set, such that repulsion and attraction can belearned. In order to judge the usefulness of HDNNP models in practice, it is important to test theirperformance in actual MD simulations. For this purpose, we have chosen two test systems with largerconformational ﬂexibility: (i) retinoic acid (a vitamin A derivative) in water, and (ii) the transitionstate of the reaction of SAM with cytosine in water (Figure 4). Sampling of the training set wasperformed in two diﬀerent ways. For the SAM/cytosine system, sampling was performed at elevatedtemperature (500 K) with a time step of 1 fs. For the retinoic acid system, this strategy resulted in anunstable trajectory. Thus, sampling was performed at 300 K with a time step of 0.5 fs (see Methodsfor more details).The ML models investigated employ the ∆ -learning scheme with DFTB as baseline, and use 20neurons per hidden layer, only two-body terms for the QM - MM interactions, gradient correction(with ω = 1 and ω = 200 ), and a cutoﬀ of 1.4 nm. OHO N NN NH N OHO OHS NH OHONHNH N O

Figure 4: Two larger test systems used for MD simulations. (Left): retinoic acid in water (50 QMatoms and about 2500 MM partial charges with a cutoﬀ of 1.4 nm). (Right): transition state of thechemical reaction between SAM and cytosine in water (63 QM atoms and about 3000 MM partialcharges with a cutoﬀ of 1.4 nm).

First, we assessed the performance of the ML model on a validation set, as done in the previous sections.Table 8 summarizes the results in terms of MAE and RMSE, Figure 5 shows them graphically. Using ∆ -learning model, the performance is improved compared to the DFTB baseline for all three properties,including the MM gradients. Model Energy QM gradients MM gradientsMAE RMSE MAE RMSE MAE RMSE [kJ mol − ] [kJ mol − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ]ML 4.07 (4.10) 5.16 (5.20) 41.25 (42.22) 59.64 (106.28) 1.30 (1.30) 3.54 (3.55)DFTB 17.91 (17.84) 222.10 (221.00) 202.60 (201.70) 302.10 (300.70) 2.10 (2.10) 16.40 (16.40) Table 8: Performance of the ∆ -learning model and the DFTB baseline on the validation set of retinoicacid in water. Performance is reported as the mean absolute error (MAE) and root-mean-square-error(RMSE). The value for the training set is given in brackets. The data points were computed on theBP86/def2-TZVP level of theory. 20igure 5: Comparison between the reference DFT method and the DFTB Hamiltonian (blue) and theML + DFTB model (green) on the validation set of retinoic acid in water. (Left): Energies. (Right):QM gradients.Next, we tested the trained ML model by performing an (QM)ML/MM MD simulation for 5000steps (top panel in Figure 6). In order to compare the ML-corrected energies with those of theDFT reference and the DFTB baseline simulation, single point calculations were performed for eachconﬁguration in the ML + DFTB trajectory. As shown in Figure 6, the energies of ML + DFTBagree better with the DFT reference than the DFTB baseline. Although the latter performs relativelywell, one should keep in mind that the deviations from the reference method accumulate along anMD trajectory. Thus, even relatively small deviations might result in largely diﬀerent trajectorieswhen starting from the same coordinates. To assess the stability of the ML model, we extended thesimulation to 200’000 steps (bottom panel in Figure 6).21igure 6: MD simulation of retinoic acid in water (integration size 0.5 fs). (Top): Energy trajectoryfrom 5000 consecutive steps performed by the ML + DFTB model (green). In each step, the energieswere also computed with DFTB (blue) and the DFT reference (BP86/def2-TZVP, black) for compari-son. The energy deviation from the reference over the complete trajectory with respect to the minimumenergy was: MAE = 9.18 kJ mol − and RMSE = 11.28 kJ mol − for the ML + DFTB model, andMAE = 20.78 kJ mol − and RMSE = 22.77 kJ mol − for DFTB. (Bottom): Energy trajectory from200’000 consecutive steps performed by the ML + DFTB model (green). In biochemistry, S-adenosylmethionate (SAM) is a co-factor for the transfer of a methyl group byenzymes (methyltransferases). Here, we investigated the transition state of the chemical reactionbetween SAM and cytosine. Again, we ﬁrst assessed the performance on a validation set as done above(Table 9 and Figure 7). The ∆ -learning model clearly outperforms the DFTB baseline. However, thediﬀerence between the prediction accuracy of the energies in the training set and the validation setindicates that the ML model overﬁtted the energies. Nevertheless, the performance on the gradients(both QM and MM) looks reasonable. Note that the training set contains a much larger numberof high-energy conformations than in the retinoic acid system due to the sampling strategy used( T = 500 K). 22igure 7: Comparison between the reference DFT method and the DFTB Hamiltonian (blue) and theML + DFTB model (green) on the validation set of the transition state of SAM/cytosine in water.(Left): Energies. (Right): QM gradients.

Model Energy QM gradients MM gradientsMAE RMSE MAE RMSE MAE RMSE [kJ mol − ] [kJ mol − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ] [kJ mol − nm − ]ML 15.19 (4.55) 19.06 (5.76) 85.13 (75.44) 116.51 (100.98) 2.03 (2.15) 5.25 (5.26)DFTB 32.00 (26.64) 32.50 (39.17) 313.50 (314.20) 418.70 (420.20) 2.43 (2.42) 8.24 (8.24) Table 9: Performance of the ∆ -learning model and the DFTB baseline on the validation set of thetransition state of SAM and cytosine in water. Performance is reported as the mean absolute error(MAE) and root-mean-square-error (RMSE). The value for the training set is given in brackets. Thetraining set was computed with BP86/def2-SVP.Next, we performed a (QM)ML/MM MD simulation for 3000 steps using the ﬁtted model (toppanel in Figure 8). As for the test system with retinoic acid in water, the energies with ML + DFTBagreed well with the DFT reference, and outperformed the DFTB baseline as indicated by the MAEand RMSE over the trajectory. To assess the stability of the model, we extended the simulation to50’000 steps (bottom panel in Figure 8). For both test systems, it was possible to carry out a stable(QM)ML/MM MD simulation for the selected number of steps (up to 200’000 steps). Of course, thepossibility cannot be excluded at this point that a conﬁguration, which is not well represented by theML model, might be encountered in even longer simulations.23igure 8: MD simulation of the SAM/cytosine transition state in water (integration size 1.0 fs). (Top):Energy trajectory from 3000 consecutive steps performed by the ML + DFTB model (green). In eachstep, the energies were also computed with DFTB (blue) and the DFT reference (BP86/def2-SVP,black) for comparison. The energy deviation from the reference over the complete trajectory withrespect to the minimum energy was: MAE = 10.53 kJ mol − and RMSE = 12.77 ,kJ mol − for theML + DFTB model, and MAE = 19.46 kJ mol − and RMSE = 24.09 kJ mol − for DFTB. (Bottom):Energy trajectory from 50’000 consecutive steps performed by the ML + DFTB model (green). The ∆ -learning scheme requires an explicit computation with the lower-level method at each time step.In this work, we have used DFTB as the lower-level method. For the SAM/cytosine system with 63QM atoms and up to 3000 MM particles in the cutoﬀ sphere, one evaluation with DFTB requiredless than a second while the corresponding DFT computation took 10 to 30 minutes more than threemagnitudes longer on the same hardware. Using a better basis set, such as def2-TZVP, increased thecomputing time to 60 minutes. Thus, the additional time needed for the DFTB calculation at eachstep of the ∆ -learning scheme is less expensive than e.g. the pair-list creation in the MD engine.An obvious limitation of the ∆ -learning approach are conﬁgurations, for which the DFTB PESis substantially diﬀerent from the reference DFT PES. In such cases, the DFTB computation willconverge very slowly (or not at all), limiting the usage. However, we did not encounter such an issuein MD simulations. In this work, we investigated the use of HDNNP in (QM)ML/MM MD simulations of condensed-phasesystems with DFT (or ab initio ) accuracy for the QM subsystem. Standard QM/MM MD simulations24t this level of accuracy are very expensive and only applicable to small systems. Using semi-empiricalmethods to describe the QM parts is much faster but also less accurate. In the HDNNP, the MMpartial charges up to a certain cutoﬀ are integrated as additional element type.We assessed the inﬂuence of diﬀerent parameters on the prediction accuracy of energies, gradientsof the QM particles, and gradients of the MM particles. First, we used a simple test system ofan apolar, rigid solute in water. Three-body terms to describe the QM - MM interactions couldbe neglected, however, the inclusion of a gradient correction in the training procedure of the MLmodel was crucial to obtain suﬃcient accuracy on the gradients. In general, we found that classicalHDNNP require a large number of weight parameters to ﬁt the PES of even small condensed-phasesystems (e.g. benzene in water) accurately. Furthermore, strong regularization on the MM gradientswas necessary during the training procedure to predict these gradients with decent accuracy. This ispossibly due to the locality ansatz, which makes it diﬃcult for the HDNNP to describe long-rangeinteractions well. As an alternative, we employed a ∆ -learning scheme, where the lower-level methodis DFTB. Semi-empirical methods are known to handle the long-range interactions well. Using the ∆ -learning model allowed us to decrease the number of weight parameters while retaining a highaccuracy. Smaller models usually require less training data points and display improved extrapolationcapabilities. This may be a key factor for large biomolecular systems, where it is not feasible toenumerate all conformations/conﬁgurations for the training set.For solutes with increased polarity (e.g. uracil, and the transition state of CH Cl/Cl − ) in water),the long-range electrostatic interactions become more important and thus the size of the cutoﬀ R c .Increasing the cutoﬀ, which was necessary to better describe long-range interactions, improved theprediction accuracy of the energy and QM gradients. However, at the same time the performanceon the MM gradients was reduced. A possible explanation is that the loss function in the trainingprocedure is not sensitive to the relatively small gradients on the MM particles (compared to the QMgradients), and thus the ML model can use these input features to achieve a better performance onthe energies and the QM gradients at the expense of the MM gradients (i.e. overﬁtting). This issuecan be resolved to some degree by increasing the weights of the MM particles in the loss function.The ﬁnal ∆ -learning model was tested further by performing actual (QM)ML/MM MD simulationsof relatively large systems, i.e. retinoic acid in water and the transition state of the chemical reactionbetween SAM and cytosine. Stable trajectories were obtained with an accuracy close to the DFTreference. We envision that the results and ﬁndings presented in this study will enable the use of(QM)ML/MM MD simulations in practical applications. Acknowledgements

The authors thank Patrick Bleiziﬀer and Annick Renevey for helpful discussions. S.R. gratefullyacknowledges ﬁnancial support by the Swiss National Science Foundation (Grant Number 200021-178762) and by ETH Zurich (ETH-34 17-2).

References [1] S. Riniker. Fixed-Charge Atomistic Force Fields for Molecular Dynamics Simulations in theCondensed Phase: An Overview.

J. Chem. Inf. Model , 58, 565–578, .[2] P. S. Nerenberg, T. Head-Gordon. New Developments in Force Fields for Biomolecular Simula-tions.

Curr. Opin. Struct. Biol. , 49, 129–138, .[3] N. Schmid, A. P. Eichenberger, A. Choutko, S. Riniker, M. Wigner, A. E. Mark, W. F. van25unsteren. Deﬁnition and Testing of the GROMOS Force-ﬁeld Versions 54A7 and 54B7.

Eur.Biophys. J. , 40(843), .[4] J. W. Ponder, D. A. Case. Force Fields for Protein Simulations.

Adv. Prot. Chem. , 66, 27–85, .[5] A. Warshel, M. Levitt. Theoretical Studies of Enzymatic Reactions: Dielectric, Electrostaticand Steric Stabilization of the Carbonium Ion in the Reaction of Lysozyme.

J. Mol. Biol. , 103,227–249, .[6] A. J. Mulholland, P. D. Lyne, M. Karplus. Ab Initio QM/MM Study of the Citrate SynthaseMechanism. A Low-Barrier Hydrogen Bond Is not Involved.

J. Am. Chem. Soc. , 122, 534–535, .[7] M. Senn, W. Thiel. QM/MM Methods for Biomolecular Systems.

Angew. Chem. Int. Ed. , 48,1198–1229, .[8] G. Groenhof. Introduction to QM/MM simulations.

J. Mol. Biol. , 924, 43–66, .[9] H. Lin, D. G. Truhlar. QM/MM: What Have We Learned, Where Are We, and Where Do We GoFrom Here?

Theor. Chem. Acc. , 117, 185–199, .[10] D. Loco, O. L. Lagardère, S. Caprasecca, F. L. Orcid, B. M. Orcid, J.-P. Piquemal. HybridQM/MM Molecular Dynamics with AMOEBA Polarizable Embedding.

J. Chem. Theory Comput. ,13, 4025–4033, .[11] C. Panosetti, A. Engelmann, L. Nemec, K. Reuter, J. T. Margraf. Learning to Use the Force: Fit-ting Repulsive Potentials in Density-Functional Tight-Binding with Gaussian Process Regression.

J. Chem. Theory Comput. , 16, 2181–2191, .[12] E. Brunk, U. Röthlisberger. Mixed Quantum Mechanical/Molecular Mechanical Molecular Dy-namics Simulations of Biological Systems in Ground and Electronically Excited States.

Chem.Rev. , 115, 6217–6263, .[13] L. W. Chung, W. M. C. Sammera, R. Ramozzi, A. J. Page, M. Hatanaka, G. P. Petrova, T. V.Harris, X. Li, Z. Ke, F. Liu, et al. . The ONIOM Method and Its Applications.

Chem. Rev. , 115,5678–5796, .[14] F. Neese. Software Update: The ORCA Program System, Version 4.0.

Comput. Mol. Sci. , 8,e1327, .[15] S. G. Balasubramani, G. Chen, S. Coriani, M. Diedenhofen, M. Frank, Y. Franzke, F. Furche,R. Grotjahn, M. Harding, C. Hättig, et al. . TURBOMOLE: Modular Program Suite for Ab InitioQuantum-chemical and Condensed-matter Simulations.

J. Chem. Phys. , 152, 185–199, .[16] A. S. Christensen, T. Kubar, Q. Cui, M. Elstner. Semiempirical Quantum Mechanical Methodsfor Noncovalent Interactions for Chemical and Biochemical Applications.

Chem. Rev. , 116, 5301–5337, .[17] J. J. P. Stewart. Optimization of Parameters for Semiempirical Methods VI: More Modiﬁcationsto the NDDO Approximations and Re-optimization of Parameters.

J. Molec. Modeling , 19, 1–32, . 2618] J. Behler. Atom-centered Symmetry Functions for Constructing High-dimensional Neural NetworkPotentials.

J. Chem. Phys. , 134, 074106, .[19] B. Jiang, H. Guoa. Permutation Invariant Polynomial Neural Network Approach to Fitting Po-tential Energy Surfaces.

J. Chem. Phys. , 139, 054112, .[20] J. Li, B. Jiang, H. Guoa. Permutation Invariant Polynomial Neural Network Approach to FittingPotential Energy Surfaces. II. Four-atom Systems.

J. Chem. Phys. , 139, 204103, .[21] J. Behler. Representing Potential Energy Surfaces by High-dimensional Neural Network Potentials.

J. Phys.: Condens. Matter , 26, 183001, .[22] J. Behler. Constructing High-dimensional Neural Network Potentials: A Tutorial Review.

Quant.Chem. , 115, 1032–1050, .[23] V. Botu, R. Ramprasad. Adaptive Machine Learning Framework to Accelerate Ab Initio MolecularDynamics.

Quant. Chem. , 115, 1074–1083, .[24] R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld. Big Data Meets Quantum ChemistryApproximations: The ∆ -Machine Learning Approach. J. Chem. Theory Comput. , 11, 2087–2096, .[25] M. Gastegger, J. Behler, P. Marquetand. Machine Learning Molecular dynamics for the simulationof infrared spectra.

Chem. Sci. , 8, 6924–6935, .[26] M. Gastegger, L. Schwiedrzik, M. Bittermann, F. Berzsenyi, P. Marquetanda. wACSF – WeightedAtom-centered Symmetry Functions as Descriptors in Machine Learning Potentials.

J. Chem.Phys. , 148, 421709, .[27] S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T. Schütt, K.-R. Müller. MachineLearning of Accurate Energy-conserving Molecular Force Fields.

Science Adv. , 3, e1603015, .[28] S. Chmiela, H. E. Sauceda, K. R. Mueller, A. Tkatchenko. Towards Exact Molecular DynamicsSimulations with Machine-Learned Force Fields.

Nature Communications , 9, 3887, .[29] J. S. Smith, O. Isayev, A. E. Roitberg. ANI-1: An Extensible Neural Network Potential withDFT Accuracy at Force Field Computational Cost.

Chem. Sci. , 8, 3192–3203, .[30] F. Noé, A. Tkatchenko, K.-R. Müller, C. Clementi. Machine Learning for Molecular Simulation.

Annu. Rev. Phys. Chem. , 71, 361–390, .[31] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, A. V. Lilienfeld,K.-R. Müller. In F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger (Eds.),

Advances inNeural Information Processing Systems 25 , 440–448. Curran Associates, Inc., .[32] C. Brunken, M. Reiher. Self-Parametrizing System-Focused Atomistic Models.

J. Chem. TheoryComput. , 16, 1646–1665, .[33] A. Fabrizio, A. Grisaﬁ, B. Meyer, M. Ceriotti, C. Corminboeuf. Electron Density Learning ofNon-covalent Systems.

Chem. Sci. , 110, 9424–9432, .[34] K. T. Schütt, P.-J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko, K.-R. Müller. SchNet:A Continuous-ﬁlter Convolutional Neural Network for Modeling Quantum Interactions, .2735] A. Sannai, Y. Takai, M. Cordonnier. Universal Approximations of Permutation Invari-ant/Equivariant Functions by Deep Neural Networks, . URL https://openreview.net/forum?id=HkeZQJBKDB .[36] J. Stevenson, L. D. Jacobson, Y. Zhao, C. Wu, J. Maple, K. Leswing, E. Harder, R. Abel.Schrodinger-ANI: An Eight-Element Neural Network Interaction Potential with Greatly ExpandedCoverage of Druglike Chemical Space, .[37] S. N. Pozdnyakov, M. J. Willatt, A. P. Bart, C. Ortner, G. Cśan, M. Ceriotti. On the Completenessof Atomic Structure Representations.[38] L. Shen, J. Wu, W. Yang. Multiscale Quantum Mechanics/Molecular Mechanics Simulations withNeural Networks.

J. Chem. Theory Comput. , 35, 479–506, .[39] L. Shen, W. Yang. Molecular Dynamics Simulations with Quantum Mechanics/Molecular Me-chanics and Adaptive Neural Networks.

J. Chem. Theory Comput. , 14, 3379–3390, .[40] Y. Zhang, A. Khorshidi, G. Kastlunger, A. A. Peterson. The Potential for Machine Learning inHybrid QM/MM Calculations.

J. Chem. Phys. , 148, 241740, .[41] A. Grisaﬁ, M. Ceriotti. Incorporating Long-range Physics in Atomic-scale Machine Learning.

J.Chem. Phys. , 151, 204105, .[42] H. Junming, Y. Shao, J. Kato. Do Better Quality Embedding Potentials Accelerate the Conver-gence of QM/MM Models? The Case of Solvated Acid Clusters.

Molecules , 26, 2466, .[43] J. Behler. First Principles Neural Network Potentials for Reactive Simulations of Large Molecularand Condensed Systems.

Angew. Chem. Int. Ed. , 56, 12828–12840, .[44] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, E. Choi. MorphNet: Fast andSimple Resource-Constrained Structure Learning of DeepNetworks.[45] M. Reiher, A. Wolf.

Relativistic Quantum Chemistry . Wiley-VCH, .[46] M. Born, R. Oppenheimer. Zur Quantentheorie der Molekeln.

Annalen der Physik , 389, 457–484, .[47] I. Newton.

Axioms or Laws of Motion. Mathematical Principles of Natural Philosophy. 1 . .[48] S. K. Gray. Symplectic Integrators for Large Scale Molecular Dynamics Simulations: A Compar-ison of Several Explicit Methods. J. Chem. Phys. , 101, 4062, .[49] E. Schrödinger. Quantisierung als Eigenwertproblem.

Annalen der Physik , 385, 437–490, .[50] I. N. Levine.

Quantum Chemistry . Pearson, .[51] W. Kohn, L. J. Sham. Self-Consistent Equations Including Exchange and Correlation Eﬀects.

Phys. Rev. , 140, 1133–1138, .[52] B. Hourahine, B. Aradi, V. Blum, F. Bonafe, A. Buccheri, C. Chamacho, C. Cevallos, M. Y. De-shaye, T. Dumitrica, A. Dominguez, et al. . DFTB+, a Software Package for Eﬃcient ApproximateDensity Functional Theory Based Atomistic Simulations Featured.

J. Chem. Phys. , 152, 124101, . 2853] R. E. Bulo, B. Ensing, J. Sikkema, L. Visscher. Toward a Practical Method for Adaptive QM/MMSimulations.

J. Chem. Theory Comput. , 5, 2212–2221, .[54] A. R. Dinner, X. Lopez, M. Karplus. A Charge-scaling Method to Treat Solvent in QM/MMSimulations.

Theo. Chem. Acc. , 109, 118–124, .[55] T. Laino, F. Mohamed, A. Laio, M. Parinello. An Eﬃcient Real Space Multigrid QM/MMElectrostatic Coupling.

J. Chem. Theory Comput. , 1, 1176–1184, .[56] E. Alpaydin.

Introduction to Machine Learning . MIT Press, .[57] M. Paliwal, U. A. Kumar. Neural Networks and Statistical tTechniques: A Review of Applications.

Expert Systems with Applications , 36, 2–17, .[58] M. Kanagawa, P. Hennig, D. Sejdinovic, B. K. Sriperumbudur. Gaussian Processes and KernelMethods: A Review on Connections and Equivalences.[59] G. Klambauer, T. Unterthiner, A. Mayr, S. Hochreiter. Self-Normalizing Neural Networks.[60] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, A. Smola. Deep Sets, .[61] J. Li, K. Song, J. Behler. A Critical Comparison of Neural Network Potentials for MolecularReaction Dynamics with Exact Permutation Symmetry.

Phys. Chem. Chem. Phys. , 21, 9672–9682, .[62] N. Schmid, C. D. Christ, M. Christen, A. P. Eichenberger, W. F. van Gunsteren. Architecture,implementation and parallelization of the GROMOS software for biomolecular simulation.

Comp.Phys. Comm. , 183, 890–903, .[63] K. Meier, N. Schmid, W. F. van Gunsteren. Interfacing the GROMOS (Bio)molecular SimulationSoftware to Quantum-Chemical Program Packages.

J. Comput. Chem. , 33, 2108–2177, .[64] J.-D. Chai, M. M. Head-Gordon. Long-range Corrected Hybrid Density Functionals with DampedAtom–Atom Dispersion Corrections.

Phys. Chem. Chem. Phys. , 44, 501–508, .[65] S. Grimme. Semiempirical Hybrid Density Functional with Perturbative Second-order Correlation.

J. Chem. Phys. , 124, 034108, .[66] F. Weigend, R. Ahlrichs. Balanced Basis Sets of Split Valence, Triple Zeta Valence and QuadrupleZeta Valence Quality for H to Rn: Design and Assessment of Accuracy.

Phys. Chem. Chem. Phys. ,7, 3297–3305, .[67] M. Feyeresein, G. Fitzgerld, A. Komornicki. Use of Approximate Integrals in Ab Initio Theory.An Application in MP2 Energy Calculations.

Chem. Phys. Lett. , 208, 3590363, .[68] F. Weigend. Accurate Coulomb-ﬁtting Basis Sets for H to Rn.

Phys. Chem. Chem. Phys. , 9,1057–1065, .[69] S. Grimme, J. Antony, S. Ehrlich, H. Krieg. A Consistent and Accurate Ab Initio Parametrizationof Density Functional Dispersion Correction (DFT-D) for the 94 Elements H-Pu.

J. Chem. Phys ,132, 154104, .[70] S. Grimme, S. Ehrlich, L. Goerigk. Eﬀect of the Damping Function in Dispersion CorrectedDensity Functional Theory.

J. Comput. Chem. , 32, 1456–1465, .2971] D. R. Hartree. The Wave Mechanics of an Atom with a Non-Coulomb Central Field. Part II.Some Results and Discussion.

Mathematical Proceedings of the Cambridge Philosophical Society ,24, 111–132, .[72] S. Kossmann, F. Neese. Comparison of Two Eﬃcient Approximate Hartee-Fock Approaches.

Chem. Phys. Lett. , 481, 240–243, .[73] C. G. Broyden. A Class of Methods for Solving Nonlinear Simultaneous Equations.

Math. Com-putation , 19, 577–593, .[74] S. Nose. A Uniﬁed Formulation of the Constant Temperature Molecular Dynamics Methods.

J.Chem. Phys. , 81, 511, .[75] W. G. Hoover. Canonical Dynamics: Equilibrium Phase-space Distributions.

Phys. Rev. A (Gen-eral Physics) , 31, 1695–1697, .[76] H. J. C. Berendsen, J. R. Grigera, T. P. Straatsma. The Missing Term in Eﬀective Pair Potentials.

J. Phys. Chem. , 91, 6269–6271, .[77] H. C. Berendsen, W. F. van Gunsteren. Computer Simulation of Molecular Dynamics: Method-ology, Applications, and Perspectives in Chemistry.

Angew. Chem. Int. Ed. , 29, 992–1023, .[78] H. J. C. Berendsen, J. R. Grigera, T. P. Straatsma. The missing term in eﬀective pair potentials.

J. Phys. Chem. , 91, 6269–6271, .[79] V. Kräuter, W. van Gunsteren, P. Hünenberger. A fast SHAKE algorithm to solve distanceconstraint equations for small molecules in molecular dynamics simulations.

J Chem Comp , 22,501–508, . doi:10.1002/1096-987X.[80] A. Malde, L. Zuo, M. Breeze, M. Stroet, D. Poger, P. Nair, C. Oostenbrink, A. Mark. AnAutomated Force Field Topology Builder (ATB) and Repository: Version 1.0.

J. Chem. TheoryComput. , 7, 4026–4037, .[81] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,M. Devin, et al. . TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, .URL . Software available from tensorﬂow.org.[82] A. P. Eichenberger, J. R. Allison, J. Dolenc, D. P. Geerke, B. A. C. Horta, K. Meier, C. Oost-enbrink, N. Schmid, D. Steiner, D. Wang, et al. . The GROMOS++ Software for the Analysis ofBiomolecular Simulation Trajectories.

J. Chem. Theory Comput. , 7, 3379–3390, .[83] https://github.com/digantamisra98/Mila .[84] D. P. Kingma, J. L. Ba. Adam: A Method for Stochastic Optimization,2014