[PDF] Automated construction of quantum-classical hybrid models

Abstract

We present a protocol for the fully automated construction of quantum mechanical-(QM)-classical hybrid models by extending our previously reported approach on self-parametrizing system-focused atomistic models (SFAM) J. Chem. Theory Comput. 2020, 16, 1646]. In this QM/SFAM approach, the size and composition of the QM region is evaluated in an automated manner based on first principles so that the hybrid model describes the atomic forces in the center of the QM region accurately. This entails the automated construction and evaluation of differently sized QM regions with a bearable computational overhead that needs to be paid for automated validation procedures. Applying SFAM for the classical part of the model eliminates any dependence on pre-existing parameters due to its system-focused quantum mechanically derived parametrization. Hence, QM/SFAM is capable of delivering a high fidelity and complete automation. Furthermore, since SFAM parameters are generated for the whole system, our ansatz allows for a convenient re-definition of the QM region during a molecular exploration. For this purpose, a local re-parametrization scheme is introduced, which efficiently generates additional classical parameters on the fly when new covalent bonds are formed (or broken) and moved to the classical region.

Full PDF

AAutomated construction of quantum–classical hybrid models

Christoph Brunken and Markus Reiher ∗ ETH Z¨urich, Laboratorium f¨ur Physikalische Chemie, Vladimir-Prelog-Weg 2,8093 Z¨urich, SwitzerlandFebruary 18, 2021

Abstract

We present a protocol for the fully automated construction of quantum mechanical-(QM)–classical hybrid models by extending our previously reported approach on self-parametri-zing system-focused atomistic models (SFAM) [

J. Chem. Theory Comput. , (3),1646–1665]. In this QM/SFAM approach, the size and composition of the QM regionis evaluated in an automated manner based on ﬁrst principles so that the hybrid modeldescribes the atomic forces in the center of the QM region accurately. This entails the au-tomated construction and evaluation of diﬀerently sized QM regions with a bearable com-putational overhead that needs to be paid for automated validation procedures. ApplyingSFAM for the classical part of the model eliminates any dependence on pre-existing pa-rameters due to its system-focused quantum mechanically derived parametrization. Hence,QM/SFAM is capable of delivering a high ﬁdelity and complete automation. Furthermore,since SFAM parameters are generated for the whole system, our ansatz allows for a con-venient re-deﬁnition of the QM region during a molecular exploration. For this purpose, alocal re-parametrization scheme is introduced, which eﬃciently generates additional clas-sical parameters on the ﬂy when new covalent bonds are formed (or broken) and movedto the classical region. ∗ Corresponding author; e-mail: [email protected] a r X i v : . [ phy s i c s . c h e m - ph ] F e b Introduction

In contrast to most protocols of computational quantum chemistry that considerisolated molecules, chemical processes can take place in a vast variety of complexenvironments. Studying chemical reactions in proteins, in nanostructures, and onsurfaces, requires a theoretical approach that must provide an accurate quantum me-chanical description of the reaction center and at the same time an eﬃcient model tocope with the enormous system size of the structured, heterogeneous environment.For these requirements to coalesce, one is forced to apply a hybrid method, which di-vides the system into several regions treated at diﬀerent levels of approximation. [1,2]

Typically, the reaction center is modeled with a quantum mechanical method, whichallows one to describe the formation and cleavage of covalent bonds in a natural way.The environment may eﬃciently be treated by a force ﬁeld [3] rooted in classicalmechanics. Such a quantum-mechanical/molecular-mechanical hybrid (QM/MM)model [4,5] typically requires signiﬁcant manual work involved in the model construc-tion process. It is plagued by a lack of standardization and comparability, and rigor-ous uncertainty quantiﬁcation is not available (as for most computational chemistrymethods), which is particularly critical for such complex composite models. Typi-cally, no standardized procedures exist regarding, for instance, the parametrizationof the force ﬁeld (especially for metal-containing regions), the QM region determina-tion, the choice of boundary scheme, the initial structure generation, conformationalsampling, and the extent to which parts of the macromolecule are constrained dur-ing structure optimizations in order to conﬁne the complexity of the system as wellas to limit the error of the MM energy contribution.One of the most prominent challenges for QM/MM modeling is the constructionof an eﬃcient and at the same time accurate molecular mechanics model, which istypically only available for a pre-deﬁned subset of chemical elements with standardbonding patterns. Its degree of transferability to a new system is not obvious at all.We addressed this issue in our recent work [6] by introducing the self-parametrizingsystem-focused atomistic model (SFAM), which allows for an automated construc-tion of a molecular mechanics model for a system of arbitrary size and elementalcomposition. The reference data for the model are obtained in a fully automatedway by an autonomous fragmentation algorithm and subsequent quantum chemicalreference calculations for the molecular fragments. Furthermore, SFAM includesa model reﬁnement step based on ∆-machine learning [7] when additional referencedata become available during a molecular exploration.In this work, we extend SFAM toward quantum–classical hybrid models (QM/SFAM)with an automated set-up. The application of SFAM as the classical part of themodel eliminates any issues arising from an incomplete set of parameters. Further-more, SFAM guarantees that the reference data, from which the MM parametersare derived, can be provided by the same quantum chemical method as is appliedin the QM part of the hybrid calculation making the MM model as consistent as2ossible with the QM part.Furthermore, we present a scheme to determine the choice of the QM region whichresults in an accurate QM/SFAM model. The selection of atoms to include in thequantum mechanical part of the calculation is a highly non-trivial task and thedecision often relies on chemical intuition alone. However, there have been recenteﬀorts to select QM regions systematically on a ﬁrst-principles basis by Kulik andcoworkers. [8]

Their systematic QM region determination scheme relies on evaluatinghow speciﬁc residues of a protein aﬀect the electronic structure (charge distribu-tion, frontier orbitals) of the core residues of the protein. In this work, we providean alternative approach solely based on the fundamental quantities of a molecularsystem, i.e., its electronic energy and its derivatives with respect to the nuclear co-ordinates, which are available from computationally cheap and approximate as wellas expensive and accurate electronic structure models.The nuclear derivatives play an essential role in obtaining reliable structures (bymolecular dynamics (MD) sampling or structure optimization) and we therefore fo-cus on the accurate description of the atomic forces when determining which atomsmust be included in the quantum mechanical part of the model. Deﬁning an op-timal set of atoms as the QM region is essential, especially because it has beendemonstrated by Ochsenfeld, [9–11]

Martinez [12] and others [13–15] that for many sys-tems QM/MM models safely converge only with large QM region sizes of severalhundred atoms. Therefore, in those cases where such large QM regions are not fea-sible (e.g., vast reaction network explorations or MD simulations), it is inevitable tocarefully select the QM region systematically in order to guarantee that the resultingmodel is an accurate approximation to a full quantum mechanical model.In the following, we ﬁrst describe our QM/SFAM model and then introduce analgorithm to systematically determine the composition of the QM region in an auto-mated way. This is demonstrated with the examples of (i) a medium-sized peptidethat also allows for full-quantum reference calculations and of (ii) a larger systemto resemble a typical case of application. Although these examples are taken frombiochemistry, we emphasize that our model is agnostic with respect to the elementalcomposition due to its ﬁrst-principles core. Hence, any nanoscale atomistic systemcan be subjected to our hybrid model construction process, even one for which amolecular structure ﬁrst needs to be constructed (by virtue of the SFAM approachthat early on in the model generation provides an approximate force ﬁeld for iterativestructure reﬁnement [6] ). 3

Theory

We brieﬂy review the SFAM approach [6] as it will be the classical part of our hybridmodel. Similar to QMDFF [16] and QuickFF [17] molecular mechanics models, alsoSFAM is generated automatically for a speciﬁc molecular system from quantummechanical reference data, which yields accurate force ﬁelds without being limitedby the elemental composition of the molecular system. SFAM is distinct from thetwo aforementioned models [16,17] in two crucial aspects. First, SFAM force constantsare parametrized by a partial Hessian ﬁt algorithm [18,19] as introduced by Hirao andcoworkers in 2016, i.e., the parameters are ﬁtted solely to local information in theHessian, which allows us to generate the model for very large molecular systemsby calculating reference data for fragments cut out of the whole structure. Wealso introduced an autonomous fragmentation algorithm for this purpose. [6]

Second,SFAM includes an (optional) improvement step based on ∆-machine learning (∆-ML). [7]

While the MM base model of SFAM provides an accurate description of the potentialenergy surface (PES) close to the local energy minimum taken as a reference forparametrization, its parameters are not guaranteed to be transferable across allregions of the PES. The base model can be applied in an exploration of additionalstructures (e.g., in molecular dynamics simulations), for which additional referencedata can be calculated on the ﬂy. The MM/ML ansatz of SFAM can then graduallyincrease its accuracy across the PES as an increasing amount of reference datais collected to train the ML model. Many ML-only models have been reportedas replacements for classical force ﬁelds. [20–24]

However, our MM/ML approach forSFAM has several advantages. On the one hand, the MM base model providesphysical insight into the properties of the system in contrast to an approach solelybased on ML. On the other hand, it requires only a limited and well controllableamount of reference data, as it is parametrized on single-point data obtained forfragments (optimized structures, atomic charges and Hessians).The SFAM energy E SFAM can be written as the sum of the MM and ML contribu-tions, E SFAM = E MM + E ML . (1) E ML is zero at this level, because we choose the hybrid QM/SFAM model as the newbase model, which can then be reﬁned in a later step by ∆-ML. The SFAM energyexpression is divided into a covalent (cov) part E cov and nonbonding (nb) potentialenergy contributions E nb , E SFAM = E cov + E nb , (2)as is common in MM models. [3] The covalent energy contribution is calculated fromthe displacements of the internal degrees of freedom out of their equilibrium positionsand can therefore be divided into terms for bonds r , bond angles α , dihedral angles4 , and improper dihedral angles ϕ , E cov = E r + E α + E θ + E ϕ = (cid:88) ( A,B ) E ABr + (cid:88) ( A,B,C ) E ABCα + (cid:88) ( A,B,C,D ) E ABCDθ + (cid:88) ( X,B,C,D ) E XBCDϕ , (3)where ( A, B ) denotes a group of two bonded atoms A and B , ( A, B, C ) a bondedtriplet of atoms, (

A, B, C, D ) a bonded quadruplet of atoms, and (

X, B, C, D ) theatoms of an improper dihedral angle (with X representing the center atom). Thenonbonding interactions comprise an electrostatic part (estat), dispersive (disp) andPauli repulsion (rep) interactions, and hydrogen bonds, E nb = E estat + E disp + E rep + E hb = (cid:88) ( A,B ) (cid:0) E AB estat + E AB disp + E AB rep (cid:1) + (cid:88) ( D,H,A ) E DHA hb , (4)where ( A, B ) represents a pair of atoms, in which the atoms A and B are neitherbonded to one another nor both bonded to another atom C , and ( D, H, A ) is ahydrogen bond. For details on the potential energy expressions for each of the MMcontributions, as well as an explanation of the parametrization procedure, we referto our previous paper. [6]

We emphasize that combining the classical model in SFAM with a quantum chemicalmethod creates an opportunity to apply our ∆-ML improvement step to the hybridmodel. During a molecular exploration with the QM/SFAM method, quantumchemical reference data are collected without any additional eﬀort. Our machine-learned model corrections can be trained with these data and will be valuable (i)if the QM focus is moved to a diﬀerent section of the whole system and (ii) if thedata can be transferred to improve the description of similar molecular substructureslocated in the MM region. We also note that related eﬀorts to combine machinelearning with quantum − classical hybrid methods have been reported recently. [25,26] The energy expression of any hybrid model with two distinct regions, a quantumcore Q and an environment E , can be approximately separated into a quantum-core-only contribution E Q , an analogous contribution of the environment E E , andan interaction energy E Q−E , E hybrid = E Q + E E + E Q−E . (5)When choosing two diﬀerent methods for the two regions Q and E , there must not beany electromagnetic interactions included twice or be missing completely. As notedbefore, we apply a quantum chemical method for the core region and SFAM for the5nvironment. Furthermore, we distinguish two schemes for the interaction energy E Q−E , (a) one in which the electrostatic interaction is treated by SFAM and (b) onein which it is described quantum mechanically. The former is known as mechanicalembedding (ME) and the latter as electrostatic embedding (EE). Although it hasbeen demonstrated that EE provides more accurate results than ME, in particularfor small QM regions, [11] we implemented both embedding schemes because EEmay not always be available for the QM method of choice. In the case of ME, theQM/SFAM energy expression reads, E MEQM/SFAM = E Q + E SFAM + E Q QM − (cid:88) ( A,B ) ∈Q E ABr − (cid:88) ( A,B,C ) ∈Q E ABCα − (cid:88) ( A,B,C,D ) ∈Q E ABCDθ − (cid:88) ( X,B,C,D ) ∈Q E XBCDϕ − (cid:88) A ∈Q B ∈Q A (cid:54) = B (cid:0) E AB estat + E AB disp + E AB rep (cid:1) − (cid:88) ( D,H,A ) ∈Q E DHA hb . (6)In Eq. (6), the energy E Q + E SFAM refers to the SFAM energy of the full system (i.e., Q and E combined) and E Q QM is the electronic energy obtained in a QM calculation for Q , E Q QM = (cid:104) (cid:98) H Q el,ME (cid:105) (7)with the electronic Hamiltonian in atomic units (cid:98) H Q el,ME = − N Q el (cid:88) i = 1 m e ∆ i − N Q nuc (cid:88) α = 1 M α ∆ α − N Q el (cid:88) i = 1 N Q nuc (cid:88) α = 1 Z α | r i − R α | + N Q el (cid:88) i = 1 N Q el (cid:88) j = i +1 | r i − r j | + N Q nuc (cid:88) α = 1 N Q nuc (cid:88) β = α +1 Z α Z β | R α − R β | , (8)where N Q el is the number of electrons, N Q nuc is the number of atomic nuclei, Z α isthe nuclear charge of nucleus α , and r i and R α are the Cartesian coordinates ofelectrons and nuclei in Q , respectively.The latter part of Eq. (6) subtracts all energy contributions in the MM force ﬁeldthat are covered by E Q QM , i.e., all bond terms E ABr with the atoms A and B bothin the QM region, all angle terms with at least two, dihedral terms with at leastthree, and all improper dihedral terms with all four of their corresponding atomsin the QM region. All pairwise noncovalent interaction terms are subtracted foreach pair of atoms in Q . Hence, all noncovalent interactions between Q and E aredescribed by SFAM, which is a consistent approach if the electronic structure modeldoes not acount for dispersive interactions (as in many standard density functionals)so that they can be treated semi-classically everywhere in the system. For EE, we6pply a quantum mechanical description of the electrostatic interaction by deﬁning E EEQM/SFAM as, E EEQM/SFAM = E MEQM/SFAM + E Q QM, estat − (cid:88) A ∈Q B ∈E E AB estat , (9)with E Q QM, estat = E Q QM,EE − E Q QM,ME . (10)Eq. (9) includes an additional energy contribution obtained in the QM calculation of Q , namely E Q QM, estat . This is the interaction energy of the elementary particles (elec-trons and nuclei) in Q with the electrostatic potential generated by SFAM’s atomicpartial charges located at atomic positions in E as point charges. Naturally, theclassical equivalent of this interaction must be subtracted to avoid double counting.The electronic Hamiltonian operator (in Hartree atomic units) is therefore diﬀerentfor EE compared to ME, (cid:98) H Q el,EE = (cid:98) H Q el,ME − N Q el (cid:88) i = 1 (cid:88) A ∈E q A | r i − r A | + N Q nuc (cid:88) α = 1 (cid:88) A ∈E Z α q A | R α − r A | . (11)Here, q A is the partial charge of atom A in E , and as before, Z α is the nuclear chargeof nucleus α in Q , and r A , r i , and R α are the Cartesian coordinates of the atoms in E and the electrons and nuclei in Q , respectively. The van der Waals interactionsare treated at the SFAM level (based on semi-classical dispersion corrections [27–30] of Grimme) in both embedding schemes. Within Q , the QM method must take careof dispersive interactions. The challenge of describing a single molecular system with two diﬀerent physicaltheories becomes most apparent at the boundary of the two regions Q and E , par-ticularly if the boundary intersects a covalent chemical bond. [31] Various strategieshave been developed for modeling this QM–MM boundary. [1,2]

The by far mostcommon one is the link-atom approach, [31–37] in which the covalent bond at theborder of the QM region is valence saturated by a hydrogen atom or some otherprototypical residue (e.g., a methyl group). The most prominent alternative is togenerate localized bond orbitals from a slightly larger QM calculation and includethese doubly occupied orbitals in the QM calculation of the hybrid method. Duringthe self-consistent ﬁeld (SCF) optimization of the orbitals, these artiﬁcial orbitalsare kept frozen. This approach known as the local-SCF method [38–41] was introducedby Rivail and coworkers and later extended by Gao and coworkers. [42–44]

Further-more, advanced embedding approaches may be applied to separate the QM regionfrom an environment, such as projector-based embedding and embedded mean-ﬁeldtheory, [45–49] frozen density embedding, [50–61] or the subsystem separation by unitaryblock-diagonalization approach (SSUB). [62]

7n our QM/SFAM implementation, we focus on the link-atom approach, but empha-size that our implementation can be extended to include the more advanced bound-ary schemes mentioned above. It is crucial for the link-atom approach to carefullyselect the bonds which indicate the boundary of the QM region, because replacingheavy atoms with hydrogen link atoms introduces artiﬁcial eﬀects on neighboringmolecular entities by distorting the electronic structure.Our automated strategy for placing link atoms is described in section 2.7. Alongthe vector of a single bond i , r E i Q i = r E i − r Q i , (12)which is deﬁned by the coordinates of atom E i in the environment and Q i of thequantum region bonded to E i , a hydrogen link atom is positioned at r H i = r Q i + ( R Q i , cov + R H,cov ) r E i Q i | r E i Q i | , (13)with the covalent radii [63,64] R Q i , cov and R H, cov of Q i and hydrogen. This approachallows us to cut through single bonds only (see section 2.7), which is, however, not asevere restriction, especially considering the fact that the QM region can be enlargedto eventually meet it.To exploit the force on an artiﬁcial link atom H i , which is located between E i and Q i , its energy gradient g H i calculated in the QM calculation must be distributed to E i and Q i . The gradient contributions ˜ g E i and ˜ g Q i in direction µ ( µ = x, y, z ) are,following Ref. 65,˜ g Q i ,µ = g TH i · (cid:18)(cid:18) − | r Q i − r H i || r E i Q i | (cid:19) u µ + | r Q i − r H i | d µ | r E i Q i | r E i Q i (cid:19) , (14)˜ g E i ,µ = g TH i · (cid:18) | r Q i − r H i || r E i Q i | u µ − | r Q i − r H i | d µ | r E i Q i | r E i Q i (cid:19) . (15)where d µ is the absolute value of the diﬀerence in the µ -th component of r E i and r Q i ,and u µ is a unit vector in direction µ , e.g., u z = (0 , , T . With these equations,it is straightforward to calculate analytic gradients for the QM/SFAM energy aslong as analytic gradients are available for the QM method (including gradientson the external point charges). These gradients are essential for eﬃcient structureoptimizations and for molecular dynamics simulations with our QM/SFAM model.In electrostatic embedding, another issue arises at the Q – E boundary. Since the par-tial charge q E i of some atom E i at the boundary is included into the QM calculationas an external point charge, the link atom may suﬀer from overpolarization eﬀectscaused by the close proximity of that charge. To counteract this artiﬁcial eﬀect,many strategies have been proposed such as deleting the charge, [32,66,67] redistribut-ing it, [68–72] or smearing it out [36,73,74] by replacing the point charge by a Gaussiancharge distribution centered at E i . We apply a charge redistribution scheme that8as shown to produce accurate results compared to full QM calculations. [72] Weimplemented two variants of the charge redistribution: (i) one in which the totalcharge is conserved (redistribution of charge, denoted RC) and (ii) one in whichalso the bond dipoles of the ﬁrst shell of bonds in E are conserved (redistribution ofcharge and dipoles, denoted RCD). Both schemes are illustrated in Fig. 1. R OHCH R 𝜀 i 𝜀 i ,1 𝜀 i ,2 𝜀 i ,3 𝓠 i + q + q + q (a) R OHCH R 𝜀 i 𝜀 i ,1 𝜀 i ,2 𝜀 i ,3 𝓠 i + + + − q − q − q (b) Figure 1:

Illustration of (a) the redistribution of charge (RC) scheme and (b) the redis-tribution of charge and dipoles (RCD) scheme to prevent overpolarization of a link atom(not shown) by the charge on E i at the QM/SFAM boundary i (wiggly line). Green colorhighlights the redistribution of partial charges toward the positions to which the arrowsare pointing, whereas the color red indicates a subtraction of charge. We deﬁne q = q E i /n with n being the number of non-QM neighbors of E i . In this example, n = 3 as the neigh-boring atoms are E i, , E i, , and E i, . R represents the remainder of the QM region and R denotes the rest of the environment. In both schemes, the charge on E i vanishes; i.e., the new charge is ˜ q E i = 0. In the RCscheme, the charge is shifted equally to the n neighbors E i,k (with k ∈ { , . . . , n } )of E i , which are also in E ,˜ q E i,k = q E i n for k = 1 , . . . , n . (16)In the RCD scheme, to conserve the bond dipoles of the bonds E i − E i,k , the chargeon E i is shifted to the positions half-way in between the E i − E i,k bond vectors anddoubled in magnitude, resulting in auxiliary charges at positions r aux ,k ,˜ q r aux ,k = 2 q E i n , (17)with r aux ,k = 12 (cid:0) r E i + r E i,k (cid:1) . (18)The factor of two, i.e., the doubling of the shifted charge, is introduced to preservethe magnitude of the bond dipole as the distance between the charges is halved.Consequently, also the charges on the neighboring atoms must be adjusted so thatthe new charges on atoms E i,k become˜ q E i,k = q E i,k − q E i n . (19)9 .4 QM/SFAM structure optimization As targets of QM/SFAM are large systems with many degrees of freedom, structureoptimizations tend to require many iterations to reach convergence. Most of thedegrees of freedom to optimize can be attributed to the environment E and there-fore the necessity to perform a QM calculation in every optimization step can beavoided by a microiteration-based structure optimization. Several variants of suchan algorithm exist and have been implemented in QM/MM programs to acceler-ate structure optimizations of large systems. [75–81] The aim of these approaches is toreach the same local energy minimum structure as in a regular optimization withoutexpensive QM calculations in every step.In our variant of the algorithm, the Cartesian coordinates of all atoms in Q aswell as those of atoms in E within a distance R to any atom in Q are frozen,while all remaining MM degrees of freedom are relaxed (either until convergenceor until a maximum of N steps is reached). As all of the atoms in Q are ﬁxed(i.e., their gradients are treated as zero), no QM calculation is necessary duringthese microiterations. We note that, despite the system-focused parametrization ofSFAM that can be tailored to the QM model in QM/SFAM, our attempts to utilizethe MM gradient for the whole system in this step were fruitless. In fact, it is thisremaining mismatch of SFAM and QM forces that requires the environment atomsat the QM boundary to be kept frozen.Then, the complete system is relaxed according to the full QM/SFAM gradients. Nonuclear positions are constrained and therefore a QM calculation is needed for eachevaluation of the complete gradient. Once convergence has been reached the opti-mization terminates. If convergence cannot be reached after N steps, one macroi-teration step will be completed and the procedure will iterate again starting withthe ﬁrst MM-only step.For the parameters of this algorithm, we found values of R = 4 ˚A, N = 1000and N = 15 to perform well in all examples studied in this work, but they maybe adjusted if needed. For the individual structure optimizations, the algorithmsimplemented in the SCINE Utilities library [82] are applied. A crucial issue of molecular mechanics models is that their error in the total energyof the system is expected to scale unfavorably with system size, e.g., measured interms of the number of atoms N at . While for small systems, energies obtained withmolecular mechanics have been shown to be accurate, especially with system-focusedmodels, [6,16,17] this is not expected for large systems, which we illustrate by a simplestatistical model. [83] Consider the covalent terms in the total MM energy expressionwhich is a sum of approximately 3 N at (mostly) independent terms, each with an10ncertainty of ± ∆ ε . We can model the estimated error under the ideal assumptionof equal probabilities to either underestimate or to overestimate the energy of asingle potential term by ∆ ε . For large N at , the corresponding binomial distributioncan be approximated by the normal distribution. [84] As a result, we estimate a totalerror of at least ∆ E ≈ . (cid:112) p (1 − p ) N at (20)to occur with a probability of 50% applying these simple assumptions with p beingthe probability of overestimating an individual potential energy by ∆ ε instead ofunderestimating it. For example, for a system with 1000 atoms it is expected thatwith a probability of 50%, a total error of at least 18 . ε will be observed, whichscales with √ N at . Moreover, the MM model may exhibit a systematic over- orunderestimation of the potential energies to some (minor) extent, resulting in anexpected error that scales linearly with system size. Furthermore, we note that theMM noncovalent pair interaction terms, which outnumber the covalent terms, aremore diﬃcult to assess with respect to their error contribution [85,86] because of theirdistance and hence structure dependence.Regardless of the simpliﬁed assumptions inherent to Eq. (20) such as neglecting addi-tional uncertainties introduced by the noncovalent interactions (see also Refs. [85,86] ),it is apparent that for large systems the energies of classical models may exhibit largeuncertainties (even with system-focused approaches). By contrast, atomic forces arelocal quantities evaluated as partial ﬁrst-order derivatives at a given reference struc-ture for each atomic nucleus.In view of these considerations, common practice in QM/MM studies is to gener-ate and sample structures, either by molecular dynamics simulations or structureoptimizations. [87–93] To obtain accurate energies of local minima on the PES, it iscommon to freeze all MM atoms beyond a given distance from the active site duringstructure optimizations [90,94–97] to obtain a converged structure at smaller computa-tional cost and with larger resemblance of a reference structure such as a structuremeasured by X-ray diﬀraction, [98] and to eliminate the contribution of most of theMM region to the total energy. The advantage of this strategy, compared to neglect-ing all MM contributions to the total energy, is that eﬀects of structural changesclose to the active site are captured. However, there exist no standardized guidelinesfor the choice of this additional cutoﬀ parameter, which may have a signiﬁcant eﬀecton calculated energies.Considering all of the aforementioned factors, we introduce a reduced QM/SFAMenergy E redQM/SFAM to counteract the possibly large uncertainties induced by the clas-sical description of a large environment E , E redQM/SFAM = E Q QM + E Q−E , (21)in which any covalent SFAM contributions as well as the noncovalent interactionswithin the environment are neglected. The QM calculation is embedded into theenvironment by including the Q – E interaction E Q−E either through mechanical orelectrostatic embedding (see section 2.2).11e propose that during a molecular exploration, relevant structures should be iden-tiﬁed with the complete QM/SFAM model (e.g., by molecular dynamics simulationsor structure optimizations), while for the energy diﬀerences of intermediate struc-tures on the PES, the diﬀerence between the two strategies for computing the energy,∆ = ∆ E QM/SFAM − ∆ E redQM/SFAM , (22)should be monitored closely. For ∆ E redQM/SFAM , it is crucial that energy contributionsfrom structural changes in close proximity to the active site are picked up by theQM calculation. With SFAM as the classical part of the hybrid approach, parameters are always gen-erated for the whole system automatically before starting a molecular exploration.Therefore, one is not restricted in the selection of the QM region and may freelyre-deﬁne the QM region. It can be valuable to have this ﬂexibility in automatedreaction network explorations [99,100] as well as in reactive molecular dynamics simula-tions because reactive centers can shift during a multi-step mechanism. This featureis also a requirement for applying QM/SFAM in an interactive quantum chemistryframework, [101,102] as the ability of the operator to choose a region of interest in alarge system should not be limited by missing parameters.Naturally, QM/SFAM does not require parameters for the covalent terms in the QMregion. This means that a reaction that takes place in the QM region and modiﬁesthe local connectivity of the atoms (and hence the SFAM topology), does not resultin the model to become unapplicable. Note that parameters for van der Waalsinteractions, namely the dispersion coeﬃcients, may be required in the QM region.These can be either quickly re-evaluated or, as an approximation, kept constant evenafter the modiﬁcation of the topology, because the dispersion coeﬃcients for the sametypes of elements are expected to be similar (we note that the dispersion coeﬃcientsof the predecessor of D3, i.e., D2, [103] are ﬁxed for each pair of elements). Partialcharges are expected to be less transferable after a chemical reaction; however, theseare not needed for atoms in the QM region.Even if no bond breaking and bond formation processes are possible in the classicalregion, a re-deﬁnition of the QM region can move atoms aﬀected by such processesfrom the QM region to the environment. In this case, the connectivity of the atomsis modiﬁed in the classical region and therefore SFAM must be re-parametrized ifthe newly required SFAM parameters are not available due to the existence of thesame bonding pattern somewhere else in the system. To cope with such events, wehere extend our SFAM parametrization procedure [6] by the option to re-parametrizelocally, for which QM data from the QM-region calculation may be exploited. In12eneral, the missing parameters must be obtained in an eﬃcient way at a smallfraction of the cost of the full-system parametrization.At the start, we identify all parameters which are not covered by the existing setof SFAM parameters. To calculate the required reference data (i.e., optimized localgeometries, Hessian matrix, atomic partial charges, bond orders), we fragment thewhole system as explained in our original work on SFAM, [6] but perform calculationsonly on those fragments that were generated around the atoms involved in the bonds,angles, or dihedral angles with missing parameters. Subsequently, the parametersare optimized based on the calculated Hessians and local equilibrium geometries.Partial charges and connectivity information (obtained from Mayer [104,105] covalentbond orders) are updated for all atoms and bonds for which new information isavailable (see our original work on SFAM [6] for details). Moreover, the dispersioncoeﬃcients are re-evaluated for the whole system due to the negligible additionalcomputational eﬀort associated with it; see Fig. 2 for an overview of the wholeprocedure. chemicalreactionre-parametri-zationIdentify missingparameters Calculate reference datafor relevant fragments Optimize missingparameters onlyUpdate atomic charges,bond orders anddispersion coe ﬃ cients1234 𝓠 𝜀 movedto classicalregionno extraparametersneeded 𝓠 𝜀 𝜀 missingparameters 𝓠 Figure 2:

Local re-parametrization of the SFAM model if a region in which new covalentbonds are formed or broken is moved out of the QM region Q to become part of theenvironment E . Shown is an esteriﬁcation reaction, which is embedded in a large environ-ment (not shown for the sake of clarity). The box provides a short summary of the stepsinvolved in a local model reparametrization. Finally, we note that this strategy can be combined with a second approach towardﬂexible QM regions, i.e., with adaptive QM/MM schemes for molecular dynam-ics. [106–110]

These have been developed in recent years to allow for moving smallmolecules (e.g., solvent molecules) from the MM to the QM region (and vice versa)during an MD simulation while preserving a smooth description of the total energy.13 .7 Automated selection of the QM region

In this section, we introduce our algorithm for the selection of atoms for the QMregion. Once a location of the QM region is provided, either based on structural char-acteristics that indicate chemical reactivity or by explicit manual determination, thislocation allows us to identify an atom around which the QM region is constructed(called “the center atom”). Our aim is to deﬁne an automated, universal, anddata-driven procedure to ﬁnd an accurate QM/MM model for the description of thereactive center when compared to a full QM calculation on the system or, if this iscomputationally not feasible, to the best possible estimate of that. First, one needsto deﬁne a descriptor to measure the accuracy of a given model. As mentioned inthe Introduction, previous work by Karelina and Kulik [8] applied descriptors basedon charge distribution. However, we focus on the forces on atoms in the proximityto the center atom, because these relate directly to reasonable structures either instructure optimizations or molecular dynamics simulations.For long-time molecular dynamics simulations (possibly with large basis sets), onecannot aﬀord as large of a QM region as, for instance, in structure optimizationsthat require less than 100 single-point gradient calculations. Moreover, for smallQM regions it is also important to have a systematic approach toward a reliablesolution. We therefore emphasize the importance of automation required to carryout a large number of exploratory calculations on candidate models of diﬀerent sizewhich need to be automatically set up, carried out, and then analyzed (includingalso the construction of the models) for the reliable and autonomous QM regiondetermination to be applicable in a routine fashion.In the following, we ﬁrst explain how we construct QM regions around a given centeratom automatically. Then, the selection criteria for the QM region are discussed.Finally, we clarify how to obtain reference data for systems where a full QM calcu-lation is not feasible.The construction of a QM region with a user-deﬁned center atom represents a taskanalogous to the fragmentation step in our SFAM model generation. [6]

In the lattercase, we construct one molecular fragment around each atom of the system undersubsequent valence saturation with hydrogen atoms. This is achieved by ﬁrst deﬁn-ing a sphere with radius r around a selected atom and adding all atoms within itto the fragment. Second, all covalent bonds which were cut by the sphere’s edge areidentiﬁed and followed outwards recursively until a covalent bond is reached at whichthe system can be divided and valence saturated by a residue (currently, hydrogen-atom saturation has been implemented). Which bonds are considered cleavable ispre-determined, but can be adapted for a given system. For biochemical systems,C sp − X bonds (with X = C, N) can be considered a suitable choice because oftheir abundance in biological macromolecules. We emphasize that advanced em-bedding schemes, [45–49] frozen density embedding [47–62] may have the potential toreplace this rule-based saturation approach. Combined with an initial radius be-14ween r = 5 . r = 7 . [6] which means that the re-quired reference data can easily be calculated for these fragments with contemporarydensity functional theory.To sample several model sizes and boundaries, we introduce a stochastic elementto our automated QM/MM model construction. If a cleavable bond is reached, thesystem may be chosen to be split into QM and MM parts at that bond with someprobability p . Naturally, the resulting set of QM regions will contain duplicatesbecause each QM region is constructed independently. Hence, the set of QM regionsmust be deduplicated. This straightforward approach is chosen over a systematicgeneration of all possible QM regions, because of its simple implementation in thecurrent fragmentation framework and the otherwise exploding number of possibleQM regions of varying size, for which an exhaustive generation and selection processbecomes unfeasible. Nevertheless, we point out that for small QM regions, thestochastic approach is also capable of generating all possible QM regions exhaustivelyup to a given size due to the eﬃciency of the fragmentation algorithm. Hence, theparameters r and p allow for adjusting the QM region size and the variation ofsizes.To ﬁlter and categorize the generated QM regions, we introduce two additional de-scriptors. The ﬁrst one is the number of covalent bonds m link cut in the processof deﬁning the QM region, which is equal to the number of link atoms in the re-sulting fragment. As a second feature of the generated QM regions, we introduce asymmetry measure m sym , m sym = r LDM r MDQ , (23)where r LDM is the mean distance of the central atom to the three least distant MMatoms (LDM) and r MDQ is the mean distance of the central atom to the three mostdistant QM atoms (MDQ). This descriptor can be applied to assess the extent towhich atoms are arranged aspherically around the reactive center.To measure the reliability of some automatically produced QM/MM model m interms of how accurately the atomic forces in close proximity to the center atom aredescribed compared to a reference (’ref’), we ﬁrst select a set of N repr representativeatoms that are closer than a cutoﬀ of r repr to the center atom. The mean absoluteerror ε mk of the force components ( f x,m,k , f y,m,k and f z,m,k ) for a given atom k in thisset is given by ε mk = 13 (cid:16) | f x, ref , k − f x, m, k | + | f y, ref , k − f y, m, k | + | f z, ref , k − f z, m, k | (cid:17) . (24)The overall accuracy can then be measured by the mean of these errors, ε m mean = 1 N repr N repr (cid:88) k = 1 ε mk . (25)15he reference forces f ref , k = ( f x, ref , k , f y, ref , k , f z, ref , k ) T in Eq. (24) may be obtainedfrom a QM calculation on a signiﬁcantly larger system. As only one single-pointgradient evaluation is necessary for this purpose, this will be feasible for systemsof several hundred atoms. Alternatively, we may obtain a reliable estimate for thereference in the case of larger systems by averaging the force vectors obtained from asample of N ref QM/SFAM models with large QM regions by choosing a radius r thatis as large as possible for a single-point calculation in order to be still feasible in areasonable amount of time. It is important that this estimate is not based on a singlereference model, but on many diﬀerent ones, because it has been shown that onecannot be certain that molecular properties are converged even with QM regionsof up to several hundred atoms. [9–12] The comparison of several QM/MM modelswith diﬀerent QM/MM boundaries allows to detect whether the atomic forces areconverged with the QM region size that was chosen for the reference. If signiﬁcantdeviations exist between reference calculations, it can be detected and ﬂagged bythe algorithm automatically.Finally, we note that it has been demonstrated that one can deploy the domain-basedlocal pair natural orbital coupled cluster methods [111–113] as the QM part (allowingfor large QM regions) to obtain accurate reaction barrier heights, [87,114] which iscrucial in mechanistic explorations for the subsequent kinetic analysis. [115]

RunningQM/SFAM calculations with these QM methods is enabled through an interface tothe quantum chemistry software ORCA. [116,117]

We demonstrate our automated QM/SFAM set-up algorithm with two examples.For the ﬁrst example, chain A of the peptide hormone insulin [118] was chosen becauseits size of slightly more than 300 atoms is large enough to test the eﬀects of diﬀerentQM regions on the QM/SFAM results while at the same time full QM calculationsare still feasible so that a well-deﬁned full-QM reference is available. The initialstructure was taken from the Protein Data Bank [119] (PDB ID: 1AI0).

To study a chemical reaction in this system, we added a 1-propanol molecule inclose proximity to the carboxylic acid group of the C-terminus of the chain (as-paragine, A21), which serves as the initial structure of an esteriﬁcation reaction.The product structure therefore contains a free water molecule and the propyl estercompound. This reaction is depicted in Fig. 3. With the added alcohol as reac-tant, the system consists of 328 atoms. We fully pre-optimized the reactant andproduct structures with the PM3 semi-empirical method [120,121] applying the

ORCA4.2 quantum chemistry software. [116,117]

The subsequent DFT optimization with16I-PBE-D3(BJ)/def2-SVP [28,122–126] was limited to ten optimization steps in orderto obtain forces on all atoms that neither vanish nor acquire artiﬁcially large nu-merical values. The coordinates of these structures can be found in the SupportingInformation.For the reactant structure, a SFAM molecular mechanics model was parametrizedin a fully automated fashion. [6]

The reference data, i.e., optimized structures, Hes-sians, Mayer bond orders, and atomic partial charges, were obtained for the RI-PBE-D3(BJ)/def2-SVP electronic structure model with

ORCA 4.2 driven by oursoftware.

ORCA

Hirshfeld charges, [127] were converted to Charge Model 5 charges byour implementation of the published algorithm [128] in our SCINE software. [129] NH H O NHO NH O O OHS

A19Tyr AsnA18 A17-A2... GlyA1

OH NHO NH HO O OHS

A19Tyr AsnA18 A17-A2... GlyA1 NH Figure 3:

Reaction of 1-propanol with chain A of the peptide hormone insulin, which con-tains 21 amino acids, labeled A1-A21. The residues A20 (cysteine) and A21 (asparagine)are represented as Lewis structures. Note that this representation refrains from showingthe disulﬁde bridge connecting two cysteine residues at positions A7 and A11.

For the generation of the second example, we solvated the (dry) insulin peptidestructure with water molecules. We applied the solvation tool of the

ADF 2016.107 graphical user interface [130] with which 418 water molecules were added (16 ˚A sphere,solute factor of 2.0), resulting in the second system that now comprises 1582 atomsin total. After the water molecules were added, the system was not relaxed. Again,we note that this is done deliberately to work with non-zero forces. The coordinatesof the solvated structures can also be found in the Supporting Information (see Fig. 4for a ball-and-stick representation of both systems). The SFAM parametrization wascarried out analogously to the unsolvated insulin system.17e note that for the parametrization, the bottleneck with respect to computingtime is the reference data generation step with all other tasks being completed inless than ﬁve minutes for the dry insulin peptide structure and less than one hourfor the solvated structure on a modern computing architecture with a single core.However, the time needed for the reference data generation can vary signiﬁcantlydepending on the number of cores chosen for parallel execution. In our set-up, allreference calculations for the solvated structure were completed within a few daysin 200 parallel calculations on 4 cores each.For QM/SFAM calculations, the RI-PBE-D3(BJ)/def2-SVP [28,122–126] combinationof density functional and basis set was applied for the QM region through an interfaceto the

ORCA 4.2 software [116,117] in both examples.Figure 4:

Dry (left) and microsolvated (right; 16 ˚A water-molecule sphere) insulin withpropanol as reactants (cf. Fig. 3) Atom coloring: carbon in gray, hydrogen in white,oxygen in red, nitrogen in blue, and sulfur in yellow.

We ﬁrst apply the QM-region generation algorithm of section 2.7 to the structuralmodel of dry insulin. The carbon atom of the carboxylic acid group of residueA21 (see Fig. 3) was chosen as the center atom around which the QM region isconstructed. We study whether the accuracy of the atomic forces close to the centerof the QM region can be exploited to ensure reproducibility of the QM reference bythe QM/SFAM model. For this purpose, a large variety of diﬀerent QM regions wasgenerated, including aspherical ones with a large value of m sym (see Eq. (23)). We18pplied a small initial radius r = 5 . m sym obtained by this algorithm can be found in the Supporting Information.

50 100 150 200 250 300Number of QM atoms0.000.250.50 M e a n e rr o r o n f o r c e s / k c a l m o l b o h r with thiol groupwithout thiol group Figure 5:

Mean error ε m mean of atomic forces of all non-hydrogen atoms within 4.0 ˚A ofthe central carbon atom in the generated QM/SFAM models for the unsolvated initial structure of insulin. Each point corresponds to a diﬀerent QM/MM model m . The pointsare colored such that for the orange ones the thiol group SH . is part of the QM region,whereas for the blue points it is part of the environment. The point farthest to the rightcorresponds to the model representing the full QM calculation, which is taken as thereference for the error evaluation. Fig. 5 presents the results of evaluating the mean error ε m mean on the atomic forces ofall non-hydrogen atoms within r repr = 4 . N repr = 10),according to Eq. (25). We observe that only four models exhibit a large error ofmore than 0.75 kcal mol − bohr − . All of these models consist of less than 60 QMatoms. Despite the obvious rationale that more QM atoms should result in a higheraccuracy, the mean error on the forces does not strictly decrease with the size of theQM region. This can be attributed to our choice to generate models with a largevariance in asphericity (achieved by applying a small value for p and measured bythe symmetry score m sym ). Large QM regions with more than about 200 QM atomsmay be lacking a residue in close proximity to the center of the QM region, whichwould then result in a large error ε m mean . However, for the models with the smallest19rror for a given size of the QM region, the trend of a decreasing error is observedfor models with more than 150 QM atoms. QM regions that are very asphericalmay be regarded as unsuitable and should not be considered in an application. Thiscan be achieved by choosing a larger value for p as well as by directly rejectingaspherical models with an improper value for m sym . We demonstrate this with oursecond example, the solvated insulin in section 3.3, for which the expected trend ofdecreasing ε m mean with increasing QM region size can be observed (see below).Moreover, it can be easily seen in Fig. 5 that the data split into two groups, whichare separated by about 0.3 kcal mol − bohr − . It follows from this observation that ε m mean allows us to easily eliminate one of these groups from our consideration. Thecause for this eﬀect can be attributed to including or excluding the thiol group ofthe cysteine residue A20 from the QM atoms. The distance of the sulfur atom ofthis group to the central atom was 5.7 ˚A, which implies that it was not alwaysincluded in the QM region. However, it was close enough to aﬀect the esteriﬁcationreaction signiﬁcantly. We call this group SH . . The coloring in Fig. 5 highlights thisobservation. It demonstrates that our descriptor ε m mean is able to clearly distinguishQM/MM models in which SH . is part of the QM region from those where it isnot. 98.4% of all models m that contain the SH . in their QM region were able toreproduce the reference forces of the full QM calculation with a mean error ε m mean less than 0.3 kcal mol − bohr − , while all of the other models produced a larger errorthan 0.27 kcal mol − bohr − . With this example, in which the crucial functionalgroup can be easily identiﬁed, we understand that it is possible to automatically andreliably sort out QM/SFAM models where an unreliable choice of QM region leads tolarge errors in atomic forces. Furthermore, we identiﬁed a second functional group(a carboxylic acid moiety (COOH . ) at a distance of 9.5 ˚A to the central carbonatom), for which an eﬀect on the forces is observed. As this residue has a largerdistance to the reaction center, its inﬂuence on the latter is smaller, which resultsin the observation that the corresponding groups of data are not well separated.Due to the small size of this eﬀect, we refer to the Supporting Information for itsvisualization (diﬀerently colored version of Fig. 5).If a QM/SFAM model is able to accurately describe the forces in the reactant struc-tures, but poorly for intermediates, transition states, or products, it will not besensible to rely on this descriptor for the evaluation of the QM region selectionprotocol. Therefore, we evaluated the atomic forces on the same atoms as beforefor the ﬁnal (product) structure of the reaction shown in Fig. 3 by applying thesame 2000 automatically selected QM/MM models to assess whether the modelsthat led to a small error for the initial structure also performed well for the ﬁnalstructure. The results of this comparison are presented in Fig. 6, in which we encodethe accuracy of the models on the initial structure forces by their color. We observean almost perfect agreement of force deviations measured in terms of ε m mean for theinitial structure and those for the ﬁnal structure, indicating that our measure forthe reliability of a selected QM region is likely to be transferable across a PES, atleast for close-to-minimum energy structures.20aturally, the size of the QM region may be diﬀerent for other physical quantitiesand our assumption that the forces are most crucial for making a decision on the size,albeit reasonable from a structural point of view, needs to be scrutinized. Therefore,we now discuss whether those QM/SFAM models that most accurately reproducedforces also deliver reliable energies. For the esteriﬁcation reaction in dry insulin,the reaction energy calculated as the diﬀerence of reduced QM/SFAM energies (seesection 2.5, Eq. (21)) obtained for the product and reactant structures are presentedin Fig. 7. where the same coloring scheme used for the forces of the product structurein Fig. 6 is applied.

50 100 150 200 250 300Number of QM atoms0.00.20.40.60.81.0 M e a n e rr o r o n f o r c e s / k c a l m o l b o h r M e a n e rr o r f o r i n i t i a l s t r u c t u r e / k c a l m o l b o h r Figure 6:

Mean error ε m mean on atomic forces of all non-hydrogen atoms within 4.0 ˚A ofthe central carbon atom for the generated QM/SFAM model for the unsolvated insulin ﬁnal (product) structure. The model with the smallest QM region exhibits a signiﬁcantlylarger error than 1 kcal mol − bohr − (as in the case of the initial reactant structure, seeFig. 5) and therefore has been omitted here. The colors encode the mean error thatwas obtained for the initial reactant structure with that model (i.e., the vertical axis inFig. 5). Models with errors larger than 0.3 kcal mol − bohr − on the initial structure arenot further distinguished in color as these are not considered reasonable model candidates.This representation highlights that the accuracy found for the initial structure is matchedby the accuracy obtained at a diﬀerent location on the PES (here, the ﬁnal productstructure of the esteriﬁcation reaction). First, we observe that the average energy error is decreasing continuously withgrowing QM region size. Second, the models that exhibited a large error on theforces (larger than 0.3 kcal mol − bohr − ) are separated from the well-performingQM/SFAM models with only a small number of exceptions. This shows that models21hat were discarded after evaluating their accuracy on the forces are also expectedto generate large errors in energy, conﬁrming the reliability of our selection strategy.However, we also observe that for the well-performing models for which the reactionenergy only ﬂuctuates by less than 2 kcal mol − for a given QM region size, theaccuracy of the forces does not map perfectly to the accuracy of the reaction energies(compared to Fig. 6). Models with diﬀerences of less than 0.2 kcal mol − bohr − onthe forces are not distinguished in terms of the energies. However, we still reduce theerror (compared to the most accurate QM/SFAM model for a given QM region size)signiﬁcantly by excluding the models with large errors on the forces; for instance,it is reduced from roughly 4 kcal mol − to 2 kcal mol − considering QM region sizesbetween 100 and 200 atoms.

50 100 150 200 250 300Number of QM atoms02468101214 A b s o l u t e e rr o r o f r e a c t i o n e n e r g y / k c a l m o l M e a n e rr o r o n f o r c e s f o r i n i t i a l s t r u c t u r e / k c a l m o l b o h r Figure 7:

Absolute error of the esteriﬁcation reaction energy for each of the generatedQM/SFAM models of the unsolvated chain A of insulin. The colors correspond to themean error on the atomic forces close to the reaction center obtained for the initial struc-ture with that model (i.e., the vertical axis in Fig. 5). Models with errors larger than0.3 kcal mol − bohr − on the initial structure are not further distinguished by color. Fig. 7 shows that the model evaluation based on atomic forces can predict whichmodels exhibit large errors of reaction energy. However, we observe that it cannot beguaranteed that the models with the smallest values of ε m mean also exhibit the small-est energy errors. Considering these observations, we conclude that the descriptor ε m mean can reliably eliminate choices of QM regions, which are lacking residues thatsigniﬁcantly aﬀect key physical quantities of the reaction. However, the fact thatthis descriptor is based on a single-point property (the atomic forces), results in botha crucial advantage and a drawback of the method. On the one hand, it allows us22o eﬃciently test many model candidates in an automated fashion against referencedata that is calculated with models with very large QM regions. On the other hand,we have shown that it cannot be guaranteed that the models with the smallest ε m mean provide reaction energies that are within 1 kcal mol − of the QM reference. Fig. 7demonstrates that in our example, this accuracy can only be achieved by applyingvery large QM regions with more than 250 atoms. During an exploration of a molec-ular reaction, we therefore stress the importance of applying models of several QMregion sizes in single-point energy calculations to probe for convergence in order toclosely monitor the uncertainties by which the QM region selections are aﬀected.Respective algorithmic procedures can be included in automated workﬂows.The energy errors discussed so far are, of course, given with respect to a DFTreference and therefore aﬀected with some unknown uncertainty. We stress, however,that accurate quantum chemical methods, such as coupled cluster approaches [87,114] with suﬃciently high excitation rank and decent one-particle bases combined withbasis-set exploration or explicit correlation factors, can be applied to obtain moreaccurate energies. Finally, we point out that it may be beneﬁcial to extend ourQM region selection process by explicitly adding energies diﬀerences of two or morestructures to our descriptor. Hence, we designed our implementation in a modularfashion to allow for such extensions easily. At the example of solvated insulin, we demonstrate that it is possible to obtain reli-able reference forces even for large systems for which one cannot routinely perform afull QM calculation on the whole system. As described in section 3.1, the structuresof the previous example after solvation with water are applied for this purpose.For this solvated insulin structure, we do not want to generate a variety of diﬀerentQM regions that is as large in number as the one in section 3.2, because (i) thisincreases the total number of candidate models that needs to be tested (whichwill not be practical in a routine QM/SFAM application) due to the larger totalsystem size, and (ii) the ﬁrst example already demonstrated that very asphericalQM regions (e.g., large QM regions without the nearby SH . group) do not provideaccurate results. Hence, we set a larger probability of p = 0 . r while increasing this value in steps of 0.1 ˚A starting at 5.5 ˚A and terminating at11 ˚A, yielding 11000 models in total. After the deduplication process, 673 uniqueQM/SFAM models were created. The smallest QM region comprised 68 atoms, thelargest 410 atoms. The obtained distribution of QM region sizes is provided in theSupporting Information.As described in section 2.7, we take the mean of N ref models with very large QMregions to obtain the reference forces. In this case, we assign all models with QM23egions of more than 390 atoms to this set, resulting in N ref = 18. For the assessmentof the models, the atomic forces of all non-hydrogen atoms within 4.0 ˚A of the centralcarbon atom were considered ( N repr = 11). The analysis was performed on the initial(reactant) structure of the esteriﬁcation reaction and the results are presented inFig. 8.Choosing a larger radius r and cutting probability p to generate the models witha larger QM region (in contrast to the setting in section 3.2), resulted in the obser-vation that all QM/SFAM models with more than 100 QM atoms include the thiolgroup SH . into the QM region. We also observed that generating the QM regions ina systematic way leads to the continuous increase of accuracy with growing QM re-gion size. The models taken as the reference (red points in Fig. 8) do not show largeﬂuctuations in accuracy. A mean deviation ε m mean of 0.05 kcal mol − bohr − for thisset of models was obtained and the maximum deviation ε m max was 0.1 kcal mol − bohr − .From this we can deduce that the functional groups that were present in some ofthe QM regions of these models, but not in all of them, do not have a signiﬁcanteﬀect on the forces. Hence, we can reliably apply the reference values obtained bythis strategy. We stress that this approach is unavoidably limited by the maximumQM region size that is still computationally feasible. Therefore, sampling of sev-eral QM/SFAM models with QM regions of such size is crucial in order to obtain areliable reference. 24 M e a n e rr o r o n f o r c e s / k c a l m o l b o h r with thiol groupwithout thiol groupreference models Figure 8:

Mean error ε m mean on the atomic forces of all non-hydrogen atoms within 4.0 ˚A ofthe central carbon atom for the generated QM/SFAM models for the initial structure ofsolvated insulin. Each point corresponds to a diﬀerent QM/SFAM model. The pointsare colored such that for the orange ones the thiol group SH . is part of the QM region,while for the blue points it is part of the environment. The red points correspond to themodels for which the mean of the forces was taken as the reference (QM regions of atleast 390 atoms). With the fragmentation settings applied here, which diﬀer from thoseof unsolvated insulin, all models with more than 100 QM atoms include the thiol groupinto the QM region. To select a QM region from these data, we consider the models with the smallesterror on the forces for a given range of QM region sizes (constrained by the typeof calculation and the available computational resources). To pick an example, weselect a model with a QM region of 125 ±

10 atoms, which may be consideredcomputationally feasible and of reasonable size for a variety of applications. Withinthe data presented in Fig. 8, this requirement is satisﬁed by 38 of the generatedmodels. Note that one would typically generate only those models fulﬁlling thedesired size requirement and the number of candidates can be increased if deemednecessary and feasible (by varying r and p ).We apply a tolerance of 0.05 kcal mol − bohr − (based on the mean deviation in thereference models), which yields seven models with highest accuracy of the forces asthe remaining candidates. We stress that our approach is not able to discriminatebetween these models reliably, as the diﬀerences in performance of these models aresmaller than the mean deviation in the reference models. An option to overcome thisissue is to perform additional reference calculations on other structures on the PES25nd deploy these reference data to determine the optimal QM region. Calculatingdata for more than one point on the PES also facilitates the use of energy diﬀerencesas a selection criterion.A simpler alternative is to base the ﬁnal selection on heuristic rules. We chose toapply the following: (i) models with fewer cuts at covalent bonds (i.e., a smallernumber of link atoms m link ) are always preferred within this pre-selection and (ii)models with symmetry scores m sym (see Eq. (23)) that are 50% larger than theminimal m sym are discarded to prevent an unphysical QM region to be selected dueto error compensation. In our current implementation, we employ these rules toguide the ﬁnal selection, which represents a systematic and reproducible process.However, we plan to extend this process in future work, as outlined above, basedon additional reference data generation to obtain a ﬁnal selection based purely onﬁrst-principles data.Figure 9: Molecular structure of the QM region of the automatically selected QM/SFAMmodel. As an example, we limited the acceptable size of this QM region to 125 ± Applying this selection strategy, a QM/SFAM model with a QM region of 127 atoms(excluding 3 link atoms) was selected in our example. The molecular structure ofthis QM region is depicted in Fig. 9. The symmetry score of this QM region is m sym = 1 .

24 and the reaction energy error with this model was 1.96 kcal mol − .Finally, we emphasize that the construction of this QM/SFAM model as well as ofits selection over the other model candidates is fully automated in our implementa-tion. This includes the reference calculation management, which is automatized andparallelized in the same way as has been implemented for the SFAM parametrizationprocess. We write the information about all the necessary reference calculations intoa MongoDB database, [131] which is subsequently processed by n instances of anotherprogram carrying out the calculations and storing the results back into the database.With this set-up, n -fold parallelization of the data generation is enabled and there-26ore many QM/SFAM model candidates can be considered and tested eﬃciently.This is particularly important because most of the computing time needed by thequantum region selection algorithm can be attributed to the reference calculations.In our set-up, each of the reference calculations (QM/SFAM models with very largequantum regions) was completed in less than one hour applying 8 cores per QMcalculation. In this work, we reported a new QM/MM hybrid model for atomistic simulationswhich features, a system-focused force ﬁeld for minimized errors. We developeda fully automated set-up of this QM/SFAM model, which by construction (i.e.,by virtue of the salient features of SFAM) is not plagued with typical limitationsof standard force ﬁelds (such as missing parameters for speciﬁc metal atoms inrelevant valence states). However, if required, the methodology reported can becombined with such a standard force ﬁeld (we implemented the general AMBERforce ﬁeld (GAFF) [132] ). Our implementation will be available within the open-source SCINE platform. [129]

As a result, the cumbersome manual set-up of QM/MM models has been deci-sively alleviated, up to the point where it can be driven in a fully automated way,which opens up new avenues for QM/MM approaches; e.g., (i) in interactive ap-proaches, [101,102] where operator-deﬁned abrupt changes of focus occur, (ii) in situa-tions of quickly changing reactive sites because of highly mobile or volatile reactants,or (iii) in studies of complex chemical systems with varying environments such asenzymes generated by high-throughput directed evolution.If, during a molecular exploration, new covalent bonds are formed and then shiftedto the MM region (e.g., because the QM region is moved to a diﬀerent local regionof the full structure), molecular-mechanics parameters may be missing for this newchemical environment in the classical region. However, the SFAM ansatz allows ourimplementation to quickly re-parametrize this new local situation with only minimalcomputational eﬀort. Furthermore, our implementation is ﬂexible enough to allowfor two or more (unconnected) QM regions in the model.Our automated model construction process also allows for the generation and appli-cation of several models with diﬀerently sized QM regions in parallel. This enablesus to estimate and control the uncertainty of the model constantly, even in fullyautomated exploration set-ups. [99,100]

As was demonstrated in section 3.2, this willbe of great importance when calculating physical quantities (e.g., reaction energies)that are not directly related to the atomic forces on which we based our modelselection criterion. However, the modular nature of our implementation allows forextending the selection criteria to include additional quantities if necessary.27 cknowledgments

C. B. gratefully acknowledges support by a Kekul´e Ph.D. fellowship of the Fondsder Chemischen Industrie. The authors thank the Schweizerischer Nationalfonds forgenerous support (Projects 200021 182400 to M. R. and 200021 172950-1 (C. B.) toPD Dr. Thomas Hofstetter).

References [1] Senn, H. M.; Thiel, W. QM/MM Methods for Biological Systems.

Top. Curr.Chem. , , 173–290.[2] Groenhof, G. Biomolecular Simulations ; Methods in Molecular Biology; Hu-mana Press, Totowa, NJ, 2013; pp 43–66.[3] Riniker, S. Fixed-charge atomistic force ﬁelds for molecular dynamics simula-tions in the condensed phase: An overview.

J. Chem. Inf. Model. , ,565–578.[4] Warshel, A.; Levitt, M. Theoretical studies of enzymic reactions: Dielectric,electrostatic and steric stabilization of the carbonium ion in the reaction oflysozyme. J. Mol. Biol. , , 227–249.[5] Thiel, W.; Hummer, G. Nobel 2013 Chemistry: Methods for computationalchemistry. Nature , , 96.[6] Brunken, C.; Reiher, M. Self-Parametrizing System-Focused Atomistic Mod-els. J. Chem. Theory Comput. , , 1646–1665.[7] Ramakrishnan, R.; Dral, P. O.; Rupp, M.; von Lilienfeld, O. A. Big data meetsquantum chemistry approximations: The ∆-machine learning approach. J.Chem. Theory Comput. , , 2087–2096.[8] Karelina, M.; Kulik, H. J. Systematic quantum mechanical region determina-tion in QM/MM simulation. J. Chem. Theory Comput. , , 563–576.[9] Sumowski, C. V.; Ochsenfeld, C. A convergence study of QM/MM isomeriza-tion energies with the selected size of the QM region for peptidic systems. J.Phys. Chem. A , , 11734–11741.[10] Flaig, D.; Beer, M.; Ochsenfeld, C. Convergence of electronic structure withthe size of the QM region: example of QM/MM NMR shieldings. J. Chem.Theory Comput. , , 2260–2271.[11] Roßbach, S.; Ochsenfeld, C. Inﬂuence of Coupling and Embedding Schemes onQM Size Convergence in QM/MM Approaches for the Example of a ProtonTransfer in DNA. J. Chem. Theory Comput. , , 1102–1107.2812] Kulik, H. J.; Zhang, J.; Klinman, J. P.; Martinez, T. J. How large shouldthe QM region be in QM/MM calculations? The case of catechol O-methyltransferase. J. Phys. Chem. B , , 11381–11394.[13] Hu, L.; S¨oderhjelm, P.; Ryde, U. On the convergence of QM/MM energies. J.Chem. Theory Comput. , , 761–777.[14] Liao, R.-Z.; Thiel, W. Convergence in the QM-only and QM/MM modeling ofenzymatic reactions: A case study for acetylene hydratase. J. Comput. Chem. , , 2389–2397.[15] Retegan, M.; Neese, F.; Pantazis, D. A. Convergence of QM/MM and clustermodels for the spectroscopic properties of the oxygen-evolving complex inphotosystem II. J. Chem. Theory Comput. , , 3832–3842.[16] Grimme, S. A General Quantum Mechanically Derived Force Field (QMDFF)for Molecules and Condensed Phase Simulations. J. Chem. Theory Comput. , , 4497–4514.[17] Vanduyfhuys, L.; Vandenbrande, S.; Verstraelen, T.; Schmid, R.; Waro-quier, M.; Van Speybroeck, V. QuickFF: A program for a quick and easyderivation of force ﬁelds for metal-organic frameworks from ab initio input. J.Comput. Chem. , , 1015–1027.[18] Wang, R.; Ozhgibesov, M.; Hirao, H. Partial hessian ﬁtting for determiningforce constant parameters in molecular mechanics. J. Comput. Chem. , , 2349–2359.[19] Wang, R.; Ozhgibesov, M.; Hirao, H. Analytical hessian ﬁtting schemes foreﬃcient determination of force-constant parameters in molecular mechanics. J. Comput. Chem. , , 307–318.[20] Li, Z.; Kermode, J. R.; De Vita, A. Molecular dynamics with on-the-ﬂy ma-chine learning of quantum-mechanical forces. Phys. Rev. Lett. , ,096405.[21] Chmiela, S.; Tkatchenko, A.; Sauceda, H. E.; Poltavsky, I.; Sch¨utt, K. T.;M¨uller, K.-R. Machine learning of accurate energy-conserving molecular forceﬁelds. Sci. Adv. , , e1603015.[22] Glielmo, A.; Sollich, P.; De Vita, A. Accurate interatomic force ﬁelds viamachine learning with covariant kernels. Phys. Rev. B , , 214302.[23] Chmiela, S.; Sauceda, H. E.; M¨uller, K.-R.; Tkatchenko, A. Towards exactmolecular dynamics simulations with machine-learned force ﬁelds. Nat. Com-mun. , , 3887.[24] Amabilino, S.; Bratholm, L. A.; Bennie, S. J.; O’Connor, M. B.;Glowacki, D. R. Training atomic neural networks using fragment-based datagenerated in virtual reality. arXiv preprint arXiv:2007.02824 ,2925] Zhang, Y.-J.; Khorshidi, A.; Kastlunger, G.; Peterson, A. A. The potential formachine learning in hybrid QM/MM calculations. J. Chem. Phys. , ,241740.[26] B¨oselt, L.; Th¨urlemann, M.; Riniker, S. Machine Learning in QM/MM Molec-ular Dynamics Simulations of Condensed-Phase Systems. , arXiv preprintarXiv:2010.11610.[27] Johnson, E. R.; Mackie, I. D.; DiLabio, G. A. Dispersion interactions indensity-functional theory. J. Phys. Org. Chem. , , 1127–1135.[28] Grimme, S.; Antony, J.; Ehrlich, S.; Krieg, H. A consistent and accurate abinitio parametrization of density functional dispersion correction (DFT-D) forthe 94 elements H-Pu. J. Chem. Phys. , , 154104.[29] Grimme, S.; Ehrlich, S.; Goerigk, L. Eﬀect of the damping function in disper-sion corrected density functional theory. J. Comput. Chem. , , 1456–1465.[30] Grimme, S. Density functional theory with London dispersion corrections. WIREs Comput. Mol. Sci. , , 211–228.[31] Field, M. J.; Bash, P. A.; Karplus, M. A combined quantum mechanical andmolecular mechanical potential for molecular dynamics simulations. J. Com-put. Chem. , , 700–733.[32] Singh, U. C.; Kollman, P. A. A combined ab initio quantum mechanical andmolecular mechanical method for carrying out simulations on complex molecu-lar systems: Applications to the CH3Cl+ Cl- exchange reaction and gas phaseprotonation of polyethers. J. Comput. Chem. , , 718–730.[33] Maseras, F.; Morokuma, K. IMOMM: A new integrated ab initio+ molec-ular mechanics geometry optimization scheme of equilibrium structures andtransition states. J. Comput. Chem. , , 1170–1179.[34] Eichler, U.; K¨olmel, C. M.; Sauer, J. Combining ab initio techniques with an-alytical potential functions for structure predictions of large systems: Methodand application to crystalline silica polymorphs. J. Comput. Chem. , ,463–477.[35] Antes, I.; Thiel, W. Combined Quantum Mechanical and Molecular MechanicalMethods ; 1998; Chapter 4, pp 50–65.[36] Das, D.; Eurenius, K. P.; Billings, E. M.; Sherwood, P.; Chatﬁeld, D. C.;Hodoˇsˇcek, M.; Brooks, B. R. Optimization of quantum mechanical molecularmechanical partitioning schemes: Gaussian delocalization of molecular me-chanical charges and the double link atom method.

J. Chem. Phys. , , 10534–10547. 3037] Swart, M. AddRemove: A new link model for use in QM/MM studies. Int. J.Quantum Chem. , , 177–183.[38] Th´ery, V.; Rinaldi, D.; Rivail, J.-L.; Maigret, B.; Ferenczy, G. G. Quantummechanical computations on very large molecular systems: The local self-consistent ﬁeld method. J. Comput. Chem. , , 269–282.[39] Monard, G.; Loos, M.; Th´ery, V.; Baka, K.; Rivail, J.-L. Hybrid classicalquantum force ﬁeld for modeling very large molecules. Int. J. Quantum Chem. , , 153–159.[40] Assfeld, X.; Rivail, J.-L. Quantum chemical computations on parts of largemolecules: the ab initio local self consistent ﬁeld method. Chem. Phys. Lett. , , 100–106.[41] Ferr´e, N.; Assfeld, X.; Rivail, J.-L. Speciﬁc force ﬁeld parameters determina-tion for the hybrid ab initio QM/MM LSCF method. J. Comput. Chem. , , 610–624.[42] Gao, J.; Amara, P.; Alhambra, C.; Field, M. J. A generalized hybrid orbital(GHO) method for the treatment of boundary atoms in combined QM/MMcalculations. J. Phys. Chem. A , , 4714–4721.[43] Amara, P.; Field, M. J.; Alhambra, C.; Gao, J. The generalized hybrid orbitalmethod for combined quantum mechanical/molecular mechanical calculations:formulation and tests of the analytical derivatives. Theor. Chem. Acc. , , 336–343.[44] Garcia-Viloca, M.; Gao, J. Generalized hybrid orbital for the treatment ofboundary atoms in combined quantum mechanical and molecular mechanicalcalculations using the semiempirical parameterized model 3 method. Theor.Chem. Acc. , , 280–286.[45] Huzinaga, S.; Cantu, A. Theory of separability of many-electron systems. J.Chem. Phys. , , 5543–5549.[46] Manby, F. R.; Stella, M.; Goodpaster, J. D.; Miller, T. F. A Simple, Ex-act Density-Functional-Theory Embedding Scheme. J. Chem. Theory Comput. , , 2564–2568.[47] Fornace, M. E.; Lee, J.; Miyamoto, K.; Manby, F. R.; Miller III, T. F. Em-bedded mean-ﬁeld theory. J. Chem. Theory Comput. , , 568–580.[48] H´egely, B.; Nagy, P. R.; Ferenczy, G. G.; K´allay, M. Exact density functionaland wave function embedding schemes based on orbital localization. J. Chem.Phys. , , 064107.[49] Lee, S. J.; Welborn, M.; Manby, F. R.; Miller III, T. F. Projection-BasedWavefunction-in-DFT Embedding. Acc. Chem. Res. , , 1359–1368.3150] Wesolowski, T. A.; Warshel, A. Frozen density functional approach for abinitio calculations of solvated molecules. J. Phys. Chem. , , 8050–8053.[51] Neugebauer, J.; Louwerse, M. J.; Baerends, E. J.; Wesolowski, T. A. Themerits of the frozen-density embedding scheme to model solvatochromic shifts. J. Chem. Phys. , , 094115.[52] Neugebauer, J.; Jacob, C. R.; Wesolowski, T. A.; Baerends, E. J. An ExplicitQuantum Chemical Method for Modeling Large Solvation Shells Applied toAminocoumarin C151. J. Phys. Chem. A , , 7805–7814.[53] Neugebauer, J.; Louwerse, M. J.; Belanzoni, P.; Wesolowski, T. A.;Baerends, E. J. Modeling solvent eﬀects on electron-spin-resonance hyperﬁnecouplings by frozen-density embedding. J. Chem. Phys. , , 114101.[54] Neugebauer, J.; Baerends, E. J. Exploring the Ability of Frozen-Density Em-bedding to Model Induced Circular Dichroism. J. Phys. Chem. A , ,8786–8796.[55] Jacob, C. R.; Neugebauer, J.; Jensen, L.; Visscher, L. Comparison of frozen-density embedding and discrete reaction ﬁeld solvent models for molecularproperties. Phys. Chem. Chem. Phys. , , 2349–2359.[56] Weso(cid:32)lowski, T. A. Embedding a multideterminantal wave function in anorbital-free environment. Phys. Rev. A , , 012504.[57] Jacob, C. R.; Neugebauer, J.; Visscher, L. A ﬂexible implementation of frozen-density embedding for use in multilevel simulations. J. Comput. Chem. , , 1011–1018.[58] Pernal, K.; Wesolowski, T. A. Orbital-free eﬀective embedding potential:Density-matrix functional theory case. Int. J. Quantum Chem. , ,2520–2525.[59] Fux, S.; Jacob, C. R.; Neugebauer, J.; Visscher, L.; Reiher, M. Accurate frozen-density embedding potentials as a ﬁrst step towards a subsystem descriptionof covalent bonds. J. Chem. Phys. , , 164101.[60] Jacob, C. R.; Neugebauer, J. Subsystem density-functional theory. WIREsComput. Mol. Sci. , , 325–362.[61] Wesolowski, T. A.; Shedge, S.; Zhou, X. Frozen-density embedding strategyfor multilevel simulations of electronic structure. Chem. Rev. , , 5891–5928.[62] M¨uhlbach, A. H.; Reiher, M. Quantum system partitioning at the single-particle level. J. Chem. Phys. , , 184104.[63] Huheey, J. E.; Keiter, E. A.; Keiter, R. L.; Medhi, O. K. Inorganic chemistry:principles of structure and reactivity ; Pearson Education India, 2006.3264] Greenwood, N. N.; Earnshaw, A.

Chemistry of the Elements ; Elsevier, 2012.[65] Walker, R. C.; Crowley, M. F.; Case, D. A. The implementation of a fast andaccurate QM/MM potential method in Amber.

J. Comput. Chem. , ,1019–1031.[66] Eurenius, K. P.; Chatﬁeld, D. C.; Brooks, B. R.; Hodoscek, M. Enzyme mech-anisms with hybrid quantum and molecular mechanical potentials. I. Theoret-ical considerations. Int. J. Quantum Chem. , , 1189–1200.[67] Ryde, U. The coordination of the catalytic zinc ion in alcohol dehydrogenasestudied by combined quantum-chemical and molecular mechanics calculations. J. Comput.-Aided Mol. Des. , , 153–164.[68] Sherwood, P.; de Vries, A. H.; Collins, S. J.; Greatbanks, S. P.; Burton, N. A.;Vincent, M. A.; Hillier, I. H. Computer simulation of zeolite structure andreactivity using embedded cluster methods. Faraday Discuss. , , 79–92.[69] De Vries, A. H.; Sherwood, P.; Collins, S. J.; Rigby, A. M.; Rigutto, M.;Kramer, G. J. Zeolite structure and reactivity by combined quantum-chemical-classical calculations. J. Phys. Chem. B , , 6133–6141.[70] Sherwood, P.; de Vries, A. H.; Guest, M. F.; Schreckenbach, G.; Catlow, C. A.;French, S. A.; Sokol, A. A.; Bromley, S. T.; Thiel, W.; Turner, A. J.; Bil-leter, S.; Terstegen, F.; Thiel, S.; Kendrick, J.; Rogers, S. C.; Casci, J.; Wat-son, M.; King, F.; Karlsen, E.; Sjøvoll, M.; Fahmi, A.; Sch¨afer, A.; Lennartz, C.QUASI: A general purpose implementation of the QM/MM approach and itsapplication to problems in catalysis. J. Mol. Struct.: THEOCHEM , ,1–28.[71] K¨onig, P.; Hoﬀmann, M.; Frauenheim, T.; Cui, Q. A critical evaluation ofdiﬀerent QM/MM frontier treatments with SCC-DFTB as the QM method. J. Phys. Chem. B , , 9082–9095.[72] Lin, H.; Truhlar, D. G. Redistributed charge and dipole schemes for combinedquantum mechanical and molecular mechanical calculations. J. Phys. Chem.A , , 3991–4004.[73] Eichinger, M.; Tavan, P.; Hutter, J.; Parrinello, M. A hybrid method for solutesin complex solvents: Density functional theory combined with empirical forceﬁelds. J. Chem. Phys. , , 10452–10467.[74] Amara, P.; Field, M. J. Evaluation of an ab initio quantum mechani-cal/molecular mechanical hybrid-potential link-atom method. Theor. Chem.Acc. , , 43–52. 3375] Hall, R. J.; Hindle, S. A.; Burton, N. A.; Hillier, I. H. Aspects of hybridQM/MM calculations: the treatment of the QM/MM interface region andgeometry optimization with an application to chorismate mutase. J. Comput.Chem. , , 1433–1441.[76] Vreven, T.; Morokuma, K.; Farkas, ¨O.; Schlegel, H. B.; Frisch, M. J. Geom-etry optimization with QM/MM, ONIOM, and other combined methods. I.Microiterations and constraints. J. Comput. Chem. , , 760–769.[77] Vreven, T.; Morokuma, K. Hybrid Methods: ONIOM(QM:MM) andQM/MM. Annu. Rep. Comput. Chem. , , 35–51.[78] K¨astner, J.; Thiel, S.; Senn, H. M.; Sherwood, P.; Thiel, W. ExploitingQM/MM capabilities in geometry optimization: A microiterative approachusing electrostatic embedding. J. Chem. Theory Comput. , , 1064–1072.[79] Melaccio, F.; Olivucci, M.; Lindh, R.; Ferr´e, N. Unique QM/MM potential en-ergy surface exploration using microiterations. Int. J. Quantum Chem. , , 3339–3346.[80] Metz, S.; K¨astner, J.; Sokol, A. A.; Keal, T. W.; Sherwood, P. Chem Shell—amodular software package for QM/MM simulations. Wiley Interdiscip. Rev.:Comput. Mol. Sci. , , 101–110.[81] Caprasecca, S.; Jurinovich, S.; Viani, L.; Curutchet, C.; Mennucci, B. Geom-etry optimization in polarizable QM/MM models: the induced dipole formu-lation. J. Chem. Theory Comput. [82] Bosia, F.; Brunken, C.; Grimmel, S. A.; Haag, M. P.; Heuer, M. A.;Simm, G. N.; Sobez, J.-G.; Steiner, M.; Unsleber, J. P.; Vaucher, A. C.;Weymuth, T.; Reiher, M. SCINE Utilities: Release 2.0.0. 2020; https://doi.org/10.5281/zenodo.3828692 .[83] Bryan, J. G.; Wadsworth, G. P.

Introduction to probability and random vari-ables ; New York: McGraw-Hill, 1960.[84] Feller, W.

An introduction to probability theory and its applications ; JohnWiley & Sons, 2008; Vol. 2.[85] Weymuth, T.; Proppe, J.; Reiher, M. Statistical Analysis of SemiclassicalDispersion Corrections.

J. Chem. Theory Comput. , , 2480–2494.[86] Proppe, J.; Gugler, S.; Reiher, M. Gaussian Process-Based Reﬁnement of Dis-persion Corrections. J. Chem. Theory Comput. , , 6046–6060.[87] Claeyssens, F.; Harvey, J. N.; Manby, F. R.; Mata, R. A.; Mulholland, A. J.;Ranaghan, K. E.; Sch¨utz, M.; Thiel, S.; Thiel, W.; Werner, H.-J. High-Accuracy Computation of Reaction Barriers in Enzymes. Angew. Chem. Int.Ed. , , 6856–6859. 3488] Walker, R. C.; Mercer, I. P.; Gould, I. R.; Klug, D. R. Comparison of basisset eﬀects and the performance of ab initio and DFT methods for probingequilibrium ﬂuctuations. J. Comput. Chem. , , 478–490.[89] Mart´ı, M. A.; Capece, L.; Bikiel, D. E.; Falcone, B.; Estrin, D. A. Oxygen aﬃn-ity controlled by dynamical distal conformations: the soybean leghemoglobinand the Paramecium caudatum hemoglobin cases. Proteins: Struct., Funct.,Bioinf. , , 480–487.[90] Mata, R. A.; Werner, H.-J.; Thiel, S.; Thiel, W. Toward accurate barriers forenzymatic reactions: QM/MM case study on p-hydroxybenzoate hydroxylase. J. Chem. Phys. , , 01B610.[91] Liao, R.-Z.; Thiel, W. Comparison of QM-only and QM/MM models for themechanism of tungsten-dependent acetylene hydratase. J. Chem. Theory Com-put. , , 3793–3803.[92] Polyak, I.; Reetz, M. T.; Thiel, W. Quantum mechanical/molecular mechanicalstudy on the mechanism of the enzymatic Baeyer–Villiger reaction. J. Am.Chem. Soc. , , 2732–2741.[93] Berraud-Pache, R.; Navizet, I. QM/MM calculations on a newly synthesisedoxyluciferin substrate: new insights into the conformational eﬀect. Phys.Chem. Chem. Phys. , , 27460–27467.[94] Bathelt, C. M.; Zurek, J.; Mulholland, A. J.; Harvey, J. N. Electronic struc-ture of compound I in human isoforms of cytochrome P450 from QM/MMmodeling. J. Am. Chem. Soc. , , 12900–12908.[95] Geronimo, I.; Paneth, P. A DFT and ONIOM study of C–H hydroxylationcatalyzed by nitrobenzene 1, 2-dioxygenase. Phys. Chem. Chem. Phys. , , 13889–13899.[96] Cooper, A. M.; K¨astner, J. Averaging Techniques for Reaction Barriers inQM/MM Simulations. ChemPhysChem , , 3264–3269.[97] Finkelmann, A. R.; Senn, H. M.; Reiher, M. Hydrogen-activation mechanismof [Fe] hydrogenase revealed by multi-scale modeling. Chem. Sci. , ,4474–4482.[98] Ke, Z.; Abe, S.; Ueno, T.; Morokuma, K. Catalytic Mechanism in Artiﬁ-cial Metalloenzyme: QM/MM Study of Phenylacetylene Polymerization byRhodium Complex Encapsulated in apo-Ferritin. J. Am. Chem. Soc. , , 15418–15429.[99] Simm, G. N.; Vaucher, A. C.; Reiher, M. Exploration of Reaction Pathwaysand Chemical Transformation Networks. J. Phys. Chem. A , , 385–399. 35100] Unsleber, J. P.; Reiher, M. The Exploration of Chemical Reaction Networks. Annu. Rev. Phys. Chem. , , 121–142.[101] Haag, M. P.; Vaucher, A. C.; Bosson, M.; Redon, S.; Reiher, M. Interactivechemical reactivity exploration. ChemPhysChem , , 3301–3319.[102] Vaucher, A. C.; Haag, M. P.; Reiher, M. Real-time feedback from iterativeelectronic structure calculations. J. Comput. Chem. , , 805–812.[103] Grimme, S. Semiempirical GGA-type density functional constructed with along-range dispersion correction. J. Comput. Chem. , , 1787–1799.[104] Mayer, I. Charge, bond order and valence in the AB initio SCF theory. Chem.Phys. Lett. , , 270–274.[105] Mayer, I. On bond orders and valences in the ab initio quantum chemicaltheory. Int. J. Quantum Chem. , , 73–84.[106] Heyden, A.; Lin, H.; Truhlar, D. G. Adaptive partitioning in combined quan-tum mechanical and molecular mechanical calculations of potential energyfunctions for multiscale simulations. J. Phys. Chem. B , , 2231–2241.[107] Bulo, R. E.; Ensing, B.; Sikkema, J.; Visscher, L. Toward a practical methodfor adaptive QM/MM simulations. J. Chem. Theory Comput. , , 2212–2221.[108] Mones, L.; Jones, A.; G¨otz, A. W.; Laino, T.; Walker, R. C.; Leimkuhler, B.;Cs´anyi, G.; Bernstein, N. The adaptive buﬀered force QM/MM method in theCP2K and AMBER software packages. J. Comput. Chem. , , 633–648.[109] Zheng, M.; Waller, M. P. Adaptive quantum mechanics/molecular mechanicsmethods. WIREs Comput. Mol. Sci. , , 369–385.[110] Duster, A. W.; Wang, C.-H.; Garza, C. M.; Miller, D. E.; Lin, H. Adaptivequantum/molecular mechanics: What have we learned, where are we, andwhere do we go from here? WIREs Comput. Mol. Sci. , , e1310.[111] Neese, F.; Wennmohs, F.; Hansen, A. Eﬃcient and accurate local approxi-mations to coupled-electron pair approaches: An attempt to revive the pairnatural orbital method. J. Chem. Phys. , , 114108.[112] Neese, F.; Hansen, A.; Liakos, D. G. Eﬃcient and accurate approximations tothe local coupled cluster singles doubles method using a truncated pair naturalorbital basis. J. Chem. Phys. , , 064103.[113] Hansen, A.; Liakos, D. G.; Neese, F. Eﬃcient and accurate local single refer-ence correlation methods for high-spin open-shell molecules using pair naturalorbitals. J. Chem. Phys. , , 214102.36114] Bistoni, G.; Polyak, I.; Sparta, M.; Thiel, W.; Neese, F. Toward accurateQM/MM reaction barriers with large QM regions using domain based pairnatural orbital coupled cluster theory. J. Chem. Theory Comput. , ,3524–3531.[115] Proppe, J.; Reiher, M. Mechanism deduction from noisy chemical reactionnetworks. J. Chem. Theory Comput. , , 357–370.[116] Neese, F. The ORCA program system. WIREs Comput. Mol. Sci. , ,73–78.[117] Neese, F. Software update: the ORCA program system, version 4.0. WIREsComput. Mol. Sci. , , e1327.[118] Sonksen, P.; Sonksen, J. Insulin: understanding its action in health and dis-ease. Br. J. Anaesth. , , 69–79.[119] Bernstein, F. C.; Koetzle, T. F.; Williams, G. J.; Meyer Jr, E. F.; Brice, M. D.;Rodgers, J. R.; Kennard, O.; Shimanouchi, T.; Tasumi, M. The Protein DataBank: a computer-based archival ﬁle for macromolecular structures. J. Mol.Biol. , , 535–542.[120] Stewart, J. J. Optimization of parameters for semiempirical methods I.Method. J. Comput. Chem. , , 209–220.[121] Stewart, J. J. Optimization of parameters for semiempirical methods II. Ap-plications. J. Comput. Chem. , , 221–264.[122] Whitten, J. L. Coulombic potential energy integrals and approximations. J.Chem. Phys. , , 4496–4501.[123] Dunlap, B. I.; Connolly, J.; Sabin, J. On some approximations in applicationsof X α theory. J. Chem. Phys. , , 3396–3402.[124] Vahtras, O.; Alml¨of, J.; Feyereisen, M. Integral approximations for LCAO-SCFcalculations. Chem. Phys. Lett. , , 514–518.[125] Perdew, J. P.; Burke, K.; Ernzerhof, M. Generalized Gradient ApproximationMade Simple. Phys. Rev. Lett. , , 3865–3868.[126] Weigend, F.; Ahlrichs, R. Balanced basis sets of split valence, triple zeta va-lence and quadruple zeta valence quality for H to Rn: Design and assessmentof accuracy. Phys. Chem. Chem. Phys. , , 3297–3305.[127] Hirshfeld, F. L. Bonded-atom fragments for describing molecular charge den-sities. Theor. Chim. Acta , , 129–138.[128] Marenich, A. V.; Jerome, S. V.; Cramer, C. J.; Truhlar, D. G. Charge Model 5:An Extension of Hirshfeld Population Analysis for the Accurate Descriptionof Molecular Interactions in Gaseous and Condensed Phases. J. Chem. TheoryComput. , , 527–541. 37129] SCINE − Software for Chemical Interaction Networks. http://scine.ethz.ch/ , visited on 2020-07-07.[130] te Velde, G.; Bickelhaupt, F. M.; Baerends, E. J.; Fonseca Guerra, C.; van Gis-bergen, S. J. A.; Snijders, J. G.; Ziegler, T. Chemistry with ADF.

J. Comput.Chem. , , 931–967.[131] MongoDB Inc., MongoDB 3.2. , visited on 2020-08-08.[132] Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Develop-ment and testing of a general amber force ﬁeld. J. Comput. Chem. ,25