[PDF] Coarse Graining Molecular Dynamics with Graph Neural Networks

Abstract

Coarse graining enables the investigation of molecular dynamics for larger systems and at longer timescales than is possible at atomic resolution. However, a coarse graining model must be formulated such that the conclusions we draw from it are consistent with the conclusions we would draw from a model at a finer level of detail. It has been proven that a force matching scheme defines a thermodynamically consistent coarse-grained model for an atomistic system in the variational limit. Wang et al. [ACS Cent. Sci. 5, 755 (2019)] demonstrated that the existence of such a variational limit enables the use of a supervised machine learning framework to generate a coarse-grained force field, which can then be used for simulation in the coarse-grained space. Their framework, however, requires the manual input of molecular features upon which to machine learn the force field. In the present contribution, we build upon the advance of Wang et al.and introduce a hybrid architecture for the machine learning of coarse-grained force fields that learns their own features via a subnetwork that leverages continuous filter convolutions on a graph neural network architecture. We demonstrate that this framework succeeds at reproducing the thermodynamics for small biomolecular systems. Since the learned molecular representations are inherently transferable, the architecture presented here sets the stage for the development of machine-learned, coarse-grained force fields that are transferable across molecular systems.

Full PDF

CCoarse Graining Molecular Dynamics with Graph Neural Networks

Brooke E. Husic,

1, 2, a) Nicholas E. Charron,

3, 4, b) Dominik Lemm, b) Jiang Wang,

4, 6

Adrià Pérez, Andreas Krämer, Yaoyi Chen,

1, 7

Simon Olsson, Gianni de Fabritiis,

5, 8, c) Frank Noé,

1, 4, 6, 9, d) and CeciliaClementi

9, 3, 4, 6, e) Department of Mathematics and Computer Science, Freie Universität, Berlin, Germany Department of Chemistry, Stanford University, Stanford, CA, USA Department of Physics, Rice University, Houston, TX, USA Center for Theoretical Biological Physics, Rice University, Houston, TX, USA Computational Science Laboratory, Universitat Pompeu Fabra, PRBB, C/Dr Aiguader 88, Barcelona, Spain Department of Chemistry, Rice University, Houston, TX, USA IMPRS-BAC, Max Planck Institute for Molecular Genetics, Berlin, Germany Institucio Catalana de Recerca i Estudis Avanats (ICREA), Passeig Lluis Companys 23, Barcelona, Spain Department of Physics, Freie Universität, Berlin, Germany

Coarse graining enables the investigation of molecular dynamics for larger systems and at longer timescalesthan is possible at atomic resolution. However, a coarse graining model must be formulated such that theconclusions we draw from it are consistent with the conclusions we would draw from a model at a ﬁner levelof detail. It has been proven that a force matching scheme deﬁnes a thermodynamically consistent coarse-grained model for an atomistic system in the variational limit. Wang et al. [ACS Cent. Sci. , 755 (2019)]demonstrated that the existence of such a variational limit enables the use of a supervised machine learningframework to generate a coarse-grained force ﬁeld, which can then be used for simulation in the coarse-grainedspace. Their framework, however, requires the manual input of molecular features upon which to machinelearn the force ﬁeld. In the present contribution, we build upon the advance of Wang et al. and introducea hybrid architecture for the machine learning of coarse-grained force ﬁelds that learns their own featuresvia a subnetwork that leverages continuous ﬁlter convolutions on a graph neural network architecture. Wedemonstrate that this framework succeeds at reproducing the thermodynamics for small biomolecular systems.Since the learned molecular representations are inherently transferable, the architecture presented here setsthe stage for the development of machine-learned, coarse-grained force ﬁelds that are transferable acrossmolecular systems. I. INTRODUCTION

Technologies facilitating molecular dynamics (MD)simulations, such as distributed computing and be-spoke hardware , have made great strides in terms ofthe time- and length-scales accessible in silico . However,even the longest protein simulations still fail to reachtotal times exceeding milliseconds, and dedicated anal-ysis methods are required to infer dynamics at longertimescales . In the context of such limitations at fullatomistic resolution, coarse graining provides a crucialmethodology to more eﬃciently simulate and analyzebiomolecular systems. In addition to the practical ad-vantages that arise from more eﬃcient sampling, coarsegraining can also elucidate the physical components thatplay key roles in molecular processes.Coarse graining is especially useful for analyzing struc-tures and processes that reach beyond the length- andtime scales accessible to all-atom MD. Important exam-ples include protein folding, protein structure prediction, a) Electronic mail: [email protected]; B.E.H., N.E.C., andD.L. contributed equally to this work. b) B.E.H., N.E.C., and D.L. contributed equally to this work. c) Electronic mail: [email protected] d) Electronic mail: [email protected] e) Electronic mail: [email protected] and protein interactions . Some of the most-used coarse-grained models for such studies are structure-based mod-els , MARTINI , CABS , AWSEM , and Rosetta .These models diﬀer with respect to their potential en-ergy function, parameterization approaches, and resolu-tion, which in combination determine their eﬃciency, ac-curacy, and transferability. In the past decade, coarse-grained models have become increasingly powerful due toan unprecedented wealth of experimental reference dataand computational capabilities. In this context, the de-velopment of more realistic architectures and modelingapproaches is of prime importance.Here, we consider coarse graining to be the process ofreducing structural degrees of freedom to facilitate moreeﬃcient simulation with speciﬁc goals in mind (e.g., re-producing system thermodynamics). Coarse graining canbe implemented with a “top down” or “bottom up” ap-proach, although other categories can be determined andstrategies can be combined . In a “top down” scheme,coarse graining frameworks are explicitly designed to re-produce certain macroscale emergent properties . In a“bottom up” framework, which we consider here, imple-mentations focus instead on reproducing speciﬁc featuresfrom a more detailed model.The latter involves (i) a mapping from the entitiesin a ﬁne-grained (e.g., atomistic) representation to asmaller set of interaction sites, often called “beads,” and(ii) a physical model (i.e., Hamiltonian function) for the a r X i v : . [ phy s i c s . c o m p - ph ] A ug coarse-grained system comprising those beads. Goodchoices for the mapping and model will lead to more eﬃ-cient simulation while preserving the biophysical proper-ties of interest to the researcher. Modern machine learn-ing techniques have been recently employed to learn boththe mapping and the model components of bot-tom up coarse graining.In the present contribution, we focus on the coarsegraining model and employ a bottom up “force match-ing” scheme formulated as a supervised machine learn-ing problem to reproduce the thermodynamics of smallbiomolecular systems. Particularly, we modify the ar-chitecture of the recently-introduced CGnet framework such that the molecular features it requires are learned in-stead of hand-selected as in the original formulation. Byleveraging the inherently transferable SchNet scheme to learn features, we render the entire CGnet frameworktransferable across molecular systems.Our goal in this paper is to present the theory under-lying CGSchNet—our new transferable coarse grainingarchitecture—and to demonstrate its success on learn-ing the thermodynamics of individual biomolecular sys-tems. We ﬁnd that our new protocol produces more ac-curate free energy surfaces in comparison with the useof hand-selected features, is more robust to hyperparam-eter choices, and requires less regularization. Presentedalongside a machine learning software package that im-plements the methods introduced, the current contribu-tion sets out a framework for the machine learning oftransferable, coarse-grained molecular force ﬁelds anddemonstrates its application to a small peptide systemand the miniprotein chignolin . The practical applica-tion of the methods described herein to larger proteinsystems, particularly those characterized by meaningfultertiary structure, remains an open challenge that will beexplored in future work. II. THEORY

The force matching technique, described in detail be-low, has been employed in two diﬀerent contexts. Forcematching was pioneered in the atomistic setting, inwhich forces obtained from an inexpensive calculationare matched to forces computed at a more computation-ally expensive level of theory (i.e., quantum) via an opti-mization scheme . Force matching was later adapted bythe coarse graining community; in that context, coarse-grained representations are sought such that the forcescomputed from the coarse-grained energy function for agiven conﬁguration match the average forces on corre-sponding atomistic representations .However, the underlying motivations in the two con-texts diﬀer. Because coarse graining away degrees of free-dom entails that multiple atomistic structures will corre-spond to the same coarse-grained conﬁguration, it is im-possible to obtain zero error during force matching in thecoarse graining context. We will see, however, that it can be proven that the coarse graining model that matchesthe mean forces yields the correct thermodynamics. Thecorresponding force matching procedure minimizes thesame objective as in the atomistic case, but that objec-tive is now variationally bounded from below by a valuethat necessarily exceeds zero.In the following sections, we overview the major ad-vances that enable the present contribution; namely, theinitial formulation of force matching in the atomistic con-text by Ercolessi and Adams (Sec. II A); the adapta-tion of force matching to the multiscale problem of coarsegraining by Izvekov and Voth (Sec. II B); the formal-ization of the associated variational principle by Noid et al. (Sec. II C); and, ﬁnally, CGnets: the fashioningof force matching as a supervised machine learning prob-lem by Wang et al. (Sec. II D).The practically inclined reader may proceed directly toSec. III, where we discuss the CGnet architecture and in-troduce this work’s methodological contribution: namely,the incorporation of learnable molecular features intoCGnet via the use of continuous ﬁlter convolutions ona graph neural network (i.e., SchNet ). We will seein Sec. III that the scheme we introduce here enables, atleast in principle, for the ﬁrst time, a fully transferablecoarse graining architecture. The practical use of this ar-chitecture to learn a force ﬁeld in a transferable contextwill be addressed in future work. A. Force matching at atomic resolution

Consider an all-atom dataset of coordinates and cor-responding forces which we have obtained using a highlevel calculation (e.g., ab initio ). We denote each three-dimensional structure r i ∈ R N , i = 1 , . . . , M , and theforces F ( r i ) ∈ R N , where N is the number of atomsin the system. Now consider a trial energy function ˆ V ( r i ; Θ ) , which takes as arguments an atomistic conﬁgu-ration r i and any parameters Θ . We would like to use ˆ V to predict the forces on every r i —presumably in a moreeﬃcient way—by taking its negative derivative. We canwrite the “force matching” problem of comparing the twosets of atomistic forces as, L ( R ; Θ ) = 13 M N M (cid:88) i =1 (cid:107) F ( r i ) (cid:124) (cid:123)(cid:122) (cid:125) “True”forces + ∇ r i ˆ V ( r i ; Θ ) (cid:124) (cid:123)(cid:122) (cid:125) (Negative)predictedforces (cid:107) , (1)where R is the set of all M sampled atomistic conﬁgura-tions.The objective (1) was introduced by Ercolessi andAdams to analyze ab initio simulations of elemental alu-minum . The authors focused speciﬁcally on usingatomistic forces calculated from ﬁrst principles to pro-duce interatomic potentials numerically, as opposed tomodeling potentials with complicated analytical expres-sions.Ercolessi and Adams raise three crucial points with re-spect to their force matching procedure . First, theyhighlight the need to accommodate invariant propertiesof the system (in this case, they maintained adheranceto glue potential equations for aluminum). Second, theydiscuss the transferability of a potential approximatedby force matching; in particular, they note the need fora variety of geometries, physical conditions, and systemidentities in R if the learned potential is to be transfer-able across conformation, thermodynamic, or chemicalspace, respectively. Finally, the authors note that the“computational engine” of the force matching problemis the minimization procedure required to optimize (1),e.g. in the presence of multiple local minima .A decade later, Izvekov et al. introduced a new forcematching methodology inspired by the work of Ercolessiand Adams but applicable to more complex systems .The contribution’s key advance addresses the third pointabove: instead of using numerical minimization to obtainan optimal but nonunique parameter set (e.g., via simu-lated annealing), Izvekov et al. fashion an overdeterminedlinear system that yields a unique parameter set Θ , in-creasing the framework’s tractability for more complexsystems. The authors showcase their method on Car-Parrinello MD simulations of liquid water and ﬁnd goodagreement with the original data for long-timescale pro-cesses such as diﬀusion . A recent review also presentsan analytical method for determining the coarse-grainedpotential for polymeric liquids in the context of con-cerns of accuracy, transferability, and computational ef-ﬁciency . B. Force matching as a coarse graining algorithm

In two follow-up papers, Izvekov and Voth present thegroundbreaking advance of using force matching in thecontext of coarse graining . They introduce the multi-scale coarse graining (MS–CG) method and apply it to alipid bilayer model. The MS–CG framework involves twosteps: ﬁrst, atoms are aggregated into “interaction sites,”and second, force matching is performed between a trans-formation of the atomistic forces and a set of predictedcoarse-grained forces. This procedure thereby creates a“multiscale” link between the all-atom and coarse-grainedrepresentations .To understand the force matching framework for coarsegraining, we require some additional notation. A linearcoarse-grained mapping from N atoms to n interactionsites (henceforth “beads”) takes the form, x i = Ξr i ∈ R n , (2)where x i is the coarse-grained representation with n < N beads and the matrix Ξ ∈ R n × N eﬀectively performs aclustering from the original atoms to the beads.In this context, we consider a coarse-grained energy function U ( x ; Θ ) . Let’s say we have a set of M coarse-grained conﬁgurations that we have obtained by apply-ing (2) to every conﬁguration r i ∈ R . To calculate theforces on the beads, we then take the negative derivativeof U with respect to the reduced coordinates; in otherwords, we evaluate, −∇ U ( Ξr i ; Θ ) = −∇ x i U ( x i ; Θ ) ∈ R n , for each conﬁguration i . From here we have all the ingre-dients to write down the adaptation of (1) to the MS–CGmethod: L ( R ; Θ ) = 13 M n M (cid:88) i =1 (cid:107) Ξ F F ( r i ) (cid:124) (cid:123)(cid:122) (cid:125) Atomistic forcesmapped tocoarse-grainedspace + ∇ U ( Ξr i ; Θ ) (cid:124) (cid:123)(cid:122) (cid:125) (Negative) forcespredicted fromcoarse-grainedmodel (cid:107) . (3)where Ξ F F is the instantaneous coarse-grained force(also called the local mean force ); that is, the projec-tion of the atomistic force into the coarse-grained space.A general expression for the force projection is Ξ F =( ΞΞ (cid:62) ) − Ξ . However, other choices for the mapping Ξ F are possible and used for coarse-graining .In principle, the coarse graining energy U ( x ) that is exactly thermodynamically consistent with the atomisticenergy V ( r ) can be expressed analytically as: U ( x ) = − k B T ln p CG ( x ) + Constant , (4)where k B is Boltzmann’s constant and T is the absolutetemperature. The function p CG is the marginal proba-bility density, p CG ( x (cid:48) ) = (cid:82) R exp (cid:16) − V ( r ) k B T (cid:17) δ ( x (cid:48) − Ξr ) d r (cid:82) R exp (cid:16) − V ( r ) k B T (cid:17) d r , (5)where R is the set of all possible atomistic conﬁgurations.Since we are concerned with all theoretically accessiblestructures and (thus) employ an integral formulation, wehave dropped the subscripts i with the understandingthat x (cid:48) and r now refer to inﬁnitesimally small regions oftheir respective conﬁguration spaces. x (cid:48) is distinguishedfrom x to emphasize that (5) is substituted into (4) as afunction, not a number.The coarse-grained energy function (4) is called thepotential of mean force (PMF) and is an analogue ofthe atomistic potential energy function. Via (5), it isa function of weighted averages of energies of atomisticconﬁgurations. For a given coarse-grained structure x (cid:48) ,in (5) we evaluate whether every possible r ∈ R mapsto x (cid:48) . We expect multiple atomistic conﬁgurations r tomap to x (cid:48) due to the reduction in degrees of freedom thatresults from structural coarse graining (n.b., this meansthe PMF is in fact a free energy, as it contains entropicinformation ). In these cases, the Dirac delta functionin (5) returns one, and the contribution of that atomisticconﬁguration to the marginal probability distribution isa function of its Boltzmann factor. If r does not map to x (cid:48) , then the evaluation of the delta function (and thusthe contribution of that atomistic structure to the freeenergy of x (cid:48) ) is zero. The denominator of the right-handside of (5) is the all-atom partition function, which servesas a normalization factor.To calculate the forces on our coarse-grained beads, wemust take the gradient of (4). However, since we cannotexhaustively sample R , (5) is intractable, and we mustapproximate U instead. One way to approximate U isto employ force matching—that is, by minimizing ex-pression 3—as we describe in Sec. II D. Another method,which we do not discuss in this report, is through rela-tive entropy , whose objective is related to that of forcematching . C. Quantitative guarantees and the variational method

Before we describe our approximation of U , however,we must establish some mathematical implications of theforce matching framework. In a report that generalizesthe earlier work of Izvekov and Voth , Noid et al. setforth a theoretical architecture for the MS–CG methodand expose its underlying variational principle . Cru-cially, they formalize the notion of thermodynamic con-sistency and establish the conditions under which it isguaranteed by the MS–CG approach.Noid et al. deﬁne consistency in conﬁguration space(i.e., for positions only) as the fulﬁllment of two require-ments: ﬁrst, that the coarse-grained coordinates are awell-deﬁned linear combination of the coordinates of theall-atom system (cf. (2)), and second, that the equilib-rium distribution of the coarse-grained conﬁgurations isequal to the one implied by the equilibrium distributionof the atomic conﬁgurations (cf. (4)). We hence refer tothe satisfaction of these requirements as thermodynamicconsistency, which is accordant with the scope of MS–CG as described by Izvekov and Voth (i.e., that themethod only yields thermodynamic information).Noid et al. then prove that, given a linear mappingfrom atoms to coarse-grained beads, thermodynamic con-sistency can only be achieved if each bead has at leastone atom that is “speciﬁc to” it; i.e., is exclusively acomponent of that and no other bead. The authorsshow the coarse-grained potential that achieves thermo-dynamic consistency at a given temperature is unique (upto an additive constant, cf. (5)). Finally, the authors con-sider a set of possible coarse-grained force ﬁelds (i.e. thederivatives of a set of functions of coarse-grained conﬁg-urations), including the unique one that achieves ther-modynamic consistency. They go on to deﬁne an errorfunctional that is (uniquely) minimized for the thermo- dynamically consistent coarse-grained force ﬁeld, therebyexpressing the variational principle underlying the MS–CG method . In theory, then, a variational method canbe used to search for the consistent coarse-grained forceﬁeld . In practice, such a search is limited by the basisof trial force ﬁelds as well as the ﬁnite simulation dataused—this is further explored in a companion paper fromthe same authors .This contribution from Noid et al. is crucial because itformally proves a notion that had only previously beenimplicitly assumed: namely, that a consistent coarse-grained model will yield the same thermodynamics as anall-atom model (at a given temperature) if it were possi-ble to suﬃciently sample the latter . Equipped with thisformulation, we can proceed with the understanding thatif our linear mapping Ξ obeys the correct restrictions (2)and (4), we can, in principle, variationally approach thethermodynamically consistent force ﬁeld.However, the variational framework is not in itself suﬃ-cient to address key challenges for bottom up methods .For example, the accuracy of a variationally optimizedmodel is dependent upon the suitability of the basis setof force ﬁelds from which the model is chosen. In otherwords, the variational method is only as good as the bestof the available model candidates (as assessed by ﬁnitesimulation data from those models). One important con-sideration in this regard is how and which multibody ef-fects should be incorporated ; such a choice will restrictthe basis set of possible models. D. CGnet: coarse graining as a supervised machine learningproblem

We can refer to (1) and (3) as “loss functions” becausethey return a scalar that assumes a minimum value onthe optimal model. In a recent report from Wang etal. , this fact is leveraged to formulate coarse graining viaforce matching as a supervised machine learning prob-lem . The authors discuss how the aim of coarse grain-ing is equivalent to that of a supervised machine learningtask: i.e., to learn a model with minimal error on a newset of data points that were not used during training .Wang et al. present several crucial contributions intheir study . First, they decompose the error term im-plied by (3) into three components—bias, variance, andnoise—that are physically meaningful in the context ofthe force ﬁeld learning problem. The bias represents theexpected discrepancy between the mean forces and theaverage force ﬁeld predicted by the model, while the vari-ance emerges from ﬁnite training data . The remainingerror is attributed to “noise,” which arises due to themapping of multiple atomistic conﬁgurations to the samecoarse-grained structures (the noise is related to the so-called “mapping entropy” introduced by Foley, Shell, andNoid ). Crucially, the noise depends only on the changeof resolution; it is not aﬀected by changes in parameters Θ , and is not expected to drop to zero .Second, the authors introduce CGnet : a neural net-work architecture designed to minimize the loss in (3).The input data comprises coarse-grained structures Ξr i ,while the output data “labels” are the mapped forces Ξ F F ( r i ) . Once a CGnet is trained, it can be used as aforce ﬁeld for new data points in the coarse-grained space(see Sec. III A 7). Every transformation from the input tothe output is designed to be diﬀerentiable (i.e., amenableto backpropagation), such that modern machine learningmethods and software (e.g., PyTorch ) can be used forlearning the network parameters. Furthermore CGnetsare designed to enforce known properties of the system,such as roto-translational invariances and equivariances (recall that Ercolessi and Adams noted this requirementin their original presentation ). The CGnet architectureis discussed in more detail in Sec. III.Third, Wang et al. augment their initial CGnet frame-work to introduce regularized CGnets . The authorsdiscover that the naïve training scheme described above,in which coarse-grained forces are regressed on the cor-responding atomistic data, produces “catastrophicallywrong” predictions and simulation results. This outcomestems from the absence of training data in regions ofconﬁguration space that are not accessed by the atom-istic system. Regularized CGnets avoid this problem byintroducing the calculation of prior energy terms beforetraining. This adjustment means that, instead of learningthe forces directly, the neural network learns a correction to the prior terms in order to match the atomistic forces.Using regularized CGnets (henceforth, we assume allCGnets are regularized) on two peptide systems, the au-thors demonstrated eﬀective learning of coarse-grainedforce ﬁelds that could not be obtained with a few-bodymodel approach . It is from this baseline that we presentfurther advances to the CGnet methodology; to describeour contribution we ﬁrst require a discussion of the prac-tical implementation of force matching with CGnets. III. METHODS

The pioneering contribution of Izvekov and Voth wasthe realization that force matching could be employedto optimize a coarse-grained representation of atomisticdata. CGnets leverage the variational principle subse-quently proven by Noid et al. to formulate force match-ing as a supervised machine learning problem . The im-plementation of CGnets is discussed below in Sec. III A.In the quantum community, supervised machine learn-ing has been used to predict energies on small moleculesthrough a variety of approaches . In particular,the SchNet architecture is based on the use of continu-ous ﬁlter convolutions and a graph neutral network .SchNet is a scalable, transferable framework that em-ploys neural networks and representation learning topredict the properties and behavior of small organicmolecules. In the vein of the original force matching pro-cedure of Ercolessi and Adams , SchNet has also been used to predict forces on atomic data from a quantummechanical gold standard .In Sec. III B, we describe SchNet and introduce ouradaptation of SchNet to the coarse graining problemby incorporating it into a CGnet to create a hybrid“CGSchNet” architecture. The original implementationof CGnet is not transferable across diﬀerent systems dueto its reliance on hand-selected structural features . Werecognized that SchNet could be leveraged as a subcom-ponent of CGnet in order to learn the features, therebyconverting CGnet—i.e., force matching via supervisedmachine learning—to a fully transferable framework forthe ﬁrst time. A. Original CGnet architecture

Here, we describe the implementation of the CGnetarchitecture introduced in Sec. II D in practice.

1. Obtaining training data from atomistic simulations

Our training data comprises an MD simulation thathas already been performed and for which the atomisticforces have been retained or recomputed. Both the con-ﬁgurations and the forces are in R N space for N atoms.We then determine our mapping matrix Ξ and use it toprepare our input data (coarse-grained structures) andlabels (atomistic forces mapped to the coarse-grainedspace), which will both be in R n for n beads (recall (2)).While the mapping is permitted to be more general(see also Sec. II C), in our work we restrict it to the spe-cial case where the matrix Ξ contains zeroes and onesonly. With this choice of mapping, the projection of theforces in (1) becomes simply Ξ F = Ξ . Our mapping thus“slices” the original atomic conﬁguration such that thecorresponding coarse-grained representation comprises asubset of the original atoms. For example, a useful map-ping might retain only protein backbone atoms or α -carbons.

2. Converting structures to features

We understand from physics that the coarse-grainedenergy will be invariant to the global translation or ro-tation of the input coordinates in space, and that thecoarse-grained forces will be invariant to translation andequivariant under rotation. ∗ Therefore, we choose to pre-process our “raw” structural data such that it is rep-resented by features with the desired properties. A ∗ In certain cases we also must enforce permutational invariance;since we only study peptides in the present contribution, we donot need to consider this here. straightforward way to featurize the input data is toconvert the spatial coordinates of the beads into a setof pairwise distances . The original CGnet archi-tecture presented by Wang et al. also includes trigono-metric transformations of planar (three-bead) and tor-sional (four-bead) angles among adjacent beads . In thepresent work, on the other hand, instead of using hand-selected structural features, we require only distances andbead identities from which features are learned ; this isdescribed in Sec. III B.

3. Calculating prior terms

In a nonregularized CGnet, a neural network directlyregresses the atomistic forces upon the input data bylearning an energy term and then evaluating its deriva-tive. This was observed to generate huge errors duringprediction due to the (expected) absence of training datapoints for conﬁgurations not physically accessible . In aso-called “regularized CGnet”, on the other hand, a priorenergy term is calculated from the training data, provid-ing a baseline energy for all of conﬁguration space. Theneural network component then learns corrections to theprior energy; from this corrected energy the forces areobtained.Wang et al. use up to two types of prior terms inCGnets . The ﬁrst is a harmonic prior on selected dis-tances (i.e., bonds or pseudobonds) and angles. The sec-ond is a repulsion prior that can be used on nonbondeddistances. Respectively, these priors are deﬁned as fol-lows for a given feature f i calculated from the data (e.g.,a particular distance), U harmonic i ( f i ) = k B T Var [ f i ] ( f i − E [ f i ]) , (6a) U repulsion i ( f i ) = (cid:18) σf i (cid:19) c . (6b)The constants in (6b) can be determined through cross-validated hyperparameter optimization as in Ref. 19.The prior energy is the sum of each prior term forall relevant features f i . In principle, any scalar functionof protein coordinates can be used to construct a priorenergy term.

4. Building the neural network

Once we have featurized our data and computed aprior energy, we must construct a neural network to learncorrections to the latter. For a CGnet as originally de-scribed , we use a fully connected network (e.g., the op-timal parameters for one of the peptide models in Wang et al. had 5 layers and 160 neurons). Wang et al. usedthe hyperbolic tangent to facilitate nonlinear transforma-tions between the layers. L Lipschitz regularization was also employed .Crucially, the last layer of the network returns a scalaroutput. Because of this single node bottleneck structure,the resulting coarse-grained force ﬁeld will be curl-freeand is therefore guaranteed to conserve energy .

5. Training the model

So far, we have converted our raw simulation data toa coarse-grained mapping, featurized the mapped coor-dinates to accommodate physical symmetries, calculateda (scalar) prior energy term, and constructed a (scalar)neural network correction to the prior. These two scalarsare summed to produce an estimated energy.Since all the steps described are diﬀerentiable, we canuse an autodiﬀerentiation framework such as PyTorch to take the derivative of the energy with respect to theoriginal (coarse-grained) spatial coordinates via back-propagation. This derivative corresponds to the pre-dicted forces on the coarse-grained beads in R n , whichcan then be compared to the known forces on the trainingcoordinates. The CGnet thereby identiﬁes the free energywhose negative derivative most accurately approximatesthe forces in the training set .The neural network weights and other parameters areobtained through a training process involving stochasticgradient descent and the Adam optimizer . The pre-dicted forces are compared with the mapped atomisticforces using force matching (i.e., according to (3)).

6. Cross-validation and hyperparameter searching

Whereas parameters such as node weights are opti-mized during the training of an individual model, hyper-parameters such as the neural network depth and num-ber of nodes must be chosen by comparing separatelytrained models according to the value of the loss func-tion (3) when training is complete. Hyperparameters arechosen based on neural network performance on test dataover multiple cross-validation “folds.” The hyperparame-ter optimization procedure we use in this study is detailedin Appendix B.

7. Simulation with a trained model

Once hyperparameters have been selected, the trainedmodel can be used as a force ﬁeld to simulate the sys-tem in the coarse-grained space. Speciﬁcally, Langevindynamics are employed to propagate coarse-grainedcoordinates x t forward in time according to, ∆∆ x t (∆ t ) = − ∇ U ( x t ) M − γ ∆ x t ∆ t + (cid:114) k b T γ

M W ( t ) , (7)where the diagonal matrix M contains the bead masses, γ is a collision rate with units ps − , and W ( t ) ∼ N (0 , is a stochastic Wiener process with units ps − . † In thiswork, we use ∆ t = 0 . ps.A subset of Langevin dynamics are so-called “over-damped” Langevin dynamics, also referred to as Brow-nian motion. Overdamped Langevin dynamics lack in-ertia; i.e., ∆∆ x t = 0 . After setting the acceleration tozero, dividing both sides by γ , and rearranging terms, (7)becomes, ∆ x t ∆ t = − Dk B T ∇ U ( x t ) + √ D W ( t ) , (8)where D ≡ k B T / M γ . Although D contains a notion ofmass, we note that propagating coarse-grained dynamicsvia (8) does not actually require bead masses, since theproduct M γ can be considered without separating itsfactors. Wang et al. use exclusively (8) to simulatedynamics from CGnets.In both formulations, the noise term is intended to in-directly model collisions—e.g., from and among solventparticles—that are not present in the coarse-grained co-ordinate space. Since Langevin dynamics depend onlyon the coordinates (and, unless overdamped, velocities)of the previous time step, these simulations can easilybe run in parallel from a set of initial coordinates. Theresulting coarse-grained simulation dataset can then beused for further analysis as we will show in Sec. IV. B. Replacing structural features with graph neural networks

Wang et al. show that using the structural featuresdescribed in Sec. III A 2 in a CGnet produces accuratemachine-learned force ﬁelds for a low-dimensional po-tential energy surface, capped alanine, and the minipro-tein chignolin . From the trained CGnets, further sim-ulations on the same system are performed using over-damped Langevin dynamics (recall (8)) and produce freeenergy surfaces comparable with those obtained from thebaseline all-atom simulations. The model architectureis found to be somewhat sensitive to various hyperpa-rameters and required individual tuning for each system(see e.g. Fig. 5 in Wang et al. ). Furthermore, a newsystem will in general require retraining because the fea-ture size is ﬁxed according to the system geometry (recallSec. III A 2).In the present contribution, we replace the ﬁxed struc-tural features employed in the original CGnet formula-tion (i.e., distances, angles, and torsions) with learned † Two conventions are adopted here. First, “division” by the M matrix indicates left-multiplication by its inverse. Second, theunits of the Wiener process are chosen for ease of computation. W ( t ) may be converted to another Wiener process that is insteadincluded in the square root term with units ps − . features computed using continuous ﬁlter convolutions ona graph neural network – here SchNet ). The SchNetarchitecture thereby becomes a subunit of CGnet withits own, separate neural network scheme; we refer to thishybrid architecture as CGSchNet . Below, we brieﬂy sum-marize SchNet and refer the reader to the original papersfor more details . Then, we describe the incorpora-tion of SchNet into CGnet to create CGSchNet.

1. SchNet overview

One key motivating factor for the original developmentof SchNet is that, unlike the images and videos that com-prise the datasets for much of modern machine learn-ing, molecular structures are not restricted to a regulargrid. Therefore, Schütt et al. introduced continuous-ﬁlterconvolutions to analyze the structures of small moleculeswith the goal of predicting energies and forces accordingto a quantum mechanical gold standard . This devel-opment builds upon previous work in predicting atomicproperties directly from structural coordinates .SchNet is a graph neural network where the nodescorrespond to particles embedded in three-dimensionalspace and the convolutional ﬁlters depend on the in-terparticle distances in order to preserve invariances ex-pected in the system . While SchNet was originally usedto predict quantum-chemical energies from atomistic rep-resentations of small molecules, we here employ it to learna feature representation that replaces the hand-selectedfeatures in a CGnet for the purpose of predicting thecoarse-grained energy on the coarse-grained bead coordi-nates x i .As in other graph neural networks, SchNet learns fea-ture vectors on the nodes (here, coarse-grained beads).The initial node features at the input are called nodeor bead embeddings ζ i , which here correspond to nu-clear charges (capped alanine) or amino acid identities(chignolin). These bead embeddings are optimized dur-ing training. Crucially, this entails that SchNet learns a molecular representation, which avoids the commonparadigm of ﬁxed, heuristic feature representations.Bead representations are updated in a series of so-called “interaction blocks.” Each interaction block com-prises beadwise linear updates, continuous convolutions,and a nonlinearity as follows : ζ ( B, i = W ( B, ζ ( B, i + b ( B, , (9a) ζ ( B, i = (cid:88) j ζ ( B, j ◦ W ( B, ( x j − x i ) , (9b) ζ ( B, i = W ( B, ζ ( B, i + b ( B, , (9c) ζ ( B, i = f (cid:16) ζ ( B, i (cid:17) (9d) ζ ( B, i = W ( B, ζ ( B, i + b ( B, , (9e)where ζ i are the bead representations, B is the block in-dex, and ◦ is elementwise multiplication. Steps (9a), (9c),and (9e) represent simple linear transformations withweights W and biases b , which are learned during train-ing.In step (9b), a ﬁlter-generating neural network W thatmaps R → R ϕ for ϕ ﬁlter dimensions is used to createa continuous ﬁlter from the interbead distances. Thisﬁlter is applied to the bead representations ζ i once perinteraction block. The sum in (9b) is taken over everybead j within the neighborhood of bead i , which canbe all other beads in the system or a subset thereof if aﬁnite neighborhood is speciﬁed. In step (9d), the func-tion f represents a nonlinear transformation of the beadrepresentation ζ i . While the original implementation ofSchNet uses shifted softplus for f , we use the hyper-bolic tangent instead.Finally, each interaction block B performs a resid-ual update of the previous bead representation as inResNets . This iterative “additive reﬁnement” stepprevents gradient annihilation in deep networks: ζ ( B +1 , i = ζ ( B, i + ζ ( B, i , (9f)where in the B = 1 case (i.e., the ﬁrst block), the ﬁrstterm on the right-hand side are the bead embeddings ζ i .In the original SchNet implementation, after the lastadditive reﬁnement step a series of further transfor-mations are performed to ultimately yield a scalar en-ergy . In this work, we instead “cut” SchNet afterstep (9f) of the last interaction block and input theselearned features into the fully connected neural networkpart of a CGnet (i.e., in place of hand-picked geometricfeatures; see Sec. III B 2).In (9b), W is convolved with the bead representations ζ i , thereby integrating the information contained in pair-wise distances into the representations. Stacks of blocksthus enable the modeling of complex interactions amongmultiple beads in the system . Particle neighborhoodscan be imposed such that only distances within a spec-iﬁed cutoﬀ are used at each layer, which has favorablescaling properties due to the ﬁnite number of beads thatcan be contained within an arbitrary radius .The loss function for the original SchNet formulationincludes a force-matching term similar to (1) as well asan energy-matching term, where the former is weightedmuch less than the latter in most applications . SchNethas been used to predict small molecule properties, en-ergies of formation, and local chemical potentials .Similarly to CGnet, SchNet was also used to create amachine-learned force ﬁeld by predicting molecular forceson a small molecule and propagating them forward intime (Schütt et al. used path integral MD)—in this case,only the force matching loss was used . Cartesian coordinatesInvariant featuresSchNetfeaturizationDense network PriorCartesian forces ∇ x U Prior … …

EmbeddingsInteraction blockInteraction block cfconvcfconvSchNetoutput … … Distances

FIG. 1. CGSchNet architecture.

2. CGSchNet: a transferable architecture for coarse graining

Sections II and III A lay the groundwork for the ad-vance we present here: namely, the incorporation ofSchNet into CGnet. CGnet as originally presented isincapable of learning a transferable coarse-grained forceﬁeld due to its reliance upon hand-picked structural fea-tures . Since SchNet is inherently a transferable frame-work, learning CGnet features using SchNet enables thetransferability of the entire CGnet architecture acrossmolecular systems.Our CGSchNet architecture is based on that of theoriginal CGnet , with the exception that instead of pre-determined structural features—i.e., distances, angles,and torsions—a SchNet is used instead, enabling themodel to learn the representation itself (see Fig. 1). Byreplacing ﬁxed-size geometric features with SchNet, weobtain a more ﬂexible representation that both scalesbetter with system size and is amenable to a transfer-able architecture . While angles and torsions may stillbe included in the prior energy terms, they are no longerpropagated through any neural networks.The use of SchNet requires us not only to providestructural coordinates but also a type for every bead.In the original (i.e., non-coarse graining) implementa-tion for systems at atomic resolution, the types areatomic numbers . In the new context presentedhere (i.e., leveraging SchNet for coarse graining), we mayspecify coarse-grained bead types—eﬀectively, chemicalenvironments—however we deem appropriate for the sys-tem under study; for example, amino acid identities maybe used.Finally, we note that we retain the coarse-grained forcematching loss (3) when comparing our predicted forcesto the known forces from our training set. Unlike in theoriginal SchNet formulation, we cannot straightforwardlyincorporate an additional “energy matching” term intothe coarse graining framework. This is because we haveno estimate of the contribution of degrees of freedom thathave been coarse grained out (e.g., solvent energies) ortheir entropic contributions to the PMF (4). AAAB63icdVDLSsNAFJ3UV62vqks3g0VwFZI2pe2u6MZlBfuANpTJdNIMnZmEmYlQQn/BjQtF3PpD7vwbJ20FFT1w4XDOvdx7T5AwqrTjfFiFjc2t7Z3ibmlv/+DwqHx80lNxKjHp4pjFchAgRRgVpKupZmSQSIJ4wEg/mF3nfv+eSEVjcafnCfE5mgoaUox0Lo2SiI7LFcdutdy614KO7TVrNc8xxG3Uqo06dG1niQpYozMuv48mMU45ERozpNTQdRLtZ0hqihlZlEapIgnCMzQlQ0MF4kT52fLWBbwwygSGsTQlNFyq3ycyxJWa88B0cqQj9dvLxb+8YarDpp9RkaSaCLxaFKYM6hjmj8MJlQRrNjcEYUnNrRBHSCKsTTwlE8LXp/B/0qvarknm1qu0r9ZxFMEZOAeXwAUN0AY3oAO6AIMIPIAn8Gxx69F6sV5XrQVrPXMKfsB6+wSNko6TAAAB63icdVDLSsNAFJ3UV62vqks3g0VwFZI2pe2u6MZlBfuANpTJdNIMnZmEmYlQQn/BjQtF3PpD7vwbJ20FFT1w4XDOvdx7T5AwqrTjfFiFjc2t7Z3ibmlv/+DwqHx80lNxKjHp4pjFchAgRRgVpKupZmSQSIJ4wEg/mF3nfv+eSEVjcafnCfE5mgoaUox0Lo2SiI7LFcdutdy614KO7TVrNc8xxG3Uqo06dG1niQpYozMuv48mMU45ERozpNTQdRLtZ0hqihlZlEapIgnCMzQlQ0MF4kT52fLWBbwwygSGsTQlNFyq3ycyxJWa88B0cqQj9dvLxb+8YarDpp9RkaSaCLxaFKYM6hjmj8MJlQRrNjcEYUnNrRBHSCKsTTwlE8LXp/B/0qvarknm1qu0r9ZxFMEZOAeXwAUN0AY3oAO6AIMIPIAn8Gxx69F6sV5XrQVrPXMKfsB6+wSNko6TAAAB63icdVDLSsNAFJ3UV62vqks3g0VwFZI2pe2u6MZlBfuANpTJdNIMnZmEmYlQQn/BjQtF3PpD7vwbJ20FFT1w4XDOvdx7T5AwqrTjfFiFjc2t7Z3ibmlv/+DwqHx80lNxKjHp4pjFchAgRRgVpKupZmSQSIJ4wEg/mF3nfv+eSEVjcafnCfE5mgoaUox0Lo2SiI7LFcdutdy614KO7TVrNc8xxG3Uqo06dG1niQpYozMuv48mMU45ERozpNTQdRLtZ0hqihlZlEapIgnCMzQlQ0MF4kT52fLWBbwwygSGsTQlNFyq3ycyxJWa88B0cqQj9dvLxb+8YarDpp9RkaSaCLxaFKYM6hjmj8MJlQRrNjcEYUnNrRBHSCKsTTwlE8LXp/B/0qvarknm1qu0r9ZxFMEZOAeXwAUN0AY3oAO6AIMIPIAn8Gxx69F6sV5XrQVrPXMKfsB6+wSNko6TAAAB63icdVDLSsNAFJ3UV62vqks3g0VwFZI2pe2u6MZlBfuANpTJdNIMnZmEmYlQQn/BjQtF3PpD7vwbJ20FFT1w4XDOvdx7T5AwqrTjfFiFjc2t7Z3ibmlv/+DwqHx80lNxKjHp4pjFchAgRRgVpKupZmSQSIJ4wEg/mF3nfv+eSEVjcafnCfE5mgoaUox0Lo2SiI7LFcdutdy614KO7TVrNc8xxG3Uqo06dG1niQpYozMuv48mMU45ERozpNTQdRLtZ0hqihlZlEapIgnCMzQlQ0MF4kT52fLWBbwwygSGsTQlNFyq3ycyxJWa88B0cqQj9dvLxb+8YarDpp9RkaSaCLxaFKYM6hjmj8MJlQRrNjcEYUnNrRBHSCKsTTwlE8LXp/B/0qvarknm1qu0r9ZxFMEZOAeXwAUN0AY3oAO6AIMIPIAn8Gxx69F6sV5XrQVrPXMKfsB6+wSNko6T AAAB7HicdVBNS8NAEJ3Ur1q/qh69LBbBU0nalLa3ohePFUxbaEPZbDft0s0m7G6EUvobvHhQxKs/yJv/xk1bQUUfDDzem2FmXpBwprRtf1i5jc2t7Z38bmFv/+DwqHh80lFxKgn1SMxj2QuwopwJ6mmmOe0lkuIo4LQbTK8zv3tPpWKxuNOzhPoRHgsWMoK1kbxBolhhWCzZ5WbTqblNZJfdRrXq2oY49WqlXkNO2V6iBGu0h8X3wSgmaUSFJhwr1XfsRPtzLDUjnC4Kg1TRBJMpHtO+oQJHVPnz5bELdGGUEQpjaUpotFS/T8xxpNQsCkxnhPVE/fYy8S+vn+qw4c+ZSFJNBVktClOOdIyyz9GISUo0nxmCiWTmVkQmWGKiTT5ZCF+fov9Jp1J2TDK3bql1tY4jD2dwDpfgQB1acANt8IAAgwd4gmdLWI/Wi/W6as1Z65lT+AHr7RPUA46yAAAB7HicdVBNS8NAEJ3Ur1q/qh69LBbBU0nalLa3ohePFUxbaEPZbDft0s0m7G6EUvobvHhQxKs/yJv/xk1bQUUfDDzem2FmXpBwprRtf1i5jc2t7Z38bmFv/+DwqHh80lFxKgn1SMxj2QuwopwJ6mmmOe0lkuIo4LQbTK8zv3tPpWKxuNOzhPoRHgsWMoK1kbxBolhhWCzZ5WbTqblNZJfdRrXq2oY49WqlXkNO2V6iBGu0h8X3wSgmaUSFJhwr1XfsRPtzLDUjnC4Kg1TRBJMpHtO+oQJHVPnz5bELdGGUEQpjaUpotFS/T8xxpNQsCkxnhPVE/fYy8S+vn+qw4c+ZSFJNBVktClOOdIyyz9GISUo0nxmCiWTmVkQmWGKiTT5ZCF+fov9Jp1J2TDK3bql1tY4jD2dwDpfgQB1acANt8IAAgwd4gmdLWI/Wi/W6as1Z65lT+AHr7RPUA46yAAAB7HicdVBNS8NAEJ3Ur1q/qh69LBbBU0nalLa3ohePFUxbaEPZbDft0s0m7G6EUvobvHhQxKs/yJv/xk1bQUUfDDzem2FmXpBwprRtf1i5jc2t7Z38bmFv/+DwqHh80lFxKgn1SMxj2QuwopwJ6mmmOe0lkuIo4LQbTK8zv3tPpWKxuNOzhPoRHgsWMoK1kbxBolhhWCzZ5WbTqblNZJfdRrXq2oY49WqlXkNO2V6iBGu0h8X3wSgmaUSFJhwr1XfsRPtzLDUjnC4Kg1TRBJMpHtO+oQJHVPnz5bELdGGUEQpjaUpotFS/T8xxpNQsCkxnhPVE/fYy8S+vn+qw4c+ZSFJNBVktClOOdIyyz9GISUo0nxmCiWTmVkQmWGKiTT5ZCF+fov9Jp1J2TDK3bql1tY4jD2dwDpfgQB1acANt8IAAgwd4gmdLWI/Wi/W6as1Z65lT+AHr7RPUA46yAAAB7HicdVBNS8NAEJ3Ur1q/qh69LBbBU0nalLa3ohePFUxbaEPZbDft0s0m7G6EUvobvHhQxKs/yJv/xk1bQUUfDDzem2FmXpBwprRtf1i5jc2t7Z38bmFv/+DwqHh80lFxKgn1SMxj2QuwopwJ6mmmOe0lkuIo4LQbTK8zv3tPpWKxuNOzhPoRHgsWMoK1kbxBolhhWCzZ5WbTqblNZJfdRrXq2oY49WqlXkNO2V6iBGu0h8X3wSgmaUSFJhwr1XfsRPtzLDUjnC4Kg1TRBJMpHtO+oQJHVPnz5bELdGGUEQpjaUpotFS/T8xxpNQsCkxnhPVE/fYy8S+vn+qw4c+ZSFJNBVktClOOdIyyz9GISUo0nxmCiWTmVkQmWGKiTT5ZCF+fov9Jp1J2TDK3bql1tY4jD2dwDpfgQB1acANt8IAAgwd4gmdLWI/Wi/W6as1Z65lT+AHr7RPUA46y FIG. 2. Capped alanine in water. The six shaded atomsare the ones preserved in the coarse-grained representation.The φ and ψ dihedral angles completely describe the centralalanine’s heavy-atom dynamics.FIG. 3. Two 100-ns trajectories simulated using a CGSchNettrained on atomistic data of capped alanine. The orange andmagenta lines represent the value of the dihedral angles φ and ψ , respectively, over the course of each simulation. Relativelysteep changes in the y -direction indicate transitions amongbasins; one can see that both trajectories feature multipletransitions in both reaction coordinates. A moving average of250 simulation frames is used to smooth the darker curves. IV. RESULTSA. Capped alanine

Capped alanine—often referred to alanine dipeptidefor its two peptide bonds—is a common benchmark forMD methods development because the heavy-atom dy-namics of the central alanine are completely describedby the dihedral (torsional) angles φ and ψ (see Fig. 2).We performed a single 1- µ s all-atom, explicit solvent MDsimulation for capped alanine and saved the forces to usefor CGSchNet training (see Ref. 19 and Appendix A).We can visualize the occupancies of backbone angle con-formations by creating a histogram of the data on φ × ψ space and visualizing the populations of the histogrambins. This is called a Ramachandran map and is depictedin Fig. 4a for the atomistic simulation using a × reg-ular spatial discretization.As an initial benchmark of the CGSchNet method, weaim to learn a force ﬁeld for a coarse-grained represen-tation of capped alanine such that we can reproduce itsheavy-atom dynamics using a trained CGSchNet insteadof a more expensive explicit solvent all-atom MD sim-ulation. For our coarse-grained mapping, we select thebackbone heavy atoms C–[N–C α –C] Ala –N as well as thealanine C β for a total of six beads. ‡ We use atomic num-bers for the bead embeddings as in the original SchNetformulation . A CGSchNet is trained on the coordi-nates and forces of the all-atom simulation depicted inFig. 4a. The learning procedure involves a hyperparame-ter selection routine and the training of multiple modelsunder ﬁve-fold cross-validation for each hyperparameterset (see Appendix B).Once a ﬁnal architecture has been selected, the trainedmodel can serve as a force ﬁeld in the coarse-grainedspace; i.e., by predicting the forces on a set of inputcoarse-grained coordinates. Along with an integrator,predicted forces can be used to propagate coarse-grainedcoordinates forward in time (recall Sec. III A 7). Thisprocedure (i.e., force prediction with CGSchNet followedby propagation with an integrator) is iterated until asimulation dataset of the desired duration has been ob-tained. Since we employ ﬁve-fold cross-validation dur-ing the model training procedure, we have ﬁve trainedCGSchNet models with a common architecture at hand.To perform our coarse-grained simulation, we simultane-ously predict the forces on each set of input coordinatesfrom all ﬁve trained networks, and the mean force vectoris used to propagate Langevin dynamics according to (7).To facilitate sampling, 100 coarse-grained simulationsof length 200 ns each are performed in parallel from var-ious starting positions in Ramachandran space (see Ap-pendix C and Fig. A3). The time series of the φ and ψ values for two of the trajectories that feature transitionsamong the major basins are plotted in Fig. 3. The sametrajectories are also overlaid on the two-dimensional en-ergy surface in Fig. A4.Free energy surfaces resulting from the coarse-grainedsimulation dataset are presented in Fig. 4b. We can seequalitatively that the two-dimensional free energy sur-face from the CGSchNet simulation captures the samebasins as the surface calculated from the baseline all-atom simulation. In the one-dimensional free energy sur-faces, we see that the barriers are well-approximated bythe CGSchNet simulation data. ‡ As in Ref. 20, we require the β -carbon in order to break the sym-metry of the system (i.e., to enforce chirality). In their demon-stration of CGnet, Wang et al. used only the ﬁve backboneheavy atoms as beads because chirality is enforced through di-hedral features, which we do not use here. a b c d e FIG. 4. Two- and one-dimensional free energy surfaces for ﬁve capped alanine datasets. From left to right, datasets arethe baseline all-atom capped alanine simulation (a), the coarse-grained CGSchNet simulation produced for analysis (b), anddatasets generated from perturbations of the original Cartesian coordinates of the baseline dataset drawn from noise distributedas ∼ N (0 , σ ) for σ = φ and ψ Ramachandranangles are calculated from the spatial coordinates and discretized into 60 ×

60 regularly spaced square bins. The bin counts areconverted to free energies by taking the natural log of the counts and multiplying by − k B T ; the color scale is the same in all ﬁvetwo-dimensional surfaces and darker color represents lower free energy (i.e., greater stability). To obtain the one-dimensional φ and ψ landscapes, free energies are calculated for regularly spaced bins along the reaction coordinate. The shaded regionalways represents the baseline dataset and the bold line represents the dataset indicated in the subﬁgure title. CGSchNet CGSchNet

FIG. 5. The Kullback-Leibler divergence (left) andmean squared error (right) are calculated between the base-line dataset and the datasets obtained from perturbationsto the baseline simulation at noise scale values of σ ∈{ , . , . , . . . , . } where the former is the reference dis-tribution and the latter is the trial distribution. This proce-dure is performed 50 times with diﬀerent random seeds; bothplots show the superposition of those 50 lines. The coloredhorizontal dashed line shows the value of the metric whencomparing the CGSchNet simulation to the baseline and theblack vertical dashed line indicates the noise scale σ that re-turn the closest value for that metric. The KL divergence iscomputed for normalized bin counts in φ × ψ space, and theMSE is computed for the energies of those bins as describedin the main text. To calibrate our understanding of the CGSchNetsimulation dataset’s relationship to the baseline atom-istic simulation dataset, we create a set of new sys- tems by perturbing the Cartesian coordinates of thelatter with noise distributed as ∼ N (0 , σ ) for σ ∈{ , . , . , . . . , . } Å. From the perturbed Cartesiancoordinates, the new φ and ψ dihedrals are calculatedand assigned to the same × regularly spaced binsin Ramachandran space. Examples of the perturbed freeenergy surfaces are shown in Fig. 4c, d, and e for σ = and amean squared error (MSE) formulation. The KL diver-gence is deﬁned for discrete distributions as, − m (cid:88) i p i ln q i p i , ∀ p i ≥ , (10)where p and q are the “reference” and “trial” distribu-tions, respectively, and m is the number of bins in eachdiscrete distribution. In this case, p and q represent thenormalized bin counts . The index i returns the normal-ized count from the i th bin of a 60 ×

60 regular dis-cretization of φ × ψ space. The distribution obtainedfrom the baseline atomistic simulation always serves asthe reference. The mean squared error used here is,1 m (cid:48) m (cid:48) (cid:88) i ( P i − Q i ) (cid:51) p i q i > , (11)where p i and q i remain the normalized bin counts and P i and Q i represent, respectively, the corresponding dis-crete distributions of bin energies calculated as, e.g., P i = k B T log p i for Boltzmann’s constant k B and ab-solute temperature T . When no count is recorded fora bin in either p i or q i , those bins are omitted from themean. m (cid:48) represents the number of bins in which p i q i > (i.e., both have ﬁnite energies). §

50 diﬀerent trials areperformed at diﬀerent random seeds for the full set ofnoise scales (i.e., at each noise scale for a given trial, val-ues are drawn from N (0 , σ ) ∈ R T × n , where T is thelength of the trajectory dataset and n is the number ofcoarse-grained beads). Within each trial, at each noisescale value σ , the KL divergence and MSE are calculated.The results are presented in the left plot in Fig. 5.We see in Fig. 5 that as the noise increases, bothdivergence metrics also increase. The dashed lines inFig. 5 show us that the error on the CGSchNet simu-lation dataset is approximately comparable to the corre-sponding error on the perturbed dataset with noise scale σ = 0 . Å (Fig. 4c) ¶ . Upon qualitative comparison ofthe free energy surfaces, however, the former has morevisual ﬁdelity to the baseline surface in Fig. 4a than tothe broader spread seen (and expected) in the latter. Weknow that coarse graining can result in increased popu-lation in transition regions that are rarely visited in anall-atom model; this is what we observe in Fig. 4b. Asa corollary, we do not expect coarse graining to resultin the absence of states known to exist in the baselinesystem. B. Chignolin

The CLN025 variant of chignolin is a 10-amino acidminiprotein featuring a β -hairpin turn in its foldedstate (Fig. 6). Due to its fast folding, its kinetics havebeen investigated in several MD studies . Our train-ing data is obtained from an atomistic simulation of chig-nolin in explicit solvent for which we stored the forces (seeRef. 19 and Appendix A). To build our CGSchNet model,we retain only the ten α -carbons for our coarse-grainedbeads. For the SchNet embeddings, we assign each aminoacid its own environment with a separate designation forthe two terminal tyrosines. After determining hyperpa-rameters for our CGSchNet model, we simulate chigno- § We could alternatively compute the mean squared error betweendiscrete distributions of counts without omitting any bins. ¶ Similar results were obtained for the two-dimensional Wasser-stein distance.

FIG. 6. The miniprotein chignolin. The α -carbon back-bone is visualized in opaque black, and these ten atoms arethe only beads preserved in the coarse-grained representa-tion. The atomistic system is also solvated, although the wa-ter molecules are not shown here. lin in the coarse-grained space using Langevin dynam-ics as in the previous section (7). The procedures forCGSchNet training and simulation are similar to thoseused for capped alanine and are described in Appen-dices B and C.Given our CGSchNet simulation data, we are inter-ested not only in performing a similar analysis to theone in the previous section for alanine dipeptide (i.e.,comparison to the baseline dataset with and withoutnoise added) but also to simulation data obtained from aCGnet trained according to the protocol in Ref. 19 for thesame system. We thus also construct a CGnet (i.e., us-ing ﬁxed geometric features as described in Sec. III A) ac-cording to the parameters selected in Ref. 19 (see also Ap-pendix B). Then, we create a simulation dataset using theprotocol described in the previous section Appendix C.Finally, we employ a similar protocol to the previoussection by perturbing the raw Cartesian coordinates ofthe all-atom chignolin simulation dataset with noise dis-tributed as ∼ N (0 , σ ) for σ ∈ { , . , . , . . . , . } Å.For each type of system (i.e., baseline, CGSchNet,CGnet, and baseline with noise perturbation), we buildMarkov state models (MSMs) . First, the data is “fea-turized” from Cartesian coordinates into the set of 45 dis-tances between pairs of α -carbons. From these distances,TICA is performed to yield four slow reaction coor-dinates for the system. These four reaction coordinatesare clustered into 150 discrete, disjoint states using the k -means algorithm. An MSM is then estimated from thecluster assignments. The MSM for the baseline simula-tion dataset is constructed ﬁrst; then, the other simu-lation datasets are projected onto the space deﬁned bythe former. MSM essentials are presented in Appendix Dfrom a theoretical standpoint, and the speciﬁc protocols2 a b c d FIG. 7. Two- and one-dimensional free energy surfaces for ﬁve capped alanine datasets. From left to right, datasets are thebaseline chignolin simulation (a), the coarse-grained CGSchNet simulation (b), the coarse-grained CGnet simulation (c), andthe dataset generated from the perturbation of the original Cartesian coordinates of the baseline dataset drawn from noisedistributed as ∼ N (0 , σ ) for σ = × histogram on TIC 1 × TIC 2 space with weights determined from the MSM built for each system (see main text). The one-dimensional surfaces aresimilarly obtained from 120-bin histograms on a single TIC. For each reweighted histogram bin, the free energy is obtained bytaking the natural log of the counts and multiplying by − k B T ; the color scale is the same in all ﬁve two-dimensional surfacesand darker color represents lower free energy (i.e., greater stability). The shaded region always represents the baseline datasetand the bold line represents the dataset indicated in the subﬁgure title. CGSchNetCGnet CGSchNetCGnet

FIG. 8. The Kullback-Leibler divergence (left) andmean squared error (right) are calculated between the base-line dataset and the datasets obtained from perturbationsto the baseline simulation at noise scale values of σ ∈{ , . , . , . . . , . } where the former is the reference dis-tribution and the latter is the trial distribution. This proce-dure is performed 50 times with diﬀerent random seeds; bothplots show the superposition of those 50 lines. The coloredhorizontal dashed lines show the values of the metric whencomparing the CGSchNet (purple) and the CGnet (blue) sim-ulations to the baseline. The black vertical dashed line indi-cates the noise scale σ that return the closest value for thatmetric. The KL divergence is computed for normalized bincounts in reweighted TIC 1 × TIC 2 space, and the MSE iscomputed for energies as described in the main text. used for the MSM analysis in this section are given in Appendix E.The stationary distribution of each MSM is then usedto reweight the TICA coordinates used for its own con-struction. Histograms of the ﬁrst two TICA coordinatesare presented in the top row of Fig. 7 for the baseline,CGSchNet, and CGnet simulation datasets as well as thebaseline dataset for σ = 0 . . The ﬁrst two reweightedTICA coordinates are also individually binned into one-dimensional free energy surfaces, which are depicted inthe second and third rows of Fig. 7. We see that thefree energy barriers along these reaction coordinates arereasonably approximated by the CGSchNet simulation.Figure 8 shows the same divergence metrics calculatedin Fig. 5 in the previous section. Again, we see that boththe KL divergence and the MSE increase monotonicallywith the magnitude of the noise. In this case, we canassess the equivalent noise value for both the CGSchNetand CGnet simulation datasets. For both divergencesmeasured, we see that the CGSchNet simulation corre-sponds to to a lesser value of added noise than the CGnetsimulation.We can also obtain free energy surfaces from the MSMsconstructed for the systems; the surfaces for the baselineand CGSchNet simulation datasets of chignolin are pre-sented in Fig. 9a on the left and right, respectively. Wesee that the three major basins observed in the atomisticdata are captured by CGSchNet. These basins repre-3 bcda FIG. 9. Two-dimensional free energy surfaces (a) and samplefolded (b), unfolded (c), and misfolded (d) conformations fromthe baseline atomistic simulation of chignolin (left column)and the CGSchNet simulation (right column). The free energysurfaces are built from 150-state MSMs that were constructedfrom the slowest two TICA coordinates in contact distancespace. The color scale is the same for both surfaces and darkercolor represents lower free energy (i.e., greater stability). Eachset of ten sampled structures corresponds to the MSM staterepresented by the star on the free energy surface of the samecolor (one of the ten structures opaque for clarity). sent folded, unfolded, and misfolded ensembles and areindicated in Fig. 9a with blue, green, and yellow stars,respectively. Each star represents one of the 150 MSMstates and was manually selected from the MSM statesnear the relevant basin (see Fig. A8 for a visualization ofall 150 MSM states). To verify that the protein confor-mations are similar in each of the states, we sample tenstructures from each starred state per simulation dataset.The structures are visualized in Fig. 9b-d, and the sim-ilarity of the structures on the left-hand side (baselinesimulation) to those on the right-hand side (CGSchNetsimulation) from corresponding MSM states is apparent.The analysis of the CGSchNet and CGnet simulationdatasets so far used TICA reaction coordinates that wereobtained by projecting the simulation data onto coordi-nates deﬁned by a TICA model built for the baseline atomistic data (see Appendix E). This was done in orderto compare simulation results using the same reaction co-ordinates. We can also construct TICA models from thesimulation datasets without projection. For this anal-ysis, we build two further (independent) TICA modelsfor the CGSchNet and CGnet datasets. The ﬁrst twotimescales of the baseline TICA model lie above a spec-tral gap (see Fig. A7), and the absolute values of theSpearman rank correlation coeﬃcients between the cor-responding TICA reaction coordinates from the baselineand CGSchNet TICA models are, respectively, 0.72 and0.63 for the ﬁrst and second TICA coordinates. This cor-relation analysis indicates that the corresponding slowprocesses capture the same collective motions. In con-trast, the same correlations for the CGnet TICA modelyield Spearman coeﬃcients with absolute values of 0.68and 0.24 for the ﬁrst and second TICA coordinates, re-spectively.Beyond noting that the slow collective motions cap-tured by TICA are similar between the CGSchNet sim-ulation and the atomistic simulation when no projectionis used to model the former, we do not attempt a kineticanalysis in the present work because the scope of forcematching is limited to thermodynamic consistency .The matching of dynamics in addition to thermodynam-ics is an open challenge that been the subject of recentwork . Given coarse-grained dynamics, analytical meth-ods have been derived that enable their rescaling to thedynamics of the system’s all-atom counterpart . V. DISCUSSION

Coarse graining holds the promise of simulating largersystems at longer timescales than are currently possibleat the atomistic level. However, mathematical frame-works must be developed in order to ensure that the re-sults obtained from a coarse-grained model are faithful tothose that would be obtained from an atomistic simula-tion or experimental measurement. Force matching is one such framework that, when certain restrictionsare applied, guarantees thermodynamic consistency withatomistic data in the variational limit . Such a vari-ational framework enables the formulation of the forcematching problem as a supervised machine learning task,which is presented in Ref. 19 as CGnet.A key limitation of the original CGnet is that it is nottransferable: a new network must be trained for each indi-vidual molecular system under study because the molec-ular features from which it learns the force ﬁeld must bechosen by hand. Here, we replace manually determinedfeatures with a learnable representation. This represen-tation is enabled by the use of continuous ﬁlter convo-lutions on a graph neutral network (i.e., SchNet ).SchNet is an inherently transferable architecture origi-nally designed to match energies and forces to quantumcalculations for small organic molecules. By leveragingSchNet in the coarse graining context—i.e., to learn the4molecular features input into a CGnet—we render thehybrid CGnet architecture (i.e., CGSchNet) fully trans-ferable across molecular systems.Our aim in the present contribution is threefold: tosummarize the variational framework enabling a super-vised learning approach to force matching, to provide anaccompanying software package implementing the meth-ods discussed herein (see Appendix F), and to demon-strate that CGSchNet produces results on individual sys-tems that are superior to those obtained from bespokefeatures. The advances presented in this work prepareus to address the ultimate challenge of machine learn-ing a coarse-grained force ﬁeld that is transferable acrossmolecular systems.In our computational experiments performed oncapped alanine and the miniprotein chignolin, we ﬁndthat CGSchNet’s performance exceeds that of CGnet inthree ways. First, the free energy surface obtained fromCGSchNet simulations of chignolin is more accurate thanthe free energy surface presented for the same systems inRef. 19. Second, CGSchNet is more robust to networkhyperparameters than its predecessor. In fact, for theCGSchNet hyperparameters varied during model train-ing (see Appendix B), the same selections are used forboth systems presented in Sec. IV. Third, CGSchNet em-ploys less regularization; particularly, it does not requirethe extra step of enforcing a Lipschitz constraint on itsnetwork’s weight matrices as was found to be necessaryfor CGnet .While our current protocol has demonstrated successfor a capped monopeptide and a 10-amino acid minipro-tein, adapting the CGSchNet pipeline to produce ac-curate coarse-grained force ﬁelds for larger protein sys-tems remains an open challenge. Addressing this chal-lenge may require speciﬁc sampling strategies when ob-taining training data, the incorporation of new priorsthat inform tertiary structure formation, or modiﬁcationsto the CGSchNet architecture itself such as regulariza-tion. Successfully modeling the thermodynamics of pro-tein folding or conformational change via a transferable,machine-learned force ﬁeld would signify a major successfor the union of artiﬁcial intelligence and the computa-tional molecular sciences.Structural, bottom up coarse graining consists of twoaspects: the model resolution and the force ﬁeld. Here, weassume the resolution is set and focus on the force ﬁeld,but the choice of an optimal model resolution is itselfa signiﬁcant challenge that is interconnected to the goalof force ﬁeld optimization. How to choose a resolutionfor coarse graining—and the interplay of this choice withtransferable force ﬁeld architectures—remains an openquestion. Recent work has employed machine learningand data-driven approaches to pursue an optimal resolu-tion using various objectives .Altogether, the methodology we introduce in thepresent contribution establishes a transferable architec-ture for the machine learning of coarse-grained forceﬁelds, and we expect our accompanying software to fa- cilitate progress not only in that realm but also towardsthe outstanding challenges of learning coarse-grained dy-namics and optimizing a model’s resolution. ACKNOWLEDGEMENTS

B.E.H. is immeasurably grateful to Moritz Hoﬀmannfor his wisdom and support. The authors are grateful toDr. Ankit Patel for discussions on model debugging andmachine learning techniques, to Iryna Zaporozhets fordiscussions concerning model priors, and to Dr. StefanDoerr for software help.We acknowledge funding from the European Commis-sion (ERC CoG 772230 “ScaleCell”) to B.E.H., A.K,Y.C., and F.N; from MATH+ the Berlin Mathemat-ics Center to B.E.H. (EF1-2), S.O. (AA1-6), and F.N(EF1-2 and AA1-6); from the Natural Science Founda-tion (CHE-1738990, CHE-1900374, and PHY-1427654)to N.E.C., J.W., and C.C.; from the Welch Foundation(C-1570) to N.E.C, J.W., and C.C.; from the NLM Train-ing Program in Biomedical Informatics and Data Science(5T15LM007093-27) to N.E.C.; from the Einstein Foun-dation Berlin to C.C.; and from the Deutsche Forschungs-gemeinschaft (SFB1114 projects A04 and C03) to F.N.D.L. was supported by a FPI fellowship from the SpanishMinistry of Science and Innovation (MICINN, PRE2018-085169). G.D.F. acknowledges support from MINECO(Unidad de Excelencia Mariá de Maeztu AEI [CEX2018-000782-M] and BIO2017-82628-P) and FEDER. Thisproject has received funding from the European Union’sHorizon 2020 research and innovation programme underGrant Agreement 823712 (CompBioMed2 Project). Wethank the GPUGRID donors for their compute time.Simulations were performed on the computer clustersof the Center for Research Computing at Rice University,supported in part by the Big-Data Private-Cloud Re-search Cyberinfrastructure MRI-award (NSF grant CNS-1338099), and on the clusters of the Department ofMathematics and Computer Science at Freie Universität,Berlin.Part of this research was performed while B.E.H.,N.E.C., D.L., J.W., G.dF., C.C., and F.N. were visitingthe Institute for Pure and Applied Mathematics (IPAM)at the University of California, Los Angeles for the LongProgram “Machine Learning for Physics and the Physicsof Learning.” IPAM is supported by the National ScienceFoundation (Grant No. DMS-1440415). M. Shirts and V. S. Pande, “Screen savers of the world unite!”Science , 1903–1904 (2000). F. Allen, G. Almasi, W. Andreoni, D. Beece, B. J. Berne,A. Bright, J. Brunheroto, C. Cascaval, J. Castanos, P. Coteus,P. Crumley, A. Curioni, M. Denneau, W. Donath, M. Elefthe-riou, B. Flitch, B. Fleischer, C. J. Georgiou, R. Germain, M. Gi-ampapa, D. Gresh, M. Gupta, R. Haring, H. Ho, P. Hochschild,S. Hummel, T. Jonas, D. Lieber, G. Martyna, K. Maturu,J. Moreira, D. Newns, M. Newton, R. Philhower, T. Picunko,J. Pitera, M. Pitman, R. Rand, A. Royyuru, V. Salapura,A. Sanomiya, R. Shah, Y. Sham, S. Singh, M. Snir, F. Suits, R. Swetz, W. C. Swope, N. Vishnumurthy, T. J. C. Ward,H. Warren, and R. Zhou, “Blue gene: A vision for protein sci-ence using a petaﬂop supercomputer,” IBM Syst. J. , 310–327(2001). I. Buch, M. J. Harvey, T. Giorgino, D. P. Anderson, andG. De Fabritiis, “High-throughput all-atom molecular dynamicssimulations using distributed computing,” J. Chem. Inf. Model. , 397–403 (2010). D. E. Shaw, M. M. Deneroﬀ, R. O. Dror, J. S. Kuskin, R. H.Larson, J. K. Salmon, C. Young, B. Batson, K. J. Bowers,J. C. Chao, M. P. Eastwood, J. Gagliardo, J. P. Grossman,C. R. Ho, D. J. Ierardi, I. Kolossváry, J. L. Klepeis, T. Layman,C. McLeavey, M. A. Moraes, R. Mueller, E. C. Priest, Y. Shan,J. Spengler, M. Theobald, B. Towles, and S. C. Wang, “Anton,a special-purpose machine for molecular dynamics simulation,”Commun. ACM , 91–97 (2008). T. J. Lane, D. Shukla, K. A. Beauchamp, and V. S. Pande, “Tomilliseconds and beyond: challenges in the simulation of proteinfolding,” Curr. Opin. Struct. Biol. , 58–65 (2013). N. Plattner, S. Doerr, G. De Fabritiis, and F. Noé, “Completeprotein–protein association kinetics in atomic detail revealed bymolecular dynamics simulations and Markov modelling,” Nat.Chem. , 1005 (2017). S. Kmiecik, D. Gront, M. Kolinski, L. Wieteska, A. E. Dawid,and A. Kolinski, “Coarse-grained protein models and their ap-plications,” Chem. Rev. , 7898–7936 (2016). C. Clementi, H. Nymeyer, and J. N. Onuchic, “Topologicaland energetic factors: what determines the structural details ofthe transition state ensemble and “en-route” intermediates forprotein folding? an investigation for small globular proteins,” J.Mol. Biol. , 937–953 (2000). S. J. Marrink, H. J. Risselada, S. Yeﬁmov, D. P. Tieleman, andA. H. De Vries, “The MARTINI force ﬁeld: coarse grained modelfor biomolecular simulations,” J. Phys. Chem. B , 7812–7824(2007). L. Monticelli, S. K. Kandasamy, X. Periole, R. G. Larson, D. P.Tieleman, and S.-J. Marrink, “The MARTINI coarse-grainedforce ﬁeld: extension to proteins,” J. Chem. Theory Comput. ,819–834 (2008). A. Koliński et al. , “Protein modeling and structure predictionwith a reduced representation,” Acta Biochimica Polonica ,349–371 (2004). A. Davtyan, N. P. Schafer, W. Zheng, C. Clementi, P. G.Wolynes, and G. A. Papoian, “AWSEM-MD: protein structureprediction using coarse-grained physical potentials and bioinfor-matically based local structure biasing,” J. Phys. Chem. B ,8494–8503 (2012). R. Das and D. Baker, “Macromolecular modeling with Rosetta,”Annu. Rev. Biochem. , 363–382 (2008). W. G. Noid, “Perspective: Coarse-grained models for biomolec-ular systems,” J. Chem. Phys. , 090901 (2013). L. Boninsegna, R. Banisch, and C. Clementi, “A data-drivenperspective on the hierarchical assembly of molecular struc-tures,” J. Chem. Theory Comput. , 453–460 (2018). W. Wang and R. Gómez-Bombarelli, “Coarse-graining auto-encoders for molecular dynamics,” npj Comput. Mater. , 1–9(2019). S. John and G. Csányi, “Many-body coarse-grained interactionsusing gaussian approximation potentials,” J. Phys. Chem. B , 10934–10949 (2017). L. Zhang, J. Han, H. Wang, R. Car, and W. E, “DeePCG:Constructing coarse-grained models via deep neural networks,”J. Chem. Phys. , 034101 (2018). J. Wang, S. Olsson, C. Wehmeyer, A. Pérez, N. E. Charron,G. De Fabritiis, F. Noé, and C. Clementi, “Machine learning ofcoarse-grained molecular dynamics force ﬁelds,” ACS Cent. Sci. , 755–767 (2019). J. Wang, S. Chmiela, K.-R. Müller, F. Noé, and C. Clementi,“Ensemble learning of coarse-grained molecular dynamics forceﬁelds with a kernel approach,” J. Chem. Phys. , 194106 (2020). K. Schütt, P.-J. Kindermans, H. E. Sauceda Felix, S. Chmiela,A. Tkatchenko, and K.-R. Müller, “SchNet: A continuous-ﬁlterconvolutional neural network for modeling quantum interac-tions,” in

Advances in Neural Information Processing Systems30 , edited by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett (Curran Asso-ciates, Inc., 2017) pp. 991–1001. K. T. Schütt, H. E. Sauceda, P.-J. Kindermans, A. Tkatchenko,and K.-R. Müller, “SchNet–a deep learning architecture formolecules and materials,” J. Chem. Phys. , 241722 (2018). S. Honda, T. Akiba, Y. S. Kato, Y. Sawada, M. Sekijima,M. Ishimura, A. Ooishi, H. Watanabe, T. Odahara, andK. Harata, “Crystal structure of a ten-amino acid protein,” J.Am. Chem. Soc. , 15327–15331 (2008). F. Ercolessi and J. B. Adams, “Interatomic potentials from ﬁrst-principles calculations: the force-matching method,” Europhys.Lett. , 583 (1994). S. Izvekov and G. A. Voth, “A multiscale coarse-graining methodfor biomolecular systems,” J. Phys. Chem. B , 2469–2473(2005). W. G. Noid, J.-W. Chu, G. S. Ayton, V. Krishna, S. Izvekov,G. A. Voth, A. Das, and H. C. Andersen, “The multiscalecoarse-graining method. I. A rigorous bridge between atomisticand coarse-grained models,” J. Chem. Phys. , 244114 (2008). S. Izvekov, M. Parrinello, C. J. Burnham, and G. A. Voth,“Eﬀective force ﬁelds for condensed phase systems from ab ini-tio molecular dynamics simulation: A new method for force-matching,” J. Chem. Phys. , 10896–10913 (2004). M. Guenza, M. Dinpajooh, J. McCarty, and I. Lyubimov, “Ac-curacy, transferability, and eﬃciency of coarse-grained models ofmolecular liquids,” J. Phys. Chem. B , 10257–10278 (2018). S. Izvekov and G. A. Voth, “Multiscale coarse graining of liquid-state systems,” J. Chem. Phys. , 134105 (2005). G. Ciccotti, T. Lelievre, and E. Vanden-Eijnden, “Projectionof diﬀusions on submanifolds: Application to mean force com-putation,” Comm. Pure Appl. Math. , 371–408 (2008). M. S. Shell, “The relative entropy is fundamental to multiscaleand inverse thermodynamic problems,” J. Chem. Phys. ,144108 (2008). J. F. Rudzinski and W. G. Noid, “Coarse-graining entropy,forces, and structures,” J. Chem. Phys. , 214101 (2011). W. G. Noid, P. Liu, Y. Wang, J.-W. Chu, G. S. Ayton,S. Izvekov, H. C. Andersen, and G. A. Voth, “The multi-scale coarse-graining method. II. Numerical implementation forcoarse-grained molecular models,” J. Chem. Phys. , 244115(2008). T. T. Foley, M. S. Shell, and W. G. Noid, “The impact of reso-lution upon entropy and information in coarse-grained models,”J. Chem. Phys. , 243104 (2015). A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury,G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Des-maison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,“Pytorch: An imperative style, high-performance deep learn-ing library,” in

Advances in Neural Information Processing Sys-tems 32 , edited by H. Wallach, H. Larochelle, A. Beygelzimer,F. dAlché Buc, E. Fox, and R. Garnett (Curran Associates,Inc., 2019) pp. 8024–8035. J. Behler and M. Parrinello, “Generalized neural-network repre-sentation of high-dimensional potential-energy surfaces,” Phys.Rev. Lett. , 146401 (2007). A. P. Bartók, M. C. Payne, R. Kondor, and G. Csányi, “Gaus-sian approximation potentials: The accuracy of quantum me-chanics, without the electrons,” Phys. Rev. Lett. , 136403(2010). M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. Von Lilien-feld, “Fast and accurate modeling of molecular atomization en-ergies with machine learning,” Phys. Rev. Lett. , 058301(2012). A. P. Bartók, M. J. Gillan, F. R. Manby, and G. Csányi,“Machine-learning approach for one-and two-body correctionsto density functional theory: Applications to molecular and con-densed water,” Phys. Rev. B , 054104 (2013). J. S. Smith, O. Isayev, and A. E. Roitberg, “ANI-1: an exten-sible neural network potential with DFT accuracy at force ﬁeldcomputational cost,” Chem. Sci. , 3192–3203 (2017). S. Chmiela, A. Tkatchenko, H. E. Sauceda, I. Poltavsky, K. T.Schütt, and K.-R. Müller, “Machine learning of accurate energy-conserving molecular force ﬁelds,” Sci. Adv. , e1603015 (2017). A. P. Bartók, S. De, C. Poelking, N. Bernstein, J. R. Kermode,G. Csányi, and M. Ceriotti, “Machine learning uniﬁes the mod-eling of materials and molecules,” Sci. Adv. , e1701816 (2017). K. T. Schütt, F. Arbabzadah, S. Chmiela, K. R. Müller, andA. Tkatchenko, “Quantum-chemical insights from deep tensorneural networks,” Nat. Commun. , 1–8 (2017). J. S. Smith, B. Nebgen, N. Lubbers, O. Isayev, and A. E.Roitberg, “Less is more: Sampling chemical space with activelearning,” J. Chem. Phys. , 241733 (2018). A. Grisaﬁ, D. M. Wilkins, G. Csányi, and M. Ceriotti,“Symmetry-adapted machine learning for tensorial propertiesof atomistic systems,” Phys. Rev. Lett. , 036002 (2018). G. Imbalzano, A. Anelli, D. Giofré, S. Klees, J. Behler, andM. Ceriotti, “Automatic selection of atomic ﬁngerprints and ref-erence conﬁgurations for machine-learning potentials,” J. Chem.Phys. , 241730 (2018). T. T. Nguyen, E. Székely, G. Imbalzano, J. Behler, G. Csányi,M. Ceriotti, A. W. Götz, and F. Paesani, “Comparison of per-mutationally invariant polynomials, neural networks, and gaus-sian approximation potentials in representing water interactionsthrough many-body expansions,” J. Chem. Phys. , 241725(2018). L. Zhang, J. Han, H. Wang, W. Saidi, R. Car, and W. E,“End-to-end symmetry preserving inter-atomic potential energymodel for ﬁnite and extended systems,” in

Advances in Neu-ral Information Processing Systems 31 , edited by S. Bengio,H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, andR. Garnett (Curran Associates, Inc., 2018) pp. 4436–4446. L. Zhang, J. Han, H. Wang, R. Car, and E. Weinan, “Deep po-tential molecular dynamics: a scalable model with the accuracyof quantum mechanics,” Phys. Rev. Lett. , 143001 (2018). T. Bereau, R. A. DiStasio Jr, A. Tkatchenko, and O. A.Von Lilienfeld, “Non-covalent interactions across organic andbiological subsets of chemical space: Physics-based potentialsparametrized from machine learning,” J. Chem. Phys. ,241706 (2018). H. Wang and W. Yang, “Toward building protein force ﬁelds byresidue-based systematic molecular fragmentation and neuralnetwork,” J. Chem. Phys. , 1409–1417 (2018). H. Gouk, E. Frank, B. Pfahringer, and M. Cree, “Regularisa-tion of neural networks by enforcing Lipschitz continuity,” arXivpreprint arXiv:1804.04368 (2018). D. P. Kingma and J. Ba, “Adam: A method for stochastic op-timization,” arXiv preprint arXiv:1412.6980 (2014). T. Schneider and E. Stoll, “Molecular-dynamics study of a three-dimensional one-component model for distortive phase transi-tions,” Phys. Rev. B , 1302 (1978). J. Fass, D. A. Sivak, G. E. Crooks, K. A. Beauchamp,B. Leimkuhler, and J. D. Chodera, “Quantifying conﬁguration-sampling error in langevin simulations of complex molecular sys-tems,” Entropy , 318 (2018). J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, andG. E. Dahl, “Neural message passing for quantum chemistry,” in

Proceedings of the 34th International Conference on MachineLearning-Volume 70 (JMLR. org, 2017) pp. 1263–1272. K. Schutt, P. Kessel, M. Gastegger, K. Nicoli, A. Tkatchenko,and K.-R. Müller, “SchNetPack: A deep learning toolbox foratomistic systems,” J. Chem. Theory Comput. , 448–455(2018). K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (2016) pp. 770–778. S. Kullback and R. A. Leibler, “On information and suﬃciency,”Ann. Math. Stat. , 79–86 (1951). K. Lindorﬀ-Larsen, S. Piana, R. O. Dror, and D. E. Shaw, “Howfast-folding proteins fold,” Science , 517–520 (2011). K. A. Beauchamp, R. McGibbon, Y.-S. Lin, and V. S. Pande,“Simple few-state models reveal hidden complexity in proteinfolding,” Proc. Natl. Acad. Sci. , 17807–17813 (2012). B. E. Husic, R. T. McGibbon, M. M. Sultan, and V. S. Pande,“Optimized parameter selection reveals trends in Markov statemodels for protein folding,” J. Chem. Phys. , 194103 (2016). K. A. McKiernan, B. E. Husic, and V. S. Pande, “Modeling themechanism of CLN025 beta-hairpin formation,” J. Chem. Phys. , 104107 (2017). M. M. Sultan and V. S. Pande, “Automated design of collectivevariables using supervised machine learning,” J. Chem. Phys. , 094106 (2018). M. K. Scherer, B. E. Husic, M. Hoﬀmann, F. Paul, H. Wu, andF. Noé, “Variational selection of features for molecular kinetics,”J. Chem. Phys. , 194108 (2019). B. E. Husic and V. S. Pande, “Markov state models: From anart to a science,” J. Am. Chem. Soc. , 2386–2396 (2018). G. Pérez-Hernández, F. Paul, T. Giorgino, G. De Fabritiis,and F. Noé, “Identiﬁcation of slow molecular order parametersfor Markov model construction,” J. Chem. Phys. , 015102(2013). C. R. Schwantes and V. S. Pande, “Improvements in Markovstate model construction reveal many non-native interactions inthe folding of NTL9,” J. Chem. Theory Comput. , 2000–2009(2013). C. Spearman, “The proof and measurement of association be-tween two things,” Am. J. Psychol. , 72–101 (1904). F. Nüske, L. Boninsegna, and C. Clementi, “Coarse-grainingmolecular systems by spectral matching,” J. Chem. Phys. ,044116 (2019). I. Lyubimov and M. Guenza, “First-principle approach to rescalethe dynamics of simulated coarse-grained macromolecular liq-uids,” Phys. Rev. E , 031801 (2011). K. Lindorﬀ-Larsen, S. Piana, K. Palmo, P. Maragakis, J. L.Klepeis, R. O. Dror, and D. E. Shaw, “Improved side-chain tor-sion potentials for the Amber ﬀ99SB protein force ﬁeld,” Pro-teins: Struct., Funct., Bioinf. , 1950–1958 (2010). W. L. Jorgensen, J. Chandrasekhar, J. D. Madura, R. W. Impey,and M. L. Klein, “Comparison of simple potential functions forsimulating liquid water,” J. Chem. Phys. , 926–935 (1983). M. J. Harvey, G. Giupponi, and G. D. Fabritiis, “ACEMD:accelerating biomolecular dynamics in the microsecond timescale,” J. Chem. Theory Comput. , 1632–1639 (2009). S. Piana, K. Lindorﬀ-Larsen, and D. E. Shaw, “How robust areprotein folding simulations with respect to force ﬁeld parame-terization?” Biophys. J. , L47–L49 (2011). A. D. MacKerell Jr, D. Bashford, M. Bellott, R. L. Dunbrack Jr,J. D. Evanseck, M. J. Field, S. Fischer, J. Gao, H. Guo, S. Ha, et al. , “All-atom empirical potential for molecular modeling anddynamics studies of proteins,” J .Phys. Chem. B , 3586–3616(1998). J. A. Hartigan,

Clustering Algorithms (John Wiley & Sons, Inc.,1975). J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held,J. D. Chodera, C. Schütte, and F. Noé, “Markov models ofmolecular kinetics: Generation and validation,” J. Chem. Phys. , 174105 (2011). T. F. Gonzalez, “Clustering to minimize the maximum inter-cluster distance,” Theor. Comput. Sci. , 293–306 (1985). R. Zwanzig, “From classical dynamics to continuous time ran-dom walks,” J. of Stat. Phys. , 255–262 (1983). C. Schütte, A. Fischer, W. Huisinga, and P. Deuﬂhard, “Adirect approach to conformational dynamics based on hybridmonte carlo,” J. Comput. Phys. , 146–168 (1999). N. Singhal, C. D. Snow, and V. S. Pande, “Using path samplingto build better Markovian state models: predicting the foldingrate and mechanism of a tryptophan zipper beta hairpin,” J.Chem. Phys. , 415–425 (2004). W. C. Swope, J. W. Pitera, and F. Suits, “Describing proteinfolding kinetics by molecular dynamics simulations. 1. Theory,”J. Phys. Chem. B , 6571–6581 (2004). N.-V. Buchete and G. Hummer, “Coarse master equations forpeptide folding dynamics,” J. Phys. Chem. B , 6057–6069(2008). C. D. Meyer,

Matrix Analysis and Applied Linear Algebra ,Vol. 71 (Siam, 2000). L. Molgedey and H. G. Schuster, “Separation of a mixture ofindependent signals using time delayed correlations,” Phys. Rev.Lett. , 3634 (1994). Y. Naritomi and S. Fuchigami, “Slow dynamics in protein ﬂuc-tuations revealed by time-structure based independent compo-nent analysis: the case of domain motions,” J. Chem. Phys. ,065101 (2011). R. T. McGibbon, B. E. Husic, and V. S. Pande, “Identiﬁca-tion of simple reaction coordinates from complex dynamics,” J.Chem. Phys. , 044109 (2017). H. Wu, F. Nüske, F. Paul, S. Klus, P. Koltai, and F. Noé, “Vari-ational koopman models: slow collective variables and molecularkinetics from short oﬀ-equilibrium simulations,” J. Chem. Phys. , 154104 (2017). D. Sculley, “Web-scale k-means clustering,” in

Proceedings ofthe 19th International Conference on World Wide Web (2010)pp. 1177–1178. B. E. Husic and V. S. Pande, “Note: MSM lag time cannotbe used for variational model selection,” J. Chem. Phys. ,176101 (2017). C. Schütte and M. Sarich,

Metastability and Markov State Mod-els in Molecular Dynamics , Vol. 24 (American MathematicalSoc., 2013). G. R. Bowman, V. S. Pande, and F. Noé,

An introductionto Markov state models and their application to long timescalemolecular simulation , Vol. 797 (Springer Science & Business Me-dia, 2013). W. Wang, S. Cao, L. Zhu, and X. Huang, “Constructing Markovstate models to elucidate the functional conformational changesof complex biomolecules,” WIREs Comput. Mol. Sci. , e1343(2018). F. Noé and F. Nuske, “A variational approach to modeling slowprocesses in stochastic dynamical systems,” Multiscale Model.Simul. , 635–655 (2013). R. T. McGibbon and V. S. Pande, “Variational cross-validationof slow dynamical modes in molecular kinetics,” J. Chem. Phys. , 124105 (2015). M. K. Scherer, B. Trendelkamp-Schroer, F. Paul, G. Pérez-Hernández, M. Hoﬀmann, N. Plattner, C. Wehmeyer, J.-H.Prinz, and F. Noé, “PyEMMA 2: A software package for esti-mation, validation, and analysis of Markov models,” J. Chem.Theory Comput. , 5525–5542 (2015). C. Wehmeyer, M. K. Scherer, T. Hempel, B. E. Husic, S. Olsson,and F. Noé, “Introduction to Markov state modeling with thePyEMMA software—v1. 0,” LiveCoMS , 1–12 (2018). M. P. Harrigan, M. M. Sultan, C. X. Hernández, B. E. Husic,P. Eastman, C. R. Schwantes, K. A. Beauchamp, R. T. McGib-bon, and V. S. Pande, “MSMBuilder: statistical models forbiomolecular dynamics,” Biophys. J. , 10–15 (2017).

J. Porter, M. I. Zimmerman, and G. R. Bowman, “Enspara:Modeling molecular ensembles with scalable data structures andparallel computing,” J. Chem. Phys. , 044108 (2019).

S. Van Der Walt, S. C. Colbert, and G. Varoquaux, “TheNumPy array: a structure for eﬃcient numerical computation,”Comput. Sci. Eng. , 22 (2011). E. Jones, T. Oliphant, P. Peterson, et al. , “SciPy: Open sourcescientiﬁc tools for Python,” (2001–).

W. McKinney et al. , “Data structures for statistical computing in python,” in

Proceedings of the 9th Python in Science Con-ference , Vol. 445 (Austin, TX, 2010) pp. 51–56.

R. T. McGibbon, K. A. Beauchamp, M. P. Harrigan, C. Klein,J. M. Swails, C. X. Hernández, C. R. Schwantes, L.-P. Wang,T. J. Lane, and V. S. Pande, “MDTraj: a modern open libraryfor the analysis of molecular dynamics trajectories,” Biophys. J. , 1528–1532 (2015).

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Ma-chine learning in Python,” J. Mach. Learn. Res. , 2825–2830(2011). T. Kluyver, B. Ragan-Kelley, F. Pérez, B. Granger, M. Bus-sonnier, J. Frederic, K. Kelley, J. Hamrick, J. Grout, S. Cor-lay, P. Ivanov, D. Avila, S. Abdalla, and C. Willing, “Jupyternotebooks – a publishing format for reproducible computationalworkﬂows,” in

Positioning and Power in Academic Publish-ing: Players, Agents and Agendas , edited by F. Loizides andB. Schmidt (IOS Press, 2016) pp. 87–90.

J. D. Hunter, “Matplotlib: A 2D graphics environment,” Com-put. Sci. Eng. , 90–95 (2007). P. Eastman, J. Swails, J. D. Chodera, R. T. McGibbon, Y. Zhao,K. A. Beauchamp, L.-P. Wang, A. C. Simmonett, M. P. Har-rigan, C. D. Stern, et al. , “OpenMM 7: Rapid developmentof high performance algorithms for molecular dynamics,” PLoSComput. Biol. , e1005659 (2017). M. Waskom, O. Botvinnik, D. O’Kane, P. Hobson, S. Lukauskas,D. C. Gemperline, T. Augspurger, Y. Halchenko, J. B. Cole,J. Warmenhoven, J. de Ruiter, C. Pye, S. Hoyer, J. Vander-plas, S. Villalba, G. Kunter, E. Quintero, P. Bachant, M. Mar-tin, K. Meyer, A. Miles, Y. Ram, T. Yarkoni, M. L. Williams,C. Evans, C. Fitzgerald, Brian, C. Fonnesbeck, A. Lee, andA. Qalieh, “mwaskom/seaborn: v0.8.1 (september 2017),”(2017).

W. Humphrey, A. Dalke, and K. Schulten, “VMD – VisualMolecular Dynamics,” J. Mol. Graph. , 33–38 (1996). B. Trendelkamp-Schroer, H. Wu, F. Paul, and F. Noé, “Esti-mation and uncertainty of reversible Markov models,” J. Chem.Phys. , 174101 (2015). Appendix A: Baseline simulation datasets

A single capped alanine simulation was performed in explicit solvent ( , total atoms) at a temperature of Kwith the AMBER ﬀ99SB-ILDN force ﬁeld and the TIP3P water model . All bonds between hydrogens and heavyatoms were constrained. The dataset represents 1 µ s of simulation time with coordinates and forces saved at psintervals for a total of frames. This simulation took about days on a GTX1080 GPU.The miniprotein chignolin was simulated on GPU grid with ACEMD using an adaptive sampling scheme inwhich new trajectories were respawned from the least explored states. The explicit solvent system contained 5820atoms and was simulated at a temperature of K with the CHARMM22* force ﬁeld and the mTIP3P watermodel . The dataset comprises µ s of simulation time with coordinates and forces saved at ps intervals fora total of . × frames. The simulation (including server-to-client sending and retrieval) took a few weeks andemployed up to thousands of GPUs. Appendix B: CGSchNet hyperparameters

A CGSchNet was constructed for each of the systems reported in Sec. IV. In Table AI we enumerate all of thehyperparameters for the coarse-grained mapping, the SchNet feature, the remaining CGnet architecture (recall Fig. 1),and other training speciﬁcations.

TABLE AI. Hyperparameter choices for CGSchNet models trained on capped alanine and chignolin systems. Parameters listedin brackets were optimized using cross-validation; the underlined number is the starting value and the bold number is the valuedetermined through cross-validation.

Hyperparameter Capped alanine Chignolin (CGSchNet) Chignolin (CGnet) a Sequence ACE–ALA–NME YYDPETGTWY YYDPETGTWYNumber of coarse-grained beads 6 10 10CG atom types × N, C α , C β , × C × C α × C α Embeddings Nuclear Per residue b –Harmonic priors Bonds and angles Bonds and angles Bonds and anglesRepulsion priors – Non-bonded pairs c Non-bonded pairs d SchNet feature size [32, 64, , 256] [32, 64, , 256] –Number of SchNet basis functions 50 300 –SchNet basis function bounds 0 to 5 Å 0 to 30 Å –SchNet neighbor cutoﬀ ∞ ∞ –SchNet normalization None Bead number –Number of SchNet interaction blocks [1, , 3, 5] [1, , 3, 5] –SchNet activation function Tanh Tanh –Standard scaling before terminal network No No YesNumber of terminal network layers [ , 2, 3, 4, 5] [ , 2, 3, 4, 5] 5Terminal network layer width SchNet feature size SchNet feature size 250Terminal network activation function Tanh Tanh TanhL Lipschitz regularization – – 4.0Batch size 512 512 512Number of epochs 100 100 20Initial learning rate × − − × − Learning rate scheduler e Reduce on plateau f None Multi-step decay g Optimizer ADAM ADAM ADAMTraining/test set split 80%/20% 80%/20% 80%/20%Cross-validation split strategy Random Random RandomNumber of cross-validation folds 5 5 5The hyperparameters optimized under cross-validation were the number of interaction blocks and feature size inthe internal SchNet, and the number of layers in the “terminal” CGnet network. The hyperparameters were variedin the order listed while holding the others constant; at each stage, a “ﬁnal” hyperparameter was selected before thenext one was varied. For each hyperparameter set, ﬁve models were trained according to the settings in Table AI, andthe lowest test set loss of any (i.e., not necessarily the last) epoch was recorded. These losses are shown in Fig. A1.9The initial model explored for each system had ﬁve interaction blocks, a SchNet feature size of 128, and one layer inthe terminal neural network. The ﬁrst search was performed over interaction blocks; the left column of Fig. A1 showsthat the average minimum loss was achieved with two interaction blocks for each system. The number of interactionblocks was then set to two, and the SchNet feature size was varied. A SchNet feature size of 128 was chosen for bothsystems both due to the value of the mean minimum losses and due to the relatively low variance of the test set lossvalues across the ﬁve folds (observed for both systems; see the middle column of Fig. A1). Finally, the number oflayers in the terminal network was explored for two interaction blocks and a SchNet feature size of 128. The rightcolumn of Fig. A1 shows that increasing the number of terminal layers did not noticeably decrease the test set lossvalues for either system, and in some cases increased the variance across the ﬁve cross-validation runs. Thus for bothsystems we used a single layer for the terminal network component of the CGnet.

Interaction blocks L o ss v a l u e Capped alanine stage 1

32 64 128 256

Feature sizes

Capped alanine stage 2

Terminal layers

Capped alanine stage 3

Interaction blocks L o ss v a l u e Chignolin stage 1

32 64 128 256

Feature sizes

Chignolin stage 2

Terminal layers

Chignolin stage 3

FIG. A1. Hyperparameter searches for capped alanine (top) and chignolin (bottom) for interaction blocks (left), SchNetfeature sizes (center), and terminal network layers (right). The unﬁlled circles represent the minimum loss value obtained forthat cross-validation run on the test dataset. The ﬁlled circle represents the average across the ﬁve runs. The ﬁlled circle iscolored if that hyperparameter value was chosen for the “ﬁnal” model. The dashed line shows the value of the average loss fromthe previous parameter search for the selected value of the previously explored parameter.FIG. A2. Loss curves for the ﬁnal parameter sets of CGSchNet models for capped alanine (left) and chignolin (right). Pinkcurves represent loss values on the training dataset and blue curves represent loss values on the test dataset. Each light-coloredcurve signiﬁes one of the ﬁve runs and the dark-colored curve signiﬁes the mean at each epoch. The training losses are sosimilar that the ﬁve distinct lines are not discernible. We chose which model to use for analysis and production according tothe minimum of the test loss curve, which is indicated by the blue star (epoch 100 for capped alanine, epoch 46 for chignolin). Appendix C: CGSchNet simulation details

After training CGSchNets and selecting hyperparameters as described in Sec. B, simulations in the coarse-grainedspace were performed for each system using Langevin dynamics (recall Sec. III A 7). These simulations representan averaging of the ﬁve cross-validated folds of each ﬁnal model: for each set of coarse-grained coordinates to bepropagated forward by the time step, each of the ﬁve models predicts a set of forces, and those forces are averagedin their respective dimensions. It is this averaged force vector that the simulation integrator uses to propagate thecoarse-grained coordinates in time. The parameters are listed in Table AII. The starting coordinates for the cappedalanine CGSchNet simulation are shown on the right side of Fig. A3. Two trajectories that feature transitions amongthe major basins are visualized in Fig. A4; the same trajectories were used in Fig. 3.

TABLE AII. Parameters used for Langevin dynamics simulations of coarse-grained systems using a trained CGSchNet model.

Parameter Capped alanine Chignolin (CGSchNet) Chignolin (CGnet)Number of models averaged 5 5 5Number of coarse-grained beads 6 10 10Temperature (K) 300 350 350Friction (ps − ) 490 490 490Timestep (ps) 0.02 0.02 0.02Number of independent simulations 100 100 100Starting position selection algorithm Regular spatial k -centers k -centersSpace for starting position selection φ × ψ TIC 1 × TIC 2 b TIC 1 × TIC 2Number of timesteps Output simulation stride (ps) 0.2 0.2 0.2Total dataset time ( µ s) 20 20 20Wall clock time to complete simulations (hours) 24 c

14 13Hardware Tesla K80 GeForce GTX 1080 GeForce GTX 1080

FIG. A3. Starting positions in Ramachandran space (blue stars) for the baseline (left) and CGSchNet simulations (right) ofcapped alanine. The free energy surfaces are reproduced from Fig. 4a and b, respectively. FIG. A4. Two individual CGSchNet simulation trajectories for capped alanine. Both trajectories sample the four major basinson the landscape. The free energy surfaces are both reproduced from Fig. 4b.

TIC 1 T I C Baseline Simulation

TIC 1 T I C CGSchNet Simulation F r ee E ne r g y ( kc a l / m o l ) FIG. A5. Starting positions in TIC 1 × TIC 2 space (blue stars) for the baseline (left) and CGSchNet simulations of chignolin(right). The free energy surfaces are reproduced from Fig. 9a.

TIC 1 T I C Trajectory 78

TIC 1 T I C Trajectory 86

TIC 1 T I C Trajectory 97 F r ee E ne r g y ( kc a l / m o l ) FIG. A6. Three individual CGSchNet simulation trajectories for chignolin. Each visualized simulation accesses multiple majorbasins. The free energy surfaces are reproduced from Fig. 9a. Appendix D: Brief introduction to MSMs

Brieﬂy, MSMs leverage the idea that mathematically straightforward kinetics emerge from a decomposition ofsimulation data into a set of states among which the transitions at a certain “lag time” τ are Markovian (i.e., dependenton only the present state of the system and independent of previous state occupations) . An MSM containing aset of discrete states is represented by a transition matrix T ( τ ) that is row-stochastic, irreducible, and aperiodic (thelatter two together are often referred to as “ergodic” in this context; this means every state can be reached from anyother state without requiring cycles).An MSM is typically constructed under the assumption that the Markov process is reversible with respect to itsstationary distribution; this means it obeys the detailed balance constraint, π ( a ) Pr ( a → b ) = π ( b ) Pr ( b → a ) , (A1)where, e.g., π ( a ) is the stationary probability of being in state a and Pr ( a → b ) is the probability of transitioningfrom state a to state b after a predetermined lag time. The equality of the left-hand and right-hand sides of (A1)means the system is microscopically reversible; i.e., at thermodynamic equilibrium.This entails that the eigendecomposition of the MSM, T ( τ ) ψ i = ψ i λ i , (A2)has certain properties (according to the Perron-Frobenius theorem ), namely, (i) the eigenvectors ψ i and eigenvalues λ i are real; particularly, (ii) the dominant (Perron) eigenvalue λ = 1 and is unique, and its corresponding eigenvector ψ has only positive elements which correspond to the stationary distribution π i ; and (iii) the remaining eigenvalues λ i , i ≥ have magnitudes less than 1 and their corresponding eigenvectors ψ i , i ≥ have all real elements wherenegative and positive elements represent the ﬂuxes out of and into MSM states, respectively. Using the lag time τ ,the eigenvalues can be related to the timescales of the kinetic processes in the (equilibrium) system according to, t i ≡ − τ log | λ i | . (A3)In practice, to construct an MSM we often use the following series of steps starting from the “raw” Cartesiancoordinates of the simulation to be analyzed :1. Featurization . The Cartesian coordinates output from the MD simulation are converted into a more usefulrepresentation according to properties known to be preserved in the system and/or chemical intuition. Forexample, the dynamics of proteins in a system without an external gradient are expected to be invariant totranslation and rotation, so featurization schemes preserving these invariances are often employed. In proteins,two common featurizations are the backbone dihedral angles (cf. Sec. IV A) and distances between the aminoacids (cf. Sec. IV B).2.

Dimensionality reduction . Optionally, a basis set transformation can be performed on the features to convertthe data into a space in which the Markovian assumption is better satisﬁed. In the MSM community, time-lagged independent component analysis (TICA) is frequently used. TICA requires the calculation oftwo matrices, C τ ≡ X (cid:62) Y + Y (cid:62) X T − τ ) , (A4a) C ≡ X (cid:62) X + Y (cid:62) Y T − τ ) , (A4b)where X ∈ R ( T − τ ) × n is the time series data and Y ∈ R ( T − τ ) × n is the data shifted forward in time by an intervalof τ for T total simulation time. Since (A4a) and (A4b) are both symmetric, the following generalized eigenvalueproblem can be written, C τ φ i = C φ i λ i , (A5)3where the eigenvectors φ i enable the transformation of the input data into TICA space. The TICA timescalescan also be computed according to (A3). The TICA coordinates can be interpreted as reaction coordinates and TICA may be used as a model in its own right or as a step in the process of the construction of anotherkinetic model such as an MSM . Note that this harsh symmetry implementation may introduce signiﬁcantbias .3. State decomposition . In this step, the MSM states are assigned. A clustering algorithm such as k -means isused to assign each data point in the MD simulation dataset to a discrete state. This means the simulationdataset is represented by a list of integers indicating the state to which each conformation belongs.4. MSM estimation . Finally, the MSM is estimated from the cluster assignments. A Markovian lag time isdesignated and transitions are counted and stored in a “counts matrix.” This counts matrix is converted tothe MSM transition matrix T ( τ ) using a maximum-likelihood estimation algorithm such that it adheres tothe previously described requirements of detailed balance and ergodicity.A great deal of research has gone into MSM theory and practice which is summarized in books and recentreviews . A crucial advance that we do not go into here is the derivation of a variational principle for MSM con-struction that enables the optimization of the hyperparameters associated with the choices in items 1 through 3 whendescribing the system kinetics . Open source software packages facilitating MSM analyses include PyEMMA ,MSMBuilder , and Enspara . Appendix E: MSM construction for chignolin

To facilitate the majority of the analysis in Sec. IV B, MSMs (see Sec. D) were constructed for both the baselineatomistic dataset and the coarse-grained simulation datasets obtained from CGSchNet and CGnet.To build an MSM for the atomistic MD dataset of chignolin, we ﬁrst featurize the Cartesian coordinates into vectorsof contact distances among every pair of amino acid residues. This transformation produces 45 distances, where thedistances are deﬁned as the distance between the two α -carbons. TICA is performed on the set of distances with alag time of 10 ns. The eigenvalue spectrum of the baseline TICA model is presented in Fig. A7. We retained the ﬁrstfour TICA coordinates and used k -means to cluster the data in this four-dimensional space with k = 150 . From thestate space of 150 clusters, an MSM is estimated with a lag time of 15 ns. Eigenvalue index T I C A e i gen v a l ue Baseline TICA model

FIG. A7. Eigenvalue plot for the TICA model built upon the baseline atomistic simulation data.

For the CGSchNet simulation, we create a “projected” MSM which utilizes the TICA and clustering models builtfor the baseline MSM as previously described. This entails that the TICA and clustering spaces will be the samefor both datasets, so the MSMs can be compared. We perform the same contact featurization on the coarse-grainedsimulation data. Instead of calculating a new TICA model, we project the featurized CGSchNet simulation data ontothe TICs deﬁned by the baseline TICA model. Similarly, we then assign each coarse-grained data point in TIC space4to one of the 150 clusters previously deﬁned by the k -means model on the simulation data. An independent MSM isﬁt from these cluster assignments. The locations of the cluster centers are shown in Fig. A8. TIC 1 T I C Baseline Simulation

TIC 1 T I C CGSchNet Simulation F r ee E ne r g y ( kc a l / m o l ) FIG. A8. Locations of 150 k -means cluster centers in TIC 1 × TIC 2 space (white circles) for the baseline (left) and CGSchNetsimulations of chignolin (right). The cluster centers are identical in both plots. The free energy surfaces and starred MSMstates are reproduced from Fig. 9a.

The lag time and k parameters were chosen based on the knowledge that chignolin has relatively simple kinetics paired with the authors’ familiarity with MSM construction for this speciﬁc system and their experience that themodeled kinetics of chignolin are relatively robust to MSM hyperparameter choices . In general it is crucial touse the variational method to determine MSM hyperparameters . To validate our MSMs, we performed impliedtimescale and Chapman-Kolmogorov tests, which are shown on the following pages. The MSM lag time of 30 nswas chosen based on the implied timescale plots. Appendix F: Software

The cgnet software package is available at https://github.com/coarse-graining/cgnet under the BSD-3-Clauselicense. cgnet requires NumPy , SciPy , and PyTorch , and optional functionalities further depend on pandas ,MDTraj , and Scikit-learn . The examples are provided in Jupyter notebooks which also require Matplotlib .The SchNet part of the code is inspired by SchNetPack and the Langevin dynamics simulation code is adapted fromOpenMM . In addition to cgnet and the packages already mentioned, visualization was aided by Seaborn andVMD . Analysis was facilitated by PyEMMA .5

10 20 30 40 50 60 70 80lag time / steps10 t i m e s c a l e / s t e p s FIG. A9. Implied timescales test for a reversible Bayesian MSM constructed constructed from the baseline all-atom datafor chignolin. The dashed line indicates the mean timescales, the solid line indicates the timescales of the maximum likelihoodestimate, and the shaded region indicates a 95% conﬁdence interval. A lag time of 30 frames was chosen. p r o b a b ili t y p r o b a b ili t y p r o b a b ili t y estimate predict FIG. A10. Chapman-Kolmogorov test (3 macrostates) for a reversible Bayesian MSM constructed from the baseline all-atomdata for chignolin. The dashed blue line is the prediction and black is the estimate.

10 20 30 40 50 60 70 80lag time / steps10 t i m e s c a l e / s t e p s FIG. A11. Implied timescales test for a reversible Bayesian MSM constructed from the CGSchNet simulation data forchignolin. The dashed line indicates the mean timescales, the solid line indicates the timescales of the maximum likelihoodestimate, and the shaded region indicates a 95% conﬁdence interval. A lag time of 30 frames was chosen. p r o b a b ili t y p r o b a b ili t y p r o b a b ili t y estimate predict FIG. A12. Chapman-Kolmogorov test (3 macrostates) for a reversible Bayesian MSM constructed from the CGSchNetsimulation data for chignolin. The dashed blue line is the prediction and black is the estimate.

10 20 30 40 50 60 70 80lag time / steps10 t i m e s c a l e / s t e p s FIG. A13. Implied timescales test for a reversible Bayesian MSM constructed from the CGnet simulation data for chignolin.The dashed line indicates the mean timescales, the solid line indicates the timescales of the maximum likelihood estimate, andthe shaded region indicates a 95% conﬁdence interval. A lag time of 30 frames was chosen. p r o b a b ili t y p r o b a b ili t y p r o b a b ili t y estimate predict FIG. A14. Chapman-Kolmogorov test (3 macrostates) for a reversible Bayesian MSM111