Active Importance Sampling for Variational Objectives Dominated by Rare Events: Consequences for Optimization and Generalization
LLearning with rare data: Using active importance sampling to optimize objectivesdominated by rare events
Grant M. Rotskoff
Dept. of Chemistry, Stanford University, Stanford, CA 94305 ∗ Eric Vanden-Eijnden
Courant Institute, New York University, New York, NY 10012 † Deep neural networks, when optimized with sufficient data, provide accurate representations ofhigh-dimensional functions; in contrast, function approximation techniques that have predominatedin scientific computing do not scale well with dimensionality. As a result, many high-dimensionalsampling and approximation problems once thought intractable are being revisited through the lensof machine learning. While the promise of unparalleled accuracy may suggest a renaissance forapplications that require parameterizing representations of complex systems, in many applicationsgathering sufficient data to develop such a representation remains a significant challenge. Here weintroduce an approach that combines rare events sampling techniques with neural network optimiza-tion to optimize objective functions that are dominated by rare events. We show that importancesampling reduces the asymptotic variance of the solution to a learning problem, suggesting benefitsfor generalization. We study our algorithm in the context of learning dynamical transition pathwaysbetween two states of a system, a problem with applications in statistical physics and implicationsin machine learning theory. Our numerical experiments demonstrate that we can successfully learneven with the compounding difficulties of high-dimension and rare data.
I. RARE EVENTS AND IMPORTANCE SAMPLING
Deep neural networks (DNNs) have become an essential tool for a diverse set of problems in data science and,increasingly, the physical sciences. Underlying the impressive breadth of applications are the uncommonly robustapproximation properties of DNNs, especially in high-dimensional settings where standard tools from numericalanalysis break down. It has not gone without notice that many compelling questions in statistical physics requireprecise knowledge of high-dimensional functions [1] which can be challenging to represent and compute, suggestingthat machine learning may have a transformative role to play.Sampling is at the core of a large class of problems in both statistical mechanics and machine learning but is oftenrendered intractable by the need for precise knowledge about where and how to sample. Computing the expectationof a physical observable A : R d → R requires making an empirical expectation of a Gibbs’ distribution E A = Z − (cid:90) R d A ( x ) e − βV ( x ) d x , (1)where Z = (cid:82) R d e − βV ( x ) d x , β is interpreted as the inverse temperature and V is the potential energy of the system,which we assume is known analytically and can be evaluated efficiently, a common setting for problems in classicalstatistical mechanics. Finally, Z is the normalization constant, also known as the partition function—it cannot beevaluated explicitly except in the most trivial cases. The seemingly basic task of computing expectations underliesmany problems, for example, computing reaction rates and identifying physically important ”transition” states inmolecular systems. Nevertheless, estimating (1) remains perhaps the central challenge for modeling and studyingcomplex chemical and biophysical systems.Here we focus on the particular problem of computing the statistics of transitions between two metastable statesof a high-dimensional probably distribution. This problem is motivated by chemical reactions or other molecular ∗ rotskoff@stanford.edu; code: https://github.com/rotskoff-group/learning-committor † [email protected] a r X i v : . [ phy s i c s . d a t a - a n ] A ug transitions but has much broader relevance. In particular, with the resurgence of stochastic sampling schemes in thecontext of optimization [2–4], the approach we describe here can be used to study dynamical properties of samplingschemes in the nonconvex optimization setting.The potential energy functions and free energy surfaces that arise in applications in machine learning, chemistry, andbiophysics are typically highly nonconvex and may have hundreds of thousands of degrees of freedom. While it shouldbe emphasized that, when sampling to compute expectations, the goal is not to identify the global minimum of thefree energy, but rather to evaluate (1), such systems still present enormous computational challenges. The algorithmstypically employed in Markov Chain Monte Carlo sampling schemes rely on local updates to the configuration x , inturn leading to slow mixing between metastable states of the distribution. This leads to difficulty in estimating (1)when there is metastability, as the transition time between two states is exponential in the height of the free energybarrier between them [5, 6].Many techniques have been proposed to overcome the inherent complexity of this task (cf. [6, 7]). Fundamentally,most of these techniques require the introduction of a “collective variables” map which reduces the high-dimensionalsampling problem to a much lower dimensional setting. When the dimensionality is significantly reduced, computingexpectations becomes tractable by using importance sampling techniques that require discretizing space. However,there is no a priori guarantee that the chosen collective variables map projects the original space without loss ofimportant information. The approach we take here avoids the complex task of designing an appropriate collectivevariables map by using a neural network to parameterize a map from the full configuration space.Importance sampling and other variance reduction techniques have appeared in a variety of contexts in machinelearning. Csiba and Richtarik [8] described and analyzed an algorithm that does importance sampling of the trainingset to adaptively select minibatches and accelerate gradient descent. Their work formalizes an approach, representedin a large body of work [9–11], to aims reduce to the variance in the gradients when optimizing using stochasticgradient descent. In a separate line of inquiry, Fan et al. [12] uses importance sampling to perform approximateBayesian inference in continuous time Bayesian networks. Our setting differs substantially from these works, as weare principally concerned with problems in which the data set is sampled on-the-fly from a Boltzmann distribution.Furthermore, we require importance sampling for the learning to be tractable at all, whereas the aforementionedworks seek to accelerate optimization in otherwise tractable learning problems. Our theoretical results suggest thatthese previously studied approaches benefit generalization.Identifying reaction pathways and sampling reactive trajectories is a central problem in statistical mechanics, withdecades of work. Transition path sampling methods are perhaps the most closely related to our approach [13, 14].Our applications are heavily influenced by the perspective of potential theory [15] and the notion of the committorfunction (discussed in detail below) [5, 16]. Khoo et al. [17] first considered the problem of learning committorfunctions from the perspective of solving high-dimensional PDEs but did not address the sampling issues that canarise in computing the objective. Our work most closely follows that of Li et al. [18], who also examined the problem ofoptimizing a representation of the committor using neural networks on low-dimensional landscapes. Our work extendsthis approach in several important ways: first, our algorithm is an active approach—the importance sampling directlyuses the committor function meaning that there is feedback between the optimization and the data collection. Inhigh-dimensional systems in which selecting a reaction coordinate presents a challenging design problem, our approachis crucial for effective sampling because we avoid explicitly constructing a reaction coordinate. II. LOSS FUNCTIONS FOR REACTION PATH DYNAMICS DOMINATED BY RARE EVENTS
In many problems in chemical physics, we hope to calculate the probability a physical system to make a spontaneousexcursion from a configuration in phase space ˆ A ⊂ R d × R d to a distinct state in ˆ B ⊂ R d × R d . In general, it isnot possible to directly compute these transition probabilities. Two confounding factors are at play: first, the statespace may very high dimensional in nontrivial cases; secondly, the computational resources required to observe atransition from ˆ A to ˆ B in dynamical simulations are prohibitive when the transitions are infrequent. One classicapproach to this problem relies on the Markovian nature of the dynamics. It can be shown that, for a Markoviandynamics, the infinitesimal generator has a “spectral gap” between a set of low-lying real eigenvalues associated withtransitions between metastable states [15, 20]. However, in complex systems, there still may be hundreds or thousandsof metastable states, and solving the associated eigenvalue problem for a given transition is neither tractable in practicenor very informative [21].In the case that ˆ A and ˆ B can be specified, an alternative is to quantify the probability that a trajectory willmake a dynamical transition from the metastable state ˆ A to another ˆ B by computing the “committor function”ˆ q : R d × R d → [0 , x , v ) first reaches ˆ B before ˆ A :ˆ q ( x , v ) := P ( x , v ) ( t ˆ B < t ˆ A ) . (2)We consider this problem in the following setting: we take a physical system with coordinates ( x , v ) ∈ R d × R d whoseevolution is governed by the Langevin equation (cid:40) ˙ x = v ˙ v = −∇ V ( x ) − γ v + (cid:112) β − η . (3)Here V : R d → [0 , ∞ ) is a potential energy function, β >
0, which controls the magnitude of the fluctuations, istypically interpreted as the inverse temperature in physical systems, γ is a friction coefficient, and W t is a Wienerprocess. This dynamics is ubiquitously used to model molecular dynamics in the condensed phase but has also beenproposed as a heuristic model for stochastic optimization methods like SGD [19] and sampling-based optimizationschemes [2].We can write a partial differential equation (PDE) for the probability q ( x , v ) that a trajectory under the dynam-ics (3) that starts at ( x , v ) reaches ˆ B before ˆ A . This is the following backward Kolmogorov equation [5] L ˆ q = 0 for ( x , v ) (cid:54)∈ ˆ A ∪ ˆ B ˆ q ( x , v ) = 0 for ( x , v ) ∈ ˆ A ˆ q ( x , v ) = 1 for ( x , v ) ∈ ˆ B. (4)where L is the infinitesimal generator of the process defined by (3): Lf = v · ∇ x f − ∇ x V · ∇ v f − γ v · ∇ v f + β − ∆ v f. (5)The probability ˆ q : R d × R d → [0 ,
1] is known as the committor function because its level sets are surfaces of constantprobability to reach ˆ B . In the context of chemical reaction dynamics, the backward Kolmogorov equation is a PDEin very high dimension, meaning that is not possible to obtain a solution analytically or using traditional numericalmethods based e.g. on finite elements. A. Defining a loss function for learning the committor
Our goal is to define a parametric representation of the committor function and an objective function that enables usto optimize the parameters. Neural networks offer flexibility to the representation and relative ease of optimization,making this a natural choice for a representation of the committor. While solving the BKE is intractable, thecommittor satisfies a straightforward variational principle that can be employed as an objective function. Using (4),the committor function satisfies ˆ C [ˆ q ] = ˆ Z − (cid:90) R d × R d | L ˆ q ( x , v ) | e − β H ( x , v ) d x d v = 0 (6)where H ( x , v ) = | v | + V ( x ) is the Hamiltonian and ˆ Z = (cid:82) R d × R d e − β H ( x , v ) d x d v is the partition function, hereincluded to interpret ˆ C [ˆ q ] as the canonical expectation of | L ˆ q ( x , v ) | . Hence, to find ˆ q we can minimize ˆ C [ˆ q ] over allfunctions satisfying the boundary conditions in (4),inf ˆ q ˆ C [ˆ q ] with ˆ q = 0 in ˆ A, ˆ q = 1 in ˆ B (7)Integrating over the momenta we reduce ˆ C [ˆ q ] to an objective function for q ( x ):inf q C [ q ] subject to q = 0 in A, q = 1 in B (8)where C [ q ] = (cid:90) R d |∇ q ( x ) | dµ ( x ) with dµ ( x ) = Z − e − βV ( x ) d x (9)and the reactant state A ∈ R d and the product state B ∈ R d are now sets defined in configuration space.In the optimization procedure below, it is more tractable to penalize deviations from the boundary conditions ratherthan impose them as constraints. Consequently, we use Lagrange multipliers to ensure that the committor has theright values on the initial and target states. The objective function we use is thus C λ [ q ] = (cid:90) R d |∇ q ( x ) | dµ ( x ) + λ (cid:90) A | q ( x ) | dµ ( x ) + λ (cid:90) B | − q ( x ) | dµ ( x ) . (10)Using a neural network representation of the committor with parameter set { θ } ni =1 , the problem becomes to minimize C λ over this set. We describe an alternative formulation of the committor (cf. [22]) in Appendix C which can besolved with distinct boundary conditions. In Appendix D we discuss how to use collective variables maps. III. IMPORTANCE SAMPLING IMPROVES GENERALIZATION ERROR
The loss function (10) requires data that may be exponentially rarer than the metastable states A and B . Toestimate it efficiently, straightforward Monte Carlo sampling does not suffice. In practice, the importance samplingprocedure we outline below reduces the variance of the estimator by ensuring that regions of R d with low probabilityare well-sampled. Importantly, this reduction in variance benefits the generalization error. We prove two results thatdemonstrate the utility of importance sampling. First, consider the empirical risk minimization (ERM) problem θ M = argmin θ M − M (cid:88) m =1 (cid:96) ( x m , θ ) ≡ argmin θ L M ( θ ) x i ∼ µ (11)which we would like to compare to minimizer of the population risk θ ∗ = argmin θ (cid:90) R d (cid:96) ( x , θ ) dµ ( x ) ≡ argmin θ L ( θ ) . (12)Importance sampling reduces the variance of our estimate for the population risk, L . This has implications for theexpected value of the minimizer; specifically, variance reduction in the empirical loss reduces the asymptotic varianceof the generalization error. Specifically, we prove Proposition III.1.
Denote by θ M the ERM obtained from canonical sampling and ˜ θ M the ERM obtained fromimportance sampling. If ∀ θ var ( ∇ ˜ L M ) ≤ var ( ∇ L M ) (13) and ∇∇ L M and ∇∇ ˜ L M are PSD (which can be guaranteed with regularization), then E µ (˜ θ M − θ ∗ ) ≤ E µ ( θ M − θ ∗ ) (14)The proof is given in Appendix A.The assumptions we make above are not guaranteed to hold in the overparameterized regime, where local strongconvexity near the minimizer may be difficult to ensure. This in mind, we also prove that importance samplingimproves an upper bound on the asymptotic variance of the loss at convergence in the overparameterized regime. Ourresult relies on the mean-field perspective for neural networks, in which we take limit n → ∞ where n is the numberof parameters [23–26]. In this setting, we write the function representation q ( x ) = σ ( (cid:90) D ϕ ( x , z ) dγ ( z )) (15)where γ is a signed Radon measure on the space of parameters for the nonlinearity ϕ and σ is a sigmoidal function.Using this representation, in Appendix A we prove, training step h ∇ q i − . − . − . . . . − . . . . . . − . − . − . − . − . − . − . . . . . . . . . . Figure 1. Simple illustrative experiment; the potential energy function is a 2D mixture of Gaussians and the committor isrepresented as a single hidden layer neural network. Left: decay of the variational loss function as a function of training timefor the M¨uller-Brown potential (26), angle brackets denote expectation with respect to the associated Gibbs measure. Right:the potential function is shown as a contour plot. The isocommittor lines are shown from white to black. Notably, level set q = 0 . Proposition III.2.
Let L [ γ ] denote the convex population risk as a functional of the signed Radon measure γ where q is represented as (15) ; let γ ∗ be the minimizer. L P denotes the corresponding convex empirical risk functional withminimizer γ P . Denote by D γ F the functional derivative of a functional F with respect to γ. Then, L [ γ P ] ≤ z , z (cid:48) ,t (cid:107) D γ ∆ L P ( z , z (cid:48) , γ ∗ + t ( γ P − γ ∗ )) (cid:107)| γ ∗ | , (16) where ∆ L P = L [ γ ] − L P [ γ ] and | γ | TV denotes the total variation norm of the measure. This bound is tightened byimportance sampling because the variance term is diminished. This result is perhaps best illustrated in the case where q ( x ) does not have the sigmoid function and we take themean-squared error as the loss. In this case, D γ ∆ L P ( z , z (cid:48) , γ ) = 1 P P (cid:88) p =1 ∇ ϕ ( x p , z ) · ∇ ϕ ( x p , z (cid:48) ); (17)note that there is no dependence on γ . This expression emphasizes that lower variance in the empirical expectationover µ will improve the bound. IV. ALGORITHMS: EMPIRICAL LOSS AND IMPORTANCE SAMPLING
In practice we need to estimate the canonical expectations (cid:104)·(cid:105) β in (10) using some data set: that is, we needto replace the population loss (10) by an empirical loss. The simplest way to proceed is to replace the canonicalexpectation of any function f ( x ) by its empirical average over a set of points { x m } Mm =1 sampled from Z − e − βV ( x ) ,i.e. use E µ f ≈ M M (cid:88) m =1 f ( x m ) , x m ∼ µ (18)This procedure gives an empirical loss that is an unbiased estimator of the population loss and converges to it almostsurely as M → ∞ . However, the empirical loss constructed this way is a poor estimator of the population loss ingeneral, in the sense that its variance is large compared to the square of its mean. The reason is that the expectation (cid:104)|∇ q | (cid:105) β is dominated by configurations that are unlikely for Z − e − βV ( x ) . This is evident if we consider a double-wellpotential, in which the barrier between the two minima of the potential is sampled exponentially rarely in the heightof the barrier.To bypass this difficulty we will estimate the loss via importance sampling. Given any set of nonnegative functions W l ( x ) ≥ l = 1 , . . . , L such that ∀ x ∈ R d : L (cid:88) l =1 W l ( x ) = 1 , (19)we can write E µ f = L (cid:88) l =1 (cid:90) R d f ( x ) W l ( x ) dµ ( x ) ≡ L (cid:88) l =1 (cid:104) f (cid:105) l w l (20)where we defined the expectation (cid:104) f (cid:105) l = Z − l (cid:90) R d f ( x ) W l ( x ) dµ ( x ) where Z l = (cid:90) R d W l ( x ) dµ ( x ) (21)as well as the weights w l = E µ W l (22)In addition, by using f ( x ) = W l (cid:48) ( x ) in this expression, we deduce that the weights satisfy the eigenvalue problem [27] w l (cid:48) = L (cid:88) l =1 w l p ll (cid:48) , l (cid:48) = 1 , . . . , L, subject to L (cid:88) l =1 w l = 1 , (23)where we defined p ll (cid:48) = (cid:104) W l (cid:48) (cid:105) l (24)In practice, we can sample Z − l W l ( x ) dµ ( x ) e.g. by Metropolis-Hastings Monte-Carlo on the potential U ( x ) − β − log W l ( x ) and compute expectations in this ensemble as (cid:104) f (cid:105) l ≈ M M (cid:88) m =1 f ( x m,l ) , x m,l ∼ Z − l W l ( x ) dµ ( x ) (25)This allows us to estimate (cid:104) f (cid:105) l in (23) as well as p ll (cid:48) in (24): knowledge of the latter quantity enables us to solve theeigenvalue problem in (23) to find the weights w l , and finally estimate E µ f via (20). We discuss specific choices ofthe windowing function in App. B 3.Putting everything together, in Algorithm IV.1, we describe the most straightforward implementation of our ap-proach. Algorithm IV.1 is sequential; a version in which we evolve x m,l and θ concurrently would allow for significantwallclock speed-ups. Algorithm IV.1 (Active Importance Sampled Variational Stochastic Gradient Descent) .Input : Energy function V : R d → R , initial θ while (cid:104)∇ θ C [ q ] (cid:105) > (cid:15) tol dofor l = 1 , . . . , L do for m = 1 , . . . , M do Sample x m,l ∼ Z − L e − βV ( x ) W l ( x )Compute p l,l (cid:48) = M (cid:80) Mm =1 W l (cid:48) ( x m,l ) for l (cid:48) = 1 , . . . , L Compute C lλ [ q ] = 1 M M (cid:88) m =1 (cid:16) ∇ θ i ∇ x q m,l ∇ x q m,l + λq m,l ∇ θ i q m,l A ( x m,l ) − λ (1 − q m,l ) ∇ θ i q m,l B ( x m,l ) (cid:17) where we denote q ml = q ( x m,l ).Solve (23) for w l , l = 1 , . . . , L Compute ∇ θ C [ q ] = L (cid:80) Ll =1 C lλ [ q ] w l θ ← θ − ∆ t ∇ θ C [ q ] Return θ V. NUMERICAL EXPERIMENTS
As a proof of concept, we optimize the committor function on the well-studied M¨uller-Brown potential [28]. Weconsider the dynamics (3) for a 2D system evolving in a Gaussian mixture potential V MB ( x ) = (cid:88) k =1 A k exp (cid:0) − ( x − µ k ) T Σ − k ( x − µ k ) (cid:1) (26)with A = ( − , − , − , µ = (cid:18) (cid:19) , (cid:18) . (cid:19) , (cid:18) − . . (cid:19) , (cid:18) − (cid:19) Σ − = (cid:18) (cid:19) , (cid:18) (cid:19) , (cid:18) . − . − . . (cid:19) , (cid:18) . . . . (cid:19) (27)Our results, shown in Fig. 1, demonstrate the importance sampling is sufficient to decrease the loss of the objective,which we compute along the transition path samples gathered during the optimization.Unlike standard approaches to computing the committor (e.g., finite elements), the algorithm outlined here alsosucceeds when the input space is high-dimensional. As a non-trivial test of robustness, we consider the energyfunctional E [ ρ ] = (cid:90) [0 , (cid:0) D |∇ ρ ( z ) | + (1 − | ρ ( z ) | ) (cid:1) d z (28)and the associated stochastic PDE ∂ t ρ = D ∆ ρ + ρ − ρ + spatio-temporal white noise (29)If we impose the Dirichlet boundary conditions ρ = +1 , for z = 0 , , ρ = − , for z = 0 , , (30)and take both D and the noise amplitude small enough the system display metastability. That is, the equation D ∆ ρ + ρ − ρ = 0 (31) training step h ∇ q i Figure 2. Top left and right: The two metastable solutions of (33) with Dirichlet boundary conditions ( ρ = 1 at the left andright boundaries, ρ = − q = 0 , . . . , has two solutions that are stable under the deterministic dynamics: these solution are either mostly ρ = 1 in thedomain, with boundary layer of size D − / near z = 0 ,
1, or mostly ρ = −
1, with boundary layer of size D − / near z = 0 ,
1. These two solutions are depicted in Fig. 2.If we discretize the problem on a lattice with spacing h = 1 /N and introduce ρ i,j = ρ ( ih, jh ) , i, j = 0 , . . . , N (32)we arrive at the SDE dρ i,j = (cid:0) ρ i,j − ρ i,j + D (∆ N ρ ) i,j (cid:1) dt + (cid:112) β − h − dη i,j (33)Here η i,j is set of independent Wiener processes, ∆ N is the discrete Laplacian,(∆ N ρ ) i,j = h − ( ρ i +1 ,j + ρ i − ,j + ρ i,j +1 + ρ i,j − − ρ i,j ) , (34)and the boundary conditions read ρ i,j = 1 , for i = 0 , N, j = 1 , . . . , N − ,ρ i,j = − , for j = 0 , N, i = 1 , . . . , N − . (35)We can also set ρ , = ρ ,N = ρ N, = ρ N,N = 0. This model is in the class for we would like to solve the committor.Here the energy is the discrete equivalent to (28): V ( ρ ) = h − N (cid:88) i,j =1 (cid:0) D | ( ∇ N ρ ) i,j | + (1 − | ρ i,j | ) (cid:1) (36)where ∇ N is the discrete gradient so that | ( ∇ N ρ ) i,j | = h − ( ρ i +1 ,j − ρ i,j ) + h − ( ρ i,j +1 − ρ i,j ) . (37)The discrete Laplacian has the effect of aligning neighboring lattice sites. The example we consider has a 256-dimensional state space. As shown in Fig. 2, the shows the characteristic pathway for a transition between thetwo metastable states, with the expected hourglass shape as transition state [29] that can also be identified by theminimum action method in this specific example [30, 31]. It should be noted that the initial increase in the lossfunction arises due to an initial representation of the transition path that is not consistent with the dynamics of themodel and that once representative configurations are sampled, the estimate of the loss improves. VI. FUTURE WORK
The approach we propose here enables optimization in contexts in which the loss function is dominated by datathat is exceedingly rare with respect to its equilibrium measure. While we have both theoretical and numericalevidence that this approach is effective for high-dimensional problems and improves generalization, further evidencefrom physics applications would bolster our current findings. In particular, we must test our approach on morecomplicated systems, like those typically arising in biophysics. In such systems, there may be multiple pathwaysconnecting two metastable states, a complication that we did not investigate thoroughly here.While our algorithm and code can easily accept a neural network architecture, we used very simple neural networksfor the examples in this paper. Finding architectures that are well-adapted to a given physical system remains animportant challenge. Additionally, there are natural improvements to the implementation of our algorithm: adaptivewindowing, more sophisticated reweighting schemes, and exploiting the “embarrassingly parallel” structure of thecomputation to obtain computational speed-ups.We focused on the computation of committor functions, but the approach we outline is much more general. Appli-cations of importance sampling to reinforcement learning and optimization of generative models are among the mostcompelling future directions. [1] G. Carleo, I. Cirac, K. Cranmer, and L. Daudet, Machine learning and the physical sciences*, Rev. Mod. Phys. , 39(2019).[2] Y.-A. Ma, Y. Chen, C. Jin, N. Flammarion, and M. I. Jordan, Sampling can be faster than optimization, Proceedings ofthe National Academy of Sciences , 20881 (2019).[3] W. Mou, N. Flammarion, M. J. Wainwright, and P. L. Bartlett, An Efficient Sampling Algorithm for Non-smooth CompositePotentials, arXiv:1910.00551 [cs, stat] (2019), arXiv:1910.00551 [cs, stat].[4] W. Mou, Y.-A. Ma, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan, High-Order Langevin Diffusion Yields an Accel-erated MCMC Algorithm, arXiv:1908.10859 [cs, math, stat] (2020), arXiv:1908.10859 [cs, math, stat].[5] W. E and E. Vanden-Eijnden, Transition-Path Theory and Path-Finding Algorithms for the Study of Rare Events, AnnualReview of Physical Chemistry , 391 (2010).[6] D. Chandler, Introduction to Modern Statistical Mechanics (Oxford University Press, USA, 1987).[7] A. Barducci, M. Bonomi, and M. Parrinello, Metadynamics, WIREs Computational Molecular Science , 826 (2011).[8] D. Csiba and P. Richtarik, Importance Sampling for Minibatches, Journal of Machine Learning Research , 21 (2018).[9] Y. Nesterov, Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems, SIAM Journal on Opti-mization , 341 (2012).[10] N. L. Roux, M. Schmidt, and F. R. Bach, A stochastic gradient method with an exponential convergence Rate for finitetraining sets, in Advances in Neural Information Processing Systems 25 , edited by F. Pereira, C. J. C. Burges, L. Bottou,and K. Q. Weinberger (Curran Associates, Inc., 2012) pp. 2663–2671.[11] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, in
Advances inNeural Information Processing Systems 26 , edited by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q.Weinberger (Curran Associates, Inc., 2013) pp. 315–323.[12] Y. Fan, J. Xu, and C. R. Shelton, Importance Sampling for Continuous Time Bayesian Networks, Journal of MachineLearning Research , 2115 (2010).[13] P. G. Bolhuis, D. Chandler, C. Dellago, and P. L. Geissler, Transition path sampling: Throwing ropes over rough mountainpasses, in the dark., Annu. Rev. Phys. Chem. , 291 (2002).[14] L. Maragliano, A. Fischer, E. Vanden-Eijnden, and G. Ciccotti, String method in collective variables: Minimum free energypaths and isocommittor surfaces, J. Chem. Phys. , 024106 (2006).[15] A. Bovier, M. Eckhoff, V. Gayrard, and M. Klein, Metastability and Low Lying Spectra in Reversible Markov Chains,Communications in Mathematical Physics , 219 (2002).[16] W. E. and E. Vanden-Eijnden, Towards a Theory of Transition Paths, Journal of Statistical Physics , 503 (2006). [17] Y. Khoo, J. Lu, and L. Ying, Solving for high dimensional committor functions using artificial neural networks, arXiv(2018), arXiv:1802.10275v1.[18] Q. Li, B. Lin, and W. Ren, Computing Committor Functions for the Study of Rare Events Using Deep Learning, TheJournal of Chemical Physics , 054112 (2019), arXiv:1906.06285.[19] S. Yaida, Fluctuation-dissipation relations for stochastic gradient descent, arXiv:1810.00004 [cs, stat] (2018),arXiv:1810.00004 [cs, stat].[20] B. Gaveau and L. S. Schulman, Theory of nonequilibrium first-order phase transitions for stochastic dynamics, Journal ofMathematical Physics , 1517 (1998).[21] M. Cameron and E. Vanden-Eijnden, Flows in Complex Networks: Theory, Algorithms, and Application to Lennard–JonesCluster Rearrangement, Journal of Statistical Physics , 427 (2014).[22] J. Lu and E. Vanden-Eijnden, Exact dynamical coarse-graining without time-scale separation, The Journal of ChemicalPhysics , 044109 (2014).[23] L. Chizat and F. Bach, On the Global Convergence of Gradient Descent for Over-parameterized Models using OptimalTransport, arXiv:1805.09545 [cs, math, stat] (2018), arXiv:1805.09545 [cs, math, stat].[24] S. Mei, A. Montanari, and P.-M. Nguyen, A mean field view of the landscape of two-layer neural networks, Proceedings ofthe National Academy of Sciences , E7665 (2018).[25] G. M. Rotskoff and E. Vanden-Eijnden, Trainability and Accuracy of Neural Networks: An Interacting Particle SystemApproach, arXiv:1805.00915 [cond-mat, stat] (2018), arXiv:1805.00915 [cond-mat, stat].[26] J. Sirignano and K. Spiliopoulos, Mean Field Analysis of Neural Networks, arXiv (2018), arXiv:1805.01053v1.[27] E. H. Thiede, B. Van Koten, J. Weare, and A. R. Dinner, Eigenvector method for umbrella sampling enables error analysis,The Journal of chemical physics , 084115 (2016).[28] K. M¨uller and L. D. Brown, Location of saddle points and minimum energy paths by a constrained simplex optimizationprocedure, Theoretica chimica acta , 75 (1979).[29] R. V. Kohn, F. Otto, M. G. Reznikoff, and E. Vanden-Eijnden, Action minimization and sharp-interface limits for thestochastic Allen-Cahn equation, Communications on Pure and Applied Mathematics: A Journal Issued by the CourantInstitute of Mathematical Sciences , 393 (2007).[30] W. E, W. Ren, and E. Vanden-Eijnden, Minimum action method for the study of rare events, Communications on pureand applied mathematics , 637 (2004).[31] M. Heymann and E. Vanden-Eijnden, The geometric minimum action method: A least action principle on the space ofcurves, Comm. Pure Appl. Math. , 1052 (2008). Appendix A: Variance reduction improves generalization1. Proof of Prop 3.1
Consider the empirical risk minimization (ERM) problem for data sampled iid from some measure { x m } Mm =1 ∼ µ .In this problem, we seek to minimize the empirical risk (or loss) L M ( θ ) = 1 M M (cid:88) m =1 (cid:96) ( x m , θ ) (A.1)where (cid:96) is a convex function that measures the discrepancy between the target function and the neural networkrepresentation at x . For example, (cid:96) ( x , θ ) = | f ∗ ( x ) − f ( x , θ ) | . (A.2)The generalization error is measured by the “population loss” L ( θ ) = (cid:90) Ω (cid:96) ( x , θ ) dµ ( x ) . (A.3)In the algorithm we propose for optimizing committor functions, the importance sampling reduces the variance ofour estimator of the empirical loss. We show that this reduction in variance for the estimator of L ( θ ) leads to bettergeneralization. Let us denote by ˜ µ the measure that we use to perform importance sampling. Defining˜ L M ( θ ) = 1 M M (cid:88) m =1 (cid:96) (˜ x m , θ ) ˜ x m ∼ dµd ˜ µ ˜ µ (A.4)we have E ˜ L M ( θ ) = L ( θ ) and we can assume that ∀ θ , var ˜ µ ( ˜ L M ( θ )) ≤ var µ ( L M ( θ )) . (A.5)Denoting by θ M , ˜ θ M , and θ ∗ minimizers of L M , ˜ L M , and L , respectively, a simple asymptotic argument demonstratesthat E ˜ µ | ˜ θ M − θ ∗ | ≤ E µ | θ M − θ ∗ | , (A.6)meaning that importance sampling reduces the asymptotic variance of the ERM. This argument requires severalassumptions so we state it as a proposition Proposition A.1.
Assume that L , L M , and ˜ L M are strictly convex (which can be guaranteed with regularization)and that there exist < σ ≤ σ < ∞ such that σ Id (cid:22) ∇∇ L ( θ ∗ ) (cid:22) σ Id (A.7) If the importance sampling guarantees that ∀ θ : σ var ˜ µ ( ∇ ˜ L M ( θ )) ≤ σ var µ ( ∇ L M ( θ )) (A.8) then E ˜ µ | ˜ θ M − θ ∗ | ≤ E µ | θ M − θ ∗ | (A.9)We establish this result by asymptotic expansion around the population loss minimizer θ ∗ , which is unique by theassumption of strict convexity of L . Assume that for M large enough we can approximate L M ( θ ) = L M ( θ ∗ ) + ∇ L M ( θ ∗ ) · ( θ − θ ∗ ) + 12 ( θ − θ ∗ ) T ∇∇ L M ( θ ∗ )( θ − θ ∗ ) (A.10)2we see that the minimizer for L M can be written θ M = θ ∗ − ( ∇∇ L M ( θ ∗ )) − ∇ L M ( θ ∗ ) (A.11)where we ensure the invertibility of Hessian with the strict convexity assumption. Therefore E µ | θ M − θ ∗ | = E µ | ( ∇∇ L M ( θ ∗ )) − ∇ L M ( θ ∗ ) | . (A.12)which, for M (cid:29)
1, we can approximate as E µ | θ M − θ ∗ | ≈ E µ | ( ∇∇ L ( θ ∗ )) − ∇ L M ( θ ∗ ) | ≥ σ − E µ |∇ L M ( θ ∗ ) | . (A.13)where we used (A.7). Using the same calculation for the variance E ˜ µ | ˜ θ M − θ ∗ | we obtain E ˜ µ | ˜ θ M − θ ∗ | ≤ σ − E ˜ µ |∇ ˜ L M ( θ ∗ ) | . (A.14)and combining the last two equations gives E ˜ µ | ˜ θ M − θ ∗ | ≤ E µ | θ M − θ ∗ | (A.15)since (A.8) implies σ E ˜ µ |∇ ˜ L M ( θ ∗ ) | ≤ σ E µ |∇ L M ( θ ∗ ) | (A.16)
2. Proof of Prop. 3.2
To state the proposition, we rely on the mean-field picture of wide neural networks, following [23, 25]. In thisformulation, we express the neural network as q ( x ) = (cid:90) D ϕ ( x , z ) dγ ( z ) (A.17)where ϕ is a nonlinear function (the unit) and γ is a signed Radon measure defined on the parameter space D . Whenusing the boundary conditions to optimize the committor function described in the main text, we use a thresholdingfunction to ensure that the range of q is in [0 , . We let L be the convex population risk functional with minimizer γ ∗ and minimum value zero, L ∗ [ γ ∗ ] = 0. Similarly,we let L M be the convex empirical risk functional with minimizer γ M and minimum value zero, L M [ γ M ] = 0. Denotingby D γ the functional derivative, the conditions of optimality require that D γ L M [ z , γ M ] ≥ , D γ L [ z , γ ∗ ] ≥ z ∈ supp γ M for the first relation, and for all z ∈ supp γ ∗ for the second. Note that γ ∗ is also aminimizer of the empirical loss, L M [ γ ∗ ] = 0, since L M ≥ E µ L M [ γ ∗ ] = L [ γ ∗ ] = 0. Therefore ∀ z ∈ supp γ M : D γ L M [ z , γ M ] = 0 ∀ z ∈ supp γ ∗ : D γ L M [ z , γ ∗ ] = D γ L [ z , γ ∗ ] = 0 (A.19)The proof proceeds in two steps. First, by convexity, L [ γ M ] ≤ (cid:90) D γ L [ z , γ M ]( dγ M ( z ) − dγ ∗ ( z )) (A.20)which, after taking the supremum over z , leads to the upper bound L [ γ M ] ≤ z | D γ L [ z , γ M ] || γ ∗ ( z ) | TV (A.21)3where we have assumed that | γ M | TV ≤ | γ ∗ | TV which can be ensured with an appropriate regularization.Next, using (A.19), we observe D γ L [ z , γ M ] = D γ L [ z , γ M ] − D γ L M [ z , γ M ] ≡ D γ ∆ L M [ z , γ M ] . (A.22)As a result D γ L M [ z , γ ∗ + γ M − γ ∗ ]= D γ ∆ L M [ z , γ ∗ + γ M − γ ∗ ]= D γ ∆ L M [ z , γ ∗ ] + (cid:90) dt (cid:90) D γ ∆ L M [ z , z (cid:48) , γ ∗ + t ( γ M − γ ∗ )] dγ ∗ ( z (cid:48) ) dγ M ( z (cid:48) )= (cid:90) D γ ∆ L M [ z , z (cid:48) , γ ∗ + t m ( γ M − γ ∗ )] dγ ∗ ( z (cid:48) ) dγ M ( z (cid:48) ) (A.23)where we used (A.19) and we fixed t m ∈ [0 ,
1] by the mean value theorem to get the third equality. This implies | D γ L M [ z , γ M ] | ≤ sup z (cid:48) t ∈ [0 , (cid:107) D γ ∆ L M [ z , z (cid:48) , γ ∗ + t ( γ M − γ ∗ )] (cid:107)| γ ∗ | (A.24)Combining with (A.21) we have L [ γ M ] ≤ z , z (cid:48) t ∈ [0 , (cid:107) D γ ∆ L M [ z , z (cid:48) , γ ∗ + t ( γ M − γ ∗ )] (cid:107)| γ ∗ | (A.25)The term involving (cid:107) D γ ∆ L M (cid:107) is suppressed with importance sampling, which completes the argument. Appendix B: Approximation of the committor with a neural network1. Representation
In our experiments we use single hidden layer ReLU networks: the simplicity of the networks is meant to demonstratethe generic capabilities of the methodology, as we are not relying on sophisticated architectures. We define a parametricrepresentation of the committor function and an objective function that enables us to optimize the parameters. Neuralnetworks (NN) offer flexibility to the representation and relative ease of optimization, making this a natural choicefor a representation of the committor. To this end, we proceed in two distinct steps.First we will work under the assumption that ˆ q ( x , v ) can be well approximated by a function of the position alone,i.e. we set ˆ q ( x , v ) ≈ q ( x ) , (B.1)Second we represent q ( x ) by a neural network taking the input x through some predefined map φ : R d → R k with k ≤ d . For example, we can use q ( x ) = σ (cid:34) n n (cid:88) i =1 ϕ ( φ ( x ) , θ i ) (cid:35) (B.2)which represents a single hidden layer neural network with ϕ a nonlinear function (e.g., ReLU) passed through athresholding function σ (e.g., a sigmoid function) to ensures that q ( x ) ∈ [0 , , ∀ x ∈ R d . In practice, the architectureof the neural network will be substantially more intricate than the single hidden layer network (B.2).In the optimization procedure below, it is more tractable to penalize deviations from the boundary conditions ratherthan impose them as constraints. Consequently, we use a strong Lagrange multipliers to ensure that the committor4has the right values on the initial and target states. The objective function we use is thus C λ [ q ] = Z − (cid:90) R d |∇ q ( x ) | e − βV ( x ) d x + λZ − (cid:90) A q ( x ) e − βV ( x ) d x + λZ − (cid:90) B (cid:0) − q ( x ) (cid:1) e − βV ( x ) d x ≡ (cid:10) |∇ q | (cid:11) β + λ (cid:10) | q | A (cid:11) β + λ (cid:10) | − q | B (cid:11) β (B.3)where (cid:104)·(cid:105) β denotes canonical expectation with respect to Z − e − βV ( x ) , and 1 A and 1 B are the indicator functions of A and B , respectively. Using the parametric representation of the committor in (B.2) the problem becomes to find aset of { θ } ni =1 that minimize C λ .
2. Computing the gradients
Optimization of the neural network representation of the committor (B.2) by gradient descent (GD) requires esti-mating the gradient of the objective function with respect to the parameters. Denote q ( x , θ ) = σ (cid:34) n n (cid:88) i =1 ϕ ( φ ( x ) , θ i ) (cid:35) (B.4)so that ∇ θ i C λ [ q ] = (cid:104)∇ θ i ∇ x q ∇ x q (cid:105) β + λ (cid:104) q ∇ θ i q A (cid:105) β − λ (cid:104) (1 − q ) ∇ θ i q B (cid:105) β (B.5)Noting that ∇ x σ ( f ( x )) = σ ( f ( x )) (1 − σ ( f ( x ))) ∇ x f ( x ) (B.6)and similarly for ∇ θ we can derive explicit expressions for the quantities that need to be averaged. In particular, wesee that ∇ x q ( x , θ ) = 1 n q ( x , θ )(1 − q ( x , θ )) n (cid:88) i =1 ∇ x φ ( x ) ∇ φ ϕ ( φ ( x ) , θ i ) , (B.7) ∇ θ i q ( x , θ ) = 1 n q ( x , θ )(1 − q ( x , θ )) ∇ θ i ϕ ( φ ( x ) , θ i ) (B.8)and ∇ θ i ∇ x q ( x , θ ) = 1 n ∇ θ i q ( x , θ )(1 − q ( x , θ )) n (cid:88) j =1 ∇ x φ ( x ) ∇ φ ϕ ( φ ( x ) , θ j )+ 1 n q ( x , θ )(1 − q ( x , θ )) ∇ x φ ( x ) ∇ φ ∇ θ i ϕ ( φ ( x ) , θ i ) , (B.9)
3. Choice of the windowing functions
It remains to specify the function W l ( x ). Here we propose to do this in a way that is adaptive to q ( x , θ ) itself. Tothis end, let σ ( u ) = 11 + e − u (B.10)5and given u < u < u < · · · < u L and some k >
0, let W l ( x ) = σ ( k ( q ( x , θ ) − u l − )) − σ ( k ( q ( x , θ ) − u l )) , l = 1 , . . . , L (B.11)Since q ( x , θ ) ∈ [0 ,
1] by construction we then have ∀ x ∈ R d : L (cid:88) l =1 W l ( x ) = σ ( k ( q ( x , θ ) − u )) − σ ( k ( q ( x , θ ) − u L )) ≥ σ ( − ku ) − σ ( k (1 − u L )) (B.12)That is if we take k large enough and pick u = − a and u L = 1 + a with a > ka (cid:29)
1, the nonnegativefunctions W l ( x ) can be made to satisfy (19) to arbitrary precision exponentially fast in ak . The functions W l ( x ) arealso peaked around q ( x , θ ) = ( u l + u l − ) which means that by taking enough values of u l between u = − a and u L = 1 + a we can cover all the range of possible values for q ( x , θ ). Appendix C: Alternative formulation of the committor and boundary conditions
The variational problem of determining the committor function can be reinterpreted via a solution to the followingPDE [22], L ˜ q = τ e βV ( x ) [ δ ( x − a ) − δ ( x − b )] . (C.1)Given a solution to (C.1), it is straightforward to verify that the committor between sets A = { x | ˜ q ( x ) ≤ ˜ q − } B = { x | ˜ q ( x ) ≥ ˜ q + } is given by q ( x ) = ˜ q ( x ) − ˜ q − ˜ q + − ˜ q − (C.2)for x ∈ ( A ∪ B ) c .We can use the variational optimization algorithm Alg.IV.1 to compute ˜ q where we penalize the cost functional toobtain the loss function, C λ [˜ q ] = C [˜ q ] + λ [˜ q ( a ) − ˜ q ( b )] . (C.3)In practice, we mollify the objective by replacing the δ -masses with Gaussians centered at a and b using the mean inthe penalized objective above.This formulation offers several advantages compared to the formulation discussed in the main text. First, becausethe range of ˜ q is all of R , there is no need to use thresholding functions that could affect the magnitude of gradientsand hence the rate of convergence of the optimization. Secondly, in order to use the penalized objective of the maintext, we must draw samples from the metastable states A and B . If those states are difficult to sample, the boundaryconditions here require knowledge of only a single point within A and B . Appendix D: Using collective variables maps
While we do not pursue this strategy in the numerical examples considered here, collective variables are oftenuseful in complex molecular systems, especially when physical insight into the reaction mechanism can be deployed.Identifying a low-dimensional subspace that describes the reaction can help alleviate the issue of sampling high6dimensional systems. This approach may also be useful to remove symmetries that are irrelevant for the reaction inquestion. To do so, we use the standard notion of a reaction coordinate, which is a map φ : R d → R k ; φ ( x ) (cid:55)→ z (D.1)with k (cid:28) d . In some sense, the committor is the reaction coordinate for the system, but it remains too unwieldyto be directly useful. Instead, we hope to approximate the committor; in what follows, we will use two distinctapproximations: ˆ q ( x , v ) ≈ q ( x ) , (D.2)and ˆ q ( x , v ) ≈ q ( z ) with z = φ ( x ) . (D.3)In the second case (D.3), we use the collective variables map to eliminate degrees of freedom in addition to reducingthe dimensionality by neglecting the momenta. Our goal at this point is to find a computationally tractable schemefor identifying q .Using (D.2), we find that the objective function in collective variables can be written C [ q ] = C (cid:90) R d |∇ q ( φ ( x )) | e − βV ( x ) d x = C (cid:90) R d |∇ φ ( x ) · ∇ q ( z ) | e − βV ( x ) δ ( z − φ ( x )) d x = C (cid:90) R k (cid:104)∇ q ( z ) , M ( z ) ∇ q ( z ) (cid:105) e − βF ( z ) d z , (D.4)where we have defined F ( z ) = − β − log (cid:90) R d δ ( z − φ ( x )) e − βV ( x ) d x (D.5)which can be viewed as the potential of mean force for the collective variable z ∈ R k and C is a constant that willnot affect the optimization. Further, the change of coordinates necessitates including the metric term M ( z ) = (cid:82) ∇ φ ( x ) · ∇ φ ( x ) e − βV ( x ) δ ( φ ( x ) − z ) d x (cid:82) δ ( φ ( x ) − z ) e − βV ( x ) d x . (D.6)Importantly, we can relate the solution of the optimization in collective variables space to the solution in the originalstate space. To do so, we examine the minimizers of the objective written with collective variables by studying itsEuler-Lagrange equation. The objective D.4 can be expressed as˜ C λ [ q ] = 12 (cid:90) (cid:90) R k (cid:104)∇ q ( z ) , M ( x ) ∇ q ( z ) (cid:105) e − βF ( z )+ βG ( q ( z )) du + λ (cid:90) (cid:90) A q ( z ) e − βF ( z )+ βG ( q ( z )) δ ( q ( z ) − u ) d z du + λ (cid:90) (cid:90) B (cid:0) − q ( z ) (cid:1) e − βF ( z )+ βG ( q ( z )) δ ( q ( z ) − u ) d z du (D.7)where we have introduced G ( u ) = − β − log (cid:90) R k e − βF ( z ) δ ( q ( z ) − u ) d z . (D.8)7This objective can be interpreted as the integral of a conditional expectation (not explicitly writing the integrals thatcontain the boundary conditions)˜ C λ [ q ] = (cid:90) (cid:82) R k (cid:104)∇ q ( z ) , M ( x ) ∇ q ( z ) (cid:105) e − βF ( z ) δ ( q ( z ) − u ) d z e − βG ( u ) du + B.C. ≡ (cid:90) E u (cid:104)∇ q ( z ) , M ( z ) ∇ ( z ) (cid:105) du + B.C. (D.9)Let us first consider the case where φ ≡ id ; that is, no collective variables. An optimum of ˜ C λ solves a differentEuler-Lagrange equation compared to a minimizer of C λ , which we can easily see satisfies∆ˆ q − β ∇ V · ∇ ˆ q = 0 (D.10)However, it suffices to minimize (D.9) to find a minimizer of (10) because we have an explicit relation between thetwo. To see this, we compute the variation of the objective function with respect to q . The Euler-Lagrange equationsatisfied by (D.7)—neglecting the boundary conditions for clarity—reads0 = δ ˜ C λ δq = δδq (cid:90) ∇ ( q + δq ) · ∇ ( q + δq ) e − βV ( x )+ βG ( q ( x )) δ ( q ( x ) − u ) d x du = δδq (cid:90) δq (cid:2) ∇ V · ∇ q − βG (cid:48) ( q ) |∇ q | − ∆ q (cid:3) e − βV ( z )+ βG ( q ) d z du + 12 δδq (cid:90) δqG (cid:48) ( q ) |∇ q | e − βF ( z )+ βG ( q ) d z du, (D.11)yielding ∆ q − ∇ V · ∇ q + β G (cid:48) ( q ) |∇ q | = 0 . (D.12)Hence, we see that the solution of (D.12) is related to the solution of the original committor’s Euler Lagrangeequation (D.10). In particular, suppose that q were a reparameterization of ˆ q in the sense that there is a mapˆ q = f ( q ). Then, using (D.10), we have that f (cid:48)(cid:48) ( q ) f (cid:48) ( q ) |∇ q | − ∇ V · ∇ q + ∆ q = 0 (D.13)meaning that we can recover ˆ q given a solution of (D.12). This amounts to determining f which can be computed byobserving ddq log f (cid:48) = β G (cid:48) ( q ) (D.14)so that f ( q ) = (cid:82) q e β G ( u ) du (cid:82) e β G ( u ) du (D.15)where we used the fact that f (0) = 0 and ff