[PDF] Manifold Learning for Accelerating Coarse-Grained Optimization

Abstract

Algorithms proposed for solving high-dimensional optimization problems with no derivative information frequently encounter the "curse of dimensionality," becoming ineffective as the dimension of the parameter space grows. One feature of a subclass of such problems that are effectively low-dimensional is that only a few parameters (or combinations thereof) are important for the optimization and must be explored in detail. Knowing these parameters/ combinations in advance would greatly simplify the problem and its solution. We propose the data-driven construction of an effective (coarse-grained, "trend") optimizer, based on data obtained from ensembles of brief simulation bursts with an "inner" optimization algorithm, that has the potential to accelerate the exploration of the parameter space. The trajectories of this "effective optimizer" quickly become attracted onto a slow manifold parameterized by the few relevant parameter combinations. We obtain the parameterization of this low-dimensional, effective optimization manifold on the fly using data mining/manifold learning techniques on the results of simulation (inner optimizer iteration) burst ensembles and exploit it locally to "jump" forward along this manifold. As a result, we can bias the exploration of the parameter space towards the few, important directions and, through this "wrapper algorithm," speed up the convergence of traditional optimization algorithms.

Full PDF

MManifold Learning for AcceleratingCoarse-Grained Optimization

Dmitry Pozharskiy, Noah J. Wichrowski, Andrew B. Duncan,Grigorios A. Pavliotis, and Ioannis G. Kevrekidis

Abstract

Algorithms proposed for solving high-dimensional optimization problemswith no derivative information frequently encounter the “curse of dimen-sionality,” becoming ineﬀective as the dimension of the parameter spacegrows. One feature of a subclass of such problems that are eﬀectively low-dimensional is that only a few parameters (or combinations thereof)are important for the optimization and must be explored in detail. Know-ing these parameters/combinations in advance would greatly simplify theproblem and its solution. We propose the data-driven construction of an eﬀective (coarse-grained, “trend”) optimizer, based on data obtained fromensembles of brief simulation bursts with an “inner” optimization algo-rithm, that has the potential to accelerate the exploration of the parameterspace. The trajectories of this “eﬀective optimizer” quickly become at-tracted onto a slow manifold parameterized by the few relevant parametercombinations. We obtain the parameterization of this low-dimensional,eﬀective optimization manifold on the ﬂy using data mining/manifoldlearning techniques on the results of simulation (inner optimizer itera-tion) burst ensembles and exploit it locally to “jump” forward along thismanifold. As a result, we can bias the exploration of the parameter spacetowards the few, important directions and, through this “wrapper algo-rithm,” speed up the convergence of traditional optimization algorithms.

The design of complex engineering systems often leads to high-dimensional op-timization problems with computationally expensive objective function evalu-ations, often given in the form of a (computational) black-box. In such casesthe derivative information may be unavailable or impractical to obtain in closedform, too expensive to compute numerically, or unreliable to estimate if theobjective function is noisy. These diﬃculties may render derivative-based op-timization methods impractical for such problems, and so-called derivative-freemethods must be used.The ﬁrst such derivative-free algorithms appeared quite early: the directsearch method [23] and the Nelder-Mead algorithm [41]. Since then a variety of1 a r X i v : . [ m a t h . O C ] A p r lgorithms have been proposed, including trust-region methods [9]; determin-istic global algorithms [27, 24]; algorithms utilizing surrogate models [3, 26];and stochastic global methods such as genetic algorithms [22], simulated an-nealing [31] and particle swarm optimization [14]. A broader overview of theaforementioned methods along with additional ones used for high-dimensionalproblems can be found at [10, 49, 46].Many of these methods have been applied successfully to low-dimensionalproblems where derivative information is not available; once the dimension ofthe parameter space grows, however, they run into the “curse of dimensional-ity,” where the required sampling of the parameter space grows exponentiallyor the convergence becomes too slow. There is case-dependent evidence that,for certain classes of problems, out of the vast parameter space only a few pa-rameters or combinations of parameters suﬃce to describe most of the variancein the objective function values, with the rest having little eﬀect [35, 39, 40, 47].It is observations of this nature that we aim to exploit in our proposed method,borrowing additional ideas from the ﬁeld of fast/slow (singularly perturbed)dynamical systems and data mining/manifold learning techniques.It has been observed that complex dynamical systems such as moleculardynamics (MD) simulations or complex reaction network dynamics may possessa low-dimensional, attracting, “slow” manifold. The dynamics of the systemafter being initialized at a random state quickly approach the slow manifoldand then evolve “along it” (close to it). A reduced model of the system interms of the slow variables parameterizing this manifold would greatly simplifythe understanding of the system’s behavior (and its computation). However,such a model is often unavailable in closed form. In previous work [30] we haveshown how short “bursts” of a microscopic simulator can evolve the systemclose to and then “along” an underlying slow manifold. Essentially, after theshort burst is attracted to the slow manifold, we can observe the evolution on a restricted set of coarse-grained observables that parameterize the slow manifoldwhen these coarse variables are known a priori . We can then perform a “large”time step by extrapolating the few macroscopic variable values and lifting thenew state back into the full space to initialize a new set of computation burstsfor the microscopic simulator. This can achieve signiﬁcant acceleration of the eﬀective complex system dynamics. If the macroscopic variables are not known,then a reduced description of the manifold can be derived on the ﬂy by usingdata-driven dimensionality reduction techniques, which uncover the few intrinsicvariables that are adequate to describe a high-dimensional data set locally.The high-dimensional optimization problem can be treated in the same veinby making two assumptions: (a) we have an “inner optimizer”, analogous to amicroscopic simulator, that samples the parameter space and produces a seriesof “pseudo”-Langevin trajectories, (b) we postulate that there exists an attract-ing, slow manifold which can be parameterized in terms of a few parameters orcombinations thereof, and the inner optimizer is quickly attracted to it. In thefollowing sections we will show that the trajectory produced by a speciﬁc versionof simulated annealing (SA)—our “inner optimizer”—at constant temperaturecan be described by an eﬀective Stochastic Diﬀerential Equation (SDE) whose2rift contains information about the local gradient of the objective function.To be precise, since we do not vary the temperature as the optimization pro-gresses, our ”inner optimizer” is a random walk Metropolis-Hastings (RWMH)algorithm. Running short bursts of this constant temperature SA, we create“pseudo” dynamics that can be thought of as the (approximate) dynamics ofan actual dynamical system. After initializing at a random point in parameterspace, the algorithm is quickly attracted to the low-dimensional manifold, andby applying either linear or nonlinear dimensionality reduction techniques wecan obtain a useful local parameterization of this manifold. We can estimate thedrift of the eﬀective SDE using established parametric inference methods andthus estimate the local eﬀective gradient of the objective function [48]. Thiscan be used subsequently in an algorithm such as gradient descent in a reducedparameter space. The new point is lifted back to full space, using local Princi-pal Component Analysis or geometric harmonics [7], and the entire procedureis repeated, leading to an acceleration of the overall optimization. The Langevin equation was introduced as a stochastic global optimization methodshortly after the ﬁrst appearance of simulated annealing [19, 18]. It is a gra-dient descent method with the addition of a “thermal” noise that allows thetrajectories to escape local minima and thus enhance their ability to explore theparameter space. However, it may become impractical for the problems we areconsidering since, as we discussed above, the gradient information is explicitlyunavailable. The equation reads dx t = −∇ f ( x t ) dt + √ T dW t , (1)where x t ∈ R n , f is the objective function, W t is an n -dimensional standardBrownian motion, and T is the temperature parameter. It can be shown that un-der an appropriate temperature schedule T ( t ), the algorithm converges weaklyto the global minimum [18]. The equilibrium distribution is the Gibbs distribu-tion, with density π ( x ; T ) = exp (cid:104) − f ( x ) T (cid:105)(cid:82) R n exp (cid:104) − f ( y ) T (cid:105) dy . The simulated annealing algorithm admits the same equilibrium distribution atan equal T and can be viewed as an adaptation of the Metropolis-Hastings algo-rithm [38] with time dependent acceptance probability due to the temperatureschedule. The acceptance probability is given by a = min (cid:32) , exp (cid:34) − (cid:0) f ( y ) − f ( x ) (cid:1) T (cid:35)(cid:33) , x is the current point and y is the new trial point that comes from asymmetric proposal density g ( y | x ). Hence, better points are always acceptedand worse points are accepted with probability 0 < a <

1, which is greater athigher temperatures T .Consider now the simple, one-dimensional case at constant temperature T (a RWMH protocol) using proposal density g ( y | x ) = N ( x, T δt ) and the ac-ceptance probability deﬁned above. It can be shown [13, 15] that, at the limitof small time steps δt or large temperatures T , the density of accepted pointsafter one step converges to a normal distribution N ( x − f (cid:48) ( x ) δt, T δt ), whichcorresponds to the density of a new point using an Euler-Maruyama discretiza-tion of the Langevin equation [36]. Figure 1 shows the density of sample pointsafter one step and after 100 steps using both algorithms. The two distributionsvisually almost coincide.Using the above procedure we can obtain “pseudo” trajectories in the pa-rameter space that are analogous to the trajectories produced by the Langevinequation and contain information about an eﬀective gradient of the objectivefunction without explicitly computing it.

In our previous discussion on dynamical systems, we mentioned that the long-term dynamics of the full system can often be usefully restricted to the dynamicsof a few slow variables. These variables can be a collection of macroscopic vari-ables that are available from our intimate knowledge of the system, or they canbe estimated “on the ﬂy” using dimension reduction techniques. Such tech-niques can be applied to large, high-dimensional data sets to uncover the few intrinsic variables that are suﬃcient to describe most of the system’s long-termbehavior. We will use the latter approach in our method, since it can be quitechallenging to identify beforehand the few parameters that are important to theoptimization.One of the most common methods for dimension reduction is Principal Com-ponent Analysis (PCA) [25]. It tries to identify a hyperplane that best ﬁts thedata by ﬁnding an orthogonal basis, where the ﬁrst vector points in the direc-tion of maximum variance in the data set and all subsequent vectors maximizevariance in orthogonal directions. The basis vectors are called

Principal Compo-nents , and they can be found by an eigenvalue decomposition of the covariancematrix of the data set after it has been centered. If the eigenvalues are sortedand the relationship λ > λ > · · · > λ k (cid:29) λ k +1 > · · · > λ n holds ( n is thedimension of the original space), then there is a gap in the eigenvalue spec-trum and we can reduce the dimensionality of our data set by projecting it ontothe ﬁrst k principal components. PCA is a well-documented technique, but itsmajor limitation is that it can parameterize only linear manifolds.Nonlinear manifold learning techniques are required if the data lie on acurved manifold. One such method is Diﬀusion Maps (DMaps) [8]. For a data4 P r obab ili t y D en s i t y Langevin EquationSimulated Annealing (a) After one step ( t = 0 . -0.5 0.0 0.5 1.0 1.5 2.0 2.50.00.51.01.5 P r obab ili t y D en s i t y Langevin EquationSimulated Annealing (b) After 100 steps ( t = 0 . Figure 1: Evolution in time of the probability density of current points usingeither the Langevin equation or SA at constant T . The objective function is f ( x ) = 0 . x ; 10 realizations are used with starting point x = 1, T = 0 .

5; andthe time step is dt = 10 − .set of size m , the algorithm starts by constructing a weight matrix W ∈ R m × m : W ij = exp (cid:18) −(cid:107) x i − x j (cid:107) ε (cid:19) , i, j = 1 , . . . , m, where x i , x j ∈ R n , (cid:107)·(cid:107) is an appropriate norm, and ε is a characteristic distancebetween data points. Next, we construct the diagonal matrix D ∈ R m × m with D ii = (cid:80) j W ij and compute ˜ W = D − α W D − α , where 0 ≤ α ≤ D ∈ R m × m with˜ D ii = (cid:80) j ˜ W ij and compute the row-stochastic matrix K = ˜ D − ˜ W . The matrix K is the transition probability matrix of a Markov chain deﬁned on the dataset, whose states are the individual data points.The eigenvectors ψ , ψ , . . . , ψ m − of the matrix K approximate the eigen-functions of the Laplace-Beltrami operator on the sampled manifold and thus5an be used to parameterize the manifold [6]. Since K is row-stochastic, theﬁrst eigenvector ψ is trivial: all ones. The subsequent eigenvectors are called diﬀusion coordinates and have corresponding eigenvalues λ , λ , · · · , λ m − . Theoriginal data points are mapped to their diﬀusion coordinates as x (cid:55)→ (cid:0) λ τ ψ ( x ) , . . . , λ τm − ψ m − ( x ) (cid:1) , where ψ i ( x ) ∈ R represents the entry of eigenvector ψ i corresponding to thepoint x from the original data. In the following sections we take τ = 0. Thedistance between two mapped points is called diﬀusion distance , and it repre-sents the similarity between two points in the original space. If two points arenearby in the diﬀusion space using a Euclidean metric, it implies that there aremultiple short paths to transition from one point to the other in the originalspace. Similarly to PCA, if there is a spectral gap we can map our originaldata set to k “important” eigenvectors. However, attention must be paid tothe fact that some eigenvectors may be higher harmonics of previous discoveredones [11]. The nonlinear manifold is parameterized by the few eigenvectors thatcorrespond to the largest eigenvalues and that are not themselves such higherharmonics. These diﬀusion coordinates are the important intrinsic variablesthat parameterize the nonlinear manifold and indicate its dimensionality.We illustrate DMaps by applying it to a “Swiss roll” data set. This is athree-dimensional data set, but only two variables are suﬃcient to describe everypoint: the height and the arclength along the roll. Figure 2a shows the dataset colored by the ﬁrst non-trivial eigenvector that parameterizes the arclength,while Figure 2b shows the data set colored by the second non-trivial eigenvectorthat parameterizes the height. Figure 2c shows the data set “unrolled” in thediﬀusion map space.In the case that the original data set comes from independent stochastic pro-cesses, as in our time series simulator, but we observe it through some nonlineartransformation y = f ( x ), we can retrieve the original manifold using Maha-lanobis distances [50, 12]. It can be shown that if x i , x j are two data points inthe original space and y i , y j are their nonlinear transformations, then (cid:107) x i − x j (cid:107) = 12 ( y i − y j ) (cid:62) (cid:2) ( JJ (cid:62) ) − ( y i ) + ( JJ (cid:62) ) − ( y j ) (cid:3) ( y i − y j )+ O (cid:0) (cid:107) y i − y j (cid:107) (cid:1) , where J is the Jacobian matrix of the transformation. In practice, the matrix JJ (cid:62) is approximated by a covariance matrix which is estimated by runningseveral short bursts of our simulator around each data point. Since the manifoldhas a lower-dimensionality than the observed space, the covariance matrix willbe rank deﬁcient and a pseudo-inverse must be used to compute the Mahalanobisdistances. We mentioned above that time series from the SA algorithm can be consid-ered as corresponding to those of an eﬀective stochastic diﬀerential equation.6

20 1010 500 -550 (a) The ﬁrst (nontrivial) diﬀusion coordinate ψ parame-terizes arclength along the roll.

020 1010 500 -550 (b) The second diﬀusion coordinate ψ parameterizesheight. -0.01 0.00 0.01-0.02-0.010.000.010.02 (c) The two-dimensional nature of the data set is uncov-ered in the diﬀusion map space. Figure 2: Applying Diﬀusion Maps to the Swiss roll data set.7ence, an essential component of our algorithm is the estimation of drift anddiﬀusion coeﬃcients of a stochastic process from local path data. Assume theone-dimensional stochastic process dx ( t ) = h ( x ( t )) dt + σ ( x ( t )) dW , with W astandard Brownian motion. The coeﬃcients can be estimated either from theirstatistical deﬁnitions [20], i.e. , h (cid:0) x ( t ) (cid:1) = lim τ → (cid:10) x ( t + τ ) − x ( t ) (cid:11) τ ,σ (cid:0) x ( t ) (cid:1) = lim τ → (cid:10)(cid:0) x ( t + τ ) − x ( t ) (cid:1) (cid:11) τ , (2)or using the Generalized Method of Moments (GMM) [21, 4], where momentconditions can be easily derived from an Euler-Maruyama discretization of thestochastic process. The above methods are more ﬁtting if the stochastic processis realized as multiple short trajectories starting from the same initial conditions.On the other hand, if we are given a single, long trajectory then maximumlikelihood methods [1, 2] are more suitable.The maximum likelihood estimator (MLE) for the drift coeﬃcients is knownto be asymptotically unbiased [44], i.e. , as the length of the observed path in-creases, the MLE converges to the true values of the coeﬃcients that appearin the drift. This is no longer true in the presence of a multiscale structure, i.e. , when we want to estimate parameters in a stochastic coarse-grained model,given observations of the slow variable from the full dynamics. Indeed, it wasshown rigorously in [43, 45] that in this case the MLE becomes asymptoticallybiased and that subsampling at an appropriate rate, between the two charac-teristic time scales of the dynamics, is needed in order to estimate accuratelythe parameters in the coarse-grained model. This is particularly relevant for us,since the optimization/estimation methodology is based on the assumption ofthe existence of a reduced model that describes accurately the system we areinterested in. See, for example, the ODE driven by a sped-up Lorenz 63 ODEthat is studied in Section 3.4. Before delving into the complete algorithm, which involves estimation of eﬀectivegradients in the low-dimensional embedding, we proceed with a simpler examplewhere the objective function is two-dimensional but the optimization process canbe eﬀectively one-dimensional. Consider an objective function given by f ( x, y ) = C exp (cid:104) − (cid:0) x + y (cid:1) + 216( x + y ) − . (cid:112) ( x − + y (cid:105) , (3)where C is a proportionality constant selected to normalize the maximum ob-jective value to unity. Since (3) arises as the posterior density of a Bayesian8arameter estimation, our algorithm in reality minimizes − f , but we presentthe results from the perspective of maximizing f .Plotting the objective (cf. Figure 3) reveals that f ≈ x + y = 4, where the function exhibitsa sharply peaked “ridge” of greater function values. If we attempt to maximize f via Simulated Annealing (or, at constant temperature, RWMH) the algorithmwill quickly be attracted to the ridge and then slowly proceed along it towardsthe maximum. Trial points far away from the ridge are usually worse than thecurrent accepted point and are highly likely to be rejected. Hence, the acceptedpoints lie on an essentially one-dimensional curve in the parameter space.We can apply Diﬀusion Maps to this data, and the ﬁrst diﬀusion coordi-nate will parameterize the curve. Once we have obtained the one-dimensionalembedding, we can extrapolate in the direction that the objective function in-creases. The last step involves returning from the diﬀusion space back to theoriginal parameter space. This procedure is called “lifting”, and a variety ofmethods are available, such as Laplacian Pyramids [12], Geometric Harmon-ics [7, 34] and Radial Basis Functions [5]. We will use geometric harmonics inthe present work. After obtaining the projected point back in parameter space,we perform another short run of SA from that point and repeat the procedure.To summarize the algorithm:1. Pick an initial point, possibly near the “ridge” of the objective function.2. Run a short burst of Simulated Annealing until a prescribed number ofpoints has been accepted (1000 here).3. Discard any outliers that may be far away from the ridge.4. Apply Diﬀusion Maps to the remaining two-dimensional data set and ob-tain a nonlinear embedding. The embedding is one-dimensional and thediﬀusion coordinate can be thought of as corresponding to the arclengthalong the ridge.5. In the diﬀusion space, project to a new point that is in the direction thatincreases the objective function.6. Lift the new point back to full space via geometric harmonics. The re-sulting point is expected to be close to the ridge, but even if it is not, thenew SA run will quickly be attracted to it.7. Return to Step 2 and repeat the procedure.The results for this illustrative example are shown in Figure 3. Each shortburst of SA is shown as a “cloud” of red points. The lifted points are shownin yellow, and we can see that geometric harmonics perform well in this case,as the lifted points are still close to the ridge. The maximum of the functionis depicted by the purple diamond. Additionally, a single run of SA using thesame total number of function evaluations was performed (in green). Figure 4compares the running maximum objective value achieved by the two approaches.9t is clear that, for this simple illustration, SA/RWMH combined with DiﬀusionMaps approaches the maximum substatially more quickly. Building on thebody of ideas developed in [16, 17] as well as [51, 52], the combination of an“inner optimizer” with data mining of its local results, followed by taking larger“outer” optimization steps in the identiﬁed reduced space, has the potential tosigniﬁcantly accelerate the overall computational optimization.Figure 3: One complete run of the algorithm that approaches the global maxi-mum. Six total “coarse iterations” (shown in red) were performed. In addition,a single run of SA using the same number of function evaluations is shown ingreen. Our algorithm visibly approaches the maximum much faster. In this section we will demonstrate how one can estimate the theoretically ex-pected drift and diﬀusion coeﬃcients after the trajectories of a stochastic processhave been transformed using DMaps. As we mentioned before, these coeﬃcientscorrespond to an eﬀective gradient and will provide us with an approximationof the “correct” ascent (resp. descent) direction along which to optimize (max-imize, resp. minimize) in the low-dimensional space. We begin with a two-dimensional SDE, analogous to the Langevin equation: dx = µ ( x, y ) dt + √ T dW dy = ν ( x, y ) dt + √ T dW , (4)where [ W W ] T = W are independent Brownian motions. Assume now thatthe data in the original space x, y are transformed by being observed throughthe leading Diﬀusion Map coordinates, and the trajectories are now written in10igure 4: A comparison of the evolving maximum objective value for bothmethods. SA combined with DMaps needs only a fraction of the total functionevaluations compared to simple SA.terms of these diﬀusion coordinates ψ ( x, y ) and ψ ( x, y ). In order to rewriteour system of SDEs in terms of the new variables, we apply the multidimensionalItˆo’s lemma [42, 44], e.g. , for ψ we have: dψ = (cid:18) ( ∇ ψ ) (cid:62) (cid:20) µν (cid:21) + 12 Tr (cid:104) Σ (cid:62) ( H ψ )Σ (cid:105)(cid:19) dt + ( ∇ ψ ) (cid:62) Σ d W = (cid:20)(cid:18) ∂ψ ∂x µ + ∂ψ ∂y ν (cid:19) + T (cid:18) ∂ ψ ∂x + ∂ ψ ∂y (cid:19)(cid:21) dt + √ T (cid:115)(cid:18) ∂ψ ∂x (cid:19) + (cid:18) ∂ψ ∂y (cid:19) d ˜ W , where Σ = √ T I is the covariance matrix from (4), H ψ is the Hessian matrixof second partial derivatives of ψ , and ˜ W is a new Brownian motion.In order to simplify the estimation, we set up a two-dimensional grid inthe x, y space. The partial derivatives that are required for the theoreticalcomputation of coeﬃcients are approximated numerically at the grid pointsusing centered diﬀerences. From every grid point, N trajectories are simulatedvia SA for a speciﬁed time ∆ t , which is also the time step in the estimationcomputed via (2). The simulation time step between successive points on atrajectory is δt = 0 . t .The data set that is then “passed” to the DMaps algorithm, in order to ﬁndthe new embedding, consists of the cloud of ﬁnal points from each trajectory,the initial grid points, and all points where the partial derivatives are estimated.Afterwards, we have a new grid in the ψ , ψ space, with every partial derivativeestimated on this grid. We can assume that in a small neighborhood of everygrid point the partial derivatives of the diﬀusion coordinates are approximately11onstant. Given that, we perform a separate estimation at each grid point ofthe following system of SDEs: dψ = θ dt + θ d ˜ W dψ = θ dt + θ d ˜ W , where θ and θ correspond to θ = (cid:20)(cid:18) ∂ψ ∂x µ + ∂ψ ∂y ν (cid:19) + T (cid:18) ∂ ψ ∂x + ∂ ψ ∂y (cid:19)(cid:21) θ = √ T (cid:115)(cid:18) ∂ψ ∂x (cid:19) + (cid:18) ∂ψ ∂y (cid:19) , (5)and similarly for θ and θ . Thus, we obtain values for each θ i , i = 1 , , , e.g. , using a quadratic ﬁt: θ ( ψ , ψ ) ≈ p + p ψ + p ψ + p ψ + p ψ ψ + p ψ . If the grid is local (small) enough and the bursts are contained within the grid forthe most part, we can assume that the objective function could be approximatedlocally by a linear surface which has a constant gradient, i.e. : µ ( x, y ) = µ ν ( x, y ) = ν . For our illustrative example we will use the function f ( x, y ) = x + 2 y , whichhas a constant gradient, on an orthogonal grid along the coordinate axes. Weuse an 8 ×

10 grid with limits [0 , . × [0 , . N = 150 trajectoriesat every grid point, each run for ∆ t = 0 .

01. The time step of the simulation is δt = 10 − .To illustrate that the estimation is accurate even if we observe the processthrough a nonlinear transformation, we transform a region of the ( x, y ) spacethat contains the trajectories and map it onto a spherical surface, obtaininga new ( x, y, z ) space. We apply Diﬀusion Maps with Euclidean distances tothe original data set and Diﬀusion Maps with Mahalanobis distances to thetransformed data set. Figure 5a shows the original data set colored by thetwo diﬀusion coordinates and the new embedding. Of course, since the originaldata set is two-dimensional, no dimensionality reduction is achieved in this case.Figure 5b shows the transformed data set colored by the diﬀusion coordinates.In this case, Mahalanobis distances enable us to retrieve the original, orthogonal,two-dimensional embedding.After estimation, some of the drifts and diﬀusivities along the grid approachzero. In order to avoid these degeneracies, we discard the points where thisoccurs and ﬁt the coeﬃcients θ i to the rest of the grid. Figures 6 and 7 showthe results for drift coeﬃcients θ , θ . Similar results are obtained for diﬀusioncoeﬃcients θ , θ . 12 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03-0.03-0.02-0.010.000.010.020.03 (a) -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03-0.03-0.02-0.010.000.010.020.03 (b) Figure 5: Diﬀusion maps applied to (A) the original 2D data with Euclideandistances, and (B) the transformed data with Mahalanobis distances. In bothcases, the inherent dimensionality is recovered in the ﬁrst two eigenvectors, asthe coloring of the data by the leading Diﬀusion Map coordinates correspondingto these two eigenvectors show. 13igure 6: Estimation of the ﬁrst drift coeﬃcient θ . “Theoretical” are obtainednumerically from (5), “Euclidean” via DMaps on the original data, and “Maha-lanobis” from DMaps on the transformed data. The last subplot shows resultsfrom Euclidean DMaps on transformed data, which, as expected, yields incorrectestimates. Having shown that we can estimate the correct parameters in the Diﬀusion Mapspace whether we observe the original manifold or some invertible transforma-tion of it, we now apply our algorithm to a three-dimensional objective functionwith an attracting, slow, two-dimensional manifold.In a cylindrical coordinate scheme ( i.e. , x ( r, θ, z ) = r cos θ, y ( r, θ, z ) = r sin θ , z ( r, θ, z ) = z ), we deﬁne: f ( r, θ, z ) = k r − R ) + h ( θ ) + k z , where k , k > z = 0 with the cylinder havingradius R and axis z . We used parameter values ( k , k , R ) = (10 , , /π )and implemented f programmatically so as to accept Cartesian coordinates and14igure 7: Estimation of the second drift coeﬃcient θ . The subplots are analo-gous to those in Figure 6.internally convert to cylindrical. The function h ( θ ) = − . . ( θ ) − .

59 cos ( θ ) − . θ )determines an asymmetric double well potential with a local minimum close to θ = − π/ θ = π/

2. Any trajectory away from thecylinder is quickly attracted to it, and then the search of the parameter spaceproceeds along the cylinder surface. The new algorithm builds on the previouslypresented one, but now also includes parameter estimation in Diﬀusion Mapspace.

Algorithm

1. Initialize a local grid around a starting point and simulate ensembles ofshort trajectories starting at every grid point.2. Apply DMaps to the data set to obtain the low-dimensional embedding.3. Estimate SDE coeﬃcients on the grid points.15. Fit a polynomial to the coeﬃcients along the grid.5. Integrate the system of ODEs dψ = θ ( ψ , ψ ) dt, dψ = θ ( ψ , ψ ) dt forward in time. This is analogous to using a gradient descent algorithm.6. Lift the resulting point to full space and use it as a new starting point.7. Repeat until the estimated coeﬃcients in Diﬀusion Map space approachzero.One must of course pay attention to the length of the integration step, toavoid extrapolating too far away from the grid in 3D space, since the coeﬃcientsare known there only locally. In general, we observed that linear ﬁts performbetter and follow the negative gradient direction at larger distances from thegrid. Figure 8 shows two snapshots of the algorithm run as it approaches theminimum near θ = π/ Our estimation procedure can also be applied in cases where the underlyingstochasticity of the system is not due to a Wiener process but arises from de-terministic chaos. Consider an ODE driven by one of the components of theLorenz system: dxdt = A ( x − x ) + λε y ,dy dt = 10 ε ( y − y ) ,dy dt = 1 ε (28 y − y − y y ) ,dy dt = 1 ε ( y y − y ) . (6)16 .21.40.01.02.0 1.03.0 0.8 0.6 1.64.0 0.4 0.2 0.0Bursts on GridLifted ExtensionsAlgorithm OutputGlobal Minimizer Figure 8: Coarse-grained (two-dimensional) optimization on a cylindrical sur-face in three dimensions. One complete run of the algorithm, illustrating theshort bursts of trajectories and new points after being lifted by geometric har-monics. Observe that, although the lifting does not perfectly locate the cylinder,the burst trajectories are quickly attracted back to it. The second plot is a top-down view of the ﬁrst. 17t can be shown that the approximate dynamics for the slow variable aregiven by the following SDE [28, 37]: dx = A ( x − x ) dt + √ σ dW. In the following simulations we use parameter values of A = 1, λ = 2 /

45, and ε = √ . ψ ( x ) thatis one-to-one with the slow variable x , we can write an SDE for it using Itˆo’sLemma: dψ = (cid:18) A ( x − x ) dψdx + σ d ψdx (cid:19) dt + √ σ dψdx dW. (7)The simulation is set up as follows (cf. [32, 33]): initially, the system isrun long enough from initial conditions (1 , , , (cid:62) so that it converges ontothe Lorenz attractor ( t ≈ . y ) will be used subsequently as our starting point for our shortbursts. The starting points for x are taken as equally-spaced points in the range[ − . , . x value, we perturb the starting pointfor each short burst. The actual initial conditions are given as x ic = x + 0 . x spacing z, y ic = y + z , where z are standard normal variables and x spacing is the distance betweenstarting points for x before the perturbation.At each of our 20 starting points, 500 short trajectories are simulated. Thesystem is integrated with time step δt = 10 − for a duration ∆ t = 0 .

03. Figure9 shows the trajectories of both the fast and slow variables for one such burst.We see that the duration ∆ t = 0 .

03 lies between the timescales of the fast(Lorenz) and slow (eﬀective x ) dynamics, so we avoid biasing our estimators forthe coarse-grained model parameters.As before, we assume that the drift and diﬀusion coeﬃcients are approxi-mately constant at each starting point and estimate the following SDE at eachstarting point via GMM: dx = θ dt + θ dW. Afterwards, we ﬁt a polynomial of an appropriate degree to the estimated coef-ﬁcients and retrieve the coeﬃcients ˆ A ≈ . σ ≈ . R × .Applying regular DMaps to this data set yields a parameterization of theLorenz attractor. In order to extract the slow variable, we apply DMaps usingMahalanobis distances. The local covariances of each data point are computed18 .10 0.11 0.12 0.130.630.650.670.690.71 0.10 0.11 0.12 0.13-30-10103050 Figure 9: Simulation of a short trajectory initialized at perturbed initial condi-tions ( x ic , y ic ).using 100 short simulations with duration dt cov = 10 − . The pseudo-inverse ofthe covariance matrix is computed using a singular value decomposition (SVD): C † = d (cid:88) m =1 s − m v m v (cid:62) m , (8)where s are the singular values and v the right-singular vectors. In this case, weuse d = n = 4, since we need the last singular value that corresponds to the slowvariable to obtain the correct embedding. Using this setup, the ﬁrst non-trivialdiﬀusion coordinate parameterizes the slow variable x , as seen in Figure 10.While in theory the derivatives of the diﬀusion coordinate could be com-puted using central diﬀerences, the parameterization of x is quite noisy and theestimated derivatives can be quite inaccurate. To ameliorate this diﬃculty, weused smoothing splines to ﬁt a curve to the data and estimated the derivativesfrom the splines. The data and ﬁt curve are shown in Figure 10.Using the new data set in Diﬀusion Map space we again assume constantlocal drift and diﬀusivity at each starting point and use GMM to ﬁt the followingSDE: dψ = ξ dt + ξ dW. This estimate is compared to (7) either using the estimates ˆ A and ˆ σ : dψ = (cid:18) ˆ A ( x − x ) dψdx + ˆ σ d ψdx (cid:19) dt + √ ˆ σ dψdx dW, (9)or directly using the estimates θ and θ : dψ = (cid:18) θ dψdx + θ d ψdx (cid:19) dt + θ dψdx dW. Figure 10: The ﬁrst diﬀusion coordinate parameterizes x .The results are shown in Figure 11 and Figure 12.To demonstrate the dimension reduction from a higher dimensional space,we can transform the slow variable x by embedding it onto a curve in the plane(see Figure 13).The entire data set is now ﬁve-dimensional, and we can again apply DMapswith Mahalanobis distances as before. We again compute the pseudo-inverseusing SVD as in (8), but now we use d = n − ψ tr ( x ), we can estimate againdrift and diﬀusion coeﬃcients as above. The results are shown in Figure 15 andFigure 16.The same method can be applied in the case of multiplicative noise: dxdt = A ( x − x ) + λε (1 + νx ) y , dy dt = 10 ε ( y − y ) ,dy dt = 1 ε (28 y − y − y y ) , dy dt = 1 ε ( y y − y ) . The approximate dynamics for the slow variable in this case are given by [32]: dx = ( Ax + Bx + Cx ) dt + (cid:112) σ a + σ b x + σ c x dW. -3 Figure 11: Estimation of the drift coeﬃcient in diﬀusion map space. -0.015 -0.010 -0.005 0.000 0.005 0.010 0.0150123456 10 -3 Figure 12: Estimation of the diﬀusion coeﬃcient in diﬀusion map space.21

Figure 13: Nonlinear transformation of the slow variable x onto a semicircle inthe plane.The constants from (6) and the simulation parameters remain unchanged, withthe exception of ν = 1 and ∆ t = 0 .

01. Applying Itˆo’s Lemma again, we obtain dψ = (cid:20) ( Ax + Bx + Cx ) dψdx + σ a + σ b x + σ c x d ψdx (cid:21) dt + (cid:112) σ a + σ b x + σ c x dψdx dW. If we estimate the coeﬃcients of the polynomials in the drift and diﬀusionterms using the data in the original space, we can also estimate these coeﬃcientsin diﬀusion map space using an equation analogous to (9): dψ = (cid:20) ( ˆ Ax + ˆ Bx + ˆ Cx ) dψdx + ˆ σ a + ˆ σ b x + ˆ σ c x d ψdx (cid:21) dt + (cid:112) ˆ σ a + ˆ σ b x + ˆ σ c x dψdx dW. The other two estimation methods remain the same, i.e. , either using θ and θ calculated at each starting point or calculating ξ and ξ at each starting point.The results are shown in Figure 17 and Figure 18.22 Original EmbeddingTransformed Embedding

Figure 14: Embeddings of the original and transformed data set.

We have conﬁrmed that, at the limit of small time steps and large temperatures,trajectories produced by the Simulated Annealing / Random Walk MetropolisHastings algorithm are analogous to those that result from the Langevin equa-tion, a global, stochastic optimization algorithm. We use SA as our “inner”optimizer that produces ensembles of brief simulation bursts, which containinformation about an eﬀective gradient of the objective function. Using di-mension reduction techniques such as PCA or, in the case of nonlinear man-ifolds, Diﬀusion Maps, we can obtain the parameterization of the underlyinglow-dimensional manifold.As our ﬁrst example, we show that a two-dimensional, Bayesian model iseﬀectively one-dimensional and DMaps can retrieve the important parameter(a sort of “reaction coordinate”) for the optimization. Combining SA withDMaps achieves considerably faster approach to the maximum, compared withthe simple SA alone.Starting from a two-dimensional SDE that corresponds to a Langevin equa-tion for an objective function with two parameters, we derived the correspond-ing SDE in terms of the diﬀusion coordinates and approximated numericallythe theoretical drift and diﬀusion coeﬃcients. We then estimated the same driftand diﬀusion coeﬃcients using parameter inference on the data set that came23 -3 Figure 15: Estimation of the drift coeﬃcient in Diﬀusion Map space. -0.015 -0.010 -0.005 0.000 0.005 0.010 0.0150123456 10 -3 Figure 16: Estimation of the diﬀusion coeﬃcient in Diﬀusion Map space.24 -3 Figure 17: Estimation of the drift coeﬃcient in diﬀusion map space. -0.015 -0.010 -0.005 0.000 0.005 0.010 0.0150123456 10 -3 Figure 18: Estimation of the diﬀusion coeﬃcient in diﬀusion map space.25rom the application of DMaps on the original trajectories. Additionally, we cantransform the original data set through a nonlinear transformation by mappingit onto a portion of the surface of a sphere and apply DMaps on the transformeddata set using Mahalanobis distances. In both cases the estimated parametersare closely comparable to the theoretical ones.For illustration purposes, we constructed a three-dimensional objective func-tion that has a strongly attracting, two-dimensional manifold. This work con-stitutes a simple “proof of concept” acceleration demonstration for the classes ofoptimization problems we consider. It is also an illustration of the tools requiredto perform scientiﬁc computations (here, gradient descent) in a latent variablespace, a space parameterized by on-the-ﬂy processing of the data produced bythe “inner optimizer.” Fast implementations of the techniques (like GeometricHarmonics) for translating back-and-forth between the original space, in whichthe problem was given, and the latent space, where the coarse optimizationsteps are taken, are crucial for the usefulness of the approach. The true beneﬁtsof this approach and its potential should be explored by applying it to trulyhigh-dimensional problems where other methods slow down considerably. Thisis the subject of current work.

Acknowledgments

This work was partially supported by the DARPA Lagrange program and anARO MURI. The authors gratefully acknowledge many helpful discussions withProf. Christian Kuehn of T. U. Munich, especially for the case where the noiseterm arises through the coupling with chaotic dynamical systems.G.P.’s research is supported by the EPSRC, grants EP/P031587/1, EP/L024926/1,EP/L020564/1 and by JPMorgan Chase & Co. Any views or opinions expressedherein are solely those of the authors listed, and may diﬀer from the views andopinions expressed by JPMorgan Chase & Co. or its aﬃliates. This material isnot a product of the Research Department of J.P. Morgan Securities LLC. Thismaterial does not constitute a solicitation or oﬀer in any jurisdiction.A.D. was supported by the Lloyd’s Register Foundation Programme on DataCentric Engineering and by the Alan Turing Institute under the EPSRC grant[EP/N510129/1].

References [1] Y. A¨ıt-Sahalia, Maximum likelihood estimation of discretely sampled dif-fusions: A closed-form approximation approach,

Econometrica , (2002),223–262.[2] Y. A¨ıt-Sahalia, Closed-form likelihood expansions for multivariate diﬀu-sions, The Annals of Statistics , (2008), 906–937.263] R. Barton, Metamodeling: a state of the art review, in Simulation Con-ference Proceedings, 1994. Winter , IEEE, 1994, 237–244.[4] K. Chan, G. Karolyi, F. Longstaﬀ and A. Sanders, An empirical compar-ison of alternative models of the short-term interest rate,

The Journal ofFinance , (1992), 1209–1227.[5] E. Chiavazzo, C. Gear, C. Dsilva, N. Rabin and I. Kevrekidis, Reducedmodels in chemical kinetics via nonlinear data-mining, Processes , (2014),112–140.[6] R. Coifman and S. Lafon, Diﬀusion maps, Applied and ComputationalHarmonic Analysis , (2006), 5–30.[7] R. Coifman and S. Lafon, Geometric harmonics: a novel tool for multiscaleout-of-sample extension of empirical functions, Applied and ComputationalHarmonic Analysis , (2006), 31–52.[8] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner andS. Zucker, Geometric diﬀusions as a tool for harmonic analysis andstructure deﬁnition of data: Diﬀusion maps, Proceedings of the NationalAcademy of Sciences of the United States of America , (2005), 7426–7431.[9] A. Conn, N. Gould and P. Toint, Trust region methods , SIAM, 2000.[10] A. Conn, K. Scheinberg and L. Vicente,

Introduction to derivative-freeoptimization , SIAM, 2009.[11] C. Dsilva, R. Talmon, R. Coifman and I. Kevrekidis, Parsimonious rep-resentation of nonlinear dynamical systems through manifold learning: Achemotaxis case study,

Applied and Computational Harmonic Analysis , (2018), 759–773.[12] C. Dsilva, R. Talmon, N. Rabin, R. Coifman and I. Kevrekidis, Nonlinearintrinsic variables and state reconstruction in multiscale simulations, TheJournal of Chemical Physics , (2013), 184109.[13] A. Duncan, G. Pavliotis and K. Zygalakis, Nonreversible langevin sam-plers: Splitting schemes, analysis and implementation, arXiv preprintarXiv:1701.04247 .[14] R. Eberhart and J. Kennedy, A new optimizer using particle swarm theory,in Micro Machine and Human Science, 1995. MHS’95., Proceedings of theSixth International Symposium on , IEEE, 1995, 39–43.[15] M. Fathi and G. Stoltz, Improving dynamical properties of metropolizeddiscretizations of overdamped langevin dynamics,

Numerische Mathematik , (2017), 545–602. 2716] C. Gear, D. Givon and I. Kevrekidis, Computing on virtual slow manifoldsof fast stochastic systems, JNAIAM , (2010), 61–72.[17] C. Gear and I. Kevrekidis, Projective methods for stiﬀ diﬀerential equa-tions: problems with gaps in their eigenvalue spectrum, SIAM J. Sci.Comp. , (2003), 1091–1106.[18] S. Geman and C.-R. Hwang, Diﬀusions for global optimization, SIAMJournal on Control and Optimization , (1986), 1031–1043.[19] B. Gidas, Global optimization via the langevin equation, in Decision andControl, 1985 24th IEEE Conference on , vol. 24, IEEE, 1985, 774–778.[20] J. Gradiˇsek, S. Siegert, R. Friedrich and I. Grabec, Analysis of time seriesfrom stochastic processes,

Physical Review E , (2000), 3146.[21] L. Hansen, Large sample properties of generalized method of momentsestimators, Econometrica: Journal of the Econometric Society , 1029–1054.[22] J. Holland,

Adaptation in natural and artiﬁcial systems: an introduc-tory analysis with applications to biology, control, and artiﬁcial intelligence ,MIT press, 1992.[23] R. Hooke and T. Jeeves, “direct search” solution of numerical and statis-tical problems,

Journal of the ACM (JACM) , (1961), 212–229.[24] W. Huyer and A. Neumaier, Global optimization by multilevel coordinatesearch, Journal of Global Optimization , (1999), 331–355.[25] I. Jolliﬀe, Principal component analysis and factor analysis, in Principalcomponent analysis , Springer, 1986, 115–128.[26] D. Jones, A taxonomy of global optimization methods based on responsesurfaces,

Journal of Global Optimization , (2001), 345–383.[27] D. Jones, C. Perttunen and B. Stuckman, Lipschitzian optimization with-out the lipschitz constant, Journal of Optimization Theory and Applica-tions , (1993), 157–181.[28] S. Kalliadasis, S. Krumscheid and G. A. Pavliotis, A new framework forextracting coarse-grained models from time series with multiscale structure, J. Comput. Phys. , (2015), 314–328.[29] C. Kelley, Iterative methods for optimization , SIAM, 1999.[30] I. Kevrekidis, C. Gear, J. Hyman, P. Kevrekidis, O. Runborg andC. Theodoropoulos, Equation-free, coarse-grained multiscale computation:Enabling mocroscopic simulators to perform system-level analysis,

Com-munications in Mathematical Sciences , (2003), 715–762.2831] S. Kirkpatrick, C. Gelatt and M. Vecchi, Optimization by simulated an-nealing, Science , (1983), 671–680.[32] S. Krumscheid, G. Pavliotis and S. Kalliadasis, Semiparametric drift anddiﬀusion estimation for multiscale diﬀusions, Multiscale Modeling & Sim-ulation , (2013), 442–473.[33] S. Krumscheid, M. Pradas, G. Pavliotis and S. Kalliadasis, Data-drivencoarse graining in action: Modeling and prediction of complex systems, Phys. Rev. E , (2015), 042139.[34] S. Lafon, Y. Keller and R. Coifman, Data fusion and multicue data match-ing by diﬀusion maps, IEEE Transactions on pattern analysis and machineintelligence , (2006), 1784–1797.[35] G. Li, C. Rosenthal and H. Rabitz, High dimensional model representa-tions, The Journal of Physical Chemistry A , (2001), 7765–7777.[36] M. Locatelli, Simulated annealing algorithms for continuous global opti-mization, in Handbook of global optimization , Springer, 2002, 179–229.[37] I. Melbourne and A. Stuart, A note on diﬀusion limits of chaotic skew-product ﬂows,

Nonlinearity , (2011), 1361.[38] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller and E. Teller,Equation of state calculations by fast computing machines, The journal ofChemical Physics , (1953), 1087–1092.[39] M. Montgomery, R. Meglen and N. Damrauer, General method for the di-mension reduction of adaptive control experiments, The Journal of PhysicalChemistry A , (2006), 6391–6394.[40] M. Montgomery, R. Meglen and N. Damrauer, General method for reducingadaptive laser pulse-shaping experiments to a single control variable, TheJournal of Physical Chemistry A , (2007), 5126–5129.[41] J. Nelder and R. Mead, A simplex method for function minimization, TheComputer Journal , (1965), 308–313.[42] B. Øksendal, Stochastic diﬀerential equations , Springer, 2003.[43] A. Papavasiliou, Particle ﬁlters for multiscale diﬀusions, in

ConferenceOxford sur les m´ethodes de Monte Carlo s´equentielles , vol. 19 of ESAIMProc., EDP Sci., Les Ulis, 2007, 108–114.[44] G. Pavliotis,

Stochastic processes and applications , vol. 60 of Texts inApplied Mathematics, Springer, New York, 2014, Diﬀusion processes, theFokker-Planck and Langevin equations.[45] G. Pavliotis and A. Stuart, Parameter estimation for multiscale diﬀusions,

J. Stat. Phys. , (2007), 741–781.2946] L. Rios and N. Sahinidis, Derivative-free optimization: a review of al-gorithms and comparison of software implementations, Journal of GlobalOptimization , (2013), 1247–1293.[47] J. Roslund and H. Rabitz, Dynamic dimensionality identiﬁcation for quan-tum control, Physical review letters , (2014), 143001.[48] C. Schillings and A. Stuart, Analysis of the ensemble kalman ﬁlter forinverse problems, SIAM Journal on Numerical Analysis , (2017), 1264–1290.[49] S. Shan and G. Wang, Survey of modeling and optimization strategiesto solve high-dimensional design problems with computationally-expensiveblack-box functions, Structural and Multidisciplinary Optimization , (2010), 219–241.[50] A. Singer and R. Coifman, Non-linear independent component analysiswith diﬀusion maps, Applied and Computational Harmonic Analysis , (2008), 226–239.[51] E. Vanden-Eijnden, Fast communications: Numerical techniques for multi-scale dynamical systems with stochastic eﬀects, Commun. in Math. Sci. , (2003), 385–391.[52] E. Weinan, B. Engquist, X. Li, W. Ren and E. Vanden-Eijnden, Hetero-geneous multiscale methods: a review, Communications in computationalphysics ,2