Manifold Learning for Accelerating Coarse-Grained Optimization
Dmitry Pozharskiy, Noah J. Wichrowski, Andrew B. Duncan, Grigorios A. Pavliotis, Ioannis G. Kevrekidis
MManifold Learning for AcceleratingCoarse-Grained Optimization
Dmitry Pozharskiy, Noah J. Wichrowski, Andrew B. Duncan,Grigorios A. Pavliotis, and Ioannis G. Kevrekidis
Abstract
Algorithms proposed for solving high-dimensional optimization problemswith no derivative information frequently encounter the “curse of dimen-sionality,” becoming ineffective as the dimension of the parameter spacegrows. One feature of a subclass of such problems that are effectively low-dimensional is that only a few parameters (or combinations thereof)are important for the optimization and must be explored in detail. Know-ing these parameters/combinations in advance would greatly simplify theproblem and its solution. We propose the data-driven construction of an effective (coarse-grained, “trend”) optimizer, based on data obtained fromensembles of brief simulation bursts with an “inner” optimization algo-rithm, that has the potential to accelerate the exploration of the parameterspace. The trajectories of this “effective optimizer” quickly become at-tracted onto a slow manifold parameterized by the few relevant parametercombinations. We obtain the parameterization of this low-dimensional,effective optimization manifold on the fly using data mining/manifoldlearning techniques on the results of simulation (inner optimizer itera-tion) burst ensembles and exploit it locally to “jump” forward along thismanifold. As a result, we can bias the exploration of the parameter spacetowards the few, important directions and, through this “wrapper algo-rithm,” speed up the convergence of traditional optimization algorithms.
The design of complex engineering systems often leads to high-dimensional op-timization problems with computationally expensive objective function evalu-ations, often given in the form of a (computational) black-box. In such casesthe derivative information may be unavailable or impractical to obtain in closedform, too expensive to compute numerically, or unreliable to estimate if theobjective function is noisy. These difficulties may render derivative-based op-timization methods impractical for such problems, and so-called derivative-freemethods must be used.The first such derivative-free algorithms appeared quite early: the directsearch method [23] and the Nelder-Mead algorithm [41]. Since then a variety of1 a r X i v : . [ m a t h . O C ] A p r lgorithms have been proposed, including trust-region methods [9]; determin-istic global algorithms [27, 24]; algorithms utilizing surrogate models [3, 26];and stochastic global methods such as genetic algorithms [22], simulated an-nealing [31] and particle swarm optimization [14]. A broader overview of theaforementioned methods along with additional ones used for high-dimensionalproblems can be found at [10, 49, 46].Many of these methods have been applied successfully to low-dimensionalproblems where derivative information is not available; once the dimension ofthe parameter space grows, however, they run into the “curse of dimensional-ity,” where the required sampling of the parameter space grows exponentiallyor the convergence becomes too slow. There is case-dependent evidence that,for certain classes of problems, out of the vast parameter space only a few pa-rameters or combinations of parameters suffice to describe most of the variancein the objective function values, with the rest having little effect [35, 39, 40, 47].It is observations of this nature that we aim to exploit in our proposed method,borrowing additional ideas from the field of fast/slow (singularly perturbed)dynamical systems and data mining/manifold learning techniques.It has been observed that complex dynamical systems such as moleculardynamics (MD) simulations or complex reaction network dynamics may possessa low-dimensional, attracting, “slow” manifold. The dynamics of the systemafter being initialized at a random state quickly approach the slow manifoldand then evolve “along it” (close to it). A reduced model of the system interms of the slow variables parameterizing this manifold would greatly simplifythe understanding of the system’s behavior (and its computation). However,such a model is often unavailable in closed form. In previous work [30] we haveshown how short “bursts” of a microscopic simulator can evolve the systemclose to and then “along” an underlying slow manifold. Essentially, after theshort burst is attracted to the slow manifold, we can observe the evolution on a restricted set of coarse-grained observables that parameterize the slow manifoldwhen these coarse variables are known a priori . We can then perform a “large”time step by extrapolating the few macroscopic variable values and lifting thenew state back into the full space to initialize a new set of computation burstsfor the microscopic simulator. This can achieve significant acceleration of the effective complex system dynamics. If the macroscopic variables are not known,then a reduced description of the manifold can be derived on the fly by usingdata-driven dimensionality reduction techniques, which uncover the few intrinsicvariables that are adequate to describe a high-dimensional data set locally.The high-dimensional optimization problem can be treated in the same veinby making two assumptions: (a) we have an “inner optimizer”, analogous to amicroscopic simulator, that samples the parameter space and produces a seriesof “pseudo”-Langevin trajectories, (b) we postulate that there exists an attract-ing, slow manifold which can be parameterized in terms of a few parameters orcombinations thereof, and the inner optimizer is quickly attracted to it. In thefollowing sections we will show that the trajectory produced by a specific versionof simulated annealing (SA)—our “inner optimizer”—at constant temperaturecan be described by an effective Stochastic Differential Equation (SDE) whose2rift contains information about the local gradient of the objective function.To be precise, since we do not vary the temperature as the optimization pro-gresses, our ”inner optimizer” is a random walk Metropolis-Hastings (RWMH)algorithm. Running short bursts of this constant temperature SA, we create“pseudo” dynamics that can be thought of as the (approximate) dynamics ofan actual dynamical system. After initializing at a random point in parameterspace, the algorithm is quickly attracted to the low-dimensional manifold, andby applying either linear or nonlinear dimensionality reduction techniques wecan obtain a useful local parameterization of this manifold. We can estimate thedrift of the effective SDE using established parametric inference methods andthus estimate the local effective gradient of the objective function [48]. Thiscan be used subsequently in an algorithm such as gradient descent in a reducedparameter space. The new point is lifted back to full space, using local Princi-pal Component Analysis or geometric harmonics [7], and the entire procedureis repeated, leading to an acceleration of the overall optimization. The Langevin equation was introduced as a stochastic global optimization methodshortly after the first appearance of simulated annealing [19, 18]. It is a gra-dient descent method with the addition of a “thermal” noise that allows thetrajectories to escape local minima and thus enhance their ability to explore theparameter space. However, it may become impractical for the problems we areconsidering since, as we discussed above, the gradient information is explicitlyunavailable. The equation reads dx t = −∇ f ( x t ) dt + √ T dW t , (1)where x t ∈ R n , f is the objective function, W t is an n -dimensional standardBrownian motion, and T is the temperature parameter. It can be shown that un-der an appropriate temperature schedule T ( t ), the algorithm converges weaklyto the global minimum [18]. The equilibrium distribution is the Gibbs distribu-tion, with density π ( x ; T ) = exp (cid:104) − f ( x ) T (cid:105)(cid:82) R n exp (cid:104) − f ( y ) T (cid:105) dy . The simulated annealing algorithm admits the same equilibrium distribution atan equal T and can be viewed as an adaptation of the Metropolis-Hastings algo-rithm [38] with time dependent acceptance probability due to the temperatureschedule. The acceptance probability is given by a = min (cid:32) , exp (cid:34) − (cid:0) f ( y ) − f ( x ) (cid:1) T (cid:35)(cid:33) , x is the current point and y is the new trial point that comes from asymmetric proposal density g ( y | x ). Hence, better points are always acceptedand worse points are accepted with probability 0 < a <
1, which is greater athigher temperatures T .Consider now the simple, one-dimensional case at constant temperature T (a RWMH protocol) using proposal density g ( y | x ) = N ( x, T δt ) and the ac-ceptance probability defined above. It can be shown [13, 15] that, at the limitof small time steps δt or large temperatures T , the density of accepted pointsafter one step converges to a normal distribution N ( x − f (cid:48) ( x ) δt, T δt ), whichcorresponds to the density of a new point using an Euler-Maruyama discretiza-tion of the Langevin equation [36]. Figure 1 shows the density of sample pointsafter one step and after 100 steps using both algorithms. The two distributionsvisually almost coincide.Using the above procedure we can obtain “pseudo” trajectories in the pa-rameter space that are analogous to the trajectories produced by the Langevinequation and contain information about an effective gradient of the objectivefunction without explicitly computing it.
In our previous discussion on dynamical systems, we mentioned that the long-term dynamics of the full system can often be usefully restricted to the dynamicsof a few slow variables. These variables can be a collection of macroscopic vari-ables that are available from our intimate knowledge of the system, or they canbe estimated “on the fly” using dimension reduction techniques. Such tech-niques can be applied to large, high-dimensional data sets to uncover the few intrinsic variables that are sufficient to describe most of the system’s long-termbehavior. We will use the latter approach in our method, since it can be quitechallenging to identify beforehand the few parameters that are important to theoptimization.One of the most common methods for dimension reduction is Principal Com-ponent Analysis (PCA) [25]. It tries to identify a hyperplane that best fits thedata by finding an orthogonal basis, where the first vector points in the direc-tion of maximum variance in the data set and all subsequent vectors maximizevariance in orthogonal directions. The basis vectors are called
Principal Compo-nents , and they can be found by an eigenvalue decomposition of the covariancematrix of the data set after it has been centered. If the eigenvalues are sortedand the relationship λ > λ > · · · > λ k (cid:29) λ k +1 > · · · > λ n holds ( n is thedimension of the original space), then there is a gap in the eigenvalue spec-trum and we can reduce the dimensionality of our data set by projecting it ontothe first k principal components. PCA is a well-documented technique, but itsmajor limitation is that it can parameterize only linear manifolds.Nonlinear manifold learning techniques are required if the data lie on acurved manifold. One such method is Diffusion Maps (DMaps) [8]. For a data4 P r obab ili t y D en s i t y Langevin EquationSimulated Annealing (a) After one step ( t = 0 . -0.5 0.0 0.5 1.0 1.5 2.0 2.50.00.51.01.5 P r obab ili t y D en s i t y Langevin EquationSimulated Annealing (b) After 100 steps ( t = 0 . Figure 1: Evolution in time of the probability density of current points usingeither the Langevin equation or SA at constant T . The objective function is f ( x ) = 0 . x ; 10 realizations are used with starting point x = 1, T = 0 .
5; andthe time step is dt = 10 − .set of size m , the algorithm starts by constructing a weight matrix W ∈ R m × m : W ij = exp (cid:18) −(cid:107) x i − x j (cid:107) ε (cid:19) , i, j = 1 , . . . , m, where x i , x j ∈ R n , (cid:107)·(cid:107) is an appropriate norm, and ε is a characteristic distancebetween data points. Next, we construct the diagonal matrix D ∈ R m × m with D ii = (cid:80) j W ij and compute ˜ W = D − α W D − α , where 0 ≤ α ≤ D ∈ R m × m with˜ D ii = (cid:80) j ˜ W ij and compute the row-stochastic matrix K = ˜ D − ˜ W . The matrix K is the transition probability matrix of a Markov chain defined on the dataset, whose states are the individual data points.The eigenvectors ψ , ψ , . . . , ψ m − of the matrix K approximate the eigen-functions of the Laplace-Beltrami operator on the sampled manifold and thus5an be used to parameterize the manifold [6]. Since K is row-stochastic, thefirst eigenvector ψ is trivial: all ones. The subsequent eigenvectors are called diffusion coordinates and have corresponding eigenvalues λ , λ , · · · , λ m − . Theoriginal data points are mapped to their diffusion coordinates as x (cid:55)→ (cid:0) λ τ ψ ( x ) , . . . , λ τm − ψ m − ( x ) (cid:1) , where ψ i ( x ) ∈ R represents the entry of eigenvector ψ i corresponding to thepoint x from the original data. In the following sections we take τ = 0. Thedistance between two mapped points is called diffusion distance , and it repre-sents the similarity between two points in the original space. If two points arenearby in the diffusion space using a Euclidean metric, it implies that there aremultiple short paths to transition from one point to the other in the originalspace. Similarly to PCA, if there is a spectral gap we can map our originaldata set to k “important” eigenvectors. However, attention must be paid tothe fact that some eigenvectors may be higher harmonics of previous discoveredones [11]. The nonlinear manifold is parameterized by the few eigenvectors thatcorrespond to the largest eigenvalues and that are not themselves such higherharmonics. These diffusion coordinates are the important intrinsic variablesthat parameterize the nonlinear manifold and indicate its dimensionality.We illustrate DMaps by applying it to a “Swiss roll” data set. This is athree-dimensional data set, but only two variables are sufficient to describe everypoint: the height and the arclength along the roll. Figure 2a shows the dataset colored by the first non-trivial eigenvector that parameterizes the arclength,while Figure 2b shows the data set colored by the second non-trivial eigenvectorthat parameterizes the height. Figure 2c shows the data set “unrolled” in thediffusion map space.In the case that the original data set comes from independent stochastic pro-cesses, as in our time series simulator, but we observe it through some nonlineartransformation y = f ( x ), we can retrieve the original manifold using Maha-lanobis distances [50, 12]. It can be shown that if x i , x j are two data points inthe original space and y i , y j are their nonlinear transformations, then (cid:107) x i − x j (cid:107) = 12 ( y i − y j ) (cid:62) (cid:2) ( JJ (cid:62) ) − ( y i ) + ( JJ (cid:62) ) − ( y j ) (cid:3) ( y i − y j )+ O (cid:0) (cid:107) y i − y j (cid:107) (cid:1) , where J is the Jacobian matrix of the transformation. In practice, the matrix JJ (cid:62) is approximated by a covariance matrix which is estimated by runningseveral short bursts of our simulator around each data point. Since the manifoldhas a lower-dimensionality than the observed space, the covariance matrix willbe rank deficient and a pseudo-inverse must be used to compute the Mahalanobisdistances. We mentioned above that time series from the SA algorithm can be consid-ered as corresponding to those of an effective stochastic differential equation.6
20 1010 500 -550 (a) The first (nontrivial) diffusion coordinate ψ parame-terizes arclength along the roll.
020 1010 500 -550 (b) The second diffusion coordinate ψ parameterizesheight. -0.01 0.00 0.01-0.02-0.010.000.010.02 (c) The two-dimensional nature of the data set is uncov-ered in the diffusion map space. Figure 2: Applying Diffusion Maps to the Swiss roll data set.7ence, an essential component of our algorithm is the estimation of drift anddiffusion coefficients of a stochastic process from local path data. Assume theone-dimensional stochastic process dx ( t ) = h ( x ( t )) dt + σ ( x ( t )) dW , with W astandard Brownian motion. The coefficients can be estimated either from theirstatistical definitions [20], i.e. , h (cid:0) x ( t ) (cid:1) = lim τ → (cid:10) x ( t + τ ) − x ( t ) (cid:11) τ ,σ (cid:0) x ( t ) (cid:1) = lim τ → (cid:10)(cid:0) x ( t + τ ) − x ( t ) (cid:1) (cid:11) τ , (2)or using the Generalized Method of Moments (GMM) [21, 4], where momentconditions can be easily derived from an Euler-Maruyama discretization of thestochastic process. The above methods are more fitting if the stochastic processis realized as multiple short trajectories starting from the same initial conditions.On the other hand, if we are given a single, long trajectory then maximumlikelihood methods [1, 2] are more suitable.The maximum likelihood estimator (MLE) for the drift coefficients is knownto be asymptotically unbiased [44], i.e. , as the length of the observed path in-creases, the MLE converges to the true values of the coefficients that appearin the drift. This is no longer true in the presence of a multiscale structure, i.e. , when we want to estimate parameters in a stochastic coarse-grained model,given observations of the slow variable from the full dynamics. Indeed, it wasshown rigorously in [43, 45] that in this case the MLE becomes asymptoticallybiased and that subsampling at an appropriate rate, between the two charac-teristic time scales of the dynamics, is needed in order to estimate accuratelythe parameters in the coarse-grained model. This is particularly relevant for us,since the optimization/estimation methodology is based on the assumption ofthe existence of a reduced model that describes accurately the system we areinterested in. See, for example, the ODE driven by a sped-up Lorenz 63 ODEthat is studied in Section 3.4. Before delving into the complete algorithm, which involves estimation of effectivegradients in the low-dimensional embedding, we proceed with a simpler examplewhere the objective function is two-dimensional but the optimization process canbe effectively one-dimensional. Consider an objective function given by f ( x, y ) = C exp (cid:104) − (cid:0) x + y (cid:1) + 216( x + y ) − . (cid:112) ( x − + y (cid:105) , (3)where C is a proportionality constant selected to normalize the maximum ob-jective value to unity. Since (3) arises as the posterior density of a Bayesian8arameter estimation, our algorithm in reality minimizes − f , but we presentthe results from the perspective of maximizing f .Plotting the objective (cf. Figure 3) reveals that f ≈ x + y = 4, where the function exhibitsa sharply peaked “ridge” of greater function values. If we attempt to maximize f via Simulated Annealing (or, at constant temperature, RWMH) the algorithmwill quickly be attracted to the ridge and then slowly proceed along it towardsthe maximum. Trial points far away from the ridge are usually worse than thecurrent accepted point and are highly likely to be rejected. Hence, the acceptedpoints lie on an essentially one-dimensional curve in the parameter space.We can apply Diffusion Maps to this data, and the first diffusion coordi-nate will parameterize the curve. Once we have obtained the one-dimensionalembedding, we can extrapolate in the direction that the objective function in-creases. The last step involves returning from the diffusion space back to theoriginal parameter space. This procedure is called “lifting”, and a variety ofmethods are available, such as Laplacian Pyramids [12], Geometric Harmon-ics [7, 34] and Radial Basis Functions [5]. We will use geometric harmonics inthe present work. After obtaining the projected point back in parameter space,we perform another short run of SA from that point and repeat the procedure.To summarize the algorithm:1. Pick an initial point, possibly near the “ridge” of the objective function.2. Run a short burst of Simulated Annealing until a prescribed number ofpoints has been accepted (1000 here).3. Discard any outliers that may be far away from the ridge.4. Apply Diffusion Maps to the remaining two-dimensional data set and ob-tain a nonlinear embedding. The embedding is one-dimensional and thediffusion coordinate can be thought of as corresponding to the arclengthalong the ridge.5. In the diffusion space, project to a new point that is in the direction thatincreases the objective function.6. Lift the new point back to full space via geometric harmonics. The re-sulting point is expected to be close to the ridge, but even if it is not, thenew SA run will quickly be attracted to it.7. Return to Step 2 and repeat the procedure.The results for this illustrative example are shown in Figure 3. Each shortburst of SA is shown as a “cloud” of red points. The lifted points are shownin yellow, and we can see that geometric harmonics perform well in this case,as the lifted points are still close to the ridge. The maximum of the functionis depicted by the purple diamond. Additionally, a single run of SA using thesame total number of function evaluations was performed (in green). Figure 4compares the running maximum objective value achieved by the two approaches.9t is clear that, for this simple illustration, SA/RWMH combined with DiffusionMaps approaches the maximum substatially more quickly. Building on thebody of ideas developed in [16, 17] as well as [51, 52], the combination of an“inner optimizer” with data mining of its local results, followed by taking larger“outer” optimization steps in the identified reduced space, has the potential tosignificantly accelerate the overall computational optimization.Figure 3: One complete run of the algorithm that approaches the global maxi-mum. Six total “coarse iterations” (shown in red) were performed. In addition,a single run of SA using the same number of function evaluations is shown ingreen. Our algorithm visibly approaches the maximum much faster. In this section we will demonstrate how one can estimate the theoretically ex-pected drift and diffusion coefficients after the trajectories of a stochastic processhave been transformed using DMaps. As we mentioned before, these coefficientscorrespond to an effective gradient and will provide us with an approximationof the “correct” ascent (resp. descent) direction along which to optimize (max-imize, resp. minimize) in the low-dimensional space. We begin with a two-dimensional SDE, analogous to the Langevin equation: dx = µ ( x, y ) dt + √ T dW dy = ν ( x, y ) dt + √ T dW , (4)where [ W W ] T = W are independent Brownian motions. Assume now thatthe data in the original space x, y are transformed by being observed throughthe leading Diffusion Map coordinates, and the trajectories are now written in10igure 4: A comparison of the evolving maximum objective value for bothmethods. SA combined with DMaps needs only a fraction of the total functionevaluations compared to simple SA.terms of these diffusion coordinates ψ ( x, y ) and ψ ( x, y ). In order to rewriteour system of SDEs in terms of the new variables, we apply the multidimensionalItˆo’s lemma [42, 44], e.g. , for ψ we have: dψ = (cid:18) ( ∇ ψ ) (cid:62) (cid:20) µν (cid:21) + 12 Tr (cid:104) Σ (cid:62) ( H ψ )Σ (cid:105)(cid:19) dt + ( ∇ ψ ) (cid:62) Σ d W = (cid:20)(cid:18) ∂ψ ∂x µ + ∂ψ ∂y ν (cid:19) + T (cid:18) ∂ ψ ∂x + ∂ ψ ∂y (cid:19)(cid:21) dt + √ T (cid:115)(cid:18) ∂ψ ∂x (cid:19) + (cid:18) ∂ψ ∂y (cid:19) d ˜ W , where Σ = √ T I is the covariance matrix from (4), H ψ is the Hessian matrixof second partial derivatives of ψ , and ˜ W is a new Brownian motion.In order to simplify the estimation, we set up a two-dimensional grid inthe x, y space. The partial derivatives that are required for the theoreticalcomputation of coefficients are approximated numerically at the grid pointsusing centered differences. From every grid point, N trajectories are simulatedvia SA for a specified time ∆ t , which is also the time step in the estimationcomputed via (2). The simulation time step between successive points on atrajectory is δt = 0 . t .The data set that is then “passed” to the DMaps algorithm, in order to findthe new embedding, consists of the cloud of final points from each trajectory,the initial grid points, and all points where the partial derivatives are estimated.Afterwards, we have a new grid in the ψ , ψ space, with every partial derivativeestimated on this grid. We can assume that in a small neighborhood of everygrid point the partial derivatives of the diffusion coordinates are approximately11onstant. Given that, we perform a separate estimation at each grid point ofthe following system of SDEs: dψ = θ dt + θ d ˜ W dψ = θ dt + θ d ˜ W , where θ and θ correspond to θ = (cid:20)(cid:18) ∂ψ ∂x µ + ∂ψ ∂y ν (cid:19) + T (cid:18) ∂ ψ ∂x + ∂ ψ ∂y (cid:19)(cid:21) θ = √ T (cid:115)(cid:18) ∂ψ ∂x (cid:19) + (cid:18) ∂ψ ∂y (cid:19) , (5)and similarly for θ and θ . Thus, we obtain values for each θ i , i = 1 , , , e.g. , using a quadratic fit: θ ( ψ , ψ ) ≈ p + p ψ + p ψ + p ψ + p ψ ψ + p ψ . If the grid is local (small) enough and the bursts are contained within the grid forthe most part, we can assume that the objective function could be approximatedlocally by a linear surface which has a constant gradient, i.e. : µ ( x, y ) = µ ν ( x, y ) = ν . For our illustrative example we will use the function f ( x, y ) = x + 2 y , whichhas a constant gradient, on an orthogonal grid along the coordinate axes. Weuse an 8 ×
10 grid with limits [0 , . × [0 , . N = 150 trajectoriesat every grid point, each run for ∆ t = 0 .
01. The time step of the simulation is δt = 10 − .To illustrate that the estimation is accurate even if we observe the processthrough a nonlinear transformation, we transform a region of the ( x, y ) spacethat contains the trajectories and map it onto a spherical surface, obtaininga new ( x, y, z ) space. We apply Diffusion Maps with Euclidean distances tothe original data set and Diffusion Maps with Mahalanobis distances to thetransformed data set. Figure 5a shows the original data set colored by thetwo diffusion coordinates and the new embedding. Of course, since the originaldata set is two-dimensional, no dimensionality reduction is achieved in this case.Figure 5b shows the transformed data set colored by the diffusion coordinates.In this case, Mahalanobis distances enable us to retrieve the original, orthogonal,two-dimensional embedding.After estimation, some of the drifts and diffusivities along the grid approachzero. In order to avoid these degeneracies, we discard the points where thisoccurs and fit the coefficients θ i to the rest of the grid. Figures 6 and 7 showthe results for drift coefficients θ , θ . Similar results are obtained for diffusioncoefficients θ , θ . 12 -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03-0.03-0.02-0.010.000.010.020.03 (a) -0.03 -0.02 -0.01 0.00 0.01 0.02 0.03-0.03-0.02-0.010.000.010.020.03 (b) Figure 5: Diffusion maps applied to (A) the original 2D data with Euclideandistances, and (B) the transformed data with Mahalanobis distances. In bothcases, the inherent dimensionality is recovered in the first two eigenvectors, asthe coloring of the data by the leading Diffusion Map coordinates correspondingto these two eigenvectors show. 13igure 6: Estimation of the first drift coefficient θ . “Theoretical” are obtainednumerically from (5), “Euclidean” via DMaps on the original data, and “Maha-lanobis” from DMaps on the transformed data. The last subplot shows resultsfrom Euclidean DMaps on transformed data, which, as expected, yields incorrectestimates. Having shown that we can estimate the correct parameters in the Diffusion Mapspace whether we observe the original manifold or some invertible transforma-tion of it, we now apply our algorithm to a three-dimensional objective functionwith an attracting, slow, two-dimensional manifold.In a cylindrical coordinate scheme ( i.e. , x ( r, θ, z ) = r cos θ, y ( r, θ, z ) = r sin θ , z ( r, θ, z ) = z ), we define: f ( r, θ, z ) = k r − R ) + h ( θ ) + k z , where k , k > z = 0 with the cylinder havingradius R and axis z . We used parameter values ( k , k , R ) = (10 , , /π )and implemented f programmatically so as to accept Cartesian coordinates and14igure 7: Estimation of the second drift coefficient θ . The subplots are analo-gous to those in Figure 6.internally convert to cylindrical. The function h ( θ ) = − . . ( θ ) − .
59 cos ( θ ) − . θ )determines an asymmetric double well potential with a local minimum close to θ = − π/ θ = π/
2. Any trajectory away from thecylinder is quickly attracted to it, and then the search of the parameter spaceproceeds along the cylinder surface. The new algorithm builds on the previouslypresented one, but now also includes parameter estimation in Diffusion Mapspace.
Algorithm
1. Initialize a local grid around a starting point and simulate ensembles ofshort trajectories starting at every grid point.2. Apply DMaps to the data set to obtain the low-dimensional embedding.3. Estimate SDE coefficients on the grid points.15. Fit a polynomial to the coefficients along the grid.5. Integrate the system of ODEs dψ = θ ( ψ , ψ ) dt, dψ = θ ( ψ , ψ ) dt forward in time. This is analogous to using a gradient descent algorithm.6. Lift the resulting point to full space and use it as a new starting point.7. Repeat until the estimated coefficients in Diffusion Map space approachzero.One must of course pay attention to the length of the integration step, toavoid extrapolating too far away from the grid in 3D space, since the coefficientsare known there only locally. In general, we observed that linear fits performbetter and follow the negative gradient direction at larger distances from thegrid. Figure 8 shows two snapshots of the algorithm run as it approaches theminimum near θ = π/ Our estimation procedure can also be applied in cases where the underlyingstochasticity of the system is not due to a Wiener process but arises from de-terministic chaos. Consider an ODE driven by one of the components of theLorenz system: dxdt = A ( x − x ) + λε y ,dy dt = 10 ε ( y − y ) ,dy dt = 1 ε (28 y − y − y y ) ,dy dt = 1 ε ( y y − y ) . (6)16 .21.40.01.02.0 1.03.0 0.8 0.6 1.64.0 0.4 0.2 0.0Bursts on GridLifted ExtensionsAlgorithm OutputGlobal Minimizer Figure 8: Coarse-grained (two-dimensional) optimization on a cylindrical sur-face in three dimensions. One complete run of the algorithm, illustrating theshort bursts of trajectories and new points after being lifted by geometric har-monics. Observe that, although the lifting does not perfectly locate the cylinder,the burst trajectories are quickly attracted back to it. The second plot is a top-down view of the first. 17t can be shown that the approximate dynamics for the slow variable aregiven by the following SDE [28, 37]: dx = A ( x − x ) dt + √ σ dW. In the following simulations we use parameter values of A = 1, λ = 2 /
45, and ε = √ . ψ ( x ) thatis one-to-one with the slow variable x , we can write an SDE for it using Itˆo’sLemma: dψ = (cid:18) A ( x − x ) dψdx + σ d ψdx (cid:19) dt + √ σ dψdx dW. (7)The simulation is set up as follows (cf. [32, 33]): initially, the system isrun long enough from initial conditions (1 , , , (cid:62) so that it converges ontothe Lorenz attractor ( t ≈ . y ) will be used subsequently as our starting point for our shortbursts. The starting points for x are taken as equally-spaced points in the range[ − . , . x value, we perturb the starting pointfor each short burst. The actual initial conditions are given as x ic = x + 0 . x spacing z, y ic = y + z , where z are standard normal variables and x spacing is the distance betweenstarting points for x before the perturbation.At each of our 20 starting points, 500 short trajectories are simulated. Thesystem is integrated with time step δt = 10 − for a duration ∆ t = 0 .
03. Figure9 shows the trajectories of both the fast and slow variables for one such burst.We see that the duration ∆ t = 0 .
03 lies between the timescales of the fast(Lorenz) and slow (effective x ) dynamics, so we avoid biasing our estimators forthe coarse-grained model parameters.As before, we assume that the drift and diffusion coefficients are approxi-mately constant at each starting point and estimate the following SDE at eachstarting point via GMM: dx = θ dt + θ dW. Afterwards, we fit a polynomial of an appropriate degree to the estimated coef-ficients and retrieve the coefficients ˆ A ≈ . σ ≈ . R × .Applying regular DMaps to this data set yields a parameterization of theLorenz attractor. In order to extract the slow variable, we apply DMaps usingMahalanobis distances. The local covariances of each data point are computed18 .10 0.11 0.12 0.130.630.650.670.690.71 0.10 0.11 0.12 0.13-30-10103050 Figure 9: Simulation of a short trajectory initialized at perturbed initial condi-tions ( x ic , y ic ).using 100 short simulations with duration dt cov = 10 − . The pseudo-inverse ofthe covariance matrix is computed using a singular value decomposition (SVD): C † = d (cid:88) m =1 s − m v m v (cid:62) m , (8)where s are the singular values and v the right-singular vectors. In this case, weuse d = n = 4, since we need the last singular value that corresponds to the slowvariable to obtain the correct embedding. Using this setup, the first non-trivialdiffusion coordinate parameterizes the slow variable x , as seen in Figure 10.While in theory the derivatives of the diffusion coordinate could be com-puted using central differences, the parameterization of x is quite noisy and theestimated derivatives can be quite inaccurate. To ameliorate this difficulty, weused smoothing splines to fit a curve to the data and estimated the derivativesfrom the splines. The data and fit curve are shown in Figure 10.Using the new data set in Diffusion Map space we again assume constantlocal drift and diffusivity at each starting point and use GMM to fit the followingSDE: dψ = ξ dt + ξ dW. This estimate is compared to (7) either using the estimates ˆ A and ˆ σ : dψ = (cid:18) ˆ A ( x − x ) dψdx + ˆ σ d ψdx (cid:19) dt + √ ˆ σ dψdx dW, (9)or directly using the estimates θ and θ : dψ = (cid:18) θ dψdx + θ d ψdx (cid:19) dt + θ dψdx dW. Figure 10: The first diffusion coordinate parameterizes x .The results are shown in Figure 11 and Figure 12.To demonstrate the dimension reduction from a higher dimensional space,we can transform the slow variable x by embedding it onto a curve in the plane(see Figure 13).The entire data set is now five-dimensional, and we can again apply DMapswith Mahalanobis distances as before. We again compute the pseudo-inverseusing SVD as in (8), but now we use d = n − ψ tr ( x ), we can estimate againdrift and diffusion coefficients as above. The results are shown in Figure 15 andFigure 16.The same method can be applied in the case of multiplicative noise: dxdt = A ( x − x ) + λε (1 + νx ) y , dy dt = 10 ε ( y − y ) ,dy dt = 1 ε (28 y − y − y y ) , dy dt = 1 ε ( y y − y ) . The approximate dynamics for the slow variable in this case are given by [32]: dx = ( Ax + Bx + Cx ) dt + (cid:112) σ a + σ b x + σ c x dW. -3 Figure 11: Estimation of the drift coefficient in diffusion map space. -0.015 -0.010 -0.005 0.000 0.005 0.010 0.0150123456 10 -3 Figure 12: Estimation of the diffusion coefficient in diffusion map space.21
Figure 13: Nonlinear transformation of the slow variable x onto a semicircle inthe plane.The constants from (6) and the simulation parameters remain unchanged, withthe exception of ν = 1 and ∆ t = 0 .
01. Applying Itˆo’s Lemma again, we obtain dψ = (cid:20) ( Ax + Bx + Cx ) dψdx + σ a + σ b x + σ c x d ψdx (cid:21) dt + (cid:112) σ a + σ b x + σ c x dψdx dW. If we estimate the coefficients of the polynomials in the drift and diffusionterms using the data in the original space, we can also estimate these coefficientsin diffusion map space using an equation analogous to (9): dψ = (cid:20) ( ˆ Ax + ˆ Bx + ˆ Cx ) dψdx + ˆ σ a + ˆ σ b x + ˆ σ c x d ψdx (cid:21) dt + (cid:112) ˆ σ a + ˆ σ b x + ˆ σ c x dψdx dW. The other two estimation methods remain the same, i.e. , either using θ and θ calculated at each starting point or calculating ξ and ξ at each starting point.The results are shown in Figure 17 and Figure 18.22 Original EmbeddingTransformed Embedding
Figure 14: Embeddings of the original and transformed data set.
We have confirmed that, at the limit of small time steps and large temperatures,trajectories produced by the Simulated Annealing / Random Walk MetropolisHastings algorithm are analogous to those that result from the Langevin equa-tion, a global, stochastic optimization algorithm. We use SA as our “inner”optimizer that produces ensembles of brief simulation bursts, which containinformation about an effective gradient of the objective function. Using di-mension reduction techniques such as PCA or, in the case of nonlinear man-ifolds, Diffusion Maps, we can obtain the parameterization of the underlyinglow-dimensional manifold.As our first example, we show that a two-dimensional, Bayesian model iseffectively one-dimensional and DMaps can retrieve the important parameter(a sort of “reaction coordinate”) for the optimization. Combining SA withDMaps achieves considerably faster approach to the maximum, compared withthe simple SA alone.Starting from a two-dimensional SDE that corresponds to a Langevin equa-tion for an objective function with two parameters, we derived the correspond-ing SDE in terms of the diffusion coordinates and approximated numericallythe theoretical drift and diffusion coefficients. We then estimated the same driftand diffusion coefficients using parameter inference on the data set that came23 -3 Figure 15: Estimation of the drift coefficient in Diffusion Map space. -0.015 -0.010 -0.005 0.000 0.005 0.010 0.0150123456 10 -3 Figure 16: Estimation of the diffusion coefficient in Diffusion Map space.24 -3 Figure 17: Estimation of the drift coefficient in diffusion map space. -0.015 -0.010 -0.005 0.000 0.005 0.010 0.0150123456 10 -3 Figure 18: Estimation of the diffusion coefficient in diffusion map space.25rom the application of DMaps on the original trajectories. Additionally, we cantransform the original data set through a nonlinear transformation by mappingit onto a portion of the surface of a sphere and apply DMaps on the transformeddata set using Mahalanobis distances. In both cases the estimated parametersare closely comparable to the theoretical ones.For illustration purposes, we constructed a three-dimensional objective func-tion that has a strongly attracting, two-dimensional manifold. This work con-stitutes a simple “proof of concept” acceleration demonstration for the classes ofoptimization problems we consider. It is also an illustration of the tools requiredto perform scientific computations (here, gradient descent) in a latent variablespace, a space parameterized by on-the-fly processing of the data produced bythe “inner optimizer.” Fast implementations of the techniques (like GeometricHarmonics) for translating back-and-forth between the original space, in whichthe problem was given, and the latent space, where the coarse optimizationsteps are taken, are crucial for the usefulness of the approach. The true benefitsof this approach and its potential should be explored by applying it to trulyhigh-dimensional problems where other methods slow down considerably. Thisis the subject of current work.
Acknowledgments
This work was partially supported by the DARPA Lagrange program and anARO MURI. The authors gratefully acknowledge many helpful discussions withProf. Christian Kuehn of T. U. Munich, especially for the case where the noiseterm arises through the coupling with chaotic dynamical systems.G.P.’s research is supported by the EPSRC, grants EP/P031587/1, EP/L024926/1,EP/L020564/1 and by JPMorgan Chase & Co. Any views or opinions expressedherein are solely those of the authors listed, and may differ from the views andopinions expressed by JPMorgan Chase & Co. or its affiliates. This material isnot a product of the Research Department of J.P. Morgan Securities LLC. Thismaterial does not constitute a solicitation or offer in any jurisdiction.A.D. was supported by the Lloyd’s Register Foundation Programme on DataCentric Engineering and by the Alan Turing Institute under the EPSRC grant[EP/N510129/1].
References [1] Y. A¨ıt-Sahalia, Maximum likelihood estimation of discretely sampled dif-fusions: A closed-form approximation approach,
Econometrica , (2002),223–262.[2] Y. A¨ıt-Sahalia, Closed-form likelihood expansions for multivariate diffu-sions, The Annals of Statistics , (2008), 906–937.263] R. Barton, Metamodeling: a state of the art review, in Simulation Con-ference Proceedings, 1994. Winter , IEEE, 1994, 237–244.[4] K. Chan, G. Karolyi, F. Longstaff and A. Sanders, An empirical compar-ison of alternative models of the short-term interest rate,
The Journal ofFinance , (1992), 1209–1227.[5] E. Chiavazzo, C. Gear, C. Dsilva, N. Rabin and I. Kevrekidis, Reducedmodels in chemical kinetics via nonlinear data-mining, Processes , (2014),112–140.[6] R. Coifman and S. Lafon, Diffusion maps, Applied and ComputationalHarmonic Analysis , (2006), 5–30.[7] R. Coifman and S. Lafon, Geometric harmonics: a novel tool for multiscaleout-of-sample extension of empirical functions, Applied and ComputationalHarmonic Analysis , (2006), 31–52.[8] R. Coifman, S. Lafon, A. Lee, M. Maggioni, B. Nadler, F. Warner andS. Zucker, Geometric diffusions as a tool for harmonic analysis andstructure definition of data: Diffusion maps, Proceedings of the NationalAcademy of Sciences of the United States of America , (2005), 7426–7431.[9] A. Conn, N. Gould and P. Toint, Trust region methods , SIAM, 2000.[10] A. Conn, K. Scheinberg and L. Vicente,
Introduction to derivative-freeoptimization , SIAM, 2009.[11] C. Dsilva, R. Talmon, R. Coifman and I. Kevrekidis, Parsimonious rep-resentation of nonlinear dynamical systems through manifold learning: Achemotaxis case study,
Applied and Computational Harmonic Analysis , (2018), 759–773.[12] C. Dsilva, R. Talmon, N. Rabin, R. Coifman and I. Kevrekidis, Nonlinearintrinsic variables and state reconstruction in multiscale simulations, TheJournal of Chemical Physics , (2013), 184109.[13] A. Duncan, G. Pavliotis and K. Zygalakis, Nonreversible langevin sam-plers: Splitting schemes, analysis and implementation, arXiv preprintarXiv:1701.04247 .[14] R. Eberhart and J. Kennedy, A new optimizer using particle swarm theory,in Micro Machine and Human Science, 1995. MHS’95., Proceedings of theSixth International Symposium on , IEEE, 1995, 39–43.[15] M. Fathi and G. Stoltz, Improving dynamical properties of metropolizeddiscretizations of overdamped langevin dynamics,
Numerische Mathematik , (2017), 545–602. 2716] C. Gear, D. Givon and I. Kevrekidis, Computing on virtual slow manifoldsof fast stochastic systems, JNAIAM , (2010), 61–72.[17] C. Gear and I. Kevrekidis, Projective methods for stiff differential equa-tions: problems with gaps in their eigenvalue spectrum, SIAM J. Sci.Comp. , (2003), 1091–1106.[18] S. Geman and C.-R. Hwang, Diffusions for global optimization, SIAMJournal on Control and Optimization , (1986), 1031–1043.[19] B. Gidas, Global optimization via the langevin equation, in Decision andControl, 1985 24th IEEE Conference on , vol. 24, IEEE, 1985, 774–778.[20] J. Gradiˇsek, S. Siegert, R. Friedrich and I. Grabec, Analysis of time seriesfrom stochastic processes,
Physical Review E , (2000), 3146.[21] L. Hansen, Large sample properties of generalized method of momentsestimators, Econometrica: Journal of the Econometric Society , 1029–1054.[22] J. Holland,
Adaptation in natural and artificial systems: an introduc-tory analysis with applications to biology, control, and artificial intelligence ,MIT press, 1992.[23] R. Hooke and T. Jeeves, “direct search” solution of numerical and statis-tical problems,
Journal of the ACM (JACM) , (1961), 212–229.[24] W. Huyer and A. Neumaier, Global optimization by multilevel coordinatesearch, Journal of Global Optimization , (1999), 331–355.[25] I. Jolliffe, Principal component analysis and factor analysis, in Principalcomponent analysis , Springer, 1986, 115–128.[26] D. Jones, A taxonomy of global optimization methods based on responsesurfaces,
Journal of Global Optimization , (2001), 345–383.[27] D. Jones, C. Perttunen and B. Stuckman, Lipschitzian optimization with-out the lipschitz constant, Journal of Optimization Theory and Applica-tions , (1993), 157–181.[28] S. Kalliadasis, S. Krumscheid and G. A. Pavliotis, A new framework forextracting coarse-grained models from time series with multiscale structure, J. Comput. Phys. , (2015), 314–328.[29] C. Kelley, Iterative methods for optimization , SIAM, 1999.[30] I. Kevrekidis, C. Gear, J. Hyman, P. Kevrekidis, O. Runborg andC. Theodoropoulos, Equation-free, coarse-grained multiscale computation:Enabling mocroscopic simulators to perform system-level analysis,
Com-munications in Mathematical Sciences , (2003), 715–762.2831] S. Kirkpatrick, C. Gelatt and M. Vecchi, Optimization by simulated an-nealing, Science , (1983), 671–680.[32] S. Krumscheid, G. Pavliotis and S. Kalliadasis, Semiparametric drift anddiffusion estimation for multiscale diffusions, Multiscale Modeling & Sim-ulation , (2013), 442–473.[33] S. Krumscheid, M. Pradas, G. Pavliotis and S. Kalliadasis, Data-drivencoarse graining in action: Modeling and prediction of complex systems, Phys. Rev. E , (2015), 042139.[34] S. Lafon, Y. Keller and R. Coifman, Data fusion and multicue data match-ing by diffusion maps, IEEE Transactions on pattern analysis and machineintelligence , (2006), 1784–1797.[35] G. Li, C. Rosenthal and H. Rabitz, High dimensional model representa-tions, The Journal of Physical Chemistry A , (2001), 7765–7777.[36] M. Locatelli, Simulated annealing algorithms for continuous global opti-mization, in Handbook of global optimization , Springer, 2002, 179–229.[37] I. Melbourne and A. Stuart, A note on diffusion limits of chaotic skew-product flows,
Nonlinearity , (2011), 1361.[38] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller and E. Teller,Equation of state calculations by fast computing machines, The journal ofChemical Physics , (1953), 1087–1092.[39] M. Montgomery, R. Meglen and N. Damrauer, General method for the di-mension reduction of adaptive control experiments, The Journal of PhysicalChemistry A , (2006), 6391–6394.[40] M. Montgomery, R. Meglen and N. Damrauer, General method for reducingadaptive laser pulse-shaping experiments to a single control variable, TheJournal of Physical Chemistry A , (2007), 5126–5129.[41] J. Nelder and R. Mead, A simplex method for function minimization, TheComputer Journal , (1965), 308–313.[42] B. Øksendal, Stochastic differential equations , Springer, 2003.[43] A. Papavasiliou, Particle filters for multiscale diffusions, in
ConferenceOxford sur les m´ethodes de Monte Carlo s´equentielles , vol. 19 of ESAIMProc., EDP Sci., Les Ulis, 2007, 108–114.[44] G. Pavliotis,
Stochastic processes and applications , vol. 60 of Texts inApplied Mathematics, Springer, New York, 2014, Diffusion processes, theFokker-Planck and Langevin equations.[45] G. Pavliotis and A. Stuart, Parameter estimation for multiscale diffusions,
J. Stat. Phys. , (2007), 741–781.2946] L. Rios and N. Sahinidis, Derivative-free optimization: a review of al-gorithms and comparison of software implementations, Journal of GlobalOptimization , (2013), 1247–1293.[47] J. Roslund and H. Rabitz, Dynamic dimensionality identification for quan-tum control, Physical review letters , (2014), 143001.[48] C. Schillings and A. Stuart, Analysis of the ensemble kalman filter forinverse problems, SIAM Journal on Numerical Analysis , (2017), 1264–1290.[49] S. Shan and G. Wang, Survey of modeling and optimization strategiesto solve high-dimensional design problems with computationally-expensiveblack-box functions, Structural and Multidisciplinary Optimization , (2010), 219–241.[50] A. Singer and R. Coifman, Non-linear independent component analysiswith diffusion maps, Applied and Computational Harmonic Analysis , (2008), 226–239.[51] E. Vanden-Eijnden, Fast communications: Numerical techniques for multi-scale dynamical systems with stochastic effects, Commun. in Math. Sci. , (2003), 385–391.[52] E. Weinan, B. Engquist, X. Li, W. Ren and E. Vanden-Eijnden, Hetero-geneous multiscale methods: a review, Communications in computationalphysics ,2