Nonparametric forecasting of low-dimensional dynamical systems
NNonparametric forecasting of low-dimensional dynamical systems
Tyrus Berry, Dimitrios Giannakis, and John Harlim
1, 3 Department of Mathematics, the Pennsylvania State University, University Park, PA 16802-6400, USA Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA Department of Meteorology, the Pennsylvania State University, University Park, PA 16802-5013, USA (Dated: January 15, 2015)This letter presents a non-parametric modeling approach for forecasting stochastic dynamicalsystems on low-dimensional manifolds. The key idea is to represent the discrete shift maps on asmooth basis which can be obtained by the diffusion maps algorithm. In the limit of large data, thisapproach converges to a Galerkin projection of the semigroup solution to the underlying dynamicson a basis adapted to the invariant measure. This approach allows one to quantify uncertainties(in fact, evolve the probability distribution) for non-trivial dynamical systems with equation-freemodeling. We verify our approach on various examples, ranging from an inhomogeneous anisotropicstochastic differential equation on a torus, the chaotic Lorenz three-dimensional model, and theNi˜no-3.4 data set which is used as a proxy of the El-Ni˜no Southern Oscillation.
A significant challenge in modeling is to account forphysical processes which are often not well understoodbut for which large data sets are available. A standardapproach is to perform regression fitting of the data intovarious parametric models. While this approach is popu-lar and successful in many applied domains, the resultingpredictive skill can be sensitive to the choice of models,the parameter fitting algorithm, and the complexity ofthe underlying physical processes. An alternative ap-proach is to avoid choosing particular models and/or pa-rameter estimation algorithms and instead apply a non-parametric modeling technique. In particular, nonpara-metric modeling based on local linearization of discreteshift maps on paths has been successful in predictingmean statistics of data sets generated by dynamical sys-tems with low-dimensional attractors [1–3].In this letter, we generalize this nonparametric ap-proach to quantify the evolving probability distributionof the underlying dynamical system. The key idea is toproject the shift map on a set of basis functions that aregenerated by the diffusion maps algorithm [4] with a vari-able bandwidth diffusion kernel [5]. This approach hasconnections with a recently developed family of kernels[6], which utilize small shifts of a deterministic time seriesto estimate the dynamical vector field. The method alsogeneralizes a recently introduced non-parametric mod-eling framework for gradient systems [7] to inhomoge-neous stochastic systems having non-gradient drift andanisotropic diffusion. Consider a dynamical system, dx = a ( x ) dt + b ( x ) dW t (1)where W t is a standard Brownian process, a ( x ) a vectorfield, and b ( x ) a diffusion tensor, all defined on a mani-fold M ⊂ R n . Given a time series x i = x ( t i ), generatedby (1) at discrete times { t i } N +1 i =1 , we are interested in con-structing a forecast model so that given an initial density p ( x ) we can estimate the density p ( x, t ) = e t L ∗ p ( x ) attime t >
0, without the Fokker-Planck operator, L ∗ , of(1) and without knowing or estimating a and b . We will assume that (1) is ergodic so that { x i } are sampled fromthe invariant measure p eq ( x ) of (1). Note that all prob-ability densities are relative to a volume form dV which M inherits from the ambient space, and the generator L of (1) is the adjoint of L ∗ with respect to L ( M , dV ).We note that our approach differs significantly fromprevious approaches such as [8, 9], which estimate a and b explicitly in the ambient space R n , relying on theKramer-Moyal expansion. In contrast, our approach di-rectly estimates the semi-group solution e τ L on the man-ifold M ⊂ R n , so that a and b are represented implicitly.The advantages of our approach are that the data re-quirements are independent of the ambient space dimen-sion and only depend on the intrinsic dimension of M ,and we will be able to estimate the semi-group solution, e τ L , directly from the data for any sampling time τ .Our approach is motivated by a rigorous connectionbetween the shift map, S which we define by Sf ( x i ) = f ( x i +1 ) for a function f ∈ L ( M , p eq ), and the semi-group solution, e τ L , of the underlying dynamical system(1). Applying Itˆo’s Lemma to f ( x ), one can show, Sf ( x i ) = e τ L f ( x i ) + (cid:90) t i +1 t i ∇ f (cid:62) b dW s + (cid:90) t i +1 t i Bf ds, (2)where τ = t i +1 − t i , Bf = L f − E (cid:2) L f ] and the expecta-tion is with respect to paths of (1) conditional to x ( t i ).The detailed derivation of (2) is in Appendix B in thesupplementary material. Since E [ Sf ( x i )] = e τ L f ( x i ), wecan use the shift map, S , to directly estimate the semi-group solution, e τ L . However, S is a noisy estimate of e τ L and our key contribution is to minimize the error byrepresenting S on a basis of smooth functions.Minimizing the error requires choosing a basis whichminimizes the functional (cid:107)∇ f (cid:107) p eq , which is shown in Ap-pendix B. Intuitively, this is because we want to boundthe stochastic integral in (2), whose integrand contains ∇ f . The functional (cid:107)∇ f (cid:107) p eq is minimized by the eigen-functions of the generator, ˆ L , of a stochastically forced a r X i v : . [ m a t h . D S ] J a n gradient flow with potential U ( x ) = − log( p eq ( x )). Let λ j and ϕ j be the eigenvalues and eigenfunctions of ˆ L ; { ϕ j } are orthonormal on L ( M , p eq ). Since ˆ L is the gen-erator of a gradient flow systems, it is easy to check that ψ j = p eq ϕ j are the eigenfunctions of the adjoint operatorˆ L ∗ , which are orthonormal on L ( M , p − ). Numerically,we obtain ϕ j ( x i ) as eigenvectors of a stochastic matrix,constructed by evaluating a variable bandwidth kernel onall pairs of data points and then applying an appropriatenormalization [5]. We summarize this procedure and therelated results in Appendix A in the Supplementary ma-terial. We emphasize that the operator ˆ L is used only toestimate { ϕ j } which is the optimal basis for smoothingthe shift operator S approximating semi-group solution e τ L of the full system (1).We write the solution, p ( x, τ ) = e τ L ∗ p ( x ), as follows, p ( x, τ ) = ∞ (cid:88) l =1 (cid:104) e τ L ∗ p , ψ l (cid:105) p − ψ l ( x ) = ∞ (cid:88) l =1 (cid:104) p , e τ L ϕ l (cid:105) ψ l ( x ) . Similarly, the initial density is p ( x ) = (cid:80) j c j (0) ψ j ( x ),where c j (0) = (cid:104) p , ψ j (cid:105) p − . We therefore obtain, p ( x, τ ) = ∞ (cid:88) l =1 ∞ (cid:88) j =1 c j (0) A lj ( τ ) p eq ( x ) ϕ l ( x ) , (3)where A lj ( τ ) := (cid:104) ϕ j , e τ L ϕ l (cid:105) p eq . Based on the discussionafter (2), we will use (cid:104) ϕ j , Sϕ l (cid:105) p eq to estimate A lj . Nu-merically, we estimate A lj by a Monte-Carlo integral,ˆ A lj = 1 N N (cid:88) i =1 ϕ j ( x i ) ϕ l ( x i +1 ) , (4)such that, the diffusion forecast is defined as follows: p ( x, τ ) ≈ ∞ (cid:88) l =1 p eq ( x ) ϕ l ( x ) ∞ (cid:88) j =1 ˆ A lj ( τ ) c j (0) . (5)One can show that E [ ˆ A lj ] = A lj which means that ˆ A lj is an unbiased estimate of A lj . Moreover, the error of thisestimate is of order λ l (cid:112) τ /N in probability assuming that x i are independent samples of p eq . This shows that wecan apply a diffusion forecast for any sampling time τ given a sufficiently large data set N . For more details,see Appendix B in the Supplementary Material. Non-gradient drift anisotrophic diffusion:
We first ver-ify the above approach for a system of SDE’s of theform (1) on a torus defined in the intrinsic coordinates( θ, φ ) ∈ [0 , π ) with drift and diffusion coefficients, a ( θ, φ ) = (cid:18) + cos( θ ) cos(2 φ ) + cos( θ + π/ cos( θ + φ/
2) + cos( θ + π/ (cid:19) ,b ( θ, φ ) = (cid:18) + sin( θ ) cos( θ + φ ) cos( θ + φ ) + sin( φ ) cos( θ ) (cid:19) . This example is chosen to exhibit non-gradient drift,anisotropic diffusion, and multiple time scales. Since itis a system of the form (1) on a smooth manifold, ourtheory shows that the shift operator S is an unbiased es-timator for the semigroup solution e τ L , and in the limitof large data the diffusion forecast will capture all aspectsof the evolution of the density p ( x, t ).We now verify this theory using a training data set of20000 points generated by numerically solving the SDEin (1) with a discrete time step ∆ t = 0 . R , via thestandard embedding of the torus given by ( x, y, z ) =((2 + sin( θ )) cos( φ ) , (2 + sin( θ )) sin( φ ) , cos( θ )). We definea Gaussian initial density p ( θ, φ ) with a randomly se-lected mean and a diagonal covariance matrix with vari-ance 1/10. The initial density is projected into a basisof M = 1000 eigenfunctions giving coefficients c j (0) = (cid:104) p /p eq , ϕ j (cid:105) p eq ≈ (cid:80) i =1 p ( θ i , φ i ) ϕ j ( θ i , φ i ) /p eq ( θ i , φ i ).The coefficients c j (0) are evolved forward in time in dis-crete steps of length ∆ t = 0 . A , constructed by (4) sothat the forecast at time τ = n ∆ t is effectively ˆ A (∆ t ) n .In Figure 1, we show the evolution of the first two-moments in the ambient space, for the fast and slowvariables, x and z , respectively, created by the diffu-sion forecast in (5). To verify the accuracy of the dif-fusion forecast, we also show the corresponding momentsproduced by an ensemble forecast of 50000 initial con-ditions, randomly sampled from the initial distribution p ( θ, φ ), evolved using the true dynamical system. Noticethe long-time pathwise agreement of both moments con-structed via the diffusion forecast and those constructedby an ensemble forecast. See also a video of the evolutionof p in the Supplementary Material.
50 100 150 200 250−2−10123 Forecast Steps F o r e c a s t V a l ue Ensemble forecast, mean and variance of xDiffusion forecast, mean of xDiffusion forecast, variance of xEnsemble forecast, mean and variance of zDiffusion forecast, mean of zDiffusion forecast, variance of z
FIG. 1. Validation of first two moments of the diffusionforecast for a stochastic dynamical system on a torus in R . Lorenz-63 model:
Next, we apply the diffusion fore-casting algorithm to data generated by the Lorenz-63model [10], with error in the initial state. Unlike theprevious example, this model does not technically satisfythe requirements of our theory because the attractor isa fractal set rather than a smooth manifold. Our resultsindicate that the applicability of the method seems toextend beyond the current theory. We compare our ap-proach to the classical nonparametric prediction methodwhich uses a local linear approximation of the shift map[1–3] and a standard ensemble forecasting method withthe true model, applied with 50000 initial conditions,sampled from the same initial distribution p .We generate 10000 data points with discrete time steps∆ t = 0 . t = 0 . x t , we define the initial state ˆ x t = x t + ξ t by introducing random perturbations ξ t sampledfrom N (0 , . p = N (ˆ x t , .
01) centered at theperturbed verification point. We chose this very smallperturbation to demonstrate the diffusion forecast for aninitial condition which is almost perfect; as the amountof noise increases the advantage of the diffusion forecastover the linear methods is even more significant.The diffusion forecast is performed with 4500 eigen-functions ϕ j , constructed with the diffusion maps algo-rithm with a variable bandwidth [5] (we show examplesin Figure 2). The local linear forecast uses ordinary leastsquares to fit an affine model to the n -step shift map onthe 15 nearest neighbors to the initial state. The iteratedlocal linear forecast completes this process for one stepand then recomputes the 15 nearest neighbors to the 1-step forecast and then repeats the process. The varianceestimate of the local linear models is given by conjugat-ing the covariance matrix of p with the linear part of theappropriate affine forecast model. We compute the rootmean squared error (RMSE) between each mean forecastand the true state, averaged over the verification periodof 5000 steps. We also show the standard deviation of theforecast density, so that a forecasting method has gooduncertainty quantification (UQ) if the standard deviationagrees with the RMSE.Of course, the ensemble forecast with the true modelgives the best forecast, however the diffusion forecast is aconsiderable improvement over the local linear forecast.For short ∆ t = 0 .
1, the iterated local linear forecast iscomparable to the diffusion forecast except in the longterm where the iterated local linear forecast exhibits sig-nificant bias. Moreover, the iterated local linear forecastsignificantly overestimates the error variance in the in-termediate to long term forecast. This overestimation isdue to the positive Lyapunov exponent, which is implic-itly estimated by the product of the iterated local lin-earizations. In contrast, the direct local linearization isunbiased in the long term, but converges very quickly tothe invariant measure and underestimates the variance. ∆ t = 0.1) R M SE Ensemble ForecastEnsemble Error EstimateDiffusion ForecastDiffusion Error EstimateLocal Linear ForecastLocal Linear Error EstimateIterated Local Linear ForecastIterated Local Linear Error EstimateInvariant Measure −1 Forecast Steps ( ∆ t = 0.5) R M SE Ensemble ForecastEnsemble Error EstimateDiffusion ForecastDiffusion Error EstimateLocal Linear ForecastLocal Linear Error EstimateIterated Local Linear ForecastIterated Local Linear Error EstimateInvariant Measure −10010 −20−100102051015202530354045 −10010 −20−100102051015202530354045−10010 −20−100102051015202530354045 −10010 −20−100102051015202530354045
FIG. 2. Comparing forecast methods for Lorenz-63, Top:∆ t = 0 .
1, Middle: ∆ t = 0 .
5. Bottom: Eigenfunctions ϕ , ϕ , ϕ , and ϕ of the coarse approximation of thefractal attractor by a manifold. This underestimation is because no single linearizationcan capture the information creation introduced by thepositive Lyapunov exponent. For long ∆ t = 0 .
5, the biasin the local linear models leads them to diverge far be-yond the invariant measure for even intermediate termforecasts. The ensemble forecast provides the most con-sistent UQ since it has access to the true model, how-ever the diffusion forecast produces reasonable estimateswithout knowing the true model as shown in Figure 2.The local linear forecast error estimates vary widely and R M SE / C o rr e l a t i on Actual ErrorEstimated ErrorClimatological ErrorCorrelation E l N i no . I nde x Truth14 Month ForecastForecast Standard Deviation
FIG. 3. Forecasting for the El Ni˜no 3.4 index. RMS andcorrelation (top); 14-month lead forecast (bottom) do not robustly provide a useful UQ whereas the diffu-sion forecast is robust across multiple sampling times assuggested by the theory. We include a video showinggood long term agreement between the diffusion forecastdensity and an ensemble in the Supplementary Material.The diffusion forecast is able to give a reasonable esti-mate of the evolution of the density by building a consis-tent finite dimensional Markovian approximation of thedynamics. This Markovian system incorporates globalknowledge of the attractor structure via the smoothingwith the adapted basis { ϕ j } . This Markovian approx-imation of the Lorenz-63 model implicitly uses a smallBrownian forcing to replicate the entropy generation ofthe positive Lyapunov exponent. El-Ni˜no data set:
We now apply our method to a realworld data set, where the validity of our theory is un-verifiable, namely the Ni˜no-3.4 index, which records themonthly anomalies of sea surface temperature (SST) inthe central equatorial Pacific region (the raw dataset isavailable from NOAA ftp://ftp.cpc.ncep.noaa.gov/wd52dg/data/indices/ ). In applying this method, weimplicitly assume that there is an underlying dynamicson a low-dimensional manifold that generates these SSTanomalies. Since the time series is one-dimensional, weapply the time-delay embedding technique to the dataset, following [11–13], which will recover this low dimen-sional manifold if it exists. We use a 5-lag embedding,empirically chosen to maximize the forecast correlationskill between the true time series and the mean estimate. We construct the A lj matrix with 80 eigenfunctions ob-tained by the diffusion maps algorithm with a variablebandwidth kernel, applied on the lag-embedded data set.In this experiment, we train our nonparametric modelwith monthly data between Jan 1950-Dec 1999 (600 datapoints), following [14] and we verify the forecasting skillon the period of Jan 2000-Sept 2013. The initial distri-bution p is generated with the same method as in theLorenz-63 example. Based on the RMSE and correlationmeasure (see Figure 3), the forecasting skill decays to theclimatological error in about 6 months but then the skillimproves, peaking at 13-14 month lead time. In fact,our 14-month lead forecast skill, in terms of RMSE 0.60and correlation 0.64, is significantly better than that ofthe method proposed in [14] (Fig. 3 in their paper sug-gests RMSE 1.4 and correlation 0.4) who claimed to beatthe current operational forecasting skill. The 14-monthlead forecast mean estimate gives a reasonably correctpattern, and the diffusion forecast provides a reasonableerror bar (showing one stdev.) which is a useful UQ. Weinclude a movie in the Supplementary Material showingthe nontrivial evolution of the forecast distribution start-ing 14 months before January 2004.Difficulty in improving the forecasts in this problemmay be due to combinations of many factors, includingthe validity of our assumption of the existence of low-dimensional structures, a transient in the time series, alarge stochastic component, and memory effects. Onepossibility is to combine the local linear models via thediffusion basis to form a global model which respects toinvariant measure. Another issue is that both the ob-servational and dynamical noise in the data is currentlytreated as part of the process, and it would be advanta-geous to isolate the attractor and build the basis there.Finally, the empirical success of this method suggeststhat it is possible to approximate a fractal attractor witha smooth manifold, however, there is limited theoreticalinterpretation for such an approximation.This research is supported under the ONR MURI grantN00014-12-1-0912. D.G. also acknowledges financial sup-port from ONR DRI grant N00014-14-1-0150. J.H. alsoacknowledges financial support from ONR grant N00014-13-1-0797 and the NSF grant DMS-1317919. [1] J. Farmer and J. Sidorowich, Phys. Rev. Lett. , 845(1987).[2] T. Sauer, in Time Series Prediction, Forecasting the Fu-ture and Understanding the Past (1994) pp. 175–193.[3] D. Kugiumtzis, O. C. Lingjrde, and N. Christophersen,Physica D: Nonlinear Phenomena , 344 (1998).[4] R. Coifman and S. Lafon, Appl. Comput. Harmon. Anal. , 5 (2006).[5] T. Berry and J. Harlim, Appl. Comput. Harmon. Anal.[to appear] (2014), http://arxiv.org/abs/1406.5064. [6] D. Giannakis, SIAM J. Appl. Dyn. Sys. [to appear](2014), http://arxiv.org/abs/1403.0361.[7] T. Berry and J. Harlim, SIAM/ASA J. Uncer. Quant. [inreview] (2014), http://arxiv.org/abs/1407.6972.[8] R. Friedrich and J. Peinke, Phys. Rev. Lett. , 863(1997).[9] F. B¨ottcher, J. Peinke, D. Kleinhans, R. Friedrich, P. G.Lind, and M. Haase, Phys. Rev. Lett. , 090603 (2006).[10] E. N. Lorenz, J. Atmos. Sci. , 130 (1963).[11] T. Sauer, J. Yorke, and M. Casdagli, J. Stat. Phys. ,579 (1991).[12] D. Giannakis and A. J. Majda, Proc. Nat. Acad. Sci. , 2222 (2012).[13] T. Berry, J. R. Cressman, Z. G. Ferenˇcek, and T. Sauer,SIAM J. Appl. Dyn. Syst. , 618 (2013).[14] M. D. Chekroun, D. Kondrashov, and M. Ghil, Proc.Nat. Acad. Sci. , 11766 (2011).[15] R. R. Coifman, Y. Shkolnisky, F. J. Sigworth, andA. Singer, Image Processing, IEEE Transactions on ,1891 (2008). APPENDIX A: ESTIMATING ELLIPTICOPERATORS WITH DIFFUSION KERNELS
In this appendix we include the details of the nu-merical algorithm introduced in [5] which is used toestimate the eigenfunctions ϕ j of the elliptic operatorˆ L = ∆ − c ∇ U · ∇ . We assume that the data set { x i } Ni =1 ⊂ R n lies on a d -dimensional submanifold M of the ambient Euclidean space R n . In order to describefunctions on the manifold, we will represent a function f by evaluation on the data set, so that f is representedby the discrete N × (cid:126)f = ( f , ..., f N ) (cid:62) where f i = f ( x i ). In this framework, operators which mapfunctions to functions are represented as N × N matrices,since these take the N × N × L aseigenvectors of a kernel matrix which provably approxi-mates the operator ˆ L in the limit of large data. In thispaper we are interested in data sets generated by dynam-ical systems, however the general theory of this section isvalid for arbitrary data sets. In this case, we will assumethat the dynamics are ergodic so that the manifold M isan attractor of the system, and the sampling measure q of the data on M is the same as the invariant measure p eq of the dynamics.We first note that ˆ L is the generator of the gradientflow system, dx = − c ∇ U ( x ) dt + √ dW t , (6)where W t is a Brownian motion on the manifold M (sothat the Laplacian ∆ is the infinitesimal generator of W t ).The potential function U = − log( q ) is determined by thesampling measure, q . In particular, we will be interestedin the case c = 1, in which case the invariant measure ofthe system (6) is e − c U = q = p eq . This means that theinvariant measure of the true dynamical system which governs the evolution of the data set is the same as thatof the gradient flow system generated by ˆ L . Since ˆ L is a negative semi-definite elliptic operator, self-adjointwith respect to p eq , the eigenfunctions { ϕ j } form a basisfor L ( M , p eq ). Moreover, this is the smoothest basiswith respect to p eq in the sense that each ϕ j minimizesthe norm ||∇ f || p eq = −(cid:104) f, ˆ L f (cid:105) p eq subject to ϕ j beingorthogonal to all { ϕ l } l Let q ∈ L ( M ) ∩ C ( M ) be a density thatis bounded above on an embedded d -dimensional manifold M ⊂ R n without boundary and let { x i } Ni =1 be sampledindependently with distribution q . Let K S(cid:15) be a variablebandwidth kernel of the form (7) with bandwidth function q β(cid:15) where q (cid:15) = q + O ( (cid:15) ) is any order- (cid:15) estimate of q . Fora function f ∈ L ( M , q ) ∩ C ( M ) and an arbitrary point x i ∈ M , define the discrete functionals, F i ( x j ) = K S(cid:15) ( x i , x j ) f ( x j ) q S(cid:15) ( x i ) α q S(cid:15) ( x j ) α , G i ( x j ) = K S(cid:15) ( x i , x j ) q S(cid:15) ( x i ) α q S(cid:15) ( x j ) α , where q S(cid:15) ( x i ) = (cid:80) l K S(cid:15) ( x i , x l ) /q (cid:15) ( x i ) dβ . Then, L S(cid:15),α,β f ( x i ) ≡ (cid:15)mq (cid:15) ( x i ) β (cid:32) (cid:80) j F i ( x j ) (cid:80) j G i ( x j ) − f ( x i ) (cid:33) (8)= ˆ L f ( x i ) + O (cid:18) (cid:15), q ( x i ) (1 − dβ ) / √ N (cid:15) d/ , ||∇ f ( x i ) || q ( x i ) − c √ N (cid:15) / d/ (cid:19) , with high probability, where c = 2 − α + dβ + 2 β and c = 1 / − α + 2 dα + dβ/ β and m is a constantdepending on the form of the kernel, and m = 2 for (7) . The key result of [5] is that for the error to be boundedwhen q is not bounded below, we require c < q → c = 1, in this paperwe will use β = − / α = − d/ 4. Notice that thisalgorithm will require the intrinsic dimension d of themanifold M , however we will determine this empiricallyas part of the kernel density estimation of the samplingdensity q .To determine the sampling density q (which alsoserves as an estimate of the invariant measure p eq )we introduce the ad-hoc bandwidth function ρ ( x ) = (cid:16) k − (cid:80) k j =2 || x i − x I( i,j ) || (cid:17) / , where I( i, j ) is the indexof the j -th nearest neighbor of x i from the data set. Fol-lowing [5] we used k = 8 in all of our examples, and em-pirically the algorithm is not very sensitive to k . Withthis bandwidth we define the following kernel, K (cid:15) ( x, y ) = exp (cid:18) − || x − y || (cid:15)ρ ( x ) ρ ( y ) (cid:19) , (9)which will be used only for the kernel density estimate ofthe sampling density q . The kernel density estimate q (cid:15) ofthe sampling density q is given by the standard formula, q (cid:15) ( x i ) ≡ N (2 π(cid:15)ρ ( x i ) ) d/ N (cid:88) j =1 K (cid:15) ( x i , x j ) (10) ≈ (cid:90) M K (cid:15) ( x i , y )(2 π(cid:15)ρ ( x i ) ) d/ q ( y ) dV ( y ) = q ( x i ) + O ( (cid:15) ) , where dV is the volume form which M inherits from theambient space and q is the sampling measure relativeto this volume form. Note that applying (10) requireschoosing the bandwidth (cid:15) and knowing the dimension d of the manifold.To determine the bandwidth (cid:15) and the dimension d , weapply the automatic tuning algorithm, originally devel-oped in [15] and refined in [5]. The idea is that if (cid:15) is notwell tuned, the kernel will become trivial; when (cid:15) is toosmall the kernel (9) is numerically zero when x (cid:54) = y , andwhen (cid:15) is too large the kernel is numerically one. Form-ing the double sum T ( (cid:15) ) ≡ N (cid:80) Ni,j =1 K (cid:15) ( x i , x j ), when (cid:15) is too small we find T approaches 1 /N and when (cid:15) is toolarge we find T approaches 1. As shown in [5, 15] when (cid:15) is well tuned we have T ( (cid:15) ) ≈ (4 π(cid:15) ) d/ vol( M ) so thatlog T ( (cid:15) ) ≈ d π(cid:15) ) − log(vol( M )) . (11)Since T ( (cid:15) ) is monotonically increasing in (cid:15) , we also havelog( T ( (cid:15) )) monotonically increasing in log( (cid:15) ), and so thederivative d log( T ( (cid:15) )) d log( (cid:15) ) has a unique maximum. Intuitivelythis maximum corresponds to (cid:15) that gives the maxi-mum ‘resolution’ of the kernel K (cid:15) . The approxima-tion (11) suggests that the value of the maximum ismax (cid:15) d log( T ( (cid:15) )) d log( (cid:15) ) = d , and we will use this to determine theintrinsic dimension of the manifold M . Since the sum-mation T ( (cid:15) ) is not very expensive to compute, we simplyevaluate T ( (cid:15) ) for (cid:15) = 2 l where l = − , − . , ..., . , d (log S ) d (log (cid:15) ) ≈ log( S ( (cid:15) + h )) − log( S ( (cid:15) ))log( (cid:15) + h ) − log( (cid:15) ) , (12)and choose the value of (cid:15) which maximizes this derivativeand set d = 2 max (cid:15) d log( T ( (cid:15) )) d log( (cid:15) ) . Now that we have the empirical estimate q (cid:15) ( x i ) = q ( x i ) + O ( (cid:15) ), we can form the kernel (7). We reapplythe same bandwidth selection to choose the global band-width (cid:15) in the kernel (7) and a new estimate of d , byconstructing T ( (cid:15) ) as a double sum of the kernel (7) overthe data set. Notice that these new values of (cid:15) and d canbe different from the previous values used in the kernel(10), although empirically the new dimension d is typi-cally very similar.We can now evaluate the kernel (7) on all pairs fromthe data set and form the matrix K S(cid:15),i,j = K S(cid:15) ( x i , x j ) andwe can compute the first normalization factor q S(cid:15) ( x i ) = (cid:80) Nj =1 K S(cid:15) ( x i , x j ) /q (cid:15) ( x i ) dβ as in Corollary 1. We definea diagonal matrix D i,i = q S(cid:15) ( x i ), and the first normal-ization is to form the matrix K S(cid:15),α = D − α K S(cid:15) D − α . Wethen compute the second normalization factor q S(cid:15),α ( x i ) = (cid:80) Nj =1 K S(cid:15),α,i,j and form a diagonal matrix D α,i,i = q S(cid:15),α ( x i ). The second normalization is to form the ma-trix ˆ K S(cid:15),α = D − α K S(cid:15),α . We define the final normalizationdiagonal matrix ˆ D i,i = 2 (cid:15)q ( x i ) β , and by Corollary 1, L S(cid:15),α,β = ˆ D − ( ˆ K S(cid:15),α − Id) = ˆ D − D − α D − α K S(cid:15) D − α − ˆ D − approximates the desired operator ˆ L when β = − / α = − d/ 4. To find the eigenvectors of L S(cid:15),α,β , whichapproximate the eigenfunctions ϕ j of ˆ L , we note thatsetting P = ˆ D / D / α we can define a symmetric matrix,ˆ L ≡ P L S(cid:15),α,β P − = P − D − α K S(cid:15) D − α P − − ˆ D − . Since ˆ L is a symmetric matrix, which is conjugate to the L S(cid:15),α,β , we can compute the eigenvectors of ˆ L = ˆ U Λ ˆ U (cid:62) efficiently and then the eigenvectors of L S(cid:15),α,β are givenby the column vectors of U = P − ˆ U .Note that the columns of ˆ U will be numerically orthog-onal, so the columns of U are orthogonal with respect to P since Id = ˆ U (cid:62) ˆ U = U (cid:62) P U . A careful calculationbased on the asymptotic expansions in [5] shows that P ii = q ( x i ) c − + O ( (cid:15) ) and in general the q c is the in-variant measure of the gradient flow (6) so that P rep-resents the ratio between the invariant measure e − c U of(6) and the sampling measure q . However, in this casesince c = 1, we have P = Id + O ( (cid:15) ). Thus, for the case c = 1, we will take the eigenvectors ϕ j to be the columnvectors of ˆ U , since these eigenvectors are numerically or-thogonal and are equal to the column vectors of U up toorder- (cid:15) . Notice that the orthogonality of these vectors N (cid:80) Ni =1 ϕ l ( x i ) ϕ j ( x i ) ≈ (cid:104) ϕ l , ϕ j (cid:105) q , corresponds to the or-thogonality of the eigenfunctions ϕ j with respect to thesampling measure (and since c = 1 this is also the in-variant measure of (6)). Finally, in order to insure thatthe eigenvectors are orthonormal, we renormalize eachcolumn vector so that N (cid:80) Ni =1 ϕ j ( x i ) = 1. APPENDIX B: ESTIMATING THE SEMI-GROUPSOLUTIONS FOR NON-ELLIPTIC OPERATORS Consider a dynamical system for a state vector x on amanifold M ⊂ R n given by, dx = a ( x ) dt + b ( x ) dW t (13)where W t is a standard Brownian process (generated bythe Laplacian on M ), a is a vector field (which is notnecessarily the gradient of a potential), and b a diffu-sion tensor, all defined on M . Let x i = x ( t i ) be a timeseries realization of (13) at discrete times { t i } N +1 i =1 with τ = t i +1 − t i . As in the previous section, we represent asmooth function f on the manifold by a vector f i = f ( x i ).Using the time ordering of the data set, we can define theshift map, Sf ( x i ) = f ( x i +1 ). Applying the Itˆo formulato the process y ( t ) = f ( x ( t )) we have, dy ( s ) = (cid:18) a · ∇ f + 12 Tr( b (cid:62) H( f ) b ) (cid:19) ds + ∇ f (cid:62) b dW s ≡ L f ds + ∇ f (cid:62) b dW s , (14)where H( f ) denotes the Hessian and the functions andderivatives are evaluated at x ( s ). We first show that theexpected value of the discrete time shift map is the semi-group solution e τ L associated to the generator L of thesystem (13).For all smooth functions f defined on the data set, theshift map yields a function Sf which is defined on thefirst N − Sf ( x i ) = f ( x ( t i +1 )) = y ( t i +1 ) (15)= f ( x i ) + (cid:90) t i +1 t i L f ds + (cid:90) t i +1 t i ∇ f (cid:62) b dW s , and taking the expectation conditional to the state x i , E x i [ Sf ( x i )] = f ( x i ) + (cid:90) t i +1 t i E x i [ L f ( x s )] ds. (16)Recall that by the Feynman-Kac connection, the condi-tional expectation of the functional y ( t i +1 ) = f ( x i +1 ) isgiven by the semi-group solution E x i [ y ( t i +1 )] = e τ L f ( x i )and combining this with (16) we find, e τ L f ( x i ) = f ( x i ) + (cid:90) t i +1 t i E x i [ L f ( x s )] . (17)Substituting (17) into (15) we find, Sf ( x i ) = e τ L f ( x i ) + (cid:90) t i +1 t i ∇ f (cid:62) b dW s + (cid:90) t i +1 t i L f − E x i [ L f ] ds. (18) The formula (18) shows that the expectation of the shiftmap S is the semi-group solution e τ L as claimed. Thissuggests that we can use the shift map to estimate thesemi-group solution of the generator of (13). We nextshow that, by representing S in an appropriate basis, wecan minimize the error of this estimate.From (18), for any smooth function g , we can definethe Monte-Carlo integral, (cid:104) g, Sf (cid:105) p eq = lim N →∞ N − N − (cid:88) i =1 g ( x ( t i )) Sf ( x ( t i ))= (cid:10) g, e τ L f (cid:11) p eq + (cid:28) g, (cid:90) t i +1 t i ∇ f (cid:62) b dW s (cid:29) p eq + (cid:28) g, (cid:90) t i +1 t i Bf ds (cid:29) p eq , (19)where we define Bf = L f − E x i [ L f ]. The Monte-Carlointegral implies that the inner products should be takenwith respect to the sampling measure for the trainingdata set, and we assume that the evolution of x is ergodicso that the sampling measure is the invariant measure p eq of the system (13). Note that for smooth functions f, g ∈ L ( M , p eq ), the final integral in (19) will be order- τ sinceit is deterministic and the inner product with g will bebounded. Therefore, our goal is to choose f, g from anorthonormal basis for L ( M , p eq ) which minimizes theinner product with the stochastic integral, and therebyreduces the variance of our estimates of the coefficients.We first expand the norm of Ω( f ) = (cid:82) t i +1 t i ∇ f (cid:62) b dW s by applying the Itˆo isometry, || Ω( f ) || p eq = (cid:90) M (cid:18)(cid:90) t i +1 t i ∇ f (cid:62) b dW s (cid:19) p eq ( x i ) dV ( x i )= (cid:90) M (cid:90) t i +1 t i ( ∇ f (cid:62) b ) ds p eq ( x i ) dV ( x i )= τ ||∇ f (cid:62) b || p eq + O ( τ ) . (20)In order to have a simple generic approach we will avoidestimating the diffusion tensor b by assuming that thenorm || b ( x ) || = sup || v (cid:62) b |||| v || is bounded above on the man-ifold by a constant b . We can now apply the Cauchy-Schwartz inequality to find that, | (cid:104) g, Ω( f ) (cid:105) p eq | ≤ || g || p eq || Ω( f ) || p eq ≤ √ τ b || g || p eq ||∇ f || p eq + O ( τ ) . (21)The optimal basis will be the one that minimizes thenorm ||∇ f || p eq = (cid:82) M |∇ f | p eq dV ( x ) . As shown in Ap-pendix A, the norm ||∇ f || p eq is provably minimized bythe eigenfunctions of the generator ˆ L of the gradient flowsystem (6) and these eigenfunctions can be estimated bythe diffusion kernel (7). Letting ϕ i be the eigenfunctionsof ˆ L with eigenvalues λ i so that ˆ L ϕ i = λ i ϕ i , the set { ϕ i } is a basis for L ( M , p eq ) and the norm we wish tominimize is given by the eigenvalues ||∇ ϕ i || p eq = λ i .Replacing f and g in (19) with these eigenfunctions, fora finite data set we define ˆ A lj ≡ N (cid:80) Ni =1 ϕ j ( x i ) Sϕ l ( x i )and A lj ≡ (cid:10) ϕ j , e τ L ϕ l (cid:11) p eq . Since E [ ϕ j ( x i ) Sϕ l ( x i )] = A lj ,we have E [ ˆ A lj ] = A lj which shows that ˆ A lj is an unbiasedestimate of A lj . Using the error bounds derived above,the variance is, E [( ˆ A lj − A lj ) ] ≤ b λ l τ N − + O ( τ N − ) , assuming that x i are independent. Since x i form a time series, they are not independent and the convergence ofthe Monte-Carlo integral will be slower if the dependenceis strong. In that case, one may need to subsample thetime series which simultaneously requires a larger dataset. Assuming independence, by the Chebyshev bound, P ( | ˆ A lj − A lj | ≥ (cid:15) ) ≤ b λ l τ (cid:15) − N − + O ( τ (cid:15) − N − )and balancing these error terms requires λ l < b − √ τ andthe errors are of order (cid:15) = O ( τ N − /2