Bridging data science and dynamical systems theory
BBridging data science and dynamical systems theory
Tyrus Berry ∗ Dimitrios Giannakis † John Harlim ‡ July 1, 2020
Modern science is undergoing what might arguablybe called a “data revolution”, manifested by a rapidgrowth of observed and simulated data from complexsystems, as well as vigorous research on in mathemat-ical and computational frameworks for data analysis.In many scientific branches, these efforts have led tothe creation of statistical models of complex systemsthat match or exceed the skill of first-principles mod-els. Yet, despite these successes, statistical modelsare oftentimes treated as black boxes, providing lim-ited guarantees about stability and convergence asthe amount of training data increases. Black-boxmodels also offer limited insights about the operatingmechanisms (physics), the understanding of which iscentral to the advancement of science.In this short review, we describe mathematicaltechniques for statistical analysis and prediction oftime-evolving phenomena, ranging from simple ex-amples such as an oscillator, to highly complex sys-tems such as the turbulent motion of the Earth’satmosphere, the folding of proteins, and the evolu-tion of species populations in an ecosystem. Ourmain thesis is that combining ideas from the theoryof dynamical systems with learning theory providesan effective route to data-driven models of complexsystems, with refinable predictions as the amount oftraining data increases, and physical interpretabilitythrough discovery of coherent patterns around whichthe dynamics is organized. Our article thus serves asan invitation to explore ideas at the interface of the ∗ Tyrus Berry is an assistant professor at George MasonUniversity. His email address is [email protected]. † Dimitrios Giannakis is an associate professor at New YorkUniversity. His email address is [email protected]. ‡ John Harlim is a professor at the Pennsylvania State Uni-versity. His email address is [email protected]. two fields.This is a vast subject, and invariably a num-ber of important developments in areas such asdeep learning [EHJ17], reservoir computing [PHG + ,VPH + + +
18] arenot discussed here.
Statistical forecasting and coher-ent pattern extraction
Consider a dynamical system of the form Φ t : Ω → Ω,where Ω is the state space and Φ t , t ∈ R , the flowmap. For example, Ω could be Euclidean space R d ,or a more general manifold, and Φ t the solution mapfor a system of ODEs defined on Ω. Alternatively,in a PDE setting, Ω could be an infinite-dimensionalfunction space and Φ t an evolution group acting onit. We consider that Ω has the structure of a metricspace equipped with its Borel σ -algebra, playing therole of an event space, with measurable functions onΩ acting as random variables, called observables .In a statistical modeling scenario, we consider thatavailable to us are time series of various such observ-ables, sampled along a dynamical trajectory which wewill treat as being unknown. Specifically, we assumethat we have access to two observables, X : Ω → X and Y : Ω → Y , respectively referred to as covari-ate and response functions, together with correspond-ing time series x , x , . . . , x N − and y , y , . . . , y N − ,where x n = X ( ω n ), y n = Y ( ω n ), and ω n =Φ n ∆ t ( ω ). Here, X and Y are metric spaces, ∆ t is apositive sampling interval, and ω an arbitrary pointin Ω initializing the trajectory. We shall refer to the1 a r X i v : . [ m a t h . S T ] J un ollection { ( x , y ) , . . . , ( x N − , y N − ) } as the train-ing data. We require that Y is a Banach space (sothat one can talk about expectations and other func-tionals applied to Y ), but allow the covariate space X to be nonlinear.Many problems in statistical modeling of dynami-cal systems can be expressed in this framework. Forinstance, in a low-dimensional ODE setting, X and Y could both be the identity map on Ω = R d , and thetask could be to build a model for the evolution of thefull system state. Weather forecasting is a classicalhigh-dimensional application, where Ω is the abstractstate space of the climate system, and X a (highlynon-invertible) map representing measurements fromsatellites, meteorological stations, and other sensorsavailable to a forecaster. The response Y could betemperature at a specific location, Y = R , illustrat-ing that the response space may be of considerablylower dimension than the covariate space. In othercases, e.g., forecasting the temperature field over ageographical region, Y may be a function space. Thetwo primary questions that will concern us here are: Problem 1 (Statistical forecasting) . Given thetraining data, construct (“learn”) a function Z t : X → Y that predicts Y at a lead time t ≥
0. Thatis, Z t should have the property that Z t ◦ X is closestto Y ◦ Φ t among all functions in a suitable class. Problem 2 (Coherent pattern extraction) . Giventhe training data, identify a collection of observables z j : Ω → Y which have the property of evolving co-herently under the dynamics. By that, we mean that z j ◦ Φ t should be relatable to z j in a natural way.These problems have an extensive history of studyfrom an interdisciplinary perspective spanning math-ematics, statistics, physics, and many other fields.Here, our focus will be on nonparametric methods ,which do not employ explicit parametric models forthe dynamics. Instead, they use universal structuralproperties of dynamical systems to inform the designof data analysis techniques. From a learning stand-point, Problems 1 and 2 can be thought of as su-pervised and unsupervised learning, respectively. Amathematical requirement we will impose to methodsaddressing either problem is that they have a well- defined notion of convergence, i.e., they are refinable,as the number N of training samples increases. Analog and POD approaches
Among the earliest examples of nonparametric fore-casting techniques is Lorenz’s analog method [Lor69].This simple, elegant approach makes predictions bytracking the evolution of the response along a dy-namical trajectory in the training data (the analogs).Good analogs are selected according to a measure ofgeometrical similarity between the covariate variableobserved at forecast initialization and the covariatetraining data. This method posits that past behav-ior of the system is representative of its future behav-ior, so looking up states in a historical record that areclosest to current observations is likely to yield a skill-ful forecast. Subsequent methodologies have also em-phasized aspects of state space geometry, e.g., usingthe training data to approximate the evolution mapthrough patched local linear models, often leveragingdelay coordinates for state space reconstruction.Early approaches to coherent pattern extraction in-clude the proper orthogonal decomposition (POD)[Kos43], which is closely related to principal com-ponent analysis (PCA; introduced in the early 20thcentury by Pearson), the Karhunen-Lo`eve expansion,and empirical orthogonal function (EOF) analysis.Assuming that Y is a Hilbert space, POD yields anexpansion Y ≈ Y L = (cid:80) Lj =1 z j , z j = u j σ j ψ j . Ar-ranging the data into a matrix Y = ( y , . . . , y N − ),the σ j are the singular values of Y (in decreasing or-der), the u j are the corresponding left singular vec-tors, called EOFs, and the ψ j are given by projec-tions of Y onto the EOFs, ψ j ( ω ) = (cid:104) u j , Y ( ω ) (cid:105) Y .That is, the principal component ψ j : Ω → R is alinear feature characterizing the unsupervised data { y , . . . , y N − } . If the data is drawn from a proba-bility measure µ , as N → ∞ the POD expansion isoptimal in an L ( µ ) sense; that is, Y L has minimal L ( µ ) error (cid:107) Y − Y L (cid:107) L ( µ ) among all rank- L approx-imations of Y . Effectively, from the perspective ofPOD, the important components of Y are those cap-turing maximal variance.Despite many successes in challenging applications2e.g., turbulence), it has been recognized that PODmay not reveal dynamically significant observables,offering limited predictability and physical insight.In recent years, there has been significant interest intechniques that address this shortcoming by modify-ing the linear map Y to have an explicit dependenceon the dynamics [BK86], or replacing it by an evolu-tion operator [DJ99, MB99]. Either directly or indi-rectly, these methods make use of operator-theoreticergodic theory, which we now discuss. Operator-theoretic formulation
The operator-theoretic formulation of dynamical sys-tems theory shifts attention from the state-spaceperspective, and instead characterizes the dynamicsthrough its action on linear spaces of observables.Denoting the vector space of Y -valued functions onΩ by F , for every time t the dynamics has a nat-ural induced action U t : F → F given by compo-sition with the flow map, U t f = f ◦ Φ t . It thenfollows by definition that U t is a linear operator,i.e., U t ( αf + g ) = αU t f + U t g for all observables f, g ∈ F and every scalar α ∈ C . The operator U t isknown as a composition operator , or Koopman oper-ator after classical work of Bernard Koopman in the1930s [Koo31], which established that a general (po-tentially nonlinear) dynamical system can be charac-terized through intrinsically linear operators actingon spaces of observables. A related notion is that ofthe transfer operator , P t : M → M , which describesthe action of the dynamics on a space of measures M via the pushforward map, P t m := Φ t ∗ m = m ◦ Φ − t .In a number of cases, F and M are dual spaces to oneanother (e.g., continuous functions and Radon mea-sures), in which case U t and P t are dual operators.If the space of observables under considerationis equipped with a Banach or Hilbert space struc-ture, and the dynamics preserves that structure, theoperator-theoretic formulation allows a broad rangeof tools from spectral theory and approximation the-ory for linear operators to be employed in the studyof dynamical systems. For our purposes, a par-ticularly advantageous aspect of this approach isthat it is amenable to rigorous statistical approxi- mation, which is one of our principal objectives. Itshould be kept in mind that the spaces of observablesencountered in applications are generally infinite-dimensional, leading to behaviors with no counter-parts in finite-dimensional linear algebra, such as un-bounded operators and continuous spectrum. In fact,as we will see below, the presence of continuous spec-trum is a hallmark of mixing (chaotic) dynamics.In this review, we restrict attention to theoperator-theoretic description of measure-preserving,ergodic dynamics . By that, we mean that there isa probability measure µ on Ω such that (i) µ is in-variant under the flow, i.e., Φ t ∗ µ = µ ; and (ii) everymeasurable, Φ t -invariant set has either zero or full µ -measure. We also assume that µ is a Borel measurewith compact support A ⊆ Ω; this set is necessarilyΦ t -invariant. An example known to rigorously sat-isfy these properties is the Lorenz 63 (L63) system onΩ = R , which has a compactly supported, ergodicinvariant measure supported on the famous “butter-fly” fractal attractor; see Figure 1. L63 exemplifiesthe fact that a smooth dynamical system may ex-hibit invariant measures with non-smooth supports.This behavior is ubiquitous in models of physical phe-nomena, which are formulated in terms of smoothdifferential equations, but whose long-term dynam-ics concentrate on lower-dimensional subsets of statespace due to the presence of dissipation. Our meth-ods should therefore not rely on the existence of asmooth structure for A .In the setting of ergodic, measure-preserving dy-namics on a metric space, two relevant structuresthat the dynamics may be required to preserve arecontinuity and µ -measurability of observables. If theflow Φ t is continuous, then the Koopman operatorsact on the Banach space F = C ( A, Y ) of continu-ous, Y -valued functions on A , equipped with the uni-form norm, by isometries, i.e., (cid:107) U t f (cid:107) F = (cid:107) f (cid:107) F . IfΦ t is µ -measurable, then U t lifts to an operator onequivalence classes of Y -valued functions in L p ( µ, Y ),1 ≤ p ≤ ∞ , and acts again by isometries. If Y isa Hilbert space (with inner product (cid:104)· , ·(cid:105) Y ), the case p = 2 is special, since L ( µ, Y ) is a Hilbert space withinner product (cid:104) f, g (cid:105) L ( µ, Y ) = (cid:82) Ω (cid:104) f ( ω ) , g ( ω ) (cid:105) Y dµ ( ω ),on which U t acts as a unitary map, U t ∗ = U − t .Clearly, the properties of approximation techniques3or observables and evolution operators depend onthe underlying space. For instance, C ( A, Y ) has awell-defined notion of pointwise evaluation at every ω ∈ Ω by a continuous linear map δ ω : C ( A, Y ) → Y , δ ω f = f ( ω ), which is useful for interpolation andforecasting, but lacks an inner-product structure andassociated orthogonal projections. On the otherhand, L ( µ ) has inner-product structure, which isvery useful theoretically as well as for numerical algo-rithms, but lacks the notion of pointwise evaluation.Letting F stand for any of the C ( A, Y ) or L p ( µ, Y )spaces, the set U = { U t : F → F} t ∈ R forms astrongly continuous group under composition of op-erators. That is, U t ◦ U s = U t + s , U t, − = U − t ,and U = Id, so that U is a group, and for every f ∈ F , U t f converges to f in the norm of F as t →
0. A central notion in such evolution groupsis that of the generator , defined by the F -norm limit V f = lim t → ( U t f − f ) /t for all f ∈ F for whichthe limit exists. It can be shown that the domain D ( V ) of all such f is a dense subspace of F , and V : D ( V ) → F is a closed, unbounded operator. In-tuitively, V can be thought as a directional derivativeof observables along the dynamics. For example, if Y = C , A is a C manifold, and the flow Φ t : A → A is generated by a continuous vector field (cid:126)V : A → T A ,the generator of the Koopman group on C ( A ) hasas its domain the space C ( A ) ⊂ C ( A ) of contin-uously differentiable, complex-valued functions, and V f = (cid:126)V · ∇ f for f ∈ C ( A ). A strongly contin-uous evolution group is completely characterized byits generator, as any two such groups with the samegenerator are identical.The generator acquires additional properties in thesetting of unitary evolution groups on H = L ( µ, Y ),where it is skew-adjoint, V ∗ = − V . Note thatthe skew-adjointness of V holds for more generalmeasure-preserving dynamics than Hamiltonian sys-tems, whose generator is skew-adjoint with respect toLebesgue measure. By the spectral theorem for skew-adjoint operators, there exists a unique projection-valued measure E : B ( R ) → B ( H ), giving the gener-ator and Koopman operator as the spectral integrals V = (cid:90) R iα dE ( α ) , U t = e tV = (cid:90) R e iαt dE ( α ) . Here, B ( R ) is the Borel σ -algebra on the real line,and B ( H ) the space of bounded operators on H . In-tuitively, E can be thought of as an operator ana-log of a complex-valued spectral measure in Fourieranalysis, with R playing the role of frequency space.That is, given f ∈ H , the C -valued Borel measure E f ( S ) = (cid:104) f, E ( S ) f (cid:105) H is precisely the Fourier spec-tral measure associated with the time-autocorrelationfunction C f ( t ) = (cid:104) f, U t f (cid:105) H . The latter, admits theFourier representation C f ( t ) = (cid:82) R e iαt dE f ( α ).The Hilbert space H admits a U t -invariant split-ting H = H a ⊕ H c into orthogonal subspaces H a and H c associated with the point and continuous compo-nents of E , respectively. In particular, E has a uniquedecomposition E = E a + E c with H a = ran E a ( R )and H c = ran E c ( R ), where E a is a purely atomicspectral measure, and E c is a spectral measure withno atoms. The atoms of E a (i.e., the singletons { α j } with E a ( { α j } ) (cid:54) = 0) correspond to eigenfrequencies of the generator, for which the eigenvalue equation V z j = iαz j has a nonzero solution z j ∈ H a . Un-der ergodic dynamics, every eigenspace of V is one-dimensional, so that if z j is normalized to unit L ( µ )norm, E ( { α j } ) f = (cid:104) z j , f (cid:105) L ( µ ) z j . Every such z j is aneigenfunction of the Koopman operator U t at eigen-value e iα j t , and { z j } is an orthonormal basis of H a .Thus, every f ∈ H a has the quasiperiodic evolution U t f = (cid:80) j e iα j t (cid:104) z j , f (cid:105) L ( µ ) z j , and the autocorrela-tion C f ( t ) is also quasiperiodic. While H a alwayscontains constant eigenfunctions with zero frequency,it might not have any non-constant elements. Inthat case, the dynamics is said to be weak-mixing .In contrast to the quasiperiodic evolution of observ-ables in H a , observables in the continuous spectrumsubspace exhibit a loss of correlation characteristicof mixing (chaotic) dynamics. Specifically, for every f ∈ H c the time-averaged autocorrelation function¯ C f ( t ) = (cid:82) t | C f ( s ) | ds/t tends to 0 as | t | → ∞ , as docross-correlation functions (cid:104) g, U t f (cid:105) L ( µ ) between ob-servables in H c and arbitrary observables in L ( µ ). Data-driven forecasting
Based on the concepts introduced above, one can for-mulate statistical forecasting in Problem 1 as the4ask of constructing a function Z t : X → Y oncovariate space X , such that Z t ◦ X optimally ap-proximates U t Y among all functions in a suitableclass. We set Y = C , so the response variable isscalar-valued, and consider the Koopman operator on L ( µ ), so we have access to orthogonal projections.We also assume for now that the covariate function X is injective, so ˆ Y t := Z t ◦ X should be able to ap-proximate U t Y to arbitrarily high precision in L ( µ )norm. Indeed, let { u , u , . . . } be an orthonormal ba-sis of L ( ν ), where ν = X ∗ µ is the pushforward ofthe invariant measure onto X . Then, { φ , φ , . . . } with φ j = u j ◦ X is an orthonormal basis of L ( µ ).Given this basis, and because U t is bounded, wehave U t Y = lim L →∞ U tL Y , where the partial sum U tL Y := (cid:80) L − j =0 (cid:104) U t Y, φ j (cid:105) L ( µ ) φ j converges in L ( µ )norm. Here, U tL is a finite-rank map on L ( µ ) withrange span { φ , . . . , φ L − } , represented by an L × L matrix U ( t ) with elements U ij ( t ) = (cid:104) φ i , U t φ j (cid:105) L ( µ ) .Defining (cid:126)y = (ˆ y , . . . , ˆ y L − ) (cid:62) , ˆ y j = (cid:104) φ j , U t Y (cid:105) L ( µ ) ,and (ˆ z ( t ) , . . . , ˆ z L − ( t )) (cid:62) = U ( t ) (cid:126)y , we have U tL Y = (cid:80) L − j =0 ˆ z j ( t ) φ j . Since φ j = u j ◦ X , this leads to theestimator ˆ Z t,L ∈ L ( ν ), with ˆ Z t,L = (cid:80) L − j =0 ˆ z j ( t ) u j .The approach outlined above tentatively providesa consistent forecasting framework. Yet, while inprinciple appealing, it has three major shortcomings:(i) Apart from special cases, the invariant measureand an orthonormal basis of L ( µ ) are not known.In particular, orthogonal functions with respect toan ambient measure on Ω (e.g., Lebesgue-orthogonalpolynomials) will not suffice, since there are no guar-antees that such functions form a Schauder basis of L ( µ ), let alone be orthonormal. Even with a basis,we cannot evaluate U t on its elements without know-ing Φ t . (ii) Pointwise evaluation on L ( µ ) is not de-fined, making ˆ Z t,L inadequate in practice, even if thecoefficients ˆ z j ( t ) are known. (iii) The covariate map X is oftentimes non-invertible, and thus the φ j spana strict subspace of L ( µ ). We now describe methodsto overcome these obstacles using learning theory. Sampling measures and ergodicity
The dynamical trajectory { ω , . . . , ω N − } in statespace underlying the training data is the support of a discrete sampling measure µ N := (cid:80) N − n =0 δ ω n /N . Akey consequence of ergodicity is that for Lebesgue-a.e. sampling interval ∆ t and µ -a.e. starting point ω ∈ Ω, as N → ∞ , the sampling measures µ N weak-converge to the invariant measure µ ; that is,lim N →∞ (cid:90) Ω f dµ N = (cid:90) Ω f dµ, ∀ f ∈ C (Ω) . (1)Since integrals against µ N are time averages on dy-namical trajectories, (cid:82) Ω f dµ N = (cid:80) N − n =0 f ( ω n ) /N , er-godicity provides an empirical means of accessing thestatistics of the invariant measure. In fact, manysystems encountered in applications possess so-called physical measures , where (1) holds for ω in a “larger”set of positive measure with respect to an ambientmeasure (e.g., Lebesgue measure) from which exper-imental initial conditions are drawn. Hereafter, wewill let M be a compact subset of Ω, which is forward-invariant under the dynamics (i.e., Φ t ( M ) ⊆ M forall t ≥ A . For ex-ample, in dissipative dynamical systems such as L63, M can be chosen as a compact absorbing ball. Shift operators
Ergodicity suggests that appropriate data-drivenanalogs of are the L ( µ N ) spaces induced by the thesampling measures µ N . For a given N , L ( µ N ) con-sists of equivalence classes of measurable functions f :Ω → C having common values at the sampled states ω n , and the inner product of two elements f, g ∈ L ( µ N ) is given by an empirical time-correlation, (cid:104) f, g (cid:105) µ N = (cid:82) Ω f ∗ g dµ N = (cid:80) N − n =0 f ∗ ( ω n ) g ( ω n ) /N .Moreover, if the ω n are distinct (as we will assume forsimplicity of exposition), L ( µ N ) has dimension N ,and is isomorphic as a Hilbert space to C N equippedwith a normalized dot product. Given that, we canrepresent every f ∈ L ( µ N ) by a column vector (cid:126)f = ( f ( ω ) , . . . , f ( ω N − )) (cid:62) ∈ C N , and every linearmap A : L ( µ N ) → L ( µ N ) by an N × N matrix A , so that (cid:126)g = A (cid:126)f is the column vector represent-ing g = Af . The elements of (cid:126)f can also be under-stood as expansion coefficients in the standard ba-sis { e ,N , . . . , e N − ,N } of L ( µ N ), where e j,N ( ω n ) = N / δ jn ; that is, f ( ω n ) = (cid:104) e n,N , f (cid:105) L ( µ N ) . Similarly,5he elements of A correspond to the operator matrixelements A ij = (cid:104) e i,N , Ae j,N (cid:105) L ( µ N ) .Next, we would like to define a Koopman operatoron L ( µ N ), but this space does not admit a such anoperator as a composition map induced by the dy-namical flow Φ t on Ω. This is because Φ t does notpreserve null sets with respect to µ N , and thus doesnot lead to a well-defined composition map on equiva-lence classes of functions in L ( µ N ). Nevertheless, on L ( µ N ) there is an analogous construct to the Koop-man operator on L ( µ ), namely the shift operator , U qN : L ( µ N ) → L ( µ N ), q ∈ Z , defined as U qN f ( ω n ) = (cid:40) f ( ω n + q ) , ≤ n + q ≤ N − , , otherwise . Even though U qN is not a composition map, in-tuitively it should have a connection with theKoopman operator U q ∆ t . One could consider,for instance, the matrix representation ˜ U N ( q ) =[ (cid:104) e i,N , U qN e j,N (cid:105) L ( µ N ) ] in the standard basis, and at-tempt to connect it with a matrix representation of U q ∆ t in an orthonormal basis of L ( µ ). However, theissue with this approach is that the e j,N do not have N → ∞ limits in L ( µ ), meaning that there is nosuitable notion of N → ∞ convergence of the matrixelements of U qN in the standard basis. In response, wewill construct a representation of the shift operatorin a different orthonormal basis with a well-defined N → ∞ limit. The main tools that we will use are kernel integral operators , which we now describe. Kernel integral operators
In the present context, a kernel function will be areal-valued, continuous function k : Ω × Ω → R withthe property that there exists a strictly positive, con-tinuous function d : Ω → R such that d ( ω ) k ( ω, ω (cid:48) ) = d ( ω (cid:48) ) k ( ω (cid:48) , ω ) , ∀ ω, ω (cid:48) ∈ Ω . (2)Notice the similarity between (2) and the detailedbalance relation in reversible Markov chains. Letnow ρ be any Borel probability measure with compactsupport S ⊆ M included in the forward-invariant set M . It follows by continuity of k and compactness of S that the integral operator K ρ : L ( ρ ) → C ( M ), K ρ f = (cid:90) Ω k ( · , ω ) f ( ω ) dρ ( ω ) , (3)is well-defined as a bounded operator mapping ele-ments of L ( ρ ) into continuous functions on M . Us-ing ι ρ : C ( M ) → L ( ρ ) to denote the canonical inclu-sion map, we consider two additional integral opera-tors, G ρ : L ( ρ ) → L ( ρ ) and ˜ G ρ : C ( M ) → C ( M ),with G ρ = ι ρ K ρ and ˜ G ρ = K ρ ι ρ , respectively.The operators G ρ and ˜ G ρ are compact operatorsacting with the same integral formula as K ρ in (3),but their codomains and domains, respectively, aredifferent. Nevertheless, their nonzero eigenvalues co-incide, and φ ∈ L ( ρ ) is an eigenfunction of G ρ cor-responding to a nonzero eigenvalue λ if and only if ϕ ∈ C ( M ) with ϕ = K ρ φ/λ is an eigenfunction of˜ G ρ at the same eigenvalue. In effect, φ (cid:55)→ ϕ “in-terpolates” the L ( ρ ) element φ (defined only upto ρ -null sets) to the continuous, everywhere-definedfunction ϕ . It can be verified that if (2) holds, G ρ is a trace-class operator with real eigenvalues, | λ | ≥ | λ | ≥ · · · (cid:38) + . Moreover, there exists aRiesz basis { φ , φ , . . . , } of L ( ρ ) and a correspond-ing dual basis { φ (cid:48) , φ (cid:48) , . . . } with (cid:104) φ (cid:48) i , φ j (cid:105) L ( ρ ) = δ ij ,such that G ρ φ j = λ j φ j and G ∗ ρ φ (cid:48) j = λ j φ (cid:48) j . We saythat the kernel k is L ( ρ ) -universal if G ρ has no zeroeigenvalues; this is equivalent to ran G ρ being densein L ( ρ ). Moreover, k is said to be L ( ρ ) -Markov if G ρ is a Markov operator, i.e., G ρ ≥ G ρ f ≥ f ≥
0, and G G µ N associatedwith the sampling measures µ N , henceforth abbrevi-ated by G N , are represented by N × N kernel ma-trices G N = [ (cid:104) e i,N , G N e j,N (cid:105) L ( µ N ) ] = [ k ( ω i , ω j )] inthe standard basis of L ( µ N ). Further, if k is apullback kernel from covariate space, i.e., k ( ω, ω (cid:48) ) = κ ( X ( ω ) , X ( ω (cid:48) )) for κ : X × X → R , then G N =[ κ ( x i , x j )] is empirically accessible from the train-ing data. Popular kernels in applications includethe covariance kernel κ ( x, x (cid:48) ) = (cid:104) x, x (cid:48) (cid:105) X on aninner-product space and the radial Gaussian kernel κ ( x, x (cid:48) ) = e −(cid:107) x − x (cid:48) (cid:107) X /(cid:15) [Gen01]. It is also common toemploy Markov kernels constructed by normalizationof symmetric kernels [CL06,BH16]. We will use k N to6enote kernels with data-dependent normalizations.A widely used strategy for learning with inte-gral operators [vLBB08] is to construct families ofkernels k N converging in C ( M × M ) norm to k .This implies that for every nonzero eigenvalue λ j of G ≡ G µ , the sequence of eigenvalues λ j,N of G N satisfies lim N →∞ λ j,N = λ j . Moreover, thereexists a sequence of eigenfunctions φ j,N ∈ L ( µ N )corresponding to λ j,N , whose continuous representa-tives, ϕ j,N = K N φ j,N /λ j,N , converge in C ( M ) to ϕ j = Kφ j /λ j , where φ j ∈ L ( µ ) is any eigenfunctionof G at eigenvalue λ j . In effect, we use C ( M ) as a“bridge” to establish spectral convergence of the op-erators G N , which act on different spaces. Note that( λ j,N , ϕ j,N ) does not converge uniformly with respectto j , and for a fixed N , eigenvalues/eigenfunctions atlarger j exhibit larger deviations from their N → ∞ limits. Under measure-preserving, ergodic dynamics,convergence occurs for µ -a.e. starting state ω ∈ M ,and ω in a set of positive ambient measure if µ isphysical. In particular, the training states ω n neednot lie on A . See Figure 1 for eigenfunctions of G N computed from data sampled near the L63 attrac-tor. When the invariant measure µ has a smoothdensity with respect to local coordinates on Ω, re-sults on spectral convergence of graph Laplacians tomanifold Laplacians [TS18, TGHS19] could be em-ployed to provide a more precise characterization ofthe spectral properties of G N for suitable choices ofkernel. Diffusion forecasting
We now have the ingredients to build a concrete sta-tistical forecasting scheme based on data-driven ap-proximations of the Koopman operator. In partic-ular, note that if φ (cid:48) i,N , φ j,N are biorthogonal eigen-functions of G ∗ N and G N , respectively, at nonzeroeigenvalues, we can evaluate the matrix element U N,ij ( q ) := (cid:104) φ (cid:48) i,N , U qN φ j,N (cid:105) L ( µ N ) of the shift oper-ator using the continuous representatives ϕ (cid:48) i,N , ϕ j,N , U N,ij ( q ) = 1 N N − − q (cid:88) n =0 φ (cid:48) i,N ( ω n ) φ j,N ( ω n + q ) = N − qN (cid:90) Ω ϕ (cid:48) i,N U q ∆ t ϕ j,N dµ N − q , where U q ∆ t is the Koopman operator on C ( M ).Therefore, if the corresponding eigenvalues λ i , λ j of G are nonzero, by the weak convergence of the sam-pling measures in (1) and uniform convergence ofthe eigenfunctions, as N → ∞ , U ij,N ( q ) convergesto the matrix element U ij ( q ∆ t ) = (cid:104) φ i , U q ∆ t φ j (cid:105) L ( µ ) of the Koopman operator on L ( µ ). This conver-gence is not uniform with respect to i, j , but if wefix a parameter L ∈ N (which can be thought of asspectral resolution) such that λ L − (cid:54) = 0, we can ob-tain a statistically consistent approximation of L × L Koopman operator matrices, U ( q ∆ t ) = [ U ij ( q ∆ t )],by shift operator matrices, U N ( q ) = [ U N,ij ( q )], with i, j ∈ { , . . . , L − } . Checkerboard plots of U N ( q )for the L63 system are displayed in Figure 1.This method for approximating matrix elementsof Koopman operators was proposed in a techniquecalled diffusion forecasting (named after the diffu-sion kernels employed) [BGH15]. Assuming that theresponse Y is continuous and by spectral conver-gence of G N , for every j ∈ N such that λ j > Y j,N = (cid:104) φ (cid:48) j,N , Y (cid:105) µ N converge, as N → ∞ , to ˆ Y j = (cid:104) φ (cid:48) j , Y (cid:105) L ( µ ) . This implies thatfor any L ∈ N such that λ L − > (cid:80) L − j =0 ˆ Y j,N ϕ j,N converges in C ( M ) to the continuous representativeof Π L Y , where Π L is the orthogonal projection on L ( µ ) mapping into span { φ , . . . , φ L − } . Supposenow that (cid:37) N is a sequence of continuous functionsconverging uniformly to (cid:37) ∈ C ( M ), such that (cid:37) N are probability densities with respect to µ N (i.e., (cid:37) N ≥ (cid:107) (cid:37) N (cid:107) L ( µ N ) = 1). By similar argu-ments as for Y , as N → ∞ , the continuous func-tion (cid:80) L − j =0 ˆ (cid:37) j,N ϕ j,N with ˆ (cid:37) j,N = (cid:104) ϕ (cid:48) j,N , (cid:37) N (cid:105) L ( µ N ) converges to Π L (cid:37) in L ( µ ). Putting these facts to-gether, and setting (cid:126)(cid:37) N = (ˆ (cid:37) ,N , . . . , ˆ (cid:37) L − ,N ) (cid:62) and (cid:126)Y N = ( ˆ Y ,N , . . . , ˆ Y L − ,N ) (cid:62) , we conclude that (cid:126)(cid:37) (cid:62) N U N ( q ) (cid:126)Y N N →∞ −−−−→ (cid:104) Π L (cid:37), Π L U q ∆ t Y (cid:105) L ( µ ) . (4)Here, the left-hand side is given by matrix–vectorproducts obtained from the data, and the right-hand side is equal to the expectation of Π L U q ∆ t Y with respect to the probability measure ρ with7 N L ( µ N ) L ( µ N ) span( { φ i,N } Li =1 ) Y M ⊆ Ω H ( M ) L ( µ ) L ( µ ) Y ω Ψ( ω ) U t ∗ Ψ( ω ) E Ψ( ω ) U t Y ι N U q ∗ N Π L E ( · ) Y errorΨ N Ψ ι U t ∗ E ( · ) Y ∈ (f) L ( µ N ) L ( µ N ) L X ( µ N ) H N L X ( µ ) C ( M ) L ( µ ) L ( µ ) L X ( µ ) Y U t Y Z t ◦ X = E ( U t Y | X ) U qN Π X N N ι error ι N ι U t Π X ∈ (g) Figure 1: Panel (a) shows eigenfunctions φ j,N of G N for a dataset sampled near the L63 attractor. Panel (b)shows the action of the shift operator U qN on the φ j,N from (a) for q = 50 steps, approximating the Koopmanoperator U q ∆ t . Panels (c, d) show the matrix elements (cid:104) φ i,N , U qN φ j,N (cid:105) µ N of the shift operator for q = 5and 50. The mixing dynamics is evident in the larger far-from-diagonal components in q = 50 vs. q = 5.Panel (e) shows the matrix representation of a finite-difference approximation of the generator V , which isskew-symmetric. Panels (f, g) summarize the diffusion forecast (DF) and kernel analog forecast (KAF) forlead time t = q ∆ t . In each diagram, the data-driven finite dimensional approximation (top row) convergesto the true forecast (middle row). DF maps an initial state ω ∈ M ⊆ Ω to the future expectation ofan observable E Ψ( ω ) U t Y = E U t ∗ Ψ( ω ) Y and KAF maps a response function Y ∈ C ( M ) to the conditionalexpectation E ( U t Y | X ). 8ensity dρ/dµ = (cid:37) ; i.e., (cid:104) Π L (cid:37), Π L U q ∆ t Y (cid:105) L ( µ ) = E ρ (Π L U q ∆ t Y ), where E ρ ( · ) := (cid:82) Ω ( · ) dρ .What about the dependence of the forecast on L ?As L increases, Π L converges strongly to the orthog-onal projection Π G : L ( µ ) → L ( µ ) onto the clo-sure of the range of G . Thus, if the kernel k is L ( µ )-universal (i.e., ran G = L ( µ )), Π G = Id, andunder the iterated limit of L → ∞ after N → ∞ the left-hand side of (4) converges to E ρ U q ∆ t Y . Insummary, implemented with an L ( µ )-universal ker-nel, diffusion forecasting consistently approximatesthe expected value of the time-evolution of any con-tinuous observable with respect to any probabilitymeasure with continuous density relative to µ . Anexample of an L ( µ )-universal kernel is the pullbackof a radial Gaussian kernel on X = R m . In contrast,the covariance kernel is not L ( µ )-universal, as in thiscase the rank of G is bounded by m . This illustratesthat forecasting in the POD basis may be subject tointrinsic limitations, even with full observations. Kernel analog forecasting
While providing a flexible framework for approximat-ing expectation values of observables under measure-preserving, ergodic dynamics, diffusion forecastingdoes not directly address the problem of construct-ing a concrete forecast function, i.e., a function Z t : X → C approximating U t Y as stated in Problem 1.One way of defining such a function is to let κ N be a L ( ν N )-Markov kernel on X for ν N = X ∗ µ N , andto consider the “feature map” Ψ N : X → C ( M )mapping each point x ∈ X in covariate space to thekernel section Ψ N ( x ) = κ N ( x, X ( · )). Then, Ψ N ( x )is a continuous probability density with respect to µ N , and we can use diffusion forecasting to define Z q ∆ t ( x ) = −−−−→ Ψ N ( x ) (cid:62) U N ( q ) (cid:126)Y N with notation as in (4).While this approach has a well-defined N → ∞ limit, it does not provide optimality guarantees, par-ticularly in situations where X is non-injective. In-deed, the L ( µ )-optimal approximation to U t Y of theform Z t ◦ X is given by the conditional expectation E ( U t Y | X ). In the present, L , setting we have E ( U t Y | X ) = Π X U t Y , where Π X is the orthogonalprojection into L X ( µ ) := { f ∈ L ( µ ) : f = g ◦ X } .That is, the conditional expectation minimizes the error (cid:107) f − U t Y (cid:107) L ( µ ) among all pullbacks f ∈ L X ( µ )from covariate space. Even though E ( U t Y | X = x )can be expressed as an expectation with respect to aconditional probability measure µ ( · | x ) on Ω, thatmeasure will generally not have an L ( µ ) density,and there is no map Ψ : X → C ( M ) such that (cid:104) Ψ( x ) , U t Y (cid:105) L ( µ ) equals E ( U t Y | X = x ).To construct a consistent estimator of the condi-tional expectation, we require that k is a pullback ofa kernel κ : X × X → R on covariate space whichis (i) symmetric, κ ( x, x (cid:48) ) = κ ( x (cid:48) , x ) for all x, x (cid:48) ∈ X (so (2) holds); (ii) strictly positive; and (iii) strictlypositive-definite . The latter means that for any se-quence x , . . . , x n − of distinct points in X the ma-trix [ κ ( x i , x j )] is strictly positive. These propertiesimply that there exists a Hilbert space H of complex-valued functions on Ω, such that (i) for every ω ∈ Ω,the kernel sections k ω = k ( ω, · ) lie in H ; (ii) the eval-uation functional δ ω : H → C is bounded and satis-fies δ ω f = (cid:104) k ω , f (cid:105) H ; (iii) every f ∈ H has the form f = g ◦ X for a continuous function g : X → C ; and(iv) ι µ H lies dense in L X ( µ ).A Hilbert space of functions satisfying (i) and (ii)above is known as a reproducing kernel Hilbert space(RKHS) , and the associated kernel k is known as a reproducing kernel . RKHSs have many useful prop-erties for statistical learning [CS01], not least be-cause they combine the Hilbert space structure of L spaces with pointwise evaluation in spaces of con-tinuous functions. The density of H in L X ( µ ) isa consequence of the strict positive-definiteness of κ . In particular, because the conditional expectation E ( U t Y | X ) lies in L X ( µ ), it can be approximated byelements of H to arbitrarily high precision in L ( µ )norm, and every such approximation will be a pull-back ˆ Y t = Z t ◦ X of a continuous function Z t thatcan be evaluated at arbitrary covariate values.We now describe a data-driven technique for con-structing such a prediction function, which we referto as kernel analog forecasting (KAF) [AG20]. Math-ematically, KAF closely related to kernel principalcomponent regression. To build the KAF estima-tor, we work again with integral operators as in (3),with the difference that now K ρ : L ( ρ ) → H ( M )takes values in the restriction of H to the forward-9nvariant set M , denoted H ( M ). One can show thatthe adjoint K ∗ ρ : H ( M ) → L ( ρ ) coincides with theinclusion map ι ρ on continuous functions, so that K ∗ ρ maps f ∈ H ( M ) ⊂ C ( M ) to its corresponding L ( ρ )equivalence class. As a result, the integral operator G ρ : L ( ρ ) → L ( ρ ) takes the form G ρ = K ∗ ρ K ρ , be-coming a self-adjoint, positive-definite, compact op-erator with eigenvalues λ ≥ λ ≥ · · · (cid:38) + , and acorresponding orthonormal eigenbasis { φ , φ , . . . } of L ( ρ ). Moreover, { ψ , ψ , . . . } with ψ j = K ρ φ j /λ / j is an orthonormal set in H ( M ). In fact, Mercer’s the-orem provides an explicit representation k ( ω, ω (cid:48) ) = (cid:80) ∞ j =0 ψ j ( ω ) ψ j ( ω (cid:48) ), where direct evaluation of thekernel in the left-hand side (known as “kernel trick”)avoids the complexity of inner-product computationsbetween feature vectors ψ j . Here, our perspectiveis to rely on the orthogonality of the eigenbasis toapproximate observables of interest at fixed L , andestablish convergence of the estimator as L → ∞ . Asimilar approach was adopted for density estimationon non-compact domains, with Mercer-type kernelsbased on orthogonal polynomials [ZHL19].Now a key operation that the RKHS enables isthe Nystr¨om extension , which interpolates L ( ρ ) el-ements of appropriate regularity to RKHS func-tions. The Nystr¨om operator N ρ : D ( N ρ ) → H ( M )is defined on the domain D ( N ρ ) = { (cid:80) j c j φ j : (cid:80) j | c j | /λ j < ∞} by linear extension of N ρ φ j = ψ j /λ / j . Note that N ρ φ j = K ρ φ j /λ j = ϕ j , so N ρ maps φ j to its continuous representative, and K ∗ ρ N ρ f = f , meaning that N ρ f = f , ρ -a.e. While D ( N ρ ) may be a strict L ( ρ ) subspace, for any L with λ L − > N L,ρ : L ( ρ ) → H ( M ), N L,ρ (cid:80) j c j φ j = (cid:80) L − j =0 c j ψ j /λ / j . Then, as L increases, K ∗ ρ N L,ρ f converges to Π G ρ f in L ( ρ ). To make empirical fore-casts, we set ρ = µ N , compute the expansion coeffi-cients c j,N ( t ) of U t Y in the { φ j,N } basis of L ( µ N ),and construct Y t,L,N = N L,N U t Y ∈ H ( M ). Be-cause ψ j,N are pullbacks of known functions u j,N ∈ C ( X ), we have Y t,L,N = Z t,L,N ◦ X , where Z t,L,N = (cid:80) L − j =0 c j ( t ) u j,N /λ / j,N can be evaluated at any x ∈ X .The function Y t,L,N is our estimator of the con-ditional expectation E ( U t Y | X ). By spectral con- Figure 2: KAF applied to the L63 state vector com-ponent Y ( ω ) = ω with full (blue) and partial (red)observations. In the fully observed case, the covariate X is the identity map on Ω = R . In the partiallyobserved case, X ( ω ) = ω is the projection to thefirst coordinate. Top: Forecasts Z t,L,N ( x ) initializedfrom fixed x = X ( ω ), compared with the true evo-lution U t Y ( ω ) (black). Shaded regions show errorbounds based on KAF estimates of the conditionalstandard deviation, σ t ( x ). Bottom: RMS forecasterrors (solid lines) and σ t (dashed lines). The agree-ment between actual and estimated errors indicatesthat σ t provides useful uncertainty quantification.vergence of kernel integral operators, as N → ∞ , Y t,L,N converges to Y t,L := N L U t Y in C ( M ) norm,where N L ≡ N L,µ . Then, as L → ∞ , K ∗ Y t,L con-verges in L ( µ ) norm to Π G U t Y . Because κ is strictlypositive-definite, G has dense range in L X ( µ ), andthus Π G U t Y = Π X U t Y = E ( U t Y | X ). We there-fore conclude that Y t,L,N converges to the conditionalexpectation as L → ∞ after N → ∞ . Forecast re-sults from the L63 system are shown in Figure 2.10 oherent pattern extraction We now turn to the task of coherent pattern ex-traction in Problem 2. This is a fundamentallyunsupervised learning problem, as we seek to dis-cover observables of a dynamical system that ex-hibit a natural time evolution (by some suitable cri-terion), rather than approximate a given observableas in the context of forecasting. We have men-tioned POD as a technique for identifying coherentobservables through eigenfunctions of covariance op-erators. Kernel PCA [SSM98] is a generalization ofthis approach utilizing integral operators with po-tentially nonlinear kernels. For data lying on Rie-mannian manifolds, it is popular to employ kernelsapproximating geometrical operators, such as heatoperators and their associated Laplacians. Exam-ples include Laplacian eigenmaps [BN03], diffusionmaps [CL06], and variable-bandwidth kernels [BH16].Meanwhile, coherent pattern extraction techniquesbased on evolution operators have also gained pop-ularity in recent years. These methods includespectral analysis of transfer operators for detectionof invariant sets [DJ99, DF00], harmonic averaging[Mez05] and dynamic mode decomposition (DMD)[RMB +
09, Sch10, WKR15, KBBP17, KNK +
18] tech-niques for approximating Koopman eigenfunctions,and Darboux kernels for approximating spectral pro-jectors [KPM20]. While natural from a theoret-ical standpoint, evolution operators tend to havemore complicated spectral properties than kernel in-tegral operators, including non-isolated eigenvaluesand continuous spectrum. The following examples il-lustrate distinct behaviors associated with the point( H a ) and continuous ( H c ) spectrum subspaces of L ( µ ). Example 1 (Torus rotation) . A quasiperiodic ro-tation on the 2-torus, Ω = T , is governed by thesystem of ODEs ˙ ω = (cid:126)V ( ω ), where ω = ( ω , ω ) ∈ [0 , π ) , (cid:126)V = ( ν , ν ), and ν , ν ∈ R are rationallyindependent frequency parameters. The resultingflow, Φ t ( ω ) = ( ω + ν t, ω + ν t ) mod 2 π , has aunique Borel ergodic invariant probability measure µ given by a normalized Lebesgue measure. Moreover,there exists an orthonormal basis of L ( µ ) consist- ing of Koopman eigenfunctions z jk ( ω ) = e i ( jω + kω ) , j, k ∈ Z , with eigenfrequencies α jk = jν + kν . Thus, H a = L ( µ ), and H c is the zero subspace. Such a sys-tem is said to have a pure point spectrum . Example 2 (Lorenz 63 system) . The L63 system onΩ = R is governed by a system of smooth ODEs withtwo quadratic nonlinearities. This system is known toexhibit a physical ergodic invariant probability mea-sure µ supported on a compact set (the L63 attrac-tor), with mixing dynamics. This means that H a is the one-dimensional subspace of L ( µ ) consistingof constant functions, and H c consists of all L ( µ )functions orthogonal to the constants (i.e., with zeroexpectation value with respect to µ ). Delay-coordinate approaches
For the point spectrum subspace H a , a natural classof coherent observables is provided by the Koop-man eigenfunctions. Every Koopman eigenfunction z j ∈ H a evolves as a harmonic oscillator at the cor-responding eigenfrequency, U t z j = e iα j t z j , and theassociated autocorrelation function, C z j ( t ) = e iα j t ,also has a harmonic evolution. Short of temporalinvariance (which only occurs for constant eigenfunc-tions under measure-preserving ergodic dynamics), itis natural to think of a harmonic evolution as being“maximally” coherent. In particular, if z j is contin-uous, then for any ω ∈ A , the real and imaginaryparts of the time series t (cid:55)→ U t z j ( ω ) are pure si-nusoids, even if the flow Φ t is aperiodic. Furtherattractive properties of Koopman eigenfunctions in-clude the facts that they are intrinsic to the dynam-ical system generating the data, and they are closedunder pointwise multiplication, z j z k = z j + k , allow-ing one to generate every eigenfunction from a po-tentially finite generating set.Yet, consistently approximating Koopman eigen-functions from data is a non-trivial task, even for sim-ple systems. For instance, the torus rotation in Ex-ample 1 has a dense set of eigenfrequencies by ratio-nal independence of the basic frequencies ν and ν .Thus, any open interval in R contains infinitely manyeigenfrequencies α jk , necessitating some form of reg-ularization. Arguably, the term “pure point spec-11rum” is somewhat of a misnomer for such systemssince a non-empty continuous spectrum is present.Indeed, since the spectrum of an operator on a Ba-nach space includes the closure of the set of eigenval-ues, i R \ { iα jk } lies in the continuous spectrum.As a way of addressing these challenges, observethat if G is a self-adjoint, compact operator commut-ing with the Koopman group (i.e., U t G = GU t ), thenany eigenspace W λ of G corresponding to a nonzeroeigenvalue λ is invariant under U t , and thus underthe generator V . Moreover, by compactness of G , W λ has finite dimension. Thus, for any orthonormalbasis { φ , . . . , φ l − } of W λ , the generator V on W λ is represented by a skew-symmetric, and thus unitar-ily diagonalizable, l × l matrix V = [ (cid:104) φ i , V φ j (cid:105) L ( µ ) ].The eigenvectors (cid:126)u = ( u , . . . , u l − ) (cid:62) ∈ C l of V thencontain expansion coefficients of Koopman eigenfunc-tions z = (cid:80) l − j =0 u j φ j in W λ , and the eigenvalues cor-responding to (cid:126)u are eigenvalues of V .On the basis of the above, since any integral oper-ator G on L ( µ ) associated with a symmetric kernel k ∈ L ( µ × µ ) is Hilbert-Schmidt (and thus com-pact), and we have a wide variety of data-driventools for approximating integral operators, we can re-duce the problem of consistently approximating thepoint spectrum of the Koopman group on L ( µ ) tothe problem of constructing a commuting integraloperator. As we now argue, the success of a num-ber of techniques, including singular spectrum anal-ysis (SSA) [BK86, VG89], diffusion-mapped delay co-ordinates (DMDC) [BCGFS13], nonlinear Laplacianspectral analysis (NLSA) [GM12], and Hankel DMD[BBP + delay-coordinate maps [SYC91]. With our notation for the covariate func-tion X : Ω → X and sampling interval ∆ t , the Q -stepdelay-coordinate map is defined as X Q : Ω → X Q with X Q ( ω ) = ( X ( ω ) , X ( ω − ) , . . . , X ( ω − Q +1 )) and ω q = Φ q ∆ t ( ω ). That is, X Q can be thought of as a liftof X , which produces “snapshots”, to a map takingvalues in the space X Q containing “videos”. Intu- itively, by virtue of its higher-dimensional codomainand dependence on the dynamical flow, a delay-coordinate map such as X Q should provide additionalinformation about the underlying dynamics on Ω overthe raw covariate map X . This intuition has beenmade precise in a number of “embedology” theorems[SYC91], which state that under mild assumptions,for any compact subset S ⊆ Ω (including, for ourpurposes, the invariant set A ), the delay-coordinatemap X Q is injective on S for sufficiently large Q .As a result, delay-coordinate maps provide a power-ful tool for state space reconstruction, as well as forconstructing informative predictor functions in thecontext of forecasting.Aside from considerations associated with topolog-ical reconstruction, however, observe that a metric d : X × X → R on covariate space pulls back to adistance-like function ˜ d Q : Ω × Ω → R such that˜ d Q ( ω, ω (cid:48) ) = 1 Q Q − (cid:88) q =0 d ( X ( ω − q ) , X ( ω (cid:48)− q )) . (5)In particular, ˜ d Q has the structure of an ergodic av-erage of a continuous function under the product dy-namical flow Φ t × Φ t on Ω × Ω. By the von Neu-mann ergodic theorem, as Q → ∞ , ˜ d Q converges in L ( µ × µ ) norm to a bounded function ˜ d ∞ , which isinvariant under the Koopman operator U t ⊗ U t ofthe product dynamical system. Note that ˜ d ∞ neednot be µ × µ -a.e. constant, as Φ t × Φ t need not beergodic, and aside from special cases it will not becontinuous on A × A . Nevertheless, based on the L ( µ × µ ) convergence of ˜ d Q to ˜ d ∞ , it can be shown[DG19] that for any continuous function h : R → R ,the integral operator G ∞ on L ( µ ) associated withthe kernel k ∞ = h ◦ d ∞ commutes with U t for any t ∈ R . Moreover, as Q → ∞ , the operators G Q as-sociated with k Q = h ◦ d Q converge to G ∞ in L ( µ )operator norm, and thus in spectrum.Many of the operators employed in SSA, DMDC,NLSA, and Hankel DMD can be modeled after G Q described above. In particular, because G Q is in-duced by a continuous kernel, its spectrum can beconsistently approximated by data-driven operators G Q,N on L ( µ N ), as described in the context of12orecasting. The eigenfunctions of these operatorsat nonzero eigenvalues approximate eigenfunctionsof G Q , which approximate in turn eigenfunctions of G ∞ lying in finite unions of Koopman eigenspaces.Thus, for sufficiently large N and Q , the eigenfunc-tions of G Q,N at nonzero eigenvalues capture distincttimescales associated with the point spectrum of thedynamical system, providing physically interpretablefeatures. These kernel eigenfunctions can also be em-ployed in Galerkin schemes to approximate individualKoopman eigenfunctions.Besides the spectral considerations describedabove, in [BCGFS13] a geometrical characterizationof the eigenspaces of G Q was given based on Lya-punov metrics of dynamical systems. In particular, itfollows by Oseledets’ multiplicative ergodic theoremthat for µ -a.e. ω ∈ M there exists a decomposition T ω M = F ,ω ⊕ . . . ⊕ F r,ω , where T ω M is the tangentspace at ω ∈ M , and F j,ω are subspaces satisfying theequivariance condition D Φ t F j,ω = F j, Φ t ( ω ) . More-over, there exist Λ > · · · > Λ r , such that for every v ∈ F j,ω , Λ j = lim t →∞ (cid:82) t log (cid:107) D Φ s v (cid:107) ds/t , where (cid:107)·(cid:107) is the norm on T ω M induced by a Riemannian met-ric. The numbers Λ j are called Lyapunov exponents ,and are metric-independent. Note that the dynami-cal vector field (cid:126)V ( ω ) lies in a subspace F j ,ω with acorresponding zero Lyapunov exponent.If F j ,ω is one-dimensional, and the norms (cid:107) D Φ t v (cid:107) obey appropriate uniform growth/decay bounds withrespect to ω ∈ M , the dynamical flow is said to be uniformly hyperbolic . If, in addition, the support A of µ is a differentiable manifold, then there exists aclass of Riemannian metrics, called Lyapunov met-rics , for which the F j,ω are mutually orthogonal atevery ω ∈ A . In [BCGFS13], it was shown that us-ing a modification of the delay-distance in (5) withexponentially decaying weights, as Q → ∞ , the topeigenfunctions φ ( Q ) j of G Q vary predominantly alongthe subspace F r,ω associated with the most stableLyapunov exponent. That is, for every ω ∈ Ω andtangent vector v ∈ T ω M orthogonal to F r,ω with re-spect to a Lyapunov metric, lim Q →∞ v · ∇ φ ( Q ) j = 0. RKHS approaches
While delay-coordinate maps are effective for approx-imating the point spectrum and associated Koop-man eigenfunctions, they do not address the problemof identifying coherent observables in the continuousspectrum subspace H c . Indeed, one can verify thatin mixed-spectrum systems the infinite-delay opera-tor G ∞ , which provides access to the eigenspaces ofthe Koopman operator, has a non-trivial nullspacethat includes H c as a subspace More broadly, thereis no obvious way of identifying coherent observablesin H c as eigenfunctions of an intrinsic evolution oper-ator. As a remedy of this problem, we relax the prob-lem of seeking Koopman eigenfunctions, and considerinstead approximate eigenfunctions . An observable z ∈ L ( µ ) is said to be an (cid:15) -approximate eigenfunc-tion of U t if there exists λ t ∈ C such that (cid:107) U t z − λ t z (cid:107) L ( µ ) < (cid:15) (cid:107) z (cid:107) L ( µ ) . (6)The number λ t is then said to lie in the (cid:15) -approximatespectrum of U t . A Koopman eigenfunction is an (cid:15) -approximate eigenfunction for every (cid:15) >
0, so wethink of (6) as a relaxation of the eigenvalue equa-tion, U t z − λ t z = 0. This suggests that a naturalnotion of coherence of observables in L ( µ ), appro-priate to both the point and continuous spectrum, isthat (6) holds for (cid:15) (cid:28) t in a “large” interval.We now outline an RKHS-based approach[DGS18], which identifies observables satisfying thiscondition through eigenfunctions of a regularized op-erator ˜ V τ on L ( µ ) approximating V with the prop-erties of (i) being skew-adjoint and compact; and (ii)having eigenfunctions in the domain of the Nystr¨omoperator, which maps them to differentiable functionsin an RKHS. Here, τ is a positive regularization pa-rameter such that, as τ → + , ˜ V τ converges to V ina suitable spectral sense. We will assume that theforward-invariant, compact manifold M has C reg-ularity, but will not require that the support A of theinvariant measure be differentiable.With these assumptions, let k : Ω × Ω → R bea symmetric, positive-definite kernel, whose restric-tion on M × M is continuously differentiable. Then,the corresponding RKHS H ( M ) embeds continuouslyin the Banach space C ( M ) of continuously differen-13iable functions on M , equipped with the standardnorm. Moreover, because V is an extension of thedirectional derivative (cid:126)V · ∇ associated with the dy-namical vector field, every function in H ( M ) lies,upon inclusion, in D ( V ). The key point here is thatregularity of the kernel induces RKHSs of observ-ables which are guaranteed to lie in the domain ofthe generator. In particular, the range of the inte-gral operator G = K ∗ K on L ( µ ) associated with k lies in D ( V ), so that A = V G is well-defined. Thisoperator is, in fact, Hilbert-Schmidt, with Hilbert-Schmidt norm bounded by the C ( M × M ) normof the kernel k . What is perhaps less obvious isthat G / V G / (which “distributes” the smooth-ing by G to the left and right of V ), defined onthe dense subspace { f ∈ L ( µ ) : G / f ∈ D ( V ) } is also bounded, and thus has a unique closed ex-tension ˜ V : L ( µ ) → L ( µ ), which turns out to beHilbert-Schmidt. Unlike A , ˜ V is skew-adjoint, andthus preserves an important structural property ofthe generator. By skew-adjointness and compactnessof ˜ V , there exists an orthonormal basis { ˜ z j : j ∈ Z } of L ( µ ) consisting of its eigenfunctions ˜ z j , with purelyimaginary eigenvalues i ˜ α j . Moreover, (i) all ˜ z j cor-responding to nonzero ˜ α j lie in the domain of theNystr¨om operator, and therefore have C represen-tatives in H ( M ); and (ii) if k is L ( µ )-universal,Markov, and ergodic, ˜ V has a simple eigenvalue atzero, in agreement with the analogous property of V .Based on the above, we seek to construct a one-parameter family of such kernels k τ , with associatedRKHSs H τ ( M ), such that as τ → + , the regular-ized generators ˜ V τ converge to V in a sense suitablefor spectral convergence. Here, the relevant notion ofconvergence is strong resolvent convergence ; that is,for every element λ of the resolvent set of V and every f ∈ L ( µ ), ( ˜ V τ − λ ) − f must converge to ( V − λ ) − f .In that case, for every element iα of the spectrumof V (both point and continuous), there exists a se-quence of eigenvalues i ˜ α j τ ,τ of ˜ V τ converging to iα as τ → +
0. Moreover, for any (cid:15) >
T >
0, thereexists τ ∗ > < τ < τ ∗ and | t | < T , e iα jτ ,τ t lies in the (cid:15) -approximate spectrum of U t and˜ z j τ ,τ is an (cid:15) -approximate eigenfunction.In [DGS18], a constructive procedure was proposedfor obtaining the kernel family k τ through a Markov semigroup on L ( µ ). This method has a data-drivenimplementation, with analogous spectral convergenceresults for the associated integral operators G τ,N on L ( µ N ) to those described in the setting of forecast-ing. Given these operators, we approximate ˜ V τ by˜ V τ,N = G / τ,N V N G / τ,N , where V N is a skew-adjoint, finite-difference approximation of the generator. Forexample, V N = ( U N − U ∗ N ) / (2 ∆ t ) is a second-orderfinite-difference approximation based on the 1-stepshift operator U N . See Figure 1 for a graphical rep-resentation of a generator matrix for L63. As withour data-driven approximations of U t , we work witha rank- L operator ˆ V τ := Π τ,N,L V τ,N Π τ,N,L , whereΠ τ,N,L is the orthogonal projection to the subspacespanned by the first L eigenfunctions of G τ,N . Thisfamily of operators convergences spectrally to V τ ina limit of N →
0, followed by ∆ t → L → ∞ ,where we note that C regularity of k τ is importantfor the finite-difference approximations to converge.At any given τ , an a posteriori criterion for iden-tifying candidate eigenfunctions ˆ z j,τ satisfying (6)for small (cid:15) is to compute a Dirichlet energy func-tional , D (ˆ z j,τ ) = (cid:107)N τ,N ˆ z j,τ (cid:107) H τ ( M ) / (cid:107) ˆ z j,τ (cid:107) L ( µ N ) . In-tuitively, D assigns a measure of roughness to everynonzero element in the domain of the Nystr¨om oper-ator (analogously to the Dirichlet energy in Sobolevspaces on differentiable manifolds), and the smaller D (ˆ z j,τ ) is, the more coherent ˆ z j,τ is expected to be.Indeed, as shown in Figure 3, the ˆ z j,τ correspondingto low Dirichlet energy identify observables of the L63system with a coherent dynamical evolution, eventhough this system is mixing and has no nonconstantKoopman eigenfunctions. Sampled along dynami-cal trajectories, the approximate Koopman eigen-functions resemble amplitude-modulated wavepack-ets, exhibiting a low-frequency modulating envelopewhile maintaining phase coherence and a precise car-rier frequency. This behavior can be thought of as a“relaxation” of Koopman eigenfunctions, which gen-erate pure sinusoids with no amplitude modulation. Conclusions and outlook
We have presented mathematical techniques at theinterface of dynamical systems theory and data sci-14igure 3: A representative eigenfunction ˆ z j,τ of thecompactified generator ˆ V τ for the L63 system, withlow corresponding Dirichlet energy. Top: Scatterplotof Re ˆ z j,τ on the L63 attractor. Bottom: Time seriesof Re ˆ z j,τ sampled along a dynamical trajectory.ence for statistical analysis and modeling of dynam-ical systems. One of our primary goals has been tohighlight a fruitful interplay of ideas from ergodictheory, functional analysis, and differential geome-try, which, coupled with learning theory, provide aneffective route for data-driven prediction and patternextraction, well-adapted to handle nonlinear dynam-ics and complex geometries.There are several open questions and future re-search directions stemming from these topics. First,it should be possible to combine pointwise estima-tors derived from methods such as diffusion forecast-ing and KAF with the Mori-Zwanzig formalism soas to incorporate memory effects. Another poten-tial direction for future development is to incorporatewavelet frames, particularly when the measurementsor probability densities are highly localized. More-over, when the attractor A is not a manifold, appro- priate notions of regularity need to be identified so asto fully characterize the behavior of kernel algorithmssuch as diffusion maps. While we suspect that kernel-based constructions will still be the fundamental tool,the choice of kernel may need to be adapted to theregularity of the attractor to obtain optimal perfor-mance. Finally, a number of applications (e.g., anal-ysis of perturbations) concern the action of dynamicson more general vector bundles besides functions, po-tentially with a non-commutative algebraic structure,calling for the development of suitable data-driventechniques for such spaces. Acknowledgments
Research of the authors de-scribed in this review was supported by DARPAgrant HR0011-16-C-0116, NSF grants 1842538,DMS-1317919, DMS-1521775, DMS-1619661, DMS-172317, DMS-1854383, and ONR grants N00014-12-1-0912, N00014-14-1-0150, N00014-13-1-0797,N00014-16-1-2649, N00014-16-1-2888.
References [AG20] R. Alexander and D. Giannakis,
Operator-theoretic framework for forecasting nonlineartime series with kernel analog techniques , Phys.D (2020), 132520.[BBP +
17] S. L. Brunton, B. W. Brunton, J. L. Proctor,E. Kaiser, and J. N. Kutz,
Chaos as an inter-mittently forced linear system , Nat. Commun. (2017), no. 19.[BCGFS13] T. Berry, R. Cressman, Z. Greguri´c-Ferenˇcek, andT. Sauer, Time-scale separation from diffusion-mapped delay coordinates , SIAM J. Appl. Dyn.Sys. (2013), 618–649.[BGH15] T. Berry, D. Giannakis, and J. Harlim, Nonpara-metric forecasting of low-dimensional dynamicalsystems , Phys. Rev. E. (2015), 032915.[BH16] T. Berry and J. Harlim, Variable bandwidth dif-fusion kernels , Appl. Comput. Harmon. Anal. (2016), no. 1, 68–96.[BK86] D. S. Broomhead and G. P. King, Extracting qual-itative dynamics from experimental data , Phys. D (1986), no. 2–3, 217–236.[BN03] M. Belkin and P. Niyogi, Laplacian eigenmaps fordimensionality reduction and data representation ,Neural Comput. (2003), 1373–1396.[CL06] R. R. Coifman and S. Lafon, Diffusion maps ,Appl. Comput. Harmon. Anal. (2006), 5–30. CS01] F. Cucker and S. Smale,
On the mathematicalfoundations of learning , Bull. Amer. Math. Soc. (2001), no. 1, 1–49.[DF00] M. Dellnitz and G. Froyland, On the isolated spec-trum of the PerronFrobenius operator , Nonlinear-ity (2000), 1171–1188.[DG19] S. Das and D. Giannakis,
Delay-coordinate mapsand the spectra of Koopman operators , J. Stat.Phys. (2019), no. 6, 1107–1145.[DGS18] S. Das, D. Giannakis, and J. Slawinska,
Reproduc-ing kernel Hilbert space compactification of uni-tary evolution groups , 2018.[DJ99] M. Dellnitz and O. Junge,
On the approximationof complicated dynamical behavior , SIAM J. Nu-mer. Anal. (1999), 491.[EHJ17] W. E, J. Han, and A. Jentzen, Deep learning-based numerical methods for high-dimensionalparabolic partial differential equations and back-ward stochastic differential equations , Commun.Math. Stat. (2017), 349–380.[Fro13] G. Froyland, An analytic framework for identi-fying finite-time coherent sets in time-dependentdynamical systems , Phys. D (2013), 1–19.[Gen01] M. C. Genton,
Classes of kernels for machinelearning: A statistics perspective , J. Mach. Learn.Res. (2001), 299–312.[GM12] D. Giannakis and A. J. Majda, Nonlinear Lapla-cian spectral analysis for time series with in-termittency and low-frequency variability , Proc.Natl. Acad. Sci. (2012), no. 7, 2222–2227.[KBBP17] J. N. Kutz, S. L. Brunton, B. W. Bunton, and J. L.Proctor,
Dynamic mode decomposition: Data-driven modeling of complex systems , Society forIndustrial and Applied Mathematics, Philadel-phia, 2017.[KM18] M. Korda and I. Mezi´c,
Linear predictors fornonlinear dynamical systems: Koopman operatormeets model predictive control , Automatica (2018), 149–160.[KNK +
18] S. Klus, F. N¨uske, P. Koltai, H. Wu, I. Kevrekidis,C. Sch¨utte, and F. No´e,
Data-driven model re-duction and transfer operator approximation , J.Nonlinear Sci. (2018), 985–1010.[KNP +
20] S. Klus, F. N¨uske, S. Peitz, J.-H. Niemann, C.Clementi, and C. Sch¨utte,
Data-driven approxi-mation of the Koopman generator: Model reduc-tion, system identification, and control , Phys. D (2020), 132416.[Koo31] B. O. Koopman,
Hamiltonian systems and trans-formation in Hilbert space , Proc. Natl. Acad. Sci. (1931), no. 5, 315–318.[Kos43] D. D. Kosambi, Satistics in function space , J. Ind.Math. Soc. (1943), 76–88. [KPM20] M. Korda, M. Putinar, and I. Mezi´c, Data-drivenspectral analysis of the Koopman operator , Appl.Comput. Harmon. Anal. (2020), no. 2, 599–629.[Lor69] E.N. Lorenz, Atmospheric predictability as re-vealed by naturally occurring analogues , J. Atmos.Sci. (1969), 636–646.[MB99] I. Mezi´c and A. Banaszuk, Comparison of systemswith complex behavior: Spectral methods , Proceed-ings of the 39th IEEE conference on decision andcontrol, 1999, pp. 1224–1231.[Mez05] I. Mezi´c,
Spectral properties of dynamical systems,model reduction and decompositions , NonlinearDyn. (2005), 309–325.[PHG + ] J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E.Ott, Model-free prediction of large spatiotempo-rally chaotic systems from data: A reservoir com-puting approach , Phys. Rev. Lett. , 024102l.[RMB +
09] C. W. Rowley, I. Mezi´c, S. Bagheri, P. Schlatter,and D. S. Henningson,
Spectral analysis of non-linear flows , J. Fluid Mech. (2009), 115–127.[Sch10] P. J. Schmid,
Dynamic mode decomposition ofnumerical and experimental data , J. Fluid Mech. (2010), 5–28.[SSM98] B. Sch¨olkopf, A. Smola, and K. M¨uller,
Nonlinearcomponent analysis as a kernel eigenvalue prob-lem , Neural Comput. (1998), 1299–1319.[SYC91] T. Sauer, J. A. Yorke, and M. Casdagli, Embedol-ogy , J. Stat. Phys. (1991), no. 3–4, 579–616.[TGHS19] N. G. Trillos, M. Gerlach, M. Hein, and D.Slepˇcev, Error estimates for spectral convergenceof the graph Laplacian on random geometricgraphs towards the Laplace–Beltrami operator ,Found. Comput. Math. (2019). In press.[TS18] N. G. Trillos and D. Slepˇcev,
A variational ap-proach to the consistency of spectral clustering ,Appl. Comput. Harmon. Anal. (2018), no. 2,239–281.[VG89] R. Vautard and M. Ghil, Singular spectrum anal-ysis in nonlinear dynamics, with applications topaleoclimatic time series , Phys. D (1989), 395–424.[vLBB08] U. von Luxburg, M. Belkin, and O. Bousquet, Consitency of spectral clustering , Ann. Stat. (2008), no. 2, 555–586.[VPH +
20] P. R. Vlachas, J. Pathak, B. R. Hunt, T. P.Sapsis, M. Girvan, and E. Ott,
Backpropagationalgorithms and Reservoir Computing in Recur-rent Neural Networks for the forecasting of com-plex spatiotemporal dynamics , Neural Netw. (2020), 191–217. WKR15] M. O. Williams, I. G. Kevrekidis, and C. W. Row-ley,
A data-driven approximation of the Koop-man operator: Extending dynamic mode decom-position , J. Nonlinear Sci. (2015), no. 6, 1307–1346.[ZHL19] H. Zhang, J. Harlim, and X. Li, Computing lin-ear response statistics using orthogonal polyno-mial based estimators: An RKHS formulation ,2019.,2019.