[PDF] Spectral learning of dynamic systems from nonequilibrium data

Abstract

Observable operator models (OOMs) and related models are one of the most important and powerful tools for modeling and analyzing stochastic systems. They exactly describe dynamics of finite-rank systems and can be efficiently and consistently estimated through spectral learning under the assumption of identically distributed data. In this paper, we investigate the properties of spectral learning without this assumption due to the requirements of analyzing large-time scale systems, and show that the equilibrium dynamics of a system can be extracted from nonequilibrium observation data by imposing an equilibrium constraint. In addition, we propose a binless extension of spectral learning for continuous data. In comparison with the other continuous-valued spectral algorithms, the binless algorithm can achieve consistent estimation of equilibrium dynamics with only linear complexity.

Full PDF

SSpectral Learning of Dynamic Systems fromNonequilibrium Data

Hao Wu and Frank Noé

Department of Mathematics and Computer ScienceFreie Universität BerlinArnimallee 6, 14195 Berlin {hao.wu,frank.noe}@fu-berlin.de

Abstract

Observable operator models (OOMs) and related models are one of the most im-portant and powerful tools for modeling and analyzing stochastic systems. Theyexactly describe dynamics of ﬁnite-rank systems and can be efﬁciently and con-sistently estimated through spectral learning under the assumption of identicallydistributed data. In this paper, we investigate the properties of spectral learningwithout this assumption due to the requirements of analyzing large-time scalesystems, and show that the equilibrium dynamics of a system can be extractedfrom nonequilibrium observation data by imposing an equilibrium constraint. Inaddition, we propose a binless extension of spectral learning for continuous data.In comparison with the other continuous-valued spectral algorithms, the binlessalgorithm can achieve consistent estimation of equilibrium dynamics with onlylinear complexity.

In the last two decades, a collection of highly related dynamic models including observable operatormodels (OOMs) [1–3], predictive state representations [4–6] and reduced-rank hidden Markov models[7, 8], have become powerful and increasingly popular tools for analysis of dynamic data. Thesemodels are largely similar, and all can be learned by spectral methods in a general framework ofmultiplicity automata, or equivalently sequential systems [9, 10]. In contrast with the other commonlyused models such as Markov state models [11, 12], Langevin models [13, 14], traditional hiddenMarkov models (HMMs) [15, 16], Gaussian process state-space models [17, 18] and recurrentneural networks [19], the spectral learning based models can exactly characterize the dynamics of astochastic system without any a priori knowledge except the assumption of ﬁnite dynamic rank (i.e.,the rank of Hankel matrix) [10, 20], and the parameter estimation can be efﬁciently performed fordiscrete-valued systems without solving any intractable inverse or optimization problem. We focus inthis paper only on stochastic systems without control inputs and all spectral learning based modelscan be expressed in the form of OOMs for such systems, so we will refer to them as OOMs below.In most literature on spectral learning, the observation data are assumed to be identically (possibly notindependently) distributed so that the expected values of observables associated with the parameterestimation can be reliably computed by empirical averaging. However, this assumption can beseverely violated due to the limit of experimental technique or computational capacity in manypractical situations, especially where metastable physical or chemical processes are involved. Anotable example is the distributed computing project Folding@home [21], which explores proteinfolding processes that occur on the timescales of microseconds to milliseconds based on moleculardynamics simulations on the order of nanoseconds in length. In such a nonequilibrium case wheredistributions of observation data are time-varying and dependent on initial conditions, it is still unclear a r X i v : . [ c s . L G ] J un f promising estimates of OOMs can be obtained. In [22], a hybrid estimation algorithm was proposedto improve spectral learning of large-time scale processes by using both dynamic and static data,but it still requires assumption of identically distributed data. One solution to reduce the statisticalbias caused by nonequilibrium data is to discard the observation data generated before the systemreaches steady state, which is a common trick in applied statistics [23]. Obviously, this way suffersfrom substantial information loss and is infeasible when observation trajectories are shorter thanmixing times. Another possible way would be to learn OOMs by likelihood-based estimation insteadof spectral methods, but there is no effective maximum likelihood or Bayesian estimator of OOMsuntil now. The maximum pseudo-likelihood estimator of OOMs proposed in [24] demands highcomputational cost and its consistency is yet unveriﬁed.Another difﬁculty for spectral approaches is learning with continuous data, where density estimationproblems are involved. The density estimation can be performed by parametric methods such as thefuzzy interpolation [25] and the kernel density estimation [8]. But these methods would reduce theﬂexibility of OOMs for dynamic modeling because of their limited expressive capacity. Recently, akernel embedding based spectral algorithm was proposed to cope with continuous data [26], whichavoids explicit density estimation and learns OOMs in a nonparametric manner. However, the kernelembedding usually yields a very large computational complexity, which greatly limits practicalapplications of this algorithm to real-world systems.The purpose of this paper is to address the challenge of spectral learning of OOMs from nonequilib-rium data for analysis of both discrete- and continuous-valued systems. We ﬁrst provide a modiﬁedspectral method for discrete-valued stochastic systems which allows us to consistently estimatethe equilibrium dynamics from nonequilibrium data, and then extend this method to continuousobservations in a binless manner. In comparison with the existing learning methods for continuousOOMs, the proposed binless spectral method does not rely on any density estimator, and can achieveconsistent estimation with linear computational complexity in data size even if the assumption of iden-tically distributed observations does not hold. Moreover, some numerical experiments are providedto demonstrate the capability of the proposed methods. In this paper, we use P to denote probability distribution for discrete random variables and probabilitydensity for continuous random variables. The indicator function of event e is denoted by e andthe Dirac delta function centered at x is denoted by δ x ( · ) . For a given process { a t } , we writethe subsequence ( a k , a k +1 , . . . , a k (cid:48) ) as a k : k (cid:48) , and E ∞ [ a t ] (cid:44) lim t →∞ E [ a t ] means the equilibriumexpected value of a t if the limit exists. In addition, the convergence in probability is denoted by p → . An m -dimensional observable operator model (OOM) with observation space O can be represented bya tuple M = ( ω , { Ξ ( x ) } x ∈O , σ ) , which consists of an initial state vector ω ∈ R × m , an evaluationvector σ ∈ R m × and an observable operator matrix Ξ ( x ) ∈ R m × m associated to each element x ∈ O . M deﬁnes a stochastic process { x t } in O as P ( x t |M ) = ω Ξ ( x t ) σ (1)under the condition that ω Ξ ( x t ) σ ≥ , ω Ξ ( O ) σ = 1 and ω Ξ ( x t ) σ = ω Ξ ( x t ) Ξ ( O ) σ holdfor all t and x t ∈ O t [10], where Ξ ( x t ) (cid:44) Ξ ( x ) . . . Ξ ( x t ) and Ξ ( A ) (cid:44) (cid:82) A d x Ξ ( x ) . TwoOOMs M and M (cid:48) are said to be equivalent if P ( x t |M ) ≡ P ( x t |M (cid:48) ) . Here and hereafter, we only consider the case that the observation space O is a ﬁnite set. (Learningwith continuous observations will be discussed in Section 4.2.) A large number of largely similar2 lgorithm 1 General procedure for spectral learning of OOMs

INPUT:

Observation trajectories generated by a stochastic process { x t } in O OUTPUT: ˆ M = ( ˆ ω , { ˆ Ξ ( x ) } x ∈O , ˆ σ ) PARAMETER: m : dimension of the OOM. D , D : numbers of feature functions. L : order offeature functions. Construct feature functions φ = ( ϕ , , . . . , ϕ ,D ) (cid:62) and φ = ( ϕ , , . . . , ϕ ,D ) (cid:62) , whereeach ϕ i,j is a mapping from O L to R and D , D ≥ m . Approximate ¯ φ (cid:44) E [ φ ( x t − L : t − )] , ¯ φ (cid:44) E [ φ ( x t : t + L − )] (5) C , (cid:44) E (cid:2) φ ( x t − L : t − ) φ ( x t : t + L − ) (cid:62) (cid:3) (6) C , ( x ) (cid:44) E (cid:2) x t = x · φ ( x t − L : t − ) φ ( x t +1: t + L ) (cid:62) (cid:3) , ∀ x ∈ O (7)by their empirical means ˆ¯ φ , ˆ¯ φ , ˆ C , and ˆ C , ( x ) over observation data. Compute F = UΣ − ∈ R D × m and F = V ∈ R D × m from the truncated singular valuedecomposition ˆ C , ≈ UΣV (cid:62) , where Σ ∈ R m × m is a diagonal matrix contains the top m singular values of ˆ C , , and U and V consist of the corresponding m left and right singularvectors of ˆ C , . Compute ˆ σ = F (cid:62) ˆ¯ φ (8) ˆ Ξ ( x ) = F (cid:62) ˆ C , ( x ) F , ∀ x ∈ O (9) ˆ ω = ˆ¯ φ (cid:62) F (10)spectral methods have been developed, and the generic learning procedure of these methods issummarized in Algorithm 1 by omitting details of algorithm implementation and parameter choice[27, 7, 28]. For convenience of description and analysis, we specify in this paper the formula forcalculating ˆ¯ φ , ˆ¯ φ , ˆ C , and ˆ C , ( x ) in Line 2 of Algorithm 1 as follows: ˆ¯ φ = 1 N N (cid:88) n =1 φ ( (cid:126)s n ) , ˆ¯ φ = 1 N N (cid:88) n =1 φ ( (cid:126)s n ) (2) ˆ C , = 1 N N (cid:88) n =1 φ ( (cid:126)s n ) φ ( (cid:126)s n ) (cid:62) (3) ˆ C , ( x ) = 1 N N (cid:88) n =1 s n = x φ ( (cid:126)s n ) φ ( (cid:126)s n ) (cid:62) , ∀ x ∈ O (4)Here { ( (cid:126)s n , s n , (cid:126)s n ) } Nn =1 is the collection of all subsequences of length (2 L + 1) appearing in observa-tion data ( N = T − L for a single observation trajectory of length T ). If an observation subsequence x t − L : t + L is denoted by ( (cid:126)s n , s n , (cid:126)s n ) with some n , then (cid:126)s n = x t − L : t − and (cid:126)s n = x t +1: t + L representsthe preﬁx and sufﬁx of x t − L : t + L of length L , s n = x t is the intermediate observation value, and (cid:126)s n = x t : t + L − is an “intermediate part” of the subsequence of length L starting from time t (seeFig. 1 for a graphical illustration).Algorithm 1 is much more efﬁcient than the commonly used likelihood-based learning algorithmsand does not suffer from local optima issues. In addition, and more importantly, this algorithm can beshown to be consistent if ( (cid:126)s n , s n , (cid:126)s n ) are (i) independently sampled from M or (ii) obtained froma ﬁnite number of trajectories which have fully mixed so that all observation triples are identicallydistributed (see, e.g., [8, 3, 10] for related works). However, the asymptotic correctness of OOMslearned from short trajectories starting from nonequilibrium states has not been formally determined.3 (cid:1876) (cid:3047)(cid:2879)(cid:3013) ⋯ (cid:1876) (cid:3047)(cid:2879)(cid:2869) (cid:1876) (cid:3047) (cid:1876) (cid:3047)(cid:2878)(cid:2869) ⋯ (cid:1876) (cid:3047)(cid:2878)(cid:3013) (cid:1871)(cid:1318) (cid:3041)(cid:2869) (cid:1871)(cid:1318) (cid:3041)(cid:2871) (cid:1871) (cid:3041)(cid:2870) (cid:1871)(cid:1318) (cid:3041)(cid:2870) Figure 1: Illustration of variables (cid:126)s n , s n , (cid:126)s n and (cid:126)s n used in Eqs. (2)-(4) with ( (cid:126)s n , s n , (cid:126)s n ) = x t − L : t + L . We now analyze statistical properties of the spectral algorithm without the assumption of identicallydistributed observations. Before stating our main result, some assumptions on observation data arelisted as follows:

Assumption 1.

The observation data consists of I independent trajectories of length T producedby a stochastic process { x t } , and the data size tends to inﬁnity with (i) I → ∞ and T = T or (ii) T → ∞ and I = I . Assumption 2. { x t } is driven by an m -dimensional OOM M = ( ω , { Ξ ( x ) } x ∈O , σ ) , and T (cid:48) T (cid:48) (cid:88) t =1 f t p → E ∞ [ f ( x t : t + l − )] = E ∞ [ f ( x t : t + l − ) | x k ] (11) as T (cid:48) → ∞ for all k , l , x k and f : O l (cid:55)→ R . Assumption 3.

The rank of the limit of ˆ C , is not less than m . Notice that Assumption 2 only states the asymptotic stationarity of { x t } and marginal distributions ofobservation triples are possibly time dependent if ω (cid:54) = ω Ξ ( O ) . Assumption 3 ensures that the limitof ˆ M given by Algorithm 1 is well deﬁned, which generally holds for minimal OOMs (see [10]).Based on the above assumptions, we have the following theorem concerning the statistical consistencyof the OOM learning algorithm (see Appendix A.1 for proof): Theorem 1.

Under Assumptions 1-3, there exists an OOM M (cid:48) = ( ω (cid:48) , { Ξ (cid:48) ( x ) } x ∈O , σ (cid:48) ) which isequivalent to ˆ M and satisﬁes σ (cid:48) p → σ , Ξ (cid:48) ( x ) p → Ξ ( x ) , ∀ x ∈ O (12)This theorem is central in this paper, which implies that the spectral learning algorithm can achieveconsistent estimation of all parameters of OOMs except initial state vectors even for nonequilibriumdata. ( ˆ ω p → ω (cid:48) does not hold in most cases except when { x t } is stationary.). It can be furthergeneralized according to requirements in more complicated situations where, for example, observationtrajectories are generated with multiple different initial conditions (see Appendix A.2). In this section, applications of spectral learning to the problem of recovering equilibrium propertiesof dynamic systems from nonequilibrium data will be highlighted, which is an important problem inpractice especially for thermodynamic and kinetic analysis in computational physics and chemistry.

According to the deﬁnition of OOMs, the equilibrium dynamics of an OOM M =( ω , { Ξ ( x ) } x ∈O , σ ) can be described by an equilibrium OOM M eq = ( ω eq , { Ξ ( x ) } x ∈O , σ ) as lim t →∞ P ( x t +1: t + k = z k |M ) = P ( x t = z k |M eq ) (13)if the equilibrium state vector ω eq = lim t →∞ ω Ξ ( O ) t (14)4xists. From (13) and (14), we have (cid:26) ω eq Ξ ( O ) = lim t →∞ ω eq Ξ ( O ) t +1 = ω eq ω eq σ = lim t →∞ (cid:80) x ∈O P ( x t +1 = x ) = 1 (15)The above equilibrium constraint of OOMs motivates the following algorithm for learning equilibriumOOMs: Perform Algorithm 1 to get ˆ Ξ ( x ) and ˆ σ and calculate ˆ ω eq by a quadratic programmingproblem ˆ ω eq = arg min w ∈{ w | w ˆ σ =1 } (cid:13)(cid:13)(cid:13) w ˆ Ξ ( O ) − w (cid:13)(cid:13)(cid:13) (16)(See Appendix A.3 for a closed-form expression of the solution to (16).)The existence and uniqueness of ω eq are shown in Appendix A.3, which yield the following theorem: Theorem 2.

Under Assumptions 1-3, the estimated equilibrium OOM ˆ M eq = ( ˆ ω eq , { ˆ Ξ ( x ) } x ∈O , ˆ σ ) provided by Algorithm 1 and Eq. (16) satisﬁes P (cid:16) x l = z l | ˆ M eq (cid:17) p → lim t →∞ P ( x t +1: t + l = z l ) (17) for all l and z l .Remark . ˆ ω eq can also be computed as an eigenvector of ˆ Ξ ( O ) . But the eigenvalue problem possiblyyields numerical instability and complex values because of statistical noise, unless some speciﬁcfeature functions φ , φ are selected so that ˆ ω eq ˆ Ξ ( O ) = ˆ ω eq can be exactly solved in the real ﬁeld[29]. A straightforward way to extend spectral algorithms to handle continuous data is based on thecoarse-graining of the observation space. Suppose that { x t } is a stochastic process in a continuousobservation space O ⊂ R d , and O is partitioned into J discrete bins B , . . . , B J . Then we can utilizethe algorithm in Section 4.1 to approximate the equilibrium transition dynamics between bins as lim t →∞ P ( x t +1 ∈ B j , . . . , x t + l ∈ B j l ) ≈ ˆ ω eq ˆ Ξ ( B j ) . . . ˆ Ξ ( B j l ) ˆ σ (18)and obtain a binned OOM ˆ M eq = ( ˆ ω eq , { ˆ Ξ ( x ) } x ∈O , ˆ σ ) for the continuous dynamics of { x t } with ˆ Ξ ( x ) = ˆ Ξ ( B ( x ))vol( B ( x )) (19)by assuming the observable operator matrices are piecewise constant on bins, where B ( x ) denotesthe bin containing x and vol( B ) is the volume of B . Conventional wisdom dictates that the numberof bins is a key parameter for the coarse-graining strategy and should be carefully chosen for thebalance of statistical noise and discretization error. However, we will show in what follows that it isjustiﬁable to increase the number of bins to inﬁnity.Let us consider the limit case where J → ∞ and bins are inﬁnitesimal with max j vol( B j ) → . Inthis case, ˆ Ξ ( x ) = lim vol( B ( x )) → ˆ Ξ ( B ( x ))vol( B ( x )) = (cid:26) ˆ W s n δ s n ( x ) , x = s n , otherwise (20)where ˆ W s n = 1 N F (cid:62) φ ( (cid:126)s n ) φ ( (cid:126)s n ) (cid:62) F (21)according to (9) in Algorithm 1. Then ˆ M eq becomes a binless OOM over sample points X = { s n } Nn =1 and can be estimated from data by Algorithm 2, where the feature functions can be selectedas indicator functions, radial basis functions or other commonly used activation functions for single-layer neural networks in order to digest adequate dynamic information from observation data.The binless algorithm presented here can be efﬁciently implemented in a linear computationalcomplexity O ( N ) , and is applicable to more general cases where observations are strings, graphs orother structured variables. Unlike the other spectral algorithms for continuous data, it does not require5 lgorithm 2 Procedure for learning binless equilibrium OOMs

INPUT:

Observation trajectories generated by a stochastic process { x t } in O ⊂ R d OUTPUT:

Binless OOM ˆ M = ( ˆ ω , { ˆ Ξ ( x ) } x ∈O , ˆ σ ) Construct feature functions φ : R Ld (cid:55)→ R D and φ : R Ld (cid:55)→ R D with D , D ≥ m . Calculate ˆ¯ φ , ˆ¯ φ , ˆ C , by (2) and (3). Compute F = UΣ − ∈ R D × m and F = V ∈ R D × m from the truncated singular valuedecomposition ˆ C , ≈ UΣV (cid:62) . Compute ˆ σ , ˆ ω and ˆ Ξ ( x ) = (cid:80) z ∈X ˆ W z δ z ( x ) by (8), (16) and (21), where ˆ Ξ ( O ) = (cid:82) O d x ˆ Ξ ( x ) = (cid:80) z ∈X ˆ W z .that the observed dynamics coincides with some parametric model deﬁned by feature functions. Lastlybut most importantly, as stated in the following theorem, this algorithm can be used to consistentlyextract static and kinetic properties of a dynamic system in equilibrium from nonequilibrium data(see Appendix A.3 for proof): Theorem 3.

Provided that the observation space O is a closed set in R d , feature functions φ , φ are bounded on O L , and Assumptions 1-3 hold, the binless OOM given by Algorithm 2 satisﬁes E (cid:104) g ( x r ) | ˆ M eq (cid:105) p → E ∞ [ g ( x t +1: t + r )] (22) with E (cid:104) g ( x r ) | ˆ M eq (cid:105) = (cid:88) x r ∈X r g ( x r ) ˆ ω ˆ W z . . . ˆ W z r ˆ σ (23) (i) for all continuous functions g : O r (cid:55)→ R .(ii) for all bounded and Borel measurable functions g : O r (cid:55)→ R , if there exist positive constants ¯ ξ and ξ so that (cid:107) Ξ ( x ) (cid:107) ≤ ¯ ξ and lim t →∞ P ( x t +1: t + r = z r ) ≥ ξ for all x ∈ O and z r ∈ O r . It is worth pointing out that the spectral learning investigated in this section is an ideal tool for analy-sis of dynamic properties of stochastic processes, because the related quantities, such as stationarydistributions, principle components and time-lagged correlations, can be easily computed from pa-rameters of discrete OOMs or binless OOMs. For many popular nonlinear dynamic models, includingGaussian process state-space models [17] and recurrent neural networks [19], the computation ofsuch quantities is intractable or time-consuming.The major disadvantage of spectral learning is that the estimated OOMs are usually only “approx-imately valid” and possibly assign “negative probabilities” to some observation sequences. So itis difﬁcult to apply spectral methods to prediction, ﬁltering and smoothing of signals where theBayesian inference is involved.

In this section, we evaluate our algorithms on two diffusion processes and the molecular dynamics ofalanine dipeptide, and compare them to several alternatives. The detailed settings of simulations andalgorithms are provided in Appendix B.

Brownian dynamics

Let us consider a one-dimensional diffusion process driven by the Browniandynamics d x t = −∇ V ( x t )d t + (cid:112) β − d W t (24)with observations generated by y t = (cid:26) , x t ∈ I0 , x t ∈ II rue OOM Empirical HMM EQ-OOM (cid:1876) trajectory lengthtrajectory length (cid:1876) h i s t o g r a m (a) (b)(c) (d) I II

Figure 2: Comparison of modeling methods for a one-dimensional diffusion process. (a) Potentialfunction. (b) Estimates of the difference between equilibrium probabilities of I and II given by thetraditional OOM, HMM and the equilibrium OOM (EQ-OOM) obtained from the proposed algorithmwith O = { I , II } . (c) Estimates of the probability difference given by the empirical estimator, HMMand the proposed binless OOM with O = [0 , . (d) Stationary histograms of { x t } with uniformbins estimated from trajectories with length . The length of each trajectory is T = 50 ∼ and the number of trajectories is [10 /T ] . Error bars are standard deviations over independentexperiments.The potential function V ( x ) is shown in Fig. 2(a), which contains two potential wells I , II . In thisexample, all simulations are performed by starting from a uniform distribution on [0 , . , whichimplies that simulations are highly nonequilibrium and it is difﬁcult to accurately estimate theequilibrium probabilities Prob I = E ∞ [1 x t ∈ I ] = E ∞ [ y t ] and Prob II = E ∞ [1 x t ∈ II ] = 1 − E ∞ [ y t ] of the two potential wells from the simulation data. We ﬁrst utilize the traditional spectral learningwithout enforcing equilibrium, expectation–maximization based HMM learning and the proposeddiscrete spectral algorithm to estimate Prob I and Prob II based on { y t } , and the estimation resultswith different simulation lengths are summarized in Fig. 2(b). It can be seen that, in contrast to withthe other methods, the spectral algorithm for equilibrium OOMs effectively reduce the statistical biasin the nonequilibrium data, and achieves statistically correct estimation at T = 300 .Figs. 2(c) and 2(d) plot estimates of stationary distribution of { x t } obtained from { x t } directly, wherethe empirical estimator calculates statistics through averaging over all observations. In this case, theproposed binless OOM signiﬁcantly outperform the other methods, and its estimates are very close totrue values even for extremely small short trajectories.Fig. 3 provides an example of a two-dimensional diffusion process. The dynamics of this process canalso be represented in the form of (24) and the potential function is shown in Fig. 3(a). The goal ofthis example is to estimate the ﬁrst time-structure based independent component w TICA [30] of thisprocess from simulation data. Here w TICA is a kinetic quantity of the process and is the solution tothe generalized eigenvalue problem C τ w = λC w with the largest eigenvalue, where C is the covariance matrix of { x t } in equilibrium and C τ = (cid:0) E ∞ (cid:2) x t x (cid:62) t + τ (cid:3) − E ∞ [ x t ] E ∞ (cid:2) x (cid:62) t (cid:3)(cid:1) is the equilibrium time-lagged covariance matrix. Thesimulation data are also nonequilibrium with all simulations starting from the uniform distribution on [ − , × [ − , . Fig. 3(b) displays the estimation errors of w TICA obtained from different learningmethods, which also demonstrates the superiority of the binless spectral method.

Alanine dipeptide

Alanine dipeptide is a small molecule which consists of two alanine amino acidunits, and its conﬁguration can be described by two backbone dihedral angles. Fig. 4(a) shows thepotential proﬁle of the alanine dipeptide with respect to the two angles, which contains ﬁve metastable7 oord 1 c oo r d2 trajectory length e rr o r o f (a) (b) Empirical HMMEQ-OOM

Figure 3: Comparison of modeling methods for a two-dimensional diffusion process. (a) Potentialfunction. (b) Estimation error of w TICA ∈ R of the ﬁrst TIC with lag time . Length of eachtrajectory is T = 200 ∼ and the number of trajectories is [10 /T ] . Error bars are standarddeviations over independent experiments. angle 1 ang l e simulation time (ns) e rr o r o f (a) (b) EmpiricalHMMEQ-OOM

II IIIIIII IVVV

Figure 4: Comparison of modeling methods for molecular dynamics of alanine dipeptide. (a) Reducedfree energy. (b) Estimation error of π , where the horizontal axis denotes the total simulation time T × I . Length of each trajectory is T = 10ns and the number of trajectories is I = 150 ∼ .Error bars are standard deviations over independent experiments.states { I , II , III , IV , V } . We perform multiple short molecular dynamics simulations starting fromthe metastable state IV , where each simulation length is , and utilizes different methods toapproximate the stationary distribution π = (Prob I , Prob II , . . . , Prob V ) of the ﬁve metastable states.As shown in Fig. 4(b), the proposed binless algorithm yields lower estimation error compared to eachof the alternatives. In this paper, we investigated the statistical properties of the general spectral learning procedure fornonequilibrium data, and developed novel spectral methods for learning equilibrium dynamics fromnonequilibrium (discrete or continuous) data. The main ideas of the presented methods are to correctthe model parameters by the equilibrium constraint and to handle continuous observations in a binlessmanner. Interesting directions of future research include analysis of approximation error with ﬁnitedata size and applications to controlled systems.

Acknowledgments

This work was funded by Deutsche Forschungsgemeinschaft (SFB 1114/A04 to H.W.) and EuropeanResearch Council (StG pcCell to F.N.).

References [1] H. Jaeger, “Observable operator models for discrete stochastic time series,”

Neural Comput. , vol. 12, no. 6,pp. 1371–1398, 2000.[2] M.-J. Zhao, H. Jaeger, and M. Thon, “A bound on modeling error in observable operator models and anassociated learning algorithm,”

Neural Comput. , vol. 21, no. 9, pp. 2687–2712, 2009.[3] H. Jaeger, “Discrete-time, discrete-valued observable operator models: a tutorial,” tech. rep., InternationalUniversity Bremen, 2012.[4] M. L. Littman, R. S. Sutton, and S. Singh, “Predictive representations of state,” in

Adv. Neural. Inf. Process.Syst. 14 (NIPS 2001) , pp. 1555–1561, 2001.

5] S. Singh, M. James, and M. Rudary, “Predictive state representations: A new theory for modeling dynamicalsystems,” in

Proc. 20th Conf. Uncertainty Artif. Intell. (UAI 2004) , pp. 512–519, 2004.[6] E. Wiewiora, “Learning predictive representations from a history,” in

Proc. 22nd Intl. Conf. on Mach. Learn.(ICML 2005) , pp. 964–971, 2005.[7] D. Hsu, S. M. Kakade, and T. Zhang, “A spectral algorithm for learning hidden Markov models,” in

Proc.22nd Conf. Learning Theory (COLT 2009) , pp. 964–971, 2005.[8] S. Siddiqi, B. Boots, and G. Gordon, “Reduced-rank hidden Markov models,” in

Proc. 13th Intl. Conf. Artif.Intell. Stat. (AISTATS 2010) , vol. 9, pp. 741–748, 2010.[9] A. Beimel, F. Bergadano, N. H. Bshouty, E. Kushilevitz, and S. Varricchio, “Learning functions representedas multiplicity automata,”

J. ACM , vol. 47, no. 3, pp. 506–530, 2000.[10] M. Thon and H. Jaeger, “Links between multiplicity automata, observable operator, models and predictivestate representations — a uniﬁed learning framework,”

J. Mach. Learn. Res. , vol. 16, pp. 103–147, 2015.[11] J.-H. Prinz, H. Wu, M. Sarich, B. Keller, M. Senne, M. Held, J. D. Chodera, C. Schütte, and F. Noé,“Markov models of molecular kinetics: Generation and validation,”

J. Chem. Phys. , vol. 134, p. 174105,2011.[12] G. R. Bowman, V. S. Pande, and F. Noé,

An introduction to Markov state models and their application tolong timescale molecular simulation . Springer, 2013.[13] A. Ruttor, P. Batz, and M. Opper, “Approximate Gaussian process inference for the drift function instochastic differential equations,” in

Adv. Neural. Inf. Process. Syst. 26 (NIPS 2013) , pp. 2040–2048, 2013.[14] N. Schaudinnus, B. Bastian, R. Hegger, and G. Stock, “Multidimensional langevin modeling of nonover-damped dynamics,”

Phys. Rev. Lett. , vol. 115, no. 5, p. 050602, 2015.[15] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,”

Proc.IEEE , vol. 77, no. 2, pp. 257–286, 1989.[16] F. Noé, H. Wu, J.-H. Prinz, and N. Plattner, “Projected and hidden markov models for calculating kineticsand metastable states of complex molecules,”

J. Chem. Phys. , vol. 139, p. 184114, 2013.[17] R. D. Turner, M. P. Deisenroth, and C. E. Rasmussen, “State-space inference and learning with Gaussianprocesses,” in

Proc. 13th Intl. Conf. Artif. Intell. Stat. (AISTATS 2010) , pp. 868–875, 2010.[18] S. S. T. S. Andreas Svensson, Arno Solin, “Computationally efﬁcient bayesian learning of Gaussian processstate space models,” in

Proc. 19th Intl. Conf. Artif. Intell. Stat. (AISTATS 2016) , pp. 213–221, 2016.[19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”

Neural Comp. , vol. 9, no. 8, pp. 1735–1780,1997.[20] H. Wu, J.-H. Prinz, and F. Noé, “Projected metastable markov processes and their estimation withobservable operator models,”

J. Chem. Phys. , vol. 143, no. 14, p. 144101, 2015.[21] M. Shirts and V. S. Pande, “Screen savers of the world unite,”

Science , vol. 290, pp. 1903–1904, 2000.[22] T.-K. Huang and J. Schneider, “Spectral learning of hidden Markov models from dynamic and static data,”in

Proc. 30th Intl. Conf. on Mach. Learn. (ICML 2013) , pp. 630–638, 2013.[23] M. K. Cowles and B. P. Carlin, “Markov chain monte carlo convergence diagnostics: a comparative review,”

J. Am. Stat. Assoc. , vol. 91, no. 434, pp. 883–904, 1996.[24] N. Jiang, A. Kulesza, and S. Singh, “Improving predictive state representations via gradient descent,” in

Proc. 30th AAAI Conf. Artif. Intell. (AAAI 2016) , 2016.[25] H. Jaeger, “Modeling and learning continuous-valued stochastic processes with OOMs,” Tech. Rep.GMD-102, German National Research Center for Information Technology (GMD), 2001.[26] B. Boots, S. M. Siddiqi, G. Gordon, and A. Smola, “Hilbert space embeddings of hidden markov models,”in

Proc. 27th Intl. Conf. on Mach. Learn. (ICML 2010) , 2010.[27] M. Rosencrantz, G. Gordon, and S. Thrun, “Learning low dimensional predictive representations,” in

Proc.22nd Intl. Conf. on Mach. Learn. (ICML 2004) , pp. 88–95, ACM, 2004.[28] B. Boots,

Spectral Approaches to Learning Predictive Representations . PhD thesis, Carnegie MellonUniversity, 2012.[29] H. Jaeger, M. Zhao, and A. Kolling, “Efﬁcient estimation of OOMs,” in

Adv. Neural. Inf. Process. Syst. 18(NIPS 2005) , pp. 555–562, 2005.[30] G. Perez-Hernandez, F. Paul, T. Giorgino, G. De Fabritiis, and F. Noé, “Identiﬁcation of slow molecularorder parameters for markov model construction,”

J. Chem. Phys. , vol. 139, no. 1, p. 015102, 2013. upplementary Information A Proofs

A.1 Proof of Theorem 1

For convenience, here we deﬁne ω ∗ ( T ) = 1 T − L ω T − L (cid:88) t =1 Ξ ( O ) t − G σ = (cid:88) z L φ ( z L ) σ (cid:62) Ξ ( z L ) (cid:62) (A.1) Part (1)

We ﬁrst show the theorem in the case of T = T and I → ∞ .Let G ω = (cid:88) z L φ ( z L ) ω ∗ ( T ) Ξ ( z L ) (A.2)Since I → ∞ , we have ˆ¯ φ p → E (cid:104) ˆ¯ φ (cid:105) = G ω σ ˆ¯ φ (cid:62) p → E (cid:104) ˆ¯ φ (cid:62) (cid:105) = ω ∗ ( T ) G (cid:62) σ ˆ C , p → E (cid:104) ˆ C , (cid:105) = G ω G (cid:62) σ ˆ C , ( x ) p → E (cid:104) ˆ C , ( x ) (cid:105) = G ω Ξ ( x ) G (cid:62) σ According to Assumption 3 and the Eckart-Young-Mirsky Theorem, we can conclude that rank ( G ω ) = rank ( G σ ) = rank (cid:16) ˆ C , (cid:17) = m and ˆ C trun1 , = UΣV (cid:62) p → G ω G (cid:62) σ By using the SVD of G ω G (cid:62) σ G ω G (cid:62) σ = ˜ U ˜ Σ ˜ V (cid:62) with rank (cid:16) ˜ U (cid:17) = rank (cid:16) ˜ V (cid:17) = rank (cid:16) ˜ Σ (cid:17) , we can construct an OOM M (cid:48) = ( ω (cid:48) , { Ξ (cid:48) ( x ) } x ∈O , σ (cid:48) ) with ω (cid:48) = ˆ ω (cid:0) G (cid:62) σ V (cid:1) − (A.3) Ξ (cid:48) ( x ) = (cid:0) G (cid:62) σ V (cid:1) ˆ Ξ ( x ) (cid:0) G (cid:62) σ V (cid:1) − (A.4) σ (cid:48) = (cid:0) G (cid:62) σ V (cid:1) ˆ σ (A.5)which is obviously equivalent to ˆ M .We can obtain from rank (cid:0) UΣV (cid:62) (cid:1) = rank (cid:0) G ω G (cid:62) σ (cid:1) = m that (cid:0) UΣV (cid:62) (cid:1) + = VΣ − U (cid:62) p → (cid:0) G ω G (cid:62) σ (cid:1) + A + denotes the Moore-Penrose pseudoinverse of A , so ω (cid:48) = ˆ¯ φ (cid:62) V (cid:0) G (cid:62) σ V (cid:1) − p → ω ∗ ( T ) Ξ (cid:48) ( x ) = (cid:0) G (cid:62) σ V (cid:1) Σ − U (cid:62) ˆ C , ( x ) V (cid:0) G (cid:62) σ V (cid:1) − p → G (cid:62) σ VΣ − U (cid:62) G ω Ξ ( x ) p → G (cid:62) σ (cid:0) G ω G (cid:62) σ (cid:1) + G ω Ξ ( x )= G + ω G ω G (cid:62) σ (cid:0) G ω G (cid:62) σ (cid:1) + G ω G (cid:62) σ G + (cid:62) σ Ξ ( x )= Ξ ( x ) σ (cid:48) = G (cid:62) σ VΣ − U (cid:62) ˆ¯ φ p → σ Note ω (cid:48) p → ω does not hold in general cases. Part (2)

We now consider the case of I = I and T → ∞ .According to Assumption 2, the limit ˆ C , p → E ∞ (cid:2) φ ( x t − L : t − ) φ ( x t : t + L − ) (cid:62) (cid:3) = lim k →∞ (cid:88) z L φ ( z L ) ω Ξ ( O ) k Ξ ( z L ) G (cid:62) σ exists. Then ˆ¯ φ p → E ∞ [ φ ( x t − L : t − )] = G ω σ ˆ¯ φ (cid:62) p → E ∞ (cid:2) φ ( x t : t + L − ) (cid:62) (cid:3) = lim k →∞ ω Ξ ( O ) k G (cid:62) σ ˆ C , p → E ∞ (cid:104) ˆ C , (cid:105) = G ω G (cid:62) σ ˆ C , ( x ) p → E ∞ (cid:104) ˆ C , ( x ) (cid:105) = G ω Ξ ( x ) G (cid:62) σ with G ω = lim k →∞ (cid:88) z L φ ( z L ) ω Ξ ( O ) k Ξ ( z L ) (A.6)The remaining part of the proof is omitted because it is the same as in Part (1). A.2 Asymptotic correctness of nonequilibrium learning with different initial states

If the i -th observation trajectories is generated by OOM M = ( ω i , { Ξ ( x ) } x ∈O , σ ) for i = 1 , . . . , I ,and ω ∗∗ = (cid:26) I (cid:80) Ii =1 ω i , for T → ∞ plim I →∞ I (cid:80) Ii =1 ω i , for I → ∞ the asymptotic correctness can also be shown as in Appendix A.1 by setting G ω = (cid:88) z L φ ( z L ) ω ∗ ( T ) Ξ ( z L ) with ω ∗ ( T ) = 1 T − L ω ∗∗ T − L (cid:88) t =1 Ξ ( O ) t − for I → ∞ , and G ω = lim k →∞ (cid:88) z L φ ( z L ) ω ∗∗ Ξ ( O ) k Ξ ( z L ) for T → ∞ . 2 .3 Proof of Theorem 2Part (1) We ﬁrst show that there is an OOM M eq = ( ω eq , { Ξ ( x ) } x ∈O , σ ) which can describe theequilibrium dynamics of { x t } .In the case of T = T and I → ∞ , we can obtain from Assumptions 2 and 3 that lim k →∞ G ω Ξ ( O ) k G (cid:62) σ = lim k →∞ T − L T − L − (cid:88) t =0 E (cid:104) φ ( x t +1: t + L ) φ ( x t + L + k +1: t +2 L + k ) (cid:62) (cid:105) = (cid:32) T − L T − L − (cid:88) t =0 E [ φ ( x t +1: t + L )] (cid:33) (cid:16) E ∞ (cid:104) φ ( x t +1: t + L ) (cid:62) (cid:105)(cid:17) = G ω σ (cid:16) E ∞ (cid:104) φ ( x t +1: t + L ) (cid:62) (cid:105)(cid:17) ⇒ lim k →∞ Ξ ( O ) k = σω eq (A.7)with ω eq = (cid:16) E ∞ (cid:104) φ ( x t +1: t + L ) (cid:62) (cid:105)(cid:17) G + (cid:62) σ (A.8)where G ω and G σ are deﬁned by (A.2) and (A.1). Then lim t →∞ P ( x t +1: t + l = z l ) = lim t →∞ ω Ξ ( O ) t Ξ ( z l ) σ = ω Ξ ( O ) σω eq Ξ ( z l ) σ = ω eq Ξ ( z l ) σ In the case of I = I and T → ∞ , because rank ( G ω ) = m for G ω deﬁned by (A.6), there is asufﬁciently large but ﬁnite T (cid:48) so that rank ( G (cid:48) ω ) = m with G (cid:48) ω = (cid:88) z L φ ( z L ) ω Ξ ( O ) T (cid:48) Ξ ( z L ) Considering lim k →∞ G (cid:48) ω Ξ ( O ) k G (cid:62) σ = lim k →∞ E (cid:104) φ ( x T (cid:48) +1: T (cid:48) + L ) φ ( x T (cid:48) + L + k +1: T (cid:48) +2 L + k ) (cid:62) (cid:105) = G (cid:48) ω σ (cid:16) E ∞ (cid:104) φ ( x t +1: t + L ) (cid:62) (cid:105)(cid:17) ⇒ lim k →∞ Ξ ( O ) k = σω eq (A.9)with ω eq deﬁned by (A.8), we can also conclude that lim t →∞ P ( x t +1: t + l = z l ) = ω eq Ξ ( z l ) σ Note in both cases, ω eq satisﬁes ω eq lim k →∞ Ξ ( O ) k = ω eq and ω eq Ξ ( O ) = lim t →∞ ω eq Ξ ( O ) t +1 = ω eq ω eq σ = ω eq Ξ ( O ) σ = lim t →∞ (cid:88) x ∈O P ( x t = x ) = 1 Part (2)

In this part, we show that wΞ ( O ) = w , w σ = 1 has a unique solution w = ω eq .According to Appendix A.1 and (A.7), (A.9), if wΞ ( O ) = w and w σ = 1 , we have w = lim k →∞ wΞ ( O ) k = w σω eq = ω eq art (3) We now show Theorem 2.The problem (16) is equivalent to min w (cid:48) E ( w (cid:48) ) = (cid:0) w (cid:48) Ξ (cid:48) ( O ) − w (cid:48) (cid:1) (cid:0) G (cid:62) σ V (cid:1) · (cid:0) G (cid:62) σ V (cid:1) (cid:62) (cid:0) w (cid:48) Ξ (cid:48) ( O ) − w (cid:48) (cid:1) (cid:62) s . t . w (cid:48) σ (cid:48) = 1 where Ξ (cid:48) ( O ) = (cid:80) x ∈O Ξ (cid:48) ( x ) , Ξ (cid:48) ( x ) and σ (cid:48) are given by (A.4) and (A.5), and w (cid:48) is related to w with w (cid:48) = w (cid:0) G (cid:62) σ V (cid:1) − . This problem can be further transformed into an unconstrained one min w (cid:48) E (cid:0) w (cid:48) (cid:0) I − σ (cid:48) σ (cid:48) + (cid:1) + σ (cid:48) + (cid:1) + (cid:13)(cid:13) w (cid:48) (cid:0) I − σ (cid:48) σ (cid:48) + (cid:1) + σ (cid:48) + − w (cid:48) (cid:13)(cid:13) (A.10)where w (cid:48) ( I − σ (cid:48) σ (cid:48) + ) + σ (cid:48) + is the projection of w (cid:48) on the space { w (cid:48) | w (cid:48) σ (cid:48) = 1 } and I denotes theidentity matrix of appropriate dimension. Considering that Ξ (cid:48) ( x ) p → Ξ ( x ) , σ (cid:48) p → σ , (cid:0) G (cid:62) σ V (cid:1) (cid:0) G (cid:62) σ V (cid:1) (cid:62) = G (cid:62) σ VΣ − U (cid:62) UΣV (cid:62) G σp → G (cid:62) σ (cid:0) G (cid:62) ω G σ (cid:1) + G ω G (cid:62) σ G σ = G (cid:62) σ G σ and the conclusion in Part (2), we can obtain that the optimal solution of (A.10) converges to ω eq inprobability and ˆ ω eq p → ω eq (cid:0) G (cid:62) σ V (cid:1) − according to Theorem 2.7 in [1], which yields the conclusionof Theorem 2. Part (4)

We derive in this part the closed-form solution to (16).Since the projection of w (cid:48)(cid:48) on the space { w (cid:48)(cid:48) | w (cid:48)(cid:48) ˆ σ = 1 } is w (cid:48)(cid:48) (cid:0) I − ˆ σ ˆ σ + (cid:1) + ˆ σ + , (16) can beequivalent transformed into min w (cid:48)(cid:48) (cid:13)(cid:13)(cid:13) w (cid:48)(cid:48) (cid:0) I − ˆ σ ˆ σ + (cid:1) (cid:16) ˆ Ξ ( O ) − I (cid:17) + ˆ σ + (cid:16) ˆ Ξ ( O ) − I (cid:17)(cid:13)(cid:13)(cid:13) The solution to this problem is w ∗ = − ˆ σ + (cid:16) ˆ Ξ ( O ) − I (cid:17) (cid:16)(cid:0) I − ˆ σ ˆ σ + (cid:1) (cid:16) ˆ Ξ ( O ) − I (cid:17)(cid:17) + which provides the optimal value of ˆ ω eq as ˆ ω eq = w ∗ (cid:0) I − ˆ σ ˆ σ + (cid:1) + ˆ σ + = ˆ σ + − ˆ σ + (cid:16) ˆ Ξ ( O ) − I (cid:17) (cid:16)(cid:0) I − ˆ σ ˆ σ + (cid:1) (cid:16) ˆ Ξ ( O ) − I (cid:17)(cid:17) + (cid:0) I − ˆ σ ˆ σ + (cid:1) (A.11) A.4 Proof of Theorem 3

Here we only consider the consistency of the binless OOM as I → ∞ . The proof can be easily toextended to the case of T → ∞ . In addition, we denote E ∞ [ g ( x t +1: t + r )] and E [ g ( x r ) | ˆ M eq ] by E ∞ [ g ] and E ˆ M [ g ] for convenience of notation. Part (1)

We ﬁrst show that Theorem 3 holds for g ( x t +1: t + r ) = 1 x t +1: t + r ∈B i ×B i × ... ×B ir , where B , . . . , B K is a partition of O and i r ∈ { , . . . , K } r . In this case, we can construct a discreteOOM with observation space {B , . . . , B K } by the nonequilibrium learning algorithm, which canprovide the same estimate of E ∞ [ g ( x t +1: t + r )] as ˆ M eq . Therefore, we can show E ˆ M [ g ] p → E ∞ [ g ] by using the similar proof of Theorem 2. Part (2)

We now consider the case that g is a continuous function. According to the Heine-Cantortheorem, g is also uniformly continuous. Then, for an arbitrary (cid:15) > , we can construct a simplefunction ˆ g ( x t +1: t + r ) = (cid:88) i ,...,i r c i i ...i r x t +1: t + r ∈B i × ... ×B ir

4o that | g ( z r ) − ˆ g ( z r ) | ≤ (cid:15), ∀ z r ∈ O r where {B , . . . , B K } is a partition of O . Then, we have | E ∞ [ g ] − E ∞ [ˆ g ] | ≤ E ∞ [ | g − ˆ g | ] ≤ (cid:15) and (cid:12)(cid:12) E ∞ [ˆ g ] − E ˆ M [ˆ g ] (cid:12)(cid:12) p → as I → ∞ according to the conclusion of Part (1), where E ∞ [ g ] = E ∞ [ g ( x t +1: t + r )] and E ˆ M [ g ] = E [ g ( x r ) | ˆ M eq ] .It can be known from the boundness of feature functions, there exists a constant ξ such that max x ∈X (cid:107) ˆ W x (cid:107) <ξ/ |X | p → (A.12)Under the condition that max x ∈X (cid:13)(cid:13)(cid:13) ˆ W x (cid:13)(cid:13)(cid:13) < ξ/ |X | , we have (cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) = ˆ ω eq (cid:32) (cid:88) z r ∈X r (ˆ g ( z r ) − g ( z r )) W z . . . W z r (cid:33) ˆ σ ≤ (cid:107) ˆ ω eq (cid:107) (cid:107) ˆ σ (cid:107) (cid:32) (cid:88) z r ∈X r ξ r (cid:15) |X | r (cid:33) = (cid:107) ˆ ω eq (cid:107) (cid:107) ˆ σ (cid:107) ξ r (cid:15) In addition, considering that we can show as in Appendix A.1 that ˆ ω eq p → ω eq G (cid:62) σ V ˆ σ p → (cid:0) G (cid:62) σ V (cid:1) − σ we can obtain (cid:107) ˆ ω eq (cid:107)(cid:107) ˆ σ (cid:107)≤ ξ p → (A.13)and (cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) ≤ ξ ξ r (cid:15) p → where ξ is a constant larger than (cid:107) ˆ ω eq (cid:107) · (cid:107) ˆ σ (cid:107) .Based on the above analysis and the fact that (cid:12)(cid:12)(cid:12) E ∞ [ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E ∞ [ g ] − E ∞ [ˆ g ] + E ∞ [ˆ g ] − E ˆ M eq [ˆ g ] + E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) ≤ | E ∞ [ g ] − E ∞ [ˆ g ] | + (cid:12)(cid:12)(cid:12) E ∞ [ˆ g ] − E ˆ M eq [ˆ g ] (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) we can get Pr (cid:16)(cid:12)(cid:12)(cid:12) E ∞ [ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) ≤ ( ξ ξ r + 2) (cid:15) (cid:17) ≥ Pr (cid:16) | E ∞ [ g ] − E ∞ [ˆ g ] | ≤ (cid:15), (cid:12)(cid:12)(cid:12) E ∞ [ˆ g ] − E ˆ M eq [ˆ g ] (cid:12)(cid:12)(cid:12) ≤ (cid:15), (cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) ≤ ξ ξ r (cid:15) (cid:17) → Because this equation holds for all (cid:15) > , we can conclude that E ˆ M eq [ g ] p → E ∞ [ g ] . Part (3)

In this part, we prove the conclusion of the theorem in the case where g is a Borelmeasurable function and bounded with | g ( z r ) | < ξ g for all z r ∈ O r , and there exist constants ¯ ξ and ξ so that (cid:107) Ξ ( x ) (cid:107) ≤ ¯ ξ and lim t →∞ P ( x t +1: t + r = z r ) ≥ ξ for all x ∈ O and z r ∈ O r .According to Theorem 2.2 in [2], for an arbitrary (cid:15) > , there is a continuous function ˆ g (cid:48) satisﬁes E ∞ [1 x t +1: t + r ∈K (cid:15) (ˆ g (cid:48) ) ] < (cid:15) , where K (cid:15) (ˆ g (cid:48) ) = { z r | z r ∈ O r , | ˆ g (cid:48) ( z r ) − g ( z r ) | > (cid:15) } . Deﬁne ˆ g ( z r ) = (cid:40) ˆ g (cid:48) ( z r ) , | ˆ g (cid:48) ( z r ) | ≤ ξ g − ξ g , ˆ g (cid:48) ( z r ) < − ξ g ξ g , ˆ g (cid:48) ( z r ) > ξ g

5t can be seen that ˆ g is a continuous function which is also satisﬁes E ∞ [1 x t +1: t + r ∈K (cid:15) (ˆ g ) ] < (cid:15) andbounded with | ˆ g ( z r ) | < ξ g . So the difference between E ∞ [ g ] and E ∞ [ˆ g ] satisﬁes | E ∞ [ g ] − E ∞ [ˆ g ] | ≤ E ∞ [ | g ( x t +1: t + r ) − ˆ g ( x t +1: t + r ) | ]= E ∞ [1 x t +1: t + r ∈K (cid:15) (ˆ g ) ] E ∞ [ | g ( x t +1: t + r ) − ˆ g ( x t +1: t + r ) | | x t +1: t + r ∈ K (cid:15) (ˆ g )]+ E ∞ [1 x t +1: t + r / ∈K (cid:15) (ˆ g ) ] E ∞ [ | g ( x t +1: t + r ) − ˆ g ( x t +1: t + r ) | | x t +1: t + r / ∈ K (cid:15) (ˆ g )] ≤ (cid:15) · ξ g + (cid:15) = (2 ξ g + 1) (cid:15) For the difference between E ∞ [ˆ g ] and E ˆ M eq [ˆ g ] , we can obtain from the above that (cid:12)(cid:12)(cid:12) E ∞ [ˆ g ] − E ˆ M eq [ˆ g ] (cid:12)(cid:12)(cid:12) p → as I → ∞ by considering that ˆ g is continuous, which implies that there isan I such that Pr (cid:16)(cid:12)(cid:12)(cid:12) E ∞ [ˆ g ] − E ˆ M eq [ˆ g ] (cid:12)(cid:12)(cid:12) > (cid:15) (cid:17) < (cid:15), ∀ I > I Next, let us consider the value of (cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) . Note that (cid:12)(cid:12) E ˆ M [ˆ g ] − E ˆ M [ g ] (cid:12)(cid:12) ≤ (cid:107) ˆ ω (cid:107) (cid:107) ˆ σ (cid:107) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) z n ∈X r (ˆ g ( z r ) − g ( z r )) ˆ W z . . . ˆ W z r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) < ξ ξ r |X | r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) z r ∈X r (ˆ g ( z r ) − g ( z r )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) under the condition that (cid:13)(cid:13)(cid:13) ˆ W x (cid:13)(cid:13)(cid:13) < ξ/ |X | and (cid:107) ˆ ω eq (cid:107) (cid:107) ˆ σ (cid:107) ≤ ξ . Therefore, there exists an I suchthat Pr (cid:32)(cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) ≥ ξ ξ r |X | r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) z r ∈X r (ˆ g ( z r ) − g ( z r )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) < (cid:15), ∀ I > I (A.14)due to (A.12) and (A.13). Let x (cid:48) r denotes a random sample taken uniformly from X r . We can obtainthat P ( x (cid:48) r ) = P ( x (cid:48) ) . . . P ( x (cid:48) r ) ≤ (cid:0) (cid:107) ω (cid:107) (cid:107) σ (cid:107) ξ O ¯ ξ (cid:1) r where ξ O ≥ (cid:13)(cid:13)(cid:13) Ξ ( O ) k (cid:13)(cid:13)(cid:13) for any k ≥ . Note ξ O < ∞ because we can show the existing of the limitof { (cid:13)(cid:13)(cid:13) Ξ ( O ) (cid:13)(cid:13)(cid:13) , (cid:13)(cid:13)(cid:13) Ξ ( O ) (cid:13)(cid:13)(cid:13) , . . . } by similar steps in Appendix A.3. Thus E (cid:34) |X | r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) z r ∈X r (ˆ g ( z r ) − g ( z r )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) ≤ E [ E [ | ˆ g ( x (cid:48) r ) − g ( x (cid:48) r ) | |X ]]= E [ | ˆ g ( x (cid:48) r ) − g ( x (cid:48) r ) | ]= E (cid:2) x (cid:48) r ∈K (cid:15) (ˆ g ) (cid:3) E [ | ˆ g ( x (cid:48) r ) − g ( x (cid:48) r ) | | x (cid:48) r ∈ K (cid:15) (ˆ g )]+ E (cid:2) x (cid:48) r / ∈K (cid:15) (ˆ g ) (cid:3) E [ | ˆ g ( x (cid:48) r ) − g ( x (cid:48) r ) | | x (cid:48) r / ∈ K (cid:15) (ˆ g )] ≤ ξ µ (cid:15) · ξ g + (cid:15) = (2 ξ g ξ µ + 1) (cid:15) where ξ µ = (cid:0) (cid:107) ω (cid:107) (cid:107) σ (cid:107) ξ O ¯ ξ (cid:1) r /ξ . By the Markov’s inequality, we have Pr (cid:34) |X | r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) z r ∈X r (ˆ g ( z r ) − g ( z r )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ (cid:15) (cid:35) ≤ (2 ξ g ξ µ + 1) √ (cid:15) (A.15)Combining (A.14) and (A.15) leads to Pr (cid:16)(cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) ≥ ξ ξ r √ (cid:15) (cid:17) ≤ Pr (cid:32)(cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) ≥ ξ ξ r |X | r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) z r ∈ X r (ˆ g ( z r ) − g ( z r )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) + Pr (cid:32) |X | r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) z r ∈X r (ˆ g ( z r ) − g ( z r )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ √ (cid:15) (cid:33) ≤ (cid:15) + (2 ξ g ξ µ + 1) √ (cid:15) I > I .From all the above, we have Pr (cid:16)(cid:12)(cid:12)(cid:12) E ∞ [ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) ≤ ξ g + 1) (cid:15) + ξ ξ r √ (cid:15) (cid:17) ≥ Pr (cid:16)(cid:12)(cid:12)(cid:12) E ∞ [ˆ g ] − E ˆ M eq [ˆ g ] (cid:12)(cid:12)(cid:12) ≤ (cid:15), (cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) ≤ ξ ξ r √ (cid:15) (cid:17) ≥ − Pr (cid:16)(cid:12)(cid:12)(cid:12) E ∞ [ˆ g ] − E ˆ M eq [ˆ g ] (cid:12)(cid:12)(cid:12) > (cid:15) (cid:17) − Pr (cid:16)(cid:12)(cid:12)(cid:12) E ˆ M eq [ˆ g ] − E ˆ M eq [ g ] (cid:12)(cid:12)(cid:12) > ξ ξ r √ (cid:15) (cid:17) ≥ − (cid:15) − (2 ξ g ξ µ + 1) √ (cid:15) for all I > max { I , I } , which yields E ˆ M eq [ g ] p → E ∞ [ g ] due to the arbitrariness of (cid:15) . B Settings in applications

B.1 Models

The one-dimensional diffusion processes in Section 5 are driven by the Brownian dynamics with β = 0 . , V ( x ) = (cid:80) i =1 ( | x − c i | + 0 . − u i (cid:80) i =1 ( | x − c i | + 0 . − and the sample interval is . . For the two-dimensional process, β = 2 , V ( x ) = − log (cid:32) (cid:88) i =1 p i N ( x | µ i , Σ i ) (cid:33) and the sample interval is . , where c = ( − . , . , , . , . , u = (21 , , , − , , p = (0 . , . , . , µ = (0 , − . , µ = ( − , . , µ = (1 , − . . The simulation details ofalanine dipeptide is given in [3]. B.2 Algorithms

The parameters of discrete spectral learning are chosen as: L = 3 , m = 10 , and φ = φ areindicator functions of all O L observation subsequences with length L .The parameters of binless spectral learning are almost the same as discrete ones, except φ = φ are Gaussian activation functions with random weights of functional link neural networks with D = D = 100 .The number of hidden states of HMMs is . For continuous data, we partition the state space into discrete bins k -mean clustering, and then learn HMMs by the EM algorithm, where the HMMpackage in PyEMMA [4] is used. All observation samples within the same bin are assumed to beindependent for quantitative analysis. References [1] W. K. Newey and D. McFadden, “Large sample estimation and hypothesis testing,”

Hand-book of Econometrics , vol. 4, pp. 2111–2245, 1994.[2] K. Hornik, M. Stinchcombe, and H.White, “Multilayer feedforward networks are universalapproximators,”

Neural Netw. , vol. 2, no. 5, pp. 359–366, 1989.[3] B. Trendelkamp-Schroer and F. Noé, “Efﬁcient estimation of rare-event kinetics,”

Phys. Rev.X , vol. 6, pp. 011009, 2016.[4] M. K. Scherer, B. Trendelkamp-Schroer, F. Paul, G. Pérez-Hernández, M. Hoffmann, N.Plattner, C. Wehmeyer, J. -H. Prinz, and F. Noé, “PyEMMA 2: A software package forestimation, validation, and analysis of Markov models,”