[PDF] Approximate Bayes learning of stochastic differential equations

Abstract

We introduce a nonparametric approach for estimating drift and diffusion functions in systems of stochastic differential equations from observations of the state vector. Gaussian processes are used as flexible models for these functions and estimates are calculated directly from dense data sets using Gaussian process regression. We also develop an approximate expectation maximization algorithm to deal with the unobserved, latent dynamics between sparse observations. The posterior over states is approximated by a piecewise linearized process of the Ornstein-Uhlenbeck type and the maximum a posteriori estimation of the drift is facilitated by a sparse Gaussian process approximation.

Full PDF

AApproximate Bayes learning of stochastic diﬀerential equations

Philipp Batz, ∗ Andreas Ruttor, † and Manfred Opper ‡ TU Berlin, Fakult¨at IV – MAR 4-2, Marchstr. 23, 10587 Berlin, Germany

We introduce a nonparametric approach for estimating drift and diﬀusion functions in systems ofstochastic diﬀerential equations from observations of the state vector. Gaussian processes are usedas ﬂexible models for these functions and estimates are calculated directly from dense data sets usingGaussian process regression. We also develop an approximate expectation maximization algorithmto deal with the unobserved, latent dynamics between sparse observations. The posterior over statesis approximated by a piecewise linearized process of the Ornstein-Uhlenbeck type and the maximuma posteriori estimation of the drift is facilitated by a sparse Gaussian process approximation.

I. INTRODUCTION

Dynamical systems in the physical world evolve in con-tinuous time and often the (noisy) dynamics is describednaturally in terms of (stochastic) diﬀerential equations[1]. However, due to missing information and/or the com-plexity of a system it may be diﬃcult to derive such amodel from ﬁrst principles. Instead, the goal often is toﬁt it to observations of the state at discrete points in time[2]. So far most inference approaches for these systemshave dealt with the estimation of parameters containedin the drift function (e.g. [3] using a generalized linearmodel of locally linear forces or [4] using a Markov ChainMonte Carlo sampler), which governs the deterministicpart of the microscopic time evolution. Assumptions forthe stochastic part were often simple: additive noise withthe diﬀusion constant as the only parameter to estimate.But as both drift and diﬀusion can be nonlinear functionsof the state vector, a nonparametric estimation would bea natural generalization, when a large number of datapoints is available. Previous nonparametric approacheswere based on solving the adjoint Fokker-Planck equa-tion [5] and on kernel estimators [6] and are eﬀectivelyrestricted to one-dimensional models.An alternative would be a Bayesian nonparamet-ric approach, where prior knowledge on the un-known functions—such as smoothness, variability, orperiodicity—can be encoded in a probability distribu-tion. A recent result by [7, 8] presented an impor-tant step in this direction. The authors have shownthat Gaussian processes (GPs) provide a natural fam-ily of prior probability measures over drift functions. Ifa path of the stochastic dynamics is observed densely,the posterior process over the drift is also a GP. Unfortu-nately, this simplicity is lost, when observations are notdense, but separated by larger time intervals. In [7] thecase of sparse observations has been treated by a MonteCarlo approach, which alternates between sampling com-plete diﬀusion paths of the stochastic diﬀerential equa-tion (SDE) and sampling from GP for the drift given a ∗ [email protected] † [email protected] ‡ [email protected] path. A nontrivial problem is the sampling from SDEpaths conditioned on observations. A second problemstems from the matrix inversions required by the GPpredictions. For a densely sampled hidden path thesematrices become large which leads to a strong increasein computational complexity. It was shown in [7] forthe case of univariate SDE that this numerical problemcan be circumvented if one chooses a GP prior wherethe inverse of the covariance operator is speciﬁed as adiﬀerential operator. In this case eﬃcient predictionsare possible in terms of solutions of ordinary diﬀeren-tial equations. Recently [9] introduced a nonparametricmethod, which models the drift as linear combination ofvariably many basis functions and uses reversible-jumpMarkov chain Monte Carlo to sample from the posteriordistribution. However, both [9] and [7] are restricted toone-dimensional SDEs. For special cases, where the driftof the system can be expressed as gradient of a potential,[10] uses the relationship between stationary density andpotential in order to eﬃciently learn the drift based onthe empirical density of the SDEs.In this paper, we develop an alternative approximatemethod for Bayesian estimation of SDEs based on GPs.The method is faster than the sampling approach andcan be applied to GPs with arbitrary covariance kernelsand also multivariate SDEs. Also, our method is able tohandle non-equilibrium models. In case of dense observa-tions the framework of GP regression is used to estimateboth drift and diﬀusion in a nonparametric way. Forsparse observations, we use an approximate expectationmaximization (EM) [11] algorithm, which extends ourapproach introduced in the conference publication [12].The EM algorithm cycles between the computation ofexpectations over SDE paths which are approximated bythose of a locally ﬁtted linear model and the computationof the maximum posterior GP prediction of the drift. Inaddition, the problem of the continuum of function valuesoccurring in expectations over the hidden path is solvedby a sparse GP approximation.The paper is organized as follows. Stochastic diﬀeren-tial equations are introduced in section II and Gaussianprocesses in section III. Then section IV explains GPbased inference for completely observed paths and showsresults on dense data sets. As large data sets slow downstandard GP inference considerably, section V reviews an a r X i v : . [ phy s i c s . d a t a - a n ] F e b eﬃcient sparse GP method. In section VI our approx-imate EM algorithm is derived and its performance isdemonstrated on a variety of SDEs. Section VII presentsa discussion and concludes with an outline of possibleextensions to the method. II. STOCHASTIC DIFFERENTIAL EQUATIONSAND LIKELIHOODS FOR DENSEOBSERVATIONS

We consider diﬀusion processes given by a stochasticdiﬀerential equation (SDE) written in Ito form as dX t = f ( X t ) dt + D / ( X t ) dW t , (1)where the vector function f ( x ) = ( f ( x ) , . . . , f d ( x )) de-ﬁnes the deterministic drift depending on the currentstate X t ∈ R d . W t denotes a Wiener process, whichmodels white noise, and D ( x ) is the d × d diﬀusion ma-trix.Suppose we observe a path X T of the process over atime interval [0 , T ]. Our goal is to estimate the drift func-tion f ( x ) based on the information contained in X T . Awell known statistical approach to estimation of unknownmodel parameters is the method of maximum likelihood[2]. This would maximize the probability of the observedpath with respect to f . To derive an expression for sucha path probability, we use the Euler time discretizationof the SDE [13] given by X t +∆ t − X t = f ( X t )∆ t + D ( X t ) / √ ∆ t (cid:15) t , (2)where (cid:15) t ∼ N (0 , I ) is a sequence of i.i.d. Gaussian noisevectors and ∆ t is a time discretization. We will later set∆ t →

0, when we compute explicit results for estimators.Since the short time transition probabilities of the processare Gaussian, the probability density for the discretizedpath can be written as the product p ( X T | f ) = p ( X T ) L ( X T | f ) , (3)where p ( X T ) ∝ exp (cid:34) − t (cid:88) t || X t +∆ t − X t || (cid:35) (4)is the measure over paths without drift, and a term L ( X T | f ) = exp (cid:34) − (cid:88) t || f ( X t ) || ∆ t + ( f ( X t ) , X t +∆ t − X t ) (cid:35) , (5)which is the relevant term for estimating the function f from the observations of the path. To avoid clut-tered notation, we have introduced the inner product( u, v ) . = u (cid:62) D − v and the corresponding squared norm || u || . = u (cid:62) D − u . The estimation of f using the methodof maximum likelihood can be motivated by the followingheuristics: Consider the case of a very large observationtime T . In this limit we may write − T ln L ( X T | f )= 12 T (cid:88) t || f ( X t ) || ∆ t − f ( X t ) , X t +∆ t − X t ) (cid:39) (cid:90) T E (cid:2) || f ( X t ) || (cid:3) −

2E [( f ( X t ) , f ∗ ( X t ))] dt = 12 (cid:90) || f ( x ) || p ( x ) dx − (cid:90) ( f ( x ) , f ∗ ( x )) p ( x ) dx, (6)where we have taken the limit ∆ t →

0. The expectationsare deﬁned with respect to the true (but unknown) pro-cess from which the data points are generated and p ( x )denotes its stationary density. The true drift is given bythe conditional expectation f ∗ ( x ) = lim ∆ t → t E [ X t +∆ t − X t | X t = x ] . (7)Obviously, a minimization of the last term in (6) wouldlead to the estimator ˆ f ( x ) = f ∗ ( x ), which is the truedrift indicating that asymptotically, for a long sequenceof data we get a consistent estimate. Unfortunately, forﬁnite sample time T , an unconstrained maximization ofthe likelihood (5) does not lead to sensible results [7].One has to use a regularization approach which restrictsthe complexity of the drift function. The simplest pos-sibility is to work with a parametric model, e.g. repre-senting f by a polynomial and estimating its coeﬃcients.However, in many cases it may not be clear in advancehow many parameters such a model should have. III. BAYESIAN ESTIMATION WITHGAUSSIAN PROCESSES

Another possibility for regularization is a nonparamet-ric Bayesian approach which uses prior probability distri-butions P ( f ) over drift functions. With diﬀerent choicesof the prior diﬀerent statistical ensembles of typical driftfunctions can be selected. We denote probabilities overthe drift f by upper case symbols in order to avoid confu-sion with path probabilities. We will also denote expec-tations over functions f by the symbol E f . Our Bayesestimator will be based on the posterior distribution p ( f | X T ) ∝ P ( f ) L ( X T | f ) , (8)where the neglected constant of proportionality only con-tains terms which do not depend on f . To construct sucha prior distribution, we note that the exponent in (5) con-tains the drift f at most quadratically. Hence a natural(conjugate) prior to the drift for this model is given by aGaussian measure over functions, i.e. a Gaussian process(GP) [7]. Although a more general model is possible, wewill restrict ourselves to the case where the GP priors overthe components f j ( x ), j = 1 , . . . , d of the drift factorizeand we also assume that we have a diagonal diﬀusionmatrix D ( x ) = diag( D ( x ) , . . . , D d ( x )). In this case, theGP posteriors of f j ( x ) also factorize in the components j , and we can estimate drift components independently.Gaussian processes have become highly popular inBayesian statistics especially for applications within theﬁeld of machine learning [14]. Such processes are com-pletely deﬁned by a mean function m ( x ) = E f [ f ( x )](which we will set to zero throughout the paper) anda kernel function deﬁned as K ( x , x ) = E f [ f ( x ) f ( x )] , (9)which speciﬁes the correlation of function values at twoarbitrary arguments x and x . By the choice of thekernel K we can encode prior assumptions about typicalrealizations of such random functions.In this paper we will apply Gaussian processes not onlyto drift estimation but also to the estimation of the dif-fusion D ( x ). The application in the latter case cannot beentirely justiﬁed from a Bayesian probabilistic perspec-tive, but rather from the point of view that Gaussianprocesses are known to provide ﬂexible tools for nonpara-metric regression, even when the underlying probabilis-tic model is not fully correctly speciﬁed. We will givea heuristic derivation of the analytical results for predic-tions with Gaussian processes which is applicable to bothdrift and diﬀusion estimation. A more detailed formula-tion can be found in [14]. In the basic regression setting,we assume that we have a set of n input-output datapoints ( x i , y i ) for i = 1 , . . . , n , where the y i are modelledas noisy function values f ( X i ), i.e. y i = f ( x i ) + ν i , (10)where the noise values ν i , are taken to be independentGaussian random variables with zero mean and (possi-bly diﬀerent) variances σ i . For drift estimation we take f ( x ) ≡ f j ( x ) as an arbitrary component of the drift vec-tor and setting D ( x ) ≡ D j ( x ). We then identify y i = ( X t i +∆ t − X t i ) / ∆ t (11) σ i = D ( x t i )∆ t . (12)In this case, the assumption of Gaussian noise is indeedfulﬁlled. Using a GP prior over functions f , we try toﬁlter out the noise from the observations and learn topredict the unobserved function f ( x ) at arbitrary inputvalues x . For the drift estimation problem, this equalsthe conditional expectation f ( x ) = E[ X t +∆ t − X t | X t = x ] / ∆ t (13)for ∆ t →

0. We will discuss the diﬀusion estimationproblem in the next section, but mention that the noise ν i will be no longer Gaussian. But we will still assumethat GP regression will be able to estimate a conditionalexpectation of the type (10) in this case. The probabilistic model for regression (10) correspondsto a likelihood p ( y | f ) ∝ exp (cid:34) − n (cid:88) i =1 σ i ( f ( x i ) − y i ) (cid:35) , (14)It is easy to see, that this likelihood agrees with (3) forthe case of drift estimation. To compute the most likelyfunction f in the Bayesian sense, we minimize the nega-tive log-posterior functional given by − ln [ P ( f ) p ( y | f )] (cid:39) (cid:90) (cid:90) f ( x ) K − ( x, x (cid:48) ) f ( x (cid:48) ) dx dx (cid:48) + n (cid:88) j =1 σ i ( f ( x j ) − y j ) . (15)Here K − is the formal inverse of the kernel operator.Setting the functional derivative δ ln [ P ( f ) L ( X T | f )] δf ( x ) = 0 (16)and applying the kernel operator K to the resulting equa-tion we get f ( x ) = n (cid:88) j =1 ( y j − f ( x j )) σ j K ( x, x j ) . (17)Evaluating this equation at observation x = x i we obtain( y i − f ( x i )) σ i = (cid:16) ( K + Σ ) − y (cid:17) i . (18)Here K = ( K ( x i , x j )) ni,j =1 denotes the kernel matrix and Σ = diag( σ , . . . , σ n ) is a diagonal matrix composed ofthe noise variances at the data points. This yields thefollowing expression (see [14]) for the GP estimator ofthe function f :ˆ f ( x ) = ( k ( x )) (cid:62) ( K + Σ ) − y , (19)where k ( x ) = ( K ( x, x i )) (cid:62) . Specializing to the estimationof the j-th drift component we identify y = (( X t +∆ t − X t ) / ∆ t ) (cid:62) and Σ = D j / ∆ t , where D j is diagonal matrixcomposed of the diﬀusions D j ( x i ) for i = 1 , . . . , n , to getˆ f j ( x ) = ( k ( x ) j ) (cid:62) (cid:18) K j + 1∆ t D j (cid:19) − y j , (20)where k ( x ) j = ( K ( x, x i ) j ) (cid:62) . A similar approach leads tothe Bayesian uncertainty at x ,ˆ D f j ( x ) = K ( x, x ) j − ( k ( x ) j ) (cid:62) (cid:18) K j + 1∆ t D j (cid:19) − k ( x ) j , (21)which can be used to quantify the uncertainty of the pre-diction. − . − . . . t X ( t ) FIG. 1. Sample path with n = 6000 data points generatedfrom a double well model with time distance ∆ t = 0 . A popular covariance kernel is the radial basis function(RBF) kernel K RBF ( x , x ) = τ exp (cid:18) − || x − x || l (cid:19) , (22)where the hyperparameters τ and l RBF denote thevariance and the correlation length scale of the process.The RBF kernel assumes smooth, inﬁnitely diﬀerentiablefunctions f ( · ). In some cases, the class of functional rela-tionship in the data set is known beforehand, so that spe-cialized kernel functions encoding this prior informationcan be applied. In our experiments, we use such kernelsfor the estimation of polynomial and periodic functions f ( · ). The corresponding kernels are the polynomial ker-nel of degree p , K P ol ( x , x ) = (cid:0) x (cid:62) x (cid:1) p , (23)and the (one-dimensional) periodic kernel K ( x , x ) Per = τ exp (cid:32) − (cid:0) x − x (cid:1) l (cid:33) . (24)For the latter kernel, the hyperparameters τ and l Per denote the variance and the correlation length scale ofthe process.In our experiments we found that the choice of thevariance kernel parameter τ did not have a noticeableimpact on the estimation results. Consequently, we ﬁxedits value to τ = 1. In the case of the length scale hy-perparameter l the user usually has relevant prior expertknowledge about the speciﬁc problem at hand and is ableto determine its value a priori. Similarly, if one knowsthat the underlying problem is of polynomial form, oneshould be able to specify its order p or at least an upperbound for p . We found that this approach usually workswell in practice. We note, however, that in the case ofdense data the kernel hyperparameters can also be au-tomatically determined in a principled way (see sectionIV). −1.5 −0.5 0.0 0.5 1.0 1.5 − − x f ( x ) FIG. 2. (color online) Estimation for the double well modelbased on the direct GP with the solid black line denoting themean and the dashed red line the true drift function. −1.5 −0.5 0.0 0.5 1.0 1.5 . . . . x D ( x ) FIG. 3. (color online) Diﬀusion estimation of the double wellbased on the direct GP. The dashed red line denotes thesquare root of the diﬀusion D ( x ) / and the solid black linethe estimator. IV. ESTIMATION FOR DENSEOBSERVATIONS

Following the dense case of section III, we consider driftand diﬀusion estimation in cases where the time grid ∆ t on which the data points are observed is small. This ap-proach will be referred to as the direct Gaussian Process(GP) estimation with mean and variance given by (20)and (21), respectively. We will treat drift and diﬀusionestimation in turn and start with the latter. The orderis motivated by the fact that the diﬀusion estimation isindependent of the drift. Hence, if both drift and dif-fusion are unknown, one should ﬁrst learn the diﬀusionand then incorporate the estimation results into the driftlearning procedure. A. Diﬀusion Estimation

We distinguish between two cases, namely models withconstant and with state dependent diﬀusion. If the dif-fusion matrix D is known to be constant, i.e. it does notdepend on the state, we will use a Bayesian maximumlikelihood approach and optimize the so-called Bayes ev-idence , which equals the probability of the path p ( X T )(in its Euler discretization), with respect to the diﬀu-sion constants D = ( D , . . . , D d ). Again the probabilityfactorizes in the components j = 1 , . . . , d . For compo-nent j of the process, the evidence is deﬁned as the n -dimensional Gaussian integral p ( X j T ) = (cid:90) p ( X j T | f j ) p ( f j ) d f j (25)where f j denotes the vector with components f j ( X t i )for i = 1 , . . . , n and p ( f j ) = N ( f j | , K j ) is the priorGaussian density induced by the GP prior over functions.Introducing, as before, the notation y ji = ( X jt i +∆ t − X jt i ) / ∆ t , we easily obtain the closed form expression p ( X j T ) = N ( y j | , K j + Σ j ) . (26)from (25) with Σ j = ( D j / ∆ t ) I , and where I denotesthe identity matrix. For the optimization, one can usea standard routine, e.g. a quasi-Newton method. Theevidence can also be used in the same way to learn ker-nel hyperparameters by optimizing with respect to thespeciﬁc variables.In the case of state dependent diﬀusions D ( x ), the ev-idence optimization becomes impractical, since we wouldhave to jointly optimize over D ( x i ) for all N observa-tions. Instead, we use the well known representation [1]for an arbitrary component of the exact diﬀusion D ∗ ( x ) = lim ∆ t → t Var ( X t +∆ t − X t | X t = x )= lim ∆ t → t (cid:0) E (cid:2) ( X t +∆ t − X t ) | X t = x (cid:3) − E [ X t +∆ t − X t | X t = x ] (cid:17) (27)= lim ∆ t → t (cid:0) E (cid:2) ( X t +∆ t − X t ) | X t = x (cid:3) − E[∆ tf ∗ ( x )] (cid:1) = lim ∆ t → t E (cid:2) ( X t +∆ t − X t ) | X t = x (cid:3) . (28)In the third line, we use the fact that the second termon the right hand side equals the squared conditionaldrift (7). Then—by taking ∆ t out of the expectationE[∆ tf ∗ ( x )]—we can easily see that the term vanishesin the limit ∆ t →

0. Hence, the conditional variancedoes not depend on the drift. For its computation,we use again GP regression, but now on the data set(( x , ˜ y ) , . . . , ( x n , ˜ y n )), where ˜ y i = ( X t i +∆ t − X t i ) / ∆ t are proportional to the squared observations of the driftestimation problem.Unfortunately, the data ˜y do not follow a Gaussian dis-tribution, so interpreting the GP posterior as a Bayesianposterior would lead to a model mismatch. Trying towork with the exact likelihood would lead to intractablenon-Gaussian integrals involving Gamma-densities which would have to be approximated. Moreover, the noise inthe data ˜ y i depends itself on the diﬀusion D ( x i ) and aproper Bayesian treatment would lead to a more com-plicated iterative estimation problem. However, the ob-servations ˜y j are obviously much smoother than the y j .Hence, we expect that the following simpler heuristicsgives good results for densely sampled paths. We regardthe GP framework simply as a regression tool for functionestimation, which in our case happens to be the diﬀusionfunction. The regression curve is given by the GP mean(19) with y j substituted by ˜y j . Note that this is con-ceptually diﬀerent from the computation of the constantdiﬀusion case above, where we matched the likelihoodvariances ˆ σ j to the diﬀusion estimators ˆ D j of the process.Under the GP as a regression toolbox lense, the likelihoodvariance σ becomes a nuisance parameter without a di-rect interest to us. Still, we have to determine suitablevariance values as well as possibly length scale param-eters in the case of a RBF kernel, which might not bereadily available.Finding hyperparameters by optimizing the marginaldistribution presupposes a Bayesian interpretation andso is not applicable in this context. Therefore we resortto a 2-fold cross-validation scheme. This method ran-domly divides the observation into two subsets of equalsize, and learns a GP estimator on each of the subsets.Then the goodness of ﬁt is determined by computing themean squared error of each estimator on the data of theremaining subset. B. Drift Estimation

Once we have diﬀusion values at the observations atour disposal, the estimation of the drift function becomesstraightforward. All we have to do is to evaluate for eachcomponent j the diﬀusion at the observations D j ( x ),which we then use as GP variance in the drift estimation.For the constant but unknown diﬀusion model, we insertthe estimated value ˆ D j into the diagonal of the matrix D j , in the state dependent unknown diﬀusion model, weuse the estimated value ˆ D j ( x i ) from the diﬀusion regres-sion function described above. Then, running the directGPs on the observations y j leads to a drift estimation,which can once again be interpreted as Bayesian poste-rior. However, we emphasize that we again regard theGP as regression toolbox for computing an expectationfunction. C. Experiments

Here we show the results for two experiments with un-known state dependent diﬀusion. First we look at syn-thetic data and then at a real world data set used inclimate research. The synthetic data sets analyzed aregenerated using the Euler method from the correspond-ing SDE with grid size ∆ t = 0 . −46 −42 −38 −34 x U ( x ) FIG. 4. (color online) The ﬁgure shows the estimated po-tentials of the ice core data both from a model with statedependent diﬀusion D ( x ) (solid black line) and with constantdiﬀusion D (dotted red). For both models we use a RBF ker-nel with length scale l = 0 .

7. The corresponding diﬀusionestimators are shown in ﬁgure 5.

Double well model with unknown state dependent diﬀusion

In order to evaluate the direct GP method, we gener-ated a sample of size n = 5000 with step size ∆ t = 0 . dX = 4( X − X ) dt + (cid:112) max(4 − . X , dW t . (29)The direct GP was run with a polynomial kernel functionof order p = 4. The estimation for drift and diﬀusionfunction are given in ﬁgures 2 and 3, respectively. Inboth cases, we see a good ﬁt between estimator and truefunction. Ice core model

As an example of a real world data set, we used theNGRIP ice core data (provided by Niels-Bohr institute inCopenhagen, [16], which provides an undisturbed ice corerecord containing climatic information stretching backinto the last glacial. Speciﬁcally, this data set as shownin ﬁgure 6 contains 4918 observations of oxygen isotopeconcentration δ O over a time period from the presentto roughly 1 . · years into the past. Since there aregenerally less isotopes in ice formed under cold condi-tions, the isotope concentration can be regarded as anindicator of past temperatures.Recent research [17] suggest to model the rapid pale-oclimatic changes exhibited in the data set by a simpledynamical system with a drift function of order p = 3 ascanonical model, which allows for bistability. This cor-responds to a meta stable state at higher temperaturesclose to marginal stability and a stable state at low val-ues, which is consistent with other research on this dataset linking a stable state of oxygen isotopes to a baselinetemperature and a region at higher values correspond-ing to the occurrence of rapid temperature spikes. For −46 −42 −38 −34 x D ( x ) FIG. 5. (color online) Diﬀusion function estimators of theice core model for the state dependent (solid black line) andthe constant diﬀusion model (dotted red line). The constantvalue D / = 2 .

81 was found by optimization of the marginallikelihood. For the GP in the state dependent model we used aRBF kernel, whose length scale l = 2 .

71 and diﬀusion D = 0 . − − − − t [ky] d O FIG. 6. (color online) Plot of the ice core data (as solid blackline) with metastable states marked by dashed green lines.These four minima of the potential function were identiﬁedby the direct GP algorithm with state dependent diﬀusion. −46 −42 −38 −34 − − x f ( x ) FIG. 7. (color online) Plot of the ice core drift function cor-responding to the above potential function in black togetherwith the 95%- Bayes conﬁdence bounds shaded in blue. Onecan see that the two inner meta stable states are statisticallysigniﬁcant, while the outer two ones are not. this particular dataset the consecutive observations arespaced ∆ t = 0 . − apart. The underlying dynamicsof the NGRIP data set is often modelled as a constantnoise process in the literature [17].Figure 5 shows that the estimated diﬀusion functionchanges signiﬁcantly over the range of the observed iso-tope concentration, which seems to make the constantdiﬀusion assumption in the model of [17] inadequate. Ourdata-driven approach not only reveals this multiplicativenature of the noise, but also a richer structure of thelearnt potential in comparison to the potential functionof the constant diﬀusion model, as shown in ﬁgure 4.Hence, choosing a state dependent diﬀusion model is ad-visable even in cases where one is only interested in thepotential, since the noise structure of the data also inﬂu-ences the estimation of the potential (and drift) function.Here we ﬁnd in total four local minima, but only twowould be expected for a polynomial drift of order p = 3.Switches between the two lowest states at δ O ≈ − . δ O ≈ − . δ O ≈ − . δ O = − .

4, thismetastable state is not statistically signiﬁcant in the es-timate of the drift function shown in ﬁgure 7.

V. LARGE NUMBER OF OBSERVATIONS: THENEED FOR A SPARSE GP

In practice, the number of observations can be largefor a ﬁne time discretization, and a fast computation ofthe matrix inverses in (20) could become infeasible. Apossible way out of this problem—as suggested by [7]—could be a restriction to kernels for which the inversekernel is a diﬀerential operator. We will now resort to adiﬀerent approach which applies to arbitrary kernels andgeneralizes easily to multivariate SDE. Our method isbased on a variational approximation to the GP posterior[18, 19], where the likelihood term of the GP model (5) isreplaced by another eﬀective likelihood, which dependsonly on a smaller set of variables f s . A. The general case

We assume a collection of random variables f = { f ( x ) } x ∈ T where the index variable x ∈ T takes valuesin some possibly inﬁnite index set T . We will assume a prior measure denoted by P ( f ) and a posterior measure of the form P ( f ) = 1 Z P ( f ) e − U ( f ) , (30) where U ( f ) is a functional of f . The goal is to approxi-mate P by another measure Q of the form Q ( f ) = 1 Z s P ( f ) e − U s ( f s ) , (31)where the potential U s depends only on a smaller sparse set f s = { f ( x ) } x ∈ S of dimension m . S is not necessarilya subset of T . While we keep the set S ﬁxed, U s will beoptimized to minimize the variational free energy of theapproximation − ln Z ≤ − ln Z s + E s [ U ( f ) − U s ( f s )] . (32)We write the joint probability of f and f s as Q ( f, f s ) = Q ( f | f s ) Q ( f s ) = P ( f | f s ) Q ( f s ) , (33)where the last equality follows from the fact that ﬁxingthe sparse set f s , U ( f s ) becomes non-random and thedependency on the random variables f is only via P and we have Q ( f s ) = P ( f s ) Z s e − U s ( f s ) . (34)Hence, we can integrate out all variables f except f s using P ( f | f s ) and rewrite the variational bound as the ﬁnitedimensional integral − ln Z ≤ − ln Z s + (cid:90) Q ( f s ) { E [ U ( f | f s )] − U s ( f s ) } d f s = (cid:90) Q ( f s ) ln (cid:18) Q ( f s ) P ( f s ) e − E [ U ( f | f s )] (cid:19) d f s . (35)E [ U ( f | f s )] is the conditional expectation w.r.t. P . Sincethis is of the form of a relative entropy, we conclude thatthe bound is minimized by the choice Q ( f s ) ∝ P ( f s ) e − E [ U ( f | f s )] (36)and thus U s ( f s ) = E [ U ( f | f s )]. B. Gaussian random variables

We next specialize to a Gaussian measure P with zeromean and covariance kernel K . If we assume (for nota-tional simplicity) that the set { f } is represented as aﬁnite but high-dimensional vector f and U ( f ) = 12 f (cid:62) Af − b (cid:62) f (37)is a quadratic form, we can then further simplify theconditional expectation (36) toE [ U ( f ) | f s ] = 12 (E [ f | f s ]) (cid:62) A E [ f | f s ] − b (cid:62) E [ f | f s ] + C, (38)where C = 12 tr (Cov [ f | f s ] A ) (39)is a constant independent of f s . This follows from thefact that E [ f | f s ] is the optimal mean square predictor ofthe vector f given f s [20], the diﬀerence f − E [ f | f s ] is arandom vector which is uncorrelated to the vector f s andthus for jointly Gaussian random variables independent of f s . Hence the conditional covariance Cov of f doesnot depend on f s . The explicit result for this predictor isgiven by E [ f | f s ] = K Ns K − s f s , (40)where K s is the kernel matrix for the sparse set and K Ns is the n × m kernel matrix between the non-sparse andthe sparse set. It is now easy to generalize to the inﬁnitedimensional case of the form U ( f ) = 12 (cid:90) f ( x ) A ( x ) dx − (cid:90) f ( x ) b ( x ) dx, (41)for which we getE [ f ( x ) | f s ] = k (cid:62) s ( x )( K s ) − f s (42)and thusE [ U ( f ) | f s ] = 12 f (cid:62) s K − s (cid:26)(cid:90) k s ( x ) A ( x ) k (cid:62) s ( x ) dx (cid:27) K − s f s − f (cid:62) s K − s (cid:90) k s ( x ) b ( x ) dx. (43) C. Sparse GP Drift and Diﬀusion Estimation

Now, setting U ( f ) = − ln[ L ( X T | f )] , (44)we can derive the drift estimator for the sparse repre-sentation analogously to (15). With deﬁnitions π j = K jNs (cid:0) K js (cid:1) − and Ω j = ∆ t ( π j ) T D − π j we get for the j th component of the drift vector:ˆ f j ( x ) = ( k ( x ) j ) (cid:62) (cid:0) I + Ω j K js (cid:1) − ∆ t ( π j ) T ( D j ) − y j , (45)where k ( x ) j = ( K ( x, x i ) j ) (cid:62) .The corresponding expression for the variance estima-tor is given by:ˆ D f j ( x ) = K ( x, x ) − k ( x ) (cid:62) (cid:0) I + Ω j K js (cid:1) − Ω j k ( x ) . (46)Notice that the inverted matrix inside the drift and vari-ance estimators is no longer of the size of observations n × n , but of the size of the sparse set m × m .While it is possible to also optimize the approxima-tion with respect to the set of sparse points numerically[14, 21], we use a simple heuristic, where we construct a histogram over the observations and select as our sparseset S the midpoints of all histogram hypercubes contain-ing at least one observation. Here, the intuition is thata sparse point in a region of high empirical density is agood approximation to the data points in the respectivehypercube. The number of histogram bins is determinedby Sturges’ formula [22], which is implicitly based on therange of the data. Note that the cardinality m of thesparse set is not set in advance but automatically deter-mined by the spatial structure of the data. In practice,this heuristic typically leads to m (cid:28) n and therefore tosubstantial computational gains compared to the full GP.In practice, using the sparse GP for the drift and dif-fusion function estimation can be easily accomplished byﬁrst determining a sparse set S for the relevant data setand then substituting mean (20) and variance (21) equa-tions with their sparse GP counterparts (45) and (46),respectively.One exception is the estimation of the constant dif-fusion D , where we have to replace the marginal distri-bution (25) with a corresponding sparse approximation.Here, we follow [18] and optimize for each component j alower bound to the evidence with respect to the diﬀusionconstants: F V ( X T ) = log[ N ( y j | , Q jN + 1∆ t D j )] − ∆ t D j ) − tr( K j − Q jN ) , (47)where Q jN = K jNs ( K js ) − ( K jNs ) T and tr( · ) denotes thetrace of the matrix. D. Performance comparison

In order to get a feel for the performance diﬀerencesbetween the standard GP and its sparse counterpart, wecompare both versions in terms of accuracy and perfor-mance on the double well model dX = 4( X − X ) dt + D / dW t (48)with constant and known variance D = 1. For the com-parison, we analyzed the performance for data sets ofdiﬀerent sizes, where we generated 10 data sets with∆ t = 0 .

002 for each ﬁxed number of observations. As ac-curacy measure, we used the approximate mean squarederror (MSE) (cid:90) p ( z )( ˆ f ( z ) − f ( z )) dz ≈ S S (cid:88) i =1 ( ˆ f ( z i ) − f ( z i )) (49)of the corresponding estimator. Here ˆ f ( z ) denotes theestimated drift and f ( z ) the true drift value, each evalu-ated on a set of S = 100 ﬁxed points evenly spaced overthe range of the samples. We then measured the runtime and MSE of each data set based on the sparse GPand the standard GP estimation, each with a polynomial Sample full GP full GP sparse GP sparse GPSize Runtime MSE Runtime MSE300 0.077 1.507 0.005 1.507500 0.104 1.384 0.008 1.3841000 0.828 1.292 0.014 1.2932500 4.19 1.157 0.028 1.1575000 30.18 0.973 0.056 0.97310000 324.5 0.592 0.162 0.59350000 - - 0.783 0.142TABLE I. Results of mean run times and MSEs of the stan-dard GP and sparse GP algorithms for diﬀerent sample sizes,run on a machine with Intel Core i3 processor. The size ofthe sparse sets varied between m = 6 and m = 19. kernel of order p = 4. All MSE are computed for oneﬁxed test set of size n = 4000, which we generated fromthe same model with ∆ t = 0 . VI. ESTIMATION FOR SPARSEOBSERVATIONS

The direct GP approach outlined above leads to wrongestimates of the drift when observations are sparse intime. In the sparse setting, we assume that n observa-tions z k . = X τ k , k = 1 , . . . , n are obtained at (for sim-plicity) regular intervals τ k = kτ , where τ (cid:29) ∆ t is muchlarger than the microscopic time scale. In this case, astraightforward discretization in (5), where the sum overmicroscopic times t i would be replaced by a sum over macroscopic times τ k and ∆ t by τ , would correspond toa discrete time dynamical model of the form (1) again re-placing ∆ t by τ . But this discretization is a bad approx-imation to the true SDE dynamics. This is because thetransition kernel over macroscopic times τ is simply nota Gaussian for a general f as was assumed in (15). Thefailure of the direct estimator for larger time distancescan be seen in ﬁgure 9, where the red line corresponds tothe true drift of the double-well (with constant, knowndiﬀusion) and the black line to its prediction based onobservations with τ = 0 . X t for times t between consecutive observations kτ < t < ( k + 1) τ as a hidden stochastic process with a conditional . . . t X ( t ) l l l l l l l l l FIG. 8. (color online) Snippet of the double well sample pathin black with observations denoted as red dots. −1.5 −0.5 0.0 0.5 1.0 1.5 − − x f ( x ) FIG. 9. (color online) Estimated drift function for the doublewell based on the direct approach, where the red dashed linedenotes the true drift function and the solid black line themean function. One can clearly see that the larger distancebetween the consecutive points leads to a wrong prediction. path probability given by p ( X T | z , f ) ∝ p ( X T | f ) n (cid:89) k =1 δ ( z k − X kτ ) , (50)where z is the collection of observations z k . We will usean iterative method based on the EM algorithm [11], inwhich the unobserved complete paths are replaced by anappropriate expectation using the probability (50). A. EM algorithm

The EM algorithm cycles between two steps1. In the E-step, we compute the expected negativelogarithm of the complete data likelihood L ( f, p ) = − E p [ln L ( X T | f )] , (51)where p denotes the posterior p ( X T | z , f old ) for theprevious estimate f old of the drift.02. In the M-Step, we recompute the most likely driftfunction by the minimization f new = arg min f ( L ( f, p ) − ln P ( f )) . (52)On can show [11] that the EM algorithm converges toa local maximum of the log-posterior. To compute theexpectation in the E-step, we use (5) and take the limit∆ t → f ( x ) is a time-independent function, this yields − E p [ln L ( X T | f )]= lim ∆ t → (cid:88) t E p (cid:2) || f ( X t ) || (cid:3) ∆ t − p [( f ( X t ) , X t +∆ t − X t )]= 12 (cid:90) T E p (cid:2) || f ( X t ) || (cid:3) − p [( f ( X t ) , g t ( X t ))] dt = 12 (cid:90) || f ( x ) || A ( x ) dx − (cid:90) ( f ( x ) , z ( x )) dx. (53)We have deﬁned the corresponding drift conditioned ondata g t ( x ) = lim ∆ t → t E p [ X t +∆ t − X t | X t = x ] , (54)as well as the functions A ( x ) = (cid:90) T q t ( x ) dt (55)and b ( x ) = (cid:90) T g t ( x ) q t ( x ) dt. (56)In contrast to (6), expectations are now over marginaldensities q t ( x ) of X t computed from the conditional pathmeasure, not over the asymptotic stationary density.Hence, we end up again with a simple quadratic formin f to be minimized. Note that due to the smoothnessof the kernel the prediction of (52) can be easily diﬀer-entiated analytically, a fact that will be needed later.However, there are two main problems for a practicalrealization of this EM algorithm: • We can not compute the expectation with respectto the conditional path measures exactly and needto ﬁnd approximations applicable to arbitrary priordrift functions f ( x ). • Although real observations are sparse, the hiddenpath involves a continuum of values X t . This willrequire (e.g. after some ﬁne discretization of time)the inversion of large matrices in (20).We can readily deal with the latter problem by resortingto the sparse GP representation introduced in section V. Linear drift approximation: The Ornstein-Uhlenbeck bridge

In this section we will look at the ﬁrst problem of com-puting expectations in the E-step. For given drift f ( · )and times t ∈ I k in the interval I k = [ k τ ; ( k + 1) τ ] be-tween two consecutive observations, the exact marginal p t ( x ) of the conditional path distribution equals the den-sity of X t = x conditioned on the fact that X kτ = z k and X ( k +1) τ = z k +1 . This is a so-called diﬀusion bridge. Us-ing the Markov property, this density can be expressed bythe transition densities p s ( x t + s | x t ) of the homogeneousMarkov diﬀusion process with drift f ( x ) as p t ( x ) ∝ p ( k +1) τ − t ( z k +1 | x ) p t − kτ ( x | z k ) for t ∈ I k . (57)As functions of t and x , the second factor fulﬁlls a forwardFokker-Planck equation and the ﬁrst one a Kolmogorovbackward equation [1]. Since exact computations are notfeasible for general drift functions, we approximate thetransition density p s ( x | x k ) in each interval I k by thatof a homogeneous Ornstein-Uhlenbeck process [1], wherethe drift f ( x ) is replaced by a local linearization. Hence,we consider the approximate process dX t = [ f ( z k ) − Γ k ( X t − z k )] dt + D / k dW (58)with Γ k = −∇ f ( z k ) and D k = D ( z k ) for t ∈ I k . For thisprocess, the transition density is a multivariate Gaussian q ( k ) s ( x | z ) = N (cid:0) x | α k + e − Γ k s ( z − α k ); S s (cid:1) , (59)where α k = z k + Γ − k f ( z k ) is the stationary mean. Thecovariance S s = A s B − s is calculated in terms of the ma-trix exponential (cid:20) A s B s (cid:21) = exp (cid:32)(cid:34) Γ k D k − Γ (cid:62) k (cid:35) s (cid:33) (cid:20) I (cid:21) . (60)Then we obtain the Gaussian approximation q ( k ) t ( x ) = N ( x | m ( t ); C ( t )) of the marginal posterior for t ∈ I k bymultiplying the two transition densities, where C ( t ) = (cid:16) e − Γ (cid:62) k ( t k +1 − t ) S − t k +1 − t e − Γ k ( t k +1 − t ) + S − t − t k (cid:17) − ,m ( t ) = C ( t ) e − Γ (cid:62) k ( t k +1 − t ) S − t k +1 − t ( z k +1 − α k + e − Γ k ( t k +1 − t ) α k (cid:17) + C ( t ) S − t − t k (cid:16) α k + e − Γ k ( t − t k ) ( z k − α k ) (cid:17) . By inspecting mean and variance we see that the distri-bution is in fact equivalent to a bridge between the points X = z k and X = z k +1 and collapses to point masses atthese points.Finally, in this approximation we obtain for the condi-tional drift g t ( x ) = lim ∆ t → t E [ X t +∆ t − X t | X t = x, X τ = z k +1 ]= f ( z k ) − Γ k ( x − z k ) + D k e − Γ (cid:62) k ( t k +1 − t ) S − t k +1 − t ( z k +1 − α k − e − Γ k ( t k +1 − t ) ( x − α k ))as shown in appendix A.1 Sparse M-Step approximation

For the M-Step approximation we use the sparse GPformalism of section V. The resulting sparse approxima-tion to the likelihood (53) is given by L s ( f , q ) = 12 (cid:90) || E [ f ( x ) | f s ] || A ( x ) dx − (cid:90) (E [ f ( x ) | f s ] , b ( x )) dx, (61)where the conditional expectation is over the GP prior.While the exact likelihood does not contain interactionsof the form f ( x ) f ( x (cid:48) ) for x (cid:54) = x (cid:48) , we allow for couplingsof the type f (cid:62) Λf − a (cid:62) f in the eﬀective log-likelihood.In order to avoid cluttered notation, it should be notedthat in the following results for a component f j , thequantities Λ s , f s , k s , K − s , z ( x ) , D ( x ) similar to (20) de-pend on the component j , but not A ( x ).We easily getE [ f ( x ) | f s ] = k (cid:62) s ( x ) K − s f s . (62)Hence L s ( f , q ) = 12 f (cid:62) s Λ s f s − f (cid:62) s y s (63)with Λ s = K − s (cid:26)(cid:90) k s ( x ) D ( x ) − A ( x ) k (cid:62) s ( x ) dx (cid:27) K − s (64)and y s = K − s (cid:90) D ( x ) − k s ( x ) b ( x ) dx. (65)With these results, the approximate MAP estimate is¯ f s ( x ) = k (cid:62) s ( x )( I + Λ s K s ) − y s . (66)The integrals over x in (64) and (65) can be computed an-alytically for many kernels of interest such as polynomialand RBF ones. However, we found it more eﬃcient totreat the time integration in (55) and (56) as well as the x -integrals by sampling, where time points t are drawnuniformly at random and x points from the multivariateGaussian q t ( x ). A related expression for the variance,¯ D s ( x ) = K ( x, x ) − k (cid:62) s ( x )( I + ΛK s ) − Λ s k s ( x ) , (67)can only be viewed as a crude estimate, because it doesnot include the impact of the GP ﬂuctuations on the pathprobabilities.Finally, a possible approximate evidence for our modelis given by the product of the local Ornstein-Uhlenbecktransition probabilities: p ( z ) ≈ p ou ( z | ˆ f ) = p ( x ) n − (cid:89) j =1 q ( k ) τ ( z k +1 | z k ) . (68) The expression is a product of Gaussian transition den-sities and therefore of analytical form. Note that in ad-dition to the Ornstein-Uhlenbeck linearization, this ap-proximation also neglects the uncertainty of f , since theGP in the M step only uses the expectation.Nevertheless, in our experiments we found that the useof the approximate evidence is a reasonable choice for theoptimization of the diﬀusion D ( x ), see subsection VI D.However, the optimization of the kernel hyperparametersis more problematic, since the approximate evidence de-pends on the drift estimate ˆ f , which itself depends onthe choice of the hyperparameters through the applica-tion of the GP. Since we assume that prior knowledge ofa suitable kernel hyperparameters is often available, wedid not pursue this problem further. B. Experiments

We created the synthetic data sets in this section byﬁrst using the Euler method from the corresponding SDEwith grid size ∆ dense = 0 . N observations separated by ∆ t (cid:29) ∆ dense , we keep every k = (∆ t/ ∆ dense )th path sample value as observation, un-til the desired observation number N is reached.The EM algorithm is initialized with the sparse directGP estimator, which works well in practice as a reason-able ﬁrst approximation to the true system dynamics.Although the monotonicity property of the EM algorithmis no longer satisﬁed due to the approximation in the E-step, convergence will be assumed, once L stabilizes up tosome minor ﬂuctuations. In our experiments convergencewas typically attained after a few ( <

10) iterations.

Performance Comparison

First, we compare the estimation accuracy of the directGP and the EM algorithm on the double well model withconstant known diﬀusion, dX = 4( X − X ) dt + dW t , (69)for diﬀerent time discretization ∆ t . For each time step,we generated 20 data sets, each of size n = 4000, andcomputed the MSE on a test set of size n = 2000 foreach data set for both algorithms with RBF kernel. Asbenchmark reference, we include the estimation results ofa Monte Carlo sampler (see appendix C). The latter oneis represented only for one data set at small and mediumtime intervals, respectively, due to its long computationtime. In order to improve comparability, we ﬁxed thelength scale of the RBF kernel to l = 0 .

62 for all datasets.The results are given in ﬁgure 10. The MSE of thedirect GP grows quite rapidly for smaller intervals untilit reaches an upper bound roughly equivalent with ran-domly guessing the drift function. On the other hand,2 ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l . . . . D t M SE l DirectEMGibbs

FIG. 10. (color online) Comparison of the MSE for diﬀerentmethods over diﬀerent time intervals. −1.5 −0.5 0.0 0.5 1.0 1.5 − x f ( x ) FIG. 11. (color online) GP estimation after one iteration ofthe EM algorithm. Again, the solid black and red dashedlines denote estimator and true drift function, respectively. the MSE for the EM algorithm increases at a much slowerrate, giving good results even for data sets with biggertime distances. The estimation results for the Gibbs sam-pler are independent of the discretization rate, but takeconsiderable time to compute: while the EM algorithmruns for a couple of minutes, the sampler takes up to twodays.

Double well model with known state dependent diﬀusion

As our next example we examine the double well modelwith state dependent diﬀusion and larger time discretiza-tion. Here we assume that the diﬀusion function D ( x )is known. Speciﬁcally, we sample n = 4000 observationat ∆ t = 0 . p = 4. The direct GP and the EMresult are given in ﬁgure 9 and 11, respectively. One canclearly see, that an application of the EM algorithm leadsto a signiﬁcantly better estimator of the drift function,compared to the direct GP method. y FIG. 12. Empirical density of the two dimensional syntheticmodel data. −1.5 −0.5 0.5 1.0 1.5 − . − . . . FIG. 13. Vector ﬁelds of the true drift depicted as grey linesand the estimated drift as black arrows for the two dimen-sional synthetic data.

Two dimensional synthetic model

We now turn to a two dimensional process with thefollowing dynamics: dX = ( X (1 − X − Y ) − Y ) dt + dW (1) t , (70) dY = ( Y (1 − X − Y ) + ) dt + dW (2) t , (71)where the component indices are denoted by superscripts.For this model we generated n = 10000 observations withstep size ∆ t = 0 . p = 4 andshows a good ﬁt to the true drift especially in the regionswhere the observations are concentrated. Note that thisis a non-equilibrium model, where the drift cannot be ex-pressed as the gradient of a potential. Hence, the densitybased method of [10] cannot be applied here. Lorenz’63 model

We next analyze a stochastic version of the three di-mensional Lorenz’63 model. It consists of the following3 − − t X ( t ) − − t Y ( t ) t Z ( t ) FIG. 14. Simulated sample path of the Lorenz’63 modellearned by the direct GP algorithm. − t X ( t ) − t Y ( t ) t Z ( t ) FIG. 15. Simulated sample path of the Lorenz’63 modellearned by the EM algorithm. system of nonlinear coupled stochastic diﬀerential equa-tions: dX = σ ( Y − X ) dt + dW (1) t , (72) dY = ( ρX − X − XZ ) dt + dW (2) t , (73) dZ = ( XY − βZ ) dt + dW (3) t . (74)Lorenz’63 is a chaotic system which was developed asa simpliﬁed model of thermal convection in the atmo-sphere [23]. The parameters θ = ( σ, ρ, β ) are set to thecommonly used θ = (10 , , /

3) known to induce chaoticbehavior in the system. In order to analyze the model wesimulate n = 3000 data points with time discretization∆ t = 0 .

2. In the inference, we used a polynomial kernelof order p = 2 and assume that the constant diﬀusion isknown.In order to visualize the quality of the estimation re-sults, we computed the direct GP and the EM algorithmand simulated paths using the corresponding mean esti-mator as drift function. Here, the application of the EMleads to a vastly superior estimation result compared tothe direct method. As shown in ﬁgure 16, the direct GPestimator path collapses to a small region of the func-tion space, whereas the EM trajectory of ﬁgure 17 nicelycaptures the true dynamics of the Lorenz’63 model, faith-fully recreating the famous butterﬂy pattern in the X-Z −12 −10 −8 −6 −4 −2 0 2 X(t) Z ( t ) FIG. 16. Simulated path in the X-Z plane from the Lorenz’63model learned by the direct GP algorithm. −15 −5 0 5 10 15

X(t) Z ( t ) FIG. 17. Simulated path in the X-Z plane from the Lorenz’63model learned by the EM algorithm. plane.

Cart and pole model

Next, we consider an example from the class of me-chanical systems. Our model describes the dynamics ofa pole attached to a cart moving randomly along an one-dimensional axis. Formally, we get a system of two di-mensional diﬀerential equations with x denoting the an-gle of the pendulum, and v the angular velocity. Wedeﬁne the upright position of the pendulum as X = 0.This particular cart and pole model is frequently studiedin the context of learning control policies [24], where thegoal is to move the cart in such a way as to stabilize thependulum in the upright position. The complete systemlooks as follows: dX = V dt, (75) dV = − γV + mgl sin( X ) ml dt + d / dW t , (76)where γ = 0 .

05 is the friction coeﬃcient, l = 1m and m = 1kg are the length and mass of the pendulum, re-spectively, and g = 9 .

81m s − denotes the gravitationalconstant. For our experiment, we generated N = 40004 −6 −4 −2 0 2 4 6 − − x f ( x , v = ) FIG. 18. (color online) One dimensional drift estimation plotsof the second ( dV ) SDE of the cart and pole model. The ﬁgureshows the estimation of the pendulum position X for a ﬁxedvelocity V = 0. The solid black line is the drift estimationand the red dashed line the true function. data points ( x, v ) on a grid with ∆ t = 0 . d = 1. Here, the full diﬀusion matrix D = (cid:32) (cid:33) , (77)for both X and V is rank deﬁcient due to its noiselessﬁrst equation. However, we note that our EM algorithmis also applicable to models with deterministic compo-nents, since the E-Step in the EM algorithm remains welldeﬁned. In the kernel function we incorporate our priorknowledge that the pendulum angle is periodic and thevelocity acts as a linear friction term inside the system.Speciﬁcally we deﬁne the following multiplicative kernelfor the dV equation: K (( x, v ) , ( x (cid:48) , v (cid:48) )) = K Per ( x, x (cid:48) ) K Poly ( v, v (cid:48) ) , (78)where K Per denotes the periodic kernel over the state x with hyperparameters l = 1 .

21 and K Poly the polynomialkernel of order p = 1 over the velocity V . The multiplica-tive kernel structure allows for interactions between itscomponents. Since in this model the components are in-dependent, we could also use an additive kernel, whichneglects interactions terms, but we have chosen the moregenerally applicable variant here. For the dX equation,we use a polynomial kernel of order p = 1, which capturesthe linear relationship between X and V . If we adapt ourchoice of the kernel to the speciﬁc form of the system, weget an accurate estimate even for data points separatedby a wider time spacing (see ﬁgures 18 and 19). C. External forces

We can expect a reasonably good estimation of f ( x )only in regions of x where we have enough observations.This is of clear importance, when the system is multi-stable and the noise is too small to allow for a suﬃcient −10 −5 0 5 10 − − x f ( x , v ) x = - p / 2x = p / 2 FIG. 19. (color online) One dimensional drift estimation plotsof the second ( dV ) SDE of the cart and pole model: Esti-mation of V for a ﬁxed pendulum position in the horizontalpositions with the top pointing to the left X = − π/ X = π/

2. Full lines denote the drift estimationand dashed and dotted lines the true values. exploration of space. An alternative method for explo-ration would be to add a known external deterministiccontrol force u ( t ) to the dynamics which is designed todrive the system from one locally stable region to anotherone. Hence, we assume an SDE dX t = ( f ( X t ) + u ( t )) dt + D / dW t . (79)This situation is easily incorporated into our formalism.In all likelihood terms, we replace f ( X t ) by f ( X t ) + u ( t ),but keeping the zero mean GP prior over functions. Thechanges for the corresponding transition probabilities ofthe approximating time dependent Ornstein-Uhlenbeckbridge are given in appendix B.We demonstrate the concept by applying it to the dou-ble well model. We get dX = (4( X − X ) + u ( t )) dt + σdW t . (80)As external force we choose a periodic control functionof the form u ( t ) = a sin( ωt ) with parameters a = 1 and ω = 3. We generated a data set of n = 2000 observationson a regular grid with distance ∆ t = 0 . D / = 0 .

5. The addition of u ( t )leads to observations from both of the wells, whereas inthe uncontrolled case only one part of the underlyingstate space is explored. Hence, the drift estimation in thelatter case leads to an accurate result solely around thewell at X = 1, as opposed to the controlled case, whereboth modes are truthfully recovered (ﬁgures 20 and 21).In both cases, we used a RBF kernel with τ = 1. Thelength scales was set to l = 0 .

74 in the controlled and l = 0 .

53 in the uncontrolled case.

D. Diﬀusion Estimation

As in the dense data scenario, we look at constant andstate dependent diﬀusions in turn. If D does not depend5 −1.0 −0.5 0.0 0.5 1.0 − − x f ( x ) FIG. 20. (color online) EM algorithm predictions for the un-controlled double well path with the solid black line denotingthe estimation and the dashed red line the true drift. Here,the estimation of the well around X = − −1.0 −0.5 0.0 0.5 1.0 − − x f ( x ) FIG. 21. (color online) EM algorithm predictions for the con-trolled double well path. The solid black line is the estimateddrift and the dashed red line the true function. on the state, we can proceed in analogy to the dense datacase and maximize the approximate evidence (68) withrespect to the diﬀusion values.For the state dependent case D ( x ) we assume a para-metric function D ( x ; θ ), which is speciﬁed by its param-eter vector θ . Here, we again maximize the likelihoodwith respect to the corresponding θ .For an illustration, we don’t show the constant dif-fusion case and instead restrict ourself to the more in-teresting case of a state dependent D ( x ). We sampled n = 8000 observations at ∆ t = 0 . dX = 0 . − X ) dt + max(2 − ( X − , . dW t . (81)The diﬀusion function was modelled as D ( x, θ ) = θ x + θ x + θ . As kernel function for the drift, weused a polynomial kernel of order p = 1. Optimizing theevidence with respect to θ leads to the results shown inﬁgure 22. One can see that the estimation gives a rea-sonably good ﬁt to the true diﬀusion function even withthe bigger time discretization. We note however, that . . . x D ( x ) FIG. 22. (color online) Comparison of the diﬀusion estimationfor data generated from (81). The dashed red line is the truesquare root D ( x ) / of the diﬀusion and the solid black linethe parametric estimation based on the EM algorithm. Forcomparison, we include the estimate based on the direct GPdenoted by the green dashed-dotted line. the diﬀusion estimate is of a lower quality than the driftestimate, since in this case the evidence is less accurate. VII. DISCUSSION

It would be interesting to replace the ad hoc local lin-ear approximation of the posterior drift by a more ﬂexi-ble time dependent Gaussian model. This could be opti-mized in a variational EM approximation by minimizinga free energy in the E-step, which contains the Kullback-Leibler divergence between the linear and true processes[15, 25]. Such a method could be extended to noisy ob-servations and the case, where some components of thestate vector are not observed. Also, this method could beturned into a variational Bayesian approximation, whereone optimizes posteriors over both drifts and over statepaths. The path probabilities are then inﬂuenced by theuncertainties in the drift estimation, which would lead tomore realistic predictions of error bars.Finally, nonparametric diﬀusion estimation deservesfurther attention. Incorporating a fully nonparametricmodel of the diﬀusion function D ( x ) in our scheme wouldbe infeasible in practice, since this would involve thejoint estimation of n diﬀusion matrices. In our experi-ments, we tried a (quasi-)nonparametric approach, wherewe represented the diﬀusion function by its value at a fewsupporting points and took these as inputs for a GP re-gression, which we then used as function approximation.However, our experiments have shown that in order toachieve a reasonable estimation quality we need support-ing points on a relatively dense grid. The correspondingoptimization over the vector of grid points turned out tobe too ineﬃcient, which makes the approach impracti-cal. Furthermore, the evidence over which we optimizeis often too inaccurate to lead to a reasonable quality.If performance time is not at all critical, one can re-sort to a Markov Chain Monte Carlo (MCMC) algorithm,6which generates exact samples from the correspondingdrift and diﬀusion functions. In contrast to the EM algo-rithm, the sampler evaluates the diﬀusion function on adense grid and also does not use the assumption of con-stant diﬀusion between adjacent observations, therebyovercoming the signiﬁcant estimation errors for largertime distances. We plan to report on this in a futurepublication. Acknowledgments

This work was supported by the European Commu-nity’s Seventh Framework Programme (FP7, 2007-2013)under the grant agreement 270327 (CompLACS).

Appendix A: Conditional drift

Here, we give the derivation of the conditional drift term g t ( x ), which occurs in the E-step of the EM algorithm. g t ( x ) = lim ∆ t → t E [ X t +∆ t − X t | X t = x, X τ = y ]= lim ∆ t → t (cid:82) ( x (cid:48) − x ) p τ − t − ∆ t ( y | x (cid:48) ) p ∆ t ( x (cid:48) | x ) dx (cid:48) (cid:82) p τ − t − ∆ t ( y | x (cid:48) ) p ∆ t ( x (cid:48) | x ) dx (cid:48) = lim ∆ t → t f ( x )∆ t + E u [ p τ − t − ∆ t ( y | x + f ( x )∆ t + u ) u ]E u [ p τ − t − ∆ t ( y | x + f ( x )∆ t + u )]= f ( x ) + D lim ∆ t → ∇ x E u [ p τ − t − ∆ t ( y | x + f ( x )∆ t + u )]E u [ p τ − t − ∆ t ( y | x + f ( x )∆ t + u )]= f ( x ) + D lim ∆ t → ∇ x ln { E u [ p τ − t − ∆ t ( y | x + f ( x )∆ t + u )] } = f ( x ) + D ∇ x ln { p τ − t ( y | x ) } . The second line follows from the deﬁnition of the conditional density, the 3rd line from the fact that p ∆ t ( x (cid:48) | x ) = N ( x + f ( x )∆ t ; D ∆ t ) and u ∼ N (0; σ ∆ t ). The fourth line is based on the fact that for zero mean Gaussian randomvectors with covariance S , we have E[ ug ( u )] = S E[ ∇ u g ( u )]. Finally, the last line is obtained by noting that thecovariance of u vanishes for ∆ t → Appendix B: Ornstein-Uhlenbeck bridge with external forces

If there is an additional time-dependent and known drift term u ( t ), e.g. a control force, in the Ornstein-Uhlenbeckmodel, i.e. dX t = [ f ( y k ) − Γ k ( X t − y k ) + u ( t )] dt + D / k dW, with Γ k = −∇ f ( y k ) and D k = D ( y k ), the mean of the marginal posterior is changed to m ( t ) = C ( t ) e − Γ (cid:62) k ( τ − u ) S − τ − u (cid:18) x k +1 − α k + e − Γ k ( τ − u ) α k − (cid:90) τu e − Γ k ( τ − v ) u ( t − u + v ) dv (cid:19) + C ( t ) S − u (cid:18) α k + e − Γ k u ( x k − α k ) + (cid:90) u e − Γ k ( u − v ) u ( t − u + v )) dv (cid:19) , but the covariance matrix stays the same. For the posterior drift, we get in this case g t ( x ) ≈ f ( x k ) − Γ k ( x − x k ) + D k e − Γ (cid:62) k ( τ − u ) S − τ − u (cid:18) x k +1 − α k − e − Γ k ( τ − u ) ( x − α k ) − (cid:90) τu e − Γ k ( τ − v ) u ( t − u + v )) dv (cid:19) . u ( t ) = a sin( ωt ): m ( t ) = C ( t ) e − γ k ( t k +1 − t ) S − t k +1 − t (cid:104) x k +1 − α k + e − γ k ( t k +1 − t ) α k − aγ k + ω (cid:16) ( γ k sin( ωt k +1 ) − ω cos( ωt k +1 )) − e − γ k ( t k +1 − t ) ( γ k sin( ωt ) − ω cos( ωt )) (cid:17)(cid:105) + C ( t ) S − t − t k (cid:104) α k + e − γ k ( t − t k ) ( x k − α k )+ aγ k + ω (cid:16) ( γ k sin( ωt ) − ω cos( ωt )) − e − γ k ( t − t k ) ( γ k sin( ωt k ) − ω cos( ωt k )) (cid:17)(cid:105) ,g t ( x ) ≈ f ( x k ) + a sin( ωt ) − γ k ( x − x k ) + De − γ k ( t k +1 − t ) S − t k +1 − t (cid:104) x k +1 − α k − e − γ k ( t k +1 − t ) ( x − α k ) − aγ k + ω (cid:16) ( γ k sin( ωt k +1 ) − ω cos( ωt k +1 )) − e − γ k ( t k +1 − t ) ( γ k sin( ωt ) − ω cos( ωt )) (cid:17)(cid:105) . Appendix C: MCMC sampler

We brieﬂy describe the Markov Chain Monte Carlo(MCMC) algorithm, which generates samples from thedrift function of a system of SDEs with known diﬀusion.Similar to the EM algorithm in the main text, the driftis modeled in a nonparametric way.As before, our data will be a set of N observations Y = ( y , . . . , y N ), where y k = X kτ . Since the timedistance between adjacent observations is taken to belarge, we impute the process between observations ininterval I k = [ k τ ; ( k + 1) τ ] on a ﬁne grid of step size∆ = τ /M for some suitable integer M . The imputedpath of the k th subinterval will be denoted by X k = { X kτ , X kτ +∆ , . . . , X kτ + M ∆ } .If we write the complete imputed path of length M N as X = ( y , X ∆ , . . . , X ( M − , . . . , y , . . . ,X ( k − τ +( M − , y k , X kτ +∆ , . . . , y N ) , then the joint posterior distribution of the data and thedrift and diﬀusion function for a given set of observationsis given by p ( X , f | Y , D ) ∝ p ( f ) NM (cid:89) l =1 p ( X l +1 | X l , f, D )Here, the density p ( X , f | Y , D ) is approximately nor-mally distributed(see (5)) on the ﬁne grid with mean andvariance given by (20) and (21), respectively. A straight-forward way to sample from this posterior is given by thefollowing Gibbs sampler: Algorithm 1

Gibbs Sampler Initialize f (0) with the direct GP solution for i = 1 , . . . , N do Sample X ( i ) ∼ p ( X | Y , f ( i − , D ) Sample f ( i ) ∼ p ( f | X ( i ) , Y ) Here, the superscripts denote the iteration. The num-ber of iterations for a particular model is determined bythe usual MCMC convergence diagnostics, see for exam-ple [26]. Since an analytic form for the imputed path dis-tribution p ( X | Y , f, D ) does not exist, we have to resortto a Metropolis- Hasting (MH) step. As proposal distri-bution q , we use the so-called modiﬁed diﬀusion bridge (MBD) of [27]. Here, for each interval I k the densityof a grid point X j +1 k from X k is normally distributed,conditioned on X jk and the interval endpoint y k +1 : q ( X j +1 k | X jk , y k +1 , f q , D q ) = N ( X j +1 k | X jk + f q ( X jk )∆ , D q ( X jk )) (C1)with drift and diﬀusion f q ( X jk ) = y k +1 − X j τ − j ∆ , D q ( X jk ) = τ − ( j + 1)∆ τ − j ∆ D ( X jk ) . Now, since for each subinterval I k the bridge proposalstarts in observation y k and terminates in y k +1 , we cangenerate a sample of the complete path p ( X | Y , f, D )by sampling a MDB proposal separately for each the N subintervals. Speciﬁcally, for subinterval I k we simulatea path X ∗ k on the dense grid by recursively sampling from(C1) and move from current state X k to X ∗ k with prob-ability α ( X k , X ∗ k ) = min  ,  M − (cid:89) j =1 p ( X ∗ ( j +1) k | X ∗ jk , f, D ) p ( X j +1 k | X jk , f, D )  ×  M − (cid:89) j =1 q ( X j +1 k | X jk , y k +1 , f q , D q ) q ( X ∗ ( j +1) k | X ∗ jk , y k +1 , f q , D q )  , with probability (1 − α ( X k , X ∗ k )) we retain the currentpath X k .The sampling from the drift p ( f | X , Y ) is easier to ac-complish, since under a GP prior p ∼ GP assumption,the distribution p ( f | Y , X ) of the SDE drift correspondsto a GP posterior and is therefore of analytic form. Sincethe number of dense path observations is usually quitesubstantial, we resort to the sparse version of the GP8with mean and variance given by (45) and (46), respec-tively. In each iteration of the Gibbs sampler, we simu-late a new f on a ﬁne grid over the (slightly extended)range of the path observations X and then interpolatethese points by nonparametric regression in order to ar- rive at an approximate drift function. The interpolationstep, for which we again resort to a sparse GP, is mo-tivated by computational considerations, since this wayevaluating the function values for the path can be can bedone very eﬃciently, while also being accurate due to thesmoothness of the underlying drift. [1] C. W. Gardiner. Handbook of Stochastic Methods .Springer, Berlin, second edition, 1996.[2] Stefano M. Iacus.

Simulation and Inference for Stochas-tic Diﬀerential Equations: With R Examples (SpringerSeries in Statistics) . Springer, 1st edition, 2008.[3] Hao Wu and Frank Noe. Bayesian framework for mod-eling diﬀusion processes with nonlinear drift based onnonlinear and incomplete observations.

Phys. Rev. E ,83(3):036705, 2011.[4] Andrew Golightly and Darren J Wilkinson. Markov chainmonte carlo algorithms for sde parameter estimation.

Learning and Inference for Computational Systems Bi-ology , pages 253–276, 2010.[5] Steven J. Lade. Finite sampling interval eﬀects inkramersmoyal analysis.

Phys. Lett. A , 373(41):3705–3709, 2009.[6] Federico M. Bandi and Peter C. B. Phillips. Fully non-parametric estimation of scalar diﬀusion models.

Econo-metrica , 71(1):241–283, 2003.[7] Omiros Papaspiliopoulos, Yvo Pokern, Gareth O.Roberts, and Andrew M. Stuart. Nonparametric esti-mation of diﬀusions: a diﬀerential equations approach.

Biometrika , 99(3):511–531, 2012.[8] Yvo Pokern, Andrew M. Stuart, and J.H. van Zanten.Posterior consistency via precision operators for Bayesiannonparametric drift estimation in SDEs.

Stochastic Pro-cesses and their Applications , 123(2):603–628, 2013.[9] Frank van der Meulen, Moritz Schauer, and Harry vanZanten. Reversible jump mcmc for nonparametric driftestimation for diﬀusion processes.

Computational Statis-tics & Data Analysis , 71:615–632, 2014.[10] Philipp Batz, Andreas Ruttor, and Manfred Opper. Vari-ational estimation of the drift for stochastic diﬀerentialequations from the empirical density.

Journal of Sta-tistical Mechanics: Theory and Experiment , (8):083404,2016.[11] Arthur P Dempster, Nan M Laird, and Donald B Rubin.Maximum likelihood from incomplete data via the emalgorithm.

Journal of the royal statistical society. SeriesB (methodological) , pages 1–38, 1977.[12] Andreas Ruttor, Philipp Batz, and Manfred Opper. Ap-proximate gaussian process inference for the drift func-tion in stochastic diﬀerential equations. In

Advancesin Neural Information Processing Systems , pages 2040–2048, 2013.[13] P. E. Kloeden and E. Platen.

Numerical Solution ofStochastic Diﬀerential Equations . Springer, New York,corrected edition, June 2011. [14] C. E. Rasmussen and C. K. I. Williams.

Gaussian Pro-cesses for Machine Learning . MIT Press, 2006.[15] C´edric Archambeau, Manfred Opper, Yuan Shen, DanCornford, and John Shawe-Taylor. Variational inferencefor diﬀusion processes. In J.C. Platt, D. Koller, Y. Singer,and S. Roweis, editors,

Advances in Neural InformationProcessing Systems 20 , pages 17–24. MIT Press, Cam-bridge, MA, 2008.[16] Katrine K Andersen, N Azuma, J-M Barnola, MatthiasBigler, P Biscaye, N Caillon, J Chappellaz, Henrik BrinkClausen, Dorthe Dahl-Jensen, Hubertus Fischer, et al.High-resolution record of northern hemisphere climateextending into the last interglacial period.

Nature ,431(7005):147–151, 2004.[17] Frank Kwasniok. Analysis and modelling of glacial cli-mate transitions using simple dynamical systems.

Philo-sophical Transactions of the Royal Society A: Mathemat-ical, Physical and Engineering Sciences , 371(1991), 2013.[18] Michalis K. Titsias. Variational learning of inducing vari-ables in sparse Gaussian processes.

JMLR WC&P , 5:567–574, 2009.[19] Lehel Csat´o, Manfred Opper, and Ole Winther. TAPGibbs free energy, belief propagation and sparsity. InT. G. Dietterich, S. Becker, and Z. Ghahramani, editors,

Advances in Neural Information Processing Systems 14 ,pages 657–663. MIT Press, 2002.[20] Athanasios Papoulis.

Probability, random variables, andstochastic processes . 1965.[21] Lehel Csat´o and Manfred Opper. Sparse on-line gaussianprocesses.

Neural Computation , 14(3):641–668, 2002.[22] H.A. Sturges. The choice of a class interval.

Journal ofthe American Statistical Association , 21:65–66, 1926.[23] Edward N Lorenz. Deterministic nonperiodic ﬂow.

Jour-nal of the atmospheric sciences , 20(2):130–141, 1963.[24] Marc Peter Deisenroth, Carl Edward Rasmussen, andJan Peters. Gaussian process dynamic programming.

Neurocomputing , 72(7):1508–1524, 2009.[25] Michail D. Vrettas, Dan Cornford, and Manfred Opper.Variational mean-ﬁeld algorithm for eﬃcient inference inlarge systems of stochastic diﬀerential equations.

Phys.Rev. E , 91:012148, 2015.[26] Christian Robert and George Casella.

Monte Carlo sta-tistical methods . Springer Science & Business Media,2013.[27] Garland B Durham and A Ronald Gallant. Numer-ical techniques for maximum likelihood estimation ofcontinuous-time diﬀusion processes.