[PDF] Statistical learning of non-linear stochastic differential equations from non-stationary time-series using variational clustering

Abstract

Parameter estimation for non-stationary stochastic differential equations (SDE) with an arbitrary non-linear drift and non-linear diffusion is accomplished in combination with a non-parametric clustering methodology. Such a model-based clustering approach includes a quadratic programming (QP) problem with equality and inequality constraints. We couple the QP problem to a closed-form likelihood function approach based on suitable Hermite-expansions to approximate the parameter values of the SDE model. The classification problem provides a smooth indicator function, which enables us to recover the underlying temporal parameter modulation of the one-dimensional SDE. As shown by the numerical examples, the clustering approach recovers a hidden functional relationship between the SDE model parameters and an additional auxiliary process. The study builds upon this functional relationship to develop closed-form, non-stationary, data-driven stochastic models for multiscale dynamical systems in real-world applications.

Full PDF

SSTATISTICAL LEARNING OF NON-LINEAR STOCHASTIC DIFFERENTIALEQUATIONS FROM NON-STATIONARY TIME-SERIES USINGVARIATIONAL CLUSTERING

VYACHESLAV BOYKO , SEBASTIAN KRUMSCHEID , AND NIKKI VERCAUTEREN , Abstract.

Parameter estimation for non-stationary stochastic diﬀerential equations (SDE) withan arbitrary non-linear drift and non-linear diﬀusion is accomplished in combination with a non-parametric clustering methodology. Such a model-based clustering approach includes a quadraticprogramming (QP) problem with equality and inequality constraints. We couple the QP problemto a closed-form likelihood function approach based on suitable Hermite-expansions to approx-imate the parameter values of the SDE model. The classiﬁcation problem provides a smoothindicator function, which enables us to recover the underlying temporal parameter modulation ofthe one-dimensional SDE. As shown by the numerical examples, the clustering approach recoversa hidden functional relationship between the SDE model parameters and an additional auxiliaryprocess. The study builds upon this functional relationship to develop closed-form, non-stationary,data-driven stochastic models for multiscale dynamical systems in real-world applications. Introduction

Clustering approaches are reliable tools for the analysis of complex time-series with the even-tual goal to extract patterns and thereby gain an understanding of the processes in natural andphysical science (Zhao and Karypis, 2005; Ayenew et al., 2009; Maione et al., 2019). If one sup-plements the classiﬁcation method with a model structure, the clustering becomes model-based(Liao, 2005). This feature enhances the modelling potential and provides a way for out-of-samplepredictions. The mathematical model takes diﬀerent structures depending on the assumptions.One useful requirement is a formulation which allows an analytical study after the model has beenidentiﬁed. Therefore, the estimation of a continuous-type model is preferable because the inter-mediate discretization step entails further post-processing eﬀort. An additional advantage of thecontinuous model lies in the further development and embedding of the continuous-type estimatesinto a multiscale system. This is often an arduous speciﬁcation since the clustering method shouldbe general enough to handle a wide variety of time-series dynamics, including nonstationarities andnon-linearity.In a series of works Horenko (2010a); Metzner et al. (2012); Pospisil et al. (2018) introduced anddeveloped an eﬃcient non-parametric model-based clustering framework which proved to be a suc-cessful analysis tool in atmospheric sciences (Horenko, 2010b; O’Kane et al., 2013; Vercauteren andKlein, 2015; Franzke et al., 2015; Risbey et al., 2015; O’Kane et al., 2016; Vercauteren et al., 2019;Boyko and Vercauteren, 2020) molecular dynamics (Gerber and Horenko, 2014) and computationalﬁnance (Putzig et al., 2010). The paradigm of the approach is based on two assumptions. Thenon-stationary time-series is assumed to be represented by a non-autonomous model with time-dependent parameters and the ﬂuctuation timescales of the parameters are assumed to be much

Date : February 25, 2021. a r X i v : . [ m a t h . O C ] F e b ON-PARAMETRIC CLUSTERING OF SDE 2 longer than the ﬂuctuation timescale of the time-series. In analogy, if we consider the oscillationsof the model parameters and the time-series in the frequency domain, then we would assume asigniﬁcant scale separation.Essentially, the clustering method solves two inverse problems: one to determine the parametersof a mathematical model and one to classify the cluster-respective parameter sets. The number ofparameter sets, i.e. the number of clusters is thereby a hyperparameter, which resolves the param-eter variability. The classiﬁcation is understood in terms of the location in time, simultaneouslyproviding the periods and their optimal, locally-stationary parameter sets. The two inverse prob-lems are coupled through the so-called ﬁtness function, which allows combining both problems in asingle minimization procedure. Such a variational framework is based on a regularized non-convexclustering algorithm which allows eﬃcient non-parametric modelling of non-stationary time-series.The main diﬀerence amongst existing variational frameworks similar to the one introduced byHorenko (2010a) lies in the assumed underlying mathematical model. One approach is based onthe vector autoregressive factor model with exogenous variables (Horenko, 2010b), where the modelstructure is described by the discrete-time linear model. Conversely, the approach of (Metzneret al., 2012) uses a Markov-regression model, while a Markov chain Monte Carlo method is used byde Wiljes et al. (2013) . The variational framework has also been combined with the probabilitydistribution-based modelling by utilizing the generalized extreme value distribution (Kaiser andHorenko, 2014). A more general distribution-based clustering is constructed based on the maxi-mum entropy principle and provides a memoryless, multidimensional and non-stationary modelling(Horenko et al., 2020). The framework is computationally-scalable and enables clustering of bothtemporal and spatial parameter modulations (Kaiser, 2015). Such eﬃciency is achieved by thesuitable utilization of an appropriate ﬁnite-dimensional function space (e.g. ﬁnite element spaces)to approximate the solution of the classiﬁcation problem.Similar to the above-mentioned approaches, here we adopt the variational framework to combineit with the model structure deﬁned by stochastic diﬀerential equations (SDE) in a time-continuousformulation. The novelty of our approach is that we have an analytical, closed-form approximationof the likelihood function that we use as the model-observation misﬁt (or ﬁtness) function; this isin contrast to the Euclidean norm used by (Horenko, 2010b). The maximum likelihood estimator(MLE) is then the result of minimizing the misﬁt (or maximizing the ﬁtness). In this way, we avoidan explicit discretization step of the mathematical SDE model, and include thereby the functionalmodelling of the noise term in the classiﬁcation algorithm as well, improving the accuracy of theclassiﬁcation-modelling task. In particular, it allows to model non-linear noise processes and notjust additive ones. The MLE is constructed by using the transition density of a general, one-dimensional non-linear SDE. The closed-form expression of the transition density is approximatedvia the Hermite polynomials following Ait-Sahalia (2002). For example, this approach showed tobe successful in modelling complex movement pattern of marine predators and climate transitions(Krumscheid et al., 2015). Here we do not deal with the problem of determining an appropriatedfunctional form of the underlying SDE, rather with the question: how can one arrive from aclustering approach to a self-contained predictive model?The goal is to present a data-driven stochastic modelling strategy for the parameterization ofunresolved degrees of freedom in a complex system. We consider a multiscale system composed ofa model representing the macroscale, while the fast degrees of freedom are represented by a low-dimensional model. Theoretically, the fast degrees of freedom are driven by the macroscale variablesin some unknown way which we seek to identify through an appropriate model. For this modellingtask, we suppose to have access to at least one measurement set including all scales forming the

ON-PARAMETRIC CLUSTERING OF SDE 3 training data-set. It is assumed that the resolved scales control the dynamics through the modu-lation of the model parameters. We seek to describe the measured process with a low-dimensionalmodel, recover its temporal parameter modulation and re-express them by some appropriated char-acteristic variable of the resolved dynamical system. Thereby it is assumed that there is a hiddenfunctional relationship, which we aim to infer, between the model parameters and the resolveddeterministic variable.The parameter-scaling functions constitute the key aspect in the identiﬁcation of the data-drivenstochastic closures. Their existence in real applications requires exploration, and the functionalform is case-speciﬁc. Here we test the concept on synthetically generated examples of diﬀerentcomplexity and show that the scaling functions are indeed inferable. Our approach provides away to develop data-driven stochastic parameterization schemes, and we focus on one-dimensionalpredictive models.The paper is organized as follows. In section 2, we recall the theoretical background on theparameter estimation of a general, stationary SDE and the variational clustering framework. Weexplain the coupling of the two inverse problems and the necessary modiﬁcations to the originalvariational clustering algorithm. Section 3 covers new approaches to select the number of clustersand the regularisation parameter which are required to infer non-stationary models. Section 4provides a summary of our implementation of the clustering method. The numerical examples insection 5 demonstrate the recovering of the predeﬁned scaling functions and section 6 covers thediscussion and concluding remarks.

ON-PARAMETRIC CLUSTERING OF SDE 4 A concept of data-driven stochastic parameterization

The concept of data-driven stochastic parameterization is introduced in three steps. The ﬁrsttwo steps form self-contained procedures. The third step is interweaving the two previous ones andaims at deriving a procedure to develop data-driven, application-oriented stochastic closure models.The generalized formulation requires researchers to embed some degree of pre-existing knowledgeabout the modeled process in terms of some suitable model structure. After a successful systemidentiﬁcation, a one-dimensional stochastic equation is obtained, which is coupled to a macroscalemodel.2.1.

Parameter Identiﬁcation for Autonomous Stochastic Diﬀerential Equations.

Con-sider a stochastic process of interest that is characterized by a one-dimensional, time-dependentvariable X ( t ). Let X ( t i ) be the (experimental) observations at observation times 0 = t < t < · · · < t N = T . For simplicity we assume that the data is recorded at a constant frequency, so that t i +1 − t i = ∆ t > i = 1 , . . . , N . The available data-set consists of N observations. We consideras underlying data-generating process a general one-dimensional It (cid:98) o SDE of the form(2.1) d X = f ( X ; Θ )d t + g ( X ; Θ )d W t , over a ﬁnite time interval [0 , T ], T >

0. The functional form for the drift f ( · ; · ) and diﬀusion g ( · ; · ) is non-linear in X and linear in Θ , depending on some unknown vector-valued parameter Θ = [ θ , θ , . . . , θ n ] ∈ Ω Θ ⊂ R n , where Ω Θ denotes the admissible parameter space. The vector Θ contains parameters of both the drift and diﬀusion terms. Furthermore, let W t = W ( t ) be onadmissible, one-dimensional Wiener process. It is assumed that Eq. (2.1) has a unique strongsolution for any Θ . The goal is to infer Θ given N observations using the maximum likelihoodapproach. The discrete-time negative log-likelihood function is deﬁned as (Fuchs, 2013)(2.2) l N ( Θ ) := − N − (cid:88) i =0 ln [( p X (∆ t, X ( t i +1 ) | X ( t i ); Θ ))] , where p X denotes the conditional transition density function of the stationary process X (see Eq.(2.1)). From now on we call l N as the ﬁtness function to match the deﬁnition of the variationalframework deﬁned by Horenko (2010a). The optimum of the ﬁtness function with respect to Θ yields the parameter estimate(2.3) Θ ∗ ∈ arg min Θ ∈ Ω Θ l N ( Θ ) , which is the MLE. It is known that the MLE converges to the true parameter Θ in probability as N → ∞ for a ﬁxed ∆ t . Furthermore, it is consistent and asymptotically normal regardless of thediscretization (Dacunha-Castelle and Florens-Zmirou, 1986),(2.4) √ N ( Θ ∗ − Θ ) → d N (0 , σ MLE ) , where σ MLE is the variance of the MLE Θ ∗ . This estimator converges in distribution → d to theunknown parameter at a rate 1 / √ N .In general, the conditional transition density function is not known and to approximate it weadapt the closed-form expansion following Ait-Sahalia (1999, 2002). To obtain the closed-formapproximation, the process X is transformed into a process Z in two steps. The transition densityof Z , namely p Z is expanded via Hermite polynomials and the needed density p X is obtained byreverting the transformations. ON-PARAMETRIC CLUSTERING OF SDE 5

The ﬁrst transformation step Y := F ( X ) standardizes the process X , such that Y have unitydiﬀusion:(2.5) d Y = µ ( Y ; Θ )d t + d W t , where the drift µ is given as(2.6) µ ( Y ; Θ ) = f ( F − ( Y ); Θ ) g ( F − ( Y ); Θ ) −

12 d g ( F − ( Y ); Θ )d X .

The Lamperti transform F is(2.7) F ( X ) := (cid:90) X g ( v, Θ ) − d v , such that the inverse F − ( Y ) exists. The second transformation is introduced to normalize theincrement of Y :(2.8) Z := ( Y − y ) / √ ∆ t , where y = Y ( t i +1 ) | Y ( t i ). The factor 1 / √ ∆ t prevents the transition density to become a Diracdelta function for small ∆ t and the shift operation sets the initial value for the process Z to z = 0(Fuchs, 2013)[ch. 6.3].The transition density p Z is expanded around a standard normal density and truncated bykeeping J + 1 terms:(2.9) p Z (∆ t, z | y ; Θ ) = φ ( z ) J (cid:88) j =0 η j ( y , ∆ t ; Θ ) H j ( z ) , where φ ( z ) = e − z / / √ π is the weight function, η j ( y , ∆ t ; Θ ) the coeﬃcients and H j ( z ) theHermite polynomials (see Appendix D for deﬁnition).By reverting the transformation from Z to X we obtain the required density p X : p X (∆ t, X ( t i +1 ) | X ( t i ); Θ ) = 1 g ( X ( t i +1 ); Θ ) √ ∆ t p Z (∆ t, z | y ; Θ ) . (2.10)The truncation error for the coeﬃcients is:(2.11) η ( M ) j ( y | ∆ t ; Θ ) = η j ( y , ∆ t ; Θ ) + O (∆ t M +1 ) , i = 1 , . . . , J ; M = J/ , where we choose J = 6 throughout this work and hence it is a suﬃciently accurate approximationfor the following proposed clustering approach.This subsection is closed by providing a summarizing abstract sketch that represents the scopeof this approach where a time-series is modeled with an autonomous SDE (2.1) (see Fig. 1). In thenext subsection, we advance the modelling with a more detailed approximation.2.2. The Non-Parametric Clustering Methodology.

The previous section covers the param-eter estimation of a general autonomous SDE, assuming the values of its parameters stay constantin time. If we consider that the data comes from an experiment, then this assumption is more validfor measurements that are carried out in the laboratory under controlled conditions. Contrarily, theexperiments with uncontrollable conditions are often hard to reproduce. Consider as an examplethe atmospheric boundary layer. The duration of a ﬁeld measurement campaign may extend overseveral months, providing a time-series with a wide range of dynamical regimes (Sun et al., 2012;Mahrt, 2014; Vercauteren et al., 2019). It is, therefore, necessary to cluster the data to be able to

ON-PARAMETRIC CLUSTERING OF SDE 6

Observations Model ﬁt PredictionApproximation Out-of-sample prediction Θ ( t ) = const.DataParameters Parameters Θ t Training t t

Figure 1.

An illustration of a data-driven modelling concept using SDE withconstant parameters.isolate and understand the associated dynamics. If one decides to model the non-stationary processwith an SDE, then the circumstance of time-dependent parameters must be taken into account:d X = f ( X ; Θ ( t ))d t + g ( X ; Θ ( t ))d W t (2.12) Θ : [0 , T ] → R n , Θ ( t ) = [ θ ( t ) , θ ( t ) , . . . , θ n ( t )] ∈ Ω Θ (2.13)To estimate the time-varying parameters we use the non-parametric clustering approach based onﬁnite element method with smooth regularization (FEM-H1) introduced by Horenko (2010a) andfurther developed by Pospisil et al. (2018).The problem of parameter estimation (see Eq. (2.3)) becomes ill-posed, if we try to estimate aset of time-dependent parameters, because we try solve a high-dimentional problem.To overcome this diﬃculty, Horenko (2010a) suggests regularizing the ﬁtness function by assum-ing that the process X ( t ) varies much faster than the parameter function Θ ( t ). Formally, this meansexpressing the ﬁtness function with a convex combination of K locally-optimal ﬁtness functions,(2.14) l N ( Θ ( t )) := N − (cid:88) i =1 K (cid:88) k =1 γ k ( t i ) f ( t i ; θ k ) , K ≥ , where f ( t i ; θ k ) is the time-dependent ﬁtness function formed by the time-averaged parameter value θ k . The overbar denotes the averaged value, which is assumed to be constant. The value of K indicates the number of clusters or sub-models. The functions γ k ( t ) are the aﬃliation functions withspeciﬁc properties explained later. They are unknown and will be estimated with an additional stepdescribed later too. The function f ( · ; · ) is taken to be equivalent to the Eq. (2.2) and is deﬁned as(2.15) f ( t i ; θ k ) := − ln (cid:2) p X (∆ t, X ( t i +1 ) | X ( t i ); θ k ) (cid:3) , i = 1 , . . . , N − , where the index i iterates over the observations. We keep the sample size of f ( · ; · ) equal to the samplesize of X by setting the value at the last timestep N to be equal to the ﬁrst one f ( t N ; θ k ) = f ( t ; θ k ).The aﬃliation functions γ k ( t ) contain information on the location of the minima for each of the k -th f ( · ; · ) and are grouped in the so-called aﬃliation vector Γ ( t ) = [ γ ( t ) , γ ( t ) , . . . , γ K ( t )] ∈ R K .The functions γ k ( t ) satisfy the convexity constrains forming the partition of unity, K (cid:88) k =1 γ k ( t ) = 1 , ∀ t ∈ [0 , T ] , (2.16) 1 ≥ γ k ( t ) ≥ , ∀ t ∈ [0 , T ] , k . (2.17) ON-PARAMETRIC CLUSTERING OF SDE 7

The vector Γ ( t ) represents a vector of time-dependent weights. The functions γ k ( t ) deﬁne theactivity of an appropriate k -th SDE at a given time t (Pospisil et al., 2018), or serves as a K -dimensional probability vector, which expresses a certainty to observe the k -th stationary model attime t . For example, assuming that we know the aﬃliation vector in advance the temporal evolutionof one parameter function is:(2.18) θ ( t ) = K (cid:88) k =1 γ k ( t ) θ k + ε ( t ) , where θ ( t ) is approximated by K > θ (see Fig. 2) and ε ( t ) is someapproximation error. X ( t ) θ ( t ) tθ θ θ θ θ θ θ θ θ Figure 2.

Sketch explaining the approximation for one of the time-dependentparameter functions with piecewise constant subdivisions.As an example, Figure 2 illustrates the approximation of one non-stationary parameter with K = 3. The ﬁgure indicates a variable location and period of each subdivision which depends onthe evolution of θ ( t ) and the number of clusters used. This information is encoded in the Γ ( t )function. For K → N the approximation of θ ( t ) gets better and causes the method to overﬁtbecause more partitions approximate the function θ ( t ). An additional consequence is that theuncertainty for each estimate of θ k grows (see Eq. (2.4)) because the amount of data available foreach cluster reduces (i.e. each SDE model).The averaged clustering functional L N together with the assumptions (2.14), (2.16), (2.17) andthe ﬁtness function (Eq. (2.15)) becomes:(2.19) L N ( Γ ( t ) , θ , . . . , θ K ) = − N − (cid:88) i =1 K (cid:88) k =1 γ k ( t i ) ln (cid:2) p X (∆ t, X ( t i +1 ) | X ( t i ); θ k ) (cid:3) . Applying the regularization and the discretization with the ﬁnite element method the functionalbecomes:(2.20) L (cid:15) (cid:98) N ( Γ ( τ ) , Θ , (cid:15) ) = K (cid:88) k =1 (cid:2) b ( τ, θ k ) (cid:62) γ k ( τ ) + (cid:15) γ k ( τ ) † H γ k ( τ ) (cid:3) → min Γ ( τ ) , Θ . ON-PARAMETRIC CLUSTERING OF SDE 8 where the vector Γ ( τ ) = [ γ ( τ ) , γ ( τ ) , . . . , γ K ( τ )] ∈ R K × (cid:98) N is the coarsened version of Γ ( t ). Here τ denotes the reduced t -grid by a factor α ∈ (0 , (cid:98) N denotes the number of samplesafter the discretization of the ﬁtness function (2.15). The ﬁrst term b ( · , · ) † γ k ( · ) in Eq. (2.20) isthe discretized version of L N ( · , · ) (see Eq. (2.19)) and is formulated as a dot product. The symbol (cid:62) indicates the transpose of the corresponding vector. The vector b ( τ, θ k ) denotes the discretizedmodel ﬁtness function,(2.21) b ( τ, θ k ) = (cid:32)(cid:90) τ τ f ( t, θ k ) v ( t )d τ, . . . , (cid:90) τ (cid:99) N τ (cid:99) N − f ( t, θ k ) v (cid:98) N ( t )d τ (cid:33) , where v ( t ) is the hat function and f ( · , · ) the function in Eq. (2.15). The operation in Eq. (2.21) isdenoted as the reduction and it projects a vector into the space of the ﬁnite element. Consequently,the function f ( · , · ) changes its sampling rate from ∆ t to a larger mesh size ∆ τ . The second term (cid:15) γ k ( τ ) † H γ k ( τ ) penalizes the functional by controlling the regularity of the vector Γ ( τ ), which isrequired to obtain a valid solution. We use the Lagrange P -elements which are also known as thehat functions. For these elements, the stiﬀness-matrix H k assembles to H k =  − · · · − − · · · − · · · · · ·  ∈ R (cid:98) N × (cid:98) N , k = 1 , . . . , K . The regularized variational problem Eq. (2.20) is solved iteratively:(1) Fix Γ ( t ) and ﬁnd Θ ∗ by solving L N ( Γ ( t ) , Θ ) → min Θ (2.19)(2) Fix Θ and ﬁnd Γ ∗ ( τ ) by solving L (cid:15) (cid:98) N ( Γ ( τ ) , Θ , (cid:15) ) → min Γ ( τ ) (2.20)The second step is a problem with a quadratic cost function formed by linear equality and inequalityconstrains, which is reformulated from Eq. (2.20) by deﬁning column vectors from K clusters. Theso-called vector of modeling errors B ( τ ) (Pospisil et al., 2018) is constructed using the reducedlocal ﬁtness function Eq. (2.21),(2.22) B ( τ ) := [ b ( τ, θ ) , b ( τ, θ ) , . . . , b ( τ, θ K )] ∈ R K (cid:98) N , and the block-diagonal matrix A =  H · · · H · · · · · · H K  ∈ R K (cid:98) N × K (cid:98) N . ON-PARAMETRIC CLUSTERING OF SDE 9

From Eq. (2.20) the block-structured quadratic programming (QP) problem is: Γ ∗ ( τ ) := arg min Γ ∈ Ω Γ L (cid:15) (cid:98) N ( Γ ( τ ) , Θ , (cid:15) ) , (2.23) L (cid:15) (cid:98) N := 1 (cid:98) N B † ( τ ) Γ ( τ ) + 1 (cid:98) N (cid:15) Γ † ( τ ) A Γ ( τ ) , (2.24) Ω Γ := { Γ ( τ ) ∈ R K (cid:98) N : Γ ( τ ) ≥ ∧ K (cid:88) k =1 γ k ( τ ) = ∀ t ∈ [0 , T ] } , (2.25)where 1 / (cid:98) N is a scaling coeﬃcient to avoid large numerical values. To solve the QP, we use theadapted spectral projected gradient method developed by Pospisil et al. (2018). Their methodenjoys high granularity of parallelization and is suitable to run on GPU clusters. Furthermore, itis eﬃcient and outperforms traditional denoising algorithms in terms of the signal-to-noise ratio.The FEM oﬀers a way to reduce the size of the QP problem through the reduction of the samples (cid:98) N < N . However, the accuracy in determining Θ ∗ depends on the number of available observationsas it scales with 1 / √ N and needs, therefore, to be solved at the smallest available time step ∆ t .If we now reduce the size of the QP problem (see Eq. (2.21)) the accuracy of the estimate Θ ∗ will decrease. Consequently, we cannot solve both problems at the same timestep ∆ τ and need tointerpolate the solution of QP in each iteration step. This is done as follows:(1) Fix Γ ( t ) and ﬁnd Θ ∗ by solving L N ( Γ ( t ) , Θ ) → min Θ (2) reduce the ﬁtness function to the τ -grid(3) Fix Θ and ﬁnd Γ ∗ ( τ ) by solving L (cid:15) (cid:98) N ( Γ ( τ ) , Θ , (cid:15) ) → min Γ ( τ ) (4) interpolate Γ ∗ ( τ ) to the t -grid step size ∆ t In that way, we keep the stepsize of the Theta solver step (1) (∆ t ) separated from the stepsize ofthe Gamma solver (3) (∆ τ ), ensuring an accurate approximation of Θ while keeping the beneﬁt ofthe FEM. The algorithm 1 to solve the variational problem (2.20) is called the subspace clusteringalgorithm (Horenko, 2010a) and is reproduced with the proposed modiﬁcation.After a successful application of the clustering method, one has several directions for furtherresearch. For instance, the vector Γ ∗ ( t ) provides a classiﬁer and can be used to analyze quantitiesof interest across the clusters. Instead, we demonstrate a way how the vector Γ ∗ ( t ) can be used toconstruct stochastic closures to eﬀectively model unresolved degrees of freedom as modulated byresolved variables.behaviour ON-PARAMETRIC CLUSTERING OF SDE 10

Algorithm 1:

The adapted subspace algorithm for SDE models. The main change tothe original version of he algorithm is in line 9, where the functional L N is formulatedfor a general SDE. Furthermore, lines 14 and 18 are added to maintain the accuracy ofthe estimated Θ ∗ . The reduced variables are indicated by the time variable τ and thenon-reduced with t , respectively. See Appendix A for more details. Input :

Time series X , number of clusters K , regularisation value (cid:15) and the reductionvalue α Output: Γ ∗ ( t ) and Θ ∗ Generate random initial Γ ( t ) satisfying 2.16 and 2.17 /* We need the full Γ ( t ) to find Θ and the reduced Γ ( τ ) to solve the QP problem */ Reduce the initial vector Γ ( t ) to Γ ( τ ) on the coarser grid ∆ τ Set iteration counter j = 0 for the main optimization loop /* main optimization loop */ while (cid:12)(cid:12)(cid:12) L (cid:15) (cid:98) N ( Γ j +1 ( τ ) , Θ j +1 , (cid:15) ) − L (cid:15)N ( Γ j ( τ ) , Θ j , (cid:15) ) (cid:12)(cid:12)(cid:12) > tol do /* Estimate Θ j from the full (This will ensure maximum accuracy for the estimateof Θ j ) Γ j ( t ) applying K times unconstrained minimization */ for k ← to K do θ j +1 k = arg min θ L N ( γ jk ( t ) , θ ) // See eq. 2.19. end for /* From the found θ j +1 k compute the fitness functions f j +1 ( t, θ j +1 k ) or save it fromthe step in line 9 */ Compute f j +1 ( t, θ j +1 k ), if not saved from line 9 // /* Reduce f j +1 ( t, θ j +1 k ) because Γ ( τ ) will be estimated on the coarser grid ∆ τ */ Compute b j +1 ( τ, θ j +1 k ) = reduction ( f j +1 ( t, θ j +1 k ) , ∆ τ, α ) // See eq. 2.21 /* Apply quadratic programming to solve the constrained minimization */ Γ j +1 ( τ ) = arg min Γ L (cid:15) (cid:98) N ( Γ ( τ ) , b j +1 ( τ, θ j +1 k ) , (cid:15) ) satisfying 2.16 and 2.17 /* Interpolate Γ j +1 ( τ ) onto the original grid with a scale ∆ t */ Γ j +1 ( t ) = interpolate ( Γ j +1 ( τ ) , ∆ t ) /* Addvance the main loop counter j */ j = j + 1 end while return Γ j +1 ( t ) , Θ j +1 Stochastic Closure.

One diﬃculty in ﬁnding a stochastic closure is the missing informationregarding the future realization of the vector Γ ∗ ( t ) because it is an estimate that is based on theavailable data. Although we have assumed that we observe a suﬃcient amount of data to capturethe full range of regime dynamics, we are incapable of predicting the future state of the vector Γ ( t ) without additional assumptions and modeling work. One can associate the aﬃliation functionwith a Markov process, assuming that it is homogeneous and stationary (Metzner et al., 2012). ON-PARAMETRIC CLUSTERING OF SDE 11

Alternatively, Kaiser et al. (2017) applied a neuronal network within the clustering framework topredict the velocity components of internal gravity waves beyond the training data-set.We show a diﬀerent approach by regressing the clustered parameter values to some known aux-iliary process u ( t ). The limitation is then that it may not always be possible to ﬁnd such a process.However, in the case of existence, no assumptions are required. This approach is attractive sincethe process u ( t ) should exist on a larger scale than the process X ( t ) and thus be predictable.For simplicity, suppose we are presented with an observation of a slowly evolving, known process u ( t ) and a fast non-stationary, stochastic process X ( t ) described by some non-linear SDE. Further-more, suppose that we have access to some expert knowledge, intuition or ﬁrst principle derivationsuggesting that u ( t ) is controlling the non-stationary process X ( t ) by modulating its model param-eters. The closure approach is then the following. First one estimates the time-varying parametersof the SDE: d X = f ( X ; Θ ( t ))d t + g ( X ; Θ ( t ))d W t (2.26) Θ : [0 , T ] → R n , Θ ( t ) = [ θ ( t ) , θ ( t ) , . . . , θ n ( t )] ∈ Ω Θ , (2.27)with the clustering approach presented in section 2.2 that is by solving the variational problem:(2.28) ( Γ ∗ , Θ ∗ ) ∈ arg min Θ ∈ Ω Θ Γ ∈ Ω Γ L (cid:15) (cid:98) N ( Γ , Θ , (cid:15) ) , where the number of clusters K ≥

2. The functional L (cid:15) (cid:98) N is deﬁned in Eq. (2.20).Second, with the help of Γ ∗ ( t ) one estimates the cluster respective values of the process u ( t ) bycomputing the weighted average:(2.29) u k = (cid:80) Ni =1 u ( t i ) γ k ( t i ) (cid:80) Ni =1 γ k ( t i ) , k = 1 , . . . , K . where u k denotes the averaged value of the auxiliary process u ( t ) in cluster k . Recall that the γ k ( t i ) is the aﬃliation function for cluster k (see Fig. 8a the color coded regions). In the last step,one performs a regression analysis to parameterize the parameter variability with some appropriatefunctions S m , which map the values of u ( t ) to θ m ( t ):(2.30) S m : u k → θ k,m , θ k = [ θ k, , θ k, , . . . , θ k,n ] , k = 1 , . . . , K m = 1 , . . . , n. There is no restriction on the diﬀerentiability of the functions S n , it can be parameterized by apiecewise-deﬁned function. The parameterization is performed for each element of the parame-ter vector Θ separately. The vector of scaling functions S = [ S , S , . . . , S n ] ∈ R n deﬁnes themodulation of the model parameters through some process u ( t ). The complexity of the relationis application dependent and needs further exploration. If all scaling functions can be found, weobtained a closed system to predict the process X ( t )(2.31) d X = f ( X ; S ( u ( t )))d t + g ( X ; S ( u ( t )))d W t , where u ( t ) is known for every t . The approach allows to model non-equilibrium processes which arecharacterized by the non-linearity in f and g in the cases where the nonstationarity of the processis diﬃcult to understand. Figure 3 summarizes the idea of this concept. In practice the functionalrelation between process u ( t ) and the modulated parameters of the SDE Θ ( t ) is assumed to holdfor the estimated parameter range, as shown in Section 5. ON-PARAMETRIC CLUSTERING OF SDE 12

Observations Model ﬁtK times SDE PredictionParameterization u ( t ) u ( t ) u ( t )Approximation Θ ( t ) = K const. Θ Θ Θ K DataParameters Auxiliary process uS ( u ) Figure 3.

The modeling approach with FEM-H1 regularization of SDE. The tem-poral modulation of the parameters is parameterized through the slow scale process u . The functional relation S ( u ) is hidden and needs to be discovered a posteriori. ON-PARAMETRIC CLUSTERING OF SDE 13 Selection of the Hyperparameters

The selection of a suitable model includes the determination of a suitable regularization param-eter (cid:15) , controlling the smoothness of the aﬃliation vector, the number of clusters K and thefunctional forms f ( · ; · ) and g ( · ; · ). A variety of strategies with examples are described in (Horenko,2010a,b; Metzner et al., 2012) for example, which are partially based on identifying several criteriasimultaneously. Here we present new approaches to select (cid:15) and K independently from each other,providing a robust estimation strategy. In particular our approach for selecting the parameter (cid:15) accounts indirectly for the frequency content of the auxiliary process u ( t ).3.1. Optimal Regularization Parameter.

Horenko (2010a) analyzed the impact of the param-eter (cid:15) on the regularity of the cluster aﬃliation function Γ ∗ ( t ). His ﬁndings show that the optimalparameter (cid:15) is characterized by a sharp separation between each of the functions γ k ( t ), such thatthe regime transitions are as short as possible. Contrary, poor partitioning is present if two or morefunctions γ k ( t ) occupy the same period. We recall, that the functions γ k ( t ) obey the convexitycondition (see Eq. (2.16); (2.16)) and if several are active simultaneously then none of the modelparameters ﬁts the data suﬃciently well. Such results indicate that the number of clusters is settoo high or the model structure is not suitable for the data under consideration. A key aspect ofchoosing a proper value for (cid:15) is in monitoring the regularity of the aﬃliation function Γ ∗ ( t ). Figure4 shows a portion of Γ ∗ ( t ) with K = 3 for a toy example.To motivate the strategy in ﬁnding (cid:15) opt we consider Γ ∗ ( t ) for diﬀerent values of the parameter(see Fig. 4a, b, c). Note the white regions between 0 and 1 in Fig. which are not color coded. Thetotal size of this area is changing in each of 4. the panels and is approximately the smallest at thevalue of (cid:15) opt . If we consider the function Γ ∗ ( t ) as an oscillating signal, then to minimize the whitearea we search for a Γ ∗ ( t ) that has the largest energy, where energy is understood as || γ k || L (0 ,T ) .Using the cluster averaged signal energy E γ for a given value (cid:15) (cid:15) < (cid:15) opt γ ( t ) γ ( t ) γ ( t ) a)01 (cid:15) = (cid:15) opt b)t01 (cid:15) > (cid:15) opt c) Figure 4.

An example on the diﬀerent regularity of the aﬃliation vector Γ ∗ ( t ).Each panel displays the same time window but for diﬀerent values of (cid:15) . a) Theweak regularization: the regime periods are irregular. Their temporal scale is closeto d t . b) The optimal regularization: γ ( t ) and γ ( t ) shows relatively sharp andclear separation. c)The strong regularization: γ ( t ) is blurred and suppressed.(3.1) E γ ( (cid:15) ) = 1 K K (cid:88) k =1 (cid:90) T γ k ( t, (cid:15) )d t , ON-PARAMETRIC CLUSTERING OF SDE 14 where γ k ( t, (cid:15) ) is the solution of Eq. (2.20), we determine the value (cid:15) that maximize the energy. Themaximum of the curve E γ ( (cid:15) ) indicates the optimal regularization value (see Fig. 5 color code value0). The true aﬃliation function which was used to generate the training data is generated with sharptransitions and three clusters. Furthermore, in the identiﬁcation step, we used the same functionalform that was used in the construction process. In other words, the considered examples are well-posed in the sense that the data-generating model is perfectly identiﬁable. However, working withreal, unexplored data we, unfortunately, may not know the true functional form of the SDE. Thismay lead to a situation where the functions γ k ( t ) are uncertain which aﬀects the curve E γ ( (cid:15) )by making it less concave and hence loosing a pronounced maximum. This is an indication of anincreased sensitivity. To achieve a more robust procedure we investigate the frequency content ofthe vector Γ ∗ ( t ).One way to emphasize the maximum of the curve E γ ( (cid:15) ) is by gradually removing the highfrequency oscillations in Γ ∗ ( t ). Since the aﬃliation function is not periodic we ﬁlter it with awavelet method (Foufoula-Georgiou and Kumar, 1994). The aﬃliation vector is decomposed usingthe discrete wavelet transform to obtain C levels with C + 1 frequency bands. The function γ k ( t ) isthen represented as the sum of the low-frequency component γ Ak,C ( t ) after C levels of transformationsand the sum of the high-frequency components γ Dk,c ( t ) over the previous decomposition levels(3.2) γ k ( t ) = γ Ak,C ( t ) + C (cid:88) c =1 γ Dk,c ( t ) . As a basis for the ﬁltering, we choose the Haar wavelet. They represent the sharp jumps of theaﬃliation vector in the best way, especially when approaching the value (cid:15) opt . Utilizing the L -orthogonality property of the Haar wavelets one inserts Eq. (3.2) in to (3.1) to obtain(3.3) E γ ( (cid:15) ) = 1 K K (cid:88) k =1 (cid:90) T (cid:2) γ Ak,C ( t ) (cid:3) d t + 1 K K (cid:88) k =1 (cid:90) T (cid:34) C (cid:88) c =1 γ Dk,c ( t ) (cid:35) d t . The signal γ k ( t ) is decomposed with the maximum number of possible levels C , which is dependenton the number of samples. The ﬁrst term in Eq. (3.3) is the mean value of the total time-series andis irrelevant since we are interested in regime transitions between diﬀerent models. In the followingsteps we keep the same notation for E γ ( (cid:15) ). The local-support property of the wavelets allows totake out the summation sign of the integral,(3.4) E γ ( (cid:15) ) = 1 K K (cid:88) k =1 C (cid:88) c =1 (cid:90) ∞−∞ (cid:2) γ Dk,c ( t ) (cid:3) d t . The sample size of the discrete-time series γ Dk,c ( t ) is extended to the nearest value 2 i with thereﬂect-padding method, where i ∈ N + . Next, we rewrite the continuous integral as a sum(3.5) E γ ( (cid:15) ) = 1 K K (cid:88) k =1 C (cid:88) c =1 C d (cid:88) d =1 (cid:2) ˜ γ Dk,c ( d ) (cid:3) ∆ t , where C d ( c ) denotes the number of coeﬃcients at the decomposition level c . Due to the eﬃcientimplementation of the multi-resolution decomposition the number of coeﬃcients at each level isconsecutively reduced by 2. With each decomposition level the cut-oﬀ frequency is halved and theenergy of the resulting frequency bands can be removed by setting the corresponding coeﬃcientsto zero. ON-PARAMETRIC CLUSTERING OF SDE 1510 − − (cid:15) E γ (cid:15) opt Figure 5.

An approximation of the optimal regularisation parameter (cid:15) for a toyexample with three clusters. The y-axis is the cluster averaged signal energy of Γ ∗ ( t ) (see Eq. 3.5). By wavelet ﬁltering the estimated aﬃliation vector we empha-size the maximum. The colorbar labels the number of removed details coeﬃcientslevels (9 of maximum 16 is shown). The zero marks the unﬁltered Γ ∗ ( t ). In thisexample one can exclude four levels to emphasise the maximum without compro-mising the optimum.The optimal cut-oﬀ frequency may be estimated in several ways. A reasonable choice would beto set it to the scale of a spectral gap, if one is present, between the process X ( t ) and u ( t ). Aspectral gap is formed when the highs frequency of u ( t ) is larger than the lowest frequency of X ( t ).In this way, the scale of the process u ( t ) provides a limit for the smallest scale of the regime jumpsin Θ ( t ). There is no need to deﬁne the upper bound because the curve E γ ( (cid:15) ) drops naturally dueto the over-smoothing of the vector γ k ( t ) at higher values (cid:15) . The over-smoothing is not detected bythe sharp jumps of the Haar wavelet basis. Consequently, the wavelets stop registering the energyat the respective scales (see the roll-oﬀ in Fig. 5 for (cid:15) > ). Alternatively, one plots the energycurves by consecutively removing the high frequencies and looks for a magniﬁcation of a maximumwithout a change in its position (see Fig. 5). The multi-resolution analysis is performed using thepython library PyWavelets (Lee et al., 2019).3.2. Optimal Number of Clusters K.

The approach to determine the optimal number of clusters K opt is borrowing ideas from the elbow method which is used in the K-means clustering approach(Yuan and Yang, 2019). The idea is to monitor an error measure for the clustering procedureover the number of clusters and search for a characteristic ”kink” in the graph. The choice of theerror measure is, therefore, important and speciﬁc to the method used. For instance, a measure ofcompactness in K-means clustering is deﬁned as the sum of the Euclidean intra-cluster distancesbetween points in a given cluster. We introduce a diﬀerent error measure because the SDE clus-tering method diﬀerentiates the temporal dynamics within the clusters. Two clusters may stronglyoverlap but be distinct in terms of their induced stationary probability density functions. TheEuclidean distance would be a poor choice in that case. Instead, we exploit the Kullback-Leibler(KL) divergence, which is closely related to the probability density of the SDE:(3.6) KL ( P || Q ) := (cid:90) P ( x ) log (cid:18) P ( x ) Q ( x ) (cid:19) d x . The KL divergence describes the information gain by changing from the distribution with density P to the distribution with density Q . Here we will use the P and Q as the stationary distributions ON-PARAMETRIC CLUSTERING OF SDE 16 of two diﬀerent clusters. We seek to measure the diﬀerence between P and Q or more generally the diversity between K stationary distributions. Due to the asymmetry of KL divergence we consideralso the values of the opposite divergences KL ( Q || P ). Let p (cid:63)k denote the stationary distribution ofan SDE with the estimated parameter set θ ∗ k and the corresponding γ k ( t ) for cluster k = 1 , . . . , K .For each cluster we thus have a diﬀerent SDE model that provides a stationary distribution. Wedeﬁne the weighted diversity matrix D = ( d ij ) ∈ R K × K computed as(3.7) d ij := KL ( p (cid:63)i || p (cid:63)j ) = ν j (cid:90) p (cid:63)i log (cid:32) p (cid:63)i p (cid:63)j (cid:33) d x ≥ i, j = 1 , . . . K , where ν j are the weights computed from the aﬃliation vector that quantify the relative frequencyof cluster j :(3.8) ν j = (cid:90) T γ j ( t )d t , j = 1 , . . . K . The weights are introduced for the following reason. By adding more clusters the less probable γ j ( t ) is suppressed and barely reaches values equal to 1 or it has a short lifetime relative to thefull-time horizon T . These suppressed, less-probable, inactive regimes should therefore include lessdiversity and compensate the measure. Since we are looking to ﬁnd the optimal number of clusterwithin some reasonable range 2 < K < K max we deﬁne our diversity measure of clustering with K clusters as:(3.9) W K = K (cid:88) i =1 K (cid:88) j =1 d ij , K = 2 , , . . . , K max , where K max denotes the maximum considered clusters, assuming that 2 < K opt < K max . Wereiterate that p (cid:63)k in Eq. (3.7) denotes the stationary distribution of the SDE model characterizedby parameter set θ ∗ k . Speciﬁcally, for the SDE model(3.10) d X = f ( X ; θ ∗ k ))d t + g ( X ; θ ∗ k ))d W t , k = 1 , . . . , K , the invariant density p (cid:63)k can be written as(3.11) p (cid:63)k ( X ) = N c g ( X ; θ ∗ k ) exp (cid:32)(cid:90) f ( X ; θ ∗ k ) g ( X ; θ ∗ k ) dx (cid:33) , k = 1 , . . . K , where the normalisation constant N c is:(3.12) N − c = (cid:90) g ( X ; θ ∗ k ) exp (cid:32)(cid:90) f ( X ; θ ∗ k ) g ( X ; θ ∗ k ) dx (cid:33) , k = 1 , . . . K , provided all terms are well-deﬁned (see Horsthemke (1984)[ch. 6.1] for details).In real-world applications the gain in the diversity W K can be too steady making the detectionof K opt diﬃcult. We use further extension of the elbow method idea to make the estimation of K opt robust. Indeed, we follow Tibshirani et al. (2001) where it has been suggested to measurethe increase of W K with respect to some null reference distribution. Namely, a data-set which hasno obvious clustering. In the work of Tibshirani et al. (2001) the randomly spreading of pointsin 2-dimensional space serves as a null reference data-set, which is suitable for clustering with theK-means method. By considering the properties of our clustering method we choose the Wienerprocess with reﬂecting boundaries as the null reference time-series. The optimal number of clusters ON-PARAMETRIC CLUSTERING OF SDE 17 is estimated as the point where we get the smallest distance between the clustering-diversity measureof the analyzed and the reference data:(3.13) Gap( K ) = E [log( W ∗ K )] − log( W K ) , where W ∗ K denotes the diversity measure of the null reference dataset and the extracted value E [log( W ∗ k )] is estimated from clustering B (cid:29) Implementation

We have developed our version of the algorithm 1 in the programming language C++. Thecore components of the FEM-H1 framework, namely: the spectral projected gradient method forthe quadratic programming, the FEM reduction and interpolation procedures, and the procedureto generate an initial Γ ( t ) are reimplemented from the MATLAB code that is created by originalauthors Pospisil et al. (2018). For the solution of the Θ problem (see algorithm 1 line 9) we use theNLopt nonlinear-optimization package (G. Johnson, 2021). It is an interface of many global andlocal optimization algorithms which can be tested just by changing one parameter. Speciﬁcally, weapplied a combination of Controlled Random Search (CRS) with local mutation (Price, 1977, 1983;Kaelo and Ali, 2006) and the ”PRAXIS” gradient-free local optimization via the ”principal-axismethod” (Richard P, 2013). Our code is GPU accelerated using the CUDA libraries cuSPARSE,Thrust and requires the CUDA version 10.2. The transition density is computed with the MATLABsymbolical toolbox and exported to a C++ code. The template library Thrust made it possible toinclude the machine-generated code of the transition density p X into the main program ﬂow. ON-PARAMETRIC CLUSTERING OF SDE 18 Numerical examples

The numerical study consists of three examples with varying complexity. The auxiliary process u ( t ) here is stochastic with a smooth sample path and a reasonable scale separation between u ( t ) and X ( t ). However, in real-world applications the process u ( t ) represents some characteristic quantityof a dynamical system, with the assumption to control the time-varying parameters of the process X ( t ).The numerical examples consist of the step by step construction of a synthetic time series X ( t )and the identiﬁcation of the underlying functional relationship between the parameters of a prede-ﬁned SDE and some random process u ( t ). Here it is assumed that the process u ( t ) is known at anyinstance of time and it serves as a predictor for the time-varying parameter vector Θ ( t ). To createthe synthetic data-set X ( t ), the process u ( t ) is used together with the arbitrary created functions S n ( u ) to generate the evolution of the parameter vector Θ ( t ) = [ S ( u ( t )) , S ( u ( t )) , . . . , S n ( u ( t ))].The sample path X ( t ) is simulated numerically with the predeﬁned SDE, where the parametersare given by Θ ( t ). The time-series X ( t ) forms the data-set of the numerical tests, from which werecover the functions S n ( u ) and compare them to the one that have been used in the creation step.5.1. The Underlying Auxiliary Process Model.

For the numerical example we give a briefexplanation on the auxiliary process model. A realization of the auxiliary process u ( t ) is constructedfor all examples following the same principle. We utilize the 4-dimensional Ornstein - Uhlenbeckprocess (OU) supplementing it with the coeﬃcients of the normalized Butter-worth polynomials,(5.1) d U = − T c A D U d t + B d W , U = [ u ( t ) , u ( t ) , u ( t ) , u ( t )] , where T c > B > u ( t ) is then the ﬁrst component of the vector U ( t ) The diﬀusion matrix B is almost everywherezero, and together with the drift matrix A D takes the form:(5.2) A D =  a a a a  , B =  b  . The coeﬃcients in the matrix A D are: [ a , a , a , a ] = [1 , . , . , . b = 1. The eﬀect of thecoeﬃcients in A D is the smoothening of the ﬁrst component in the vector U , which is an analogyto a ﬁltering approach. ON-PARAMETRIC CLUSTERING OF SDE 19

The Non-Stationary Ornstein - Uhlenbeck Process.

We start the study with a trivialcase where the sample path X ( t ) is modeled by an OU process, which has a linear drift term andadditive noise. Consider the OU process of the following form:(5.3) d X = ( θ ( t ) − θ ( t ) X )d t + θ ( t )d W , X ( t ) = 0 , where Θ ( t ) = [ θ ( t ) , θ ( t ) , θ ( t )] will be modulated by u ( t ) deﬁned with Eq. (5.1). The analyticalexpression for the transition density p X ( · , · ), which is required to compute the MLE, in this case, isknown. However, we still use the approximation with the Hermite expansion to test the implemen-tation of the framework. Figure 6a demonstrates one realization of the sample path for the process − − − X ( t ) a) − θ ( t ) θ ( t ) θ ( t ) b)0 200 400 600 800 1000 1200 1400 1600t − . − . . . . u ( t ) c) − . − . . . . − S S S d) Figure 6.

The summary of functions used to generate the non-stationary trainingdata X ( t ) according to the Eq. 5.3. a) The sample path of the process X ( t ). b) Thetemporal evolution of the model parameters. c) The Parameter auxiliary process u ( t ). d) The scaling functions θ n ( t ) = S n ( u ( t )) X ( t ). The SDE (5.3) is solved simultaneously with the Eq. (5.1) using diﬀerent independent re-alizations of the Wiener process for each of the equations. We use the Euler-Maruyama methodwith a time-marching step size of ∆ t = 10 − and then downsample the results to the time step∆ t = 10 − , which then has in total 16384 sample points. The scaling functions are deﬁned as (seeﬁg. 6d) θ ( t ) := S ( u ) = 2 u ( t ) , (5.4) θ ( t ) := S ( u ) = (cid:2) ( u ( t ) − + 0 . (cid:3) − , (5.5) θ ( t ) := S ( u ) = ( u ( t ) − + 0 . . (5.6)The evolution of the vector Θ ( t ) (see ﬁg. 6b) and its functional relationship to the process u ( t )is hidden and supposed to be unknown in real applications. In this example, one can observe aninterrelationship between the variance of the process X ( t ) and the value of the process u ( t ) by onlyconsidering the sample paths because the equation is linear. As shown in the next examples, thisis not obvious with a non-linear system. In the following, we demonstrate how the methodology ON-PARAMETRIC CLUSTERING OF SDE 20 introduced here is applied to recover the scaling functions S n , given only the discrete time valuesof X ( t ) and u ( t ). For details on running the optimization algorithm for this example see AppendixB. . . . . . . . . W k ) E [log( W ∗ k )]a) 0 2 4 6 8 10K0 . . . . . . . G a p K opt Optimal K = 6b) 10 − − − (cid:15) . . . . E γ (cid:15) opt Optimal (cid:15) = 20 for K = 6c) 0123456789 Figure 7.

An estimation of the number of clusters for the OU example (see Fig.6 and Eq. (5.3)) using the adapted gap statistics approach see Sect 3. In panela) we ﬁnd that for

K > K to be6. This is conﬁrmed by the minimum value in panel b). c) The estimation of theregularization parameter (cid:15) for the clustering with K = 6. The colorbar codes theconsecutive suppression of high frequency bands to emphasise the maximum. Theeﬀect is minimal.By plotting the cluster averaged signal energy of the aﬃliation vector, i.e its L -norm, against theparameter (cid:15) we estimate the optimal value (cid:15) for each value of K individually (see one examplefor K = 6 in ﬁg. 7c). According to the ﬁgure 7c the maximum energy is found at a value (cid:15) ≈ E γ may be further emphasized by suppressing the high frequencies in the vector Γ ∗ ( t ).This is highlighted in ﬁgure 7c with diﬀerent colors, where index 0 is the unﬁltered result and eachnext integer marks the consecutive reduction of the frequency content by a factor of two (discretewavelet transform). We ﬁnd that the frequency suppression shows the desired eﬀect, althoughthis is unnecessary in this example because the maximum is well pronounced. Additionally, wehave inspected the aﬃliation function for every value of (cid:15) (not shown) and found relatively smalldiﬀerences in its evolution for a wide range of the value (cid:15) . This indicates a robust estimation andmay be attributed to the simplicity of the example.The procedure to estimate the optimal K opt requires more eﬀort than it is the case with (cid:15) .We introduced the diversity measure in section 3.2 to characterize the incremental information gainwith every additional cluster. Figure 7a demonstrates the graph of this measure (blue) for thetraining data X ( t ) and for the reference data (green). The reference process is generated B = 20times with the same timestep that was used in the identiﬁcation step. The clustering is repeatedon each of these data-sets with the value (cid:15) ≈

20 for each of the K clusters. The intention is tokeep all parameters of the framework unchanged and only to replace the data-set. The resultinggraph of the information gain for the clustering of the reference time-series is shown in ﬁgure 7a(green). The key point to look for is the rate at which the information content is increased whenclustering the analyzed data in comparison to the rate when clustering the reference time-series. ON-PARAMETRIC CLUSTERING OF SDE 21

Figure 7b shows the distance between the two clustering curves. A reasonable choice is K opt = 6because afterwards the curve of the analyzed data does not approach the reference one, which meansno signiﬁcant further increase of the model diversity with respect to the reﬂected Wiener process.The expected value is built based on 10 reference data-sets. This number will be increased in thefollowing examples to obtain better estimates.Figure 8a displays Γ ∗ ( t ) obtained after clustering the time series X ( t ) with K = 6 (see Fig. 6a).The result is considered to be valid, since at almost every time instance we observe a high valueof Γ ∗ ( t ) indicating a high certainty of the parameters. In Figure 8b the aﬃliation function is usedto label the training data. Note, although the sample path of X ( t ) at t ≈

500 appears to looksimilar to the sample path at t ≈ u ( t ). The regimeoccupation duration is changing across clusters. One can note that it is related to the rate ofchange of the auxiliary process. At times when this process is changing fast the occupation time isshort and when the process is almost steady the regime occupation time is long. This observationhighlights that the method is able to tract relatively rapid changes in the parameters if they occurfrequently. . . . . . . ∗ ( t ) a) − − −

100 classiﬁed X ( t ) b)0 200 400 600 800 1000 1200 1400 1600t − . − . . . . u ( t ) c) K Figure 8.

The result of clustering the example time series that is presented inFig. 6. a) The aﬃliation function that is found in solving the variational problemEq. (2.20) with K = 6 and (cid:15) ≈

20. b) The classiﬁed time-series X ( t ). c) Theclassiﬁed auxiliary process u ( t )Figure 9 shows the estimated parameters values for six clusters. The corresponding six points aresuﬃcient to estimate the hidden scaling functions. When working with unexplored data the scalingfunctions would need to be parameterized from the scatter plots by applying a further regressionanalysis. We skip this step here because the estimated parameters are in a good agreement withthe true functions. Parameter θ shows largest discrepancy in cluster k = 3 (the right most point inﬁg. 9b) which is explained by the fact, that this cluster shows the largest variability of the process u ( t ). Please observe the classiﬁed periods of u ( t ) for k = 3 in ﬁgure 8c. One way to improve theresults would be to consider more sample points and repeat the clustering with more clusters. ON-PARAMETRIC CLUSTERING OF SDE 22 − . − . . . . u k . . . . . . . . . . θ True S Estimatea) − . − . . . . u k θ True S Estimate b) − . − . . . . u k − . − . − . − . . . . . . θ True S Estimate c) k k k Figure 9.

The estimated parameters plotted over the cluster averaged auxiliaryprocess. The model identiﬁcation is performed with an optimal number of clustersequal to six. The colorbar is labeling the diﬀerent clusters as in Figure 8. The dotsrepresent the estimated parameter values. The dashed line shows the true scalingfunctions, which were used in generating the training data.5.3.

SDE with Two Independent Auxiliary Processes.

In this example, we consider an SDEwith a non-linear drift and a multiplicative noise term. Furthermore, the goal of this example is todemonstrate that the clustering methodology can handle modulation of two parameters which areindependent and have diﬀerent timescales. The following functional form of the SDE is investigated:(5.7) d X = (2 − θ ( t ) X − ln( X ))d t + θ ( t ) X d W , X ( t ) = 1 . The relationship between the parameters and the auxiliary processes u ( t ) and v ( t ) is taken as θ ( t ) := S ( u ) = ( u ( t ) + 1) + 0 . , (5.8) θ ( t ) := S ( v ) = − v ( t ) + 2 , (5.9)where u ( t ) and v ( t ) are independent solutions of Eq. (5.1) with diﬀerent values for T c : T c,v = 4 T c,u (see Eq. (5.1)). The task is to recover the scaling functions solely from discrete time observations of X ( t ) , u ( t ) and v ( t ).For details on running the optimization algorithm for this example see AppendixC. The sample size of the time series X ( t ) is 131072 (∆ t grid) points and it is longer than in theexample with the OU process (see Sec. 5.2). More points support the methodology to distinguishthe parameter variability. The processes u ( t ) and v ( t ) are uncorrelated and the re-occurrence ofthe unique parameter pairs in the sample path X ( t ) is not assured. This makes the identiﬁcationof the scaling functions ambitious. To mitigate the problem one needs to increase the number ofclusters to raise the resolution of the scaling functions. By increasing the number of clusters wereduce the number of the samples available to estimate the parameter values within one cluster.Consequently, one also needs to increase the sample size of the total time-series X ( t ) to maintainthe accuracy of the parameter estimates.Figure 10a shows the sample path of the SDE (5.7) together with the auxiliary processes u ( t )and v ( t ) (see Fig. 10c) which were used to generate Θ ( t ) = [ θ ( t ) , θ ( t )] (see Fig. 10b). The scalingfunctions are plotted in Fig. 11d.The diversity measure of the clustering for the process X ( t ) and the reference data-set is shownfor comparison in Fig. 11a. The peak in diversity when clustering the data-set X ( t ) is reachedapproximately at K opt = 10. By plotting the distance between the two curves we ﬁnd that for ON-PARAMETRIC CLUSTERING OF SDE 23 − . − . . . . S ( u ) S ( v ) d)024 X ( t ) a)950 1000 1050 1100 1150 1200 1250 1300t − . − . . . . u ( t ) v ( t ) c)01020 θ ( t ) = S ( u ) θ ( t ) = S ( v ) b) Figure 10.

Summary of the functions to generate the non-stationary training data X ( t ) according to the Eq. (5.7). a) The sample path of X ( t ). b) The temporalevolution of the model parameters. c) The auxiliary parameter processes u ( t ) and v ( t ). d) The scaling functions θ n ( t ) = S n ( u ( t )). − W k ) E [log( W ∗ k )]a) 0 5 10 15 20K1 . . . . . . G a p K opt Optimal K = 10b) 10 − (cid:15) . . . . . . . . . E γ (cid:15) opt Optimal (cid:15) = 30 for K = 10c) 0123456789 Figure 11.

Estimation of the number of clusters for the example with two aux-iliary processes (see Fig. (10) and Eq. (5.7)) using the gap statistics approacha) (see Eq. (3.9)) and b)(see Eq. (3.13)). In the panel a) after approximately K = 10 the diversity of the training data is increasing at the same rate as in thereference data-sets. This suggest the optimal K to be around 10. c) Estimationof the regularization parameter (cid:15) for clustering with K = 10. The colorbar codesthe consecutive suppression of high frequency bands to emphasize the maximum.The ﬁltering step is not necessary in this example. K >

10 the clustering does not add signiﬁcantly more information relative to the clustering of thereﬂected Wiener process. The selection of (cid:15) for this example shows a ﬂat maximum (see Fig.11c).

ON-PARAMETRIC CLUSTERING OF SDE 240 . . . . . . ∗ ( t ) a) − . − . . . . u ( t ) b)950 1000 1050 1100 1150 1200 1250 1300t −

101 classiﬁed v ( t ) c) K Figure 12.

Result of the clustering the example time-series that is presented inFig. 10. a) The aﬃliation function that is found in solving the variational problem2.20 with K = 10 and (cid:15) = 30. b) The classiﬁed auxiliary process u ( t ). c) Theclassiﬁed auxiliary process v ( t ). The ﬁgure shows 30% of the data-set.The optimal classiﬁer Γ ∗ ( t ) and the partitioning of the auxiliary processes are shown in Fig. 12where we can investigate the quality of the clustering. Figure 13 displays the estimated parametersand the true scaling functions. By visually inspecting these results we note a consistent spreadof the estimates in the parameter θ , but estimates in the parameter θ are distributed ratherdisproportionally. It is noticeable in the estimates of θ that three parameters have been appar-ently estimated twice and are positioned close to each other. The respective cluster values for theparameter θ however are located apart, such that the scaling function is sampled at signiﬁcantlydiﬀerent points. This means that in this example the methodology is distinguishing pairs of pa-rameters which have one common element. We therefore conclude that the methodology is capableto classify SDE models with at least two uncorrelated auxiliary processes. − . . . . u k θ True S Estimate − . − . . . . v k − θ True S Estimate k k Figure 13.

Estimated parameters plotted over the cluster averaged auxiliary pro-cesses. The model identiﬁcation is performed with an optimal number of clustersequal to ten. The colorbar is labeling the diﬀerent clusters as in Figure 8. The dotsrepresent the estimated parameter values. The dashed line shows the true scalingfunctions, which were used in generating the training data.

ON-PARAMETRIC CLUSTERING OF SDE 25

SDE with Non-Linear Drift and Non-Linear Diﬀusion.

In this example, we investigatean SDE with non-linear drift and non-linear diﬀusion. For the sake of simplicity, we consider onlyone auxiliary process u ( t ) and two parameters: one in the drift term and one in the diﬀusion term.The functional form of the SDE is(5.10) d X = ( θ ( t ) X − X )d t + θ ( t ) (cid:112) X d W , X ( t ) = 0 . It is possible to incorporate more parameters into the model structure, for example, in front thequadratic or the cubic term. The parameter θ ( t ) is a time-dependent bifurcation parameter whichcauses a continuous variation of the equilibrium properties in time, eventually leading to metastablestates in the dynamics. (see ﬁg. 14a). The clustering methodology can comprehend transitions inthe underlying double-well potential of the SDE. The relationship between the parameters and theauxiliary processes is θ ( t ) ≡ S ( u ) = − . u ( t ) − + 2 . , (5.11) θ ( t ) ≡ S ( u ) = − u ( t ) + 5 . (5.12)Like in the previous examples the task is to recover the scaling functions by only knowing onediscrete time trajectory of the process X ( t ) and u ( t ). − . − . . . . S ( u ) S ( u )d) − X ( t ) a)0 100 200 300 400 500 600t − u ( t ) c)0510 θ ( t ) = S ( u ) θ ( t ) = S ( u ) b) Figure 14.

The example of a non-stationary process that is associated with Eq.(5.10). a) The sample path of X ( t ). b) The temporal evolution of the modelparameters. c) The parameter auxiliary process u ( t ). d) The scaling functions S n ( u ( t )) = θ n ( t ).In this example our method to estimate K opt does not provide such a clear picture as it did inthe previous examples. Nevertheless, it is still beneﬁcial to analyze the tendencies of the curvesin Figure 15a, b. As indicated by the minimum in the Figure 15b we should choose K opt = 2.However, our ﬁnal goal is to recover the scaling functions and in the case of a non-linear scalingfunctions one requires more than two points and therefore more clusters. According to the secondminimum in Figure 15b, the second-best choice is K opt = 4. If we investigate the curve log( W k ) in ON-PARAMETRIC CLUSTERING OF SDE 260 2 4 6 8 10K − K opt log( W k ) E [log( W ∗ k )]a) 0 2 4 6 8 10K0 . . . . . . . . G a p b) 10 − − (cid:15) . . . . . . . . E γ (cid:15) opt Optimal (cid:15) = 7 for K = 5c) 0123456789 Figure 15.

An estimation of the number of clusters for the example associatedwith Eq. (5.10) (see Fig. 14) using the gap statistics approach a) (see Eq. (3.9))and b)(see Eq. (3.13)). In panel a) we ﬁnd that for

K > K opt = 4.c) The estimation of the regularization parameter (cid:15) for clustering with K = 5. Thecolorbar codes the consecutive suppression of high frequency bands to emphasizethe maximumFigure 15a, we ﬁnd a minor characteristic change in the slope at K opt = 4. To resolve the scalingfunctions even better we select K opt = 5.Knowing the aﬃliation function and the averaged values of the parameters, the temporal evolu-tion of the model parameters (see Fig. 16a) is approximated by(5.13) θ ∗ ( t ) = K opt (cid:88) k =1 θ ∗ k γ k ( t ) , where θ ∗ k is a constant vector of size N . Figure 16a shows that the methodology recovers the hiddenevolution of the parameters rather well. The approximation becomes better the more clusters areused (not shown). The drawback is that the uncertainty in Θ ∗ grows with the number of clusters,due to the reduction in data points per cluster.The approximated path of the model parameters (see Fig. 16a in red) is used to test theprediction performance (see Fig. 16b in red). In this ﬁgure, the time-series of the training datasetis shifted up for better comparison by the constant value of 10. The considered SDE exhibitsa temporal modulation by the bifurcation parameter. The times with two metastable states areexpected to have diﬀerent sample paths because the local state of the system is dependent on theparticular realization of the Wiener process. Two diﬀerent Wiener processes were used to generatethe training and simulation solution. Figure 16b shows the properly captured signal dynamics. Thecorresponding cluster probability density functions are in good agreement with those of the trainingdata and with the respective stationary distributions, which were computed from the parameterestimates. The largest discrepancy is shown for the cluster with index 1 (see Fig. 16c). This clusterhas a short lifetime to capture the regime jumping in the present realization. The trajectory-based p.d.f. estimate is oﬀ due to not observing the entire cluster dynamics. If we consider thestationary distribution computed from the estimated parameter values, we ﬁnd the correct bimodaldistribution (compare in Fig. 16c). ON-PARAMETRIC CLUSTERING OF SDE 270 100 200 300 400 500 600 t0 . . . . . θ True θ ( t ) estimated θ ∗ ( t ) a)0 100 200 300 400 500 600 t010X classiﬁed U ( t )training estimatedtraining estimated stat. limit b) − . . . . . pd f c) − . . . . − . . . . . − . . . . − . . . . K Figure 16.

Clustered time-series generated with Eq. (5.10). In a) and b) thecolored background is the estimated aﬃliation vector Γ ∗ ( t ) > .

8. The colorbar tothe right of the panels labels the corresponding 5 clusters. a) The true temporalevolution (black) of the diﬀusion parameter and its recovered version (red). Thesecond parameter looks similar and is not shown. b)The simulation of the samplepath with the estimated parameter values compared to the training data. Panels c)to g) show the cluster respective densities: the histograms for the training (black)and simulation (red). The blue lines mark the stationary pdf which is computedfrom the parameter estimates. The colored circles are marking the correspondingclusters.Figure 16 validates the performance of the model within the training data. To construct aself-contained, non-stationary prediction model, we show that the scaling functions between theauxiliary process u ( t ) and the parameters are recovered as well (see Fig. 17). The accuracy of theclustering method is suﬃcient to obtain the correct parameter values for a wide range of clustersused. It is important to note that the optimal number of clusters is of secondary importance. Thenumber of clusters should be as low as possible and high enough to identify the scaling function.After the relation between the auxiliary process and the model parameter is identiﬁed ( see Fig. 17),the last step is consisting of performing a regression analysis to parameterize the scaling behavior.It is left out here as it will be speciﬁc for each application. ON-PARAMETRIC CLUSTERING OF SDE 28 − . − . . . . u k θ True S Estimate K = 8Estimate K = 5Estimate K = 2 − . − . . . . u k . . . . . . θ True S Estimate K = 8Estimate K = 5Estimate K = 2

Figure 17.

The estimated parameters plotted over the cluster averaged auxiliaryprocess. The model identiﬁcation is performed with an optimal number of clustersequal to six. The dots represent the estimated parameter values. The dashedline shows the true scaling functions, which were used in generating the trainingdata. The optimal number of clusters is of second importance if one is able toparameterize the scaling functions.

ON-PARAMETRIC CLUSTERING OF SDE 29 Discussion and Conclusion

Data-driven modeling has the ability to enhance the understanding of complex dynamical systemsby highlighting patterns or hidden regularity. Especially challenging is the identiﬁcation and theparameterization of non-linear and non-stationary SDE. The intention of the system identiﬁcationstep is the determination of a suitable model structure and the estimation of the correspondingmodel parameters. In retrospect, one may analyze the data associated with the newly estimatedmodel to improve understanding of the process to derive hidden relationships. These relationshipsmay not be given explicitly by the clustering approach, so that unveiling these hidden dependenciesrequires additional research work. For instance, Vercauteren et al. (2016) characterized the inﬂuenceof non-turbulent motions on the turbulence after a model-based data-clustering methodology wasapplied (Vercauteren and Klein, 2015). In a study of the North Atlantic Oscillation in the climatesystem, Quinn et al. (2020) used the clustering results to study dynamical processes associated withatmospheric regime transitions.The presented approach clusters the data into segments and models each of them with an in-dividual, locally-stationary model. Such a decomposition of a complex dynamical behavior maynot be optimal since it may not always be clear how to diﬀerentiate between a non-linear andnon-stationary time-series without knowing the underlying structure of the data-generating pro-cess. Identifying an appropriate model structure is therefore an important step. For discrete-timesystems, Billings (2013) developed an estimation algorithm which can sequentially rate and col-lect important model terms. Following that, Wei et al. (2004) ﬁnd an eﬀective model for a highlynon-linear terrestrial magnetospheric dynamical system. Within their framework the time-varyingparameters can be estimated with a multi-wavelet (Wei et al., 2010) or a sliding window approach(Li et al., 2016).In section 2 we recall the variational clustering framework which allows the simultaneous analysisand modeling of a non-stationary time series. One variation of the framework (Horenko, 2010b)is based on discretized mathematical models, which are also termed discrete-time systems. Tointegrate an identiﬁed discrete-time model into a preexisting continuous-time system, one needs toaccount for the discretization operation, otherwise, the solution is restricted to the time frequencygiven by the data. One way to overcome the problem is to convey the frequency response of theidentiﬁed model by transforming it into a continuous one (Kuznetsov and Clarke, 1994). Thistransformation is widely studied in linear system theory and is extendable for the case of non-linearsystems (Billings, 2013)[p. 342]. Our approach avoids the above-mentioned diﬃculty because we usethe MLE derived from an appropriate likelihood (ﬁtness) function for the discrete-time observationsof SDE.In general, the main challenge in applying the MLE is the missing transition density function,which requires, for example, a solution of the Fokker-Plank equation for a considered structure ofthe SDE. Especially for the nonlinear SDE, this function is analytically not known and demandsestimation Fuchs (2013). In our approach, by separating the available time series into K clusters,the available number of samples for estimation reduces, leading thereby to uncertain estimates ineach of them. The closed-form approximation based on the Hermite expansion provides an accurateapproximation for the transition density, with an error of O (∆ t ) (Ait-Sahalia, 2002). This accuracyallows to increase the number of clusters and to raise the resolution of the hidden scaling functionsmaking an accurate parameterization task feasible.We combine the closed-form likelihood function approach based on suitable Hermite-expansionswith the non-parametric clustering framework. Speciﬁcally, we highlight the details of the necessarymodiﬁcation in the original subspace clustering algorithm to achieve accurate estimates of the model ON-PARAMETRIC CLUSTERING OF SDE 30 parameters independently of the discretization of the QP problem. Furthermore, we present anextensive numerical study consisting of three controlled examples of diﬀerent complexity to validateboth the parameterization methodology and the novel methods to estimate the hyper-parametersof the framework, which are independent and does not rely on information theory.In particular, our approach to estimating the optimal number of clusters is based on measuringthe degree of model diversity and comparing it to the clustering of the reﬂected reference Wienerprocess. The numerical experiments conﬁrm that this approach produces adequately reliable esti-mates. However, the associated graphs prove to be dependent on the considered model structuresuch that the selection requires a case-speciﬁc interpretation. The proposed method is still compu-tationally expensive because it requires to repeat the clustering of the reference data-set multipletimes to construct an expected value.Our method to identify the optimal regularization value of the aﬃliation function proves tobe robust for the studied test cases. We further extended it to articulate the optimum throughconsecutively ﬁltering out the small-scale variability from the aﬃliation vector. Unfortunately, wecould not construct an appropriated example to show its potential. However, apart from this study,we tested this methodology on the real-data application and found it to be eﬀective. This will bereported in a follow-up study.In summary, the data-driven variational clustering approach enables modeling of the non-linearand non-stationary processes to develop models for multi-scale dynamical systems. The modelstructure of the SDE is thereby unknown and demands to be prescribed or inferred in supplementaryways. The parameterization method is accurate and provides reasonable results with small data-setsconsisting of 2 × points. It is also scalable on the GPU as it is realized in this work. We testeddata-sets consisting of 2 × points, where the computation time was approximately 24 hours ona GeForce 1080 with 3 clusters. The numerical examples demonstrate the correct identiﬁcation ofthe required scaling functions, which allows parameterizing a considered SDE by linking additionalauxiliary processes to the model parameters.Finally, it is noteworthy that the presented parameterization method is intended to be used todevelop a data-driven stochastic parameterization of intermittent turbulence present in the noctur-nal atmospheric boundary layer. In climate models for example, the limited numerical resolutionleaves unresolved degrees of freedom, commonly denoted as subgrid-scale processes. The approachpresented here provides an ideal framework to derive subgrid-scale models which are modulated byresolved variables. These results will be published elsewhere. Another direction for future researchis the extension of the non-stationary data-driven learning approach developed in this work to SDEmodels with (at least) two widely separated time scales, for which the MLE is known to becomeasymptotically biased. For these systems the proposed methodology has to be combined with mul-tiscale robust inference techniques (Krumscheid et al., 2013; Kalliadasis et al., 2015; Krumscheid,2018). ON-PARAMETRIC CLUSTERING OF SDE 31

Appendix A. Reusing of the initial guess

One can estimate the values of the hyperparameters K and (cid:15) by running the optimizationproblem for each pair of parameter combinations. In the OU example, we choose for K the rangeof values [2 , , . . . ,

10] and for (cid:15) the range of values [10 − , ], where we use 100 divisions onthe logarithmic scale for the latter one. In principle for each parameter combination ( K, (cid:15) ) theoptimization method can be started with a random initial guess Θ and Γ ( t ). However, practicallyit is eﬃcient to use the optimal values from the previous run in the following way:(1) Select one value for K ,(2) Initialize the ﬁrst optimization run with the smallest or the largest value of (cid:15) together witha random guess for Θ and Γ ( t ),(3) Algorithm 1 provides a solution Θ ∗ and Γ ∗ (4) Change the parameter value (cid:15) to the next closest value (cid:15) ,(5) Start the optimization method with (cid:15) and the initial guess to be the solution from theprevious run: Θ ∗ and Γ ∗ ,(6) Algorithm 1 provides to the solution Θ ∗ and Γ ∗ ,(7) Solve the variational problem for all considered (cid:15) by incrementally changing its value,(8) Change the value of K and repeat with (1).The approach where the initial guess is taken randomly requires , of course, more computationaltime in total than the case with reused solution of previous simulations. The principle of traversingthe regularization parameter (cid:15) is further exploited by traversing the parameter from low to highvalue and back to a low value while consistently reusing the previous solution outcomes. Thisstrategy showed to be eﬀective to ﬁnd a better solution after each iteration, although with onlyminor incremental improvement. Appendix B. Details on the settings of the framework for the OU example insection 5.3

The minimization algorithm for the QP solver has two stopping criteria which are as follows.The maximum number of iterations is set to 100, and the minimum diﬀerence between consecutivecost function values is set to 10 − . The reduction value for the QP solver speeds up the QP problemand is set to 1 /

3. For the Θ solver, the global optimizer is a random search algorithm supplementedwith a local search algorithm. For both of them, the maximum number of function evaluations is setto 300 and the break tolerance to a relative value 10 − . The population size of the global optimizeris set to 3000. The ﬁnal point to mention is the parameter bounds for the Θ solver. In this example(and for the clustering of the reference data set) they are : − < θ <

20; 0 < θ <

20; 0 < θ < Appendix C. Details on the settings of the framework for the example in Section5.4

Here are the details on the simulation of the sample path X ( t ) (see Fig. 14) which forms thetraining data for the clustering test. The time series X ( t ) has 65536 samples and a constant timestep ∆ t = 0 .

01. The sample path is obtained by solving the Eq. 5.10 with the Milstein methodand the sample path of u ( t ) is obtained by solving the Eq. 5.1 with the Euler-Maruyama method.During the simulation the time step between each of the 65536 samples is reduced to the the value10 − in order to obtain a more accurate realization of the sample path. The initial value for Eq. ON-PARAMETRIC CLUSTERING OF SDE 32 X ( t ) = 0 and for Eq. 5.1 - to U ( t ) = [1 , , , T c = 15; b = 0 . − . The reduction value for the Γ solver is set to α = 0 .

1. For both of solvers, the maximum number of calls is set to 500 and thetermination tolerance to a relative value 10 − . The population size of the global optimizer is set to1000. The bounds for the parameter limits of the Θ solver (and for the clustering of the referencedata set) is: − < θ <

10; 0 < θ < Appendix D. Coefficients of the Hermit Expansion

The coeﬃcients η j in Eq. (2.9) are approximated with a truncated Taylor series of length M = J/ J = 6. We state them here for completeness. η (3)0 = 1 ,η (3)1 = − µh / − µµ (cid:48) + µ (cid:48)(cid:48) h / − µµ (cid:48) + 4 µ µ (cid:48)(cid:48) + 6 µ (cid:48) µ (cid:48)(cid:48) + 4 µµ (3) + µ (4) h / ,η (3)2 = µ + µ (cid:48) h + 6 µ µ (cid:48) + 4( µ (cid:48) ) + 7 µµ (cid:48)(cid:48) + 2 µ (3) h + 196 (cid:16) µ ( µ (cid:48) ) + 28 µ µ (3) + 16( µ (cid:48) ) + 16 µ µ (cid:48)(cid:48) + 88 µµ (cid:48) µ (cid:48)(cid:48) + 21( µ (cid:48)(cid:48) ) + 32 µ (cid:48) µ (3) + 16 µµ (4) + 3 µ (5) (cid:17) h ,η (3)3 = − µ + 3 µµ (cid:48) + µ (cid:48)(cid:48) h / − (cid:16) µ µ (cid:48) + 28 µ ( µ (cid:48) ) + 22 µ µ (cid:48)(cid:48) + 24 µ (cid:48) µ (cid:48)(cid:48) + 14 µµ (3) + 3 µ (4) (cid:17) h / ,η (3)4 = µ + 6 µ µ (cid:48) + 3( µ (cid:48) ) + 4 µµ (cid:48)(cid:48) + µ (3) h + 1240 (cid:16) µ µ (cid:48) + 50 µ µ (cid:48)(cid:48) + 100 µ ( µ (cid:48) ) + 50 µ µ (3) + 23 µµ (4) + 180 µµ (cid:48) µ (cid:48)(cid:48) + 40( µ (cid:48) ) + 34( µ (cid:48)(cid:48) ) + 52 µ (cid:48) µ (3) + 4 µ (5) (cid:17) h ,η (3)5 = − µ + 10 µ µ (cid:48) + 15 µ ( µ (cid:48) ) + 10 µ µ (cid:48)(cid:48) + 10 µ (cid:48) µ (cid:48)(cid:48) + 5 µµ (3) + µ (4) h / ,η (3)6 = 1720 (cid:16) µ + 15 µ µ (cid:48) + 15( µ (cid:48) ) + 20 µ µ (cid:48)(cid:48) + 15 µ (cid:48) µ (3) + 45 µ ( µ (cid:48) ) + 10( µ (cid:48)(cid:48) ) + 15 µ µ (3) + 60 µµ (cid:48) µ (cid:48)(cid:48) + 6 µµ (4) + µ (5) (cid:17) h , where µ is the drift of the Y process evaluated at y (see Eq. (2.8)), ” (cid:48) ” denotes a derivative and h = ∆ t . The Hermite polynomials for our expansion are: H ( z ) = 1 , H ( z ) = − z, H ( z ) = − z , H ( z ) = 3 z − z , H ( z ) = 3 − z + z , H ( z ) = − z +10 z − z , H ( z ) = − z − z + z . References

Ait-Sahalia, Y. (1999). Transition Densities for Interest Rate and Other Nonlinear Diﬀusions.

TheJournal of Finance , page 35.Ait-Sahalia, Y. (2002). Maximum Likelihood Estimation of Discretely Sampled Diﬀusions: AClosed-form Approximation Approach.

Econometrica , 70(1):223–262.

ON-PARAMETRIC CLUSTERING OF SDE 33

Ayenew, T., Fikre, S., Wisotzky, F., Demlie, M., and Wohnlich, S. (2009). Hierarchical clusteranalysis of hydrochemical data as a tool for assessing the evolution and dynamics of groundwateracross the Ethiopian rift. page 15.Billings, S. A. (2013).

Nonlinear system identiﬁcation: NARMAX methods in the time, frequency,and spatio-temporal domains . John Wiley & Sons, Inc, Chichester, West Sussex, United Kingdom.Boyko, V. and Vercauteren, N. (2020). Multiscale Shear Forcing of Turbulence in the NocturnalBoundary Layer: A Statistical Analysis.

Boundary-Layer Meteorology .Dacunha-Castelle, D. and Florens-Zmirou, D. (1986). Estimation of the coeﬃcients of a diﬀusionfrom discrete observations.

Stochastics , 19(4):263–284.de Wiljes, J., Majda, A., and Horenko, I. (2013). An Adaptive Markov Chain Monte Carlo Approachto Time Series Clustering of Processes with Regime Transition Behavior.

Multiscale Modeling &Simulation , 11(2):415–441.Foufoula-Georgiou, E. and Kumar, P., editors (1994).

Wavelets in geophysics . Number v. 4 inWavelet analysis and its applications. Academic Press, San Diego.Franzke, C. L. E., O’Kane, T. J., Monselesan, D. P., Risbey, J. S., and Horenko, I. (2015). Sys-tematic attribution of observed Southern Hemisphere circulation trends to external forcing andinternal variability.

Nonlinear Processes in Geophysics , 22(5):513–525.Fuchs, C. (2013).

Inference for Diﬀusion Processes . Springer Berlin Heidelberg, Berlin, Heidelberg.G. Johnson, S. (2021). The NLopt nonlinear-optimization package.Gerber, S. and Horenko, I. (2014). On inference of causality for discrete state models in a multiscalecontext.

Proceedings of the National Academy of Sciences , 111(41):14651–14656.Horenko, I. (2010a). Finite Element Approach to Clustering of Multidimensional Time Series.

SIAMJournal on Scientiﬁc Computing , 32(1):62–83.Horenko, I. (2010b). On the Identiﬁcation of Nonstationary Factor Models and Their Applicationto Atmospheric Data Analysis.

Journal of the Atmospheric Sciences , 67(5):1559–1574.Horenko, I., Ganna, M., and Patrick, G. (2020). On a computationally-scalable sparse formulationof the multidimensional and non-stationary maximum entropy principle. arXiv:2005.03253 [cs,stat] . arXiv: 2005.03253.Horsthemke, W. (1984). Noise Induced Transitions.

Non-Equilibrium Dynamics in Chemical Sys-tems , pages 150–160. Publisher: Springer, Berlin, Heidelberg.Kaelo, P. and Ali, M. M. (2006). Some Variants of the Controlled Random Search Algorithm forGlobal Optimization.

Journal of Optimization Theory and Applications , 130(2):253–264.Kaiser, O. (2015).

Data-based analysis of extreme events: inference, numerics and applications .PhD thesis, Universit`a della Svizzera italiana.Kaiser, O., Hien, S., Achatz, U., and Horenko, I. (2017). Stochastic Subgrid-Scale Pa- rametrization.page 1.Kaiser, O. and Horenko, I. (2014). On inference of statistical regression models for extreme eventsbased on incomplete observation data.

Communications in Applied Mathematics and Computa-tional Science , 9(1):143–174.Kalliadasis, S., Krumscheid, S., and Pavliotis, G. A. (2015). A new framework for extracting coarse-grained models from time series with multiscale structure.

J. Comput. Phys. , 296:314–328.Krumscheid, S. (2018). Perturbation-based inference for diﬀusion processes: Obtaining eﬀectivemodels from multiscale data.

Math. Models Methods Appl. Sci. , 28(8):1565–1597.Krumscheid, S., Pavliotis, G. A., and Kalliadasis, S. (2013). Semiparametric Drift and DiﬀusionEstimation for Multiscale Diﬀusions.

Multiscale Model. Simul. , 11(2):442–473.

ON-PARAMETRIC CLUSTERING OF SDE 34

Krumscheid, S., Pradas, M., Pavliotis, G. A., and Kalliadasis, S. (2015). Data-driven coarse grainingin action: Modeling and prediction of complex systems.

Physical Review E , 92(4):042139.Kuznetsov, A. and Clarke, D. (1994). Simple Numerical Algorithms for Continuous-To-Discrete andDiscrete-To-Continuous Conversion of the Systems with Time Delay.

IFAC Proceedings Volumes ,27(8):1549–1554.Lee, G., Gommers, R., Waselewski, F., Wohlfahrt, K., and O’Leary, A. (2019). PyWavelets: APython package for wavelet analysis.

Journal of Open Source Software , 4(36):1237.Li, Y., Wei, H.-L., Billings, S. A., and Sarrigiannis, P. (2016). Identiﬁcation of nonlinear time-varying systems using an online sliding-window and common model structure selection (CMSS)approach with applications to EEG.

International Journal of Systems Science , 47(11):2671–2681.Liao, T. W. (2005). Clustering of time series data—a survey.

Pattern Recognition , page 18.Mahrt, L. (2014). Stably Stratiﬁed Atmospheric Boundary Layers.

Annual Review of Fluid Me-chanics , 46(1):23–45.Maione, C., Nelson, D. R., and Barbosa, R. M. (2019). Research on social data by means of clusteranalysis.

Applied Computing and Informatics , 15(2):153–162.Metzner, P., Putzig, L., and Horenko, I. (2012). Analysis of persistent nonstationary time series andapplications.

Communications in Applied Mathematics and Computational Science , 7(2):175–229.O’Kane, T. J., Risbey, J. S., Franzke, C., Horenko, I., and Monselesan, D. P. (2013). Changes inthe Metastability of the Midlatitude Southern Hemisphere Circulation and the Utility of Non-stationary Cluster Analysis and Split-Flow Blocking Indices as Diagnostic Tools.

Journal of theAtmospheric Sciences , 70(3):824–842.O’Kane, T. J., Risbey, J. S., Monselesan, D. P., Horenko, I., and Franzke, C. L. E. (2016). Onthe dynamics of persistent states and their secular trends in the waveguides of the SouthernHemisphere troposphere.

Climate Dynamics , 46(11-12):3567–3597.Pospisil, L., Gagliardini, P., Sawyer, W., and Horenko, I. (2018). On a scalable nonparametricdenoising of time series signals.

Communications in Applied Mathematics and ComputationalScience , 13(1):107–138.Price, W. L. (1977). A controlled random search procedure for global optimisation.

The ComputerJournal , 20(4):367–370.Price, W. L. (1983). Global optimization by controlled random search.

Journal of OptimizationTheory and Applications , 40(3):333–348.Putzig, L., Becherer, D., and Horenko, I. (2010). Optimal Allocation of a Futures Portfolio UtilizingNumerical Market Phase Detection.

SIAM Journal on Financial Mathematics , 1(1):752–779.Quinn, C., Harries, D., and O’Kane, T. J. (2020). Dynamical analysis of a reduced model for theNorth Atlantic Oscillation. Publisher: EarthArXiv.Richard P, B. (2013).

Algorithms for minimization without derivatives . Courier Corporation.Risbey, J. S., O’Kane, T. J., Monselesan, D. P., Franzke, C., and Horenko, I. (2015). Metastability ofNorthern Hemisphere Teleconnection Modes.

Journal of the Atmospheric Sciences , 72(1):35–54.Sun, J., Mahrt, L., Banta, R. M., and Pichugina, Y. L. (2012). Turbulence Regimes and TurbulenceIntermittency in the Stable Boundary Layer during CASES-99.

Journal of the AtmosphericSciences , 69(1):338–351. Publisher: American Meteorological Society.Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data setvia the gap statistic.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) ,63(2):411–423.Vercauteren, N., Boyko, V., Kaiser, A., and Belusic, D. (2019). Statistical Investigation of FlowStructures in Diﬀerent Regimes of the Stable Boundary Layer.

Boundary-Layer Meteorology , ON-PARAMETRIC CLUSTERING OF SDE 35

Journal of theAtmospheric Sciences , 72(4):1504–1517.Vercauteren, N., Mahrt, L., and Klein, R. (2016). Investigation of interactions between scalesof motion in the stable boundary layer: Interactions between Scales of Motion in the StableBoundary Layer.

Quarterly Journal of the Royal Meteorological Society , 142(699):2424–2433.Wei, H. L., Billings, S. A., and Liu, J. (2004). Term and variable selection for non-linear systemidentiﬁcation.

International Journal of Control , 77(1):86–110.Wei, H. L., Billings, S. A., and Liu, J. J. (2010). Time-varying parametric modelling and time-dependent spectral characterisation with applications to EEG signals using multiwavelets.

Inter-national Journal of Modelling, Identiﬁcation and Control , 9(3):215.Yuan, C. and Yang, H. (2019). Research on K-Value Selection Method of K-Means ClusteringAlgorithm. J , 2(2):226–235.Zhao, Y. and Karypis, G. (2005). Data Clustering in Life Sciences. Molecular Biotechnology ,31(1):055–080. (Vyacheslav Boyko, Nikki Vercauteren) Department of Mathematics and Computer Sciences, Freie Uni-versitaet Berlin, Germany, Arnimallee 6-14195 Berlin

Email address , V. Boyko: [email protected]

Email address , N. Vercauteren: [email protected] (Sebastian Krumscheid) Faculty of Mathematics, Computer Science and Natural Sciences, RWTHAachen University, Germany, Templergraben 55, 52062 Aachen

Email address , S. Krumscheid: [email protected] (Nikki Vercauteren)3