[PDF] Estimating Vector Fields from Noisy Time Series

Abstract

While there has been a surge of recent interest in learning differential equation models from time series, methods in this area typically cannot cope with highly noisy data. We break this problem into two parts: (i) approximating the unknown vector field (or right-hand side) of the differential equation, and (ii) dealing with noise. To deal with (i), we describe a neural network architecture consisting of tensor products of one-dimensional neural shape functions. For (ii), we propose an alternating minimization scheme that switches between vector field training and filtering steps, together with multiple trajectories of training data. We find that the neural shape function architecture retains the approximation properties of dense neural networks, enables effective computation of vector field error, and allows for graphical interpretability, all for data/systems in any finite dimension d . We also study the combination of either our neural shape function method or existing differential equation learning methods with alternating minimization and multiple trajectories. We find that retrofitting any learning method in this way boosts the method's robustness to noise. While in their raw form the methods struggle with 1% Gaussian noise, after retrofitting, they learn accurate vector fields from data with 10% Gaussian noise.

Full PDF

EEstimating Vector Fields from Noisy Time Series

Harish S. Bhat

Applied MathematicsUniversity of California, Merced

Merced, CA [email protected]

Majerle Reeves

Applied MathematicsUniversity of California, Merced

Merced, CA [email protected]

Ramin Raziperchikolaei

Rakuten, Inc.

San Mateo, CA [email protected]

Abstract —While there has been a surge of recent interest inlearning differential equation models from time series, methods inthis area typically cannot cope with highly noisy data. We breakthis problem into two parts: (i) approximating the unknownvector ﬁeld (or right-hand side) of the differential equation,and (ii) dealing with noise. To deal with (i), we describe aneural network architecture consisting of tensor products of one-dimensional neural shape functions. For (ii), we propose an al-ternating minimization scheme that switches between vector ﬁeldtraining and ﬁltering steps, together with multiple trajectories oftraining data. We ﬁnd that the neural shape function architectureretains the approximation properties of dense neural networks,enables effective computation of vector ﬁeld error, and allowsfor graphical interpretability, all for data/systems in any ﬁnitedimension d . We also study the combination of either our neuralshape function method or existing differential equation learningmethods with alternating minimization and multiple trajectories.We ﬁnd that retroﬁtting any learning method in this way booststhe method’s robustness to noise. While in their raw form themethods struggle with 1% Gaussian noise, after retroﬁtting, theylearn accurate vector ﬁelds from data with 10% Gaussian noise. I. I

NTRODUCTION

We consider the problem of learning a dynamical systemfrom multiple, vector-valued time series. Suppose we have N time series or trajectories, each observed at T discrete times { t i } Ti =1 . We assume this temporal grid is sufﬁciently ﬁne tocapture the dynamics of the system of interest. Let y ji ∈ R d denote the observation for trajectory j at time t i . Here d ≥ is arbitrary. Given this data, we compute estimates of (i) thestates (cid:98) x ji and (ii) a vector ﬁeld (cid:98) f : R d → R d that determines adynamical system model for the time-evolution of the states.Suppose the true trajectory x j ( t ) satisﬁes the nonlinearsystem of differential equations given by ˙ x j ( t ) = d x j ( t ) /dt = f ( x j ( t )) . (1)We model the observation y ji as the true state plus noise: y ji = x j ( t i ) + (cid:15) j ( t i ) , i = 1 , . . . , T. (2) This work was performed under the auspices of the U.S. Department ofEnergy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 and was supported by the LLNL-LDRD Program underProject No. 19-ERD-009. LLNL-CONF-800042. H. S. Bhat and M. Reevesacknowledge partial support from NSF DMS-1723272. We also acknowledgeuse of the MERCED computational cluster, funded by NSF award ACI-1429783. ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained forall other uses, in any current or future media, including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted componentof this work in other works.

We seek (cid:98) x ji that approximates x j ( t i ) , and (cid:98) f that approximates f . While there has been much recent interest in learningdifferential equations from data, the bulk of the literaturefocuses on computing (cid:98) f from relatively clean data; [1] notesthat when Gaussian noise of strength greater than is presentin observed states, estimation becomes unstable and inaccu-rate. We demonstrate below that leading methods encounterdifﬁculty even at relatively low noise magnitudes [2], [3].Our method yields accurate estimates even when the data iscorrupted by noise.Our prior work [4] established that the block coordinatedescent proximal ﬁltering method yields more accurate androbust parameter estimates than either iPDA or the extendedKalman ﬁlter. A key element of our ﬁltering approach is itsproximal step [5]. In contrast, other techniques such iPDA and soft adherence use a penalty term that anchors the ﬁlteredstates (at all iterations) to the data y [6], [7]. Our workshares goals with [8], who are concerned with recoveringthe functional form of sufﬁciently ergodic, chaotic dynamicalsystems from highly corrupted measurements. While our pa-rameterizations of f differ considerably, both the present paperand [8] develop and apply alternating minimization methods.The present paper focuses on extending [4] to the settingwhere the functional form of (cid:98) f is unknown and must beestimated. In our prior work, we assumed this vector ﬁeld wasknown up to a ﬁnite-dimensional set of parameters. Hence ourprior work contains no mention of neural networks, nor doesit contain comparisons against the methods of [3] and [9].The central ﬁnding of this paper is as follows: as themagnitude of noise increases, computing accurate (cid:98) x ji and (cid:98) f is still possible if one alternates model estimation stepswith ﬁltering steps, and if one trains using a larger number N of trajectories. Using tests on simulated data, we showthat leading equation discovery methods can be retroﬁttedwith our ﬁltering approach, greatly enhancing the ability ofthese methods to cope with noisy data. These methods differin the way they model the estimated vector ﬁeld f : sparselinear combinations of prescribed functions [2], neural shapefunctions, or dense, feedforward neural networks [3]. Weconduct these tests for both the FitzHugh–Nagumo system anda nonlinear oscillator chain, showing that we can recover boththe ﬁltered states and the underlying vector ﬁeld with highaccuracy.We also apply our method to power grid data recorded by a r X i v : . [ s t a t . M L ] D ec icro-phasor measurement units ( µ PMUs). The units recordsynchronized measurements of voltage and phase angle atthe power distribution level [10]. Using data taken on twodifferent days for which the system’s status has been labeled,we estimate two vector ﬁelds, one for normal operation andone for anomalous operation. To estimate these vector ﬁelds,we found it essential to both ﬁlter the data and increase thenumber of trajectories. II. M

ETHODS

A. Filtering

Let (cid:98) f ( x , θ ) model the true vector ﬁeld f . For now, we are notconcerned with the actual details of this model except to saythat the model parameters are given by θ . Consider an explicitEuler time discretization of (1) with time step ∆ i = t i +1 − t i .For trajectory j , we have x ji +1 − x ji = f ( x ji , θ )∆ i , i = 1 , ..., T − (3)We use Euler purely for simplicity here; in practice themethod can accommodate higher-order, explicit time integra-tion schemes. Let (cid:98) X denote the collection { (cid:98) x ji } N,Tj =1 ,i =1 of allﬁltered states over all trajectories. Then deﬁne the objectivefunction E ( (cid:98) X , (cid:98) θ ) = N (cid:88) j =1 T − (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:98) x ji +1 − (cid:98) x ji ∆ i − (cid:98) f ( (cid:98) x ji , (cid:98) θ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (4)Assume that if the true states X were known, that (cid:98) f could betrained by minimizing E over θ . We thus refer to minimizationover θ as model training.Let Y denote the collection { y ji } N,Tj =1 ,i =1 of all observationsover all trajectories. We propose the following alternatingprocedure to learn (cid:98) X and (cid:98) θ . The states are initialized to bethe data: (cid:98) X = Y —superscripts denote the iteration number:train: (cid:98) θ k +1 = argmin θ E ( (cid:98) X k , θ ) (5a)ﬁlter: (cid:98) X k +1 = argmin X (cid:110) E ( X , (cid:98) θ k +1 ) + λ (cid:107) X − (cid:98) X k (cid:107) (cid:111) . (5b)We terminate when the change in ( (cid:98) θ , (cid:98) X ) is sufﬁciently small. B. Three Parameterizations of (cid:98) f

1) SINDy:

We ﬁrst take (cid:98) f to be a linear combination ofprescribed functions. Given a × d input x we let Ξ( x ) denote a × s dictionary of functions. For instance, for ( x , x ) we can take Ξ( x , x ) = ( x , x , x , x , x x ) , i.e.,all polynomials in the components of x , up to degree two.Using this dictionary, we write (cid:98) f ( x ) = Ξ( x ) θ (6)where θ is an s × d matrix of coefﬁcients. Throughout thispaper, to extend (cid:98) f ( x ) to a function (cid:98) f ( X ) , we apply (cid:98) f to each × × d slice of X . To solve for θ , we apply an iterativelythresholded least-squares regression procedure that promotesa sparse solution θ —for further details, consult [2], [11]. pred. err on [0 , T ]1% noise noise noise ours Raissi Rudy SINDy020406080100 ours Raissi Rudy SINDy020406080100 ours Raissi Rudy SINDy020406080100 pred. err on [ T, T ]1% noise noise noise ours Raissi Rudy SINDy020406080100 ours Raissi Rudy SINDy020406080100 ours Raissi Rudy SINDy020406080100 Fig. 1. We compare the neural shape method with [3], [9], and [2] onFitzHugh–Nagumo with different noise levels. For each noise level, we create sets of observations, run the methods on each set separately, and reportprediction errors. Each error bar shows the error on one experiment. For allplots, we train on the time interval [0 , T ] . In the upper (resp., lower) set ofplots, we report the prediction error on [0 , T ] (resp., [ T, T ] ).

2) Neural Shape Functions:

Instead of prescribing Ξ andlearning only (cid:98) θ , here we learn both Ξ and (cid:98) θ . Let B be thedesired number of one-dimensional shape functions; we modelthese using a dense neural network with one-dimensionalinput, D − hidden layers, and a ﬁnal layer with B outputs.Let h ( x ) ∈ R B denote the output corresponding to scalar input x ; then h j is the j -th one-dimensional shape function.To create multi-dimensional shape functions, we take tensorproducts of one-dimensional shape functions. Let a multi-index denote a collection of integers α = ( α , . . . , α d ) suchthat ≤ α k ≤ B . Deﬁne h ( x ) = 1 . Then for a given multi-index α , deﬁne the multi-dimensional shape function H α ( x ) = d (cid:89) k =1 h α k ( x k ) ∈ R . (7)Let A be a collection of multi-indices and | A | its cardinality.Suppose we compute H α ( x ) for each α ∈ A ; in this way weobtain a × | A | vector H ( x ) . Here H ( x ) plays the same roleas Ξ( x ) in SINDy above; | A | is analogous to s , the numberof shape functions. Now let B denote an | A | × d matrix ofweights. The neural shape function model is (cid:98) f ( x ) = H ( x ) B .Note that there are ( B + 1) d possible multi-dimensional shapefunctions. In practice, we choose A such that | A | (cid:28) ( B + 1) d ,thus constraining (cid:98) f to be a small linear combination of multi-dimensional shape functions. The set of parameters θ consistsof B together with all weights and biases in the h ( x ) network.Suppose we work in a compact subset K ⊂ R d . Given directobservations of a smooth vector ﬁeld f in K , it is possible toachieve arbitrarily small error (cid:107) f − (cid:98) f (cid:107) by choosing the neuralshape function hyperparameters sufﬁciently large. To under-stand why, note that universal approximation theory guaranteesthat (even when D = 1 ) the space of our one-dimensionalneural shape functions is dense in the space of continuousfunctions. Hence tensor products of these shape functions aredense in the space of tensor products of continuous functions. x y Fig. 2. Ground truth FitzHugh–Nagumo vector ﬁeld.

By Stone-Weierstrass, we see that any smooth vector ﬁeld canbe approximated by linear combinations of tensor productsof continuous functions (e.g., univariate polynomials). Puttingthe previous two facts together, we conclude that linear com-binations of our multi-dimensional neural shape functions canbe used to approximate smooth vector ﬁelds; the accuracy ofthis approximation can be controlled by the number of unitsin each h j network together with | A | .

3) Dense Neural Network (DNN):

We also consider adense, feedforward neural network model (cid:98) f ( x ) with d inputsand d outputs, the model used by [3].III. R ESULTS

Here we present results from simulated data experiments inwhich the ground truth vector ﬁeld f is known. A. FitzHugh–Nagumo

In Figure 1, we compare neural shape functions plus ﬁlter-ing (our method) against three published methods [2], [3], [9]on simulated data from the FitzHugh–Nagumo system: dx dt = c (cid:18) x − x x (cid:19) , dx dt = − c ( x − a + bx ) . To generate simulated data, we set the system parameters a =0 . , b = 0 . , c = 3 and initial condition x = [ − , . We thennumerically solve the system on the interval [0 , T ] with T =20 , recording data at a spacing of ∆ t = 0 . . To these cleantrajectories we add mean-zero Gaussian noise at strengths of , , and . We train using data on [0 , T ] ; in Figure 1,we plot prediction errors for both the training interval and anextrapolatory test interval [ T, T ] .For this initial comparison, let us deﬁne the concept ofprediction error. Suppose we are at iteration k and that wehave trained the model for e epochs. Deﬁne the predictedstates (cid:101) X ke to be the numerical solution of (1) that resultsfrom using the ﬁltered initial conditions contained in (cid:98) X k (forthe j -th trajectory, (cid:98) x j ) together with the current best vectorﬁeld (cid:98) f ( x , (cid:98) θ k ) . We call the distance between (cid:101) X ke and X k the prediction error —we claim this is a good metric to detectoverﬁtting. If we ﬁnd that the current prediction error increasesas we increase the number of epochs, we halt the optimization dimension j = 0 dimension j = 1 f j ( x ) cleanpredicted cleanpredicted time t time t shape : h ( x k ) shape : h ( x k ) shape : h ( x k ) h ( x k ) −2 −1 0 1 2−1.0−0.50.00.51.0 −2 −1 0 1 2−1.0−0.50.00.51.0 −2 −1 0 1 2−1.0−0.50.00.51.0 x k x k x k shape : h ( x k ) shape : h ( x k ) shape : h ( x k ) h ( x k ) −2 −1 0 1 2−1.0−0.50.00.51.0 −2 −1 0 1 2−1.0−0.50.00.51.0 −2 −1 0 1 2−1.0−0.50.00.51.0 x k x k x k Fig. 3. We applied algorithm (5), with neural shape function model for (cid:98) f , todata obtained by simulating the FitzHugh–Nagumo system and then adding noise. First panel: the clean states and predicted states over time. Here thepredicted states are obtained by numerically integrating the neural vector ﬁeld (cid:98) f forward in time starting at the estimated/ﬁltered initial condition. Secondpanel: visualization of the learned one-dimensional shape functions h j ( x ) . step (5a), set (cid:98) θ k to the weights of the network with minimumcurrent prediction error, and proceed to the ﬁltering step (5b).In Figure 3, we plot the predicted and true states for theneural shape function model of (cid:98) f learned from data with Gaussian noise. We also plot the learned one-dimensional neu-ral shape functions h j ( x ) . An advantage of this method is theability to graphically interpret these shape functions, regardlessof the dimension d of the vector ﬁeld being modeled.Among the methods studied in Figure 1, only our methodand that of [9] attempt to deal with noise during training.Instead of using a proximal term of the form (cid:107) X − (cid:98) X k (cid:107) asin (5b), [9] use a penalty term of the form (cid:107) X − Y (cid:107) ; ﬁlteredstates are always anchored to the data. While our methodoutperforms this penalty-based method, both are more robustto noise than approaches that do not incorporate ﬁltering atall. Clearly SINDy [2] has issues with even small amountsof noise, while the dense neural network approach of [3]encounters problems starting at noise.We have carried out tests similar to that of Figure 1 fornonlinear, chaotic systems such as the Lorenz, R¨ossler, anddouble pendulum systems. For such systems, if we perturb thevector ﬁeld or initial conditions slightly, sensitive dependenceimplies that over time, trajectories will diverge exponentially.For such systems, the prediction error on [ T, T ] can seemlarge even if we estimate the vector ﬁeld and ﬁltered stateswith a reasonable degree of accuracy. These tests motivate us oise Trajectories Filter SINDy Neural ShapeFunctions Filter DNN1 . × − . . . × − . . . × − . . . . . . × − . . . × − . . . . .

10% 25 . × − . . . × − . . TABLE IF

OR EACH OF THREE

ODE

LEARNING METHODS , ALL COMBINED WITHFILTERING , WE RECORD THE ERROR IN THE ESTIMATED VECTOR FIELD ASA FUNCTION OF BOTH NOISE STRENGTH AND THE NUMBER OFTRAJECTORIES . F

OR EACH METHOD AND EACH NOISE STRENGTH , INCREASING THE NUMBER OF TRAJECTORIES DECREASES THE VECTORFIELD ERROR . N

OTE THAT RETROFITTING

SIND

Y WITH FILTERINGDRAMATICALLY IMPROVES ITS ABILITY TO ESTIMATE THE VECTOR FIELDACCURATELY , AS COMPARED TO THE RESULTS IN F IGURE to measure the vector ﬁeld error (cid:107) (cid:98) f − f (cid:107) , which providesinsight into how well the estimated vector ﬁeld captures theglobal behavior of f . We now retroﬁt both [3] (the dense neural network modelfor (cid:98) f ) and [2] (SINDy) with our ﬁltering procedure. To beclear, both retroﬁtted methods follow (5); the only differenceis in how (cid:98) f is modeled.As all the results shown in Figure 1 are for one trajectory, wenow move to the setting where we have multiple trajectories.We numerically integrate the FitzHugh–Nagumo system with initial conditions taken from an equispaced × gridon the square [ − , , producing a set of trajectories. Wethen apply each retroﬁtted procedure to either , (randomlychosen), or all trajectories from the collection. In eachcase, we obtain an estimated vector ﬁeld (cid:98) f ( x ) . We computethe vector ﬁeld error (cid:107) (cid:98) f − f (cid:107) using elementary quadratureon the square [ − , . Here f ( x ) is the right-hand side of theFitzHugh–Nagumo system; see the ground truth vector ﬁeldplotted in Figure 2.The results, shown in Figure 9 and Table I, clearly showthe beneﬁt of combining any method with ﬁltering and in-creasing the number of trajectories used for training, evenif those trajectories are all corrupted with noise. Theimprovement is particularly striking for the SINDy method;in fact, after incorporating SINDy into (5), it outperforms theother methods. B. Nonlinear Oscillator Network

Consider a ring of M masses connected by identicalsprings, each with potential energy V ( x ) and force F ( x ) = − V (cid:48) ( x ) . Here x denotes displacement from equilibrium. Thenthe equations of motion for the mass-spring system are ¨ x i = F ( x i − x i − ) − F ( x i +1 − x i ) (8)for i = 1 , . . . , M , with the understanding that x ≡ x M and x ≡ x M +1 . For our tests, we choose the double-well potential V ( x ) = − x + (1 / x , so that F ( x ) = 16 x − x . Noise Trajectories (cid:107) f − (cid:98) f (cid:107) (cid:107) X − (cid:98) X (cid:107) prediction error10 . × − . × − . . × − . × − . . × − . × − . . × − . × − . . × − . × − . . × − . × − . . × − . × − .

10% 100 . × − . × − . . × − . × − . TABLE IIW

E APPLY THE RETROFITTED

SIND

Y ALGORITHM (9)

TO NOISYOBSERVATIONS OF THE MASS - SPRING SYSTEM (8). R

ETROFITTING THE

SIND

Y ALGORITHM ENABLES IT TO HANDLE NOISY DATA . T

HE RESULTSALSO SHOW THAT , AT EACH NOISE LEVEL , INCREASING THE NUMBER OFTRAJECTORIES CLEARLY REDUCES PREDICTION ERRORS . F

OREXPLANATIONS OF THE DIFFERENT TYPES OF ERRORS , PLEASE SEE THEMAIN TEXT . We focus on the system with M = 3 masses; when we writethe system in ﬁrst-order form, the system has M = 6 degreesof freedom corresponding to x i and ˙ x i for each i = 1 , . . . , M .To generate data for our tests, we simulate (8) using the 8th-order Dormand-Prince integrator in scipy.integrate, with abso-lute and relative tolerances tuned to − . We generate trajectories, each with an initial condition sampled uniformlyfrom the hypercube [ − / , / . Each trajectory is saved at equispaced time steps from t = 0 to a ﬁnal time of t = 4 ,i.e., ∆ t = 0 . . To these clean trajectories X , we add mean-zero Gaussian noise with strengths of , , and asbefore, resulting in noisy data Y .We focus our attention on a retroﬁtted SINDy methodwith (cid:98) f modeled using a dictionary of polynomials. In thismethod, we use a variant of (5) theoretically analyzed inour previous work [4]. In this method, we incorporate theSINDy model (6), which has the beneﬁt of being linear in theparameters θ . This linearity implies that the objective function(4) is convex in θ . Suppose we split the decision variables (cid:98) X into two halves, the ﬁrst half (cid:98) X + consisting of time steps j = 1 , . . . , (cid:98)M / (cid:99) , and the second half (cid:98) X − consisting of timesteps j = (cid:98)M / (cid:99) + 1 , . . . , M . Then (cid:98) X = ( (cid:98) X + , (cid:98) X − ) . Thisleads us to the following block coordinate descent algorithm:train: (cid:98) θ k +1 = argmin θ E ( (cid:98) X k + , (cid:98) X k − , θ ) (9a)ﬁlter: (cid:98) X k +1 − = argmin X − (cid:110) (cid:98) X k + , X − , (cid:98) θ k +1 ) + λ (cid:107) X − − (cid:98) X k − (cid:107) (cid:111) (9b)ﬁlter: (cid:98) X k +1+ = argmin X + (cid:110) X + , (cid:98) X k +1 − , (cid:98) θ k +1 ) + λ (cid:107) X + − (cid:98) X k + (cid:107) (cid:111) . (9c)Splitting (cid:98) X and formulating the algorithm in this way givesus block convexity. That is, when we hold two of the threevariables in { (cid:98) X k + , (cid:98) X k − , (cid:98) θ } ﬁxed and minimize over the remain-ing variable, we obtain in each case a convex subproblem [4].Note that this property would not hold if we were to insteaduse either of the neural network approaches to model (cid:98) f , as inthis case (9a) would be non-convex.

10 20 30 40 50 iteration || c h a n g e i n f il t e r e d s t a t e e s t i m a t e ||

10% noise5% noise1% noise

Fig. 4. We monitor and plot the magnitudes of the change in ﬁltered states, (cid:107) (cid:98) X k +1 − (cid:98) X k (cid:107) , as a function of iteration number k . We have done this foreach of three training sets, each containing the indicated level of noise. In allcases, we ﬁnd that this quantity decreases rapidly and monotonically. When we train, we restrict Y to consist of only the ﬁrst steps of each trajectory. Starting with noisy data (cid:98) X = Y , werun algorithm (9) with λ = 10 − . Note that λ multiplies thesquared Frobenius error between two large matrices; hencethis value of λ is still consequential. We terminate if the normdifference between (cid:98) X k +1 and (cid:98) X k is less than − , or if wehave already completed iterations.To solve (9b) and (9c), we use steps of gradientdescent with a learning rate of × − . To solve (9a),we use the iteratively thresholded least squares procedurefrom [2], [11] with a threshold of . . For the dictionary Ξ deﬁned in (6), we include all polynomials up to degree in variables, not including an intercept or constant term,resulting in s = 63 columns. We have uploaded our code tohttps://github.com/hbhat4000/ﬁltersindy/An interesting property of algorithm (9) is the way in which (cid:107) (cid:98) X k +1 − (cid:98) X k (cid:107) behaves as a function of k . In Figure 4, wehave plotted this quantity as a function of iteration k . We haverecorded these magnitudes of the change in ﬁltered states whilerunning algorithm (9) on the full -trajectory training set,contaminated with either , , or noise. In all cases,we see monotonic convergence. Note that the vertical scale onthis plot is logarithmic, further indicating the rapid decreasein (cid:107) (cid:98) X k +1 − (cid:98) X k (cid:107) as a function of k . As a practical matter, thismeans that we do not need to run (9) for many iterations, andthat the results are robust to the number of iterations.After running algorithm (9) on either , , or trajectories with either , , or noise, we quantifyerrors in three ways—see Table II. The vector ﬁeld error (cid:107) f − (cid:98) f (cid:107) measures the mean absolute error between groundtruth and estimated vector ﬁelds. The quantity (cid:107) X − (cid:98) X (cid:107) isthe mean absolute error between true and ﬁltered states. Aftertraining on the ﬁrst steps of each trajectory, we computepredicted trajectories by numerically integrating the estimated time s t a t e s filteredground truth time s t a t e s filteredground truth time s t a t e s filteredground truth Fig. 5. We present the results of applying (9) to (top), (middle), or (bottom) trajectories worth of data for the nonlinear mass-spring system(8), all contaminated with Gaussian noise. In each of these training sets,we have singled out one common trajectory for the purposes of illustration.In each plot, for this common trajectory, we have plotted the ﬁnal estimate ofthe ﬁltered states (cid:98) X in red, and ground truth trajectories X in black. A closeinspection of the plots reveals errors with trajectories; for the purposes ofplotting, these errors disappear with trajectories. For more details, seethe main text. vector ﬁeld forward in time, starting from the ﬁltered initialconditions. The prediction error is the mean absolute errorbetween these predicted trajectories and the true states X .When we numerically integrate the estimated vector ﬁeld, weuse a standard fourth-order explicit Runge-Kutta method.The ﬁrst thing to notice from the results in Table II isthat they conﬁrm that combining SINDy with our alternatingminimization ﬁltering approach allows SINDy to estimate theass-spring system’s vector ﬁeld accurately. Even with 10%noise in the original data, the vector ﬁeld error when we trainon trajectories is . × − . To get a sense for what thisnumber means, note that the ground truth vector ﬁeld is highlysparse—across the entire ground truth coefﬁcient matrix θ , outof · possible entries, only are nonzero. The retroﬁttedSINDy algorithm gets this sparsity pattern correct; ifwe round the coefﬁcients obtained by SINDy, we obtain theground truth vector ﬁeld.Let us now interpret the ﬁltering errors (or second column)of Table II. At the 10% noise level, increasing the number oftrajectories appears to decrease the ﬁltering error from . × − (with 10 trajectories) only slightly to . × − . Tovisualize this, we present Figure 5. From top to bottom, wepresent the results of training with (top), (middle), or (bottom) trajectories worth of data, all contaminated with Gaussian noise. For the purposes of illustration, we havesingled out one trajectory { x i ( t ) , ˙ x i ( t ) } i =1:3 that is a memberof all three training sets. Each plot contains black curves,corresponding to { x i ( t ) , ˙ x i ( t ) } for i = 1 , , , together withthe ﬁltered (or hatted) versions, for a total of curves perplot. When there are only trajectories (top panel), one candiscern differences between red and black curves; as we go to trajectories (bottom panel), these errors mostly disappear.Overall, Figure 5 supports the notion that it is indeedpossible to estimate ﬁltered states even when the equationsof motion (i.e., the vector ﬁeld) for the underlying system arethemselves unknown and must be simultaneously estimated.However, there is a difference between ﬁltered trajectories andpredicted trajectories of the dynamical system. One will noticefrom Figure 5 that the time axis ends at t = 3 ; we have used steps of training data with ∆ t = 0 . , so there is no noisytraining data to ﬁlter beyond t = 3 .To obtain solutions of the estimated dynamical system be-yond t = 3 , we must numerically integrate, as in the predictedtrajectories whose errors are quantiﬁed in the third column ofTable II. When we compute the predicted trajectories in thistable, we integrate all the way up to t = 4 and then compareagainst the ground truth (clean) states of the system X . Weview predictions on t ∈ (3 , as a true test set, i.e., a test ofthe estimated vector ﬁeld’s ability to extrapolate beyond thetraining set.We again single out the same trajectory common to ourtraining sets with , , or total trajectories. Startingwith the estimated/ﬁltered initial conditions, we numericallyintegrate the estimated vector ﬁeld forward in time using astandard fourth-order explicit Runge-Kutta method, from t = 0 to t = 4 . In Figure 6, we compare the results of these numeri-cal integrations (in green) against the ground truth trajectories(in black). When the number of trajectories is small (top),the predicted dynamics are qualitatively wrong. As we trainon more trajectories (middle, bottom), the dynamics begin toqualitatively match the true mass-spring system’s dynamics.Note that we obtain much better quantitative accuracy for t ∈ [0 , , the training interval.For the bottom panel (trained on trajectories), by the time s t a t e s predictedground truth time s t a t e s predictedground truth time s t a t e s predictedground truth Fig. 6. We present the results of applying (9) to (top), (middle),or (bottom) trajectories worth of data from the nonlinear mass-springsystem (8), all contaminated with Gaussian noise. In each of thesetraining sets, we have singled out one common trajectory for the purposesof illustration. In each plot, for this common trajectory, we have plottedpredicted trajectories (the results of numerically integrating the estimatedvector ﬁeld forward in time from the estimated initial conditions) in green, andground truth trajectories X in black. The estimated vector ﬁeld and ﬁlteredinitial conditions are inaccurate when we use only trajectories, leading toqualitatively incorrect dynamics. As we increase the number of trajectories,we recover the qualitatively correct behavior of the mass-spring system. Thedynamics for t ∈ [0 , are more accurate because that is the interval coveredby the training data; for t ∈ (3 , , we are seeing the results of propagatingbeyond the training interval. time we reach t = 3 , the predicted state has drifted noticeablyaway from the ground truth. We can improve upon thissituation by resetting the state of the system, at t = 3 , toequal our estimate (cid:98) x (3) of the true state at that time and then .0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 time s t a t e s predictedground truth Fig. 7. We continue with results for algorithm (9) applied to noisy data fromthe nonlinear mass-spring system (8). We redo the numerical integration fromthe bottom panel of Figure 6, this time resetting the state of the system at t = 3 to equal our estimate (cid:98) x (3) of the true state of the system at that time.This improves the quantitative accuracy of our predicted trajectory on theextrapolation/test interval t ∈ (3 , . continuing the numerical integration until t = 4 . In Figure7, we see that this procedure leads to improved quantitativeagreement between predicted and ground truth trajectories onthe test interval t ∈ (3 , . C. MicroPMU Data

We apply the neural shape function plus ﬁltering methodto µ PMU data taken from a Mountain View substation nearRiverside, CA. We use one hour’s worth of data from bothAug. 9 and Aug. 1, 2017. We use measurements aggregatedat a spacing of ∆ t = 0 . . In prior work, these dates have beenidentiﬁed as corresponding to normal (Aug. 9) and anomalous(Aug. 1) system operation. Our idea is to learn two vectorﬁelds, one for normal and one for anomalous behavior.To begin, we preﬁltered the data using a wavelet low-passﬁlter. This was to remove high-frequency noise prevalent inthe raw data. Next, we focused our modeling efforts on twophase angle variables, θ and θ . This means that we didnot use components of the full -dimensional signal. Toimplement the multiple trajectory idea, we reshaped the singlehour’s trajectory from its original length of T = 360000 to N = 3600 trajectories each consisting of T = 100 points. Toimplement (5a) we use epochs of the Adam optimizer witha learning rate of . ; to implement (5b), we use L-BFGS-B from scipy.optimize with default tolerances and λ = 1000 .The network itself consists of B = 3 shape functions witha depth of D = 2 and U = 16 units per layer. The resultsare robust to increasing λ , B , D , or U . However, note thattraining on one preﬁltered trajectory of length T = 360000 ,without alternating ﬁltering, does not yield usable models.In Figure 8, we plot the phase portraits for the learnedsystems. Interestingly, the main difference between the learnedvector ﬁelds for the normal (left) and anomalous (right)settings has to do with stability. In short, we see that normal x y x y Fig. 8. µ PMU phase plots for normal (left) and anomalous (right) systembehavior. Note that the ﬁxed point is stable (eigenvalues − . × − ± . i ) on the left and unstable (eigenvalues . × − ± . i ) on theright. (respectively, anomalous) system operation corresponds tostable (respectively, unstable) oscillations. We have conﬁrmedthis by numerically ﬁnding the ﬁxed points of the vector ﬁeldsand checking the eigenvalues of the Jacobians at these ﬁxedpoints. IV. D ISCUSSION

We ﬁnd that we can compensate for noisy training databy increasing its volume (the number of trajectories) and byapplying simultaneous ﬁltering. Filtering, in the form of theproximal step (5b), reestimates the states (cid:98) X based on the bestmodel at the current iteration, (cid:98) f ( X ; (cid:98) θ k ) . When Y is heavilycontaminated by noise, several iterations of (5b) allow (cid:98) X tostep away from Y as needed. While we have focused on theFitzHugh–Nagumo and nonlinear mass-spring systems in thepaper, we are currently running additional tests for higher-dimensional physical systems. Preliminary results conﬁrm thesame trends we have observed for the FitzHugh–Nagumosystem.SINDy uses polynomial shape functions and the FitzHugh–Nagumo vector ﬁeld consists of polynomials. Yet unless weinclude ﬁltering and multiple trajectories, SINDy performspoorly on noisy data. For non-polynomial vector ﬁelds, theneural network approaches allow for greater model ﬂexibil-ity. We hypothesize that on non-polynomial systems, neuralnetwork-based approaches will prove superior. With the neuralshape function approach, one can plot and visualize the one-dimensional shape functions h j : R → R , analogous tovisualizing the components of Ξ in SINDy. However, due toits special form and construction, the neural shape functionmodel is more difﬁcult to train than a standard dense neuralnetwork.In future work, we seek to replace these dense neuralnetworks with sparsely connected neural networks that haveoptimal approximation properties [12]. We will also combinethese methods with dimensionality reduction techniques toobtain reduced-order models from the full -dimensional µ PMU time series. Ultimately, our goal is to uncover globalphenomena, such as instabilities, attractors, and periodic or-bits, that go beyond single trajectory forecasts.ilter SINDy Neural Shape Functions Filter DNN T r a j ec t o r y x y x y x y T r a j ec t o r i e s x y x y x y Fig. 9. We compare the estimated vector ﬁelds for data consisting of FitzHugh–Nagumo trajectories corrupted by 10% Gaussian noise. We combine the threeODE learning methods from Section II-B with the ﬁltering procedure from Section II-A. Note the improvement in the quality of the phase portrait as weincrease the number of trajectories used for training. R EFERENCES[1] H. Schaeffer, “Learning partial differential equations via data discoveryand sparse optimization,”

Proceedings of the Royal Society A , vol. 473,no. 2197, pp. 20 160 446, 20, 2017.[2] S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Discovering governingequations from data by sparse identiﬁcation of nonlinear dynamicalsystems,”

Proceedings of the National Academy of Sciences of the UnitedStates of America , vol. 113, no. 15, pp. 3932–3937, 2016.[3] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics informed deeplearning (part i): Data-driven solutions of nonlinear partial differentialequations,” arXiv:1711.10561 , 2017.[4] R. Raziperchikolaei and H. S. Bhat, “A block coordinate descentproximal method for simultaneous ﬁltering and parameter estimation,” in

Proceedings of the 36th International Conference on Machine Learning ,ser. Proceedings of Machine Learning Research, K. Chaudhuri andR. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA:PMLR, 09–15 Jun 2019, pp. 5380–5388. [Online]. Available:http://proceedings.mlr.press/v97/raziperchikolaei19a.html[5] N. Parikh and S. Boyd, “Proximal algorithms,”

Found. Trends Optim. ,2014. [Online]. Available: http://dx.doi.org/10.1561/2400000003[6] J. Ramsay and G. Hooker,

Dynamic Data Analysis: Modeling Data withDifferential Equations , ser. Springer Series in Statistics. Springer NewYork, 2017.[7] S. H. Rudy, S. L. Brunton, and J. N. Kutz, “Smoothing and param-eter estimation by soft-adherence to governing equations,”

Journal ofComputational Physics , vol. 398, p. 108860, 2019.[8] G. Tran and R. Ward, “Exact Recovery of Chaotic Systems from HighlyCorrupted Data,”

Multiscale Modeling & Simulation , vol. 15, no. 3, pp.1108–1129, Jan. 2017. [9] S. H. Rudy, J. N. Kutz, and S. L. Brunton, “Deep learning of dynamicsand signal-noise decomposition with time-stepping constraints,”

Journalof Computational Physics , vol. 396, pp. 483–506, 2019.[10] J. Cadena, P. Ray, and E. Stewart, “Fingerprint discovery for transformerhealth prognostics from micro-phasor measurements,” in

ICML 2019,Time Series Workshop, Long Beach, CA , 2019.[11] L. Zhang and H. Schaeffer, “On the Convergence of the SINDyalgorithm,”

Multiscale Modeling & Simulation , vol. 17, no. 3, pp. 948–972, 2019.[12] H. B¨olcskei, P. Grohs, G. Kutyniok, and P. Petersen, “Optimal approx-imation with sparsely connected deep neural networks,”