Estimating Vector Fields from Noisy Time Series
EEstimating Vector Fields from Noisy Time Series
Harish S. Bhat
Applied MathematicsUniversity of California, Merced
Merced, CA [email protected]
Majerle Reeves
Applied MathematicsUniversity of California, Merced
Merced, CA [email protected]
Ramin Raziperchikolaei
Rakuten, Inc.
San Mateo, CA [email protected]
Abstract —While there has been a surge of recent interest inlearning differential equation models from time series, methods inthis area typically cannot cope with highly noisy data. We breakthis problem into two parts: (i) approximating the unknownvector field (or right-hand side) of the differential equation,and (ii) dealing with noise. To deal with (i), we describe aneural network architecture consisting of tensor products of one-dimensional neural shape functions. For (ii), we propose an al-ternating minimization scheme that switches between vector fieldtraining and filtering steps, together with multiple trajectories oftraining data. We find that the neural shape function architectureretains the approximation properties of dense neural networks,enables effective computation of vector field error, and allowsfor graphical interpretability, all for data/systems in any finitedimension d . We also study the combination of either our neuralshape function method or existing differential equation learningmethods with alternating minimization and multiple trajectories.We find that retrofitting any learning method in this way booststhe method’s robustness to noise. While in their raw form themethods struggle with 1% Gaussian noise, after retrofitting, theylearn accurate vector fields from data with 10% Gaussian noise. I. I
NTRODUCTION
We consider the problem of learning a dynamical systemfrom multiple, vector-valued time series. Suppose we have N time series or trajectories, each observed at T discrete times { t i } Ti =1 . We assume this temporal grid is sufficiently fine tocapture the dynamics of the system of interest. Let y ji ∈ R d denote the observation for trajectory j at time t i . Here d ≥ is arbitrary. Given this data, we compute estimates of (i) thestates (cid:98) x ji and (ii) a vector field (cid:98) f : R d → R d that determines adynamical system model for the time-evolution of the states.Suppose the true trajectory x j ( t ) satisfies the nonlinearsystem of differential equations given by ˙ x j ( t ) = d x j ( t ) /dt = f ( x j ( t )) . (1)We model the observation y ji as the true state plus noise: y ji = x j ( t i ) + (cid:15) j ( t i ) , i = 1 , . . . , T. (2) This work was performed under the auspices of the U.S. Department ofEnergy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 and was supported by the LLNL-LDRD Program underProject No. 19-ERD-009. LLNL-CONF-800042. H. S. Bhat and M. Reevesacknowledge partial support from NSF DMS-1723272. We also acknowledgeuse of the MERCED computational cluster, funded by NSF award ACI-1429783. ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained forall other uses, in any current or future media, including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted componentof this work in other works.
We seek (cid:98) x ji that approximates x j ( t i ) , and (cid:98) f that approximates f . While there has been much recent interest in learningdifferential equations from data, the bulk of the literaturefocuses on computing (cid:98) f from relatively clean data; [1] notesthat when Gaussian noise of strength greater than is presentin observed states, estimation becomes unstable and inaccu-rate. We demonstrate below that leading methods encounterdifficulty even at relatively low noise magnitudes [2], [3].Our method yields accurate estimates even when the data iscorrupted by noise.Our prior work [4] established that the block coordinatedescent proximal filtering method yields more accurate androbust parameter estimates than either iPDA or the extendedKalman filter. A key element of our filtering approach is itsproximal step [5]. In contrast, other techniques such iPDA and soft adherence use a penalty term that anchors the filteredstates (at all iterations) to the data y [6], [7]. Our workshares goals with [8], who are concerned with recoveringthe functional form of sufficiently ergodic, chaotic dynamicalsystems from highly corrupted measurements. While our pa-rameterizations of f differ considerably, both the present paperand [8] develop and apply alternating minimization methods.The present paper focuses on extending [4] to the settingwhere the functional form of (cid:98) f is unknown and must beestimated. In our prior work, we assumed this vector field wasknown up to a finite-dimensional set of parameters. Hence ourprior work contains no mention of neural networks, nor doesit contain comparisons against the methods of [3] and [9].The central finding of this paper is as follows: as themagnitude of noise increases, computing accurate (cid:98) x ji and (cid:98) f is still possible if one alternates model estimation stepswith filtering steps, and if one trains using a larger number N of trajectories. Using tests on simulated data, we showthat leading equation discovery methods can be retrofittedwith our filtering approach, greatly enhancing the ability ofthese methods to cope with noisy data. These methods differin the way they model the estimated vector field f : sparselinear combinations of prescribed functions [2], neural shapefunctions, or dense, feedforward neural networks [3]. Weconduct these tests for both the FitzHugh–Nagumo system anda nonlinear oscillator chain, showing that we can recover boththe filtered states and the underlying vector field with highaccuracy.We also apply our method to power grid data recorded by a r X i v : . [ s t a t . M L ] D ec icro-phasor measurement units ( µ PMUs). The units recordsynchronized measurements of voltage and phase angle atthe power distribution level [10]. Using data taken on twodifferent days for which the system’s status has been labeled,we estimate two vector fields, one for normal operation andone for anomalous operation. To estimate these vector fields,we found it essential to both filter the data and increase thenumber of trajectories. II. M
ETHODS
A. Filtering
Let (cid:98) f ( x , θ ) model the true vector field f . For now, we are notconcerned with the actual details of this model except to saythat the model parameters are given by θ . Consider an explicitEuler time discretization of (1) with time step ∆ i = t i +1 − t i .For trajectory j , we have x ji +1 − x ji = f ( x ji , θ )∆ i , i = 1 , ..., T − (3)We use Euler purely for simplicity here; in practice themethod can accommodate higher-order, explicit time integra-tion schemes. Let (cid:98) X denote the collection { (cid:98) x ji } N,Tj =1 ,i =1 of allfiltered states over all trajectories. Then define the objectivefunction E ( (cid:98) X , (cid:98) θ ) = N (cid:88) j =1 T − (cid:88) i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:98) x ji +1 − (cid:98) x ji ∆ i − (cid:98) f ( (cid:98) x ji , (cid:98) θ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (4)Assume that if the true states X were known, that (cid:98) f could betrained by minimizing E over θ . We thus refer to minimizationover θ as model training.Let Y denote the collection { y ji } N,Tj =1 ,i =1 of all observationsover all trajectories. We propose the following alternatingprocedure to learn (cid:98) X and (cid:98) θ . The states are initialized to bethe data: (cid:98) X = Y —superscripts denote the iteration number:train: (cid:98) θ k +1 = argmin θ E ( (cid:98) X k , θ ) (5a)filter: (cid:98) X k +1 = argmin X (cid:110) E ( X , (cid:98) θ k +1 ) + λ (cid:107) X − (cid:98) X k (cid:107) (cid:111) . (5b)We terminate when the change in ( (cid:98) θ , (cid:98) X ) is sufficiently small. B. Three Parameterizations of (cid:98) f
1) SINDy:
We first take (cid:98) f to be a linear combination ofprescribed functions. Given a × d input x we let Ξ( x ) denote a × s dictionary of functions. For instance, for ( x , x ) we can take Ξ( x , x ) = ( x , x , x , x , x x ) , i.e.,all polynomials in the components of x , up to degree two.Using this dictionary, we write (cid:98) f ( x ) = Ξ( x ) θ (6)where θ is an s × d matrix of coefficients. Throughout thispaper, to extend (cid:98) f ( x ) to a function (cid:98) f ( X ) , we apply (cid:98) f to each × × d slice of X . To solve for θ , we apply an iterativelythresholded least-squares regression procedure that promotesa sparse solution θ —for further details, consult [2], [11]. pred. err on [0 , T ]1% noise noise noise ours Raissi Rudy SINDy020406080100 ours Raissi Rudy SINDy020406080100 ours Raissi Rudy SINDy020406080100 pred. err on [ T, T ]1% noise noise noise ours Raissi Rudy SINDy020406080100 ours Raissi Rudy SINDy020406080100 ours Raissi Rudy SINDy020406080100 Fig. 1. We compare the neural shape method with [3], [9], and [2] onFitzHugh–Nagumo with different noise levels. For each noise level, we create sets of observations, run the methods on each set separately, and reportprediction errors. Each error bar shows the error on one experiment. For allplots, we train on the time interval [0 , T ] . In the upper (resp., lower) set ofplots, we report the prediction error on [0 , T ] (resp., [ T, T ] ).
2) Neural Shape Functions:
Instead of prescribing Ξ andlearning only (cid:98) θ , here we learn both Ξ and (cid:98) θ . Let B be thedesired number of one-dimensional shape functions; we modelthese using a dense neural network with one-dimensionalinput, D − hidden layers, and a final layer with B outputs.Let h ( x ) ∈ R B denote the output corresponding to scalar input x ; then h j is the j -th one-dimensional shape function.To create multi-dimensional shape functions, we take tensorproducts of one-dimensional shape functions. Let a multi-index denote a collection of integers α = ( α , . . . , α d ) suchthat ≤ α k ≤ B . Define h ( x ) = 1 . Then for a given multi-index α , define the multi-dimensional shape function H α ( x ) = d (cid:89) k =1 h α k ( x k ) ∈ R . (7)Let A be a collection of multi-indices and | A | its cardinality.Suppose we compute H α ( x ) for each α ∈ A ; in this way weobtain a × | A | vector H ( x ) . Here H ( x ) plays the same roleas Ξ( x ) in SINDy above; | A | is analogous to s , the numberof shape functions. Now let B denote an | A | × d matrix ofweights. The neural shape function model is (cid:98) f ( x ) = H ( x ) B .Note that there are ( B + 1) d possible multi-dimensional shapefunctions. In practice, we choose A such that | A | (cid:28) ( B + 1) d ,thus constraining (cid:98) f to be a small linear combination of multi-dimensional shape functions. The set of parameters θ consistsof B together with all weights and biases in the h ( x ) network.Suppose we work in a compact subset K ⊂ R d . Given directobservations of a smooth vector field f in K , it is possible toachieve arbitrarily small error (cid:107) f − (cid:98) f (cid:107) by choosing the neuralshape function hyperparameters sufficiently large. To under-stand why, note that universal approximation theory guaranteesthat (even when D = 1 ) the space of our one-dimensionalneural shape functions is dense in the space of continuousfunctions. Hence tensor products of these shape functions aredense in the space of tensor products of continuous functions. x y Fig. 2. Ground truth FitzHugh–Nagumo vector field.
By Stone-Weierstrass, we see that any smooth vector field canbe approximated by linear combinations of tensor productsof continuous functions (e.g., univariate polynomials). Puttingthe previous two facts together, we conclude that linear com-binations of our multi-dimensional neural shape functions canbe used to approximate smooth vector fields; the accuracy ofthis approximation can be controlled by the number of unitsin each h j network together with | A | .
3) Dense Neural Network (DNN):
We also consider adense, feedforward neural network model (cid:98) f ( x ) with d inputsand d outputs, the model used by [3].III. R ESULTS
Here we present results from simulated data experiments inwhich the ground truth vector field f is known. A. FitzHugh–Nagumo
In Figure 1, we compare neural shape functions plus filter-ing (our method) against three published methods [2], [3], [9]on simulated data from the FitzHugh–Nagumo system: dx dt = c (cid:18) x − x x (cid:19) , dx dt = − c ( x − a + bx ) . To generate simulated data, we set the system parameters a =0 . , b = 0 . , c = 3 and initial condition x = [ − , . We thennumerically solve the system on the interval [0 , T ] with T =20 , recording data at a spacing of ∆ t = 0 . . To these cleantrajectories we add mean-zero Gaussian noise at strengths of , , and . We train using data on [0 , T ] ; in Figure 1,we plot prediction errors for both the training interval and anextrapolatory test interval [ T, T ] .For this initial comparison, let us define the concept ofprediction error. Suppose we are at iteration k and that wehave trained the model for e epochs. Define the predictedstates (cid:101) X ke to be the numerical solution of (1) that resultsfrom using the filtered initial conditions contained in (cid:98) X k (forthe j -th trajectory, (cid:98) x j ) together with the current best vectorfield (cid:98) f ( x , (cid:98) θ k ) . We call the distance between (cid:101) X ke and X k the prediction error —we claim this is a good metric to detectoverfitting. If we find that the current prediction error increasesas we increase the number of epochs, we halt the optimization dimension j = 0 dimension j = 1 f j ( x ) cleanpredicted cleanpredicted time t time t shape : h ( x k ) shape : h ( x k ) shape : h ( x k ) h ( x k ) −2 −1 0 1 2−1.0−0.50.00.51.0 −2 −1 0 1 2−1.0−0.50.00.51.0 −2 −1 0 1 2−1.0−0.50.00.51.0 x k x k x k shape : h ( x k ) shape : h ( x k ) shape : h ( x k ) h ( x k ) −2 −1 0 1 2−1.0−0.50.00.51.0 −2 −1 0 1 2−1.0−0.50.00.51.0 −2 −1 0 1 2−1.0−0.50.00.51.0 x k x k x k Fig. 3. We applied algorithm (5), with neural shape function model for (cid:98) f , todata obtained by simulating the FitzHugh–Nagumo system and then adding noise. First panel: the clean states and predicted states over time. Here thepredicted states are obtained by numerically integrating the neural vector field (cid:98) f forward in time starting at the estimated/filtered initial condition. Secondpanel: visualization of the learned one-dimensional shape functions h j ( x ) . step (5a), set (cid:98) θ k to the weights of the network with minimumcurrent prediction error, and proceed to the filtering step (5b).In Figure 3, we plot the predicted and true states for theneural shape function model of (cid:98) f learned from data with Gaussian noise. We also plot the learned one-dimensional neu-ral shape functions h j ( x ) . An advantage of this method is theability to graphically interpret these shape functions, regardlessof the dimension d of the vector field being modeled.Among the methods studied in Figure 1, only our methodand that of [9] attempt to deal with noise during training.Instead of using a proximal term of the form (cid:107) X − (cid:98) X k (cid:107) asin (5b), [9] use a penalty term of the form (cid:107) X − Y (cid:107) ; filteredstates are always anchored to the data. While our methodoutperforms this penalty-based method, both are more robustto noise than approaches that do not incorporate filtering atall. Clearly SINDy [2] has issues with even small amountsof noise, while the dense neural network approach of [3]encounters problems starting at noise.We have carried out tests similar to that of Figure 1 fornonlinear, chaotic systems such as the Lorenz, R¨ossler, anddouble pendulum systems. For such systems, if we perturb thevector field or initial conditions slightly, sensitive dependenceimplies that over time, trajectories will diverge exponentially.For such systems, the prediction error on [ T, T ] can seemlarge even if we estimate the vector field and filtered stateswith a reasonable degree of accuracy. These tests motivate us oise Trajectories Filter SINDy Neural ShapeFunctions Filter DNN1 . × − . . . × − . . . × − . . . . . . × − . . . × − . . . . .
10% 25 . × − . . . × − . . TABLE IF
OR EACH OF THREE
ODE
LEARNING METHODS , ALL COMBINED WITHFILTERING , WE RECORD THE ERROR IN THE ESTIMATED VECTOR FIELD ASA FUNCTION OF BOTH NOISE STRENGTH AND THE NUMBER OFTRAJECTORIES . F
OR EACH METHOD AND EACH NOISE STRENGTH , INCREASING THE NUMBER OF TRAJECTORIES DECREASES THE VECTORFIELD ERROR . N
OTE THAT RETROFITTING
SIND
Y WITH FILTERINGDRAMATICALLY IMPROVES ITS ABILITY TO ESTIMATE THE VECTOR FIELDACCURATELY , AS COMPARED TO THE RESULTS IN F IGURE to measure the vector field error (cid:107) (cid:98) f − f (cid:107) , which providesinsight into how well the estimated vector field captures theglobal behavior of f . We now retrofit both [3] (the dense neural network modelfor (cid:98) f ) and [2] (SINDy) with our filtering procedure. To beclear, both retrofitted methods follow (5); the only differenceis in how (cid:98) f is modeled.As all the results shown in Figure 1 are for one trajectory, wenow move to the setting where we have multiple trajectories.We numerically integrate the FitzHugh–Nagumo system with initial conditions taken from an equispaced × gridon the square [ − , , producing a set of trajectories. Wethen apply each retrofitted procedure to either , (randomlychosen), or all trajectories from the collection. In eachcase, we obtain an estimated vector field (cid:98) f ( x ) . We computethe vector field error (cid:107) (cid:98) f − f (cid:107) using elementary quadratureon the square [ − , . Here f ( x ) is the right-hand side of theFitzHugh–Nagumo system; see the ground truth vector fieldplotted in Figure 2.The results, shown in Figure 9 and Table I, clearly showthe benefit of combining any method with filtering and in-creasing the number of trajectories used for training, evenif those trajectories are all corrupted with noise. Theimprovement is particularly striking for the SINDy method;in fact, after incorporating SINDy into (5), it outperforms theother methods. B. Nonlinear Oscillator Network
Consider a ring of M masses connected by identicalsprings, each with potential energy V ( x ) and force F ( x ) = − V (cid:48) ( x ) . Here x denotes displacement from equilibrium. Thenthe equations of motion for the mass-spring system are ¨ x i = F ( x i − x i − ) − F ( x i +1 − x i ) (8)for i = 1 , . . . , M , with the understanding that x ≡ x M and x ≡ x M +1 . For our tests, we choose the double-well potential V ( x ) = − x + (1 / x , so that F ( x ) = 16 x − x . Noise Trajectories (cid:107) f − (cid:98) f (cid:107) (cid:107) X − (cid:98) X (cid:107) prediction error10 . × − . × − . . × − . × − . . × − . × − . . × − . × − . . × − . × − . . × − . × − . . × − . × − .
10% 100 . × − . × − . . × − . × − . TABLE IIW
E APPLY THE RETROFITTED
SIND
Y ALGORITHM (9)
TO NOISYOBSERVATIONS OF THE MASS - SPRING SYSTEM (8). R
ETROFITTING THE
SIND
Y ALGORITHM ENABLES IT TO HANDLE NOISY DATA . T
HE RESULTSALSO SHOW THAT , AT EACH NOISE LEVEL , INCREASING THE NUMBER OFTRAJECTORIES CLEARLY REDUCES PREDICTION ERRORS . F
OREXPLANATIONS OF THE DIFFERENT TYPES OF ERRORS , PLEASE SEE THEMAIN TEXT . We focus on the system with M = 3 masses; when we writethe system in first-order form, the system has M = 6 degreesof freedom corresponding to x i and ˙ x i for each i = 1 , . . . , M .To generate data for our tests, we simulate (8) using the 8th-order Dormand-Prince integrator in scipy.integrate, with abso-lute and relative tolerances tuned to − . We generate trajectories, each with an initial condition sampled uniformlyfrom the hypercube [ − / , / . Each trajectory is saved at equispaced time steps from t = 0 to a final time of t = 4 ,i.e., ∆ t = 0 . . To these clean trajectories X , we add mean-zero Gaussian noise with strengths of , , and asbefore, resulting in noisy data Y .We focus our attention on a retrofitted SINDy methodwith (cid:98) f modeled using a dictionary of polynomials. In thismethod, we use a variant of (5) theoretically analyzed inour previous work [4]. In this method, we incorporate theSINDy model (6), which has the benefit of being linear in theparameters θ . This linearity implies that the objective function(4) is convex in θ . Suppose we split the decision variables (cid:98) X into two halves, the first half (cid:98) X + consisting of time steps j = 1 , . . . , (cid:98)M / (cid:99) , and the second half (cid:98) X − consisting of timesteps j = (cid:98)M / (cid:99) + 1 , . . . , M . Then (cid:98) X = ( (cid:98) X + , (cid:98) X − ) . Thisleads us to the following block coordinate descent algorithm:train: (cid:98) θ k +1 = argmin θ E ( (cid:98) X k + , (cid:98) X k − , θ ) (9a)filter: (cid:98) X k +1 − = argmin X − (cid:110) (cid:98) X k + , X − , (cid:98) θ k +1 ) + λ (cid:107) X − − (cid:98) X k − (cid:107) (cid:111) (9b)filter: (cid:98) X k +1+ = argmin X + (cid:110) X + , (cid:98) X k +1 − , (cid:98) θ k +1 ) + λ (cid:107) X + − (cid:98) X k + (cid:107) (cid:111) . (9c)Splitting (cid:98) X and formulating the algorithm in this way givesus block convexity. That is, when we hold two of the threevariables in { (cid:98) X k + , (cid:98) X k − , (cid:98) θ } fixed and minimize over the remain-ing variable, we obtain in each case a convex subproblem [4].Note that this property would not hold if we were to insteaduse either of the neural network approaches to model (cid:98) f , as inthis case (9a) would be non-convex.
10 20 30 40 50 iteration || c h a n g e i n f il t e r e d s t a t e e s t i m a t e ||
10% noise5% noise1% noise
Fig. 4. We monitor and plot the magnitudes of the change in filtered states, (cid:107) (cid:98) X k +1 − (cid:98) X k (cid:107) , as a function of iteration number k . We have done this foreach of three training sets, each containing the indicated level of noise. In allcases, we find that this quantity decreases rapidly and monotonically. When we train, we restrict Y to consist of only the first steps of each trajectory. Starting with noisy data (cid:98) X = Y , werun algorithm (9) with λ = 10 − . Note that λ multiplies thesquared Frobenius error between two large matrices; hencethis value of λ is still consequential. We terminate if the normdifference between (cid:98) X k +1 and (cid:98) X k is less than − , or if wehave already completed iterations.To solve (9b) and (9c), we use steps of gradientdescent with a learning rate of × − . To solve (9a),we use the iteratively thresholded least squares procedurefrom [2], [11] with a threshold of . . For the dictionary Ξ defined in (6), we include all polynomials up to degree in variables, not including an intercept or constant term,resulting in s = 63 columns. We have uploaded our code tohttps://github.com/hbhat4000/filtersindy/An interesting property of algorithm (9) is the way in which (cid:107) (cid:98) X k +1 − (cid:98) X k (cid:107) behaves as a function of k . In Figure 4, wehave plotted this quantity as a function of iteration k . We haverecorded these magnitudes of the change in filtered states whilerunning algorithm (9) on the full -trajectory training set,contaminated with either , , or noise. In all cases,we see monotonic convergence. Note that the vertical scale onthis plot is logarithmic, further indicating the rapid decreasein (cid:107) (cid:98) X k +1 − (cid:98) X k (cid:107) as a function of k . As a practical matter, thismeans that we do not need to run (9) for many iterations, andthat the results are robust to the number of iterations.After running algorithm (9) on either , , or trajectories with either , , or noise, we quantifyerrors in three ways—see Table II. The vector field error (cid:107) f − (cid:98) f (cid:107) measures the mean absolute error between groundtruth and estimated vector fields. The quantity (cid:107) X − (cid:98) X (cid:107) isthe mean absolute error between true and filtered states. Aftertraining on the first steps of each trajectory, we computepredicted trajectories by numerically integrating the estimated time s t a t e s filteredground truth time s t a t e s filteredground truth time s t a t e s filteredground truth Fig. 5. We present the results of applying (9) to (top), (middle), or (bottom) trajectories worth of data for the nonlinear mass-spring system(8), all contaminated with Gaussian noise. In each of these training sets,we have singled out one common trajectory for the purposes of illustration.In each plot, for this common trajectory, we have plotted the final estimate ofthe filtered states (cid:98) X in red, and ground truth trajectories X in black. A closeinspection of the plots reveals errors with trajectories; for the purposes ofplotting, these errors disappear with trajectories. For more details, seethe main text. vector field forward in time, starting from the filtered initialconditions. The prediction error is the mean absolute errorbetween these predicted trajectories and the true states X .When we numerically integrate the estimated vector field, weuse a standard fourth-order explicit Runge-Kutta method.The first thing to notice from the results in Table II isthat they confirm that combining SINDy with our alternatingminimization filtering approach allows SINDy to estimate theass-spring system’s vector field accurately. Even with 10%noise in the original data, the vector field error when we trainon trajectories is . × − . To get a sense for what thisnumber means, note that the ground truth vector field is highlysparse—across the entire ground truth coefficient matrix θ , outof · possible entries, only are nonzero. The retrofittedSINDy algorithm gets this sparsity pattern correct; ifwe round the coefficients obtained by SINDy, we obtain theground truth vector field.Let us now interpret the filtering errors (or second column)of Table II. At the 10% noise level, increasing the number oftrajectories appears to decrease the filtering error from . × − (with 10 trajectories) only slightly to . × − . Tovisualize this, we present Figure 5. From top to bottom, wepresent the results of training with (top), (middle), or (bottom) trajectories worth of data, all contaminated with Gaussian noise. For the purposes of illustration, we havesingled out one trajectory { x i ( t ) , ˙ x i ( t ) } i =1:3 that is a memberof all three training sets. Each plot contains black curves,corresponding to { x i ( t ) , ˙ x i ( t ) } for i = 1 , , , together withthe filtered (or hatted) versions, for a total of curves perplot. When there are only trajectories (top panel), one candiscern differences between red and black curves; as we go to trajectories (bottom panel), these errors mostly disappear.Overall, Figure 5 supports the notion that it is indeedpossible to estimate filtered states even when the equationsof motion (i.e., the vector field) for the underlying system arethemselves unknown and must be simultaneously estimated.However, there is a difference between filtered trajectories andpredicted trajectories of the dynamical system. One will noticefrom Figure 5 that the time axis ends at t = 3 ; we have used steps of training data with ∆ t = 0 . , so there is no noisytraining data to filter beyond t = 3 .To obtain solutions of the estimated dynamical system be-yond t = 3 , we must numerically integrate, as in the predictedtrajectories whose errors are quantified in the third column ofTable II. When we compute the predicted trajectories in thistable, we integrate all the way up to t = 4 and then compareagainst the ground truth (clean) states of the system X . Weview predictions on t ∈ (3 , as a true test set, i.e., a test ofthe estimated vector field’s ability to extrapolate beyond thetraining set.We again single out the same trajectory common to ourtraining sets with , , or total trajectories. Startingwith the estimated/filtered initial conditions, we numericallyintegrate the estimated vector field forward in time using astandard fourth-order explicit Runge-Kutta method, from t = 0 to t = 4 . In Figure 6, we compare the results of these numeri-cal integrations (in green) against the ground truth trajectories(in black). When the number of trajectories is small (top),the predicted dynamics are qualitatively wrong. As we trainon more trajectories (middle, bottom), the dynamics begin toqualitatively match the true mass-spring system’s dynamics.Note that we obtain much better quantitative accuracy for t ∈ [0 , , the training interval.For the bottom panel (trained on trajectories), by the time s t a t e s predictedground truth time s t a t e s predictedground truth time s t a t e s predictedground truth Fig. 6. We present the results of applying (9) to (top), (middle),or (bottom) trajectories worth of data from the nonlinear mass-springsystem (8), all contaminated with Gaussian noise. In each of thesetraining sets, we have singled out one common trajectory for the purposesof illustration. In each plot, for this common trajectory, we have plottedpredicted trajectories (the results of numerically integrating the estimatedvector field forward in time from the estimated initial conditions) in green, andground truth trajectories X in black. The estimated vector field and filteredinitial conditions are inaccurate when we use only trajectories, leading toqualitatively incorrect dynamics. As we increase the number of trajectories,we recover the qualitatively correct behavior of the mass-spring system. Thedynamics for t ∈ [0 , are more accurate because that is the interval coveredby the training data; for t ∈ (3 , , we are seeing the results of propagatingbeyond the training interval. time we reach t = 3 , the predicted state has drifted noticeablyaway from the ground truth. We can improve upon thissituation by resetting the state of the system, at t = 3 , toequal our estimate (cid:98) x (3) of the true state at that time and then .0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 time s t a t e s predictedground truth Fig. 7. We continue with results for algorithm (9) applied to noisy data fromthe nonlinear mass-spring system (8). We redo the numerical integration fromthe bottom panel of Figure 6, this time resetting the state of the system at t = 3 to equal our estimate (cid:98) x (3) of the true state of the system at that time.This improves the quantitative accuracy of our predicted trajectory on theextrapolation/test interval t ∈ (3 , . continuing the numerical integration until t = 4 . In Figure7, we see that this procedure leads to improved quantitativeagreement between predicted and ground truth trajectories onthe test interval t ∈ (3 , . C. MicroPMU Data
We apply the neural shape function plus filtering methodto µ PMU data taken from a Mountain View substation nearRiverside, CA. We use one hour’s worth of data from bothAug. 9 and Aug. 1, 2017. We use measurements aggregatedat a spacing of ∆ t = 0 . . In prior work, these dates have beenidentified as corresponding to normal (Aug. 9) and anomalous(Aug. 1) system operation. Our idea is to learn two vectorfields, one for normal and one for anomalous behavior.To begin, we prefiltered the data using a wavelet low-passfilter. This was to remove high-frequency noise prevalent inthe raw data. Next, we focused our modeling efforts on twophase angle variables, θ and θ . This means that we didnot use components of the full -dimensional signal. Toimplement the multiple trajectory idea, we reshaped the singlehour’s trajectory from its original length of T = 360000 to N = 3600 trajectories each consisting of T = 100 points. Toimplement (5a) we use epochs of the Adam optimizer witha learning rate of . ; to implement (5b), we use L-BFGS-B from scipy.optimize with default tolerances and λ = 1000 .The network itself consists of B = 3 shape functions witha depth of D = 2 and U = 16 units per layer. The resultsare robust to increasing λ , B , D , or U . However, note thattraining on one prefiltered trajectory of length T = 360000 ,without alternating filtering, does not yield usable models.In Figure 8, we plot the phase portraits for the learnedsystems. Interestingly, the main difference between the learnedvector fields for the normal (left) and anomalous (right)settings has to do with stability. In short, we see that normal x y x y Fig. 8. µ PMU phase plots for normal (left) and anomalous (right) systembehavior. Note that the fixed point is stable (eigenvalues − . × − ± . i ) on the left and unstable (eigenvalues . × − ± . i ) on theright. (respectively, anomalous) system operation corresponds tostable (respectively, unstable) oscillations. We have confirmedthis by numerically finding the fixed points of the vector fieldsand checking the eigenvalues of the Jacobians at these fixedpoints. IV. D ISCUSSION
We find that we can compensate for noisy training databy increasing its volume (the number of trajectories) and byapplying simultaneous filtering. Filtering, in the form of theproximal step (5b), reestimates the states (cid:98) X based on the bestmodel at the current iteration, (cid:98) f ( X ; (cid:98) θ k ) . When Y is heavilycontaminated by noise, several iterations of (5b) allow (cid:98) X tostep away from Y as needed. While we have focused on theFitzHugh–Nagumo and nonlinear mass-spring systems in thepaper, we are currently running additional tests for higher-dimensional physical systems. Preliminary results confirm thesame trends we have observed for the FitzHugh–Nagumosystem.SINDy uses polynomial shape functions and the FitzHugh–Nagumo vector field consists of polynomials. Yet unless weinclude filtering and multiple trajectories, SINDy performspoorly on noisy data. For non-polynomial vector fields, theneural network approaches allow for greater model flexibil-ity. We hypothesize that on non-polynomial systems, neuralnetwork-based approaches will prove superior. With the neuralshape function approach, one can plot and visualize the one-dimensional shape functions h j : R → R , analogous tovisualizing the components of Ξ in SINDy. However, due toits special form and construction, the neural shape functionmodel is more difficult to train than a standard dense neuralnetwork.In future work, we seek to replace these dense neuralnetworks with sparsely connected neural networks that haveoptimal approximation properties [12]. We will also combinethese methods with dimensionality reduction techniques toobtain reduced-order models from the full -dimensional µ PMU time series. Ultimately, our goal is to uncover globalphenomena, such as instabilities, attractors, and periodic or-bits, that go beyond single trajectory forecasts.ilter SINDy Neural Shape Functions Filter DNN T r a j ec t o r y x y x y x y T r a j ec t o r i e s x y x y x y Fig. 9. We compare the estimated vector fields for data consisting of FitzHugh–Nagumo trajectories corrupted by 10% Gaussian noise. We combine the threeODE learning methods from Section II-B with the filtering procedure from Section II-A. Note the improvement in the quality of the phase portrait as weincrease the number of trajectories used for training. R EFERENCES[1] H. Schaeffer, “Learning partial differential equations via data discoveryand sparse optimization,”
Proceedings of the Royal Society A , vol. 473,no. 2197, pp. 20 160 446, 20, 2017.[2] S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Discovering governingequations from data by sparse identification of nonlinear dynamicalsystems,”
Proceedings of the National Academy of Sciences of the UnitedStates of America , vol. 113, no. 15, pp. 3932–3937, 2016.[3] M. Raissi, P. Perdikaris, and G. E. Karniadakis, “Physics informed deeplearning (part i): Data-driven solutions of nonlinear partial differentialequations,” arXiv:1711.10561 , 2017.[4] R. Raziperchikolaei and H. S. Bhat, “A block coordinate descentproximal method for simultaneous filtering and parameter estimation,” in
Proceedings of the 36th International Conference on Machine Learning ,ser. Proceedings of Machine Learning Research, K. Chaudhuri andR. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA:PMLR, 09–15 Jun 2019, pp. 5380–5388. [Online]. Available:http://proceedings.mlr.press/v97/raziperchikolaei19a.html[5] N. Parikh and S. Boyd, “Proximal algorithms,”
Found. Trends Optim. ,2014. [Online]. Available: http://dx.doi.org/10.1561/2400000003[6] J. Ramsay and G. Hooker,
Dynamic Data Analysis: Modeling Data withDifferential Equations , ser. Springer Series in Statistics. Springer NewYork, 2017.[7] S. H. Rudy, S. L. Brunton, and J. N. Kutz, “Smoothing and param-eter estimation by soft-adherence to governing equations,”
Journal ofComputational Physics , vol. 398, p. 108860, 2019.[8] G. Tran and R. Ward, “Exact Recovery of Chaotic Systems from HighlyCorrupted Data,”
Multiscale Modeling & Simulation , vol. 15, no. 3, pp.1108–1129, Jan. 2017. [9] S. H. Rudy, J. N. Kutz, and S. L. Brunton, “Deep learning of dynamicsand signal-noise decomposition with time-stepping constraints,”
Journalof Computational Physics , vol. 396, pp. 483–506, 2019.[10] J. Cadena, P. Ray, and E. Stewart, “Fingerprint discovery for transformerhealth prognostics from micro-phasor measurements,” in
ICML 2019,Time Series Workshop, Long Beach, CA , 2019.[11] L. Zhang and H. Schaeffer, “On the Convergence of the SINDyalgorithm,”
Multiscale Modeling & Simulation , vol. 17, no. 3, pp. 948–972, 2019.[12] H. B¨olcskei, P. Grohs, G. Kutyniok, and P. Petersen, “Optimal approx-imation with sparsely connected deep neural networks,”