Robust PDE Identification from Noisy Data
RRobust PDE Identification from Noisy Data
Yuchen He ∗ , Sung Ha Kang † , Wenjing Liao ‡ , Hao Liu § and Yingjie Liu ¶ Abstract
We propose robust methods to identify underlying Partial Differential Equation (PDE) froma given set of noisy time dependent data. We assume that the governing equation is a linearcombination of a few linear and nonlinear differential terms in a prescribed dictionary. Noisydata make such identification particularly challenging. Our objective is to develop methodswhich are robust against a high level of noise, and to approximate the underlying noise-freedynamics well. We first introduce a Successively Denoised Differentiation (SDD) scheme tostabilize the amplified noise in numerical differentiation. SDD effectively denoises the given dataand the corresponding derivatives. Secondly, we present two algorithms for PDE identification:Subspace pursuit Time evolution error (ST) and Subspace pursuit Cross-validation (SC). Ourgeneral strategy is to first find a candidate set using the Subspace Pursuit (SP) greedy algorithm,then choose the best one via time evolution or cross validation. ST uses multi-shooting numericaltime evolution and selects the PDE which yields the least evolution error. SC evaluates the cross-validation error in the least squares fitting and picks the PDE that gives the smallest validationerror. We present a unified notion of PDE identification error to compare the objectives ofrelated approaches. We present various numerical experiments to validate our methods. Bothmethods are efficient and robust to noise.
In science and engineering, partial differential equations (PDEs) are used to model various real-world scientific phenomena. Numerical solvers for PDEs, as well as analysis on various properties ofthe solutions have been widely studied in literature. In this paper, we focus on the inverse problem:Given a set of time dependent noisy data, how to identify the governing PDE.Let the given noisy time dependent discrete data set be D : D := { U n i ∈ R | n = 0 , · · · , N ; i = ( i , · · · , i d ) with i j = 0 , · · · , M − , j = 1 , · · · , d } for N, M ∈ N , (1)where i is a d -dimensional spacial index of a discretized domain in R d , and n represents the timeindex at time t n . The objective is to find an evolutionary PDE of the form ∂ t u = f ( u, ∂ x u, ∂ x u, · · · , ∂ k x u, · · · ) , (2) ∗ School of Mathematics, Georgia Institute of Technology. Email: [email protected]. † School of Mathematics, Georgia Institute of Technology. Email: [email protected]. Research is supportedin part by Simons Foundation grant 282311 and 584960. ‡ School of Mathematics, Georgia Institute of Technology. Email: [email protected]. Research is supported inpart by NSF grant nsf-dms 1818751. § School of Mathematics, Georgia Institute of Technology. Email: [email protected]. ¶ School of Mathematics, Georgia Institute of Technology. Email: [email protected]. Research is supportedin part by NSF grants DMS-1522585 and DMS-CDS&E-MSS-1622453. a r X i v : . [ m a t h . NA ] J un hich represents the dynamics of the given data D . Here t is the time variable, x = [ x , ..., x d ] ∈ R d denotes the space variable, and ∂ k x u = (cid:26) ∂ k u∂x k ∂x k ··· ∂x kdd | k , · · · , k d ∈ N , (cid:80) dj =1 k j = k (cid:27) is the set ofpartial derivatives of u with respect to the space variable of order k for k = 0 , , · · · . We assumethat f is a polynomial of its arguments, so that the right hand side of (2) is a linear combinationof linear and nonlinear differential terms. The model in (2) includes a class of parametric PDEswhere the parameters are the polynomial coefficients in f .Parameter identification in differential equations and dynamical systems has been consideredby physicists or applied scientists. Earlier works include [1–4, 26–28], and among which, [2, 26]considered the PDE model as in (2). Two important papers [5, 36] used symbolic regression torecover the underlying physical systems from experimental data. Recently, sparse regression and L -minimization were introduced to promote sparsity in the identification of PDEs or dynamicalsystems [7, 16, 32, 33]. In [7], Brunton, et al. considered the discovery of nonlinear dynamicalsystems with sparsity-promoting techniques. The underlying dynamical systems are assumed to begoverned by a small number of active terms in a prescribed dictionary, and sparse regression is usedto identify these active terms. Several extensions of this sparse regression approach can be foundin [15, 20, 25]. In [33], Schaeffer considered the problem of PDE identification using the spectralmethod, and focused on the benefit of using L -minimization for sparse coefficient recovery. Theidentification of dynamical systems with highly corrupted and undersampled data are consideredin [35, 41]. In [32], Rudy et al. proposed to identify PDEs by solving the L -regularized regressionfollowed by a post-processing step of thresholding. Sparse Bayesian regression was considered in[47] for the recovery of dynamical systems. This series of work focused on the benefit of using L -minimization to resolve dynamical systems or PDEs with certain sparse pattern [34]. Anotherrelated problem is to infer the interaction law in a system of agents from the trajectory data.In [6, 23], nonparametric regression was used to predict the interaction function and a theoreticalguarantee was established. Another category of methods uses deep learning [17, 21, 22, 24, 29–31].The most closely related work to this paper is [16], where Identifying Differential Equationwith Numerical Time evolution (IDENT) was proposed. It is based on the convergence principleof numerical PDE schemes. LASSO is used to efficiently find a candidate set, and the correctPDE is identified by computing the numerical Time Evolution Error (TEE). Among all the PDEsfrom the candidate set, the one whose numerical solution best matches the dynamics of the givendata is chosen as the identified PDE. When the given data are contaminated by noise, the authorsproposed a Least-Square Moving Average method to denoise the data as a pre-processing step.When the coefficients vary in the spatial domain, a Base Element Expansion (BEE) technique wasproposed to recover the varying coefficients.Despite the developments of many effective methods, when the given data is noisy, PDE iden-tification is still challenging. A small amount of noise can make the recovery unstable, especiallyfor high order PDEs. It was shown in [16] that the noise to signal ratio for LASSO depends on theorder of the underlying PDE, and IDENT can only handle a small amount of noise when the PDEcontains high order derivatives. A major issue is that the numerical differentiation often magnifiesnoise, which is illustrated by an example in Figure 1.In this paper, we propose a class of robust methods for PDE identification which can handle alarge amount of noise. Our contributions include:1. First, we propose a new denoising procedure, called Successively Denoised Differentiation(SDD), to stabilize the numerical differentiation applied on noisy data.2. Second, we present two recovery algorithms which are robust against noise: Subspace pursuitTime evolution (ST) and Subspace pursuit Cross-validation (SC). Both methods utilize the2a) (b) (c) exactnoisy exactnoisy exactnoisy Figure 1: Sensitivity of numerical differentiation to noise. (a) Graph of sin( x ), 0 ≤ x ≤ π (black),and its noisy version (red) with Gaussian noise of mean 0 and standard deviation 0 . Let the time-space domain be Ω = [0 , T ] × [0 , X ] d for some T >
X >
0. Suppose the noisydata D are given as (1) on a regular grid in Ω, with time index n = 0 , · · · , N , N ∈ N and spatialindex i ∈ I , where I = { ( i , · · · , i d ) | i j = 0 , · · · , M − , j = 1 , · · · , d, M ∈ N } . Denote ∆ t := T /N and ∆ x := X/ ( M −
1) as the time and space spacing in the given data, respectively.At the time t n and the location x i , each datum is given as U n i = u ( x i , t n ) + ε n i , (3)where t n := n ∆ t ∈ [0 , T ], x i := ( i ∆ x, · · · , i d ∆ x ) ∈ [0 , X ] d , and ε n i is some random noise with mean0. For n = 0 , , · · · , N −
1, we vectorize the data in all spatial domains at time t n , and denote it as U n ∈ R M d . Concatenating the vectors { U n } N − n =0 vertically gives rise to a long vector U ∈ R NM d .The underlying function f in (2) is assumed to be a finite order polynomial of its arguments: f ( u, ∂ x u, ∂ x u, · · · ∂ k x u, · · · ) = c + c ∂ x u + · · · + c m u∂ x u + · · · . (4)where ∂ k x denotes all k -th order partial derivatives and ∂ x i denotes the partial derivative with respectto the i -th variable. We refer to each term, such as 1 , ∂ x u, . . . , u∂ x u, . . . in (4), as a feature . Since f is a finite order polynomial, only a finite number of features are included. Denote the number3f features by K . Under this model, the function f is expressed in a parametric form as a linearcombination of K features. Our objective is to recover the parameters, or coefficients, c = [ c c . . . c m . . . c K ] T ∈ R K . where some of the entries may be zeros.From D , we numerically approximate the time and spatial derivatives of u to obtain the following approximated time derivative vector D t U ∈ R NM d and approximated feature matrix F ∈ R NM d × K : D t U = U − U ∆ tU − U ∆ t ... U N − U N − ∆ t , F = M d × U D x U · · · U ◦ D x U · · · M d × U D x U · · · U ◦ D x U · · · ... ... ... . . . ... · · · M d × U N − D x U N − · · · U N ◦ D x U N − · · · . In this paper, the time derivatives in D t U are approximated by the forward difference scheme,and the spatial derivatives in F are computed using the 5-point ENO scheme [13]; M d × denotesthe 1-vector of size M d , and the Hadamard product ◦ is the elementwise multiplication betweenvectors. Each column of F is referred to as a feature column . The PDE model in (2) suggests that,an optimal coefficient vector c should satisfy the following approximation: D t U ≈ F c . (5)The objective of this paper is to find the correct set of coefficients in (4). Due to a large size of K ,the idea of sparsity becomes useful.Throughout this paper, we denote F as the true feature matrix whose elements are the exactderivatives evaluated at the corresponding time and space location as those in F . For a vector c , (cid:107) c (cid:107) p := ( (cid:80) j | c j | p ) p is the L p norm of c . In particular, (cid:107) c (cid:107) ∞ := max j | c j | . When p = 0, (cid:107) c (cid:107) := { c j : c j (cid:54) = 0 } represents the L semi-norm of c . The support of c is denoted bysupp( c ) := { j : c j (cid:54) = 0 } . The vector c is said to be k -sparse if (cid:107) c (cid:107) = k for a non-negative integer k . For any matrix A m × n and index sets L ⊆ { , , . . . , n } , L ⊆ { , , . . . , m } , we denote [ A ] L as the submatrix of A consisting of the columns indexed by L , and [ A ] L as the submatrix of A consisting of the rows indexed by L . A T , A ∗ and A † denote the transpose, conjugate transposeand Moore-Penrose pseudoinverse of A , respectively. For x ∈ R , (cid:98) x (cid:99) denotes the largest integer nolarger than x . As shown in Figure 1, when the given data are contaminated by noise, numerical differentiationamplifies noise and introduces a large error in the time derivative vector D t U and the approximatedfeature matrix F . With random noise, the regularity of the given data is different from the regularityof the PDE solution. Thus, denoising plays an important role in PDE identification.We introduce a smoothing operator S to process the data. Kernel methods are good options for S , such as Moving Average [37] and Moving Least Square (MLS) [18]. In this paper, the smoothingoperator S is chosen as the MLS where data are locally fit by quadratic polynomials. In MLS, aweighted least squares problem, along time domain or spacial domain, is solved at each time t n and4a) (b) (c) exactSDD exactSDD exactSDD Figure 2: Performance of SDD on the data in Figure 1. (a) Graph of sin( x ), 0 ≤ x ≤ π (black)and the denoised data (red) using MLS. (b) First order derivatives of the function (black) andthe denoised data using SDD (red). (c) Second order derivatives of the function (black) and thedenoised data using SDD (red). Derivatives are computed by the five-point ENO scheme, and thesmoothing operator S is MLS.spatial location x i as follows: S ( x ) [ U n i ] = p n i ( x i ) , where p n i = arg min p ∈ P (cid:88) j ∈ I ( p ( x j ) − U n j ) exp (cid:18) − (cid:107) x i − x j (cid:107) h (cid:19) ,S ( t ) [ U n i ] = p n i ( t n ) , where p n i = arg min p ∈ P (cid:88) ≤ k ≤ N ( p ( t k ) − U k i ) exp (cid:18) − (cid:107) t n − t k (cid:107) h (cid:19) . Here h > P denotes the set of polynomials of degree nomore than 2. When d = 1, it was shown that if { U nj } Mj =1 is a third order approximation of a smoothfunction u ( x, t n ) with or without noise, and if h is appropriately chosen, MLS gives a third orderapproximation of u ( x, t n ) [16,44]. This result can be easily generalized to higher order polynomials.We propose a Successively Denoised Differentiation (SDD) procedure to stabilize the numericaldifferentiation. For every derivative approximation, smoothing is applied as follows:Successively Denoised Differentiation (SDD) u ≈ S ( x ) [ U ] The given data set U is denoised by MLS. ∂ t u ≈ S ( t ) D t S ( x ) [ U ] Denoising at numerical time differentiation. ∂ k x u ≈ ( S ( x ) D x ) k · · · ( S ( x ) D x d ) k d S ( x ) [ U ], Denoising at every spatial numerical differentiation step.where (cid:80) di =1 k i = k , for k = 1 , , . . . .All nonlinear features and mixed-partial derivatives are computed by the operations above. Themain idea of SDD is to smooth the data at each step (before and) after the numerical differentiation.This simple idea effectively stabilizes numerical differentiation. Figure 2 shows the results of SDDfor the same data in Figure 1. The approximations of the first and second order derivatives of u are greatly improved.In Section 4.6, we explore details of SDD when different smoothing operators are used. We findthat MLS has the best performance in terms of preserving the derivative profiles. We set S to beMLS for the numerical experiments. 5o simplify the notations, in the rest of this paper, we use U to denote the denoised data S ( x ) [ U ], and D t U as well as D k x U to denote the numerical derivatives with SDD applied as above. Under the parametric model in (4), the PDE identification problem can be reduced to solving thelinear system (5) for a sparse vector c with few nonzero entries. Sparse regression can be formulatedas the following L -minimizationmin (cid:107) c (cid:107) , subject to (cid:107) F c − D t U (cid:107) ≤ (cid:15) , (6)for some (cid:15) >
0. However, the L -minimization in (6) is NP hard. Its approximate solutions havebeen intensively studied in literature. The most popular surrogate for the L semi-norm is the L norm as applied in image and signal processing [8, 11]. The L -regularized minimization is calledLeast Absolute Shrinkage and Selection Operator (LASSO) [39], which was used in [16, 32, 33] forPDE identification. The common strategy in these works is to utilize LASSO to select a candidateset, then refine the results with other techniques.In this paper, we utilize a greedy algorithm called Subspace Pursuit (SP) [10] to select a can-didate set. Different from LASSO, SP takes the sparsity level as the input. Let k be a positiveinteger and denote b = D t U . For a fixed sparsity level k , SP( k ; F, b ) in Algorithm 1 gives rise to a k -sparse vector whose support is selected in a greedy fashion. It was proved that SP gives rise to asolution of the L -minimization (6) under certain conditions of the matrix F , such as the restrictedisometry property [10]. Algorithm 1: Subspace Pursuit SP ( k ; F, b ) Input: F ∈ R NM d × K , b ∈ R NM d and sparsity k ∈ N . Initialization: j = 0; G ← column-normalized version of F ; I = { k indices corresponding to the largest magnitude entries in the vector G ∗ b } ; b = b − G I G †I b . while True doStep 1. (cid:101) I j +1 = I j ∪ { k indices corresponding to the largest magnitude entries in thevector G ∗ b j res } ; Step 2.
Set c p = G † (cid:101) I j +1 b ; Step 3. I j +1 = { k indices corresponding to the largest elements of c p } ; Step 4.
Compute b j +1res = b − G I j +1 G †I j +1 b ; Step 5. If | b l +1res (cid:107) > (cid:107) b l res (cid:107) , let I j +1 = I j and terminate the algorithm; otherwise set j ← j + 1 and iterate. Output: (cid:98) c ∈ R K satisfying (cid:98) c I j = F †I j b and (cid:98) c ( I j ) (cid:123) = . We propose two new methods based on SP for PDE identification: Subspace pursuit Timeevolution (ST) and Subspace pursuit Cross-validation (SC). ST uses multi-shooting numerical timeevolution and selects the PDE which yields the least evolution error. From a different perspective,SC computes the cross-validation error in the least squares fitting and picks the PDE that givesthe smallest error. 6 .1 Subspace Pursuit Time Evolution (ST)
We first propose a method combining SP and the idea of time evolution. In [16], Time EvolutionError (TEE) quantifies the mismatch between the solution simulated from a candidate PDE andthe denoised data. Any candidate coefficient vector (cid:98) c = ( (cid:98) c , (cid:98) c . . . ) defines a candidate PDE: u t = (cid:98) c + (cid:98) c ∂ x u + · · · + (cid:98) c m u∂ x u + · · · . This PDE is numerically evolved from the initial condition U with a smaller time step (cid:102) ∆ t (cid:28) ∆ t . Denote (cid:98) U , (cid:98) U , . . . , (cid:98) U N as this numerical solution at thesame time-space location as U , U , . . . , U N . The TEE of the candidate PDE given by (cid:98) c isTEE( (cid:98) c ) = 1 N N (cid:88) n =1 (cid:107) (cid:98) U n − U n (cid:107) , where U n is the denoised data at time t n . When there are several candidate PDEs, the one withthe least TEE is picked [16]. This TEE idea is based on the convergence principle that a correctnumerical approximation converges to the true solution as the time step (cid:102) ∆ t goes to zero. The errorfrom the wrongly identified terms grows during this time evolution process. Figure 3 (a) and (b)illustrate the idea of TEE.In this paper, we propose a Multi-shooting Time Evolution Error (MTEE). The idea is to evolvea candidate PDE from multiple time locations with a time step (cid:102) ∆ t (cid:28) ∆ t using the forward Eulerscheme for a time length of w ∆ t , where w is a positive integer. Throughout this paper, we use (cid:102) ∆ t = ∆ t/
5. Let (cid:98) U n + w | n be the numerical solution of the candidate PDE at the time ( n + w )∆ t ,which is evolved from the initial condition U n at time t n = n ∆ t . The MTEE is defined asMTEE( (cid:98) c ; w ) = 1 N − w N − − w (cid:88) n =0 (cid:107) (cid:98) U ( n + w ) | n − U n + w (cid:107) . (7)Figure 3 (c) and (d) demonstrate the process of multi-shooting time evolution. While the TEEevolution starts from the initial condition U and ends at T , the MTEE evolution starts fromvarious time locations, such as t n , n = 0 , . . . , N − − w , and lasts for a shorter time, e.g. w ∆ t inour case.MTEE has two advantages over TEE: (1) MTEE is more robust against noise in comparisonwith TEE. If w (cid:28) N , the noise accumulates for less time, which helps to stabilize numerical solvers;(2) MTEE is more flexible and its computation is parallelizable.The SP algorithm finds a coefficient vector with a specified sparsity, but it is difficult to knowthe correct sparsity from the given data. We propose Subspace pursuit Time evolution (ST) whichiteratively refines the selection of features.As an initial condition, we set K = K and A = { , . . . , K } . At the first iteration, all possiblesparsity levels are considered in the SP algorithm. For each k = 1 , . . . , K , we run SP( k ; F, D t U ) toobtain a coefficient vector (cid:98) c ( k ) ∈ R K such that (cid:107) (cid:98) c ( k ) (cid:107) = k, which gives rise to the PDE: u t = f SP( k ) where f SP( k ) := (cid:98) c ( k )1 + (cid:98) c ( k )2 ∂ x u + · · · + (cid:98) c ( k ) m u∂ x u + · · · . (8)We then numerically evolve each PDE u t = f SP( k ) , for k = 1 , . . . , K and calculate the correspondingMTEE. Among these PDEs, the one with the smallest MTEE is selected, then let K = arg min k =1 , , ··· ,K MTEE( (cid:98) c ( k ) ; w ) and A = supp( (cid:98) c ( K ) ) . If A = A , the algorithm is terminated; otherwise, we continue to the second iteration.At the second iteration, we refine the selection from the index set A with cardinality K . For k = 0 , . . . , K , we run SP( k ; [ F ] A , D t U ) to obtain a coefficient vector (cid:98) c ( k ) ∈ R K such that7igure 3: (a) and (b) illustrate the idea of TEE. (c) and (d) explain MTEE when w = 2. Theblue arrows in (a) and (c) represent time evolution using the forward Euler scheme on a fine timegrid with spacing (cid:102) ∆ t (cid:28) ∆ t . In (b), two different PDEs (green and red) are evolved, and the greenone has a smaller TEE. In (d) the candidate PDEs are evolved from multiple time locations, andtheir numerical solutions are compared with the denoised data after a time length of w ∆ t . (e)An example of the ST iteration. Starting with a large number K , the first iteration gives riseto K candidate coefficients for k = 1 , . . . , K . The PDE with the smallest MTEE is picked, e.g.SP(3) with cardinality K = 3 and support A . The second iteration gives rise to the candidatecoefficients only supported on A using SP(k) with k = 1 , ,
3. The PDE with the smallest MTEEis found, e.g. SP(2) with cardinality K = 2 and support A . The third iteration does not changethe support, i.e. A = A , so the final output is the coefficient vector of SP(2). (cid:98) c ( k ) A = SP( k ; [ F ] A , D t U ) , and (cid:98) c ( k ) A (cid:123) = , and the associated PDE u t = f SP( k ) as in (8). Among these PDEs, the one with the smallest MTEEis selected, and we denote K = arg min k =1 , , ··· ,K MTEE( (cid:98) c ( k ) ; w ) , and A = supp( (cid:98) c ( K ) ) . If A = A , the algorithm is terminated; otherwise, we continue to the next iteration in a similarmanner.The ST iteration will be terminated when the index set remains the same, i.e., A j = A j +1 .The ST outputs a recovered coefficient vector and the corresponding PDE denoted by ST( w ). Acomplete description of ST is given in Algorithm 2, and Figure 3 (e) illustrates an example of theST iteration. 8 lgorithm 2: Subspace pursuit Time evolution (ST)Input: F ∈ R NM d × K , D t U ∈ R NM d and a positive integer w . Initialization: j = 0, K = K and A = { , , · · · , K } . while A j +1 (cid:54) = A j doStep 1. For k = 1 , , · · · , K j , run SP( k ; [ F ] A j , D t U ) to obtain a coefficient vector (cid:98) c ( k ) ∈ R K such that (cid:98) c ( k ) A j = SP( k ; [ F ] A j , D t U ) and (cid:98) c ( k ) A (cid:123) j = , and the associated PDE u t = f SP( k ) given in (8). Step 2.
Among all the PDEs u t = f SP( k ) for k = 1 , . . . , K j , select the one with theminimum MTEE( (cid:98) c ( k ) ; w ) and update K j +1 = arg min k =1 , , ··· ,K j MTEE( (cid:98) c ( k ) ; w ) and A j +1 = supp( (cid:98) c ( k j +1 ) ) . If A j +1 = A j , terminate the algorithm; otherwise, update j = j + 1. Output:
Recovered coefficient (cid:98) c K j +1 and the corresponding PDE, denoted by ST( w ). Our second method utilizes the idea of cross validation for the linear system in (5). Cross-validationis commonly used in statistics for the choice of parameters in order to avoid overfitting [14]. Weconsider the two-fold cross validation where data are partitioned into two subsets. One subset isused to estimate the coefficient vector and the other one is used to validate the candidates. If agood coefficient vector is found within one subset, it should yield a small validation error for theother subset because of consistency.For some fixed ratio parameter α ∈ (0 , D t U ∈ R NM d (and F ∈ R NM d × K )into two groups indexed by T and T , such that T consists of the indices of the first (cid:98) αN M d (cid:99) rows and T consists of the indices of the rest of the rows. Since we focus on PDEs with constantcoefficients, the idea of cross validation is applicable: if a correct support is identified, the coefficientvector obtained from the data in T should be compatible with the data in T .We introduce our Subspace pursuit Cross validation (SC) algorithm where cross validation isincorporated into the SP algorithm. SC consists of the following three steps:
Step 1 : For each sparsity level k = 1 , , ..., K , use SP to select a set of active features: A k = supp(SP( k ; F, D t U )) . Step 2 : Use the data in T to compute the estimator for the coefficient vector, (cid:98) c ( k ) ∈ R K , by thefollowing least squares problem (cid:98) c ( k ) = arg min c ∈ R K such that c A (cid:123) k =0 (cid:107) [ F ] T A k c A k − [ D t U ] T (cid:107) , and then use the data in T to compute a Cross validation Estimation Error (CEE)CEE( A k ; α, T , T ) = (cid:107) [ D t U ] T − [ F ] T (cid:98) c ( k ) (cid:107) . (9) Step 3 : Set k min = arg min k CEE( A k ; α, T , T ) and the estimated coefficient vector is given as (cid:98) c = arg min c ∈ R K such that c A (cid:123) k =0 (cid:107) [ F ] T A k min c A k min − [ D t U ] T (cid:107) . lgorithm 3: Subspace pursuit Cross validation (SC) AlgorithmInput: F ∈ R NM d × K and D t U ∈ R NM d ; 0 < α < Step 1.
For k = 1 , , · · · , K , run SP( k ; F, D t U ) to obtain the support of the candidatecoefficients A k = supp(SP( k ; F, D t U )) . Step 2.
For each k , compute the averaged cross validation errorCEE( A k , α ) = 12 (CEE( A k ; α, T , T ) + CEE( A k ; 1 − α, T , T )) . Step 3.
Choose the k which gives the smallest cross validation error and denote it by k min k min = arg min k CEE( A k , α ) . Estimate the coefficients by least squares as (cid:98) c = arg min c ∈ R K such that c A (cid:123) k =0 (cid:107) [ F ] T A k min c A k min − [ D t U ] T (cid:107) . Output:
Recovered coefficient (cid:98) c and the identified PDE denoted by SC( α ).The identified PDE by SC is denoted as SC( α ).CEE in (9) is an effective measure for consistency. If the support of the estimated coefficientvector matches that of the true one, CEE is guaranteed to be small provided with sufficiently highresolution in time and space. Theorem 3.1.
Assume that D t U → u t and F → F as ∆ t, ∆ x → . Let A = supp( c ) where c is the coefficient vector of the true PDE. For any set of support A , we have CEE( A ; α, T , T ) ≤ (cid:13)(cid:13)(cid:13)(cid:16) [ F ] T A (cid:0) [ F ] T A (cid:1) † − [ F ] T A (cid:0) [ F ] T A (cid:1) † (cid:17) [ u t ] T (cid:13)(cid:13)(cid:13) + g ( A ; α, T , T ) , where g > is a positive function independent of A , such that g → as ∆ t, ∆ x → .Proof. See Appendix B.In (9), the data in T serve as the training set, and the data in T act as the validation set.One can also use the data in T for training and the data in T for validation, which gives rise tothe cross validation estimation error CEE( A k ; 1 − α, T , T ). To improve the robustness of SC, wereplace (9) with the following averaged cross validation error:CEE( A k , α ) = 12 (CEE( A k ; α, T , T ) + CEE( A k ; 1 − α, T , T )) . In general, one can randomly pick part of the data as the training set and use the rest as thevalidation set. For simplicity, we split the data according to the row index in this paper.The proposed SC algorithm is summarized in Algorithm 3. In comparison with ST, SC doesnot involve any numerical evolution of the candidate PDE, so the computation of SC is faster.10 .3 Comparison with Related Methods
In this section, we discuss the error representation to compare different objective of PDE identifi-cation approaches. We consider two ways to measure errors in PDE identification. The first oneis the error between the identified numerical solution (cid:98) U and the exact solution u , which is givenby e ( u ) := (cid:98) U − u . The second error is e ( u t ) := D t (cid:98) U − u t , which measures the difference betweenthe numerical time derivative of (cid:98) U and the ground truth u t . The errors e ( u ) and e ( u t ) are closelyrelated, which is explained in details in Appendix A.Many existing methods for the identification of PDEs or dynamical systems involve a minimiza-tion of e ( u ) or e ( u t ). Consider the following decomposition of e ( u ): e ( u ) = (cid:98) U − U (cid:124) (cid:123)(cid:122) (cid:125) Data fidelity + U − u (cid:124) (cid:123)(cid:122) (cid:125) Measurement error , (10)where U is the given data. In (10), the Data fidelity (cid:98) U − U represents the accuracy of the identifiedPDE in comparison with the given data U . In literature, a class of dynamic-fitting approachessuch as [1, 4, 26, 36] focus on controlling the data fidelity error in order to ensure if the numericalprediction is consistent with the evolution of the given data. The Measurement error U − u comesfrom data acquisition where the given data are contaminated by noise. Denoising is an importantstep to reduce the measurement error.The second error e ( u t ) can be expressed as e ( u t ) = D t (cid:98) U − D t U (cid:124) (cid:123)(cid:122) (cid:125) Response error + D t U − F (cid:98) c (cid:124) (cid:123)(cid:122) (cid:125) Regression error + F ( (cid:98) c − c ) (cid:124) (cid:123)(cid:122) (cid:125) Coefficient error + ( F − F ) c (cid:124) (cid:123)(cid:122) (cid:125) System error . (11)where (cid:98) c is the estimated coefficient. The first term D t (cid:98) U − D t U is called the Response error , whichis the difference between the numerical derivatives of the identified PDE and the given data. The L norm of the Regression error D t U − F (cid:98) c is the most frequently used objective function in PDEidentification for the regression-based methods [2, 3, 19, 28, 42]. In addition, one can introducevarious types of regularization, such as the L regularization [16, 32, 33] to induce sparsity. The coefficient error F ( (cid:98) c − c ) compares (cid:98) c and c . This term vanishes when (cid:98) c − c lies in the null spaceof F , which can occur even when (cid:98) c (cid:54) = c . If the initial condition of the PDE is too simple, thenull space of F is very large, which makes the PDE identification problem ill-posed. See Equation(16) and (17) for an example. In order to guarantee a successful identification, the initial conditionshould have sufficient variations so that F satisfies an incoherence or null space property [12]. Thefinal term ( F − F ) c represents the System error , which is due to the numerical differentiation inthe computation of F . Our SDD denoising technique can effectively reduce the system error.We summarize in Table 1 the objectives considered by many existing methods in the literature.These methods are categorized according to which error term(s) that they aim at minimizing. Asfor our proposed methods, ST minimizes the data fidelity, and SC focuses on the coefficient errorand the regression error for the PDE identification. In this section, we perform a systematic numerical study to demonstrate the effectiveness of STand SC and compare them to IDENT [16]. To measure the identification error, we use the followingrelative coefficient error e c and grid-dependent residual error e r : e c = (cid:107) (cid:98) c − c (cid:107) (cid:107) c (cid:107) , e r = (cid:40) √ ∆ x ∆ t (cid:107) F ( (cid:98) c − c ) (cid:107) for one dimensional PDE . √ ∆ x ∆ y ∆ t (cid:107) F ( (cid:98) c − c ) (cid:107) for two dimensional PDE . . ype of Problems Objectives in Minimization Methods Parameter Estimation Data fidelity [1, 4, 26, 27, 36, 40]Regression error [2, 3, 19, 28, 42]Regression error + Data fidelity [46]Model Identification Data fidelity [22]Regression error [32, 33]Regression error + Data fidelity ST (Section 3.1), [16]Regression error + Coefficient error SC (Section 3.2)Table 1: Comparison of the objectives of PDE identification. For parameter estimation problems,the feature variables of the underlying PDEs are known. For model identification problems, suchactive set is unknown; hence sparsity is often imposed, or neural network is designed.The relative coefficient error e c measures the accuracy in the recovery of PDE coefficients, whilethe residual error e r measures the difference between the learned dynamics and the denoised oneby SDD. Since each feature vector in F may have different scales, e r can be different from e c insome cases. Especially when the given data contain noise, the features containing higher orderderivatives have greater magnitudes compared to the features containing lower order derivatives.In this case, a small coefficient error in the high order terms may lead to a large e r . We use both e c and e r to quantify the PDE identification error.To generate the data, we first solve the underlying PDE by forward Euler scheme using timeand space step δt and δx respectively, then downsample the data to then we add Gaussian noisewith standard deviation σ to the clean data. We say that the noise is p % when setting σ = p (cid:113) NM d (cid:80) n (cid:80) i ( u ( x i , t n )) . In the computation of D t U and the feature matrix F , we alwaysuse SDD with MLS with h = 0 .
04 as the smoother. When MLS is used to denoise the dataof two dimensional PDEs, one can either fit two dimensional polynomials or fit one dimensionalpolynomials in each dimension. In this work, we use the second approach. In ST, (cid:102) ∆ t = ∆ t/ f being a polynomial with degree up to 2.We consider 10 features 1 , u, u , u x , u x , uu x , u xx , u xx , uu xx ,and u x u xx as the dictionary for one dimensional PDEs and 28 for two dimensional PDEs, whichexhaust all the available features up to order 2 and degree 2, e.g., u xx u y and u xy u yy . Our first experiment is a transport equation with zero Dirichlet boundary condition: u t = − u x , (12)with an initial condition of u (0 , x ) = (cid:40) sin (2 πx/ (1 − T )) cos(2 πx/ (1 − T )) , for 0 ≤ x ≤ − T, , otherwise , for 0 < t ≤ T . The given data D is generated by explicitly solving (12) with δx = ∆ x = 1 / , δt =∆ t = 10 − and T = 0 . ethod Identified PDE, noise without SDD e c e r ST(20) u t = − . u x . × − . × − SC(1/200) u t = − . u x − . u xx . × − . × − noise with SDD e c e r ST(20) u t = − . u x . × − . × − SC(1/200) u t = − . u x − . u xx . × − . × − noise e c e r ST(20) and SC(1/200) u t = − . u x . × − . × − noise e c e r ST(20) and SC(1/200) u t = − . u x . × − . × − Table 2: Identification of the transport equation (12) with different noise levels. In the noise-freecase, applying SDD does not introduce strong bias. The identification results by ST and SC arestable even with 30% noise.small difference in the noise-free case. With clean data, SC identifies an additional u xx term witha small coefficient, while ST is capable of ruling out all wrong terms. The corresponding e c and e r are both small. With 10% or 30% noise, both ST and SC identify the same PDE with small e c and e r values.To demonstrate the significance of SDD and the effectiveness of ST and SC, we display thenoisy data with 10% and 30% noise, the denoised data, and the recovered dynamics in Figure 4.Even though the given data contain a large amount of noise, the recovered dynamics are close tothe clean data.(a) (b) (c) (d) (e) (f) (g) Figure 4: Noisy and denoised data of the transport equation (12), as well as simulations of therecovered PDE. (a) The clean data, (b) data with 10% noise, (c) the denoised data S x [ U ], (d)simulation of the PDE identified by ST and SC (identical). (e) Data with 30% noise, (f) thedenoised data S ( x ) [ U ], and (g) simulation of the PDE identified by ST and SC (identical).Figure 5 shows how e c and e r change when the noise level varies from 0 .
1% to 100%. Eachexperiment is repeated 50 times and the error is averaged. We test IDENT, ST(20) and SC(1/200).Figure 5 (a) shows that the e c of ST or SC is much smaller than that of IDENT as the noise levelis larger than 20%. Figure 5 (b) shows e r versus noise. The coefficient error e c by ST and SC is13ignificantly smaller than that of IDENT.(a) (b) noise level in percentage e c IDENTST(20)SC(1/200) noise level in percentage e r IDENTST(20)SC(1/200)
Figure 5: The average error e c and e r over 50 experiments of the transport equation (12) withrespect to various noise levels. (a) The curve represents the average e c for IDENT [16] (Green), ST(Red) and SC (Blue), and the standard deviation is represented by vertical bars. (b) The averageand variation of e r for IDENT (Green), ST (Red) and SC (Blue). The coefficient error e c by STand SC is significantly smaller than that of IDENT.In Figure 6, we explore the robustness of SC with respect to the choice of α . We present e c and e r versus 1 /α in (a) and (b) respectively, with 1% , , ,
20% noise. Each experiment isrepeated 50 times and the error is averaged. The result shows that SC in this case is not sensitiveto α , and there are a wide range choices of α that give rise to a small error.(a) (b) e c
1% noise5% noise10% noise20% noise e r
1% noise5% noise10% noise20% noise
Figure 6: Robustness of SC to the choice of α for the recovery of the transport equation (12). (a)and (b) display e c and e r versus 1 /α respectively, with 1% (Blue), 5% (Red), 10% (Orange), 20%(Purple) noise. Each experiment is repeated 50 times and the errors are averaged. We observe thatSC is not sensitive to α , and there are a wide range of values for α that give rise to a small error. In the second example, we test on the Burgers’ equation, which is a first order nonlinear PDE: u t = − uu x . (13)14e use the initial condition u = sin(4 πx ) cos( πx ) and zero Dirichlet boundary condition. Ourdata is generated by solving (13) with δx = ∆ x = 1 / , δt = ∆ t = 10 − and T = 0 . Method Identified PDE, noise without SDD e c e r ST(20) u t = − . uu x − . × − u x u xx . × − . × − SC(1/500) u t = − . uu x . × − . × − noise with SDD e c e r ST(20) u t = − . uu x − . u x u xx . × − . × − SC(1/500) u t = − . uu x . × − . × − noise e c e r ST(20) and SC(1/500) u t = − . uu x . × − . × − noise e c e r ST(20) and SC(1/500) u t = − . uu x . × − . × − Table 3: Identification of the Burgers’ equation (13) with different noise levels. The identificationresults by ST and SC are good with small e c and e r for a noise level up to 40%.Table 3 shows the results of ST(20) and SC(1/500) with various noise levels. With clean data,ST identifies an additional term, but its coefficient is very small and the corresponding e c and e r are small. SC works very well on clean data. With 10% and 40% noise, both methods identify thesame PDE with small e c and e r .Figure 7 shows how e c and e r change when the noise level varies from 0 .
1% to 90%. Each exper-iment is repeated 50 times and the errors are averaged. We test IDENT, ST(20) and SC(1/500),and the results in Figure 7 show that ST and SC perform better than IDENT.(a) (b) noise level in percentage e c IDENTST(20)SC(1/500) noise level in percentage e r IDENTST(20)SC(1/500)
Figure 7: The average error e c and e r over 50 experiments of the Burgers’ equation (13) withrespect to various noise levels. (a) The curve represents the average e c for IDENT [16] (Geeen),ST (Red) and SC (Blue), and the standard deviations are represented by vertical bars. (b) Theaverage and variation of e r for IDENT (Geeen), ST (Red) and SC (Blue). The e c and e r of ST andSC are much smaller than those of IDENT. Our third example is the Burgers’ equation with diffusion, which is a second order nonlinear PDE: u t = − uu x + 0 . u xx . (14)15e use the initial condition u = sin(3 πx ) cos( πx ) and zero Dirichlet boundary condition. We firstsolve (14) with δx = 1 / , δt = 10 − and T = 0 .
05. The given data is downsampled from thenumerical solution such that ∆ x = 1 /
64 and ∆ t = 10 − . Method Identified PDE, noise without SDD e c e r ST(20) and SC(1/10) u t = − . uu x + 0 . u xx . × − . × − noise with SDD e c e r ST(20) and SC(1/10) u t = − . uu x + 0 . u xx . × − . × − noise e c e r ST(20) and SC(1/10) u t = − . uu x + 0 . u xx . × − . × − noise e c e r ST(20) and SC(1/10) u t = − . uu x + 0 . u xx . × − . × − Table 4: Identification of the Burgers’ equation with diffusion (14) with different noise levels. Theidentification results by ST and SC are good with small e c and e r for a noise level up to 5%.Table 4 shows the results of ST(20) and SC(1/10) with various noise levels. With clean data,1% and 5% noise, both methods identify the PDE with small e c and e r .Figure 8 shows how e c and e r change when the noise level varies from 0 .
1% to 10%. Eachexperiment is repeated 50 times and the error is averaged. We test IDENT, ST(20) and SC(1/10).Among the three methods, ST is the best. SC does not perform as well as ST and IDENT whennoise level is large. For high order PDEs, the high order derivatives are heavily contaminated bynoise, even with SDD, which affects the accuracy of cross validation. While ST and IDENT usestime evolution, it is easier to pick correct features. In general, ST performs better than SC for highorder PDEs when the given data contain heavy noise.(a) (b) noise level in percentage e c IDENTST(20)SC(1/10) noise level in percentage e r IDENTST(20)SC(1/10)
Figure 8: The average error e c and e r over 50 experiments of the Burgers’ equation with diffusion(14) with respect to various noise levels. (a) The curve represents the average e c for IDENT [16](Geeen), ST (Red) and SC (Blue), and the standard deviations are represented by vertical bars.(b) The average and variation of e r for IDENT (Geeen), ST (Red) and SC (Blue). Among thethree methods, ST gives the best result.In Figure 9, we explore the effect of α in SC on the Burgers’ equation with diffusion. Figure 9(a) and (b) show e c and e r versus 1 /α respectively, with 0 . , , ,
5% noise. When the noiselevel is low, such as 0 .
5% and 1%, we have a wide range of good choices of α which gives rise to asmaller error. As the noise level increases, the range of the optimal α becomes narrow.16a) (b) e c e r Figure 9: Robustness of SC to the choice of α for the recovery of the Burgers’ equation withdiffusion (14). (a) and (b) display e c and e r versus 1 /α respectively, with 0 .
5% (Blue), 1% (Red),3% (Orange), 5% (Purple) noise. Each experiment is repeated 50 times and the errors are averaged.When the noise level is low, such as 0 .
5% and 1%, there are a wide range of values for α which givea small error. As the noise level increases, the range of the optimal α becomes narrow. We apply our methods to identify two-dimensional PDEs. The PDEs are solved with δx = δy = 0 . δt = 8 × − . Data are downsampled from the numerical solution with ∆ x = 0 .
04 and∆ t = 8 × − . We fix w = 10 for ST and α = 3 /
200 for SC.The identification of two-dimensional PDEs is more challenging and more sensitive to noise.There are more features in two dimensions, and the directional variation of the data adds complexityto the problem. We will show that both ST and SC are still robust against noise.We first consider the following PDE: (cid:40) u t = 0 . u xx − uu y for ( x, y, t ) ∈ [0 , × [0 , . ,u ( x, y,
0) = sin ( πx . ) sin ( πx . ) when ( x, y ) ∈ [0 , . and 0 otherwise . , (15)which has different dynamics along the x and y directions. Table 5 shows the identification results ofST(10) and SC(3/200) with noise level 0% ,
5% and 10%. Both methods identify the same featureswith small e c and e r . Method Identified PDE, noise e c e r ST(10) and SC(3/200) u t = 0 . u xx − . uu y . × − . × − noise e c e r ST(10) and SC(3/200) u t = 0 . u xx − . uu y . × − . × − noise e c e r ST(10) and SC(3/200) u t = 0 . u xx − . uu y . × − . × − Table 5: Identification of the two dimensional PDE (15) with different noise levels. The identifica-tion results by ST and SC have small e c and e r for a noise level up to 10%. For the PDE identification, especially in high dimensions, the given data U plays an importantrole. When the initial condition has sufficient variations in each dimension, the correct PDE can17e identified. Otherwise, there may be multiple PDEs which generate the same dynamics. Forexample, consider the following transport equation without noise: (cid:40) u t = − . u x + 0 . u y , ( x, y ) ∈ [0 , × [0 , , t ∈ [0 , . u ( x, y,
0) = f ( x, y ) , ( x, y ) ∈ [0 , × [0 , , (16)where f denotes the initial condition.We first choose f ( x, y ) = sin(2 πx/ . sin(2 πy/ . for ( x, y ) ∈ [0 , . × [0 , .
9] and 0 other-wise. The identified PDE by SC(3/200) is u t = − . u x + 0 . u y , where the recovered coefficients are very close to the true coefficients. Next we choose f ( x, y ) =sin(2 πx/ . for ( x, y ) ∈ [0 , . × R and 0 otherwise, then SC(3/200) gives u t = − . u x . (17)With this initial condition, the PDE in (16) has the exact solution: u ( x, y, t ) = (cid:40) sin( π ( x − . t )0 . ) , x ∈ [0 . t, . . t ] , ( x, y ) ∈ R × [0 , , t ∈ [0 , . , Otherwise , which also satisfies u t = − . u x . The identified PDE in (17) approximates this simpler equation.Since the given data only vary along the x direction, the columns in the feature matrix related to y , e.g., u y , u x u y , and u yy , are mostly 0. This explains why our method identifies the PDE in (17),instead of (16). In this paper, we use Moving Least Square (MLS) as the denoising in SDD. To numerically jus-tify this choice among Moving Average (MA) [38], cubic spline interpolation [9], and diffusionsmoothing [45], we present the SDD results with these smoothers in Figure 10. We first solve thePDE u t = − . uu x − . uu y , ( x, y ) ∈ [0 , × [0 , , t ∈ [0 , . , (18)with ∆ t = 0 .
005 and ∆ x = ∆ y = 0 .
01, where the initial condition is u ( x, y ) = sin(3 πx ) sin(5 πy ).Then 5% Gaussian noise is added on the numerical solution. Given the noisy data, we performSDD denoising with different smoothers to obtain various partial derivatives. In MLS, we take thebandwidth h = 0 .
04. For MA, the window size for averaging is fixed to be 3. For Cubic Spline, weuse the MATLAB function csaps with p = 0 .
5. For the Diffusion denoising, we evolve the noisysurface following the heat equation u t = u xx + u yy with a time step size (∆ x ) / u, u x , u yy , uu x at t = 0 .
15 when different smoothers are usedin SDD. All of them recover U (the first row), while MLS preserves the underlying dynamics thebest, i.e. the first and second order derivatives. 18lean data Noisy data (5%) MA Cubic Spline Diffusion MLS u -1-0.500.51 u x -30-20-10010 u yy -1000-50005001000 uu x -10-50510 Figure 10: SDD results with different smoothers. The first column is the numerical solution of (18)at t = 0 .
15 with the initial condition u ( x, y ) = sin(3 πx ) sin(5 πy ) and its various partial derivatives.The second column shows the noisy data and its numerical derivatives when 5% Gaussian noise isadded to the clean data. The right four columns are the SDD results at t = 0 .
15 using MA, cubicspline, diffusion and MLS in order. While all methods recover U (the first row), the dynamics ofthe derivatives, especially in the third and forth rows, are best preserved by MLS. This paper developed two robust methods for PDE identification when noisy data are given. First,we proposed a Successively Denoised Differentiation (SDD) procedure to stabilize numerical differ-entiation, which significantly improves the accuracy in the computation of the feature matrix fromnoisy data. We then proposed two new robust PDE identification algorithms called ST and SC.These algorithms utilize the Subspace Pursuit (SP) greedy algorithm to efficiently select a candi-date set, and then refine the results by time evolution or cross validation. We presented variousnumerical experiments to demonstrate the effectiveness of both methods. SC is more computation-ally efficient, while ST performs better for PDEs with high order derivatives. We also providedan error analysis of ST and SC in the context of PDE identification, which unifies many relatedmethods in the literature.
AppendixA Relation between the Errors e ( u ) and e ( u t ) In this section, we explain the statement in Section 3.3: If the numerical scheme for the computationof D t (cid:98) U is consistent, then (cid:107) e ( u ) (cid:107) ∞ → (cid:107) e ( u t ) (cid:107) ∞ → t, ∆ x → n = 0 , , . . . , N , we denote e ( u ) n and e ( u t ) n as the values of e ( u ) and e ( u t ) occurred at time19 ∆ t , respectively. For j = 1 , , . . . , N , we have e ( u ) j − e ( u ) j − ∆ t = (cid:98) U j − (cid:98) U j − ∆ t − u j − t + r (cid:48) = e ( u t ) j − + (cid:0) (cid:98) U j − (cid:98) U j − ∆ t − [ D t (cid:98) U ] j − (cid:1) + r (cid:48) , where (cid:107) r (cid:48) (cid:107) ∞ = O (∆ t ). By induction, we obtain the following connection between e ( u ) and e ( u t ): e ( u ) n = e ( u ) + n − (cid:88) j =0 e ( u t ) j ∆ t + n − (cid:88) j =0 (cid:32) (cid:98) U j +1 − (cid:98) U j ∆ t − [ D t (cid:98) U ] j (cid:33) ∆ t + n r , (19)where the remainder (cid:107) r (cid:107) ∞ = O (∆ t ). Equation (19) suggests that if the approximation D t (cid:98) U isconsistent and (cid:107) e ( u ) (cid:107) ∞ converges to 0 as ∆ x → (cid:107) e ( u ) (cid:107) ∞ → (cid:107) e ( u t ) (cid:107) ∞ → (cid:107) e ( u ) (cid:107) ∞ or (cid:107) e ( u t ) (cid:107) ∞ approachto 0 are equivalent.It is often practical to consider a grid-dependent L -norm of the errors, i.e., (cid:107)·(cid:107) , ∆ = (cid:107)·(cid:107) √ ∆ x ∆ t where (cid:107) · (cid:107) denotes the ordinary L vector norm. We provide an upper bound for (cid:107) e ( u ) (cid:107) , ∆ . Theorem A.1.
Suppose D t (cid:98) U is computed using the forward difference. Then (cid:107) e ( u ) (cid:107) , ∆ ≤ X d T (cid:107) e ( u t ) (cid:107) ∞ + O ( (cid:107) e ( u t ) (cid:107) ∞ + ∆ t ) + O (∆ t ) . (20) Proof.
Recall that U ∈ R M d N is the vectorization of the data. By the definition of the grid-dependent norm, (cid:107) U (cid:107) , ∆ = ∆ x d ∆ t (cid:107) U (cid:107) = X d TM d N (cid:107) U (cid:107) . Using (19), we have (cid:107) e ( u ) (cid:107) = (cid:107) e ( u ) (cid:107) + N (cid:88) n =1 (cid:107) e ( u ) n (cid:107) ≤ (cid:107) e ( u ) (cid:107) + N (cid:88) n =1 n − (cid:88) j =0 (cid:107) e ( u t ) j (cid:107) ∆ t + M d N (cid:88) n =1 n O (∆ t )+ N (cid:88) n =1 (cid:107) e ( u ) (cid:107) n − (cid:88) j =0 (cid:107) e ( u t ) j (cid:107) ∆ t + M d/ N (cid:88) n =1 (cid:107) e ( u ) (cid:107) nO (∆ t ) + M d/ N (cid:88) n =1 n − (cid:88) j =0 (cid:107) e ( u t ) j (cid:107) nO (∆ t ) ≤ (cid:107) e ( u ) (cid:107) + N (cid:88) n =1 n − (cid:88) j =0 (cid:107) e ( u t ) j (cid:107) ∆ t + M d O ( T ∆ t )+ (cid:107) e ( u ) (cid:107) N (cid:88) n =1 n − (cid:88) j =0 (cid:107) e ( u t ) j (cid:107) ∆ t + M d/ (cid:107) e ( u ) (cid:107) O ( T ) + M d/ N (cid:88) n =1 n − (cid:88) j =0 (cid:107) e ( u t ) j (cid:107) nO (∆ t ) . Since (cid:107) e ( u t ) j (cid:107) ≤ M d/ (cid:107) e ( u t ) (cid:107) ∞ , we can simplify the expression above as: (cid:107) e ( u ) (cid:107) ≤ (cid:107) e ( u ) (cid:107) + M d T N (cid:107) e ( u t ) (cid:107) ∞ + M d O ( T ∆ t )+ T M d/ N (cid:107) e ( u ) (cid:107) (cid:107) e ( u t ) (cid:107) ∞ + M d/ (cid:107) e ( u ) (cid:107) O ( T ) + M d (cid:107) e ( u t ) (cid:107) ∞ O ( T ) . Thus (cid:107) e ( u ) (cid:107) , ∆ = ∆ x d ∆ t (cid:107) e ( u ) (cid:107) ≤ ∆ t (cid:107) e ( u ) (cid:107) + X d T (cid:107) e ( u t ) (cid:107) ∞ + O ( X d T ∆ t )( (cid:107) e ( u t ) (cid:107) ∞ + ∆ t ) (cid:107) e ( u ) (cid:107) O ( T X d/ ) + X d (cid:107) e ( u t ) (cid:107) ∞ O ( T ∆ t ) . t and the domain size X, T . To derive useful informationfrom Theorem A.1, we assume that (cid:107) e ( u t ) (cid:107) ∞ = O (∆ t ). This condition holds, for example, whenwe use first order forward difference and the underlying data is noiseless. Corollary A.2.
When the time-space domain is fixed, i.e.,
T > and X > , if || e ( u t ) || ∞ = O (∆ t ) ,we have (cid:107) e ( u ) (cid:107) , ∆ → , ∆ t, ∆ x → , (21)This result suggests that, with the assumptions satisfied, increasing both time and space res-olutions is a sufficient condition for controlling (cid:107) e ( u ) (cid:107) , ∆ →
0. The convergence of (cid:107) e ( u ) (cid:107) , ∆ as∆ t, ∆ x → B Proof of Proposition 3.1
Proof. [ D t U ] T − [ F ] T A (cid:0) [ F ] T A (cid:1) † [ D t U ] T = [ D t U ] T − [ u t ] T + [ u t ] T − [ F ] T A (cid:0) [ F ] T A (cid:1) † [ D t U ] T = [ D t U ] T − [ u t ] T (cid:124) (cid:123)(cid:122) (cid:125) E +[ u t ] T − [ F ] T A (cid:0) [ F ] T A (cid:1) † [ u t ] T − [ F ] T A (cid:0) [ F ] T A (cid:1) † ([ D t U ] T − [ u t ] T ) (cid:124) (cid:123)(cid:122) (cid:125) E = [ u t ] T − ([ F ] T A + [ F ] T A − [ F ] T A ) (cid:0) [ F ] T A (cid:1) † [ u t ] T + E + E = [ u t ] T − [ F ] T A (cid:0) [ F ] T A (cid:1) † [ u t ] T − ([ F ] T A − [ F ] T A ) (cid:0) [ F ] T A (cid:1) † [ u t ] T (cid:124) (cid:123)(cid:122) (cid:125) E + E + E = [ u t ] T − [ F ] T A (cid:0) [ F ] T A (cid:1) † [ u t ] T (cid:124) (cid:123)(cid:122) (cid:125) =0 + (cid:0) [ F ] T A (cid:0) [ F ] T A (cid:1) † − [ F ] T A (cid:0) [ F ] T A (cid:1) † (cid:1) [ u t ] T + E + E + E = (cid:0) [ F ] T A (cid:0) [ F ] T A (cid:1) † − [ F ] T A (cid:0) [ F ] T A (cid:1) † (cid:1) [ u t ] T + E + E + E = (cid:0) [ F ] T A (cid:0) [ F ] T A (cid:1) † − [ F ] T A (cid:0) [ F ] T A (cid:1) † (cid:1) [ u t ] T − [ F ] T A (cid:0)(cid:0) [ F ] T A (cid:1) † − (cid:0) [ F ] T A (cid:1) † (cid:1) [ u t ] T (cid:124) (cid:123)(cid:122) (cid:125) E + E + E + E = (cid:0) [ F ] T A (cid:0) [ F ] T A (cid:1) † − [ F ] T A (cid:0) [ F ] T A (cid:1) † (cid:1) [ u t ] T + E + E + E + E . Then we have:CEE( A k ; α, T , T ) ≤ (cid:107) (cid:0) [ F ] T A (cid:0) [ F ] T A (cid:1) † − [ F ] T A (cid:0) [ F ] T A (cid:1) † (cid:1) [ u t ] T (cid:107) + (cid:107) [ D t U ] T − [ u t ] T (cid:107) + (cid:107) (cid:0) [ F ] T A (cid:1) † (cid:107) (cid:0) (cid:107) [ F ] T A (cid:107) (cid:107) [ D t U ] T − [ u t ] T (cid:107) + (cid:107) [ F ] T A − [ F ] T A (cid:107) (cid:107) [ u t ] T (cid:107) (cid:1) + (cid:107) [ F ] T A (cid:107) (cid:107) (cid:0) [ F ] T A (cid:1) † (cid:107) (cid:107) (cid:0) [ F ] T A (cid:1) † (cid:107) (cid:107) [ F ] T A − [ F ] T A (cid:107) (cid:107) [ u t ] T (cid:107) . In the last term on the right hand side of the inequality, we applied the norm bound in Theorem4.1 of [43]. 21 eferences [1] E. Baake, M. Baake, H. Bock, and K. Briggs. Fitting ordinary differential equations to chaoticdata.
Physical Review A , 45(8):5524, 1992.[2] M. B¨ar, R. Hegger, and H. Kantz. Fitting partial differential equations to space-time dynamics.
Physical Review E , 59(1):337, 1999.[3] H. G. Bock. Numerical treatment of inverse problems in chemical reaction kinetics. In
Mod-elling of chemical reaction systems , pages 102–125. Springer, 1981.[4] H. G. Bock. Recent advances in parameter identification techniques for ODE. In
Numericaltreatment of inverse problems in differential and integral equations , pages 95–121. Springer,1983.[5] J. Bongard and H. Lipson. Automated reverse engineering of nonlinear dynamical systems.
Proceedings of the National Academy of Sciences , 104(24):9943–9948, 2007.[6] M. Bongini, M. Fornasier, M. Hansen, and M. Maggioni. Inferring interaction rules from ob-servations of evolutive systems i: The variational approach.
Mathematical Models and Methodsin Applied Sciences , 27(05):909–951, 2017.[7] S. L. Brunton, J. L. Proctor, and J. N. Kutz. Discovering governing equations from data bysparse identification of nonlinear dynamical systems.
Proceedings of the National Academy ofSciences , 113(15):3932–3937, 2016.[8] E. J. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal recon-struction from highly incomplete frequency information.
IEEE Transactions on InformationTheory , 52(2):489–509, 2006.[9] P. Craven and G. Wahba. Smoothing noisy data with spline functions.
Numerische mathematik ,31(4):377–403, 1978.[10] W. Dai and O. Milenkovic. Subspace pursuit for compressive sensing signal reconstruction.
IEEE transactions on Information Theory , 55(5):2230–2249, 2009.[11] D. L. Donoho. Compressed sensing.
IEEE Transactions on Information Theory , 52(4):1289–1306, 2006.[12] D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition.
IEEEtransactions on information theory , 47(7):2845–2862, 2001.[13] A. Harten, B. Engquist, S. Osher, and S. R. Chakravarthy. Uniformly high order accurateessentially non-oscillatory schemes, iii. In
Upwind and high-resolution schemes , pages 218–290. Springer, 1987.[14] T. Hastie, R. Tibshirani, and J. Friedman.
The elements of statistical learning: data mining,inference, and prediction . Springer Science & Business Media, 2009.[15] E. Kaiser, J. N. Kutz, and S. L. Brunton. Sparse identification of nonlinear dynamicsfor model predictive control in the low-data limit.
Proceedings of the Royal Society A ,474(2219):20180335, 2018. 2216] S. H. Kang, W. Liao, and Y. Liu. IDENT: Identifying differential equations with numericaltime evolution. arXiv preprint arXiv:1904.03538 , 2019.[17] Y. Khoo and L. Ying. SwitchNet: a neural network model for forward and inverse scatteringproblems. arXiv preprint arXiv:1810.09675 , 2018.[18] P. Lancaster and K. Salkauskas. Surfaces generated by moving least squares methods.
Math-ematics of computation , 37(155):141–158, 1981.[19] H. Liang and H. Wu. Parameter estimation for differential equation models using a frameworkof measurement error in regression models.
Journal of the American Statistical Association ,103(484):1570–1583, 2008.[20] J.-C. Loiseau and S. L. Brunton. Constrained sparse galerkin regression.
Journal of FluidMechanics , 838:42–67, 2018.[21] Z. Long, Y. Lu, and B. Dong. Pde-net 2.0: Learning pdes from data with a numeric-symbolichybrid deep network.
Journal of Computational Physics , 399:108925, 2019.[22] Z. Long, Y. Lu, X. Ma, and B. Dong. PDE-net: Learning PDSs from data. arXiv preprintarXiv:1710.09668 , 2017.[23] F. Lu, M. Zhong, S. Tang, and M. Maggioni. Nonparametric inference of interaction lawsin systems of agents from trajectory data.
Proceedings of the National Academy of Sciences ,116(29):14424–14433, 2019.[24] B. Lusch, J. N. Kutz, and S. L. Brunton. Deep learning for universal linear embeddings ofnonlinear dynamics.
Nature communications , 9(1):4950, 2018.[25] N. M. Mangan, J. N. Kutz, S. L. Brunton, and J. L. Proctor. Model selection for dynamicalsystems via sparse regression and information criteria.
Proceedings of the Royal Society A:Mathematical, Physical and Engineering Sciences , 473(2204):20170009, 2017.[26] T. M¨uller and J. Timmer. Parameter identification techniques for partial differential equations.
International Journal of Bifurcation and Chaos , 14(06):2053–2060, 2004.[27] T. G. M¨uller and J. Timmer. Fitting parameters in partial differential equations from partiallyobserved noisy data.
Physica D: Nonlinear Phenomena , 171(1-2):1–7, 2002.[28] U. Parlitz and C. Merkwirth. Prediction of spatiotemporal time series based on reconstructedlocal states.
Physical review letters , 84(9):1890, 2000.[29] T. Qin, K. Wu, and D. Xiu. Data driven governing equations approximation using deep neuralnetworks.
Journal of Computational Physics , 2019.[30] M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partialdifferential equations.
Journal of Computational Physics , 357:125–141, 2018.[31] M. Raissi, P. Perdikaris, and G. E. Karniadakis. Physics informed deep learning (part I): Data-driven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561 ,2017.[32] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz. Data-driven discovery of partialdifferential equations.
Science Advances , 3(4):e1602614, 2017.2333] H. Schaeffer. Learning partial differential equations via data discovery and sparse optimiza-tion.
Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences ,473(2197):20160446, 2017.[34] H. Schaeffer, R. Caflisch, C. D. Hauck, and S. Osher. Sparse dynamics for partial differentialequations.
Proceedings of the National Academy of Sciences , 110(17):6634–6639, 2013.[35] H. Schaeffer, G. Tran, and R. Ward. Extracting sparse high-dimensional dynamics from limiteddata.
SIAM Journal on Applied Mathematics , 78(6):3279–3295, 2018.[36] M. Schmidt and H. Lipson. Distilling free-form natural laws from experimental data. science ,324(5923):81–85, 2009.[37] S. W. Smith.
The Scientist and Engineer’s Guide to Digital Signal Processing . CaliforniaTechnical Pub. San Diego, 1997.[38] M. Tham. Dealing with measurement noise. moving average filter.
Chemical Engineering andAdvanced Materials, University of Newcastle upon Tyne , 1998.[39] R. Tibshirani. Regression shrinkage and selection via the lasso.
Journal of the Royal StatisticalSociety: Series B (Methodological) , 58(1):267–288, 1996.[40] J. Timmer, T. M¨uller, and W. Melzer. Numerical methods to determine calcium release fluxfrom calcium transients in muscle cells.
Biophysical journal , 74(4):1694–1707, 1998.[41] G. Tran and R. Ward. Exact recovery of chaotic systems from highly corrupted data.
MultiscaleModeling & Simulation , 15(3):1108–1129, 2017.[42] H. U. Voss, P. Kolodner, M. Abel, and J. Kurths. Amplitude equations from spatiotemporalbinary-fluid convection data.
Physical review letters , 83(17):3422, 1999.[43] P.-˚A. Wedin. Perturbation theory for pseudo-inverses.
BIT Numerical Mathematics , 13(2):217–232, 1973.[44] H. Wendland. Local polynomial reproduction and moving least squares approximation.
IMAJournal of Numerical Analysis , 21(1):285–300, 2001.[45] A. P. Witkin. Scale-space filtering. In
Readings in Computer Vision , pages 329–332. Elsevier,1987.[46] X. Xun, J. Cao, B. Mallick, A. Maity, and R. J. Carroll. Parameter estimation of partialdifferential equation models.
Journal of the American Statistical Association , 108(503):1009–1020, 2013.[47] M. M. Zhang, H. Lam, and L. Lin. Robust and parallel bayesian model selection.