[PDF] SubTSBR to tackle high noise and outliers for data-driven discovery of differential equations

Abstract

Data-driven discovery of differential equations has been an emerging research topic. We propose a novel algorithm subsampling-based threshold sparse Bayesian regression (SubTSBR) to tackle high noise and outliers. The subsampling technique is used for improving the accuracy of the Bayesian learning algorithm. It has two parameters: subsampling size and the number of subsamples. When the subsampling size increases with fixed total sample size, the accuracy of our algorithm goes up and then down. When the number of subsamples increases, the accuracy of our algorithm keeps going up. We demonstrate how to use our algorithm step by step and compare our algorithm with threshold sparse Bayesian regression (TSBR) for the discovery of differential equations. We show that our algorithm produces better results. We also discuss the merits of discovering differential equations from data and demonstrate how to discover models with random initial and boundary condition as well as models with bifurcations. The numerical examples are: (1) predator-prey model with noise, (2) shallow water equations with outliers, (3) heat diffusion with random initial and boundary condition, and (4) fish-harvesting problem with bifurcations.

Full PDF

RRobust subsampling-based threshold sparse Bayesian regression to tackle highnoise and outliers for data-driven discovery of di ﬀ erential equations Sheng Zhang a , Guang Lin a,b,c, ∗ a Department of Mathematics, Purdue University, West Lafayette, IN 47907, USA b School of Mechanical Engineering, Purdue University, West Lafayette, IN 47907, USA c Department of Statistics, Purdue University, West Lafayette, IN 47907, USA

Abstract

Data-driven discovery of di ﬀ erential equations has been an emerging research topic. We propose a novel algorithmsubsampling-based threshold sparse Bayesian regression to tackle high noise and outliers. The subsampling techniqueis used for improving the accuracy of the Bayesian learning algorithm. It has two parameters: subsampling size andthe number of subsamples. When the subsampling size increases with ﬁxed total sample size, the accuracy of ouralgorithm goes up and then down. When the number of subsamples increases, the accuracy of our algorithm keepsgoing up. We demonstrate how to use our algorithm step by step and the merits of our algorithm through fournumerical examples: (1) predator-prey model with noise, (2) shallow water equations with outliers, (3) heat di ﬀ usionwith random initial and boundary condition, and (4) ﬁsh-harvesting problem with bifurcations. We also compareour algorithm with other model discovery methods as well as traditional least-squares regression and show that ouralgorithm produces better results. Keywords: machine learning, data-driven discovery, Bayesian inference, subsampling, high noise, outlier

1. Introduction

The search for physical laws has been a fundamental aim of science for centuries. The physical laws are criticalto the understanding of natural phenomena and the prediction of future dynamics. They are either derived by otherknown physical laws or generalized based on empirical observations of physical behavior. We focus on the secondtask, which is also called data-driven discovery of governing physical laws. It deals with the case where experimentaldata are given while the governing physical model is unclear. Traditional methods for discovering physical lawsfrom data include interpolation and regression. Suppose x : R → R is an unknown physical law. Given data { t i , x ( t i ) } Ni = , traditional methods approximate the expression of x ( t ) in terms of a class of functions of t . In this paper,we follow a di ﬀ erent approach, data-driven discovery of governing di ﬀ erential equations, which adopts the strategy:ﬁrst, discover the di ﬀ erential equations that x ( t ) satisﬁes; second, solve the di ﬀ erential equations analytically or ∗ Corresponding author.

Email address: [email protected] (Guang Lin) a r X i v : . [ s t a t . M L ] M a y umerically. This discovery pattern is applicable to a larger class of models than traditional methods and derives thegoverning di ﬀ erential equations, which provide insights to the governing physical laws behind the observations [1].Many fundamental laws are formulated in the form of di ﬀ erential equations, such as Maxwell equations in classicalelectromagnetism, Einstein ﬁeld equations in general relativity, Schrodinger equation in quantum mechanics, Navier-Stokes equations in ﬂuid dynamics, Boltzmann equation in thermodynamic, predator-prey equations in biology, andBlack-Scholes equation in economics. While automated techniques for generating and collecting data from scientiﬁcmeasurements are more and more precise and powerful, automated processes for extracting knowledge in analyticforms from data are limited [2]. Our goal is to develop automated algorithms for extracting the governing di ﬀ erentialequations from data.Consider a di ﬀ erential equation of the form dxdt = f ( t , x ) , (1)with the unknown function f ( t , x ). Given the data (cid:110) t i , x i , x (cid:48) i (cid:111) Ni = collected from a space governed by this di ﬀ erentialequation, where x i = x ( t i ) and x (cid:48) i = ( dx / dt )( t i ), automated algorithms for deriving the expression of f ( t , x ) are studiedfrom various approaches (in application, if the gradients are not given, they can be approximated numerically). Oneof the approaches assumes that f ( t , x ) is a linear combination of simple functions of t and x . First, construct amoderately large set of basis-functions that may contain all the terms of f ( t , x ); then, apply algorithms to select asubset that is exactly all the terms of f ( t , x ) from the basis-functions and estimate the corresponding weights in thelinear combination.Suppose the basis-functions are chosen as f ( t , x ) , f ( t , x ) , . . . , f M ( t , x ). Then we need to estimate the weights w , w , . . . , w M in the following linear combination: dxdt = w f ( t , x ) + w f ( t , x ) + · · · + w M f M ( t , x ) . (2)Given the data (cid:110) t i , x i , x (cid:48) i (cid:111) Ni = , where x i = x ( t i ) and x (cid:48) i = ( dx / dt )( t i ), the above problem becomes a regression problem asfollows:  x (cid:48) x (cid:48) ... x (cid:48) N  =  f ( t , x ) f ( t , x ) · · · f M ( t , x ) f ( t , x ) f ( t , x ) · · · f M ( t , x ) ... ... . . . ... f ( t N , x N ) f ( t N , x N ) · · · f M ( t N , x N )   w w ... w M  + (cid:15), (3)where (cid:15) = [ (cid:15) , (cid:15) , . . . , (cid:15) N ] T is the model error. Let η = (cid:2) x (cid:48) , . . . , x (cid:48) N (cid:3) T (4) Φ =  f ( t , x ) · · · f M ( t , x ) ... . . . ... f ( t N , x N ) · · · f M ( t N , x N )  (5) w = [ w , . . . , w M ] T . (6)2quation (3) may be written in the vector form as follows: η = Φ w + (cid:15). (7)Now the problem is to estimate the weight-vector w given a known vector η and a known matrix Φ .Since many physical systems have few terms in the equations, the set of basis-functions usually has more termsthan f ( t , x ): M > { terms in f ( t , x ) } , which suggests the use of sparse methods to select the subset of basis-functionsand estimate the weights. These sparse methods include linear algebraic methods, optimization methods, and Bayesianmethods. Examples of linear algebraic methods are sequential threshold least squares (also called sparse identiﬁcationof nonlinear dynamics (SINDy)) [3], which does least-square regression and eliminates the terms with small weightsiteratively (Algorithm 1), and its variation sequential threshold ridge regression [4]. An advantage of linear algebraicmethods is that they are easy to implement. Examples of optimization methods are Lasso (least absolute shrinkageand selection operator) [5, 6, 7] and its variation SR3 [8]. Lasso solves the following optimization problem:min w (cid:40) N || η − Φ w || + λ || w || (cid:41) , (8)where the regularization parameter λ may be ﬁtted by cross-validation (Algorithm 2). An advantage of optimizationmethods is that they have an explicit objective function to optimize. An example of Bayesian methods is thresholdsparse Bayesian regression [1], which calculates the posterior distribution of w given the data and then ﬁlters outsmall weights, iteratively until convergence (Algorithm 3). An advantage of Bayesian methods is that they provideposterior distributions for further analysis. A comparison in [1] shows that threshold sparse Bayesian regression ismore accurate and robust than sequential threshold least squares and Lasso.The same mechanism as above also applies to the discovery of general di ﬀ erential equations including higher-order di ﬀ erential equations and implicit di ﬀ erential equations [1], besides the di ﬀ erential equations of the form (1).Nevertheless, the mechanism is described in the pattern (1) here for convenience and simpliﬁcation, so that moreattention is given to the essence of the algorithm itself. In addition, to apply the algorithm to real-world problems,dimensional analysis can be incorporated in the construction of the basis-functions [1]. Any physically meaningfulequation has the same dimensions on every term, which is a property known as dimensional homogeneity. Therefore,when summing up terms in the equations, the addends should be of the same dimension.Sparse regression methods for data-driven discovery of di ﬀ erential equations are also developed in other papersrecently with a wide range of applications, for example, inferring biological networks [9], sparse identiﬁcation of apredator-prey system [10], model selection via integral terms [11], extracting high-dimensional dynamics from lim-ited data [12], recovery of chaotic systems from highly corrupted data [13], model selection for dynamical systems viainformation criterion [14], model predictive control in the low-data limit [15], sparse learning of stochastic dynami-cal systems [16], model selection for hybrid dynamical systems [17], identiﬁcation of parametric partial di ﬀ erentialequations [18], extracting structured dynamical systems with very few samples [19], constrained Galerkin regression[20], rapid model recovery [21], convergence of the SINDy algorithm [22]. Moreover, other methods for data-driven3iscovery of di ﬀ erential equations are proposed as well, for instance, deep neural networks [23, 24, 25] and Gaussianprocess [26]. One of the advantages of the sparse regression methods is the ability to provide explicit formulas ofthe di ﬀ erential equations, from which further analysis on the systems may be performed, while deep neural networksusually provide “black boxes”, in which the mechanism of the systems is not very clearly revealed. Another advantageof the sparse regression methods is that they do not require too much prior knowledge of the di ﬀ erential equations,while Gaussian process methods have restrictions on the form of the di ﬀ erential equations and are used to estimate afew parameters.Previous developments and applications based on sparse regression methods mostly employ either sequentialthreshold least squares or Lasso, or their variations. One of the reasons why data-driven discovery of di ﬀ erentialequations has not yet been applied to industry is the instability of its methods. Previous methods require the data ofvery high quality, which is usually not the case in industry. Although threshold sparse Bayesian regression is moreaccurate and robust [1], its performance is still unsatisfactory if the provided data are of high noise or contain outliers.However, it provides a model-selection criterion that allows us to conduct further research and make improvements.In this paper, we develop a subsampling-based technique for improving the threshold sparse Bayesian regressionalgorithm, so that the new algorithm is robust to high noise and outliers. Note that subsampling methods are usuallyemployed for estimating statistics [27] or speeding up algorithms [4] in the literatures, but the subsampling methodin this paper is used for improving the accuracy of the Bayesian learning algorithm. In practice, denoising techniquescan be used to reduce part of the noise and outliers in the data before our algorithm is performed.The remainder of this paper is structured as follows. In Section 2, we give a review of the threshold sparseBayesian regression algorithm. In Section 3, we introduce our new subsampling-based algorithm. In Section 4, wedetail the mechanism of our algorithm and demonstrate how to apply our algorithm to discover models. In Section 5,we discuss the merits of discovering di ﬀ erential equations from data. Finally, the summary is given in Section 6.

2. Review of threshold sparse Bayesian regression

This section is a review of the threshold sparse Bayesian regression algorithm (TSBR) proposed in our previouswork [1].

Let η be a known N × Φ be a known N × M matrix, w = [ w , w , . . . , w M ] T be the weight-vector to beestimated sparsely, and (cid:15) be the model error: η = Φ w + (cid:15). (9)TSBR adopts a sparse Bayesian framework based on relevance vector machine (RVM) [28], which is motivatedby automatic relevance determination (ARD) [29, 30], to estimate the weight-vector w . Similar frameworks haveapplications in compressed sensing [31, 32, 33]. The Bayesian framework assumes that the model errors are modeled4 w w · · · w M σ α α · · · α M Figure 1: Graphical structure of the sparse Bayesian model. as independent and identically distributed zero-mean Gaussian with variance σ . The variance may be speciﬁedbeforehand, but here it is ﬁtted by the data. The model gives a multivariate Gaussian likelihood on the vector η : p (cid:16) η | w , σ (cid:17) = (cid:16) πσ (cid:17) − N / exp (cid:40) − || η − Φ w | | σ (cid:41) . (10)Next, a Gaussian prior is introduced over the weight-vector. The prior is governed by a set of hyper-parameters, onehyper-parameter associated with each component of the weight-vector: p ( w | α ) = M (cid:89) j = N (cid:16) w j | , α − j (cid:17) , (11)where α = [ α , α , . . . , α M ] T . The values of the hyper-parameters are estimated from the data. See Figure 1 for thegraphical structure of this model. The posterior over all unknown parameters given the data can be decomposed as follows: p (cid:16) w , α, σ | η (cid:17) = p (cid:16) w | η, α, σ (cid:17) p (cid:16) α, σ | η (cid:17) . (12)As analytic computations cannot be performed in full, TSBR approximates p (cid:16) α, σ | η (cid:17) using the Dirac delta functionat the maximum likelihood estimation: (cid:16) ˆ α ML , ˆ σ (cid:17) = arg max α,σ (cid:110) p (cid:16) η | α, σ (cid:17)(cid:111) = arg max α,σ (cid:40)(cid:90) p (cid:16) η, w | α, σ (cid:17) d w (cid:41) = arg max α,σ (cid:40)(cid:90) p (cid:16) η | w , σ (cid:17) p ( w | α ) d w (cid:41) = arg max α,σ (cid:40) (2 π ) − N / (cid:12)(cid:12)(cid:12) σ I + Φ A − Φ T (cid:12)(cid:12)(cid:12) − / exp (cid:40) − η T (cid:16) σ I + Φ A − Φ T (cid:17) − η (cid:41)(cid:41) , (13)5ith A = diag ( α , α , . . . , α M ). The Dirac delta function may be used as an approximation on the basis that this point-estimate is representative of the posterior in the sense that the integral calculation for the posterior using the point-estimate is roughly equal to the one obtained by sampling from the full posterior distribution [28]. This maximizationis a type-II maximum likelihood and can be calculated using a fast method [34]. Next, TSBR integrates out α and σ to get the posterior over the weight-vector: p ( w | η ) = (cid:90) (cid:90) p (cid:16) w , α, σ | η (cid:17) d α d σ = (cid:90) (cid:90) p (cid:16) w | η, α, σ (cid:17) p (cid:16) α, σ | η (cid:17) d α d σ ≈ (cid:90) (cid:90) p (cid:16) w | η, α, σ (cid:17) δ (cid:16) ˆ α ML , ˆ σ (cid:17) d α d σ = p (cid:16) w | η, ˆ α ML , ˆ σ (cid:17) = p (cid:16) η | w , ˆ σ (cid:17) p ( w | ˆ α ML ) p (cid:16) η | ˆ α ML , ˆ σ (cid:17) [Bayes’ rule] = (2 π ) − M / (cid:12)(cid:12)(cid:12) ˆ Σ (cid:12)(cid:12)(cid:12) − / exp (cid:40) −

12 ( w − ˆ µ ) T ˆ Σ − ( w − ˆ µ ) (cid:41) = N (cid:16) w | ˆ µ, ˆ Σ (cid:17) , (14)in which the posterior covariance and mean are:ˆ Σ = (cid:104) ˆ σ − Φ T Φ + diag ( ˆ α ML ) (cid:105) − (15)ˆ µ = ˆ σ − ˆ ΣΦ T η. (16)Therefore the posterior for each weight can be deduced from (14): p ( w j | η ) = N (cid:16) w j | ˆ µ j , ˆ Σ j j (cid:17) , (17)with mean ˆ µ j and standard deviation ˆ Σ / j j . Thus the mean posterior prediction of the weight-vector w and otherquantities that we want to obtain are determined by the values of η and Φ . This is the beauty of the Bayesian approach:there is no need to determine a regularization parameter via expensive cross-validation, and moreover likelihood valuesand conﬁdence intervals for the solution can be easily calculated [35]. Another merit of the Bayesian approach is thatit can work with limited data since it incorporates prior information into the problem to supplement limited data. In practice we can show that the optimal values of many hyper-parameters α j in (13) are inﬁnite [28] with furtheranalysis in [36, 37], and thus from (15)-(17) the posterior distributions of many components of the weight-vectorare sharply peaked at zero. From another perspective, the Bayesian framework is related to a series of re-weighted L problems [38]. This leads to the sparsity of the resulting weight-vector. To further encourage the accuracy androbustness, a threshold δ ≥ lgorithm 1: Sequential threshold least squares: η = Φ w + (cid:15) Input: η , Φ , threshold Output: ˆ µ Solve ˆ µ in ( Φ T Φ ) ˆ µ = Φ T η ;For components of ˆ µ with absolute value less than the threshold, set them as 0; while ˆ µ (cid:44) Delete the columns of Φ whose corresponding weight is 0, getting Φ (cid:48) ;Solve ˆ µ (cid:48) in ( Φ (cid:48) T Φ (cid:48) ) ˆ µ (cid:48) = Φ (cid:48) T η ;Update the corresponding components of ˆ µ using ˆ µ (cid:48) ;For components of ˆ µ with absolute value less than the threshold, set them as 0; if ˆ µ is the same as the one on the last loop then break; endend Then the weight-vector is reestimated using the remaining terms, iteratively until convergence. The entire procedureis summarized in Algorithm 3. A discussion on how to choose the threshold and its impact on the solution is detailedin [1]. As the threshold is a parameter representing the model complexity, we may use machine learning algorithmssuch as cross-validation to determine it. Note that in Algorithm 3, ˆ µ is more and more sparse after each loop by design.Thus the convergence of the algorithm is guaranteed given the convergence of the calculation of maximum likelihoodin (13). TSBR follows the implementation in [34] for the calculation of (13), with the global convergence analysis in[38].Next, TSBR deﬁnes a model-selection criterion that quantiﬁes the quality of the posterior estimated model asfollows: Model-selection criterion = M (cid:88) j = µ j (cid:44) ˆ Σ j j ˆ µ j , (18)where each estimated variance ˆ Σ j j is divided by the square of the corresponding estimated mean ˆ µ j to normalizethe variance on each weight. This deﬁnition adds up all the normalized variances of each weight present in theresult, and penalizes the unsureness of the estimations. In other words, a smaller model-selection criterion meanssmaller normalized variances and higher posterior conﬁdence, and implies higher model quality. If given a set ofcandidate models for the data, the preferred model would be the one with the minimum model-selection criterion.As a comparison, other sparse regression algorithms are listed: sequential threshold least squares (Algorithm 1) andLasso (Algorithm 2). 7 lgorithm 2: Lasso: η = Φ w + (cid:15) Input: η , Φ Output: ˆ µ ˆ µ = arg min w { N || η − Φ w || + λ || w || } , where λ is ﬁtted by ﬁve-fold cross-validation with minimum meansquared error (MSE) on validation sets. Algorithm 3:

Threshold sparse Bayesian regression: η = Φ w + (cid:15) Input: η , Φ , threshold Output: ˆ µ , ˆ Σ Calculate the posterior distribution p ( w | η ) in η = Φ w , and let the mean be ˆ µ ;For components of ˆ µ with absolute value less than the threshold, set them as 0; while ˆ µ (cid:44) Delete the columns of Φ whose corresponding weight is 0, and let the result be Φ (cid:48) ;Calculate the posterior distribution p ( w (cid:48) | η ) in η = Φ (cid:48) w (cid:48) , and let the mean be ˆ µ (cid:48) ;Update the corresponding components of ˆ µ using ˆ µ (cid:48) ;For components of ˆ µ with absolute value less than the threshold, set them as 0; if ˆ µ is the same as the one on the last loop then break; endend Set the submatrix of ˆ Σ corresponding to non-zero components of ˆ µ as the last estimated posterior variance inthe preceding procedure, and set the other elements of ˆ Σ as 0.

3. Subsampling-based threshold sparse Bayesian regression

This section introduces the novel subsampling-based algorithm based on the threshold sparse Bayesian regressionalgorithm reviewed in the last section.

In the regression problem (3), when we have more data than the number of unknown weights: N > M , we mayuse a subset of the data (cid:110) t i , x i , x (cid:48) i (cid:111) Ni = to estimate the weights. We do this on the basis that the data sets collected fromreal-world cases may contain outliers or a percentage of data points of high noise. Classical methods for parameterestimation, such as least squares, ﬁt a model to all of the presented data. These methods have no internal mechanismfor detecting or discarding outliers. They are averaging methods based on the assumption (the smoothing assumption)that there will always be enough good values to smooth out any gross noise [39]. In practice the data may containmore noise than what the good values can ever compensate, breaking the smoothing assumption. To deal with this8ituation, an e ﬀ ective way is to discard the “bad” data points (outliers or those of high noise), and use the rest “good”data points to run estimations. Based on this idea, we propose an algorithm called subsampling-based threshold sparseBayesian regression (SubTSBR). Our algorithm ﬁlters out bad data points by the means of random subsampling.Other robust regression methods designed to overcome the limitations of traditional methods include iterativelyreweighted least squares [40, 41], random sample consensus [39], least absolute deviations [42], Theil-Sen estimator[43, 44], repeated median regression [45, 46], and simultaneous variable selection and outlier identiﬁcation [47], butthey do not ﬁt very well into the framework of data-driven discovery of di ﬀ erential equations. Also, there are methodsfor handling outliers in the Bayesian framework through the design of the prior distribution [48, 49, 50], but they donot completely cut o ﬀ the outliers. SubTSBR approaches the problem by selecting data points randomly to estimate the weights and using the model-selection criterion to evaluate the estimations. When an estimation is “good” (small model-selection criterion), ouralgorithm identiﬁes the corresponding selected data points as good data points and the estimation as the ﬁnal result.To be speciﬁc, our algorithm is given a user-preset subsampling size S ( < N ) and the number of subsamples L ( ≥ S data points is randomly selected: (cid:110) t k i , x k i , x (cid:48) k i (cid:111) Si = (cid:18) ⊂ (cid:110) t i , x i , x (cid:48) i (cid:111) Ni = (cid:19) and used to estimate the weights w , w , . . . , w M in the following regression problem:  x (cid:48) k x (cid:48) k ... x (cid:48) k S  =  f ( t k , x k ) f ( t k , x k ) · · · f M ( t k , x k ) f ( t k , x k ) f ( t k , x k ) · · · f M ( t k , x k ) ... ... . . . ... f ( t k S , x k S ) f ( t k S , x k S ) · · · f M ( t k S , x k S )   w w ... w M  + (cid:15), (19)where f ( t , x ) , f ( t , x ) , . . . , f M ( t , x ) are the basis-functions and (cid:15) is the model error. This regression problem can besymbolized into the form as follows: η = Φ w + (cid:15), (20)which is (9). By running TSBR (Algorithm 3), we obtain a di ﬀ erential equation: dxdt = ˆ µ f ( t , x ) + ˆ µ f ( t , x ) + · · · + ˆ µ M f M ( t , x ) , (21)along with model-selection criterion calculated by (18). After repeating this procedure L times with di ﬀ erent randomlyselected data points, the di ﬀ erential equation with the smallest model-selection criterion among all the subsamplesis chosen as the ﬁnal result of the whole subsampling algorithm. Our algorithm has two user-preset parameters: thesubsampling size and the number of subsamples. Their impact on the accuracy of the ﬁnal result is discussed inSection 4 through an example. Note that the above mechanism is described in the pattern (1) for convenience andsimpliﬁcation. It also applies to higher-order di ﬀ erential equations and implicit di ﬀ erential equations, as long as thedi ﬀ erential equations can be symbolized into the form (20). The SubTSBR procedure is summarized in Algorithm 4,where the for-loop can be coded parallelly. 9 lgorithm 4: Subsampling-based threshold sparse Bayesian regression: η = Φ w + (cid:15) Input: η , Φ , threshold, subsampling size S , the number of subsamples L Output: ˆ µ , ˆ Σ Let I N × N be the N × N identity matrix, where N is the number of rows in Φ ; for r = to L do Let P r be an S × N submatrix of I N × N with randomly chosen rows;Use Algorithm 3 to solve the problem P r η = P r Φ w + (cid:15) , getting ˆ µ r , ˆ Σ r ;Calculate [model-selection criterion] r using ˆ µ r , ˆ Σ r and (18); end Let R = arg min r { [model-selection criterion] r } ;Let ˆ µ = ˆ µ R and ˆ Σ = ˆ Σ R . The numerical results in this paper show that our subsampling algorithm can improve the overall accuracy in thediscovery of di ﬀ erential equations, and the model-selection criterion (18) is capable of evaluating the estimations. Thegiven data (cid:110) t i , x i , x (cid:48) i (cid:111) Ni = contain a part of data points of low noise and a part of data points of high noise. When a subsetconsisting of only data points of low noise is selected, our algorithm would estimate the weights well and indicatethat this is the case by showing a small model-selection criterion (18). As we do not know which data points are oflow noise and which data points are of high noise before the model is discovered, we select a subset from the datarandomly, repeating multiple times. When it happens that the selected data points are of low noise, we would have agood estimation of the weights and at the same time recognize this case.The numerical results also show that when the subsampling size increases, the performance of our algorithm getsbetter and then worse. When the number of subsamples increases, the performance of our algorithm keeps gettingbetter. In practice, as more subsamples mean more computational time, we can increase the number of subsamplesgradually and stop the algorithm when the smallest model-selection criterion among all the subsamples drops belowa certain preset value or the smallest model-selection criterion stops decreasing. The outliers in the data can cause serious problems in the estimations and should be excluded. The subsamplingprocedure in Algorithm 4 is designed to be resistant to outliers. Here we calculate how many subsamples are neededto exclude the outliers from a data set for a certain conﬁdence level.Suppose we are given N data points, a portion p of which are outliers. Suppose the subsampling size is S . We tryto determine the number of subsamples L such that with conﬁdence q , at least one of the L randomly selected subsetsof the data does not contain any outlier. 10he number of outliers is pN and the number of “good” data points is (1 − p ) N . For a random subset of size S ( S ≤ (1 − p ) N ) not containing any outlier, the probability is (cid:16) (1 − p ) NS (cid:17)(cid:16) NS (cid:17) = (1 − p ) (1 − p ) N − N − − p ) N − N − · · · (1 − p ) N − S + N − S + . (22)For a random subset of size S containing at least one outlier, the probability is1 − (cid:16) (1 − p ) NS (cid:17)(cid:16) NS (cid:17) . (23)For L random subsets of size S each containing at least one outlier, the probability is  − (cid:16) (1 − p ) NS (cid:17)(cid:16) NS (cid:17)  L . (24)For L random subsets of size S at least one subset not containing any outlier, the probability is1 −  − (cid:16) (1 − p ) NS (cid:17)(cid:16) NS (cid:17)  L . (25)Set the probability greater than or equal to q : 1 −  − (cid:16) (1 − p ) NS (cid:17)(cid:16) NS (cid:17)  L ≥ q . (26)Then we have L ≥ log(1 − q )log (cid:18) − ( (1 − p ) NS )( NS ) (cid:19) . (27)

4. Tackling high noise and outliers using subsampling-based threshold sparse Bayesian regression

High noise and outliers in the data can hinder model-discovering algorithms from producing correct results.Through two examples, we demonstrate how to apply our algorithm to discover models. Also, we detail the mecha-nism of our algorithm and investigate the robustness against noise and outliers.

The predator-prey model is a system of a pair of ﬁrst-order nonlinear di ﬀ erential equations and is frequently usedto describe the interaction between two species, one as a predator and the other as prey. The population change bytime is as follows: dxdt = α x − β xy (28) dydt = δ xy − γ y , (29)11 a) (b)Figure 2: [Predator-prey model] The true predator-prey model and the noisy data. (a) Population vs time. (b) Population (predator) vs population(prey). where x is the number of the prey, y is the number of the predator, and α, β, δ, γ are positive real parameters describingthe interaction of the two species. In this example, we ﬁx the parameters as follows: dxdt = x − xy (30) dydt = xy − y . (31)We assume that we do not know about the formulas for the system (30) - (31), neither the terms nor the parameters,and try to discover the model using noisy data. We ﬁrst generate 200 data points from the system (30) - (31), with the initial value x = . y = .

2, duringtime t = t =

20. Then independent and identically distributed white noise N (0 , . ) is added to all the data x and all the data y . See Figure 2 for the true model and the noisy data. Now we need to calculate the derivatives of the data to estimate the left-hand-side terms in (30) - (31). The noisein the data would be ampliﬁed greatly if the derivatives are calculated using numerical di ﬀ erentiation. See Figure 3afor the approximated derivatives using gradient. Therefore, we use total-variation derivative [51, 52] instead. For areal function f ( t ) on [0 , L ], total-variation derivative computes the derivatives of f as the minimizer of the functional: F ( u ) = λ (cid:90) L | u (cid:48) | + (cid:90) L | Au + f (0) − f | , (32)where Au ( t ) = (cid:82) t u is the operator of antidi ﬀ erentiation and λ is a regularization parameter that controls the balancebetween the two terms. The numerical implementation is introduced in [52]. See Figure 3b for the approximatedderivatives of x ( t ) and y ( t ) using total-variation derivative with λ = . a) (b)Figure 3: [Predator-prey model] (a) The true derivatives of the data and the approximated derivatives using gradient. (b) The true derivatives of thedata and the approximated derivatives using total-variation derivative (tvd). As we can see in Figure 3, the robust di ﬀ erentiation method is critical for getting high-quality derivatives. Besidestotal-variation derivative, many other methods are available for robust di ﬀ erentiation. Another approach is to usedenoising techniques to reduce the noise in the data before taking derivatives. For example, a neural network is usedto denoise data and approximate derivatives in [53]. Those methods may have better performance in practical use,depending on the situation. Here we do not use denoising techniques for the sake of demonstrating the robustnessof our algorithm against noise. In practice, the denoised data may still contain noise. Our algorithm can be usedfollowing the denoising processes and may achieve good results. ﬀ erent sparse algorithms Now we try to discover the predator-prey model (30) - (31) using sequential threshold least squares (STLS, Al-gorithm 1), Lasso (Algorithm 2), threshold sparse Bayesian regression (TSBR, Algorithm 3), and subsampling-basedthreshold sparse Bayesian regression (SubTSBR, Algorithm 4). The basis-functions are monomials generated by { , x , y } up to degree three (10 terms in total). STLS, TSBR, and SubTSBR have the threshold set at 0 .

1. In addition,SubTSBR has the subsampling size set at 60 and the number of subsamples set at 30. The regularization parameterfor Lasso is ﬁtted by ﬁve-fold cross-validation. Each of STLS, Lasso, and TSBR uses all 200 data points at one timeto discover the model, while SubTSBR does subsampling.The numerical results are listed in Table 1 and Table 2, from which we can see that SubTSBR approximates thetrue model signiﬁcantly better than the other sparse algorithms. Although the data contain a considerable amount ofnoise, SubTSBR successfully ﬁnds the exact terms in the true model and accurately estimates the parameters. SeeFigure 4 for the ﬁnal selected data points in SubTSBR. See Figure 5 for the dynamics calculated by each model. Notethat since the data are collected from t = t =

20, Figure 5a shows the approximation and Figure 5b shows theprediction. SubTSBR demonstrates better performance than the other three methods especially in prediction.In Table 1 and Table 2, the weights in SubTSBR are slightly smaller in absolute value than the weights in the13erm True STLS Lasso TSBR SubTSBR1 − .

122 0 . x . .

539 0 .

219 0 . . . . y . − .

363 0 . . x − . xy − . − .

604 0 . − . . y − . − . − . . x .

770 0 . x y . − . xy . − . − . . y .

588 0 .

773 3 . . Table 1: [Predator-prey model] The true model for dx / dt (30) and the models discovered by sequential threshold least squares (STLS), Lasso,threshold sparse Bayesian regression (TSBR), and subsampling-based threshold sparse Bayesian regression (SubTSBR). Every column representsthe weights for the terms in the model. Blank means the model does not have the speciﬁc term. The numbers inside the parentheses following theweights in TSBR and SubTSBR represent the standard deviation for the weight. true model because when we calculate the numerical derivatives, total-variation derivative (32) smooths out somevariation in the derivatives. This defect does not have much impact on the result and can be addressed by usingdi ﬀ erent methods to calculate the derivatives or collecting the derivatives along with the data directly. In the subsampling process of SubTSBR (Algorithm 4), we have two parameters to set, one is the subsamplingsize and the other is the number of subsamples. First, we investigate the impact on the basis-selection success rate bydi ﬀ erent subsampling sizes. In Figure 6a, each curve is drawn by ﬁxing the number of subsamples. Then given eachsubsampling size, SubTSBR is applied to the data set collected above. This method is performed 1000 times for eachﬁxed number of subsamples and subsampling size. Then the percentage of successful identiﬁcation of the exact termsin the system (30) - (31) is calculated and plotted.Figure 6a shows that for each ﬁxed number of subsamples, basis-selection success rate goes up and then downwhen the subsampling size increases. When the subsampling size equals 200, all the data points are used andSubTSBR is equivalent to TSBR (Algorithm 3). In this case the true terms cannot be identiﬁed. In addition, foreach chosen number of subsamples there is an optimal subsampling size, and the optimal subsampling size increasesas the number of subsamples increases.Next, we investigate the impact on the basis-selection success rate by di ﬀ erent numbers of subsamples. In Figure6b, each curve is drawn by ﬁxing the subsampling size. Then given each number of subsamples, SubTSBR is appliedto the data set to discover the model. The discovery is done 1000 times for each ﬁxed subsampling size and number of14erm True STLS Lasso TSBR SubTSBR1 0 . − . x − .

613 0 . y − . − . − . − . . − . . x . xy .

302 0 .

100 1 . . . . y − . − . x − . − . x y − .

080 0 . − . . xy − .

735 0 . y . − . Table 2: [Predator-prey model] The true model for dy / dt (31) and the discovered models. All other interpretations are the same as Table 1.Figure 4: [Predator-prey model] The ﬁnal selected data points in SubTSBR. subsamples. Then the percentage of successful identiﬁcation of the exact terms in the system (30) - (31) is calculatedand plotted.Figure 6b shows that for each ﬁxed subsampling size, basis-selection success rate keeps going up when the numberof subsamples increases. In addition, the larger the subsampling size is, more subsamples are needed for the basis-selection success rate to reach a certain level. This is because our data set is polluted by Gaussian noise and naturallycontains some data points of high noise and some of low noise. As the subsampling size gets bigger, it is less likelyfor each of the random subsets of data to exclude the data points of high noise. When more subsamples are used,the likelihood for one of the subsamples to exclude the data points of high noise increases. As long as one of thesubsamples excludes the data points of high noise, this subsample may successfully select the true basis functions andhave the smallest model-selection criterion. When it happens, the ﬁnal result would come from this subsample and15 a) (b)Figure 5: [Predator-prey model] (a) Approximated dynamics by STLS, Lasso, TSBR, and SubTSBR from t = t =

20. (b) Predicted dynamicsfrom t = t = ﬀ erent numbersof subsamples. (b) Basis-selection success rate vs the number of subsamples, with di ﬀ erent subsampling sizes. SubTSBR selects the true basis functions successfully. This explains why the curves of smaller subsampling size goup faster at the beginning.On the other hand, when more data points are used in a subsample, the noise inside the subsample gets smoothedout easier in the regression (20). If all the included data points in the subsample are of low noise, then the result froma larger subsampling size may be a better one. This explains why the saturated basis-selection success rate is higherfor larger subsampling size within a certain range. In conclusion, there is tradeo ﬀ for larger subsampling size—it ismore di ﬃ cult to include only data points of low noise while it is easier to smooth out the noise inside the subsample. In real-world applications, the situation is usually more complicated and the problem of setting the best sub-sampling size is subtle. Since the true equations are unknown, the basis-selection success rate cannot be calculated.Therefore, drawing a curve like the ones in Figure 6 to ﬁnd the best subsampling size is not available. Here we16 a) (b)Figure 7: [Predator-prey model] The total number of data points is 200. (a) Model-selection criterion vs subsampling size, with di ﬀ erent numbersof subsamples. (b) Adjusted model-selection criterion vs subsampling size, with di ﬀ erent numbers of subsamples. deﬁne the adjusted model-selection criterion as an indicator for the quality of the approximated model and ﬁt thesubsampling size automatically.The model-selection criterion deﬁned in (18) depends on the number of data points used and the quality of theapproximated model. The ˆ Σ j j in (18) is calculated by (15). When the number of data points increases, the number ofrows in the matrix Φ increases, causing a decrease of ˆ Σ j j . As a result, when the subsampling size is close to optimal,the model-selection criterion is dominated by and negatively correlated with the subsampling size. If we want tocompare the quality of the results among di ﬀ erent subsampling sizes, we need to adjust the model-selection criterionsuch that it is not directly related to the subsampling size and depends solely on the quality of the model. Here wegive the following empirical formula:Adjusted model-selection criterion = [model-selection criterion] × [subsampling size] . . (33)Now we use the example in this section to validate the formula (33). In Figure 7, for each ﬁxed number ofsubsamples and subsampling size, we run SubTSBR for 1000 times and discover 1000 models with their model-selection criterion or adjusted model-selection criterion. Then the median of the 1000 model-selection criterion oradjusted model-selection criterion is plotted. Comparing Figure 7a with Figure 7b, we see that the adjusted model-selection criterion prefers models with smaller subsampling size, which is closer to the optimal subsampling size(the maximum point on each curve in Figure 6a). By choosing a model with the smallest adjusted model-selectioncriterion, the algorithm tends to ﬁnd a model with the subsampling size close to the optimal subsampling size. In thisway, the subsampling size is ﬁtted automatically by the algorithm. We do not have to set the subsampling size at thebeginning but we may try di ﬀ erent subsampling sizes to discover the model, and the best result can be selected fromall the results with di ﬀ erent subsampling sizes. 17 a) (b)(c) (d)Figure 8: [Predator-prey model] (a) and (b) The noisy data with higher noise than in Figure 2. (c) and (d) The basis-selection success rate withhigher noise than in Figure 6. We use a new data set with white noise N (0 , . ) to discover the predator-prey model (30) - (31). The noisehere is higher than the noise in the previous data set ( N (0 , . )). All other settings remain the same. Correspondingto Figure 2, the noisy data are presented in Figure 8a and Figure 8b. Corresponding to Figure 6, the basis-selectionsuccess rate is presented in Figure 8c and Figure 8d. With higher noise, the chance of successfully picking out thetrue terms from the basis-functions is lower. As a result, more subsamples would be needed in this case. Figure 8ddemonstrates the results with the numbers of subsamples up to 80, but many more subsamples may be used in practicalproblems. As for computation time, it takes about 15 seconds to discover the dynamical system in this example withsubsampling size 60 and the number of subsamples 1000 on one core of the CPU Intel i7-6700HQ (coded in MATLAB2018a). Now we use another new data set with white noise N (0 , . ) to discover the predator-prey model (30) - (31).The noise here is lower than the noise in the ﬁrst data set ( N (0 , . )). All other settings remain the same. Thenumerical result is presented in Figure 9. In this case, we have a large portion of data points of low noise, so a 100%18 a) (b)(c) (d)Figure 9: [Predator-prey model] (a) and (b) The noisy data with lower noise than in Figure 2. (c) and (d) The basis-selection success rate withlower noise than in Figure 6. basis-selection success rate can be reached with just 20 subsamples. Also, the noise inside the subsamples is small, sowe do not need a large subsampling size to smooth out the noise inside the subsamples. The basis-selection successrate is very high even with a subsampling size of 20. Consider the following 2-D conservative form of shallow water equations: ∂ h ∂ t + ∂ ( hu ) ∂ x + ∂ ( hv ) ∂ y = ∂ ( hu ) ∂ t + ∂ ( hu + (1 / gh ) ∂ x + ∂ ( huv ) ∂ y = ∂ ( hv ) ∂ t + ∂ ( huv ) ∂ x + ∂ ( hv + (1 / gh ) ∂ y = x , y ) ∈ [0 , × [0 ,

39] and t ∈ [0 , ∞ ), with reﬂective boundary condition and a water drop initiating gravitywaves, where h is the total ﬂuid column height, ( u , v ) is the ﬂuid’s horizontal ﬂow velocity averaged across the verticalcolumn, and g = . − is the gravitational acceleration. The ﬁrst equation can be derived from mass conservation,the last two from momentum conservation. Here, we have made the assumption that the ﬂuid density is a constant.19 a) (b)(c) (d)Figure 10: [Shallow water equations] Surface plot displays height colored by momentum. (a) A water drop falls into the pool. (b) The gravity wavesare traveling and being reﬂected by the boundary. (c) The water surface state when the data are collected. (d) The accessible data are corrupted byoutliers. We generate the numerical solution to the shallow water equations using Lax-Wendro ﬀ ﬁnite di ﬀ erence methodwith ∆ x = ∆ y = ∆ t = .

02. See Figure 10. The data are collected at t =

36 and the partial derivatives ∂ h /∂ x , ∂ u /∂ x , ∂ v /∂ x , ∂ h /∂ y , ∂ u /∂ y , and ∂ v /∂ y are calculated by the three-point central-di ﬀ erence formula. The calculationof the partial derivatives ∂ h /∂ t , ∂ u /∂ t , and ∂ v /∂ t uses the points from two adjacent time frames. Assume that only thecentral 36 ×

36 part of the data is made accessible and 2% of the accessible data have the values h , u , and v corruptedby independent and identically distributed random noise ∼ U (0 . ,

1) (the uniform distribution on [0 . , × = (cid:40) h i , u i , v i , (cid:32) ∂ h ∂ t (cid:33) i , (cid:32) ∂ u ∂ t (cid:33) i , (cid:32) ∂ v ∂ t (cid:33) i , (cid:32) ∂ h ∂ x (cid:33) i , (cid:32) ∂ u ∂ x (cid:33) i , (cid:32) ∂ v ∂ x (cid:33) i , (cid:32) ∂ h ∂ y (cid:33) i , (cid:32) ∂ u ∂ y (cid:33) i , (cid:32) ∂ v ∂ y (cid:33) i (cid:41) i = , (37)2% of which are outliers. 20 .2.2. Discovery of the model We apply the dimensional analysis method introduced in [1] to construct the basis-functions of the same dimensionas ∂ h /∂ t (m s − ) to discover it: (cid:40) h ∂ u ∂ x , h ∂ v ∂ x , h ∂ u ∂ y , h ∂ v ∂ y , u , u ∂ h ∂ x , u ∂ h ∂ y , v , v ∂ h ∂ x , v ∂ h ∂ y , ∂ h ∂ t ∂ h ∂ x , ∂ h ∂ t ∂ h ∂ y (cid:41) . (38)Similarly, we can construct the basis-functions of the same dimension as ∂ u /∂ t and ∂ v /∂ t (m s − ) to discover them: (cid:40) u ∂ u ∂ x , u ∂ v ∂ x , u ∂ u ∂ y , u ∂ v ∂ y , v ∂ u ∂ x , v ∂ v ∂ x , v ∂ u ∂ y , v ∂ v ∂ y , ∂ h ∂ t ∂ u ∂ x , ∂ h ∂ t ∂ v ∂ x , ∂ h ∂ t ∂ u ∂ y , ∂ h ∂ t ∂ v ∂ y , g ∂ h ∂ x , g ∂ h ∂ y (cid:41) . (39)Now we try to discover the shallow water equations (34) - (36) using sequential threshold least squares (STLS,Algorithm 1), Lasso (Algorithm 2), threshold sparse Bayesian regression (TSBR, Algorithm 3), and subsampling-based threshold sparse Bayesian regression (SubTSBR, Algorithm 4). STLS, TSBR, and SubTSBR have the thresholdset at 0 .

1. In SubTSBR, since there are 1296 data points and 2% are outliers, if the subsampling size is 100, then thenumber of subsamples would be 36 by (27) in order to have conﬁdence level 0 .

99. The regularization parameter forLasso is ﬁtted by ﬁve-fold cross-validation. Each of STLS, Lasso, and TSBR uses all 1296 data points at one time todiscover the model, while SubTSBR does subsampling. Note that the system of equations (34) - (36) is equivalent to ∂ h ∂ t + h ∂ u ∂ x + h ∂ v ∂ y + u ∂ h ∂ x + v ∂ h ∂ y = ∂ u ∂ t + u ∂ u ∂ x + v ∂ u ∂ y + g ∂ h ∂ x = ∂ v ∂ t + u ∂ v ∂ x + v ∂ v ∂ y + g ∂ h ∂ y = . (42)The numerical results are listed in Table 3, Table 4, and Table 5, from which we can see that SubTSBR has the bestperformance.

5. Applications and merits of discovering di ﬀ erential equations from data In this section, we demonstrate some merits of discovering di ﬀ erential equations from data:(a) Integration of the data from di ﬀ erent experiments into meaningful and valuable information. In many cases,collecting enough data from a single experiment is di ﬃ cult to achieve due to limited resources. For instance, theexperiments may need to be done at multiple di ﬀ erent times and / or di ﬀ erent locations. When the data are frommultiple experiments, although they are generated by the same model, the initial condition and / or boundarycondition used to generate them may be di ﬀ erent or even unmeasurable. This is a challenge for traditionalmodelling methods. A signiﬁcant advantage of the method of discovering governing di ﬀ erential equations isthat the data are allowed to be from di ﬀ erent experiments, as long as the model governing them is the same. Ontop of that, if the initial condition and boundary condition can be formulated into algebraic equations and we aregiven data at the initial state and boundary, we may symbolize the initial condition and boundary condition into21erm True STLS Lasso TSBR SubTSBR h ∂ u ∂ x − − . − . − . . − . . h ∂ v ∂ x . h ∂ u ∂ y h ∂ v ∂ y − − . − . − . . − . . u . u ∂ h ∂ x − − . − . − . . − . . u ∂ h ∂ y .

395 0 .

344 0 . . vv ∂ h ∂ x .

409 0 .

400 0 . . v ∂ h ∂ y − − . − . − . . − . . ∂ h ∂ t ∂ h ∂ x ∂ h ∂ t ∂ h ∂ y .

267 0 .

143 0 . . Table 3: [Shallow water equations] The true model for ∂ h /∂ t (40) and the models discovered by sequential threshold least squares (STLS), Lasso,threshold sparse Bayesian regression (TSBR), and subsampling-based threshold sparse Bayesian regression (SubTSBR). Every column representsthe weights for the terms in the model. Blank means the model does not have the speciﬁc term. The numbers inside the parentheses following theweights in TSBR and SubTSBR represent the standard deviation for the weight. Term True STLS Lasso TSBR SubTSBR u ∂ u ∂ x − − . − . − . . − . . u ∂ v ∂ x − .

931 0 . − . . u ∂ u ∂ y .

173 0 .

017 2 . . u ∂ v ∂ y v ∂ u ∂ x .

269 0 .

112 0 . . v ∂ v ∂ x . − .

466 1 . . v ∂ u ∂ y − − . − . − . . − . . v ∂ v ∂ y ∂ h ∂ t ∂ u ∂ x ∂ h ∂ t ∂ v ∂ x . ∂ h ∂ t ∂ u ∂ y − . ∂ h ∂ t ∂ v ∂ y g ∂ h ∂ x − − . − . − . . − . . g ∂ h ∂ y Table 4: [Shallow water equations] The true model for ∂ u /∂ t (41) and the discovered models. All other interpretations are the same as Table 3. u ∂ u ∂ x u ∂ v ∂ x − − . − . − . . − . . u ∂ u ∂ y − . u ∂ v ∂ y .

118 0 .

049 0 . . v ∂ u ∂ x v ∂ v ∂ x .

093 0 .

165 1 . . v ∂ u ∂ y − . − . . v ∂ v ∂ y − − . − . − . . − . . ∂ h ∂ t ∂ u ∂ x − . − . . ∂ h ∂ t ∂ v ∂ x .

128 0 . . ∂ h ∂ t ∂ u ∂ y − . ∂ h ∂ t ∂ v ∂ y .

195 0 . . g ∂ h ∂ x g ∂ h ∂ y − − . − . − . . − . . Table 5: [Shallow water equations] The true model for ∂ v /∂ t (42) and the discovered models. All other interpretations are the same as Table 3. the form (7) and discover them using the subsampling-based threshold sparse Bayesian regression algorithm.The only di ﬀ erence in this case is that the algebraic equations do not have any derivative term. Finally, with thediscovered di ﬀ erential equation, initial condition, and boundary condition, we may reconstruct the solutions tothe model and make predictions.(b) Prediction of the generalized dynamics in broader areas when the data are collected within a restricted domain.The solution to the dynamics may have di ﬀ erent behavior in di ﬀ erent areas. If the data are collected withina restricted domain, traditional linear regression may not be able to capture the behavior or make predictionsoutside that domain. This becomes even more challenging when the dynamics have bifurcations. The methodof discovering di ﬀ erential equations is able to predict the generalized dynamics, because the same di ﬀ erentialequations that govern the dynamics inside the restricted domain will govern the outside as well. Experimentsin labs sometimes have to be done within a certain range of condition, such as initial condition, but real-worldapplications are in di ﬀ erent scales. Data analysis by discovering di ﬀ erential equations can be helpful since it isable to generalize to broader range of condition.(c) Extrapolation. When the model is correctly discovered, it performs well in extrapolation as we might expect.These merits are shared by methods that discover di ﬀ erential equations, including sequential threshold leastsquares, Lasso, threshold sparse Bayesian regression, and subsampling-based threshold sparse Bayesian regression.23e demonstrate the merits in the two following examples by comparing the results from discovering di ﬀ erentialequations with the results from traditional linear regression without discovering di ﬀ erential equations. ﬀ usion with random initial and boundary condition Consider the following 1-D heat di ﬀ usion equation: ∂ u ∂ t = ∂ u ∂ x (43)on x ∈ [0 ,

5] and t ∈ [0 , ∞ ) with random initial condition: u ( x , = − ξ x ( x −

5) (44)and random boundary condition: u (0 , t ) = ξ sin (2 t ) − ξ cos t + ξ (45) u (5 , t ) = ξ ξ sin t − ξ sin ( t + π + ξ √ , (46)where ξ , ξ , ξ are independent random variables: • ξ ∼ U (0 , , • ξ ∼ U (0 , , • ξ ∼ N (0 , . ), the normal distribution with mean 0 and standard deviation 0 . ξ = ξ = ξ = .

5, the solution to the heat di ﬀ usion equation is plotted in Figure 12a. We generate 20 solutions by (43) - (46) with 20 sets of independent random variables { ξ , ξ , ξ } . See Figure 11a.Then we collect the data at the grid points in the domain x ∈ [0 ,

5] and t ∈ [0 ,

5] illustrated in Figure 11b. There are11 × × = ﬀ erence formula. Notethat the derivatives are only calculated at the interior grid points (marked as [ (cid:13) ] in Figure 11b). Next, we discover thePDE using our algorithm SubTSBR with subsampling size 245, which is one fourth of all interior grid points, and 30subsamples. The basis-functions are monomials generated by { , x , t , u , ∂ u /∂ x , ∂ u /∂ x } up to degree 3. There are 56terms. The result is: ∂ u ∂ t = . ∂ u ∂ x . (47)After that, we discover the boundary condition using SubTSBR with subsampling size 55, which is one fourth ofall lower boundary points or upper boundary points, and 30 subsamples. The basis-functions are monomials generatedby { , ξ , ξ , t , sin t , cos t } up to degree 3. There are 56 terms. The result is: u (0 , t ) = . ξ + . ξ sin t cos t − . ξ cos t (48) u (5 , t ) = . ξ − . ξ sin t − . ξ cos t + . ξ ξ sin t . (49)24 a) (b)Figure 11: [Heat di ﬀ usion] (a) The data are generated by (43) - (46) with 20 sets of independent random variables { ξ , ξ , ξ } . (b) Grid points: [ • ]all data points [ (cid:13) ] data points used to discover PDE [ (cid:63) ] data points used to discover boundary condition [ (cid:3) ] data points used to discover initialcondition. { , ξ , x , sin x , cos x } up to degree 3.There are 35 terms. The result is: u ( x , = . ξ x − . ξ x . (50) Now we predict the solution when ξ = ξ = ξ = .

5. Fix the ξ , ξ , ξ values in (48) - (50) and solve the PDE(47) on x ∈ [0 ,

5] and t ∈ [0 , ﬀ usion equation predicted by discovering PDE.See Figure 12b. The true model is solved by (43) with ξ = ξ = ξ = . u ( x , t , ξ , ξ , ξ ) is ﬁtted by a linear combination of monomials generated by { , x , sin x , cos x , t , sin t , cos t , ξ , ξ , ξ } upto degree 3. There are 220 terms. Here we use the full data set for discovering PDE to do the regression. The solutionis drawn with ﬁxed ξ = ξ = ξ = . u ( x , t , . , . , . t ∈ [0 , t ∈ (5 , t > t ∈ [0 ,

5] and least-squares regression is an interpolation method. It does not work well outside the region withknown data. In contrast, discovering PDE is an extrapolation method, which works beyond the original observationrange. See Figure 13 for the solution at di ﬀ erent time t and Table 6 for the mean squared error (MSE) at di ﬀ erent time t . Note that we did not use any data generated from ξ = ξ = ξ = . { ξ , ξ , ξ } . Predictions at other ξ , ξ , ξ values can be derived in the same way by ﬁxing the ξ , ξ , ξ values in (48) - (50) and solving (47). Theprediction at { ξ , ξ , ξ } does not need any data generated from the same { ξ , ξ , ξ } .Also note that we do not need any information about ξ , ξ , ξ to discover the PDE. If the values of ξ , ξ , ξ areunknown, we can still discover the PDE but are unable to discover the initial condition or boundary condition in thisexample. If given a new initial condition and boundary condition, we can still make prediction using the discoveredPDE, while we are not able to do so using regression methods. Consider the following ﬁsh-harvesting problem: dNdt = N (4 − N ) − H , (51)where N ( t ) is the population of the ﬁsh at time t and H ≥ H = dNdt = N (4 − N ) − . (52)26 a)(b) (c)Figure 12: [Heat di ﬀ usion] (a) The true model solved by (43) with initial and boundary condition (44) - (46). (b) The solution predicted bydiscovering PDE, solved by (47) with initial and boundary condition (48) - (50). (c) The solution predicted by least-squares regression. All with ξ = ξ = ξ = . t MSE by discovering PDE MSE by least-squares regression0 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 6: [Heat di ﬀ usion] Mean squared error (MSE) of the predicted solutions in Figure 12 by discovering PDE and least-squares regression atdi ﬀ erent time t . a) (b)(c) (d)(e) (f)Figure 13: [Heat di ﬀ usion] The true model, the solution predicted by discovering PDE, and the solution predicted by least-squares regression, atdi ﬀ erent time t . All settings are the same as Figure 12. In (d)(e)(f), the predictions by least-squares regression are outside of the axes limits. a) (b)(c) (d)Figure 14: [Fish-harvesting] (a) The data are generated by ﬁve random initial values and collected at the nodes. (b) The solutions to the true model(52). (c) The solutions to the discovered model (55). (d) The solutions calculated by least-squares regression. Setting dN / dt =

0, we have: N (4 − N ) − = , (53)whose solutions are N = N = . (54)When the ﬁsh population N is between 1 and 3, the population grows up; otherwise, it goes down. See Figure 14b. We generate data on t ∈ [0 ,

2] using ﬁve random initial values N ∼ U (1 , , ﬀ erence formula. Note that the derivatives are not calculated at the ﬁrst two and last two data points in each curve.There are 80 data points with derivative. Next, we discover the ODE using our algorithm SubTSBR with subsamplingsize 40, a half of all the data points with derivative, and 30 subsamples. The basis-functions are monomials generatedby { , t , N } up to degree 10. There are 66 terms. The result is: dNdt = − . + . N − . N . (55)29 .2.2. Prediction Now we predict the solutions using the discovered ODE (55) and 40 evenly spaced initial values N . See Figure14c. The solutions calculated by the true model with the same initial values are shown in Figure 14b. As a comparison,the solutions predicted by least-squares regression are displayed in Figure 14d, where the data are all 100 nodesillustrated in Figure 14a, and u ( t , N ) is ﬁtted by a linear combination of monomials generated by { , t , N } up todegree 11 with the constraint u (0 , N ) = N . There are 66 coe ﬃ cients to be estimated. Then the solutions are drawnby ﬁxing N at each value.Although the data are generated by initial values N between 1 and 3, the discovered model (55) can be applied toother initial values and almost perfectly predicts the behavior of the true model. By contrast, least-squares regressionbarely approximates the behavior of the true model in the area where the data are given and is not able to extrapolate.

6. Summary

In this paper, we have proposed a novel algorithm subsampling-based threshold sparse Bayesian regression(SubTSBR) to tackle high noise and outliers for data-driven discovery of di ﬀ erential equations. The subsamplingtechnique has two parameters: subsampling size and the number of subsamples. When the subsampling size increaseswith ﬁxed total sample size, the accuracy of SubTSBR goes up and then down. The optimal subsampling size can beﬁtted by the adjusted model-selection criterion. When the number of subsamples increases, the accuracy of SubTSBRkeeps going up. The minimum number of subsamples needed to exclude outliers for a certain conﬁdence level hasbeen deduced. SubTSBR can be used as a substitute for other sparse regression methods, such as sequential thresholdleast squares, Lasso, and threshold sparse Bayesian regression. More accurate results may be expected. At the end,some applications and merits of discovering di ﬀ erential equations from data have been discussed. Acknowledgments

We acknowledge the support from the National Science Foundation (DMS-1555072, DMS-1736364, and DMS-1821233). We thank Elizabeth A. Robbins for proofreading this manuscript.30

1] S. Zhang, G. Lin, Robust data-driven discovery of governing physical laws with error bars 474 (2217) 20180305. doi:10.1098/rspa.2018.0305 .URL http://rspa.royalsocietypublishing.org/lookup/doi/10.1098/rspa.2018.0305 [2] M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data 324 (5923) 81–85. doi:10.1126/science.1165893 .URL [3] S. L. Brunton, J. L. Proctor, J. N. Kutz, Discovering governing equations from data by sparse identiﬁcation of nonlinear dynamical systems113 (15) 3932–3937. doi:10.1073/pnas.1517384113 .URL [4] S. H. Rudy, S. L. Brunton, J. L. Proctor, J. N. Kutz, Data-driven discovery of partial di ﬀ erential equations 3 (4) e1602614. doi:10.1126/sciadv.1602614 .URL http://advances.sciencemag.org/lookup/doi/10.1126/sciadv.1602614 [5] R. Tibshirani, Regression shrinkage and selection via the lasso 267–288.[6] H. Schae ﬀ er, Learning partial di ﬀ erential equations via data discovery and sparse optimization 473 (2197) 20160446. doi:10.1098/rspa.2016.0446 .URL http://rspa.royalsocietypublishing.org/lookup/doi/10.1098/rspa.2016.0446 [7] S. H. Kang, W. Liao, Y. Liu, IDENT: Identifying di ﬀ erential equations with numerical time evolution arXiv:1904.03538 .URL http://arxiv.org/abs/1904.03538 [8] P. Zheng, T. Askham, S. L. Brunton, J. N. Kutz, A. Y. Aravkin, A uniﬁed framework for sparse relaxed regularized regression: SR3 7 1404–1423. doi:10.1109/ACCESS.2018.2886528 .URL https://ieeexplore.ieee.org/document/8573778/ [9] N. M. Mangan, S. L. Brunton, J. L. Proctor, J. N. Kutz, Inferring biological networks by sparse identiﬁcation of nonlinear dynamics 2 (1)52–63. doi:10.1109/TMBMC.2016.2633265 .URL http://ieeexplore.ieee.org/document/7809160/ [10] M. Dam, M. Brns, J. Juul Rasmussen, V. Naulin, J. S. Hesthaven, Sparse identiﬁcation of a predator-prey system from simulation data of aconvection model 24 (2) 022310. doi:10.1063/1.4977057 .URL http://aip.scitation.org/doi/10.1063/1.4977057 [11] H. Schae ﬀ er, S. G. McCalla, Sparse model selection via integral terms 96 (2). doi:10.1103/PhysRevE.96.023302 .URL http://link.aps.org/doi/10.1103/PhysRevE.96.023302 [12] H. Schae ﬀ er, G. Tran, R. Ward, Extracting sparse high-dimensional dynamics from limited data.URL http://arxiv.org/abs/1707.08528 [13] G. Tran, R. Ward, Exact recovery of chaotic systems from highly corrupted data 15 (3) 1108–1129. doi:10.1137/16M1086637 .URL http://epubs.siam.org/doi/10.1137/16M1086637 [14] N. M. Mangan, J. N. Kutz, S. L. Brunton, J. L. Proctor, Model selection for dynamical systems via sparse regression and information criteria473 (2204) 20170009. doi:10.1098/rspa.2017.0009 .URL http://rspa.royalsocietypublishing.org/lookup/doi/10.1098/rspa.2017.0009 [15] E. Kaiser, J. N. Kutz, S. L. Brunton, Sparse identiﬁcation of nonlinear dynamics for model predictive control in the low-data limit.URL http://arxiv.org/abs/1711.05501 [16] L. Boninsegna, F. Nske, C. Clementi, Sparse learning of stochastic dynamical equations 148 (24) 241723. doi:10.1063/1.5018409 .URL http://aip.scitation.org/doi/10.1063/1.5018409 [17] N. M. Mangan, T. Askham, S. L. Brunton, J. N. Kutz, J. L. Proctor, Model selection for hybrid dynamical systems via sparse regression.URL http://arxiv.org/abs/1808.03251 [18] S. Rudy, A. Alla, S. L. Brunton, J. N. Kutz, Data-driven identiﬁcation of parametric partial di ﬀ erential equations.URL http://arxiv.org/abs/1806.00732

19] H. Schae ﬀ er, G. Tran, R. Ward, L. Zhang, Extracting structured dynamical systems using sparse optimization with very few samples.URL http://arxiv.org/abs/1805.04158 [20] J.-C. Loiseau, S. L. Brunton, Constrained sparse galerkin regression 838 42–67. doi:10.1017/jfm.2017.823 .URL [21] M. Quade, M. Abel, J. N. Kutz, S. L. Brunton, Sparse identiﬁcation of nonlinear dynamics for rapid model recovery 28 (6) 063116. doi:10.1063/1.5027470 .URL http://arxiv.org/abs/1803.00894 [22] L. Zhang, H. Schae ﬀ er, On the convergence of the SINDy algorithm.URL http://arxiv.org/abs/1805.06445 [23] S. H. Rudy, J. N. Kutz, S. L. Brunton, Deep learning of dynamics and signal-noise decomposition with time-stepping constraints.URL http://arxiv.org/abs/1808.02578 [24] M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics informed deep learning (part i): Data-driven solutions of nonlinear partial di ﬀ erentialequations.URL http://arxiv.org/abs/1711.10561 [25] M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics informed deep learning (part II): Data-driven discovery of nonlinear partial di ﬀ erentialequations.URL http://arxiv.org/abs/1711.10566 [26] M. Raissi, G. E. Karniadakis, Machine learning of linear di ﬀ erential equations using gaussian processes 348 683–693. doi:10.1016/j.jcp.2017.07.050 .URL http://arxiv.org/abs/1701.02440 [27] B. Efron, C. Stein, The jackknife estimate of variance 9 (3) 586–596. doi:10.1214/aos/1176345462 .URL http://projecteuclid.org/euclid.aos/1176345462 [28] M. E. Tipping, Sparse bayesian learning and the relevance vector machine 1 211–244.[29] D. J. C. MacKay, Bayesian methods for backpropagation networks, in: E. Domany, J. L. van Hemmen, K. Schulten, E. Domany, J. L. vanHemmen, K. Schulten (Eds.), Models of Neural Networks III, Springer New York, pp. 211–254. doi:10.1007/978-1-4612-0723-8_6 .URL http://link.springer.com/10.1007/978-1-4612-0723-8_6 [30] R. M. Neal, Bayesian Learning for Neural Networks, Vol. 118 of Lecture Notes in Statistics, Springer New York. doi:10.1007/978-1-4612-0745-0 .URL http://link.springer.com/10.1007/978-1-4612-0745-0 [31] S. Ji, Y. Xue, L. Carin, Bayesian compressive sensing 56 (6) 2346–2356. doi:10.1109/TSP.2007.914345 .URL http://ieeexplore.ieee.org/document/4524050/ [32] S. Ji, D. Dunson, L. Carin, Multitask compressive sensing 57 (1) 92–106. doi:10.1109/TSP.2008.2005866 .URL http://ieeexplore.ieee.org/document/4627439/ [33] S. Babacan, R. Molina, A. Katsaggelos, Bayesian compressive sensing using laplace priors 19 (1) 53–63. doi:10.1109/TIP.2009.2032894 .URL http://ieeexplore.ieee.org/document/5256324/ [34] M. E. Tipping, A. C. Faul, Fast marginal likelihood maximisation for sparse bayesian models, in: AISTATS.[35] A. Schmolck, R. Everson, Smooth relevance vector machine: a smoothness prior extension of the RVM 68 (2) 107–135. doi:10.1007/s10994-007-5012-z .URL http://link.springer.com/10.1007/s10994-007-5012-z [36] A. C. Faul, M. E. Tipping, Analysis of sparse bayesian learning, in: T. G. Dietterich, S. Becker, Z. Ghahramani (Eds.), Advances in NeuralInformation Processing Systems 14, MIT Press, pp. 383–389.URL http://papers.nips.cc/paper/2121-analysis-of-sparse-bayesian-learning.pdf

37] J. Palmer, B. D. Rao, D. P. Wipf, Perspectives on sparse bayesian learning, in: S. Thrun, L. K. Saul, B. Schlkopf (Eds.), Advances in NeuralInformation Processing Systems 16, MIT Press, pp. 249–256.URL http://papers.nips.cc/paper/2393-perspectives-on-sparse-bayesian-learning.pdf [38] D. P. Wipf, S. S. Nagarajan, A new view of automatic relevance determination, in: J. C. Platt, D. Koller, Y. Singer, S. T. Roweis (Eds.),Advances in Neural Information Processing Systems 20, Curran Associates, Inc., pp. 1625–1632.URL http://papers.nips.cc/paper/3372-a-new-view-of-automatic-relevance-determination.pdf [39] M. A. Fischler, R. C. Bolles, Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automatedcartography 24 (6) 381–395. doi:10.1145/358669.358692 .URL http://portal.acm.org/citation.cfm?doid=358669.358692 [40] P. W. Holland, R. E. Welsch, Robust regression using iteratively reweighted least-squares 6 (9) 813–827. doi:10.1080/03610927708827533 .URL [41] R. Chartrand, Wotao Yin, Iteratively reweighted algorithms for compressive sensing, in: 2008 IEEE International Conference on Acoustics,Speech and Signal Processing, IEEE, pp. 3869–3872. doi:10.1109/ICASSP.2008.4518498 .URL http://ieeexplore.ieee.org/document/4518498/ [42] O. J. Karst, Linear curve ﬁtting using least deviations 53 (281) 118. doi:10.2307/2282572 .URL [43] H. Theil, A rank-invariant method of linear and polynomial regression analysis, in: A. J. H. Hallet, J. Marquez, B. Raj, J. Koerts (Eds.), HenriTheils Contributions to Economics and Econometrics, Vol. 23, Springer Netherlands, pp. 345–381. doi:10.1007/978-94-011-2546-8_20 .URL [44] P. K. Sen, Estimates of the regression coe ﬃ cient based on kendall’s tau 63 (324) 1379. doi:10.2307/2285891 .URL [45] A. F. Siegel, Robust regression using repeated medians 69 (1) 242–244. doi:10.1093/biomet/69.1.242 .URL https://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/69.1.242 [46] P. J. Rousseeuw, Least median of squares regression 79 (388) 871–880. doi:10.1080/01621459.1984.10477105 .URL [47] J. Hoeting, A. E. Raftery, D. Madigan, A method for simultaneous variable selection and outlier identiﬁcation in linear regression 22 (3)251–270. doi:10.1016/0167-9473(95)00053-4 .URL https://linkinghub.elsevier.com/retrieve/pii/0167947395000534 [48] P. J. Farrell, B. Macgibbon, T. J. Tomberlin, Protection against outliers in empirical bayes estimation 22 (3) 365–376. doi:10.2307/3315598 .URL http://doi.wiley.com/10.2307/3315598 [49] G. E. P. Box, G. C. Tiao, A bayesian approach to some outlier problems 55 (1) 119–129. doi:10.1093/biomet/55.1.119 .URL https://academic.oup.com/biomet/article-lookup/doi/10.1093/biomet/55.1.119 [50] M. West, Outlier models and prior distributions in bayesian linear regression 46 (3) 431–439. doi:10.1111/j.2517-6161.1984.tb01317.x .URL http://doi.wiley.com/10.1111/j.2517-6161.1984.tb01317.x [51] L. I. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noise removal algorithms 60 (1) 259–268. doi:10.1016/0167-2789(92)90242-F .URL http://linkinghub.elsevier.com/retrieve/pii/016727899290242F [52] R. Chartrand, Numerical di ﬀ erentiation of noisy, nonsmooth data 2011 1–11. doi:10.5402/2011/164564 .URL

53] J. Lagergren, J. T. Nardini, G. M. Lavigne, E. M. Rutter, K. B. Flores, Learning partial di ﬀ erential equations for biological transport modelsfrom noisy spatiotemporal data.URL http://arxiv.org/abs/1902.04733http://arxiv.org/abs/1902.04733