[PDF] Bayesian uncertainty quantification for data-driven equation learning

Abstract

Equation learning aims to infer differential equation models from data. While a number of studies have shown that differential equation models can be successfully identified when the data are sufficiently detailed and corrupted with relatively small amounts of noise, the relationship between observation noise and uncertainty in the learned differential equation models remains unexplored. We demonstrate that for noisy data sets there exists great variation in both the structure of the learned differential equation models as well as the parameter values. We explore how to combine data sets to quantify uncertainty in the learned models, and at the same time draw mechanistic conclusions about the target differential equations. We generate noisy data using a stochastic agent-based model and combine equation learning methods with approximate Bayesian computation (ABC) to show that the correct differential equation model can be successfully learned from data, while a quantification of uncertainty is given by a posterior distribution in parameter space.

Full PDF

BBayesian uncertainty quantiﬁcation fordata-driven equation learning

Simon Martina-Perez , Matthew J. Simpson and Ruth E. Baker Mathematical Institute, University of Oxford, Oxford, United Kingdom School of Mathematical Sciences, Queensland University of Technology, Brisbane, AustraliaFebruary 24, 2021

Abstract

Equation learning aims to infer diﬀerential equation models from data. While a numberof studies have shown that diﬀerential equation models can be successfully identiﬁed whenthe data are suﬃciently detailed and corrupted with relatively small amounts of noise [1, 2,3, 4], the relationship between observation noise and uncertainty in the learned diﬀerentialequation models remains unexplored. We demonstrate that for noisy data sets there existsgreat variation in both the structure of the learned diﬀerential equation models as well asthe parameter values. We explore how to combine data sets to quantify uncertainty inthe learned models, and at the same time draw mechanistic conclusions about the targetdiﬀerential equations. We generate noisy data using a stochastic agent-based model andcombine equation learning methods with approximate Bayesian computation (ABC) to showthat the correct diﬀerential equation model can be successfully learned from data, while aquantiﬁcation of uncertainty is given by a posterior distribution in parameter space. Codecan be found at https://github.com/simonmape/UQ-for-pdefind . Many phenomena in mathematical biology arise as a result of complex interactions betweenindividual agents at the microscale that result in emergent properties at the macroscale. Under-standing the mechanistic basis for the observed macroscale behaviour, in order to gain funda-mental insights into biological phenomena, is one of the key challenges in biology. Mathematicalmodels are well-placed to help provide such insights, providing a rigorous framework where hy-potheses can be generated, tested and reﬁned. While the interactions between individual agentscan be naturally described by agent-based models (ABMs) that prescribe precise rules for theinteractions between agents, predicting the macroscale behaviour of ABMs can be a challenging1 a r X i v : . [ q - b i o . Q M ] F e b ask, since their governing equations are often intractable and stochastic simulations can be com-putationally expensive, often prohibitively so in the context of parameter sensitivity analysis orparameter inference [2]. This makes partial diﬀerential equation (PDE) models an indispensabletool to describe the macroscale properties of the population. Beneﬁts of PDE models is thatthey are fast to solve numerically, their diﬀerent terms often carry a physical interpretation, andthey can be explored using a range of analytical and numerical approaches. Knowing how sucha model is parametrized, then, can provide key insights into the system under consideration,and aid in making quantitative as well as qualitative predictions.Traditional approaches to mathematical modelling use experimentally-derived mechanistichypotheses to derive PDE models in which the various terms of a given model are designedto describe the hypothesised mechanisms for that scenario. Calibration of the model to datathen involves ﬁnding the parameters that optimise the discrepancy between the model outputand data. The ensuing, iterative process of testing and reﬁning the model against furtherexperimental data allows the original hypotheses to be reﬁned, and so new insights gained.Equation learning (EQL) methods aim to infer the dynamical systems model that bestdescribes a given time series by leveraging statistical and machine learning tools to learn theappropriate terms of a diﬀerential equation directly from the data. In particular, the PDE-FIND algorithm [1] takes as input quantitative data, together with a large library of candidateterms for a PDE, and aims to learn which terms to include in the PDE model, as well as theircoeﬃcients. User-deﬁned parameters can be tuned to enable a balance between the requirementfor good model ﬁt with the desire for a simple, interpretable model.EQL methods have rapidly gained popularity, mainly thanks to increases in computationalpower, and a number of other techniques to establish models from data now exist. For ex-ample, biologically informed neural networks [3], an extension of physically informed neuralnetworks [5], have been developed to learn diﬀerent terms of a PDE model without the needto specify a library of possible terms. A major advance has come from the use of techniquessuch artiﬁcial neural nets (ANNs) [6] to accurately recover models from artiﬁcially generatednoisy data from PDEs. The fact that EQL could discover previously undetected mechanisms,choose between competing models or estimate biological quantities of interest that are diﬃcultto measure experimentally makes EQL attractive to scientists working with real-world data.However, practitioners wishing to obtain models that they can use in real-world settings re-quire, in addition, a thorough quantiﬁcation of uncertainty. This need comes from the factthat the, often signiﬁcant, noise in real-world data can impact the models predicted by EQLmethods, and hence the predictive capability of the models for unseen data or scenarios. For2xample, Nardini et al. [2] have recently shown, through the use of several case studies, that itis possible to infer diﬀerential equation models that describe noisy data generated by stochasticABMs. However, the stochasticity in the ABM results in variability in the learned macroscalediﬀerential equation. This means that, for a particular realisation generated from a stochasticmodel, the learned diﬀerential equation is essentially a point estimate of the underlying PDE,and there is no quantiﬁcation of uncertainty in the learned equation.Recently, some authors [1, 7] have attempted to address this problem by analyzing therobustness of PDE-FIND with noisy or sparse data. For example, the approach in [1] amounts tochoosing a PDE with ground truth parameters, simulating from this target equation, corruptingthe simulated data by noise, training PDE-FIND on this data, and ﬁnally comparing PDE-FINDparameters with the ground truth. Li et al. [7] investigate how to increase the signal-to-noiseratio prior to using EQL techniques. While both works show that model parameters can beretrieved in within an impressive margin of error when data is corrupted with relatively smallamounts of noise, these approaches give no statistical quantiﬁcation of uncertainty in modelpredictions, as the methods are still designed to output a single learned equation.Figure 1: Proposed framework for uncertainty quantiﬁcation in equation learning. A noisy dataensemble is used to create an informative prior distribution eﬃciently, which can then be usedto obtain a practical posterior distribution.In this work we demonstrate that noise can signiﬁcantly impact both the structure andthe parameters of learned diﬀerential equations, rendering uncertainty quantiﬁcation a crucial3omponent of the equation learning process. As such, our overarching aim is to develop andshowcase a method for uncertainty quantiﬁcation in the context of EQL, where we harnessthe immense computational eﬃciencies of PDE-FIND in learning point estimates of governingequations, together with the power of computational Bayesian inference in evaluating the leveluncertainty in the learned equation. Figure 1 shows our proposed framework for uncertaintyquantiﬁcation. We start from the basis that it is possible to collect an ensemble of spatiotemporaldatasets from a given system, and develop an approach to understand how the data can be bestused to learn a governing equation while simultaneously estimating the uncertainty in thatlearned equation.Our motivation is thus: on the one hand, PDE-FIND provides a computationally cheapmethod to obtain a point estimate for the governing equation from a single time series. How-ever, when the data are noisy, the individual predictions are unreliable. On the other hand, theﬁeld of computational Bayesian inference provides a number of methods to estimate, for a givenmodel and data, posterior parameter distributions i.e. provide estimates of model parametersand quantify the uncertainty in those estimates. In principle, a likelihood-free method suchas approximate Bayesian computation (ABC) [8] could be used directly with the candidate li-brary of the PDE-FIND method to estimate the posterior distribution of the library coeﬃcients.However, due to the very large number of candidate terms in the PDE-FIND library, the com-putational cost associated with applying ABC (or other likelihood-free methods for Bayesianinference) on the entire high-dimensional parameter space is prohibitive. Instead, we proposea framework that combines the strengths of each approach: we train the PDE-FIND algorithmon individual datasets from the ensemble to obtain an informed prior distribution that makes itpossible to use ABC in a computationally realistic setting. We demonstrate the potential of ourapproach using synthetic data generated from a widely used ABM that describes the behaviourof a motile and proliferative cell population and can be accurately coarse-grained to a mean-ﬁeldPDE in certain regions of parameter space.In Section 2, we describe the ABM and discuss in detail its relation with a governing PDE.We describe the PDE-FIND algorithm and propose how the PDE-FIND algorithm may be usedto construct a prior for Bayesian inference. In Section 3, we show that the learned PDEs areintrinsically variable in the presence of observation noise, and regardless of the method to dealwith observation noise, there is still variability that needs to be quantiﬁed. We then demonstratethat PDE-FIND can learn unphysical solutions to the PDE and provide an explanation in termsof the objective function of the algorithm. We bring all these issues together through the use ofBayesian methods where we can evaluate uncertainty in a framework that optimizes the ﬁt of4he model density proﬁle to the data. We begin by describing the ABM and its coarse-grained mean-ﬁeld PDE, and then we outlinethe PDE-FIND algorithm and the means by which we use it to construct a prior distributionfor ABC.

ABMs allow practitioners to investigate the collective behaviour of cells and tissues based on adescription of the behaviour of individual cells. Here, in order to realistically describe the cell-cell interactions, we follow a volume-exclusion model presented in [9] for a population of agentsthat move and proliferate according to a discrete random walk model. This is a simple modelthat can be used to analyse a range of phenomena, including the collective migration of cells ina tissue, for example. In this model, we will assume that agents occupy sites on a square latticeand their possible locations are ( i, j ), where ( i, j ) are integer coordinates. Agent movementsare constrained by exclusion: each lattice site may only be populated by one agent at a time.Agents can only move to either one of their four adjacent lattice sites. The lattice is initializedaccording to some initial conditions, and the parameters p m , p p ∈ [0 , ρ x , ρ y ∈ [ − ,

1] are set.At each time step τ , a random sequential updating procedure is carried out. At each time step,agents are selected, one at a time, with replacement, and are allowed to attempt a movement andproliferation event. When an agent is selected, a movement event is attempted with probability p m and the random numbers S , S ∼ U (0 ,

1) are independently drawn. If S ≤ p p , then theagent proliferates and a daughter agent is placed in one of the adjacent sites, if it is empty. If amovement is attempted, then, S determines the motion of the agent according to Table 1. Werecall that proliferation and movement into any site can only occur if that site is empty.Move chosen Target site Probability Where random number S fallsvertically down ( i, j − − ρ y ≤ S ≤ − ρ y vertically down ( i, j + 1) ρ y − ρ y ≤ S ≤ horizontally left ( i − , j ) − ρ x ≤ S ≤ + − ρ x horizontally right ( i + 1 , j ) ρ x + − ρ x ≤ S ≤ i, j ) selects a target site, as presented in [9].From a biological perspective, the parameters control the collective behavior of the cells.5 high value of p m reﬂects cells that move easily through their environment, whereas a lowervalue of p m corresponds to lower motility in the process. This will be reﬂected by fewer cellmovements on average in each time interval. The movement bias parameters ρ x , ρ y reﬂect thepreferences of agents in a certain direction. This can be due to factors such as chemotaxis orother factors driving cell movement. Setting ρ x = ρ y = 0 corresponds to unbiased motion on thelattice. Finally, p p determines the rate at which cells proliferate: a high value of p p correspondsto cells that rapidly proliferate, whereas a lower value signiﬁes lower mean proliferation rates.In this paper, we consider a uniformly spaced lattice, with lattice spacing equal to 1, and timeincrements τ = 1. We also consider p m = 1. As noted in [9], all simulations can be easilygeneralized to arbitrary lattice spacings and arbitrary time increments, but we keep thingssimple here by choosing uniform lattice spacing and constant time steps.As done in [9], we simulate the ABM on a square lattice with integer coordinates ( i, j )such that 1 ≤ i ≤

200 and 1 ≤ j ≤

20. That is, the horizontal dimension of the lattice is200 and the vertical dimension is 20. This corresponds to looking at a thin strip of cells thatmove and proliferate. This, for example, corresponds to experiments involving a spatial invasionprocess. We take a pseudo-one-dimensional initial condition by populating all lattice sites with90 ≤ i ≤ C ij denote the occupancy of cell ( i, j )by C ij = 1 if ( i, j ) is occupied by an agent and 0 if it is empty. We can average the occupancyof the columns of the lattice, by deﬁning the mean occupancy of column i as C i = 120 (cid:88) i =1 C ij (1)When doing this averaging at all later time points, this yields a simulation of a time-varyingone-dimensional spatial density proﬁle. The dimensions of the lattice considered and the choice of a pseudo-one-dimensional initialcondition suggest to consider how the density proﬁle, obtained by averaging the occupancy ofeach column of the lattice, varies in time. To do this, let C i be the averaged occupancy in the i -th column of the lattice, that is, the average number of agents per site in the column. Then { C i } is a one-dimensional density proﬁle of the lattice. As noted by Simpson et al. in [9], therules of the stochastic model induce a one-dimensional conservation of mass statement between C i ( t ) and C i ( t + τ ), the occupancy of site i at time t and at the next time step, t + τ . Taking6gain p m = 1, C i ( t + τ ) = C i ( t ) + C i − ( t ) (1 + ρ x )4 (1 − C i ( t )) + C i +1 ( t ) (1 − ρ x )4 (1 − C i ( t )) − C i ( t ) (1 + ρ x )4 (1 − C i +1 ( t )) − C i ( t ) (1 − ρ x )4 (1 − C i +1 ( t ))+ p p C i − ( t )(1 − C i ( t )) + p p C i +1 ( t )(1 − C i ( t )) . Note how the diﬀerent terms in the conservation of mass statement correspond to the ﬂuxof agents in and out of the adjacent lattice sites. Note also how ρ y does not appear in theconservation of mass statement, justifying the choice of ρ y = 0 henceforth. For the generalderivation of the conservation of mass statements, we refer to [9]. The general one-dimensionalconservation of mass statement for the average occupancy between t and t + τ corresponds toan explicit ﬁnite diﬀerence approximation of the PDE ∂c∂t = D ∂ c∂x − V ∂∂x [ c (1 − c )] + P [ c (1 − c )] , (2)in the limiting case where τ and the mesh size ∆ satisfy τ, ∆ → /τ constant. In Equation 2, D = lim ∆ ,τ → p m ∆ / τ , V = lim p m ∆ ,τ → ∆ ρ x / τ and P =lim τ → p p /τ . For the full derivation and details, we refer to [9]. The identiﬁcation of this agentbased model with the resulting PDE motivates us to investigate the performance of equationlearning methods trained on data generated by the agent based model, since the PDE describesthe time-evolution of the expected value of the density proﬁle. As test cases for learning the govoverning equations from data, we explore three diﬀerent pa-rameter regimes in the model, which each correspond to a biologically relevant setting. In CaseI, we consider agents moving without bias and without proliferation ( ρ x = ρ y = p p = 0). In CaseII, we consider agents moving with bias, but without proliferation ( ρ x = 0 .

075 and ρ y = p p = 0).In Case III, we consider agents moving without bias, but with proliferation ( ρ x = ρ y = 0 and p p = 0 . u for the unbiased case without proliferation, b thebiased case without proliferation, and p the unbiased case with proliferation. We let i = 1 for7ase ρ x p p Coarse-grained PDECase I: no bias, no proliferation 0 0 c t = 0 . u xx Case II: bias, no proliferation -0.0375 0 c t = 0 . c xx + 0 . c (1 − c )] x Case III: proliferation, no bias 0 0.01 c t = 0 . c xx + 0 . c (1 − c )]Table 2: Coeﬃcients of the mean-ﬁeld PDEs describing evolution of the mean population densityover time for the three example cases used in this work.the case where we average over one iterate in each density proﬁle and i = 2 for the case wherewe average over 50 iterates for each density proﬁle, and denote the resulting data sets as {D ri } ,where r ∈ { u, b, p } and i ∈ { , } . For example, D u refers to the data set corresponding tosimulations of Case I where we average over one replicate per density proﬁle.In Figure 2 it can that the obtained data traces of the ABM are in good agreement withthe coarse-grained PDE. Note, however, that for the plots at t = 50, the solutions for Case I,II and III are very similar. Still, at t = 100 the plots for Case I and III are very similar andit is not until later time scales ( t = 150) that there can be seen a clear diﬀerence between thesolution owing to the role of proliferation. We can forecast based on this visual inspection thatthe question of reliably estimating the PDEs is diﬃcult from the average data, and all timescales need to be taken into account to properly distinguish the data traces. In the following, assume that we have time series data for an unknown function u ( x, t ) on agrid of n points in time and m points in space. This data is stored in a matrix U ∈ R n × m . Weassume that the data is a noisy discretization of a function u ( x, t ), the solution of an unknownPDE, and the aim is to learn the PDE that best describes the governing equation of the observeddata. Henceforth, and to avoid confusion, we will write U ( x, t ) for the observed data, u ( x, t )for the learned, data-driven, solution of the governing PDE, and c ( x, t ) for the solution to theapproximate mean-ﬁeld PDE deﬁned in Equation (2). We follow Rudy et al. [1] in assumingthat the PDE governing u ( x, t ) is given in the following form u t = N ( u, u x , u xx , . . . , x, µ ) = d (cid:88) i =1 N i ( u, u x , u xx , . . . , ) ξ, (3)where N is a nonlinear function of u ( x, t ) and its partial derivatives, µ and ξ are constants.Furthermore, it is assumed that N is the linear combination of a ﬁnite number of distinctlibrary terms. We point out that Equation (2) falls within this class of PDEs. By design, N t = 0 , t = 50 , t = 150 , t = 500 for one of the simulatedrealizations of the model where no averaging is done (left column), where time series consist ofaverages over 50 realizations of the model (middle column) and the corresponding continuummodel. It can be seen that the continuum model is a very good approximation of the averagedensity obtained from the ABM simulations. 9as polynomial nonlinearities, as is common in many equations in the natural sciences. PDE-FIND creates a large library of terms that could appear in N , with the view to select a smallsubset of relevant terms from this list.The ﬁrst step of the PDE-FIND pipeline is to numerically approximate both sides of Equation(3). This is done by taking derivatives of the data with respect to space and time. The standardPDE-FIND implementation in [1] takes ﬁnite diﬀerence approximations when the data containslittle noise and polynomial diﬀerentiation when data is highly noisy. The data and its derivativesare combined in a matrix Θ( U ), where each column of Θ contains all of the values of a particularcandidate function across the entire n × m grid. For example, if the candidate library consistsof all polynomials up to degree 2 and all derivatives up to the second order, Θ( U ) will look likeΘ( U ) = (cid:2) , U, U , U x , U U x , U U x , U xx , U U xx , U U xx (cid:3) . (4)So, if there are N terms in the candidate library, Θ( U ) is a n · m × N matrix. In this example, N = 9. The left hand side of Equation (3) is similarly approximated using time diﬀerentiation,and we obtain a linear equation representing the PDE in Equation (3) U t = Θ( U ) ξ. (5)Taking the same example for Θ( U ) as in Equation (4), this matrix equation looks like  U t ( x , t ) U t ( x , t ) U t ( x , t )... U t ( x n − , t m ) U t ( x n , t m )  =  U ( x , t ) U ( x , t ) U x ( x , t ) . . . U U xx ( x , t )1 U ( x , t ) U ( x , t ) U x ( x , t ) . . . U U xx ( x , t )1 U ( x , t ) U ( x , t ) U x ( x , t ) . . . U U xx ( x , t )... ... ... . . . ...1 U ( x n − , t m ) U ( x n − , t m ) U x ( x n − , t m ) . . . U U xx ( x n − , t m )1 U ( x n , t m ) U ( x n , t m ) U x ( x n , t m ) . . . U U xx ( x n , t m )   ξ ξ ξ ... ξ ξ  (6)Note how this representation shows that each row in the matrix equation represents the governingdynamics behind the data at one point in time and space. Note also that the values of ξ determinethe form of the PDE, and so the aim is to learn the coeﬃcients of ξ optimally. Following Rudy etal. [1], we will take Θ to be overspeciﬁed, meaning that the dynamics can be represented well aslinear combinations of the columns of Θ. However, many PDEs in the natural sciences containonly a few terms. Therefore, we wish to learn a sparse vector ξ in the solution of Equation (3).This is done in PDE-FIND by considering the optimisation criterion ξ = argmin ξ (cid:107) Θ( U, Q ) ξ − U t (cid:107) + λ (cid:107) ξ (cid:107) , (7)for the coeﬃcients ξ , where λ ∈ R > is a free parameter that penalises large terms. Thisis the method of ridge regression . We note here that the term (cid:107) ξ (cid:107) can be replaced with10 ξ (cid:107) , which corresponds to performing LASSO. Here, it is worth noting that the choice ofimplementation is largely problem-dependent, and various choices for the regularization methodhave been compared in the literature, for example by Rudy et al. [1], although no method is tobe deﬁnitively preferred over another. The default implementation of PDE-FIND as proposedby Rudy et al. [1] supplements the ridge regression problem with a sequential thresholding procedure: a solution to (7) and a hard threshold is performed on the regression coeﬃcientsby eliminating all coeﬃcients smaller than some pre-speciﬁed parameter d tol . This process isrepeated on the remaining coeﬃcients. This is done to enforce sparsity as the solution to theridge regression problem in Equation (7) may contain several small, but non-zero values. Thecombined algorithm is called Sequential Thresholding Ridge regression (STRidge). For detailsand motivation of the method, we refer to [1] We summarize the PDE-FIND algorithm inAlgorithm 1, as reproduced from [1]. Algorithm 1:

STRidge

Input:

Library matrix Θ( U ), time derivative of data U t , λ , d tol Output:

Sparse vector ξ Compute ˆ ξ = argmin ξ (cid:107) Θ( U, Q ) ξ − U t (cid:107) + λ (cid:107) ξ (cid:107) ; Set B = { j : | ˆ ξ | ≥ d tol } ; Set ˆ ξ i = 0 for all i (cid:54)∈ B ; Compute ˆ ξ [ B ] = STRidge(Θ[: , B ] , U t , d tol ) recursively; For the simulated ABM data in Case I, II and III, we collect the column-averaged density ateach time step. We denote the resulting data sets D ir for r = u, b, p , i = 1 ,

2, where u stands forunbiased motion (Case I), b stands for biased motion (Case II) and p stands for proliferation(Case III). Then, i = 1 corresponds to data traces with no averaging and i = 2 correspondsto data traces where there is averaging over 50 simulations of the ABM. Per data set, we thensubsample n = 500 equidistant time points, subsampled from the 1000 time points at which thesimulation was carried out, with ∆ t = 2. So, each sample in the data set contains informationfor n = 500 time points and m = 200 space points. We use the standard implementation ofPDE-FIND [1] and use polynomial diﬀerentiation at fourth order to ﬁnd both the time andspace derivatives.We select a library of candidate terms that includes all polynomial terms up to order twoand up to the second derivative since interactions between agents happen only between directneighbors and so we do not expect other terms to show up in the PDE. Table 3 shows the valuesof the coarse-grained PDE coeﬃcients, according to Equation 2. These are the values of the11oeﬃcients that we would expect the PDE-FIND algorithm to return for perfect spatio-temporaldata. In Table 3, and the rest of this work, we use the notation c i for the coeﬃcient of term i in the learned PDE.Case c c u c u c u x c u · u x c u · u xx c u xx c u · u xx c u · u xx Case I 0 0 0 0 0 0 0.25 0 0Case II 0 0 0 -0.0375 0.075 0 0.25 0 0Case III 0 0.01 -0.01 0 0 0 0.25 0 0Table 3: Coeﬃcients of the mean-ﬁeld PDEs describing evolution of the mean population densityover time for the three example cases used in this work. The coeﬃcients correspond to thecoarse-grained PDEs described in table 2.

ABC is a popular likelihood-free tool to obtain a data-informed posterior distribution overthe space of possible parameter values. The main idea of ABC is that the user speciﬁes aprior distribution π reﬂecting prior knowledge of model parameters, which the ABC algorithmthen uses to create a posterior distribution by evaluating the performance of the parameters inreproducing the observed data. For an overview of ABC methods, we refer to Sunn˚aker et al. [8] - see the SI for implementation details.Such posterior distributions provide information as to the uncertainty in parameter estimatesthat are learned from the data, and also allow practitioners to understand the range of realisticparameter values that can produce the observed data. Despite its simplicity, unless an informa-tive prior is used to constrain the space of possible parameters, a high-dimensional parametersearch using ABC is generally computationally prohibitive for the kinds of models routinely usedin mathematical biology. This high computational cost means that direct application of ABCmethods to estimate the coeﬃcients ¯ ξ of Equation (3) is infeasible. Here, we propose a methodthat uses the predictions of PDE-FIND to construct an informed prior so that ABC can thenbe used to estimate the library coeﬃcients ¯ ξ .Consider performing sparse linear regression using Algorithm 1 to the PDE-FIND solution ξ to the problem deﬁned in Equation (3). Let ˆ β ( λ ) denote the ridge regression solution - thatis the solution obtained without applying the hard thresholding part of Algorithm 1, i.e. by d tol = 0. Let ˆ β ( λ, ST) denote the STRidge coeﬃcients obtained when using a ﬁxed value of thethreshold parameter d tol >

0. It is well-known - see [10] - that in the presence of observationnoise, the empirical distribution of ˆ β ( λ ) can be well-approximated with a multivariate normal12istribution. Now, the exact distribution of ˆ β ( λ, ST) in the presence of observation noise will bediﬀerent due to the hard thresholding imposed by the STRidge algorithm, but we expect thatthe nonzero coeﬃcients of ˆ β ( λ, ST) follow a similar distribution to those of ˆ β ( λ ). This justiﬁesapproximating the empirical distribution of the coeﬃcients of ˆ β ( λ, ST) as a multivariate normal.Here, it is important to note that the purpose of the method is not to ﬁnd the exact distributionthe PDE-FIND coeﬃcients. Rather, it is to ﬁnd a model that can capture the prior uncertaintyabout the PDE-FIND coeﬃcients so that this can be used in a Bayesian context.Our approximation that the nonzero STRidge coeﬃcients approximately follow a normaldistribution suggests that we must ﬁnd a mixture distribution that allows for coeﬃcients to beexactly zero or to follow a normal distribution. With other words, we wish to approximate themarginal distribution of each coeﬃcient of the PDE-FIND solution with a mixture of a Dirac- δ at 0 and a normal distribution. In the following, we propose a straightforward method to usethe observed data to construct such a distribution. Assume that for each sample in the observeddata set, PDE-FIND has produced an estimate of the parameters using STRidge. For each term j in the library, deﬁne the identiﬁcation ratio a j as a λj = 1 N s N s (cid:88) i =1

1( ˆ β ( λ, ST) ij (cid:54) = 0) . (8)When the data set is suﬃciently large, this parameter is informative of how often the termis included in the PDE-FIND predictions. When a j is large, the term is identiﬁed across thesamples in the data set as being relevant for the dynamics. When a j is small, the term isidentiﬁed in only some samples as being relevant. Dropping the dependence on λ , since this willalways be evident from context, we propose to model the empirical distribution of the j -th termin the library of the PDE-FIND coeﬃcients obtained from the data set asˆ β ( λ, ST) j ∼ a j µ j + (1 − a j ) δ , (9)where µ j = N ( m j , s j ) with m j , s j the sample mean and variance over the nonzero values of the j -th coeﬃcient and δ the Dirac δ function at 0. We note that using the identiﬁcation ratio yieldsan eﬃcient method of constructing a prior. This is because PDE-FIND is a sparsity-promotingmethod and so for most coeﬃcients the identiﬁcation ratio a j will be either close to 0, or close to1. If a j ≈

0, the empirical marginal of the j -th coeﬃcient is approximately δ , which means thatPDE-FIND does not identify this term as relevant across many samples in the data set, implyingthat the term can be conﬁdently eliminated from the model and not considered in the prior. If a j ≈

1, PDE-FIND has consistently identiﬁed the term as relevant for the dynamics across thesamples in the data set, and so this term should be included in the model, but uncertainty muststill be quantiﬁed. If a j is neither close to 0 nor close to 1, PDE-FIND includes these terms in13he PDE in for a non-trivial number of data points, but fails to conﬁdently include them. Inthis scenario, Bayesian models can be used to investigate the joint posterior distribution of theseparameters with the rest of the model by considering the performance of models that includeand exclude the term respectively.To perform this term elimination procedure more rigorously, let 0 < δ (cid:28) A = { j : a j > δ } , that is, the set of terms which a practitioner feels have a signiﬁcant identiﬁcationratio. The key will be to perform Bayesian inference solely for the terms in A , while excludingthe rest of the terms from the model on the basis of their low identiﬁcation ratio. We recallthat PDE-FIND promotes sparsity, so the size of A will be very small compared to that ofthe library, making ABC computationally much more eﬃcient. With the information availablefrom training the PDE-FIND coeﬃcients on the ensemble of data points, and in the absence offurther information about the problem, the most straightforward prior that can be deﬁned is anindependent prior over the coeﬃcients in A , giving the following choice π = (cid:79) j ∈ A (cid:0) a j µ j + (1 − a j ) δ (cid:1) . (10)We summarize the Bayes-PDE-FIND algorithm below in Algorithm 2. Algorithm 2:

Bayes-PDE-FIND

Input:

Data set of time series; PDE-FIND hyperparameters: library Θ of size N , λ , d tol ; minimum identiﬁcation ratio δ > Output:

Posterior distribution over coeﬃcients of library PDE. for Sample i in data set do Compute ˆ ξ i with sample i using Algorithm 1; end for j ∈ , . . . , N do Compute sample mean m j and sample variance s j and set µ j = N ( m j , s j ).; Compute identiﬁcation ratio a j by a j = N s (cid:80) N s i =1

1( ˆ ξ ij (cid:54) = 0) end Compute A = { j : a j > δ } ; Deﬁne prior distribution π = (cid:79) j ∈ A (cid:0) a j µ j + (1 − a j ) δ (cid:1) . (11) Perform Approximate Bayesian Computation to obtain posterior distribution.14

Results and discussion

In this section, we showcase three diﬀerent, but related, directions in the uncertainty quantiﬁca-tion of the learned diﬀerential equations. In subsection 3.1, we showcase the intrinsic variabilityof the learned coeﬃcients in the presence of observation noise. We evaluate how uncertaintychanges as the noise is varied and suggest that even when using state-of-the-art denoising ap-proaches, a need for uncertainty quantiﬁcation remains. Although increasing the signal-to-noiseratio helps - regardless of the method - there is still variability and that needs to be quantiﬁed.In particular, this is important in biology where observations are often very noisy practitionersrarely have access to very large amounts of data. In subsection 3.2, we investigate the impact ofvarying algorithm hyperparameters in PDE-FIND with a view to asking whether these can beoptimized to drive down uncertainty. We ﬁnd that while this is possible, parameter estimatesare still uncertain and this uncertainty needs to be quantiﬁed. In subsection 3.3, we demonstratethat a key issue with PDE-FIND is that it aims to ﬁt the time derivative and does not evaluatethe ﬁt of the observed density to the data, leading to unphysical predictions. We bring all theseissues together in subsection 3.4 through the use of Bayesian methods as outlined in Algorithm2 where we can both evaluate uncertainty in a framework that optimizes the ﬁt of the modeldensity proﬁle to the data.

We ﬁrst demonstrate that a naive application of PDE-FIND on noisy synthetic data yieldsvariable and unreliable parameter estimates. For this application, we do not carry out hyper-parameter tuning, but simply use widely used parameter settings to learn the coeﬃcients. Foreach of the two data sets associated to each of Case I, Case II and Case III, we train the PDE-FIND algorithm using Algorithm 1 (STRidge) with ﬁxed hyperparameter settings λ = 10 − and d tol = 0 . For Case I, recall that the true PDE model contains only the term u xx . Table 5 shows that thetwo terms identiﬁed regularly by PDE-FIND on D u are u xx and uu xx , with identiﬁcation ratiosof 0 .

83 and 0 .

20, respectively. On D u , only u xx is identiﬁed and no other terms are identiﬁed.15xperiment c c u c u c u x c u · u x c u · u xx c u xx c u · u xx c u · u xx STRidge D u D u D b D b D u c u xx trained on D u (blue) and trained on D u (red) compared to true parametervalue (black line). B: histogram for c uu xx trained on D u compared to true value (black line). C:joint distribution of c u xx and c uu xx compared to true parameter (black star).To investigate the variability of the learned parameters, we display histograms of the learnedcoeﬃcients in Figure 3. In addition, since the identiﬁcation ratios of u xx and uu xx on D u suggest that the terms have a nontrivial joint distribution, we display the joint distribution oftheir coeﬃcients as well.Figure 3 reveals that the learned parameters are highly variable and that with noisy data,the terms uu xx and u xx have a nontrivial joint distribution. This means that in some cases,PDE-FIND will identify either of the two terms, and in others it will identify a combination ofthe two. Since it is assumed that all of the empirical data is governed by the same governingPDE, this points towards non-identiﬁcation of the model by PDE-FIND. For Case II, Table 5 shows that the two terms identiﬁed regularly by PDE-FIND on D b are u x and u xx with an identiﬁcation ratio of 0 .

999 and 0 .

012 respectively. Note that the true modelshould contain the term uu x as well, but PDE-FIND fails to identify this term across the data.16igure 4: Histograms showing the empirical distribution of relevant PDE-FIND parameters forCase II. A: histogram for the c u xx trained on D u (blue) compared to true parameter value (blackline). B: histogram of c u x trained on D b (blue) compared to true parameter value (black line).C: histogram of c u x trained on D b compared to true parameter value (black line).Similarly, for D b , only u x is identiﬁed, with an identiﬁcation ratio of 1 .

0. This suggests toinvestigate the variability of the learned coeﬃcients for u x and u xx learned on D b and only thecoeﬃcient for u x learned on D b .Figure 4 again reveals a large amount of variability in the learned parameters. For instance,note that when trained on D b and on D b , parameters are distributed on a wide interval awayfrom the true parameter value. Note also that the variability is lessened when there is lessobservation noise. For Case III, Table 5 shows that the terms identiﬁed regularly by PDE-FIND on D p are u , u and u xx with identiﬁcation ratios equal to 0.659, 0.482 and 0.571 respectively. Note that thismeans that all correct terms have been identiﬁed. On D p the parameters identiﬁed are u and u xx , both with identiﬁcation ratio equal to 1.0. In addition, since the identiﬁcation ratios of u and u on D p suggest that the terms have a nontrivial joint distribution, we display the jointdistribution of their coeﬃcients as well.Figure 5 reveals that the learned parameters are much less variable when less noise is presentin observation data. However, it does not mean that the model is identiﬁed better per se whenthere is little observation noise. For instance, the term u is not learned by the model on theless noisy data. We investigate whether choosing a coarser grid helps in reducing observation noise with theaim to improve PDE-FIND predictions. Note that choosing a coarser spatial discretization will17igure 5: Histograms for relevant PDE-FIND parameters trained for Case III. A: histogramfor c u trained on D p (blue) and D p (red) compared to true parameter value (black line). B:histogram for c u trained on D p (blue) compared to true parameter value (black line). C:histogram for c u xx trained on D p (blue) and D p (red) compared to true parameter value (blackline).result in a smoother density proﬁle, while incurring more errors in the approximation of thespatial derivatives, as well as having fewer data points. For this experiment, we subsamplethe x -dimension by averaging the occupancy over 4 columns at a time time. Mathematically,from the empirical densities C i , at each time point, we estimate the average occupancies ˜ C i for1 ≤ i ≤

50, where ˜ C i = 14 i +1) (cid:88) (cid:96) =4 i +1 C i . (12)Table 5 summarizes the identiﬁcation ratios found. With spatial subsampling, the share ofcorrectly identiﬁed diﬀusion parameters increases signifcantly, although there is no marked im-provement in the identiﬁcation of other terms. Even if the correct coeﬃcient is identiﬁed moreoften, there remains a need to deal with the rest of terms that are spuriously identiﬁed. The PDE-FIND algorithm aﬀords practitioners a great amount of ﬂexibility in specifying thetuning parameters to carry out the sparse regression. Recall that in Algorithm 1, a freeparameter λ controls penalty incurred by large coeﬃcients in the solution of Equation (3).It is well known that the choice of regularization parameter is nontrivial as it modulates theamount of sparsity that the sparse regression problem will enforce on the estimated coeﬃcients.The issue of how to choose the optimal value of this hyperparameter was addressed recently byNardini et al. [2], where cross-validation is discussed among other options. As a test case toinvestigate the eﬀect of algorithm hyperparameters on the uncertainty of learned coeﬃcients, weperform cross-validation on the data set D u and then train PDE-FIND using the optimal value18f λ found. To do this, we apply the grid search implementation of cross-validation suggested in[2], as detailed in the SI to arrive at an optimal value of λ = 0 .

5. We note here that this valueis problem-dependent, and whenever a new data set is being investigated, typically, a diﬀerentvalue of λ will be appropriate.While the ﬁndings in the supplementary information show that cross-validation improves theperformance of the method dramatically, as the number of misspeciﬁed coeﬃcients decreasessharply when the regularization parameter is optimized, cross-validation does not provide a suf-ﬁcient solution to manage the uncertainty associated with the noise in the predicted coeﬃcients.Figure 6 shows how even with the optimal value of the regularization coeﬃcient, there is stillmuch uncertainty in the parameter, as the support of the histogram is large. While the atomat 0 has nearly vanished, uncertainty quantiﬁcation is still necessary, the empirical distributionstill suggests a large degree of stochasticity. Moreover, Figure 6 shows that at the optimal valueof the tuning parameter, the coeﬃcients c u xx and c uu xx still have a nontrivial joint distribution,implying that even with an optimal choice of the regularization parameter, Bayesian methodsare needed to analyze the joint behaviour of these two coeﬃcients. Lastly, one can investigatethe role of using diﬀerent regularization techniques for learning the parameters of the model.The existing literature on PDE-FIND contains objections to the use of LASSO when the dif-ferent observations are correlated. For example, Rudy et al. [1] argue for using STRidge as asparse regression method on the basis of performance. However, other methods exist and giveninformation about the dynamical system to be learnt, one may wish to choose diﬀerent methods.Figure 6: Empirical distributions of relevant PDE-FIND coeﬃcients found on ABM data ofunbiased motion without proliferation. (a) Histogram of c u xx trained with λ = 0 .

01, comparedto true value 0.25 (black line). (b) Histogram of c u xx trained with λ = 0 .

05, compared to truevalue 0.25 (black line). (c) Histogram of c u xx trained with optimal λ = 0 .

5, compared to truevalue 0.25 (black line). (d) Empirical joint distribution of c u xx and c uu xx trained with optimal λ = 0 .

5, compared to location of true parameter (0.25,0) (black star).19igure 7: Comparison of predictions made by mis-speciﬁed PDEs using coeﬃents diﬀerent fromthe ground truth learned by PDE-FIND. A: using c u xx = 0 . , c uu xx = 0 .

26 (blue dash) and c u u xx = 1 .

517 (red dots). B: using c u x = − . , c u xx = 0 .

20 (blue dash) and c uu xx = 0 . c u xx = 0 .

12 (blue dash) and c u = 0 . We propose an explanation for the poor performance of PDE-FIND on the ABM data. ThePDE-FIND algorithm solves a sparse regression function to ﬁt linear combinations of spatialderivatives to the time derivative. When data is noiseless or near noiseless, the temporal andspatial derivatives can be accurately estimated, and so the relationship between spatial andtemporal derivatives can be inferred from observed data. In this context, comparing modelpredictions by their performance with respect to the L -loss in the learned temporal derivativeretrieves the ground truth. However, when data is noisy, diﬀerent linear combinations of spatialderivatives can have a performance comparable to the ground truth. In the SI, we visualize thisby selecting, for each data set, two time data points whose PDE-FIND coeﬃcients contain termsdiﬀerent to the ground truth PDE. While both the ground truth and the PDE-FIND temporalderivative ﬁt reproduce the observed temporal derivative qualitatively, there is no guarantee thatthis yields solutions that resemble the observed data when the PDE is numerically evaluated.We illustrate this in Figure 7, where for each of Case I, II, and III, we select two parametercombinations estimated by PDE-FIND: one that when integrated resembles the observed data,and one where the model prediction bears no resemblance to the observed data.We note that the poor performance of the model predictions when integrated should not besurprising. PDE-FIND does not take into account modeling constraints, for example that themodel be given in standard conservation form. A possible solution to this speciﬁc problem couldbe to write the diﬀerent terms in the candidate library in ﬂux form. More generally, though,20t is not obvious how to include all possible constraints on solutions in how the candidatelibrary is composed. Especially given the need to use PDE-FIND to discover new solutions ofthe governing PDEs, there should be a low expectation of practitioners to include many suchconstraints into the candidate library. What this striking diﬀerence in predictive capabilities ofthe learned PDEs reveals is that coeﬃcients that optimize Equation (7) do not per se performwell when optimizing a criterion that takes into account the spatial diﬀerences of the solution.One such metric is the L -loss for square-integrable functions, deﬁned as (cid:107) f − g (cid:107) = (cid:90) ( f ( x ) − g ( x )) dx. (13)This contrasts with the optimization criterion in Equation (7), which we will refer to as the L loss between the derivatives. We show that these two loss functions are indeed very diﬀerent inFigure 23. As our reference data, we choose the average over all time series from the data set forunbiased motion, and for each point in a region of parameter space that only contains c u xx and c uu xx , we approximate the two diﬀerent loss functions: ﬁrst between derivatives - as consideredby PDE-FIND in Equation 3 - and subsequently estimate the L -loss between the integratedsolution and the empirical density from Equation 13 by averaging the squared error at 5 timepoints, d ( X obs , X sim ) = (cid:88) i =1 (cid:13)(cid:13)(cid:13) X obs50 i − X sim50 i (cid:13)(cid:13)(cid:13) , (14)where (cid:107)•(cid:107) is the squared L norm for vectors. We choose this metric so that it captures bothearly and late time dynamics, which we noted is important when distinguishing Case I and III,for example. Here, the choice for 5 time points can be improved in future work, as more reﬁnedchoices for the time discretization may yield better results, but we note that this is a reasonablechoice to capture the dynamics throughout the time series. Figure 23 demonstrates a signiﬁcantdiﬀerence between the loss landscapes of the diﬀerent error metrics. While both loss landscapesshow a minimum around the true parameter value, the derivative loss landscape in Figure 23 isunable to distinguish diﬀerent regions of parameter space. For example, errors on the horizontaland vertical axes are equal, meaning that PDE-FIND is unable to distinguish between these twocases. On the other hand, considering the ﬁt of integrated solutions to reference data is muchmore consistent. The loss lanscape shows clear regions of parameter space with increasinglyhigher error when the model is further away from the ground truth: the worst performance isfound in the model where u xx is not included. This is a much desirable property, since it helpsin identifying the true model from the data, which cannot be done when only looking at theobserved diﬀerences in derivatives. 21igure 8: Heatmap showing losses with respect to diﬀerent training objective functions. A: losslandscape for the L error computed between the derivatives, as in (3) for each pair of c uu xx and c u xx , overlayed with actual PDE-FIND predictions. B: loss landscape for the L error computedbetween the numerically found PDE model prediction and observed data, as in (14) for eachpair of c uu xx and c u xx , overlayed with actual PDE-FIND predictions. We showcase how Bayesian methods can be used to signiﬁcantly improve the quality of modelpredictions arising from the learned PDEs from observed data. Recall that the aim is to reduceuncertainty regarding which coeﬃcients to include in the model, to reduce uncertainty in theestimated model parameters, and to improve the posterior predictive ability of the model byﬁnding a posterior distribution in parameter space that takes into account properties of theactual observed density proﬁles. For each of the noisy test cases, that is, data sets D u , D b and D p , we apply Algorithm 2. As a proof of concept in the implementation of Algorithm 2,we choose the simplest possible ABC implementation and perform ABC rejection sampling, asdetailed in the SI. To capture agreement between model predictions and the observed data, wecarry out ABC inference by numerically computing, for each sampled parameter, the solutionof the predicted PDE. Recall that solutions similar to the ground truth will retrieve time seriesin good agreement with the observed densities. Writing then X sim i , X obs i for the simulated andobserved densities at time i , we choose the distance function deﬁned in Equation (14). We recallthat this analyzes the goodness of ﬁt at selected time points in the trajectory. Note that we donot evaluate the entire time series, but rather just consider an appropriate summary statistic.We choose a wide range of time points to capture the diﬀerent behaviors of the model at diﬀerenttime scales. We use the Pakman package [11], a tool to parallelize ABC inference. The ﬁles to22arry out the Bayesian inference are found in the electronic supplementary materials.For Case I, the identiﬁcation ratios in Table 5 suggest that the two coeﬃcients that areincluded in a large number of simulations are c u xx and c uu xx , while other model parameters canbe conﬁdently excluded from the model. This suggests to these two parameters in the candidatemodel and excluding the other parameters. For Case II, we note that the terms identiﬁedconsistently by PDE-FIND are u x and u xx . For Case III the terms are u, u and u xx . Note thatPDE-FIND failed to identify one relevant model term in Case II, which is uu x . Nevertheless,we proceed with the application of the rest of Algorithm 2 to ﬁnd posterior distributions inparameter space. For the ABC implementation, we set (cid:15) = 0 . (cid:15) = 2 for Case IIand Case III.Figure 9: Comparisons of prior (grey) with observed data and posteriors obtained by ABCrejection sampling (red). A: Case I, (cid:15) = 0 .

3. B: Case II, (cid:15) = 2, Case III, (cid:15) = 2Figure 9 reveals that in the three cases ABC rejection can signiﬁcantly reduce the uncertaintyassociated to the PDE coeﬃcients. In Cases I and II, uncertainty regarding which parameterto include is completely removed, as the posterior is conﬁdently supported only on the diﬀusionparameter axis in Case I and on a region where both c u x and c u xx are nonzero in Case II. Hence,Algorithm 2 removes uncertainty for the model choice. In Case III, the obtained posteriorstill estimates some diﬀusion parameters as nonzero, however, the share of nonzero c u xx in theposterior is equal to 0 .

75, compared to 0 .

57 in the prior, meaning that the posterior has moremass close to the true parameter value. To assess the quality of the resulting posteriors ingenerating output that matches the observed data, we run a posterior predictive check. Foreach parameter value in the posterior distribution, we use numerical integration to obtain aprediction for the density at t = 250. We then plot the 5% and 95% quantiles and overlay thiswith a data trace from the data sets for Cases I, II and III in Figure 10.The posterior predictive check in Figure 10 reveals that model solutions resemble the real-world data, even if the wide spread of parameters leads to big uncertainty bounds. Especiallyin Case III, with the large degree of model misspeciﬁcation, we see that the noise in the data23igure 10: Posterior predictive check for the three cases. All parameter values in the posteriordistribution are used to compute a prediction for the density at t = 250. We plot the 5% and95% percentiles at each spatial location, and overlay with a randomly chosen data point fromthe data set.directly translates to low conﬁdence around the origin, reﬂecting that even when ABC is applied,some model misspeciﬁcation is picked up. This problem is much less apparent in Case I, wherethe posterior distribution learned the correct distribution, and in Case II, where the posteriorhas learned the correct structure of the model. We conclude that even in the presence ofmodel misspeciﬁcation, such as in Case II, it is possible to obtain a posterior distribution inparameter space that has good predictive power. Compare this to the case of Figure 7, wherethe integrated model solutions failed to resemble the empirical data. We conclude that taking aBayesian approach to supplement the PDE-FIND yields reasonable estimates, whereas the pointestimates otherwise obtained using PDE-FIND would not have matched the observed data. This paper proposed a framework to perform uncertainty quantiﬁcation for implementing EQLmethods on noisy data. By simulating a large amount of synthetic data from an ABM, weshow that the PDE-FIND algorithm produces predictions that have great variability both in thestructure of the learned models as well as in the parameter values. We conclude that when datais corrupted with large amounts of noise, the PDE-FIND algorithm can learn models that areuninformative of the correct governing equation. In contrast, when we control for the noise andpresent the PDE-FIND algorithm with data that is suﬃciently detailed and with little noise,the structure of the learned equations is correct, but there is still variability in the learnedparameters. By using a combined approach of PDE-FIND and ABC, it is possible to obtainposterior distributions in parameter space that translate directly into uncertainty quantiﬁcation24n model predictions. We believe that this can enable practitioners to both conﬁdently reportthe range of possible models that can explain a given data set as well as obtain practical errorbounds for model predictions.After applying ABC on the PDE-FIND generated prior distribution, not all model misspeci-ﬁcation was removed from the observations. Researchers may ﬁnd it worthwhile to ﬁnd diﬀerent,more eﬃcient implementations of the ABC algorithm used here. For example, one may con-sider using a Sequential Monte Carlo (ABC-SMC) approach [8], which carefully propagates theprior through a series of intermediate distributions to a ﬁnal, posterior distribution. There isa wide range of studies, see for example [12] that elucidate how ABC-SMC is well-suited toﬁnd complicated posterior distributions from a wide prior distribution. We expect that bettertuned approaches to ABC can help improve the quality of the posteriors found in this work.Likewise, within the given ABC implementation, one may wish to consider a more detailed setof summary statistics, or a diﬀerent choice of the distance function. Nevertheless, we ﬁnd thateven the simplest possible ABC implementation yields an eﬀective and practical uncertaintyquantiﬁcation. For practitioners wishing to obtain these error bounds without needing to con-sider the mathematical intricacies of more sophisticated ABC implementations, we believe thatour ABC implementation provides an appropriate tool.Another, orthogonal, direction of research concerns the application of denoising methodsto the data prior to training the PDE-FIND algorithm. In this paper, we controled for theintrinsic noise in the observations to compare the performance of PDE-FIND in predictions. Inapplications of PDE-FIND to real-life data, practitioners will typically not be able to controlfor the amount of observations noise in their data. Hence, methods to smooth data and reduceobservation noise may be an attractive extension to PDE-FIND. When using spatial subsamplingas a method to denoise data, we found that while the degree of model misspeciﬁcation wasdecreased, the structure of the model was still highly variable with big uncertainty in the learnedparameters. We note that even state-of-the-art methods to reduce noise will not retrieve the truesignal behind the data, and so uncertainty quantiﬁcation is still important, since ultimately theprecise relation between noise in the data and uncertainty in the learned equations is unknown.We highlight two methods to denoise data. Convolution with a Gaussian kernel is ubiquitousin the information sciences and is often used to denoise data. In addition, an implementationimplementation PCA for EQL has been proposed by Li et al. [7]. In the supplementaryinformation, we show that these two methods still provide variable and unreliable predictions,meaning that with the present selection of denoising methods uncertainty quantiﬁcation remainsnecessary for applications of PDE-FIND to realistic biological data.25his article proposed as an explanation for the poor performance of PDE-FIND on noisydata that the loss function used by PDE-FIND does not take into account spatial informationof the predictions made by the learned model. To avoid the problem that predictions that mayﬁt the temporal derivative well in the presence of noise, but fail to capture important featuresof the observed data, we show that Bayesian methods can help incorporate this information intopredictions. PDE-FIND is not the only method for learning diﬀerential equations from data.When data is heavily corrupted by data, it may be worthwhile using a Deep Neural Networkapproach [6, 3] instead of the PDE-FIND algorithm to simultaneously learn the spatio-temporaldynamics while correcting for the presence of noise.

References [1] Samuel H. Rudy, Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. Data-drivendiscovery of partial diﬀerential equations.

Science Advances , 3(4), 2017.[2] John T. Nardini, Ruth E. Baker, Matthew J. Simpson, and Kevin B. Flores. Learningdiﬀerential equation models from stochastic agent-based model simulations, 2020.[3] John H. Lagergren, John T. Nardini, Ruth E. Baker, Matthew J. Simpson, and Kevin B.Flores. Biologically-informed neural networks guide mechanistic modeling from sparse ex-perimental data, 2020.[4] Kathleen Champion, Bethany Lusch, J. Nathan Kutz, and Steven L. Brunton. Data-drivendiscovery of coordinates and governing equations.

Proceedings of the National Academy ofSciences , 116(45):22445–22451, 2019.[5] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deeplearning framework for solving forward and inverse problems involving nonlinear partialdiﬀerential equations.

Journal of Computational Physics , 378:686–707, 2019.[6] John H. Lagergren, John T. Nardini, G. Michael Lavigne, Erica M. Rutter, and Kevin B.Flores. Learning partial diﬀerential equations for biological transport models from noisyspatio-temporal data.

Proceedings of the Royal Society A: Mathematical, Physical andEngineering Sciences , 476(2234):20190800, 2020.[7] Jun Li, Gan Sun, Guoshuai Zhao, and Li-wei H. Lehman. Robust low-rank discovery ofdata-driven partial diﬀerential equations.

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , 34(01):767–774, Apr. 2020. 268] Mikael Sunn˚aker, Alberto Giovanni Busetto, Elina Numminen, Jukka Corander, MatthieuFoll, and Christophe Dessimoz. Approximate bayesian computation.

PLOS ComputationalBiology , 9(1):1–10, 01 2013.[9] M J Simpson, B D Hughes, and K A Landman. Diﬀusing populations: Ghosts or folks?

Australasian Journal of Engineering Education , 15(2):59–68, 2009.[10] Wessel N. van Wieringen. Lecture notes on ridge regression, 2020.[11] Thomas F. Pak, Ruth E. Baker, and Joe M. Pitt-Francis. Pakman: a modular, eﬃcientand portable tool for approximate bayesian inference.

Journal of Open Source Software ,5(47):1716, 2020.[12] Sarah Filippi, Chris Barnes, Julien Cornebise, and Michael P. H. Stumpf. On optimality ofkernels for approximate bayesian computation using sequential monte carlo, 2012.27 upplementary Information: Bayesian uncertainty quantiﬁcationfor data-driven equation learning

Variability of PDE-FIND coeﬃcients

In this section, we report the full data on the learned PDE-FIND coeﬃcients trained on thediﬀerent data sets. This includes the identiﬁcation ratios for all coeﬃcients in all experimentsas well as plots of the joint distributions of all the learned parameters.

Full data on identiﬁcation ratios

We report the identiﬁcation ratios for the problems considered in the main text together withthe additional experiments performed for this supplementary information in the table below.Experiment c c u c u c u x c u · u x c u · u xx c u xx c u · u xx c u · u xx STRidge D u D u D b D b D p , d tol = 5 e − D p , d tol = 1 e − D p , d tol = 5 e − D p , d tol = 1 e − D u D u D b D p D u a j for all experiments, where j is given as the term in the PDE in eachcolumn. Empirical distribution of all learned PDE-FIND parameters

In this section we report the pairwise distribution of all learned PDE-FIND parameters in CaseI, Case II, Case III, both in the low-noise and high-noise regime. All terms in the library areordered as: 1 , u, u , u x uu x , u u x , u xx , uu xx , u u xx .28igure 11: Pairwise correlation plots for coeﬃcients trained on D u D u D b D b D p D p enoising approaches In this section, we detail how denoising approaches were used to denoise the ABM data andshow the variability of the learned coeﬃcients using each of these approaches.

Rescaled Robust PCA Li et al. [7] propose an approach to perform PDE-FIND on noisy data. Mathematically, denotethe data matrix U ∈ C n × m given on a discretized domain x ∈ [ a, b ] and t ∈ [0 , T ] for thediscretely and noisily sampled measurements from the function u ( x, t ) that satisﬁes the PDE.The authors observe that the underlying data u ( x, t ) is often low-rank and so they suppose that U and its time derivative U t can be decomposed as U = Z + E , U t = D ( Z, Q ) ξ + E , (15)where Z and D ( Z, Q ) have low rank and E , E are sparse. Informally, Z represents the clean data, D ( Z, Q ) the clean derivative matrix, and E , E the perturbations of the clean data andderivatives respectively. The method developed in [7] is to optimally ﬁnd Z and D ( Z, Q ) fromthe data matrix. This problem is addressed by solving the optimization problemmin ξ,Z,E ,E (cid:107) Z (cid:107) ∗ + (cid:107)D ( Z, Q ) ξ (cid:107) ∗ + λ R ( ξ ) + λ (cid:107) E (cid:107) + λ (cid:107) E (cid:107) (16)such that U = Z + E and U t = D ( Z, Q ) ξ + E . Additionally, R ( ξ ) is a sparse regularizationof the parameters ξ , and λ i , i = 1 , , Z , after which Low-rank STRidge (LrSTR) is used to construct D ( Z, Q ) from Z and U t is denoised prior to applying PDE-FIND. The ﬁnal algorithm, which uses rPCA and LrSTRin tandem, is called Double Low-rank Sparse Regression (DLrSR). While the examples givenby Li et al. in [7] prove capable of retrieving the true parameters of the PDEs with remark-able accuracy, the choice of regularization parameters is non-trivial and makes implementationchallenging, since the many degrees of freedom complicate choosing a robust parameter set.As shown in Algorithm 1 of [7], λ corresponds to the penalty in rPCA. One must choose λ carefully, since diﬀerent sizes of this parameter can yield unfavorable predictions when used intandem with PDE-FIND. We start by plotting the output of empirical data processed by rPCAat two diﬀerent values of λ .As seen in Figure 17, choosing a small value of λ in rPCA acts in favor of interpreting the entiresignal as noise and output Z = 0, eﬀectively rendering the output useless. Larger values of λ a) (b) (c) (d) Figure 17: Plots of denoised data Z (points) compared to original data (solid lines), plotted at t = 0 (red) and t = 1000 (black). (a) λ = 0 .

001 (b) λ = 0 .

015 (c) λ = 0 .

06 (d) λ = 0 . λ favor interpreting the entire signal as noise and output Z = 0. Larger valuesof λ do not penalize the noise and return a solution that is identical to the original signal. Inthe intermediate regime, the order of magnitude of the rPCA output is distorted.do not penalize the noise and return a solution that is identical to the original signal as canbe seen in the rightmost plot. In the intermediate regime, the order of magnitude of the rPCAoutput is distorted, as rPCA interprets the data matrix U as sparse and so the maximum of therPCA output for the densities has comparable order of magnitude t = 0 and t = 500, eﬀectivelyobscuring the biological mechanism at play. However, the behavior of the rPCA output in theintermediate regime is qualitatively good, so we suggest to scale the rPCA output so that itis in the appropriate order of magnitude. Since the maximum of the density is inﬂuenced bynoise too, it should not be the only factor considered in the scaling. Rather, the mean of thedata should be included too. Heuristically, we let Z be the output of rPCA and U the originalmatrix, such that Z i , U i denote the i -th row of Z , U respectively (corresponding to the i -thtime point). Let Z i , U i be the means of the vectors Z i , U i respectively and Z max i , U max i be theirmaxima. Then, we re-scale Z to ˜ Z such that˜ Z i = 12 · (cid:18) U i Z i + U max i Z max i (cid:19) · Z i . (17)Focusing on intermediate values of λ , we now plot the rescaled rPCA output compared to theoriginal solutions in Figure 18. From these plots we decide to use the rescaled output of theDLrSR algorithm in [7] on dataset D u , choosing for the rPCA the value of λ = 0 . a for the coeﬃcients found with the rPCA method. Comparingthese values with the values found using STRidge for trained on D u reveals that the method hasa much higher variability of terms that are included and excluded in the resulting PDE model.Moreover, Figure 19 shows that coeﬃcients are correlated in a complicated, non-trivial, manner.36 a) (b) (c) (d) Figure 18: Plots of rescaled denoised data ˜ Z (points) compared to original data (solid lines),plotted at t = 0 (red) and t = 1000 (black) chosen at values of λ in the intermediate parameterregime. (a) λ = 0 . λ = 0 .

03 (c) λ = 0 .

035 (d) λ = 0 .

04. The order of magnitude ofthe maximum of the rPCA output is no longer distorted.The correlation plot also reveals the problem with taking point estimates. In the unbiased case,for example, the pairs of the predicted coeﬃcients for the terms u xx and uu xx are containedin a region of parameter space where either of the two terms is discovered or when both termsare picked up and are not small. That implies that any point estimate when derived from thisdata is unreliable and careful consideration needs to be given to the correlations between terms.Figure 19b also shows that the support of PDE-FIND predictions for the diﬀusion coeﬃcientobtained using DLrSR on D u is similar in size to that obtained with Naive STRidge. This meansthat with the intrinsic noise in the sample, rPCA yields estimates for the diﬀusion coeﬃcientthat are no less variable than the original PDE-FIND method. Table 5 summarizes the valuesof the coeﬃcients a j . It can be seen that no terms in the PDE are conclusively identiﬁed, asall terms for a are either small or in the intermediate range. Most saliently, rPCA identiﬁes thediﬀusion coeﬃcient in only 38% of samples in the data set. Hence, even when using a state-of-the-art method of denoising observations it is important to statistically quantify uncertainty inthe learned coeﬃcients. Convolution with Gaussian kernel: an alternative approach to denoising

In many applications, noisy data can be successfully smoothed using a suitable kernel. Wechoose to smooth our data using a Gaussian kernel with standard deviation 2, so that it isroughly equivalent to the scale used in the spatial discretization . Summary values for a arefound in Table 5. While the convolution is highly successful in recovering one of the relevantterms in the dynamical system (for example, the diﬀusion term in D u , the bias term in D b andthe proliferation term in D p , the method fails to recover the diﬀusion term regularly in D b and37 p . (a) (b) (c) Figure 19: Joint distribution of c u xx and c uu xx trained on ABM data of unbiased motion withoutproliferation, where each sample is one time series. Location of true parameter value is indicatedby black star. (a) Naive application of STRidge on the ABM data. (b) ABM data is pre-processed with rPCA as detailed in SI before training PDE-FIND. (c) ABM data is pre-processedby smoothing with a Gaussian kernel with σ = 2 before training PDE-FIND. Tuning of PDE-FIND hyperparameters

Following Nardini et al. [2], we implement cross-validation for ﬁnding the optimal value of theregularization parameter in the STRidge algorithm. As training data, we randomly select asubset of 10 data points from the ensemble. For each λ ∈ { . · x, ≤ x ≤ } , we performleave-one-out cross validation, that is, we train PDE-FIND on the left-out data sample, andcompute the L -loss of the integrated solution at t = 250 with the samples not used for trainingat t = 250. This gives an optimal value of λ = 0 .

5. To illustrate the role of increasing the tuningparameter, Table 6 shows the values of the a j coeﬃcients at diﬀerent values of λ . For thesevalues, the main text reports the empirical distribution of the PDE-FIND coeﬃcients. Notethat λ = 0 .

01 equals the original experiment.Value of λ c c u c u c u x c u · u x c u · u xx c u xx c u · u xx c u · u xx a j for all experiments, where j is given as the term in the PDE in eachcolumn. 38 omparison of model predictions When data is noisy, diﬀerent linear combinations of spatial derivatives can outperform the tem-poral derivative when measured according to the L -loss. We visualize this by selecting twotime data points for each of the data sets whose PDE-FIND coeﬃcients contain terms diﬀerentto the ground truth PDE. We ﬁt the predicted PDE to the temporal derivative at t = 20 andat t = 250 and compare it to the ﬁtted ground truth using the estimated spatial derivatives.Figure 20 compares the observed temporal derivatives with these ﬁtted values for the temporalderivative. In ﬁve of the six plots, both the ground truth and the PDE-FIND temporal deriva-tive ﬁt reproduce the observed temporal derivative qualitatively. A priori, there is no guaranteethat the L -loss is optimized by the ﬁtted ground truth.However, Figure22 shows that some of the mis-speciﬁed coeﬃcients do not yield predictionsthat are in close agreement with the observed density data. This illustrates the ﬁnding in themain text that a close agreement between the ﬁtted temporal derivatives need not imply thatthe model predictions are adequate. (a) (b) (c)(d) (e) (f) Figure 20: Comparisons of the observed time derivative at t = 20 with predicted time derivativeusing misspeciﬁed PDE-FIND coeﬃcients (blue) compared to ﬁtted ground truth (red). FittedPDEs in panels (a)-(f) match the panels in Figure 22.39 a) (b) (c)(d) (e) (f) Figure 21: Comparisons of the observed time derivative at t = 250 with predicted time derivativeusing misspeciﬁed PDE-FIND coeﬃcients (blue) compared to ﬁtted ground truth (red). FittedPDEs in panels (a)-(f) match the panels in Figure 22. Approximate Bayesian Computation

In the main text we identify a signiﬁcant diﬀerence between the loss landscapes of the diﬀerenterror metrics for the computed derivative and spatial solution of the PDE. Here, we show thatwe may also take as a reference point a single, noisy, observation instead of the average of theobservations in the ensemble, albeit that the quality of the resulting posterior is decreased. Forcompleteness, we plot the diﬀerent loss landscapes in Figure 23.The high amount of noise in the single measurement case means that the loss landscapewhen considering the derivative does not have a minimum around the true parameter value,implying that minimizing with respect to this norm is not guaranteed to converge to the correctsolution when many measurements are being taken. This is very diﬀerent in the case of lownoise, the loss landscape clearly has a minimum around the true parameter value. On theother hand, comparing the ﬁt of integrated solutions is much more consistent in the case of asingle, very noisy measurement and retrieves low measurement error in a region around the trueparameter value. When the loss is considered on the averaged data, this region is smaller, andthe range of the error is bigger, suggesting that this loss function is better able to distinguishbetween solutions. The poor performance of the model predictions when integrated should notbe surprising. PDE-FIND does not take into account modeling constraints, for example thatthe model be mass-conserving. A possible solution to this speciﬁc problem could be to write the40 a) (b) (c)(d) (e) (f)

Figure 22: Comparisons of model predictions with misspeciﬁed PDE-FIND predictions (blue)compared to ground truth (red). (a) Unbiased motion, misspeciﬁed model with c u xx = 0 . c uu xx = 0 .

26 (b) Unbiased motion, misspeciﬁed model with c u xx = 0 and c u u xx = 1 .

52. (c)For biased motion, misspeciﬁed model with c u xx = 0 .

20 and c u x = 0 . c u xx = 0 and c uu xx = 0 .

64. (e) For proliferation and unbiased motion,misspeciﬁed model with c u xx = 0 and c u = 0 . c u xx = 0 .

12 and c u = 0.diﬀerent terms in the candidate library in ﬂux form. More generally, though, it is not obvioushow to include all possible constraints on solutions in how the candidate library is composed.Especially given the need to use PDE-FIND to discover new solutions of the governing PDEs,there should be a low expectation of practitioners to include many such constraints into thecandidate library. If a modeler were able to construct the candidate library exactly to incorporateall constraints, perhaps other methods of equation ﬁtting for the relevant equation parameterscould be more appropriate, avoiding the need to use EQL methods.Approximate Bayesian Computation is a popular tool to perform likelihood-free Bayesianinference. This is useful in cases where the likelihood function of the observed data given themodel parameters is intractable or diﬃcult to compute, but simulation given model parametersis feasible. In ABC, a prior π is speciﬁed over parameter space that reﬂects prior informationabout the model parameter θ . In each step, θ ∼ π is drawn, and a model simulation is done using θ , yielding the simulated data X Sim . The simulated data is compared to the observed data X Obs a) (b)(c) (d) Figure 23: Heatmap showing L error for each pair of c uu xx and c u xx , overlayed with actualPDE-FIND predictions. (a) Derivative L loss when compared to a single point. (b) Derivative L loss compared to average. (c) Comparison is done with respect to a single data point. (d)Comparison is done with respect to the average over all of the time series in the ensemble.according to a distance function d ( X Sim , X

Obs ) provided by the practitioner. Given a tolerance (cid:15) >

0, the sampled θ is accepted whenever d ( X Sim , X

Obs ) < (cid:15) . The process ﬁnishes when anpre-speciﬁed number of parameter samples have been accepted. The posterior distribution isthen given by the accepted parameter samples. We refer to [8] for the details.In the main text, we choose a two-dimensional parameter space that contains only u xx and uu xx as terms in the PDE and parametrize the normal marginals as used in in the main textby computing the empirical means of c u xx and c uu xx , conditional on them being nonzero, andassign the following relative weights m j for the atom at 0 for coeﬃcient j ∈ { u xx , uu xx } : m j = (cid:80) Ni =1 c ij =0) N − (cid:80) Ni =1 c iuxx ,c iuuxx =0) . (18)42his gives the prior distribution shaded in grey in each of the panels of Figure 9 in the main text.Letting X sim i , X obs i be the simulated and observed densities at time i , we choose the distancefunction d ( X obs , X sim ) = (cid:88) i =1 (cid:13)(cid:13)(cid:13) X obs50 i − X sim50 i (cid:13)(cid:13)(cid:13) , (19)where (cid:107) · (cid:107) represents the L -norm. We perform two experiments. In the ﬁrst, we choose for X obs the time series used to compute the loss landscape of 23 and choose (cid:15) = 0 . X obs the average over all timeseries in the ensemble and set (cid:15) = 0 ..