Bayesian uncertainty quantification for data-driven equation learning
BBayesian uncertainty quantification fordata-driven equation learning
Simon Martina-Perez , Matthew J. Simpson and Ruth E. Baker Mathematical Institute, University of Oxford, Oxford, United Kingdom School of Mathematical Sciences, Queensland University of Technology, Brisbane, AustraliaFebruary 24, 2021
Abstract
Equation learning aims to infer differential equation models from data. While a numberof studies have shown that differential equation models can be successfully identified whenthe data are sufficiently detailed and corrupted with relatively small amounts of noise [1, 2,3, 4], the relationship between observation noise and uncertainty in the learned differentialequation models remains unexplored. We demonstrate that for noisy data sets there existsgreat variation in both the structure of the learned differential equation models as well asthe parameter values. We explore how to combine data sets to quantify uncertainty inthe learned models, and at the same time draw mechanistic conclusions about the targetdifferential equations. We generate noisy data using a stochastic agent-based model andcombine equation learning methods with approximate Bayesian computation (ABC) to showthat the correct differential equation model can be successfully learned from data, while aquantification of uncertainty is given by a posterior distribution in parameter space. Codecan be found at https://github.com/simonmape/UQ-for-pdefind . Many phenomena in mathematical biology arise as a result of complex interactions betweenindividual agents at the microscale that result in emergent properties at the macroscale. Under-standing the mechanistic basis for the observed macroscale behaviour, in order to gain funda-mental insights into biological phenomena, is one of the key challenges in biology. Mathematicalmodels are well-placed to help provide such insights, providing a rigorous framework where hy-potheses can be generated, tested and refined. While the interactions between individual agentscan be naturally described by agent-based models (ABMs) that prescribe precise rules for theinteractions between agents, predicting the macroscale behaviour of ABMs can be a challenging1 a r X i v : . [ q - b i o . Q M ] F e b ask, since their governing equations are often intractable and stochastic simulations can be com-putationally expensive, often prohibitively so in the context of parameter sensitivity analysis orparameter inference [2]. This makes partial differential equation (PDE) models an indispensabletool to describe the macroscale properties of the population. Benefits of PDE models is thatthey are fast to solve numerically, their different terms often carry a physical interpretation, andthey can be explored using a range of analytical and numerical approaches. Knowing how sucha model is parametrized, then, can provide key insights into the system under consideration,and aid in making quantitative as well as qualitative predictions.Traditional approaches to mathematical modelling use experimentally-derived mechanistichypotheses to derive PDE models in which the various terms of a given model are designedto describe the hypothesised mechanisms for that scenario. Calibration of the model to datathen involves finding the parameters that optimise the discrepancy between the model outputand data. The ensuing, iterative process of testing and refining the model against furtherexperimental data allows the original hypotheses to be refined, and so new insights gained.Equation learning (EQL) methods aim to infer the dynamical systems model that bestdescribes a given time series by leveraging statistical and machine learning tools to learn theappropriate terms of a differential equation directly from the data. In particular, the PDE-FIND algorithm [1] takes as input quantitative data, together with a large library of candidateterms for a PDE, and aims to learn which terms to include in the PDE model, as well as theircoefficients. User-defined parameters can be tuned to enable a balance between the requirementfor good model fit with the desire for a simple, interpretable model.EQL methods have rapidly gained popularity, mainly thanks to increases in computationalpower, and a number of other techniques to establish models from data now exist. For ex-ample, biologically informed neural networks [3], an extension of physically informed neuralnetworks [5], have been developed to learn different terms of a PDE model without the needto specify a library of possible terms. A major advance has come from the use of techniquessuch artificial neural nets (ANNs) [6] to accurately recover models from artificially generatednoisy data from PDEs. The fact that EQL could discover previously undetected mechanisms,choose between competing models or estimate biological quantities of interest that are difficultto measure experimentally makes EQL attractive to scientists working with real-world data.However, practitioners wishing to obtain models that they can use in real-world settings re-quire, in addition, a thorough quantification of uncertainty. This need comes from the factthat the, often significant, noise in real-world data can impact the models predicted by EQLmethods, and hence the predictive capability of the models for unseen data or scenarios. For2xample, Nardini et al. [2] have recently shown, through the use of several case studies, that itis possible to infer differential equation models that describe noisy data generated by stochasticABMs. However, the stochasticity in the ABM results in variability in the learned macroscaledifferential equation. This means that, for a particular realisation generated from a stochasticmodel, the learned differential equation is essentially a point estimate of the underlying PDE,and there is no quantification of uncertainty in the learned equation.Recently, some authors [1, 7] have attempted to address this problem by analyzing therobustness of PDE-FIND with noisy or sparse data. For example, the approach in [1] amounts tochoosing a PDE with ground truth parameters, simulating from this target equation, corruptingthe simulated data by noise, training PDE-FIND on this data, and finally comparing PDE-FINDparameters with the ground truth. Li et al. [7] investigate how to increase the signal-to-noiseratio prior to using EQL techniques. While both works show that model parameters can beretrieved in within an impressive margin of error when data is corrupted with relatively smallamounts of noise, these approaches give no statistical quantification of uncertainty in modelpredictions, as the methods are still designed to output a single learned equation.Figure 1: Proposed framework for uncertainty quantification in equation learning. A noisy dataensemble is used to create an informative prior distribution efficiently, which can then be usedto obtain a practical posterior distribution.In this work we demonstrate that noise can significantly impact both the structure andthe parameters of learned differential equations, rendering uncertainty quantification a crucial3omponent of the equation learning process. As such, our overarching aim is to develop andshowcase a method for uncertainty quantification in the context of EQL, where we harnessthe immense computational efficiencies of PDE-FIND in learning point estimates of governingequations, together with the power of computational Bayesian inference in evaluating the leveluncertainty in the learned equation. Figure 1 shows our proposed framework for uncertaintyquantification. We start from the basis that it is possible to collect an ensemble of spatiotemporaldatasets from a given system, and develop an approach to understand how the data can be bestused to learn a governing equation while simultaneously estimating the uncertainty in thatlearned equation.Our motivation is thus: on the one hand, PDE-FIND provides a computationally cheapmethod to obtain a point estimate for the governing equation from a single time series. How-ever, when the data are noisy, the individual predictions are unreliable. On the other hand, thefield of computational Bayesian inference provides a number of methods to estimate, for a givenmodel and data, posterior parameter distributions i.e. provide estimates of model parametersand quantify the uncertainty in those estimates. In principle, a likelihood-free method suchas approximate Bayesian computation (ABC) [8] could be used directly with the candidate li-brary of the PDE-FIND method to estimate the posterior distribution of the library coefficients.However, due to the very large number of candidate terms in the PDE-FIND library, the com-putational cost associated with applying ABC (or other likelihood-free methods for Bayesianinference) on the entire high-dimensional parameter space is prohibitive. Instead, we proposea framework that combines the strengths of each approach: we train the PDE-FIND algorithmon individual datasets from the ensemble to obtain an informed prior distribution that makes itpossible to use ABC in a computationally realistic setting. We demonstrate the potential of ourapproach using synthetic data generated from a widely used ABM that describes the behaviourof a motile and proliferative cell population and can be accurately coarse-grained to a mean-fieldPDE in certain regions of parameter space.In Section 2, we describe the ABM and discuss in detail its relation with a governing PDE.We describe the PDE-FIND algorithm and propose how the PDE-FIND algorithm may be usedto construct a prior for Bayesian inference. In Section 3, we show that the learned PDEs areintrinsically variable in the presence of observation noise, and regardless of the method to dealwith observation noise, there is still variability that needs to be quantified. We then demonstratethat PDE-FIND can learn unphysical solutions to the PDE and provide an explanation in termsof the objective function of the algorithm. We bring all these issues together through the use ofBayesian methods where we can evaluate uncertainty in a framework that optimizes the fit of4he model density profile to the data. We begin by describing the ABM and its coarse-grained mean-field PDE, and then we outlinethe PDE-FIND algorithm and the means by which we use it to construct a prior distributionfor ABC.
ABMs allow practitioners to investigate the collective behaviour of cells and tissues based on adescription of the behaviour of individual cells. Here, in order to realistically describe the cell-cell interactions, we follow a volume-exclusion model presented in [9] for a population of agentsthat move and proliferate according to a discrete random walk model. This is a simple modelthat can be used to analyse a range of phenomena, including the collective migration of cells ina tissue, for example. In this model, we will assume that agents occupy sites on a square latticeand their possible locations are ( i, j ), where ( i, j ) are integer coordinates. Agent movementsare constrained by exclusion: each lattice site may only be populated by one agent at a time.Agents can only move to either one of their four adjacent lattice sites. The lattice is initializedaccording to some initial conditions, and the parameters p m , p p ∈ [0 , ρ x , ρ y ∈ [ − ,
1] are set.At each time step τ , a random sequential updating procedure is carried out. At each time step,agents are selected, one at a time, with replacement, and are allowed to attempt a movement andproliferation event. When an agent is selected, a movement event is attempted with probability p m and the random numbers S , S ∼ U (0 ,
1) are independently drawn. If S ≤ p p , then theagent proliferates and a daughter agent is placed in one of the adjacent sites, if it is empty. If amovement is attempted, then, S determines the motion of the agent according to Table 1. Werecall that proliferation and movement into any site can only occur if that site is empty.Move chosen Target site Probability Where random number S fallsvertically down ( i, j − − ρ y ≤ S ≤ − ρ y vertically down ( i, j + 1) ρ y − ρ y ≤ S ≤ horizontally left ( i − , j ) − ρ x ≤ S ≤ + − ρ x horizontally right ( i + 1 , j ) ρ x + − ρ x ≤ S ≤ i, j ) selects a target site, as presented in [9].From a biological perspective, the parameters control the collective behavior of the cells.5 high value of p m reflects cells that move easily through their environment, whereas a lowervalue of p m corresponds to lower motility in the process. This will be reflected by fewer cellmovements on average in each time interval. The movement bias parameters ρ x , ρ y reflect thepreferences of agents in a certain direction. This can be due to factors such as chemotaxis orother factors driving cell movement. Setting ρ x = ρ y = 0 corresponds to unbiased motion on thelattice. Finally, p p determines the rate at which cells proliferate: a high value of p p correspondsto cells that rapidly proliferate, whereas a lower value signifies lower mean proliferation rates.In this paper, we consider a uniformly spaced lattice, with lattice spacing equal to 1, and timeincrements τ = 1. We also consider p m = 1. As noted in [9], all simulations can be easilygeneralized to arbitrary lattice spacings and arbitrary time increments, but we keep thingssimple here by choosing uniform lattice spacing and constant time steps.As done in [9], we simulate the ABM on a square lattice with integer coordinates ( i, j )such that 1 ≤ i ≤
200 and 1 ≤ j ≤
20. That is, the horizontal dimension of the lattice is200 and the vertical dimension is 20. This corresponds to looking at a thin strip of cells thatmove and proliferate. This, for example, corresponds to experiments involving a spatial invasionprocess. We take a pseudo-one-dimensional initial condition by populating all lattice sites with90 ≤ i ≤ C ij denote the occupancy of cell ( i, j )by C ij = 1 if ( i, j ) is occupied by an agent and 0 if it is empty. We can average the occupancyof the columns of the lattice, by defining the mean occupancy of column i as C i = 120 (cid:88) i =1 C ij (1)When doing this averaging at all later time points, this yields a simulation of a time-varyingone-dimensional spatial density profile. The dimensions of the lattice considered and the choice of a pseudo-one-dimensional initialcondition suggest to consider how the density profile, obtained by averaging the occupancy ofeach column of the lattice, varies in time. To do this, let C i be the averaged occupancy in the i -th column of the lattice, that is, the average number of agents per site in the column. Then { C i } is a one-dimensional density profile of the lattice. As noted by Simpson et al. in [9], therules of the stochastic model induce a one-dimensional conservation of mass statement between C i ( t ) and C i ( t + τ ), the occupancy of site i at time t and at the next time step, t + τ . Taking6gain p m = 1, C i ( t + τ ) = C i ( t ) + C i − ( t ) (1 + ρ x )4 (1 − C i ( t )) + C i +1 ( t ) (1 − ρ x )4 (1 − C i ( t )) − C i ( t ) (1 + ρ x )4 (1 − C i +1 ( t )) − C i ( t ) (1 − ρ x )4 (1 − C i +1 ( t ))+ p p C i − ( t )(1 − C i ( t )) + p p C i +1 ( t )(1 − C i ( t )) . Note how the different terms in the conservation of mass statement correspond to the fluxof agents in and out of the adjacent lattice sites. Note also how ρ y does not appear in theconservation of mass statement, justifying the choice of ρ y = 0 henceforth. For the generalderivation of the conservation of mass statements, we refer to [9]. The general one-dimensionalconservation of mass statement for the average occupancy between t and t + τ corresponds toan explicit finite difference approximation of the PDE ∂c∂t = D ∂ c∂x − V ∂∂x [ c (1 − c )] + P [ c (1 − c )] , (2)in the limiting case where τ and the mesh size ∆ satisfy τ, ∆ → /τ constant. In Equation 2, D = lim ∆ ,τ → p m ∆ / τ , V = lim p m ∆ ,τ → ∆ ρ x / τ and P =lim τ → p p /τ . For the full derivation and details, we refer to [9]. The identification of this agentbased model with the resulting PDE motivates us to investigate the performance of equationlearning methods trained on data generated by the agent based model, since the PDE describesthe time-evolution of the expected value of the density profile. As test cases for learning the govoverning equations from data, we explore three different pa-rameter regimes in the model, which each correspond to a biologically relevant setting. In CaseI, we consider agents moving without bias and without proliferation ( ρ x = ρ y = p p = 0). In CaseII, we consider agents moving with bias, but without proliferation ( ρ x = 0 .
075 and ρ y = p p = 0).In Case III, we consider agents moving without bias, but with proliferation ( ρ x = ρ y = 0 and p p = 0 . u for the unbiased case without proliferation, b thebiased case without proliferation, and p the unbiased case with proliferation. We let i = 1 for7ase ρ x p p Coarse-grained PDECase I: no bias, no proliferation 0 0 c t = 0 . u xx Case II: bias, no proliferation -0.0375 0 c t = 0 . c xx + 0 . c (1 − c )] x Case III: proliferation, no bias 0 0.01 c t = 0 . c xx + 0 . c (1 − c )]Table 2: Coefficients of the mean-field PDEs describing evolution of the mean population densityover time for the three example cases used in this work.the case where we average over one iterate in each density profile and i = 2 for the case wherewe average over 50 iterates for each density profile, and denote the resulting data sets as {D ri } ,where r ∈ { u, b, p } and i ∈ { , } . For example, D u refers to the data set corresponding tosimulations of Case I where we average over one replicate per density profile.In Figure 2 it can that the obtained data traces of the ABM are in good agreement withthe coarse-grained PDE. Note, however, that for the plots at t = 50, the solutions for Case I,II and III are very similar. Still, at t = 100 the plots for Case I and III are very similar andit is not until later time scales ( t = 150) that there can be seen a clear difference between thesolution owing to the role of proliferation. We can forecast based on this visual inspection thatthe question of reliably estimating the PDEs is difficult from the average data, and all timescales need to be taken into account to properly distinguish the data traces. In the following, assume that we have time series data for an unknown function u ( x, t ) on agrid of n points in time and m points in space. This data is stored in a matrix U ∈ R n × m . Weassume that the data is a noisy discretization of a function u ( x, t ), the solution of an unknownPDE, and the aim is to learn the PDE that best describes the governing equation of the observeddata. Henceforth, and to avoid confusion, we will write U ( x, t ) for the observed data, u ( x, t )for the learned, data-driven, solution of the governing PDE, and c ( x, t ) for the solution to theapproximate mean-field PDE defined in Equation (2). We follow Rudy et al. [1] in assumingthat the PDE governing u ( x, t ) is given in the following form u t = N ( u, u x , u xx , . . . , x, µ ) = d (cid:88) i =1 N i ( u, u x , u xx , . . . , ) ξ, (3)where N is a nonlinear function of u ( x, t ) and its partial derivatives, µ and ξ are constants.Furthermore, it is assumed that N is the linear combination of a finite number of distinctlibrary terms. We point out that Equation (2) falls within this class of PDEs. By design, N t = 0 , t = 50 , t = 150 , t = 500 for one of the simulatedrealizations of the model where no averaging is done (left column), where time series consist ofaverages over 50 realizations of the model (middle column) and the corresponding continuummodel. It can be seen that the continuum model is a very good approximation of the averagedensity obtained from the ABM simulations. 9as polynomial nonlinearities, as is common in many equations in the natural sciences. PDE-FIND creates a large library of terms that could appear in N , with the view to select a smallsubset of relevant terms from this list.The first step of the PDE-FIND pipeline is to numerically approximate both sides of Equation(3). This is done by taking derivatives of the data with respect to space and time. The standardPDE-FIND implementation in [1] takes finite difference approximations when the data containslittle noise and polynomial differentiation when data is highly noisy. The data and its derivativesare combined in a matrix Θ( U ), where each column of Θ contains all of the values of a particularcandidate function across the entire n × m grid. For example, if the candidate library consistsof all polynomials up to degree 2 and all derivatives up to the second order, Θ( U ) will look likeΘ( U ) = (cid:2) , U, U , U x , U U x , U U x , U xx , U U xx , U U xx (cid:3) . (4)So, if there are N terms in the candidate library, Θ( U ) is a n · m × N matrix. In this example, N = 9. The left hand side of Equation (3) is similarly approximated using time differentiation,and we obtain a linear equation representing the PDE in Equation (3) U t = Θ( U ) ξ. (5)Taking the same example for Θ( U ) as in Equation (4), this matrix equation looks like U t ( x , t ) U t ( x , t ) U t ( x , t )... U t ( x n − , t m ) U t ( x n , t m ) = U ( x , t ) U ( x , t ) U x ( x , t ) . . . U U xx ( x , t )1 U ( x , t ) U ( x , t ) U x ( x , t ) . . . U U xx ( x , t )1 U ( x , t ) U ( x , t ) U x ( x , t ) . . . U U xx ( x , t )... ... ... . . . ...1 U ( x n − , t m ) U ( x n − , t m ) U x ( x n − , t m ) . . . U U xx ( x n − , t m )1 U ( x n , t m ) U ( x n , t m ) U x ( x n , t m ) . . . U U xx ( x n , t m ) ξ ξ ξ ... ξ ξ (6)Note how this representation shows that each row in the matrix equation represents the governingdynamics behind the data at one point in time and space. Note also that the values of ξ determinethe form of the PDE, and so the aim is to learn the coefficients of ξ optimally. Following Rudy etal. [1], we will take Θ to be overspecified, meaning that the dynamics can be represented well aslinear combinations of the columns of Θ. However, many PDEs in the natural sciences containonly a few terms. Therefore, we wish to learn a sparse vector ξ in the solution of Equation (3).This is done in PDE-FIND by considering the optimisation criterion ξ = argmin ξ (cid:107) Θ( U, Q ) ξ − U t (cid:107) + λ (cid:107) ξ (cid:107) , (7)for the coefficients ξ , where λ ∈ R > is a free parameter that penalises large terms. Thisis the method of ridge regression . We note here that the term (cid:107) ξ (cid:107) can be replaced with10 ξ (cid:107) , which corresponds to performing LASSO. Here, it is worth noting that the choice ofimplementation is largely problem-dependent, and various choices for the regularization methodhave been compared in the literature, for example by Rudy et al. [1], although no method is tobe definitively preferred over another. The default implementation of PDE-FIND as proposedby Rudy et al. [1] supplements the ridge regression problem with a sequential thresholding procedure: a solution to (7) and a hard threshold is performed on the regression coefficientsby eliminating all coefficients smaller than some pre-specified parameter d tol . This process isrepeated on the remaining coefficients. This is done to enforce sparsity as the solution to theridge regression problem in Equation (7) may contain several small, but non-zero values. Thecombined algorithm is called Sequential Thresholding Ridge regression (STRidge). For detailsand motivation of the method, we refer to [1] We summarize the PDE-FIND algorithm inAlgorithm 1, as reproduced from [1]. Algorithm 1:
STRidge
Input:
Library matrix Θ( U ), time derivative of data U t , λ , d tol Output:
Sparse vector ξ Compute ˆ ξ = argmin ξ (cid:107) Θ( U, Q ) ξ − U t (cid:107) + λ (cid:107) ξ (cid:107) ; Set B = { j : | ˆ ξ | ≥ d tol } ; Set ˆ ξ i = 0 for all i (cid:54)∈ B ; Compute ˆ ξ [ B ] = STRidge(Θ[: , B ] , U t , d tol ) recursively; For the simulated ABM data in Case I, II and III, we collect the column-averaged density ateach time step. We denote the resulting data sets D ir for r = u, b, p , i = 1 ,
2, where u stands forunbiased motion (Case I), b stands for biased motion (Case II) and p stands for proliferation(Case III). Then, i = 1 corresponds to data traces with no averaging and i = 2 correspondsto data traces where there is averaging over 50 simulations of the ABM. Per data set, we thensubsample n = 500 equidistant time points, subsampled from the 1000 time points at which thesimulation was carried out, with ∆ t = 2. So, each sample in the data set contains informationfor n = 500 time points and m = 200 space points. We use the standard implementation ofPDE-FIND [1] and use polynomial differentiation at fourth order to find both the time andspace derivatives.We select a library of candidate terms that includes all polynomial terms up to order twoand up to the second derivative since interactions between agents happen only between directneighbors and so we do not expect other terms to show up in the PDE. Table 3 shows the valuesof the coarse-grained PDE coefficients, according to Equation 2. These are the values of the11oefficients that we would expect the PDE-FIND algorithm to return for perfect spatio-temporaldata. In Table 3, and the rest of this work, we use the notation c i for the coefficient of term i in the learned PDE.Case c c u c u c u x c u · u x c u · u xx c u xx c u · u xx c u · u xx Case I 0 0 0 0 0 0 0.25 0 0Case II 0 0 0 -0.0375 0.075 0 0.25 0 0Case III 0 0.01 -0.01 0 0 0 0.25 0 0Table 3: Coefficients of the mean-field PDEs describing evolution of the mean population densityover time for the three example cases used in this work. The coefficients correspond to thecoarse-grained PDEs described in table 2.
ABC is a popular likelihood-free tool to obtain a data-informed posterior distribution overthe space of possible parameter values. The main idea of ABC is that the user specifies aprior distribution π reflecting prior knowledge of model parameters, which the ABC algorithmthen uses to create a posterior distribution by evaluating the performance of the parameters inreproducing the observed data. For an overview of ABC methods, we refer to Sunn˚aker et al. [8] - see the SI for implementation details.Such posterior distributions provide information as to the uncertainty in parameter estimatesthat are learned from the data, and also allow practitioners to understand the range of realisticparameter values that can produce the observed data. Despite its simplicity, unless an informa-tive prior is used to constrain the space of possible parameters, a high-dimensional parametersearch using ABC is generally computationally prohibitive for the kinds of models routinely usedin mathematical biology. This high computational cost means that direct application of ABCmethods to estimate the coefficients ¯ ξ of Equation (3) is infeasible. Here, we propose a methodthat uses the predictions of PDE-FIND to construct an informed prior so that ABC can thenbe used to estimate the library coefficients ¯ ξ .Consider performing sparse linear regression using Algorithm 1 to the PDE-FIND solution ξ to the problem defined in Equation (3). Let ˆ β ( λ ) denote the ridge regression solution - thatis the solution obtained without applying the hard thresholding part of Algorithm 1, i.e. by d tol = 0. Let ˆ β ( λ, ST) denote the STRidge coefficients obtained when using a fixed value of thethreshold parameter d tol >
0. It is well-known - see [10] - that in the presence of observationnoise, the empirical distribution of ˆ β ( λ ) can be well-approximated with a multivariate normal12istribution. Now, the exact distribution of ˆ β ( λ, ST) in the presence of observation noise will bedifferent due to the hard thresholding imposed by the STRidge algorithm, but we expect thatthe nonzero coefficients of ˆ β ( λ, ST) follow a similar distribution to those of ˆ β ( λ ). This justifiesapproximating the empirical distribution of the coefficients of ˆ β ( λ, ST) as a multivariate normal.Here, it is important to note that the purpose of the method is not to find the exact distributionthe PDE-FIND coefficients. Rather, it is to find a model that can capture the prior uncertaintyabout the PDE-FIND coefficients so that this can be used in a Bayesian context.Our approximation that the nonzero STRidge coefficients approximately follow a normaldistribution suggests that we must find a mixture distribution that allows for coefficients to beexactly zero or to follow a normal distribution. With other words, we wish to approximate themarginal distribution of each coefficient of the PDE-FIND solution with a mixture of a Dirac- δ at 0 and a normal distribution. In the following, we propose a straightforward method to usethe observed data to construct such a distribution. Assume that for each sample in the observeddata set, PDE-FIND has produced an estimate of the parameters using STRidge. For each term j in the library, define the identification ratio a j as a λj = 1 N s N s (cid:88) i =1
1( ˆ β ( λ, ST) ij (cid:54) = 0) . (8)When the data set is sufficiently large, this parameter is informative of how often the termis included in the PDE-FIND predictions. When a j is large, the term is identified across thesamples in the data set as being relevant for the dynamics. When a j is small, the term isidentified in only some samples as being relevant. Dropping the dependence on λ , since this willalways be evident from context, we propose to model the empirical distribution of the j -th termin the library of the PDE-FIND coefficients obtained from the data set asˆ β ( λ, ST) j ∼ a j µ j + (1 − a j ) δ , (9)where µ j = N ( m j , s j ) with m j , s j the sample mean and variance over the nonzero values of the j -th coefficient and δ the Dirac δ function at 0. We note that using the identification ratio yieldsan efficient method of constructing a prior. This is because PDE-FIND is a sparsity-promotingmethod and so for most coefficients the identification ratio a j will be either close to 0, or close to1. If a j ≈
0, the empirical marginal of the j -th coefficient is approximately δ , which means thatPDE-FIND does not identify this term as relevant across many samples in the data set, implyingthat the term can be confidently eliminated from the model and not considered in the prior. If a j ≈
1, PDE-FIND has consistently identified the term as relevant for the dynamics across thesamples in the data set, and so this term should be included in the model, but uncertainty muststill be quantified. If a j is neither close to 0 nor close to 1, PDE-FIND includes these terms in13he PDE in for a non-trivial number of data points, but fails to confidently include them. Inthis scenario, Bayesian models can be used to investigate the joint posterior distribution of theseparameters with the rest of the model by considering the performance of models that includeand exclude the term respectively.To perform this term elimination procedure more rigorously, let 0 < δ (cid:28) A = { j : a j > δ } , that is, the set of terms which a practitioner feels have a significant identificationratio. The key will be to perform Bayesian inference solely for the terms in A , while excludingthe rest of the terms from the model on the basis of their low identification ratio. We recallthat PDE-FIND promotes sparsity, so the size of A will be very small compared to that ofthe library, making ABC computationally much more efficient. With the information availablefrom training the PDE-FIND coefficients on the ensemble of data points, and in the absence offurther information about the problem, the most straightforward prior that can be defined is anindependent prior over the coefficients in A , giving the following choice π = (cid:79) j ∈ A (cid:0) a j µ j + (1 − a j ) δ (cid:1) . (10)We summarize the Bayes-PDE-FIND algorithm below in Algorithm 2. Algorithm 2:
Bayes-PDE-FIND
Input:
Data set of time series; PDE-FIND hyperparameters: library Θ of size N , λ , d tol ; minimum identification ratio δ > Output:
Posterior distribution over coefficients of library PDE. for Sample i in data set do Compute ˆ ξ i with sample i using Algorithm 1; end for j ∈ , . . . , N do Compute sample mean m j and sample variance s j and set µ j = N ( m j , s j ).; Compute identification ratio a j by a j = N s (cid:80) N s i =1
1( ˆ ξ ij (cid:54) = 0) end Compute A = { j : a j > δ } ; Define prior distribution π = (cid:79) j ∈ A (cid:0) a j µ j + (1 − a j ) δ (cid:1) . (11) Perform Approximate Bayesian Computation to obtain posterior distribution.14
Results and discussion
In this section, we showcase three different, but related, directions in the uncertainty quantifica-tion of the learned differential equations. In subsection 3.1, we showcase the intrinsic variabilityof the learned coefficients in the presence of observation noise. We evaluate how uncertaintychanges as the noise is varied and suggest that even when using state-of-the-art denoising ap-proaches, a need for uncertainty quantification remains. Although increasing the signal-to-noiseratio helps - regardless of the method - there is still variability and that needs to be quantified.In particular, this is important in biology where observations are often very noisy practitionersrarely have access to very large amounts of data. In subsection 3.2, we investigate the impact ofvarying algorithm hyperparameters in PDE-FIND with a view to asking whether these can beoptimized to drive down uncertainty. We find that while this is possible, parameter estimatesare still uncertain and this uncertainty needs to be quantified. In subsection 3.3, we demonstratethat a key issue with PDE-FIND is that it aims to fit the time derivative and does not evaluatethe fit of the observed density to the data, leading to unphysical predictions. We bring all theseissues together in subsection 3.4 through the use of Bayesian methods as outlined in Algorithm2 where we can both evaluate uncertainty in a framework that optimizes the fit of the modeldensity profile to the data.
We first demonstrate that a naive application of PDE-FIND on noisy synthetic data yieldsvariable and unreliable parameter estimates. For this application, we do not carry out hyper-parameter tuning, but simply use widely used parameter settings to learn the coefficients. Foreach of the two data sets associated to each of Case I, Case II and Case III, we train the PDE-FIND algorithm using Algorithm 1 (STRidge) with fixed hyperparameter settings λ = 10 − and d tol = 0 . For Case I, recall that the true PDE model contains only the term u xx . Table 5 shows that thetwo terms identified regularly by PDE-FIND on D u are u xx and uu xx , with identification ratiosof 0 .
83 and 0 .
20, respectively. On D u , only u xx is identified and no other terms are identified.15xperiment c c u c u c u x c u · u x c u · u xx c u xx c u · u xx c u · u xx STRidge D u D u D b D b D u c u xx trained on D u (blue) and trained on D u (red) compared to true parametervalue (black line). B: histogram for c uu xx trained on D u compared to true value (black line). C:joint distribution of c u xx and c uu xx compared to true parameter (black star).To investigate the variability of the learned parameters, we display histograms of the learnedcoefficients in Figure 3. In addition, since the identification ratios of u xx and uu xx on D u suggest that the terms have a nontrivial joint distribution, we display the joint distribution oftheir coefficients as well.Figure 3 reveals that the learned parameters are highly variable and that with noisy data,the terms uu xx and u xx have a nontrivial joint distribution. This means that in some cases,PDE-FIND will identify either of the two terms, and in others it will identify a combination ofthe two. Since it is assumed that all of the empirical data is governed by the same governingPDE, this points towards non-identification of the model by PDE-FIND. For Case II, Table 5 shows that the two terms identified regularly by PDE-FIND on D b are u x and u xx with an identification ratio of 0 .
999 and 0 .
012 respectively. Note that the true modelshould contain the term uu x as well, but PDE-FIND fails to identify this term across the data.16igure 4: Histograms showing the empirical distribution of relevant PDE-FIND parameters forCase II. A: histogram for the c u xx trained on D u (blue) compared to true parameter value (blackline). B: histogram of c u x trained on D b (blue) compared to true parameter value (black line).C: histogram of c u x trained on D b compared to true parameter value (black line).Similarly, for D b , only u x is identified, with an identification ratio of 1 .
0. This suggests toinvestigate the variability of the learned coefficients for u x and u xx learned on D b and only thecoefficient for u x learned on D b .Figure 4 again reveals a large amount of variability in the learned parameters. For instance,note that when trained on D b and on D b , parameters are distributed on a wide interval awayfrom the true parameter value. Note also that the variability is lessened when there is lessobservation noise. For Case III, Table 5 shows that the terms identified regularly by PDE-FIND on D p are u , u and u xx with identification ratios equal to 0.659, 0.482 and 0.571 respectively. Note that thismeans that all correct terms have been identified. On D p the parameters identified are u and u xx , both with identification ratio equal to 1.0. In addition, since the identification ratios of u and u on D p suggest that the terms have a nontrivial joint distribution, we display the jointdistribution of their coefficients as well.Figure 5 reveals that the learned parameters are much less variable when less noise is presentin observation data. However, it does not mean that the model is identified better per se whenthere is little observation noise. For instance, the term u is not learned by the model on theless noisy data. We investigate whether choosing a coarser grid helps in reducing observation noise with theaim to improve PDE-FIND predictions. Note that choosing a coarser spatial discretization will17igure 5: Histograms for relevant PDE-FIND parameters trained for Case III. A: histogramfor c u trained on D p (blue) and D p (red) compared to true parameter value (black line). B:histogram for c u trained on D p (blue) compared to true parameter value (black line). C:histogram for c u xx trained on D p (blue) and D p (red) compared to true parameter value (blackline).result in a smoother density profile, while incurring more errors in the approximation of thespatial derivatives, as well as having fewer data points. For this experiment, we subsamplethe x -dimension by averaging the occupancy over 4 columns at a time time. Mathematically,from the empirical densities C i , at each time point, we estimate the average occupancies ˜ C i for1 ≤ i ≤
50, where ˜ C i = 14 i +1) (cid:88) (cid:96) =4 i +1 C i . (12)Table 5 summarizes the identification ratios found. With spatial subsampling, the share ofcorrectly identified diffusion parameters increases signifcantly, although there is no marked im-provement in the identification of other terms. Even if the correct coefficient is identified moreoften, there remains a need to deal with the rest of terms that are spuriously identified. The PDE-FIND algorithm affords practitioners a great amount of flexibility in specifying thetuning parameters to carry out the sparse regression. Recall that in Algorithm 1, a freeparameter λ controls penalty incurred by large coefficients in the solution of Equation (3).It is well known that the choice of regularization parameter is nontrivial as it modulates theamount of sparsity that the sparse regression problem will enforce on the estimated coefficients.The issue of how to choose the optimal value of this hyperparameter was addressed recently byNardini et al. [2], where cross-validation is discussed among other options. As a test case toinvestigate the effect of algorithm hyperparameters on the uncertainty of learned coefficients, weperform cross-validation on the data set D u and then train PDE-FIND using the optimal value18f λ found. To do this, we apply the grid search implementation of cross-validation suggested in[2], as detailed in the SI to arrive at an optimal value of λ = 0 .
5. We note here that this valueis problem-dependent, and whenever a new data set is being investigated, typically, a differentvalue of λ will be appropriate.While the findings in the supplementary information show that cross-validation improves theperformance of the method dramatically, as the number of misspecified coefficients decreasessharply when the regularization parameter is optimized, cross-validation does not provide a suf-ficient solution to manage the uncertainty associated with the noise in the predicted coefficients.Figure 6 shows how even with the optimal value of the regularization coefficient, there is stillmuch uncertainty in the parameter, as the support of the histogram is large. While the atomat 0 has nearly vanished, uncertainty quantification is still necessary, the empirical distributionstill suggests a large degree of stochasticity. Moreover, Figure 6 shows that at the optimal valueof the tuning parameter, the coefficients c u xx and c uu xx still have a nontrivial joint distribution,implying that even with an optimal choice of the regularization parameter, Bayesian methodsare needed to analyze the joint behaviour of these two coefficients. Lastly, one can investigatethe role of using different regularization techniques for learning the parameters of the model.The existing literature on PDE-FIND contains objections to the use of LASSO when the dif-ferent observations are correlated. For example, Rudy et al. [1] argue for using STRidge as asparse regression method on the basis of performance. However, other methods exist and giveninformation about the dynamical system to be learnt, one may wish to choose different methods.Figure 6: Empirical distributions of relevant PDE-FIND coefficients found on ABM data ofunbiased motion without proliferation. (a) Histogram of c u xx trained with λ = 0 .
01, comparedto true value 0.25 (black line). (b) Histogram of c u xx trained with λ = 0 .
05, compared to truevalue 0.25 (black line). (c) Histogram of c u xx trained with optimal λ = 0 .
5, compared to truevalue 0.25 (black line). (d) Empirical joint distribution of c u xx and c uu xx trained with optimal λ = 0 .
5, compared to location of true parameter (0.25,0) (black star).19igure 7: Comparison of predictions made by mis-specified PDEs using coeffients different fromthe ground truth learned by PDE-FIND. A: using c u xx = 0 . , c uu xx = 0 .
26 (blue dash) and c u u xx = 1 .
517 (red dots). B: using c u x = − . , c u xx = 0 .
20 (blue dash) and c uu xx = 0 . c u xx = 0 .
12 (blue dash) and c u = 0 . We propose an explanation for the poor performance of PDE-FIND on the ABM data. ThePDE-FIND algorithm solves a sparse regression function to fit linear combinations of spatialderivatives to the time derivative. When data is noiseless or near noiseless, the temporal andspatial derivatives can be accurately estimated, and so the relationship between spatial andtemporal derivatives can be inferred from observed data. In this context, comparing modelpredictions by their performance with respect to the L -loss in the learned temporal derivativeretrieves the ground truth. However, when data is noisy, different linear combinations of spatialderivatives can have a performance comparable to the ground truth. In the SI, we visualize thisby selecting, for each data set, two time data points whose PDE-FIND coefficients contain termsdifferent to the ground truth PDE. While both the ground truth and the PDE-FIND temporalderivative fit reproduce the observed temporal derivative qualitatively, there is no guarantee thatthis yields solutions that resemble the observed data when the PDE is numerically evaluated.We illustrate this in Figure 7, where for each of Case I, II, and III, we select two parametercombinations estimated by PDE-FIND: one that when integrated resembles the observed data,and one where the model prediction bears no resemblance to the observed data.We note that the poor performance of the model predictions when integrated should not besurprising. PDE-FIND does not take into account modeling constraints, for example that themodel be given in standard conservation form. A possible solution to this specific problem couldbe to write the different terms in the candidate library in flux form. More generally, though,20t is not obvious how to include all possible constraints on solutions in how the candidatelibrary is composed. Especially given the need to use PDE-FIND to discover new solutions ofthe governing PDEs, there should be a low expectation of practitioners to include many suchconstraints into the candidate library. What this striking difference in predictive capabilities ofthe learned PDEs reveals is that coefficients that optimize Equation (7) do not per se performwell when optimizing a criterion that takes into account the spatial differences of the solution.One such metric is the L -loss for square-integrable functions, defined as (cid:107) f − g (cid:107) = (cid:90) ( f ( x ) − g ( x )) dx. (13)This contrasts with the optimization criterion in Equation (7), which we will refer to as the L loss between the derivatives. We show that these two loss functions are indeed very different inFigure 23. As our reference data, we choose the average over all time series from the data set forunbiased motion, and for each point in a region of parameter space that only contains c u xx and c uu xx , we approximate the two different loss functions: first between derivatives - as consideredby PDE-FIND in Equation 3 - and subsequently estimate the L -loss between the integratedsolution and the empirical density from Equation 13 by averaging the squared error at 5 timepoints, d ( X obs , X sim ) = (cid:88) i =1 (cid:13)(cid:13)(cid:13) X obs50 i − X sim50 i (cid:13)(cid:13)(cid:13) , (14)where (cid:107)•(cid:107) is the squared L norm for vectors. We choose this metric so that it captures bothearly and late time dynamics, which we noted is important when distinguishing Case I and III,for example. Here, the choice for 5 time points can be improved in future work, as more refinedchoices for the time discretization may yield better results, but we note that this is a reasonablechoice to capture the dynamics throughout the time series. Figure 23 demonstrates a significantdifference between the loss landscapes of the different error metrics. While both loss landscapesshow a minimum around the true parameter value, the derivative loss landscape in Figure 23 isunable to distinguish different regions of parameter space. For example, errors on the horizontaland vertical axes are equal, meaning that PDE-FIND is unable to distinguish between these twocases. On the other hand, considering the fit of integrated solutions to reference data is muchmore consistent. The loss lanscape shows clear regions of parameter space with increasinglyhigher error when the model is further away from the ground truth: the worst performance isfound in the model where u xx is not included. This is a much desirable property, since it helpsin identifying the true model from the data, which cannot be done when only looking at theobserved differences in derivatives. 21igure 8: Heatmap showing losses with respect to different training objective functions. A: losslandscape for the L error computed between the derivatives, as in (3) for each pair of c uu xx and c u xx , overlayed with actual PDE-FIND predictions. B: loss landscape for the L error computedbetween the numerically found PDE model prediction and observed data, as in (14) for eachpair of c uu xx and c u xx , overlayed with actual PDE-FIND predictions. We showcase how Bayesian methods can be used to significantly improve the quality of modelpredictions arising from the learned PDEs from observed data. Recall that the aim is to reduceuncertainty regarding which coefficients to include in the model, to reduce uncertainty in theestimated model parameters, and to improve the posterior predictive ability of the model byfinding a posterior distribution in parameter space that takes into account properties of theactual observed density profiles. For each of the noisy test cases, that is, data sets D u , D b and D p , we apply Algorithm 2. As a proof of concept in the implementation of Algorithm 2,we choose the simplest possible ABC implementation and perform ABC rejection sampling, asdetailed in the SI. To capture agreement between model predictions and the observed data, wecarry out ABC inference by numerically computing, for each sampled parameter, the solutionof the predicted PDE. Recall that solutions similar to the ground truth will retrieve time seriesin good agreement with the observed densities. Writing then X sim i , X obs i for the simulated andobserved densities at time i , we choose the distance function defined in Equation (14). We recallthat this analyzes the goodness of fit at selected time points in the trajectory. Note that we donot evaluate the entire time series, but rather just consider an appropriate summary statistic.We choose a wide range of time points to capture the different behaviors of the model at differenttime scales. We use the Pakman package [11], a tool to parallelize ABC inference. The files to22arry out the Bayesian inference are found in the electronic supplementary materials.For Case I, the identification ratios in Table 5 suggest that the two coefficients that areincluded in a large number of simulations are c u xx and c uu xx , while other model parameters canbe confidently excluded from the model. This suggests to these two parameters in the candidatemodel and excluding the other parameters. For Case II, we note that the terms identifiedconsistently by PDE-FIND are u x and u xx . For Case III the terms are u, u and u xx . Note thatPDE-FIND failed to identify one relevant model term in Case II, which is uu x . Nevertheless,we proceed with the application of the rest of Algorithm 2 to find posterior distributions inparameter space. For the ABC implementation, we set (cid:15) = 0 . (cid:15) = 2 for Case IIand Case III.Figure 9: Comparisons of prior (grey) with observed data and posteriors obtained by ABCrejection sampling (red). A: Case I, (cid:15) = 0 .
3. B: Case II, (cid:15) = 2, Case III, (cid:15) = 2Figure 9 reveals that in the three cases ABC rejection can significantly reduce the uncertaintyassociated to the PDE coefficients. In Cases I and II, uncertainty regarding which parameterto include is completely removed, as the posterior is confidently supported only on the diffusionparameter axis in Case I and on a region where both c u x and c u xx are nonzero in Case II. Hence,Algorithm 2 removes uncertainty for the model choice. In Case III, the obtained posteriorstill estimates some diffusion parameters as nonzero, however, the share of nonzero c u xx in theposterior is equal to 0 .
75, compared to 0 .
57 in the prior, meaning that the posterior has moremass close to the true parameter value. To assess the quality of the resulting posteriors ingenerating output that matches the observed data, we run a posterior predictive check. Foreach parameter value in the posterior distribution, we use numerical integration to obtain aprediction for the density at t = 250. We then plot the 5% and 95% quantiles and overlay thiswith a data trace from the data sets for Cases I, II and III in Figure 10.The posterior predictive check in Figure 10 reveals that model solutions resemble the real-world data, even if the wide spread of parameters leads to big uncertainty bounds. Especiallyin Case III, with the large degree of model misspecification, we see that the noise in the data23igure 10: Posterior predictive check for the three cases. All parameter values in the posteriordistribution are used to compute a prediction for the density at t = 250. We plot the 5% and95% percentiles at each spatial location, and overlay with a randomly chosen data point fromthe data set.directly translates to low confidence around the origin, reflecting that even when ABC is applied,some model misspecification is picked up. This problem is much less apparent in Case I, wherethe posterior distribution learned the correct distribution, and in Case II, where the posteriorhas learned the correct structure of the model. We conclude that even in the presence ofmodel misspecification, such as in Case II, it is possible to obtain a posterior distribution inparameter space that has good predictive power. Compare this to the case of Figure 7, wherethe integrated model solutions failed to resemble the empirical data. We conclude that taking aBayesian approach to supplement the PDE-FIND yields reasonable estimates, whereas the pointestimates otherwise obtained using PDE-FIND would not have matched the observed data. This paper proposed a framework to perform uncertainty quantification for implementing EQLmethods on noisy data. By simulating a large amount of synthetic data from an ABM, weshow that the PDE-FIND algorithm produces predictions that have great variability both in thestructure of the learned models as well as in the parameter values. We conclude that when datais corrupted with large amounts of noise, the PDE-FIND algorithm can learn models that areuninformative of the correct governing equation. In contrast, when we control for the noise andpresent the PDE-FIND algorithm with data that is sufficiently detailed and with little noise,the structure of the learned equations is correct, but there is still variability in the learnedparameters. By using a combined approach of PDE-FIND and ABC, it is possible to obtainposterior distributions in parameter space that translate directly into uncertainty quantification24n model predictions. We believe that this can enable practitioners to both confidently reportthe range of possible models that can explain a given data set as well as obtain practical errorbounds for model predictions.After applying ABC on the PDE-FIND generated prior distribution, not all model misspeci-fication was removed from the observations. Researchers may find it worthwhile to find different,more efficient implementations of the ABC algorithm used here. For example, one may con-sider using a Sequential Monte Carlo (ABC-SMC) approach [8], which carefully propagates theprior through a series of intermediate distributions to a final, posterior distribution. There isa wide range of studies, see for example [12] that elucidate how ABC-SMC is well-suited tofind complicated posterior distributions from a wide prior distribution. We expect that bettertuned approaches to ABC can help improve the quality of the posteriors found in this work.Likewise, within the given ABC implementation, one may wish to consider a more detailed setof summary statistics, or a different choice of the distance function. Nevertheless, we find thateven the simplest possible ABC implementation yields an effective and practical uncertaintyquantification. For practitioners wishing to obtain these error bounds without needing to con-sider the mathematical intricacies of more sophisticated ABC implementations, we believe thatour ABC implementation provides an appropriate tool.Another, orthogonal, direction of research concerns the application of denoising methodsto the data prior to training the PDE-FIND algorithm. In this paper, we controled for theintrinsic noise in the observations to compare the performance of PDE-FIND in predictions. Inapplications of PDE-FIND to real-life data, practitioners will typically not be able to controlfor the amount of observations noise in their data. Hence, methods to smooth data and reduceobservation noise may be an attractive extension to PDE-FIND. When using spatial subsamplingas a method to denoise data, we found that while the degree of model misspecification wasdecreased, the structure of the model was still highly variable with big uncertainty in the learnedparameters. We note that even state-of-the-art methods to reduce noise will not retrieve the truesignal behind the data, and so uncertainty quantification is still important, since ultimately theprecise relation between noise in the data and uncertainty in the learned equations is unknown.We highlight two methods to denoise data. Convolution with a Gaussian kernel is ubiquitousin the information sciences and is often used to denoise data. In addition, an implementationimplementation PCA for EQL has been proposed by Li et al. [7]. In the supplementaryinformation, we show that these two methods still provide variable and unreliable predictions,meaning that with the present selection of denoising methods uncertainty quantification remainsnecessary for applications of PDE-FIND to realistic biological data.25his article proposed as an explanation for the poor performance of PDE-FIND on noisydata that the loss function used by PDE-FIND does not take into account spatial informationof the predictions made by the learned model. To avoid the problem that predictions that mayfit the temporal derivative well in the presence of noise, but fail to capture important featuresof the observed data, we show that Bayesian methods can help incorporate this information intopredictions. PDE-FIND is not the only method for learning differential equations from data.When data is heavily corrupted by data, it may be worthwhile using a Deep Neural Networkapproach [6, 3] instead of the PDE-FIND algorithm to simultaneously learn the spatio-temporaldynamics while correcting for the presence of noise.
References [1] Samuel H. Rudy, Steven L. Brunton, Joshua L. Proctor, and J. Nathan Kutz. Data-drivendiscovery of partial differential equations.
Science Advances , 3(4), 2017.[2] John T. Nardini, Ruth E. Baker, Matthew J. Simpson, and Kevin B. Flores. Learningdifferential equation models from stochastic agent-based model simulations, 2020.[3] John H. Lagergren, John T. Nardini, Ruth E. Baker, Matthew J. Simpson, and Kevin B.Flores. Biologically-informed neural networks guide mechanistic modeling from sparse ex-perimental data, 2020.[4] Kathleen Champion, Bethany Lusch, J. Nathan Kutz, and Steven L. Brunton. Data-drivendiscovery of coordinates and governing equations.
Proceedings of the National Academy ofSciences , 116(45):22445–22451, 2019.[5] M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deeplearning framework for solving forward and inverse problems involving nonlinear partialdifferential equations.
Journal of Computational Physics , 378:686–707, 2019.[6] John H. Lagergren, John T. Nardini, G. Michael Lavigne, Erica M. Rutter, and Kevin B.Flores. Learning partial differential equations for biological transport models from noisyspatio-temporal data.
Proceedings of the Royal Society A: Mathematical, Physical andEngineering Sciences , 476(2234):20190800, 2020.[7] Jun Li, Gan Sun, Guoshuai Zhao, and Li-wei H. Lehman. Robust low-rank discovery ofdata-driven partial differential equations.
Proceedings of the AAAI Conference on ArtificialIntelligence , 34(01):767–774, Apr. 2020. 268] Mikael Sunn˚aker, Alberto Giovanni Busetto, Elina Numminen, Jukka Corander, MatthieuFoll, and Christophe Dessimoz. Approximate bayesian computation.
PLOS ComputationalBiology , 9(1):1–10, 01 2013.[9] M J Simpson, B D Hughes, and K A Landman. Diffusing populations: Ghosts or folks?
Australasian Journal of Engineering Education , 15(2):59–68, 2009.[10] Wessel N. van Wieringen. Lecture notes on ridge regression, 2020.[11] Thomas F. Pak, Ruth E. Baker, and Joe M. Pitt-Francis. Pakman: a modular, efficientand portable tool for approximate bayesian inference.
Journal of Open Source Software ,5(47):1716, 2020.[12] Sarah Filippi, Chris Barnes, Julien Cornebise, and Michael P. H. Stumpf. On optimality ofkernels for approximate bayesian computation using sequential monte carlo, 2012.27 upplementary Information: Bayesian uncertainty quantificationfor data-driven equation learning
Variability of PDE-FIND coefficients
In this section, we report the full data on the learned PDE-FIND coefficients trained on thedifferent data sets. This includes the identification ratios for all coefficients in all experimentsas well as plots of the joint distributions of all the learned parameters.
Full data on identification ratios
We report the identification ratios for the problems considered in the main text together withthe additional experiments performed for this supplementary information in the table below.Experiment c c u c u c u x c u · u x c u · u xx c u xx c u · u xx c u · u xx STRidge D u D u D b D b D p , d tol = 5 e − D p , d tol = 1 e − D p , d tol = 5 e − D p , d tol = 1 e − D u D u D b D p D u a j for all experiments, where j is given as the term in the PDE in eachcolumn. Empirical distribution of all learned PDE-FIND parameters
In this section we report the pairwise distribution of all learned PDE-FIND parameters in CaseI, Case II, Case III, both in the low-noise and high-noise regime. All terms in the library areordered as: 1 , u, u , u x uu x , u u x , u xx , uu xx , u u xx .28igure 11: Pairwise correlation plots for coefficients trained on D u D u D b D b D p D p enoising approaches In this section, we detail how denoising approaches were used to denoise the ABM data andshow the variability of the learned coefficients using each of these approaches.
Rescaled Robust PCA Li et al. [7] propose an approach to perform PDE-FIND on noisy data. Mathematically, denotethe data matrix U ∈ C n × m given on a discretized domain x ∈ [ a, b ] and t ∈ [0 , T ] for thediscretely and noisily sampled measurements from the function u ( x, t ) that satisfies the PDE.The authors observe that the underlying data u ( x, t ) is often low-rank and so they suppose that U and its time derivative U t can be decomposed as U = Z + E , U t = D ( Z, Q ) ξ + E , (15)where Z and D ( Z, Q ) have low rank and E , E are sparse. Informally, Z represents the clean data, D ( Z, Q ) the clean derivative matrix, and E , E the perturbations of the clean data andderivatives respectively. The method developed in [7] is to optimally find Z and D ( Z, Q ) fromthe data matrix. This problem is addressed by solving the optimization problemmin ξ,Z,E ,E (cid:107) Z (cid:107) ∗ + (cid:107)D ( Z, Q ) ξ (cid:107) ∗ + λ R ( ξ ) + λ (cid:107) E (cid:107) + λ (cid:107) E (cid:107) (16)such that U = Z + E and U t = D ( Z, Q ) ξ + E . Additionally, R ( ξ ) is a sparse regularizationof the parameters ξ , and λ i , i = 1 , , Z , after which Low-rank STRidge (LrSTR) is used to construct D ( Z, Q ) from Z and U t is denoised prior to applying PDE-FIND. The final algorithm, which uses rPCA and LrSTRin tandem, is called Double Low-rank Sparse Regression (DLrSR). While the examples givenby Li et al. in [7] prove capable of retrieving the true parameters of the PDEs with remark-able accuracy, the choice of regularization parameters is non-trivial and makes implementationchallenging, since the many degrees of freedom complicate choosing a robust parameter set.As shown in Algorithm 1 of [7], λ corresponds to the penalty in rPCA. One must choose λ carefully, since different sizes of this parameter can yield unfavorable predictions when used intandem with PDE-FIND. We start by plotting the output of empirical data processed by rPCAat two different values of λ .As seen in Figure 17, choosing a small value of λ in rPCA acts in favor of interpreting the entiresignal as noise and output Z = 0, effectively rendering the output useless. Larger values of λ a) (b) (c) (d) Figure 17: Plots of denoised data Z (points) compared to original data (solid lines), plotted at t = 0 (red) and t = 1000 (black). (a) λ = 0 .
001 (b) λ = 0 .
015 (c) λ = 0 .
06 (d) λ = 0 . λ favor interpreting the entire signal as noise and output Z = 0. Larger valuesof λ do not penalize the noise and return a solution that is identical to the original signal. Inthe intermediate regime, the order of magnitude of the rPCA output is distorted.do not penalize the noise and return a solution that is identical to the original signal as canbe seen in the rightmost plot. In the intermediate regime, the order of magnitude of the rPCAoutput is distorted, as rPCA interprets the data matrix U as sparse and so the maximum of therPCA output for the densities has comparable order of magnitude t = 0 and t = 500, effectivelyobscuring the biological mechanism at play. However, the behavior of the rPCA output in theintermediate regime is qualitatively good, so we suggest to scale the rPCA output so that itis in the appropriate order of magnitude. Since the maximum of the density is influenced bynoise too, it should not be the only factor considered in the scaling. Rather, the mean of thedata should be included too. Heuristically, we let Z be the output of rPCA and U the originalmatrix, such that Z i , U i denote the i -th row of Z , U respectively (corresponding to the i -thtime point). Let Z i , U i be the means of the vectors Z i , U i respectively and Z max i , U max i be theirmaxima. Then, we re-scale Z to ˜ Z such that˜ Z i = 12 · (cid:18) U i Z i + U max i Z max i (cid:19) · Z i . (17)Focusing on intermediate values of λ , we now plot the rescaled rPCA output compared to theoriginal solutions in Figure 18. From these plots we decide to use the rescaled output of theDLrSR algorithm in [7] on dataset D u , choosing for the rPCA the value of λ = 0 . a for the coefficients found with the rPCA method. Comparingthese values with the values found using STRidge for trained on D u reveals that the method hasa much higher variability of terms that are included and excluded in the resulting PDE model.Moreover, Figure 19 shows that coefficients are correlated in a complicated, non-trivial, manner.36 a) (b) (c) (d) Figure 18: Plots of rescaled denoised data ˜ Z (points) compared to original data (solid lines),plotted at t = 0 (red) and t = 1000 (black) chosen at values of λ in the intermediate parameterregime. (a) λ = 0 . λ = 0 .
03 (c) λ = 0 .
035 (d) λ = 0 .
04. The order of magnitude ofthe maximum of the rPCA output is no longer distorted.The correlation plot also reveals the problem with taking point estimates. In the unbiased case,for example, the pairs of the predicted coefficients for the terms u xx and uu xx are containedin a region of parameter space where either of the two terms is discovered or when both termsare picked up and are not small. That implies that any point estimate when derived from thisdata is unreliable and careful consideration needs to be given to the correlations between terms.Figure 19b also shows that the support of PDE-FIND predictions for the diffusion coefficientobtained using DLrSR on D u is similar in size to that obtained with Naive STRidge. This meansthat with the intrinsic noise in the sample, rPCA yields estimates for the diffusion coefficientthat are no less variable than the original PDE-FIND method. Table 5 summarizes the valuesof the coefficients a j . It can be seen that no terms in the PDE are conclusively identified, asall terms for a are either small or in the intermediate range. Most saliently, rPCA identifies thediffusion coefficient in only 38% of samples in the data set. Hence, even when using a state-of-the-art method of denoising observations it is important to statistically quantify uncertainty inthe learned coefficients. Convolution with Gaussian kernel: an alternative approach to denoising
In many applications, noisy data can be successfully smoothed using a suitable kernel. Wechoose to smooth our data using a Gaussian kernel with standard deviation 2, so that it isroughly equivalent to the scale used in the spatial discretization . Summary values for a arefound in Table 5. While the convolution is highly successful in recovering one of the relevantterms in the dynamical system (for example, the diffusion term in D u , the bias term in D b andthe proliferation term in D p , the method fails to recover the diffusion term regularly in D b and37 p . (a) (b) (c) Figure 19: Joint distribution of c u xx and c uu xx trained on ABM data of unbiased motion withoutproliferation, where each sample is one time series. Location of true parameter value is indicatedby black star. (a) Naive application of STRidge on the ABM data. (b) ABM data is pre-processed with rPCA as detailed in SI before training PDE-FIND. (c) ABM data is pre-processedby smoothing with a Gaussian kernel with σ = 2 before training PDE-FIND. Tuning of PDE-FIND hyperparameters
Following Nardini et al. [2], we implement cross-validation for finding the optimal value of theregularization parameter in the STRidge algorithm. As training data, we randomly select asubset of 10 data points from the ensemble. For each λ ∈ { . · x, ≤ x ≤ } , we performleave-one-out cross validation, that is, we train PDE-FIND on the left-out data sample, andcompute the L -loss of the integrated solution at t = 250 with the samples not used for trainingat t = 250. This gives an optimal value of λ = 0 .
5. To illustrate the role of increasing the tuningparameter, Table 6 shows the values of the a j coefficients at different values of λ . For thesevalues, the main text reports the empirical distribution of the PDE-FIND coefficients. Notethat λ = 0 .
01 equals the original experiment.Value of λ c c u c u c u x c u · u x c u · u xx c u xx c u · u xx c u · u xx a j for all experiments, where j is given as the term in the PDE in eachcolumn. 38 omparison of model predictions When data is noisy, different linear combinations of spatial derivatives can outperform the tem-poral derivative when measured according to the L -loss. We visualize this by selecting twotime data points for each of the data sets whose PDE-FIND coefficients contain terms differentto the ground truth PDE. We fit the predicted PDE to the temporal derivative at t = 20 andat t = 250 and compare it to the fitted ground truth using the estimated spatial derivatives.Figure 20 compares the observed temporal derivatives with these fitted values for the temporalderivative. In five of the six plots, both the ground truth and the PDE-FIND temporal deriva-tive fit reproduce the observed temporal derivative qualitatively. A priori, there is no guaranteethat the L -loss is optimized by the fitted ground truth.However, Figure22 shows that some of the mis-specified coefficients do not yield predictionsthat are in close agreement with the observed density data. This illustrates the finding in themain text that a close agreement between the fitted temporal derivatives need not imply thatthe model predictions are adequate. (a) (b) (c)(d) (e) (f) Figure 20: Comparisons of the observed time derivative at t = 20 with predicted time derivativeusing misspecified PDE-FIND coefficients (blue) compared to fitted ground truth (red). FittedPDEs in panels (a)-(f) match the panels in Figure 22.39 a) (b) (c)(d) (e) (f) Figure 21: Comparisons of the observed time derivative at t = 250 with predicted time derivativeusing misspecified PDE-FIND coefficients (blue) compared to fitted ground truth (red). FittedPDEs in panels (a)-(f) match the panels in Figure 22. Approximate Bayesian Computation
In the main text we identify a significant difference between the loss landscapes of the differenterror metrics for the computed derivative and spatial solution of the PDE. Here, we show thatwe may also take as a reference point a single, noisy, observation instead of the average of theobservations in the ensemble, albeit that the quality of the resulting posterior is decreased. Forcompleteness, we plot the different loss landscapes in Figure 23.The high amount of noise in the single measurement case means that the loss landscapewhen considering the derivative does not have a minimum around the true parameter value,implying that minimizing with respect to this norm is not guaranteed to converge to the correctsolution when many measurements are being taken. This is very different in the case of lownoise, the loss landscape clearly has a minimum around the true parameter value. On theother hand, comparing the fit of integrated solutions is much more consistent in the case of asingle, very noisy measurement and retrieves low measurement error in a region around the trueparameter value. When the loss is considered on the averaged data, this region is smaller, andthe range of the error is bigger, suggesting that this loss function is better able to distinguishbetween solutions. The poor performance of the model predictions when integrated should notbe surprising. PDE-FIND does not take into account modeling constraints, for example thatthe model be mass-conserving. A possible solution to this specific problem could be to write the40 a) (b) (c)(d) (e) (f)
Figure 22: Comparisons of model predictions with misspecified PDE-FIND predictions (blue)compared to ground truth (red). (a) Unbiased motion, misspecified model with c u xx = 0 . c uu xx = 0 .
26 (b) Unbiased motion, misspecified model with c u xx = 0 and c u u xx = 1 .
52. (c)For biased motion, misspecified model with c u xx = 0 .
20 and c u x = 0 . c u xx = 0 and c uu xx = 0 .
64. (e) For proliferation and unbiased motion,misspecified model with c u xx = 0 and c u = 0 . c u xx = 0 .
12 and c u = 0.different terms in the candidate library in flux form. More generally, though, it is not obvioushow to include all possible constraints on solutions in how the candidate library is composed.Especially given the need to use PDE-FIND to discover new solutions of the governing PDEs,there should be a low expectation of practitioners to include many such constraints into thecandidate library. If a modeler were able to construct the candidate library exactly to incorporateall constraints, perhaps other methods of equation fitting for the relevant equation parameterscould be more appropriate, avoiding the need to use EQL methods.Approximate Bayesian Computation is a popular tool to perform likelihood-free Bayesianinference. This is useful in cases where the likelihood function of the observed data given themodel parameters is intractable or difficult to compute, but simulation given model parametersis feasible. In ABC, a prior π is specified over parameter space that reflects prior informationabout the model parameter θ . In each step, θ ∼ π is drawn, and a model simulation is done using θ , yielding the simulated data X Sim . The simulated data is compared to the observed data X Obs a) (b)(c) (d) Figure 23: Heatmap showing L error for each pair of c uu xx and c u xx , overlayed with actualPDE-FIND predictions. (a) Derivative L loss when compared to a single point. (b) Derivative L loss compared to average. (c) Comparison is done with respect to a single data point. (d)Comparison is done with respect to the average over all of the time series in the ensemble.according to a distance function d ( X Sim , X
Obs ) provided by the practitioner. Given a tolerance (cid:15) >
0, the sampled θ is accepted whenever d ( X Sim , X
Obs ) < (cid:15) . The process finishes when anpre-specified number of parameter samples have been accepted. The posterior distribution isthen given by the accepted parameter samples. We refer to [8] for the details.In the main text, we choose a two-dimensional parameter space that contains only u xx and uu xx as terms in the PDE and parametrize the normal marginals as used in in the main textby computing the empirical means of c u xx and c uu xx , conditional on them being nonzero, andassign the following relative weights m j for the atom at 0 for coefficient j ∈ { u xx , uu xx } : m j = (cid:80) Ni =1 c ij =0) N − (cid:80) Ni =1 c iuxx ,c iuuxx =0) . (18)42his gives the prior distribution shaded in grey in each of the panels of Figure 9 in the main text.Letting X sim i , X obs i be the simulated and observed densities at time i , we choose the distancefunction d ( X obs , X sim ) = (cid:88) i =1 (cid:13)(cid:13)(cid:13) X obs50 i − X sim50 i (cid:13)(cid:13)(cid:13) , (19)where (cid:107) · (cid:107) represents the L -norm. We perform two experiments. In the first, we choose for X obs the time series used to compute the loss landscape of 23 and choose (cid:15) = 0 . X obs the average over all timeseries in the ensemble and set (cid:15) = 0 ..