[PDF] Stability selection enables robust learning of partial differential equations from limited noisy data

Abstract

We present a statistical learning framework for robust identification of partial differential equations from noisy spatiotemporal data. Extending previous sparse regression approaches for inferring PDE models from simulated data, we address key issues that have thus far limited the application of these methods to noisy experimental data, namely their robustness against noise and the need for manual parameter tuning. We address both points by proposing a stability-based model selection scheme to determine the level of regularization required for reproducible recovery of the underlying PDE. This avoids manual parameter tuning and provides a principled way to improve the method's robustness against noise in the data. Our stability selection approach, termed PDE-STRIDE, can be combined with any sparsity-promoting penalized regression model and provides an interpretable criterion for model component importance. We show that in particular the combination of stability selection with the iterative hard-thresholding algorithm from compressed sensing provides a fast, parameter-free, and robust computational framework for PDE inference that outperforms previous algorithmic approaches with respect to recovery accuracy, amount of data required, and robustness to noise. We illustrate the performance of our approach on a wide range of noise-corrupted simulated benchmark problems, including 1D Burgers, 2D vorticity-transport, and 3D reaction-diffusion problems. We demonstrate the practical applicability of our method on real-world data by considering a purely data-driven re-evaluation of the advective triggering hypothesis for an embryonic polarization system in C.~elegans. Using fluorescence microscopy images of C.~elegans zygotes as input data, our framework is able to recover the PDE model for the regulatory reaction-diffusion-flow network of the associated proteins.

Full PDF

SStability selection enables robust learning of partial di ﬀ erentialequations from limited noisy data Suryanarayana Maddu , , , Bevan L. Cheeseman , , , Ivo F. Sbalzarini , , ,Christian L. M¨uller ∗ Chair of Scientiﬁc Computing for Systems Biology, Faculty of Computer Science,TU Dresden, 01069 Dresden, Germany Center for Systems Biology Dresden, Max Planck Institute of Molecular Cluster of Excellence Physics of Life, TU Dresden, 01062 Dresden, Germany Center for Computational Mathematics, Flatiron Institute, New York, USA ∗ Correspondence to: cmueller@ﬂatironinstitute.org

AbstractWe present a statistical learning framework for robust identiﬁcation of partial di ﬀ erential equa-tions (PDEs) from noisy spatiotemporal data. Extending previous sparse regression approaches forinferring PDE models from simulated data, we address key issues that have thus far limited the appli-cation of these methods to noisy experimental data, namely their robustness against noise and the needfor manual parameter tuning. We address both points by proposing a stability-based model selectionscheme to determine the level of regularization required for reproducible recovery of the underlyingPDE. This avoids manual parameter tuning and provides a principled way to improve the method’srobustness against noise in the data. Our stability selection approach, termed PDE-STRIDE, canbe combined with any sparsity-promoting penalized regression model and provides an interpretablecriterion for model component importance. We show that in particular the combination of stabil-ity selection with the iterative hard-thresholding algorithm from compressed sensing provides a fast,parameter-free, and robust computational framework for PDE inference that outperforms previousalgorithmic approaches with respect to recovery accuracy, amount of data required, and robustness tonoise. We illustrate the performance of our approach on a wide range of noise-corrupted simulatedbenchmark problems, including 1D Burgers, 2D vorticity-transport, and 3D reaction-di ﬀ usion prob-lems. We demonstrate the practical applicability of our method on real-world data by considering apurely data-driven re-evaluation of the advective triggering hypothesis for an embryonic polarizationsystem in C. elegans . Using ﬂuorescence microscopy images of

C. elegans zygotes as input data, ourframework is able to recover the PDE model for the regulatory reaction-di ﬀ usion-ﬂow network of theassociated proteins. Predictive mathematical models, derived from ﬁrst principles and symmetry arguments and validated in exper-iments, are of key importance for the scientiﬁc understanding of natural phenomena. While this approach hasbeen particularly successful in describing spatiotemporal dynamical systems in physics and engineering, it hasnot seen the same degree of success in other scientiﬁc ﬁelds, such as neuroscience, biology, ﬁnance, and ecology.This is because the underlying ﬁrst principles in these areas remain largely elusive. Nevertheless, modeling inthose areas has seen increasing use and relevance to help formulate simpliﬁed mathematical equations wheresu ﬃ cient observational data are available for validation [1, 2, 3, 4, 5]. In biology, modern high-throughput tech-nologies have enabled collection of large-scale data sets, ranging from genomics, proteomics, and metabolomicsdata, to microscopy images and videos of cells and tissues. These data sets are routinely used to infer parametersin hypothesized models, or to perform model selection among a ﬁnite number of alternative hypotheses [6, 7, 8].The amount and quality of biological data, as well as the performance of computing hardware and computationalmethods have now reached a level that promises direct inference of mathematical models of biological processesfrom the available experimental data. Such data-driven approaches seem particularly valuable in cell and devel-opmental biology, where ﬁrst principles are hard to come by, but large-scale imaging data are available, alongwith an accepted consensus of which phenomena a model could possibly entail. In such scenarios, data-drivenmodeling approaches have the potential to uncover the unknown ﬁrst principles underlying the observed biolog-ical dynamics. 1 a r X i v : . [ m a t h . NA ] J u l iological dynamics can be formalized at di ﬀ erent scales, from discrete molecular processes to the continuummechanics of tissues. Here, we consider the macroscopic, continuum scale where spatiotemporal dynamics aremodeled by Partial Di ﬀ erential Equations (PDEs) over coarse-grained state variables [9, 3]. PDE models havebeen used to successfully address a range of biological problems from embryo patterning [10] to modeling gene-expression networks [11, 12] to predictive models of cell and tissue mechanics during growth and development[13]. They have shown their potential to recapitulate experimental observables in cases where the underlyingphysical phenomena are known or have been postulated [14]. In many biological systems, however, the govern-ing PDEs are not (yet) known, which limits progress in discovering the underlying physical principles. Thus itis desirable to verify existing models or even discover new models by extracting governing laws directly frommeasured spatiotemporal data.For given observable spatiotemporal dynamics, with no governing PDE known, several proposals have beenput forward to learn mathematically and physically interpretable PDE models. The earliest work in this di-rection [15] frames the problem of “PDE learning” as a multivariate nonlinear regression problem where eachcomponent in the design matrix consists of space and time di ﬀ erential operators and low-order non-linearitiescomputed directly from data. Then, the alternating conditional expectation (ACE) algorithm [16] is used to com-pute both optimal element-wise non-linear transformations of each component and their associated coe ﬃ cients.In [17], the problem is formulated as a linear regression problem with a ﬁxed pre-deﬁned set of space and timedi ﬀ erential operators and polynomial transformations that are computed directly from data. Then, backwardelimination is used to identify a compact set of PDE components by minimizing the least square error of thefull model and then removing terms that reduce the ﬁt the least. In the statistical literature [18, 19], the PDElearning problem is formulated as a Bayesian non-parametric estimation problem where the observed dynamicsare learned via non-parametric approximation, and a PDE representation serves as a prior distribution to com-pute the posterior estimates of the PDE coe ﬃ cients. Recent inﬂuential work revived the idea of jointly learningthe structure and the coe ﬃ cients of PDE models from data in discrete space and time using sparse regression[20, 21, 22]. Approaches such as SINDy [20] and PDE-FIND [21] compute a large pre-assembled dictionaryof possible PDE terms from data and identify the most promising components via penalized linear regressionformulations. For instance, PDE-FIND is able to learn di ﬀ erent types of PDEs from simulated spatiotemporaldata, including Burgers, Kuramato-Sivishinksy, reaction-di ﬀ usion, and Navier-Stokes equations. PDE-FIND’sperformance was evaluated on noise-free simulated data as well as data with up to 1% additive noise and showeda critical dependency on the proper setting of the tuning parameters which are typically unknown in practice.Recent approaches attempt to alleviate this dependence by using Bayesian sparse regression schemes for modeluncertainty quantiﬁcation [23] or information criteria for tuning parameter selection [24]. There is also a grow-ing body of work that considers deep neural networks for PDE learning [25, 26, 27]. For instance, a deep feedforward network formulation [25], PDE-NET, directly learns a computable discretized form of the underlyinggoverning PDEs for forecasting [25, 28]. PDE-NET exploits the connection between di ﬀ erential operators andorders-of-sum rules of convolution ﬁlters [29] in order to constrain network layers to learn valid discretizeddi ﬀ erential operators. The forecasting capability of this approach was numerically demonstrated for predeﬁnedlinear di ﬀ erential operator templates. A compact and interpretable symbolic identiﬁcation of the PDE structureis, however, not available with this approach.Here, we ask the question whether and how it is possible to extend the class of sparse regression inferencemethods to work on real, limited, and noisy experimental data. As a ﬁrst step, we present a statistical learn-ing framework, PDE-STRIDE (STability-based Robust IDEntiﬁcation of PDEs), to robustly infer PDE modelsfrom noisy spatiotemporal data without requiring manual tuning of learning parameters, such as regularizationconstants. PDE-STRIDE is based on the statistical principle of stability selection [30, 31], which provides aninterpretable criterion for any component’s inclusion in the learned PDE in a data-driven manner. Stability selec-tion can be combined with any sparsity-promoting regression method, including LASSO [32, 30], iterative hardthresholding (IHT) [33], Hard Thresholding Pursuit (HTP) [34], or Sequential Thresholding Ridge Regression(STRidge) [21]. PDE-STRIDE therefore provides a drop-in solution to render existing inference tools morerobust, while reducing the need for parameter tuning. In the benchmarks presented herein, the combination ofstability selection with de-biased iterative hard thresholding (IHT-d) empirically showed the best performanceand highest consistency w.r.t. perturbations of the dictionary matrix and sampling of the data.This paper is organized as follows: Section 2 provides the mathematical formulation of the sparse regression2roblem and discusses how the design matrix is assembled. We also review the concepts of regularization pathsand stability selection and discuss how they are combined in the proposed method. The numerical results inSection 3 highlight the performance and robustness of the PDE-STRIDE for recovering di ﬀ erent PDEs fromnoise-corrupted simulated data. We also perform an achievability analysis of PDE-STRIDE + IHT-d inferencescheme for consistency and convergence of recovery probability with increasing sample size. Section 4 demon-strates that the robustness of the proposed method is su ﬃ cient for real-world applications. We consider learninga PDE model from noisy biological microscopy images of membrane protein dynamics in a C. elegans zygote.Section 5 provides a summary of our results and highlights future challenges for data-driven PDE learning.

Figure 1: Enabling data-driven mathematical model discovery through stability selection:

We outline the necessary stepsin our method for learning PDE models from spatiotemporal data. A : Extract spatiotemporal proﬁles from microscopy videos ofthe chosen state variables. Data courtesy of Grill Lab, MPI-CBG / TU Dresden [35]. B : Compile the design matrix Θ and themeasurement vector U t from the data. C : Construct multiple linear systems of reduced size through random sub-sampling of therows of the design matrix Θ and U t . D : Solve and record the sparse / penalized regression solutions independently for eachsub-sample along the λ paths. E : Compute the importance measure Π for each component. The histogram shows the importancemeasure Π for all components at a particular value of λ . F : Construct the stability plot by aggregating the importance measuresalong the λ path, leading to separation of the noise variables (dashed black) from the stable components (colored). Identify themost stable components by thresholding Π > . G : Build the PDE model from the identiﬁed components. We outline the problem formulation underlying the data-driven PDE inference considered here. We reviewimportant sparse regression techniques and introduce the concept of stability selection used in PDE-STRIDE.

We propose a framework for stable estimation of the structure and parameters of the governing equations ofcontinuous dynamical systems from discrete spatiotemporal measurements or observations. Speciﬁcally, weconsider PDEs for the multidimensional state variable u ∈ R d of the form shown in Eq. (2.1.1), composedof polynomial non-linearities (e.g., u , u ), spatial derivatives (e.g., u x , u xx ), and the parametric dependencemodeled through Ξ ∈ R p . ∂ u ∂ t = F (cid:16) [ u , u , u , u xx , uu x , ..... ] , x , t , Ξ (cid:17) . (2.1.1)Here, F ( · ) is the function map that models the spatiotemporal non-linear dynamics of the system. For simplicity,we limit ourselves to forms of the function map F ( · ) that can be written as the linear combinations of polynomialnon-linearities, spatial derivatives, and combinations of both. For instance, for a one-dimensional ( d =

1) state3ariable u , the function map can take the form: ∂ u ∂ t = ξ + ξ u + ξ ∂ u ∂ x + ξ u ∂ u ∂ x + ξ u + ..... + ξ k u ∂ u ∂ x + ...., (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) F ( · ) (2.1.2)where ξ k are the coe ﬃ cients of the components of the PDE for k ≥

0. The continuous PDE of the form describedin Eq. (2.1.2), with appropriate coe ﬃ cients ξ k , holds true for all continuous space and time points ( x , t ) in thedomain of the model. Numerical solutions of the PDE try to satisfy the equality relation in Eq. (2.1.2) for recon-stituting the non-linear dynamics of a dynamical system at discrete space and time points ( x m , t n ). We assumethat we have access to N noisy observational data ˜ u m , n of the state variable u over the discrete space and timepoints. The measurement errors are independent and identically distributed and are assumed to follow a normaldistribution with mean zero and variance σ .We follow the formulation put forward in [17, 21, 22] and construct a large dictionary of PDE componentsusing discrete approximations of the dynamics from data. For instance, for the one-dimensional example in Eq.(2.1.2), the discrete approximation with p PDE terms can be written in vectorized form as a linear regressionproblem:  | u t | (cid:124)(cid:123)(cid:122)(cid:125) U t =  | | | | | | u uu x ... u u xx .. | | | | | | (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) Θ × ξ. (2.1.3)Here, the left-hand side U t ∈ R N is a discrete approximation to the time derivatives of u at the discretizationpoints and represents the response or outcome vector in the linear regression framework. Each column of thedictionary or design matrix Θ ∈ R N × p represents the discrete approximation of one PDE component, i.e., one ofthe terms in Eq. (2.1.2), at N discrete points in space and time ( x , t ) n , n = , . . . , N . Each column is interpreted asa potential predictor of the response vector U t . The vector ξ = (cid:104) ξ , ξ , . . . ξ p − (cid:105) (cid:62) ∈ R p is the vector of unknownPDE coe ﬃ cients, i.e., the pre-factors of the terms in Eq. (2.1.2).Both U t and Θ need to be constructed from numerical approximations of the temporal and spatial derivativesof the observed state variables. There is a rich literature in numerical analysis on this topic (see, e.g.,[36, 37]).Here, we approximate the time derivatives by ﬁrst-order forward ﬁnite di ﬀ erences from ˜ u after initial denoisingof the data. Similarly, the spatial derivatives are computed by applying the second-order central ﬁnite di ﬀ er-ences. Details about the denoising are given in Section 3 and the Supplemental Material.Given the general linear regression ansatz in Eq. 2.1.3 we formulate the data-driven PDE inference problem as aregularized optimization problem of the form:ˆ ξ λ = arg min ξ (cid:16) h ( ξ ) + λ g ( ξ ) (cid:17) , (2.1.4)where ˆ ξ λ is the minimizer of the objective function, h ( · ) a smooth convex data ﬁtting function, g ( · ) a regulariza-tion or penalty function, and λ ≥ g ( · ) is not necessarily convex or di ﬀ erentiable. We follow previous work [17, 21, 22] and consider thestandard least-squares objective h ( ξ ) = (cid:107) U t − Θ ξ (cid:107) . (2.1.5)The choice of the penalty function g ( · ) inﬂuences the properties of the coe ﬃ cient estimates ˆ ξ λ . Following[20, 21, 22], we seek to identify a small set of PDE components among the p possible components that accuratelypredict the time evolution of the state variables. This implies that we want to identify a sparse coe ﬃ cient vectorˆ ξ λ , thus resulting in an interpretable PDE model. This can be achieved through sparsity-promoting penaltyfunctions g ( · ). We next consider di ﬀ erent choices for g ( · ) that enforce sparsity in the coe ﬃ cient vector andreview algorithms that solve the associated optimization problems.4 .2 Sparse optimization for PDE learning The least-squares loss in Eq.(2.1.5) can be combined with di ﬀ erent sparsity-inducing penalty functions g ( · ). Theprototypical example is the (cid:96) -norm g ( · ) = (cid:107) · (cid:107) leading to the LASSO formulation of sparse linear regression[32]: ˆ ξ λ = arg min ξ (cid:32) (cid:107) U t − Θ ξ (cid:107) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) h ( · ) + λ (cid:107) ξ (cid:107) (cid:124)(cid:123)(cid:122)(cid:125) g ( · ) (cid:33) , (2.2.1)The LASSO objective comprises a convex smooth loss and a convex non-smooth regularizer. For this classof problems, e ﬃ cient optimization algorithms exist that can exploit the properties of the functions and comewith convergence guarantees. Important examples include coordinate-descent algorithms [38, 39] and proximalalgorithms, including the Douglas-Rachford algorithm [40] and the projected (or proximal) gradient method, alsoknown as the Forward-Backward algorithm [41]. In signal processing, the latter schemes are termed iterativeshrinkage-thresholding (ISTA) algorithms (see [42] and references therein) which can be extended to non-convexpenalties. Although the LASSO has been previously used for PDE learning in [22], the statistical performanceof the LASSO estimates are known to deteriorate if certain conditions on the design matrix are not met. Forexample, the studies in [43, 44] provide su ﬃ cient and necessary conditions, called the irrepresentable conditions ,for consistent variable selection using LASSO, essentially excluding too strong correlations of the predictors inthe design matrix. The conditions are, however, di ﬃ cult to check in practice, as they require knowledge of thetrue components of the model. One way to relax these conditions is via randomization. The Randomized LASSO[30] considers the following objective:ˆ ξ λ = arg min ξ (cid:32) (cid:107) U t − Θ ξ (cid:107) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) h ( · ) + λ p (cid:88) k = | ξ k | W k (cid:124) (cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32) (cid:125) g ( · ) (cid:33) , (2.2.2)where each W k is an i.i.d. random variable uniformly distributed over [ α,

1] with α ∈ (0 , α =

1, theRandomized LASSO reduced to the standard LASSO. The Randomized LASSO has been shown to successfullyovercome limitations of the LASSO to handle correlated components in the dictionary [30] while simultaneouslypreserving the overall convexity of objective function. As part of our PDE-STRIDE framework, we will evaluatethe performance of the Randomized Lasso in the context of PDE learning using cyclical coordinate descent [39].The sparsity-promoting property of the (weighted) (cid:96) -norm comes at the expense of considerable bias in theestimation of the non-zero coe ﬃ cients [44], thus leading to reduced variable selection performance in practice.This drawback can be alleviated by using non-convex penalty functions [45, 46], allowing near-optimal variableselection performance at the cost of needing to solve a non-convex optimization problem. For instance, usingthe (cid:96) -“norm” (which counts the number of non-zero elements of a vector) as regularizers g ( · ) = (cid:107) · (cid:107) leads tothe NP-hard problem: ˆ ξ λ = arg min ξ (cid:18) (cid:107) U t − Θ ξ (cid:107) (cid:124) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:125) h ( · ) + λ (cid:107) ξ (cid:107) (cid:124)(cid:123)(cid:122)(cid:125) g ( · ) (cid:19) . (2.2.3)This formulation has found widespread applications in compressive sensing and signal processing. Algorithmsthat deliver approximate solutions to the objective function in Eq. 2.2.3 include greedy optimization strategies[47], Compressed Sampling Matching Pursuit (CoSaMP) [48], and subspace pursuit [49]. We here consider theIterative Hard Thresholding (IHT) algorithm [33, 50], which belongs to the class of ISTA algorithms. Giventhe design matrix Θ and the measurement vector U t , IHT computes sparse solutions ˆ ξ by applying a non-linearshrinkage (thresholding) operator to gradient descent steps in an iterative manner. One step in the iterativescheme reads: ξ n + = H λ (cid:16) ξ n + Θ T ( U t − Θ ξ n ) (cid:17) = T ξ n , H λ ( x ) =  , x ≤ √ λ x , otherwise . (2.2.4)The operator H λ ( ξ ) is the non-linear hard-thresholding operator. Convergence of the above IHT algorithm isguaranteed i ﬀ (cid:107) U t − Θ ξ n + (cid:107) < (1 − c ) (cid:107) U t − Θ ξ n (cid:107) is true in each iteration for some constant 0 < c <

1. Specif-ically, under the condition that (cid:107) Θ (cid:107) <

1, the IHT algorithm is guaranteed to not increase the cost function inEq. (2.2.3) (

Lemma 1 in [33]). The above IHT algorithm can also be viewed as a thresholded version of theclassical Landweber iteration [51]. The ﬁxed points ξ ∗ of ξ ∗ = T ξ ∗ for the non-linear operator T in Eq. (2.2.4)5re local minima of the cost function in Eq. (2.2.3) ( Lemma 3 in [33]). Under the same condition on the designmatrix, i.e. (cid:107) Θ (cid:107) <

1, the optimal solution of the cost function in Eq. (2.2.3) thus belongs to the set of ﬁxedpoints of the IHT algorithm (

Theorem 2 in [33] and

Theorem 12 in [52]). Although the IHT algorithm comeswith theoretical convergence guarantees, the resulting ﬁxed points are not necessarily sparse [33].Here, we propose modiﬁcation of the IHT algorithm that will prove to be particularly suited for solving PDElearning problems with PDE-STRIDE. Following a proposal in [34] for the Hard Thresholding Pursuit (HTP)algorithm, we equip the standard IHT algorithm with an additional debiasing step. At each iteration, we solvea least-squares problem restricted to the support S n + = { k : ξ n + (cid:44) } obtained from the n th IHT iteration. Werefer to this form of IHT as Iterative Hard Thresholding with debiasing (IHT-d). In this two-stage algorithm, thestandard IHT step serves to extract the explanatory variables, while the debiasing step approximately debiases (orreﬁts) the coe ﬃ cients restricted to the currently active support [53]. Rather than solving the least-squares prob-lem to optimality, we use gradient descent steps until a loose upper bound on the least-squares reﬁt is satisﬁed.This prevents over-ﬁtting by attributing low conﬁdence to large supports and reduces computational overhead.The complete IHT-d procedure is detailed in Algorithm 1 in the Supplementary material. In PDE-STRIDE, wewill compare IHT-d with a heuristic iterative algorithm, Sequential Thresholding of Ridge regression (STRidge),that also uses (cid:96) penalization and is available in PDE-FIND [21]. The practical performance of the sparse optimization techniques for PDE learning critically hinges on the properselection of the regularization parameter λ that balances model ﬁt and complexity of the coe ﬃ cient vector. Inmodel discovery tasks on real experimental data, a wrong choice of the regularization parameter could resultin incorrect PDE model selection even if true model discovery would have been, in principle, achievable. Instatistics, a large number of tuning parameter selection criteria are available, ranging from cross-validation ap-proaches [54] to information criteria [55], or formulations that allow joint learning of model coe ﬃ cients andtuning parameters [56, 57]. Here, we advocate stability-based model selection [30] for robust PDE learning.The statistical principle of stability [58] has been put forward as one of the pillars of modern data science andstatistics and provides an intuitive approach to model selection [30, 31, 59].In the context of sparse regression, stability selection [30] proceeds as follows (see also Figure 1 for an il-lustration). Given a design matrix Θ and the measurement vector U t , we generate random subsample indices I ∗ i ⊂ { , . . . , } , i = , . . . , B of equal size | I ∗ i | = N / Θ [ I ∗ i ] ∈ R N × p and U t [ I ∗ i ] ∈ R N by choosing rows according to the index set I ∗ i . For each of the resulting B subproblems, we applya sparse regression technique and systematically record the recovered support ˆ S λ [ I ∗ i ] , i = , . . . , B as a functionof λ over an regularization path Λ = [ λ max λ min ]. The values of λ max are data dependent and are easily com-putable for generalized linear models with convex penalties [39]. Similarly, the critical parameter λ max for thenon-convex problem in Eq.(2.2.3) can be evaluated from optimality conditions ( Theorem 12 in [52] and

Theorem1 in [33]). The lower bound λ min of the regularization path is set to λ min = (cid:15)λ max with default value (cid:15) = .

1. The λ -dependent stability (or importance) measure for each coe ﬃ cient ξ k is then computed as:ˆ Π λ k = P (cid:16) k ∈ ˆ S λ (cid:17) ≈ B B (cid:88) i = ( k ∈ ˆ S λ [ I ∗ i ]) , (2.3.1)where I ∗ , . . . , I ∗ B indicates the independent random sub-samples. The stability ˆ Π λ k of each coe ﬃ cient can beplotted across the λ -path, resulting in a component stability proﬁle (see Figure 1F for an illustration). Thisvisualization provides an intuitive overview of the importance of the di ﬀ erent model components. Di ﬀ erentfrom the original stability selection proposal [30], we deﬁne the stable components of the model as follows:ˆ S stable = { k : ˆ Π λ min k ≥ π th } (2.3.2)Here, π th denotes the critical stability threshold parameter and can be set to π th ∈ [0 . , .

9] [30]. The defaultsetting is π th = .

8. In an exploratory data analysis mode, the threshold π th can also be set through visualinspection of the stability plots, thereby allowing the principled exploration of several alternative PDE models.The stability measures ˆ Π λ k also provide an interpretable criterion for a model component’s importance, thereby6uiding the user to build the right model with high probability. As we will show in the numerical experiments,stability selection ensures robustness against varying dictionary size, di ﬀ erent types of data sampling, noise inthe data, and variability of the sub-optimal solutions when non-convex penalties are used. All of these propertiesare critical for consistent and reproducible model learning in real-world applications. Under certain conditions,stability selection can also provides an upper bound on the expected number of false positives. Such guaranteesare not generally assured by any sparsity-promoting regression method in isolation [31]. For instance, stabilityselection combined with randomized LASSO (Eq. (2.2.2) with α < .

5) is consistent for variable selection evenwhen the irrepresentable condition is violated [30].

We present numerical experiments in order to benchmark the performance and robustness of PDE-STRIDEcombined with di ﬀ erent (cid:96) /(cid:96) sparsity-promoting regression methods to infer PDEs from spatiotemporal data.In order to provide comparisons and benchmarks, we ﬁrst use simulated data obtained by numerically solvingknown ground-truth PDEs, before applying our method to a real-world data set from biology. The benchmarkexperiments on simulation data are presented in four sub-sections that demonstrate di ﬀ erent aspects of the infer-ence framework: Sub-section 3.1 demonstrates the use of di ﬀ erent sparsity-promoting regression methods in ourframework in a simple 1D Burgers problem. Sub-section 3.2 then compares their performance in order to choosethe best regression method, IHT-d. In sub-section 3.3, stability selection is combined with IHT-d to recover 2Dvorticity-transport and 3D reaction-di ﬀ usion PDEs from limited, noisy simulation data. Sub-section 3.4 reportsachievability results to quantify the robustness of stability selection to perturbations in dictionary size, samplesize, and noise levels. STability-based Robust IDEntiﬁcation of PDEs (PDE-STRIDE)

Given the noise-corrupted data ˜ u and a choice of regression method, e.g., (randomized) LASSO, IHT, HTP, IHT-d, STRidge.1. Apply any required de-noising method on the noisy data and compute the spatial derivatives and non-linearities toconstruct the design matrix Θ ∈ R N × p and the time-derivatives vector U t ∈ R N × for suitable sample size and dictionarysize, N and p , respectively.2. Build the sub-samples Θ [ I ∗ i ] ∈ R N × p and U t [ I ∗ i ], for i = , , ..., B , by uniformly randomly sub-sampling of rows fromthe design matrix Θ and the corresponding rows from U t . For every sub-sample I ∗ i , standardize the sub-design matrix Θ [ I ∗ i ] such that (cid:80) N j = θ jk = N (cid:80) N j = θ jk =

1, for k = , , ..., p . Here, θ jk is the element in row j and column k of thematrix Θ [ I ∗ i ]. The corresponding measurement vector U t [ I ∗ i ] is centered to zero mean.3. Apply the sparsity-promoting regression method independently to each sub-sample Θ [ I ∗ i ] , U t [ I ∗ i ] to construct the λ pathsfor M values of λ as discussed in section 2.3.4. Compute the importance measures ˆ Π λ k of all dictionary components ξ k along the discretized λ path, as discussed insection 2.3. Select the stable support set ˆ S stable by applying the threshold π th to all ˆ Π k . Solve a linear least-squaresproblem restricted to the support ˆ S stable to identify the coe ﬃ cients of the learned model. Adding noise to the simulation data

Let u ∈ R N be the vector of clean simulation data sampled in both space and time. This vector is corrupted withadditive Gaussian noise to ˜ u = u + ε, such that ε = σ · N (0 , std( u )) is the additive Gaussian noise with an empirical standard deviation of the entriesin the vector u , and σ is the level of Gaussian noise added. Computing the data vector

The data vector U t ∈ R N contains numerical approximations to the time derivatives of the state variable u at dif-ferent points in space and time. We compute these by ﬁrst-order forward ﬁnite di ﬀ erences (i.e., the explicit Eulerscheme) from ˜ u after initial de-noising of the data. Similarly, the spatial derivatives are computed by applyingthe second-order central ﬁnite di ﬀ erences directly on the de-noised data. For de-noising we use truncated singlevalue decomposition (SVD) with a cut-o ﬀ at the elbow of the singular values curve, as shown in SupplementaryFigures S2 and S3. 7 ixing the parameters for stability selection We propose that PDE-STRIDE combined with IHT-d provides a parameter-free PDE learning method. There-fore, all stability selection parameters described in Section 2.3 are ﬁxed throughout our numerical experiments.The choice of these statistical parameters is well-discussed in the literature [30, 60, 39]. We thus ﬁx: the repeti-tion number B = (cid:15) = . λ -path size M =

20, and the importance threshold π th = .

8. Using these standard choices, the methods works robustly across all tests presented hereafter, and isparameter-free in that sense. In both stability and regularization plots, we show the normalized value of regu-larization constant λ ∗ = λ/λ max . Although, the stable component set ˆ S is evaluated at λ min as in Eq.(2.3.2), theentire stability proﬁle of each component from λ max to λ min is shown in all our stability plots. This way, we getadditional graphical insight into how each component evolves along the λ − path. ﬀ erent sparsity-promoters We show that stability selection can be combined with any sparsity-promoting penalized regression to learn PDEcomponents from noisy and limited spatiotemporal data. We use simulated data of the 1D Burgers equation ∂ u ∂ t + u ∂ u ∂ x = ∂ u ∂ x (3.1.1)with identical boundary and initial conditions as used in [21] to provide fair comparison between methods:periodic boundaries in space and the following Gaussian initial condition: u ( x , = e ( − ( x + ) , x ∈ [ − , − ,

8] is divided uniformly into 256 Cartesian grid points in space and 1000 timepoints. The numerical solution is visualized in space-time in Figure 4. The numerical solution was obtainedusing parabolic method based on ﬁnite di ﬀ erences and time-stepping using explicit Euler method with step size dt = . Figure 2: Model selection with PDE-STRIDE for the 1D Burgers equation:

The top row shows regularization paths (seeSection 2.3) for three sparsity-promoting regression techniques: randomized LASSO, STRidge, and IHT-d all for the samedesign ( N = p = λ R for STRidge is ﬁxed to λ R = − [21]. The value of α for the randomized LASSO is set to 0 .

2. In all three cases, the standardthreshold π th = . (cid:15) is set to 0 .

001 for therandomized LASSO case for demonstrating stability selection.

We test the combinations of stability selection with the three sparsity-promoting regression techniques describedin Section 2.2: randomized LASSO, STRidge, and IHT-d. The top row of Figure 2 shows the regularizationpaths for randomized LASSO, STRidge, and IHT-d. The bottom row of Figure 2 shows the correspondingstability proﬁles for each component in the dictionary. The colored solid lines correspond to the advection8nd di ﬀ usion terms of Eq.3.1.1. Thresholding the importance measure at Π > π th = Although stability selection can be used in conjunction with any (cid:96) or (cid:96) sparsity-promoting regression method,the question arises whether a particular choice of regression algorithm is particularly well suited in the frame ofPDE learning. We therefore perform a systematic comparison between LASSO, STRidge, and IHT-d for recov-ering the 1D Burgers equation under perturbations to the sample size N , the dictionary size p , and the noise level σ . An experiment for a particular triple ( N , p , σ ) is considered as success if there exists a λ ∈ Λ (see section2.3) for which the true PDE components are recovered. In Figure 3, the success frequencies over 30 independentrepetitions with uniformly random data sub-samples are shown. Figure 3: Comparison between di ﬀ erent sparsity promoters for the 1D Burgers equation: Each colored squarecorresponds to a design ( N , p , σ ) with certain sample size N , disctionary size p , and nose level σ . Color indicates the successfrequency over 30 independent repetitions with uniformly random data sub-samples. “Success” is deﬁned as the existence of a λ for which the correct PDE is recovered from the data. The columns compare the three sparsity-promoting techniques:LASSO, STRidge, and IHT-d (left to right), as labelled at the top. A ﬁrst observation from Figure 3 is that (cid:96) solutions (here with STRidge and IHT-d) are better than the relaxed (cid:96) solutions (here with LASSO). We also observe that IHT-d performs better than LASSO and STRidge for largedictionary sizes p , high noise, and small sample sizes. Large dictionaries with higher-order derivatives com-puted from discrete data cause grouping (correlations between variables), for which LASSO tends to select onevariable from each group, ignoring the others [61]. Thus, LASSO fails to identify the true support consistently.STRidge shows good recovery for large dictionary sizes p with clean data, but it breaks down in the presenceof noise on the data. Of all three methods, IHT-d shows the best robustness to both noise and changes in the9esign. We note a decrease in inference power with increasing sample size N , especially for large p and highnoise levels. This again can be attributed to correlations and groupings in the dictionary, which become moreprominent with increasing sample size N .Based on these results, we use IHT-d as the sparsity-promoting regression method in conjunction with PDE-STRIDE for model selection in the remaining sections. We present benchmark results of PDE-STRIDE for PDE recovery with IHT-d as the sparse regression method.This combination of methods is used to recover PDEs from limited noisy data obtained by numerical solution ofthe 1D Burgers, 2D vorticity-transport, and 3D Gray-Scott equations. Once the support ˆ S of the PDE model islearned by PDE-STRIDE with IHT-d, the actual coe ﬃ cient values of the non-zero components are determinedby solving the linear least-squares problem restricted to the recovered support ˆ S . However, more sophisticatedmethods could be used for parameter estimation for a known structure of the PDE like in [19, 18] from limitednoisy data. But, this is beyond the scope of this paper given that in all cases considered the sample size N signiﬁcantly exceeds the cardinality of the recovered support ( N (cid:29) | ˆ S | ) for which LLS ﬁt provide good estimatesof the PDE coe ﬃ cients. We again consider the 1D Burgers equation from Eq. (3.1.1), using the same simulated data as in Section 3.1,to quantify the performance and robustness against noise of the PDE-STRIDE + IHT-d method. The results areshown in Figure 4 for a design with N =

250 and p =

19. Even on this small data set, with a sample sizecomparable to dictionary size, our method recovers the correct model ( { u xx , uu x } ) with up to 5% noise on thedata, although the least-squares ﬁts of the coe ﬃ cient values gradually deviate from their exact values (see Table1). Figure 4: Model selection with PDE-STRIDE + IHT-d for 1D Burgers equation recovery :

The top left image shows thenumerical solution of the 1D Burgers equations on 256 ×

100 space and time grid. The stability plots for the design N = , p =

19 show the separation of the true PDE components (in solid color) from the noisy components (dotted black). Theinference power of the PDE-STRIDE method is tested for additive Gaussian noise-levels σ up-to 5% (not shown). In all thecases, perfect recovery was possible with the ﬁxed threshold of π th = . Π (shown by the horizontalred solid line). The inset at the bottom shows the colors correspondence with the dictionary components. For comparison, the corresponding stability plots for PDE-STRIDE + STRidge are shown in Supplementary Fig-ure S4. When using STRidge regression, the algorithm creates many false positives, even at mild noise levels( < u x ( − . u xx (0 . Table 1:

Coe ﬃ cients values of the recovered 1D Burgers equation for di ﬀ erent noise levels.The stable components ofthe PDE inferred from plots in Figure 4 are ˆ S stable = { u xx , uu x } . This section discusses results for the recovery of 2D vorticity transport equation using PDE-STRIDE. The vortic-ity transport equation can be obtained by taking curl of the Navier-Stokes equations and imposing the divergence-free constraint for enforcing in-compressibility, i.e. ∇ · u =

0. This form of Navier-Stokes has found extensiveapplications in oceanography and climate modeling [62]. For the numerical solution of the transport equation,we impose no-slip boundary condition at the left ( x = , y ∈ [0 , x = , y ∈ [0 , y = , x ∈ [0 , U = . , V = y = , x ∈ [0 , ×

128 space grid.The poisson problem was solved at every time-step to correct velocities u , v to ensure divergence-free ﬁelds. Theviscosity of the ﬂuid simulated was set to µ = . u , v , velocities and the vorticity ﬁeld ω inside the square domain [0 , × [0 ,

1] of the Lid-driven cavity experiment. ω t + u ω x + v ω y = µ (cid:16) ω xx + ω yy (cid:17) (3.3.1)In Figure 6, the PDE-STRIDE results for 2D vorticity transport equation recovery are shown. The results demon-strate consistent recovery of the true support of the PDE for di ﬀ erent noise-levels σ . The stable componentsˆ S stable = { ω xx , ω yy , u ω x , v ω y } recovered correspond to the true PDE components of the 2D vorticity equation Eq.3.3.1. In table 2, we show reﬁtted coe ﬃ cients for the recovered PDE components. It should also be noted that theseparation between the true (colored solid-lines) and the noisy (black dotted-lines) becomes less distinctive withincreasing noise-levels. In the supplementary Figure S5, we also report the STRidge based stability selectionresults for the same design and stability selection parameters. It can be seen that STRidge struggles to recoverthe true support even at small noise-levels, i.e. σ > . Figure 5: Numerical solution of 2D Vorticity transport equation : The 2D domain with u , v velocity components andvorticity ω is illustrated in the square domain. We choose to sample inside the rectangular box [0 , . × [0 . , .

0] in the upperpart of the domain to capture the rich dynamics resulting from shear boundary conditions imposed at the top surface. The blackdots ( ≈ ω xx (0 . ω yy (0 . u ω x ( − . v ω y ( − . Table 2:

Coe ﬃ cients of the recovered 2D Vorticity transport equation for di ﬀ erent noise levels. The stable componentsof the PDE from Figure 6 are ˆ S stable = { ω xx , ω yy , u ω x , v ω y } . igure 6: Model selection with PDE-STRIDE + IHT-d for 2D Vorticity transport equation recovery:

The stability plots forthe design N = , p =

48 show the separation of the true PDE components (in solid color) from the noisy components (dottedblack). The inference power of the PDE-STRIDE method is tested for additive Gaussian noise-levels σ up-to 6% (not shown).In all the cases, perfect recovery was possible with the ﬁxed threshold of π th = . Π (shown by thehorizontal red solid line). The inset at the bottom shows the colors correspondence with the dictionary components. In this section, we report the recovery capabilities of PDE-STRIDE for the 3D Gray-Scott reaction-di ﬀ usionequation Eqs. 3.3.2. Reaction and di ﬀ usion of chemical species can produce a variety of patterns, reminiscent ofthose often observed in nature. Such processes form essential basis for morphogenesis in biology [64] and mayeven be used to describe skin patterning and pigmentation [65]. We choose this example to show examples ofsystems with coupled variables and is very similar in dynamics to our real world example discussed in section4. The reaction-di ﬀ usion dynamics described by Eqs. 3.3.2 are commonly be used to model the non-linearinteractions between two chemical species ( u , v ). u t = D u (cid:32) ∂ u ∂ x + ∂ u ∂ y + ∂ u ∂ z (cid:33) − uv + f (1 − u ) , (3.3.2a) v t = D v (cid:32) ∂ u ∂ x + ∂ u ∂ y + ∂ u ∂ z (cid:33) + uv − ( f + k ) v . (3.3.2b)We simulate the above equations using openFPM framework. The snapshot of the 3D cube simulation box2 . × . × . v concentration is shown in Fig. 7. Finite-di ﬀ erence space discretization schemewas used to discretize the dynamics described in Eqs. 3.3.2 with grid spacing dx = dy = dz = . = . s in realtime. The 3D Gray-Scott model parameters used are k = . , f = . , D u = . − , and D v = . − .Given the large degrees of freedom present in the 3D problem, for our learning problem we choose to sample dataonly from a small cube in the middle of the domain with dimension 0 . × . × .

5. In Figure 7, we show PDE-STRIDE method correctly identiﬁes the true PDE components for the dynamics of u species given by Eq. (3.3.2a)for di ﬀ erent noise levels. One can appreciate the clear separation between the true and noisy PDE components inthe stability plots. We show results for di ﬀ erent noise-levels between 0 −

6% with as few as N =

400 samples andfor dictionary size p =

69. Similar plots for the inference of v species dynamics are shown in Fig. S1. Althoughperfect recovery was not possible owing to the small di ﬀ usivity ( D v = . − ) of the v species, consistent andstable recovery of the reaction terms (interaction terms) can be seen. The reﬁtted coe ﬃ cients for the recoveredPDE for the both the u , v species are reported in table 3 and table S1, respectively. The comparison plots forPDE-STRIDE with STRidge for the 3D Gray-Scott recovery are shown in the supplementary Figures S6, S7.We note that the STRidge is able to recover the complete form of Eq 3.3.2 in noise-free case for both the u , v u , v PDEs in the noise case. The comparison clearly demonstrates thatPDE-STRIDE + IHT-d clearly outperforms PDE-STRIDE + STRidge for inference from noisy data-sets. u xx (2 . − ) u yy (2 . − ) u zz (2 . − ) u ( − . uv ( − . . − . − . − -0.0140 -1.00002% 0.0142 1 . − . − . − -0.0143 -0.99154% 0.0144 1 . − . − . − -0.0146 -0.97956% 0.0150 2 . − . − . − -0.0153 -0.9843 Table 3:

Coe ﬃ cients of the recovered u -component Gray-Scott reaction di ﬀ usion equation for di ﬀ erent noise levels.The stable components of the PDE from Figure 7 are ˆ S stable = { u xx , u yy , u zz , u , uv } Figure 7: Model selection with PDE-STRIDE + IHT-d for 3D Gray-Scott u − component equation recovery : The top leftﬁgure shows the visualization of the 3D simulation domain with v species concentration. The color gradient corresponds to thevarying concentration over space. The stability plots for the design N = , p =

69 show the separation of the true PDEcomponents (solid color) from the noisy components (dotted black). The inference power of the PDE-STRIDE method is testedfor additive Gaussian noise-levels σ up-to 6% (not shown). In all the cases, perfect recovery was possible with the ﬁxedthreshold of π th = . Π (shown by the horizontal red solid line). The inset at the bottom shows thecolors correspondence with the dictionary components. In all the above presented cases, the learned stable components ( ˆ S stable ) coincide with the true components of theunderlying PDE model. The PDE-STRIDE framework is able to learn the stable components with data pointsas few as ≈ + IHT-d framework is able to robustly learn partialdi ﬀ erential equations from limited noisy data. In the next section, we discuss the consistency and robustnessof the PDE-STRIDE + IHT-d for perturbations in design parameters like sample-size N , dictionary size p , andnoise-levels σ . Achievability results are a compact way to check for the robustness and consistency of a model selection methodfor varying design parameters. They also provide an approximate means to reveal the sample complexity of any l and l sparsity-promoting technique, i.e. the number of data points N required to recover the model with fullprobability. Speciﬁcally, given a sparsity promoting method, dictionary size p , sparsity k , and noise level σ , weare interested in how the sample size N scales with p , k , σ with recovery probability converging to one. Thestudy [66] reports sharp phase transition from failure to success for Gaussian random designs with increasingsample size N for LASSO based sparsity solutions. The study also provides su ﬃ cient lower bounds for sam-13le size N as a function of p , k for full recovery probability. In this section, we ask the question on whethersparse model selection with PDE-STRIDE (paired with IHT-d) exhibit similar sharp phase transition behaviour.Given the dictionary components in our case are compiled from derivatives and non-linearities computed fromnoisy-data, it is interesting to observe if full recovery probability is achieved and maintained with increasingsample size( N ). In the particular context of PDE learning, increasing dictionary size by including higher ordernon-linearities and higher order derivatives has the potential to include strongly correlated components, whichcan negatively impact the inference power. This observation was made evident with results and discussion fromsection 3.2.In Figures (8, 9) the achievability results for both the 1D Burgers system and 3D u -component Gray-Scottreaction di ﬀ usion system are shown. Each point in Figure (8, 9) correspond to 20 repetitions of an experimentwith some design ( N , p , σ ) under random data sub-sampling. An experiment with a design ( N , p , σ ) is consideredas success if ∃ λ ∈ Λ for which the true PDE support ( S ∗ ) is recovered with PDE-STRIDE + IHT-d by thresholdingat π th = .

8. In Figures (8, 9), we see strong consistency and robustness to design parameters for both Burgersand Gray-Scott systems. And, we also observe sharp phase transition from failure to success with recoveryprobability converging to one with increasing sample size N . This amply evidence suggest that PDE-STRIDEnot only enhances the inference power of the IHT-d method but also ensures consistency. In addition, the sharpphase transition behaviour also point towards a strict lower bound on the sample complexity (N) below whichfull recoverability is not attainable as studied in [66]. Following this, achievability plots can be used to read-outapproximate estimates of the sample-complexity of the learned dynamical systems. In the case of Burgers, 90%success probability is achieved with data points as few as ≈

70 in noise-free and ≈

200 points in noisy cases fordi ﬀ erent designs ( p ). For the 3D Gray-scott system, 90% success probability is achieved with data points as fewas ≈

200 in noise-free and ≈

400 points in noisy cases for di ﬀ erent designs ( p ). Figure 8: Achievability results for model selection with PDE-STRIDE + IHT-d for 1D Burgers equation recovery :

Theachievability results for 1D Burgers equation recovery with PDE-STRIDE is shown for varying designs ( N , p , σ ). Every pointon the plot corresponds to 20 repetitions of the PDE-STRIDE method for some design ( N , p , σ ) under random datasub-sampling. Each line with markers corresponds to a dictionary size p . In Burgers case, we test for p = { , , } clearlyshown in the inset below the plots. The colored area around the line show the associated variance of the Bernoulli’s trials. C. elegans zygote patterning

We showcase the applicability of the PDE-STRIDE with IHT-d to real experimental data. We use microscopyimages to infer a PDE model that explains early

C. elegans embryo patterning, and we use it to conﬁrm a pre-vious hypothesis about the physics driving this biological process. Earlier studies of this process proposed amechano-chemical mechanism for PAR protein polarization on the cell membrane [67, 68, 35]. They systemati-cally showed that the cortical ﬂows provide su ﬃ cient perturbations to trigger polarization [68]. The experimentsconducted in [68] measured the concentration of the anterior PAR complex (aPAR), the concentration of poste-rior PAR complex (pPAR) and the cortical ﬂow ﬁeld (v) as a function of time as shown in Figure 10, A, B, D. Theconcentration and velocity ﬁelds were acquired on a grid with resolution of 60 ×

55 in space and time. These ex-perimental data-sets were used to validate the mechano-chemical model developed in the studies [67, 68]. Here,we challenge the PDE-STRIDE + IHT-d framework to learn a PDE model for the regulatory network (Eq 4.0.1)of the interacting membrane PAR proteins in a pure data-driven sense from the experimental data-set. Given thenoisy nature of the data-sets, our analysis is limited to the ﬁrst SVD mode of the data as shown in Figure 10 C.14 igure 9: Achievability results for model selection with PDE-STRIDE + IHT-d for 3D Gray-Scott u -component equationrecovery : The achievability results for 3D Gray-Scott equation recovery with PDE-STRIDE method is shown for varyingdesigns ( N , p , σ ). Every point on the plot corresponds to 20 repetitions of the PDE-STRIDE for some design ( N , p , σ ) underrandom data sub-sampling. Each line with markers corresponds to a dictionary size p . In 3D Gray-Scott case, we test for p = { , , } clearly shown in the inset below the plots. The colored area around the line show the associated variance of theBernoulli’s trials. We also focus our attention on the temporal regime post the advection trigger when the early domains of PARproteins are already formed. The PDE-STRIDE is then directed to learn an interpretable model from the data,that evolves the early protein domains to fully developed patterns as shown in Figure 10A.The reaction kinetics of the PAR proteins can be formulated as, v − a A + v − p P k −→ v + a A + v + p P . (4.0.1)Here, v − a / p and v + a / p are the reactant and product stoichiometry, respectively. The variables A and P correspondto the aPAR and pPAR protein species and k is the reaction rate.In designing the dictionary Θ , the maximum allowed stoichiometry for reactant and product is restricted to 2,i.e. v − a / p , v + a / p ∈ { , , } . The PDE-STRIDE + IHT-d results for the learned regulatory reaction network from dataare shown in Figure 10 E, F. The stable components of the model for aPAR protein are ˆ S Pstable = { P , P A } andfor pPAR protein are ˆ S Astable = { A , PA } for a design N ≈ , p =

20. In Figure 10G, achievability tests areconducted to show the consistency and robustness of the learned models across di ﬀ erent sample-sizes N . Thelearned model achieves full recovery probability for sample-size N > pPAR 1 P AP -0.00019769 0.01073594 -0.00027887aPAR 1 A A P Table 4:

Coe ﬃ cients values of the inferred PAR model. The stable components of the PDE inferred from stability results in Figure 10E, 10 F are ˆ S Pstable = { P , P A } and ˆ S Astable = { A , A P } . In the Figure 10H, 10I, we overlay the numerical solution of the the learned model with the de-noised experimen-tal data at certain time snapshots for both models of aPAR and pPAR proteins for quantitative comparison. Thisvery simple PDE model is able to describe the temporal evolution of the PAR protein domains on the

C. elegans zygote. Although, there is a good match in the spatial-scales (PAR domain sizes) for the two proteins, there existsa non-negligible discrepancy between the simulation and experiments in the time-scales for the pPAR evolution.This di ﬀ erence can be attributed to the advection processes dictated by the cortical ﬂows, which are not includedin our simple ODE type model as shown in 10E, 10I. We believe including higher modes of the SVD decompo-sition and also using structured sparsity for enforcing symmetric arguments through grouping [69] can furthermature our data-driven models to include the mechanical aspects of the PAR system. In supplementary FigureS8(left), we already show for a particular design of N = , p =

20, the advection and di ﬀ usion components ofthe aPAR protein model carry signiﬁcant importance measure to be included in the stable set ˆ S stable , but this isnot the case with the pPAR components as shown in Figure S8(right). The preferential advective displacementof the aPARs to the anterior side modeled by the advective term (v x A ) is also in line with the observations of the15 igure 10: Data-driven model inference of the regulatory network of membrane PAR proteins from spatiotemporal dataacquired from C. elegans zygote : A : spatiotemporal PAR concentration data-sets provided by Grill Lab [67, 68] at MPI-CBG.Manually deﬁning observational boundaries (white ellipses) and identiﬁcation of key variables of interest like proteinconcentration and cell-cortex velocity in the microscopy data. The blue color and red color corresponds to the intensities of theposterior PAR (pPAR) and anterior PAR (aPAR) proteins, respectively. B : The noisy aPAR(red) and pPAR(blue) concentrationproﬁles extracted from the experiments from ﬂuorescence microscopy. C : De-noised spatiotemporal concentration proﬁlesobtained from extracting the principle mode of the Singular value decomposition (SVD) of the noisy ﬂow data. Di ﬀ erentsymbol lines correspond to di ﬀ erent time instances shown in the bottom inset of Figure A. D : De-noised spatiotemporal corticalﬂow proﬁles obtained from extracting the principle mode of the Singular value decomposition (SVD) of the noisy data. E : Thestability plots using PDE-STRIDE + IHT-d to identify the stable PDE components (colored) and learn the model for aPARprotein interaction. F : The stability plots using PDE-STRIDE + IHT-d to identify the stable PDE components (colored) and learnthe model for pPAR protein interaction. G : Achievability results to test the robustness and consistency of the inferred model ofboth aPAR - S Astable = { A , A P } and pPAR - S Pstable = { P , P A } with increasing sample-size N . H : The simulation results ( ) ofthe learned models for aPAR protein overlapped with the experimental data ( − (cid:113) − ) at di ﬀ erent times. I : The simulation results( ) of the learned model for pPAR protein overlapped with the experimental data ( − (cid:113) − ) at di ﬀ erent times. experimental studies [68]. However, such models with advection and di ﬀ usion components exhibit inconsistencyfor varying sample-size N , in contrast to our simple ODE type model as illustrated in Figure 10G. We have addressed two key issues that have thus far limited the application of sparse regression methods for auto-mated PDE inference from noisy and limited data: the need for manual parameter tuning and the high sensitivityto noise in the data. We have shown that stability selection combined with any sparsity-promoting regressiontechnique provides an appropriate level of regularization for consistent and robust recovery of the correct PDEmodel. Our numerical benchmarks suggested that iterative hard thresholding with de-biasing (IHT-d) is an idealcombination with stability selection to form a robust and parameter-free framework (PDE-STRIDE) for PDElearning. This combination of methods outperformed all other tested algorithmic approaches with respect to16dentiﬁcation performance, amount of data required, and robustness to noise. The resulting stability-based PDE-STRIDE method was tested for robust recovery of the 1D Burgers equation, 2D vorticity transport equation, and3D Gray-Scott reaction-di ﬀ usion equations from simulation data corrupted with up to 6% of additive Gaussiannoise. The achievability studies demonstrated the consistency and robustness of the PDE-STRIDE method forfull recovery probability of the model with increasing sample size N and for varying dictionary size p and noiselevels σ . In addition, we note that achievability plots provide a natural estimate for the sample-complexity of theunderlying non-linear dynamical system. However, this empirical estimate of the sample-complexity depends onthe choice of model selection algorithm and how the data is sampled.We demonstrated the capabilities of the PDE-STRIDE + IHT-d framework by applying it to learn a PDE model ofembryo polarization directly from ﬂuorescence microscopy images of

C. elegans zygotes. The model recoveredthe regulatory reaction network of the involved proteins and their spatial transport dynamics in a pure data-drivenmanner, with no knowledge used about the underlying physics or symmetries. The thus learned, data-derivedPDE model was able to correctly predict the spatiotemporal dynamics of the embryonic polarity system fromthe early spatial domains to the fully developed patterns as observed in the polarized

C. elegans zygote. Themodel we inferred from image data using our method conﬁrms both the structure and the mechanisms of popularphysics-derived cell polarity models. Interestingly, the mutually inhibitory interactions between the involvedprotein species, which have previously been discovered by extensive biochemical experimentation, were auto-matically extracted from the data.Besides rendering sparse inference methods more robust to noise and parameter-free, stability selection has theimportant conceptual beneﬁt of also providing interpretable probabilistic importance measures for all modelcomponents. This enables modelers to construct their models with high ﬁdelity, and to gain an intuition aboutcorrelations and sensitivities. Graphical inspection of stability paths provides additional freedom for user inter-vention in semi-automated model discovery from data.We expect that statistical learning methods have the potential to enable robust, consistent, and reproduciblediscovery of predictive and interpretable models directly from observational data. Our parameter-free PDE-STRIDE framework provides a ﬁrst step toward this, but many open issues remain. First, numerically approxi-mating time and space derivatives in the noisy data is a challenge for noise levels higher than a few percent. Thislimits the noise robustness of the overall methods, regardless of how robust the subsequent regression method is.The impact of noise becomes even more severe when exploring models with higher-order derivatives or strongernon-linearities. Future work should focus on formulations that are robust to the choice of di ﬀ erent discretizationmethods, while providing the necessary freedom to impose structures on the coe ﬃ cients. Second, a principledway to constrain the learning process by physical priors, such as conservation laws and symmetries, is lacking.Exploiting such structural knowledge about the dynamical system is expected to greatly improve learning per-formance. It should therefore be explored whether, e.g., structured sparsity or grouping constraints [70, 69] canbe adopted for this purpose. Especially in coupled systems, like the Gray-Scott reaction-di ﬀ usion system andthe PAR-polarity model, known symmetries could be enforced through structured grouping constraints.In summary, we believe that data-driven model discovery has tremendous potential to provide novel insights intocomplex processes, in particular in biology. It provides an e ﬀ ective and complementary alternative to hypothesis-driven approaches. We hope that the stability-based model selection method PDE-STRIDE presented here isgoing to contribute to the further development and adoption of these approaches in the sciences. Code and data availability:

The github repository for the codes and data can be found at https://github.com/SuryanarayanaMK/PDE-STRIDE.git . Acknowledgements

This work was in parts supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Founda-tion) under Germany’s Excellence Strategy EXC-2068 390729961. We are grateful to the Grill lab at MPI-CBG / TU Dresden for providing the spatiotemporal PAR concentration and ﬂow ﬁeld data and allowing us to usethem in our showcase. We thank Nathan Kutz (University of Washington) and his group for making their codeand data public. 17 eferences [1] Alex Mogilner, Roy Wollman, and Wallace F Marshall. Quantitative modeling in cell biology: what is itgood for?

Developmental cell , 11(3):279–287, 2006.[2] Ivo F Sbalzarini. Modeling and simulation of biological systems from image data.

Bioessays , 35(5):482–490, 2013.[3] Claire J Tomlin and Je ﬀ rey D Axelrod. Biology by numbers: mathematical modelling in developmentalbiology. Nature reviews genetics , 8(5):331, 2007.[4] Daniel J Du ﬀ y. Finite Di ﬀ erence methods in ﬁnancial engineering: a Partial Di ﬀ erential Equation ap-proach . John Wiley & Sons, 2013.[5] George Adomian. Solving the mathematical models of neurosciences and medicine. Mathematics andcomputers in simulation , 40(1-2):107–114, 1995.[6] David Donoho. 50 years of data science.

Journal of Computational and Graphical Statistics , 26(4):745–766, 2017.[7] Chris P Barnes, Daniel Silk, Xia Sheng, and Michael P H Stumpf. Bayesian design of synthetic biologicalsystems.

Proceedings of the National Academy of Sciences of the United States of America , 108(37):15190–15195, 2011. ISSN 0027-8424. doi: 10.1073 / pnas.1017972108.[8] Joseﬁne Asmus, Christian L M¨uller, and Ivo F Sbalzarini. L p-Adaptation: Simultaneous Design Centeringand Robustness Estimation of Electronic and Biological Systems. Scientiﬁc Reports , 7(1):6660, 2017.ISSN 20452322. doi: 10.1038 / s41598-017-03556-5.[9] Hiroaki Kitano. Computational systems biology. Nature , 420(6912):206, 2002.[10] Thomas Gregor, William Bialek, Rob R de Ruyter van Steveninck, David W Tank, and Eric F Wieschaus.Di ﬀ usion and scaling during early embryonic pattern formation. Proceedings of the National Academy ofSciences , 102(51):18403–18407, 2005.[11] Ting Chen, Hongyu L He, and George M Church. Modeling gene expression with di ﬀ erential equations.In Biocomputing’99 , pages 29–40. World Scientiﬁc, 1999.[12] Ahmet Ay and David N Arnosti. Mathematical modeling of gene expression: a guide for the perplexedbiologist.

Critical reviews in biochemistry and molecular biology , 46(2):137–151, 2011.[13] Jacques Prost, Frank J¨ulicher, and Jean-Franc¸ois Joanny. Active gel physics.

Nature Physics , 11(2):111,2015.[14] Stefan M¨unster, Akanksha Jain, Alexander Mietke, Anastasios Pavlopoulos, Stephan W Grill, and PavelTomancak. Attachment of the blastoderm to the vitelline envelope a ﬀ ects gastrulation of insects. Nature ,page 1, 2019.[15] H Voss, MJ B¨unner, and Markus Abel. Identiﬁcation of continuous, spatiotemporal systems.

PhysicalReview E , 57(3):2820, 1998.[16] Leo Breiman and Jerome H. Friedman. Estimating optimal transformations for multiple regression and cor-relation.

Journal of the American Statistical Association , 1985. ISSN 1537274X. doi: 10.1080 / ﬀ erential equations to space-time dynam-ics. Physical Review E , 59(1):337, 1999.[18] Xiaolei Xun, Jiguo Cao, Bani Mallick, Arnab Maity, and Raymond J. Carroll. Parameter estimation ofpartial di ﬀ erential equation models. Journal of the American Statistical Association , 108(503):1009–1020,2013. ISSN 01621459. doi: 10.1080 / ﬀ erential equa-tions using gaussian processes. Journal of Computational Physics , 348:683–693, 2017.[20] Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Discovering governing equations from data bysparse identiﬁcation of nonlinear dynamical systems.

Proceedings of the National Academy of Sciences ,page 201517384, 2016.[21] Samuel H Rudy, Steven L Brunton, Joshua L Proctor, and J Nathan Kutz. Data-driven discovery of partialdi ﬀ erential equations. Science Advances , 3(4):e1602614, 2017.[22] Hayden Schae ﬀ er. Learning partial di ﬀ erential equations via data discovery and sparse optimization. Proc.R. Soc. A , 473(2197):20160446, 2017.[23] Sheng Zhang and Guang Lin. Robust data-driven discovery of governing physical laws with error bars.

Pro-ceedings of the Royal Society A: Mathematical, Physical and Engineering Science , 474(2217):20180305,2018. ISSN 1364-5021. doi: 10.1098 / rspa.2018.0305. URL http://rspa.royalsocietypublishing.org/lookup/doi/10.1098/rspa.2018.0305 .[24] N M Mangan, T Askham, S L Brunton, J N Kutz, and J L Proctor. Model selection for hybrid dynamicalsystems via sparse regression. Proceedings of the Royal Society A: Mathematical, Physical and Engineer-ing Sciences , 475(2223):1–22, 2019. ISSN 14712946. doi: 10.1098 / rspa.2018.0534.[25] Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. PDE-Net: Learning PDEs from data. arXivpreprint arXiv:1710.09668 , 2017.[26] Maziar Raissi and George Em Karniadakis. Hidden physics models: Machine learning of nonlinear partialdi ﬀ erential equations. Journal of Computational Physics , 357:125–141, 2018.[27] M Raissi, P Perdikaris, and G E Karniadakis. Physics-informed neural networks: A deep learning frame-work for solving forward and inverse problems involving nonlinear partial di ﬀ erential equations. Journalof Computational Physics , 378:686–707, 2019. ISSN 10902716. doi: 10.1016 / j.jcp.2018.10.045. URL https://doi.org/10.1016/j.jcp.2018.10.045 .[28] Zichao Long, Yiping Lu, and Bin Dong. PDE-Net 2.0: Learning PDEs from data with a numeric-symbolichybrid deep network. arXiv preprint arXiv:1812.04426 , 2018.[29] Bin Dong, Qingtang Jiang, and Zuowei Shen. Image restoration: Wavelet frame shrinkage, nonlinearevolution pdes, and beyond. Multiscale Modeling & Simulation , 15(1):606–660, 2017.[30] Nicolai Meinshausen and Peter B¨uhlmann. Stability selection.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 72(4):417–473, 2010.[31] Rajen D Shah and Richard J Samworth. Variable selection with error control: another look at stabilityselection.

Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 75(1):55–80, 2013.[32] Robert Tibshirani. Regression shrinkage and selection via the lasso.

Journal of the Royal Statistical Society.Series B (Methodological) , pages 267–288, 1996.[33] Thomas Blumensath and Mike E Davies. Iterative thresholding for sparse approximations.

Journal ofFourier analysis and Applications , 14(5-6):629–654, 2008.[34] Simon Foucart. Hard thresholding pursuit: an algorithm for compressive sensing.

SIAM Journal on Nu-merical Analysis , 49(6):2543–2563, 2011.[35] Peter Gross, K Vijay Kumar, Nathan W Goehring, Justin S Bois, Carsten Hoege, Frank J¨ulicher, andStephan W Grill. Guiding self-organized pattern formation in cell polarity establishment.

Nature Physics ,15(3):293, 2019.[36] Rick Chartrand. Numerical di ﬀ erentiation of noisy, nonsmooth data. ISRN Applied Mathematics , 2011,2011. 1937] Jonathan J Stickel. Data smoothing and numerical di ﬀ erentiation by a regularization method. Computers & chemical engineering , 34(4):467–475, 2010.[38] Tong Tong Wu and Kenneth Lange. Coordinate descent algorithms for lasso penalized regression. Annalsof Applied Statistics , 2(1):224–244, 2008. ISSN 19326157. doi: 10.1214 / Journal of statistical software , 33(1):1, 2010.[40] Jonathan Eckstein and Dimitri P Bertsekas. On the douglasrachford splitting method and the proximalpoint algorithm for maximal monotone operators.

Mathematical Programming , 55(1-3):293–318, 1992.[41] Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In

Fixed-point algorithms for inverse problems in science and engineering , pages 185–212. Springer, 2011.[42] Amir Beck and Marc Teboulle. A Fast Iterative Shrinkage-Thresholding Algorithm.

Society for Industrialand Applied Mathematics Journal on Imaging Sciences , 2(1):183–202, 2009. ISSN 1936-4954. doi: 10.1137 / The annals of statistics , 34(3):1436–1462, 2006.[44] Peng Zhao and Bin Yu. On model selection consistency of lasso.

Journal of Machine learning research , 7(Nov):2541–2563, 2006.[45] Jianqing Fan and Runze Li. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Prop-erties.

Journal of the American Statistical Association , 96(456):1348–1360, 2001. ISSN 0162-1459. doi:10.1198 / Nearly unbiased variable selection under minimax concave penalty , volume 38. 2010.ISBN 9040210063. doi: 10.1214 / IEEE Transactions on Infor-mation theory , 50(10):2231–2242, 2004.[48] Deanna Needell and Joel A Tropp. CoSaMP: Iterative signal recovery from incomplete and inaccuratesamples.

Applied and computational harmonic analysis , 26(3):301–321, 2009.[49] Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction.

IEEEtransactions on Information Theory , 55(5):2230–2249, 2009.[50] Thomas Blumensath and Mike E Davies. Iterative hard thresholding for compressed sensing.

Applied andcomputational harmonic analysis , 27(3):265–274, 2009.[51] Kyle K Herrity, Anna C Gilbert, and Joel A Tropp. Sparse approximation via iterative thresholding. In , volume 3,pages III–III. IEEE, 2006.[52] Joel A Tropp. Just relax: Convex programming methods for identifying sparse signals in noise.

IEEEtransactions on information theory , 52(3):1030–1051, 2006.[53] M´ario AT Figueiredo, Robert D Nowak, and Stephen J Wright. Gradient projection for sparse reconstruc-tion: Application to compressed sensing and other inverse problems.

IEEE Journal of selected topics insignal processing , 1(4):586–597, 2007.[54] Ron Kohavi. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.

International Joint Conference of Artiﬁcial Intelligence , 1995.[55] Gideon Schwarz. Estimating the Dimension of a Model.

The Annals of Statistics , 1978. ISSN 0090-5364.doi: 10.1214 / aos / Proceedings of the Twenty-Ninth AAAI Conference on ArtiﬁcialIntelligence (AAAI 2015) , pages 2729—-2735. AAAI Press, 2015.[57] Jacob Bien, Irina Gaynanova, Johannes Lederer, and Christian L. M¨uller. Non-Convex Global Min-imization and False Discovery Rate Control for the TREX.

Journal of Computational and Graph-ical Statistics , 27(1):23–33, 2018. ISSN 15372715. doi: 10.1080 / http://arxiv.org/abs/1604.06815 .[58] Bin Yu. Stability. Bernoulli , 19(4):1484–1500, 2013. ISSN 1350-7265. doi: 10.3150 / http://projecteuclid.org/euclid.bj/1377612862 .[59] Han Liu, Kathryn Roeder, and Larry Wasserman. Stability Approach to Regularization Selection(StARS) for High Dimensional Graphical Models. Advances in neural information processingsystems , 24(2):1432–1440, 2010. ISSN 1049-5258. URL .[60] Peter B¨uhlmann, Markus Kalisch, and Lukas Meier. High-dimensional statistics with a view toward appli-cations in biology.

Annual Review of Statistics and Its Application , 1:255–278, 2014.[61] Sijian Wang, Bin Nan, Saharon Rosset, and Ji Zhu. Random lasso.

The annals of applied statistics , 5(1):468, 2011.[62] Roger Temam.

Navier-Stokes equations: theory and numerical analysis , volume 343. American Mathe-matical Soc., 2001.[63] Pietro Incardona, Antonio Leo, Yaroslav Zaluzhnyi, Rajesh Ramaswamy, and Ivo F Sbalzarini. OpenFPM:A scalable open framework for particle and particle-mesh codes on parallel computers.

Computer PhysicsCommunications , 2019.[64] Hans Meinhardt. Models of biological pattern formation.

New York , 1982.[65] Liana Manukyan, Sophie A Montandon, Anamarija Fofonjka, Stanislav Smirnov, and Michel CMilinkovitch. A living mesoscopic cellular automaton made of skin scales.

Nature , 544(7649):173, 2017.[66] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l - con-strained quadratic programming (lasso). IEEE transactions on information theory , 55(5):2183–2202, 2009.[67] Bijan Etemad-Moghadam, Su Guo, and Kenneth J Kemphues. Asymmetrically distributed PAR-3 proteincontributes to cell polarity and spindle alignment in early C. elegans embryos.

Cell , 83(5):743–752, 1995.[68] Nathan W Goehring, Philipp Khuc Trong, Justin S Bois, Debanjan Chowdhury, Ernesto M Nicola, An-thony A Hyman, and Stephan W Grill. Polarization of PAR proteins by advective triggering of a pattern-forming system.

Science , 334(6059):1137–1141, 2011.[69] Joe H Ward Jr. Hierarchical grouping to optimize an objective function.

Journal of the American statisticalassociation , 58(301):236–244, 1963.[70] J Nathan Kutz, Samuel H Rudy, Alessandro Alla, and Steven L Brunton. Data-driven discovery of gov-erning physical laws and their parametric dependencies in engineering, physics and biology. In ,pages 1–5. IEEE, 2017.[71] David L Donoho and Matan Gavish. The optimal hard threshold for singular values is 4 /

3, 2013.[72] Per Christian Hansen. Truncated singular value decomposition solutions to discrete ill-posed problemswith ill-determined numerical rank.

SIAM Journal on Scientiﬁc and Statistical Computing , 11(3):503–518,1990. 21

Supplementary Material

Algorithm 1 ˆ ξ = arg min ξ (cid:107) U t − Θ ξ (cid:107) + λ (cid:107) ξ (cid:107) Problem: ˆ ξ = arg min ξ (cid:107) U t − Θ ξ (cid:107) + λ (cid:107) ξ (cid:107) IHD-d ( Θ , U t , λ, maxit , subit): Initialize: ξ = for n = to maxit do ∇ g = − Θ T ( U t − Θ ξ n ) u n + , = H λ (cid:16) ξ n − µ · ∇ g (cid:17) , S n + = supp( u n + , ) for l = to subit do ∇ g s = (cid:16) − Θ T ( U t − Θ u n + , l ) (cid:17) S n + u n + , l + = (cid:16) u n + , l − µ · ∇ g s (cid:17) S n + if (cid:16) (cid:107) U t − Θ u n + , l + (cid:107) ≤ λ | S n + | (cid:17) then return ˆ ξ = u n + , l + end ifend for ξ n + = u n + , l + end for return ˆ ξ = u n + , l + IHT ( Θ , U t , K , maxit): Initialize: ξ = for n = to maxit do ∇ g = − Θ T ( U t − Θ ξ n ) u n + , = H λ (cid:16) ξ n − µ · ∇ g (cid:17) end for Problem: ˆ ξ = arg min ξ (cid:107) U t − Θ ξ (cid:107) , s.t. (cid:107) ξ (cid:107) ≤ K HTP ( Θ , U t , K , maxit): Initialize: ξ = for n = to maxit do ∇ g = − Θ T ( U t − Θ ξ n ) u n + , = indices of K largest entries of ξ n − µ · ∇ gS n + = supp( u n + , ) ξ n + = arg min (cid:18) (cid:107) U t − Θ z (cid:107) (cid:19) S n + end for In the above algorithm µ and µ correspond to the step-size / learning rates corresponding to the IHT and de-biasing steps respectively. The learning rate µ is computed as the inverse of the Lipschitz constant of thegradient of the square-loss function h ( · ) in Eq.(2.1.5), i.e. µ = . / L . In the similar fashion, the learning ratein the de-biasing step is computed as µ = . / L ∗ . Here, L ∗ is the Lipschitz constant of the square-loss function( h ( · )) S n + restricted to the support set S n + . 22 igure S1: Model selection with PDE-STRIDE + STRidge for 3D Gray-scott v − component equation recovery : Thestability plots for the design N = , p =

69 show the separation of the true PDE components (in solid color) from the noisycomponents. The inference power of the PDE-STRIDE method is tested for additive Gaussian noise-levels σ up-to 6% (notshown). The PDE-STRIDE fails to identify the true support of the v -component PDE given the small di ﬀ usion coe ﬃ cients(1E − ) associated with the unidentiﬁed di ﬀ usive components. However, the plots show consistency for the reaction terms upto6% additive Gaussian noise-levels. The inset at the bottom shows the colors correspondence with the PDE components. v ( − . uv (1 . Table S1:

Coe ﬃ cients of the recovered v -component Gray-Scott reaction di ﬀ usion equation for di ﬀ erent noise levels.The stable components of the PDE from Figure S1 are ˆ S stable = { v , uv } .23 .2 Denoising technique For all the data-sets considered both from experiments and simulation, we use Singular Value decomposition(SVD) for de-noising. De-noising is achieved by identifying the “elbows” in singular values plots and applyinga hardthreshold at the elbow to ﬁlter the noise [71, 72].

Figure S2: Truncated singular value decomposition (SVD) for denoising for 1D Burgers data-setFigure S3: Truncated singular value decomposition (SVD) for denoising for 1D Burgers data-set igure S4: Model selection with PDE-STRIDE + STRidge for 1D Burgers equation inference :

The plots show the identiﬁcationof the true PDE components (in solid color) from the noisy components. The Stability selection parameters are N = , p = B =

250 repetitions. The statistical power of the algorithm is tested for additive Gaussian noise-levels upto 4%. All in the cases,perfect recovery was possible with the ﬁxed threshold of π th = . Figure S5: Model selection with PDE-STRIDE + STRidge for 2D Vorticity transport equation inference :

The plots show theidentiﬁcation of the true PDE components (in solid color) from the noisy components. The Stability selection parameters are N = , p =

48 with B =

250 repetitions. The statistical power of the algorithm is tested for additive Gaussian noise-levels upto 5%. All inthe cases, perfect recovery was possible with the ﬁxed threshold of π th = . igure S6: Model selection with PDE-STRIDE + STRidge for inference of u -component of the Gray-Scott reaction di ﬀ usionequation : The plots show the identiﬁcation of the true PDE components (in solid color) from the noisy components. The Stabilityselection parameters are N = , p =

69 with B =

250 repetitions. The statistical power of the algorithm is tested for additive Gaussiannoise-levels upto 6%. All in the cases, perfect recovery was possible with the ﬁxed threshold of π th = . Figure S7: Model selection with PDE-STRIDE + STRidge for inference of v -component of the Gray-Scott reaction di ﬀ usionequation : The plots show the identiﬁcation of the true PDE components (in solid color) from the noisy components. The Stabilityselection parameters are N = , p =

69 with B =

The stability results for the design N = , p =

20 for both aPAR (left) and pPAR (right) are shown. In the stability set for aPAR ˆ S stable = { A , A P , v x A } , we see advection dominant termv x A also appearing. The di ﬀ usion term A xx is also seen very close to the threshold π th = .8.