[PDF] Robust learning from noisy, incomplete, high-dimensional experimental data via physically constrained symbolic regression

Abstract

Machine learning offers an intriguing alternative to first-principles analysis for discovering new physics from experimental data. However, to date, purely data-driven methods have only proven successful in uncovering physical laws describing simple, low-dimensional systems with low levels of noise. Here we demonstrate that combining a data-driven methodology with some general physical principles enables discovery of a quantitatively accurate model of a non-equilibrium spatially-extended system from high-dimensional data that is both noisy and incomplete. We illustrate this using an experimental weakly turbulent fluid flow where only the velocity field is accessible. We also show that this hybrid approach allows reconstruction of the inaccessible variables -- the pressure and forcing field driving the flow.

Full PDF

RRobust learning from noisy, incomplete, high-dimensional experimental data viaphysically constrained symbolic regression

Patrick A.K. Reinbold, Logan M. Kageorge, Michael F. Schatz, and Roman O. Grigoriev

1, 2 School of Physics, Georgia Institute of Technology, Atlanta, Georgia 30332, USA Corresponding author, email: [email protected] (Dated: February 25, 2021)Machine learning oﬀers an intriguing alternative to ﬁrst-principles analysis for discovering newphysics from experimental data. However, to date, purely data-driven methods have only provensuccessful in uncovering physical laws describing simple, low-dimensional systems with low levels ofnoise. Here we demonstrate that combining a data-driven methodology with some general physi-cal principles enables discovery of a quantitatively accurate model of a non-equilibrium spatially-extended system from high-dimensional data that is both noisy and incomplete. We illustrate thisusing an experimental weakly turbulent ﬂuid ﬂow where only the velocity ﬁeld is accessible. We alsoshow that this hybrid approach allows reconstruction of the inaccessible variables – the pressure andforcing ﬁeld driving the ﬂow.

Revolutionary advances in our ability to collect, store,and process vast amounts of information has unleashedmachine learning as a dramatically diﬀerent approach toscientiﬁc discovery [1–3]. Initial eﬀorts have focused onpurely data-driven methods to synthesize knowledge inthe form of equations. For instance, symbolic regressionhas been applied successfully to extract both evolutionlaws expressed as ordinary diﬀerential equations [4] andconservation laws in the form of algebraic equations [5]from low-dimensional data with low levels of noise. Un-fortunately, to date, purely data-driven approaches havebeen unable to handle high-dimensional data sets rep-resenting complex or spatially-extended non-equilibriumphenomena such as cancer, fusion plasmas, earthquakes,weather, or climate change. A key diﬃculty is that, with-out appropriate constraints, the high dimensionality ofthe data makes the model search space far too large forany purely data-driven approach to be tractable.In principle, machine learning can be used to constructsuitable models (e.g., nonlinear partial diﬀerential equa-tions (PDEs)) of spatially extended systems [6, 7]; how-ever, numerous diﬃculties arise when using data from thereal world. First and foremost, all the variables (or ﬁelds)that are necessary to describe the phenomena of interestshould be identiﬁed; no existing purely data-driven ap-proach can help with this. Second, some of the requiredvariables may not be accessible in a real world problem;to date, no known machine learning method has beensuccessful in model discovery based on incomplete data.Third, data from real world problems often involve signif-icant uncertainty due to both random and systematic er-rors, which, as a consequence, makes accurate evaluationof particular, crucially important model terms infeasible.Finally, unlike the test cases using synthetic data gener-ated by a reference model [6, 7], assessing the quality ofa model learned from real world data is not straightfor-ward. The fusion of domain knowledge with data science[8] is essential for addressing these challenges.Here we present such a hybrid approach which uses appropriate physical constraints (e.g., locality, smooth-ness, symmetries) to dramatically constrain the searchspace containing various candidate models. Our ap-proach incorporates three key ingredients: (1) generalphysical principles used to identify the variables and can-didate models, (2) weak formulation of diﬀerential equa-tions to reduce noise sensitivity and eliminate depen-dence on inaccessible variables, and (3) ensemble sym-bolic regression to identify a parsimonious model thatbalances accuracy and simplicity. To illustrate, we ex-amine an experimental ﬂuid ﬂow in a thin layer that ex-hibits complex spatio-temporal behavior when driven bytime-independent forcing [9] (see Fig. 1 and the Meth-ods section). We show that a quantitative 2D model ofthis ﬂow can be discovered using experimental measure-ments of the horizontal components of the velocity ﬁeld u ( x , t ). Furthermore, using this model, all latent ﬁelds(here pressure and forcing) can also be reconstructed. A HYBRID APPROACH TO MODELDISCOVERY

We start by describing the three key components of thehybrid approach to model discovery. Additional detailsare provided in the Methods section.

Constructing the model library

The ﬁrst two steps of model discovery are to identifya set of variables (ﬁelds) required to describe the dataand construct a suﬃciently broad library of candidatemodels that will later be narrowed down to obtain a par-simonious description. In practice, these two steps maybe hard, or even impossible, to separate and, for systemsof high dimensionality, require additional considerationsbased on domain knowledge. For the system consideredhere, the general physical assumptions of causality, local- a r X i v : . [ phy s i c s . f l u - dyn ] F e b Magnet Array x y –+ J B f z (a) (b) (c) (d) FIG. 1. Schematic top (a) and side (b) views are shown for laboratory studies of weak turbulence in a thin electrolyte layerinside a rectangular container. Flow is driven by Lorentz forcing f , which arises by applying a current density J in the presenceof a magnetic ﬁeld B from a permanent magnet array (dashed lines). The snapshots illustrate measured velocity ﬁelds in the x - (c) and y - (d) directions at Reynolds number Re = 22 .

17, when the ﬂow is weakly turbulent. ity, and smoothness can be used to write the model in theform of Volterra series [10]. Each term F n of the seriesinvolves a product of the velocity ﬁeld u , latent ﬁelds,and/or their partial derivatives. Since we are dealingwith a ﬂuid ﬂow, we can rely on the more speciﬁc do-main knowledge recognizing the ﬂuid ﬂow is driven byexternal and internal stresses. Hence, the evolution ofthe velocity ﬁeld should depend on body forces f andpressure p , which are the latent ﬁelds here: ∂ t u = (cid:88) n c n F n [ u , p, f , ∇ u , ∇ p, ∇ f , . . . ] . (1)The library of candidate models can be further con-strained by using another general physical concept ofEuclidean symmetry which reﬂects the uniformity andisotropy of the ﬂuid layer. Truncating the sum at a suﬃ-ciently low order in the ﬁelds and derivatives yields [11] ∂ t u = c ( u · ∇ ) u + c ∇ u + c u + c u u + c ω u + c ( ∇ · u ) u + c ( ∇ · u ) u − ρ − ∇ p + ρ − f , (2)where ω = ˆ z · ( ∇ × u ) is the vorticity and u = u · u .Isotropy constrains the functional form of the libraryterms, each of which transforms as a vector, while uni-formity implies that the unknown coeﬃcients are con-stants, i.e., independent of position and time. Note that,without loss of generality, the coeﬃcients of the last twoterms can be set to ± ρ − , where ρ is an arbitrary con-stant with the units of mass density; this simply amountsto ﬁxing the units (and sign) of the pressure and forcingﬁelds. While the forcing in this particular experimentis time-independent, the pressure varies in time and sorequires its own model. A corresponding library of can-didate models can constructed in a similar way which,after truncation to lowest-order terms, yields ∂ t p = c ∇ · u + c ∇ · f + c p. (3)Here each term transforms as a scalar, and c , c , and c are additional unknown constants. We can further con-strain both libraries using the experimental observationthat, to high accuracy, the velocity ﬁeld is divergence-free, which corresponds to setting c = c = 0 in equation(2) and c → ∞ in equation (3).The need for including in the model the dependence onthe pressure and forcing ﬁelds could be discovered fromdata directly without relying on the knowledge of ﬂuiddynamics. We can rewrite equation (2) in the form ρ s = −∇ p + f , (4)where s represents the sum of all the terms that dependonly on u and its partial derivatives. In general we wouldﬁnd s (cid:54) = 0 for any choice of the coeﬃcients. Helmholtzdecomposition requires s = ∇ φ + ∇ × A , where φ and A are the scalar and vector potentials. Hence two addi-tional ﬁelds, one scalar and one vector, are required tosatisfy equation (4): p = − ρφ and f = ρ ∇ × A . Weak formulation of the model

Although symbolic regression could be performed us-ing the strong form of the model, e.g., by directly evaluat-ing each term in equation (2) at diﬀerent spatiotemporallocations, this presents two problems. The most obvi-ous one is that we cannot evaluate the terms involvinglatent ﬁelds. Pressure could, in principle, be computedby taking the divergence of equation (2) and solving theresulting pressure-Poisson equation, if the forcing f wereknown or at least divergence-free. In our case, this isnot an option, since f satisﬁes neither condition. Fur-thermore, taking a derivative greatly ampliﬁes the noisepresent in the data, whether this is done using ﬁnite dif-ferences [6, 12], polynomial interpolation [11], or spectralmethods [13, 14]. Instead we use a weak form of themodel to address both noise sensitivity and the depen-dence on latent variables. This approach was originallyintroduced in the context of ordinary diﬀerential equa-tions [15, 16]. In the context of PDE models, it wasshown to be as general as prior approaches based on thestrong form [6, 7] and superior in terms of both its ﬂexi-bility and robustness [17, 18].Let us choose a set of spatiotemporal domains Ω i andweight functions w j (see the Methods section and Fig. 2)and deﬁne (cid:104) w j , F n (cid:105) i = (cid:90) Ω i w j · F n d Ω , (5)where d Ω = dx dy dt and n = 0 corresponds to the term ∂ t u . Evaluating the integrals in equation (5) for diﬀerent i and j and stacking the results to form vectors q n , wearrive at a linear system of equations for the unknowncoeﬃcients Q c = q , (6)where c = [ c , · · · , c N ] T and Q = [ q · · · q N ]. Ensemble symbolic regression

A parsimonious model describing the data can befound by solving an over-determined system (6) using anystandard algorithm such as LASSO [19], ridge regression[20], sequentially thresholded least-squares [21], or var-ious information-theoretic criteria [22]. Here we adoptthe computationally eﬃcient iterative procedure intro-duced in Ref. [18], which is an adaptation of the latteralgorithm. At each iteration, equation (6) is solved toﬁnd parameters c through c N . Then, the magnitude ofeach term is computed. If it is below some threshold, say (cid:107) c n q n (cid:107) < ε (cid:107) q (cid:107) for a given choice of ε , the correspond-ing term is removed from the library by setting c n = 0and the column q n is removed from the matrix Q . Theprocess is then repeated until all remaining terms have amagnitude that is above the threshold.How well a model describes a particular data set canbe quantiﬁed in terms of the relative residual η = (cid:107) Q c − q (cid:107) max n (cid:107) c n q n (cid:107) , (7)where we expect η (cid:28) η howevertells us little about the functional form of the model orthe magnitude of the respective coeﬃcients. For instance, including a term such as c ( ∇· u ) u with an arbitrary coef-ﬁcient c in equation (2) does not change η for a ﬂow thatis incompressible, but does change the model [11]. Therobustness of the functional form of the model and the ac-curacy with which the coeﬃcients c n are determined canboth be quantiﬁed by performing symbolic regression foran ensemble of diﬀerent samplings of the data (or evendiﬀerent data sets) [18]. Here, each ensemble includesdiﬀerent distributions of integration domains in the tem-poral direction. The variation in the functional form ofthe identiﬁed model across the ensemble can be used todetect missing or spurious terms, while the standard de-viation of the coeﬃcients c n can be used to quantify theiraccuracy. RESULTS

To test our approach for model discovery, we measuredthe velocity ﬁeld components in the plane of the ﬂuidlayer and performed symbolic regression for an ensembleof 30 diﬀerent random distribution of spatiotemporal do-mains Ω i . We found that choosing 0 . (cid:46) ε (cid:46) . ε , the model does not ﬁt the data accurately,as measured by η . For lower ε , the functional form of themodel acquires a sensitive dependence on the choice ofspatiotemporal domains Ω i , which is a sign of overﬁtting.Over the range of Reynolds numbers 17 . (cid:46) Re (cid:46) ∂ t u = c ( u · ∇ ) u + c ∇ u + c u − ρ − ∇ p + ρ − f , (8)with η as low as 0.02 (see Fig. 3d). This model al-lows easy interpretation, since its form is similar to theNavier-Stokes equation which represents momentum bal-ance. The ﬁrst term on the right-hand-side describesadvection of momentum. The second and third term de-scribe momentum ﬂux due to viscosity in the horizontaland vertical direction [9, 23], respectively. The fourthand ﬁfth term also appear in the Navier-Stokes equa-tion and describe (isotropic) internal stresses and exter-nal stresses, respectively.It is worth emphasizing that the form of the 2D modelidentiﬁed by symbolic regression is identical to that de-rived from ﬁrst principles [9, 24] under a number of as-sumptions, including the divergence-free condition on thehorizontal components of the velocity. Dropping this as-sumption produces a more general model [25] which is aspecial case of the system (2)-(3) with c (cid:54) = 0, c = 0, c (cid:54) = ∞ , and c = c = 0. In both cases, the coeﬃ-cients c , c , and c are nonzero and given by explicitexpressions in terms of the material parameters and thegeometry of the ﬂuid layer [9]. The theoretical valuesof parameters are compared with the respective valuesidentiﬁed by symbolic regression in Fig. 3(a-c). … ( u · r ) u AAACBHicbVBNS8NAEJ3Ur1q/oh57WSxCvZSkCnosePFYwbZCE8pms2mXbjZhdyOU0IMX/4oXD4p49Ud489+4bXPQ1gcDj/dmmJkXpJwp7TjfVmltfWNzq7xd2dnd2z+wD4+6KskkoR2S8ETeB1hRzgTtaKY5vU8lxXHAaS8YX8/83gOViiXiTk9S6sd4KFjECNZGGtjVeu4FEcqmHgkT7QkccHxWSAO75jScOdAqcQtSgwLtgf3lhQnJYio04Vipvuuk2s+x1IxwOq14maIpJmM8pH1DBY6p8vP5E1N0apQQRYk0JTSaq78nchwrNYkD0xljPVLL3kz8z+tnOrrycybSTFNBFouijCOdoFkiKGSSEs0nhmAimbkVkRGWmGiTW8WE4C6/vEq6zYZ73mjeXtRaqIijDFU4gTq4cAktuIE2dIDAIzzDK7xZT9aL9W59LFpLVjFzDH9gff4AKZeXrg== @ t u AAAB+nicbVDLSsNAFJ34rPWV6tLNYBFclaQKuiy4cVnBPqAJYTKdtEMnkzBzo5TYT3HjQhG3fok7/8ZJm4W2HrhwOOfemXtPmAquwXG+rbX1jc2t7cpOdXdv/+DQrh11dZIpyjo0EYnqh0QzwSXrAAfB+qliJA4F64WTm8LvPTCleSLvYZoyPyYjySNOCRgpsGteShRwIgLIvTDC2Syw607DmQOvErckdVSiHdhf3jChWcwkUEG0HrhOCn5ePEsFm1W9TLOU0AkZsYGhksRM+/l89Rk+M8oQR4kyJQHP1d8TOYm1nsah6YwJjPWyV4j/eYMMoms/5zLNgEm6+CjKBIYEFzngIVeMgpgaQqjiZldMx0QRCiatqgnBXT55lXSbDfei0by7rLdwGUcFnaBTdI5cdIVa6Ba1UQdR9Iie0St6s56sF+vd+li0rlnlzDH6A+vzB3U5lAQ= r u AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEInkpSBT0WvHisYFuhiWWz3bRLN5uwH0IN/SVePCji1Z/izX/jts1BWx8MPN6bYWZelHGmtOd9O6W19Y3NrfJ2ZWd3b7/qHhx2VGokoW2S8lTeR1hRzgRta6Y5vc8kxUnEaTcaX8/87iOViqXiTk8yGiZ4KFjMCNZW6rvVQOCI44dGHkQxMtO+W/Pq3hxolfgFqUGBVt/9CgYpMQkVmnCsVM/3Mh3mWGpGOJ1WAqNohskYD2nPUoETqsJ8fvgUnVplgOJU2hIazdXfEzlOlJokke1MsB6pZW8m/uf1jI6vwpyJzGgqyGJRbDjSKZqlgAZMUqL5xBJMJLO3IjLCEhNts6rYEPzll1dJp1H3z+uN24taExVxlOEYTuAMfLiEJtxAC9pAwMAzvMKb8+S8OO/Ox6K15BQzR/AHzucPS02Svg== u AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48V7Ae0oWy2k3bpZhN2N0IJ+RFePCji1d/jzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHstHM0vQj+hY8pAzaqzUzQZBSNJ8WK25dXcBsk68gtSgQGtY/RqMYpZGKA0TVOu+5ybGz6gynAnMK4NUY0LZlI6xb6mkEWo/W5ybkwurjEgYK1vSkIX6eyKjkdazKLCdETUTverNxf+8fmrCWz/jMkkNSrZcFKaCmJjMfycjrpAZMbOEMsXtrYRNqKLM2IQqNgRv9eV10mnUvat64+G61iRFHGU4g3O4BA9uoAn30II2MJjCM7zCm5M4L86787FsLTnFzCn8gfP5AyGPj1M= ! u AAAB+HicbVDLSgNBEJyNrxgfWfXoZTAInsJuFPQY8OIxgkmE7BpmJ7PJkHks8xDiki/x4kERr36KN//GSbIHTSxoKKq66e5KMka1CYJvr7S2vrG5Vd6u7Ozu7Vf9g8OOllZh0saSSXWfIE0YFaRtqGHkPlME8YSRbjK+nvndR6I0leLOTDISczQUNKUYGSf1/WokORmih0YeJSm0075fC+rBHHCVhAWpgQKtvv8VDSS2nAiDGdK6FwaZiXOkDMWMTCuR1SRDeIyGpOeoQJzoOJ8fPoWnThnAVCpXwsC5+nsiR1zrCU9cJ0dmpJe9mfif17MmvYpzKjJriMCLRall0Eg4SwEOqCLYsIkjCCvqboV4hBTCxmVVcSGEyy+vkk6jHp7XG7cXtSYs4iiDY3ACzkAILkET3IAWaAMMLHgGr+DNe/JevHfvY9Fa8oqZI/AH3ucPXIWSyQ== c AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48V7Ae0oWy2m3bpZhN2J0IJ+RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2md3O/+8S1EbF6xFnC/YiOlQgFo2ilbjYIQsLyYbXm1t0FyDrxClKDAq1h9WswilkacYVMUmP6npugn1GNgkmeVwap4QllUzrmfUsVjbjxs8W5ObmwyoiEsbalkCzU3xMZjYyZRYHtjChOzKo3F//z+imGt34mVJIiV2y5KEwlwZjMfycjoTlDObOEMi3srYRNqKYMbUIVG4K3+vI66TTq3lW98XBda5IijjKcwTlcggc30IR7aEEbGEzhGV7hzUmcF+fd+Vi2lpxi5hT+wPn8AQY1j0E= … ( u · r ) u AAACBHicbVBNS8NAEJ3Ur1q/oh57WSxCvZSkCnosePFYwbZCE8pms2mXbjZhdyOU0IMX/4oXD4p49Ud489+4bXPQ1gcDj/dmmJkXpJwp7TjfVmltfWNzq7xd2dnd2z+wD4+6KskkoR2S8ETeB1hRzgTtaKY5vU8lxXHAaS8YX8/83gOViiXiTk9S6sd4KFjECNZGGtjVeu4FEcqmHgkT7QkccHxWSAO75jScOdAqcQtSgwLtgf3lhQnJYio04Vipvuuk2s+x1IxwOq14maIpJmM8pH1DBY6p8vP5E1N0apQQRYk0JTSaq78nchwrNYkD0xljPVLL3kz8z+tnOrrycybSTFNBFouijCOdoFkiKGSSEs0nhmAimbkVkRGWmGiTW8WE4C6/vEq6zYZ73mjeXtRaqIijDFU4gTq4cAktuIE2dIDAIzzDK7xZT9aL9W59LFpLVjFzDH9gff4AKZeXrg== @ t u AAAB+nicbVDLSsNAFJ34rPWV6tLNYBFclaQKuiy4cVnBPqAJYTKdtEMnkzBzo5TYT3HjQhG3fok7/8ZJm4W2HrhwOOfemXtPmAquwXG+rbX1jc2t7cpOdXdv/+DQrh11dZIpyjo0EYnqh0QzwSXrAAfB+qliJA4F64WTm8LvPTCleSLvYZoyPyYjySNOCRgpsGteShRwIgLIvTDC2Syw607DmQOvErckdVSiHdhf3jChWcwkUEG0HrhOCn5ePEsFm1W9TLOU0AkZsYGhksRM+/l89Rk+M8oQR4kyJQHP1d8TOYm1nsah6YwJjPWyV4j/eYMMoms/5zLNgEm6+CjKBIYEFzngIVeMgpgaQqjiZldMx0QRCiatqgnBXT55lXSbDfei0by7rLdwGUcFnaBTdI5cdIVa6Ba1UQdR9Iie0St6s56sF+vd+li0rlnlzDH6A+vzB3U5lAQ= r u AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEInkpSBT0WvHisYFuhiWWz3bRLN5uwH0IN/SVePCji1Z/izX/jts1BWx8MPN6bYWZelHGmtOd9O6W19Y3NrfJ2ZWd3b7/qHhx2VGokoW2S8lTeR1hRzgRta6Y5vc8kxUnEaTcaX8/87iOViqXiTk8yGiZ4KFjMCNZW6rvVQOCI44dGHkQxMtO+W/Pq3hxolfgFqUGBVt/9CgYpMQkVmnCsVM/3Mh3mWGpGOJ1WAqNohskYD2nPUoETqsJ8fvgUnVplgOJU2hIazdXfEzlOlJokke1MsB6pZW8m/uf1jI6vwpyJzGgqyGJRbDjSKZqlgAZMUqL5xBJMJLO3IjLCEhNts6rYEPzll1dJp1H3z+uN24taExVxlOEYTuAMfLiEJtxAC9pAwMAzvMKb8+S8OO/Ox6K15BQzR/AHzucPS02Svg== u AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48V7Ae0oWy2k3bpZhN2N0IJ+RFePCji1d/jzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHstHM0vQj+hY8pAzaqzUzQZBSNJ8WK25dXcBsk68gtSgQGtY/RqMYpZGKA0TVOu+5ybGz6gynAnMK4NUY0LZlI6xb6mkEWo/W5ybkwurjEgYK1vSkIX6eyKjkdazKLCdETUTverNxf+8fmrCWz/jMkkNSrZcFKaCmJjMfycjrpAZMbOEMsXtrYRNqKLM2IQqNgRv9eV10mnUvat64+G61iRFHGU4g3O4BA9uoAn30II2MJjCM7zCm5M4L86787FsLTnFzCn8gfP5AyGPj1M= ! u AAAB+HicbVDLSgNBEJyNrxgfWfXoZTAInsJuFPQY8OIxgkmE7BpmJ7PJkHks8xDiki/x4kERr36KN//GSbIHTSxoKKq66e5KMka1CYJvr7S2vrG5Vd6u7Ozu7Vf9g8OOllZh0saSSXWfIE0YFaRtqGHkPlME8YSRbjK+nvndR6I0leLOTDISczQUNKUYGSf1/WokORmih0YeJSm0075fC+rBHHCVhAWpgQKtvv8VDSS2nAiDGdK6FwaZiXOkDMWMTCuR1SRDeIyGpOeoQJzoOJ8fPoWnThnAVCpXwsC5+nsiR1zrCU9cJ0dmpJe9mfif17MmvYpzKjJriMCLRall0Eg4SwEOqCLYsIkjCCvqboV4hBTCxmVVcSGEyy+vkk6jHp7XG7cXtSYs4iiDY3ACzkAILkET3IAWaAMMLHgGr+DNe/JevHfvY9Fa8oqZI/AH3ucPXIWSyQ== c AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48V7Ae0oWy2m3bpZhN2J0IJ+RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2md3O/+8S1EbF6xFnC/YiOlQgFo2ilbjYIQsLyYbXm1t0FyDrxClKDAq1h9WswilkacYVMUmP6npugn1GNgkmeVwap4QllUzrmfUsVjbjxs8W5ObmwyoiEsbalkCzU3xMZjYyZRYHtjChOzKo3F//z+imGt34mVJIiV2y5KEwlwZjMfycjoTlDObOEMi3srYRNqKYMbUIVG4K3+vI66TTq3lW98XBda5IijjKcwTlcggc30IR7aEEbGEzhGV7hzUmcF+fd+Vi2lpxi5hT+wPn8AQY1j0E= ( u · r ) u AAACBHicbVBNS8NAEJ3Ur1q/oh57WSxCvZSkCnosePFYwbZCE8pms2mXbjZhdyOU0IMX/4oXD4p49Ud489+4bXPQ1gcDj/dmmJkXpJwp7TjfVmltfWNzq7xd2dnd2z+wD4+6KskkoR2S8ETeB1hRzgTtaKY5vU8lxXHAaS8YX8/83gOViiXiTk9S6sd4KFjECNZGGtjVeu4FEcqmHgkT7QkccHxWSAO75jScOdAqcQtSgwLtgf3lhQnJYio04Vipvuuk2s+x1IxwOq14maIpJmM8pH1DBY6p8vP5E1N0apQQRYk0JTSaq78nchwrNYkD0xljPVLL3kz8z+tnOrrycybSTFNBFouijCOdoFkiKGSSEs0nhmAimbkVkRGWmGiTW8WE4C6/vEq6zYZ73mjeXtRaqIijDFU4gTq4cAktuIE2dIDAIzzDK7xZT9aL9W59LFpLVjFzDH9gff4AKZeXrg== @ t u AAAB+nicbVDLSsNAFJ34rPWV6tLNYBFclaQKuiy4cVnBPqAJYTKdtEMnkzBzo5TYT3HjQhG3fok7/8ZJm4W2HrhwOOfemXtPmAquwXG+rbX1jc2t7cpOdXdv/+DQrh11dZIpyjo0EYnqh0QzwSXrAAfB+qliJA4F64WTm8LvPTCleSLvYZoyPyYjySNOCRgpsGteShRwIgLIvTDC2Syw607DmQOvErckdVSiHdhf3jChWcwkUEG0HrhOCn5ePEsFm1W9TLOU0AkZsYGhksRM+/l89Rk+M8oQR4kyJQHP1d8TOYm1nsah6YwJjPWyV4j/eYMMoms/5zLNgEm6+CjKBIYEFzngIVeMgpgaQqjiZldMx0QRCiatqgnBXT55lXSbDfei0by7rLdwGUcFnaBTdI5cdIVa6Ba1UQdR9Iie0St6s56sF+vd+li0rlnlzDH6A+vzB3U5lAQ= r u AAAB+HicbVBNS8NAEJ3Ur1o/GvXoZbEInkpSBT0WvHisYFuhiWWz3bRLN5uwH0IN/SVePCji1Z/izX/jts1BWx8MPN6bYWZelHGmtOd9O6W19Y3NrfJ2ZWd3b7/qHhx2VGokoW2S8lTeR1hRzgRta6Y5vc8kxUnEaTcaX8/87iOViqXiTk8yGiZ4KFjMCNZW6rvVQOCI44dGHkQxMtO+W/Pq3hxolfgFqUGBVt/9CgYpMQkVmnCsVM/3Mh3mWGpGOJ1WAqNohskYD2nPUoETqsJ8fvgUnVplgOJU2hIazdXfEzlOlJokke1MsB6pZW8m/uf1jI6vwpyJzGgqyGJRbDjSKZqlgAZMUqL5xBJMJLO3IjLCEhNts6rYEPzll1dJp1H3z+uN24taExVxlOEYTuAMfLiEJtxAC9pAwMAzvMKb8+S8OO/Ox6K15BQzR/AHzucPS02Svg== u AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeCF48V7Ae0oWy2k3bpZhN2N0IJ+RFePCji1d/jzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHstHM0vQj+hY8pAzaqzUzQZBSNJ8WK25dXcBsk68gtSgQGtY/RqMYpZGKA0TVOu+5ybGz6gynAnMK4NUY0LZlI6xb6mkEWo/W5ybkwurjEgYK1vSkIX6eyKjkdazKLCdETUTverNxf+8fmrCWz/jMkkNSrZcFKaCmJjMfycjrpAZMbOEMsXtrYRNqKLM2IQqNgRv9eV10mnUvat64+G61iRFHGU4g3O4BA9uoAn30II2MJjCM7zCm5M4L86787FsLTnFzCn8gfP5AyGPj1M= c AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY8FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpgQ28QbniVt0FyDrxclKBHM1B+as/jFkaoTRMUK17npsYP6PKcCZwVuqnGhPKJnSEPUsljVD72eLUGbmwypCEsbIlDVmovycyGmk9jQLbGVEz1qveXPzP66UmvPEzLpPUoGTLRWEqiInJ/G8y5AqZEVNLKFPc3krYmCrKjE2nZEPwVl9eJ+1a1buq1u7rlQbJ4yjCGZzDJXhwDQ24gya0gMEInuEV3hzhvDjvzseyteDkM6fwB87nD+K3jW0= c AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY8FLx4r2g9oQ9lsN+3SzSbsToQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IJHCoOt+O4WNza3tneJuaW//4PCofHzSNnGqGW+xWMa6G1DDpVC8hQIl7yaa0yiQvBNMbud+54lrI2L1iNOE+xEdKREKRtFKD2xQG5QrbtVdgKwTLycVyNEclL/6w5ilEVfIJDWm57kJ+hnVKJjks1I/NTyhbEJHvGepohE3frY4dUYurDIkYaxtKSQL9fdERiNjplFgOyOKY7PqzcX/vF6K4Y2fCZWkyBVbLgpTSTAm87/JUGjOUE4toUwLeythY6opQ5tOyYbgrb68Ttq1qndVrd3XKw2Sx1GEMziHS/DgGhpwB01oAYMRPMMrvDnSeXHenY9la8HJZ07hD5zPH+Q7jW4= c AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0laQY8FLx4r2g9oQ9lsJ+3SzSbsboQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPZpqgH9GR5CFn1FjpgQ3qg3LFrboLkHXi5aQCOZqD8ld/GLM0QmmYoFr3PDcxfkaV4UzgrNRPNSaUTegIe5ZKGqH2s8WpM3JhlSEJY2VLGrJQf09kNNJ6GgW2M6JmrFe9ufif10tNeONnXCapQcmWi8JUEBOT+d9kyBUyI6aWUKa4vZWwMVWUGZtOyYbgrb68Ttq1qlev1u6vKg2Sx1GEMziHS/DgGhpwB01oAYMRPMMrvDnCeXHenY9la8HJZ07hD5zPH+W/jW8= (c) Regression(a) Sampling ⌦ i AAAB73icbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04s0K9gPaUDbbSbt0N4m7G6GE/gkvHhTx6t/x5r9x2+agrQ8GHu/NMDMvSATXxnW/nZXVtfWNzcJWcXtnd2+/dHDY1HGqGDZYLGLVDqhGwSNsGG4EthOFVAYCW8HoZuq3nlBpHkcPZpygL+kg4iFn1Fip3b2TOKA93iuV3Yo7A1kmXk7KkKPeK311+zFLJUaGCap1x3MT42dUGc4ETordVGNC2YgOsGNpRCVqP5vdOyGnVumTMFa2IkNm6u+JjEqtxzKwnZKaoV70puJ/Xic14ZWf8ShJDUZsvihMBTExmT5P+lwhM2JsCWWK21sJG1JFmbERFW0I3uLLy6RZrXjnler9Rbl2ncdRgGM4gTPw4BJqcAt1aAADAc/wCm/Oo/PivDsf89YVJ585gj9wPn8A3nWP2w== (b) Weighting u x AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5JUQY8FLx4r2lZoQ9lsJ+3SzSbsbsQS+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMTqIaAaBZfYMtwIfEgU0igQ2AnG1zO/84hK81jem0mCfkSHkoecUWOlu7T/1C9X3Jo7B1klXk4qkKPZL3/1BjFLI5SGCap113MT42dUGc4ETku9VGNC2ZgOsWuppBFqP5ufOiVnVhmQMFa2pCFz9fdERiOtJ1FgOyNqRnrZm4n/ed3UhFd+xmWSGpRssShMBTExmf1NBlwhM2JiCWWK21sJG1FFmbHplGwI3vLLq6Rdr3nntfrtRaVRzeMowgmcQhU8uIQG3EATWsBgCM/wCm+OcF6cd+dj0Vpw8plj+APn8wdsNo3O u y AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBahp5JUQY8FLx4r2g9oQ9lsJ+3SzSbsboRQ+hO8eFDEq7/Im//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJaPJkvQj+hI8pAzaqz0kA6yQbni1twFyDrxclKBHM1B+as/jFkaoTRMUK17npsYf0qV4UzgrNRPNSaUTegIe5ZKGqH2p4tTZ+TCKkMSxsqWNGSh/p6Y0kjrLApsZ0TNWK96c/E/r5ea8MafcpmkBiVbLgpTQUxM5n+TIVfIjMgsoUxxeythY6ooMzadkg3BW315nbTrNe+yVr+/qjSqeRxFOINzqIIH19CAO2hCCxiM4Ble4c0Rzovz7nwsWwtOPnMKf+B8/gBtuo3P j AAAB7XicbVBNSwMxEJ3Ur1q/qh69BIvQU9mtgh4LXjxWsK3QLiWbZtu02WRJskJZ+h+8eFDEq//Hm//GtN2Dtj4YeLw3w8y8MBHcWM/7RoWNza3tneJuaW//4PCofHzSNirVlLWoEko/hsQwwSVrWW4Fe0w0I3EoWCec3M79zhPThiv5YKcJC2IylDzilFgntXvJiPfH/XLFq3kL4HXi56QCOZr98ldvoGgaM2mpIMZ0fS+xQUa05VSwWamXGpYQOiFD1nVUkpiZIFtcO8MXThngSGlX0uKF+nsiI7Ex0zh0nTGxI7PqzcX/vG5qo5sg4zJJLZN0uShKBbYKz1/HA64ZtWLqCKGau1sxHRFNqHUBlVwI/urL66Rdr/mXtfr9VaVRzeMowhmcQxV8uIYG3EETWkBhDM/wCm9IoRf0jj6WrQWUz5zCH6DPH4smjwY= w j · F n AAACAHicbVDLSgMxFM3UV62vqgsXboJF6KrMVEGXBUFcVrAPaIchk8m0sZlkSDJKGWbjr7hxoYhbP8Odf2M6nYW2Hrhwcs695N7jx4wqbdvfVmlldW19o7xZ2dre2d2r7h90lUgkJh0smJB9HynCKCcdTTUj/VgSFPmM9PzJ1czvPRCpqOB3ehoTN0IjTkOKkTaSVz1Kh34IHzPvfogDofPXdeYZp2Y37BxwmTgFqYECba/6NQwETiLCNWZIqYFjx9pNkdQUM5JVhokiMcITNCIDQzmKiHLT/IAMnholgKGQpriGufp7IkWRUtPIN50R0mO16M3E/7xBosNLN6U8TjTheP5RmDCoBZylAQMqCdZsagjCkppdIR4jibA2mVVMCM7iycuk22w4Z43m7XmtVS/iKINjcALqwAEXoAVuQBt0AAYZeAav4M16sl6sd+tj3lqyiplD8AfW5w+PrpZM Q c = AAAB8HicbVBNSwMxEJ2tX7V+VT16CRahp7JbBb0IBS8eW7Af0i4lm2bb0CS7JFmhLP0VXjwo4tWf481/Y7bdg7Y+GHi8N8PMvCDmTBvX/XYKG5tb2zvF3dLe/sHhUfn4pKOjRBHaJhGPVC/AmnImadsww2kvVhSLgNNuML3L/O4TVZpF8sHMYuoLPJYsZAQbKz220kEQIjK/HZYrbs1dAK0TLycVyNEclr8Go4gkgkpDONa677mx8VOsDCOczkuDRNMYkyke076lEguq/XRx8BxdWGWEwkjZkgYt1N8TKRZaz0RgOwU2E73qZeJ/Xj8x4Y2fMhknhkqyXBQmHJkIZd+jEVOUGD6zBBPF7K2ITLDCxNiMSjYEb/XlddKp17zLWr11VWlU8ziKcAbnUAUPrqEB99CENhAQ8Ayv8OYo58V5dz6WrQUnnzmFP3A+fwAu7Y/r = q AAAB8XicbVDLSgNBEOyNrxhfUY9eBoOQU9iNgl6EgBePEcwDkxBmJ73JkNnZdWZWCEv+wosHRbz6N978GyfJHjSxoKGo6qa7y48F18Z1v53c2vrG5lZ+u7Czu7d/UDw8auooUQwbLBKRavtUo+ASG4Ybge1YIQ19gS1/fDPzW0+oNI/kvZnE2AvpUPKAM2qs9HCddv2APE77br9YcivuHGSVeBkpQYZ6v/jVHUQsCVEaJqjWHc+NTS+lynAmcFroJhpjysZ0iB1LJQ1R99L5xVNyZpUBCSJlSxoyV39PpDTUehL6tjOkZqSXvZn4n9dJTHDVS7mME4OSLRYFiSAmIrP3yYArZEZMLKFMcXsrYSOqKDM2pIINwVt+eZU0qxXvvFK9uyjVylkceTiBUyiDB5dQg1uoQwMYSHiGV3hztPPivDsfi9ack80cwx84nz/J1pBB = AAAB6HicbVDLSgNBEOyNrxhfUY9eBoOQU9iNgl6EgBePCZgHJEuYnfQmY2Znl5lZIYR8gRcPinj1k7z5N06SPWhiQUNR1U13V5AIro3rfju5jc2t7Z38bmFv/+DwqHh80tJxqhg2WSxi1QmoRsElNg03AjuJQhoFAtvB+G7ut59QaR7LBzNJ0I/oUPKQM2qs1LjtF0tuxV2ArBMvIyXIUO8Xv3qDmKURSsME1brruYnxp1QZzgTOCr1UY0LZmA6xa6mkEWp/ujh0Ri6sMiBhrGxJQxbq74kpjbSeRIHtjKgZ6VVvLv7ndVMT3vhTLpPUoGTLRWEqiInJ/Gsy4AqZERNLKFPc3krYiCrKjM2mYEPwVl9eJ61qxbusVBtXpVo5iyMPZ3AOZfDgGmpwD3VoAgOEZ3iFN+fReXHenY9la87JZk7hD5zPH4ZJjKs= = AAAB6HicbVDLSgNBEOyNrxhfUY9eBoOQU9iNgl6EgBePCZgHJEuYnfQmY2Znl5lZIYR8gRcPinj1k7z5N06SPWhiQUNR1U13V5AIro3rfju5jc2t7Z38bmFv/+DwqHh80tJxqhg2WSxi1QmoRsElNg03AjuJQhoFAtvB+G7ut59QaR7LBzNJ0I/oUPKQM2qs1LjtF0tuxV2ArBMvIyXIUO8Xv3qDmKURSsME1brruYnxp1QZzgTOCr1UY0LZmA6xa6mkEWp/ujh0Ri6sMiBhrGxJQxbq74kpjbSeRIHtjKgZ6VVvLv7ndVMT3vhTLpPUoGTLRWEqiInJ/Gsy4AqZERNLKFPc3krYiCrKjM2mYEPwVl9eJ61qxbusVBtXpVo5iyMPZ3AOZfDgGmpwD3VoAgOEZ3iFN+fReXHenY9la87JZk7hD5zPH4ZJjKs= = AAAB6HicbVDLSgNBEOyNrxhfUY9eBoOQU9iNgl6EgBePCZgHJEuYnfQmY2Znl5lZIYR8gRcPinj1k7z5N06SPWhiQUNR1U13V5AIro3rfju5jc2t7Z38bmFv/+DwqHh80tJxqhg2WSxi1QmoRsElNg03AjuJQhoFAtvB+G7ut59QaR7LBzNJ0I/oUPKQM2qs1LjtF0tuxV2ArBMvIyXIUO8Xv3qDmKURSsME1brruYnxp1QZzgTOCr1UY0LZmA6xa6mkEWp/ujh0Ri6sMiBhrGxJQxbq74kpjbSeRIHtjKgZ6VVvLv7ndVMT3vhTLpPUoGTLRWEqiInJ/Gsy4AqZERNLKFPc3krYiCrKjM2mYEPwVl9eJ61qxbusVBtXpVo5iyMPZ3AOZfDgGmpwD3VoAgOEZ3iFN+fReXHenY9la87JZk7hD5zPH4ZJjKs= Z ⌦ i d ⌦ AAAB/3icbZDLSsNAFIYnXmu9RQU3bgaL4KokVdBl0Y07K9gLNCFMpift0MkkzEyEErvwVdy4UMStr+HOt3HaZqGtPwx8/Occzpk/TDlT2nG+raXlldW19dJGeXNre2fX3ttvqSSTFJo04YnshEQBZwKammkOnVQCiUMO7XB4Pam3H0Aqloh7PUrBj0lfsIhRoo0V2IceEzrIvdsY+iRg415BdsWpOlPhRXALqKBCjcD+8noJzWIQmnKiVNd1Uu3nRGpGOYzLXqYgJXRI+tA1KEgMys+n94/xiXF6OEqkeULjqft7IiexUqM4NJ0x0QM1X5uY/9W6mY4u/ZyJNNMg6GxRlHGsEzwJA/eYBKr5yAChkplbMR0QSag2kZVNCO78lxehVau6Z9Xa3XmlflXEUUJH6BidIhddoDq6QQ3URBQ9omf0it6sJ+vFerc+Zq1LVjFzgP7I+vwBQIGWPw== FIG. 2. Three key elements, sampling, weighting and regression, for the weak formulation of symbolic regression are schemat-ically illustrated. (a) Integration domains, shown as red boxes, are randomly sampled throughout the 2D space-1D time dataset. (b) For each integration domain Ω i , the data u = u x ˆ x + u y ˆ y and the weights w j = ∇ × [ φ j ˆ z ] are used to evaluate the scalarproduct (cid:104) w j · F n (cid:105) , as discussed in the Methods section. The result determines the matrix element Q kn (for n (cid:54) = 0) or the k -thelement of q (for n = 0), where the composite index k runs over all integration domains i and weights j . The columns arelabeled using the corresponding terms in the model instead of the index n to make the relation with the linear system (6) moretransparent. (c) A sparse solution to the system is then found via sequential thresholding, where one (or more) columns areremoved from the matrix Q (and the model) at each iteration, until a parsimonious model balancing accuracy with simplicityis identiﬁed (bottom of (c)). Note that all three parameters identiﬁed using experi-mental data are close, but not identical, to the theoreticalvalues (Fig. 3a-3c). This helps explain the discrepancyin the critical Re of the primary instability in this sys-tem in experiment and numerics [24]. The original studyestimated that a 22% increase in the value of c wouldbe required to match the observed value with the modelpredictions, assuming the other two parameters do notchange. The identiﬁed values of c are about 25% higherthan the theoretical value (Fig. 3c), which is consistentwith that estimate.The accuracy with which the parameters of the modelare estimated via symbolic regression can be judgedbased on both their standard deviation for each ensem-ble and the variation of the mean between diﬀerent datasets at roughly the same Re . The former is much smallerthan the latter, and so may underestimate the true un-certainty. Diﬀerent data sets represent separate exper-iments, so, conversely, the variation in the mean couldalso reﬂect the (small) variation in the conditions of theexperiment (e.g., the thickness of the ﬂuid layers). Whilethe diﬀerence in the mean values of c for the two data sets at Re ≈

36, where the ﬂow is weakly turbulent, isprobably attributed to just such a variation in the con-ditions, the much larger variation in the mean of c and c for the three data sets at Re ≈

18 (Figs. 3b and 3c) ismost likely due to a qualitative change in the dynamics.For 17 . (cid:46) Re (cid:46)

19 the ﬂow becomes time-periodic[24]. The amplitude of the temporal oscillation decreasessubstantially as Re approaches Re ≈ .

8, leading toa corresponding decrease in the magnitude of all theterms (Fig. 3e) and an increase in η (Fig. 3d). Indeed,the constraint (12) on the weight functions implies that (cid:104) F n , w j (cid:105) = 0 for all n for a stationary ﬂow. Hence ourparticular choice of the weight functions is only suitablefor ﬂows that are time-dependent. This is the fundamen-tal reason why the accuracy of the reconstructed modeldecreases at the low end of the Re range explored here,where the magnitude of the time-dependent componentof the velocity ﬁeld becomes comparable to the measure-ment error of the PIV. The breakdown of our approachfor steady ﬂows is not an inherent problem of symbolicregression but is rather due to the presence of latent vari-ables, mainly the steady forcing which the constraint (12) (a) (b) (c)(d) (e) (f) FIG. 3. Model parameters, shown in panels (a)-(c) are consistently well-estimated from experimental data for a range ofReynolds numbers Re , particularly when the amplitude of ﬂow time-dependence is suﬃciently large, as illustrated in panels(d) and (e). For the results shown, ﬂows in experiments are time-periodic for Re (cid:46)

19, and weakly turbulent otherwise. Inpanels (a)-(c), parameters obtained using ensemble averaging (black dots) are compared with the corresponding value obtainedusing ﬁrst-principles analysis (dashed line) performed for time-independent ﬂows at low Re . In panel (d), low values of theresidual η (equation (7)) indicate good parameter ﬁts; the relative quality of ﬁt deteriorates in a regime ( Re (cid:46)

19) where ﬂowtime-dependence is weak and, therefore, the maximum magnitude of terms in equation (6) is small (e). The terms retained in aparsimonious model depend on a choice of threshold ε ; the probability of retaining the term F n as a function of ε shown in (f)indicates the model given by equation (8) is consistently identiﬁed by choosing 0 . (cid:46) ε (cid:46) .

3. The vertical error bars in panels(a)-(d) represent the standard deviation over the ensemble (in most instances they are smaller than the symbol size) and thehorizontal error bars represent the variation in Re over the data set. was aimed to eliminate. One way to get around this lim-itation is to analyze transient ﬂows relaxing towards thesteady state.Once the parsimonious model has been identiﬁed, thelatent ﬁelds can be determined as well. Using theHelmholtz decomposition in equation (4), the pressure p and forcing f can be computed at each time t representedin the data set, as discussed in the Methods section. Themovie showing the time evolution of the reconstructedpressure ﬁeld is included as supplementary material.The electrical current is uniform in the electrolytelayer, hence the forcing ﬁeld f = f ( x, y )ˆ x that appears inthe 2D model of the ﬂuid ﬂow should correspond to thedepth average of the Lorentz force across the electrolytelayer: f ( x, y ) ∝ (cid:90) JB z ( x, y, z ) dz. (9)The forcing proﬁle reconstructed from the measuredﬂow ﬁeld is compared with the Lorentz force computed from direct experimental measurement of the magneticﬁeld according to equation (18) in Fig. 4, which showsthat the two proﬁles are almost indistinguishable. DISCUSSION

As we have demonstrated here, a data-driven ap-proach based on symbolic regression can successfully dis-cover a quantitatively accurate model of a fairly compli-cated and high-dimensional non-equilibrium system withhighly nontrivial dynamics using noisy, incomplete exper-imental measurements. Unlike artiﬁcial neural networkmodels [26, 27] that trade oﬀ interpretability for gener-ality, our model has the form of a PDE which is bothstraightforward to interpret and allows the latent ﬁeldsto be easily reconstructed. The discovered model can alsobe directly compared with other models of the same sys-tem constructed using ﬁrst-principles. This comparisonsuggests that the ﬁrst-principles models do capture all (a) (b) -9 -6 -3 0 3 6 9-101 (c)

FIG. 4. The x -component of the forcing ﬁeld driving theﬂow. Depth-averaged Lorentz force J × B computed using ex-perimental measurement of the magnetic ﬁeld (a) is virtuallyindistinguishable from the forcing ﬁeld f reconstructed usingequation (8) for Re = 22 .

17 (b). In (c), the reconstructed(blue line) and measured (black circles) forcing proﬁles, bothnormalized by their maximum magnitude, are compared alongthe line x = 0 (dashed lines in (a) and (b)); this normaliza-tion also removes the dependence on an arbitrary choice of ρ in equation (8). the relevant physical mechanisms qualitatively, but failto describe them quantitatively with suﬃcient accuracy,indicating that the assumptions used in their derivationrequire reﬁnement.Although our results validate the practical utility ofdata-driven model discovery, they also highlight the needfor a hybrid approach which combines a number of gen-eral physical constraints – most notably, locality, causal-ity, and spatial symmetries – to generate a library ofcandidate models with symbolic regression which down-selects from this library the parsimonious model thatbest describes the data. Although purely data-drivenapproaches such as manifold learning [28] can be usedto help with library construction, it is unlikely that thisapproach remains tractable for high-dimensional systemssuch as the one considered here. We have also relied onfairly speciﬁc domain knowledge to identify the latentﬁelds that are not a part of the data. While in our case,their presence is suggested by the structure of the model,no general approach to identifying latent variables from data has been developed so far.Domain knowledge also plays an essential role in choos-ing the weight functions. We used both the functionalform of the terms involving the latent variables (e.g., ∇ p )and the known properties of the latent ﬁelds (e.g., theforcing f being time-independent) to eliminate the depen-dence on both p and f from the regression problem. Thiswould not have been possible without using some do-main knowledge, illustrating the limitations of the purelydata-driven approach. It should also be mentioned thatthe dependence on latent ﬁelds may not always be elimi-nated, while still allowing the governing equations to beidentiﬁed. For instance, our approach would not succeedwithout measurement of the velocity ﬁeld, even if thepressure were known.The success of any data-driven approach is also heavilydependent on the data used [29]. In particular, for PDEdiscovery, the data should exhibit variation in all inde-pendent coordinates. In the present problem, we ﬁndthat symbolic regression identiﬁes a sparse model withhigh accuracy for higher Re where the ﬂow is weakly tur-bulent and the velocity ﬁeld varies in time and both spa-tial coordinates. The same exact approach experiencesdiﬃculties at lower Re where the ﬂow becomes (nearly)stationary. Indeed, once the time-dependence is lost, wehave q n = 0 for all n , so that equation (6) becomes anidentity which cannot be solved for c .Finally, it should be pointed out that the approachpresented in this paper is not limited to models in theform of a single parabolic PDE, such as equation (2). Itcan be applied without signiﬁcant modiﬁcation to sys-tems of any number of elliptic, hyperbolic, or ellipticsecond-order PDEs, as well as higher-order PDEs andordinary diﬀerential equations. In particular, there isno need to separate out the terms such at ∂ t u , whichare only present in equations governing temporal evolu-tion. In their absence, the linear system that appears insymbolic regression can be solved using alternative ap-proaches such as singular value decomposition [17]. METHODSExperimental system and data collection

Our experimental setup is the same one as used inRef. [24]. The ﬂow is produced in a shallow electrolyte-dielectric bi-layer in a rectangular container, the top viewof which is shown in Fig. 1a. The two ﬂuids are immis-cible, and both layers have a thickness of 0.3 cm andhorizontal extent of L x = 17 . × L y = 22 . . ◦ C, corresponding to a 0 . w = 1 . J = J ˆ y passes throughthe electrolyte layer. Its interaction with the magneticﬁeld produces a Lorentz force J × B that drives the ﬂow.The z -component of the magnetic ﬁeld has been mea-sured at a resolution of 10 points per magnet width ineach of 7 equally spaced horizontal planes throughout theelectrolyte layer. These measurements were only used asa reference to validate the results of our reconstructionprocedure.The electrolyte-dielectric interface is seeded with ﬂu-orescent microspheres in order to measure 2D velocityﬁelds quantifying the horizontal ﬂow via particle imagevelocimetry (PIV) [30]. A typical snapshot of the veloc-ity ﬁeld is shown overlaid on its corresponding vorticityin Fig. 1 The strength of the ﬂow is characterized by theReynolds number Re = ¯ uw/ ¯ ν , where ¯ u is the RMS ve-locity within the central 8 w × w region of the domain,and ¯ ν = 3 . × − m /s is the characteristic depth-averaged viscosity chosen to allow direct comparison withthe results of previous studies of this experimental system[9, 24, 31–33]. For Re (cid:46)

50, the vertical ( z ) componentof the ﬂow is negligibly small, so that the horizontal ﬂowcan be considered divergence-free [9].Each data set represents the x and y components ofthe velocity ﬁeld sampled on a uniform grid (∆ x = ∆ y )within the ﬂow domain and covers a temporal interval ofat least 600 s with temporal resolution ∆ t = 1 s. Thecharacteristic time scale τ of the ﬂow varies with Re . Atlow Re , the ﬂow is periodic, with period of around 120 s.At higher Re , the ﬂow is aperiodic, with autocorrelationtime which decreases with Re [31]. The spatial resolutionof the data is between 6 and 10 grid points per magnetwidth w , which is the characteristic length scale of theﬂow. The temporal extent L t and the spatial resolutionof each data set, labeled by the mean Reynolds number,are given in Table I.The z -component of the magnetic ﬁeld is measured ata resolution of 10 points per magnet width in each of7 equally spaced horizontal planes throughout the elec-trolyte layer. The average of these planes is shown inFig. 4a in comparison with the reconstructed forcing inFig. 4b. Integration domains and weight functions

For simplicity, we take the integration domains tobe rectangular and centered at diﬀerent grid points

TABLE I. Description of the data sets used for the symbolicregression analysis. Re denotes the mean Reynolds number.Times τ marked with an asterisk (*) represent temporal pe-riod, whereas those without represent autocorrelation time. Re τ (s) L t τ H x L x H y L y H t L t ∆ xw ∆ tτ ( x i , y i , t i ),Ω i = (cid:8) ( x, y, t ) (cid:12)(cid:12) | x − x i | ≤ H x , | y − y i | ≤ H y , | t − t i | ≤ H t (cid:9) , (10)where H l is the half-width of the integration domainin the direction l = { x, y, t } . All the domains Ω i havethe same size, centered spatially and distributed tempo-rally throughout the data set, as shown in Fig. 2. Sinceintegration leads to a reduction of noise due to averaging[17], the domains are chosen to be large in both spatialdirections. Their spatial width 2 H x × H y was chosento be slightly smaller than the size L x × L y of the ﬂowdomain to avoid the regions near the side walls wherePIV is noisier than in the bulk. The temporal width 2 H t was chosen to be smaller than the temporal extent L t ofthe data set to limit overlap between diﬀerent integra-tion domains, so that rows of equation (6) could remainlinearly independent. Speciﬁc values of H x , H y , and H t for each data set are given in Table I.As mentioned previously, each partial derivative of thevelocity ﬁeld increases the noise that is inevitably presentin the PIV data. Hence, the derivatives are transferredonto the smooth, noiseless weight functions w j wheneverpossible. Consider for illustration the term F = ∂ t u .Using integration by parts we obtain (cid:104) w j , ∂ t u (cid:105) i = −(cid:104) ∂ t w j , u (cid:105) i , (11)if the boundary terms are eliminated by requiring w j = 0at t = t i ± H t . The complete set of boundary condi-tions [18] require that w j and its spatial derivatives upto second-order vanish at the boundary of the integrationdomain. Some nonlinear terms in equation (2), such as ω u , do not allow all derivatives to be transferred onto w j via integration by parts. In such cases, the remainingderivatives on u are computed in Fourier space utiliz-ing both a Tukey-like windowing function and a low-passﬁlter.Furthermore, the weight functions should be cho-sen such that the integrals involving the latent ﬁeldsdisappear. To remove the dependence on the time-independent forcing term, we require that w j be an oddfunction in time, such that (cid:90) H t − H t w j dt = 0 , (12)We also constrain our weight function to the form w j = ∇ × [ˆ zφ j ( x, y, t )] , (13)so that (cid:104) w j , ∇ p (cid:105) i = −(cid:104)∇ · w j , p (cid:105) i = 0 , (14)eliminating the dependence on pressure.All of the above constraints can be satisﬁed by choosingthe scalar ﬁelds φ j in the form φ j ( x, y, t ) = P λ ( x (cid:48) ) P µ ( y (cid:48) ) P ν ( t (cid:48) ) E α ( x (cid:48) ) E β ( y (cid:48) ) E γ ( t (cid:48) ) , (15)where P m ( · ) is a Legendre polynomial, E α ( w ) = (1 − w ) α , (16)is an envelope function, and the prime denotes coor-dinates scaled by the integration domain size: x (cid:48) =( x − x i ) /H x , y (cid:48) = ( y − y i ) /H y , t (cid:48) = ( t − t (cid:48) ) /H t . Eachintegral over Ω i is evaluated numerically using the trape-zoidal rule, with the accuracy of the numerical quadra-ture controlled by the integers α , β , and γ [17]. Here weset α = β = γ = 6 to allow the use of PIV data thatis relatively sparse. For reference, regression based ondirect evaluation of derivatives via a polynomial method[11] requires about 20 grid points per magnet width (e.g.,2-3 times higher than in our data sets).Unlike Ref. [11] which considered symbolic regressionfor synthetic data, multiple weight functions labeled byinteger indices j = { λ, µ, ν } were used here to sample thedata more thoroughly, while keeping the large integrationdomains from overlapping too much for the shorter datasets. The constraint (12) requires ν to be an odd integer.Here we used all combinations of λ and µ set to either 0or 1 and ν = 1, i.e., a total of four weight functions foreach integration domain (this number could be increasedfurther to improve the model reconstruction accuracy).The total number of equations in the system deﬁned byequation (6) is therefore K = 4 I , where I is the totalnumber of integration domains. The system has to beover-determined, K > N ; we chose I = 50 which satisﬁesthis condition. A higher value would further increase theaccuracy and robustness of the method. Reconstructing the pressure and forcing ﬁeld

Once the parsimonious model describing a particulardata set has been found, the horizontal forcing proﬁle f ( x ) and pressure p ( x , t ) can be computed using theHelmholtz decomposition of the vector ﬁeld s ( x , t ) inequation (4). Speciﬁcally, p ( x , t ) = − ρ (cid:90) (cid:90) i k · ˆ s ( k , t ) k · k e − i k · x d k (17)and f ( x , t ) = − ρ (cid:90) (cid:90) k × [ k × ˆ s ( k , t )] k · k e − i k · x d k , (18)where ˆ s ( k , t ) = ˆ F ( k , t ) − (cid:88) n =1 c n ˆ F n ( k , t ) . (19)and ˆ F n ( k , t ) = 1(2 π ) (cid:90) (cid:90) F n ( x , t ) e i k · x d x . (20)The latent ﬁelds are reconstructed without the beneﬁtof the weak formulation, which plays a crucial role in in-creasing the robustness of symbolic regression in the pres-ence of noise. Since some of the terms F n ( x , t ) involvederivatives which amplify noise, the respective Fouriertransforms ˆ F n ( k , t ) are low-pass-ﬁltered by eliminatingfrequencies | k x | > k and | k y | > k where k = π/w isthe wavenumber corresponding to the wavelength 2 w ofthe magnet array. This cut-oﬀ frequency is chosen em-pirically to balance the inclusion of relevant modes andthe exclusion of modes corrupted by noise. The spatialderivatives were computed spectrally and the temporalderivative term was computed using a second-order cen-tral diﬀerence.Note that f = ρ ∇ × A involves an extra derivativecompared with p = ρφ , which decreases its accuracy fornoisy data. Since f is stationary in our experiment, itsaccuracy can be improved substantially by temporallyaveraging equation (18). Data availability

The source data used to construct Figure 3 are in-cluded as supplementary material. Data sets containingvelocity ﬁelds and their gradients are available from thecorresponding author upon request.

Code availability

MATLAB codes used to identify the governingequations can be found in the GitHub repositoryhttps://github.com/pakreinbold/PDE Discovery WeakFormulation.

Acknowledgements

This material is based upon work supported by NSFunder Grants No. CMMI-1725587 and CMMI-2028454.The experimental data used in this work was producedby Jeﬀ Tithof.

Author contributions

P.A.K.R. was responsible for conducting data analysisand interpretation of the results. L.M.K. was responsiblefor performing ﬂuid ﬂow experiments, data acquisition,and PIV analysis. M.F.S. was responsible for experi-mental design. R.O.G. was responsible for concept andresearch design. All authors were involved in the prepa-ration of the manuscript, read and approved the ﬁnalversion. [1] A. Gaudinier and S. M. Brady, Mapping transcriptionalnetworks in plants: data-driven discovery of novel bio-logical mechanisms, Annual review of plant biology ,575 (2016).[2] S. Pan and K. Duraisamy, Data-driven discovery of clo-sure models, SIAM Journal on Applied Dynamical Sys-tems , 2381 (2018).[3] K. J. Bergen, P. A. Johnson, V. Maarten, and G. C.Beroza, Machine learning for data-driven discovery insolid earth geoscience, Science , eaau0323 (2019).[4] J. Bongard and H. Lipson, Automated reverse engineer-ing of nonlinear dynamical systems, Proceedings of theNational Academy of Sciences , 9943 (2007).[5] M. Schmidt and H. Lipson, Distilling free-form naturallaws from experimental data, science , 81 (2009).[6] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz,Data-driven discovery of partial diﬀerential equations,Science Advances , e1602614 (2017).[7] H. Schaeﬀer, Learning partial diﬀerential equations viadata discovery and sparse optimization, Proceedings ofthe Royal Society A: Mathematical, Physical and Engi-neering Sciences , 20160446 (2017).[8] A. Karpatne, G. Atluri, J. H. Faghmous, M. Steinbach,A. Banerjee, A. Ganguly, S. Shekhar, N. Samatova, andV. Kumar, Theory-guided data science: A new paradigmfor scientiﬁc discovery from data, IEEE Transactions onknowledge and data engineering , 2318 (2017).[9] B. Suri, J. Tithof, R. Mitchell, R. O. Grigoriev, and M. F.Schatz, Velocity proﬁle in a two-layer Kolmogorov-likeﬂow, Phys. Fluids , 053601 (2014). [10] S. Boyd, L. O. Chua, and C. A. Desoer, Analytical foun-dations of volterra series, IMA Journal of MathematicalControl and Information , 243 (1984).[11] P. A. Reinbold and R. O. Grigoriev, Data-driven dis-covery of partial diﬀerential equation models with latentvariables, Physical Review E , 022219 (2019).[12] X. Li, L. Li, Z. Yue, X. Tang, H. U. Voss, J. Kurths, andY. Yuan, Sparse learning of partial diﬀerential equationswith structured dictionary matrix, Chaos , 043130(2019).[13] D. Xu and O. Khanmohamadi, Spatiotemporal systemreconstruction using fourier spectral operators and struc-ture selection techniques, Chaos , 043122 (2008).[14] O. Khanmohamadi and D. Xu, Spatiotemporal sys-tem identiﬁcation on nonperiodic domains using cheby-shev spectral operators and system reduction algorithms,Chaos , 033117 (2009).[15] M. Shinbrot, On the analysis of linear and nonlinear dy-namical systems from transient-response data, NationalAdvisory Committee for Aeronautics, Technical Note3288 (1954).[16] H. Preisig and D. Rippin, Theory and application of themodulating function method—i. review and theory of themethod and theory of the spline-type modulating func-tions, Computers & chemical engineering , 1 (1993).[17] D. R. Gurevich, P. A. Reinbold, and R. O. Grigoriev,Robust and optimal sparse regression for nonlinear pdemodels, Chaos , 103113 (2019).[18] P. A. Reinbold, D. R. Gurevich, and R. O. Grigoriev, Us-ing noisy or incomplete data to discover models of spa-tiotemporal dynamics, Physical Review E , 010203(2020).[19] R. Tibshirani, Regression shrinkage and selection via theLASSO, Journal of the Royal Statistical Society: SeriesB (Methodological) , 267 (1996).[20] D. W. Marquardt and R. D. Snee, Ridge regression inpractice, The American Statistician , 3 (1975).[21] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discoveringgoverning equations from data by sparse identiﬁcation ofnonlinear dynamical systems, Proceedings of the nationalacademy of sciences , 3932 (2016).[22] N. M. Mangan, J. N. Kutz, S. L. Brunton, and J. L. Proc-tor, Model selection for dynamical systems via sparseregression and information criteria, Proceedings of theRoyal Society A: Mathematical, Physical and Engineer-ing Sciences , 20170009 (2017).[23] F. V. Dolzhanskii, V. A. Krymov, and D. Y. Manin,Stability and vortex structures of quasi-two-dimensionalshear ﬂows, Sov. Phys. Usp. , 495 (1990).[24] J. Tithof, B. Suri, R. K. Pallantla, R. O. Grigoriev, andM. F. Schatz, Bifurcations in a quasi-two-dimensionalKolmogorov-like ﬂow, J. Fluid Mech. , 837 (2017).[25] R. Pallantla, Exact Coherent Structures and DynamicalConnections in a Quasi 2D Kolmogorov Like Flow , Ph.D.thesis, Georgia Institute of Technology (2018).[26] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning framework forsolving forward and inverse problems involving nonlinearpartial diﬀerential equations, Journal of ComputationalPhysics , 686 (2019).[27] R. Iten, T. Metger, H. Wilming, L. Del Rio, and R. Ren-ner, Discovering physical concepts with neural networks,Physical Review Letters , 010508 (2020). [28] L. Cayton, Algorithms for manifold learning, Univ. ofCalifornia at San Diego Tech. Rep , 1 (2005).[29] H. Schaeﬀer, G. Tran, and R. Ward, Extracting sparsehigh-dimensional dynamics from limited data, SIAMJournal on Applied Mathematics , 3279 (2018).[30] B. Drew, J. Charonko, and P. P. Vlachos, QI – Quan-titative Imaging (PIV and more) (2013), available athttps://sourceforge.net/projects/qi-tools/.[31] B. Suri, J. Tithof, R. O. Grigoriev, and M. F. Schatz,Forecasting ﬂuid ﬂows using the geometry of turbulence, Phys. Rev. Lett. , 114501 (2017).[32] B. Suri, J. Tithof, R. O. Grigoriev, and M. F. Schatz,Unstable equilibria and invariant manifolds in quasi-two-dimensional kolmogorov-like ﬂow, Phys. Rev. E ,023105 (2018).[33] B. Suri, R. K. Pallantla, M. F. Schatz, and R. O.Grigoriev, Heteroclinic and homoclinic connections in akolmogorov-like ﬂow, Phys. Rev. E100