Robust learning from noisy, incomplete, high-dimensional experimental data via physically constrained symbolic regression
Patrick A.K. Reinbold, Logan M. Kageorge, Michael F. Schatz, Roman O. Grigoriev
RRobust learning from noisy, incomplete, high-dimensional experimental data viaphysically constrained symbolic regression
Patrick A.K. Reinbold, Logan M. Kageorge, Michael F. Schatz, and Roman O. Grigoriev
1, 2 School of Physics, Georgia Institute of Technology, Atlanta, Georgia 30332, USA Corresponding author, email: [email protected] (Dated: February 25, 2021)Machine learning offers an intriguing alternative to first-principles analysis for discovering newphysics from experimental data. However, to date, purely data-driven methods have only provensuccessful in uncovering physical laws describing simple, low-dimensional systems with low levels ofnoise. Here we demonstrate that combining a data-driven methodology with some general physi-cal principles enables discovery of a quantitatively accurate model of a non-equilibrium spatially-extended system from high-dimensional data that is both noisy and incomplete. We illustrate thisusing an experimental weakly turbulent fluid flow where only the velocity field is accessible. We alsoshow that this hybrid approach allows reconstruction of the inaccessible variables – the pressure andforcing field driving the flow.
Revolutionary advances in our ability to collect, store,and process vast amounts of information has unleashedmachine learning as a dramatically different approach toscientific discovery [1–3]. Initial efforts have focused onpurely data-driven methods to synthesize knowledge inthe form of equations. For instance, symbolic regressionhas been applied successfully to extract both evolutionlaws expressed as ordinary differential equations [4] andconservation laws in the form of algebraic equations [5]from low-dimensional data with low levels of noise. Un-fortunately, to date, purely data-driven approaches havebeen unable to handle high-dimensional data sets rep-resenting complex or spatially-extended non-equilibriumphenomena such as cancer, fusion plasmas, earthquakes,weather, or climate change. A key difficulty is that, with-out appropriate constraints, the high dimensionality ofthe data makes the model search space far too large forany purely data-driven approach to be tractable.In principle, machine learning can be used to constructsuitable models (e.g., nonlinear partial differential equa-tions (PDEs)) of spatially extended systems [6, 7]; how-ever, numerous difficulties arise when using data from thereal world. First and foremost, all the variables (or fields)that are necessary to describe the phenomena of interestshould be identified; no existing purely data-driven ap-proach can help with this. Second, some of the requiredvariables may not be accessible in a real world problem;to date, no known machine learning method has beensuccessful in model discovery based on incomplete data.Third, data from real world problems often involve signif-icant uncertainty due to both random and systematic er-rors, which, as a consequence, makes accurate evaluationof particular, crucially important model terms infeasible.Finally, unlike the test cases using synthetic data gener-ated by a reference model [6, 7], assessing the quality ofa model learned from real world data is not straightfor-ward. The fusion of domain knowledge with data science[8] is essential for addressing these challenges.Here we present such a hybrid approach which uses appropriate physical constraints (e.g., locality, smooth-ness, symmetries) to dramatically constrain the searchspace containing various candidate models. Our ap-proach incorporates three key ingredients: (1) generalphysical principles used to identify the variables and can-didate models, (2) weak formulation of differential equa-tions to reduce noise sensitivity and eliminate depen-dence on inaccessible variables, and (3) ensemble sym-bolic regression to identify a parsimonious model thatbalances accuracy and simplicity. To illustrate, we ex-amine an experimental fluid flow in a thin layer that ex-hibits complex spatio-temporal behavior when driven bytime-independent forcing [9] (see Fig. 1 and the Meth-ods section). We show that a quantitative 2D model ofthis flow can be discovered using experimental measure-ments of the horizontal components of the velocity field u ( x , t ). Furthermore, using this model, all latent fields(here pressure and forcing) can also be reconstructed. A HYBRID APPROACH TO MODELDISCOVERY
We start by describing the three key components of thehybrid approach to model discovery. Additional detailsare provided in the Methods section.
Constructing the model library
The first two steps of model discovery are to identifya set of variables (fields) required to describe the dataand construct a sufficiently broad library of candidatemodels that will later be narrowed down to obtain a par-simonious description. In practice, these two steps maybe hard, or even impossible, to separate and, for systemsof high dimensionality, require additional considerationsbased on domain knowledge. For the system consideredhere, the general physical assumptions of causality, local- a r X i v : . [ phy s i c s . f l u - dyn ] F e b Magnet Array x y –+ J B f z (a) (b) (c) (d) FIG. 1. Schematic top (a) and side (b) views are shown for laboratory studies of weak turbulence in a thin electrolyte layerinside a rectangular container. Flow is driven by Lorentz forcing f , which arises by applying a current density J in the presenceof a magnetic field B from a permanent magnet array (dashed lines). The snapshots illustrate measured velocity fields in the x - (c) and y - (d) directions at Reynolds number Re = 22 .
17, when the flow is weakly turbulent. ity, and smoothness can be used to write the model in theform of Volterra series [10]. Each term F n of the seriesinvolves a product of the velocity field u , latent fields,and/or their partial derivatives. Since we are dealingwith a fluid flow, we can rely on the more specific do-main knowledge recognizing the fluid flow is driven byexternal and internal stresses. Hence, the evolution ofthe velocity field should depend on body forces f andpressure p , which are the latent fields here: ∂ t u = (cid:88) n c n F n [ u , p, f , ∇ u , ∇ p, ∇ f , . . . ] . (1)The library of candidate models can be further con-strained by using another general physical concept ofEuclidean symmetry which reflects the uniformity andisotropy of the fluid layer. Truncating the sum at a suffi-ciently low order in the fields and derivatives yields [11] ∂ t u = c ( u · ∇ ) u + c ∇ u + c u + c u u + c ω u + c ( ∇ · u ) u + c ( ∇ · u ) u − ρ − ∇ p + ρ − f , (2)where ω = ˆ z · ( ∇ × u ) is the vorticity and u = u · u .Isotropy constrains the functional form of the libraryterms, each of which transforms as a vector, while uni-formity implies that the unknown coefficients are con-stants, i.e., independent of position and time. Note that,without loss of generality, the coefficients of the last twoterms can be set to ± ρ − , where ρ is an arbitrary con-stant with the units of mass density; this simply amountsto fixing the units (and sign) of the pressure and forcingfields. While the forcing in this particular experimentis time-independent, the pressure varies in time and sorequires its own model. A corresponding library of can-didate models can constructed in a similar way which,after truncation to lowest-order terms, yields ∂ t p = c ∇ · u + c ∇ · f + c p. (3)Here each term transforms as a scalar, and c , c , and c are additional unknown constants. We can further con-strain both libraries using the experimental observationthat, to high accuracy, the velocity field is divergence-free, which corresponds to setting c = c = 0 in equation(2) and c → ∞ in equation (3).The need for including in the model the dependence onthe pressure and forcing fields could be discovered fromdata directly without relying on the knowledge of fluiddynamics. We can rewrite equation (2) in the form ρ s = −∇ p + f , (4)where s represents the sum of all the terms that dependonly on u and its partial derivatives. In general we wouldfind s (cid:54) = 0 for any choice of the coefficients. Helmholtzdecomposition requires s = ∇ φ + ∇ × A , where φ and A are the scalar and vector potentials. Hence two addi-tional fields, one scalar and one vector, are required tosatisfy equation (4): p = − ρφ and f = ρ ∇ × A . Weak formulation of the model
Although symbolic regression could be performed us-ing the strong form of the model, e.g., by directly evaluat-ing each term in equation (2) at different spatiotemporallocations, this presents two problems. The most obvi-ous one is that we cannot evaluate the terms involvinglatent fields. Pressure could, in principle, be computedby taking the divergence of equation (2) and solving theresulting pressure-Poisson equation, if the forcing f wereknown or at least divergence-free. In our case, this isnot an option, since f satisfies neither condition. Fur-thermore, taking a derivative greatly amplifies the noisepresent in the data, whether this is done using finite dif-ferences [6, 12], polynomial interpolation [11], or spectralmethods [13, 14]. Instead we use a weak form of themodel to address both noise sensitivity and the depen-dence on latent variables. This approach was originallyintroduced in the context of ordinary differential equa-tions [15, 16]. In the context of PDE models, it wasshown to be as general as prior approaches based on thestrong form [6, 7] and superior in terms of both its flexi-bility and robustness [17, 18].Let us choose a set of spatiotemporal domains Ω i andweight functions w j (see the Methods section and Fig. 2)and define (cid:104) w j , F n (cid:105) i = (cid:90) Ω i w j · F n d Ω , (5)where d Ω = dx dy dt and n = 0 corresponds to the term ∂ t u . Evaluating the integrals in equation (5) for different i and j and stacking the results to form vectors q n , wearrive at a linear system of equations for the unknowncoefficients Q c = q , (6)where c = [ c , · · · , c N ] T and Q = [ q · · · q N ]. Ensemble symbolic regression
A parsimonious model describing the data can befound by solving an over-determined system (6) using anystandard algorithm such as LASSO [19], ridge regression[20], sequentially thresholded least-squares [21], or var-ious information-theoretic criteria [22]. Here we adoptthe computationally efficient iterative procedure intro-duced in Ref. [18], which is an adaptation of the latteralgorithm. At each iteration, equation (6) is solved tofind parameters c through c N . Then, the magnitude ofeach term is computed. If it is below some threshold, say (cid:107) c n q n (cid:107) < ε (cid:107) q (cid:107) for a given choice of ε , the correspond-ing term is removed from the library by setting c n = 0and the column q n is removed from the matrix Q . Theprocess is then repeated until all remaining terms have amagnitude that is above the threshold.How well a model describes a particular data set canbe quantified in terms of the relative residual η = (cid:107) Q c − q (cid:107) max n (cid:107) c n q n (cid:107) , (7)where we expect η (cid:28) η howevertells us little about the functional form of the model orthe magnitude of the respective coefficients. For instance, including a term such as c ( ∇· u ) u with an arbitrary coef-ficient c in equation (2) does not change η for a flow thatis incompressible, but does change the model [11]. Therobustness of the functional form of the model and the ac-curacy with which the coefficients c n are determined canboth be quantified by performing symbolic regression foran ensemble of different samplings of the data (or evendifferent data sets) [18]. Here, each ensemble includesdifferent distributions of integration domains in the tem-poral direction. The variation in the functional form ofthe identified model across the ensemble can be used todetect missing or spurious terms, while the standard de-viation of the coefficients c n can be used to quantify theiraccuracy. RESULTS
To test our approach for model discovery, we measuredthe velocity field components in the plane of the fluidlayer and performed symbolic regression for an ensembleof 30 different random distribution of spatiotemporal do-mains Ω i . We found that choosing 0 . (cid:46) ε (cid:46) . ε , the model does not fit the data accurately,as measured by η . For lower ε , the functional form of themodel acquires a sensitive dependence on the choice ofspatiotemporal domains Ω i , which is a sign of overfitting.Over the range of Reynolds numbers 17 . (cid:46) Re (cid:46) ∂ t u = c ( u · ∇ ) u + c ∇ u + c u − ρ − ∇ p + ρ − f , (8)with η as low as 0.02 (see Fig. 3d). This model al-lows easy interpretation, since its form is similar to theNavier-Stokes equation which represents momentum bal-ance. The first term on the right-hand-side describesadvection of momentum. The second and third term de-scribe momentum flux due to viscosity in the horizontaland vertical direction [9, 23], respectively. The fourthand fifth term also appear in the Navier-Stokes equa-tion and describe (isotropic) internal stresses and exter-nal stresses, respectively.It is worth emphasizing that the form of the 2D modelidentified by symbolic regression is identical to that de-rived from first principles [9, 24] under a number of as-sumptions, including the divergence-free condition on thehorizontal components of the velocity. Dropping this as-sumption produces a more general model [25] which is aspecial case of the system (2)-(3) with c (cid:54) = 0, c = 0, c (cid:54) = ∞ , and c = c = 0. In both cases, the coeffi-cients c , c , and c are nonzero and given by explicitexpressions in terms of the material parameters and thegeometry of the fluid layer [9]. The theoretical valuesof parameters are compared with the respective valuesidentified by symbolic regression in Fig. 3(a-c). … ( u · r ) u
36, where the flow is weakly turbulent, isprobably attributed to just such a variation in the con-ditions, the much larger variation in the mean of c and c for the three data sets at Re ≈
18 (Figs. 3b and 3c) ismost likely due to a qualitative change in the dynamics.For 17 . (cid:46) Re (cid:46)
19 the flow becomes time-periodic[24]. The amplitude of the temporal oscillation decreasessubstantially as Re approaches Re ≈ .
8, leading toa corresponding decrease in the magnitude of all theterms (Fig. 3e) and an increase in η (Fig. 3d). Indeed,the constraint (12) on the weight functions implies that (cid:104) F n , w j (cid:105) = 0 for all n for a stationary flow. Hence ourparticular choice of the weight functions is only suitablefor flows that are time-dependent. This is the fundamen-tal reason why the accuracy of the reconstructed modeldecreases at the low end of the Re range explored here,where the magnitude of the time-dependent componentof the velocity field becomes comparable to the measure-ment error of the PIV. The breakdown of our approachfor steady flows is not an inherent problem of symbolicregression but is rather due to the presence of latent vari-ables, mainly the steady forcing which the constraint (12) (a) (b) (c)(d) (e) (f) FIG. 3. Model parameters, shown in panels (a)-(c) are consistently well-estimated from experimental data for a range ofReynolds numbers Re , particularly when the amplitude of flow time-dependence is sufficiently large, as illustrated in panels(d) and (e). For the results shown, flows in experiments are time-periodic for Re (cid:46)
19, and weakly turbulent otherwise. Inpanels (a)-(c), parameters obtained using ensemble averaging (black dots) are compared with the corresponding value obtainedusing first-principles analysis (dashed line) performed for time-independent flows at low Re . In panel (d), low values of theresidual η (equation (7)) indicate good parameter fits; the relative quality of fit deteriorates in a regime ( Re (cid:46)
19) where flowtime-dependence is weak and, therefore, the maximum magnitude of terms in equation (6) is small (e). The terms retained in aparsimonious model depend on a choice of threshold ε ; the probability of retaining the term F n as a function of ε shown in (f)indicates the model given by equation (8) is consistently identified by choosing 0 . (cid:46) ε (cid:46) .
3. The vertical error bars in panels(a)-(d) represent the standard deviation over the ensemble (in most instances they are smaller than the symbol size) and thehorizontal error bars represent the variation in Re over the data set. was aimed to eliminate. One way to get around this lim-itation is to analyze transient flows relaxing towards thesteady state.Once the parsimonious model has been identified, thelatent fields can be determined as well. Using theHelmholtz decomposition in equation (4), the pressure p and forcing f can be computed at each time t representedin the data set, as discussed in the Methods section. Themovie showing the time evolution of the reconstructedpressure field is included as supplementary material.The electrical current is uniform in the electrolytelayer, hence the forcing field f = f ( x, y )ˆ x that appears inthe 2D model of the fluid flow should correspond to thedepth average of the Lorentz force across the electrolytelayer: f ( x, y ) ∝ (cid:90) JB z ( x, y, z ) dz. (9)The forcing profile reconstructed from the measuredflow field is compared with the Lorentz force computed from direct experimental measurement of the magneticfield according to equation (18) in Fig. 4, which showsthat the two profiles are almost indistinguishable. DISCUSSION
As we have demonstrated here, a data-driven ap-proach based on symbolic regression can successfully dis-cover a quantitatively accurate model of a fairly compli-cated and high-dimensional non-equilibrium system withhighly nontrivial dynamics using noisy, incomplete exper-imental measurements. Unlike artificial neural networkmodels [26, 27] that trade off interpretability for gener-ality, our model has the form of a PDE which is bothstraightforward to interpret and allows the latent fieldsto be easily reconstructed. The discovered model can alsobe directly compared with other models of the same sys-tem constructed using first-principles. This comparisonsuggests that the first-principles models do capture all (a) (b) -9 -6 -3 0 3 6 9-101 (c)
FIG. 4. The x -component of the forcing field driving theflow. Depth-averaged Lorentz force J × B computed using ex-perimental measurement of the magnetic field (a) is virtuallyindistinguishable from the forcing field f reconstructed usingequation (8) for Re = 22 .
17 (b). In (c), the reconstructed(blue line) and measured (black circles) forcing profiles, bothnormalized by their maximum magnitude, are compared alongthe line x = 0 (dashed lines in (a) and (b)); this normaliza-tion also removes the dependence on an arbitrary choice of ρ in equation (8). the relevant physical mechanisms qualitatively, but failto describe them quantitatively with sufficient accuracy,indicating that the assumptions used in their derivationrequire refinement.Although our results validate the practical utility ofdata-driven model discovery, they also highlight the needfor a hybrid approach which combines a number of gen-eral physical constraints – most notably, locality, causal-ity, and spatial symmetries – to generate a library ofcandidate models with symbolic regression which down-selects from this library the parsimonious model thatbest describes the data. Although purely data-drivenapproaches such as manifold learning [28] can be usedto help with library construction, it is unlikely that thisapproach remains tractable for high-dimensional systemssuch as the one considered here. We have also relied onfairly specific domain knowledge to identify the latentfields that are not a part of the data. While in our case,their presence is suggested by the structure of the model,no general approach to identifying latent variables from data has been developed so far.Domain knowledge also plays an essential role in choos-ing the weight functions. We used both the functionalform of the terms involving the latent variables (e.g., ∇ p )and the known properties of the latent fields (e.g., theforcing f being time-independent) to eliminate the depen-dence on both p and f from the regression problem. Thiswould not have been possible without using some do-main knowledge, illustrating the limitations of the purelydata-driven approach. It should also be mentioned thatthe dependence on latent fields may not always be elimi-nated, while still allowing the governing equations to beidentified. For instance, our approach would not succeedwithout measurement of the velocity field, even if thepressure were known.The success of any data-driven approach is also heavilydependent on the data used [29]. In particular, for PDEdiscovery, the data should exhibit variation in all inde-pendent coordinates. In the present problem, we findthat symbolic regression identifies a sparse model withhigh accuracy for higher Re where the flow is weakly tur-bulent and the velocity field varies in time and both spa-tial coordinates. The same exact approach experiencesdifficulties at lower Re where the flow becomes (nearly)stationary. Indeed, once the time-dependence is lost, wehave q n = 0 for all n , so that equation (6) becomes anidentity which cannot be solved for c .Finally, it should be pointed out that the approachpresented in this paper is not limited to models in theform of a single parabolic PDE, such as equation (2). Itcan be applied without significant modification to sys-tems of any number of elliptic, hyperbolic, or ellipticsecond-order PDEs, as well as higher-order PDEs andordinary differential equations. In particular, there isno need to separate out the terms such at ∂ t u , whichare only present in equations governing temporal evolu-tion. In their absence, the linear system that appears insymbolic regression can be solved using alternative ap-proaches such as singular value decomposition [17]. METHODSExperimental system and data collection
Our experimental setup is the same one as used inRef. [24]. The flow is produced in a shallow electrolyte-dielectric bi-layer in a rectangular container, the top viewof which is shown in Fig. 1a. The two fluids are immis-cible, and both layers have a thickness of 0.3 cm andhorizontal extent of L x = 17 . × L y = 22 . . ◦ C, corresponding to a 0 . w = 1 . J = J ˆ y passes throughthe electrolyte layer. Its interaction with the magneticfield produces a Lorentz force J × B that drives the flow.The z -component of the magnetic field has been mea-sured at a resolution of 10 points per magnet width ineach of 7 equally spaced horizontal planes throughout theelectrolyte layer. These measurements were only used asa reference to validate the results of our reconstructionprocedure.The electrolyte-dielectric interface is seeded with flu-orescent microspheres in order to measure 2D velocityfields quantifying the horizontal flow via particle imagevelocimetry (PIV) [30]. A typical snapshot of the veloc-ity field is shown overlaid on its corresponding vorticityin Fig. 1 The strength of the flow is characterized by theReynolds number Re = ¯ uw/ ¯ ν , where ¯ u is the RMS ve-locity within the central 8 w × w region of the domain,and ¯ ν = 3 . × − m /s is the characteristic depth-averaged viscosity chosen to allow direct comparison withthe results of previous studies of this experimental system[9, 24, 31–33]. For Re (cid:46)
50, the vertical ( z ) componentof the flow is negligibly small, so that the horizontal flowcan be considered divergence-free [9].Each data set represents the x and y components ofthe velocity field sampled on a uniform grid (∆ x = ∆ y )within the flow domain and covers a temporal interval ofat least 600 s with temporal resolution ∆ t = 1 s. Thecharacteristic time scale τ of the flow varies with Re . Atlow Re , the flow is periodic, with period of around 120 s.At higher Re , the flow is aperiodic, with autocorrelationtime which decreases with Re [31]. The spatial resolutionof the data is between 6 and 10 grid points per magnetwidth w , which is the characteristic length scale of theflow. The temporal extent L t and the spatial resolutionof each data set, labeled by the mean Reynolds number,are given in Table I.The z -component of the magnetic field is measured ata resolution of 10 points per magnet width in each of7 equally spaced horizontal planes throughout the elec-trolyte layer. The average of these planes is shown inFig. 4a in comparison with the reconstructed forcing inFig. 4b. Integration domains and weight functions
For simplicity, we take the integration domains tobe rectangular and centered at different grid points
TABLE I. Description of the data sets used for the symbolicregression analysis. Re denotes the mean Reynolds number.Times τ marked with an asterisk (*) represent temporal pe-riod, whereas those without represent autocorrelation time. Re τ (s) L t τ H x L x H y L y H t L t ∆ xw ∆ tτ ( x i , y i , t i ),Ω i = (cid:8) ( x, y, t ) (cid:12)(cid:12) | x − x i | ≤ H x , | y − y i | ≤ H y , | t − t i | ≤ H t (cid:9) , (10)where H l is the half-width of the integration domainin the direction l = { x, y, t } . All the domains Ω i havethe same size, centered spatially and distributed tempo-rally throughout the data set, as shown in Fig. 2. Sinceintegration leads to a reduction of noise due to averaging[17], the domains are chosen to be large in both spatialdirections. Their spatial width 2 H x × H y was chosento be slightly smaller than the size L x × L y of the flowdomain to avoid the regions near the side walls wherePIV is noisier than in the bulk. The temporal width 2 H t was chosen to be smaller than the temporal extent L t ofthe data set to limit overlap between different integra-tion domains, so that rows of equation (6) could remainlinearly independent. Specific values of H x , H y , and H t for each data set are given in Table I.As mentioned previously, each partial derivative of thevelocity field increases the noise that is inevitably presentin the PIV data. Hence, the derivatives are transferredonto the smooth, noiseless weight functions w j wheneverpossible. Consider for illustration the term F = ∂ t u .Using integration by parts we obtain (cid:104) w j , ∂ t u (cid:105) i = −(cid:104) ∂ t w j , u (cid:105) i , (11)if the boundary terms are eliminated by requiring w j = 0at t = t i ± H t . The complete set of boundary condi-tions [18] require that w j and its spatial derivatives upto second-order vanish at the boundary of the integrationdomain. Some nonlinear terms in equation (2), such as ω u , do not allow all derivatives to be transferred onto w j via integration by parts. In such cases, the remainingderivatives on u are computed in Fourier space utiliz-ing both a Tukey-like windowing function and a low-passfilter.Furthermore, the weight functions should be cho-sen such that the integrals involving the latent fieldsdisappear. To remove the dependence on the time-independent forcing term, we require that w j be an oddfunction in time, such that (cid:90) H t − H t w j dt = 0 , (12)We also constrain our weight function to the form w j = ∇ × [ˆ zφ j ( x, y, t )] , (13)so that (cid:104) w j , ∇ p (cid:105) i = −(cid:104)∇ · w j , p (cid:105) i = 0 , (14)eliminating the dependence on pressure.All of the above constraints can be satisfied by choosingthe scalar fields φ j in the form φ j ( x, y, t ) = P λ ( x (cid:48) ) P µ ( y (cid:48) ) P ν ( t (cid:48) ) E α ( x (cid:48) ) E β ( y (cid:48) ) E γ ( t (cid:48) ) , (15)where P m ( · ) is a Legendre polynomial, E α ( w ) = (1 − w ) α , (16)is an envelope function, and the prime denotes coor-dinates scaled by the integration domain size: x (cid:48) =( x − x i ) /H x , y (cid:48) = ( y − y i ) /H y , t (cid:48) = ( t − t (cid:48) ) /H t . Eachintegral over Ω i is evaluated numerically using the trape-zoidal rule, with the accuracy of the numerical quadra-ture controlled by the integers α , β , and γ [17]. Here weset α = β = γ = 6 to allow the use of PIV data thatis relatively sparse. For reference, regression based ondirect evaluation of derivatives via a polynomial method[11] requires about 20 grid points per magnet width (e.g.,2-3 times higher than in our data sets).Unlike Ref. [11] which considered symbolic regressionfor synthetic data, multiple weight functions labeled byinteger indices j = { λ, µ, ν } were used here to sample thedata more thoroughly, while keeping the large integrationdomains from overlapping too much for the shorter datasets. The constraint (12) requires ν to be an odd integer.Here we used all combinations of λ and µ set to either 0or 1 and ν = 1, i.e., a total of four weight functions foreach integration domain (this number could be increasedfurther to improve the model reconstruction accuracy).The total number of equations in the system defined byequation (6) is therefore K = 4 I , where I is the totalnumber of integration domains. The system has to beover-determined, K > N ; we chose I = 50 which satisfiesthis condition. A higher value would further increase theaccuracy and robustness of the method. Reconstructing the pressure and forcing field
Once the parsimonious model describing a particulardata set has been found, the horizontal forcing profile f ( x ) and pressure p ( x , t ) can be computed using theHelmholtz decomposition of the vector field s ( x , t ) inequation (4). Specifically, p ( x , t ) = − ρ (cid:90) (cid:90) i k · ˆ s ( k , t ) k · k e − i k · x d k (17)and f ( x , t ) = − ρ (cid:90) (cid:90) k × [ k × ˆ s ( k , t )] k · k e − i k · x d k , (18)where ˆ s ( k , t ) = ˆ F ( k , t ) − (cid:88) n =1 c n ˆ F n ( k , t ) . (19)and ˆ F n ( k , t ) = 1(2 π ) (cid:90) (cid:90) F n ( x , t ) e i k · x d x . (20)The latent fields are reconstructed without the benefitof the weak formulation, which plays a crucial role in in-creasing the robustness of symbolic regression in the pres-ence of noise. Since some of the terms F n ( x , t ) involvederivatives which amplify noise, the respective Fouriertransforms ˆ F n ( k , t ) are low-pass-filtered by eliminatingfrequencies | k x | > k and | k y | > k where k = π/w isthe wavenumber corresponding to the wavelength 2 w ofthe magnet array. This cut-off frequency is chosen em-pirically to balance the inclusion of relevant modes andthe exclusion of modes corrupted by noise. The spatialderivatives were computed spectrally and the temporalderivative term was computed using a second-order cen-tral difference.Note that f = ρ ∇ × A involves an extra derivativecompared with p = ρφ , which decreases its accuracy fornoisy data. Since f is stationary in our experiment, itsaccuracy can be improved substantially by temporallyaveraging equation (18). Data availability
The source data used to construct Figure 3 are in-cluded as supplementary material. Data sets containingvelocity fields and their gradients are available from thecorresponding author upon request.
Code availability
MATLAB codes used to identify the governingequations can be found in the GitHub repositoryhttps://github.com/pakreinbold/PDE Discovery WeakFormulation.
Acknowledgements
This material is based upon work supported by NSFunder Grants No. CMMI-1725587 and CMMI-2028454.The experimental data used in this work was producedby Jeff Tithof.
Author contributions
P.A.K.R. was responsible for conducting data analysisand interpretation of the results. L.M.K. was responsiblefor performing fluid flow experiments, data acquisition,and PIV analysis. M.F.S. was responsible for experi-mental design. R.O.G. was responsible for concept andresearch design. All authors were involved in the prepa-ration of the manuscript, read and approved the finalversion. [1] A. Gaudinier and S. M. Brady, Mapping transcriptionalnetworks in plants: data-driven discovery of novel bio-logical mechanisms, Annual review of plant biology ,575 (2016).[2] S. Pan and K. Duraisamy, Data-driven discovery of clo-sure models, SIAM Journal on Applied Dynamical Sys-tems , 2381 (2018).[3] K. J. Bergen, P. A. Johnson, V. Maarten, and G. C.Beroza, Machine learning for data-driven discovery insolid earth geoscience, Science , eaau0323 (2019).[4] J. Bongard and H. Lipson, Automated reverse engineer-ing of nonlinear dynamical systems, Proceedings of theNational Academy of Sciences , 9943 (2007).[5] M. Schmidt and H. Lipson, Distilling free-form naturallaws from experimental data, science , 81 (2009).[6] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz,Data-driven discovery of partial differential equations,Science Advances , e1602614 (2017).[7] H. Schaeffer, Learning partial differential equations viadata discovery and sparse optimization, Proceedings ofthe Royal Society A: Mathematical, Physical and Engi-neering Sciences , 20160446 (2017).[8] A. Karpatne, G. Atluri, J. H. Faghmous, M. Steinbach,A. Banerjee, A. Ganguly, S. Shekhar, N. Samatova, andV. Kumar, Theory-guided data science: A new paradigmfor scientific discovery from data, IEEE Transactions onknowledge and data engineering , 2318 (2017).[9] B. Suri, J. Tithof, R. Mitchell, R. O. Grigoriev, and M. F.Schatz, Velocity profile in a two-layer Kolmogorov-likeflow, Phys. Fluids , 053601 (2014). [10] S. Boyd, L. O. Chua, and C. A. Desoer, Analytical foun-dations of volterra series, IMA Journal of MathematicalControl and Information , 243 (1984).[11] P. A. Reinbold and R. O. Grigoriev, Data-driven dis-covery of partial differential equation models with latentvariables, Physical Review E , 022219 (2019).[12] X. Li, L. Li, Z. Yue, X. Tang, H. U. Voss, J. Kurths, andY. Yuan, Sparse learning of partial differential equationswith structured dictionary matrix, Chaos , 043130(2019).[13] D. Xu and O. Khanmohamadi, Spatiotemporal systemreconstruction using fourier spectral operators and struc-ture selection techniques, Chaos , 043122 (2008).[14] O. Khanmohamadi and D. Xu, Spatiotemporal sys-tem identification on nonperiodic domains using cheby-shev spectral operators and system reduction algorithms,Chaos , 033117 (2009).[15] M. Shinbrot, On the analysis of linear and nonlinear dy-namical systems from transient-response data, NationalAdvisory Committee for Aeronautics, Technical Note3288 (1954).[16] H. Preisig and D. Rippin, Theory and application of themodulating function method—i. review and theory of themethod and theory of the spline-type modulating func-tions, Computers & chemical engineering , 1 (1993).[17] D. R. Gurevich, P. A. Reinbold, and R. O. Grigoriev,Robust and optimal sparse regression for nonlinear pdemodels, Chaos , 103113 (2019).[18] P. A. Reinbold, D. R. Gurevich, and R. O. Grigoriev, Us-ing noisy or incomplete data to discover models of spa-tiotemporal dynamics, Physical Review E , 010203(2020).[19] R. Tibshirani, Regression shrinkage and selection via theLASSO, Journal of the Royal Statistical Society: SeriesB (Methodological) , 267 (1996).[20] D. W. Marquardt and R. D. Snee, Ridge regression inpractice, The American Statistician , 3 (1975).[21] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discoveringgoverning equations from data by sparse identification ofnonlinear dynamical systems, Proceedings of the nationalacademy of sciences , 3932 (2016).[22] N. M. Mangan, J. N. Kutz, S. L. Brunton, and J. L. Proc-tor, Model selection for dynamical systems via sparseregression and information criteria, Proceedings of theRoyal Society A: Mathematical, Physical and Engineer-ing Sciences , 20170009 (2017).[23] F. V. Dolzhanskii, V. A. Krymov, and D. Y. Manin,Stability and vortex structures of quasi-two-dimensionalshear flows, Sov. Phys. Usp. , 495 (1990).[24] J. Tithof, B. Suri, R. K. Pallantla, R. O. Grigoriev, andM. F. Schatz, Bifurcations in a quasi-two-dimensionalKolmogorov-like flow, J. Fluid Mech. , 837 (2017).[25] R. Pallantla, Exact Coherent Structures and DynamicalConnections in a Quasi 2D Kolmogorov Like Flow , Ph.D.thesis, Georgia Institute of Technology (2018).[26] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning framework forsolving forward and inverse problems involving nonlinearpartial differential equations, Journal of ComputationalPhysics , 686 (2019).[27] R. Iten, T. Metger, H. Wilming, L. Del Rio, and R. Ren-ner, Discovering physical concepts with neural networks,Physical Review Letters , 010508 (2020). [28] L. Cayton, Algorithms for manifold learning, Univ. ofCalifornia at San Diego Tech. Rep , 1 (2005).[29] H. Schaeffer, G. Tran, and R. Ward, Extracting sparsehigh-dimensional dynamics from limited data, SIAMJournal on Applied Mathematics , 3279 (2018).[30] B. Drew, J. Charonko, and P. P. Vlachos, QI – Quan-titative Imaging (PIV and more) (2013), available athttps://sourceforge.net/projects/qi-tools/.[31] B. Suri, J. Tithof, R. O. Grigoriev, and M. F. Schatz,Forecasting fluid flows using the geometry of turbulence, Phys. Rev. Lett. , 114501 (2017).[32] B. Suri, J. Tithof, R. O. Grigoriev, and M. F. Schatz,Unstable equilibria and invariant manifolds in quasi-two-dimensional kolmogorov-like flow, Phys. Rev. E ,023105 (2018).[33] B. Suri, R. K. Pallantla, M. F. Schatz, and R. O.Grigoriev, Heteroclinic and homoclinic connections in akolmogorov-like flow, Phys. Rev. E100