Solving Irregular and Data-enriched Differential Equations using Deep Neural Networks
Craig Michoski, Milos Milosavljevic, Todd Oliver, David Hatch
SSOLVING IRREGULAR AND DATA-ENRICHED DIFFERENTIAL EQUATIONS USINGDEEP NEURAL NETWORKS
CRAIG MICHOSKI, MILOˇS MILOSAVLJEVI´C, TODD OLIVER, AND DAVID HATCH
Abstract.
Recent work has introduced a simple numerical method for solving partial differential equations(PDEs) with deep neural networks (DNNs). This paper reviews and extends the method while applying it toanalyze one of the most fundamental features in numerical PDEs and nonlinear analysis: irregular solutions.First, the Sod shock tube solution to compressible Euler equations is discussed, analyzed, and then comparedto conventional finite element and finite volume methods. These methods are extended to consider performanceimprovements and simultaneous parameter space exploration. Next, a shock solution to compressible magneto-hydrodynamics (MHD) is solved for, and used in a scenario where experimental data is utilized to enhance a PDEsystem that is a priori insufficient to validate against the observed/experimental data. This is accomplishedby enriching the model PDE system with source terms and using supervised training on synthetic experimentaldata. The resulting DNN framework for PDEs seems to demonstrate almost fantastical ease of system prototyp-ing, natural integration of large data sets (be they synthetic or experimental), all while simultaneously enablingsingle-pass exploration of the entire parameter space. Introduction
Solving differential equations with numerical optimization, specifically with minimization of the equationresidual, is not a new idea. Optimization-based methods have been popular for some time in the form of least-squares finite element methods (LSFEM) [1, 2, 3], and element-free Galerkin methods [4, 5], where even formesh-free formulations, theoretical convergence criteria have been examined [6]. Nevertheless, recent work byHan et al. [7], Berg and Nystr¨om [8], Sirignano and Spiliopoulos [9], Raissi et al. [10], and Xu and Darve [11] hasshown that remarkably simple implementations of deep neural networks (DNNs) can also solve relatively diversedifferential equations by utilizing very similar residual-minimizing formulations but with the DNNs replacing theusual finite element discretization strategies. From a method development point of view, further demonstratingthe capability of DNNs to solve differential equations is of intrinsic interest.To explore how efficient, reliable, robust, and generalizable DNN-based solutions are will undoubtedly requirecareful and prolonged study. Here, we offer a practical starting point by asking how well, and in what ways, canDNNs manage one of the cleanest, simplest, and most ubiquitous irregularities observed in differential systems,namely shock fronts. Shock fronts provide a quintessential test bed for a numerical method in that they containan isolated analytically non-differentiable feature that propagates through the differential system. The featurereduces the local regularity of the solution both in space and time and leads to numerical representations thatmust accommodate local first-order analytic discontinuities while still maintaining some concept of numericalstability and robustness. Indeed the way a numerical method responds to a local shock front tends to revealthe fundamental signature response that the numerical method has to local irregularities in the solution, whichmight be thought of as providing an indicator by which to gauge the method’s applicability.A tremendous amount of work spanning pure mathematical analysis [12, 13], numerical analysis [14, 15],physics [16, 17], etc., has been done on differential systems that demonstrate shock-like behavior. Shock-likenumerical behavior is deeply rooted in discrete function representation and approximation theory, and furtherin functional and discrete functional analysis, specifically in what it means to locally and globally approximatea member of a function space [18, 19, 20, 21].Along these lines, the first half of this paper is dedicated to exploring the behavior of the conventional Sodshock tube solution to the compressible Euler equations of gas dynamics [22]. We use the shock tube solutionas a setting in which to explore what is meant by a DNN solution to a system of PDEs, what is meant by aregular (and irregular) solution to a system of PDEs both analytically as well as numerically, how these twoframes of reference relate to each other, and where they differ. Then we discuss in some detail how DNN solutionscompare and relate to more traditional and conventional ways of solving systems of PDEs and how the validationof DNN-based methods is slightly different than, say, typical numerical convergence analysis.The second half of the paper is focused more directly on how to improve the performance of DNN solutionsin the context of irregular PDE solvers, and how to capitalize on the operational details of DNNs solvers toperform complete and simultaneous parameter space exploration. We also consider how to extend the frameworkof physics-based modeling to incorporate experimental data into the analytic workflow. These questions have (Craig Michoski, Todd Oliver)
Oden Institute for Computational Engineering and Sciences, University of Texas atAustin, Austin,TX 78712 (Miloˇs Milosavljevi´c)
Department of Astronomy, The University of Texas at Austin, Austin, TX 78712 (David Hatch)
Institute for Fusion Studies, University of Texas at Austin, Austin,TX 78712
E-mail addresses : (Craig Michoski) [email protected] . Key words and phrases.
Deep neural networks, differential equations, partial differential equations, nonlinear, shocks, dataanalytics, optimization. a r X i v : . [ c s . L G ] M a y C. MICHOSKI, M. MILOSAVLJEVI´C, T. OLIVER, AND D. HATCH growing importance in science and engineering applications as data becomes increasingly available. Many ofthe more traditional approaches, including data-informed parameter estimation methods [23] and real-timefiltering [24], are frequently incorporated into more standard forward PDE solvers. We argue that DNN-basedPDE solvers provide a natural interface between physics-based PDE solvers and advanced data analytics. Thegeneral observation is that because DNNs are able to combine a PDE solver framework with a simple andflexible interface for optimization, it becomes natural to ask how sufficient a modeling system is at representingexperimental observations and to explore the nature of the mismatch between a simulated theoretical principleand an experimentally measured phenomenon. In this context, the DNN framework almost automatically inheritsthe ability to couple many practical data-driven concepts, from signal analysis, to optimal control, to data-drivenPDE enrichment and discovery.To provide some insight into the utility of the DNN framework for solving PDEs, it is instructive to discusssome of the apparent strengths and drawbacks. Three strengths of the framework are:(I) Phenomenal ease of prototyping PDEs,(II) Natural incorporation of large data,(III) Simultaneous solution over entire parameter space.Regarding (I), complicated systems of highly parameterized and multidimensional PDEs can be prototyped inTensorFlow or PyTorch in hundreds of lines of code, in a day or two. This might be compared to the decade-long development cycle of many legacy PDE solvers. For (II), incorporating and then utilizing experimentaldata in the PDE workflow as a straightforward supervised machine learning task is remarkably rewarding andprovides a painless integration with the empirical scientific method. It is easy to see how this feature could benaturally leveraged for practical uncertainty quantification, risk assessment, data-driven exploration, optimalcontrol, and even discovery and identification of the theoretical underpinnings of physical systems (as discussedin some detail in section 8 below). In (III), another powerful and practical advantage is identified, where n -dimensional parameter space exploration simply requires augmenting the solution domain (space-time) withadditional parameter axes, ( x , t, p , . . . , p n ), and then optimizing the DNN to solve the PDEs as a function ofthe parameters as inputs. The input space augmentation adds little algorithmic or computational complexityover solving on the space-time domain ( x , t ) at a single parameter point ( p , . . . , p n ) and is drastically simplerthan exploring parameter space points sequentially.Some of the more immediate drawbacks the DNN-based approach to solving PDEs include:(i) Absence of theoretical convergence guarantees for the non-convex PDE residual minimization,(ii) Slower overall run time per forward solve,(iii) Weaker theoretical grounding of the method in PDE analysis.We touch on these weaknesses throughout the paper, but briefly, (i) indicates the challenge posed by under-standing convergence criteria in convex optimization, where solutions may get trapped in local minima. Theconcern of (ii) is substantially subtler than it appears at first glance, and strongly depends on many aspects ofthe chosen neural network architecture, how optimized the hyperparameters are, what the ultimate goal of thesimulation is, etc. Finally, (iii) merely highlights that DNNs have only recently been seriously considered forPDE solvers, and thus, remain largely theoretically untouched.The paper is organized as follows: Section 2 shows how a simple DNN can be used to solve a nonlinear systemof PDEs. Section 3 proceeds with a discussion about the relationship between different notions of solutions toPDEs and attempts to provide some context in which to understand where DNN solutions can be placed relativeto solutions provided by more conventional numerical methods. Section 4 is a demonstration of how the DNNsolution behaves given unbounded gradients. Section 5 explores solutions to the Sod shock tube problem andcompare aspects of the DNN method to more standard finite element and finite volume methods. Section 6discusses a method of dissipative annealing to improve (in a specific sense) the rate of convergence relative tothe number of iterations required in the descent algorithm, and then how to use the DNN framework to recoversimultaneous parameter scans.. In section 7 the modified network with residual connections and multiplicativegates from [9], which resembles the “highway network” and “long short term memory” (LSTM) architectures, iscompared to the more standard DNN model in the specific context of the Sod shock tube. Finally in section 8,the concept of data-enriched PDEs is presented through the lens of a hypothetical experimental scenario, wherethe experimenter finds herself in the situation in which the anticipated physical model of the system does notadequately validate against the experimental data. The accompanying code to this manuscript will be madeavailable on github upon publication.2. Solving PDEs using Deep Neural Networks
In this paper we are interested in a broad system of n -coupled potentially nonlinear PDEs that can be writtenin terms of the initial-boundary value problem, for i = 1 , . . . , n : ∂ t u i + N i ( u ; p ) = 0 , for ( x , t, p ) ∈ Ω × [0 , T s ] × Ω p ,u i | t =0 = u i, ( x , p ) , for ( x , p ) ∈ Ω × Ω p , B i ( u , p ) = 0 , for ( x , t, p ) ∈ ∂ Ω × [0 , T s ] × Ω p , (2.1) OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 3 over the domain Ω ⊂ R d with boundary ∂ Ω, and the m dimensional parameter domain p ∈ Ω p given m parameters p = ( p , . . . , p m ), where u = u ( x , t, p ) is the n dimensional solution vector with i th scalar-valuedcomponent u i = u i ( x , t, p ), T s is the duration in time, N i is generally a nonlinear differential operator, witharbitrary boundary conditions expressed through the component-wise operator B i ( u , p ).Here we are interested in approximating the solutions to (2.1) with DNNs. A feed-forward network can bedescribed in terms of the input y ∈ R d in for y = ( x , t, p ), the output z L ∈ R d out , and an input-to-output mapping y (cid:55)→ z L , where d in and d out are the input and output dimension. The components of the pre- and post-activationsof hidden layer (cid:96) are denoted y (cid:96)k and z (cid:96)j , respectively, where the activation function is a sufficiently differentiablefunction φ : R → R . The j th component of the activations in the (cid:96) th hidden layer of the network is given by z (cid:96)j ( y ) = b (cid:96)j + N (cid:96) (cid:88) k =1 W (cid:96)jk y (cid:96)k ( y ) , y (cid:96)k = φ ( z (cid:96) − k ( y )) , (2.2)where N (cid:96) are the numbers of neurons in the hidden layers of the network, and W (cid:96)jk and b (cid:96)j are the weight andbias parameters of layer (cid:96) . These parameters make up the tunable, or “trainable”, parameters of the network,which can be thought of as the degrees of freedom that parametrize the representation space of the DNN. Notethat in section 7 we will consider a slightly more exotic network architecture, but for now, we concern ourselveswith the standard “multilayer perceptron” described in (2.2).In this context, the output of the neural network z L , where the number of units N (cid:96) is chosen and fixed foreach layer, realizes a family of approximate solutions to (2.1), such that for each i component of (2.1), z Li ( x , t ; p , ϑ ) ≈ u i ( x , t, p ) , (2.3)where we have set the parameters ϑ = ( W, b ). Similarly, activation functions φ must be chosen such that thedifferential operators ∂z Li ∂t ( x , t ; p , ϑ ) and N i ( z L ; p ) , can be readily and robustly evaluated using reverse mode automatic differentiation [25].We proceed by defining a slightly more general objective function to that presented in [10, 9], J i ( ϑ ) = J PDE i ( ϑ ) + J BC i ( ϑ ) + J IC i ( ϑ ) (2.4)where J PDE i ( ϑ ) = (cid:13)(cid:13)(cid:13)(cid:13) ∂z Li ∂t + N i ( z L ; p ) (cid:13)(cid:13)(cid:13)(cid:13) L i (Ω × [0 ,T s ] × Ω p ,℘ ) , J BC i ( ϑ ) = (cid:107)B i ( u , p ) (cid:107) L i ( ∂ Ω × [0 ,T s ] × Ω p ,℘ ) ,J IC i ( ϑ ) = (cid:107) z Li ( x , ϑ ) − u i, ( x , ϑ ) (cid:107) L i (Ω × Ω p ,℘ ) , (2.5)and L i indicates the form of the i th loss and ℘ , , denotes the probability density with respect to which theexpected values of the loss are calculated. In particular, the form of the loss function can correspond to anystandard norm (e.g., the discrete kernel of an L p -norm), the mean square error, or specialized hybrids such asthe Huber loss [26]. For this chosen form L , the corresponding loss is given by the following expected value: (cid:107) f ( X ) (cid:107) L ( Y ,℘ ) = E ℘ [ L ( f )] = (cid:90) Y L ( f ) ℘ ( X ) dX. Letting z L = { z L , . . . , z Ln } , we define any particular approximate solution ˆ z L to (2.1) as finding a ϑ thatminimizes the loss relative to the parameter set. That is:ˆ ϑ = arg min ϑ n (cid:88) i =1 J i ( z Li ) . (2.6)To solve this optimization problem, one must approximate the loss function and couple this approximationwith an optimization algorithm. For instance, using stochastic gradient descent (SGD), one uses a Monte Carloapproximation of the loss coupled with gradient descent. Thus, the loss is computed by drawing a set of samples X batch from the joint distribution ℘ ℘ ℘ and forming the following approximation: G i ( ϑ ; X batch ) = G PDE i ( ϑ ; X batch ) + G BC i ( ϑ ; X batch ) + G IC i ( ϑ ; X batch ) , where G PDE i ( ϑ ; X batch ) = 1 N PDE N PDE (cid:88) n =1 L ( R i ( x n , t n , p n )) , G BC i ( ϑ ; X batch ) = 1 N BC N BC (cid:88) n =1 L ( B i ( x n , p n )) ,G IC i ( ϑ ; X batch ) = 1 N IC N IC (cid:88) n =1 L ( z Li ( x n , , p n ; ϑ ) − u i, ( x n , p n )) , and R i ( x n , t n , p n ) denotes the i th PDE residual evaluated at the batch point ( x n , t n , p n ), which is drawnaccording to ℘ , and similarly for the BC and IC terms. Then, the SGD algorithm for the k + 1 iteration can C. MICHOSKI, M. MILOSAVLJEVI´C, T. OLIVER, AND D. HATCH be written as ϑ k +1 ← ϑ k − α k ∇ ϑ (cid:32) n (cid:88) i =1 G i ( ϑ k ; X k batch ) (cid:33) , (2.7)where X k batch denotes the batch set drawn at iteration k .In practice we find the Adam optimizer [27] to be the most efficient SGD-family algorithm for the cases run inthis paper. For network initialization we use the Xavier uniform initializer [28]. It is also worth noting that thehyperparameters L and N (cid:96) as well as the sampling distribution drawn from ℘ are additional hyperparametersthat can be optimized for. For simplicity, however, we do not perform hyperparameter optimization in thispaper. 3. Irregular Solutions to PDEs
In this section we briefly review what a solution to a PDE represents. Although most PDEs have been derivedfrom and are studied for their ability to represent natural phenomena, at their core they remain mathematicalabstractions that are only approximations to the phenomena that inspire them. These abstractions are oftenderived from simple variational perturbation theories, and are subsequently celebrated for their ability to cap-ture and predict the physical behavior of complex natural systems. This “capturing” of physical systems isconventionally accomplished at the level of experimental validation, which is often many steps removed from theabstraction that the PDE itself represents.In the study and understanding of PDEs and mathematical analysis, it is difficult to over-emphasize theimportance of stability and regularity. One of the most challenging problems in modern theoretical scienceand engineering is developing a strategy for solving a PDE and ultimately interpreting the solution. In thiscontext, irregular solutions to PDEs are easy to come by. To account for this, convention introduces the conceptof mathematically well-posed solutions (in the sense of Hadamard [29]). Well-posedness circumscribes what ismeant, in some sense, by a “good” or admissible solution to a PDE. It is within this framework that the conceptof classical and regular solutions, strong solutions, weak solutions, and so forth has evolved.A “regular solution” can be thought of as one where the actual PDE is satisfied in an intuitively sensibleand classical way. Namely, in a regular solution, enough derivatives exist at every point in the solution so thatoperations within the PDE make sense at every point in the entire domain of relevance. A well-posed regularsolution is formally defined to provably exist, be unique, and depend continuously on the data given in theproblem [30]. In contrast to regular solutions, the concept of a weak solution was developed to substantiallyweaken the notion of what is meant by a solution to a PDE. In weak solutions derivatives only exist in a “weak”distributional sense, and the solution is subsequently recast in terms of distributions and integral norms.For example, in the case of a “strong solution” to a PDE, which is a “weak solution” that is bounded andstable in the L -norm and thus preserves important geometric properties, a solution may admit, e.g., pointwisediscontinuities and remain an admissible solution. Such solutions are the strongest, and/or most well-behavedof the weak solutions; still, interpreting the space of strong solutions from a physical point of view can bechallenging. Moreover, weak solutions in the sense of Hadamard are only proper solutions to a PDE if they areadjoined with a concept of stability (such as L p -stability) that implies a bound in a given normed space. Thisis a rather practical condition, in that a solution in a distributional sense that is not bounded in norm is toopermissive to pointwise “variation” to be physically interpretable.A substantial fraction of solutions arising in applied science are weak solutions that demonstrate stability innotably irregular spaces. Even the simplest of equations can admit solutions of highly irregular character (e.g.,the Poisson equation [31]). In fact, in many applied areas the concept of turbulence becomes important, whichfrom a mathematical point of view deals with PDEs in regimes where neither regularity nor stability are observed,a fact with broad conceptual ramifications [32]. This irregular and unexpected behavior in PDE analysis leadsto a rather basic and inevitable question that we address in the next section: how well do our specific discretenumerical approximate solutions to certain PDEs capture their rigorously established mathematical properties?3.1. Numerical solutions.
Discrete numerical systems cannot, in general, exactly replicate infinite dimensionalspaces (except in an abstract limit), and thus approximate numerical solutions to PDEs remain just that:approximations to mathematical PDEs. As a matter of practice, weak formulations are often implemented innumerical methods, but while these methods often preserve the spirit of their mathematical counterparts, theysubsequently lose much of their content. For example, solutions in discontinuous finite element methods arefrequently couched in terms of L residuals and are presented along with numerical stability results, when ineffect the discrete L spaces they are purported to represent are simply piecewise continuous polynomial spacesthat are capable of representing only a tiny fraction of admissible solutions in L .It is then a natural question to ask, what has become of all the admissible solutions that even the most rigorousnumerical methods discard by construction? And how exactly has it been determined which solutions are tobe preserved? A pragmatist might respond with the argument that numerical methods develop, primarily, todeliver practical utility, adjoining as a central thematic pillar to their evolution the notion of “physical relevance,”where physical relevance is notionally defined in terms of the utility that the numerical solution provides in theelucidation of a practical observation. OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 5
While this may certainly be the case, and while we adopt the pragmatic perspective in this paper, we would liketo raise a consideration to the reader: if numerical methods, particularly those driving the evaluation of predictivesimulation models, are, with widespread application, utilizing numerical schemes that to some extent arbitrarilyselect solutions of a particular kind from the space of mathematically admissible solutions, and then use thatsubset of solutions as the justification upon which interpretive predictive physical understanding is derived, thenhow can one evaluate the predictive capability of a model without directly comparing it to observed data?3.2.
Shocks.
Perhaps the simplest type of irregularity that confronts a numerical method is a numerical shock.A numerical shock need not be a mathematical discontinuity, and as such it is a fundamentally different conceptthan both a mathematical shock and a physical shock. One definition of a numerical shock is: the numericalresponse (in the form of numerical instability) of a method when the representation space that the method isconstructed in does not support the function space it is interrogating. Examples of this behavior can be foundfrom Gibb’s phenomena in spectral methods to Runge’s phenomena in polynomial based methods and elsewhere.More broadly, one can simply posit that the radius of convergence of the solution must be contained within thesupport of its discrete representation, or else instabilities are likely to arise [33, 34]. As a consequence, numericalapproaches to the shock problem display unique characteristics that often expose deep numerical insights into thenature of the method. The unique signature that a numerical method displays when dispersing its instabilitiesthroughout a solution foretells the predictive characteristics of the simulation itself.Many successful numerical methods can be viewed as variations of general strategies with important butsubtle differences in the underlying detail. Consider, for example, finite volume methods (FVM), spectral,pseudospectral, and Fourier Galerkin methods, continuous and discontinuous Galerkin finite element methods,etc. In each of these cases, the system of PDEs is cast into a weak formulation, where the spatial derivatives areeffectively transferred to analytically robust basis functions, and fluxes are recovered from integration by parts.For example, if we choose some adequately smooth test function ϕ , multiply it by some transport equation ∂ t U + F ( U ) x = 0 in a state space where U a state vector, and integrate by parts, we have an equation that canbe written in the form: ddt (cid:90) Ω ϕ (cid:12) U dx + (cid:90) ∂ Ω ϕ (cid:12) F dS − (cid:90) Ω F (cid:12) ϕ x dx = 0 , (3.8)where (cid:12) denotes componentwise multiplication.To see the relation between distinct numerical methods, recall here that when ϕ is a high order polynomial, andthe discrete domain is, for example, a single element, the recovered method is a type of spectral element method(SEM). Whereas, when the discretization is turned into a finite element mesh Ω h comprised of i subintervals,Ω , . . . , Ω i , and the basis functions are chosen to be continuous or piecewise discontinuous polynomials, we havethe continuous Galerkin (CG) or discontinuous Galerkin (DG) finite element methods, respectively. Similarly,when the elements Ω i are viewed as finite volumes and the basis functions ϕ are chosen to be piecewise constant,a finite volume method can be easily recovered.As discussed above the weak formulation (3.8) of method should not really be viewed, in and of itself, asan advantage. Indeed, the DNN loss functions (2.4) could equally be cast into a weak formulation (as in [2]for example), paying a price in the simplicity of the method. Similarly, by using different activation functions(e.g., φ being taken as ReLU), the universal approximation theorem can, in the limit sense, recover all Lebesgueintegral function spaces [35]. We sidestep these issues, viewing them as unnecessary complications that muddythe simplicity and elegance of the DNN solution. Instead, we choose the hyperbolic tangent tanh( z ) activationfunction and interpret solutions to shock problems through the lens that smooth solutions are dense in L p .3.3. Euler’s equations and the Sod shock tube.
For simplicity, we start by considering the dimensionlessform of the compressible Euler equations in 1D, solved over only the space-time domain Ω × [0 , T s ], with Ω =[ − ,
1] and boundary ∂ Ω = {− , } . We are interested in solutions to the initial-boundary value problem: ∂ t U + ∂ x F ( U ) = 0 , U | t =0 = U , U | x ∈ ∂ Ω = U ∂ Ω , (3.9)for a state vector U = ( ρ, ρu, E ) (cid:62) and corresponding flux F ( U ) = ( ρu, ρu + p, u ( p + E )) (cid:62) . The state vector iscomprised of the density ρ , velocity u , and total energy of the gas E = pγ − ρ u , while the pressure-temperature relation is the single component ideal gas law p = ρT . The constant γ denotesthe adiabatic index and, unless otherwise noted, is taken throughout the remainder of the paper to be γ = 5 / c s = √ γT .To analyze the behavior of the network on a simple and classic 1D shock problem, we solve the Sod shocktube. The initial condition is chosen to be piecewise constant: U = (cid:40) U l , x < U r , x ≥ U l = (1 , , . (cid:62) and the right by U r = (0 . , , . (cid:62) . This also defines theleft and right Dirichlet boundaries U ∂ Ω = { U l , U r } . The exact solution to this system is readily obtained by astandard textbook procedure [36]. Except for in section 8, the simulation time horizon is set to T s = 0 . C. MICHOSKI, M. MILOSAVLJEVI´C, T. OLIVER, AND D. HATCH
Figure 1.
The unlimited solutions to the Sod shock tube. The solution is fully ill-posed inthe classical formulation, leading to undefined behavior (i.e., unbounded derivatives) along theshock fronts. Nevertheless, the solution remains stable (in contrast with the unlimited versionsof other algorithms), and appears relatively robust to the major features of the solution.4.
Numerical Experiments
In this section we show the basic behavior of DNN solutions to the Sod shock tube. In section 4.1 we showhow the DNN solution behaves without any regularization and discuss how that compares to other numericalmethods. In section 4.2 we discuss how to add analytic diffusion to regularize the solution.4.1.
An unlimited solution using DNNs.
The standard solution to the Sod shock tube problem for (3.9)poses real challenges to most numerical methods. In the standard DG method, for example, when the polynomialdegree is nonzero, n >
0, the solution becomes numerically unstable without the use of some form of slope limiting[37, 38]. Similarly when FVM is evaluated at higher than first order accuracy, the absence of flux-limiting leadsto unstable solutions [39, 40]. This behavior is also observed in spectral methods [41, 42], continuous Galerkinmethods [43], and so on.One of the more robust regimes for solving shock problems can be found in constant DG methods with degree n = 0 basis functions (or equivalently, FVM methods with first order accuracy), wherein both a stable andrelatively robust solution can be readily calculated, but at the cost of large numerical diffusion. Indeed theability to solve these types of irregular problems helps explain the preference practitioners often demonstrate inchoosing low order methods to compute solutions to irregular systems, where to resolve solution features onerelies solely on mesh/grid refinement.In contrast to lower order approaches, the DNN solution to the Sod shock tube problem is able to exploitthe relative flexibility of DNN functional approximators, as determined by composition of affine transformationsand element-wise activation nonlinearities. For example, with hyperbolic tangent activations, the DNN solutionto the shock tube problem becomes an approximation to a regular solution of the Sod problem at high orderaccuracy, in the sense that the approximation order of a linear combination of transcendental functions is C ∞ .In the case of the DNN solution, even though the discontinuous shock fronts display unbounded derivativesin the classical sense, the DNN is able to find a relatively well-behaved solution as shown in Fig. 1. The DNNformulation with the hyperbolic tangent activation, if the optimization is to converge to a parameter point ˆ ϑ , ispredicated on a regular solution. Failure to converge could arise if the derivatives at the locus of discontinuity, and OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 7
Table 1.
Mean square error of the unlimited DNN solution in Fig. 1 and the solutions as afunction of the grid density in Fig. 2, against the exact Sod shock tube solution.Solution Type Density Pressure VelocityUnlimited DNN 0.00199 0.00074 0.01272FVM 1st Order 0.00117 0.00139 0.00804FVM 2nd Order 0.00030 0.00019 0.00164DG, p = 2 0.00163 0.00259 0.01413DG, p = 5 0.00314 0.00811 0.02945DNN, τ = .
005 0.00020 0.00025 0.00136the components of ϑ , enter unstable unbounded growth in the iterations of the optimization protocol. Indeed,the DNN approximates the discontinuous mathematical shock precisely in the limit in which certain componentsof ϑ tend to infinity. We observe that in practice, however, this does not happen. The inherent spectral bias[44] of the DNNs seems to be able to contain the potential instability of the solution until the optimization hassettled in a “reasonable” local minimum. This leads to an interesting feature of the DNN solutions, namely thatthey seem to demonstrate robustness while broadly maintaining many of the small-scale features of the exactsolution, even in the presence of irregularity in the formulation.4.1.1. DNNs as gridless adaptive inverse PDE solvers.
To explain why an approximate regular DNN solutionto a shock problem can perform as well as it does, as shown in Fig. 1, we note that a DNN solver can really beviewed as a type of gridless adaptive inverse solver. First, the absence of a fixed grid in the evaluation of the lossfunctions (2.4) and determination of the optimization direction (2.7) means that the probability density ℘ can befinitely sampled in space as densely as the method can computationally afford at every iteration and for as manyiterations as can be afforded. Given the computational resources, this can lead to a relatively dense, adaptivelysampled interrogation of the space-time domain. Moreover, adaptive mesh methods for shock problems havelong been known to be highly effective [45, 46], particularly when shock fronts can be precisely tracked. However,AMR and hp -adaptive methods are constrained by their meshing/griding geometries, even as isoparametric andisogeometric methods have been developed to, at least in part, reduce these dependencies [47].The second pertinent observation about the DNN solution is that because it utilizes an optimization methodto arrive at the solution, it can be viewed as a type of inverse problem. Unlike a forward numerical method,which accumulates temporal and spatial numerical error as it runs, the DNN solutions become more accuratewith iterations, being a global-in-space-time inverse-like solver. This means that it becomes entirely reasonableto consider running such systems in single or even half precision floating point formats, leading to potentialperformance benefits in comparison to more standard PDE solvers.4.2. Diffusing along unbounded gradients.
As discussed above, solving a shock problem without addingsome form of diffusion (often in the form of limiters or low order accurate methods) tends to lead to unstablesolutions in classical numerical methods. Similarly here, even in the case of the DNN solution, when searching foran approximation to the regular solution shown in Fig. 1, the resulting solution, though reasonably well-behaved,is of course spurious along the undefined shock fronts.As in all numerical methods, to ameliorate this problem, all that needs to be done is to add some smallamount of diffusion to constrain the derivatives along the shock fronts. In this case, we slightly alter (3.9) intothe non-conductive compressible Navier-Stokes type system, ∂ t U + ∂ x F ( U ) − ∂ x G ( U ) = 0 , U | t =0 = U , U | x ∈ ∂ Ω = V ∂ Ω , (4.11)given a dissipative flux G ( U ) = (0 , ρτ u x , ρτ u x ) (cid:62) . Again here we start by merely solving over the space-timedomain, Ω × [0 , T s ], given a fixed and positive constant τ ∈ R + . As we show in more detail below, adding amodest amount of viscosity τ = 0 .
005 leads to a DNN solution that is competitive with some of the most effectivenumerical methods available for solving the Sod shock tube problem.5.
Numerical Comparison Tests
In this section we show numerical comparison tests between DNNs, the DG finite element methods (FEMs),and FVMs for the Sod shock tube problem. We show solutions in the eyeball norm for each, relative to, first,grid density, and then degrees of freedom in the method. It should be noted that accuracy in the DNN solution issomewhat arbitrary, as they can be tuned via metaheuristics to search for local minima that improve upon basicaccuracy measures, such as L p error. This is in contrast to, for example, DG and FVM methods which haveestablished convergence criteria, so that as a function of spatial order and mesh width expected improvementsin smooth solutions can be anticipated.The DNN solution method is effectively adaptive and gridless and relies on stochastic optimization to establishminimizing solutions. In the standard DNN formulation of (3.9) a list of tunable parameters would include achoice of activation function φ , the number of optimization iterations l , the number and distribution of sampledpoints from ℘ , the optimization method stopping criteria, the form and weighting of the loss functions in J i , the C. MICHOSKI, M. MILOSAVLJEVI´C, T. OLIVER, AND D. HATCH
Figure 2.
Comparison of slope-limited discontinuous Galerkin (DG) finite element methodsolvers, flux-limited finite volume method (FVM) solvers, and the dissipative DNN solver holdingfixed the relative resolution in the spatial grids of each solution, √ N batch ∼ N vol ∼ ( n + 1) N el .learning rate of the optimizer η , etc. The result is that comparing the accuracy of a DNN solution to a morestandard forward solver in some specified norm is not particularly insightful (though we still show the MSE ofthe solutions in Table 1 and 2). But more so, we try to establish some criteria that do not draw us into thequicksand of hyperparameter optimization, by which to compare DNN solutions to FVM and DGFEM solutionsbelow.5.1. Solutions as a function of spatial grid density.
One way to evaluate the relationship between theDNN method and FVM or DGFEM is by comparing the relative density of effective residual evaluation pointsin space. Broadly, in the DG solution, for degree n = 1 polynomial solutions, this corresponds to ( n + 1) N el for N el the number of elements. In the FVM method this corresponds to simply N vol , the number of finitevolumes. In the DNN the residual is evaluated at batch points determined by the sampling distribution x ∼ ℘ ,which we denote with N batch . One nuance in comparing the methods through residual evaluation points is thatin standard FVM and DGFEM methods, these points are spatially fixed. In the DNN approach however, thebatch points are resampled after every evaluation of the objective function gradients. As such, it might seemmore appropriate to look at how DNNs compare to hp -adaptive DGFEM and FVM that use adaptive meshrefinement (AMR). Note, however, that adaptive forward problems using DGFEM and FVM are adaptive in afundamentally different way relative to the residual evaluation as compared to the DNN method. In forwardproblems, the residual evaluation shifts in space relative to the current solution in time. In the DNN solution,however, though the batch points are resampled, the samples can be chosen independent of the solution behavior,and are done so over the whole spacetime domain Ω h × [0 , T s ].This dependence on space and time of the sampled batch points N batch , distinguishes the DNN solution fromeven the adaptive FVM and FEM solutions. Thus in order to compare the spatial resolution of the two differenttypes of solutions, we compare solutions using the rule of thumb that √ N batch ∼ N vol ∼ ( n + 1) N el .Results of this comparison are shown in Fig. 2 and Table 1, where the timestepping is chosen sufficiently small.Here the DGFEM solution is run at n = 2, N el = 24 and at n = 5, N el = 12 using a Rusanov Riemann solverwith adaptive hierarchical slope limiting [37]. Both first and second order FVM solutions are run at N vol = 72,where the first order solution is unlimited, and the second order uses a standard per-face slope limiter. The OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 9
Table 2.
Mean square errors of the solutions as a function of fixed degrees of freedom in bothspace and time, as shown in Fig. 3.Solution Type Density Pressure VelocityFVM 1st Order 0.00027 0.00017 0.00136FVM 2nd Order 0.00011 0.00005 0.00084DG, p = 2 0.00021 0.00013 0.00182DG, p = 5 0.00042 0.00029 0.00257DNN, τ = .
005 0.00020 0.00025 0.00136
Figure 3.
Comparison of slope-limited discontinuous Galerkin (DG) finite element methodsolvers, flux-limited finite volume method (FVM) solvers, and the dissipative DNN solver holdingfixed the total degrees of freedom (in both space and time) of the discrete representation.DNN solution is run at √ N batch = 72, using τ = 0 . Solutions as a function of degrees of freedom.
Another way of comparing the DNN solution to theFVM and DGFEM methods is to look at solutions as a function of their respective degrees of freedom (DoFs).Heuristically this can be thought of as comparing the number of “tunable parameters” within the representationspace of the solution.Since the solution in the DNN is a global in space and time solution, we consider the DoFs spanning thewhole space, Ω h × [0 , T s ]. In the DG solution this corresponds to S ( n + 1) N el T grid , where T grid is the temporalgrid and S are the number of stages in the Runge-Kutta discretization. Likewise in the FVM setting, we have SN el T grid . In the case of the DNN, on the other hand, the degrees of freedom are recovered by counting thescalar components of the vector ϑ parametrizing the network architecture (2.2), and this yieldsDNN DoFs = N ( d in + 1) + d out ( N L + 1) + L − (cid:88) (cid:96) =1 N (cid:96) +1 ( N (cid:96) + 1) , (5.12)where L is the number of hidden layers, N (cid:96) is the width of the (cid:96) -th layer, and d in and d out are the input andoutput dimensions, respectively. Figure 4.
The Sod shock tube solution using a DNN solver along with dissipative annealing,out to only 60K iterations.The results are presented in Fig. 3 and Table 2. In the DNN solution, the degrees of freedom are DNN
DoFs =248065 after setting L = 16 and N (cid:96) = 128. The DGFEM solutions are run the same as in section 5.1, exceptthat at sixth order, n = 5, S = 4, N el = 104 and T grid = 100, so that DGFEM (6)DoFs = 249600. Similarlythe third order accurate DGFEM solutions use n = 2, S = 4, N el = 208 and T grid = 100, corresponding toDGFEM (3)DoFs = 249600. The first order accurate FVM solution is run the same as in section 5.1, except S = 1 (aforward Euler solver is used), N vol = 500, and T grid = 500 so that FVM (1)DoFs = 250000. At second order the FVMmethod uses a predictor-corrector method, which we set as S = 2, N vol = 350, and T grid = 355 corresponding toFVM (2)DoFs = 248500. Again, the results show that the DNN solution compares favorably as a function of degreesof freedom. 6. Natural Extensions
In this section we discuss a pair of natural extensions to the DNN framework applied to the Sod Shock tubeproblem. The first of these is an application of a type of “simulated annealing,” that can be used along theviscous and/or dissipative parameter directions to improve the effective time to convergence in the method.Next we show how even more powerfully, solving the DNN framework along a full parameter axis, in this casealong the viscous axis τ , can lead to a simultaneous and fully parameterized solution in the entire parameterspace at nearly no additional cost.6.1. Dissipative annealing.
One of the potential limitations of the DNN approach, in comparison to othernumerical methods, is the compute time-to-convergence. Discussion of convergence in the DNN/PDE settingis complicated by the fact that “convergence” refers to different concepts in numerical PDE analysis and innumerical optimization. In numerical PDE analysis, convergence usually refers to rate at which an approximatesolution tends to the exact solution as a function of the representation parameters, e.g., mesh width h and/orpolyniomial order p . In iterative numerical optimization, however, convergence refers to the rate at which theparameters ϑ converge to a (possibly local) minimum as a function of the number of iterations. The globalminimum may exist only in the limit in which some components of ϑ tend to infinity. Then, heuristics mandatesthat we seek convergence to an acceptable local minimum. It is thus difficult to arrive at a formal definitionfor what is meant by time-to-convergence in the DNN setting. In this section, we resort to a hybrid heuristic OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 11
Figure 5.
The Sod shock tube solution using a DNN solver without dissipative annealing to100K iterations.between the two concepts of convergence discussed above, and refer to “time-to-convergence” as the number ofcomputational cycles required to reach a comparable level of accuracy to that of a more traditional PDE solutionmethod. It is in this sense that DNN solutions can appear noticeably more expensive, though in this section wediscuss some of the nuances that underlie this observation, and why the unique capacity and flexibility of DNNsolutions makes the time-to-convergence less restrictive than it seems.The premise we assume here about slowly converging solutions (in the sense of optimization) is that large-scale features of a smooth solution are relatively easy to converge to, but when accompanied with finer scalefeatures, the optimization noise from attempting to fit the finer scales can obscure the larger scale features fromthe optimizer. To mitigate this effect, inspired by the simulated annealing method [48], we recast the dissipativeflux in (4.11) as G ( U ) = (0 , ρτ u x , ρτ u x ) (cid:62) , where τ = τ ( l ) becomes an iteration- l numerical viscosity. In this case τ ( l ) = g ( l ) τ smooth for some smooth τ smooth ∈ R . The function g ( l ) is chosen as a fractional stepwise function bounded from above by unity.The idea of using τ , is to converge early and fast in the iteration cycle to an overtly smooth solution of (4.11)with large viscosity. Since such a solution has dramatically fewer fine-scale features, it is conjectured that theoptimizer can more easily find a stable minimum of the objective function for such a smooth solution. As aconsequence, fewer total iterations are required to converge to the minimum.We have tested this idea on the Sod problem, and have found that with minimal effort it seems to reducethe number of iterations needed to arrive at similar results. For example, in Figs. 4–6, we test a DNN usinga decreasing learning rate schedule. This base case is run with the minibatch size of 5000 and initial learningrate of 10 − that is reduced every 25000 iterations by factor of 0 . g = { , . , . , . , . } , such that τ ( l ) = g ( l ) τ smooth for τ smooth = 0 .
05 and l = 1 , . . . ,
5. This meansthe dissipatively annealed solution takes only 60000 iterations to arrive at τ (5) = 0 . Figure 6.
The final timestep of Sod shock tube solution comparing the dissipative annealedsolution at 60K, to the standard solution at 100K iterations, to the exact solution.algorithm. It is not clear that this is an entirely fair comparison, given the number of hyperparameters that canbe tuned in each case.6.2.
Simultaneous parameter scans.
Dense parameter space exploration is the most reliable way to de-velop predictive input-output mappings for scenario development, optimal design and control, risk assessment,uncertainty quantification, etc., in many real-world applications. Without understanding the response to theparameters of the system, be they variable boundary conditions, interior shaping constraints, or constitutivecoefficients, it is all but impossible to examine the utility an engineered model might have in some particulardesign process. PDEs clearly help in this regard since they are fully parametrized models of anticipated physicalresponses. That being said, PDEs are also often fairly expensive to run forward simulations on, frequently requir-ing large HPC machines to simulate even modest systems [49, 50]. Consequently, parameter space explorationof a PDE system is usually both very important and very computationally expensive.To mitigate the expense of running forward models, surrogate modeling [51] and multifidelity hierarchies [52]are often conceived in order to develop cost-effective ways of examining the response of a PDE relative to itsparametric responses. One of the ways in which these methods function is by developing reduced models toemulate the parametric dependencies from a relatively sparse sampling of the space. In this way it becomespossible—for a sufficiently smooth response surface—to tractably parametrize the input-output mapping, andthus interrogate the response at a significantly reduced cost, lending itself to statistical inference, uncertaintyquantification, and real-world validation studies.However, as is commonly the case, the response surface is not sufficiently smooth, or its correlation scale-length cannot be determined a priori, and these reduced order mappings can become highly inaccurate. In suchcircumstances having a way to emulate the exact response surface would clearly be of high practical value. As apotential solution to this, one of the most powerful immediate features of the DNN framework seems to be theability to perform simultaneous and dense parameter space scans with both very little effort. The framework isthus a natural, and in some sense exact emulator for complicated system response models. In the case of the Sodshock problem this can be accomplished by simply recasting (4.11) relative to its parameter γ , where instead ofsolving over space-time ( x, t ) ∈ Ω × [0 , T s ], the solution is solved over the parameter-augmented parameter space( x, t, γ ) ∈ Ω × [0 , T s ] × [ γ low , γ high ]. OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 13
Figure 7.
100 3D contours of the Sod shock tube solution solved over the continuous parameterscan, ( x, t, γ ) ∈ [ − , × [0 , . × [1 . , .
0] at 100K iterations.Intuitively one might expect the increase of dimensionality to be prohibitive, as the parameter space growsfrom a discrete 2D system to a discrete 3D system. However, in analogy with the observation of section 6.1where the parameter τ is dissipatively annealed, this is not what is observed. Instead, using the same DNNparameter sets, the same minibatch size and learning rate schedule, optimization over the parameter-augmentedinput space ( x, t, γ ) reaches similar loss values with almost no computational overhead. Contour and time sliceplots of the solutions are shown in Fig. 7 and Fig. 8 for the density, velocity pressure, and temperature, each asa function of ( x, t, γ ) ∈ [ − , × [0 , . × [1 . , . γ as it is to solve at a single fixed valueof γ . It remains unclear how robust this behavior is over physical parameters of the model — in this case(4.11). But however robust this behavior ends up being, even if only across isolated and specific parameters,this demonstrates a remarkably advantageous aspect of the DNN formulation.7. DNNs With Residual Connections and Multiplicative Gates
As pointed out in [9], the DNN architecture can influence its behavior. In [9] it is claimed that an LSTM-inspired feed-forward network architecture with residual connections and multiplicative gates can help improveperformance for some parabolic PDE problems. These models are effectively an alternative to the more standarddeep neural network architecture presented in (2.2).We implement these LSTM-like architectures for our 1D mixed hyperbolic-parabolic like PDE system (3.9),to see how it behaves relative to standard DNNs in the context of more irregularity in the solution space. Setting
Figure 8.
Variation in γ and x of the Sod shock tube solution, at T s = 0 . ζ = ( x , t ), we test the Euler system (3.9) on the following architecture comprised of L + 1 hidden layers: S = φ ( W ζ + b ) ,Z (cid:96) = σ ( U z,(cid:96) ζ + W z,(cid:96) S (cid:96) + b z,(cid:96) ) , (cid:96) = 0 , . . . , L − ,G (cid:96) = σ ( U g,(cid:96) ζ + W g,(cid:96) S (cid:96) + b g,(cid:96) ) , (cid:96) = 0 , . . . , L − ,R (cid:96) = φ ( U r,(cid:96) ζ + W r,(cid:96) S (cid:96) + b r,(cid:96) ) , (cid:96) = 0 , . . . , L − ,H (cid:96) = φ ( U h,(cid:96) ζ + W h,(cid:96) ( S (cid:96) (cid:12) R (cid:96) + b h,(cid:96) ) , (cid:96) = 0 , . . . , L − ,S (cid:96) +1 = (1 − G (cid:96) ) (cid:12) H (cid:96) + Z (cid:96) (cid:12) S (cid:96) , (cid:96) = 0 , . . . , L − ,z L ( ζ , θ ) = W L S L + b L , (7.13)where σ ( x ) = 1 / (1 + e − x ) is the sigmoid function and the network parameters are given by: θ = (cid:26) W , b , (cid:0) U z,(cid:96) , W z,(cid:96) , b z,(cid:96) (cid:1) L − (cid:96) =0 , (cid:0) U g,(cid:96) , W g,(cid:96) , b g,(cid:96) (cid:1) L − (cid:96) =0 , (cid:0) U r,(cid:96) , W r,(cid:96) , b r,(cid:96) (cid:1) L − (cid:96) =0 , (cid:0) U h,(cid:96) , W h,(cid:96) , b h,(cid:96) (cid:1) L − (cid:96) =0 , W L , b L (cid:27) , (7.14)and the number of units per layer is N (cid:96) . For more details on the network architecture we refer the reader to [9].To understand whether the standard DNN architecture is more effective, in some sense, than (7.13), it isessential to recognize that the degrees of freedom scale differently for the LSTM-like system (7.13) than theydo in (5.12), where the LSTM-like system is substantially more expensive per network layer (cid:96) . More explicitly,assuming that each layer has the same width, N (cid:96) , the degrees of freedom in the LSTM-like system can becalculated as: LSTM DoFs = 4 LN (cid:96) + (4 L ( d in + 1) + d in + d out + 1) N (cid:96) + d out . (7.15)Comparing (5.12) to (7.15), for a fixed number of layers L = 2, setting for (3.9) d in = 2 and d out = 3, theresulting relationship between the number of units in the LSTM, N (cid:96) = ⇒ N LSTM , and DNN, N (cid:96) = ⇒ N DNN , OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 15
Figure 9.
The Sod shock tube solution at the final timestep, comparing the LSTM-like archi-tecture versus standard DNN at fixed DoFs.can be computed by solving the following ceiling function: N DNN = (cid:38) − (cid:112) N + 72 N LSTM + 492 (cid:39) . For our numerical test, we set L = 16 with the reference solution width set to N DNN = 128. This leads to N DNN = 248451. Solving for the relationship above this yields N LSTM = 63. The results are presented in Fig. 9,and show fairly unambiguously that for this particular test case the LSTM-like architecture does not perform aswell as the standard DNN. As above, this result must be taken in context. It may well be, for example, that thetuned parameters for standard DNNs do not tune the LSTM-like networks well, and that the result presentedis not a proper comparison. Again, to fully reveal the relationship between these architectures on even just thisone simple hyperbolic test case, would require a full study of its own.That said, from a purely practical point of view, the implementation of the DNN versus the implementationof the LSTM-like networks seems to lead to a network architecture with a large disparity in the number of graphedges. The DNN network has effectively one layer per (cid:96) ∈ L , while the LSTM-like network has, in some sense,five layer-like networks per (cid:96) ∈ L as seen in (7.13. The result of this is that in our implementation, even at fixeddegrees of freedom, the LSTM-like network takes almost five times longer in compute time to finish than thestandard DNN, at a fixed number of iterations. Again, it is not clear if this slowdown can be reduced by usinga more clever implementation of the LSTM-like networks, or if this is simply a result of increasing the networkcomplexity in the graph. 8. Data-enriched PDEs
The simplicity and apparent robustness of DNNs for solving systems of nonlinear and irregular differentialequations raises the question: how to incorporate experimental data into such a framework? Work combiningDNN-based approach to solving PDEs with conventional supervised machine learning has already started toemerge; we presume that this trend will only accelerate. Raissi et al. [10] introduced “data discovered PDEs,”where a relatively traditional parameter estimation is performed over differential operators that are discoveredthrough optimization. For example, parameter estimation can be performed for λ ∈ R that factors througha differential operator λuu x . It is easy to see this can be expanded to data driven discovery of PDEs, where Figure 10.
The reactive subsystem from (8.18). To illustrate the autocatalytic oscillation, theinitial state is set to ρ = 2 ρ l / , ρ = ρ l /
3, though the autocatalysis is largely independent ofthe initial conditions.parameter hierarchies can be used to generate libraries of operators that are selected based on system specificphysical criteria and constraints, in order to effectively “construct” and/or discover PDE systems from large datasets [53]. These types of approaches are emerging with increasing interest [54, 55, 56, 57, 58], and it is easy tosee how such approaches naturally lend themselves to the DNN frameworks.These types of empirical PDE discovery techniques offer clear advantages in cases where the systems areassumed to be too complex to easily construct first principles models. However, researchers also find themselvesin a different type of situation, where the physics model that describes the system response is confidentlyprescribed, so that any mismatch between the model and data raises questions about the experimental setupjust as it raises them about the model.To explore this common circumstance, we address a situation relating to the experimental validation of afirst principles model system. First principles models are predicated on the idea that the physical dependencieswithin an experimental system are, or can be, fully understood. Validation studies, however, frequently discoverthat this is in fact not the case [59, 60]. Frequently systematic behaviors, engineering details, alleatoric and/orepistemic uncertainties can cascade, and lead to systems with behaviors that are uncertain and/or different fromthose of the model systems anticipated to describe them.In this circumstance, a researcher might be in a situation where a prescribed model system does not plausiblyvalidate against the large quantities of high resolution experimentally measured data that have been collected.Some of the scientific questions one might want to raise in such circumstances are:(1) Are my experimental results reliable?(2) Is my model system sufficient to describe the experimental data?(3) How far from my model system is the measured data?(4) What is the form of the mismatch between the data and the model?(5) How might I enrich my model system to more accurately capture the observed behavior?(6) Are there characteristics in the mismatch that reveal missing physical subsystems?Below we present a simple example of how to approach such a circumstance, and provide an outline of how sucha series of questions can be systematically retired.8.1.
Hypothetical experimental senario.
We consider the following hypothetical scenario: an experimentaltest facility is set up to run experiments interrogating the behavior of transonic gas dynamics. Although experi-mental setup is mildly exotic, the expected responses of the system are anticipated to obey classical gas dynamics(3.9). Experiments are run, and a process is initially set up to validate relative to the following dimensionlessideal model system: ∂ t ρ + ( ρu ) x = 0 ,∂ t ( ρu ) + (cid:0) ρu + p − τ ρu x (cid:1) x = 0 ,∂ t E + ( Eu + pu − τ ρuu x − κT x ) x = 0 , (8.16)where the total energy density is given by E = pγ − ρ u , and the initial-boundary data is set to the same as in (3.9). Here a constant heat conductivity κ = τ = 0 . OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 17
Figure 11.
The top left shows the result from the total measured density in the experiment.The top right and middle left are the experimentally measured values of the MHD subspecies ρ + ρ = ρ , corresponding to the reactive subsystem in Fig. 10. The middle right is the DNN-enriched simulated value of the density after adding the f i ’s. The bottom panels show thefunction f and it’s first derivative, where it’s dominant signature effect can be found near theinitial state of the system, t = 0.A traditional approach to this problem might begin with a vigorous debate between the simulation expertsand the laboratory experts, both claiming systematic errors on behalf of the other. Both camps might proceedby testing their subsystems to the best of their ability and coming to the conclusion that nothing is wrong, perse, but there is some other physics occurring in the test facility that they have not properly anticipated. As aconsequence, the next step might be to slowly start adding back physics terms to (8.16). Subsequent efforts mayinclude parameter estimation on remaining uncertain parameters in the model until a better fit can be recovered.In this case, however, such a process would be extremely slow and tedious as illustrated below.Unbeknownst to both the modellers and experimentalists, the actual dynamics going on inside the experiment,is a substantially more complicated system. In fact, it turns out that a feature of the experimental setup isinadvertaintly inducing ionization of the gas in the chamber. As a result, the gas actually behaves like a reactive Figure 12.
The top left is the experimentally measured velocity, and the top right the DNN-enriched simulation including the contribution from the mismatch function f . The mismatchfunction f is shown on the bottom left, with df /dx on the bottom right.photoactivated autocatalytic plasma in the center of the reactor, characterized exactly by the following equations: ∂ t ρ + ( ρ u ) x = ( k A + k ρ ρ − k ρ A − k ρ ) ζ,∂ t ρ + ( ρ u ) x = ( k ρ A − k ρ ρ ) ζ,∂ t ( ρu ) + (cid:18) ρu + p + 18 π B − π B x − τ ρu x (cid:19) x = 0 ,∂ t ( ρv ) + (cid:18) ρuv − π B x B y − τ ρv x (cid:19) x = 0 ,∂ t ( ρw ) + (cid:18) ρuw − π B x B z − τ ρw x (cid:19) x = 0 ,∂ t E + (cid:18) Eu + pu + 18 π B − π B x ( v · B ) − τ ρuu x − κT x (cid:19) x = 0 ,∂ t B x = 0 , ∇ · B = 0 ,∂ t B y + ( uB y − vB x ) x = 0 ,∂ t B z + ( uB z − wB x ) x = 0 , (8.17)where E = pγ − ρ v + 18 π B , and (cid:88) i ρ i = ρ. Here B = ( B x , B y , B z ) is a magnetic field, thought inconsequential in the erroneously presumed absence of ions,and the velocity field in the gas is defined as v = ( u, v, w ). In this hypothetical scenario, just like for the modelsystem (8.16), the initial conditions for all unknowns are initialized with a jump discontinuity at x = 0, overthe domain Ω = [ − . , . t ∈ [0 , . OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 19
Table 3.
The initial-boundary data for the model (8.16) and the experiment (8.19).B.C. ρ u v w T B x B y B z Left Mod. B.C. 1.08 1.2 – – 0.8796 – – –Right Mod. B.C. 0.9891 -0.0131 – – 0.9823 – – –Left Exp. B.C. 1.08 1.2 0.01 0.5 0.8796 2.0 3.6 2.0Right Exp. B.C. 0.9891 -0.0131 0.0269 0.010037 0.9823 2.0 4.0244 2.0026
Figure 13.
The experimental temperature profile is given on the top left, and the simulationon the top right with f . The mismatch function f is shown on the bottom left, with its x -derivative on the bottom right.The autocatalytic reactive subsystem is induced by virtue of an unexpected ionization pulse in the reactor,leading to an oscillating chaotic attractor characterized by the reactions: A k ρ ρ + ρ k ρ A + ρ k ρ + A ρ k A (8.18)where ρ is the first chemical species with density ρ , and ρ the second chemical species with density ρ . Herethe A i are excess bulk species, and the k i are dimensional rate constants. The condition for instability is that A > A + 1, thus we set A = 2, and A = 0 .
9, where for simplicity we set k i = 150 for all i . Photoactivationonly occurs in the center of the reactor, and thus ζ = 112 σ √ π e − x σ . The solution to this subsystem is shown in Fig. 10
Figure 14.
The experimental magnetic fields, where b x is measured to be negligibly small.Given the power of the DNN framework, to reveal the mismatch of the underlying system that is actuallyobserved in the experiment, DNN is trained to model the following enriched PDE system instead of (8.16): ∂ t ρ + ( ρu ) x = f ,∂ t ( ρu ) + (cid:0) ρu + p − τ ρu x (cid:1) x = f ,∂ t E + ( Eu + pu − τ ρuu x − κT x ) x = f (8.19)where E = pγ − ρ u , with again the same initial-boundary data from Table 3.The solution from the experiment (8.17) is now used as training data to supervise (8.19). Formally theobjective function from (2.4) is enriched with the training data, U exp = ( ρ exp , ρ exp , u exp , E exp ) from (8.17), byappending the new loss functions (cid:107) ρ exp − ρ (cid:107) L s (Ω × [0 ,T s ] , E ) + (cid:107) ρ exp u exp − ρu (cid:107) L s (Ω × [0 ,T s ] , E ) + (cid:107) E exp − E (cid:107) L (Ω × [0 ,T s ] , E ) to (2.4), where E are the experimentally measured data support points, and L s is the mean square error. Forsimplicity, the support points E are chosen as a 1000 × x, t ). In addition, the neural networkoutputs associated to the f i = f i ( x, t ) are L -regularized by setting: (cid:88) i (cid:107) f i (cid:107) L i (Ω × [0 ,T s ] ,℘ i ) , where here the distributions ℘ i are chosen to be the same as for (8.19), and the L i are L -losses with weight w i = 0 . i in order to minimize the penalization for accumulating non-zero f i .The resulting system matches up reasonably well to the experimental data. As can be seen in Fig. 11, thesimulation density and experimental density match to within the eyeball norm, where the initial mismatch iscorrected with the f function. Though the total density nearly completely obfuscates the oscillations from thereactive subsystem in Fig. 10, the total density is off primarily near the initial state, where the reactive subsystemindicate a dramatic influx of mass due to the species being tracked by the sensor data in the experiment. Thistype of mismatch in the mass conservation clearly indicates chemical reactions relative to the tracked speciesdensities, unless, of course, the system is somehow not closed. In Fig. 12 the velocity mismatch is similarlyconvincing, where the mismatch f shows complicated behavior with nontrivial variation that cannot be readilyexplained by the mass influx in f . Similarly the mismatch f in the temperature in Fig. 13 also shows anunusual signature not present in f .Returning to our list of potential questions about how the model system relates to the experimental system(8.17), we can now make some fairly strong conclusions. First, clearly there is a mass mismatch in the measuredspecies density. The mismatch is substantial and cannot be captured by simple parameter estimation. Therefore,there is physics in the mass equation that is not being accounted for by the ascribed model system (8.16).Moreover, it is also straightforward to conclude that the giant influx of mass makes a chemical reaction a verylikely candidate to explain the experimental behavior, assuming the system closed.In contrast to the mass equation, the momentum and energy equations in (8.16) show perturbations away fromthe model Sod shock solution that indicate that complicated system dynamics is occurring in the continuum.While this behavior might be more challenging to diagnose, it is clear in these cases too that the base equations(8.16), even with the addition of f , cannot account for the system response, and there is missing physics. As itso happens, knowing simply the magnetic field variation of the experiment, as shown in Fig. 14 is really enoughto diagnose the mismatch in both the the velocity f and temperature f as related to the magnetic field, sincethey exhibit signature features that track the magnetic field variation. OLVING DIFFERENTIAL EQUATIONS USING DEEP NEURAL NETWORKS 21 Conclusion
Gridless representations provided by deep neural networks in combination with numerical optimization can beused to solve complicated systems of differential equations that display irregular solutions. This requires fairlystraightforward techniques, and in irregular solutions benefits from the usual trick of adding numerical diffusionto smooth numerically unstable function representations. The DNN method compares favorably with regardto accuracy and stability to more conventional numerical methods and yet enables one-shot exploration of theentire parameter space. The incorporation of efficient optimization algorithms in DNN methods into the PDEsolver lends itself to an ease and simplicity of integrating advanced data analytic techniques into physics-enrichedmodel systems. Early results indicate that DNNs enable a simple and powerful framework for exploring andadvancing predictive capabilities in science and engineering.10.
Acknowledgements
This work was supported by U.S. DOE Contract No. DE-FG02-04ER54742 and U.S. DOE Office of FusionEnergy Sciences Scientific Discovery through Advanced Computing (SciDAC) program under Award NumberDE-SC0018429. This work was also supported by the NSF grant AST-1413501. We acknowledge the TexasAdvanced Computing Center at The University of Texas at Austin for providing HPC resources. Computationswere performed on the Maverick2 GPU cluster.
References [1] B.-N. Jiang and G. F. Carey, “Least-squares finite element methods for compressible Euler equations,”
Internat. J. Numer.Methods Fluids , vol. 10, no. 5, pp. 557–568, 1990. 1[2] P. B. Bochev and M. D. Gunzburger,
Least-squares finite element methods , vol. 166 of
Applied Mathematical Sciences . Springer,New York, 2009. 1, 3.2[3] G. Strang and G. J. Fix,
An analysis of the finite element method . Prentice-Hall, Inc., Englewood Cliffs, N. J., 1973. Prentice-Hall Series in Automatic Computation. 1[4] T. Belytschko, Y. Y. Lu, and L. Gu, “Element-free Galerkin methods,”
Internat. J. Numer. Methods Engrg. , vol. 37, no. 2,pp. 229–256, 1994. 1[5] X. Li, “Error estimates for the moving least-square approximation and the element-free Galerkin method in n -dimensionalspaces,” Appl. Numer. Math. , vol. 99, pp. 77–97, 2016. 1[6] I. Babuˇska, U. Banerjee, and J. E. Osborn, “Survey of meshless and generalized finite element methods: a unified approach,”
Acta Numer. , vol. 12, pp. 1–125, 2003. 1[7] J. Han, A. Jentzen, and W. E, “Solving high-dimensional partial differential equations using deep learning,”
Proceedings of theNational Academy of Sciences , vol. 115, no. 34, pp. 8505–8510, 2018. 1[8] J. Berg and K. Nystr¨om, “A unified deep artificial neural network approach to partial differential equations in complex geome-tries,”
Neurocomputing , vol. 317, pp. 28–41, 2018. 1[9] J. Sirignano and K. Spiliopoulos, “DGM: A deep learning algorithm for solving partial differential equations,”
Journal ofComputational Physics , vol. 375, pp. 1339 – 1364, 2018. 1, 1, 2, 7, 7[10] M. Raissi, P. Perdikaris, and G. Karniadakis, “Physics-informed neural networks: A deep learning framework for solving forwardand inverse problems involving nonlinear partial differential equations,”
Journal of Computational Physics , vol. 378, pp. 686 –707, 2019. 1, 2, 8[11] K. Xu and E. Darve, “The neural network approach to inverse problems in differential equations,” 2019. 1[12] P. D. Lax,
Hyperbolic systems of conservation laws and the mathematical theory of shock waves . Society for Industrial andApplied Mathematics, Philadelphia, Pa., 1973. Conference Board of the Mathematical Sciences Regional Conference Series inApplied Mathematics, No. 11. 1[13] J. Smoller,
Shock waves and reaction-diffusion equations , vol. 258 of
Grundlehren der Mathematischen Wissenschaften [Fun-damental Principles of Mathematical Science] . Springer-Verlag, New York-Berlin, 1983. 1[14] R. Abgrall and R. Saurel, “Discrete equations for physical and numerical compressible multiphase mixtures,”
Journal of Com-putational Physics , vol. 186, pp. 361–396, APR 10 2003. 1[15] T. Ohwada and S. Kobayashi, “Management of discontinuous reconstruction in kinetic schemes,”
Journal of ComputationalPhysics , vol. 197, pp. 116–138, JUN 10 2004. 1[16] R. Chevalier, J. Blondin, and R. Emmering, “Hydrodynamic instabilities in supernova-remnants - self-similar driven waves,”
Astrophysical Journal , vol. 392, pp. 118–130, Jun 10 1992. 1[17] Z. Zhang and G. Gogos, “Theory of shock wave propagation during laser ablation,”
Physical Review B , vol. 69, Jun 2004. 1[18] J. S. Hesthaven and T. Warburton,
Nodal discontinuous Galerkin methods , vol. 54 of
Texts in Applied Mathematics . Springer,New York, 2008. Algorithms, analysis, and applications. 1[19] T. Gerstner and M. Griebel, “Numerical integration using sparse grids,”
Numer. Algorithms , vol. 18, no. 3-4, pp. 209–232, 1998.1[20] J. B. Garnett,
Bounded analytic functions , vol. 96 of
Pure and Applied Mathematics . Academic Press, Inc. [Harcourt BraceJovanovich, Publishers], New York-London, 1981. 1[21] R. A. DeVore and G. G. Lorentz,
Constructive approximation , vol. 303 of
Grundlehren der Mathematischen Wissenschaften[Fundamental Principles of Mathematical Sciences] . Springer-Verlag, Berlin, 1993. 1[22] G. A. Sod, “A survey of several finite difference methods for systems of nonlinear hyperbolic conservation laws,”
Journal ofComputational Physics , vol. 27, no. 1, pp. 1 – 31, 1978. 1[23] R. Aster, B. Borchers, and C. Thurber, “Parameter Estimation and Inverse Problems,” in
Parameter Estimation and InverseProblems , vol. 90 of
International Geophysics Series , pp. 1–303, 2005. 1[24] N. Gordon, D. Salmond, and A. Smith, “Novel-approach to nonlinear non-gaussian Bayesian state estimation,”
IEE Proceedings-F Radar and Signal Processing , vol. 140, pp. 107–113, APR 1993. 1[25] D. A. Fournier, H. J. Skaug, J. Ancheta, J. Ianelli, A. Magnusson, M. N. Maunder, A. Nielsen, and J. Sibert, “Ad modelbuilder: using automatic differentiation for statistical inference of highly parameterized complex nonlinear models.,”
Optimiza-tion Methods & Software , vol. 27, no. 2, pp. 233 – 249, 2012. 2[26] P. J. Huber, “Robust estimation of a location parameter,”
Ann. Math. Statist. , vol. 35, pp. 73–101, 03 1964. 2 [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014. 2[28] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in
Proceedings of theThirteenth International Conference on Artificial Intelligence and Statistics (Y. W. Teh and M. Titterington, eds.), vol. 9 of
Proceedings of Machine Learning Research , (Chia Laguna Resort, Sardinia, Italy), pp. 249–256, PMLR, 13–15 May 2010. 2[29] J. Hadamard, “Sur les probl`emes aux d´eriv´es partielles et leur signification physique,”
Princeton University Bulletin , vol. 13,pp. 49–52, 1902. 3[30] L. C. Evans,
Partial differential equations . Providence, R.I.: American Mathematical Society, 2010. 3[31] L. Wang, F. Yao, S. Zhou, and H. Jia, “Optimal regularity for the poisson equation,”
Proceedings of the American MathematicalSociety , vol. 137, no. 6, pp. 2037–2047, 2009. 3[32] P. Isett, “A proof of Onsager’s conjecture,”
Annals of Mathematics , vol. 188, pp. 871–963, NOV 2018. 3[33] C. Domb, M. F. Sykes, and J. T. Randall, “On the susceptibility of a ferromagnetic above the curie point,”
Proceedings of theRoyal Society of London. Series A. Mathematical and Physical Sciences , vol. 240, no. 1221, pp. 214–228, 1957. 3.2[34] G. Mercer and A. Roberts, “A Centre Manifold Description of Contaminant Dispersion in Channels with Varying Flow Prop-erties,”
SIAM Journal on Applied Mathematics , vol. 50, no. 6, pp. 1547–1565, 1990. 3.2[35] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang, “The expressive power of neural networks: A view from the width,” in
Advancesin Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,and R. Garnett, eds.), pp. 6231–6239, Curran Associates, Inc., 2017. 3.2[36] R. J. LeVeque,
Finite-Volume Methods for Hyperbolic Problems . Cambridge University Press, 2002. 3.3[37] C. Michoski, C. Mirabito, C. Dawson, D. Wirasaet, E. J. Kubatko, and J. J. Westerink, “Adaptive hierarchic transformations fordynamically p-enriched slope-limiting over discontinuous Galerkin systems of generalized equations,”
Journal of ComputationalPhysics , vol. 230, pp. 8028–8056, Sep 10 2011. 4.1, 5.1[38] C. Michoski, C. Dawson, E. J. Kubatko, D. Wirasaet, S. Brus, and J. J. Westerink, “A Comparison of Artificial Viscosity,Limiters, and Filters, for High Order Discontinuous Galerkin Solutions in Nonlinear Settings,”
Journal of Scientific Computing ,vol. 66, pp. 406–434, Jan 2016. 4.1[39] Z. Xu, “Parametrized Maximum Principle Preserving Flux Limiters for High Order Schemes Solving Hyperbolic ConservationLaws: One-dimensional Scalar Problem,”
Mathematics of Computation , vol. 83, pp. 2213–2238, Sep 2014. 4.1[40] M. Dumbser, M. Kaeser, V. A. Titarev, and E. F. Toro, “Quadrature-free non-oscillatory finite volume schemes on unstructuredmeshes for nonlinear hyperbolic systems,”
Journal of Computational Physics , vol. 226, pp. 204–243, Sep 10 2007. 4.1[41] Z. Xu, Y. Liu, and C.-W. Shu, “Hierarchical reconstruction for spectral volume method on unstructured grids,”
Journal ofComputational Physics , vol. 228, pp. 5787–5802, Sep 1 2009. 4.1[42] j. Giannakouros and G. Karniadakis, “Spectral Element Fct Method for Scalar Hyperbolic Conservation-laws,”
InternationalJournal for Numerical Methods in Fluids , vol. 14, pp. 707–727, MAR 30 1992. 4.1[43] S. Mabuza, J. N. Shadid, and D. Kuzmin, “Local bounds preserving stabilization for continuous Galerkin discretization ofhyperbolic systems,”
Journal of Computational Physics , vol. 361, pp. 82–110, MAY 15 2018. 4.1[44] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. Courville, “On the spectral biasof neural networks,” 2018. 4.1[45] M. Berger and P. Colella, “Local Adaptive Mesh Refinement for Shock Hydrodynamics,”
Journal of Computational Physics ,vol. 82, pp. 64–84, May 1989. 4.1.1[46] R. LeVeque, “Wave propagation algorithms for multidimensional hyperbolic systems,”
Journal of Computational Physics ,vol. 131, pp. 327–353, MAR 1 1997. 4.1.1[47] C. Michoski, J. Chan, L. Engvall, and J. Evans, “Foundations of the blended isogeometric discontinuous galerkin (bidg) method,”
Computer Methods in Applied Mechanics and Engineering , vol. 305, pp. 658 – 681, 2016. 4.1.1[48] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing.,”
Science , vol. 220 4598, pp. 671–80,1983. 6.1[49] M. Bremer, K. Kazhyken, H. Kaiser, C. Michoski, and C. Dawson, “Performance Comparison of HPX Versus TraditionalParallelization Strategies for the Discontinuous Galerkin Method,”
Journal of Scientific Computing , vol. In Press, 2019. 6.2[50] N. Malaya, D. McDougall, C. Michoski, M. Lee, and C. S. Simmons, “Experiences porting scientific applications to the intel (knl)xeon phi platform,” in
Proceedings of the Practice and Experience in Advanced Research Computing 2017 on Sustainability,Success and Impact , PEARC17, (New York, NY, USA), pp. 40:1–40:8, ACM, 2017. 6.2[51] A. Forrester, A. So`I (cid:44)
Abester, A. Keane, and W. I. O. service),
Engineering design via surrogate modelling : a practical guide .Hoboken, N.J. : Wiley ; Chichester : John Wiley [distributor], 2008. Description based upon print version of record. 6.2[52] B. Peherstorfer, K. Willcox, and M. Gunzburger, “Survey of multifidelity methods in uncertainty propagation, inference, andoptimization,”
SIAM Review , vol. 60, no. 3, pp. 550–591, 2018. 6.2[53] Z. Long, Y. Lu, X. Ma, and B. Dong, “PDE-net: Learning PDEs from data,” 2018. 8[54] S. H. Rudy, S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Data-driven discovery of partial differential equations,”
ScienceAdvances , vol. 3, no. 4, 2017. 8[55] H. Schaeffer, “Learning partial differential equations via data discovery and sparse optimization,”
Proceedings of the RoyalSociety A: Mathematical, Physical and Engineering Sciences , vol. 473, no. 2197, p. 20160446, 2017. 8[56] J. Berg and K. Nystr ˜A˝um, “Data-driven discovery of pdes in complex datasets,”
Journal of Computational Physics , vol. 384,pp. 239 – 252, 2019. 8[57] K. Champion, B. Lusch, J. N. Kutz, and S. L. Brunton, “Data-driven discovery of coordinates and governing equations,” 2019.8[58] M. Raissi, “Deep hidden physics models: Deep learning of nonlinear partial differential equations,”
J. Mach. Learn. Res. , vol. 19,pp. 932–955, Jan. 2018. 8[59] R. Ghanem, D. Higdon, and H. Owhadi,
Handbook of uncertainty quantification . 2017. 8[60] T. A. Oliver, G. Terejanu, C. S. Simmons, and R. D. Moser, “Validating predictions of unobserved quantities,”
ComputerMethods in Applied Mechanics and Engineering , vol. 283, pp. 1310 – 1335, 2015. 8[61] M. Dumbser, D. Balsara, M. Tavelli, and F. Fambri, “A divergence-free semi-implicit finite volume scheme for ideal, viscous,and resistive magnetohydrodynamics,”