[PDF] Understanding and mitigating gradient pathologies in physics-informed neural networks

Abstract

The widespread use of neural networks across different scientific domains often involves constraining them to satisfy certain symmetries, conservation laws, or other domain knowledge. Such constraints are often imposed as soft penalties during model training and effectively act as domain-specific regularizers of the empirical risk loss. Physics-informed neural networks is an example of this philosophy in which the outputs of deep neural networks are constrained to approximately satisfy a given set of partial differential equations. In this work we review recent advances in scientific machine learning with a specific focus on the effectiveness of physics-informed neural networks in predicting outcomes of physical systems and discovering hidden physics from noisy data. We will also identify and analyze a fundamental mode of failure of such approaches that is related to numerical stiffness leading to unbalanced back-propagated gradients during model training. To address this limitation we present a learning rate annealing algorithm that utilizes gradient statistics during model training to balance the interplay between different terms in composite loss functions. We also propose a novel neural network architecture that is more resilient to such gradient pathologies. Taken together, our developments provide new insights into the training of constrained neural networks and consistently improve the predictive accuracy of physics-informed neural networks by a factor of 50-100x across a range of problems in computational physics. All code and data accompanying this manuscript are publicly available at \url{this https URL}.

Full PDF

UU NDERSTANDING AND MITIGATING GRADIENT PATHOLOGIES INPHYSICS - INFORMED NEURAL NETWORKS

A P

REPRINT

Sifan Wang

Graduate Group in Applied Mathematicsand Computational ScienceUniversity of PennsylvaniaPhiladelphia, PA 19104 [email protected]

Yujun Teng

Department of Mechanichal Engineeringand Applied MechanicsUniversity of PennsylvaniaPhiladelphia, PA 19104 [email protected]

Paris Perdikaris

Department of Mechanichal Engineeringand Applied MechanicsUniversity of PennsylvaniaPhiladelphia, PA 19104 [email protected]

January 15, 2020 A BSTRACT

The widespread use of neural networks across different scientiﬁc domains often involves constrainingthem to satisfy certain symmetries, conservation laws, or other domain knowledge. Such constraintsare often imposed as soft penalties during model training and effectively act as domain-speciﬁcregularizers of the empirical risk loss. Physics-informed neural networks is an example of thisphilosophy in which the outputs of deep neural networks are constrained to approximately satisfya given set of partial differential equations. In this work we review recent advances in scientiﬁcmachine learning with a speciﬁc focus on the effectiveness of physics-informed neural networks inpredicting outcomes of physical systems and discovering hidden physics from noisy data. We willalso identify and analyze a fundamental mode of failure of such approaches that is related to numericalstiffness leading to unbalanced back-propagated gradients during model training. To address thislimitation we present a learning rate annealing algorithm that utilizes gradient statistics during modeltraining to balance the interplay between different terms in composite loss functions. We alsopropose a novel neural network architecture that is more resilient to such gradient pathologies. Takentogether, our developments provide new insights into the training of constrained neural networksand consistently improve the predictive accuracy of physics-informed neural networks by a factorof 50-100x across a range of problems in computational physics. All code and data accompanyingthis manuscript are publicly available at https://github.com/PredictiveIntelligenceLab/GradientPathologiesPINNs . K eywords Deep learning · Differential equations · Optimization · Stiff dynamics · Computational physics

Thanks to breakthrough results across a diverse range of scientiﬁc disciplines [1, 2, 3, 4, 5], deep learning is currentlyinﬂuencing the way we process data, recognize patterns, and build predictive models of complex systems. Many ofthese predictive tasks are currently being tackled using over-parameterized, black-box discriminative models such asdeep neural networks, in which interpretability and robustness is often sacriﬁced in favor of ﬂexibility in representation a r X i v : . [ c s . L G ] J a n PREPRINT - J

ANUARY

15, 2020and scalability in computation. Such models have yielded remarkable results in data-rich domains [1, 6, 7], yet theireffectiveness in the small data regime still remains questionable, motivating the question of how can one endow thesepowerful black-box function approximators with prior knowledge and appropriate inductive biases.Attempts to address this question are currently deﬁning two distinct schools of thought. The ﬁrst pertains to effortsfocused on designing specialized neural network architectures that implicitly embed any prior knowledge and inductivebiases associated with a given predictive task. Without a doubt, the most celebrated example in this category isconvolutional neural networks [8, 9] which have revolutionized the ﬁeld of computer vision by craftily respectinginvariance along the groups of symmetries and distributed pattern representations found in natural images [10]. Anotherexample includes covariant neural networks [11], that are tailored to conform with the rotation and translation invariancepresent in many-body molecular systems. Despite their remarkable effectiveness, such approaches are currently limitedto tasks that are characterized by relatively simple and well-deﬁned symmetry groups, and often require craftsmanshipand elaborate implementations. Moreover, their extension to more complex tasks is challenging as the underlyinginvariance that characterize many physical systems (e.g., ﬂuid ﬂows) are often poorly understood or hard to implicitlyencode in a neural architecture.The second school of thought approaches the problem of endowing a neural net with prior knowledge from a differentangle. Instead of designing specialized architectures that implicitly bake in this knowledge, current efforts aim toimpose such constraints in a soft manner by appropriately penalizing the loss function of conventional neural networkapproximations denoted by f θ ( · ) and typically parametrized by a set of weights and biases θ . These penalty constraintslead to loss functions taking the general form L ( θ ) := 1 N u N u (cid:88) i =1 [ u i − f θ ( x i )] + 1 λ R [ f θ ( x )] , (1)where the empirical risk loss over a given set of input-output pairs { x i , u i } , i = 1 , . . . , N u is penalized by someappropriate functional R [ f θ ( x )] that is designed to constraint the outputs of the neural network to satisfy a set ofspeciﬁed conditions, as controlled by the regularization parameter λ . A representative example of this approachis physics-informed neural networks [12, 13, 14, 15], in which the outputs of a neural network are constrained toapproximately satisfy a system of partial differential equations (PDE) by using a regularization functional R [ f θ ( x )] that typically corresponds to the residual or the variational energy of the PDE system under the neural networkrepresentation. This framework has enabled the solution of forward and inverse problem in computational physicsby reviving the original ideas in [16, 17] using modern software developments in reverse-mode differentiation [18]in order to automatically compute any derivatives present in the PDE operator. Although such approaches appearseemingly straightforward and have yielded remarkable results across a range of problems in computational science andengineering [19, 20, 21, 22, 23, 24, 25, 26] , the effects of the regularization mechanism in equation 1 remain poorlyunderstood, and in several cases can even lead to unstable and erroneous predictions (for e.g. see remarks in [20, 26]).In this work, we use physics-informed neural networks as a test-bed for analyzing the performance of constrainedneural networks trained using regularized loss functions in the form of equation 1. Our speciﬁc contributions can besummarized in the following points: • Our analysis reveals a fundamental mode of failure in physics-informed neural networks related to stiffness inthe gradient ﬂow dynamics. • This leads to an unstable imbalance in the magnitude of the back-propagated gradients during model trainingusing gradient descent. • We propose a simple solution based on an adaptive learning rate annealing algorithm that aims to balance theinterplay between data-ﬁt and regularization. • We also propose a novel neural network architecture that has less stiffness than the convention fully-connectedneural network. • We systematically test the proposed ideas and demonstrate consistent improvements in the predictive accuracyof physics-informed neural networks by a factor of 50-100x across a range of problems in computationalphysics.Taken all together, our developments provide new insight into the training of constrained neural networks that can helpus endow deep learning tools with prior knowledge and reduce the energy barrier for adapting them to new scientiﬁcdomains.The paper is structured as follows. In section 2, we ﬁrst present a brief overview of the physics-informed neuralnetworks (PINNs) following the original formulation of Raissi et. al. [12]. Next, we introduce a simple benchmark2

PREPRINT - J

ANUARY

15, 2020problem that can guide our analysis and highlight the difﬁculties introduced by stiffness in the gradient ﬂow dynamicsof physics-informed neural networks, see sections 2.2 - 2.4. To address these difﬁculties we proposed an adaptivelearning rate algorithm along with a novel fully-connected neural architecture, as described in sections 2.5 and 2.6.In section 3 we present the detailed evaluation of our proposed algorithm and neural network architecture across arange of representative benchmark examples. Finally, in section 4, we summarize our ﬁndings and provide a discussionon potential pitfalls and promising future directions. All code and data accompanying this manuscript are publiclyavailable at https://github.com/PredictiveIntelligenceLab/GradientPathologiesPINNs . Physics-informed neural networks (PINNs) [12] aim at inferring a continuous latent function u ( x , t ) that arises as thesolution to a system of nonlinear partial differential equations (PDE) of the general form u t + N x [ u ] = 0 , x ∈ Ω , t ∈ [0 , T ] u ( x ,

0) = h ( x ) , x ∈ Ω u ( x , t ) = g ( x , t ) , t ∈ [0 , T ] , x ∈ ∂ Ω (2)where x and t denote space and time coordinates, subscripts denote partial differentiation, N x [ · ] is a nonlineardifferential operator, Ω is a subset of R D , and ∂ Ω is the boundary of Ω . Following the original work of [12], we thenproceed by approximating u ( x , t ) by a deep neural network f θ ( x , t ) , and deﬁne the residual of equation 2 as r θ ( x , t ) := ∂∂t f θ ( x , t ) + N x [ f θ ( x , t )] , (3)where the partial derivatives of the neural network representation with respect to the space and time coordinates canbe readily computed to machine precision using reverse mode differentiation [18]. Notice how the neural networkparameters θ (i.e. the weights and biases of the neural network) are shared between the representation of the latentsolution u ( x , t ) , and the PDE residual r ( x , t ) . A good set of candidate parameters can be identiﬁed via gradientdescent using a composite loss function of the general form L ( θ ) := L r ( θ ) + M (cid:88) i =1 λ i L i ( θ ) , (4)where L r ( θ ) is a loss term that penalizes the PDE residual, and L i ( θ ) , i = 1 , . . . , M correspond to data-ﬁt terms(e.g., measurements, initial or boundary conditions, etc.). For a typical initial and boundary value problem, these lossfunctions would take the speciﬁc form L r = 1 N r N r (cid:88) i =1 [ r ( x ir , t ir )] (5) L u b = 1 N b N b (cid:88) i =1 [ u ( x ib , t ib ) − g ib ] , (6) L u = 1 N N (cid:88) i =1 [ u ( x i , − h i )] , (7)where { x i , h i ) } N i =1 denotes the initial data, { ( x ib , t ib ) , g ib } N b i =1 denotes the boundary data, and { ( x ir , t ir ) , } N r i =1 a set ofcollocation points that are randomly placed inside the domain Ω in order to minimize the PDE residual. Consequently, L r penalizes the equation not being satisﬁed on the collocation points. Moreover, L u b and L u enforces the boundaryconditions and the initial conditions respectively. As L ( θ ) is typically minimized used stochastic gradient descent, anvery large number of training points ( O (10 − ) can be sampled as the locations of x i , ( x ib , t ib ) , ( x ir , t ir ) can berandomized within each gradient descent iteration. The ultimate goal of this procedure it to construct a neural networkrepresentation f θ ( x , t ) for which L ( θ ) is as close to zero as possible.3 PREPRINT - J

ANUARY

15, 2020 x x Exact u ( x ) x x Predicted u ( x ) x x Absolute error

Figure 1:

Helmholtz equation:

Exact solution versus the prediction of a conventional physics-informed neural networkmodel with 4 hidden layers and 50 neurons each layer after 40,000 iterations of training with gradient descent (relative L -error: 1.81e-01). Despite a series of promising results [19, 20, 23, 24, 25, 26, 13], the original formulation of Raissi et. al. [12] often hasdifﬁculties in constructing an accurate approximation to the exact latent solution u ( x , t ) for reasons that yet remainpoorly understood. These pathologies can arise even in the simplest possible setting corresponding to solving classicallinear elliptic equations. As an example, let us consider the Helmholtz equation in two space dimensions ∆ u ( x, y ) + k u ( x, y ) = q ( x, y ) , ( x, y ) ∈ Ω := ( − , (8) u ( x, y ) = h ( x, y ) , ( x, y ) ∈ ∂ Ω (9)where ∆ is the Laplace operator. One can easily fabricate an exact solution to this problem taking the form u ( x, y ) =sin( a πx ) sin( a πy ) , corresponding to a source term of the form q ( x, y ) = − ( a π ) sin( a πx ) sin( a πy ) − ( a π ) sin( a πx ) sin( a πy ) + k sin( a πx ) sin( a πy ) (10)where we take a = 1 and a = 4 . A PINN approximation to solving equation 8 can be constructed by parametrizingits solution with a deep neural network f θ ( x, y ) , whose parameters θ can be identiﬁed by minimizing a composite lossfunction that aims to ﬁt the known boundary conditions, while also penalizing the Helmholtz equation residual insidethe domain Ω , i.e., L ( θ ) = L r ( θ ) + L u b ( θ ) (11)Without loss of generality, let us consider an example prediction scenario in which f θ ( x, y ) is a 4-layer deep fullyconnected neural network with 50 neurons per layer and a hyperbolic tangent activation function. We train this networkfor , stochastic gradient descent steps by minimizing the loss of equation 11 using the Adam optimizer [27] withan initial learning rate of − and a decreasing annealing schedule. In ﬁgure 1 we compare the predictions of thistrained model against the exact solution for this problem, and report the point-wise absolute discrepancy between thetwo. It is evident that the PINN approximation does a poor job at ﬁtting the boundary conditions, leading to a . prediction error measured in the relative L norm.To explore the reason why this model fails to return accurate predictions, we draw motivation from the seminal workof Glorot and Bengio [28] and monitor the distribution of the back-propagated gradients of our loss with respect tothe neural network parameters during training. Rather than tracking the gradients of the aggregate loss L ( θ ) , we trackthe gradients of each individual terms L u b ( θ ) and L r ( θ ) with respect to the weights in each hidden layer of the neuralnetwork. As illustrated in ﬁgure 2, the gradients corresponding to the boundary loss term L u b ( θ ) in each layer aresharply concentrated around zero and overall attain signiﬁcantly smaller values than the gradients corresponding to thePDE residual loss L r ( θ ) . As we know, in lack of proper restrictions such as boundary conditions or initial conditions,a PDE system may have inﬁnitely many solutions [29]. Therefore, if the gradients ∇ θ L u b ( θ ) are very small duringtraining, then our PINN model should experience difﬁculties in ﬁtting the boundary conditions. One the other hand,while the gradients ∇ θ L r ( θ ) are large, the neural network can easily learn any solutions that satisfy the equation. As aresult, our trained model is heavily biased towards returning a solution that leads to a small PDE residual, but withouthowever respecting the given boundary conditions, it is prone to return erroneous predictions. It is now natural to ask what is the mechanism that gives rise to this gradient imbalance between the two loss terms, L u b ( θ ) and L r ( θ ) , respectively. For starters, one may be quick to notice that the chain rule for computing the gradients4 PREPRINT - J

ANUARY

15, 2020 Layer 1 Layer 2 Layer 3 Layer 4 u b r Figure 2:

Helmholtz equation:

Histograms of back-propagated gradients ∇ θ L r ( θ ) and ∇ θ L u b ( θ ) at each layer duringthe , th iteration of training a standard PINN model for solving the Helmholtz equation.of L r ( θ ) is deeper than the chain rule for computing the gradients of L b ( θ ) , and thus the gradients of L r are easier tosuffer from vanishing gradient pathologies. Paradoxically, here we observe the opposite behavior as it is the ∇ θ L u b ( θ ) gradients that appears to vanish during model training (see ﬁgure 2). To understand what gives rise to this behavior, letus take a step back and consider an even simpler benchmark, the one-dimensional Poisson equation ∆ u ( x ) = g ( x ) , x ∈ [0 , u ( x ) = h ( x ) , x = 0 and x = 1 . (12)To this end, let us consider exact solutions of the form u ( x ) = sin( Cx ) , corresponding to a source term given by g ( x ) = − C sin( Cx ) = − C u ( x ) . Under the PINNs framework, we will use a fully-connected deep neural network f θ ( x ) parametrized by θ to approximate the latent solution u ( x ) . Then the corresponding loss function over a collectionof boundary and residual data-points is given by L ( θ ) = L r ( θ ) + L u b ( θ )= 1 N b N b (cid:88) i =1 [ f θ ( x ib ) − h ( x ib )] + 1 N r N r (cid:88) i =1 [ ∂ ∂x f θ ( x ir ) − g ( x ir )] . (13)To provide some analysis, let us assume that there exists a trained neural network f θ ( x ) that can provide a goodapproximation to the latent solution u ( x ) (i.e. the target solution is in the Hilbert space spanned by the neural networkdegrees of freedom). Then we may express our approximation as f θ ( x ) = u ( x ) (cid:15) θ ( x ) , where (cid:15) θ ( x ) is a smooth functiondeﬁned in [0 , , and for which | (cid:15) θ ( x ) − | ≤ (cid:15) for some (cid:15) > , and (cid:107) ∂ k (cid:15) θ ( x ) ∂x k (cid:107) L ∞ < (cid:15) , for all non-negative integer k .This construction allows us to derive the following bound for the gradients of the PDE boundary loss and residual loss(see proof in Appendix 5.1) (cid:107)∇ θ L u b ( θ ) (cid:107) L ∞ ≤ (cid:15) · (cid:107)∇ θ (cid:15) θ ( x ) (cid:107) L ∞ (14) (cid:107)∇ θ L r ( θ ) (cid:107) L ∞ ≤ O ( C ) · (cid:15) · (cid:107)∇ θ (cid:15) θ ( x ) (cid:107) L ∞ (15)Based on this simple analysis, we can conclude that if the constant C is large, then the norm of gradients of L r ( θ ) maybe much greater the gradients of L u b ( θ ) , thus biasing the neural network training towards neglecting the contributionof the boundary data-ﬁt term. To conﬁrm this result we have performed a series of simulations in which standardPINN models are trained to approximate the solution of equation 12 for different values of the constant C . Our resultsare summarized in ﬁgure 3, indicating that larger values of the constant C lead to a pronounced imbalance in theback-propagated gradients ∇ θ L r ( θ ) and ∇ θ L u b ( θ ) , that ultimately leads to an inaccurate reconstruction of the PDEsolution. To provide further insight into the pathology of imbalanced gradients, let us consider the continuous limit of the learningdynamics for calibrating the neural network parameters θ , as governed by the gradient ﬂow system dθdt = −∇ θ L r ( θ ) − M (cid:88) i =1 ∇ θ L i ( θ ) . (16)5 PREPRINT - J

ANUARY

15, 2020 Layer 1 Layer 2 Layer 3 Layer 4 u b r (a) C = 1 Layer 1 Layer 2 Layer 3 Layer 4 u b r (b) C = 2 Layer 1 Layer 2 Layer 3 Layer 4 u b r (c) C = 4 Figure 3:

Possion equation:

Histograms of back-propagated gradients ∇ θ L r ( θ ) and ∇ θ L u b ( θ ) at each layer during the , th iteration of training a standard PINN model for solving the one-dimensional Poisson equation for differentvalues of the constant C , see equation 12. 6 PREPRINT - J

ANUARY

15, 2020Starting from an initial guess, one can integrate over the gradient ﬂow to obtain a local minimum of the total loss L ( θ ) .It is straightforward to see that the gradient descent updates of equation 39 correspond to a forward Euler discretization[30] of equation 16. It is also well understood that the stability of this explicit, ﬁrst-order discretization strategy isseverely limited by the choice of the step-size/learning rate, especially for cases in which the governing gradient ﬂowdynamics are stiff. We believe that such cases arise routinely in the context of physics-informed neural networks inwhich the different terms in their loss have inherently different nature and often correspond to competing objectives(e.g., ﬁtting a set of noisy data versus enforcing a zero PDE residual).To obtain a quantitative assessment of stiffness in the gradient ﬂow dynamics of a neural network one can computeand monitor the largest eigenvalue σ max ( ∇ θ L ( θ )) of the Hessian matrix ∇ θ L ( θ ) during model training [31]. Thisimmediately follows from performing a stability analysis for the linearized system ddt ˜ θ ( t ) = −∇ θ L (˜ θ ( t )) · ˜ θ ( t ) . (17)It is well known that the largest eigenvalue of the Hessian dictates the fastest time-scale of the system and directlyimposes a restriction on the learning rate that one needs to employ to ensure stability in the forward Euler discretization,as reﬂected by gradient descent update rule of equation 39. In fact, a classical result in numerical analysis on theconditional stability of the forward Euler method requires bounding the learning rate as η < /σ max ( ∇ θ L ( θ )) [30].To illustrate the presence of stiffness in the gradient ﬂow dynamics of physics-informed neural networks, let us revisitthe simple simulation study presented in section 2.2 for the two-dimensional Helmholtz benchmark. Speciﬁcally, weconsider two cases by modulating the control parameters a and a in equation 8. In the ﬁrst case we set a = 1 , a = 1 to fabricate a relatively simple solution u ( x, y ) that has an isotropic length-scale across both spatial dimensions. Inthe second case we choose a = 1 , a = 4 to fabricate a solution u ( x, y ) that exhibits directional anisotropy and hasmore oscillations along the y -coordinate. For each case we then employ the same simulation setup presented in section2.2 to train a 4-layer deep physics-informed neural network for approximating the latent solution u ( x, y ) . Duringtraining we probe the stiffness of the gradient ﬂow dynamics by computing and monitoring the largest eigenvalue of theHessian matrix σ max ( ∇ θ L ( θ )) . In ﬁgure 5 we present the recorded values of σ max ( ∇ θ L ( θ )) at different iterations ofthe gradient descent algorithm. Our results reveal that, as the complexity of the target solution is increased, the trainingdynamics of the physics-informed neural network become increasingly stiffer, ultimately leading to a severe restrictionin the required learning rate needed for stable gradient descent updates.The following simple analysis may reveal the connection between stiffness in the gradient ﬂow dynamics and theevident difﬁculty in training physics-informed neural networks with gradient descent. Suppose that at n -th step ofgradient decent during the training, we have θ n +1 = θ n − η ∇ θ L ( θ n ) = θ n − η [ ∇ θ L r ( θ n ) + ∇ θ L u b ( θ n )] (18)where η is the learning rate. Then applying second order Taylor expansion to the loss function L ( θ ) at θ n gives L ( θ n +1 ) = L ( θ n ) + ( θ n +1 − θ n ) · ∇ θ L ( θ n ) + 12 ( θ n +1 − θ n ) T ∇ θ L ( ξ )( θ n +1 − θ n ) (19)where ξ = tθ n + (1 − t ) θ n +1 for some t ∈ [0 , and ∇ θ L ( ξ ) is the Hessian matrix of the loss function L ( θ ) evaluatedat ξ . Now applying 18 to 19, we obtain L ( θ n +1 ) − L ( θ n ) = − η ∇ θ L ( θ n ) · ∇ θ L ( θ n ) + 12 η ∇ θ L ( θ n ) T ∇ θ L ( ξ ) η ∇ θ L ( θ n ) (20) = − η (cid:107)∇ θ L ( θ n ) (cid:107) + 12 η ∇ θ L ( θ n ) T ∇ θ L ( ξ ) ∇ θ L ( θ n ) (21) = − η (cid:107)∇ θ L ( θ n ) (cid:107) + 12 η ∇ θ L ( θ n ) T (cid:0) ∇ θ L r ( ξ ) + ∇ θ L u b ( ξ ) (cid:1) ∇ θ L ( θ n ) (22) = − η (cid:107)∇ θ L ( θ n ) (cid:107) + 12 η ∇ θ L ( θ n ) T ∇ θ L r ( ξ ) ∇ θ L ( θ n ) + 12 η ∇ θ L ( θ n ) T ∇ θ L u b ( ξ ) ∇ θ L ( θ n ) (23)Here, note that ∇ θ L ( θ n ) T ∇ θ L ( ξ ) ∇ θ L ( θ n ) = (cid:107)∇ θ L ( θ n ) (cid:107) ∇ θ L ( θ n ) T (cid:107)∇ θ L ( θ n ) (cid:107) ∇ θ L ( ξ ) ∇ θ L ( θ n ) (cid:107)∇ θ L ( θ n ) (cid:107) (24) = (cid:107)∇ θ L ( θ n ) (cid:107) x T Q T diag ( λ , λ · · · λ n ) Qx (25) = (cid:107)∇ θ L ( θ n ) (cid:107) y T diag ( λ , λ . . . λ M ) y (26) = (cid:107)∇ θ L ( θ n ) (cid:107) M (cid:88) i =1 λ i y i (27)7 PREPRINT - J

ANUARY

15, 2020where x = ∇ θ L ( θ n ) (cid:107)∇ θ L ( θ n ) (cid:107) , Q is an orthogonal matrix diagonalizing ∇ θ L ( ξ ) and y = Qx . And λ ≤ λ ≤ · · · ≤ λ N areeigenvalues of ∇ θ L ( ξ ) . Similarly, we have ∇ θ L ( θ n ) T ∇ θ L r ( ξ ) ∇ θ L ( θ n ) = (cid:107)∇ θ L ( θ n ) (cid:107) M (cid:88) i =1 λ ri y i (28) ∇ θ L ( θ n ) T ∇ θ L u b ( ξ ) ∇ θ L ( θ n ) = (cid:107)∇ θ L ( θ n ) (cid:107) M (cid:88) i =1 λ u b i y i (29)where λ r ≤ λ r ≤ · · · ≤ λ rN and λ u b ≤ λ u b ≤ · · · ≤ λ u b N are eigenvalues of ∇ θ L r and ∇ θ L u b respectively. Thus,combining these together we get L ( θ n +1 ) − L ( θ n ) = η (cid:107)∇ θ L ( θ n ) (cid:107) ( − η N (cid:88) i =1 λ i y i ) (30) L r ( θ n +1 ) − L r ( θ n ) = η (cid:107)∇ θ L ( θ n ) (cid:107) ( − η N (cid:88) i =1 λ ri y i ) (31) L u b ( θ n +1 ) − L u b ( θ n ) = η (cid:107)∇ θ L ( θ n ) (cid:107) ( − η N (cid:88) i =1 λ u b i y i ) (32)Thus, if the gradient ﬂow is stiff, then many eigenvalues of ∇ θ L ( ξ ) may be large. As a result, it is very possible that L ( θ n +1 ) − L ( θ n ) > , which implies that the gradient decent method fails to decrease the loss even if ∇ θ L ( θ n ) is theright decent direction. This result comes to no surprise, as it is well understood that gradient descent may get stuck inlimit cycles or even diverge in the presence of multiple competing objectives [32, 33].To verify our hypothesis, we still consider the example in section 2. Speciﬁcally, we choose a ﬁve layer neural networkwith 50 units in each hidden layer. To reduce stochastic effects introduced by mini-batch gradient descent we use fullbatch gradient decent with an initial learning rate of e − and an exponential decay with a decay-rate of 0.9 and adecay-step of 1,000 iterations. All collocation points are uniformly sampled from both the boundary and the interiordomain. Figure 4a shows the eigenvalues of ∇ θ L u b ( θ ) and ∇ θ L r ( θ ) in ascending order. Clearly, we can concludethat many eigenvalues of ∇ θ L r ( θ ) are extremely large up to and the stiffness of the gradient ﬂow is dominated by L r ( θ ) . As a consequence, in ﬁgure 4b we observe that the loss of L r ( θ ) oscillates wildly even when full batch gradientdescent is employed, while the loss of L u b ( θ ) exhibits a decreasing tendency since the eigenvalues of ∇ θ L u b ( θ ) remainrelatively small. Besides, note that the oscillation of the loss of L r ( θ ) becomes small as the iteration increases since thelearning rate decay. The above analysis may give another angle to illustrate how the stiff nature of gradient ﬂow systemmay cause severe difﬁculties in the training of physics-informed neural networks with gradient descent. It also hints tothe importance of choosing the learning rate such us a stable discretization of the gradient ﬂow is achieved..Moreover, applying the classical Sobolev inequality [34] to ∇ θ L r and ∇ θ L u b gives (cid:107)∇ θ L r ( θ ) (cid:107) L ∞ ≤ C (cid:107)∇ θ L r ( θ ) (cid:107) L ∞ (33) (cid:107)∇ θ L u b ( θ ) (cid:107) L ∞ ≤ C (cid:107)∇ θ L u b ( θ ) (cid:107) L ∞ (34)where C is a constant that does not depend on ∇L r ( θ ) and ∇L u b ( θ ) . By diagonalizing ∇ L r ( θ ) and ∇ L u b ( θ ) withorthonormal matrices, we know that (cid:107)∇ θ L r ( θ ) (cid:107) L ∞ and (cid:107)∇ θ L u b ( θ ) (cid:107) L ∞ are bounded by the largest eigenvalues of ∇ L r ( θ ) and ∇ L u b ( θ ) respectively. Then again from ﬁgure 4a, we can conclude that (cid:107)∇ θ L r ( θ ) (cid:107) L ∞ is much greaterthan (cid:107)∇ θ L u b ( θ ) (cid:107) L ∞ . Therefore, equations 33, 34 give another loose upper bound for ∇L r ( θ ) and ∇L u b ( θ ) and revealthe reason why stiffness of the gradient ﬂow may result in unbalanced gradients pathologies.Based on these ﬁndings we may conclude that the de facto use of gradient descent and its modern variants (e.g., Adam[27]) may be a poor, or even unstable choice for training physics-informed neural networks. From a numerical analysisstandpoint, it is then natural to investigate whether more stable discretizations, such as implicit or IMEX operatorsplitting schemes [30, 35], can yield more effective optimization algorithms for constrained neural networks. Froman optimization standpoint, this setting has motivated the development of proximal gradient algorithms [36] that havebeen successfully applied to ill-posed inverse problems involving stiff regularization terms [37], as well as effectivealgorithms for ﬁnding Nash equilibria in multi-player games [33, 38]. Nevertheless, these intriguing directions gobeyond the scope of the present work and deﬁne an areas for future investigation. As discussed in the next section,here our goal is to retain the simplicity of gradient descent while providing an non-intrusive and adaptive heuristic for8 PREPRINT - J

ANUARY

15, 2020 index e i g e n v a l u e ru b (a) iterations l o ss ru b (b) Figure 4:

Helmholtz equation: (a) All eigenvalues of ∇ θ L r ( θ ) and ∇ θ L u b ( θ ) , respectively, arranged in increasing order.(b) Loss curves of L r ( θ ) and L u b , respectively, after 40,000 iterations of gradient decent using the Adam optimizer infull batch mode. m a x ( ) Stiff ( a = 1, a = 4 )Non-stiff ( a = 1, a = 1 ) Figure 5:

Stiffness in the gradient ﬂow dynamics:

Largest eigenvalue of the Hessian ∇ θ L ( θ ) during the training of aphysics-informed neural network model for approximating the solution to the two-dimensional Helmholtz problem (seeequation 8) and for different values of the control parameters a and a .enhancing its performance. Similar approaches can be found in the numerical analysis literature for integrating stiffand multi-scale dynamical systems, see for example the multi-rate schemes of Gear et. al. [39] and the ﬂow-averagingintegration techniques put forth by Tao et. al. [40]. Having identiﬁed a common mode of failure for physics-informed neural networks related to unbalanced gradientsduring back-propagation, we can investigate potential remedies for overcoming this pathology. To do so, let usre-examine the general form of a PINNs loss function L ( θ ) := L r ( θ ) + M (cid:88) i =1 L i ( θ ) , (35)the minimization of which is typically performed according to the following gradient descent update θ n +1 = θ n − η ∇ θ L ( θ n )= θ n − η [ ∇ θ L r ( θ n ) + M (cid:88) i =1 ∇ θ L i ( θ n )] (36)9 PREPRINT - J

ANUARY

15, 2020where η is a learning rate parameter. To balance the interplay between the different terms in this loss, a straightforwardway is to multiply a constant λ i to each L i ( θ ) term. More speciﬁcally, we consider minimizing the following loss inwhich the weights λ resemble the role of penalty coefﬁcients in constrained optimization [41] L ( θ ) := L r ( θ ) + M (cid:88) i =1 λ i L i ( θ ) . (37)Consequently, the corresponding gradient descent updates now take the form θ n +1 = θ n − η ∇ θ L ( θ n ) (38) = θ n − η ∇ θ L r ( θ n ) − η M (cid:88) i =1 λ i ∇ θ L i ( θ n ) , (39)where we see how the constants λ i can effectively introduce a re-scaling of the learning rate corresponding to eachloss term. Obviously, the next question that needs to be answered is should those weights how λ i be chosen? It isstraightforward to see that choosing λ i arbitrarily following a trial and error procedure is extremely tedious and maynot produce satisfying results. Moreover, the optimal constants may vary greatly for different problems, which meanswe cannot ﬁnd a ﬁxed empirical recipe that is transferable across different PDEs. Most importantly, the loss functionalways consists of various parts that serve to provide restrictions on the equation. It is impractical to give differentweights to different parts of the loss function manually.Here we draw motivation from Adam [27] – one the most widely used adaptive learning rate optimizers in the deeplearning literature – to derive an adaptive rule for choosing the λ i weights online during model training. The basicidea behind Adam is to keep track of the ﬁrst- and second-order moments of the back-propagated gradients duringtraining, and utilize this information to adaptively scale the learning rate associated with each parameter in the θ vector. In a similar spirit, our proposed learning rate annealing procedure, as summarized in algorithm 1, is designed toautomatically tune the λ i weights by utilizing the back-propagated gradient statistics during model training, such thatthe interplay between all terms in equation 37 is appropriately balanced. Algorithm 1:

Learning rate annealing for physics-informed neural networksConsider a physics-informed neural network f θ ( x ) with parameters θ and a loss function L ( θ ) := L r ( θ ) + M (cid:88) i =1 λ i L i ( θ ) , where L r ( θ ) denotes the PDE residual loss, the L i ( θ ) correspond to data-ﬁt terms (e.g., measurements, initial orboundary conditions, etc.), and λ i = 1 , i = 1 , . . . , M are free parameters used to balance the interplay between thedifferent loss terms. Then use S steps of a gradient descent algorithm to update the parameters θ as: for n = 1 , . . . , S do (a) Compute ˆ λ i by ˆ λ i = max θ {|∇ θ L r ( θ n ) |}|∇ θ L i ( θ n ) | , i = 1 , . . . , M, (40)where |∇ θ L i ( θ n ) | denotes the mean of |∇ θ L i ( θ n ) | with respect to parameters θ .(b) Update the weights λ i using a moving average of the form λ i = (1 − α ) λ i + α ˆ λ i , i = 1 , . . . , M. (41)(c) Update the parameters θ via gradient descent θ n +1 = θ n − η ∇ θ L r ( θ n ) − η M (cid:88) i =1 λ i ∇ θ L i ( θ n ) (42) end The recommended hyper-parameter values are: η = 10 − and α = 0 . .The proposed algorithm is characterized by two steps. First, we compute instantaneous values for the constants ˆ λ i bycomputing the ratio between the maximum gradient value attained by ∇ θ L r ( θ ) and the mean of the gradient magnitudes10 PREPRINT - J

ANUARY

15, 2020 x x Exact u ( x ) x x Predicted u ( x ) x x Absolute error

Figure 6:

Helmholtz equation:

Exact solution versus the prediction of a physics-informed neural network model with 4hidden layers and 50 neurons each layer after 40,000 iterations of training using the proposed learning rate annealingalgorithm (relative L -error: 1.27e-02). Layer 1 Layer 2 Layer 3 Layer 4 u b r Figure 7:

Helmholtz equation:

Histograms of back-propagated gradients of ∇ θ L r and ∇ θ L u b at each layer duringtraining a PINN with the proposed algorithm 1 to solve Helmholtz equation.computed for each of the L i ( θ ) loss terms, namely, |∇ θ L i ( θ ) | , see equation 40. As these instantaneous values areexpected to exhibit high variance due to the stochastic nature of the gradient descent updates, the actual weights λ i arecomputed as a running average of their previous values, as shown in equation 41. Notice that the updates in equations 40and 41 can either take place at every iteration of the gradient descent loop, or at a frequency speciﬁed by the user (e.g.,every 10 gradient descent steps). Finally, a gradient descent update takes is performed to update the neural networkparameters θ using the current weight values stored in λ i , see equation 42.The key beneﬁts of the proposed automated procedure is that this adaptive method can be easily generalized to lossfunctions consisting of multiple terms (e.g., multi-variate problems with multiple boundary conditions on differentvariables), while the extra computational overhead associated with computing the gradient statistics in equation 40 issmall, especially in the case of infrequent updates. Moreover, our computational studies conﬁrm very low sensitivity onthe additional hyper-parameter α , as the accuracy of our results does not exhibit any signiﬁcant variance when thisparameter take values within a reasonable range (e.g., α ∈ [0 . , . ).A ﬁrst illustration of the effectiveness of the proposed learning-rate annealing algorithm is provided through the lensof the Helmholtz benchmark presented in equation 8. Figures 6 and 7 summarize the predictions of the same 4-layerdeep physics-informed neural network model used in section 2.2 after training it using algorithm 1 for , gradientdescent iterations. Evidently, the proposed training scheme is able to properly balance the interplay between theboundary and the residual loss, and improve the relative prediction error by more than one order of magnitude. Inparticular, by comparing ﬁgures 1 and 6 notice how the absolute point-wise error at the domain boundaries has beensigniﬁcantly reduced. Finally, ﬁgure 8 summarizes the convergent evolution of the constant λ u b used to scale theboundary condition loss L u b ( θ ) in equation 11 during model training using algorithm 1.11 PREPRINT - J

ANUARY

15, 2020 iterations u b Figure 8:

Helmholtz equation:

Evolution of the constant λ u b used to scale the boundary condition loss L u b ( θ ) inequation 11 during model training using algorithm 1. Besides carefully formulating a loss function (and effective algorithms to minimize it), another ingredient that is key tothe success of deep neural networks is the design of their architecture. The most successful neural architectures in theliterature are in principle designed to exploit prior knowledge associated with a given task. For instance, convolutionalneural networks [1, 9] are widely used for object recognition and image classiﬁcation tasks. This is because theconvolution operation naturally adheres to the symmetries induced by the translation group, which is a crucial feature inimage recognition. Similarly, recurrent neural networks [42, 9] are well suited to modeling sequence data due to theirability to respect temporal invariance and capture long-term dependencies. Although respecting the invariances thatcharacterize a given task is crucial in designing an effective neural architecture, in many scenarios related to physicalsystems the underlying symmetry groups are often non-trivial or even unknown. This is the main reason why themajority of recent works focused on applying deep learning tools to modeling and simulating physical systems governedby PDEs, still relies on generic fully connected architectures [12, 24, 23, 43, 26]. Examples of cases include recentstudies on physics-informed neural networks [12, 44], which primarily use fully-connected architectures with weakinductive biases and simply rely on the fact that the representational capacity of these over-parametrized networks canbe sufﬁcient to correctly capture the target solution of a given PDE. However, in absence of a universal approximationtheorem for physics-informed neural networks (i.e., fully-connected networks trained subject to PDE constraints), it iscurrently unclear whether standard fully-connected architectures offer sufﬁciently ﬂexible representations for inferringthe solution of complex PDEs. Although this question remains hard to answer, in this section we show how a simpleextension to standard fully-connected networks can lead to a novel architecture that appears to perform uniformly andsigniﬁcantly better across all the benchmark studies considered in this work.Inspired by neural attention mechanisms recently employed for computer vision and natural language processing tasks[45], here we present a novel neural network architecture, which has the following characteristics: (i) explicitly accountsfor multiplicative interactions between different input dimensions, and (ii) enhances the hidden states with residualconnections. As shown in ﬁgure 9 the key extension to conventional fully-connected architectures is the introduction oftwo transformer networks that project the inputs variables to a high-dimensional feature space, and then use a point-wisemultiplication operation to update the hidden layers according to the following forward propagation rule U = φ ( XW + b ) , V = φ ( XW + b ) (43) H (1) = φ ( XW z, + b z, ) (44) Z ( k ) = φ ( H ( k ) W z,k + b z,k ) , k = 1 , . . . , L (45) H ( k +1) = (1 − Z ( k ) ) (cid:12) U + Z ( k ) (cid:12) V, k = 1 , . . . , L (46) f θ ( x ) = H ( L +1) W + b (47)where X denotes the ( n × d ) design matrix of the input data-points, and (cid:12) denotes element-wise multiplication. Theparameters of this model are essentially the same as in a standard fully-connected architecture, with the addition of the12 PREPRINT - J

ANUARY

15, 2020 . . . W , b W , b W z, , b z, W z, , b z, W z, , b z, W z,L , b z,L W, bX UV f θ ( X ) XX H (1) H (2) H ( L ) Figure 9:

An improved fully-connected architecture for physics-informed neural networks:

Introducing residual connec-tions and accounting for multiplicative interactions between the inputs can lead to improved predictive performance. x x Exact u ( x ) x x Predicted u ( x ) x x Absolute error

Figure 10:

Helmholtz equation:

Exact solution versus the prediction of a physics-informed neural network model withthe proposed improved architecture (4 hidden layers, 50 neurons each) after 40,000 iterations of training using theproposed learning rate annealing algorithm (relative L -error: 3.69e-03).weights and biases used by the two transformer networks, i.e., θ = { W , b , W , b , ( W z,l , b z,l ) Ll =1 , W, b } (48)Finally, the number of units in each layer is M and φ : R M → R M is a nonlinear activation function. Here we shouldalso notice that the proposed architecture, and its associated forward pass, induce a relatively small computational andmemory overhead while leading to signiﬁcant improvements in predictive accuracy.A ﬁrst indication of the improved performance brought by the proposed neural architecture can be seen by revisiting theHelmholtz benchmark discussed in section 2.2. Figure 10 summarizes the prediction of the proposed fully connectedarchitecture with a depth of 4 layers for a physics-informed neural network model trained it using algorithm 1 for , gradient descent iterations. Evidently, the proposed training scheme is able to properly balance the interplaybetween the boundary and the residual loss, and improve the relative prediction error by almost two orders of magnitudecompared to the original formulation of Raissi et. al. [12]. In particular, by comparing ﬁgures 1 and 10 notice how theabsolute point-wise error at the domain boundaries has been effectively diminished.Perhaps a more interesting fact, is that the choice of the neural architecture has also an inﬂuence on the stiffness of thegradient ﬂow dynamics. To illustrate this point, we report the largest eigenvalue of the Hessian matrix σ max ( ∇ θ L ( θ )) for the Helmholtz benchmark discussed in section 2.2 using a stiff parameter setting corresponding to a = 1 , a = 4 .Speciﬁcally, in ﬁgure 11 we compare the evolution of σ max ( ∇ θ L ( θ )) during model training for the conventional dense,4-layer deep architecture employed in section 2.2 (denoted as model M1), versus the proposed neural architecture alsowith a depth of 4 layers (denoted as model M3). Evidently, the proposed architecture yields a roughly 3x decreasein the magnitude of the leading Hessian eigenvalue, suggesting that tweaks in the neural network architecture couldpotentially stabilize and accelerate the training of physics-informed neural networks; a direction that will be explored infuture work. Moreover, as reported in section 3, the combination of the architecture and the learning rate annealingscheme put forth in section 2.5 consistently improves the accuracy of physics-informed neural networks by a factor of50-100x across a range of benchmark problems in computational physics. In this section we provide a collection of comprehensive numerical studies that aim to assess the performance or theproposed methodologies against the current state-of-the-art in using fully-connected deep neural network models for13

PREPRINT - J

ANUARY

15, 2020 m a x ( ) M1 M3

Figure 11:

An improved fully-connected architecture for physics-informed neural networks:

Largest eigenvalue of theHessian ∇ θ L ( θ ) during the training of two different physics-informed neural network models for approximating thesolution to the two-dimensional Helmholtz problem (see equation 8) with a = 1 , a = 4 . Model M1 is a conventionaldense, 4-layer deep architecture, while model M3 corresponds to the proposed improved architecture also with a depthof 4 layers. Model DescriptionM1 Original formulation of physics-informed neural networks as put forthin Raissi et. al. [12]M2 Physics-informed neural networks with the proposed learning rate an-nealing algorithm (see section 2.5)M3 Physics-informed neural networks with the proposed improved fully-connected architecture (see section 2.6)M4 Physics-informed neural networks combining the proposed learning rateannealing algorithm and fully-connected architectureTable 1: Models considered in the numerical studies presented in section 3.inferring the solution of PDEs. Throughout all benchmarks we will employ and compare the performance of fourdifferent approaches as summarized in table 1. In all cases we employ hyperbolic tangent activation functions, amini-batch size of 128 data-points, and train the networks using single precision arithmetic via stochastic gradientdescent using the Adam optimizer with default settings [27]. Moreover, all networks are initialized using the Glorotscheme [28], and no additional regularization techniques are employed (e.g., dropout, L / L penalties, etc.). ModelsM2 and M4 use the proposed learning rate annealing algorithm update their loss weights λ i every 10 iterations ofstochastic gradient descent (see equations 40 and 41). All algorithms are implemented in Tensorﬂow [46] and thereported run-times are obtained on a Lenovo X1 Carbon ThinkPad laptop with an Intel Core i5 2.3GHz processor and8Gb of RAM memory. The code and data accompanying this manuscript is publicly available at https://github.com/PredictiveIntelligenceLab/GradientPathologiesPINNs . First, let us revisit the Helmholtz equation benchmark described in section 2.2 (see equation 8). This equation isclosely related to many problems in natural and engineering sciences, such as wave propagation in acoustic, elasticand electromagnetic media [47]. It also routinely arises in conventional numerical discretization methods for PDEssuch as ﬁnite element and spectral methods [48, 49]. Our goal here is to use this canonical benchmark problem tosystematically analyze the performance of the different models listed in table 1, and quantify their predictive accuracyfor different choices of the underlying neural network architecture. Speciﬁcally, we have varied the number of hiddenlayers and the number of neurons per layer that deﬁne each model’s architecture and report the resulting relative L error for the predicted solution after , Adam iterations.Table 2 summarizes our results, averaged over 10 independent trials with randomized weight initializations using theGlorot scheme [28]. Evidently, the original formulation of Raissi et. al. [12] (model M1) is very sensitive to the choice14

PREPRINT - J

ANUARY

15, 2020of neural architecture and fails to attain a stable prediction accuracy, yielding errors ranging between - in therelative L norm. In contrast, the proposed models (models M2-M4) appear to be very robust with respect to networkarchitectures and show a consistent trend in improving the prediction accuracy as the number of hidden layers andneural units is increased. We also observe that, by introducing a small number of additional weights and biases, theimproved fully-connected architecture corresponding to model M3 can consistently yield better prediction accuracythan the conventional fully-connected architectures used in the original work of Raissi et. al. [12] (model M1). Thisindicates that the proposed architecture seems to have better ability to represent complicated functions than conventionalfully-connected neural networks, as well as it may lead to more stable gradient descent dynamics as discussed in section2.6. Last but not least, the combination of the proposed improved architecture with the adaptive learning rate algorithmof section 2.5 (model M4) uniformly leads to the most accurate results we have obtained for this problem (relative L errors ranging between . - . ).Architecture M1 M2 M3 M430 units / 3 hidden layers 2.44e-01 5.39e-02 5.31e-02 1.07e-0250 units / 3 hidden layers 1.06e-01 1.12e-02 2.46e-02 3.87e-03100 units / 3 hidden layers 9.07e-02 4.84e-03 1.17e-02 2.71e-0330 units / 5 hidden layers 2.47e-01 2.74e-02 4.12e-02 3.49e-0350 units / 5 hidden layers 1.40e-01 7.43e-03 1.97e-02 2.54e-03100 units / 5 hidden layers 1.15e-01 5.37e-03 1.08e-02 1.63e-0330 units / 7 hidden layers 3.10e-01 2.67e-02 3.17e-02 3.59e-0350 units / 7 hidden layers 1.98e-01 7.33e-03 2.37e-02 2.04e-03100 units / 7 hidden layers 8.14e-02 4.52e-03 9.36e-03 1.49e-03Table 2: Helmholtz equation:

Relative L error between the predicted and the exact solution u ( x, y ) for the differentmethods summarized in table 1, and for different neural architectures obtained by varying the number of hidden layersand different number of neurons per layer. To emphasize the ability of the proposed methods to handle nonlinear and time-dependent problems, leading tocomposite loss functions with multiple competing objectives, let us consider the time-dependent Klein-Gordon equation.The Klein–Gordon equation is a non-linear equation, which plays a signiﬁcant role in many scientiﬁc applications suchas solid-state physics, nonlinear optics, and quantum physics [50]. The initial-boundary value problem in one spatialdimension takes the form u tt + αu xx + βu + γu k = f ( x, t ) , ( x, t ) ∈ Ω × [0 , T ] (49) u ( x,

0) = g ( x ) , x ∈ Ω (50) u t ( x,

0) = g ( x ) , x ∈ Ω (51) u ( x, t ) = h ( x, t ) , ( x, t ) ∈ ∂ Ω × [0 , T ] (52)where α, β, γ and k are known constants, k = 2 when we have a quadratic non-linearity and k = 3 when we have acubic non-linearity. The functions f ( x, t ) , g ( x ) , g ( x ) and h ( x ) are considered to be known, and the function u ( x, t ) represents the latent solution to this problem. Here we choose Ω = [0 , × [0 , , T = 1 , α = − , β = 0 , γ = 1 , k = 3 ,and the initial conditions satisfying g ( x ) = g ( x ) = 0 for all x ∈ Ω . To assess the accuracy of our models we will usea fabricated solution u ( x, t ) = x cos(5 πt ) + ( xt ) , (53)and correspondingly derive a consistent forcing term f ( x, t ) using equation 49, while the Dirichlet boundary conditionterm h ( x, y ) imposed on the spatial domain boundaries ∂ Ω is directly extracted from the fabricated solution of equation53.A physics-informed neural network model f θ ( x, t ) can now be trained to approximate the latent solution u ( x, t ) byformulating the following composite loss L ( θ ) = L r ( θ ) + λ u b L u b ( θ ) + λ u L u ( θ ) (54)where L r ( θ ) , L u b ( θ ) , L u ( θ ) are deﬁned in equations 5, 6, and 7. Notice that if λ u b = λ u = 1 , then the loss functionabove gives rise to the original physics-informed neural network formulation of Raissi et. al. [12] (model M1).Given this problem setup, let us now investigate the distribution of back-propagated gradients during the training of a5-layer deep fully-connected network with 50 neurons per layer using the the original formulation of Raissi et. al. [12]15 PREPRINT - J

ANUARY

15, 2020 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 u u b r Figure 12:

Klein-Gordon equation:

Histograms of back-propagated gradients for model M1 after 40,000 iterations ofstochastic gradient descent using the Adam optimizer.(model M1). The results summarized in ﬁgure 12 indicate that ∇ θ L u b and ∇ θ L u accumulate at the origin aroundwhich they form two sharp peaks, while ∇ θ L r keeps ﬂat. This implies that the gradients of L u b and L u r nearly diminishin comparison to the gradients of L r . This is a clear manifestation of the imbalanced gradient pathology describedin section 2.2, which we believe is the leading mode of failure for conventional physics-informed neural networkmodels. Consequently, model M1 fails to accurately ﬁt the initial and boundary conditions data, and therefore, asexpected, yields a poor prediction for the target solution u ( x, t ) with a relative L error of . . From our experience,this behavior is extremely common in using conventional physics-informed neural network models [12] for solvingPDE systems that have sufﬁciently complex dynamics leading to solutions with non-trivial behavior (e.g., directionalanisotropy, multi-scale features, etc.).It is now natural to ask whether the proposed methods can effectively mitigate this gradient imbalance pathology andlead to accurate and robust predictions. To this end, in ﬁgure 13 we present the distribution of back-propagated gradientsfor ∇ θ L r , ∇ θ L u b , ∇ θ L u at each hidden layer of the same neural architecture used above, albeit using the proposedlearning rate annealing algorithm (model M2) for adaptively choosing the constants λ u b and λ u during model training.We observe that all gradient distributions do not sharply peak around zero values indicating the fact that a healthygradient signal is back-propagated through the network from all the contributing terms in the loss function.A comparative study on the performance of all models (M1-M4) is summarized in table 3. Notice that here our goal isnot to perform an exhaustive hyper-parameter search to ﬁnd the best performing model, rather our goal is to presentindicative results for the performance of each method (models M1-M4). To do so, we consider a ﬁxed neural architecturewith 5 hidden layers and 50 neurons and report the relative L prediction error of each method after , iterationsof training, along with the total wall clock time required to complete each simulation. It is clear that the proposedmethodologies consistently outperform the original implementation of Raissi et. al. [12], while the combined use of theproposed learning rate annealing algorithm with the improved fully-connected architecture (model M4) lead to the bestresults, albeit at a higher computational cost.Figures 14 and 15 provide a more detailed visual assessment of the predictive accuracy for models M1 and M4. As weexpected, in ﬁgure 14, model M1 fails to obtain accurate predictive solutions and has increased absolute error up toabout . as time t increases. In ﬁgure 15, we present the solution obtained model M4 that combines the proposedlearning rate annealing algorithm (see section 2.5) and improved neural architecture put forth in section 2.6. It can beobserved that the absolute error takes a maximum value of . and the relative L prediction error is . , which isabout two orders of magnitude lower than the one obtained using model M1, and one order of magnitude lower that thepredictive error of models M2 and M3. Finally, ﬁgure 13 depicts the evolution of the constants λ u b and λ u used to16 PREPRINT - J

ANUARY

15, 2020 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 u u b r Figure 13:

Klein-Gordon equation:

Back-propagated gradients for model M2 after 40,000 iterations of stochasticgradient descent using the Adam optimizer.Method M1 M2 M3 M4Relative L error 1.79e-01 1.06e-02 1.97e-02 2.81e-03Training time (sec) 200 200 760 1040Table 3: Klein-Gordon equation:

Relative L error between the predicted and the exact solution u ( t, x ) for the differentmethods summarized in table 1. Here the network architecture is ﬁxed to 5 fully-connected layers with 50 neurons perlayer. The reported CPU time corresponds to the total wall-clock time to train each network for 40,000 iterations ofstochastic gradient descent on a Lenovo X1 Carbon ThinkPad laptop with an Intel Core i5 2.3GHz processor and 8Gbof RAM memory.scale the boundary and initial conditions loss, L u b ( θ ) and L u ( θ ) , respectively (see equation 54), during the training ofmodel M2 using algorithm 1. In our last example, we present a study that highlights a few important remarks on how a problem’s formulationcan be as crucial as the algorithm used to solve it. To this end, let us consider a classical benchmark problem incomputational ﬂuid dynamics, the steady-state ﬂow in a two-dimensional lid-driven cavity. The system is governed bythe incompressible Navier-Stokes equations which can be written in non-dimensional form as u · ∇ u + ∇ p − Re ∆ u = 0 in Ω (55) ∇ · u = 0 in Ω (56) u ( x ) = (1 , on Γ (57) u ( x ) = (0 , on Γ (58)where u ( x ) = ( u ( x ) , v ( x )) is a velocity vector ﬁeld, p is a scalar pressure ﬁeld, and x = ( x, y ) ∈ Ω = (0 , × (0 , with Ω being a two-dimensional square cavity. Moreover, Γ is the top boundary of the cavity, Γ denotes the other threesides, and Re is the Reynolds number of the ﬂow. In this example, we chosen a relatively simple case corresponding toa Reynolds number of Re = 100 , for which it is well understood that the ﬂow quickly converges to an incompressiblesteady-state solution [51]. Our goal here is to train a physics-informed neural network to predict the latent velocityand pressure ﬁelds. In order to assess the accuracy of our predictions, we have also simulated this system to obtain a17 PREPRINT - J

ANUARY

15, 2020 t x Exact u(x) t x Predicted u(x) t x Absolute error

Figure 14:

Klein Gordon equation:

Predicted solution of model M1, a conventional physics-informed neural network[12] with 5 hidden layers and 50 neurons in each layer.The relative L error of the prediction solution is 1.79e-01 after40,000 iterations of stochastic gradient descent using the Adam optimizer. t x Exact u(x) t x Predicted u(x) t x Absolute error

Figure 15:

Klein Gordon equation:

Predicted solution of model M4, a physics-informed neural network [12] with 5hidden layers and 50 neurons in each layer trained using the proposed learning rate annealing algorithm (see section2.5), and the improved architecture described in 2.6. The relative L error of the prediction solution is 2.22e-03 after40,000 iterations of stochastic gradient descent using the Adam optimizer. iterations u u b Figure 16:

Klein Gordon equation:

Evolution of the constants λ u b and λ u used to scale the boundary and initialconditions loss, L u b ( θ ) and L u ( θ ) , respectively (see equation 54), during the training of model M2 using algorithm 1.18 PREPRINT - J

ANUARY

15, 2020reference solution using conventional ﬁnite difference methods, there-by creating a high-resolution validation data set { x i , u i } Ni =1 (see details in Appendix 5.2). In the following subsections we will consider two different neural networkrepresentations to highlight a crucial aspect of simulating incompressible ﬂuid ﬂows with physics-informed neuralnetworks. A straightforward application of physics-informed neural networks for approximating the solution of equation 55 wouldintroduce a neural network representation with parameters θ that aims to learn how to map spatial coordinates ( x, y ) tothe latent solution functions u ( x, y ) , v ( x, y ) and p ( x, y ) , i.e., [ x, y ] f θ (cid:55)−→ [ u ( x, y ) , v ( x, y ) , p ( x, y )] Using this representation one can then introduce the residuals of the momentum and continuity equations as r uθ ( x, y ) := u ∂u∂x + v ∂u∂y + ∂p∂x − Re ( ∂ u∂x + ∂ u∂y ) (59) r vθ ( x, y ) := u ∂v∂x + v ∂v∂y + ∂p∂y − Re ( ∂ v∂x + ∂ v∂y ) (60) r cθ ( x, y ) = ∂u∂x + ∂v∂y , (61)where ( u, v, p ) = f θ ( x, y ) are the outputs of the chosen neural network representation, and all required gradients can bereadily computed using automatic differentiation [18]. Given these residuals, along with a set of appropriate boundaryconditions for equation 55 we can now formulate a loss function for training a physics-informed neural network as L ( θ ) = L r u ( θ ) + L r v ( θ ) + L r c ( θ ) + L u b ( θ ) + L v b ( θ ) , (62)with L r u ( θ ) = 1 N r N r (cid:88) i =1 [ r uθ ( x ir , y ir )] (63) L r v ( θ ) = 1 N r N r (cid:88) i =1 [ r vθ ( x ir , y ir )] (64) L r c ( θ ) = 1 N r N r (cid:88) i =1 [ r cθ ( x ir , y ir )] (65) L u b ( θ ) = 1 N b N b (cid:88) i =1 [ u ( x ib , y ib ) − u ib ] , (66) L v b ( θ ) = 1 N b N b (cid:88) i =1 [ v ( x ib , y ib ) − v ib ] (67)where { ( x ir , y ir ) } N r i =1 is a set of collocation points in which we aim to minimize the PDE residual, while { ( x ib , y ib ) , u ib } N b i =1 and { ( x ib , y ib ) , v ib } N b i =1 denote the boundary data for the two velocity components at the domain boundaries Γ and Γ ,respectively. Taken together, these terms aim to constrain the neural network approximation to satisfy the Navier-Stokessystem and boundary conditions prescribed in equation 55. In practice, this composite loss is minimized using stochasticgradient descent, where a different set of collocation and boundary points are randomly sampled at each gradientdescent iteration.Here, our goal is to learn the two velocity components u ( x, y ) and v ( x, y ) , as well as the latent pressure ﬁeld p ( x, y ) by training a 5-layer deep neural network using the mean squared error loss of 62. Each hidden layer contains 50neurons and a hyperbolic tangent activation function. Figure 17 summarizes the results of our experiment, which showsthe magnitude of the predicted solution | u ( x ) | = (cid:112) u ( x ) + v ( x ) predicted by each model M1-M4, after , stochastic gradient descent updates using the Adam optimizer [27]. It is evident that none of the models is capableof producing a reasonable approximation. We postulate that this failure is related to the residual loss used to trainthe models. Speciﬁcally, it is inevitable that solutions inferred by minimizing equation 62 will not exactly satisfy the19 PREPRINT - J

ANUARY

15, 2020prescribed boundary conditions, nor the incompressibility constraint, potentially giving rise to issues of non-uniqueness.This fundamental limitation highlights the importance of adopting a consistent problem formulation, leading to effectiveobjectives for training the physics-informed neural networks. As we will see in the next section, representing the latentvelocity ﬁeld as the gradient of a scalar potential can guarantee incompressible solutions and lead to stable trainingdynamics, as well as accurate predictions.

Instead of directly representing the two velocity components u ( x, y ) and v ( x, y ) , one can use a neural networkrepresent a scalar potential function ψ ( x, y ) such that an incompressible velocity ﬁeld can be directly computed as ( u, v ) = ( ∂ψ/∂y, − ∂ψ/∂x ) . Hence, we can introduce a neural network representation with parameters θ that aims tolearn how to map spatial coordinates ( x, y ) to the latent scalar functions ψ ( x, y ) , p ( x, y ) , i.e., [ x, y ] f θ (cid:55)−→ [ ψ ( x, y ) , p ( x, y )] Then, we only need to consider the residuals for the momentum equations, as the incompressibility constraint is nowexactly satisﬁed, i.e. r uθ ( x, y ) := u ∂u∂x + v ∂u∂y + ∂p∂x − Re ( ∂ u∂x + ∂ u∂y ) (68) r vθ ( x, y ) := u ∂v∂x + v ∂v∂y + ∂p∂y − Re ( ∂ v∂x + ∂ v∂y ) (69)where ( u, v ) = ( ∂ψ/∂y, − ∂ψ/∂x ) , ( ψ, p ) = f θ ( x, y ) are the outputs of the chosen neural network representation f θ ,and all required gradients are computed using automatic differentiation [18]. Although we are now able to exactlysatisfy the incompressibilty constraint, notice that this approach incurs a larger computational cost since third-orderderivatives of the neural network are need to be computed. Despite this additional cost, our experience indicates that thismodiﬁcation is crucial for obtaining accurate and unique solutions for equation 55 by minimizing a physics-informedloss of the form L ( θ ) = L r u ( θ ) + L r v ( θ ) + L u b ( θ ) + L v b ( θ ) , (70)where L r u ( θ ) = 1 N r N r (cid:88) i =1 [ r uθ ( x ir , y ir )] (71) L r v ( θ ) = 1 N r N r (cid:88) i =1 [ r vθ ( x ir , y ir )] (72) L u b ( θ ) = 1 N b N b (cid:88) i =1 [ u ( x ib , y ib ) − u ib ] , (73) L v b ( θ ) = 1 N b N b (cid:88) i =1 [ v ( x ib , y ib ) − v ib ] (74)where { ( x ir , y ir ) } N r i =1 is a set of collocation points in which we aim to minimize the residual of equations 68 and 69,while { ( x ib , y ib ) , u ib } N b i =1 and { ( x ib , y ib ) , v ib } N b i =1 denote the boundary data for the two velocity components at the domainboundaries Γ and Γ , respectively.Similarly, we still aim to learn the velocity components u ( x, y ) and v ( x, y ) , as well as the latent pressure ﬁeld p ( x, y ) .Now the velocity components can be obtained by taking derivatives of the stream function ψ ( x, y ) with respect to the x and y coordinates using automatic differentiation [18]. We chose to jointly represent the latent function [ ψ ( x, y ) , p ( x, y )] using a 5-layer deep neural network with 50 neurons per hidden layer and a hyperbolic tangent activation function,trained using the mean squared error loss of equation 71. The results of this experiment are summarized in ﬁgure 18.In particular, we present a comparison between the reference and the predicted solutions obtained using the differentmodels M1 - M4. It can be observed that the velocity ﬁelds obtained using M2-M4 are in good agreement with thereference solution while M1 is not able to yield satisfactory prediction solution. As expected, model M4 yields the bestpredicted solution with an error of measured in the relative L -norm.20 PREPRINT - J

ANUARY

15, 2020 x y Reference Velocity x y Predicted Velocity x y Absolute Error (a) Model M1 (relative L prediction error: 7.84e-01). x y Reference Velocity x y Predicted Velocity x y Absolute Error (b) Model M2 (relative L prediction error: 7.48e-01). x y Reference Velocity x y Predicted Velocity x y Absolute Error (c) Model M3 (relative L prediction error: 7.48e-01). x y Reference Velocity x y Predicted Velocity x y Absolute Error (d) Model M4 (relative L prediction error: 7.53e-01). Figure 17:

Flow in a lid-driven cavity, velocity-pressure representation:

Reference solution using a conventional ﬁnitedifference solver, prediction of a 5-layer deep physics-informed neural network (models M1-M4, top to bottom), andabsolute point-wise error. 21

PREPRINT - J

ANUARY

15, 2020 x y Reference Velocity x y Predicted Velocity x y Absolute Error (a) Model M1 (relative L prediction error: 2.71e-01). x y Reference Velocity x y Predicted Velocity x y Absolute Error (b) Model M2 (relative L prediction error: 2.49e-01). x y Reference Velocity x y Predicted Velocity x y Absolute Error (c) Model M3 (relative L prediction error: 1.21e-01). x y Reference Velocity x y Predicted Velocity x y Absolute Error (d) Model M4 (relative L prediction error: 3.42e-02). Figure 18:

Flow in a lid-driven cavity, streamfunction-pressure representation:

Reference solution using a conventionalﬁnite difference solver, prediction of a 5-layer deep physics-informed neural network (models M1-M4, top to bottom),and absolute point-wise error. 22

PREPRINT - J

ANUARY

15, 2020

Despite recent success across a range of applications [12, 43, 25, 24, 26, 19], physics-informed neural networks (PINNs)often struggle to accurately approximate the solution of partial differential equations. In this work we identify andanalyze a fundamental mode of failure of physics-informed neural networks related to stiff gradient ﬂow dynamics thatlead to unbalanced gradients during model training via back-propagation. To provide further insight, we quantify andanalyze the stiffness of the gradient ﬂow dynamics and elucidate the difﬁculties of training PINNs via gradient descent.This stiffness analysis not only plays a key role in understanding the gradient ﬂow dynamics, but also motivates futureresearch directions to consider more stable discretizations of the gradient ﬂow system. We mitigate this shortcoming byproposing a learning rate annealing algorithm that utilizes gradient statistics during model training to adaptive assignappropriate weights to different terms in a PINNs loss function. The new algorithm aims to address the unbalancedgradients pathology and effectively lead to noticeable improvements in predictive accuracy. This development is notlimited to PINNs, but can be straightforwardly generalized to other tasks involving the interplay of multiple objectivefunctions that may lead to unbalanced gradients issue, e.g. multi-task learning [52]. Finally, we propose a novel neuralnetwork architecture that can enhance the accuracy and robustness of PINNs by decreasing the stiffness of gradientﬂow, indicating that specialized architectures can play a prominent role in the future success of PINNs models. Takentogether, our developments provide new insights into the training of physics-informed neural networks and consistentlyimprove their predictive accuracy by a factor of 50-100x across a range of problems in computational physics.Despite recent progress, we have to admit that we are still at the very early stages of rigorously understanding thecapabilities and limitations of physics-informed neural networks. To bridge this gap we need to answer a series of openquestions: What is the relation between stiffness in a given PDE and stiffness in the gradient ﬂow dynamics of theassociated PINN model? Can stiffness in the gradient ﬂow dynamics be reduced (for e.g. using domain decompositiontechniques, different choices of loss functions, more effective neural architectures, etc.)? If stiffness turns out to be aninherent property of PINNs, what else can we do to enhance the robustness of their training and the accuracy of theirpredictions? Can we devise more stable and effective optimization algorithms to train PINN models with stiff gradientﬂow dynamics? How does stiffness affect the approximation error and generalization error of PINNs? Addressing thesechallenges calls for a fruitful synergy between deep learning, optimization, numerical analysis and dynamical systemstheory that has the potential to crystallize new exciting developments in computational science and engineering.

Acknowledgements

This work received support from the US Department of Energy under the Advanced Scientiﬁc Computing Researchprogram (grant de-sc0019116) and the Defense Advanced Research Projects Agency under the Physics of ArtiﬁcialIntelligence program (grant HR00111890034).

References [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classiﬁcation with deep convolutional neuralnetworks. In

Advances in neural information processing systems , pages 1097–1105, 2012.[2] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk,and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 , 2014.[3] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence speciﬁcitiesof dna-and rna-binding proteins by deep learning.

Nature biotechnology , 33(8):831, 2015.[4] Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics withdeep learning.

Nature communications , 5:4308, 2014.[5] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, ThomasHubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature , 550(7676):354–359, 2017.[6] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, KateSaenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2625–2634, 2015.[7] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutionalgenerative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.23

PREPRINT - J

ANUARY

15, 2020[8] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.

The handbook ofbrain theory and neural networks , 3361(10):1995, 1995.[9] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep learning . MIT press, 2016.[10] Stéphane Mallat. Understanding deep convolutional networks.

Philosophical Transactions of the Royal Society A:Mathematical, Physical and Engineering Sciences , 374(2065):20150203, 2016.[11] Risi Kondor, Hy Truong Son, Horace Pan, Brandon Anderson, and Shubhendu Trivedi. Covariant compositionalnetworks for learning graphs. arXiv preprint arXiv:1801.02144 , 2018.[12] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learningframework for solving forward and inverse problems involving nonlinear partial differential equations.

Journal ofComputational Physics , 378:686–707, 2019.[13] Yibo Yang and Paris Perdikaris. Adversarial uncertainty quantiﬁcation in physics-informed neural networks.

Journal of Computational Physics , 394:136–152, 2019.[14] Ehsan Kharazmi, Zhongqiang Zhang, and GE Karniadakis. Variational physics-informed neural networks forsolving partial differential equations. arXiv preprint arXiv:1912.00873 , 2019.[15] Alexandre M Tartakovsky, Carlos Ortiz Marrero, D Tartakovsky, and David Barajas-Solano. Learning parametersand constitutive relationships with physics informed deep neural networks. arXiv preprint arXiv:1808.03398 ,2018.[16] Dimitris C Psichogios and Lyle H Ungar. A hybrid neural network-ﬁrst principles approach to process modeling.

AIChE Journal , 38(10):1499–1511, 1992.[17] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artiﬁcial neural networks for solving ordinary andpartial differential equations.

IEEE transactions on neural networks , 9(5):987–1000, 1998.[18] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automaticdifferentiation in machine learning: a survey.

Journal of machine learning research , 18(153), 2018.[19] Maziar Raissi, Alireza Yazdani, and George Em Karniadakis. Hidden ﬂuid mechanics: A navier-stokes informeddeep learning framework for assimilating ﬂow visualization data. arXiv preprint arXiv:1808.04327 , 2018.[20] Maziar Raissi. Deep hidden physics models: Deep learning of nonlinear partial differential equations.

The Journalof Machine Learning Research , 19(1):932–955, 2018.[21] Yinhao Zhu, Nicholas Zabaras, Phaedon-Stelios Koutsourelakis, and Paris Perdikaris. Physics-constrained deeplearning for high-dimensional surrogate modeling and uncertainty quantiﬁcation without labeled data.

Journal ofComputational Physics , 394:56–81, 2019.[22] Nicholas Geneva and Nicholas Zabaras. Modeling the dynamics of pde systems with physics-constrained deepauto-regressive networks.

Journal of Computational Physics , 403:109056, 2020.[23] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solving partial differentialequations.

Journal of Computational Physics , 375:1339–1364, 2018.[24] Rohit K Tripathy and Ilias Bilionis. Deep uq: Learning deep neural network surrogate models for high dimensionaluncertainty quantiﬁcation.

Journal of Computational Physics , 375:565–588, 2018.[25] Georgios Kissas, Yibo Yang, Eileen Hwuang, Walter R Witschey, John A Detre, and Paris Perdikaris. Machinelearning in cardiovascular ﬂows modeling: Predicting arterial blood pressure from non-invasive 4d ﬂow mri datausing physics-informed neural networks.

Computer Methods in Applied Mechanics and Engineering , 358:112623,2020.[26] Luning Sun, Han Gao, Shaowu Pan, and Jian-Xun Wang. Surrogate modeling for ﬂuid ﬂows based on physics-constrained deep learning without simulation data. arXiv preprint arXiv:1906.02382 , 2019.[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[28] Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks.In

Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics , pages 249–256,2010.[29] Lawrence C Evans. Partial differential equations.

Providence, RI , 1998.[30] Arieh Iserles.

A ﬁrst course in the numerical analysis of differential equations . Number 44. Cambridge universitypress, 2009. 24

PREPRINT - J

ANUARY

15, 2020[31] Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity andbeyond. arXiv preprint arXiv:1611.07476 , 2016.[32] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularizedlearning. In

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms , pages2703–2717. SIAM, 2018.[33] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. Themechanics of n-player differentiable games. arXiv preprint arXiv:1802.05642 , 2018.[34] Robert A Adams and John JF Fournier.

Sobolev spaces , volume 140. Elsevier, 2003.[35] Gilbert Strang. On the construction and comparison of difference schemes.

SIAM journal on numerical analysis ,5(3):506–517, 1968.[36] Neal Parikh, Stephen Boyd, et al. Proximal algorithms.

Foundations and Trends R (cid:13) in Optimization , 1(3):127–239,2014.[37] Morteza Mardani, Qingyun Sun, David Donoho, Vardan Papyan, Hatef Monajemi, Shreyas Vasanawala, and JohnPauly. Neural proximal gradient descent for compressive imaging. In Advances in Neural Information ProcessingSystems , pages 9573–9583, 2018.[38] Florian Schäfer and Anima Anandkumar. Competitive gradient descent. arXiv preprint arXiv:1905.12103 , 2019.[39] Charles William Gear and DR Wells. Multirate linear multistep methods.

BIT Numerical Mathematics , 24(4):484–502, 1984.[40] Molei Tao, Houman Owhadi, and Jerrold E Marsden. Nonintrusive and structure preserving multiscale integrationof stiff odes, sdes, and hamiltonian systems with hidden slow dynamics via ﬂow averaging.

Multiscale Modeling& Simulation , 8(4):1269–1324, 2010.[41] Dimitri P Bertsekas.

Constrained optimization and Lagrange multiplier methods . Academic press, 2014.[42] Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for sequencelearning. arXiv preprint arXiv:1506.00019 , 2015.[43] AM Tartakovsky, CO Marrero, P Perdikaris, GD Tartakovsky, and D Barajas-Solano. Learning parameters andconstitutive relationships with physics informed deep neural networks, arxiv preprint (2018). arXiv preprintarXiv:1808.03398 .[44] Maziar Raissi, Paris Perdikaris, and George Em Karniadakis. Multistep neural networks for data-driven discoveryof nonlinear dynamical systems. arXiv preprint arXiv:1801.01236 , 2018.[45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin. Attention is all you need. In

Advances in neural information processing systems , pages5998–6008, 2017.[46] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, SanjayGhemawat, Geoffrey Irving, Michael Isard, et al. Tensorﬂow: A system for large-scale machine learning. In { USENIX } Symposium on Operating Systems Design and Implementation ( { OSDI } , pages 265–283, 2016.[47] Jean-Claude Nédélec. Acoustic and electromagnetic equations: integral representations for harmonic problems .Springer Science & Business Media, 2001.[48] Olgierd Cecil Zienkiewicz, Robert L Taylor, Perumal Nithiarasu, and JZ Zhu.

The ﬁnite element method , volume 3.McGraw-hill London, 1977.[49] Claudio Canuto, M Youssuff Hussaini, Alﬁo Quarteroni, and Thomas A Zang.

Spectral methods . Springer, 2006.[50] James D Bjorken and Sidney D Drell.

Relativistic quantum mechanics . McGraw-Hill, 1965.[51] Charles-Henri Bruneau and Mazen Saad. The 2d lid-driven cavity problem revisited.

Computers & Fluids ,35(3):326–348, 2006.[52] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization foradaptive loss balancing in deep multitask networks. arXiv preprint arXiv:1711.02257 , 2017.[53] Ukng Ghia, Kirti N Ghia, and CT Shin. High-re solutions for incompressible ﬂow using the navier-stokes equationsand a multigrid method.

Journal of computational physics , 48(3):387–411, 1982.[54] Lc Ce Woods. A note on the numerical solution of fourth order differential equations.

The Aeronautical Quarterly ,5(4):176–184, 1954. 25

PREPRINT - J

ANUARY

15, 2020

Proof.

Recall that the loss function is given by L ( θ ) = L r ( θ ) + L u b ( θ )= 1 N b N b (cid:88) i =1 [ f θ ( x ib ) − h ( x ib )] + 1 N r N r (cid:88) i =1 [ ∂ ∂x f θ ( x ir ) − g ( x ir )] . (75)Here we ﬁx θ ∈ Θ , where Θ denote all weights in a neural network. Then by assumptions, ∂ L ub ( θ ) ∂θ can be computed by (cid:12)(cid:12)(cid:12)(cid:12) ∂ L u b ( θ ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂∂θ (cid:32) (cid:88) i =1 (cid:0) u θ (cid:0) x ib (cid:1) − h (cid:0) x ib (cid:1)(cid:1) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i =1 (cid:0) u θ (cid:0) x ib (cid:1) − h (cid:0) x ib (cid:1)(cid:1) ∂u θ (cid:0) x ib (cid:1) ∂θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i =1 (cid:0) u (cid:0) x ib (cid:1) · (cid:15) θ (cid:0) x ib (cid:1) − u (cid:0) x ib (cid:1)(cid:1) u (cid:0) x ib (cid:1) ∂(cid:15) θ (cid:0) x ib (cid:1) ∂θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i =1 u (cid:0) x ib (cid:1) (cid:0) − (cid:15) θ (cid:0) x ib (cid:1)(cid:1) u (cid:0) x ib (cid:1) ∂(cid:15) θ (cid:0) x ib (cid:1) ∂θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (cid:13)(cid:13)(cid:13)(cid:13) ∂(cid:15) θ ( x ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) L ∞ · (cid:15) Next, we may rewrite the L r as L r = 1 N f N f (cid:88) i =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ u θ ∂x (cid:0) x if (cid:1) − ∂ u∂x (cid:0) x if (cid:1)(cid:12)(cid:12)(cid:12)(cid:12) ≈ (cid:90) (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) dx Then by integration by parts we have, ∂ L r ∂θ = ∂∂θ (cid:90) (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) dx = (cid:90) ∂∂θ (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) dx = (cid:90) (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) ∂∂θ (cid:18) ∂ u θ ( x ) ∂x (cid:19) dx = 2 (cid:90) (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) ∂ ∂x ∂ ( u θ ( x ) ∂θ (cid:19) dx = 2 (cid:20) ∂ u θ ( x ) ∂x∂θ (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) (cid:12)(cid:12)(cid:12) − (cid:90) ∂ u θ ( x ) ∂x∂θ (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) dx (cid:21) = 2 (cid:20) ∂ u θ ( x ) ∂x∂θ (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) (cid:12)(cid:12)(cid:12) − ∂u θ ( x ) ∂θ (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) (cid:12)(cid:12)(cid:12) + (cid:90) ∂u θ ( x ) ∂θ (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) dx (cid:21) Note that (cid:12)(cid:12)(cid:12)(cid:12) ∂ u θ ( x ) ∂x∂θ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ∂ u ( x ) (cid:15) θ ( x ) ∂x∂θ (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ∂∂θ ( u (cid:48) ( x ) (cid:15) θ ( x ) + u ( x ) (cid:15) (cid:48) θ ( x )) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) u (cid:48) ( x ) ∂(cid:15) θ ( x ) ∂θ + u ( x ) ∂(cid:15) (cid:48) θ ( x ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ C (cid:13)(cid:13)(cid:13)(cid:13) ∂(cid:15) θ ( x ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) L ∞ + (cid:13)(cid:13)(cid:13)(cid:13) ∂(cid:15) (cid:48) θ ( x ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) L ∞ PREPRINT - J

ANUARY

15, 2020And (cid:12)(cid:12)(cid:12)(cid:12) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ∂ u ( x ) (cid:15) θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:12)(cid:12)(cid:12)(cid:12) = | u (cid:48)(cid:48) ( x ) (cid:15) θ ( x ) + 2 u ( x ) (cid:15) (cid:48)(cid:48) θ ( x ) − u (cid:48)(cid:48) ( x ) | = | u (cid:48)(cid:48) ( x ) ( (cid:15) θ ( x ) −

1) + 2 u ( x ) (cid:15) (cid:48)(cid:48) θ ( x ) |≤ C (cid:15) + 2 (cid:15) So we have (cid:12)(cid:12)(cid:12)(cid:12) ∂ u θ ( x ) ∂x∂θ (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:18) C (cid:13)(cid:13)(cid:13)(cid:13) ∂(cid:15) θ ( x ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) L ∞ + (cid:13)(cid:13)(cid:13)(cid:13) ∂(cid:15) (cid:48) θ ( x ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) L ∞ (cid:19) (cid:0) C (cid:15) + 2 (cid:15) (cid:1) (76) = O (cid:0) C (cid:1) · (cid:15) · (cid:13)(cid:13)(cid:13)(cid:13) ∂(cid:15) θ ( x ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) L ∞ (77)Similarly, (cid:12)(cid:12)(cid:12)(cid:12) ∂u θ ( x ) ∂θ (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) ∂u θ ( x ) ∂θ (cid:18) ∂ u ( x ) (cid:15) θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) (78) ≤ O (cid:0) C (cid:1) · (cid:15) · (cid:13)(cid:13)(cid:13)(cid:13) ∂(cid:15) θ ( x ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) L ∞ (79)Finally, (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) ∂u θ ( x ) ∂θ (cid:18) ∂ u θ ( x ) ∂x − ∂ u ( x ) ∂x (cid:19) dx (cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:0) C (cid:1) · (cid:15) · (cid:13)(cid:13)(cid:13)(cid:13) ∂(cid:15) θ ( x ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) L ∞ (80)Therefore, plugging all these together we obtain (cid:12)(cid:12)(cid:12)(cid:12) ∂ L r ∂θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:0) C (cid:1) · (cid:15) · (cid:13)(cid:13)(cid:13)(cid:13) ∂(cid:15) θ ( x ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) L ∞ (81) If we introduce the stream function ψ and vorticity ω , the Navier-Stokes equation can be written in the following form[53]:  ∂ω∂t + ∂ψ∂y ∂ω∂x − ∂ψ∂x ∂ω∂y = ν ∆ ω ∆ ψ = − ω (82)where u = ∂ψ∂y , v = − ∂ψ∂x , ω = ∂v∂x − ∂u∂y (83)The setup of the simulation is as follows. A set of points ( x i , y j ) is uniformly distributed in the domain [0 , × [0 , ,with x i = i/N , y i = j/N , i, j = 0 , , · · · , N . The grid resolution h equals /N . We denote A i,j as the value ofphysical variable A (velocity, pressure, etc.) at the point ( x i , y j ) . All the spacial derivatives are treated with 2nd-orderdiscretization scheme shown in Eq.84. ∂A∂x | i,j ≈ A i +1 ,j − A i − ,j h∂A∂y | i,j ≈ A i,j +1 − A i,j − h ∆ A | i,j ≈ A i +1 ,j + A i − ,j + A i,j +1 + A i,j − − A i,j h (84)According to the boundary condition of velocity ( u, v ) = (cid:26) (1 , , y = 1(0 , , otherwise (85)27 PREPRINT - J

ANUARY

15, 2020we can derive the boundary condition of stream function ψ = (cid:26) h/ , y = 10 , otherwise (86)under the assumption that the x-component of velocity u grows linearly from to between (0 , − /N ) and (0 , ,and between (1 , − /N ) and (1 , .The boundary condition of vorticity is derived from the Wood’s formula [54] ω = − ω − h ( ψ − ψ ) − h v τ − ∂v n ∂τ + h ∂ v τ ∂τ (87)where ( ψ , ω ) is the local stream function and vorticity at a boundary point, ( ψ , ω ) is the stream function andvorticity at the adjacent point along the normal direction, ( v n , v τ ) is the normal and tangential component of velocity,and τ is the tangential direction.The algorithm is composed of the following steps:Step.1 Set the stream function ψ and vorticity ω at inner points to zero, and calculate the ψ and ω at the boundaryusing equation 86 and Eq.87. Set time t = 0 ;Step.2 Calculate the vorticity ω at inner points at the time t + ∆ t with equation 82(1), substituting ∂ω/∂t with ( ω ( t + ∆ t ) − ω ( t )) / ∆ t ;Step.3 Calculate the stream function ψ at inner points at the time t + ∆ t with equation 82(2) and boundary conditionequation 86;Step.4 Calculate the vorticity ω on the boundary with equation 87;Step.5 Update the velocity ( u, v ) at the time t + ∆ with equation 83, and calculate the error between the velocity atthe time t and t + ∆ t error u = max i,j { u i,j ( t + ∆ t ) − u i,j ( t ) } ∆ terror v = max i,j { v i,j ( t + ∆ t ) − v i,j ( t ) } ∆ t (88)If max { error u , error v } < ε , the ﬂow has reached the steady state, and the computation terminates. Otherwise,set t ← t + ∆ t , and return to Step.2.In our simulation, we set the number of grid N = 128 , time step δt = 1 × , and criterion for convergence ε = 1 × −4