[PDF] Universal Approximation Properties of Neural Networks for Energy-Based Physical Systems

Abstract

In Hamiltonian mechanics and the Landau theory, many physical phenomena are modeled using energy. In this paper, we prove the universal approximation property of neural network models for such physical phenomena. We also discuss behaviors of the models for integrable Hamiltonian systems when the loss function does not vanish completely by applying the KAM theory.

Full PDF

UU NIVERSAL A PPROXIMATION P ROPERTIES OF N EURAL N ETWORKS FOR E NERGY -B ASED P HYSICAL S YSTEMS

Yuhan Chen

Graduate School of System and Information ScienceKobe UniversityHyogo, Japan

Takashi Matsubara

Graduate School of Engineering ScienceOsaka UniversityOsaka, Japan [email protected]

Takaharu Yaguchi

Graduate School of System and Information ScienceKobe UniversityHyogo, Japan [email protected]

February 25, 2021 A BSTRACT

In Hamiltonian mechanics and the Landau theory, many physical phenomena are modeled usingenergy. In this paper, we prove the universal approximation property of neural network modelsfor such physical phenomena. We also discuss behaviors of the models for integrable Hamiltoniansystems when the loss function does not vanish completely by applying the KAM theory.

Many physical phenomena are described by energy-based theories such as Hamiltonian mechanics and the Landautheory. In such theories, equations of the following form are often used as governing equations: d u d t = G ∂H∂u , (1)where u : t ∈ R (cid:55)→ u ( t ) ∈ R N , H : u ∈ R N (cid:55)→ H ( u ) ∈ R , G is a skew-symmetric or negative semideﬁnitematrix Furihata & Matsuo (2010). H represents the energy function of the system. For the systems described by (1),the time derivative of the energy is given as d H d t = ∂H∂u (cid:62) d u d t = ∂H∂u (cid:62) G ∂H∂u . (2)Hence, when the matrix G is skew-symmetric, the energy conservation law holds d H d t = 0 because for all vector v , v (cid:62) Gv = 0 when G is skew-symmetric. In particular, if G is in addition nondegenerate, (4) iscalled Hamiltonian equation, which is the equation of motion in Hamiltonian mechanics, and the energy function H iscalled Hamiltonian (e.g., Marsden & Ratiu (2013)). When the matrix G is negative semideﬁnite, the energy function ismonotonically non-increasing: d H d t = ∂H∂u (cid:62) G ∂H∂u ≤ . (3) a r X i v : . [ m a t h . D S ] F e b PREPRINT - F

EBRUARY

25, 2021The equations of this form are used in many ﬁelds of mathematical modeling, for example, in the Randau theory andthe phase-ﬁeld method. Phenomena such as phase separation, crystal growth and crack propagation are modeled byusing these theories (e.g., Caginalp (1986); Wheeler et al. (1993); Steinbach (2009); Miehe et al. (2010)).In recent years, there has been a lot of research on predicting the corresponding physical phenomena by learningthe energy function H in such equations with a neural network H NN (e.g., Greydanus et al. (2019); Cranmer et al.(2020); Chen et al. (2020); Zhong et al. (2020); Matsubara et al. (2020)); however, to the best of our knowledge, theapproximation properties of such models have not been theoretically investigated sufﬁciently, except for the SympNet(Jin et al., 2020b) for the Hamilton equation, where the universal approximation theorems for discrete-time neuralnetwork models are provided.In this paper, we present some theoretical analyses of the approximation properties of such a model, which are outlinedin Figure 1. Firstly, we will show that the continuous-time energy-based deep physical model d u d t = G ∂H NN ∂u (4)has the universal approximation property. This is obtained by applying the theory by Hornik et al. (1990) that shows theuniversal approximation property of multilayer feedforward networks for the function values along with the derivatives.Secondly, we show the universal approximation property for a coordinate transformation of the above model d x d t = ( ∂u∂x ) − G ( ∂u∂x ) −(cid:62) ∂H∂x . (5)Models of this form appear when the state variables in the original equation, especially the generalized momenta forHamiltonian equations, are unknown, and hence they must be learned as latent variables by using, e.g., neural networks.Figure 1: The outline of this paper. We show the universal approximation theorem for the general deep energy-basedmodels with and without the latent variables. The analysis of the models for the integrable Hamiltonian systems arealso provided; the neural network models admit the quasi-periodic behaviors of the original systems even when the lossfunction does not vanish but is small enough.Thirdly, we consider the behavior of one of the most important energy-based models, Hamiltonian neural networks (Grey-danus et al., 2019), especially when the loss function does not completely vanish, under the assumption that the learningtarget is an integrable system in the sense of Liouville. In this case, the trained model can be regarded as a perturbedHamiltonian system due to the modeling error in the energy function, and hence in principle can be analyzed by applyingthe famous theory of mathematical physics, the Kolmogorov–Arnold–Moser (KAM) theory. When the original systemis integrable the system exhibits quasi-periodic behaviors, which are, roughly, superpositions of periodic motions withmutually unrelated frequencies. The KAM theory shows that these quasi-periodic behaviors persist after sufﬁcientlysmall perturbations; thus it can be shown that the trained model has the quasi-periodic behaviors even if the loss functiondoes not vanish completely, provided that the loss function is small enough. We performed a rigorous analysis on this toﬁnd that it would be important to minimize the p -norm of the errors with a sufﬁciently large p rather than the meansquared errors. This intuitively makes sense because the KAM theory requires that the perturbation is small in the supnorm and the p -norm approximates this norm when p is large enough.The main contributions include: 2 PREPRINT - F

EBRUARY

25, 2021

Universal approximation theorem for general energy-based neural network models with and without hiddenvariables

While applications of neural networks to modeling of physical phenomena are developing very rapidly, asseen in Section 2, theoretical studies on these models are limited. This study aims to provide a theoretical basis forthese approaches.

Application of the KAM theory to the analysis of the behaviors of Hamiltonian neural networks when the lossfunction does not vanish

Even for a model with the universal approximation property, values of the loss functioncan not be completely vanished when the model is actually trained. In this paper, for integrable systems in the sense ofLiouville, by combining the KAM theory and the generalization error analysis using the Rademacher complexities, weshow that under a certain condition with a high probability quasi-periodic motions of such systems are preserved evenwhen the loss function does not vanish. In particular, the analysis shown below implies that the p -norm of the errorsshould be minimized rather than the mean squared errors of them. Introduction of general energy-based neural network models with hidden variables

While the main purpose ofthis study is the theoretical analysis of deep physical modeling, the latent variable model considered in Section 4 isa new energy-based neural network model that does not depend on a speciﬁc coordinate system. For example, theHamilton equation used in Hamiltonian neural networks requires the state of the system to be described in a specialcoordinate system with generalized coordinates and generalized momenta. However, the generalized momenta dependon the Lagrangian, which is deﬁned by the unknown energy function modeled by the neural networks, of the systemand hence is often unknown without detailed prior knowledge about the system. For the Hamilton equation, a modelthat employs latent variables has been proposed (Toth et al., 2019); however, the main objective of Toth et al. (2019) isthe extraction of low-dimensional phase space from images or movies, while ours is ﬁnding a diffeomorphism.

Many studies of neural network models for phenomena that can be modeled by the energy-based equation (1) havebeen proposed. Among those studies, the most basic ones would be neural differential equations (Chen et al., 2018) andHamiltonian neural networks (Greydanus et al., 2019). In particular, the extensions of Hamiltonian neural networkshave been intensively developed.Due to the limitation of space, it is difﬁcult to describe all of them, but some examples are as follows: Toth et al. (2019)extended Hamiltonian neural networks to the models with latent variable models. Other studies such as Zhong et al.(2020), DiPietro et al. (2020), and Xiong et al. (2021) focus on the symplectic structure of the Hamilton equation.Focusing on Noether’s theorem, which is a fundamental theorem of classical mechanics, several studies such asBondesan & Lamacraft (2019), Finzi et al. (2020a), and Bharadwaj et al. (2020) have developed methods relatedto symmetry and conservation laws. In addition, Matsubara et al. (2020) constructed a discrete-time model thatpreserves the energy behaviors by using the same model as (4) in this paper. Besides, in Galioto & Gorodetsky (2020),Hamiltonian neural networks are combined with the Bayesian approach.Methods applied to the framework of classical mechanics other than Hamiltonian mechanics include Cranmer et al.(2020), Desai & Roberts (2020), and Sæmundsson et al. (2019), which are methods for Lagrangian formalism. In Jinet al. (2020c), reinforcement learning is applied to the variational principle. Finzi et al. (2020b) proposed a simpliﬁedmodel by introducing constraints. In Jin et al. (2020a), Hamiltonian neural networks are extended to the Poisson system,which is a wider class of mechanical equations. There are also a number of proposals that integrate with more advanceddeep learning techniques, e.g., graph networks (Sanchez-Gonzalez et al., 2019), recurrent neural networks (Chen et al.,2020) and normalizing ﬂows (Li et al., 2020). As an application-oriented approach, Feng et al. (2020) designed themodel for structural analysis from the microscopic level.On the other hand, as far as the authors know, there is no theoretical research other than the universal approximationtheorems for Hamiltonian mechanics in SympNet (Jin et al., 2020b), in which a certain kind of neural networks areshown to have universal approximation properties for symplectic maps. The difference between their results and oursis in that the universal approximation theorems in Jin et al. (2020b) are for discrete-time models while ours are forcontinuous-time models.

In this section, we show that under a certain condition the model (4) has the universal approximation property.3

PREPRINT - F

EBRUARY

25, 2021We ﬁrst prepare some notation to describe the theorem. We denote C m ( X ) with the topology of the Sobolev space W p,m ( X ) by S mp ( X ) . W p,m ( X ) is a space of functions that admit weak derivatives up to the m th order of which L p norms are bounded; hence S mp ( X ) is the space of functions in W p,m ( X ) with (usual) derivatives. For details on theSobolev theory, see Adams & Fournier (2003). We denote L p norms for functions by (cid:107) · (cid:107) L p and that for vectors by (cid:107) · (cid:107) p . The space of the neural networks with the activation function σ is denoted by Σ( σ ) : Σ( σ ) = { g : R r → R | g ( x ) = q (cid:88) i β i σ ( γ (cid:62) i x + α i ) , α i ∈ R , β i ∈ R , γ i ∈ R r } . The following is the ﬁrst result of this paper, which shows the universal approximation property of the model (4).

Theorem 1.

Let H : R N → R be an energy function with the equation ∂u∂t = G ∂H∂u , where u : t ∈ R (cid:55)→ u ( t ) ∈ R N and G is an N × N matrix. Suppose that the phase space K of this system is compactand the right-hand side G∂H/∂u is Lipschitz continuous. If the activation function σ (cid:54) = 0 belongs to S ( R ) , then forany ε > there exists a neural network H NN for which (cid:13)(cid:13)(cid:13)(cid:13) G ∂H∂u − G ∂H NN ∂u (cid:13)(cid:13)(cid:13)(cid:13) < ε holds. This theorem is a direct application of the universal approximation theorem of Hornik et al. (1990) ． In Hornik et al.(1990), several universal approximation theorems are provided, for example,

Theorem 2 (Hornik et al., 1990) . Let the activation function σ (cid:54) = 0 belong to S mp ( R , λ ) for an integer m ≥ . Then, Σ( σ ) is m -uniformly dense in C ∞ ( K ) , where K is any compact subset of R N . Lemma 1.

Under the same assumption, Σ( σ ) is also dense in S mp ( R , λ ) . Hence if the activation function σ of the hidden layer is in S mp ( R , λ ) and does not vanish everywhere, then for anysufﬁciently smooth function there exists a neural network that approximates the function and its derivatives up to theorder m arbitrarily well on compact sets. Besides, this theorem is also extended to the functions of multiple outputs; seeHornik et al. (1990). Proof of Theorem 1.

Since the target equation is determined only by the gradient of H , any function obtained byshifting H by a constant gives the same equation. Hence we choose and ﬁx an energy function H that yields the targetequation. Because G∂H/∂u is Lipschitz continuous and hence continuous on the phase space K , this function isbounded and square-integrable. Thus G∂H/∂u ∈ S ( K ) , which means H is in S ( K ) . Therefore, from Lemma 1 andthe assumption that the activation function is in S ( R ) , for each ε there exists a neural network that approximates H in S ( K ) : (cid:107) H − H NN (cid:107) + (cid:13)(cid:13)(cid:13)(cid:13) ∂H∂u − ∂H NN ∂u (cid:13)(cid:13)(cid:13)(cid:13) < ε (cid:107) G (cid:107) , which gives (cid:13)(cid:13)(cid:13)(cid:13) G ∂H∂u − G ∂H NN ∂u (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) G (cid:107) (cid:13)(cid:13)(cid:13)(cid:13) ∂H∂u − ∂H NN ∂u (cid:13)(cid:13)(cid:13)(cid:13) < ε . Remark 1.

In the above theorem, we supposed that the energy function is Lipschitz continuous; however, as can beseen from the proof, if the energy function is C ∞ the function can be approximated by a C ∞ neural network providedthat the activation function is smooth enough. We use this statement later in Section 5. PREPRINT - F

EBRUARY

25, 2021

The practical use of Hamiltonian neural networks is hampered by the fact that the state variables must be representedby using a speciﬁc coordinate, e.g., the generalized momentum. The generalized momenta are in principle deﬁned byusing the Lagrangian of the system; however, the Lagrangian is typically deﬁned by the difference between the kineticenergy and the potential energy. Thus, the derivation of the generalized momenta requires the energy function, which isunknown and hence is the target function of Hamiltonian neural networks.Therefore, we propose a model for physical phenomena that cannot be modeled by the normal form (4) due to theimpossibility of calculating the required speciﬁc coordinate system but are considered to follow the energy-basedphysics theories. More speciﬁcally, we consider the equation d x d t = f ( x ) (6)that can be transformed into the model equation (4) with respect to the hidden state variable u ( t ) = u ( x ( t )) by a certaindiffeomorphism.By substituting u = u ( x ) into the model equation (4), we have ∂u∂x d x d t = G ∂x∂u (cid:62) ∂H∂x .

Since ∂x/∂u = ( ∂u/∂x ) − , this becomes d x d t = ( ∂u∂x ) − G ( ∂u∂x ) −(cid:62) ∂H∂x . (7)In fact, the transformed model (7) also has the same energetic property as the original equation. Theorem 3.

The model (7) admits the energy conservation law d H d t = 0 if G is skew-symmetric and admits the energy dissipation law d H d t ≤ if G is negative semideﬁnite.Proof. This theorem can be obtained from d H d t = ∂H∂x (cid:62) d x d t = ∂H∂x (cid:62) ( ∂u∂x ) − G ( ∂u∂x ) −(cid:62) ∂H∂x which is 0 if G is skew-symmetric and ≤ if G is negative semideﬁnite.The second result in this study is the universal approximation property of the above transformed model (7). Theorem 4.

Let H : R N → R be an energy function for the equation d x d t = ( ∂u∂x ) − G ( ∂u∂x ) −(cid:62) ∂H∂x where x : t ∈ R (cid:55)→ x ( t ) ∈ R N , u : x ∈ R N (cid:55)→ u ( x ) ∈ R N and G is an N × N matrix. Suppose that the phasespace K of this system is compact and the right-hand side ∂H/∂u is Lipschitz continuous. Suppose also that u is a C -diffeomorphism. If the functions σ (cid:54) = 0 and ρ (cid:54) = 0 belong to S ( R ) , then for any ε > there exist neural networks H NN with the activation function σ and u NN with ρ for which (cid:13)(cid:13)(cid:13)(cid:13) ( ∂u∂x ) − G ( ∂u∂x ) −(cid:62) ∂H∂x − ( ∂u NN ∂x ) − G ( ∂u NN ∂x ) −(cid:62) ∂H NN ∂x (cid:13)(cid:13)(cid:13)(cid:13) < ε holds.Proof. We need to prove the approximation property for ( ∂u/∂x ) − . From the assumption that ρ (cid:54) = 0 is in S ( R ) ,there exists a function u NN that approximates ∂u/∂x . Since the determinant function of matrices is continuous, itis deduced that det ∂u NN /∂x (cid:54) = 0 and hence ( ∂u NN /∂x ) − exists. Because the matrix inverse is also continuous, ( ∂u NN /∂x ) − is also approximated by u NN . Remark 2.

Although the main objective of this study is the theoretical analysis, to the best of our knowledge, no deeplearning model using this equation has been proposed so far. PREPRINT - F

EBRUARY

25, 2021Figure 2: The outline of the anaylsis in Section 5. Integrable Hamiltonian systems exhibit quasi-periodic behaviors.Hamiltonian neural networks can maintain this property with a high probability even when the loss function does notcompletely vanish provided that the generalization error is small enough.

In this section, we investigate the behaviors of the trained model for integrable Hamiltonian systems, particularly in thecase where the loss function does not vanish completely. The Hamilton equation is the equation of the form (1) with anondegenerate skew-symmetric matrix S : ∂u∂t = S ∂H∂u , S (cid:62) = − S, det S (cid:54) = 0 . For the matrix S to be nondegenerate, the dimension of the phase space must be an even number. Therefore we supposethat u : t ∈ R (cid:55)→ u ( t ) ∈ R N , N = 2 M . As explained below, integrable systems exhibit quasi-periodic motions. Themodel (4) for Hamiltonian equations ∂u∂t = S ∂H NN ∂u is called a Hamiltonian neural network. In Section 3, we have shown that the above model has the universal approxi-mation property and hence the loss function can be arbitrarily small; however, although the model has the universalapproximation property, in reality, minimization cannot be perfect, and a certain amount of error should be introduced.In this section, we show that if the activation function of the neural network is sufﬁciently smooth and the lossfunction and hence the generalization error are not zero but sufﬁciently small, the trained model maintains with a highprobability the above quasi-periodic behaviors of integrable systems. This result is obtained by the KAM theory and thegeneralization error analysis using the Rademacher complexities.First, we brieﬂy introduce some properties of Hamiltonian systems. Firstly, it is known as the Darboux theorem that byan appropriate coordinate transformation the matrix S is transformed into the normal form (cid:18) O I − I O (cid:19) . Hence it can be assumed that the matrix S is given as this form. Secondly, the following function ω : ( v, w ) ∈ R N × R N (cid:55)→ w ( v, w ) ∈ R ω ( v, w ) = v (cid:62) S − w is called the symplectic form. By using the symplectic form, a vector ﬁeld X F is associated to each function F : R N → R by requiring ω ( X F , v ) = ∂F∂u · v for all v. PREPRINT - F

EBRUARY

25, 2021For two functions

F, G , the following operation { F, G } = ω ( X F , X G ) (8)is called the Poisson bracket. Deﬁnition 1.

A Hamiltonian system of which state variable is N = 2 M dimensional is integrable in thesense of Liouville if this Hamiltonian system has ﬁrst integrals (i.e., conserved quantities) F , F , . . . , F M with ∇ F ( u ) , ∇ F ( u ) , . . . , ∇ F M ( u ) independent at each u and for all i, j { F i , F j } = 0 . For integrable systems, the following theorem is known.

Theorem 5 (Liouville–Arnold) . Suppose that for an integrable Hamiltonian system there exist constants c , . . . , c M such that K = ∩ Mi =1 F − i ( c i ) is connected and compact. Then there exist a neighborhood N of K and a coordinatetransform φ : ( θ, I ) ∈ T n × U → φ ( θ, I ) ∈ N (9) such that the transformed system is the Hamilton equation of which Hamiltonian H ◦ φ depends only on I . The variables I and θ are called the action-angle variables. This theorem with the Darboux theorem roughly means thatintegrable Hamiltonian systems can be written in the following form dd t (cid:18) θI (cid:19) = (cid:18) O I − I O (cid:19) (cid:32) ∂ ˜ H∂θ∂ ˜ H∂I (cid:33) . Further, since ˜ H = H ◦ φ depends on I only, it holds that dd t (cid:18) θI (cid:19) = (cid:18) O I − I O (cid:19) (cid:18) ∂ ˜ H∂I (cid:19) = (cid:18) ∂ ˜ H∂I (cid:19) . This shows that I is constant and hence θ moves on the torus at a constant velocity. Because the velocity is typically notco-related with each other and hence the dynamics is “quasi-periodic.” See, e.g., Scott Dumas (2014) for more details.As seen above, integrable Hamiltonian systems are quasi-periodic. Therefore, the same property is preferably maintainedwhen modeling such a system by neural networks. When the modelling error is sufﬁciently small, this is considered asa perturbation problem. The perturbation theory of Hamiltonian systems has been investigated from various aspects.For example, perturbed integrable Hamiltonian systems are in general no longer integrable; hence approximation ofintegrable Hamiltonian systems by integrable neural network models should be difﬁcult. Fortunately, however, theKAM theory shows that even though the perturbed system is not integrable but maintains the above quasi-periodicbehaviors under certain conditions.The KAM theorem has many variants under various conditions. The following is a typical one (Scott Dumas, 2014). Theorem 6 (KAM Theorem) . Let θ, I be the action-angle variables for a C ∞ integrable Hamiltonian H : R M → R with M ≥ . If H is non-degenerate, that is, det ∂ H ∂I (cid:54) = 0 , (10) for the perturbed system H ( θ, I ) = H ( I ) + εF ( θ, I, ε ) by F ∈ C ∞ , there exists ε such that if εF < ε , there exists a set of M -dimensional tori, invariant under the perturbedﬂow. On each invariant torus, the ﬂow of the perturbed system H is quasiperiodic. In addition, the set of invariant toriis large in the sense that its measure becomes full as ε → . Suppose that an integrable Hamiltonian system with a compact phase space is modeled by using Hamiltonian neuralnetworks. The universal approximation property shown in Section 3 guarantees that the value of MSE can be arbitrarilysmall; however, in actual training, a ﬁnite error remains. In this section, we apply the KAM theory to investigate thetrained model in such cases. 7

PREPRINT - F

EBRUARY

25, 2021To this end, we assume that the conditions for the Liouville–Arnold theorem, that is, for the target integrable system,there exist c , . . . , c M such that K = ∩ Mi =1 F − i ( c i ) is connected and compact. We also assume that the Hamiltonian H is nondegenerate, which is the condition for the KAM theorem. Besides, since for application of the KAM theory theHamiltonian and the perturbation must be smooth enough, we assume the smoothness of the activation functions of theneural networks.Essentially, what we need to show is that the Hamiltonian H of the target system and H NN of the trained model aresufﬁciently close to each other. Because in the learning process of Hamiltonian neural networks, the error on the energygradient is typically minimized we need to develop a generalization error bound on the function values from the trainingerrors of the derivatives.The gap between the errors on the function values and those on the derivatives can be compensated by applying thePoincaré inequality (e.g., Chen & Hou (2020).) The Poincaré inequality essentially states that in order for a function tobe large, its derivative must be large. Theorem 7 (the Poincaré inequality) . Suppose that ≤ p ≤ ∞ and Ω ⊂ R M is bounded. Then there exists a constant c p such that for any H ∈ S p (Ω) (cid:90) Ω | H ( u ) − ¯ H | p d u ≤ c p (cid:13)(cid:13)(cid:13)(cid:13) ∂H∂u (cid:13)(cid:13)(cid:13)(cid:13) pp , where ¯ H = 1 (cid:82) Ω d u (cid:90) Ω H ( u )d u. The constant c p is called the Poincaré constant. We use this inequality along with the invariance of the Hamilton equation under the constant shift of the energy function.More precisely, among the energy functions that yield the target Hamilton equation, we choose the one for which (cid:90) H ( u )d u = (cid:90) H NN ( u )d u. (11)holds so that the error function has the zero mean: e ( u ) := H ( u ) − H NN ( u ) , ¯ e ( u ) := 0 . Then from the Poincaré inequality we get (cid:90) Ω | e ( u ) | p d u ≤ c p (cid:13)(cid:13)(cid:13)(cid:13) ∂e∂u (cid:13)(cid:13)(cid:13)(cid:13) pL p . In addition, to apply the KAM theory, the error function e ( u ) must be small not only for the data points but also for anypoint near the quasi-periodic solutions to the unperturbed system. Therefore, we need a generalization error bound forthe function derivatives. In the following, we derive such an estimate.Generalization error bounds are typically obtained by using the Rademacher complexities. See, e.g., Bousquet et al.(2004); Steinwart & Christmann (2008); Giné & Nickl (2016) for details. Deﬁnition 2.

For a set of functions

G ⊂ { f | f : Z → R } , the empirical Rademacher complexity of G with respect to z ∈ Z n is deﬁned by ˆ R ( G ) := E σ (cid:34) sup g ∈G n n (cid:88) i =1 σ i g ( z i ) (cid:35) , where E σ is the expectation with respect to the uniform distribution of σ ∈ {− , } n . In addition, if the data z = ( z , z , . . . , z n ) are distributed according to a probability measure P , then the Rademacher complexity of G withrespect to P is deﬁned by R n ( G ) := E P (cid:104) ˆ R ( G ) (cid:105) The following is a typical generalization bound using the Rademacher complexity.8

PREPRINT - F

EBRUARY

25, 2021

Theorem 8.

Let X , Y be arbitrary spaces, F ⊂ { f : X → Y } be a hypotheses class and L : Y × Y → [0 , c ] be a lossfunction. Let G be deﬁned by { ( x, y ) ∈ X × Y (cid:55)→ L ( y, h ( x )) ∈ R | h ∈ F} . Then for any δ > and any probabilitymeasure P , we have with a probability at least (1 − δ ) with respect to the repeated sampling of P n -distributed trainingdata E [ L ( Y, h ( X ))] − n n (cid:88) i =1 L ( Y i , h ( X i )) ≤ R n ( G ) + c (cid:115) ln δ n for all h ∈ F . Although we only need the generalization error bound for the ﬁrst-order derivatives, in what follows, we show a moregeneral bound. Suppose that the model is trained by a certain kind of the Sobolev training (Czarnecki et al., 2017)where the errors measured by l (cid:88) j =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂H ( u ) j ∂u j − ∂H NN ( u ) j ∂u j (cid:13)(cid:13)(cid:13)(cid:13) pp (12)is minimized. We denote the number of the components of the derivatives that appear in the above summation by N (cid:48) .By applying Theorem 8 to the approximation of N (cid:48) -dimensional vector-valued functions, we obtain Lemma 2.

Let X be the phase space K and Y be Y = { y = ( y , y , . . . , y N (cid:48) ) ∈ R N (cid:48) | ∃ H, y = ∂H∂u , y = ∂H∂u , . . . } . F ⊂ { f : X → Y } be a hypotheses class and L : Y × Y → [0 , c ] be the loss function L ( Y, Y (cid:48) ) = N (cid:48) (cid:88) i =1 (cid:107) Y i − Y (cid:48) i (cid:107) pp Let G be deﬁned by { ( x, y ) ∈ X × Y (cid:55)→ L ( y, h ( x )) ∈ R | h ∈ F} . Then for any δ > and any probability measure P , we have with a probability at least (1 − δ ) with respect to the repeated sampling of P n -distributed training data E [ L ( Y, h ( X ))] − n n (cid:88) i =1 L ( Y i , h ( X i )) ≤ R n ( G ) + c (cid:115) ln δ n for all h ∈ F . Note that the terms in the above estimate corresponds to E [ L ( Y, h ( X ))] = (cid:90) l (cid:88) j =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂H ( u ) j ∂u j − ∂H NN ( u ) j ∂u j (cid:13)(cid:13)(cid:13)(cid:13) pp d P, n n (cid:88) i =1 L ( Y i , h ( X i )) = N data (cid:88) j =1 (cid:90) l (cid:88) j =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂H ( u ) j ∂u j ( u d ,j ) − ∂H NN ( u ) j ∂u j ( u d ,j ) (cid:13)(cid:13)(cid:13)(cid:13) pp respectively, where N data is the number of the data points and u d ,j is the each data point. Then, if we relax the space Y to ˜ Y = R N (cid:48) , corresponding function space ˜ G = { ( x, y ) ∈ X × ˜ Y (cid:55)→ L ( y, h ( x )) ∈ R | h ∈ F} . contains G . From the property of the Rademacher complexities G ⊂ ˜ G implies R n ( G ) ≤ R n ( ˜ G ) . Thus we have Theorem 9.

Under the same condition as Lemma 2, for any δ > and any probability measure P , we have with aprobability at least (1 − δ ) with respect to the repeated sampling of P n -distributed training data E [ L ( Y, h ( X ))] − n n (cid:88) i =1 L ( Y i , h ( X i )) ≤ R n ( ˜ G ) + c (cid:115) ln δ n for all h ∈ F . PREPRINT - F

EBRUARY

25, 2021Fortunately, R n ( ˜ G ) can be bounded. In fact, the Rademacher complexities of the neural networks are well studiedand there exist theoretical bounds depending on the architecture of the networks (Neyshabur et al., 2015; Golowichet al., 2018). In addition, because the phase space is assumed to be compact the loss function is Lipschitz. Because theRademacher complexities of the composition of a function and a Lipschitz function is estimated by the Rademachercomplexities of the former function multiplied by the Lipschitz constant of the latter function, R n ( ˜ G ) is bounded by aconstant proportional to the Rademacher complexity of the employed neural networks.From the above estimate we have (cid:90) l (cid:88) j =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂H ( u ) j ∂u j − ∂H NN ( u ) j ∂u j (cid:13)(cid:13)(cid:13)(cid:13) pp d P ≤ n n (cid:88) i =1 L ( Y i , h ( X i )) + 2 R n ( ˜ G ) + c (cid:115) ln δ n In addition, if we assume that there exists a density f P for the measure P with inf f P > , the above inequality gives abound on (cid:90) l (cid:88) j =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂H ( u ) j ∂u j − ∂H NN ( u ) j ∂u j (cid:13)(cid:13)(cid:13)(cid:13) pp d u. which is essentially the Sobolev norm in W p,l (Adams & Fournier, 2003). Because in general the Rademachercomplexity is a decreasing function of n , the right hand side other than the loss function of the training processdecreases as n becomes large. Hence if the training loss is small enough and also the number of data is large enough,the norm of the error function e in W p,l can be small with a probability at least (1 − δ ) . Lastly, we use the Sobolevinequality (Adams & Fournier, 2003; Benyi & Oh, 2013); there exist constants c , c such that if lp > d (cid:107) e (cid:107) L ∞ ( R d ) ≤ c (cid:107) e (cid:107) W p,l ( R d ) , (cid:107) e (cid:107) L ∞ ( T d ) ≤ c (cid:107) e (cid:107) W p,s ( T d ) to ensure that H and H NN is close in the function values. In particular, we consider the case l = 1 because typically theloss function (12) with l = 1 is used in Hamiltonian neural networks. Thus, to apply the above Sobolev inequality weneed p > d .By combining the above results, we have the following theorem. Theorem 10.

Suppose that the target system is an integrable Hamiltonian system with a C ∞ and non-degenerateHamiltonian H : R M → R with M ≥ . Suppose also that a Hamiltonian neural network with C ∞ activationfunctions is used to model the target system. Then for a ﬁxed δ , if the loss function < N data (cid:13)(cid:13)(cid:13)(cid:13) ∂H∂u − ∂H NN ∂u (cid:13)(cid:13)(cid:13)(cid:13) p for p > M is small enough, the number of the data N data is large enough so that the generalization error and hencethe sup norm of (cid:107) e (cid:107) L ∞ is less than a constant ε , there exists a set of invariant tori for the trained model H NN with aprobability at least (1 − δ ) . The constant ε depends on δ , the Poincaré constant c p , inf f P and the threshold of theKAM theorem. In this paper, we have investigated the approximation properties of the deep energy-based models, including Hamiltonianneural networks. In particular, we have provided the universal approximation theorem for the general deep energy-basedmodels with and without the latent variables. In addition, the general energy-based model with the latent variables isnewly introduced in this paper. For the integrable Hamiltonian systems, we have applied the KAM theory to prove thepersistence of the quasi-periodic behaviors with a high probability even when the loss function is not perfectly zero.Future work includes a more sophisticated analysis of the generalization errors and similar analysis for the other typesof neural networks for physics, such as the variational integrator networks.

References

Adams, R. A. and Fournier, J. J. F.

Sobolev Spaces . Elsevier, June 2003.Benyi, A. and Oh, T. The Sobolev inequality on the torus revisited.

Publ. Math. Debrecen , 83(3):359–374, October2013. 10

PREPRINT - F

EBRUARY

25, 2021Bharadwaj, P., Li, M., and Demanet, L. SymAE: An autoencoder with embedded physical symmetries for passivetime-lapse monitoring. In

SEG Technical Program Expanded Abstracts 2020 . Society of Exploration Geophysicists,2020.Bondesan, R. and Lamacraft, A. Learning Symmetries of Classical Integrable Systems. In

ICML 2019 Workshop onTheoretical Physics for Deep Learning , 2019.Bousquet, O., Boucheron, S., and Lugosi, G. Introduction to statistical learning theory. In Bousquet, O., von Luxburg,U., and Rätsch, G. (eds.),

Advanced Lectures on Machine Learning , pp. 169–207. Springer Berlin Heidelberg, Berlin,Heidelberg, 2004.Caginalp, G. An analysis of a phase ﬁeld model of a free boundary.

Archive for Rational Mechanics and Analysis , 92(3):205–245, 1986.Chen, T. Q., Rubanova, Y., Bettencourt, J., Duvenaud, D., Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud,D. Neural Ordinary Differential Equations. In

Advances in Neural Information Processing Systems (NeurIPS) , 2018.Chen, Y. and Hou, T. Y. Function Approximation via the Subsampled Poincaré Inequality.

Discrete & ContinuousDynamical Systems - A , 41(1):169–199, August 2020.Chen, Z., Zhang, J., Arjovsky, M., and Bottou, L. Symplectic Recurrent Neural Networks. In

International Conferenceon Learning Representations (ICLR) , 2020.Cranmer, M., Greydanus, S., Hoyer, S., Battaglia, P., Spergel, D., and Ho, S. Lagrangian Neural Networks.

ICLR 2020Deep Differential Equations Workshop , 2020.Czarnecki, W. M., Osindero, S., Jaderberg, M., Swirszcz, G., and Pascanu, R. Sobolev training for neural networks. In

Advances in Neural Information Processing Systems (NeurIPS) , pp. 4281–4290, 2017.Desai, S. and Roberts, S. VIGN: Variational Integrator Graph Networks. arXiv , 2020.DiPietro, D. M., Xiong, S., and Zhu, B. Sparse Symplectically Integrated Neural Networks. In

Advances in NeuralInformation Processing Systems (NeurIPS) , 2020.Feng, Y., Wang, H., Yang, H., and Wang, F. Time-Continuous Energy-Conservation Neural Network for StructuralDynamics Analysis. arXiv , 2020.Finzi, M., Stanton, S., Izmailov, P., and Wilson, A. G. Generalizing Convolutional Neural Networks for Equivariance toLie Groups on Arbitrary Continuous Data. In

International Conference on Machine Learning (ICML) , pp. 3165–3176,2020a.Finzi, M., Wang, K. A., and Wilson, A. G. Simplifying Hamiltonian and Lagrangian Neural Networks via ExplicitConstraints. In

Advances in Neural Information Processing Systems (NeurIPS) , 2020b.Furihata, D. and Matsuo, T.

Discrete Variational Derivative Method: A Structure-Preserving Numerical Method forPartial Differential Equations . Chapman and Hall/CRC, 2010.Galioto, N. and Gorodetsky, A. A. Bayesian Identiﬁcation of Hamiltonian Dynamics from Symplectic Data. In

IEEEConference on Decision and Control (CDC) , pp. 1190–1195, 2020.Giné, E. and Nickl, R.

Mathematical Foundations of Inﬁnite-Dimensional Statistical Models . Cambridge UniversityPress, 2016.Golowich, N., Rakhlin, A., and Shamir, O. Size-Independent sample complexity of neural networks. In

Conference OnLearning Theory (COLT) , pp. 297–299, 2018.Greydanus, S., Dzamba, M., and Yosinski, J. Hamiltonian Neural Networks. In

Advances in Neural InformationProcessing Systems (NeurIPS) , 2019.Hornik, K., Stinchcombe, M., and White, H. Universal approximation of an unknown mapping and its derivatives usingmultilayer feedforward networks.

Neural Networks , 3(5):551–560, 1990.Jin, P., Zhang, Z., Kevrekidis, I. G., and Karniadakis, G. E. Learning Poisson systems and trajectories of autonomoussystems via Poisson neural networks. arXiv , 2020a.Jin, P., Zhang, Z., Zhu, A., Tang, Y., and Karniadakis, G. E. SympNets: Intrinsic structure-preserving symplecticnetworks for identifying Hamiltonian systems.

Neural Networks , 132:166–179, 2020b.Jin, Z., Lin, J. Y.-Y., and Li, S.-F. Learning principle of least action with reinforcement learning. arXiv , 2020c.Langley, P. Crafting Papers on Machine Learning. In

International Conference on Machine Learning (ICML) , pp.1207–1216, 2000.Li, S.-H., Dong, C.-X., Zhang, L., and Wang, L. Neural Canonical Transformation with Symplectic Flows.

PhysicalReview X , 10(2):021020, 2020. 11

PREPRINT - F

EBRUARY

25, 2021Marsden, J. E. and Ratiu, T. S.

Introduction to Mechanics and Symmetry: A Basic Exposition of Classical MechanicalSystems . Springer Science & Business Media, March 2013.Matsubara, T., Ishikawa, A., and Yaguchi, T. Deep Energy-Based Modeling of Discrete-Time Physics. In

Advances inNeural Information Processing Systems (NeurIPS) , 2020.Miehe, C., Hofacker, M., and Welschinger, F. A phase ﬁeld model for rate-independent crack propagation: Robustalgorithmic implementation based on operator splits.

Computer Methods in Applied Mechanics and Engineering ,199(45):2765–2778, 2010.Neyshabur, B., Tomioka, R., and Srebro, N. Norm-Based capacity control in neural networks. In

Conference onLearning Theory (COLT) , pp. 1376–1401, 2015.Sæmundsson, S., Hofmann, K., Terenin, A., and Deisenroth, M. P. Variational integrator networks for physicallymeaningful embeddings. In

Artiﬁcial Intelligence and Statistics (AISTATS) , volume 108, pp. 3078–3087, 2019.Sanchez-Gonzalez, A., Bapst, V., Cranmer, K., and Battaglia, P. Hamiltonian Graph Networks with ODE Integrators. arXiv , 2019.Scott Dumas, H.

KAM Story, The: A Friendly Introduction To The Content, History, And Signiﬁcance Of ClassicalKolmogorov-Arnold-Moser Theory . World Scientiﬁc Publishing Company, 2014.Steinbach, I. Phase-ﬁeld models in materials science.

Modelling and Simulation in Materials Science and Engineering ,17(7):073001, 2009.Steinwart, I. and Christmann, A.

Support Vector Machines . Springer Science & Business Media, September 2008.Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M., and Sugiyama, M. Coupling-based invertible neural networksare universal diffeomorphism approximators. In

Advances in Neural Information Processing Systems (NeurIPS) ,2020.Toth, P., Rezende, D. J., Jaegle, A., Racanière, S., Botev, A., and Higgins, I. Hamiltonian Generative Networks. In

International Conference on Learning Representations (ICLR) , 2019.Wheeler, A. A., Murray, B. T., and Schaefer, R. J. Computation of dendrites using a phase ﬁeld model.

Physica D , 66(1):243–262, 1993.Xiong, S., Tong, Y., He, X., Yang, C., Yang, S., and Zhu, B. Nonseparable Symplectic Neural Networks. In

InternationalConference on Learning Representations (ICLR) , 2021.Zhong, Y. D., Dey, B., and Chakraborty, A. Symplectic ODE-Net: Learning Hamiltonian Dynamics with Control. In

International Conference on Learning Representations (ICLR) , 2020.

A Numerical results of the energy-based physical model with latent variables

Although the main contribution of this paper is the theoretical analyses of the deep energy-based physical models, tothe best of our knowledge, the model with latent variables d x d t = ( ∂u∂x ) − G ( ∂u∂x ) −(cid:62) ∂H∂x . shown in Section 4 in the main paper is new. For a better understanding of the signiﬁcance of the theoretical resultspresented in this paper, we performed numerical tests using this model. However, we would like to emphasize that theproposal of this model is not the main purpose of this study.Firstly, we consider a double pendulum shown in Figure 3, of which equation of motion is d θ d t = φ , d θ d t = φ , d φ d t = g (sin θ sin( θ − θ ) − m + m m sin( θ )) − ( l θ cos( θ − θ ) + l θ ) sin( θ − θ ) l ( m + m m − cos ( θ − θ )) , d φ d t = g ( m + m ) m (sin θ cos( θ − θ ) − sin( θ )) − ( l ( m + m ) m θ + l θ cos( θ − θ )) sin( θ − θ ) l ( m + m m − cos ( θ − θ )) . (13)The energy function of this system is H = 12 ( m + m ) l φ + 12 m l φ + m l l φ φ cos( θ − θ ) + g ( m + m ) l cos θ + gm l cos( θ ) PREPRINT - F

EBRUARY

25, 2021and the Lagrangian is L = 12 ( m + m ) l φ + 12 m l φ + m l l φ φ cos( θ − θ ) − g ( m + m ) l cos θ − gm l cos( θ ) . Note that the generalized momenta of this system are not obvious: p = ∂ L ∂φ = ( m + m ) l φ + m l l φ cos( θ − θ ) ,p = ∂ L ∂φ = m l φ + m l l φ cos( θ − θ ) . Because these are difﬁcult to be observed, we suppose that the values of the state variables θ , θ and their derivatives φ , φ are given as data. In such a situation, Hamiltonian neural networks (Greydanus et al., 2019) dd t  q q p p  = (cid:18) O I − I O (cid:19) ∇ H NN ( q , q , p , p ) are not appropriate because this model can be used only when p and p are generalized momenta; in the case consideredhere, they are supposed to be unknown.To conﬁrm it, we tested a naive model dd t  q q v v  = (cid:18) O I − I O (cid:19) ∇ H NN ( q , q , v , v ) , (14)where v , v are not the generalized momenta but the velocities v = ˙ q , v = ˙ q , and the proposed model d x d t = ( ∂u∂x ) − G ( ∂u∂x ) −(cid:62) ∂H NN ∂x . In the experiments, the programs are implemented using Python 3.8.5 with the packages PyTorch 1.7.1, numpy 1.19.5,scipy 1.6.0, autograd 1.3, torchdiffeq 0.2.1. As the data, we used numerical solutions to (13) with the parameters l = l = 1 . , m = 1 , m = 2 , g = 9 . solved by the scipy odeint with 100 initial conditions that are randomlygenerated from the standard normal distribution. We numerically integrated each orbit on the time interval [0 , , inwhich the computed values are evaluated at 100 points with the identical sampling rate. Then the target data dd t  θ θ φ φ  Figure 3: The double pendulum used as the target in the ﬁrst experiment.13

PREPRINT - F

EBRUARY

25, 2021 (a) Ground truth (b) Naive model(c) Proposed model

Figure 4: An example of the predicted orbits by the Hamiltonian neural network and the proposed model. Eachcomponent of x ( t ) = ( q ( t ) , v ( t ) , p ( t ) , v ( t ) is represented by the red ( q ), green ( v ), blue ( q ), and black ( v )curves.are obtained by substituting the computed state variables into the right-hand side of (13). The experiments wereperformed on GeForce 2080 Ti. We used the Adam algorithm with the learning rate . . The number of the learningsteps was 10000 and the batch size was set to 200. In both cases, the energy function is modeled by using a multilayerperceptron with tanh as the activation function. The network has only one hidden layer, of which size is set to 50. Remark 3.

For the coordinate transformations in the proposed model, the invertible neural networks should be used;however we used the simple multilayer perceptron here because this model is assumed in the theorem in the main partof the paper. In the theorem, we needed neural networks that can approximate given functions and their derivatives.Although the universal approximation property of invertible neural networks is proved recently (Teshima et al., 2020),this theorem shows the universal approximation property of the function values only, not of the derivatives, which areneeded for computation of the Jacobi matrix in our model.

The examples of the predicted orbits are shown in Figure 4. The training losses of these models were 13.6 for theHamiltonian neural network and 0.280 for the proposed model. As shown in this ﬁgure, the naive model failed tocapture the dynamics correctly. This is because the dynamics of θ , θ , φ , φ cannot be described by (14). This resultillustrates that in order to model the physical phenomena by using the model of the form d x d t = G ∂H∂x , the choice of the coordinate system is important.

Mass-spring system.

Due to the well-known chaotic behaviors of the double pendulum, the results, in particular thevalues of the losses, of the previous experiments are to a certain extent unstable, except for the fact that Hamilton neuralnetworks always failed. Therefore, secondly, we investigated the models in more detail using a simple mass-springsystem depicted in Figure 5. The two mass points m and m are connected by the springs, each of which respectively14 PREPRINT - F

EBRUARY

25, 2021has the spring constant k and k and the natural length l and l . This system is a Hamiltonian system with the energyfunction H ( q , q , p , p ) = p m + p m + k ( q − l ) k ( q − q − l ) , (15)where q , q are the positions of the mass points and p , p are the momenta, which are deﬁned by p = m v , p = m v , v = d p / d t, v = d p / d t . Suppose that we do not know the exact values of m and m and the only positions q and q and their derivatives can be observed. Although m and m may be estimated from the data, for evaluation ofthe models, we tried to model the dynamics only using q , q and their derivatives.We examined the naive model, the proposed model and the neural ordinary differential equation (Chen et al., 2018) dd t  q q v v  = f NN ( q , q , v , v ) . The condition of the experiment is almost the same as the previous one, except for the batch size, which we set to 100in this experiment. As the data, we used numerical solutions to dd t  q q v v  =  v v − k m ( q − l ) + k m ( q − q − l ) − k m ( q − q − l )  , (16)which is equivalent to (15).Examples of the predicted orbits are shown in Figure 6. While the proposed method gave almost exactly the same orbitas the ground truth, the naive model failed to predict the states. In fact, it should be impossible to rewrite the equation(16) to (14) with a certain energy function; for example, when the system has just one mass point and the equation ofmotion is given by dd t (cid:18) q v (cid:19) = (cid:18) v − k m ( q − l ) (cid:19) , this can be transformed into a Hamiltonian system dd t (cid:18) q v (cid:19) = (cid:18) − (cid:19) ∇ ˜ H, ˜ H = v m (cid:48) + k (cid:48) ( q − l ) with k (cid:48) = k /m , m (cid:48) and the mass m (cid:48) is 1. Hence, for this system, the Hamiltonian neural networks are applicablewithout the knowledge of m . However, for the system (16) such a transformation cannot be applied. Meanwhile, inthe proposed model, a coordinate transformation that makes the equation Hamiltonian is explored.The result by the neural ODE is better than that of the naive model; however, the prediction error is increased as timebecomes large. This is due to the non-existence of the energy conservation law for the neural ODE. In fact, the value ofthe energy function of the neural ODE model was increasing, as shown in Figure 7.The losses for the above models are shown in Table 1. We performed twelve experiments for each model and the lossesfor these experiments are listed. The values of the loss functions of the neural ODE are very small and those of theFigure 5: The target mass-spring system used in the second experiment.15 PREPRINT - F

EBRUARY

25, 2021 (a) Ground truth (b) NODE(c) Naive model (d) Proposed model

Figure 6: An example of the predicted orbits by the neural ODEs (NODE), Hamiltonian neural networks and theproposed model. Each component of x ( t ) = ( q ( t ) , v ( t ) , p ( t ) , v ( t ) is represented by the red ( q ), green ( v ), blue( q ), and black ( v ) curves. (a) NODE (b) Proposed model Figure 7: The values of the energy function predicted by the neural ODE model and by the proposed model.16

PREPRINT - F