A Priori Generalization Analysis of the Deep Ritz Method for Solving High Dimensional Elliptic Equations
AA PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZMETHOD FOR SOLVING HIGH DIMENSIONAL ELLIPTIC EQUATIONS
JIANFENG LU, YULONG LU, AND MIN WANG
Abstract.
This paper concerns the a priori generalization analysis of the Deep Ritz Method(DRM) [W. E and B. Yu, 2017], a popular neural-network-based method for solving highdimensional partial differential equations. We derive the generalization error bounds oftwo-layer neural networks in the framework of the DRM for solving two prototype ellipticPDEs: Poisson equation and static Schrödinger equation on the d -dimensional unit hyper-cube. Specifically, we prove that the convergence rates of generalization errors are inde-pendent of the dimension d , under the a priori assumption that the exact solutions of thePDEs lie in a suitable low-complexity space called spectral Barron space. Moreover, we givesufficient conditions on the forcing term and the potential function which guarantee that thesolutions are spectral Barron functions. We achieve this by developing a new solution theoryfor the PDEs on the spectral Barron space, which can be viewed as an analog of the classicalSobolev regularity theory for PDEs. Introduction
Numerical solutions to high dimensional partial differential equations (PDEs) have been along-standing challenge in scientific computing. The impressive advance of deep learning hasoffered exciting possibilities for algorithmic innovations. In particular, it is a natural idea torepresent solutions of PDEs by (deep) neural networks to exploit the rich expressiveness ofneural networks representation. The parameters of neural networks are then trained by opti-mizing some loss functions associated with the PDE. Natural loss functions can be designedusing the variational structure, similar to the Ritz-Galerkin method in classical numericalanalysis of PDEs. Such method is known as the Deep Ritz Method (DRM) in [13, 22].Methods in a similar spirit has been also developed in the computational physics literature[4] for solving eigenvalue problems arising from many-body quantum mechanics, under theframework of variational Monte Carlo method [28].Despite wide popularity and many successful applications of the DRM and other ap-proaches of using neural networks to solve high-dimensional PDEs, the analysis of suchmethods is scarce and still not well understood. This paper aims to provide an a priorigeneralization error analysis of the DRM with dimension-explicit estimates.Generally speaking, the error of using neural networks to solve high dimensional PDEs canbe decomposed into the following parts: • Approximation error: this is the error of approximating the solution of a PDE usingneural networks; • Generalization error: this refers to the error of the neural network-based approximatesolution on predicting unseen data. The variational problem involves integrals in high di-mension, which can be expensive to compute. In practice Monte Carlo methods are usuallyused to approximate those high dimensional integrals and thus the miminizer of the surrogatemodel (known as empirical risk minimization) would be different from the minimizer of theoriginal variational problem;
Date : January 6, 2021.J.L. and M.W. are supported in part by National Science Foundation via grants DMS-2012286 and CCF-1934964. Y.L. is supported by the start-up fund of the Department of Mathematics and Statistics at UMassAmherst. a r X i v : . [ m a t h . NA ] J a n JIANFENG LU, YULONG LU, AND MIN WANG • Training (or optimization) error: this is the error incurred by the optimization algorithmused in the training of neural networks for PDEs. Since the parameters of the neural net-works are obtained through an optimization process, it might not be able to find the bestapproximation to the unknown solution within the function class.Note that from a numerical analysis point of view, these errors already appear for conven-tional Galerkin methods. Indeed, taking finite element methods for example, the approxi-mation error is the error of approximating the true solution in the finite element space; thegeneralization error can be seen as the discretization error caused by numerical quadratureof the variational formulation; the optimization error corresponds to the computational errorin the conventional numerical PDEs due to the inaccurate resolution of linear or nonlinearfinite dimensional discrete system.Although classical numerical analysis for PDEs in low dimensions has formed a relativelycomplete theory in the last several decades, the error analysis of neural network methods ismuch more challenging for high dimensional PDEs and requires new ideas and tools. In fact,the three components of error analysis highlighted above all face new difficulties.For approximation, as is well known, high dimensional problems suffer from the curse ofdimensionality, if we proceed with standard regularity-based function spaces such as Sobolevspaces or Hölder spaces as in conventional numerical analysis. In fact, even using deep neuralnetworks, the approximation rate for functions in such spaces deteriorate as the dimensionbecomes higher; see [43, 44]. Therefore, to obtain better approximation rates that scale mildlyin the large dimensionality, it is natural to assume that the function of interest lies in a suitablesmaller function space which has low complexity compared to Sobolev or Hölder spaces sothat the function can be efficiently approximated by neural networks in high dimensions.The first function class of this kind is the so-called
Barron space defined in the seminal workof Barron [2]; see also [11, 23, 37, 38] for more variants of Barron spaces and their neural-network approximation properties. In the present paper we will introduce a discrete versionof Barron’s definition of such space using the idea of spectral decomposition and becauseof this we adopt the terminology of spectral Barron space following [10, 38] to distinguishit from the other versions. As the Barron spaces are very different from the usual Sobolevspaces, for PDE problems, one has to develop novel a priori estimates and correspondinglyapproximation error analysis. In particular, a new solution theory for high dimensional PDEsin those low-complexity function spaces needs to be developed. This paper makes an initialattempt in establishing a solution theory in the spectral Barron space for a class of ellipticPDEs.The analysis of the generalization error is also intimately related to the function class (e.g.neural networks) we use, in particular its complexity. This makes the generalization analysisquite different from the analysis of numerical quadrature error in an usual finite elementmethod. We face a trade-off between the approximation and generalization: To reduce theapproximation error, one would like to use an approximation ansatz which involves largenumber of degrees of freedom, however, such choice will incur large generalization error.The training of the neural networks also remains to be a very challenging problem sincethe associated optimization problem is highly non-convex. In fact, even under a standardsupervised learning setting, we still largely lack understanding of the optimization error,except in simplified setting where the optimization dynamics is essentially linear (see e.g.,[6, 15, 21]). The analysis for PDE problems would face similar, if not severer, difficulties,and it is beyond the scope of our current work.In this work, we provide a rigorous analysis to the approximation and generalization errorsof the DRM for high dimensional elliptic PDEs. We will focus on relative simple PDEs(Poisson equation and static Schrödinger equation) to better convey the idea and illustratethe framework, without bogging the readers down with technical details. Our analysis, asalready suggested by the discussions above, which is based on identifying a correct functional
PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 3 analysis setup and developing the corresponding a priori analysis and complexity estimates,will provides dimension-independent generalization error estimates.1.1.
Related Works.
Several previous works on analysis of neural-network based methodsfor high-dimensional PDEs focus on the aspect of representation, i.e., whether a solutionto the PDE can be approximated by a neural network with quantitative error control; seee.g., [17, 20]. Fixing an approximation space, the generalization error can be controlled byanalyzing complexity such as covering numbers, see e.g., [3] for a specific PDE problem.More recently, several papers [26, 30, 35, 36] considered the generalization error analysisof the physics informed neural network (PINNs) approach based on residual minimizationfor solving PDEs [24, 33]. In particular, the work [35] established the consistency of theloss function such that the approximation converges to the true solution as the trainingsample increases under the assumption of vanishing training error. For the generalizationerror, Mishra and Molinaro [30] carried out an a-posteriori-type generalization error analysisfor PINNs, and proved that the generalization error is bounded by the training error andquadrature error under some stability assumptions of the PDEs. To avoid the issue of curse ofdimensionality in quadrature error, the authors also considered the cumulative generalizationerror which involves a validation set. The paper [36] proved both a priori and a posteriorestimates for residual minimization methods in Sobolev spaces. The paper [26] obtaineda priori generalization estimates for a class of second order linear PDEs by assuming (butwithout verifying) that the exact solutions of PDEs belong to a Barron-type space introducedin [11].Different from the previous generalization error analysis, we derive a priori and dimension-explicit generalization error estimates under the assumption that the solutions of the PDEslie in the spectral Barron space that is more aligned with [2]. Moreover, we justify suchassumption by developing a novel solution theory in the spectral Barron space for the PDEsof consideration. This regularity theory is the main difference between our work comparedwith the above mentioned ones.It is worth mentioning that in a very recent preprint [12], E and Wojtowytsch consideredthe regularity theory of high dimensional PDEs on the whole space (including screened Pois-son equation, heat equation, and a viscous Hamilton-Jacobi equation) defined in the Barronspace introduced by [11]. Their result shared a similar spirit as our analysis of PDE regular-ity theory in the spectral Barron space (Theorem 2.5 for Poisson equation and Theorem 2.6for static Schrödinger equation), while we focus on PDEs on finite domain, and as a result,we have to develop different Barron function spaces from those used for the whole space.The authors of [12] also provided some counterexamples to regularity theory for PDE prob-lems defined on non-convex domains, while we would only focus on simple domain (in facthypercubes) in this work.While we focus on the variational principle based approach for solving high dimensionalPDEs using neural networks, we note that many other approaches have been developed,such as the deep BSDE method based on the control formulation of parabolic PDEs [18],the deep Galerkin method based on the weak formulation [39], methods based on the strongformulation (residual minimization) such as the PINNs [24, 33], the diffusion Monte Carlotype approach for high-dimensional eigenvalue problems [19], just to name a few. It wouldbe interesting future directions to extend our analysis to these methods.1.2.
Our Contributions.
We analyze the generalization error of two-layer neural networksfor solving two simple elliptic PDEs in the framework of DRM. Specifically we make thefollowing contributions: • We define a spectral Barron space B s (Ω) on a d -dimensional unit hypercube Ω = [0 , d that extend the Barron’s original function space [2] from the whole space to bounded JIANFENG LU, YULONG LU, AND MIN WANG domain; see the definition (2.10). In the generalization theory we develop, we assumethat the solutions lie in the spectral Barron space. • We show that the spectral Barron functions B (Ω) can be well approximated in the H -norm by two-layer neural networks with either ReLU or Softplus activation func-tions without curse of dimensionality. Moreover, the parameters (weights and biases)of the two-layer neural networks are controlled explicitly in terms of the spectral Bar-ron norm. The bounds on the neural-network parameters play an essential role incontrolling the generalization error of the neural nets. See Theorem 2.1 and Theo-rem 2.2 for the approximation results. • We derive generalization error bounds of the neural-network solutions for solvingPoisson equation and the static Schrödinger equation under the assumption that thesolutions belong to the Barron space B (Ω) . We emphasize that the convergence ratesin our generalization error bounds are dimension-independent and that the prefactorsin the error estimates depend at most polynomially on the dimension and the Barronnorms of the solutions, indicating that the DRM overcomes the curse of dimensionalitywhen the solutions of the PDEs are spectral Barron functions. See Theorem 2.3 andTheorem 2.4 for the generalization results. • Last but not the least, we develop new well-posedness theory for the solutions ofPoisson and static Schrödinger equations in the spectral Barron space, providingsufficient conditions to verify the earlier assumption on the solutions made in thegeneralization analysis. The new solution theory can be viewed as an analog of theclassical PDE theory in Sobolev or Hölder spaces. See Theorem 2.5 and Theorem 2.6for the new solution theory in spectral Barron space.2.
Set-Up and Main Results
Set-Up of PDEs.
Let
Ω = [0 , d be the unit hypercube on R d . Let ∂ Ω be the boundaryof Ω . We consider the following two prototype elliptic PDEs on Ω equipped with the Neumannboundary condition: Poisson equation(2.1) − ∆ u = f on Ω ,∂u∂ν = 0 on ∂ Ω and the static Schrödinger equation(2.2) − ∆ u + V u = f on Ω ,∂u∂ν = 0 on ∂ Ω . Throughout the paper, we make the minimal assumption that f ∈ L (Ω) and V ∈ L ∞ (Ω) with V ( x ) ≥ V min > , although later we will impose stronger regularity assumptions on f and V . In particular, in our high dimensional setting, we would certainly need to restrict the classof f and V , otherwise just prescribing such general functions numerically would already incurcurse of dimensionality. The well-posedness of the solutions to the Poisson equation and staticSchrödinger equation in the Sobolev space H (Ω) as well as the variational characterizationsof the solutions are well-known and are summarized in the proposition below, whose proofcan be found in Appendix A. Proposition 2.1. (i) Assume that f ∈ L (Ω) with (cid:82) Ω f dx = 0 . Then there exists a uniqueweak solution u ∗ P ∈ H (cid:5) (Ω) := { u ∈ H (Ω) | (cid:82) Ω udx = 0 } to the Poisson equation (2.1) .Moreover, we have that (2.3) u ∗ P = arg min u ∈ H (Ω) E P ( u ) := arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | dx + 12 (cid:16) (cid:90) Ω udx (cid:17) − (cid:90) Ω f udx (cid:111) , PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 5 and that for any u ∈ H (Ω) , (2.4) E ( u ) − E ( u ∗ P )) ≤ (cid:107) u − u ∗ P (cid:107) H (Ω) ≤ { C P + 1 , } ( E ( u ) − E ( u ∗ P )) , where C P is the Poincaré constant on the domain Ω , i.e., for any v ∈ H (Ω) , (cid:13)(cid:13)(cid:13) v − (cid:90) Ω vdx (cid:13)(cid:13)(cid:13) L (Ω) ≤ C P (cid:107)∇ v (cid:107) L (Ω) . (ii) Assume that f, V ∈ L ∞ (Ω) and that < V min ≤ V ( x ) ≤ V max < ∞ for some constants V min and V max . Then there exists a unique weak solution u ∗ S ∈ H (Ω) to the static Schrödingerequation (2.2) . Moreover, we have that (2.5) u ∗ S = arg min u ∈ H (Ω) E S ( u ) := arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | + V | u | dx − (cid:90) Ω f udx (cid:111) , and that for any u ∈ H (Ω) (2.6) , V max ) ( E ( u ) − E ( u ∗ S )) ≤ (cid:107) u − u ∗ (cid:107) H (Ω) ≤ , V min ) ( E ( u ) − E ( u ∗ S )) . The variational formulations (2.3) and (2.5) are the basis of the DRM [13] for solvingthose PDEs. The main idea is to train neural networks to minimize the (population) lossdefined by the Ritz energy functional E . More specifically, let F ⊂ H (Ω) be a hypothesisfunction class parameterized by neural networks. The DRM seeks the optimal solution tothe population loss E within the hypothesis space F . However, the population loss requiresevaluations of d -dimensional integrals, which can be prohibitively expensive when d (cid:29) iftraditional quadrature methods were used. To circumvent the curse of dimensionality, it isnatural to employ the Monte-Carlo method for computing the high dimensional integrals,which leads to the so-called empirical loss (or risk) minimization. Empirical Loss Minimization.
Let us denote by P Ω the uniform probability distri-butions on the domain Ω . Then the loss functional E P and E S can be rewritten in terms ofexpectations under P Ω as E P ( u ) = | Ω | · E X ∼P Ω (cid:104) |∇ u ( X ) | − f ( X ) u ( X ) (cid:105) + 12 (cid:16) | Ω | · E X ∼P Ω u ( X ) (cid:17) , E S ( u ) = | Ω | · E X ∼P Ω (cid:104) |∇ u ( X ) | + 12 V ( X ) | u ( X ) | − f ( X ) u ( X ) (cid:105) . To define the empirical loss, let { X j } nj =1 be an i.i.d. sequence of random variables distributedaccording to P Ω . Define the empirical losses E n,P and E n,S by setting(2.7) E n,P ( u ) = 1 n n (cid:88) j =1 (cid:104) | Ω | · (cid:16) |∇ u ( X j ) | − f ( X j ) u ( X j ) (cid:17)(cid:105) + 12 (cid:16) | Ω | n n (cid:88) j =1 u ( X j ) (cid:17) , E n,S ( u ) = 1 n n (cid:88) j =1 (cid:104) | Ω | · (cid:16) |∇ u ( X j ) | + 12 V ( X j ) | u ( X j ) | − f ( X j ) u ( X j ) (cid:17)(cid:105) . Given an empirical loss E n , the empirical loss minimization algorithm seeks u n which mini-mizes E n , i.e.(2.8) u n = arg min u ∈F E n ( u ) . Here we have suppressed the dependence of u n on F . We denote by u n,P and u n,S the minimalsolutions to the empirical loss E n,P and E n,S , respectively. JIANFENG LU, YULONG LU, AND MIN WANG
Main Results.
The goal of the present paper is to obtain quantitative estimates forthe generalization error between the minimal solution u n,S and u n,P computed from the finitedata points { X j } nj =1 and the exact solutions when the spacial dimension d is large. Ourprimary interest is to derive such estimates which scales mildly with respect to the increasingdimension d . To this end, it is necessary to assume that the true solutions lie in a smallerspace which has a lower complexity than Sobolev spaces. Specifically we will consider thespectral Barron space defined below via the cosine transformation.Let C be a set of cosine functions defined by(2.9) C := (cid:110) Φ k (cid:111) k ∈ N d := (cid:110) d (cid:89) i =1 cos( πk i x i ) | k i ∈ N (cid:111) . Given u ∈ L (Ω) , let { ˆ u ( k ) } k ∈ N d be the expansion coefficients of u under the basis { Φ k } k ∈ N d .Let us define for s ≥ the spectral Barron space B s (Ω) on Ω by(2.10) B s (Ω) := (cid:110) u ∈ L (Ω) : (cid:88) k ∈ N d (1 + π s | k | s ) | ˆ u ( k ) | < ∞ (cid:111) . The spectral Barron norm of a function u on B s (Ω) is given by (cid:107) u (cid:107) B s (Ω) = (cid:88) k ∈ N d (1 + π s | k | s ) | ˆ u ( k ) | . Observe that a function f ∈ B s (Ω) if and only if { ˆ u ( k ) } k ∈ N d belongs to the weighted (cid:96) -space (cid:96) W s ( N d ) on the lattice N d with the weights W s ( k ) = (1 + π s | k | s ) . When s = 2 , we adopt theshort notation B (Ω) for B (Ω) . Our definition of spectral Barron space is strongly motivatedby the seminar work by Barron [2] and other recent works [11, 23, 37]. The initial Barronfunction f in [2] is defined on the whole space R d whose Fourier transform ˆ f ( w ) satisfies that (cid:82) | ˆ f ( ω ) || ω | dω < ∞ . Our spectral Barron space B s (Ω) with s = 1 can be viewed as a discreteanalog of the initial Barron space from [2].The most important property of the Barron functions is that those functions can be wellapproximated by two-layer neural networks without the curse of dimensionality. To make thismore precise, let us define the class of two-layer neural networks to be used as our hypothesisspace for solving PDEs. Given an activation function φ , a constant B > and the numberof hidden neurons m , we define(2.11) F φ,m ( B ) := (cid:110) c + m (cid:88) i =1 γ i φ ( ω i · x − t i ) , | c | ≤ B, (cid:107) w i (cid:107) = 1 , | t i | ≤ , m (cid:88) i =1 | γ i | ≤ B (cid:111) . Our first result concerns the approximation of spectral Barron functions in B (Ω) by two-layerneural networks with ReLU activation functions.
Theorem 2.1.
Consider the class of two-layer ReLU neural networks (2.12) F ReLU ,m ( B ) := (cid:110) c + m (cid:88) i =1 γ i ReLU( ω i · x − t i ) , | c | ≤ B, (cid:107) w i (cid:107) = 1 , | t i | ≤ , m (cid:88) i =1 | γ i | ≤ B (cid:111) . Then for any u ∈ B (Ω) , there exists u m ∈ F ReLU ,m ( (cid:107) u (cid:107) B (Ω) ) , such that (cid:107) u − u m (cid:107) H (Ω) ≤ √ (cid:107) u (cid:107) B (Ω) √ m . A similar approximation result was firstly proved in the seminar paper of Barron [2] wherethe same approximation rate O ( m − ) was also obtained when approximating the Barronfunction defined on the whole space with two-layer neural nets with the sigmoid activationfunction in the L ∞ -norm. Results of this kind were also obtained in the recent works [11, PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 7
23, 37]. In particular, the same convergence rate was proved for approximating functions f with (cid:107) f (cid:107) B s = (cid:82) R d | ˆ f ( ω ) | (1 + | ω | ) s dω < ∞ in Sobolev norms by two-layer networks with ageneral class of activation functions satisfying polynomial decay condition. The convergencerate O ( m − ) was recently improved to O ( m − ( + δ ( d )) ) with δ ( d ) > depending on d in [38]when ReLU k or cosine is used as the activation function. Moreover, the rate has been provedto be sharp in the Sobolev norms when the index s of Barron space and that of the Sobolevnorm belong to certain appropriate regime.Although the function class F ReLU ,m ( B ) can be used to approximate functions in B (Ω) without curse of dimensionality, it brings several issues to both theory and computation ifused as the hypothesis class for solving PDEs. On the one hand, the set F ReLU ,m ⊂ H (Ω) consists of only piecewise affine functions, which may be undesirable in some PDE problemsif the function of interest is expected to be more regular or smooth. On the other hand,the fact that F ReLU ,m only admits first order weak derivatives makes it extremely difficult tobound the complexities of function classes involving derivatives of functions from F ReLU ,m ,whereas the latter is a crucial ingredient for getting a generalization bound for the DRM.To resolve those issues, in what follows we will consider instead a class of two-layer neuralnetworks with the Softplus [9, 16] activation function. Recall the Softplus function SP( z ) =ln(1 + e z ) and its rescaled version SP τ ( z ) defined also for τ > , SP τ ( z ) = 1 τ SP( τ z ) = 1 τ ln(1 + e τz ) . Observe that the rescaled Softplus SP τ ( z ) can be viewed as a smooth approximation ofthe ReLU function since SP τ ( z ) → ReLU( z ) as τ → for any z ∈ R (see Lemma 4.6 for aquantitative statement). Moreover, the two-layer neural networks with the activation function SP τ satisfy a similar approximation result as Theorem 2.1 when approximating spectralBarron functions in B (Ω) , as shown in the next theorem. Theorem 2.2.
Consider the class of two-layer Softplus neural networks functions (2.13) F SP τ ,m ( B ) := (cid:110) c + m (cid:88) i =1 γ i SP τ ( ω i · x − t i ) , | c | ≤ B, (cid:107) w i (cid:107) = 1 , | t i | ≤ , m (cid:88) i =1 | γ i | ≤ B (cid:111) . Then for any u ∈ B (Ω) , there exists a two-layer neural network u m ∈ F SP τ ,m ( (cid:107) u (cid:107) B (Ω) ) withactivation function SP τ with τ = √ m , such that (cid:107) u − u m (cid:107) H (Ω) ≤ (cid:107) u (cid:107) B (Ω) (6 log m + 30) √ m . The proofs of Theorem 2.1 and Theorem 2.2 can be found in Section 4.Now we are ready to state the main generalization results of two-layer neural networks forsolving Poisson and the static Schrödinger equations. We start with the generalization errorbound for the neural-network solution in the Poisson case.
Theorem 2.3.
Assume that the solution u ∗ P of the Neumann problem for the Poisson equation (2.1) satisfies that (cid:107) u ∗ P (cid:107) B (Ω) < ∞ . Let u mn,S be the minimizer of the empirical loss E n,P in theset F = F SP τ ,m ( (cid:107) u ∗ P (cid:107) B (Ω) ) with τ = √ m . Then it holds that (2.14) E (cid:2) E P ( u mn,P ) − E P ( u ∗ P ) (cid:3) ≤ C √ m ( √ log m + 1) √ n + C (log m + 1) m . Here C > depends polynomially on (cid:107) u ∗ P (cid:107) B (Ω) , d, (cid:107) f (cid:107) L ∞ (Ω) , and C > depends quadraticallyon (cid:107) u ∗ P (cid:107) B (Ω) . In particular, setting m = n in (2.14) leads to E (cid:2) E P ( u mn,P ) − E P ( u ∗ ) (cid:3) ≤ C (log n ) n for some C > depending only polynomially on (cid:107) u ∗ P (cid:107) B (Ω) , d, (cid:107) f (cid:107) L ∞ (Ω) . JIANFENG LU, YULONG LU, AND MIN WANG
Next we state the generalization error for the neural-network solution in the case of thestatic Schrödinger equation.
Theorem 2.4.
Assume that the solution u ∗ S of the Neumann problem for the static Schrödingerequation (2.2) satisfies that (cid:107) u ∗ S (cid:107) B (Ω) < ∞ . Let u mn,S be the minimizer of the empirical loss E n,S in the set F = F SP τ ,m ( (cid:107) u ∗ S (cid:107) B (Ω) ) with τ = √ m . Then it holds that (2.15) E (cid:2) E S ( u mn,S ) − E S ( u ∗ S ) (cid:3) ≤ C √ m ( √ log m + 1) √ n + C (log m + 1) m . Here C > depends polynomially on (cid:107) u ∗ S (cid:107) B (Ω) , d, (cid:107) f (cid:107) L ∞ (Ω) , (cid:107) V (cid:107) L ∞ (Ω) and C > dependsquadratically on (cid:107) u ∗ S (cid:107) B (Ω) . In particular, setting m = n in (2.15) leads to E (cid:2) E S ( u mn,S ) − E S ( u ∗ S ) (cid:3) ≤ C (log n ) n for some C > depending only polynomially on (cid:107) u ∗ S (cid:107) B (Ω) , d, (cid:107) f (cid:107) L ∞ (Ω) , (cid:107) V (cid:107) L ∞ (Ω) . Remark 2.1.
Thanks to the estimates (2.4) and (2.6) , the generalization bound above on theenergy excess translate directly to the generalization bound on square of the H -error betweenthe neural-network solution and the exact solution of the PDE. Specifically, when m = n , itholds that for some constant C > , E (cid:107) u mn − u ∗ (cid:107) H (Ω) ≤ C (log n ) n . Theorem 2.3 and Theorem 2.4 show that the generalization error of the neural-networksolution for Poisson and the static Schrödinger equations do not suffer from the curse ofdimensionality under the key assumption that their exact solutions belong to the spectralBarron space B (Ω) . The proofs of Theorem 2.3 and Theorem 2.4 can be found in Section 6.Finally we verify the key low-complexity assumption by proving new well-posedness theoryof Poisson and the static Schrödinger equations in spectral Barron spaces. We start with thenew solution theory for the Poisson equation, whose proof can be found in Section 7.1. Theorem 2.5.
Assume that f ∈ B s (Ω) with s ≥ and that ˆ f = (cid:82) Ω f ( x ) dx = 0 . Thenthe unique solution u ∗ to the Neumann problem for the Poisson equation satisfies that u ∗ ∈B s +2 (Ω) and that (cid:107) u ∗ (cid:107) B s +2 (Ω) ≤ (cid:107) f (cid:107) B s (Ω) . In particular, when s = 0 we have (cid:107) u ∗ (cid:107) B (Ω) ≤ (cid:107) f (cid:107) B (Ω) . The next theorem establishes the solution theory for the static Schrödinger equation inspectral Barron spaces.
Theorem 2.6.
Assume that f ∈ B s (Ω) with s ≥ and that V ∈ B s (Ω) with V ( x ) ≥ V min > for every x ∈ R d . Then the static Schrödinger problem (2.2) has a unique solution u ∈ B s +2 (Ω) . Moreover, there exists a constant C > depending on V and d such that (2.16) (cid:107) u (cid:107) B s +2 (Ω) ≤ C (cid:107) f (cid:107) B s (Ω) . In particular, when s = 0 we have (cid:107) u ∗ (cid:107) B (Ω) ≤ C (cid:107) f (cid:107) B (Ω) . The stability estimates above can be viewed as an analog of the standard Sobolev regularityestimate (cid:107) u (cid:107) H s +2 (Ω) ≤ C (cid:107) f (cid:107) H s (Ω) . However, the proof of the estimate (2.16) is quite differentfrom that of the Sobolev estimate. In particular, due to the lack of Hilbert structure inthe Barron space B s (Ω) , the standard Lax-Milgram theorem and the bootstrap argumentsfor proving the Sobolev regularity estimates can not be applied here. Instead, we turn tostudying the equivalent operator equation satisfied by the cosine coefficients of the solutionof the static Schrödinger equation. By exploiting the fact that the Barron space is a weighted (cid:96) -space on the cosine coefficients, we manage to prove the well-posedness of the operator PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 9 equation and the stability estimate (2.16) with an application of the Fredholm theory to theoperator equation. The complete proof of Theorem 2.6 can be found in Section 7.2.2.4.
Discussions and Future Directions.
We established dimension-independent rates ofconvergence for the generalization error of the DRM for solving two simple linear ellipticPDEs. We would like to discuss some restrictions of the main results and point out someinteresting future directions.First, some numerical results show that the convergence rates in our generalization errorestimates may not be sharp. In fact, Siegel and Xu [38] obtained sharp convergence ratesof O ( m − ( + δ ( d )) ) with some δ ( d ) > for approximating a similar class of spectral Barronfunctions using two-layer neural nets with cosine and ReLU k activation functions. However,the parameters (weights and biases) of the neural networks constructed in their approximationresults were not well controlled (and maybe unbounded) and potentially could lead to largegeneralization errors. One interesting open question is to sharpen the approximation ratefor our spectral Barron functions using controllable two-layer neural networks with possiblydifferent activation functions. On the other hand, the statistical error bound O ( √ m ( √ log m +1) √ n ) may also be improved with sharper and more delicate Rademacher complexity estimates ofthe neural networks.We restricted our attention on two simple elliptic problems defined on a hypercube withthe Neumann boundary condition to better convey the main ideas. It is natural to considercarrying out similar programs of solving more general PDE problems defined on generalbounded or unbounded domains with other boundary conditions. One major difficulty ariseswhen one comes to the definition of Barron functions on a general bounded domain and ourspectral Barron functions built on cosine expansions can not be adapted to general domains.Other Barron functions such as the one defined in [11] via integral representation are onbounded domains and may be considered as alternatives, but building a solution theoryfor PDEs in those spaces seems highly nontrivial; see [12] for some results and discussionsalong this direction. Another major issue comes from solving PDEs with essential boundaryconditions such as Dirichlet or periodic boundary conditions, where one needs to constructneural networks that satisfy those boundary conditions; we refer to [8, 31] for some initialattempts in this direction.Finally, the analysis of training error of neural network methods for solving PDEs is ahighly important and challenging question. The difficulty is largely due to the non-convexityof the loss function in the parameters. Nevertheless, recent breakthroughs in the theoreticalanalysis of two-layer neural networks training show that the training dynamics can be largelysimplified in infinite-width limit, such as in the the mean field regime [5, 29, 34, 40] or neuraltangent kernel (NTK) regime [6, 15, 21], where global convergence of limiting dynamicscan be proved under suitable assumptions. It is an exciting direction to establish similarconvergence results for overparameterized two-layer networks in the context of solving PDEs.3. Abstract generalization error bounds
In this section, we derive some abstract generalization bounds for the empirical loss mini-mization discussed in the previous section. To simply the notation, we suppress the problem-dependent subscript P or S and denote by u n the minimizer of the empirical loss E n over thehypothesis space F . Recall that u ∗ is the exact solution of the PDE. We aim to bound theenergy excess ∆ E n := E ( u n ) − E ( u ∗ ) . By definition we have that ∆ E n ≥ . To bound ∆ E n from above, we first decompose ∆ E n as(3.1) ∆ E n = E ( u n ) − E n ( u n ) + E n ( u n ) − E n ( u F ) + E n ( u F ) − E ( u F ) + E ( u F ) − E ( u ∗ ) . Here u F = arg min u ∈F E ( u ) . Since u n is the minimizer of E n , E n ( u n ) − E n ( u F ) ≤ . Thereforetaking expectation on both sides of (3.1) gives(3.2) E ∆ E n ≤ E [ E ( u n ) − E n ( u n )] (cid:124) (cid:123)(cid:122) (cid:125) ∆ E gen + E [ E n ( u F )] − E ( u F ) (cid:124) (cid:123)(cid:122) (cid:125) ∆ E bias + E ( u F ) − E ( u ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) ∆ E approx . Observe that ∆ E gen and ∆ E bias are the statistical errors: the first term ∆ E gen describingthe generalization error of the empirical loss minimization over the hypothesis space F andthe second term ∆ E bias being the bias coming from the Monte Carlo approximation of theintegrals. Whereas the third term ∆ E approx is the approximation error incurred by restrictingminimizing E from over the set H (Ω) to F . Moreover, thanks to Proposition 2.1, the thirdterm ∆ E approx is equivalent (up to a constant) to inf u ∈F (cid:107) u − u ∗ (cid:107) H (Ω) .To control the statistical errors, it is essential to prove the so-called uniform law of largenumbers for certain function classes, where the notion of Rademacher complexity plays animportant role, which we now recall below.
Definition 3.1.
We define for a set of random variables { Z j } nj =1 independently distributedaccording to P Ω and a function class S the random variable ˆ R n ( S ) := E σ (cid:104) sup g ∈S (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 σ j g ( Z j ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) , where the expectation E σ is taken with respect to the independent uniform Bernoulli se-quence { σ j } nj =1 with σ j ∈ {± } . Then the Rademacher complexity of S defined by R n ( S ) = E P Ω [ ˆ R n ( S )] . The following important symmetrization lemma makes the connection between the uniformlaw of large numbers and the Rademacher complexity.
Lemma 3.1. [41, Proposition 4.11] Let F be a set of functions. Then E sup u ∈F (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 u ( X j ) − E X ∼P Ω u ( X ) (cid:12)(cid:12)(cid:12) ≤ R n ( F ) . Poisson Equation.
In this subsection we derive the abstract generalization bound inthe setting of Poisson equation. Recall the Ritz loss and the empirical loss associated to thePoisson equation E ( u ) = | Ω | · E X ∼P Ω (cid:104) |∇ u ( X ) | − f ( X ) u ( X ) (cid:105) + 12 (cid:16) | Ω | · E X ∼P Ω u ( X ) (cid:17) =: E ( u ) + E ( u ) , E n ( u ) = 1 n n (cid:88) j =1 (cid:104) | Ω | · (cid:16) |∇ u ( X j ) | − f ( X j ) u ( X j ) (cid:17)(cid:105) + 12 (cid:16) | Ω | n n (cid:88) j =1 u ( X j ) (cid:17) =: E n ( u ) + E n ( u ) , PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 11
By definition, the bias term ∆ E bias satisfies that ∆ E bias = E [ E n ( u F )] − E ( u F ) + E [ E n ( u F )] − E ( u F )= 12 E (cid:16) | Ω | n n (cid:88) j =1 u ( X j ) (cid:17) − (cid:16) | Ω | · E X ∼P Ω u ( X ) (cid:17) = 12 E (cid:104)(cid:16) n n (cid:88) j =1 u F ( X j ) − E X ∼P Ω u F ( X ) (cid:17) · (cid:16) n n (cid:88) j =1 u F ( X j ) + E X ∼P Ω u F ( X ) (cid:17)(cid:105) ≤ (cid:107) u F (cid:107) L ∞ (Ω) · E sup u ∈F (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 u ( X j ) − E X ∼P Ω u ( X ) (cid:12)(cid:12)(cid:12) ≤ u ∈F (cid:107) u (cid:107) L ∞ (Ω) · R n ( F ) , where we have used | Ω | = 1 the last inequality follows from Lemma 3.1.Next we bound the first term ∆ E gen . Let us first define the set of functions G P for the termappeared in E by G P := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | − f u where u ∈ F (cid:111) . Then it follows by Lemma 3.1 that ∆ E gen ≤ E sup v ∈F (cid:12)(cid:12)(cid:12) E ( v ) − E n ( v ) (cid:12)(cid:12)(cid:12) ≤ E sup v ∈F (cid:12)(cid:12)(cid:12) E ( v ) − E n ( v ) (cid:12)(cid:12)(cid:12) + E sup v ∈F (cid:12)(cid:12)(cid:12) E ( v ) − E n ( v ) (cid:12)(cid:12)(cid:12) ≤ E sup g ∈G (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 g ( X j ) − E P Ω [ g ] (cid:12)(cid:12)(cid:12) + E sup u ∈F (cid:12)(cid:12)(cid:12)(cid:16) E X ∼P Ω u ( X ) (cid:17) − (cid:16) n n (cid:88) j =1 u ( X j ) (cid:17) (cid:12)(cid:12)(cid:12) ≤ R n ( G P ) + sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) · E sup u ∈F (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 u ( X j ) − E X ∼P Ω u ( X ) (cid:12)(cid:12)(cid:12) ≤ R n ( G P ) + 2 sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) R n ( F ) . Finally owing to the estimate (2.4) in Proposition 2.1, the approximation error ∆ E approx satisfies that ∆ E approx ≤
12 inf u ∈F (cid:107) u − u ∗ (cid:107) H (Ω) . To summarize, we have established the following abstract generalization error bound for theenergy excess ∆ E n in the case of Poisson equation. Theorem 3.1.
Let u n,P be the minimizer of the empirical risk E n,P within the hypothesisclass F satisfying that sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) < ∞ . Let ∆ E n,P = E P ( u n,P ) − E P ( u ∗ P ) . Then (3.3) E ∆ E n,P ≤ R n ( G P ) + 4 sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) · R n ( F ) + 12 inf u ∈F (cid:107) u − u ∗ (cid:107) H (Ω) . Static Schrödinger Equation.
In this subsection we proceed to prove an abstractgeneralization bound for the static Schrödinger equation. First recall the corresponding Ritzloss and the empirical loss as follows E S ( u ) = | Ω | · E X ∼P Ω (cid:104) |∇ u ( X ) | + 12 V ( X ) | u ( X ) | − f ( X ) u ( X ) (cid:105) , E n,S ( u ) = 1 n n (cid:88) j =1 (cid:104) | Ω | · (cid:16) |∇ u ( X j ) | + 12 V ( X j ) | u ( X j ) | − f ( X j ) u ( X j ) (cid:17)(cid:105) . Similar to the previous subsection, we introduce the function class G S by setting G S := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | + 12 V | u | − f u where u ∈ F (cid:111) . In the Schrödinger case, since the Ritz energy E S is linear with respect to the probabilitymeasure P Ω , the statistical errors ∆ E gen and ∆ E bias are simpler than those in the Poisson case.In particular, a similar calculation shows that ∆ E gen = 0 and ∆ E ≤ R n ( G S ) . Therefore asa result of (3.2) we obtained the following theorem. Theorem 3.2.
Let u n,S be the minimizer of the empirical risk E n,S within the hypothesisclass F satisfying that sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) < ∞ . Let ∆ E n,S = E P ( u n,S ) − E P ( u ∗ S ) . Then (3.4) E ∆ E n,S ≤ R n ( G S ) + 12 inf u ∈F (cid:107) u − u ∗ (cid:107) H (Ω) . Spectral Barron functions on the hypercube and their H -approximation. In this section, we discuss the properties of spectral Barron functions on the d -dimensionalhypercube defined by (2.10) as well as their neural network approximations. Since our spectralBarron functions are defined via the expansion under the following set of cosine functions: C = (cid:110) Φ k (cid:111) k ∈ N d := (cid:110) d (cid:89) i =1 cos( πk i x i ) | k i ∈ N (cid:111) , we start by stating some preliminaries on C and the product of cosines to be used in thesubsequent proofs.4.1. Preliminary Lemmas.Lemma 4.1.
The set C forms an orthogonal basis of L (Ω) and H (Ω) .Proof. First that C forms an orthogonal basis of L (Ω) follows directly from the Parseval’stheorem applied to the Fourier expansion of the even extension of a function u from L (Ω) .To see C is an orthogonal basis of H (Ω) , since C is an orthogonal set of H (Ω) , it suffices toshow that if u ∈ H (Ω) satisfying (cid:16) u, Φ k (cid:17) H (Ω) = 0 for all k ∈ N d , then u = 0 . In fact, the last display above yields that (cid:90) Ω u · Φ k dx + (cid:90) Ω ∇ u · ∇ Φ k dx = (cid:90) Ω u · (Φ k − ∆Φ k ) dx = (1 + π | k | ) (cid:90) Ω u · Φ k dx, where for the second identity we have used the Green’s formula and the fact that the normalderivative of Φ k vanishes on the boundary of Ω . Therefore we have obtained that ( u, Φ k ) L = 0 for any k ∈ N d , which implies that u = 0 since C is an orthogonal basis of L (Ω) . (cid:3) Given u ∈ L (Ω) , let { ˆ u ( k ) } k ∈ N d be the expansion coefficients of u under the basis { Φ k } k ∈ N d .Then for any u ∈ L (Ω) , u ( x ) = (cid:88) k ∈ N d ˆ u ( k )Φ k ( x ) . PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 13
Moreover, it follows from a straightforward calculation that for u ∈ H (Ω) , (cid:107) u (cid:107) H (Ω) = (cid:88) k ∈ N d α k (1 + π | k | ) | ˆ u ( k ) | , where α k = (cid:104) Φ k , Φ k (cid:105) L (Ω) = 2 − (cid:80) di =1 ki (cid:54) =0 ≤ . This implies the following characterization of afunction from H (Ω) function in terms of its expansion coefficients under C . Corollary 4.1.
The space H (Ω) can be characterized as H (Ω) = (cid:110) u ∈ L (Ω) (cid:12)(cid:12)(cid:12) (cid:88) k ∈ N d | ˆ u ( k ) | (1 + π | k | ) < ∞ (cid:111) . The following elementary product formula of cosine functions will also be useful.
Lemma 4.2.
For any { θ i } di =1 ⊂ R , d (cid:89) i =1 cos( θ i ) = 12 d (cid:88) ξ ∈ Ξ cos( ξ · θ ) , where θ = ( θ , · · · , θ d ) T and Ξ = { , − } d .Proof. The lemma follows directly by iterating the following simple identity cos( θ ) cos( θ ) = 12 (cid:0) cos( θ + θ ) + cos( θ − θ ) (cid:1) = 14 (cid:0) cos( θ + θ ) + cos( θ − θ ) + cos( − θ − θ ) + cos( − θ + θ ) (cid:1) . (cid:3) Spectral Barron Space and Neural-Network Approximation.
Recall for any s ∈ N the spectral Barron space B s (Ω) given by B s (Ω) := (cid:110) u ∈ L (Ω) : (cid:88) k ∈ N d (1 + π s | k | s ) | ˆ u ( k ) | < ∞ (cid:111) with associated norm (cid:107) u (cid:107) B s (Ω) := (cid:80) k ∈ N d (1 + π s | k | s ) | ˆ u ( k ) | . Recall also the short notation B (Ω) for B (Ω) . Lemma 4.3.
The following embedding results hold:(i) B (Ω) (cid:44) → H (Ω) ;(ii) B (Ω) (cid:44) → L ∞ (Ω) .Proof. (i). If u ∈ B (Ω) , then (cid:107) u (cid:107) B (Ω) = (cid:80) k ∈ N d (1 + π | k | ) | ˆ u ( k ) | < ∞ . This particularlyimplies | ˆ u ( k ) | ≤ (cid:107) u (cid:107) B (Ω) for each k ∈ N d . Since α k ≤ , we have that (cid:107) u (cid:107) H (Ω) = (cid:88) k ∈ N d α k (1 + π | k | ) | ˆ u ( k ) | ≤ (cid:107) u (cid:107) B (Ω) (cid:88) k ∈ N d (1 + π | k | ) | ˆ u ( k ) | = (cid:107) u (cid:107) B (Ω) . (ii). For u ∈ B (Ω) , using the fact that (cid:107) Φ k (cid:107) L ∞ (Ω) ≤ we have that (cid:107) u (cid:107) L ∞ (Ω) = (cid:13)(cid:13)(cid:13) (cid:88) k ∈ N d ˆ u ( k )Φ k (cid:13)(cid:13)(cid:13) L ∞ (Ω) ≤ (cid:88) k ∈ N d | ˆ u ( k ) | = (cid:107) u (cid:107) B (Ω) . (cid:3) Thanks to Lemma 4.1 and Lemma 4.2, any function u ∈ H (Ω) admits the expansion(4.1) u ( x ) = (cid:88) k ∈ N d ˆ u ( k ) · d (cid:88) ξ ∈ Ξ cos( πk ξ · x ) , where ˆ u ( k ) is the expansion coefficient of u under the basis C and k ξ = ( k ξ , · · · , k d ξ d ) ∈ Z d .Given u ∈ B (Ω) ⊂ H (Ω) , letting ( − θ ( k ) = sign (ˆ u ( k )) with θ ( k ) ∈ { , } , we have from(4.1) that u ( x ) = ˆ u (0) + (cid:88) k ∈ N d \{ } ˆ u ( k ) · d (cid:88) ξ ∈ Ξ cos( πk ξ · x )= ˆ u (0) + (cid:88) k ∈ N d \{ } | ˆ u ( k ) | sign (ˆ u ( k )) · d (cid:88) ξ ∈ Ξ cos( πk ξ · x )= ˆ u (0) + (cid:88) k ∈ N d \{ } | ˆ u ( k ) | · d (cid:88) ξ ∈ Ξ cos( π ( k ξ · x + θ k ))= ˆ u (0) + (cid:88) k ∈ N d \{ } Z u | ˆ u ( k ) | (1 + π | k | ) · Z u π | k | · d (cid:88) ξ ∈ Ξ cos( π ( k ξ · x + θ k ))=: ˆ u (0) + (cid:90) g ( x, k ) µ ( dk ) , where µ ( dk ) is the probability measure on N d \ { } defined by µ ( dk ) = (cid:88) k ∈ N d \{ } Z u (cid:12)(cid:12) ˆ u ( k ) (cid:12)(cid:12) (1 + π | k | ) δ ( dk ) with normalizing constant Z u = (cid:80) k ∈ N d \{ } | ˆ u ( k ) | (1 + π | k | ) ≤ (cid:107) u (cid:107) B (Ω) and g ( x, k ) = Z u π | k | · d (cid:88) ξ ∈ Ξ cos( π ( k ξ · x + θ k )) . Observe that the function g ( x, k ) ∈ C (Ω) for every k ∈ N d \ { } . Moreover, it is straight-forward to show that the following bounds hold: (cid:107) g ( · , k ) (cid:107) H (Ω) = Z u (cid:114) α k π | k | ≤ (cid:107) u (cid:107) B (Ω) , (cid:107) D s g ( · , k ) (cid:107) L ∞ (Ω) ≤ Z u ≤ (cid:107) u (cid:107) B (Ω) for s = 0 , , . Let us define for a constant
B > the function class F cos ( B ) := (cid:110) γ π | k | cos( π ( k · x + b )) , k ∈ Z d \ { } , | γ | ≤ B, b ∈ { , } (cid:111) . It follows from the calculations above that if u ∈ B (Ω) , then ¯ u := u − ˆ u (0) lies in the H -closure of the convex hull of F cos ( B ) with B = (cid:107) u (cid:107) B (Ω) . Indeed, if { k i } mi =1 is an i.i.d. sequenceof random samples from the probability measure µ , then it follows from Fubini’s theorem PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 15 that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ¯ u ( x ) − m m (cid:88) i =1 g ( x, k i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H (Ω) = E (cid:90) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ u ( x ) − m m (cid:88) i =1 g ( x, k i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dx + E (cid:90) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ ¯ u ( x ) − m m (cid:88) i =1 ∇ g ( x, k i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dx = 1 m (cid:90) Ω Var [ g ( x, k )] dx + 1 m (cid:90) Ω Tr ( Cov [ ∇ g ( x, k )]) dx ≤ E (cid:107) g ( · , k ) (cid:107) H (Ω) m ≤ (cid:107) u (cid:107) B (Ω) m . Therefore the expected H -norm of an average of m elements in F cos ( B ) converges to zeroas m →∞ . This in particular implies that there exists a sequence of convex combinations ofpoints in F cos ( B ) converging to ¯ u in H -norm. Since the H -norm of any function in F cos ( B ) is bounded by B , an application of Maurey’s empirical method (see Lemma 4.4) yields thefollowing theorem. Theorem 4.1.
Let u ∈ B (Ω) . Then there exists u m which is a convex combination of m functions in F cos ( B ) with B = (cid:107) u (cid:107) B (Ω) such that (cid:107) u − ˆ u (0) − u m (cid:107) H (Ω) ≤ (cid:107) u (cid:107) B (Ω) m . Lemma 4.4. [2, 32] Let u belongs to the closure of the convex hull of a set G in a Hilbertspace. Let the Hilbert norm of of each element of G be bounded B > . Then for every m ∈ N , there exists { g i } mi =1 ⊂ G and { c i } mi =1 ⊂ [0 , with (cid:80) mi =1 c i = 1 such that (cid:13)(cid:13)(cid:13) u − m (cid:88) i =1 c i g i (cid:13)(cid:13)(cid:13) ≤ B m . Reduction to ReLU and Softplus Activation Functions.
Notice that every func-tion in F cos ( B ) is the composition of the one dimensional function g defined on [ − , by(4.2) g ( z ) = γ π | k | cos( π ( | k | z + b )) with k ∈ Z d \ { } , | γ | ≤ B and b ∈ { , } , and a linear function z = w · x with w = k/ | k | . Itis clear that g ∈ C ([ − , and g satisfies that(4.3) (cid:107) g ( s ) (cid:107) L ∞ ([ − , ≤ | γ | ≤ B for s = 0 , , . Since b ∈ { , } , it also holds that g (cid:48) (0) = 0 . Lemma 4.5.
Let g ∈ C ([ − , with (cid:107) g ( s ) (cid:107) L ∞ ([ − , ≤ B for s = 0 , , . Assume that g (cid:48) (0) = 0 . Let { z j } mj =0 be a partition of [ − , with z = − , z m = 0 , z m = 1 and z j +1 − z j = h = 1 /m for each j = 0 , · · · , m − . Then there exists a two-layer ReLU network g m of theform (4.4) g m ( z ) = c + m (cid:88) i =1 a i ReLU( (cid:15) i z − b i ) , z ∈ [ − , with c = g (0) , b i ∈ [ − , and (cid:15) i ∈ {± } , i = 1 , · · · , m such that (4.5) (cid:107) g − g m (cid:107) W , ∞ ([ − , ≤ Bm .
Moreover, we have that | a i | ≤ Bm and that | c | ≤ B. Proof.
Let g m be the piecewise linear interpolation of g with respect to the grid { z j } mj =0 , i.e. g m ( z ) = g ( z j +1 ) z − z j h + g ( z j ) z j +1 − zh if z ∈ [ z j , z j +1 ] . According to [1, Chapter 11], (cid:107) g − g m (cid:107) L ∞ ([ − , ≤ h (cid:107) g (cid:48)(cid:48) (cid:107) L ∞ ([ − , . Moreover, (cid:107) g (cid:48) − g (cid:48) m (cid:107) L ∞ ([ − , ≤ h (cid:107) g (cid:48)(cid:48) (cid:107) L ∞ ([ − , . In fact, consider z ∈ [ z j , z j +1 ] for some j ∈{ , · · · , m − } . By the mean value theorem, there exist ξ, η ∈ ( z j , z j +1 ) such that ( g ( z j +1 − g ( z j ))) /h = g (cid:48) ( ξ ) and hence (cid:12)(cid:12)(cid:12) g (cid:48) ( z ) − g ( z j +1 ) − g ( z i ) h (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) g (cid:48) ( z ) − g (cid:48) ( ξ ) (cid:12)(cid:12)(cid:12) = | g (cid:48)(cid:48) ( η ) || z − ξ |≤ h (cid:107) g (cid:48)(cid:48) (cid:107) L ∞ ([ − , . This proves the error bound (4.5).Next, we show that g m can be represented by a two-layer ReLU neural network. Indeed,it is easy to verify that g m can be rewritten as(4.6) g m ( z ) = c + m (cid:88) i =1 a i ReLU( z i − z ) + m (cid:88) i = m +1 a i ReLU( z − z i − ) , z ∈ [ − , , where c = g ( z m ) = g (0) and the parameters a i defined by a i = g ( z m +1 ) − g ( z m ) h , if i = m + 1 , g ( z m − ) − g ( z m ) h , if i = m, g ( z i ) − g ( z i − )+ g ( z i − ) h , if i > m + 1 , g ( z i − ) − g ( z i )+ g ( z i +1 ) h , if i < m. Furthermore, by again the mean value theorem, there exists ξ , ξ ∈ ( z m , z m +1 ) such that | a m +1 | = | g (cid:48) ( ξ ) | = | g (cid:48) ( ξ ) − g (cid:48) (0) | = | g (cid:48)(cid:48) ( ξ ) ξ | ≤ Bh.
In a similar manner one can obtainthat | a m | ≤ Bh and | a i | ≤ Bh if i / ∈ { m, m + 1 } .Finally, by setting (cid:15) i = − , b i = − z i for i = 1 , · · · , m and (cid:15) i = 1 , b i = z i − for i = m + 1 , · · · , m , one obtains the desired form (4.4) of g m . This completes the proof of thelemma. (cid:3) The following proposition is a direct consequence of Lemma 4.5.
Proposition 4.1.
Define the function class F ReLU ( B ) := (cid:110) c + γ ReLU( w · x − t ) , | c | ≤ B, | w | = 1 , | t | ≤ , | γ | ≤ B } . Then for any constant ˜ c such that | ˜ c | ≤ B , the set ˜ c + F cos ( B ) is in the H -closure of theconvex hull of F ReLU ( B ) .Proof. First Lemma 4.5 states that each C -function g with g (cid:48) (0) = 0 and with up to secondorder derivatives bounded by B can be well approximated in H -norm by a linear combinationof a constant function and the ReLU functions ReLU( (cid:15)z − t ) with the sum of the absolutevalues of the combination coefficients bounded by B . As a result, the function g defined in(4.2) lies in the closure of the convex hull of functions c + γ ReLU( (cid:15)z − t ) with | c | ≤ B, | γ | ≤ B, | t | ≤ . Then the proposition follows from absorbing the additive constant ˜ c into theconstant c in the definition of F ReLU ( B ) . (cid:3) With Proposition 4.1, we are ready to give the proof of Theorem 2.1.
PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 17
Proof of Theorem 2.1.
Observe that if u ∈ F ReLU ( B ) , then (cid:107) u (cid:107) H (Ω) ≤ ( c + 2 γ ) + γ ≤ (10 + 4 ) B = 116 B . Therefore Theorem 2.1 follows directly from Lemma 4.4, Proposition 4.1 with ˜ c = ˆ u (0) andthe fact that | ˆ u (0) | ≤ (cid:107) u (cid:107) B (Ω) . (cid:3) Next we proceed to prove Theorem 2.2 which concerns approximating spectral Barronfunctions using two-layer networks with the Softplus activation. To this end, let us first statea lemma which shows that
ReLU can be well approximated by SP τ for τ (cid:29) . Lemma 4.6.
The following inequalities hold:(i) | ReLU( z ) − SP τ ( z ) | ≤ τ e − τ | z | , ∀ z ∈ [ − , (ii) | ReLU (cid:48) ( z ) − SP (cid:48) τ ( z ) | ≤ e − τ | z | , ∀ z ∈ [ − , ∪ (0 , (iii) (cid:107) SP τ (cid:107) W , ∞ ([ − , ≤ τ . Proof.
Notice that
ReLU( z ) − SP τ ( z ) = − τ ln(1 + e − τ | z | ) . Hence inequality (i) follows fromthat | ReLU( z ) − SP τ ( z ) | ≤ τ ln(1 + e − τ | z | ) ≤ e − τ | z | τ , where the second inequality follows from the simple inequality ln(1 + x ) ≤ x for x > − . Inaddition, inequality (ii) holds since | ReLU (cid:48) ( z ) − SP (cid:48) τ ( z ) | = (cid:12)(cid:12)(cid:12)
11 + e τ | z | (cid:12)(cid:12)(cid:12) ≤ e − τ | z | , if z (cid:54) = 0 . Finally, inequality (iii) follows from that (cid:107) SP τ ( z ) (cid:107) L ∞ ([ − , = SP τ (2) ≤ τ and that | SP (cid:48) τ ( z ) | = (cid:12)(cid:12)(cid:12)
11 + e τz (cid:12)(cid:12)(cid:12) ≤ . (cid:3) Lemma 4.7.
Let g ∈ C ([ − , with (cid:107) g ( s ) (cid:107) L ∞ ([ − , ≤ B for s = 0 , , . Assume that g (cid:48) (0) = 0 . Let { z j } mj = − m be a partition of [ − , with m ≥ and z − m = − , z = 0 , z m = 1 and z j +1 − z j = h = 1 /m for each j = − m, · · · , m − . Then there exists a two-layer neuralnetwork g τ,m of the form (4.7) g τ,m ( z ) = c + m (cid:88) i =1 a i SP τ ( (cid:15) i z − b i ) , z ∈ [ − , with c = g (0) ≤ B, b i ∈ [ − , , | a i | ≤ B/m and (cid:15) i ∈ {± } , i = 1 , · · · , m such that (4.8) (cid:107) g − g τ,m (cid:107) W , ∞ ([ − , ≤ Bδ τ , where (4.9) δ τ := 1 τ (cid:16) τ (cid:17)(cid:16) log (cid:0) τ (cid:1) + 1 (cid:17) . Proof.
Thanks to Lemma 4.5, there exists g m of the form(4.10) g m ( z ) = c + m (cid:88) i =1 a i ReLU( z i − z ) + m (cid:88) i = m +1 a i ReLU( z − z i − ) , z ∈ [ − , such that (cid:107) g − g m (cid:107) W , ∞ ([ − , ≤ B/m . More importantly, the coefficients a i satisfies that | a i | ≤ B/m so that (cid:80) mi =1 a i ≤ B . Now let g τ,m be the function obtained by replacing theactivation ReLU in g m by SP τ , i.e.(4.11) g τ,m ( z ) = c + m (cid:88) i =1 a i SP τ ( z i − z ) + m (cid:88) i = m +1 a i SP τ ( z − z i − ) , z ∈ [ − , . Suppose that z ∈ ( z j , z j +1 ) for some fixed j < m − . Then thanks to Lemma 4.6 - (i), thebound | a i | ≤ B/m and the fact that | z i − z | ≥ /m if i (cid:54) = j while z ∈ ( z j , z j +1 ) , we have | g m ( z ) − g τ,m ( z ) | ≤ | a j || ReLU( z j − z ) − SP τ ( z j − z ) | + m (cid:88) i =1 ,i (cid:54) = j | a i || ReLU( z i − z ) − SP τ ( z i − z ) | + m (cid:88) i = m +1 | a i || ReLU( z − z i − ) − SP τ ( z − z i − ) |≤ Bmτ + 2
Bτ e − τ | x | | x |≥ /m . Similar bounds hold for the case where z ∈ ( z j , z j +1 ) for j > m . Lastly, if z ∈ ( z m , z m +1 ) ,then both the m -th and m + 1 -th term in (4.10) and (4.11) depend on z m , from which we get | g m ( z ) − g τ,m ( z ) | ≤ Bmτ + 2
Bτ e − τ | x | | x |≥ /m . Therefore we have obtained that (cid:107) g m − g τ,m (cid:107) L ∞ ([ − , ≤ Bmτ + 2
Bτ e − τ | x | | x |≥ /m . Thanks to Lemma 4.6 - (ii), the same argument carries over to the estimate for the differenceof the derivatives and leads to (cid:107) g (cid:48) m − g (cid:48) τ,m (cid:107) L ∞ ([ − , ≤ Bm + 2 Be − τ | x | | x |≥ /m . Combining the estimates above with that (cid:107) g − g m (cid:107) W , ∞ ([ − , ≤ B/m yields that (cid:107) g − g τ,m (cid:107) W , ∞ ([ − , ≤ (cid:107) g − g m (cid:107) W , ∞ ([ − , + (cid:107) g m − g τ,m (cid:107) W , ∞ ([ − , ≤ Bm + 4 Bmτ + 2
Bτ e − τ | x | | x |≥ /m ≤ B (cid:16) τ (cid:17)(cid:16) m + e − τm (cid:17) = 6 Bδ τ . We have used the fact that max PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 19 Proof of Theorem 2.2. First according to Theorem 4.1, u − ˆ u (0) lies in the H -closure of theconvex hull of F cos ( B ) with B = (cid:107) u (cid:107) B (Ω) . Note that each function in F cos ( B ) is a compositionof the multivariate linear function z = w · x with | w | = 1 and the univariate function g ( z ) defined in (4.2) such that g (cid:48) (0) = 0 and (cid:107) g ( s ) (cid:107) L ∞ ([ − , ≤ B for s = 0 , , . By Lemma 4.7,such g can be approximated by g τ,m which lies in the convex hull of the set of functions (cid:110) c + γ SP τ ( (cid:15)z − b ) , | c | ≤ B, (cid:15) ∈ {± } , | b | ≤ , γ ≤ B (cid:111) . Moreover, (cid:107) g − g τ,m (cid:107) W , ∞ ([ − , ≤ Bδ τ . As a result, we have that (cid:107) g ( w · x ) − g τ,m ( w · x ) (cid:107) H (Ω) ≤ (cid:107) g − g τ,m (cid:107) W , ∞ ([ − , ≤ Bδ τ . This combining with the fact that | ˆ u (0) | ≤ B yields that there exists a function u τ in theclosure of the convex hull of F SP τ ( B ) such that (cid:107) u − u τ (cid:107) H (Ω) ≤ Bδ τ . Thanks to Lemma 4.4 and the bound (4.12), there exists u m ∈ F SP τ ,m ( B ) , which is a convexcombination of m functions in F SP τ ( B ) such that (cid:107) u τ − u m (cid:107) H (Ω) ≤ B (cid:16) τ + 14 (cid:17) √ m . Combining the last two inequalities leads to (cid:107) u − u m (cid:107) H (Ω) ≤ Bδ τ + B (cid:16) τ + 14 (cid:17) √ m . Setting τ = √ m ≥ and using (4.9), we obtain that (cid:107) u − u m (cid:107) H (Ω) ≤ Bτ (cid:16) τ (cid:17)(cid:16) log (cid:16) τ (cid:17) + 1 (cid:17) + B √ m (cid:16) τ + 14 (cid:17) ≤ B √ m (cid:16) 12 log( m ) + 1 (cid:17) + 18 B √ m = B (6 log( m ) + 30) √ m . This proves the desired estimate. (cid:3) Rademacher complexities of two-layer neural networks The goal of this section is to derive the Rademacher complexity bounds for some two-layerneural-network function classes that are relevant to the Ritz losses of the Poisson and thestatic Schrödinger equations. These bounds will be essential for obtaining the generalizationbounds in Theorem 2.3 and Theorem 2.4.First let us consider for fixed positive constants C, Γ , W and T the set of two-layer neuralnetworks(5.1) F m = (cid:110) u θ ( x ) = c + m (cid:88) i =1 γ i φ ( w i · x + t i ) , x ∈ Ω , θ ∈ Θ (cid:12)(cid:12) | c | ≤ C, m (cid:88) i =1 | γ i | ≤ Γ , (cid:107) w i (cid:107) ≤ W, | t i | ≤ T (cid:111) . Here φ is the activation function, θ = ( c, { γ i } mi =1 , { w i } mi =1 , { t i } mi =1 ) denotes collectively theparameters of the two-layer neural network, Θ = Θ c × Θ γ × Θ w × Θ t = [ − C, C ] × B m (Γ) × (cid:0) B d ( W ) (cid:1) m × [ − T, T ] m represents the parameter space. We shall consider the set Θ endowedwith the metric ρ defined for θ = ( c, γ, w, t ) , θ (cid:48) = ( c (cid:48) , γ (cid:48) , w (cid:48) , t (cid:48) ) ∈ Θ by(5.2) ρ Θ ( θ, θ (cid:48) ) = max {| c − c (cid:48) | , (cid:107) γ − γ (cid:48) (cid:107) , max i (cid:107) w i − w (cid:48) i (cid:107) , (cid:107) t − t (cid:48) (cid:107) ∞ } . Throughout the section we assume that φ satisfies the following assumption, which particu-larly holds for the Softplus activation function. Assumption 5.1. φ ∈ C ( R ) and that φ (resp. φ (cid:48) , the derivative of φ ) is L -Lipschitz (resp.is L (cid:48) -Lipschitz) for some L, L (cid:48) > . Moreover, there exist positive constants φ max and φ (cid:48) max such that sup w ∈ Θ w ,t ∈ Θ t ,x ∈ Ω | φ ( w · x + t ) | ≤ φ max and sup w ∈ Θ w ,t ∈ Θ t ,x ∈ Ω | φ (cid:48) ( w · x + t ) | ≤ φ (cid:48) max . Recall that the Rademacher complexity of a function class G is defined by R n ( G ) = E Z E σ (cid:104) sup g ∈G (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 σ j g ( Z j ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) . In the subsequent proof, it will be useful to use the following modified Rademacher complexity ˜ R n ( G ) without the absolute value sign: ˜ R n ( G ) = E Z E σ (cid:104) sup g ∈G n n (cid:88) j =1 σ j g ( Z j ) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) . The lemma below bounds the Rademacher complexity of F m . Lemma 5.1. Assume that the activation function φ is L -Lipschitz. Then R n ( F m ) ≤ L ( W + T ) + 2Γ | φ (0) |√ n . Proof. Let ¯ φ ( x ) = φ ( x ) − φ (0) . First observe that E σ (cid:104) sup f ∈F m n n (cid:88) j =1 σ j f ( Z j ) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) = E σ (cid:104) sup Θ n n (cid:88) j =1 σ j (cid:0) c + m (cid:88) i =1 γ i φ ( w i · Z j + t i ) (cid:1)(cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) = E σ (cid:104) sup Θ n n (cid:88) j =1 σ j m (cid:88) i =1 γ i φ ( w i · Z j + t i ) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) ≤ n E σ (cid:104) sup Θ m (cid:88) i =1 γ i n (cid:88) j =1 σ j ¯ φ ( w i · Z j + t i ) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) + 1 n E σ (cid:104) sup Θ m (cid:88) i =1 γ i n (cid:88) j =1 σ j φ (0) (cid:105) =: J + J . Using the fact that ¯ φ ( · ) = φ ( · ) − φ (0) is L -Lipschitz, one has that J ≤ n m (cid:88) i =1 | γ i | · E σ (cid:104) sup | w |≤ W, | t |≤ T (cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j ¯ φ ( w · Z j + t ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) ≤ Ln (cid:16) E σ (cid:104) sup | w |≤ W (cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j w · Z j (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) + E σ (cid:104) sup | t |≤ T (cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j t (cid:12)(cid:12)(cid:12)(cid:105)(cid:17) ≤ Ln (cid:16) W · E σ (cid:107) n (cid:88) j =1 σ j Z j (cid:107) + T E σ (cid:104)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j (cid:12)(cid:12)(cid:12)(cid:105)(cid:17) ≤ Ln (cid:16) W · (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j =1 | Z j | + T · (cid:118)(cid:117)(cid:117)(cid:116) E σ (cid:104) n (cid:88) j =1 σ j (cid:105)(cid:17) PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 21 ≤ L ( W + T ) √ n . Note that in the second inequality we have used the Talagrand’s contraction principle (Lemma5.2 below). Moreover, since (cid:80) mi =1 | γ i | ≤ Γ , it is easy to see that J ≤ Γ | φ (0) | n E σ (cid:104)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j (cid:12)(cid:12)(cid:12)(cid:105) ≤ Γ | φ (0) | n (cid:118)(cid:117)(cid:117)(cid:116) E σ (cid:104) n (cid:88) j =1 σ j (cid:105) = Γ | φ (0) |√ n . Combining the estimates above and then taking the expectation w.r.t. Z j yields that ˜ R n ( F m ) ≤ L ( W + T )+Γ | φ (0) |√ n . This combined with Lemma 5.3 below leads to the desiredestimate. (cid:3) Lemma 5.2 (Ledoux-Talagrand contraction [25, Theorem 4.12]) . Assume that φ : R → R is L -Lipschitz with φ (0) = 0 . Let { σ i } ni =1 be independent Rademacher random variables. Thenfor any T ⊂ R n E σ sup ( t , ··· ,t n ) ∈ T (cid:12)(cid:12)(cid:12) n (cid:88) i =1 σ i φ ( t i ) (cid:12)(cid:12)(cid:12) ≤ L · E σ sup ( t , ··· ,t n ) ∈ T (cid:12)(cid:12)(cid:12) n (cid:88) i =1 σ i t i (cid:12)(cid:12)(cid:12) . Lemma 5.3. [27, Lemma 1] Assume that the set of functions G contains the zero function.Then R n ( G ) ≤ R n ( G ) . Recall the sets of two-layer neural networks F ReLU ,m ( B ) and F SP τ ,m ( B ) defined by (2.12)and (2.13) respectively. Since both ReLU and SP τ are -Lipschitz and ReLU(0) = 0 , SP τ (0) = ln 2 τ , the following corollary is a direct consequence of Lemma 5.1. Corollary 5.1. R n ( F ReLU ,m ( B )) ≤ B √ n and R n ( F SP τ ,m ( B )) ≤ ln 2 τ ) B √ n . Given the source function f ∈ L ∞ (Ω) and the potential V ∈ L ∞ (Ω) , we recall the functionclasses associated to the Ritz losses of Poisson equation and the static Schrödinger equation(5.3) G m,P := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | − f u where u ∈ F m (cid:111) , G m,S := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | + 12 V | u | − f u where u ∈ F m (cid:111) , In the sequel we aim to bound the Rademacher complexities of G m,P and G m,S defined above.This will be achieved by bounding the Rademacher complexities of the following functionclasses G m := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | where u ∈ F m (cid:111) , G m := (cid:110) g : Ω → R (cid:12)(cid:12) g = f u where u ∈ F m (cid:111) , G m := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 V | u | where u ∈ F m (cid:111) . The celebrated Dudley’s theorem will be used to bound the Rademacher complexity in termsof the metric entropy. To this end, let us first recall the metric entropy and the Dudley’stheorem below.Let ( E, ρ ) be a metric space with metric ρ . A δ -cover of a set A ⊂ E with respect to ρ isa collection of points { x , · · · , x n } ⊂ A such that for every x ∈ A , there exists i ∈ { , · · · , n } such that ρ ( x, x i ) ≤ δ . The δ -covering number N ( δ, A, ρ ) is the cardinality of the smallest δ -cover of the set A with respect to the metric ρ . Equivalently, the δ -covering number N ( δ, A, ρ ) is the minimal number of balls B ρ ( x, δ ) of radius δ needed to cover the set A . Theorem 5.1 (Dudley’s theorem) . Let F be a function class such that sup f ∈F (cid:107) f (cid:107) ∞ ≤ M .Then the Rademacher complexity R n ( F ) satisfies that R n ( F ) ≤ inf ≤ δ ≤ M (cid:110) δ + 12 √ n (cid:90) Mδ (cid:112) log N ( ε, F , (cid:107) · (cid:107) ∞ ) dε (cid:111) . Note that our statement of Dudley’s theorem is slightly different from the standard Dud-ley’s theorem where the covering number is based on the empirical (cid:96) -metric instead of the L ∞ -metric above. However, since L ∞ -metric is stronger than the empirical (cid:96) -metric andsince the covering number is monotonically increasing with respect to the metric, Theo-rem 5.1 follows directly from the classical Dudley’s theorem (see e.g. [42, Theorem 1.19]).Let us now state an elementary lemma on the covering number of product spaces. Lemma 5.4. Let ( E i , ρ i ) be metric spaces with metrics ρ i and let A i ⊂ E i , i = 1 , · · · , n .Consider the product space E = × ni =1 E i equipped with the metric ρ = max i ρ i and the set A = × ni =1 A i . Then for any δ > , (5.4) N ( δ, A, ρ ) ≤ n (cid:89) i =1 N ( δ, A i , ρ i ) . Proof. It suffices to prove the lemma in the case that n = 2 , i.e.,(5.5) N ( δ, A × A , ρ ) ≤ N ( δ, A , ρ ) · N ( δ, A , ρ ) . Indeed, suppose that C and C are δ -covers of A and A respectively. Then it is straightfor-ward that the product set C × C is also a δ -cover of A × A in the space ( E × E , ρ ) with ρ = max( ρ , ρ ) . Hence N ( δ, A × A , ρ ) ≤ card ( C ) · card ( C ) . Applying this inequalityfor C i with card ( C i ) = N ( δ, A i , ρ i ) , i = 1 , , we obtain (5.5). The general inequality (5.4)follows by iterating (5.5). (cid:3) As a consequence of Lemma 5.4, the following proposition gives an upper bound for thecovering number N ( δ, Θ , ρ Θ ) . Proposition 5.1. Consider the metric space (Θ , ρ Θ ) with ρ Θ defined in (5.2) . Then for any δ > , the covering number N ( δ, Θ , ρ Θ ) satisfies that N ( δ, Θ , ρ Θ ) ≤ Cδ · (cid:16) δ (cid:17) m · (cid:16) Wδ (cid:17) dm · (cid:16) Tδ (cid:17) m . Proof. Thanks to Lemma 5.4, N ( δ, Θ , ρ ) ≤ N ( δ, Θ c , | · | ) · N ( δ, Θ γ , (cid:107) · (cid:107) ) · (cid:16) N ( δ, B d ( W ) , (cid:107) · (cid:107) ) (cid:17) m · N ( δ, Θ t , (cid:107) · (cid:107) ∞ ) ≤ Cδ · (cid:16) δ (cid:17) m · (cid:16) Wδ (cid:17) dm · (cid:16) Tδ (cid:17) m , where in the last inequality we have used the fact that the covering number of a d -dimensional (cid:96) p -ball of radius r satisfies that N ( δ, B dp ( r ) , (cid:107) · (cid:107) p ) ≤ (cid:16) rδ (cid:17) d . (cid:3) PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 23 Bounding R n ( G m ) . We would like to bound R n ( G m ) from above using metric entropy. Tothis end, let us first bound the covering number N ( δ, G m , (cid:107) · (cid:107) ∞ ) . Recall the parameters C, Γ , W and T in (5.1). With those parameters fixed, to simplify expressions, we introducethe following functions to be used in the sequel M ( δ, Λ , m, d ) := 2 C Λ δ · (cid:16) δ (cid:17) m · (cid:16) W Λ δ (cid:17) dm · (cid:16) T Λ δ (cid:17) m , (5.6) Z ( M, Λ , d ) := M (cid:0)(cid:112) (log(2 C Λ)) + + (cid:112) (log(3ΓΛ) + d log(3 W Λ) + log(3 T Λ)) + (cid:1) (5.7) + √ d + 3 (cid:90) M (cid:112) (log(1 /ε )) + dε. Lemma 5.5. Let the activation function φ satisfy Assumption 5.1. Then we have (5.8) N ( δ, G m , (cid:107) · (cid:107) ∞ ) ≤ M ( δ, Λ , m, d ) , where the constant Λ is defined by (5.9) Λ = (cid:16) ( W + Γ) φ (cid:48) max + Γ W L (cid:48) ( √ d + 1) (cid:17) Γ W φ (cid:48) max . Proof. Thanks to Assumption 5.1, sup θ ∈ Θ | φ (cid:48) ( w · x + t ) | ≤ φ (cid:48) max . This implies that max θ ∈ Θ |∇ u θ ( x ) | ≤ m (cid:88) i =1 | γ i |(cid:107) w i (cid:107) | φ (cid:48) ( w i · x + t i ) |≤ Γ W φ (cid:48) max . Furthermore, for θ, θ (cid:48) ∈ Θ , by adding and subtracting terms, we have that |∇ u θ ( x ) − ∇ u θ (cid:48) ( x ) | ≤ m (cid:88) i =1 | γ i − γ (cid:48) i |(cid:107) w i (cid:107) | φ (cid:48) ( w i · x + t i ) | + m (cid:88) i =1 | γ (cid:48) i |(cid:107) w i − w (cid:48) i (cid:107) | φ (cid:48) ( w i · x + t i ) | + m (cid:88) i =1 | γ (cid:48) i || w (cid:48) i || φ (cid:48) ( w i · x + t i ) − φ (cid:48) ( w (cid:48) i · x + t (cid:48) i ) |≤ W φ (cid:48) max (cid:107) γ − γ (cid:48) (cid:107) + Γ φ (cid:48) max max i (cid:107) w i − w (cid:48) i (cid:107) + Γ W L (cid:48) (max i (cid:107) w i − w (cid:48) i (cid:107) √ d + (cid:107) t − t (cid:48) (cid:107) ∞ ) ≤ (cid:16) ( W + Γ) φ (cid:48) max + Γ W L (cid:48) ( √ d + 1) (cid:17) ρ Θ ( θ, θ (cid:48) ) . Combining the last two estimates yields that (cid:12)(cid:12) |∇ u θ ( x ) | − |∇ u θ (cid:48) ( x ) | (cid:12)(cid:12) ≤ (cid:12)(cid:12) ∇ u θ ( x ) + ∇ u θ (cid:48) ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ∇ u θ ( x ) − ∇ u θ (cid:48) ( x ) (cid:12)(cid:12) ≤ Λ ρ Θ ( θ, θ (cid:48) ) . This particularly implies that N ( δ, G m , (cid:107) · (cid:107) ∞ ) ≤ N ( δ Λ , Θ , ρ Θ ) . Then the estimate (5.8)follows from Proposition 5.1 with δ replaced by δ Λ . (cid:3) Proposition 5.2. Assume that the activation function φ satisfies Assumption 5.1. Then R n ( G m ) ≤ Z ( M , Λ , d ) · (cid:114) mn . where M = Γ W ( φ (cid:48) max ) and Λ is defined in (5.9) .Proof. Thanks to Assumption 5.1, sup g ∈G m (cid:107) g (cid:107) L ∞ (Ω) ≤ sup u ∈F m (cid:107)∇ u (cid:107) L ∞ (Ω) ≤ Γ W ( φ (cid:48) max ) . Then the proposition follows from Lemma 5.5, Theorem 5.1 with δ = 0 and M = M = Γ W ( φ (cid:48) max ) , and the simple fact that √ a + b ≤ √ a + √ b for a, b ≥ . (cid:3) Bounding R n ( G m ) . The next lemma provides an upper bound for N ( δ, G m , (cid:107) · (cid:107) ∞ ) . Lemma 5.6. Assume that (cid:107) f (cid:107) L ∞ (Ω) ≤ F for some F > . Assume that the activationfunction φ satisfies Assumption 5.1. Then the covering number N ( δ, G m , (cid:107) · (cid:107) ∞ ) satisfies that N ( δ, G m , (cid:107) · (cid:107) ∞ ) ≤ M ( δ, Λ , m, d ) . Here the constant Λ is defined by (5.10) Λ = F (cid:0) φ max + L Γ(1 + √ d ) (cid:1) . Proof. Note that a function g θ ∈ G m has the form g θ = f u θ . Given θ = ( c, γ, w, t ) , θ (cid:48) =( c (cid:48) , γ (cid:48) , w (cid:48) , t (cid:48) ) ∈ Θ , we have(5.11) | u θ ( x ) − u θ (cid:48) ( x ) | ≤ | c − c (cid:48) | + m (cid:88) i =1 | γ i φ ( w i · x − t i ) − m (cid:88) i =1 γ (cid:48) i φ ( w (cid:48) i · x − t (cid:48) i ) |≤ | c − c (cid:48) | + m (cid:88) i =1 | γ i − γ (cid:48) i | φ ( w i · x − t i ) + m (cid:88) i =1 | γ (cid:48) i || φ ( w i · x − t i ) − φ ( w (cid:48) i · x − t (cid:48) i ) | . Since φ satisfies Assumption 5.1, we have that | φ ( w i · x − t i ) | ≤ φ max and that | φ ( w i · x − t i ) − φ ( w (cid:48) i · x − t (cid:48) i ) | ≤ L ( √ d (cid:107) w i − w (cid:48) i (cid:107) + | t i − t (cid:48) i | ) . Therefore, it follows from (5.11) that(5.12) | u θ ( x ) − u θ (cid:48) ( x ) | ≤ | c − c (cid:48) | + φ max (cid:107) γ − γ (cid:48) (cid:107) + L Γ( √ d max i (cid:107) w i − w (cid:48) i (cid:107) + (cid:107) t − t (cid:48) (cid:107) ∞ ) ≤ (cid:16) φ max + L Γ(1 + √ d ) (cid:17) ρ Θ ( θ, θ (cid:48) ) . This implies that (cid:107) g θ − g θ (cid:48) (cid:107) ∞ ≤ F (cid:16) φ max + L Γ(1 + √ d ) (cid:17) ρ = Λ ρ Θ ( θ, θ (cid:48) ) . As a consequence, N ( δ, G m , (cid:107)·(cid:107) ∞ ) ≤ N ( δ Λ , Θ , ρ Θ ) . Then the lemma follows from Proposition5.1 with δ replaced by δ Λ . (cid:3) Proposition 5.3. Assume that (cid:107) f (cid:107) L ∞ (Ω) ≤ F for some F > . Assume that the activationfunction φ is L -Lipschitz. Then R n ( G m ) ≤ Z ( M , Λ , d ) · (cid:114) mn , where M = F ( C + Γ φ max ) and Λ is defined in (5.10) .Proof. It follows from the definition of G m and the assumption that (cid:107) f (cid:107) L ∞ (Ω) ≤ F , onehas that sup g ∈G m (cid:107) g (cid:107) L ∞ (Ω) ≤ M = F ( C + Γ φ max ) . Then the proposition is proved by anapplication of Theorem 5.1 with δ = 0 , M = M and Lemma 5.6. (cid:3) Bounding R n ( G m ) . The lemma below gives an upper bound for N ( δ, G m , (cid:107) · (cid:107) ∞ ) . Lemma 5.7. Assume that (cid:107) V (cid:107) L ∞ (Ω) ≤ V max for some V max < ∞ . Assume that the activationfunction φ satisfies Assumption 5.1. Then the covering number N ( δ, G m , (cid:107) · (cid:107) ∞ ) satisfies that (5.13) N ( δ, G m , (cid:107) · (cid:107) ∞ ) ≤ M ( δ, Λ , m, d ) , where the constant Λ is defined by (5.14) Λ = V max ( C + Γ φ max ) (cid:16) φ max + L Γ(1 + √ d ) (cid:17) . PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 25 Proof. By the definition of F m and Assumption 5.1 on φ , sup u ∈F m (cid:107) u (cid:107) L ∞ (Ω) ≤ C + Γ φ max . Moreover, recall from (5.12) that for θ, θ (cid:48) ∈ Θ , | u θ ( x ) − u θ (cid:48) ( x ) | ≤ (cid:16) φ max + L Γ(1 + √ d ) (cid:17) ρ Θ ( θ, θ (cid:48) ) . Consequently, (cid:12)(cid:12)(cid:12) V ( x ) u θ ( x ) − V ( x ) u θ (cid:48) ( x ) (cid:12)(cid:12)(cid:12) ≤ | V ( x ) || u θ ( x ) + u θ (cid:48) ( x ) || u θ ( x ) − u θ (cid:48) ( x ) |≤ Λ ρ Θ ( θ, θ (cid:48) ) . The estimate (5.13) follows from the same line of arguments used in the proof of Lemma 5.6. (cid:3) Proposition 5.4. Under the same assumption of Lemma 5.7, G m satisfies that R n ( G m ) ≤ Z ( M , Λ , d ) · (cid:114) mn , where M = V max ( C + Γ φ max ) and Λ is defined in (5.14) .Proof. Note that sup u ∈G m (cid:107) u (cid:107) L ∞ (Ω) ≤ M = V max ( C + Γ φ max ) . Then the proposition followsfrom Theorem 5.1 with δ = 0 , M = M and Lemma 5.7. (cid:3) The following corollary is a direct consequence of the Propositions 5.2-5.4. Corollary 5.2. The two sets of functions G m,P and G m,S defined in (5.3) satisfy that R n ( G m,P ) ≤ ( Z ( M , Λ , d ) + Z ( M , Λ , d )) · (cid:114) mn and that R n ( G m,S ) ≤ (cid:88) i =1 Z ( M i , Λ i , d ) · (cid:114) mn Considering the set of two-layer neural networks F SP τ ,m ( B ) defined in (2.13) with τ = √ m ,we define the following associated sets of functions G SP τ ,m,P ( B ) := { g : Ω → R | g = 12 |∇ u | − f u where u ∈ F SP τ ,m,P ( B ) } , G SP τ ,m,S ( B ) := { g : Ω → R | g = 12 |∇ u | + 12 V | u | − f u where u ∈ F SP τ ,m,S ( B ) } , G τ ,m ( B ) := { g : Ω → R | g = 12 |∇ u | where u ∈ F SP τ ,m ( B ) } , G τ ,m ( B ) := { g : Ω → R | g = f u where u ∈ F SP τ ,m ( B ) }G τ ,m ( B ) := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 V | u | where u ∈ F SP τ ,m ( B ) (cid:111) . Corollary 5.2 allows us to bound the Rademacher complexities of G SP τ ,m,P ( B ) and G SP τ ,m,S ( B ) .Indeed, from the definition of the activation function SP τ , we know that (cid:107) SP (cid:48) τ (cid:107) L ∞ ( R ) ≤ and (cid:107) SP (cid:48)(cid:48) τ (cid:107) L ∞ ( R ) ≤ τ = √ m , so SP τ satisfies Assumption (5.1) with L = φ (cid:48) max = 1 , L (cid:48) = τ = √ m, φ max ≤ √ m ≤ . Note also that F SP τ ,m,P ( B ) coincides with the set F m defined in (5.1) with the followingparameters(5.15) C = 2 B, Γ = 4 B, W = 1 , T = 1 . With the parameters above, one has that M = 8 B, Λ ≤ B ( √ d + 1) √ m + 4 B,M ≤ F B, Λ ≤ F (5 + 4 B (1 + √ d )) ,M ≤ V max B ) , Λ ≤ V max B (5 + 4 B (1 + √ d )) . Inserting M i and Λ i , i = 1 , , into (5.7), one can obtain by a straightforward calculationthat there exist positive constants C ( B, d ) , C ( B, d, F ) and C ( B, d, V max ) , depending on theparameters B, d, F, V max polynomially, such that Z ( M , Λ , d ) ≤ C ( B, d ) (cid:112) log m, Z ( M , Λ , d ) ≤ C ( B, d, F ) , Z ( M , Λ , d ) ≤ C ( B, d, V max ) , Combining the estimates above with Corollary 5.2 gives directly the Rademacher complexitybounds for G SP τ ,m,P ( B ) and G SP τ ,m,S ( B ) as summarized in the following theorem. Theorem 5.2. Assume that (cid:107) f (cid:107) L ∞ (Ω) ≤ F and (cid:107) V (cid:107) L ∞ (Ω) ≤ V max . Consider the sets G SP τ ,m,P ( B ) and G SP τ ,m,S ( B ) with τ = √ m . Then there exist positive constants C P ( B, d, F ) and C S ( B, d, F, V max ) depending polynomially on B, d, F, V max such that R n ( G SP τ ,m,P ( B )) ≤ C P ( B, d, F ) √ m ( √ log m + 1) √ n ,R n ( G SP τ ,m,S ( B )) ≤ C S ( B, d, F, V max ) √ m ( √ log m + 1) √ n . Proofs of Theorem 2.3 and Theorem 2.4 With the approximation estimates for spectral Barron functions and the complexity esti-mates of the two-layer neural networks proved in previous sections, we are ready to proveTheorem 2.3 and Theorem 2.4 which establish the a priori generalization error bounds of theDRM. Proof of Theorem 2.3. Recall that u mn,P is the minimizer of the empirical loss E n,P in the set F = F SP τ ,m ( B ) with τ = √ m , where B = (cid:107) u ∗ P (cid:107) B (Ω) . From the definition of F SP τ ,m ( B ) , onecan obtain that sup u ∈F SP τ ,m ( B ) (cid:107) u (cid:107) L ∞ (Ω) ≤ B. Then it follows from Theorem 3.1, Theorem 5.2, Theorem 2.2 and Corollary 5.1 that E (cid:2) E P ( u mn,P ) − E P ( u ∗ P ) (cid:3) ≤ R n ( G SP τ ,m,P ) + 4 sup u ∈F SP τ ,m ( B ) (cid:107) u (cid:107) L ∞ (Ω) · R n ( F SP τ ,m )+ 12 inf u ∈F SP τ,m ( B ) (cid:107) u − u ∗ (cid:107) H (Ω) ≤ C P ( B, d, F ) √ m ( √ log m + 1) √ n + 4 · · · B (1 + ln 2 √ m ) √ n + B (6 log m + 30) m ≤ C √ m ( √ log m + 1) √ n + C (log m + 1) m , where the constant C depends polynomially on B, d and F and C depends only quadraticallyon B . (cid:3) Proof of Theorem 2.4. The proof is almost identical to the proof of Theorem 2.3 and followsdirectly from Theorem 3.2, Theorem 5.2, Theorem 2.2 and Corollary 5.1. Hence we omit thedetails. (cid:3) PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 27 Solution theory of Poisson and static Schrödinger Equations inspectral Barron Spaces In Theorems 2.3 and 2.4, we have established the generalization error bounds of the DRMfor the Poisson equation and static Schrödinger equation under the assumption that theexact solutions lie in the spectral Barron space B (Ω) . This section aims to justify suchassumption by proving complexity estimates of solutions in the spectral Barron space asshown in Theorem 2.5 and Theorem 2.6. This can be viewed as regularity analysis of highdimensional PDEs in the spectral Barron space.7.1. Proof of Theorem 2.5. Suppose that f = (cid:80) k ∈ N d ˆ f k Φ k and that f has vanishing meanvalue on Ω so that ˆ f = 0 . Let ˆ u k be the cosine coefficients of the solution u ∗ P of the Neumannproblem for Poisson equation. By testing Φ k on both sides of the Poisson equation and bytaking account of the Neumann boundary condition, one obtains that ˆ u = 0 , ˆ u k = − π | k | ˆ f k . As a result, (cid:107) u ∗ P (cid:107) B s +2 (Ω) = (cid:88) k ∈ N d \{ } (1 + π s +2 | k | s +2 ) | ˆ u k | = (cid:88) k ∈ N d \{ } (1 + π s +2 | k | s +2 ) π | k | | ˆ f k |≤ (cid:88) k ∈ N d \{ } (1 + π s | k | s ) | ˆ f k | = 2 (cid:107) f (cid:107) B s (Ω) . This finishes the proof.7.2. Proof of Theorem 2.6. First under the assumption of Theorem 2.6, there exists aunique solution u ∈ H (Ω) to (2.2). Moreover,(7.1) (cid:107)∇ u (cid:107) L (Ω) + V min (cid:107) u (cid:107) L (Ω) ≤ (cid:107) f (cid:107) L (Ω) (cid:107) u (cid:107) L (Ω) . Our goal is to show that u ∈ B s +2 (Ω) . To this end, let us first derive an operator equationthat is equivalent to the original Schrödinger problem (2.2). To do this, multiplying Φ k on both sides of the static Schrödinger equation and then integrating yields the followingequivalent linear system on ˆ u :(7.2) − | π | | k | ˆ u k + (cid:91) ( V u ) k = ˆ f k , k ∈ N d . Let us first consider (7.2) with k = . Thanks to Corollary B.1, (cid:91) ( V u ) = 1 β (cid:16) (cid:88) m ∈ Z d β m ˆ u | m | ˆ V | m | (cid:17) = ˆ u ˆ V + (cid:16) (cid:88) m ∈ Z d \{ } β m ˆ u | m | ˆ V | m | (cid:17) , where we have also used the fact that β = 1 . Consequently, equation (7.2) with k = becomes ˆ u ˆ V + (cid:88) m ∈ Z d \{ } β m ˆ u | m | ˆ V | m | = ˆ f . For k (cid:54) = , using again Corollary B.1, equation (7.2) can be written as −| π | | k | ˆ u k + 1 β k (cid:16) (cid:88) m ∈ Z d β m ˆ u | m | β m − k ˆ V | m − k | (cid:17) = ˆ f k , k ∈ N d \ { } . Recall that a function u ∈ B s (Ω) is equivalent to that ˆ u k belongs to the weighted (cid:96) space (cid:96) W s ( N d ) with the weight W s ( k ) = 1 + π s | k | s . We would like to rewrite the above equations as an operator equation on the space (cid:96) W s ( N d ) . For doing this, let us define some usefuloperators. Define the operator M : ˆ u (cid:55)→ M ˆ u by ( M ˆ u ) k = (cid:40) ˆ V ˆ u if k = , −| π | | k | ˆ u k otherwise . Define the operator V : ˆ u (cid:55)→ V ˆ u by ( V ˆ u ) k = (cid:40)(cid:80) m ∈ Z d \{ } β m ˆ u | m | ˆ V | m | if k = , β k (cid:16) (cid:80) m ∈ Z d β m ˆ u | m | β m − k ˆ V | m − k | (cid:17) otherwise . With those operators, the system (7.2) can be reformulated as the operator equation(7.3) ( M + V )ˆ u = ˆ f . Since V ( x ) ≥ V min > for every x , we have ˆ V > . As a direct consequence, the diagonaloperator M is invertible. Therefore the operator equation (7.3) is equivalent to(7.4) ( I + M − V )ˆ u = M − ˆ f . In order to show that u ∈ B s +2 (Ω) , it suffices to show that the equation (7.3) or (7.4) hasa unique solution ˆ u ∈ (cid:96) W s ( N d ) . Indeed, if ˆ u ∈ (cid:96) W s ( N d ) , then it follows from (7.3) and theboundedness of V on (cid:96) W s ( N d ) (see (7.8) in the proof of Lemma 7.1 below) that(7.5) (cid:107) M ˆ u (cid:107) (cid:96) Ws ( N d ) ≤ (cid:107) V ˆ u (cid:107) (cid:96) Ws ( N d ) + (cid:107) ˆ f (cid:107) (cid:96) Ws ( N d ) ≤ C ( d, V ) (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) + (cid:107) ˆ f (cid:107) (cid:96) Ws ( N d ) . Moreover, this combined with the positivity of ˆ V implies that(7.6) (cid:107) u (cid:107) B s +2 (Ω) = (cid:88) k ∈ N d (1 + π s +2 | k | s +2 ) | ˆ u k | = 1ˆ V · ˆ V | ˆ u | + (cid:88) k ∈ N d π s +2 | k | s +2 π | k | · π | k | | ˆ u k |≤ max (cid:110) V , (cid:16) π + 1 (cid:17)(cid:111) (cid:107) M ˆ u (cid:107) (cid:96) Ws ( N d ) ≤ C ( d, V )( (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) + (cid:107) ˆ f (cid:107) (cid:96) Ws ( N d ) ) for some C ( d, V ) > .Next, we claim that equation (7.4) has a unique solution ˆ u ∈ (cid:96) W s ( N d ) and that there existsa constant C > such that(7.7) (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) ≤ C (cid:107) ˆ f (cid:107) (cid:96) Ws ( N d ) . To see this, observe that owing to the compactness of M − V as shown in Lemma 7.1, theoperator equation I + M − V is a Fredholm operator on (cid:96) W s ( N d ) . By the celebrated Fredholmalternative theorem (see e.g., [14] and [7, VII 10.7]), the operator I + M − V has a boundedinverse ( I + M − V ) − if and only if ( I + M − V )ˆ u = 0 has a trivial solution. Therefore toobtain the bound (7.7), it suffices to show that ( I + M − V )ˆ u = 0 implies ˆ u = 0 . By theequivalence between the Schrödinger problem (2.2) and (7.4), we only need to show that theonly solution of (2.2) is zero. Notice that the latter is a direct consequence of (7.1) and thusthis finishes the proof of that the Schrödinger problem (2.2) has a unique solution in B (Ω) .Finally, the stability estimate (2.16) follows by combining (7.6) and (7.7). Lemma 7.1. Assume that V ∈ B s (Ω) with V ( x ) ≥ V min > for every x ∈ Ω . Then theoperator M − V is compact on (cid:96) W s ( N d ) . PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 29 Proof. Since M − is a multiplication operator on (cid:96) W s ( N d ) with the diagonal entries convergingto zero, it follows from Lemma 7.2 that M − is compact on (cid:96) W s ( N d ) . Therefore to show thecompactness of M − V , it is sufficient to show that the operator V is bounded on (cid:96) W s ( N d ) . Tosee this, note that by definition β k = 2 k − (cid:80) di =1 ki (cid:54) =0 ∈ [2 − d , . In addition, since V ∈ B (Ω) ,using Corollary B.1, one has that(7.8) (cid:107) V ˆ u (cid:107) (cid:96) Ws ( N d ) = (cid:12)(cid:12)(cid:12) (cid:88) m ∈ Z d \{ } β m ˆ u | m | ˆ V | m | (cid:12)(cid:12)(cid:12) + (cid:88) k ∈ N d β k (cid:12)(cid:12)(cid:12) (cid:88) m ∈ Z d β m ˆ u | m | β m − k ˆ V | m − k | (cid:12)(cid:12)(cid:12) (1 + π s (cid:107) k (cid:107) s ) ≤ (cid:88) m ∈ Z d \{ } | ˆ u | m | | (cid:88) m ∈ Z d \{ } | ˆ V | m | | + 2 d +1 (cid:88) m ∈ Z d (cid:88) k ∈ N d | ˆ u | m | || ˆ V | m − k | | (cid:0) | π | s C s ( (cid:107) m − k (cid:107) s + (cid:107) m (cid:107) s ) (cid:1) ≤ d +2 (cid:107) ˆ u (cid:107) (cid:96) ( N d ) (cid:107) ˆ V (cid:107) (cid:96) ( N d ) + 2 d +1 max(1 , C s ) · (cid:0) (cid:107) ˆ u (cid:107) (cid:96) ( N d ) (cid:107) ˆ V (cid:107) (cid:96) Ws ( N d ) + (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) (cid:107) ˆ V (cid:107) (cid:96) ( N d ) (cid:1) ≤ d +3 max(1 , C s ) · (cid:107) ˆ V (cid:107) (cid:96) Ws ( N d ) (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) = 2 d +3 max(1 , C s ) · (cid:107) V (cid:107) B s (Ω) (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) , where in the first inequality above we used the elementary inequality | a + b | s ≤ C s ( | a | s + | b | s ) for some constant C s > and in the second inequality we used the fact that (cid:80) m ∈ Z d | ˆ u | m | | ≤ d (cid:107) ˆ u (cid:107) (cid:96) ( N d ) ≤ d (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) . (cid:3) Lemma 7.2. Suppose that T is a multiplication operator on (cid:96) W s ( N d ) defined by for u =( u k ) k ∈ N d that ( T u ) k = λ k u k with λ k → as (cid:107) k (cid:107) →∞ . Then T : (cid:96) W s ( N d ) → (cid:96) W s ( N d ) is compact.Proof. It suffices to show that the image of the unit ball in (cid:96) W s ( N d ) under the map T is totallybounded. To this end, given any fixed ε > , let K ∈ N be such that | λ k | ≤ ε if (cid:107) k (cid:107) > K .Denote by I : { k ∈ N d : (cid:107) k (cid:107) ≤ K } and let d be the cardinality of the index set I .Note that the ball in R d of radius max k {| λ k | : k ∈ I } with respect to the weighted -norm (cid:107) v (cid:107) (cid:96) Ws = (cid:80) k ∈I | v k | W s ( k ) is precompact, so it can be covered by the union of n ε -balls withcenters { v , · · · , v n } where v i ∈ R d . We now claim that the image of the unit ball in (cid:96) W s ( N d ) under T is covered by n ε -balls with centers { ( v , ) , · · · , ( v n , ) } . In fact, for u ∈ (cid:96) W s ( N d ) with (cid:80) k ∈ N d | u k | W s ( k ) ≤ , one has T u = (cid:16) ( λ k u k ) k ∈I , (cid:17) + (cid:16) , ( λ k u k ) k / ∈I (cid:17) . Suppose that v i ∗ is the closest center of { v , · · · , v n } to the vector (cid:0) ( λ k u k ) k ∈I (cid:1) . Then (cid:107) T u − ( v i ∗ , ) (cid:107) (cid:96) Ws ( N d ) = (cid:88) k ∈I | ( v i ∗ ) k − ( λ k u k ) | W s ( k ) + (cid:13)(cid:13)(cid:13)(cid:16) , ( λ k u k ) k / ∈I (cid:17)(cid:13)(cid:13)(cid:13) (cid:96) Ws ( N d ) ≤ ε + ε (cid:13)(cid:13)(cid:13)(cid:16) , ( u k ) k / ∈I (cid:17)(cid:13)(cid:13)(cid:13) (cid:96) Ws ( N d ) ≤ ε. This finishes the proof. (cid:3) Appendix A. Proof of Proposition 2.1 A.1. Proof of Proposition 2.1-(i). First, it is well known that the problem (2.1) has aunique weak solution u ∗ P ∈ H (cid:5) (Ω) = { u ∈ H (Ω) : (cid:82) Ω udx = 0 } , i.e.(A.1) a ( u, v ) =: (cid:90) Ω ∇ u · ∇ v = F ( v ) := (cid:90) Ω f vdx for every v ∈ H (cid:5) (Ω) . Moreover, the solution u ∗ P satisfies that u ∗ P = arg min u ∈ H (cid:5) (Ω) (cid:110) (cid:90) Ω |∇ u | dx − (cid:90) Ω f udx (cid:111) . Due to the mean-zero constraint of the space H (cid:5) (Ω) , the variational formulation above isinconvenient to be adopted as the loss function for training a neural network solution. Totackle this issue, we consider instead the following modified Poisson problem:(A.2) − ∆ u + λ (cid:90) Ω udx = f on Ω ,∂∂ν u = 0 on ∂ Ω . Here λ > is a fixed constant. By the Lax-Milgram theorem the problem (A.2) has a uniqueweak solution u ∗ λ,P , which solves(A.3) a λ ( u ∗ λ,P , v ) := (cid:90) Ω ∇ u · ∇ vdx + λ (cid:90) Ω udx (cid:90) Ω vdx = F ( v ) for every v ∈ H (Ω) . It is clear that u ∗ λ,P is the solution of the variational problem(A.4) arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | dx + λ (cid:16) (cid:90) Ω udx (cid:17) − (cid:90) Ω f udx (cid:111) . Furthermore, the lemma below shows that the weak solutions of (A.2) are independent of λ and they all coincides with u ∗ P . Lemma A.1. Assume that λ > . Let u ∗ P and u ∗ λ,P be the weak solution of (2.1) and (A.2) respectively with f ∈ L (Ω) satisfying (cid:82) Ω f dx = 0 . Then we have that u ∗ λ,P = u ∗ P .Proof. We only need to show that u ∗ λ,P satisfies the weak formulation (A.1). In fact, since u ∗ λ,P satisfies (A.3), by setting v = 1 we obtain that λ (cid:90) Ω udx = (cid:90) Ω f dx = 0 . This immediately implies that a λ ( u ∗ λ,P , v ) = a ( u ∗ λ,P , v ) and hence u ∗ λ,P satisfies (A.1). (cid:3) Since the solution to (A.2) is invariant for all λ > , for simplicity we set λ = 1 in (A.4)and this proves (2.3), i.e.(A.5) u ∗ P = arg min u ∈ H (Ω) E P ( u ) = arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | dx − (cid:90) Ω f udx + 12 (cid:16) (cid:90) Ω udx (cid:17) (cid:111) . Finally we prove that u ∗ P satisfies the estimate (2.4). To see this, we first state a useful lemmawhich computes the energy excess E ( u ) − E ( u ∗ P ) with any u ∈ H (Ω) . Lemma A.2. Let u ∗ P be the minimizer of E P or equivalently the weak solution of the Poissonproblem (A.2) . Then for any u ∈ H (Ω) , it holds that E P ( u ) − E P ( u ∗ P ) = 12 (cid:90) Ω |∇ u − ∇ u ∗ P | dx + 12 (cid:16) (cid:90) Ω u ∗ P − u dx (cid:17) . Proof. It follows from Green’s formula and the fact that u ∗ P ∈ H (cid:5) (Ω) that E ( u ∗ P ) = (cid:90) Ω |∇ u ∗ P | − f u ∗ P dx + 12 (cid:16) (cid:90) Ω u ∗ P dx (cid:17) (cid:124) (cid:123)(cid:122) (cid:125) =0 = (cid:90) Ω |∇ u ∗ P | + ∆ u ∗ P u ∗ P dx = − (cid:90) Ω |∇ u ∗ P | dx. PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 31 Then for any u ∈ H (Ω) , applying Green’s formula again yields E ( u ) − E ( u ∗ P ) = 12 (cid:90) Ω |∇ u | dx − (cid:90) Ω f udx + 12 (cid:16) (cid:90) Ω udx (cid:17) + 12 (cid:90) Ω |∇ u ∗ P | dx = 12 (cid:90) Ω |∇ u | dx + (cid:90) Ω ∆ u ∗ P udx + 12 (cid:16) (cid:90) Ω udx (cid:17) + 12 (cid:90) Ω |∇ u ∗ P | dx = 12 (cid:90) Ω |∇ u − ∇ u ∗ P | dx + 12 (cid:16) (cid:90) Ω ( u ∗ P − u ) dx (cid:17) . (cid:3) Now recall that C P > is the Poincaré constant such that for any v ∈ H (Ω) , (cid:13)(cid:13)(cid:13) v − (cid:90) Ω vdx (cid:13)(cid:13)(cid:13) L (Ω) ≤ C P (cid:107)∇ v (cid:107) L (Ω) . As a result, (cid:107) v (cid:107) H (Ω) = (cid:107)∇ v (cid:107) L (Ω) + (cid:107) v (cid:107) L (Ω) ≤ (cid:107)∇ v (cid:107) L (Ω) + 2 (cid:13)(cid:13)(cid:13) v − (cid:90) Ω v (cid:13)(cid:13)(cid:13) L (Ω) + 2 (cid:12)(cid:12)(cid:12) (cid:90) Ω vdx (cid:12)(cid:12)(cid:12) ≤ (2 C P + 1) (cid:107)∇ v (cid:107) L (Ω) + 2 (cid:12)(cid:12)(cid:12) (cid:90) Ω vdx (cid:12)(cid:12)(cid:12) . Therefore, an application of the last inequality with v = u − u ∗ P and Lemma A.2 yields that (cid:107) u − u ∗ P (cid:107) H (Ω) ≤ { C P + 1 , } ( E ( u ) − E ( u ∗ P )) . On the other hand, it follows from Lemma A.2 that E ( u ) − E ( u ∗ P ) ≤ (cid:107) u − u ∗ P (cid:107) H (Ω) . Combining the last two estimates leads to (2.4) and hence finishes the proof of Proposition2.1-(i).A.2. Proof of Proposition 2.1-(ii). First the standard Lax-Milgram theorem implies thatthe static Schrödinger equation has a unique weak solution u ∗ S . Moreover, it is not hard toverify that u ∗ S solves the equivalent variational problem (2.5), i.e. u ∗ S = arg min u ∈ H (Ω) E S ( u ) = arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | + V | u | dx − (cid:90) Ω f udx (cid:111) , Finally we prove that u ∗ S satisfies the estimate (2.6). For this, we first claim that for any u ∈ H (Ω) ,(A.6) E S ( u ) − E S ( u ∗ S ) = 12 (cid:90) Ω |∇ u − ∇ u ∗ S | dx + 12 (cid:90) Ω V ( u ∗ S − u ) dx. In fact, using Green’s formula, one has that E S ( u ∗ S ) = (cid:90) Ω |∇ u ∗ S | + 12 V | u ∗ S | − f u ∗ dx = (cid:90) Ω |∇ u ∗ S | + 12 V | u ∗ S | + (∆ u ∗ S − V u ∗ S ) u ∗ dx = − (cid:90) Ω |∇ u ∗ S | + V | u ∗ | dx. Then for any u ∈ H (Ω) , applying Green’s formula again yields E S ( u ) − E S ( u ∗ S ) = 12 (cid:90) Ω |∇ u | + V | u | dx − (cid:90) Ω f udx + 12 (cid:90) Ω |∇ u ∗ S | + V | u ∗ S | dx = 12 (cid:90) Ω |∇ u | + V | u | dx + (cid:90) Ω (∆ u ∗ S − V u ∗ S ) udx + 12 (cid:90) Ω |∇ u ∗ S | + V | u ∗ S | dx = 12 (cid:90) Ω |∇ u − ∇ u ∗ S | dx + 12 (cid:90) Ω V (cid:0) u ∗ S − u (cid:1) dx. The estimate (2.6) follows directly from the identity (A.6) and the assumption that Assume that u ∈ L (Ω) admits the cosine series expansion u ( x ) = (cid:88) k ∈ N d ˆ u k Φ k ( x ) , where { ˆ u k } k ∈ N d are the cosine expansion coefficients, i.e.(B.1) ˆ u k = (cid:82) Ω u ( x )Φ k ( x ) dx (cid:82) Ω Φ k ( x ) dx = (cid:82) Ω u ( x )Φ k ( x ) dx − (cid:80) di =1 ki (cid:54) =0 . Let Ω e := [ − , d and define the even extension of u e of a function u by u e ( x ) = u e ( x , · · · , u d ) = u ( | x | , · · · , | x d | ) , x ∈ Ω e . Let ˜ u k be the Fourier coefficients of u e . Since u e is real and even, one has that u e = (cid:88) k ∈ Z d ˜ u k cos( πk · x ) , where(B.2) ˜ u k = (cid:82) Ω e u e ( x ) cos( πk · x ) dx (cid:82) Ω e cos ( πk · x ) dx = 12 d − k (cid:54) = (cid:90) Ω e u e ( x ) cos( πk · x ) dx. By abuse of notation, we use | k | to stand for the vector ( | k | , | k , | , · · · , | k d | ) . Lemma B.1. For every k ∈ Z d , it holds that ˜ u k = β k ˆ u | k | where β k = 2 k (cid:54) = − (cid:80) di =1 ki (cid:54) =0 .Proof. First thanks to Lemma 4.2 and the evenness of cosine, (cid:90) Ω e u e ( x ) cos( πk · x ) dx = (cid:90) Ω e u e ( x ) cos (cid:16) π (cid:16) d − (cid:88) i =1 k i x i (cid:17)(cid:17) cos( πk d x d ) dx − (cid:90) Ω e u e ( x ) sin (cid:16) π (cid:16) d − (cid:88) i =1 k i x i (cid:17)(cid:17) sin( πk d x d ) dx (cid:124) (cid:123)(cid:122) (cid:125) =0 = (cid:90) Ω e u e ( x ) cos (cid:16) π (cid:16) d − (cid:88) i =1 k i x i (cid:17)(cid:17) cos( πk d − x d − ) cos( πk d x d ) dx − (cid:90) Ω e u e ( x ) sin (cid:16) π (cid:16) d − (cid:88) i =1 k i x i (cid:17)(cid:17) sin( πk d − x d − ) cos( πk d x d ) dx (cid:124) (cid:123)(cid:122) (cid:125) =0 = · · · = (cid:90) Ω e u e ( x ) d (cid:89) i =1 cos( πk i x i ) dx = 2 d (cid:90) Ω u ( x )Φ k ( x ) dx. In addition, since Φ k = Φ | k | for any k ∈ Z d , the lemma follows from the equation above,(B.1) and (B.2). (cid:3) PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 33 The next lemma shows that the Fourier coefficients of the product of two functions u and v are the discrete convolution of their Fourier coefficients. Recall that { ˜ u k } k ∈ Z d denote theFourier coefficients of the even functions u e . Lemma B.2. Let w e = u e v e . Then ˜ w k = (cid:80) m ∈ Z d ˜ u m ˜ v k − m .Proof. By definition, u e ( x ) = (cid:80) m ∈ Z d ˜ u m cos( πm · x ) and v e ( x ) = (cid:80) n ∈ Z d ˜ v n cos( πn · x ) Thanksto the fact that (cid:90) Ω e cos( π(cid:96) · x ) cos( πk · x ) = 2 d − k (cid:54) = δ (cid:96) ( k ) , one obtains that ˜ w k = 12 d − k (cid:54) = (cid:90) Ω e u e ( x ) v e ( x ) cos( πk · x ) dx = 12 d − k (cid:54) = (cid:88) m ∈ Z d (cid:88) n ∈ Z d ˜ u m ˜ v n (cid:90) Ω e cos( πm · x ) cos( πn · x ) cos( πk · x ) dx = 12 d − k (cid:54) = (cid:88) m ∈ Z d (cid:88) n ∈ Z d ˜ u m ˜ v n (cid:90) Ω e (cid:104) cos( π ( m + n ) · x ) + cos( π ( m − n ) · x ) (cid:105) cos( πk · x ) dx = 12 (cid:88) m ∈ Z d ˜ u m (˜ v k − m + ˜ v m − k )= (cid:88) m ∈ Z d ˜ u m ˜ v k − m , where we have also used that ˜ v k = ˜ v − k for any k . (cid:3) Corollary B.1. For any k ∈ N d , (cid:100) ( uv ) k = 1 β k (cid:88) m ∈ Z d β m ˆ u | m | β m − k ˆ v | m − k | . Proof. Thanks to Lemma B.1 and Lemma B.2, (cid:100) ( uv ) k = 1 β k (cid:103) ( uv ) k = 1 β k (˜ u ∗ ˜ v ) k = 1 β k (cid:88) m ∈ Z d β m ˆ u | m | β m − k ˆ v | m − k | . (cid:3) References [1] Uri M Ascher and Chen Greif. A first course on numerical methods . SIAM, 2011.[2] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidalfunction. IEEE Transactions on Information theory , 39(3):930–945, 1993.[3] Julius Berner, Philipp Grohs, and Arnulf Jentzen. Analysis of the generalization error:Empirical risk minimization over deep artificial neural networks overcomes the curseof dimensionality in the numerical approximation of black–scholes partial differentialequations. SIAM Journal on Mathematics of Data Science , 2(3):631–657, 2020.[4] Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem withartificial neural networks. Science , 355(6325):602–606, 2017.[5] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent forover-parameterized models using optimal transport. Advances in neural informationprocessing systems , 31:3036–3046, 2018.[6] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiableprogramming. In Advances in Neural Information Processing Systems , pages 2933–2943,2019.[7] John B. Conway. A course in functional analysis , volume 96 of Graduate Texts inMathematics . Springer-Verlag, New York, second edition, 1990. [8] Suchuan Dong and Naxian Ni. A method for representing periodic functions and en-forcing exactly periodic boundary conditions with deep neural networks. arXiv preprintarXiv:2007.07442 , 2020.[9] Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, and René Garcia. In-corporating second-order functional knowledge for better option pricing. In Advances inneural information processing systems , pages 472–478, 2001.[10] Weinan E, Chao Ma, Stephan Wojtowytsch, and Lei Wu. Towards a mathematicalunderstanding of neural network-based machine learning: what we know and what wedon’t. arXiv preprint arXiv:2009.10713 , 2020.[11] Weinan E, Chao Ma, and Lei Wu. Barron spaces and the compositional function spacesfor neural network models. arXiv preprint arXiv:1906.08039 , 2019.[12] Weinan E and Stephan Wojtowytsch. Some observations on partial differential equationsin barron and multi-layer spaces, 2020. arXiv preprint arXiv:2012.01484.[13] Weinan E and Bing Yu. The deep ritz method: a deep learning-based numerical algo-rithm for solving variational problems. Communications in Mathematics and Statistics ,6(1):1–12, 2018.[14] Ivar Fredholm. On a class of functional equations. Acta mathematica , 27(1):365–390,1903.[15] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limita-tions of lazy training of two-layers neural network. In Advances in Neural InformationProcessing Systems , pages 9108–9118, 2019.[16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural net-works. In Proceedings of the fourteenth international conference on artificial intelligenceand statistics , pages 315–323, 2011.[17] Philipp Grohs, Fabian Hornung, Arnulf Jentzen, and Philippe Von Wurstemberger. Aproof that artificial neural networks overcome the curse of dimensionality in the numer-ical approximation of Black-Scholes partial differential equations, 2018. arXiv preprintarXiv:1809.02362.[18] Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial differen-tial equations using deep learning. Proceedings of the National Academy of Sciences ,115(34):8505–8510, 2018.[19] Jiequn Han, Jianfeng Lu, and Mo Zhou. Solving high-dimensional eigenvalue problemsusing deep neural networks: A diffusion Monte Carlo like approach. Journal of Compu-tational Physics , 423:109792, 2020.[20] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, and Tuan Anh Nguyen. A proofthat rectified deep neural networks overcome the curse of dimensionality in the numer-ical approximation of semilinear heat equations. SN Partial Differential Equations andApplications , 1:1–34, 2020.[21] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergenceand generalization in neural networks. In Advances in neural information processingsystems , pages 8571–8580, 2018.[22] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving for high-dimensional committorfunctions using artificial neural networks. Research in the Mathematical Sciences , 6(1):1,2019.[23] Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu andsquared relu ridge functions with (cid:96) and (cid:96) controls. IEEE Transactions on InformationTheory , 64(12):7649–7656, 2018.[24] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artificial neural networksfor solving ordinary and partial differential equations. IEEE transactions on neuralnetworks , 9(5):987–1000, 1998. PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 35 [25] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: Isoperimetry andProcesses , volume 23. Springer Science & Business Media, 1991.[26] Tao Luo and Haizhao Yang. Two-layer neural networks for partial differential equations:Optimization and generalization theory. arXiv preprint arXiv:2006.15733 , 2020.[27] Tengyu Ma. CS229T/STATS231: Statistical Learning Theory, 2018. URL: https://web.stanford.edu/class/cs229t/scribe_notes/10_08_final.pdf . Last visited on2020/09/16.[28] William Lauchlin McMillan. Ground state of liquid he 4. Physical Review , 138(2A):A442,1965.[29] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the land-scape of two-layer neural networks. Proceedings of the National Academy of Sciences ,115(33):E7665–E7671, 2018.[30] Siddhartha Mishra and Roberto Molinaro. Estimates on the generalization error ofphysics informed neural networks (PINNs) for approximating PDEs, 2020. arXiv preprintarXiv:2006.16144.[31] Ali Girayhan Özbay, Sylvain Laizet, Panagiotis Tzirakis, Georgios Rizos, and BjörnSchuller. Poisson cnn: Convolutional neural networks for the solution of the pois-son equation with varying meshes and dirichlet boundary conditions. arXiv preprintarXiv:1910.08613 , 2019.[32] Gilles Pisier. Remarques sur un résultat non publié de b. maurey. Séminaire Analysefonctionnelle (dit" Maurey-Schwartz") , pages 1–12.[33] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neuralnetworks: A deep learning framework for solving forward and inverse problems involvingnonlinear partial differential equations. Journal of Computational Physics , 378:686–707,2019.[34] Grant Rotskoff and Eric Vanden-Eijnden. Parameters as interacting particles: long timeconvergence and asymptotic error scaling of neural networks. In Advances in neuralinformation processing systems , pages 7146–7155, 2018.[35] Yeonjong Shin, Jerome Darbon, and George Em Karniadakis. On the convergence ofphysics informed neural networks for linear second-order elliptic and parabolic typePDEs, 2020. arXiv preprint arXiv:2004.01806.[36] Yeonjong Shin, Zhongqiang Zhang, and George Em Karniadakis. Error estimates ofresidual minimization using neural networks for linear PDEs, 2020. arXiv preprintarXiv:2010.08019.[37] Jonathan W Siegel and Jinchao Xu. Approximation rates for neural networks withgeneral activation functions. Neural Networks , 2020.[38] Jonathan W Siegel and Jinchao Xu. High-order approximation rates for neural networkswith ReLU k activation functions. arXiv preprint arXiv:2012.07205 , 2020.[39] Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm forsolving partial differential equations. Journal of computational physics , 375:1339–1364,2018.[40] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks:A law of large numbers. SIAM Journal on Applied Mathematics , 80(2):725–752, 2020.[41] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint . Cam-bridge University Press, 2019.[42] Michael M. Wolf. Mathematical Foundations of Supervised Learning, 2020. URL: . Last visited on 2020/12/5.[43] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. NeuralNetworks , 94:103–114, 2017. [44] Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLUnetworks. arXiv preprint arXiv:1802.03620 , 2018. (JL) Departments of Mathematics, Physics, and Chemistry, Duke University, Box 90320,Durham, NC 27708. Email address : [email protected] (YL) Department of Mathematics and Statistics, Lederle Graduate Research Tower, Uni-versity of Massachusetts, 710 N. Pleasant Street, Amherst, MA 01003. Email address : [email protected] (MW) Mathematics Department, Duke University, Box 90320, Durham, NC 27708. Email address ::