[PDF] A Priori Generalization Analysis of the Deep Ritz Method for Solving High Dimensional Elliptic Equations

Abstract

This paper concerns the a priori generalization analysis of the Deep Ritz Method (DRM) [W. E and B. Yu, 2017], a popular neural-network-based method for solving high dimensional partial differential equations. We derive the generalization error bounds of two-layer neural networks in the framework of the DRM for solving two prototype elliptic PDEs: Poisson equation and static Schr\"odinger equation on the d-dimensional unit hypercube. Specifically, we prove that the convergence rates of generalization errors are independent of the dimension d, under the a priori assumption that the exact solutions of the PDEs lie in a suitable low-complexity space called spectral Barron space. Moreover, we give sufficient conditions on the forcing term and the potential function which guarantee that the solutions are spectral Barron functions. We achieve this by developing a new solution theory for the PDEs on the spectral Barron space, which can be viewed as an analog of the classical Sobolev regularity theory for PDEs.

Full PDF

AA PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZMETHOD FOR SOLVING HIGH DIMENSIONAL ELLIPTIC EQUATIONS

JIANFENG LU, YULONG LU, AND MIN WANG

Abstract.

This paper concerns the a priori generalization analysis of the Deep Ritz Method(DRM) [W. E and B. Yu, 2017], a popular neural-network-based method for solving highdimensional partial diﬀerential equations. We derive the generalization error bounds oftwo-layer neural networks in the framework of the DRM for solving two prototype ellipticPDEs: Poisson equation and static Schrödinger equation on the d -dimensional unit hyper-cube. Speciﬁcally, we prove that the convergence rates of generalization errors are inde-pendent of the dimension d , under the a priori assumption that the exact solutions of thePDEs lie in a suitable low-complexity space called spectral Barron space. Moreover, we givesuﬃcient conditions on the forcing term and the potential function which guarantee that thesolutions are spectral Barron functions. We achieve this by developing a new solution theoryfor the PDEs on the spectral Barron space, which can be viewed as an analog of the classicalSobolev regularity theory for PDEs. Introduction

Numerical solutions to high dimensional partial diﬀerential equations (PDEs) have been along-standing challenge in scientiﬁc computing. The impressive advance of deep learning hasoﬀered exciting possibilities for algorithmic innovations. In particular, it is a natural idea torepresent solutions of PDEs by (deep) neural networks to exploit the rich expressiveness ofneural networks representation. The parameters of neural networks are then trained by opti-mizing some loss functions associated with the PDE. Natural loss functions can be designedusing the variational structure, similar to the Ritz-Galerkin method in classical numericalanalysis of PDEs. Such method is known as the Deep Ritz Method (DRM) in [13, 22].Methods in a similar spirit has been also developed in the computational physics literature[4] for solving eigenvalue problems arising from many-body quantum mechanics, under theframework of variational Monte Carlo method [28].Despite wide popularity and many successful applications of the DRM and other ap-proaches of using neural networks to solve high-dimensional PDEs, the analysis of suchmethods is scarce and still not well understood. This paper aims to provide an a priorigeneralization error analysis of the DRM with dimension-explicit estimates.Generally speaking, the error of using neural networks to solve high dimensional PDEs canbe decomposed into the following parts: • Approximation error: this is the error of approximating the solution of a PDE usingneural networks; • Generalization error: this refers to the error of the neural network-based approximatesolution on predicting unseen data. The variational problem involves integrals in high di-mension, which can be expensive to compute. In practice Monte Carlo methods are usuallyused to approximate those high dimensional integrals and thus the miminizer of the surrogatemodel (known as empirical risk minimization) would be diﬀerent from the minimizer of theoriginal variational problem;

Date : January 6, 2021.J.L. and M.W. are supported in part by National Science Foundation via grants DMS-2012286 and CCF-1934964. Y.L. is supported by the start-up fund of the Department of Mathematics and Statistics at UMassAmherst. a r X i v : . [ m a t h . NA ] J a n JIANFENG LU, YULONG LU, AND MIN WANG • Training (or optimization) error: this is the error incurred by the optimization algorithmused in the training of neural networks for PDEs. Since the parameters of the neural net-works are obtained through an optimization process, it might not be able to ﬁnd the bestapproximation to the unknown solution within the function class.Note that from a numerical analysis point of view, these errors already appear for conven-tional Galerkin methods. Indeed, taking ﬁnite element methods for example, the approxi-mation error is the error of approximating the true solution in the ﬁnite element space; thegeneralization error can be seen as the discretization error caused by numerical quadratureof the variational formulation; the optimization error corresponds to the computational errorin the conventional numerical PDEs due to the inaccurate resolution of linear or nonlinearﬁnite dimensional discrete system.Although classical numerical analysis for PDEs in low dimensions has formed a relativelycomplete theory in the last several decades, the error analysis of neural network methods ismuch more challenging for high dimensional PDEs and requires new ideas and tools. In fact,the three components of error analysis highlighted above all face new diﬃculties.For approximation, as is well known, high dimensional problems suﬀer from the curse ofdimensionality, if we proceed with standard regularity-based function spaces such as Sobolevspaces or Hölder spaces as in conventional numerical analysis. In fact, even using deep neuralnetworks, the approximation rate for functions in such spaces deteriorate as the dimensionbecomes higher; see [43, 44]. Therefore, to obtain better approximation rates that scale mildlyin the large dimensionality, it is natural to assume that the function of interest lies in a suitablesmaller function space which has low complexity compared to Sobolev or Hölder spaces sothat the function can be eﬃciently approximated by neural networks in high dimensions.The ﬁrst function class of this kind is the so-called

Barron space deﬁned in the seminal workof Barron [2]; see also [11, 23, 37, 38] for more variants of Barron spaces and their neural-network approximation properties. In the present paper we will introduce a discrete versionof Barron’s deﬁnition of such space using the idea of spectral decomposition and becauseof this we adopt the terminology of spectral Barron space following [10, 38] to distinguishit from the other versions. As the Barron spaces are very diﬀerent from the usual Sobolevspaces, for PDE problems, one has to develop novel a priori estimates and correspondinglyapproximation error analysis. In particular, a new solution theory for high dimensional PDEsin those low-complexity function spaces needs to be developed. This paper makes an initialattempt in establishing a solution theory in the spectral Barron space for a class of ellipticPDEs.The analysis of the generalization error is also intimately related to the function class (e.g.neural networks) we use, in particular its complexity. This makes the generalization analysisquite diﬀerent from the analysis of numerical quadrature error in an usual ﬁnite elementmethod. We face a trade-oﬀ between the approximation and generalization: To reduce theapproximation error, one would like to use an approximation ansatz which involves largenumber of degrees of freedom, however, such choice will incur large generalization error.The training of the neural networks also remains to be a very challenging problem sincethe associated optimization problem is highly non-convex. In fact, even under a standardsupervised learning setting, we still largely lack understanding of the optimization error,except in simpliﬁed setting where the optimization dynamics is essentially linear (see e.g.,[6, 15, 21]). The analysis for PDE problems would face similar, if not severer, diﬃculties,and it is beyond the scope of our current work.In this work, we provide a rigorous analysis to the approximation and generalization errorsof the DRM for high dimensional elliptic PDEs. We will focus on relative simple PDEs(Poisson equation and static Schrödinger equation) to better convey the idea and illustratethe framework, without bogging the readers down with technical details. Our analysis, asalready suggested by the discussions above, which is based on identifying a correct functional

PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 3 analysis setup and developing the corresponding a priori analysis and complexity estimates,will provides dimension-independent generalization error estimates.1.1.

Related Works.

Several previous works on analysis of neural-network based methodsfor high-dimensional PDEs focus on the aspect of representation, i.e., whether a solutionto the PDE can be approximated by a neural network with quantitative error control; seee.g., [17, 20]. Fixing an approximation space, the generalization error can be controlled byanalyzing complexity such as covering numbers, see e.g., [3] for a speciﬁc PDE problem.More recently, several papers [26, 30, 35, 36] considered the generalization error analysisof the physics informed neural network (PINNs) approach based on residual minimizationfor solving PDEs [24, 33]. In particular, the work [35] established the consistency of theloss function such that the approximation converges to the true solution as the trainingsample increases under the assumption of vanishing training error. For the generalizationerror, Mishra and Molinaro [30] carried out an a-posteriori-type generalization error analysisfor PINNs, and proved that the generalization error is bounded by the training error andquadrature error under some stability assumptions of the PDEs. To avoid the issue of curse ofdimensionality in quadrature error, the authors also considered the cumulative generalizationerror which involves a validation set. The paper [36] proved both a priori and a posteriorestimates for residual minimization methods in Sobolev spaces. The paper [26] obtaineda priori generalization estimates for a class of second order linear PDEs by assuming (butwithout verifying) that the exact solutions of PDEs belong to a Barron-type space introducedin [11].Diﬀerent from the previous generalization error analysis, we derive a priori and dimension-explicit generalization error estimates under the assumption that the solutions of the PDEslie in the spectral Barron space that is more aligned with [2]. Moreover, we justify suchassumption by developing a novel solution theory in the spectral Barron space for the PDEsof consideration. This regularity theory is the main diﬀerence between our work comparedwith the above mentioned ones.It is worth mentioning that in a very recent preprint [12], E and Wojtowytsch consideredthe regularity theory of high dimensional PDEs on the whole space (including screened Pois-son equation, heat equation, and a viscous Hamilton-Jacobi equation) deﬁned in the Barronspace introduced by [11]. Their result shared a similar spirit as our analysis of PDE regular-ity theory in the spectral Barron space (Theorem 2.5 for Poisson equation and Theorem 2.6for static Schrödinger equation), while we focus on PDEs on ﬁnite domain, and as a result,we have to develop diﬀerent Barron function spaces from those used for the whole space.The authors of [12] also provided some counterexamples to regularity theory for PDE prob-lems deﬁned on non-convex domains, while we would only focus on simple domain (in facthypercubes) in this work.While we focus on the variational principle based approach for solving high dimensionalPDEs using neural networks, we note that many other approaches have been developed,such as the deep BSDE method based on the control formulation of parabolic PDEs [18],the deep Galerkin method based on the weak formulation [39], methods based on the strongformulation (residual minimization) such as the PINNs [24, 33], the diﬀusion Monte Carlotype approach for high-dimensional eigenvalue problems [19], just to name a few. It wouldbe interesting future directions to extend our analysis to these methods.1.2.

Our Contributions.

We analyze the generalization error of two-layer neural networksfor solving two simple elliptic PDEs in the framework of DRM. Speciﬁcally we make thefollowing contributions: • We deﬁne a spectral Barron space B s (Ω) on a d -dimensional unit hypercube Ω = [0 , d that extend the Barron’s original function space [2] from the whole space to bounded JIANFENG LU, YULONG LU, AND MIN WANG domain; see the deﬁnition (2.10). In the generalization theory we develop, we assumethat the solutions lie in the spectral Barron space. • We show that the spectral Barron functions B (Ω) can be well approximated in the H -norm by two-layer neural networks with either ReLU or Softplus activation func-tions without curse of dimensionality. Moreover, the parameters (weights and biases)of the two-layer neural networks are controlled explicitly in terms of the spectral Bar-ron norm. The bounds on the neural-network parameters play an essential role incontrolling the generalization error of the neural nets. See Theorem 2.1 and Theo-rem 2.2 for the approximation results. • We derive generalization error bounds of the neural-network solutions for solvingPoisson equation and the static Schrödinger equation under the assumption that thesolutions belong to the Barron space B (Ω) . We emphasize that the convergence ratesin our generalization error bounds are dimension-independent and that the prefactorsin the error estimates depend at most polynomially on the dimension and the Barronnorms of the solutions, indicating that the DRM overcomes the curse of dimensionalitywhen the solutions of the PDEs are spectral Barron functions. See Theorem 2.3 andTheorem 2.4 for the generalization results. • Last but not the least, we develop new well-posedness theory for the solutions ofPoisson and static Schrödinger equations in the spectral Barron space, providingsuﬃcient conditions to verify the earlier assumption on the solutions made in thegeneralization analysis. The new solution theory can be viewed as an analog of theclassical PDE theory in Sobolev or Hölder spaces. See Theorem 2.5 and Theorem 2.6for the new solution theory in spectral Barron space.2.

Set-Up and Main Results

Set-Up of PDEs.

Let

Ω = [0 , d be the unit hypercube on R d . Let ∂ Ω be the boundaryof Ω . We consider the following two prototype elliptic PDEs on Ω equipped with the Neumannboundary condition: Poisson equation(2.1) − ∆ u = f on Ω ,∂u∂ν = 0 on ∂ Ω and the static Schrödinger equation(2.2) − ∆ u + V u = f on Ω ,∂u∂ν = 0 on ∂ Ω . Throughout the paper, we make the minimal assumption that f ∈ L (Ω) and V ∈ L ∞ (Ω) with V ( x ) ≥ V min > , although later we will impose stronger regularity assumptions on f and V . In particular, in our high dimensional setting, we would certainly need to restrict the classof f and V , otherwise just prescribing such general functions numerically would already incurcurse of dimensionality. The well-posedness of the solutions to the Poisson equation and staticSchrödinger equation in the Sobolev space H (Ω) as well as the variational characterizationsof the solutions are well-known and are summarized in the proposition below, whose proofcan be found in Appendix A. Proposition 2.1. (i) Assume that f ∈ L (Ω) with (cid:82) Ω f dx = 0 . Then there exists a uniqueweak solution u ∗ P ∈ H (cid:5) (Ω) := { u ∈ H (Ω) | (cid:82) Ω udx = 0 } to the Poisson equation (2.1) .Moreover, we have that (2.3) u ∗ P = arg min u ∈ H (Ω) E P ( u ) := arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | dx + 12 (cid:16) (cid:90) Ω udx (cid:17) − (cid:90) Ω f udx (cid:111) , PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 5 and that for any u ∈ H (Ω) , (2.4) E ( u ) − E ( u ∗ P )) ≤ (cid:107) u − u ∗ P (cid:107) H (Ω) ≤ { C P + 1 , } ( E ( u ) − E ( u ∗ P )) , where C P is the Poincaré constant on the domain Ω , i.e., for any v ∈ H (Ω) , (cid:13)(cid:13)(cid:13) v − (cid:90) Ω vdx (cid:13)(cid:13)(cid:13) L (Ω) ≤ C P (cid:107)∇ v (cid:107) L (Ω) . (ii) Assume that f, V ∈ L ∞ (Ω) and that < V min ≤ V ( x ) ≤ V max < ∞ for some constants V min and V max . Then there exists a unique weak solution u ∗ S ∈ H (Ω) to the static Schrödingerequation (2.2) . Moreover, we have that (2.5) u ∗ S = arg min u ∈ H (Ω) E S ( u ) := arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | + V | u | dx − (cid:90) Ω f udx (cid:111) , and that for any u ∈ H (Ω) (2.6) , V max ) ( E ( u ) − E ( u ∗ S )) ≤ (cid:107) u − u ∗ (cid:107) H (Ω) ≤ , V min ) ( E ( u ) − E ( u ∗ S )) . The variational formulations (2.3) and (2.5) are the basis of the DRM [13] for solvingthose PDEs. The main idea is to train neural networks to minimize the (population) lossdeﬁned by the Ritz energy functional E . More speciﬁcally, let F ⊂ H (Ω) be a hypothesisfunction class parameterized by neural networks. The DRM seeks the optimal solution tothe population loss E within the hypothesis space F . However, the population loss requiresevaluations of d -dimensional integrals, which can be prohibitively expensive when d (cid:29) iftraditional quadrature methods were used. To circumvent the curse of dimensionality, it isnatural to employ the Monte-Carlo method for computing the high dimensional integrals,which leads to the so-called empirical loss (or risk) minimization. Empirical Loss Minimization.

Let us denote by P Ω the uniform probability distri-butions on the domain Ω . Then the loss functional E P and E S can be rewritten in terms ofexpectations under P Ω as E P ( u ) = | Ω | · E X ∼P Ω (cid:104) |∇ u ( X ) | − f ( X ) u ( X ) (cid:105) + 12 (cid:16) | Ω | · E X ∼P Ω u ( X ) (cid:17) , E S ( u ) = | Ω | · E X ∼P Ω (cid:104) |∇ u ( X ) | + 12 V ( X ) | u ( X ) | − f ( X ) u ( X ) (cid:105) . To deﬁne the empirical loss, let { X j } nj =1 be an i.i.d. sequence of random variables distributedaccording to P Ω . Deﬁne the empirical losses E n,P and E n,S by setting(2.7) E n,P ( u ) = 1 n n (cid:88) j =1 (cid:104) | Ω | · (cid:16) |∇ u ( X j ) | − f ( X j ) u ( X j ) (cid:17)(cid:105) + 12 (cid:16) | Ω | n n (cid:88) j =1 u ( X j ) (cid:17) , E n,S ( u ) = 1 n n (cid:88) j =1 (cid:104) | Ω | · (cid:16) |∇ u ( X j ) | + 12 V ( X j ) | u ( X j ) | − f ( X j ) u ( X j ) (cid:17)(cid:105) . Given an empirical loss E n , the empirical loss minimization algorithm seeks u n which mini-mizes E n , i.e.(2.8) u n = arg min u ∈F E n ( u ) . Here we have suppressed the dependence of u n on F . We denote by u n,P and u n,S the minimalsolutions to the empirical loss E n,P and E n,S , respectively. JIANFENG LU, YULONG LU, AND MIN WANG

Main Results.

The goal of the present paper is to obtain quantitative estimates forthe generalization error between the minimal solution u n,S and u n,P computed from the ﬁnitedata points { X j } nj =1 and the exact solutions when the spacial dimension d is large. Ourprimary interest is to derive such estimates which scales mildly with respect to the increasingdimension d . To this end, it is necessary to assume that the true solutions lie in a smallerspace which has a lower complexity than Sobolev spaces. Speciﬁcally we will consider thespectral Barron space deﬁned below via the cosine transformation.Let C be a set of cosine functions deﬁned by(2.9) C := (cid:110) Φ k (cid:111) k ∈ N d := (cid:110) d (cid:89) i =1 cos( πk i x i ) | k i ∈ N (cid:111) . Given u ∈ L (Ω) , let { ˆ u ( k ) } k ∈ N d be the expansion coeﬃcients of u under the basis { Φ k } k ∈ N d .Let us deﬁne for s ≥ the spectral Barron space B s (Ω) on Ω by(2.10) B s (Ω) := (cid:110) u ∈ L (Ω) : (cid:88) k ∈ N d (1 + π s | k | s ) | ˆ u ( k ) | < ∞ (cid:111) . The spectral Barron norm of a function u on B s (Ω) is given by (cid:107) u (cid:107) B s (Ω) = (cid:88) k ∈ N d (1 + π s | k | s ) | ˆ u ( k ) | . Observe that a function f ∈ B s (Ω) if and only if { ˆ u ( k ) } k ∈ N d belongs to the weighted (cid:96) -space (cid:96) W s ( N d ) on the lattice N d with the weights W s ( k ) = (1 + π s | k | s ) . When s = 2 , we adopt theshort notation B (Ω) for B (Ω) . Our deﬁnition of spectral Barron space is strongly motivatedby the seminar work by Barron [2] and other recent works [11, 23, 37]. The initial Barronfunction f in [2] is deﬁned on the whole space R d whose Fourier transform ˆ f ( w ) satisﬁes that (cid:82) | ˆ f ( ω ) || ω | dω < ∞ . Our spectral Barron space B s (Ω) with s = 1 can be viewed as a discreteanalog of the initial Barron space from [2].The most important property of the Barron functions is that those functions can be wellapproximated by two-layer neural networks without the curse of dimensionality. To make thismore precise, let us deﬁne the class of two-layer neural networks to be used as our hypothesisspace for solving PDEs. Given an activation function φ , a constant B > and the numberof hidden neurons m , we deﬁne(2.11) F φ,m ( B ) := (cid:110) c + m (cid:88) i =1 γ i φ ( ω i · x − t i ) , | c | ≤ B, (cid:107) w i (cid:107) = 1 , | t i | ≤ , m (cid:88) i =1 | γ i | ≤ B (cid:111) . Our ﬁrst result concerns the approximation of spectral Barron functions in B (Ω) by two-layerneural networks with ReLU activation functions.

Theorem 2.1.

Consider the class of two-layer ReLU neural networks (2.12) F ReLU ,m ( B ) := (cid:110) c + m (cid:88) i =1 γ i ReLU( ω i · x − t i ) , | c | ≤ B, (cid:107) w i (cid:107) = 1 , | t i | ≤ , m (cid:88) i =1 | γ i | ≤ B (cid:111) . Then for any u ∈ B (Ω) , there exists u m ∈ F ReLU ,m ( (cid:107) u (cid:107) B (Ω) ) , such that (cid:107) u − u m (cid:107) H (Ω) ≤ √ (cid:107) u (cid:107) B (Ω) √ m . A similar approximation result was ﬁrstly proved in the seminar paper of Barron [2] wherethe same approximation rate O ( m − ) was also obtained when approximating the Barronfunction deﬁned on the whole space with two-layer neural nets with the sigmoid activationfunction in the L ∞ -norm. Results of this kind were also obtained in the recent works [11, PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 7

23, 37]. In particular, the same convergence rate was proved for approximating functions f with (cid:107) f (cid:107) B s = (cid:82) R d | ˆ f ( ω ) | (1 + | ω | ) s dω < ∞ in Sobolev norms by two-layer networks with ageneral class of activation functions satisfying polynomial decay condition. The convergencerate O ( m − ) was recently improved to O ( m − ( + δ ( d )) ) with δ ( d ) > depending on d in [38]when ReLU k or cosine is used as the activation function. Moreover, the rate has been provedto be sharp in the Sobolev norms when the index s of Barron space and that of the Sobolevnorm belong to certain appropriate regime.Although the function class F ReLU ,m ( B ) can be used to approximate functions in B (Ω) without curse of dimensionality, it brings several issues to both theory and computation ifused as the hypothesis class for solving PDEs. On the one hand, the set F ReLU ,m ⊂ H (Ω) consists of only piecewise aﬃne functions, which may be undesirable in some PDE problemsif the function of interest is expected to be more regular or smooth. On the other hand,the fact that F ReLU ,m only admits ﬁrst order weak derivatives makes it extremely diﬃcult tobound the complexities of function classes involving derivatives of functions from F ReLU ,m ,whereas the latter is a crucial ingredient for getting a generalization bound for the DRM.To resolve those issues, in what follows we will consider instead a class of two-layer neuralnetworks with the Softplus [9, 16] activation function. Recall the Softplus function SP( z ) =ln(1 + e z ) and its rescaled version SP τ ( z ) deﬁned also for τ > , SP τ ( z ) = 1 τ SP( τ z ) = 1 τ ln(1 + e τz ) . Observe that the rescaled Softplus SP τ ( z ) can be viewed as a smooth approximation ofthe ReLU function since SP τ ( z ) → ReLU( z ) as τ → for any z ∈ R (see Lemma 4.6 for aquantitative statement). Moreover, the two-layer neural networks with the activation function SP τ satisfy a similar approximation result as Theorem 2.1 when approximating spectralBarron functions in B (Ω) , as shown in the next theorem. Theorem 2.2.

Consider the class of two-layer Softplus neural networks functions (2.13) F SP τ ,m ( B ) := (cid:110) c + m (cid:88) i =1 γ i SP τ ( ω i · x − t i ) , | c | ≤ B, (cid:107) w i (cid:107) = 1 , | t i | ≤ , m (cid:88) i =1 | γ i | ≤ B (cid:111) . Then for any u ∈ B (Ω) , there exists a two-layer neural network u m ∈ F SP τ ,m ( (cid:107) u (cid:107) B (Ω) ) withactivation function SP τ with τ = √ m , such that (cid:107) u − u m (cid:107) H (Ω) ≤ (cid:107) u (cid:107) B (Ω) (6 log m + 30) √ m . The proofs of Theorem 2.1 and Theorem 2.2 can be found in Section 4.Now we are ready to state the main generalization results of two-layer neural networks forsolving Poisson and the static Schrödinger equations. We start with the generalization errorbound for the neural-network solution in the Poisson case.

Theorem 2.3.

Assume that the solution u ∗ P of the Neumann problem for the Poisson equation (2.1) satisﬁes that (cid:107) u ∗ P (cid:107) B (Ω) < ∞ . Let u mn,S be the minimizer of the empirical loss E n,P in theset F = F SP τ ,m ( (cid:107) u ∗ P (cid:107) B (Ω) ) with τ = √ m . Then it holds that (2.14) E (cid:2) E P ( u mn,P ) − E P ( u ∗ P ) (cid:3) ≤ C √ m ( √ log m + 1) √ n + C (log m + 1) m . Here C > depends polynomially on (cid:107) u ∗ P (cid:107) B (Ω) , d, (cid:107) f (cid:107) L ∞ (Ω) , and C > depends quadraticallyon (cid:107) u ∗ P (cid:107) B (Ω) . In particular, setting m = n in (2.14) leads to E (cid:2) E P ( u mn,P ) − E P ( u ∗ ) (cid:3) ≤ C (log n ) n for some C > depending only polynomially on (cid:107) u ∗ P (cid:107) B (Ω) , d, (cid:107) f (cid:107) L ∞ (Ω) . JIANFENG LU, YULONG LU, AND MIN WANG

Next we state the generalization error for the neural-network solution in the case of thestatic Schrödinger equation.

Theorem 2.4.

Assume that the solution u ∗ S of the Neumann problem for the static Schrödingerequation (2.2) satisﬁes that (cid:107) u ∗ S (cid:107) B (Ω) < ∞ . Let u mn,S be the minimizer of the empirical loss E n,S in the set F = F SP τ ,m ( (cid:107) u ∗ S (cid:107) B (Ω) ) with τ = √ m . Then it holds that (2.15) E (cid:2) E S ( u mn,S ) − E S ( u ∗ S ) (cid:3) ≤ C √ m ( √ log m + 1) √ n + C (log m + 1) m . Here C > depends polynomially on (cid:107) u ∗ S (cid:107) B (Ω) , d, (cid:107) f (cid:107) L ∞ (Ω) , (cid:107) V (cid:107) L ∞ (Ω) and C > dependsquadratically on (cid:107) u ∗ S (cid:107) B (Ω) . In particular, setting m = n in (2.15) leads to E (cid:2) E S ( u mn,S ) − E S ( u ∗ S ) (cid:3) ≤ C (log n ) n for some C > depending only polynomially on (cid:107) u ∗ S (cid:107) B (Ω) , d, (cid:107) f (cid:107) L ∞ (Ω) , (cid:107) V (cid:107) L ∞ (Ω) . Remark 2.1.

Thanks to the estimates (2.4) and (2.6) , the generalization bound above on theenergy excess translate directly to the generalization bound on square of the H -error betweenthe neural-network solution and the exact solution of the PDE. Speciﬁcally, when m = n , itholds that for some constant C > , E (cid:107) u mn − u ∗ (cid:107) H (Ω) ≤ C (log n ) n . Theorem 2.3 and Theorem 2.4 show that the generalization error of the neural-networksolution for Poisson and the static Schrödinger equations do not suﬀer from the curse ofdimensionality under the key assumption that their exact solutions belong to the spectralBarron space B (Ω) . The proofs of Theorem 2.3 and Theorem 2.4 can be found in Section 6.Finally we verify the key low-complexity assumption by proving new well-posedness theoryof Poisson and the static Schrödinger equations in spectral Barron spaces. We start with thenew solution theory for the Poisson equation, whose proof can be found in Section 7.1. Theorem 2.5.

Assume that f ∈ B s (Ω) with s ≥ and that ˆ f = (cid:82) Ω f ( x ) dx = 0 . Thenthe unique solution u ∗ to the Neumann problem for the Poisson equation satisﬁes that u ∗ ∈B s +2 (Ω) and that (cid:107) u ∗ (cid:107) B s +2 (Ω) ≤ (cid:107) f (cid:107) B s (Ω) . In particular, when s = 0 we have (cid:107) u ∗ (cid:107) B (Ω) ≤ (cid:107) f (cid:107) B (Ω) . The next theorem establishes the solution theory for the static Schrödinger equation inspectral Barron spaces.

Theorem 2.6.

Assume that f ∈ B s (Ω) with s ≥ and that V ∈ B s (Ω) with V ( x ) ≥ V min > for every x ∈ R d . Then the static Schrödinger problem (2.2) has a unique solution u ∈ B s +2 (Ω) . Moreover, there exists a constant C > depending on V and d such that (2.16) (cid:107) u (cid:107) B s +2 (Ω) ≤ C (cid:107) f (cid:107) B s (Ω) . In particular, when s = 0 we have (cid:107) u ∗ (cid:107) B (Ω) ≤ C (cid:107) f (cid:107) B (Ω) . The stability estimates above can be viewed as an analog of the standard Sobolev regularityestimate (cid:107) u (cid:107) H s +2 (Ω) ≤ C (cid:107) f (cid:107) H s (Ω) . However, the proof of the estimate (2.16) is quite diﬀerentfrom that of the Sobolev estimate. In particular, due to the lack of Hilbert structure inthe Barron space B s (Ω) , the standard Lax-Milgram theorem and the bootstrap argumentsfor proving the Sobolev regularity estimates can not be applied here. Instead, we turn tostudying the equivalent operator equation satisﬁed by the cosine coeﬃcients of the solutionof the static Schrödinger equation. By exploiting the fact that the Barron space is a weighted (cid:96) -space on the cosine coeﬃcients, we manage to prove the well-posedness of the operator PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 9 equation and the stability estimate (2.16) with an application of the Fredholm theory to theoperator equation. The complete proof of Theorem 2.6 can be found in Section 7.2.2.4.

Discussions and Future Directions.

We established dimension-independent rates ofconvergence for the generalization error of the DRM for solving two simple linear ellipticPDEs. We would like to discuss some restrictions of the main results and point out someinteresting future directions.First, some numerical results show that the convergence rates in our generalization errorestimates may not be sharp. In fact, Siegel and Xu [38] obtained sharp convergence ratesof O ( m − ( + δ ( d )) ) with some δ ( d ) > for approximating a similar class of spectral Barronfunctions using two-layer neural nets with cosine and ReLU k activation functions. However,the parameters (weights and biases) of the neural networks constructed in their approximationresults were not well controlled (and maybe unbounded) and potentially could lead to largegeneralization errors. One interesting open question is to sharpen the approximation ratefor our spectral Barron functions using controllable two-layer neural networks with possiblydiﬀerent activation functions. On the other hand, the statistical error bound O ( √ m ( √ log m +1) √ n ) may also be improved with sharper and more delicate Rademacher complexity estimates ofthe neural networks.We restricted our attention on two simple elliptic problems deﬁned on a hypercube withthe Neumann boundary condition to better convey the main ideas. It is natural to considercarrying out similar programs of solving more general PDE problems deﬁned on generalbounded or unbounded domains with other boundary conditions. One major diﬃculty ariseswhen one comes to the deﬁnition of Barron functions on a general bounded domain and ourspectral Barron functions built on cosine expansions can not be adapted to general domains.Other Barron functions such as the one deﬁned in [11] via integral representation are onbounded domains and may be considered as alternatives, but building a solution theoryfor PDEs in those spaces seems highly nontrivial; see [12] for some results and discussionsalong this direction. Another major issue comes from solving PDEs with essential boundaryconditions such as Dirichlet or periodic boundary conditions, where one needs to constructneural networks that satisfy those boundary conditions; we refer to [8, 31] for some initialattempts in this direction.Finally, the analysis of training error of neural network methods for solving PDEs is ahighly important and challenging question. The diﬃculty is largely due to the non-convexityof the loss function in the parameters. Nevertheless, recent breakthroughs in the theoreticalanalysis of two-layer neural networks training show that the training dynamics can be largelysimpliﬁed in inﬁnite-width limit, such as in the the mean ﬁeld regime [5, 29, 34, 40] or neuraltangent kernel (NTK) regime [6, 15, 21], where global convergence of limiting dynamicscan be proved under suitable assumptions. It is an exciting direction to establish similarconvergence results for overparameterized two-layer networks in the context of solving PDEs.3. Abstract generalization error bounds

In this section, we derive some abstract generalization bounds for the empirical loss mini-mization discussed in the previous section. To simply the notation, we suppress the problem-dependent subscript P or S and denote by u n the minimizer of the empirical loss E n over thehypothesis space F . Recall that u ∗ is the exact solution of the PDE. We aim to bound theenergy excess ∆ E n := E ( u n ) − E ( u ∗ ) . By deﬁnition we have that ∆ E n ≥ . To bound ∆ E n from above, we ﬁrst decompose ∆ E n as(3.1) ∆ E n = E ( u n ) − E n ( u n ) + E n ( u n ) − E n ( u F ) + E n ( u F ) − E ( u F ) + E ( u F ) − E ( u ∗ ) . Here u F = arg min u ∈F E ( u ) . Since u n is the minimizer of E n , E n ( u n ) − E n ( u F ) ≤ . Thereforetaking expectation on both sides of (3.1) gives(3.2) E ∆ E n ≤ E [ E ( u n ) − E n ( u n )] (cid:124) (cid:123)(cid:122) (cid:125) ∆ E gen + E [ E n ( u F )] − E ( u F ) (cid:124) (cid:123)(cid:122) (cid:125) ∆ E bias + E ( u F ) − E ( u ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) ∆ E approx . Observe that ∆ E gen and ∆ E bias are the statistical errors: the ﬁrst term ∆ E gen describingthe generalization error of the empirical loss minimization over the hypothesis space F andthe second term ∆ E bias being the bias coming from the Monte Carlo approximation of theintegrals. Whereas the third term ∆ E approx is the approximation error incurred by restrictingminimizing E from over the set H (Ω) to F . Moreover, thanks to Proposition 2.1, the thirdterm ∆ E approx is equivalent (up to a constant) to inf u ∈F (cid:107) u − u ∗ (cid:107) H (Ω) .To control the statistical errors, it is essential to prove the so-called uniform law of largenumbers for certain function classes, where the notion of Rademacher complexity plays animportant role, which we now recall below.

Deﬁnition 3.1.

We deﬁne for a set of random variables { Z j } nj =1 independently distributedaccording to P Ω and a function class S the random variable ˆ R n ( S ) := E σ (cid:104) sup g ∈S (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 σ j g ( Z j ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) , where the expectation E σ is taken with respect to the independent uniform Bernoulli se-quence { σ j } nj =1 with σ j ∈ {± } . Then the Rademacher complexity of S deﬁned by R n ( S ) = E P Ω [ ˆ R n ( S )] . The following important symmetrization lemma makes the connection between the uniformlaw of large numbers and the Rademacher complexity.

Lemma 3.1. [41, Proposition 4.11] Let F be a set of functions. Then E sup u ∈F (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 u ( X j ) − E X ∼P Ω u ( X ) (cid:12)(cid:12)(cid:12) ≤ R n ( F ) . Poisson Equation.

In this subsection we derive the abstract generalization bound inthe setting of Poisson equation. Recall the Ritz loss and the empirical loss associated to thePoisson equation E ( u ) = | Ω | · E X ∼P Ω (cid:104) |∇ u ( X ) | − f ( X ) u ( X ) (cid:105) + 12 (cid:16) | Ω | · E X ∼P Ω u ( X ) (cid:17) =: E ( u ) + E ( u ) , E n ( u ) = 1 n n (cid:88) j =1 (cid:104) | Ω | · (cid:16) |∇ u ( X j ) | − f ( X j ) u ( X j ) (cid:17)(cid:105) + 12 (cid:16) | Ω | n n (cid:88) j =1 u ( X j ) (cid:17) =: E n ( u ) + E n ( u ) , PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 11

By deﬁnition, the bias term ∆ E bias satisﬁes that ∆ E bias = E [ E n ( u F )] − E ( u F ) + E [ E n ( u F )] − E ( u F )= 12 E (cid:16) | Ω | n n (cid:88) j =1 u ( X j ) (cid:17) − (cid:16) | Ω | · E X ∼P Ω u ( X ) (cid:17) = 12 E (cid:104)(cid:16) n n (cid:88) j =1 u F ( X j ) − E X ∼P Ω u F ( X ) (cid:17) · (cid:16) n n (cid:88) j =1 u F ( X j ) + E X ∼P Ω u F ( X ) (cid:17)(cid:105) ≤ (cid:107) u F (cid:107) L ∞ (Ω) · E sup u ∈F (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 u ( X j ) − E X ∼P Ω u ( X ) (cid:12)(cid:12)(cid:12) ≤ u ∈F (cid:107) u (cid:107) L ∞ (Ω) · R n ( F ) , where we have used | Ω | = 1 the last inequality follows from Lemma 3.1.Next we bound the ﬁrst term ∆ E gen . Let us ﬁrst deﬁne the set of functions G P for the termappeared in E by G P := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | − f u where u ∈ F (cid:111) . Then it follows by Lemma 3.1 that ∆ E gen ≤ E sup v ∈F (cid:12)(cid:12)(cid:12) E ( v ) − E n ( v ) (cid:12)(cid:12)(cid:12) ≤ E sup v ∈F (cid:12)(cid:12)(cid:12) E ( v ) − E n ( v ) (cid:12)(cid:12)(cid:12) + E sup v ∈F (cid:12)(cid:12)(cid:12) E ( v ) − E n ( v ) (cid:12)(cid:12)(cid:12) ≤ E sup g ∈G (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 g ( X j ) − E P Ω [ g ] (cid:12)(cid:12)(cid:12) + E sup u ∈F (cid:12)(cid:12)(cid:12)(cid:16) E X ∼P Ω u ( X ) (cid:17) − (cid:16) n n (cid:88) j =1 u ( X j ) (cid:17) (cid:12)(cid:12)(cid:12) ≤ R n ( G P ) + sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) · E sup u ∈F (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 u ( X j ) − E X ∼P Ω u ( X ) (cid:12)(cid:12)(cid:12) ≤ R n ( G P ) + 2 sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) R n ( F ) . Finally owing to the estimate (2.4) in Proposition 2.1, the approximation error ∆ E approx satisﬁes that ∆ E approx ≤

12 inf u ∈F (cid:107) u − u ∗ (cid:107) H (Ω) . To summarize, we have established the following abstract generalization error bound for theenergy excess ∆ E n in the case of Poisson equation. Theorem 3.1.

Let u n,P be the minimizer of the empirical risk E n,P within the hypothesisclass F satisfying that sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) < ∞ . Let ∆ E n,P = E P ( u n,P ) − E P ( u ∗ P ) . Then (3.3) E ∆ E n,P ≤ R n ( G P ) + 4 sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) · R n ( F ) + 12 inf u ∈F (cid:107) u − u ∗ (cid:107) H (Ω) . Static Schrödinger Equation.

In this subsection we proceed to prove an abstractgeneralization bound for the static Schrödinger equation. First recall the corresponding Ritzloss and the empirical loss as follows E S ( u ) = | Ω | · E X ∼P Ω (cid:104) |∇ u ( X ) | + 12 V ( X ) | u ( X ) | − f ( X ) u ( X ) (cid:105) , E n,S ( u ) = 1 n n (cid:88) j =1 (cid:104) | Ω | · (cid:16) |∇ u ( X j ) | + 12 V ( X j ) | u ( X j ) | − f ( X j ) u ( X j ) (cid:17)(cid:105) . Similar to the previous subsection, we introduce the function class G S by setting G S := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | + 12 V | u | − f u where u ∈ F (cid:111) . In the Schrödinger case, since the Ritz energy E S is linear with respect to the probabilitymeasure P Ω , the statistical errors ∆ E gen and ∆ E bias are simpler than those in the Poisson case.In particular, a similar calculation shows that ∆ E gen = 0 and ∆ E ≤ R n ( G S ) . Therefore asa result of (3.2) we obtained the following theorem. Theorem 3.2.

Let u n,S be the minimizer of the empirical risk E n,S within the hypothesisclass F satisfying that sup u ∈F (cid:107) u (cid:107) L ∞ (Ω) < ∞ . Let ∆ E n,S = E P ( u n,S ) − E P ( u ∗ S ) . Then (3.4) E ∆ E n,S ≤ R n ( G S ) + 12 inf u ∈F (cid:107) u − u ∗ (cid:107) H (Ω) . Spectral Barron functions on the hypercube and their H -approximation. In this section, we discuss the properties of spectral Barron functions on the d -dimensionalhypercube deﬁned by (2.10) as well as their neural network approximations. Since our spectralBarron functions are deﬁned via the expansion under the following set of cosine functions: C = (cid:110) Φ k (cid:111) k ∈ N d := (cid:110) d (cid:89) i =1 cos( πk i x i ) | k i ∈ N (cid:111) , we start by stating some preliminaries on C and the product of cosines to be used in thesubsequent proofs.4.1. Preliminary Lemmas.Lemma 4.1.

The set C forms an orthogonal basis of L (Ω) and H (Ω) .Proof. First that C forms an orthogonal basis of L (Ω) follows directly from the Parseval’stheorem applied to the Fourier expansion of the even extension of a function u from L (Ω) .To see C is an orthogonal basis of H (Ω) , since C is an orthogonal set of H (Ω) , it suﬃces toshow that if u ∈ H (Ω) satisfying (cid:16) u, Φ k (cid:17) H (Ω) = 0 for all k ∈ N d , then u = 0 . In fact, the last display above yields that (cid:90) Ω u · Φ k dx + (cid:90) Ω ∇ u · ∇ Φ k dx = (cid:90) Ω u · (Φ k − ∆Φ k ) dx = (1 + π | k | ) (cid:90) Ω u · Φ k dx, where for the second identity we have used the Green’s formula and the fact that the normalderivative of Φ k vanishes on the boundary of Ω . Therefore we have obtained that ( u, Φ k ) L = 0 for any k ∈ N d , which implies that u = 0 since C is an orthogonal basis of L (Ω) . (cid:3) Given u ∈ L (Ω) , let { ˆ u ( k ) } k ∈ N d be the expansion coeﬃcients of u under the basis { Φ k } k ∈ N d .Then for any u ∈ L (Ω) , u ( x ) = (cid:88) k ∈ N d ˆ u ( k )Φ k ( x ) . PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 13

Moreover, it follows from a straightforward calculation that for u ∈ H (Ω) , (cid:107) u (cid:107) H (Ω) = (cid:88) k ∈ N d α k (1 + π | k | ) | ˆ u ( k ) | , where α k = (cid:104) Φ k , Φ k (cid:105) L (Ω) = 2 − (cid:80) di =1 ki (cid:54) =0 ≤ . This implies the following characterization of afunction from H (Ω) function in terms of its expansion coeﬃcients under C . Corollary 4.1.

The space H (Ω) can be characterized as H (Ω) = (cid:110) u ∈ L (Ω) (cid:12)(cid:12)(cid:12) (cid:88) k ∈ N d | ˆ u ( k ) | (1 + π | k | ) < ∞ (cid:111) . The following elementary product formula of cosine functions will also be useful.

Lemma 4.2.

For any { θ i } di =1 ⊂ R , d (cid:89) i =1 cos( θ i ) = 12 d (cid:88) ξ ∈ Ξ cos( ξ · θ ) , where θ = ( θ , · · · , θ d ) T and Ξ = { , − } d .Proof. The lemma follows directly by iterating the following simple identity cos( θ ) cos( θ ) = 12 (cid:0) cos( θ + θ ) + cos( θ − θ ) (cid:1) = 14 (cid:0) cos( θ + θ ) + cos( θ − θ ) + cos( − θ − θ ) + cos( − θ + θ ) (cid:1) . (cid:3) Spectral Barron Space and Neural-Network Approximation.

Recall for any s ∈ N the spectral Barron space B s (Ω) given by B s (Ω) := (cid:110) u ∈ L (Ω) : (cid:88) k ∈ N d (1 + π s | k | s ) | ˆ u ( k ) | < ∞ (cid:111) with associated norm (cid:107) u (cid:107) B s (Ω) := (cid:80) k ∈ N d (1 + π s | k | s ) | ˆ u ( k ) | . Recall also the short notation B (Ω) for B (Ω) . Lemma 4.3.

The following embedding results hold:(i) B (Ω) (cid:44) → H (Ω) ;(ii) B (Ω) (cid:44) → L ∞ (Ω) .Proof. (i). If u ∈ B (Ω) , then (cid:107) u (cid:107) B (Ω) = (cid:80) k ∈ N d (1 + π | k | ) | ˆ u ( k ) | < ∞ . This particularlyimplies | ˆ u ( k ) | ≤ (cid:107) u (cid:107) B (Ω) for each k ∈ N d . Since α k ≤ , we have that (cid:107) u (cid:107) H (Ω) = (cid:88) k ∈ N d α k (1 + π | k | ) | ˆ u ( k ) | ≤ (cid:107) u (cid:107) B (Ω) (cid:88) k ∈ N d (1 + π | k | ) | ˆ u ( k ) | = (cid:107) u (cid:107) B (Ω) . (ii). For u ∈ B (Ω) , using the fact that (cid:107) Φ k (cid:107) L ∞ (Ω) ≤ we have that (cid:107) u (cid:107) L ∞ (Ω) = (cid:13)(cid:13)(cid:13) (cid:88) k ∈ N d ˆ u ( k )Φ k (cid:13)(cid:13)(cid:13) L ∞ (Ω) ≤ (cid:88) k ∈ N d | ˆ u ( k ) | = (cid:107) u (cid:107) B (Ω) . (cid:3) Thanks to Lemma 4.1 and Lemma 4.2, any function u ∈ H (Ω) admits the expansion(4.1) u ( x ) = (cid:88) k ∈ N d ˆ u ( k ) · d (cid:88) ξ ∈ Ξ cos( πk ξ · x ) , where ˆ u ( k ) is the expansion coeﬃcient of u under the basis C and k ξ = ( k ξ , · · · , k d ξ d ) ∈ Z d .Given u ∈ B (Ω) ⊂ H (Ω) , letting ( − θ ( k ) = sign (ˆ u ( k )) with θ ( k ) ∈ { , } , we have from(4.1) that u ( x ) = ˆ u (0) + (cid:88) k ∈ N d \{ } ˆ u ( k ) · d (cid:88) ξ ∈ Ξ cos( πk ξ · x )= ˆ u (0) + (cid:88) k ∈ N d \{ } | ˆ u ( k ) | sign (ˆ u ( k )) · d (cid:88) ξ ∈ Ξ cos( πk ξ · x )= ˆ u (0) + (cid:88) k ∈ N d \{ } | ˆ u ( k ) | · d (cid:88) ξ ∈ Ξ cos( π ( k ξ · x + θ k ))= ˆ u (0) + (cid:88) k ∈ N d \{ } Z u | ˆ u ( k ) | (1 + π | k | ) · Z u π | k | · d (cid:88) ξ ∈ Ξ cos( π ( k ξ · x + θ k ))=: ˆ u (0) + (cid:90) g ( x, k ) µ ( dk ) , where µ ( dk ) is the probability measure on N d \ { } deﬁned by µ ( dk ) = (cid:88) k ∈ N d \{ } Z u (cid:12)(cid:12) ˆ u ( k ) (cid:12)(cid:12) (1 + π | k | ) δ ( dk ) with normalizing constant Z u = (cid:80) k ∈ N d \{ } | ˆ u ( k ) | (1 + π | k | ) ≤ (cid:107) u (cid:107) B (Ω) and g ( x, k ) = Z u π | k | · d (cid:88) ξ ∈ Ξ cos( π ( k ξ · x + θ k )) . Observe that the function g ( x, k ) ∈ C (Ω) for every k ∈ N d \ { } . Moreover, it is straight-forward to show that the following bounds hold: (cid:107) g ( · , k ) (cid:107) H (Ω) = Z u (cid:114) α k π | k | ≤ (cid:107) u (cid:107) B (Ω) , (cid:107) D s g ( · , k ) (cid:107) L ∞ (Ω) ≤ Z u ≤ (cid:107) u (cid:107) B (Ω) for s = 0 , , . Let us deﬁne for a constant

B > the function class F cos ( B ) := (cid:110) γ π | k | cos( π ( k · x + b )) , k ∈ Z d \ { } , | γ | ≤ B, b ∈ { , } (cid:111) . It follows from the calculations above that if u ∈ B (Ω) , then ¯ u := u − ˆ u (0) lies in the H -closure of the convex hull of F cos ( B ) with B = (cid:107) u (cid:107) B (Ω) . Indeed, if { k i } mi =1 is an i.i.d. sequenceof random samples from the probability measure µ , then it follows from Fubini’s theorem PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 15 that E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ¯ u ( x ) − m m (cid:88) i =1 g ( x, k i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H (Ω) = E (cid:90) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ u ( x ) − m m (cid:88) i =1 g ( x, k i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dx + E (cid:90) Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ ¯ u ( x ) − m m (cid:88) i =1 ∇ g ( x, k i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) dx = 1 m (cid:90) Ω Var [ g ( x, k )] dx + 1 m (cid:90) Ω Tr ( Cov [ ∇ g ( x, k )]) dx ≤ E (cid:107) g ( · , k ) (cid:107) H (Ω) m ≤ (cid:107) u (cid:107) B (Ω) m . Therefore the expected H -norm of an average of m elements in F cos ( B ) converges to zeroas m →∞ . This in particular implies that there exists a sequence of convex combinations ofpoints in F cos ( B ) converging to ¯ u in H -norm. Since the H -norm of any function in F cos ( B ) is bounded by B , an application of Maurey’s empirical method (see Lemma 4.4) yields thefollowing theorem. Theorem 4.1.

Let u ∈ B (Ω) . Then there exists u m which is a convex combination of m functions in F cos ( B ) with B = (cid:107) u (cid:107) B (Ω) such that (cid:107) u − ˆ u (0) − u m (cid:107) H (Ω) ≤ (cid:107) u (cid:107) B (Ω) m . Lemma 4.4. [2, 32] Let u belongs to the closure of the convex hull of a set G in a Hilbertspace. Let the Hilbert norm of of each element of G be bounded B > . Then for every m ∈ N , there exists { g i } mi =1 ⊂ G and { c i } mi =1 ⊂ [0 , with (cid:80) mi =1 c i = 1 such that (cid:13)(cid:13)(cid:13) u − m (cid:88) i =1 c i g i (cid:13)(cid:13)(cid:13) ≤ B m . Reduction to ReLU and Softplus Activation Functions.

Notice that every func-tion in F cos ( B ) is the composition of the one dimensional function g deﬁned on [ − , by(4.2) g ( z ) = γ π | k | cos( π ( | k | z + b )) with k ∈ Z d \ { } , | γ | ≤ B and b ∈ { , } , and a linear function z = w · x with w = k/ | k | . Itis clear that g ∈ C ([ − , and g satisﬁes that(4.3) (cid:107) g ( s ) (cid:107) L ∞ ([ − , ≤ | γ | ≤ B for s = 0 , , . Since b ∈ { , } , it also holds that g (cid:48) (0) = 0 . Lemma 4.5.

Let g ∈ C ([ − , with (cid:107) g ( s ) (cid:107) L ∞ ([ − , ≤ B for s = 0 , , . Assume that g (cid:48) (0) = 0 . Let { z j } mj =0 be a partition of [ − , with z = − , z m = 0 , z m = 1 and z j +1 − z j = h = 1 /m for each j = 0 , · · · , m − . Then there exists a two-layer ReLU network g m of theform (4.4) g m ( z ) = c + m (cid:88) i =1 a i ReLU( (cid:15) i z − b i ) , z ∈ [ − , with c = g (0) , b i ∈ [ − , and (cid:15) i ∈ {± } , i = 1 , · · · , m such that (4.5) (cid:107) g − g m (cid:107) W , ∞ ([ − , ≤ Bm .

Moreover, we have that | a i | ≤ Bm and that | c | ≤ B. Proof.

Let g m be the piecewise linear interpolation of g with respect to the grid { z j } mj =0 , i.e. g m ( z ) = g ( z j +1 ) z − z j h + g ( z j ) z j +1 − zh if z ∈ [ z j , z j +1 ] . According to [1, Chapter 11], (cid:107) g − g m (cid:107) L ∞ ([ − , ≤ h (cid:107) g (cid:48)(cid:48) (cid:107) L ∞ ([ − , . Moreover, (cid:107) g (cid:48) − g (cid:48) m (cid:107) L ∞ ([ − , ≤ h (cid:107) g (cid:48)(cid:48) (cid:107) L ∞ ([ − , . In fact, consider z ∈ [ z j , z j +1 ] for some j ∈{ , · · · , m − } . By the mean value theorem, there exist ξ, η ∈ ( z j , z j +1 ) such that ( g ( z j +1 − g ( z j ))) /h = g (cid:48) ( ξ ) and hence (cid:12)(cid:12)(cid:12) g (cid:48) ( z ) − g ( z j +1 ) − g ( z i ) h (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) g (cid:48) ( z ) − g (cid:48) ( ξ ) (cid:12)(cid:12)(cid:12) = | g (cid:48)(cid:48) ( η ) || z − ξ |≤ h (cid:107) g (cid:48)(cid:48) (cid:107) L ∞ ([ − , . This proves the error bound (4.5).Next, we show that g m can be represented by a two-layer ReLU neural network. Indeed,it is easy to verify that g m can be rewritten as(4.6) g m ( z ) = c + m (cid:88) i =1 a i ReLU( z i − z ) + m (cid:88) i = m +1 a i ReLU( z − z i − ) , z ∈ [ − , , where c = g ( z m ) = g (0) and the parameters a i deﬁned by a i =  g ( z m +1 ) − g ( z m ) h , if i = m + 1 , g ( z m − ) − g ( z m ) h , if i = m, g ( z i ) − g ( z i − )+ g ( z i − ) h , if i > m + 1 , g ( z i − ) − g ( z i )+ g ( z i +1 ) h , if i < m. Furthermore, by again the mean value theorem, there exists ξ , ξ ∈ ( z m , z m +1 ) such that | a m +1 | = | g (cid:48) ( ξ ) | = | g (cid:48) ( ξ ) − g (cid:48) (0) | = | g (cid:48)(cid:48) ( ξ ) ξ | ≤ Bh.

In a similar manner one can obtainthat | a m | ≤ Bh and | a i | ≤ Bh if i / ∈ { m, m + 1 } .Finally, by setting (cid:15) i = − , b i = − z i for i = 1 , · · · , m and (cid:15) i = 1 , b i = z i − for i = m + 1 , · · · , m , one obtains the desired form (4.4) of g m . This completes the proof of thelemma. (cid:3) The following proposition is a direct consequence of Lemma 4.5.

Proposition 4.1.

Deﬁne the function class F ReLU ( B ) := (cid:110) c + γ ReLU( w · x − t ) , | c | ≤ B, | w | = 1 , | t | ≤ , | γ | ≤ B } . Then for any constant ˜ c such that | ˜ c | ≤ B , the set ˜ c + F cos ( B ) is in the H -closure of theconvex hull of F ReLU ( B ) .Proof. First Lemma 4.5 states that each C -function g with g (cid:48) (0) = 0 and with up to secondorder derivatives bounded by B can be well approximated in H -norm by a linear combinationof a constant function and the ReLU functions ReLU( (cid:15)z − t ) with the sum of the absolutevalues of the combination coeﬃcients bounded by B . As a result, the function g deﬁned in(4.2) lies in the closure of the convex hull of functions c + γ ReLU( (cid:15)z − t ) with | c | ≤ B, | γ | ≤ B, | t | ≤ . Then the proposition follows from absorbing the additive constant ˜ c into theconstant c in the deﬁnition of F ReLU ( B ) . (cid:3) With Proposition 4.1, we are ready to give the proof of Theorem 2.1.

PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 17

Proof of Theorem 2.1.

Observe that if u ∈ F ReLU ( B ) , then (cid:107) u (cid:107) H (Ω) ≤ ( c + 2 γ ) + γ ≤ (10 + 4 ) B = 116 B . Therefore Theorem 2.1 follows directly from Lemma 4.4, Proposition 4.1 with ˜ c = ˆ u (0) andthe fact that | ˆ u (0) | ≤ (cid:107) u (cid:107) B (Ω) . (cid:3) Next we proceed to prove Theorem 2.2 which concerns approximating spectral Barronfunctions using two-layer networks with the Softplus activation. To this end, let us ﬁrst statea lemma which shows that

ReLU can be well approximated by SP τ for τ (cid:29) . Lemma 4.6.

The following inequalities hold:(i) | ReLU( z ) − SP τ ( z ) | ≤ τ e − τ | z | , ∀ z ∈ [ − , (ii) | ReLU (cid:48) ( z ) − SP (cid:48) τ ( z ) | ≤ e − τ | z | , ∀ z ∈ [ − , ∪ (0 , (iii) (cid:107) SP τ (cid:107) W , ∞ ([ − , ≤ τ . Proof.

Notice that

11 + e τ | z | (cid:12)(cid:12)(cid:12) ≤ e − τ | z | , if z (cid:54) = 0 . Finally, inequality (iii) follows from that (cid:107) SP τ ( z ) (cid:107) L ∞ ([ − , = SP τ (2) ≤ τ and that | SP (cid:48) τ ( z ) | = (cid:12)(cid:12)(cid:12)

11 + e τz (cid:12)(cid:12)(cid:12) ≤ . (cid:3) Lemma 4.7.

Let g ∈ C ([ − , with (cid:107) g ( s ) (cid:107) L ∞ ([ − , ≤ B for s = 0 , , . Assume that g (cid:48) (0) = 0 . Let { z j } mj = − m be a partition of [ − , with m ≥ and z − m = − , z = 0 , z m = 1 and z j +1 − z j = h = 1 /m for each j = − m, · · · , m − . Then there exists a two-layer neuralnetwork g τ,m of the form (4.7) g τ,m ( z ) = c + m (cid:88) i =1 a i SP τ ( (cid:15) i z − b i ) , z ∈ [ − , with c = g (0) ≤ B, b i ∈ [ − , , | a i | ≤ B/m and (cid:15) i ∈ {± } , i = 1 , · · · , m such that (4.8) (cid:107) g − g τ,m (cid:107) W , ∞ ([ − , ≤ Bδ τ , where (4.9) δ τ := 1 τ (cid:16) τ (cid:17)(cid:16) log (cid:0) τ (cid:1) + 1 (cid:17) . Proof.

Thanks to Lemma 4.5, there exists g m of the form(4.10) g m ( z ) = c + m (cid:88) i =1 a i ReLU( z i − z ) + m (cid:88) i = m +1 a i ReLU( z − z i − ) , z ∈ [ − , such that (cid:107) g − g m (cid:107) W , ∞ ([ − , ≤ B/m . More importantly, the coeﬃcients a i satisﬁes that | a i | ≤ B/m so that (cid:80) mi =1 a i ≤ B . Now let g τ,m be the function obtained by replacing theactivation ReLU in g m by SP τ , i.e.(4.11) g τ,m ( z ) = c + m (cid:88) i =1 a i SP τ ( z i − z ) + m (cid:88) i = m +1 a i SP τ ( z − z i − ) , z ∈ [ − , . Suppose that z ∈ ( z j , z j +1 ) for some ﬁxed j < m − . Then thanks to Lemma 4.6 - (i), thebound | a i | ≤ B/m and the fact that | z i − z | ≥ /m if i (cid:54) = j while z ∈ ( z j , z j +1 ) , we have | g m ( z ) − g τ,m ( z ) | ≤ | a j || ReLU( z j − z ) − SP τ ( z j − z ) | + m (cid:88) i =1 ,i (cid:54) = j | a i || ReLU( z i − z ) − SP τ ( z i − z ) | + m (cid:88) i = m +1 | a i || ReLU( z − z i − ) − SP τ ( z − z i − ) |≤ Bmτ + 2

Bτ e − τ | x | | x |≥ /m . Similar bounds hold for the case where z ∈ ( z j , z j +1 ) for j > m . Lastly, if z ∈ ( z m , z m +1 ) ,then both the m -th and m + 1 -th term in (4.10) and (4.11) depend on z m , from which we get | g m ( z ) − g τ,m ( z ) | ≤ Bmτ + 2

Bτ e − τ | x | | x |≥ /m . Therefore we have obtained that (cid:107) g m − g τ,m (cid:107) L ∞ ([ − , ≤ Bmτ + 2

Bτ e − τ | x | | x |≥ /m . Thanks to Lemma 4.6 - (ii), the same argument carries over to the estimate for the diﬀerenceof the derivatives and leads to (cid:107) g (cid:48) m − g (cid:48) τ,m (cid:107) L ∞ ([ − , ≤ Bm + 2 Be − τ | x | | x |≥ /m . Combining the estimates above with that (cid:107) g − g m (cid:107) W , ∞ ([ − , ≤ B/m yields that (cid:107) g − g τ,m (cid:107) W , ∞ ([ − , ≤ (cid:107) g − g m (cid:107) W , ∞ ([ − , + (cid:107) g m − g τ,m (cid:107) W , ∞ ([ − , ≤ Bm + 4 Bmτ + 2

Bτ e − τ | x | | x |≥ /m ≤ B (cid:16) τ (cid:17)(cid:16) m + e − τm (cid:17) = 6 Bδ τ . We have used the fact that max

PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 19

Proof of Theorem 2.2.

First according to Theorem 4.1, u − ˆ u (0) lies in the H -closure of theconvex hull of F cos ( B ) with B = (cid:107) u (cid:107) B (Ω) . Note that each function in F cos ( B ) is a compositionof the multivariate linear function z = w · x with | w | = 1 and the univariate function g ( z ) deﬁned in (4.2) such that g (cid:48) (0) = 0 and (cid:107) g ( s ) (cid:107) L ∞ ([ − , ≤ B for s = 0 , , . By Lemma 4.7,such g can be approximated by g τ,m which lies in the convex hull of the set of functions (cid:110) c + γ SP τ ( (cid:15)z − b ) , | c | ≤ B, (cid:15) ∈ {± } , | b | ≤ , γ ≤ B (cid:111) . Moreover, (cid:107) g − g τ,m (cid:107) W , ∞ ([ − , ≤ Bδ τ . As a result, we have that (cid:107) g ( w · x ) − g τ,m ( w · x ) (cid:107) H (Ω) ≤ (cid:107) g − g τ,m (cid:107) W , ∞ ([ − , ≤ Bδ τ . This combining with the fact that | ˆ u (0) | ≤ B yields that there exists a function u τ in theclosure of the convex hull of F SP τ ( B ) such that (cid:107) u − u τ (cid:107) H (Ω) ≤ Bδ τ . Thanks to Lemma 4.4 and the bound (4.12), there exists u m ∈ F SP τ ,m ( B ) , which is a convexcombination of m functions in F SP τ ( B ) such that (cid:107) u τ − u m (cid:107) H (Ω) ≤ B (cid:16) τ + 14 (cid:17) √ m . Combining the last two inequalities leads to (cid:107) u − u m (cid:107) H (Ω) ≤ Bδ τ + B (cid:16) τ + 14 (cid:17) √ m . Setting τ = √ m ≥ and using (4.9), we obtain that (cid:107) u − u m (cid:107) H (Ω) ≤ Bτ (cid:16) τ (cid:17)(cid:16) log (cid:16) τ (cid:17) + 1 (cid:17) + B √ m (cid:16) τ + 14 (cid:17) ≤ B √ m (cid:16)

12 log( m ) + 1 (cid:17) + 18 B √ m = B (6 log( m ) + 30) √ m . This proves the desired estimate. (cid:3) Rademacher complexities of two-layer neural networks

The goal of this section is to derive the Rademacher complexity bounds for some two-layerneural-network function classes that are relevant to the Ritz losses of the Poisson and thestatic Schrödinger equations. These bounds will be essential for obtaining the generalizationbounds in Theorem 2.3 and Theorem 2.4.First let us consider for ﬁxed positive constants C, Γ , W and T the set of two-layer neuralnetworks(5.1) F m = (cid:110) u θ ( x ) = c + m (cid:88) i =1 γ i φ ( w i · x + t i ) , x ∈ Ω , θ ∈ Θ (cid:12)(cid:12) | c | ≤ C, m (cid:88) i =1 | γ i | ≤ Γ , (cid:107) w i (cid:107) ≤ W, | t i | ≤ T (cid:111) . Here φ is the activation function, θ = ( c, { γ i } mi =1 , { w i } mi =1 , { t i } mi =1 ) denotes collectively theparameters of the two-layer neural network, Θ = Θ c × Θ γ × Θ w × Θ t = [ − C, C ] × B m (Γ) × (cid:0) B d ( W ) (cid:1) m × [ − T, T ] m represents the parameter space. We shall consider the set Θ endowedwith the metric ρ deﬁned for θ = ( c, γ, w, t ) , θ (cid:48) = ( c (cid:48) , γ (cid:48) , w (cid:48) , t (cid:48) ) ∈ Θ by(5.2) ρ Θ ( θ, θ (cid:48) ) = max {| c − c (cid:48) | , (cid:107) γ − γ (cid:48) (cid:107) , max i (cid:107) w i − w (cid:48) i (cid:107) , (cid:107) t − t (cid:48) (cid:107) ∞ } . Throughout the section we assume that φ satisﬁes the following assumption, which particu-larly holds for the Softplus activation function. Assumption 5.1. φ ∈ C ( R ) and that φ (resp. φ (cid:48) , the derivative of φ ) is L -Lipschitz (resp.is L (cid:48) -Lipschitz) for some L, L (cid:48) > . Moreover, there exist positive constants φ max and φ (cid:48) max such that sup w ∈ Θ w ,t ∈ Θ t ,x ∈ Ω | φ ( w · x + t ) | ≤ φ max and sup w ∈ Θ w ,t ∈ Θ t ,x ∈ Ω | φ (cid:48) ( w · x + t ) | ≤ φ (cid:48) max . Recall that the Rademacher complexity of a function class G is deﬁned by R n ( G ) = E Z E σ (cid:104) sup g ∈G (cid:12)(cid:12)(cid:12) n n (cid:88) j =1 σ j g ( Z j ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) . In the subsequent proof, it will be useful to use the following modiﬁed Rademacher complexity ˜ R n ( G ) without the absolute value sign: ˜ R n ( G ) = E Z E σ (cid:104) sup g ∈G n n (cid:88) j =1 σ j g ( Z j ) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) . The lemma below bounds the Rademacher complexity of F m . Lemma 5.1.

Assume that the activation function φ is L -Lipschitz. Then R n ( F m ) ≤ L ( W + T ) + 2Γ | φ (0) |√ n . Proof.

Let ¯ φ ( x ) = φ ( x ) − φ (0) . First observe that E σ (cid:104) sup f ∈F m n n (cid:88) j =1 σ j f ( Z j ) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) = E σ (cid:104) sup Θ n n (cid:88) j =1 σ j (cid:0) c + m (cid:88) i =1 γ i φ ( w i · Z j + t i ) (cid:1)(cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) = E σ (cid:104) sup Θ n n (cid:88) j =1 σ j m (cid:88) i =1 γ i φ ( w i · Z j + t i ) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) ≤ n E σ (cid:104) sup Θ m (cid:88) i =1 γ i n (cid:88) j =1 σ j ¯ φ ( w i · Z j + t i ) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) + 1 n E σ (cid:104) sup Θ m (cid:88) i =1 γ i n (cid:88) j =1 σ j φ (0) (cid:105) =: J + J . Using the fact that ¯ φ ( · ) = φ ( · ) − φ (0) is L -Lipschitz, one has that J ≤ n m (cid:88) i =1 | γ i | · E σ (cid:104) sup | w |≤ W, | t |≤ T (cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j ¯ φ ( w · Z j + t ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) ≤ Ln (cid:16) E σ (cid:104) sup | w |≤ W (cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j w · Z j (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) Z , · · · , Z n (cid:105) + E σ (cid:104) sup | t |≤ T (cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j t (cid:12)(cid:12)(cid:12)(cid:105)(cid:17) ≤ Ln (cid:16) W · E σ (cid:107) n (cid:88) j =1 σ j Z j (cid:107) + T E σ (cid:104)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j (cid:12)(cid:12)(cid:12)(cid:105)(cid:17) ≤ Ln (cid:16) W · (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j =1 | Z j | + T · (cid:118)(cid:117)(cid:117)(cid:116) E σ (cid:104) n (cid:88) j =1 σ j (cid:105)(cid:17) PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 21 ≤ L ( W + T ) √ n . Note that in the second inequality we have used the Talagrand’s contraction principle (Lemma5.2 below). Moreover, since (cid:80) mi =1 | γ i | ≤ Γ , it is easy to see that J ≤ Γ | φ (0) | n E σ (cid:104)(cid:12)(cid:12)(cid:12) n (cid:88) j =1 σ j (cid:12)(cid:12)(cid:12)(cid:105) ≤ Γ | φ (0) | n (cid:118)(cid:117)(cid:117)(cid:116) E σ (cid:104) n (cid:88) j =1 σ j (cid:105) = Γ | φ (0) |√ n . Combining the estimates above and then taking the expectation w.r.t. Z j yields that ˜ R n ( F m ) ≤ L ( W + T )+Γ | φ (0) |√ n . This combined with Lemma 5.3 below leads to the desiredestimate. (cid:3) Lemma 5.2 (Ledoux-Talagrand contraction [25, Theorem 4.12]) . Assume that φ : R → R is L -Lipschitz with φ (0) = 0 . Let { σ i } ni =1 be independent Rademacher random variables. Thenfor any T ⊂ R n E σ sup ( t , ··· ,t n ) ∈ T (cid:12)(cid:12)(cid:12) n (cid:88) i =1 σ i φ ( t i ) (cid:12)(cid:12)(cid:12) ≤ L · E σ sup ( t , ··· ,t n ) ∈ T (cid:12)(cid:12)(cid:12) n (cid:88) i =1 σ i t i (cid:12)(cid:12)(cid:12) . Lemma 5.3. [27, Lemma 1] Assume that the set of functions G contains the zero function.Then R n ( G ) ≤ R n ( G ) . Recall the sets of two-layer neural networks F ReLU ,m ( B ) and F SP τ ,m ( B ) deﬁned by (2.12)and (2.13) respectively. Since both ReLU and SP τ are -Lipschitz and ReLU(0) = 0 , SP τ (0) = ln 2 τ , the following corollary is a direct consequence of Lemma 5.1. Corollary 5.1. R n ( F ReLU ,m ( B )) ≤ B √ n and R n ( F SP τ ,m ( B )) ≤ ln 2 τ ) B √ n . Given the source function f ∈ L ∞ (Ω) and the potential V ∈ L ∞ (Ω) , we recall the functionclasses associated to the Ritz losses of Poisson equation and the static Schrödinger equation(5.3) G m,P := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | − f u where u ∈ F m (cid:111) , G m,S := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | + 12 V | u | − f u where u ∈ F m (cid:111) , In the sequel we aim to bound the Rademacher complexities of G m,P and G m,S deﬁned above.This will be achieved by bounding the Rademacher complexities of the following functionclasses G m := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 |∇ u | where u ∈ F m (cid:111) , G m := (cid:110) g : Ω → R (cid:12)(cid:12) g = f u where u ∈ F m (cid:111) , G m := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 V | u | where u ∈ F m (cid:111) . The celebrated Dudley’s theorem will be used to bound the Rademacher complexity in termsof the metric entropy. To this end, let us ﬁrst recall the metric entropy and the Dudley’stheorem below.Let ( E, ρ ) be a metric space with metric ρ . A δ -cover of a set A ⊂ E with respect to ρ isa collection of points { x , · · · , x n } ⊂ A such that for every x ∈ A , there exists i ∈ { , · · · , n } such that ρ ( x, x i ) ≤ δ . The δ -covering number N ( δ, A, ρ ) is the cardinality of the smallest δ -cover of the set A with respect to the metric ρ . Equivalently, the δ -covering number N ( δ, A, ρ ) is the minimal number of balls B ρ ( x, δ ) of radius δ needed to cover the set A . Theorem 5.1 (Dudley’s theorem) . Let F be a function class such that sup f ∈F (cid:107) f (cid:107) ∞ ≤ M .Then the Rademacher complexity R n ( F ) satisﬁes that R n ( F ) ≤ inf ≤ δ ≤ M (cid:110) δ + 12 √ n (cid:90) Mδ (cid:112) log N ( ε, F , (cid:107) · (cid:107) ∞ ) dε (cid:111) . Note that our statement of Dudley’s theorem is slightly diﬀerent from the standard Dud-ley’s theorem where the covering number is based on the empirical (cid:96) -metric instead of the L ∞ -metric above. However, since L ∞ -metric is stronger than the empirical (cid:96) -metric andsince the covering number is monotonically increasing with respect to the metric, Theo-rem 5.1 follows directly from the classical Dudley’s theorem (see e.g. [42, Theorem 1.19]).Let us now state an elementary lemma on the covering number of product spaces. Lemma 5.4.

Let ( E i , ρ i ) be metric spaces with metrics ρ i and let A i ⊂ E i , i = 1 , · · · , n .Consider the product space E = × ni =1 E i equipped with the metric ρ = max i ρ i and the set A = × ni =1 A i . Then for any δ > , (5.4) N ( δ, A, ρ ) ≤ n (cid:89) i =1 N ( δ, A i , ρ i ) . Proof.

It suﬃces to prove the lemma in the case that n = 2 , i.e.,(5.5) N ( δ, A × A , ρ ) ≤ N ( δ, A , ρ ) · N ( δ, A , ρ ) . Indeed, suppose that C and C are δ -covers of A and A respectively. Then it is straightfor-ward that the product set C × C is also a δ -cover of A × A in the space ( E × E , ρ ) with ρ = max( ρ , ρ ) . Hence N ( δ, A × A , ρ ) ≤ card ( C ) · card ( C ) . Applying this inequalityfor C i with card ( C i ) = N ( δ, A i , ρ i ) , i = 1 , , we obtain (5.5). The general inequality (5.4)follows by iterating (5.5). (cid:3) As a consequence of Lemma 5.4, the following proposition gives an upper bound for thecovering number N ( δ, Θ , ρ Θ ) . Proposition 5.1.

Consider the metric space (Θ , ρ Θ ) with ρ Θ deﬁned in (5.2) . Then for any δ > , the covering number N ( δ, Θ , ρ Θ ) satisﬁes that N ( δ, Θ , ρ Θ ) ≤ Cδ · (cid:16) δ (cid:17) m · (cid:16) Wδ (cid:17) dm · (cid:16) Tδ (cid:17) m . Proof.

Thanks to Lemma 5.4, N ( δ, Θ , ρ ) ≤ N ( δ, Θ c , | · | ) · N ( δ, Θ γ , (cid:107) · (cid:107) ) · (cid:16) N ( δ, B d ( W ) , (cid:107) · (cid:107) ) (cid:17) m · N ( δ, Θ t , (cid:107) · (cid:107) ∞ ) ≤ Cδ · (cid:16) δ (cid:17) m · (cid:16) Wδ (cid:17) dm · (cid:16) Tδ (cid:17) m , where in the last inequality we have used the fact that the covering number of a d -dimensional (cid:96) p -ball of radius r satisﬁes that N ( δ, B dp ( r ) , (cid:107) · (cid:107) p ) ≤ (cid:16) rδ (cid:17) d . (cid:3) PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 23

Bounding R n ( G m ) . We would like to bound R n ( G m ) from above using metric entropy. Tothis end, let us ﬁrst bound the covering number N ( δ, G m , (cid:107) · (cid:107) ∞ ) . Recall the parameters C, Γ , W and T in (5.1). With those parameters ﬁxed, to simplify expressions, we introducethe following functions to be used in the sequel M ( δ, Λ , m, d ) := 2 C Λ δ · (cid:16) δ (cid:17) m · (cid:16) W Λ δ (cid:17) dm · (cid:16) T Λ δ (cid:17) m , (5.6) Z ( M, Λ , d ) := M (cid:0)(cid:112) (log(2 C Λ)) + + (cid:112) (log(3ΓΛ) + d log(3 W Λ) + log(3 T Λ)) + (cid:1) (5.7) + √ d + 3 (cid:90) M (cid:112) (log(1 /ε )) + dε. Lemma 5.5.

Let the activation function φ satisfy Assumption 5.1. Then we have (5.8) N ( δ, G m , (cid:107) · (cid:107) ∞ ) ≤ M ( δ, Λ , m, d ) , where the constant Λ is deﬁned by (5.9) Λ = (cid:16) ( W + Γ) φ (cid:48) max + Γ W L (cid:48) ( √ d + 1) (cid:17) Γ W φ (cid:48) max . Proof.

Thanks to Assumption 5.1, sup θ ∈ Θ | φ (cid:48) ( w · x + t ) | ≤ φ (cid:48) max . This implies that max θ ∈ Θ |∇ u θ ( x ) | ≤ m (cid:88) i =1 | γ i |(cid:107) w i (cid:107) | φ (cid:48) ( w i · x + t i ) |≤ Γ W φ (cid:48) max . Furthermore, for θ, θ (cid:48) ∈ Θ , by adding and subtracting terms, we have that |∇ u θ ( x ) − ∇ u θ (cid:48) ( x ) | ≤ m (cid:88) i =1 | γ i − γ (cid:48) i |(cid:107) w i (cid:107) | φ (cid:48) ( w i · x + t i ) | + m (cid:88) i =1 | γ (cid:48) i |(cid:107) w i − w (cid:48) i (cid:107) | φ (cid:48) ( w i · x + t i ) | + m (cid:88) i =1 | γ (cid:48) i || w (cid:48) i || φ (cid:48) ( w i · x + t i ) − φ (cid:48) ( w (cid:48) i · x + t (cid:48) i ) |≤ W φ (cid:48) max (cid:107) γ − γ (cid:48) (cid:107) + Γ φ (cid:48) max max i (cid:107) w i − w (cid:48) i (cid:107) + Γ W L (cid:48) (max i (cid:107) w i − w (cid:48) i (cid:107) √ d + (cid:107) t − t (cid:48) (cid:107) ∞ ) ≤ (cid:16) ( W + Γ) φ (cid:48) max + Γ W L (cid:48) ( √ d + 1) (cid:17) ρ Θ ( θ, θ (cid:48) ) . Combining the last two estimates yields that (cid:12)(cid:12) |∇ u θ ( x ) | − |∇ u θ (cid:48) ( x ) | (cid:12)(cid:12) ≤ (cid:12)(cid:12) ∇ u θ ( x ) + ∇ u θ (cid:48) ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ∇ u θ ( x ) − ∇ u θ (cid:48) ( x ) (cid:12)(cid:12) ≤ Λ ρ Θ ( θ, θ (cid:48) ) . This particularly implies that N ( δ, G m , (cid:107) · (cid:107) ∞ ) ≤ N ( δ Λ , Θ , ρ Θ ) . Then the estimate (5.8)follows from Proposition 5.1 with δ replaced by δ Λ . (cid:3) Proposition 5.2.

Assume that the activation function φ satisﬁes Assumption 5.1. Then R n ( G m ) ≤ Z ( M , Λ , d ) · (cid:114) mn . where M = Γ W ( φ (cid:48) max ) and Λ is deﬁned in (5.9) .Proof. Thanks to Assumption 5.1, sup g ∈G m (cid:107) g (cid:107) L ∞ (Ω) ≤ sup u ∈F m (cid:107)∇ u (cid:107) L ∞ (Ω) ≤ Γ W ( φ (cid:48) max ) . Then the proposition follows from Lemma 5.5, Theorem 5.1 with δ = 0 and M = M = Γ W ( φ (cid:48) max ) , and the simple fact that √ a + b ≤ √ a + √ b for a, b ≥ . (cid:3) Bounding R n ( G m ) . The next lemma provides an upper bound for N ( δ, G m , (cid:107) · (cid:107) ∞ ) . Lemma 5.6.

Assume that (cid:107) f (cid:107) L ∞ (Ω) ≤ F for some F > . Assume that the activationfunction φ satisﬁes Assumption 5.1. Then the covering number N ( δ, G m , (cid:107) · (cid:107) ∞ ) satisﬁes that N ( δ, G m , (cid:107) · (cid:107) ∞ ) ≤ M ( δ, Λ , m, d ) . Here the constant Λ is deﬁned by (5.10) Λ = F (cid:0) φ max + L Γ(1 + √ d ) (cid:1) . Proof.

Note that a function g θ ∈ G m has the form g θ = f u θ . Given θ = ( c, γ, w, t ) , θ (cid:48) =( c (cid:48) , γ (cid:48) , w (cid:48) , t (cid:48) ) ∈ Θ , we have(5.11) | u θ ( x ) − u θ (cid:48) ( x ) | ≤ | c − c (cid:48) | + m (cid:88) i =1 | γ i φ ( w i · x − t i ) − m (cid:88) i =1 γ (cid:48) i φ ( w (cid:48) i · x − t (cid:48) i ) |≤ | c − c (cid:48) | + m (cid:88) i =1 | γ i − γ (cid:48) i | φ ( w i · x − t i ) + m (cid:88) i =1 | γ (cid:48) i || φ ( w i · x − t i ) − φ ( w (cid:48) i · x − t (cid:48) i ) | . Since φ satisﬁes Assumption 5.1, we have that | φ ( w i · x − t i ) | ≤ φ max and that | φ ( w i · x − t i ) − φ ( w (cid:48) i · x − t (cid:48) i ) | ≤ L ( √ d (cid:107) w i − w (cid:48) i (cid:107) + | t i − t (cid:48) i | ) . Therefore, it follows from (5.11) that(5.12) | u θ ( x ) − u θ (cid:48) ( x ) | ≤ | c − c (cid:48) | + φ max (cid:107) γ − γ (cid:48) (cid:107) + L Γ( √ d max i (cid:107) w i − w (cid:48) i (cid:107) + (cid:107) t − t (cid:48) (cid:107) ∞ ) ≤ (cid:16) φ max + L Γ(1 + √ d ) (cid:17) ρ Θ ( θ, θ (cid:48) ) . This implies that (cid:107) g θ − g θ (cid:48) (cid:107) ∞ ≤ F (cid:16) φ max + L Γ(1 + √ d ) (cid:17) ρ = Λ ρ Θ ( θ, θ (cid:48) ) . As a consequence, N ( δ, G m , (cid:107)·(cid:107) ∞ ) ≤ N ( δ Λ , Θ , ρ Θ ) . Then the lemma follows from Proposition5.1 with δ replaced by δ Λ . (cid:3) Proposition 5.3.

Assume that (cid:107) f (cid:107) L ∞ (Ω) ≤ F for some F > . Assume that the activationfunction φ is L -Lipschitz. Then R n ( G m ) ≤ Z ( M , Λ , d ) · (cid:114) mn , where M = F ( C + Γ φ max ) and Λ is deﬁned in (5.10) .Proof. It follows from the deﬁnition of G m and the assumption that (cid:107) f (cid:107) L ∞ (Ω) ≤ F , onehas that sup g ∈G m (cid:107) g (cid:107) L ∞ (Ω) ≤ M = F ( C + Γ φ max ) . Then the proposition is proved by anapplication of Theorem 5.1 with δ = 0 , M = M and Lemma 5.6. (cid:3) Bounding R n ( G m ) . The lemma below gives an upper bound for N ( δ, G m , (cid:107) · (cid:107) ∞ ) . Lemma 5.7.

Assume that (cid:107) V (cid:107) L ∞ (Ω) ≤ V max for some V max < ∞ . Assume that the activationfunction φ satisﬁes Assumption 5.1. Then the covering number N ( δ, G m , (cid:107) · (cid:107) ∞ ) satisﬁes that (5.13) N ( δ, G m , (cid:107) · (cid:107) ∞ ) ≤ M ( δ, Λ , m, d ) , where the constant Λ is deﬁned by (5.14) Λ = V max ( C + Γ φ max ) (cid:16) φ max + L Γ(1 + √ d ) (cid:17) . PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 25

Proof.

By the deﬁnition of F m and Assumption 5.1 on φ , sup u ∈F m (cid:107) u (cid:107) L ∞ (Ω) ≤ C + Γ φ max . Moreover, recall from (5.12) that for θ, θ (cid:48) ∈ Θ , | u θ ( x ) − u θ (cid:48) ( x ) | ≤ (cid:16) φ max + L Γ(1 + √ d ) (cid:17) ρ Θ ( θ, θ (cid:48) ) . Consequently, (cid:12)(cid:12)(cid:12) V ( x ) u θ ( x ) − V ( x ) u θ (cid:48) ( x ) (cid:12)(cid:12)(cid:12) ≤ | V ( x ) || u θ ( x ) + u θ (cid:48) ( x ) || u θ ( x ) − u θ (cid:48) ( x ) |≤ Λ ρ Θ ( θ, θ (cid:48) ) . The estimate (5.13) follows from the same line of arguments used in the proof of Lemma 5.6. (cid:3)

Proposition 5.4.

Under the same assumption of Lemma 5.7, G m satisﬁes that R n ( G m ) ≤ Z ( M , Λ , d ) · (cid:114) mn , where M = V max ( C + Γ φ max ) and Λ is deﬁned in (5.14) .Proof. Note that sup u ∈G m (cid:107) u (cid:107) L ∞ (Ω) ≤ M = V max ( C + Γ φ max ) . Then the proposition followsfrom Theorem 5.1 with δ = 0 , M = M and Lemma 5.7. (cid:3) The following corollary is a direct consequence of the Propositions 5.2-5.4.

Corollary 5.2.

The two sets of functions G m,P and G m,S deﬁned in (5.3) satisfy that R n ( G m,P ) ≤ ( Z ( M , Λ , d ) + Z ( M , Λ , d )) · (cid:114) mn and that R n ( G m,S ) ≤ (cid:88) i =1 Z ( M i , Λ i , d ) · (cid:114) mn Considering the set of two-layer neural networks F SP τ ,m ( B ) deﬁned in (2.13) with τ = √ m ,we deﬁne the following associated sets of functions G SP τ ,m,P ( B ) := { g : Ω → R | g = 12 |∇ u | − f u where u ∈ F SP τ ,m,P ( B ) } , G SP τ ,m,S ( B ) := { g : Ω → R | g = 12 |∇ u | + 12 V | u | − f u where u ∈ F SP τ ,m,S ( B ) } , G τ ,m ( B ) := { g : Ω → R | g = 12 |∇ u | where u ∈ F SP τ ,m ( B ) } , G τ ,m ( B ) := { g : Ω → R | g = f u where u ∈ F SP τ ,m ( B ) }G τ ,m ( B ) := (cid:110) g : Ω → R (cid:12)(cid:12) g = 12 V | u | where u ∈ F SP τ ,m ( B ) (cid:111) . Corollary 5.2 allows us to bound the Rademacher complexities of G SP τ ,m,P ( B ) and G SP τ ,m,S ( B ) .Indeed, from the deﬁnition of the activation function SP τ , we know that (cid:107) SP (cid:48) τ (cid:107) L ∞ ( R ) ≤ and (cid:107) SP (cid:48)(cid:48) τ (cid:107) L ∞ ( R ) ≤ τ = √ m , so SP τ satisﬁes Assumption (5.1) with L = φ (cid:48) max = 1 , L (cid:48) = τ = √ m, φ max ≤ √ m ≤ . Note also that F SP τ ,m,P ( B ) coincides with the set F m deﬁned in (5.1) with the followingparameters(5.15) C = 2 B, Γ = 4

B, W = 1 , T = 1 . With the parameters above, one has that M = 8 B, Λ ≤ B ( √ d + 1) √ m + 4 B,M ≤ F B, Λ ≤ F (5 + 4 B (1 + √ d )) ,M ≤ V max B ) , Λ ≤ V max B (5 + 4 B (1 + √ d )) . Inserting M i and Λ i , i = 1 , , into (5.7), one can obtain by a straightforward calculationthat there exist positive constants C ( B, d ) , C ( B, d, F ) and C ( B, d, V max ) , depending on theparameters B, d, F, V max polynomially, such that Z ( M , Λ , d ) ≤ C ( B, d ) (cid:112) log m, Z ( M , Λ , d ) ≤ C ( B, d, F ) , Z ( M , Λ , d ) ≤ C ( B, d, V max ) , Combining the estimates above with Corollary 5.2 gives directly the Rademacher complexitybounds for G SP τ ,m,P ( B ) and G SP τ ,m,S ( B ) as summarized in the following theorem. Theorem 5.2.

Assume that (cid:107) f (cid:107) L ∞ (Ω) ≤ F and (cid:107) V (cid:107) L ∞ (Ω) ≤ V max . Consider the sets G SP τ ,m,P ( B ) and G SP τ ,m,S ( B ) with τ = √ m . Then there exist positive constants C P ( B, d, F ) and C S ( B, d, F, V max ) depending polynomially on B, d, F, V max such that R n ( G SP τ ,m,P ( B )) ≤ C P ( B, d, F ) √ m ( √ log m + 1) √ n ,R n ( G SP τ ,m,S ( B )) ≤ C S ( B, d, F, V max ) √ m ( √ log m + 1) √ n . Proofs of Theorem 2.3 and Theorem 2.4

With the approximation estimates for spectral Barron functions and the complexity esti-mates of the two-layer neural networks proved in previous sections, we are ready to proveTheorem 2.3 and Theorem 2.4 which establish the a priori generalization error bounds of theDRM.

Proof of Theorem 2.3.

Recall that u mn,P is the minimizer of the empirical loss E n,P in the set F = F SP τ ,m ( B ) with τ = √ m , where B = (cid:107) u ∗ P (cid:107) B (Ω) . From the deﬁnition of F SP τ ,m ( B ) , onecan obtain that sup u ∈F SP τ ,m ( B ) (cid:107) u (cid:107) L ∞ (Ω) ≤ B. Then it follows from Theorem 3.1, Theorem 5.2, Theorem 2.2 and Corollary 5.1 that E (cid:2) E P ( u mn,P ) − E P ( u ∗ P ) (cid:3) ≤ R n ( G SP τ ,m,P ) + 4 sup u ∈F SP τ ,m ( B ) (cid:107) u (cid:107) L ∞ (Ω) · R n ( F SP τ ,m )+ 12 inf u ∈F SP τ,m ( B ) (cid:107) u − u ∗ (cid:107) H (Ω) ≤ C P ( B, d, F ) √ m ( √ log m + 1) √ n + 4 · · · B (1 + ln 2 √ m ) √ n + B (6 log m + 30) m ≤ C √ m ( √ log m + 1) √ n + C (log m + 1) m , where the constant C depends polynomially on B, d and F and C depends only quadraticallyon B . (cid:3) Proof of Theorem 2.4.

The proof is almost identical to the proof of Theorem 2.3 and followsdirectly from Theorem 3.2, Theorem 5.2, Theorem 2.2 and Corollary 5.1. Hence we omit thedetails. (cid:3)

PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 27 Solution theory of Poisson and static Schrödinger Equations inspectral Barron Spaces

In Theorems 2.3 and 2.4, we have established the generalization error bounds of the DRMfor the Poisson equation and static Schrödinger equation under the assumption that theexact solutions lie in the spectral Barron space B (Ω) . This section aims to justify suchassumption by proving complexity estimates of solutions in the spectral Barron space asshown in Theorem 2.5 and Theorem 2.6. This can be viewed as regularity analysis of highdimensional PDEs in the spectral Barron space.7.1. Proof of Theorem 2.5.

Suppose that f = (cid:80) k ∈ N d ˆ f k Φ k and that f has vanishing meanvalue on Ω so that ˆ f = 0 . Let ˆ u k be the cosine coeﬃcients of the solution u ∗ P of the Neumannproblem for Poisson equation. By testing Φ k on both sides of the Poisson equation and bytaking account of the Neumann boundary condition, one obtains that ˆ u = 0 , ˆ u k = − π | k | ˆ f k . As a result, (cid:107) u ∗ P (cid:107) B s +2 (Ω) = (cid:88) k ∈ N d \{ } (1 + π s +2 | k | s +2 ) | ˆ u k | = (cid:88) k ∈ N d \{ } (1 + π s +2 | k | s +2 ) π | k | | ˆ f k |≤ (cid:88) k ∈ N d \{ } (1 + π s | k | s ) | ˆ f k | = 2 (cid:107) f (cid:107) B s (Ω) . This ﬁnishes the proof.7.2.

Proof of Theorem 2.6.

First under the assumption of Theorem 2.6, there exists aunique solution u ∈ H (Ω) to (2.2). Moreover,(7.1) (cid:107)∇ u (cid:107) L (Ω) + V min (cid:107) u (cid:107) L (Ω) ≤ (cid:107) f (cid:107) L (Ω) (cid:107) u (cid:107) L (Ω) . Our goal is to show that u ∈ B s +2 (Ω) . To this end, let us ﬁrst derive an operator equationthat is equivalent to the original Schrödinger problem (2.2). To do this, multiplying Φ k on both sides of the static Schrödinger equation and then integrating yields the followingequivalent linear system on ˆ u :(7.2) − | π | | k | ˆ u k + (cid:91) ( V u ) k = ˆ f k , k ∈ N d . Let us ﬁrst consider (7.2) with k = . Thanks to Corollary B.1, (cid:91) ( V u ) = 1 β (cid:16) (cid:88) m ∈ Z d β m ˆ u | m | ˆ V | m | (cid:17) = ˆ u ˆ V + (cid:16) (cid:88) m ∈ Z d \{ } β m ˆ u | m | ˆ V | m | (cid:17) , where we have also used the fact that β = 1 . Consequently, equation (7.2) with k = becomes ˆ u ˆ V + (cid:88) m ∈ Z d \{ } β m ˆ u | m | ˆ V | m | = ˆ f . For k (cid:54) = , using again Corollary B.1, equation (7.2) can be written as −| π | | k | ˆ u k + 1 β k (cid:16) (cid:88) m ∈ Z d β m ˆ u | m | β m − k ˆ V | m − k | (cid:17) = ˆ f k , k ∈ N d \ { } . Recall that a function u ∈ B s (Ω) is equivalent to that ˆ u k belongs to the weighted (cid:96) space (cid:96) W s ( N d ) with the weight W s ( k ) = 1 + π s | k | s . We would like to rewrite the above equations as an operator equation on the space (cid:96) W s ( N d ) . For doing this, let us deﬁne some usefuloperators. Deﬁne the operator M : ˆ u (cid:55)→ M ˆ u by ( M ˆ u ) k = (cid:40) ˆ V ˆ u if k = , −| π | | k | ˆ u k otherwise . Deﬁne the operator V : ˆ u (cid:55)→ V ˆ u by ( V ˆ u ) k = (cid:40)(cid:80) m ∈ Z d \{ } β m ˆ u | m | ˆ V | m | if k = , β k (cid:16) (cid:80) m ∈ Z d β m ˆ u | m | β m − k ˆ V | m − k | (cid:17) otherwise . With those operators, the system (7.2) can be reformulated as the operator equation(7.3) ( M + V )ˆ u = ˆ f . Since V ( x ) ≥ V min > for every x , we have ˆ V > . As a direct consequence, the diagonaloperator M is invertible. Therefore the operator equation (7.3) is equivalent to(7.4) ( I + M − V )ˆ u = M − ˆ f . In order to show that u ∈ B s +2 (Ω) , it suﬃces to show that the equation (7.3) or (7.4) hasa unique solution ˆ u ∈ (cid:96) W s ( N d ) . Indeed, if ˆ u ∈ (cid:96) W s ( N d ) , then it follows from (7.3) and theboundedness of V on (cid:96) W s ( N d ) (see (7.8) in the proof of Lemma 7.1 below) that(7.5) (cid:107) M ˆ u (cid:107) (cid:96) Ws ( N d ) ≤ (cid:107) V ˆ u (cid:107) (cid:96) Ws ( N d ) + (cid:107) ˆ f (cid:107) (cid:96) Ws ( N d ) ≤ C ( d, V ) (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) + (cid:107) ˆ f (cid:107) (cid:96) Ws ( N d ) . Moreover, this combined with the positivity of ˆ V implies that(7.6) (cid:107) u (cid:107) B s +2 (Ω) = (cid:88) k ∈ N d (1 + π s +2 | k | s +2 ) | ˆ u k | = 1ˆ V · ˆ V | ˆ u | + (cid:88) k ∈ N d π s +2 | k | s +2 π | k | · π | k | | ˆ u k |≤ max (cid:110) V , (cid:16) π + 1 (cid:17)(cid:111) (cid:107) M ˆ u (cid:107) (cid:96) Ws ( N d ) ≤ C ( d, V )( (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) + (cid:107) ˆ f (cid:107) (cid:96) Ws ( N d ) ) for some C ( d, V ) > .Next, we claim that equation (7.4) has a unique solution ˆ u ∈ (cid:96) W s ( N d ) and that there existsa constant C > such that(7.7) (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) ≤ C (cid:107) ˆ f (cid:107) (cid:96) Ws ( N d ) . To see this, observe that owing to the compactness of M − V as shown in Lemma 7.1, theoperator equation I + M − V is a Fredholm operator on (cid:96) W s ( N d ) . By the celebrated Fredholmalternative theorem (see e.g., [14] and [7, VII 10.7]), the operator I + M − V has a boundedinverse ( I + M − V ) − if and only if ( I + M − V )ˆ u = 0 has a trivial solution. Therefore toobtain the bound (7.7), it suﬃces to show that ( I + M − V )ˆ u = 0 implies ˆ u = 0 . By theequivalence between the Schrödinger problem (2.2) and (7.4), we only need to show that theonly solution of (2.2) is zero. Notice that the latter is a direct consequence of (7.1) and thusthis ﬁnishes the proof of that the Schrödinger problem (2.2) has a unique solution in B (Ω) .Finally, the stability estimate (2.16) follows by combining (7.6) and (7.7). Lemma 7.1.

Assume that V ∈ B s (Ω) with V ( x ) ≥ V min > for every x ∈ Ω . Then theoperator M − V is compact on (cid:96) W s ( N d ) . PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 29

Proof.

Since M − is a multiplication operator on (cid:96) W s ( N d ) with the diagonal entries convergingto zero, it follows from Lemma 7.2 that M − is compact on (cid:96) W s ( N d ) . Therefore to show thecompactness of M − V , it is suﬃcient to show that the operator V is bounded on (cid:96) W s ( N d ) . Tosee this, note that by deﬁnition β k = 2 k − (cid:80) di =1 ki (cid:54) =0 ∈ [2 − d , . In addition, since V ∈ B (Ω) ,using Corollary B.1, one has that(7.8) (cid:107) V ˆ u (cid:107) (cid:96) Ws ( N d ) = (cid:12)(cid:12)(cid:12) (cid:88) m ∈ Z d \{ } β m ˆ u | m | ˆ V | m | (cid:12)(cid:12)(cid:12) + (cid:88) k ∈ N d β k (cid:12)(cid:12)(cid:12) (cid:88) m ∈ Z d β m ˆ u | m | β m − k ˆ V | m − k | (cid:12)(cid:12)(cid:12) (1 + π s (cid:107) k (cid:107) s ) ≤ (cid:88) m ∈ Z d \{ } | ˆ u | m | | (cid:88) m ∈ Z d \{ } | ˆ V | m | | + 2 d +1 (cid:88) m ∈ Z d (cid:88) k ∈ N d | ˆ u | m | || ˆ V | m − k | | (cid:0) | π | s C s ( (cid:107) m − k (cid:107) s + (cid:107) m (cid:107) s ) (cid:1) ≤ d +2 (cid:107) ˆ u (cid:107) (cid:96) ( N d ) (cid:107) ˆ V (cid:107) (cid:96) ( N d ) + 2 d +1 max(1 , C s ) · (cid:0) (cid:107) ˆ u (cid:107) (cid:96) ( N d ) (cid:107) ˆ V (cid:107) (cid:96) Ws ( N d ) + (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) (cid:107) ˆ V (cid:107) (cid:96) ( N d ) (cid:1) ≤ d +3 max(1 , C s ) · (cid:107) ˆ V (cid:107) (cid:96) Ws ( N d ) (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) = 2 d +3 max(1 , C s ) · (cid:107) V (cid:107) B s (Ω) (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) , where in the ﬁrst inequality above we used the elementary inequality | a + b | s ≤ C s ( | a | s + | b | s ) for some constant C s > and in the second inequality we used the fact that (cid:80) m ∈ Z d | ˆ u | m | | ≤ d (cid:107) ˆ u (cid:107) (cid:96) ( N d ) ≤ d (cid:107) ˆ u (cid:107) (cid:96) Ws ( N d ) . (cid:3) Lemma 7.2.

Suppose that T is a multiplication operator on (cid:96) W s ( N d ) deﬁned by for u =( u k ) k ∈ N d that ( T u ) k = λ k u k with λ k → as (cid:107) k (cid:107) →∞ . Then T : (cid:96) W s ( N d ) → (cid:96) W s ( N d ) is compact.Proof. It suﬃces to show that the image of the unit ball in (cid:96) W s ( N d ) under the map T is totallybounded. To this end, given any ﬁxed ε > , let K ∈ N be such that | λ k | ≤ ε if (cid:107) k (cid:107) > K .Denote by I : { k ∈ N d : (cid:107) k (cid:107) ≤ K } and let d be the cardinality of the index set I .Note that the ball in R d of radius max k {| λ k | : k ∈ I } with respect to the weighted -norm (cid:107) v (cid:107) (cid:96) Ws = (cid:80) k ∈I | v k | W s ( k ) is precompact, so it can be covered by the union of n ε -balls withcenters { v , · · · , v n } where v i ∈ R d . We now claim that the image of the unit ball in (cid:96) W s ( N d ) under T is covered by n ε -balls with centers { ( v , ) , · · · , ( v n , ) } . In fact, for u ∈ (cid:96) W s ( N d ) with (cid:80) k ∈ N d | u k | W s ( k ) ≤ , one has T u = (cid:16) ( λ k u k ) k ∈I , (cid:17) + (cid:16) , ( λ k u k ) k / ∈I (cid:17) . Suppose that v i ∗ is the closest center of { v , · · · , v n } to the vector (cid:0) ( λ k u k ) k ∈I (cid:1) . Then (cid:107) T u − ( v i ∗ , ) (cid:107) (cid:96) Ws ( N d ) = (cid:88) k ∈I | ( v i ∗ ) k − ( λ k u k ) | W s ( k ) + (cid:13)(cid:13)(cid:13)(cid:16) , ( λ k u k ) k / ∈I (cid:17)(cid:13)(cid:13)(cid:13) (cid:96) Ws ( N d ) ≤ ε + ε (cid:13)(cid:13)(cid:13)(cid:16) , ( u k ) k / ∈I (cid:17)(cid:13)(cid:13)(cid:13) (cid:96) Ws ( N d ) ≤ ε. This ﬁnishes the proof. (cid:3)

Appendix A. Proof of Proposition 2.1

A.1.

Proof of Proposition 2.1-(i).

First, it is well known that the problem (2.1) has aunique weak solution u ∗ P ∈ H (cid:5) (Ω) = { u ∈ H (Ω) : (cid:82) Ω udx = 0 } , i.e.(A.1) a ( u, v ) =: (cid:90) Ω ∇ u · ∇ v = F ( v ) := (cid:90) Ω f vdx for every v ∈ H (cid:5) (Ω) . Moreover, the solution u ∗ P satisﬁes that u ∗ P = arg min u ∈ H (cid:5) (Ω) (cid:110) (cid:90) Ω |∇ u | dx − (cid:90) Ω f udx (cid:111) . Due to the mean-zero constraint of the space H (cid:5) (Ω) , the variational formulation above isinconvenient to be adopted as the loss function for training a neural network solution. Totackle this issue, we consider instead the following modiﬁed Poisson problem:(A.2) − ∆ u + λ (cid:90) Ω udx = f on Ω ,∂∂ν u = 0 on ∂ Ω . Here λ > is a ﬁxed constant. By the Lax-Milgram theorem the problem (A.2) has a uniqueweak solution u ∗ λ,P , which solves(A.3) a λ ( u ∗ λ,P , v ) := (cid:90) Ω ∇ u · ∇ vdx + λ (cid:90) Ω udx (cid:90) Ω vdx = F ( v ) for every v ∈ H (Ω) . It is clear that u ∗ λ,P is the solution of the variational problem(A.4) arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | dx + λ (cid:16) (cid:90) Ω udx (cid:17) − (cid:90) Ω f udx (cid:111) . Furthermore, the lemma below shows that the weak solutions of (A.2) are independent of λ and they all coincides with u ∗ P . Lemma A.1.

Assume that λ > . Let u ∗ P and u ∗ λ,P be the weak solution of (2.1) and (A.2) respectively with f ∈ L (Ω) satisfying (cid:82) Ω f dx = 0 . Then we have that u ∗ λ,P = u ∗ P .Proof. We only need to show that u ∗ λ,P satisﬁes the weak formulation (A.1). In fact, since u ∗ λ,P satisﬁes (A.3), by setting v = 1 we obtain that λ (cid:90) Ω udx = (cid:90) Ω f dx = 0 . This immediately implies that a λ ( u ∗ λ,P , v ) = a ( u ∗ λ,P , v ) and hence u ∗ λ,P satisﬁes (A.1). (cid:3) Since the solution to (A.2) is invariant for all λ > , for simplicity we set λ = 1 in (A.4)and this proves (2.3), i.e.(A.5) u ∗ P = arg min u ∈ H (Ω) E P ( u ) = arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | dx − (cid:90) Ω f udx + 12 (cid:16) (cid:90) Ω udx (cid:17) (cid:111) . Finally we prove that u ∗ P satisﬁes the estimate (2.4). To see this, we ﬁrst state a useful lemmawhich computes the energy excess E ( u ) − E ( u ∗ P ) with any u ∈ H (Ω) . Lemma A.2.

Let u ∗ P be the minimizer of E P or equivalently the weak solution of the Poissonproblem (A.2) . Then for any u ∈ H (Ω) , it holds that E P ( u ) − E P ( u ∗ P ) = 12 (cid:90) Ω |∇ u − ∇ u ∗ P | dx + 12 (cid:16) (cid:90) Ω u ∗ P − u dx (cid:17) . Proof.

It follows from Green’s formula and the fact that u ∗ P ∈ H (cid:5) (Ω) that E ( u ∗ P ) = (cid:90) Ω |∇ u ∗ P | − f u ∗ P dx + 12 (cid:16) (cid:90) Ω u ∗ P dx (cid:17) (cid:124) (cid:123)(cid:122) (cid:125) =0 = (cid:90) Ω |∇ u ∗ P | + ∆ u ∗ P u ∗ P dx = − (cid:90) Ω |∇ u ∗ P | dx. PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 31

Then for any u ∈ H (Ω) , applying Green’s formula again yields E ( u ) − E ( u ∗ P ) = 12 (cid:90) Ω |∇ u | dx − (cid:90) Ω f udx + 12 (cid:16) (cid:90) Ω udx (cid:17) + 12 (cid:90) Ω |∇ u ∗ P | dx = 12 (cid:90) Ω |∇ u | dx + (cid:90) Ω ∆ u ∗ P udx + 12 (cid:16) (cid:90) Ω udx (cid:17) + 12 (cid:90) Ω |∇ u ∗ P | dx = 12 (cid:90) Ω |∇ u − ∇ u ∗ P | dx + 12 (cid:16) (cid:90) Ω ( u ∗ P − u ) dx (cid:17) . (cid:3) Now recall that C P > is the Poincaré constant such that for any v ∈ H (Ω) , (cid:13)(cid:13)(cid:13) v − (cid:90) Ω vdx (cid:13)(cid:13)(cid:13) L (Ω) ≤ C P (cid:107)∇ v (cid:107) L (Ω) . As a result, (cid:107) v (cid:107) H (Ω) = (cid:107)∇ v (cid:107) L (Ω) + (cid:107) v (cid:107) L (Ω) ≤ (cid:107)∇ v (cid:107) L (Ω) + 2 (cid:13)(cid:13)(cid:13) v − (cid:90) Ω v (cid:13)(cid:13)(cid:13) L (Ω) + 2 (cid:12)(cid:12)(cid:12) (cid:90) Ω vdx (cid:12)(cid:12)(cid:12) ≤ (2 C P + 1) (cid:107)∇ v (cid:107) L (Ω) + 2 (cid:12)(cid:12)(cid:12) (cid:90) Ω vdx (cid:12)(cid:12)(cid:12) . Therefore, an application of the last inequality with v = u − u ∗ P and Lemma A.2 yields that (cid:107) u − u ∗ P (cid:107) H (Ω) ≤ { C P + 1 , } ( E ( u ) − E ( u ∗ P )) . On the other hand, it follows from Lemma A.2 that E ( u ) − E ( u ∗ P ) ≤ (cid:107) u − u ∗ P (cid:107) H (Ω) . Combining the last two estimates leads to (2.4) and hence ﬁnishes the proof of Proposition2.1-(i).A.2.

Proof of Proposition 2.1-(ii).

First the standard Lax-Milgram theorem implies thatthe static Schrödinger equation has a unique weak solution u ∗ S . Moreover, it is not hard toverify that u ∗ S solves the equivalent variational problem (2.5), i.e. u ∗ S = arg min u ∈ H (Ω) E S ( u ) = arg min u ∈ H (Ω) (cid:110) (cid:90) Ω |∇ u | + V | u | dx − (cid:90) Ω f udx (cid:111) , Finally we prove that u ∗ S satisﬁes the estimate (2.6). For this, we ﬁrst claim that for any u ∈ H (Ω) ,(A.6) E S ( u ) − E S ( u ∗ S ) = 12 (cid:90) Ω |∇ u − ∇ u ∗ S | dx + 12 (cid:90) Ω V ( u ∗ S − u ) dx. In fact, using Green’s formula, one has that E S ( u ∗ S ) = (cid:90) Ω |∇ u ∗ S | + 12 V | u ∗ S | − f u ∗ dx = (cid:90) Ω |∇ u ∗ S | + 12 V | u ∗ S | + (∆ u ∗ S − V u ∗ S ) u ∗ dx = − (cid:90) Ω |∇ u ∗ S | + V | u ∗ | dx. Then for any u ∈ H (Ω) , applying Green’s formula again yields E S ( u ) − E S ( u ∗ S ) = 12 (cid:90) Ω |∇ u | + V | u | dx − (cid:90) Ω f udx + 12 (cid:90) Ω |∇ u ∗ S | + V | u ∗ S | dx = 12 (cid:90) Ω |∇ u | + V | u | dx + (cid:90) Ω (∆ u ∗ S − V u ∗ S ) udx + 12 (cid:90) Ω |∇ u ∗ S | + V | u ∗ S | dx = 12 (cid:90) Ω |∇ u − ∇ u ∗ S | dx + 12 (cid:90) Ω V (cid:0) u ∗ S − u (cid:1) dx. The estimate (2.6) follows directly from the identity (A.6) and the assumption that

Assume that u ∈ L (Ω) admits the cosine series expansion u ( x ) = (cid:88) k ∈ N d ˆ u k Φ k ( x ) , where { ˆ u k } k ∈ N d are the cosine expansion coeﬃcients, i.e.(B.1) ˆ u k = (cid:82) Ω u ( x )Φ k ( x ) dx (cid:82) Ω Φ k ( x ) dx = (cid:82) Ω u ( x )Φ k ( x ) dx − (cid:80) di =1 ki (cid:54) =0 . Let Ω e := [ − , d and deﬁne the even extension of u e of a function u by u e ( x ) = u e ( x , · · · , u d ) = u ( | x | , · · · , | x d | ) , x ∈ Ω e . Let ˜ u k be the Fourier coeﬃcients of u e . Since u e is real and even, one has that u e = (cid:88) k ∈ Z d ˜ u k cos( πk · x ) , where(B.2) ˜ u k = (cid:82) Ω e u e ( x ) cos( πk · x ) dx (cid:82) Ω e cos ( πk · x ) dx = 12 d − k (cid:54) = (cid:90) Ω e u e ( x ) cos( πk · x ) dx. By abuse of notation, we use | k | to stand for the vector ( | k | , | k , | , · · · , | k d | ) . Lemma B.1.

For every k ∈ Z d , it holds that ˜ u k = β k ˆ u | k | where β k = 2 k (cid:54) = − (cid:80) di =1 ki (cid:54) =0 .Proof. First thanks to Lemma 4.2 and the evenness of cosine, (cid:90) Ω e u e ( x ) cos( πk · x ) dx = (cid:90) Ω e u e ( x ) cos (cid:16) π (cid:16) d − (cid:88) i =1 k i x i (cid:17)(cid:17) cos( πk d x d ) dx − (cid:90) Ω e u e ( x ) sin (cid:16) π (cid:16) d − (cid:88) i =1 k i x i (cid:17)(cid:17) sin( πk d x d ) dx (cid:124) (cid:123)(cid:122) (cid:125) =0 = (cid:90) Ω e u e ( x ) cos (cid:16) π (cid:16) d − (cid:88) i =1 k i x i (cid:17)(cid:17) cos( πk d − x d − ) cos( πk d x d ) dx − (cid:90) Ω e u e ( x ) sin (cid:16) π (cid:16) d − (cid:88) i =1 k i x i (cid:17)(cid:17) sin( πk d − x d − ) cos( πk d x d ) dx (cid:124) (cid:123)(cid:122) (cid:125) =0 = · · · = (cid:90) Ω e u e ( x ) d (cid:89) i =1 cos( πk i x i ) dx = 2 d (cid:90) Ω u ( x )Φ k ( x ) dx. In addition, since Φ k = Φ | k | for any k ∈ Z d , the lemma follows from the equation above,(B.1) and (B.2). (cid:3) PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 33

The next lemma shows that the Fourier coeﬃcients of the product of two functions u and v are the discrete convolution of their Fourier coeﬃcients. Recall that { ˜ u k } k ∈ Z d denote theFourier coeﬃcients of the even functions u e . Lemma B.2.

Let w e = u e v e . Then ˜ w k = (cid:80) m ∈ Z d ˜ u m ˜ v k − m .Proof. By deﬁnition, u e ( x ) = (cid:80) m ∈ Z d ˜ u m cos( πm · x ) and v e ( x ) = (cid:80) n ∈ Z d ˜ v n cos( πn · x ) Thanksto the fact that (cid:90) Ω e cos( π(cid:96) · x ) cos( πk · x ) = 2 d − k (cid:54) = δ (cid:96) ( k ) , one obtains that ˜ w k = 12 d − k (cid:54) = (cid:90) Ω e u e ( x ) v e ( x ) cos( πk · x ) dx = 12 d − k (cid:54) = (cid:88) m ∈ Z d (cid:88) n ∈ Z d ˜ u m ˜ v n (cid:90) Ω e cos( πm · x ) cos( πn · x ) cos( πk · x ) dx = 12 d − k (cid:54) = (cid:88) m ∈ Z d (cid:88) n ∈ Z d ˜ u m ˜ v n (cid:90) Ω e (cid:104) cos( π ( m + n ) · x ) + cos( π ( m − n ) · x ) (cid:105) cos( πk · x ) dx = 12 (cid:88) m ∈ Z d ˜ u m (˜ v k − m + ˜ v m − k )= (cid:88) m ∈ Z d ˜ u m ˜ v k − m , where we have also used that ˜ v k = ˜ v − k for any k . (cid:3) Corollary B.1.

For any k ∈ N d , (cid:100) ( uv ) k = 1 β k (cid:88) m ∈ Z d β m ˆ u | m | β m − k ˆ v | m − k | . Proof.

Thanks to Lemma B.1 and Lemma B.2, (cid:100) ( uv ) k = 1 β k (cid:103) ( uv ) k = 1 β k (˜ u ∗ ˜ v ) k = 1 β k (cid:88) m ∈ Z d β m ˆ u | m | β m − k ˆ v | m − k | . (cid:3) References [1] Uri M Ascher and Chen Greif.

A ﬁrst course on numerical methods . SIAM, 2011.[2] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidalfunction.

IEEE Transactions on Information theory , 39(3):930–945, 1993.[3] Julius Berner, Philipp Grohs, and Arnulf Jentzen. Analysis of the generalization error:Empirical risk minimization over deep artiﬁcial neural networks overcomes the curseof dimensionality in the numerical approximation of black–scholes partial diﬀerentialequations.

SIAM Journal on Mathematics of Data Science , 2(3):631–657, 2020.[4] Giuseppe Carleo and Matthias Troyer. Solving the quantum many-body problem withartiﬁcial neural networks.

Science , 355(6325):602–606, 2017.[5] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent forover-parameterized models using optimal transport.

Advances in neural informationprocessing systems , 31:3036–3046, 2018.[6] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in diﬀerentiableprogramming. In

Advances in Neural Information Processing Systems , pages 2933–2943,2019.[7] John B. Conway.

A course in functional analysis , volume 96 of

Graduate Texts inMathematics . Springer-Verlag, New York, second edition, 1990. [8] Suchuan Dong and Naxian Ni. A method for representing periodic functions and en-forcing exactly periodic boundary conditions with deep neural networks. arXiv preprintarXiv:2007.07442 , 2020.[9] Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, and René Garcia. In-corporating second-order functional knowledge for better option pricing. In

Advances inneural information processing systems , pages 472–478, 2001.[10] Weinan E, Chao Ma, Stephan Wojtowytsch, and Lei Wu. Towards a mathematicalunderstanding of neural network-based machine learning: what we know and what wedon’t. arXiv preprint arXiv:2009.10713 , 2020.[11] Weinan E, Chao Ma, and Lei Wu. Barron spaces and the compositional function spacesfor neural network models. arXiv preprint arXiv:1906.08039 , 2019.[12] Weinan E and Stephan Wojtowytsch. Some observations on partial diﬀerential equationsin barron and multi-layer spaces, 2020. arXiv preprint arXiv:2012.01484.[13] Weinan E and Bing Yu. The deep ritz method: a deep learning-based numerical algo-rithm for solving variational problems.

Communications in Mathematics and Statistics ,6(1):1–12, 2018.[14] Ivar Fredholm. On a class of functional equations.

Acta mathematica , 27(1):365–390,1903.[15] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Limita-tions of lazy training of two-layers neural network. In

Advances in Neural InformationProcessing Systems , pages 9108–9118, 2019.[16] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectiﬁer neural net-works. In

Proceedings of the fourteenth international conference on artiﬁcial intelligenceand statistics , pages 315–323, 2011.[17] Philipp Grohs, Fabian Hornung, Arnulf Jentzen, and Philippe Von Wurstemberger. Aproof that artiﬁcial neural networks overcome the curse of dimensionality in the numer-ical approximation of Black-Scholes partial diﬀerential equations, 2018. arXiv preprintarXiv:1809.02362.[18] Jiequn Han, Arnulf Jentzen, and E Weinan. Solving high-dimensional partial diﬀeren-tial equations using deep learning.

Proceedings of the National Academy of Sciences ,115(34):8505–8510, 2018.[19] Jiequn Han, Jianfeng Lu, and Mo Zhou. Solving high-dimensional eigenvalue problemsusing deep neural networks: A diﬀusion Monte Carlo like approach.

Journal of Compu-tational Physics , 423:109792, 2020.[20] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, and Tuan Anh Nguyen. A proofthat rectiﬁed deep neural networks overcome the curse of dimensionality in the numer-ical approximation of semilinear heat equations.

SN Partial Diﬀerential Equations andApplications , 1:1–34, 2020.[21] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergenceand generalization in neural networks. In

Advances in neural information processingsystems , pages 8571–8580, 2018.[22] Yuehaw Khoo, Jianfeng Lu, and Lexing Ying. Solving for high-dimensional committorfunctions using artiﬁcial neural networks.

Research in the Mathematical Sciences , 6(1):1,2019.[23] Jason M Klusowski and Andrew R Barron. Approximation by combinations of relu andsquared relu ridge functions with (cid:96) and (cid:96) controls. IEEE Transactions on InformationTheory , 64(12):7649–7656, 2018.[24] Isaac E Lagaris, Aristidis Likas, and Dimitrios I Fotiadis. Artiﬁcial neural networksfor solving ordinary and partial diﬀerential equations.

IEEE transactions on neuralnetworks , 9(5):987–1000, 1998.

PRIORI GENERALIZATION ANALYSIS OF THE DEEP RITZ METHOD 35 [25] Michel Ledoux and Michel Talagrand.

Probability in Banach Spaces: Isoperimetry andProcesses , volume 23. Springer Science & Business Media, 1991.[26] Tao Luo and Haizhao Yang. Two-layer neural networks for partial diﬀerential equations:Optimization and generalization theory. arXiv preprint arXiv:2006.15733 , 2020.[27] Tengyu Ma. CS229T/STATS231: Statistical Learning Theory, 2018. URL: https://web.stanford.edu/class/cs229t/scribe_notes/10_08_final.pdf . Last visited on2020/09/16.[28] William Lauchlin McMillan. Ground state of liquid he 4.

Physical Review , 138(2A):A442,1965.[29] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean ﬁeld view of the land-scape of two-layer neural networks.

Proceedings of the National Academy of Sciences ,115(33):E7665–E7671, 2018.[30] Siddhartha Mishra and Roberto Molinaro. Estimates on the generalization error ofphysics informed neural networks (PINNs) for approximating PDEs, 2020. arXiv preprintarXiv:2006.16144.[31] Ali Girayhan Özbay, Sylvain Laizet, Panagiotis Tzirakis, Georgios Rizos, and BjörnSchuller. Poisson cnn: Convolutional neural networks for the solution of the pois-son equation with varying meshes and dirichlet boundary conditions. arXiv preprintarXiv:1910.08613 , 2019.[32] Gilles Pisier. Remarques sur un résultat non publié de b. maurey.

Séminaire Analysefonctionnelle (dit" Maurey-Schwartz") , pages 1–12.[33] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neuralnetworks: A deep learning framework for solving forward and inverse problems involvingnonlinear partial diﬀerential equations.

Journal of Computational Physics , 378:686–707,2019.[34] Grant Rotskoﬀ and Eric Vanden-Eijnden. Parameters as interacting particles: long timeconvergence and asymptotic error scaling of neural networks. In

Advances in neuralinformation processing systems , pages 7146–7155, 2018.[35] Yeonjong Shin, Jerome Darbon, and George Em Karniadakis. On the convergence ofphysics informed neural networks for linear second-order elliptic and parabolic typePDEs, 2020. arXiv preprint arXiv:2004.01806.[36] Yeonjong Shin, Zhongqiang Zhang, and George Em Karniadakis. Error estimates ofresidual minimization using neural networks for linear PDEs, 2020. arXiv preprintarXiv:2010.08019.[37] Jonathan W Siegel and Jinchao Xu. Approximation rates for neural networks withgeneral activation functions.

Neural Networks , 2020.[38] Jonathan W Siegel and Jinchao Xu. High-order approximation rates for neural networkswith ReLU k activation functions. arXiv preprint arXiv:2012.07205 , 2020.[39] Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning algorithm forsolving partial diﬀerential equations. Journal of computational physics , 375:1339–1364,2018.[40] Justin Sirignano and Konstantinos Spiliopoulos. Mean ﬁeld analysis of neural networks:A law of large numbers.

SIAM Journal on Applied Mathematics , 80(2):725–752, 2020.[41] Martin J Wainwright.

High-dimensional statistics: A non-asymptotic viewpoint . Cam-bridge University Press, 2019.[42] Michael M. Wolf. Mathematical Foundations of Supervised Learning, 2020. URL: . Last visited on 2020/12/5.[43] Dmitry Yarotsky. Error bounds for approximations with deep relu networks.

NeuralNetworks , 94:103–114, 2017. [44] Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLUnetworks. arXiv preprint arXiv:1802.03620 , 2018. (JL) Departments of Mathematics, Physics, and Chemistry, Duke University, Box 90320,Durham, NC 27708.

Email address : [email protected] (YL) Department of Mathematics and Statistics, Lederle Graduate Research Tower, Uni-versity of Massachusetts, 710 N. Pleasant Street, Amherst, MA 01003. Email address : [email protected] (MW) Mathematics Department, Duke University, Box 90320, Durham, NC 27708. Email address ::