Implicit bias with Ritz-Galerkin method in understanding deep learning for solving PDEs
aa r X i v : . [ m a t h . NA ] D ec Implicit bias with Ritz-Galerkin method in under-standing deep learning for solving PDEs
Jihong Wang , Zhi-Qin John Xu , Jiwei Zhang and Yaoyu Zhang Beijing Computational Science Research Center, Beijing 100193, P.R. China. School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC andQing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai 200240, P.R.China. School of Mathematics and Statistics, and Hubei Key Laboratory of ComputationalScience, Wuhan University, Wuhan 430072, P.R. China.
Abstract.
This paper aims at studying the difference between Ritz-Galerkin (R-G) methodand deep neural network (DNN) method in solving partial differential equations (PDEs)to better understand deep learning. To this end, we consider solving a particular Pois-son problem, where the information of the right-hand side of the equation f is onlyavailable at n sample points, that is, f is known at finite sample points. Through boththeoretical and numerical studies, we show that solution of the R-G method convergesto a piecewise linear function for the one dimensional (1D) problem or functions oflower regularity for high dimensional problems. With the same setting, DNNs how-ever learn a relative smooth solution regardless of the dimension, this is, DNNs implic-itly bias towards functions with more low-frequency components among all functionsthat can fit the equation at available data points. This bias is explained by the recentstudy of frequency principle (Xu et al., (2019) [17] and Zhang et al., (2019) [11, 19]).In addition to the similarity between the traditional numerical methods and DNNsin the approximation perspective, our work shows that the implicit bias in the learn-ing process, which is different from traditional numerical methods, could help betterunderstand the characteristics of DNNs. AMS subject classifications : 35Q68, 65N30, 65N35
Key words : Deep learning, Ritz-Galerkin method, Partial differential equations, F-Principle
Deep neural networks (DNNs) become increasingly important in scientific computingfields [3–9,15,16]. A major potential advantage over traditional numerical methods is thatDNNs could overcome the curse of dimensionality in high-dimensional problems. With ∗ Corresponding author.
Email addresses: [email protected] (J. Wang), [email protected] (Z. Xu), [email protected] (J. Zhang), [email protected] (Y. Zhang) J. Wang et al. traditional numerical methods, several studies have made progress on the understand-ing of the algorithm characteristics of DNNs. For example, by exploring ReLU DNNrepresentation of continuous piecewise linear function in FEM, the work [8] theoreticallyestablishes that a ReLU DNN can accurately represent any linear finite element functions.In the aspect of the convergence behavior, the works [17, 18] show a Frequency Principle(F-Principle) that DNNs often learn low-frequency components first while most of theconventional methods (e.g., Jacobi method) exhibit the opposite convergence behavior—higher-frequency components are learned faster. These understandings could lead to abetter use of DNNs in practice, such as DNN-based algorithms are proposed based onthe F-Principle to fast eliminate high-frequency error [2, 10].The aim of this paper is to investigate the different behaviors between DNNs and tra-ditional numerical method, e.g., Ritz-Galerkin (R-G) method. To this end, we utilize anexample to show their stark difference, that is, solving PDEs given a few sample points.We denote n by the sample number and m by the basis number in the Ritz-Galerkinmethod or the neuron number in DNNs. In traditional PDE models, we consider thesituation where the source functions in the equation are completely known, i.e. the sam-ple number n can go to infinity. But in practical applications, such as signal processing,statistical mechanics, chemical and biophysical dynamic systems, we often encounter theproblems that only a few sample values can be obtained. It is interesting to ask what ef-fect R-G methods would have on solving this particular problem, and what the solutionwould be obtained by the DNN method.In this paper, we show that R-G method considers the discrete sampling points as lin-ear combinations of Dirac delta functions, while DNN methods always use a relativelysmooth function to interpolate the discrete sampling points. And we incorporate the F-Principle to show how DNN is different from the R-G method, that is, for all functionsthat can fit the training data, DNNs implicitly bias towards functions with more low-frequency components. In addition to the similarity between the traditional numericalmethods and DNNs in the approximation perspective [8], Our work shows that the im-plicit bias in the learning process, which is different from traditional numerical methods,could help better understand the characteristics of DNNs.The rest of the paper is organized as follows. In section 2, we briefly introduce the R-G method and the DNN method. In sections 3 and 4, we present the difference betweenusing these two methods to solve PDEs theoretically and numerically. We end the paperwith the conclusion in section 5. In this section we take the toy model of Poisson’s equation as an example to investigatethe difference of solution behaviors between R-G method and DNN method. . Wang et al. 3
We consider the d -dimensional Poisson problem posed on the bounded domain Ω ⊂ R d with Dirichlet boundary condition as (cid:26) − ∆ u ( x ) = f ( x ) , x ∈ Ω , u ( x ) = x ∈ ∂ Ω , (2.1)where ∆ represents the Laplace operator, x = ( x , x , ··· , x d ) is a d-dimensional vector. It isknown that the problem (2.1) admits a unique solution for f ∈ L ( Ω ) , and its regularitycan be raised to C s + b ( Ω ) if f ∈ C sb ( Ω ) for some s ≥
0. In literatures, there are a numberof effective numerical methods to solve problem (2.1) in general case. We here considera special situation: we only have the information of f ( x ) at the n sample points x i ( i = ··· , n ) . In practical applications, we may imagine that we only have finite experimentdata, i.e., the value of f ( x i ) ( i = ··· , n ) , and have no more information of f ( x ) at otherpoints. Through solving such a particular Poisson problem (2.1) with R-G method anddeep learning method, we aim to find the bias of these two methods in solving PDEs. In this subsection, we briefly introduce the R-G method [1]. For problem (2.1), we con-struct a functional J ( u ) = a ( u , u ) − ( f , u ) , (2.2)where a ( u , v ) = Z Ω ∇ u ( x ) ∇ v ( x ) d x , ( f , v ) = Z Ω f ( x ) v ( x ) d x .The variational form of problem (2.1) is the following:Find u ∈ H ( Ω ) , s.t. J ( u ) = min v ∈ H ( Ω ) J ( v ) . (2.3)The weak form of (2.3) is to find u ∈ H ( Ω ) such that a ( u , v ) = ( f , v ) , ∀ v ∈ H ( Ω ) . (2.4)The problem (2.1) is the strong form if the solution u ∈ H ( Ω ) . To numerically solve (2.4),we now introduce the finite dimensional space U h to approximate the infinite dimen-sional space H ( Ω ) . Let U h ⊂ H ( Ω ) be a subspace with a sequence of basis functions { φ , φ , ··· , φ m } . The numerical solution u h ∈ U h that we will find can be represented as u h = m ∑ k = c k φ k , (2.5) J. Wang et al. where the coefficients { c i } are the unknown values that we need to solve. Replacing H ( Ω ) by U h , both problems (2.3) and (2.4) can be transformed to solve the followingsystem: m ∑ k = c k a ( φ k , φ j ) = ( f , φ j ) , j = ··· , m . (2.6)From (2.6), we can calculate c i , and then obtain the numerical solution u h . We usually call(2.6) R-G equation.For different types of basis functions, the R-G method can be divided into finite ele-ment method (FEM) and spectral method (SM) and so on. If the basis functions { φ i ( x ) } are local, namely, they are compactly supported, this method is usually taken as the FEM.Assume that Ω is a polygon, and we divide it into finite element grid T h by simplex, h = max τ ∈T h diam ( τ ) . A typical finite element basis is the linear hat basis function, satis-fying φ k ( x j ) = δ kj , x j ∈ N h , (2.7)where N h stands for the set of the nodes of grid T h . The schematic diagram of the basisfunctions in 1-D and 2-D are shown in Fig. 1. On the other hand, if we choose the globalbasis function such as Fourier basis or Legendre basis [14], we call R-G method spectralmethod. • x j − • x j • x j + •• • ••••• • • • • •• Figure 1: The finite element basis function in 1d and 2d.
The error estimate theory of R-G method has been well established. Under suitableassumption on the regularity of solution, the linear finite element solution u h has thefollowing error estimate k u − u h k ≤ C h | u | ,where the constant C is independent of grid size h . The spectral method has the follow-ing error estimate k u − u h k ≤ C m s , . Wang et al. 5 where C is a constant and the exponent s depends only on the regularity (smoothness)of the solution u . If u is smooth enough and satisfies certain boundary conditions, thespectral method has the spectral accuracy.In this paper, we use the R-G method to solve the Poisson problem (2.1) in a specialsetting, i.e. we only have the information of f ( x ) at the n sample points x i ( i = ··· , n ) . Inthis situation, the integral on the right hand side (r.h.s.) of R-G equation (2.6) is hard tobe computed exactly, so we need to compute it with the proper numerical method. Forhigher dimensional case, it is known the Monte Carlo (MC) integration [13] may be theonly viable approach. Then replacing the integral on the r.h.s. of (2.6) with the form ofMC integral formula, we obtain m ∑ k = c k a ( φ k , φ j ) = n n ∑ i = f ( x i ) φ j ( x i ) , j = ··· , m . (2.8)In fact, if we use the Gaussian quadrature rule to compute the integral on the r.h.s. ofequation (2.6), we still have the similar form as (2.8), except that 1/ n is replaced by thecorresponding Gaussian quadrature weights w i . Because of the inaccuracy of the right-hand side integral, equation (2.8) is actually different from the traditional R-G method.However, we can see clearly the bias of the R-G method under this special setting in thelater numerical experiments. We now introduce the DNN method. The L -layer neural network is denoted by u θ ( x ) = W [ L − ] σ ◦ ( W [ L − ] σ ◦ ( ··· ( W [ ] x + b [ ] ) ··· )+ b [ L − ] )+ b [ L − ] , (2.9)where W [ l ] ∈ R m l + × m l , b [ l ] = R m l + , m = d , m L = σ is a scalar function and “ ◦ ” meansentry-wise operation. We denote the set of parameters by θ = ( W [ ] , W [ ] ,..., W [ L − ] , b [ ] , b [ ] ,..., b [ L − ] ) ,and an entry of W [ l ] by W [ l ] ij .Particularly, the one-hidden layer DNN with activation function σ is given as u θ ( x ) = m ∑ k = c k σ ( w k · x + b k ) , (2.10)where w k ∈ R d , c k , b k ∈ R are parameters. If we denote σ ( w k · x + b k ) = φ k ( x ) , then we obtainthe similar form to R-G solution (2.5), i.e., u θ ( x ) = m ∑ k = c k φ k ( x ) . (2.11) J. Wang et al.
The difference between the expressions of the solutions of these two methods is that thebasis functions of the R-G solution are known, while the bases of the DNN solution areunknown, and need to be obtained together with the coefficients through the gradientdescent algorithm with a loss function.The loss function corresponding to problem (2.1) is given by L ( u θ , f ) = n n ∑ i = ( ∆ u θ ( x i )+ f ( x i )) + β Z ∂ Ω u θ ( x ) ds , (2.12)or a variation form [4] L ( u θ , f ) = n n ∑ i = (cid:18) |∇ x u θ ( x i ) | − f ( x i ) u θ ( x i ) (cid:19) + β Z ∂ Ω u θ ( x ) ds , (2.13)where the last term is for the boundary condition and β is a hyper-parameter. In this section, we illustrate and introduce a rigorous definition of the F-Principle.We start from considering a two-layer neural network, following [11, 19], f ( x , θ ) = m ∑ j = a j σ (cid:16) w ⊺ j x −| w j | c j (cid:17) , (2.14)where w j , x ∈ R d , θ = ( a ⊺ , w ⊺ ,..., w ⊺ m , c ⊺ ) ⊺ , a , c ∈ R m and W = ( w ,..., w m ) ⊺ ∈ R m × d , and σ ( z ) = max ( z ,0 ) ( z ∈ R ) is the activation function of ReLU. Note that this two-layer modelis slightly different from the model in (2.11) for easy calculation in [11, 19]. The targetfunction is denoted by f ( x ) . The network is trained by mean-squared error (MSE) lossfunction L = Z R d | f ( x , θ ) − f ( x ) | ρ ( x ) d x , (2.15)where ρ ( x ) is a probability density. Considering finite samples, we have ρ ( x ) = n n ∑ i = δ ( x − x i ) . (2.16)For any function g defined on R d , we use the following convention of the Fourier trans-form and its inverse: F [ g ]( ξ ) = R R d g ( x ) e − π i ξ ⊺ x d x , g ( x ) = R R d F [ g ]( ξ ) e π i x ⊺ ξ d ξ ,where ξ ∈ R d denotes the frequency. . Wang et al. 7 The study in [11, 19] shows that when the neuron number m is sufficient large, train-ing the network in (2.14) with gradient descent is described by the following differentialequation ∂ t F [ h ]( ξ ) = − Γ ( ξ ) F [( h − f ) ρ ]( ξ ) (2.17)with initial F [ h ini ]( ξ ) . The long time solution of (2.17) is equivalent to solve the followingoptimization problem min h − h ini ∈ F γ Z R d Γ − ( ξ ) |F [ h − h ini ]( ξ ) | d ξ , (2.18)s.t. h ( x i ) = f ( x i ) for i = ··· , n , (2.19)where Γ ( ξ ) = m ∑ mj = (cid:0) k w j ( ) k + a j ( ) (cid:1) k ξ k d + + π m ∑ mj = (cid:0) k w j ( ) k a i ( ) (cid:1) k ξ k d + , (2.20)here k·k represents the L -norm, w j ( ) and a j ( ) represent initial parameters before train-ing, and F γ = { h | Z R d Γ − ( ξ ) |F [ h ]( ξ ) | d ξ < ∞ } . (2.21)Since Γ ( ξ ) monotonically decreases with ξ , the gradient flow in (2.17) rigorously definesthe F-Principle, i.e., low frequency converges faster. The minimization in (2.18) clearlyshows that the DNN has an implicit bias in addition to the sample constraint in (2.19).As ( Γ ( ξ )) − monotonically increases with ξ , the optimization problem prefers to choosea function that has less high frequency components, which explicates the implicit bias ofthe F-Principle — DNN prefers low frequency [17, 18]. In the classical case, f ( x ) is a given function, so we can compute exactly the integralon the right-hand side of the R-G equation (2.6). As the number of basis functions m approaches infinity, the numerical solution obtained by R-G method (2.6) approximatesthe exact solution of problem (2.1). It is interesting to ask if we only have the informationof f at the finite n points, what could happen to numerical solution obtained by (2.8)when m → ∞ ?Fixing the number of sample points n , we study the property of the solution of thenumerical method (2.8). We have the following theorem. Theorem 1.
When m → ∞ , the numerical method (2.8) is solving the problem − ∆ u ( x ) = n n ∑ i = δ ( x − x i ) f ( x i ) , x ∈ Ω , u ( x ) = x ∈ ∂ Ω , (3.1) J. Wang et al. where δ ( x ) represents the Dirac delta function. Proof.
According to the filtering property of delta function, the formula for the integral onthe right hand side (r.h.s.) of the equation (2.8) is the exact integral between the functionof the r.h.s. of equation (3.1) and the basis function φ j ( x ) , i.e.,1 n n ∑ i = f ( x i ) φ j ( x i ) = Z Ω n n ∑ i = f ( x i ) δ ( x − x i ) φ j ( x ) d x , j = ··· .Note that δ ( x ) : R d → R is a continuous linear functional. According to the error estimationtheory [1, 14], the finite dimensional system (2.8) evidently approximates the problem(3.1) when m → ∞ . Remark:
For the 1D case, the analytic solution to problem (3.1) defined in [ a , b ] can begiven as a piecewise linear function, namely u ( x ) = n n ∑ i = f ( x i )( b − x i ) x − ab − a − n n ∑ i = f ( x i )( x − x i ) H ( x − x i ) , (3.2)where H ( x ) is the Heaviside step function H ( x ) = (cid:26) x < x ≥ [ a ] × [ b ] by Green’s function u ( x , y ) = nab n ∑ i = f i ∞ ∑ k = ∞ ∑ l = sin ( p k x ) sin ( q l y ) sin ( p k x i ) sin ( q l y i ) p k + q l . (3.3)where f i = f ( x i , y i ) , p k = π k / a , q l = π l / b . We can prove that this series diverges at thesampling point ( x i , y i )( i = ··· , n ) and converges at other points. Therefore, the 2dexact solution u ( x , y ) is highly singular. Although R-G method and DNN method can be equivalent to each other in the sense ofapproximation, in this section, we present three examples to investigate the differencebetween the solution obtained by the R-G method and the one obtained by the DNNwith gradient descent optimization. For simplicity, we consider the following 1D Poissonproblem (cid:26) − u ′′ ( x ) = f ( x ) , x ∈ ( − ) , u ( − ) = u ( ) =
0, (3.4)where we only know what the value of f ( x ) is at n points, i.e. f ( x i ) ( i = ··· , n ) . Inexperiments, f ( x i ) are sampled from the function f ( x ) = − ( x − x ) exp ( − x ) . . Wang et al. 9 -1 -0.5 0 0.5 1 x -0.06-0.04-0.0200.020.040.06 u SM:m=5,n=5
Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.06-0.04-0.0200.020.040.06 u SM:m=10,n=5
Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.06-0.04-0.0200.020.040.06 u SM:m=50,n=5
Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.06-0.04-0.0200.020.040.06 u SM:m=500,n=5
Numerical Sol.Exact Sol.Location of samples
Figure 2: (Example 1): Numerical solutions in SM with 5 sampling points.
Example 1:
Fixing the number of sampling points n =
5, we use R-G method and DNNmethod to solve the problem (3.4), respectively. The reason why we choose fewer samplepoints here is that in this situation we can investigate the property of the solution moreclearly. The results are shown as follows.
R-G method.
First, we use R-G method to solve the problem (3.4), specially the spectralmethod with the Fourier basis function given as φ i ( x ) = sin ( k π x ) , k = ··· , m .We set the numbers of basis functions m = m . One cansee that the R-G solution approximates the piecewise linear function (3.2) when m → ∞ .This result is consistent with the solution property analyzed in Theorem 1. DNN method.
For a better comparison with R-G method, we choose the activation func-tion by sin ( x ) in DNN with one hidden layer. And the number of neurons are taken as m = β =
10. We reduce
Figure 3: (Example 1): Numerical solutions in DNN method with 5 sampling points. the loss to an order of 1e-4, and take learning rate by 1e-4. And 1000 test points are usedto plot the figures. The DNN solutions are shown in Fig. 3, in which we observe that theDNN solutions are always smooth even when m is very large. Example 2:
In this example, we use the ReLU function as the basis function in R-Gmethod and the activation function in DNN method to repeat the experiments in Ex-ample 1. Here we randomly choose another 10 sampling points.Since the linear finite element function φ j ( x ) can be expressed by ReLU functions forone dimensional case, namely, φ j ( x ) = h j − ReLU ( x − x j ) − ( h j − + h j ) ReLU ( x − x j )+ h j ReLU ( x − x j + ) ,where h j = x j + − x j , we can use ReLU as the basis function for R-G method. For conve-nience, we just use the linear finite element function φ j ( x ) instead of ReLU function as thebasis function. Fig. 4 shows that there is no doubt that the FEM solution approximatesthe piecewise linear solution (3.2). . Wang et al. 11 -1 -0.5 0 0.5 1 x -0.1-0.0500.050.1 u FEM:m=5,n=10
Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.1-0.0500.050.1 u FEM:m=10,n=10
Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.1-0.0500.050.1 u FEM:m=50,n=10
Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.1-0.0500.050.1 u FEM:m=500,n=10
Numerical Sol.Exact Sol.Location of samples
Figure 4: (Example 2): Numerical solutions in FEM with 10 sampling points.
In DNN method, we choose the number of neurons m = Example 3:
We consider the 2D case (cid:26) − ∆ u ( x ) = f ( x ) , x ∈ ( ) , u ( x ) = x ∈ ∂ ( ) ,where x = ( x , y ) and we know the values of f at n points sampled from the function f ( x ) : = f ( x , y ) = π sin ( π x ) sin ( π y ) . We fix the number of sample points n = . Thesampling points in the x direction and the y direction are both at x h =[ ] .We test the solution with the number of basis m = Figure 5: (Example 2): Numerical solutions in DNN method with ReLU activation function and 10 samplingpoints. . Wang et al. 13 of R-G solutions at y = m , in which we can see that the values of numericalsolutions at the sampling points get larger and larger with the increase of m . However,Fig. 8 shows that the DNN solutions are stable without singularity for large m . Figure 6: (Example 3): R-G solutions with different m . The basis functions for the first and the second row areLegendre basis function and piecewise linear basis function, respectively. x u ( x , y = . ) SM:n=5 m=5 m=50 m=100 m=200 x u ( x , y = . ) FEM:n=5 m=5 m=50 m=100 m=200 Figure 7: (Example 3): Profile of R-G solutions with different m . DNNs are widely used in solving PDEs, especially for high-dimensional problems. Theoptimizing the loss functions in Eqs. (2.12) and (2.13) are equivalent to solving (2.6) ex-cept that the bases in (2.12) and (2.13) are adaptive. In addition, the DNN problem isoptimized by (stochastic) gradient descent. The experiments in the previous section have
Figure 8: (Example 3): DNN solutions with different m . The activation functions for the first and the secondrow are ReLU ( x ) and sin ( x ) , respectively. shown that when the number of bases goes to infinity, DNN methods solve (2.1) by arelatively smooth and stable function compared with the one obtained by Theorem 1. Wenow utilize the F-Principle to understand what leads to the smoothness.For two-layer wide DNNs with d =
1, the two terms of Γ ( ξ ) in the minimization prob-lem of (2.18) yield different fitting results. Note that min R R || ξ || − |F [ h ]( ξ ) | d ξ leads to apiecewise linear function, while min R R || ξ || − |F [ h ]( ξ ) | d ξ leads to a cubic spline. Sincethe DNN is a combination of both terms, therefore, the DNN would yields to a muchsmoother function than the piecewise linear function. For a general DNN, the coefficient Γ ( ξ ) in (2.17) cannot be obtained exactly, however, the monotonically decreasing prop-erty of Γ ( ξ ) with respect to ξ can be postulated based on the F-Principle. This paper compares the different behaviors of Ritz-Galerkin method and DNN methodthrough solving PDEs to better understand the working principle of DNNs. We considera particular Poisson problem (2.1), where the r.h.s. term f is a discrete function. We an-alyze why the two numerical methods behave differently in theory. R-G method dealswith the discrete f as the linear combination of Dirac delta functions, while DNN meth-ods implicitly bias towards functions with more low-frequency components to interpo-late the discrete sampling points due to the F-Principle. Furthermore, from the numericalexperiments, as the number of bases increases, one can see that the solutions obtainedby R-G method approximate piecewise linear functions for 1d case and singular func-tion for the 2d case, regardless of the basis function, but the solutions obtained by DNNmethod are smoother for 1d case and stable for the 2d case. In conclusion, based on the . Wang et al. 15 theoretical and numerical study in comparison to traditional methods in solving PDEs,DNN method possesses special implicit bias towards low frequencies, which leads to awell-bahaved solution even in a heavily over parameterized setting. Acknowledgments
Zhiqin Xu is supported by National Key R&D Program of China (2019YFA0709503),Shanghai Sailing Program, Natural Science Foundation of Shanghai (20ZR1429000), andpartially supported by HPC of School of Mathematical Sciences at Shanghai Jiao TongUniversity. Jihong Wang and Jiwei Zhang is partially supported by NSFC under No.11771035 and NSAF U1930402, and the Natural Science Foundation of Hubei ProvinceNo. 2019CFA007, and Xiangtan University 2018ICIP01.
References [1] S. C. Brenner and L. R. Scott, The mathematical theory of finite element methods, Springer,New York, third edition, 2008.[2] W. Cai, X. G. Li and L. Z. Liu, A phase shift deep neural network for high frequency approx-imation and wave problems, SIAM J. Sci. Comput., 42(2020), A3285-A3312.[3] Z. Q. Cai, J. S. Chen, M. Liu and X. Y. Liu, Deep least-squares methods: an unsupervisedlearning-based numerical method for solving elliptic PDEs, J. Comput. Phys., 420(2020),109707.[4] W. N. E, J. Q. Han and A. Jentzen, Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differentialequations, Commun. Math. Stat., 5(2017), 349-380.[5] W. N. E and B. Yu, The deep ritz method: A deep learning-based numerical algorithm forsolving variational problems, Commun. Math. Stat., 6(2018), 1-12.[6] A. Hamilton, T. Tran, M. Mckay, B. Quiring and P. Vassilevski, DNN approximation of non-linear finite element equations, Tech. rep., Lawrence Livermore National Lab.(LLNL), Liv-ermore, CA (United States), 2019.[7] J. Q. Han, A. Jentzen and E. Weinan, Solving high-dimensional partial differential equationsusing deep learning, P. Natl. Acad. Sci., 115(2018), 8505-8510.[8] J. C. He, L. Li, J. C. Xu and C. Y. Zheng, ReLU deep neural networks and linear finite ele-ments, J. Comput. Math., 38(2020), 502-527.[9] Y. L. Liao and P. B. Ming, Deep Nitsche method: Deep Ritz method with essential boundaryconditions, arXiv preprint arXiv:1912.01309, 2019.[10] Z. Q. Liu, W. Cai and Z. Q. J. Xu, Multi-scale deep neural network (mscaleDNN) for solvingPoisson-Boltzmann equation in complex domains, Commun. Comput. Phys., to appear.[11] T. Luo, Z. Ma, Y. Y. Zhang and Z. Q. J. Xu, On the exact computation of linear frequencyprinciple dynamics and its generalization, arXiv preprint arXiv:2010.08153, 2020.[12] A.D. Polyanin, Handbook of Linear Partial Differential Equations for Engineers and Scien-tists, Chapman & Hall/CRC, 2002.[13] C.P. Robert and G. Casella, Monte Carlo Statistical Methods, Springer, New York, 2004.[14] J. Shen, T. Tang and L. L. Wang, Spectral methods. Algorithms, analysis and applications,Springer, 2011.6 J. Wang et al.