[PDF] Implicit bias with Ritz-Galerkin method in understanding deep learning for solving PDEs

Abstract

This paper aims at studying the difference between Ritz-Galerkin (R-G) method and deep neural network (DNN) method in solving partial differential equations (PDEs) to better understand deep learning. To this end, we consider solving a particular Poisson problem, where the information of the right-hand side of the equation f is only available at n sample points, that is, f is known at finite sample points. Through both theoretical and numerical studies, we show that solution of the R-G method converges to a piecewise linear function for the one dimensional (1D) problem or functions of lower regularity for high dimensional problems. With the same setting, DNNs however learn a relative smooth solution regardless of the dimension, this is, DNNs implicitly bias towards functions with more low-frequency components among all functions that can fit the equation at available data points. This bias is explained by the recent study of frequency principle (Xu et al., (2019) [17] and Zhang et al., (2019) [11, 19]). In addition to the similarity between the traditional numerical methods and DNNs in the approximation perspective, our work shows that the implicit bias in the learning process, which is different from traditional numerical methods, could help better understand the characteristics of DNNs.

Full PDF

aa r X i v : . [ m a t h . NA ] D ec Implicit bias with Ritz-Galerkin method in under-standing deep learning for solving PDEs

Jihong Wang , Zhi-Qin John Xu , Jiwei Zhang and Yaoyu Zhang Beijing Computational Science Research Center, Beijing 100193, P.R. China. School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC andQing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai 200240, P.R.China. School of Mathematics and Statistics, and Hubei Key Laboratory of ComputationalScience, Wuhan University, Wuhan 430072, P.R. China.

Abstract.

This paper aims at studying the difference between Ritz-Galerkin (R-G) methodand deep neural network (DNN) method in solving partial differential equations (PDEs)to better understand deep learning. To this end, we consider solving a particular Pois-son problem, where the information of the right-hand side of the equation f is onlyavailable at n sample points, that is, f is known at ﬁnite sample points. Through boththeoretical and numerical studies, we show that solution of the R-G method convergesto a piecewise linear function for the one dimensional (1D) problem or functions oflower regularity for high dimensional problems. With the same setting, DNNs how-ever learn a relative smooth solution regardless of the dimension, this is, DNNs implic-itly bias towards functions with more low-frequency components among all functionsthat can ﬁt the equation at available data points. This bias is explained by the recentstudy of frequency principle (Xu et al., (2019) [17] and Zhang et al., (2019) [11, 19]).In addition to the similarity between the traditional numerical methods and DNNsin the approximation perspective, our work shows that the implicit bias in the learn-ing process, which is different from traditional numerical methods, could help betterunderstand the characteristics of DNNs. AMS subject classiﬁcations : 35Q68, 65N30, 65N35

Key words : Deep learning, Ritz-Galerkin method, Partial differential equations, F-Principle

Deep neural networks (DNNs) become increasingly important in scientiﬁc computingﬁelds [3–9,15,16]. A major potential advantage over traditional numerical methods is thatDNNs could overcome the curse of dimensionality in high-dimensional problems. With ∗ Corresponding author.

Email addresses: [email protected] (J. Wang), [email protected] (Z. Xu), [email protected] (J. Zhang), [email protected] (Y. Zhang) J. Wang et al. traditional numerical methods, several studies have made progress on the understand-ing of the algorithm characteristics of DNNs. For example, by exploring ReLU DNNrepresentation of continuous piecewise linear function in FEM, the work [8] theoreticallyestablishes that a ReLU DNN can accurately represent any linear ﬁnite element functions.In the aspect of the convergence behavior, the works [17, 18] show a Frequency Principle(F-Principle) that DNNs often learn low-frequency components ﬁrst while most of theconventional methods (e.g., Jacobi method) exhibit the opposite convergence behavior—higher-frequency components are learned faster. These understandings could lead to abetter use of DNNs in practice, such as DNN-based algorithms are proposed based onthe F-Principle to fast eliminate high-frequency error [2, 10].The aim of this paper is to investigate the different behaviors between DNNs and tra-ditional numerical method, e.g., Ritz-Galerkin (R-G) method. To this end, we utilize anexample to show their stark difference, that is, solving PDEs given a few sample points.We denote n by the sample number and m by the basis number in the Ritz-Galerkinmethod or the neuron number in DNNs. In traditional PDE models, we consider thesituation where the source functions in the equation are completely known, i.e. the sam-ple number n can go to inﬁnity. But in practical applications, such as signal processing,statistical mechanics, chemical and biophysical dynamic systems, we often encounter theproblems that only a few sample values can be obtained. It is interesting to ask what ef-fect R-G methods would have on solving this particular problem, and what the solutionwould be obtained by the DNN method.In this paper, we show that R-G method considers the discrete sampling points as lin-ear combinations of Dirac delta functions, while DNN methods always use a relativelysmooth function to interpolate the discrete sampling points. And we incorporate the F-Principle to show how DNN is different from the R-G method, that is, for all functionsthat can ﬁt the training data, DNNs implicitly bias towards functions with more low-frequency components. In addition to the similarity between the traditional numericalmethods and DNNs in the approximation perspective [8], Our work shows that the im-plicit bias in the learning process, which is different from traditional numerical methods,could help better understand the characteristics of DNNs.The rest of the paper is organized as follows. In section 2, we brieﬂy introduce the R-G method and the DNN method. In sections 3 and 4, we present the difference betweenusing these two methods to solve PDEs theoretically and numerically. We end the paperwith the conclusion in section 5. In this section we take the toy model of Poisson’s equation as an example to investigatethe difference of solution behaviors between R-G method and DNN method. . Wang et al. 3

We consider the d -dimensional Poisson problem posed on the bounded domain Ω ⊂ R d with Dirichlet boundary condition as (cid:26) − ∆ u ( x ) = f ( x ) , x ∈ Ω , u ( x ) = x ∈ ∂ Ω , (2.1)where ∆ represents the Laplace operator, x = ( x , x , ··· , x d ) is a d-dimensional vector. It isknown that the problem (2.1) admits a unique solution for f ∈ L ( Ω ) , and its regularitycan be raised to C s + b ( Ω ) if f ∈ C sb ( Ω ) for some s ≥

0. In literatures, there are a numberof effective numerical methods to solve problem (2.1) in general case. We here considera special situation: we only have the information of f ( x ) at the n sample points x i ( i = ··· , n ) . In practical applications, we may imagine that we only have ﬁnite experimentdata, i.e., the value of f ( x i ) ( i = ··· , n ) , and have no more information of f ( x ) at otherpoints. Through solving such a particular Poisson problem (2.1) with R-G method anddeep learning method, we aim to ﬁnd the bias of these two methods in solving PDEs. In this subsection, we brieﬂy introduce the R-G method [1]. For problem (2.1), we con-struct a functional J ( u ) = a ( u , u ) − ( f , u ) , (2.2)where a ( u , v ) = Z Ω ∇ u ( x ) ∇ v ( x ) d x , ( f , v ) = Z Ω f ( x ) v ( x ) d x .The variational form of problem (2.1) is the following:Find u ∈ H ( Ω ) , s.t. J ( u ) = min v ∈ H ( Ω ) J ( v ) . (2.3)The weak form of (2.3) is to ﬁnd u ∈ H ( Ω ) such that a ( u , v ) = ( f , v ) , ∀ v ∈ H ( Ω ) . (2.4)The problem (2.1) is the strong form if the solution u ∈ H ( Ω ) . To numerically solve (2.4),we now introduce the ﬁnite dimensional space U h to approximate the inﬁnite dimen-sional space H ( Ω ) . Let U h ⊂ H ( Ω ) be a subspace with a sequence of basis functions { φ , φ , ··· , φ m } . The numerical solution u h ∈ U h that we will ﬁnd can be represented as u h = m ∑ k = c k φ k , (2.5) J. Wang et al. where the coefﬁcients { c i } are the unknown values that we need to solve. Replacing H ( Ω ) by U h , both problems (2.3) and (2.4) can be transformed to solve the followingsystem: m ∑ k = c k a ( φ k , φ j ) = ( f , φ j ) , j = ··· , m . (2.6)From (2.6), we can calculate c i , and then obtain the numerical solution u h . We usually call(2.6) R-G equation.For different types of basis functions, the R-G method can be divided into ﬁnite ele-ment method (FEM) and spectral method (SM) and so on. If the basis functions { φ i ( x ) } are local, namely, they are compactly supported, this method is usually taken as the FEM.Assume that Ω is a polygon, and we divide it into ﬁnite element grid T h by simplex, h = max τ ∈T h diam ( τ ) . A typical ﬁnite element basis is the linear hat basis function, satis-fying φ k ( x j ) = δ kj , x j ∈ N h , (2.7)where N h stands for the set of the nodes of grid T h . The schematic diagram of the basisfunctions in 1-D and 2-D are shown in Fig. 1. On the other hand, if we choose the globalbasis function such as Fourier basis or Legendre basis [14], we call R-G method spectralmethod. • x j − • x j • x j + •• • ••••• • • • • •• Figure 1: The ﬁnite element basis function in 1d and 2d.

The error estimate theory of R-G method has been well established. Under suitableassumption on the regularity of solution, the linear ﬁnite element solution u h has thefollowing error estimate k u − u h k ≤ C h | u | ,where the constant C is independent of grid size h . The spectral method has the follow-ing error estimate k u − u h k ≤ C m s , . Wang et al. 5 where C is a constant and the exponent s depends only on the regularity (smoothness)of the solution u . If u is smooth enough and satisﬁes certain boundary conditions, thespectral method has the spectral accuracy.In this paper, we use the R-G method to solve the Poisson problem (2.1) in a specialsetting, i.e. we only have the information of f ( x ) at the n sample points x i ( i = ··· , n ) . Inthis situation, the integral on the right hand side (r.h.s.) of R-G equation (2.6) is hard tobe computed exactly, so we need to compute it with the proper numerical method. Forhigher dimensional case, it is known the Monte Carlo (MC) integration [13] may be theonly viable approach. Then replacing the integral on the r.h.s. of (2.6) with the form ofMC integral formula, we obtain m ∑ k = c k a ( φ k , φ j ) = n n ∑ i = f ( x i ) φ j ( x i ) , j = ··· , m . (2.8)In fact, if we use the Gaussian quadrature rule to compute the integral on the r.h.s. ofequation (2.6), we still have the similar form as (2.8), except that 1/ n is replaced by thecorresponding Gaussian quadrature weights w i . Because of the inaccuracy of the right-hand side integral, equation (2.8) is actually different from the traditional R-G method.However, we can see clearly the bias of the R-G method under this special setting in thelater numerical experiments. We now introduce the DNN method. The L -layer neural network is denoted by u θ ( x ) = W [ L − ] σ ◦ ( W [ L − ] σ ◦ ( ··· ( W [ ] x + b [ ] ) ··· )+ b [ L − ] )+ b [ L − ] , (2.9)where W [ l ] ∈ R m l + × m l , b [ l ] = R m l + , m = d , m L = σ is a scalar function and “ ◦ ” meansentry-wise operation. We denote the set of parameters by θ = ( W [ ] , W [ ] ,..., W [ L − ] , b [ ] , b [ ] ,..., b [ L − ] ) ,and an entry of W [ l ] by W [ l ] ij .Particularly, the one-hidden layer DNN with activation function σ is given as u θ ( x ) = m ∑ k = c k σ ( w k · x + b k ) , (2.10)where w k ∈ R d , c k , b k ∈ R are parameters. If we denote σ ( w k · x + b k ) = φ k ( x ) , then we obtainthe similar form to R-G solution (2.5), i.e., u θ ( x ) = m ∑ k = c k φ k ( x ) . (2.11) J. Wang et al.

The difference between the expressions of the solutions of these two methods is that thebasis functions of the R-G solution are known, while the bases of the DNN solution areunknown, and need to be obtained together with the coefﬁcients through the gradientdescent algorithm with a loss function.The loss function corresponding to problem (2.1) is given by L ( u θ , f ) = n n ∑ i = ( ∆ u θ ( x i )+ f ( x i )) + β Z ∂ Ω u θ ( x ) ds , (2.12)or a variation form [4] L ( u θ , f ) = n n ∑ i = (cid:18) |∇ x u θ ( x i ) | − f ( x i ) u θ ( x i ) (cid:19) + β Z ∂ Ω u θ ( x ) ds , (2.13)where the last term is for the boundary condition and β is a hyper-parameter. In this section, we illustrate and introduce a rigorous deﬁnition of the F-Principle.We start from considering a two-layer neural network, following [11, 19], f ( x , θ ) = m ∑ j = a j σ (cid:16) w ⊺ j x −| w j | c j (cid:17) , (2.14)where w j , x ∈ R d , θ = ( a ⊺ , w ⊺ ,..., w ⊺ m , c ⊺ ) ⊺ , a , c ∈ R m and W = ( w ,..., w m ) ⊺ ∈ R m × d , and σ ( z ) = max ( z ,0 ) ( z ∈ R ) is the activation function of ReLU. Note that this two-layer modelis slightly different from the model in (2.11) for easy calculation in [11, 19]. The targetfunction is denoted by f ( x ) . The network is trained by mean-squared error (MSE) lossfunction L = Z R d | f ( x , θ ) − f ( x ) | ρ ( x ) d x , (2.15)where ρ ( x ) is a probability density. Considering ﬁnite samples, we have ρ ( x ) = n n ∑ i = δ ( x − x i ) . (2.16)For any function g deﬁned on R d , we use the following convention of the Fourier trans-form and its inverse: F [ g ]( ξ ) = R R d g ( x ) e − π i ξ ⊺ x d x , g ( x ) = R R d F [ g ]( ξ ) e π i x ⊺ ξ d ξ ,where ξ ∈ R d denotes the frequency. . Wang et al. 7 The study in [11, 19] shows that when the neuron number m is sufﬁcient large, train-ing the network in (2.14) with gradient descent is described by the following differentialequation ∂ t F [ h ]( ξ ) = − Γ ( ξ ) F [( h − f ) ρ ]( ξ ) (2.17)with initial F [ h ini ]( ξ ) . The long time solution of (2.17) is equivalent to solve the followingoptimization problem min h − h ini ∈ F γ Z R d Γ − ( ξ ) |F [ h − h ini ]( ξ ) | d ξ , (2.18)s.t. h ( x i ) = f ( x i ) for i = ··· , n , (2.19)where Γ ( ξ ) = m ∑ mj = (cid:0) k w j ( ) k + a j ( ) (cid:1) k ξ k d + + π m ∑ mj = (cid:0) k w j ( ) k a i ( ) (cid:1) k ξ k d + , (2.20)here k·k represents the L -norm, w j ( ) and a j ( ) represent initial parameters before train-ing, and F γ = { h | Z R d Γ − ( ξ ) |F [ h ]( ξ ) | d ξ < ∞ } . (2.21)Since Γ ( ξ ) monotonically decreases with ξ , the gradient ﬂow in (2.17) rigorously deﬁnesthe F-Principle, i.e., low frequency converges faster. The minimization in (2.18) clearlyshows that the DNN has an implicit bias in addition to the sample constraint in (2.19).As ( Γ ( ξ )) − monotonically increases with ξ , the optimization problem prefers to choosea function that has less high frequency components, which explicates the implicit bias ofthe F-Principle — DNN prefers low frequency [17, 18]. In the classical case, f ( x ) is a given function, so we can compute exactly the integralon the right-hand side of the R-G equation (2.6). As the number of basis functions m approaches inﬁnity, the numerical solution obtained by R-G method (2.6) approximatesthe exact solution of problem (2.1). It is interesting to ask if we only have the informationof f at the ﬁnite n points, what could happen to numerical solution obtained by (2.8)when m → ∞ ?Fixing the number of sample points n , we study the property of the solution of thenumerical method (2.8). We have the following theorem. Theorem 1.

When m → ∞ , the numerical method (2.8) is solving the problem  − ∆ u ( x ) = n n ∑ i = δ ( x − x i ) f ( x i ) , x ∈ Ω , u ( x ) = x ∈ ∂ Ω , (3.1) J. Wang et al. where δ ( x ) represents the Dirac delta function. Proof.

According to the ﬁltering property of delta function, the formula for the integral onthe right hand side (r.h.s.) of the equation (2.8) is the exact integral between the functionof the r.h.s. of equation (3.1) and the basis function φ j ( x ) , i.e.,1 n n ∑ i = f ( x i ) φ j ( x i ) = Z Ω n n ∑ i = f ( x i ) δ ( x − x i ) φ j ( x ) d x , j = ··· .Note that δ ( x ) : R d → R is a continuous linear functional. According to the error estimationtheory [1, 14], the ﬁnite dimensional system (2.8) evidently approximates the problem(3.1) when m → ∞ . Remark:

For the 1D case, the analytic solution to problem (3.1) deﬁned in [ a , b ] can begiven as a piecewise linear function, namely u ( x ) = n n ∑ i = f ( x i )( b − x i ) x − ab − a − n n ∑ i = f ( x i )( x − x i ) H ( x − x i ) , (3.2)where H ( x ) is the Heaviside step function H ( x ) = (cid:26) x < x ≥ [ a ] × [ b ] by Green’s function u ( x , y ) = nab n ∑ i = f i ∞ ∑ k = ∞ ∑ l = sin ( p k x ) sin ( q l y ) sin ( p k x i ) sin ( q l y i ) p k + q l . (3.3)where f i = f ( x i , y i ) , p k = π k / a , q l = π l / b . We can prove that this series diverges at thesampling point ( x i , y i )( i = ··· , n ) and converges at other points. Therefore, the 2dexact solution u ( x , y ) is highly singular. Although R-G method and DNN method can be equivalent to each other in the sense ofapproximation, in this section, we present three examples to investigate the differencebetween the solution obtained by the R-G method and the one obtained by the DNNwith gradient descent optimization. For simplicity, we consider the following 1D Poissonproblem (cid:26) − u ′′ ( x ) = f ( x ) , x ∈ ( − ) , u ( − ) = u ( ) =

0, (3.4)where we only know what the value of f ( x ) is at n points, i.e. f ( x i ) ( i = ··· , n ) . Inexperiments, f ( x i ) are sampled from the function f ( x ) = − ( x − x ) exp ( − x ) . . Wang et al. 9 -1 -0.5 0 0.5 1 x -0.06-0.04-0.0200.020.040.06 u SM:m=5,n=5

Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.06-0.04-0.0200.020.040.06 u SM:m=10,n=5

Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.06-0.04-0.0200.020.040.06 u SM:m=50,n=5

Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.06-0.04-0.0200.020.040.06 u SM:m=500,n=5

Numerical Sol.Exact Sol.Location of samples

Figure 2: (Example 1): Numerical solutions in SM with 5 sampling points.

Example 1:

Fixing the number of sampling points n =

5, we use R-G method and DNNmethod to solve the problem (3.4), respectively. The reason why we choose fewer samplepoints here is that in this situation we can investigate the property of the solution moreclearly. The results are shown as follows.

R-G method.

First, we use R-G method to solve the problem (3.4), specially the spectralmethod with the Fourier basis function given as φ i ( x ) = sin ( k π x ) , k = ··· , m .We set the numbers of basis functions m = m . One cansee that the R-G solution approximates the piecewise linear function (3.2) when m → ∞ .This result is consistent with the solution property analyzed in Theorem 1. DNN method.

For a better comparison with R-G method, we choose the activation func-tion by sin ( x ) in DNN with one hidden layer. And the number of neurons are taken as m = β =

10. We reduce

Figure 3: (Example 1): Numerical solutions in DNN method with 5 sampling points. the loss to an order of 1e-4, and take learning rate by 1e-4. And 1000 test points are usedto plot the ﬁgures. The DNN solutions are shown in Fig. 3, in which we observe that theDNN solutions are always smooth even when m is very large. Example 2:

In this example, we use the ReLU function as the basis function in R-Gmethod and the activation function in DNN method to repeat the experiments in Ex-ample 1. Here we randomly choose another 10 sampling points.Since the linear ﬁnite element function φ j ( x ) can be expressed by ReLU functions forone dimensional case, namely, φ j ( x ) = h j − ReLU ( x − x j ) − ( h j − + h j ) ReLU ( x − x j )+ h j ReLU ( x − x j + ) ,where h j = x j + − x j , we can use ReLU as the basis function for R-G method. For conve-nience, we just use the linear ﬁnite element function φ j ( x ) instead of ReLU function as thebasis function. Fig. 4 shows that there is no doubt that the FEM solution approximatesthe piecewise linear solution (3.2). . Wang et al. 11 -1 -0.5 0 0.5 1 x -0.1-0.0500.050.1 u FEM:m=5,n=10

Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.1-0.0500.050.1 u FEM:m=10,n=10

Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.1-0.0500.050.1 u FEM:m=50,n=10

Numerical Sol.Exact Sol.Location of samples -1 -0.5 0 0.5 1 x -0.1-0.0500.050.1 u FEM:m=500,n=10

Numerical Sol.Exact Sol.Location of samples

Figure 4: (Example 2): Numerical solutions in FEM with 10 sampling points.

In DNN method, we choose the number of neurons m = Example 3:

We consider the 2D case (cid:26) − ∆ u ( x ) = f ( x ) , x ∈ ( ) , u ( x ) = x ∈ ∂ ( ) ,where x = ( x , y ) and we know the values of f at n points sampled from the function f ( x ) : = f ( x , y ) = π sin ( π x ) sin ( π y ) . We ﬁx the number of sample points n = . Thesampling points in the x direction and the y direction are both at x h =[ ] .We test the solution with the number of basis m = Figure 5: (Example 2): Numerical solutions in DNN method with ReLU activation function and 10 samplingpoints. . Wang et al. 13 of R-G solutions at y = m , in which we can see that the values of numericalsolutions at the sampling points get larger and larger with the increase of m . However,Fig. 8 shows that the DNN solutions are stable without singularity for large m . Figure 6: (Example 3): R-G solutions with diﬀerent m . The basis functions for the ﬁrst and the second row areLegendre basis function and piecewise linear basis function, respectively. x u ( x , y = . ) SM:n=5 m=5 m=50 m=100 m=200 x u ( x , y = . ) FEM:n=5 m=5 m=50 m=100 m=200 Figure 7: (Example 3): Proﬁle of R-G solutions with diﬀerent m . DNNs are widely used in solving PDEs, especially for high-dimensional problems. Theoptimizing the loss functions in Eqs. (2.12) and (2.13) are equivalent to solving (2.6) ex-cept that the bases in (2.12) and (2.13) are adaptive. In addition, the DNN problem isoptimized by (stochastic) gradient descent. The experiments in the previous section have

Figure 8: (Example 3): DNN solutions with diﬀerent m . The activation functions for the ﬁrst and the secondrow are ReLU ( x ) and sin ( x ) , respectively. shown that when the number of bases goes to inﬁnity, DNN methods solve (2.1) by arelatively smooth and stable function compared with the one obtained by Theorem 1. Wenow utilize the F-Principle to understand what leads to the smoothness.For two-layer wide DNNs with d =

1, the two terms of Γ ( ξ ) in the minimization prob-lem of (2.18) yield different ﬁtting results. Note that min R R || ξ || − |F [ h ]( ξ ) | d ξ leads to apiecewise linear function, while min R R || ξ || − |F [ h ]( ξ ) | d ξ leads to a cubic spline. Sincethe DNN is a combination of both terms, therefore, the DNN would yields to a muchsmoother function than the piecewise linear function. For a general DNN, the coefﬁcient Γ ( ξ ) in (2.17) cannot be obtained exactly, however, the monotonically decreasing prop-erty of Γ ( ξ ) with respect to ξ can be postulated based on the F-Principle. This paper compares the different behaviors of Ritz-Galerkin method and DNN methodthrough solving PDEs to better understand the working principle of DNNs. We considera particular Poisson problem (2.1), where the r.h.s. term f is a discrete function. We an-alyze why the two numerical methods behave differently in theory. R-G method dealswith the discrete f as the linear combination of Dirac delta functions, while DNN meth-ods implicitly bias towards functions with more low-frequency components to interpo-late the discrete sampling points due to the F-Principle. Furthermore, from the numericalexperiments, as the number of bases increases, one can see that the solutions obtainedby R-G method approximate piecewise linear functions for 1d case and singular func-tion for the 2d case, regardless of the basis function, but the solutions obtained by DNNmethod are smoother for 1d case and stable for the 2d case. In conclusion, based on the . Wang et al. 15 theoretical and numerical study in comparison to traditional methods in solving PDEs,DNN method possesses special implicit bias towards low frequencies, which leads to awell-bahaved solution even in a heavily over parameterized setting. Acknowledgments

Zhiqin Xu is supported by National Key R&D Program of China (2019YFA0709503),Shanghai Sailing Program, Natural Science Foundation of Shanghai (20ZR1429000), andpartially supported by HPC of School of Mathematical Sciences at Shanghai Jiao TongUniversity. Jihong Wang and Jiwei Zhang is partially supported by NSFC under No.11771035 and NSAF U1930402, and the Natural Science Foundation of Hubei ProvinceNo. 2019CFA007, and Xiangtan University 2018ICIP01.

References [1] S. C. Brenner and L. R. Scott, The mathematical theory of ﬁnite element methods, Springer,New York, third edition, 2008.[2] W. Cai, X. G. Li and L. Z. Liu, A phase shift deep neural network for high frequency approx-imation and wave problems, SIAM J. Sci. Comput., 42(2020), A3285-A3312.[3] Z. Q. Cai, J. S. Chen, M. Liu and X. Y. Liu, Deep least-squares methods: an unsupervisedlearning-based numerical method for solving elliptic PDEs, J. Comput. Phys., 420(2020),109707.[4] W. N. E, J. Q. Han and A. Jentzen, Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differentialequations, Commun. Math. Stat., 5(2017), 349-380.[5] W. N. E and B. Yu, The deep ritz method: A deep learning-based numerical algorithm forsolving variational problems, Commun. Math. Stat., 6(2018), 1-12.[6] A. Hamilton, T. Tran, M. Mckay, B. Quiring and P. Vassilevski, DNN approximation of non-linear ﬁnite element equations, Tech. rep., Lawrence Livermore National Lab.(LLNL), Liv-ermore, CA (United States), 2019.[7] J. Q. Han, A. Jentzen and E. Weinan, Solving high-dimensional partial differential equationsusing deep learning, P. Natl. Acad. Sci., 115(2018), 8505-8510.[8] J. C. He, L. Li, J. C. Xu and C. Y. Zheng, ReLU deep neural networks and linear ﬁnite ele-ments, J. Comput. Math., 38(2020), 502-527.[9] Y. L. Liao and P. B. Ming, Deep Nitsche method: Deep Ritz method with essential boundaryconditions, arXiv preprint arXiv:1912.01309, 2019.[10] Z. Q. Liu, W. Cai and Z. Q. J. Xu, Multi-scale deep neural network (mscaleDNN) for solvingPoisson-Boltzmann equation in complex domains, Commun. Comput. Phys., to appear.[11] T. Luo, Z. Ma, Y. Y. Zhang and Z. Q. J. Xu, On the exact computation of linear frequencyprinciple dynamics and its generalization, arXiv preprint arXiv:2010.08153, 2020.[12] A.D. Polyanin, Handbook of Linear Partial Differential Equations for Engineers and Scien-tists, Chapman & Hall/CRC, 2002.[13] C.P. Robert and G. Casella, Monte Carlo Statistical Methods, Springer, New York, 2004.[14] J. Shen, T. Tang and L. L. Wang, Spectral methods. Algorithms, analysis and applications,Springer, 2011.6 J. Wang et al.