[PDF] A multi-scale DNN algorithm for nonlinear elliptic equations with multiple scales

Abstract

Algorithms based on deep neural networks (DNNs) have attracted increasing attention from the scientific computing community. DNN based algorithms are easy to implement, natural for nonlinear problems, and have shown great potential to overcome the curse of dimensionality. In this work, we utilize the multi-scale DNN-based algorithm (MscaleDNN) proposed by Liu, Cai and Xu (2020) to solve multi-scale elliptic problems with possible nonlinearity, for example, the p-Laplacian problem. We improve the MscaleDNN algorithm by a smooth and localized activation function. Several numerical examples of multi-scale elliptic problems with separable or non-separable scales in low-dimensional and high-dimensional Euclidean spaces are used to demonstrate the effectiveness and accuracy of the MscaleDNN numerical scheme.

Full PDF

AA multi-scale DNN algorithm for nonlinear ellipticequations with multiple scales

Xi-An Li ,Zhi-Qin John Xu , and Lei Zhang School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai 200240,China Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240,China MOE-LSC, Shanghai Jiao Tong University, Shanghai 200240, China

Abstract.

Algorithms based on deep neural networks (DNNs) have attracted increas-ing attention from the scientiﬁc computing community. DNN based algorithms areeasy to implement, natural for nonlinear problems, and have shown great potential toovercome the curse of dimensionality. In this work, we utilize the multi-scale DNN-based algorithm (MscaleDNN) proposed by Liu, Cai and Xu (2020) to solve multi-scaleelliptic problems with possible nonlinearity, for example, the p-Laplacian problem.We improve the MscaleDNN algorithm by a smooth and localized activation function.Several numerical examples of multi-scale elliptic problems with separable or non-separable scales in low-dimensional and high-dimensional Euclidean spaces are usedto demonstrate the effectiveness and accuracy of the MscaleDNN numerical scheme.

AMS subject classiﬁcations : 52B10, 65D18, 68U05, 68U07

Key words : multi-scale elliptic problem, p-Laplacian equation, deep neural network (DNN), vari-ational formulation, activation function

In this paper, we will introduce a DNN based algorithm for the following elliptic equationwith multiple scales and possible nonlinearity  − div (cid:18) a ( x , ∇ u ( x )) (cid:19) = f ( x ) in Ω u ( x ) = g ( x ) on ∂ Ω (1.1)where Ω ⊂ R d , d ≥

2, is a polygonal (polyhedral) domain (open, bounded and connected), a ( x , ∇ u ( x )) : Ω × R d → R d is the ﬂux function, and f : Ω → R is the source term. ∗ Corresponding author.

Email address: [email protected], [email protected] (X.A. Li), [email protected] (Z.Q.J. Xu),

[email protected] a r X i v : . [ phy s i c s . c o m p - ph ] O c t Deep neural networks (DNNs) has not only achieved great successes in computervision, natural language processing and other machine learning tasks [19, 29], but alsocaptured great attention in the scientiﬁc computing community due to its universal ap-proximating power, especially in high dimensional spaces [47]. It has found applicationsin the context of numerical solution of ordinary/partial differential equations, integral-differential equations and dynamical systems [16, 20, 26, 37, 42, 48].Recent theoretical studies on DNNs have shed some light on the design of DNN-based algorithms for scientiﬁc computing tasks, in particular, for multi-scale problems.For example, the frequency principle (F-Principle) [15, 38, 45, 46], shows that, DNNs oftenﬁt target functions from low frequency components to high frequency ones, as opposedto the behavior of many conventional iterative numerical schemes (e.g., Gauss-Seidelmethod), which exhibit faster convergence for higher frequencies. To improve the conver-gence for high-frequency or multi-scale problems, a series of algorithms are developed toaccelerate the learning of high-frequency components based on F-Principle [5, 6, 27, 31].In particular, a multi-scale DNN algorithm(MscaleDNN) has achieved favourable per-formance boost for high-frequency problems [31]. The idea of the MscaleDNN to converthigh-frequency contents into low-frequency ones as follows. The Fourier space is parti-tioned with respect to the radial direction. Since scaling input can shift the frequencydistribution along the radial direction, a scaling down operation is used to scale thehigh-frequency components to low-frequency ones. Such radial scaling is independentof dimensionality, hence MscaleDNN is applicable for high-dimensional problems. Also,borrowing the multi-resolution concept of wavelet approximation theory using compactscaling and wavelet functions, an localized activation function (i.e., sReLU) was designedin previous work [31], which is a product of two ReLU functions. By setting multiple scal-ings in a MscaleDNN, numerical results in previous study [31] show that MscaleDNN iseffective for linear elliptic partial differential equations with high frequencies.We focus our exposition on the numerical method, and therefore restrict the ﬂux func-tion in (1.1) to the following Leray-Lions form [13] since it admits a natural variationalform. Namely, for ( x , ξ ) ∈ Ω × R d , a ( x , ξ ) = κ ( x ) φ (cid:48) ( | ξ | ) ξ | ξ | , where φ ∈ C is the so-called N − function (an extension for the convex function with φ (cid:48) ( ) =

0, see [13] for the precisedeﬁnition). For p-Laplacian problem, φ ( t ) = p t p , and when p = a ( x , ξ ) = κ ( x ) ξ becomes linear. κ ( x ) ∈ L ∞ ( Ω ) is symmetric, uniformly elliptic on Ω , and may contain(non-separable) multiple scales. More general nonlinear ﬂux will be treated in futurework. With the above setup, the elliptic problem (1.1) is monotone and coercive, there-fore it admits a unique solution. Those models have applications in many areas suchas heterogeneous (nonlinear) materials [18], non-Newtonian ﬂuids, surgical simulation,image processing, machine learning [41], etc .In the last decades, much effort has been made for the numerical solution of the (1.1).In particular, for p-Laplacian equation with κ ( x ) =

1, Some degrees of effectiveness canbe achieved by mesh based methods such as ﬁnite element method (FEM) [2, 3, 25], ﬁnite difference method (FDM) [32], discontinuous Galerkin method [12], and meshless meth-ods [8, 30] etc .. In addition to those discretization methods, iterative methods such aspreconditioned steepest descent, quasi-Newton or Newton method are employed to dealwith the nonlinearity. The fully nonlinear problem (1.1) may become singular and/ordegenerate, some regularization of the nonlinear operator needs to be added to ensurethe convergence of the nonlinear iteration [25].However, conventional methods cannot deal with the multiple scales, which is ofgreat interest in applications for composite materials, geophysics, machine learning etc [17]. Homogenization method [11, 43] relies on the assumption of scale separation andperiodicity. In addition, for nonlinear problems, one need to resort to a series of cell prob-lems with the cell size going to inﬁnity, which limits the practical utility of the method.In comparison, numerical homogenization methods, can solve linear multi-scale prob-lems [14, 24, 35] and nonlinear multi-scale problems [1, 9] on the coarse scale, withoutresolving all the ﬁne scales. Nonetheless, the aforementioned numerical methods areeasy to implement in low-dimensional space R d ( d = ) , however, they will encountergreat difﬁculty in high-dimensions.In this paper, based on the Deep Ritz method [16], we proposed an improved versionof the MscaleDNN algorithm to solve elliptic problems (1.1) with multiple scales and/ornonlinearity. We improve the MscaleDNN by designing a new activation function dueto the following intuition. The original activation function, i.e., sReLU, is localized onlyin the spatial domain due to the ﬁrst-order discontinuity. However, the MscaleDNN re-quires the localization in the Fourier domain, which is equivalent to the smoothness in thespatial domain. Therefore, we design a smooth and localized activation function, whichis a production of sReLU and the sine function, i.e., s2ReLU. In addition, our DNN struc-tures also employ the residual connection technique, which was ﬁrst proposed in [23] forimage analysis and has become very popular due to its effectiveness. We employ thisimproved MscaleDNN to solve multi-scale elliptic problems, such as the multi-scale p-Laplacian equation, in various dimensions. Numerical experiments demonstrate that thealgorithm is effective to obtain the oscillatory solution for multi-scale problems with orwithout nonlinearity, even in relative high dimensions. And the performance of s2ReLUactivation function is much better than that of sReLU in the MscaleDNN framework.The paper is structured as follows. In Section 2, we brieﬂy introduce the frameworkof deep neural network approximation. Section 3 provides a variational formulation tosolve the nonlinear multi-scale problem by MscaleDNN. In Section 4, some numericalexperiments are carried out to demonstrate the effectiveness of our method. Concludingremarks are made in Section 5. In recent years, the DNN has achieved great success in a wide range of applications,such as classiﬁcation in complex systems and construction of response surfaces for high- dimensional models. Mathematically, the DNN is a nested composition of sequentiallinear functions and nonlinear activation functions. A standard single layer network,e.g., the neural unit of a DNN with a d -dimensional vector x ∈ R d as its input and a m -dimensional vector as its output, is in the form of y = σ ( W x + b ) (2.1)where W = ( w ij ) ∈ R m × d , b ∈ R m are denoted by weights and biases, respectively. σ ( · ) isan element-wise non-linear function, commonly known as the activation function. Var-ious activation functions are proposed in machine learning literature, such as sigmoid,tanh, ReLU, Leaky-ReLU etc [21]. In DNN, the single layer (2.1) is also denoted as thehidden layer. Its output can be transformed through new weights, biases, and activationfunctions in the next layer.Given an input datum x ∈ R d , the output of a DNN, denoted by y ( x ; θ ) , can be writtenas y ( x ; θ ) = W [ L ] ◦ σ ( W [ L − ] ◦ σ ( ···◦ σ ( W [ ] ◦ σ ( W [ ] x + b [ ] )+ b [ ] ))+ b [ L − ] )+ b [ L ] (2.2)where W [ l ] ∈ R n l + × n l , b [ l ] ∈ R n l + are the weights and biases of the l -th hidden layer, re-spectively. n = d , and “ ◦ ” stands for the elementary-wise operation. θ represents the setof parameters W [ L ] , ··· W [ ] , W [ ] , b [ L ] , ··· b [ ] , b [ ] .Many experiments have shown that the approximating capability of the DNN willbecame better and more robust with increasing depth. However, the problem of gradientexplosion or vanishing might occur when the depth of DNNs increases, which will have anegative effect on the performance of the DNNs. ResNet (Residual Neural Network) [23]skillfully overcomes the vanishing (exploding) gradient problem in backpropagation byintroducing a skip connection between input layer and output layer or some hidden lay-ers. It makes the network easier to train, and also improves the performance of DNN. Forexample, it outperforms the VGG models and obtain excellent results by using extremelydeep residual nets on the ImageNet classiﬁcation data set. Mathematically, a Resnet unitwith L layers produces a ﬁltered version y N for the input y [ ] is as follows y [ (cid:96) + ] = y [ (cid:96) ] + σ ( W [ (cid:96) + ] y [ (cid:96) ] + b [ (cid:96) + ] ) , for (cid:96) = ··· , N − The multi-scale elliptic problem (1.1) with N-function φ admits a natural variational form[10]. Deﬁne the energy functional as J ( v ) : = (cid:90) Ω κ ( x ) φ ( |∇ v | ) d x − (cid:90) Ω f vd x , (3.1) where v is the trial function in the admissible Orlicz-Sobolev space V : = W φ g ( Ω ) [13]where the subscript g means that the trace on ∂ Ω is g . The solution of (1.1) can be obtainedby minimizing J ( v ) , i.e., u = argmin v ∈ V J ( v ) . (3.2)Therefore, we can employ the Deep Ritz method to solve (3.2), which is an efﬁcientapproach to solve variational problems that stem from generally partial differential equa-tions [16].We consider an ansatz y ( · ; θ ) represented by a DNN with parameter θ as the trial func-tion in the variational problem (3.2), where θ ∈ Θ denotes the parameters of the underly-ing DNN. Substituting y ( x ; θ ) into (3.1) and (3.2), we can obtain the following equation u = argmin y ∈ V  (cid:90) Ω κ ( x ) φ ( |∇ y ( x ; θ ) | ) d x − (cid:90) Ω f ( x ) y ( x ; θ ) d x  . (3.3)Similar to the general strategy of searching a solution which satisfying boundary condi-tions of (1.1) in the admissible space V [10, 16], we further approximate the integral byMonte Carlo method [39] and convert the minimization problem with respect to y ∈ V toan equivalent one with respect to the parameters θ , θ ∗ = argmin θ ∈ Θ n it n it ∑ i = (cid:104) κ ( x iI ) φ ( |∇ y ( x iI ; θ ) | ) − f ( x iI ) y ( x iI ; θ ) (cid:105) for x iI ∈ Ω . (3.4)and y ( x , θ ) = g ( x ) for x ∈ ∂ Ω .Boundary conditions are indispensable constraints for numerical solution of PDEs.Analogously, imposing boundary conditions is also an important issue in DNN represen-tation. We approximate the squared L norm of the boundary discrepancy (cid:82) ∂ Ω ( y ( x , θ ) − g ( x )) by a Monte Carlo approximation,1 n bd n bd ∑ j = (cid:20) y (cid:0) x jB ; θ (cid:1) − g ( x jB ) (cid:21) for x jB ∈ ∂ Ω . (3.5)We deﬁne the following loss function where the boundary condition is treated as apenalty term with parameter β , L ( θ ; X I , X B ) = n it n it ∑ i = (cid:104) κ ( x iI ) φ ( |∇ y ( x iI ; θ ) | ) − f ( x iI ) y ( x iI ; θ ) (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) loss it + β n bd n bd ∑ j = (cid:20) y (cid:0) x jB ; θ (cid:1) − g ( x jB ) (cid:21) (cid:124) (cid:123)(cid:122) (cid:125) loss bd (3.6)where X I = { x iI } n it i = and X B = { x jB } n bd j = represent the training data in the iterior of Ω and onthe boundary ∂ Ω , respectively. loss it and loss bd stand for the loss computed with the x a x a x ··· ··· a N x ⊕ ······ y ( x ; θ ) DNN with ResNet ∇ y ( x ; θ ) κ ( x ) f ( x ) | y ( x ; θ ) − g ( x ) | on x ∈ ∂ Ω loss bd (cid:90) κ ( x ) φ ( |∇ y ( x ; θ ) | ) − f ( x ) y ( x ; θ ) d x loss it loss θ ∗ minimizeFigure 1: Schematic of a DNN for solving the nonlinear multi-scale probleminterior points and the boundary points, respectively. The penalty parameter β controlsthe relative contribution of loss bd in the loss function. It increases gradually within thetraining process in order to better enforce the boundary condition. Our goal is to ﬁnd aset of parameters θ such that the approximate function y ( x ; θ ) minimizes the loss function L ( θ ) . If L ( θ ) is small enough, then y ( x ; θ ) will be a ”good” approximate solution of (1.1),i.e., θ ∗ = argmin L ( θ ) ⇐⇒ u ( x ) ≈ y ( x ; θ ∗ ) .In order to obtain θ ∗ , one can update the parameter θ by the gradient descent methodover the all training examples at each iteration. The objective function decreases along adescent direction w k after an iteration, i.e., L ( θ k + ) < L ( θ k ) , where θ k + = θ k + η w k withsome properly chosen learning rate or step size η . Since DNNs are highly non-convex, θ ∗ may only converge to a local minimum. Stochastic gradient descent (SGD), as thecommon optimization technique of deep learning, has been proven to be very effectivein practice (can avoid the problem of local minimum) and is a fundamental buildingblock of nearly all deep learning models. In the implementation, SGD method chooses a”mini-batch” of data X km (a subset of interior points and boundary points in our case) ateach step. In this context, the SGD is given by: θ k + = θ k − η k ∇ θ L ( θ k ; X km ) ,where the “learning rate” η k decreases with k increasing. Remark 3.1.

The θ consist of weights and biases in our model are initialized by using the normal distribution N (cid:18) (cid:16) n in + n out (cid:17) (cid:19) , where n in and n out are the input and outputdimensions of the corresponding layer, respectively. A conventional DNN model can achieve a satisfactory solution for PDE problems whenthe coefﬁcient κ ( x ) is homogeneous (e.g., smooth or possessing few scales) [4, 22, 40, 44].However, it is difﬁcult to solve PDEs (1.1) with multi-scale κ ( x ) due to the complex in-teraction of nonlinearity and multiple scales. The MscaleDNN architecture has been pro-posed to approximate the solution with high frequency and multiple scales by convertingoriginal data to a low frequency space [31], as described in the following:• Divide the neurons in the ﬁrst hidden-layer into N groups, and generate the scalevector K = ( a , ··· , a , a , ··· , a , ··· , a N , ··· , a N ) T , a i (cid:62)

1. (3.7)Note that the scale parameters a i ’s are hyper-parameters and can be set by severalmethods. Numerical results in previous work [7, 31] show that the effectiveness ofthe MscaleDNN is not sensitive to the selection of scale parameters if it covers thescales of the target function.• Convert the input data x to ˜ x = K (cid:12) x , then feed ˜ x to the ﬁrst hidden-layer of DNN,where (cid:12) is the Hadamard product. From the viewpoint of Fourier transformationand decomposition, the Fourier transform ˆ f ( k ) of a given band-limited function f ( x ) has a compact support B ( K max ) = (cid:8) k ∈ R d , | k | (cid:54) K max (cid:9) which can be partitionedas the union of M concentric annulus with uniform or nonuniform width, e.g., P i = (cid:110) k ∈ R d , ( i − ) K (cid:54) | k | (cid:54) iK (cid:111) , K = K max / M , i (cid:54) i (cid:54) M ,then ˆ f ( k ) can be expressed as followsˆ f ( k ) = M ∑ i ˆ f i ( k ) , with supp ˆ f i ( k ) ⊂ P i By the down-scaling operation which shift the high frequency region P i into a lowfrequency ones, the scale transfrom reads,ˆ f ( scale ) i ( k ) = ˆ f ( a i k ) , a i > f ( scale ) i ( x ) = f i (cid:18) a i x (cid:19) or f i ( x ) = f ( scale ) i ( a i x ) . Then, instead of ﬁnding a function ˆ f i ( k ) in the support set of P i , the transformedfunction ˆ f ( scale ) i ( k ) will be explored insupp ˆ f ( scale ) i ( k ) ⊂ (cid:26) k ∈ R d , ( i − ) K α i (cid:54) | k | (cid:54) iK α i (cid:27) .When iK α i is small, a DNN (cid:96) i ( θ ) can be used to learn f ( scale ) i ( x ) quickly, which fur-ther means that (cid:96) i ( θ ) can approximate f i ( x ) immediately, i.e., f i ( x ) ≈ (cid:96) i ( a i θ ) .Finally, f ( x ) can be expressed by f ( x ) ≈ M ∑ i (cid:96) i ( a i θ ) . (3.8)In general, (3.8) suggests an ansatz to approximate the solution more quickly withDNN representations, hence, converting the original data x into ˜ x is a nice trickwhen dealing with multi-scale problems.• Output the result of DNN. The role of activation functions is to decide whether particular neuron should ﬁre ornot. When the activation function is absent, the neural network will simply be a lineartransformation involving weights and biases, which in turn becomes a linear regressionmodel. Although linear model is simple to solve, but its expressive power for complexproblems is limited. A nonlinear activation function performs the nonlinear transfor-mation of the input data, making it capable to learn and perform more complex tasks.Thus, choosing the right activation function is essential for the efﬁciency and accuracy ofthe DNN. The signiﬁcance of activation functions for different models have been investi-gated in, e.g., [28, 36] etc .In the previous work [31], the activation function sReLU ( x ) = ReLU ( x ) × ReLU ( − x ) smoother than ReLU ( x ) = max { x } is used in the MscaleDNN algorithm to solvePDEs. It is localized in the spatial domain due to the ﬁrst-order discontinuity, but it lacksof adequate smoothness. To improve the efﬁciency of scale separation, we propose asmoother activation function which is a production of sReLU and the sine function, andis referred to sin-sReLU,sin-sReLU ( x ) = sin ( π x ) ∗ ReLU ( x ) ∗ ReLU ( − x ) . (3.9)For convenience, we abbreviate it by s2ReLU here and thereafter. sReLU (a) s2ReLU (b) frequency fft of sReLUsReLU(1x)sReLU(3x) sReLU(5x)sReLU(9x) (c) frequency fft of s2ReLUs2ReLU(1x)s2ReLU(3x) s2ReLU(5x)s2ReLU(9x) (d) Figure 2: sReLU function (left) and s2ReLU function (right)in the spatial (upper) and frequency (lower, the peak of each line is indicated by a star.)domains.As shown in in Figs. 2(a) and 2(b), sReLU and s2ReLU are localized in the spatial do-main. For sReLU, since its ﬁrst-order derivative is discontinuous, its Fourier transformdecays slowly. As shown in Fig. 2(c), sReLU functions with different scales overlap witheach other. However, since s2ReLU is a smooth function, it decays faster and has betterlocalization property in the frequency domain, compared with sReLU, as shown in Fig.2(d). The localization in the frequency domain, leads to the fact, that the peak-amplitudefrequencies of different scaled s2ReLU functions (stars in Fig. 2(d)) are separated andincrease as the scales increase. Therefore, s2ReLU potentially could be more efﬁcient torepresent multi-scale functions. In the numerical experiments, we will show that s2ReLUhas more superior performance compared to sReLU with MscaleDNN or ReLU with con-ventional DNN. In this section, several test problems are presented to illustrate the effectiveness of ourmethod to solve multi-scale nonlinear problems. In our numerical experiments, all train-ing and testing data are generated with a uniform distribution over corresponding do-mains, and all networks are trained by the Adam optimizer. In addition, the initial learn-ing rate is set as 2 × − with a decay rate 5 × − for each training step. For the ﬁrsthidden-layer in MscaleDNN, we divide the neurons into N =

100 groups to generate thescale vector K = { ··· ,99 } as in (3.7). Here, we provide two criteria to evaluate ourmodel: MSE = n s || ˜ u − u ∗ || , and REL = n s || ˜ u − u ∗ || || u ∗ || where ˜ u and u ∗ are approximate solutions of deep neural network and the exact solution(or the reference solution computed on a very ﬁne mesh), respectively, at n s testing sam-ple points. To evaluate the effectiveness, we test our model for every 1000 epochs in thetraining process. In our work, the penalty parameter β is set as follows, β =  β , if i < M max ∗ β , if M max ∗ < = i < M max ∗ β , if M max ∗ < = i < M max ∗ β , if M max ∗ < = i < M max ∗ β , if M max ∗ < = i < M max ∗ β , otherwise, (4.1)where the β = M max represents the total number of epoches.We perform neural network training and testing in TensorFlow (version 1.4.0) on a workstation with 256GB RAM and a single NVIDIA GeForce GTX 2080Ti 12GB.For the numerical examples, we use the following p-Laplacian problem as an proto-typical example of the more general form (1.1),  − div (cid:18) κ ( x ) |∇ u ( x ) | p − ∇ u ( x ) (cid:19) = f ( x ) , x ∈ Ω , u ( x ) = g ( x ) , x ∈ ∂ Ω , (4.2)then the energy in (3.1) can be rewritten as J ( v ) : = p (cid:90) Ω κ ( x ) |∇ v | p d x − (cid:90) Ω f vd x , v ∈ V .with V : = W pg , namely, the Sobolev space W p with trace g on ∂ Ω . We consider the following one-dimensional highly oscillatory elliptic problem in Ω =[ ] .  − ddx (cid:32) κ ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ddx u (cid:101) ( x ) (cid:12)(cid:12)(cid:12)(cid:12) p − ddx u (cid:101) ( x ) (cid:33) = f ( x ) , u (cid:101) ( ) = u (cid:101) ( ) =

0. (4.3)For equation (4.3), we consider (cid:101) = (cid:101) = n it = Ω and n bd =

500 boundary pointson ∂ Ω as the training dataset, and uniformly sample n s = Ω as the testingdataset. Example 4.1.

We consider the case of p = f ≡ κ ( x ) = (cid:16) + cos (cid:16) π x (cid:101) (cid:17)(cid:17) − (4.4)with a small parameter (cid:101) > (cid:101) − ∈ N + . In one-dimensional setting, the corre-sponding unique solution is given by u (cid:101) ( x ) = x − x + (cid:101) (cid:18) π sin (cid:16) π x (cid:101) (cid:17) − π x sin (cid:16) π x (cid:101) (cid:17) − (cid:101) π cos (cid:16) π x (cid:101) (cid:17) + (cid:101) π (cid:19) . (4.5)Since the oscillation amplitude is small, to show the highly oscillation, we display theﬁrst-order derivative of the target functions for (cid:101) = (cid:101) = (a) (cid:101) = (cid:101) = Figure 3: The graphs for the original function and the derivative of u (cid:101) Although the p-Laplacian equation is reduced to a linear one, the problem is stilldifﬁcult to deal with by DNN due to the highly oscillatory coefﬁcients with small (cid:101) [46].Since the solution is a smooth O ( ) function with a oscillating perturbation of O ( (cid:101) ) forour one-dimensional problems, in the following, we then only illustrate the O ( (cid:101) ) parts ofthe solutions by subtracting u ( x ) − x ( − x ) . For (cid:101) = (cid:101) = ( ) , but thes2ReLU solution still outperforms that of sReLU. The error curves in Fig. 5(b) enhancethis conclusion. Figs. 4 and 5 clearly reveal that the performances of MscaleDNN modelwith s2ReLU and sReLU are superior to that of general DNN model with ReLU. (a) solution (b) MSE and REL Figure 4: Testing results for (cid:101) = p =

2. The network size is (300, 200, 150, 150,100, 50, 50).

Figure 5: Testing results for (cid:101) = p =

2. The network size is (500, 400, 300, 300,200, 100, 100).When p increases, the nonlinearity of the p-Laplacian problem (1.1) becomes moreand more signiﬁcant and has complex interaction with the highly oscillatory coefﬁcients,hence the solution becomes increasingly more difﬁcult. In following examples, we fur-ther consider the 1d example (4.3) with p =

5, respectively.

Example 4.2.

For p =

5, this p-Laplacian equation is a highly oscillatory diffusion prob-lem. The exact solution u (cid:101) ( x ) and κ ( x ) are the same as that of Example 4.1. The force side f is given by f ( x ) = −| x − | (cid:2) + cos ( π x (cid:101) ) (cid:3) (cid:2) π ( x − ) sin (cid:0) π x (cid:101) (cid:1) − (cid:101) cos (cid:0) π x (cid:101) (cid:1) − (cid:101) (cid:3) (cid:101) , (4.6)where (cid:101) > (cid:101) − ∈ N + .We show the testing results for (cid:101) = (cid:101) = (cid:101) = (cid:101) = Figure 6: Testing results for (cid:101) = p =

5. The network size is (300, 200, 150, 150,100, 50, 50). (a) solution (b) MSE and REL

Figure 7: Testing results for (cid:101) = p =

5. The network size is (500, 400, 300, 300,200, 100, 100).From the above results, we conclude that the MscaleDNN model with s2ReLU acti-vation function can much better solve the p-Laplacian problem compared with the onesof sReLU and ReLU, even for a nonlinear case. We consider the following p-Laplacian problem in domain Ω = [ − ] × [ − ] ,  − div (cid:18) κ ( x , x ) |∇ u ( x , x ) | p − ∇ u ( x , y ) (cid:19) = f ( x , x ) , u ( − x ) = u ( x ) = u ( x , − ) = u ( x ,1 ) =

0. (4.7)In the following tests, we obtain the solution of (4.7) by employing two types ofMscaleDNN with size (1000, 500, 400, 300, 300, 200, 100, 100) and activation functionssReLU and s2ReLU, respectively. Based on the conclusions of MscaleDNN for one-dimensional p-Laplacian problems and previous results for MscaleDNN in solving PDEs[31], a MscaleDNN with s2ReLU or sReLU outperforms DNN with ReLU, therefore, wewill not show the results of DNN with ReLU in the following experiments.

Example 4.3.

In this example, the forcing term f ( x , x ) ≡ p = κ ( x , x ) is given by κ ( x , x ) = (cid:18) + sin ( π x / (cid:101) ) + sin ( π x / (cid:101) ) + + sin ( π x / (cid:101) ) + cos ( π x / (cid:101) ) + + cos ( π x / (cid:101) ) + sin ( π x / (cid:101) )+ + sin ( π x / (cid:101) ) + cos ( π x / (cid:101) ) + + cos ( π x / (cid:101) ) + sin ( π x / (cid:101) ) + sin ( x x )+ (cid:19) ,where (cid:101) =

15 , (cid:101) =

113 , (cid:101) =

117 , (cid:101) =

131 , (cid:101) =

165 . For this example, the corresponding exactsolution can not be expressed explicitly. Alternatively, a reference solution u ( x , x ) is setas the ﬁnite element solution computed by numerical homogenization method [33–35]on a square grid [ − ] × [ − ] of mesh-size h = ( + q ) − with a positive integer q = n it = n bd = [ − ] × [ − ] of mesh-size h = ( + q ) − with q = Figure 8: Testing results for Example 4.3. 8(a): Cut lines along x = κ ( x , x ) in this example, the performances of our model with s2ReLU and sReLU are still favor-able to solve (4.7) and our s2ReLU performs better than sReLU in overall training process.Figs. 8(c) and 8(d) not only show that the point-wise errors for major points are closed tozero, but also reveal that the point-wise error of s2ReLU is smaller than that of sReLU. Inshort, our model with s2ReLU activation function can obtain a satisfactory solution forp-Laplacian problem and it outperforms the one of sReLU. Example 4.4.

In this example, we test the performance of MscaleDNN to p-Laplacianproblem for p =

3. The forcing term f ( x , x ) and κ ( x , x ) are similar to that in Example4.3. Analogously, we still take the reference solution u as the ﬁnite element solution on aﬁne mesh over the square domain [ ] × [ ] of mesh-size h = ( + q ) − with a positive integer q =

6. In addition, the training and testing datasets in this example are similarlyconstructed as the Example 4.3. (a) Cut lines of solutions (b) MSE and REL(c) point-wise error (d) point-wise error

Figure 9: Testing results for Example 4.4. 9(a): Cut lines along x = Example 4.5.

In this example, we take the forcing term f = p =

2, and κ ( x , x ) = Π qk = (cid:18) + (cid:16) k π ( x + x ) (cid:17)(cid:19)(cid:18) + (cid:16) k π ( x − x ) (cid:17)(cid:19) where q is a positive integer. The coefﬁcient κ ( x , x ) has non-separable scales. Similarlyto Example 4.3, we take the reference solution u as the ﬁnite element solution on a ﬁnemesh over the square domain [ − ] × [ − ] of mesh-size h = ( + q ) − with a positiveinteger q = (a) Cut lines of solutions (b) MSE and REL(c) point-wise error (d) point-wise error Figure 10: Testing results for Example 4.5. 10(a) : Cut lines along x = mulitscale elliptic problems with oscillating coefﬁcients with possible nonlinearity, andits performance is superior to that of sReLU. It is important to examine the capability ofMscaleDNN for high-dimensional (multi-scale) elliptic problems, which will be shownin the following. Example 4.6.

We consider the following p-Laplacian problem in domain Ω = [ ]  − div (cid:18) κ ( x , x , ··· , x ) |∇ u ( x , x , ··· , x ) | p − ∇ u ( x , x , ··· , x ) (cid:19) = f ( x , x , ··· , x ) , u ( x , ··· , x ) = u ( x , ··· , x ) = ······ u ( x , x , ··· ,0 ) = u ( x , x , ··· ,1 ) =

0. (4.8)In this example, we take p = κ ( x , x , ··· , x ) = + cos ( π x ) cos ( π x ) cos ( π x ) cos ( π x ) cos ( π x ) .We choose the forcing term f such that the exact solution is u ( x , x , ··· , x ) = sin ( π x ) sin ( π x ) sin ( π x ) sin ( π x ) sin ( π x ) .For ﬁve-dimensional elliptic problems, we use two types of MscaleDNNs with size(1000, 800, 500, 500, 400, 200, 200, 100) and activation functions s2ReLU and sReLU, re-spectively. The training data set includes 7500 interior points and 1000 boundary pointsrandomly sampled from Ω . The testing dataset includes 1600 random samples in Ω .We plot the testing results in Fig. 11. To visually illustrate these results, we map thepoint-wise errors of sReLU and s2ReLU solutions, evaluated on 1600 sample points in Ω ,onto a 40 ×

40 2d array, respectively. We note that the mapping is only for the purpose ofvisualization, and is independent of the actual coordinates of those points. (a) MSE and REL (b) point-wise error (c) point-wise error

Figure 11: Testing results for Example 4.6. 11(a): Mean square error and relative errorfor s2ReLU and sReLU, respectively. 11(b): Point-wise square error for s2ReLU. 11(c):Point-wise square error for sReLU. The numerical results in Fig. 11(a) indicate that the MscaleDNN models with s2ReLUand sReLU can still well approximate the exact solution of elliptic equation in ﬁve-dimensionalspace. In particular, Figs. 11(b) and 11(c) show that the point-wise error of s2ReLU ismuch smaller than that of sReLU.

In this paper, we propose an improved version of MscaleDNN by designing an activationfunction localized in both spatial and Fourier domains, and use that to solve multi-scaleelliptic problems. Numerical results show that this method is effective for the resolu-tion of elliptic problems with multiple scales and possible nonlinearity, in low to medianhigh dimensions. As a meshless method, DNN based method is more ﬂexible for partialdifferential equations than traditional mesh-based and meshfree methods in regular orirregular region. In the future, we will optimize the MscaleDNN architecture and designDNN based algorithms for multi-scale nonlinear problems with more general nonlinear-ities.

Acknowledgements

X.L and L.Z are partially supported by the National Natural Science Foundation of China(NSFC 11871339, 11861131004). Z.X. is supported by National Key R&D Program ofChina (2019YFA0709503), and Shanghai Sailing Program. This work is also partially sup-ported by HPC of School of Mathematical Sciences at Shanghai Jiao Tong University.

References [1] Assyr Abdulle and Gilles Vilmart. Analysis of the ﬁnite element heterogeneous multiscalemethod for quasilinear elliptic homogenization problems. Mathematics of Computation,83(286):513–536, 2013.[2] John W. Barrett and W. B. Liu. Finite element approximation of the parabolic p -laplacian.SIAM Journal on Numerical Analysis, 31(2):413–428, 1994.[3] Liudmila Belenki, Lars Diening, and Christian Kreuzer. Optimality of an adaptive ﬁniteelement method for the p-laplacian equation. Ima Journal of Numerical Analysis, 32(2):484–510, 2012.[4] Jens Berg and Kaj Nystr ¨om. A uniﬁed deep artiﬁcial neural network approach to partialdifferential equations in complex geometries. Neurocomputing, 317:28–41, 2018.[5] Simon Biland, Vinicius C Azevedo, Byungsoo Kim, and Barbara Solenthaler. Frequency-aware reconstruction of ﬂuid simulations with generative networks. arXiv preprintarXiv:1912.08776, 2019. [6] Wei Cai, Xiaoguang Li, and Lizuo Liu. A phase shift deep neural network for high frequencyapproximation and wave problems. Accepted by SISC, arXiv:1909.11759, 2019.[7] Wei Cai and Zhi-Qin John Xu. Multi-scale deep neural networks for solving high dimen-sional pdes. arXiv preprint arXiv:1910.11710, 2019.[8] Sudhakar Chaudhary, Vimal Srivastava, V. V. K Srinivas Kumar, and Balaji Srinivasan. Web-spline-based mesh-free ﬁnite element approximation for p-laplacian. International Journalof Computer Mathematics, 93(6):1022–1043, 2016.[9] Eric T. Chung, Yalchin Efendiev, Ke Shi, and Shuai Ye. A multiscale model reductionmethod for nonlinear monotone elliptic equations in heterogeneous media. Networks andHeterogeneous Media, 12(4):619–642, 2017.[10] Philippe G. Ciarlet and J. T. Oden. The Finite Element Method for Elliptic Problems. 1978.[11] Doina Cioranescu and Patrizia Donato. An introduction to homogenization. 2000.[12] Bernardo Cockburn and Jiguang Shen. A hybridizable discontinuous galerkin method forthe p-laplacian. SIAM Journal on Scientiﬁc Computing, 38(1), 2016.[13] Lars Diening and Frank Ettwein. Fractional estimates for non-differentiable elliptic systemswith general growth. Forum Mathematicum, 20(3):523 – 556, 01 May. 2008.[14] Weinan E, Bjorn Engquist, Xiantao Li, Weiqing Ren, and Eric Vanden-Eijnden. Heteroge-neous multiscale methods: A review. Communications in Computational Physics, 2(3):367–450, 2007.[15] Weinan E, Chao Ma, and Lei Wu. Machine learning from a continuous viewpoint. arXivpreprint arXiv:1912.12777, 2019.[16] Weinan E and Bing Yu. The deep ritz method: A deep learning-based numerical algorithmfor solving variational problems. Communications in Mathematics and Statistics, 6(1):1–12,2018.[17] Fr´ed´eric Feyel. Multiscale fe2 elastoviscoplastic analysis of composite structures.Computational Materials Science, 16(1):344–354, 1999.[18] Marc G. D. Geers, Varvara G. Kouznetsova, Karel Matouˇs, and Julien Yvonnet. Ho-mogenization methods and multiscale modeling : Nonlinear problems. Encyclopedia ofComputational Mechanics Second Edition, pages 1–34, 2017.[19] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT press, Cam-bridge, 2016.[20] Jiequn Han, Chao Ma, Zheng Ma, and E Weinan. Uniformly accurate machine learning-based hydrodynamic models for kinetic equations. Proceedings of the National Academyof Sciences, 116(44):21983–21991, 2019.[21] Simon O. Haykin. Neural Networks: A comprehensive foundation. 1998.[22] Cuiyu He, Xiaozhe Hu, and Lin Mu. A mesh-free method using piecewise deep neuralnetwork for elliptic interface problems. arXiv preprint arXiv:2005.04847, 2020. [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for imagerecognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),pages 770–778, 2016.[24] Thomas Hou and Yalchin Efendiev. Multiscale Finite Element Methods: Theory andApplications. 2009.[25] Y. Q. Huang, Ruo Li, and Wenbin Liu. Preconditioned descent algorithms for p-laplacian.Journal of Scientiﬁc Computing, 32(2):343–371, 2007.[26] Martin Hutzenthaler, Arnulf Jentzen, Thomas Kruse, Tuan Anh Nguyen, and Philippe vonWurstemberger. Overcoming the curse of dimensionality in the numerical approximationof semilinear parabolic partial differential equations. arXiv preprint arXiv:1807.01212, 2018.[27] Ameya D Jagtap, Kenji Kawaguchi, and George Em Karniadakis. Adaptive activation func-tions accelerate convergence in deep and physics-informed neural networks. Journal ofComputational Physics, 404:109136, 2020.[28] Ameya D. Jagtap, Kenji Kawaguchi, and George Em Karniadakis. Adaptive activation func-tions accelerate convergence in deep and physics-informed neural networks. Journal ofComputational Physics, 404:109136, 2020.[29] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.[30] Xiaolin Li and Haiyun Dong. The element-free galerkin method for the nonlinear p-laplacianequation. Computers and Mathematics With Applications, 75(7):2549–2560, 2018.[31] Ziqi Liu, Wei Cai, and Zhi-Qin John Xu. Multi-scale deep neural network (mscalednn) forsolving poisson-boltzmann equation in complex domains. Accepted by Communications inComputational Physics, arXiv:2007.11207, 2020.[32] Adam M. Oberman. Finite difference methods for the inﬁnity laplace and p-laplace equa-tions. Journal of Computational and Applied Mathematics, 254(1):65–80, 2013.[33] Houman Owhadi and Lei Zhang. Homogenization of parabolic equations with a continuumof space and time scales. SIAM Journal on Numerical Analysis, 46(1):1–36, 2007.[34] Houman Owhadi and Lei Zhang. Numerical homogenization of the acoustic waveequations with a continuum of scales. Computer Methods in Applied Mechanics andEngineering, 198:397–406, 2008.[35] Houman Owhadi, Lei Zhang, and Leonid Berlyand. Polyharmonic homogenization,rough polyharmonic splines and sparse super-localization. Mathematical Modelling andNumerical Analysis, 48(2):517–552, 2014.[36] Sheng Qian, Hua Liu, Cheng Liu, Si Wu, and Hau San Wong. Adaptive activation functionsin convolutional neural networks. Neurocomputing, 272:204–212, 2018.[37] Tong Qin, Kailiang Wu, and Dongbin Xiu. Data driven governing equations approximationusing deep neural networks. Journal of Computational Physics, 395:620–635, 2019.[38] Nasim Rahaman, Devansh Arpit, Aristide Baratin, Felix Draxler, Min Lin, Fred A Ham- precht, Yoshua Bengio, and Aaron Courville. On the spectral bias of deep neural networks.International Conference on Machine Learning, 2019.[39] Christian P. Robert and George Casella. Monte Carlo Statistical Methods. 1999.[40] Justin Sirignano and Konstantinos Spiliopoulos. Dgm: A deep learning algorithm for solv-ing partial differential equations. Journal of Computational Physics, 375:1339–1364, 2018.[41] Dejan Slepˇcev and Matthew Thorpe. Analysis of pp