[PDF] Power series expansion neural network

Abstract

In this paper, we develop a new neural network family based on power series expansion, which is proved to achieve a better approximation accuracy comparing to existing neural networks. This new set of neural networks can improve the expressive power while preserving comparable computational cost by increasing the degree of the network instead of increasing the depth or width. Numerical results have shown the advantage of this new neural network.

Full PDF

aa r X i v : . [ m a t h . NA ] F e b Power series expansion neural network

Qipin Chen , Wenrui Hao , and Juncai He Pennsylvania State University, University Park, PA 16802 The University of Texas at Austin, Austin, TX 78712March 1, 2021

Keywords —

Neural Network, Power series expansion, Approximation analysis

Abstract

In this paper, we develop a new neural network family based on power series expansion, which isproved to achieve a better approximation accuracy comparing to existing neural networks. This new setof neural networks can improve the expressive power while preserving comparable computational cost byincreasing the degree of the network instead of increasing the depth or width. Numerical results haveshown the advantage of this new neural network.

Machine learning has been experiencing an extraordinary resurgence in many important artiﬁcial intelligenceapplications since the late 2000s. In particular, it has been able to produce state-of-the-art accuracy incomputer vision, video analysis, natural language processing, and speech recognition. Recently, interest inmachine learning-based approaches in the applied mathematics community has increased rapidly [14, 28, 31].This growing enthusiasm for machine learning stems from massive amounts of data available from scientiﬁccomputations and other sources [15, 16]. Mathematically speaking, the main challenge of machine learningis the training process as both complexity and memory grow rapidly [3] for deep or wide neural networks.Although that there are many good approximation results for both deep and wide neural networks, thissigniﬁcant increase in computational cost may not be justiﬁed by the performance gain that it brings. Powerseries expansion (PSE) has been widely used in the function approximation and reduces to solving a linearsystem, for instance, spectral method [24]. However, the curse of dimensionality is the main obstacle in thenumerical treatment of most high-dimensional problems based on the PSE approximation. In this paper, wewill combine the ideas of neural network and PSE to develop a new network so-called PSENet. This newnetwork can achieve a higher accuracy even for shallow or narrow networks.

Neural networks, consisting of a series of fully connected layers, can be written as a function from the input x ∈ R d to the output y ∈ R κ . Mathematically, a neural network with L hidden layers can be written asfollows y ( x ; θ ) = W L h L − + b L , h i = σ ( W i h i − + b i ) , i ∈ { , · · · , L − } , and h = x, (1)where W i ∈ R n i × n i − is the weight, b i ∈ R n i is the bias, d i is the width of i -th hidden layer, and σ isthe activation function (for example, the rectiﬁed linear unit (ReLU) or the sigmoid activation functions).Motivated by the power series expansion for a smooth function f ( x ), i.e., f ( x ) ≈ n X j =0 α j x j , we use the powerseries expansion on each layer, namely, h i = n X j =0 α i,j σ j ( W i h i − + b i ) , (2)1here α i,j ∈ R n i is the unknown coeﬃcients, α i,j σ j ( W i h i − + b i ) means element-wise multiplication and σ j stands for j -th power of the activation function. Speciﬁcally, if j = 0, we have σ = i.d. . The PSENet isreduced to the ResNet [13] by setting n i = 1, namely, h i = σ ( W i h i − + b i ) + h i − . The ResNet is consideredto add a shortcut or a skip connection that allows information to ﬂow, well just say, more easily from onelayer to the next’s next layer, i.e., it bypasses data along with normal neural network ﬂow from one layer tothe next layer after the immediate next.For the sake of brevity to discuss the approximation properties in Section 3, we deﬁne the followinggeneral architecture of a hidden layer in PSENet h i = n X j =0 α i,j σ j ( W i,j h i − + b i,j ) , (3)where W i,j : R n i × n i − . It is easy to see that the original deﬁnition in (2) can be covered by the aboveformula if we make weights W i,j = W i and b i,j = b i . Moreover, we have the following theorem to show theequivalence between two formulas. Theorem 2.1. If f ( x ) : R d R κ is a PSENet model deﬁned by (3) with hyper-parameters maximal power n and width n i . Then, there exists a PSENet model ˜ f ( x ) deﬁned by (2) with hyper-parameters maximal power n and width ˜ n i = nn i .Proof. By deﬁnition of (1), the PSENet deﬁned by 3 f ( x ) = W L +1 h L ( x ) where h i ( x ) = P nj =0 α i,j σ j ( W i,j h i − ( x )+ b i,j ) , i = 1 , , · · · , L, h ( x ) = x , and W L +1 : R n L R κ . For simplicity, we denote α i,j as a diagonal matrixon R n i . Now, we deﬁne ˜ f ( x ) = ˜ W L +1 ˜ h L ( x ) where ˜ h i ( x ) = P nj =0 ˜ α i,j σ j ( ˜ W i,j ˜ h i − ( x ) + ˜ b i,j ) , i = 1 , · · · , L, ˜ h ( x ) = x and ˜ W L +1 : R nn L R κ . Here, we construct ˜ f ( x ) by taking˜ α i,j =  n i ...0  , ˜ W i =  W i, α i − , W i, α i − , · · · W i, α i − ,n W i, α i − , W i, α i − , · · · W i, α i − ,n ... ... ... ... W i,n α i − , W i,n α i − , · · · W i,n α i − ,n  , and ˜ b i =  b i, b i, ... b i,n  for i = 2 , · · · , L and n i = (1 , , · · · , ∈ R n i . In addition, we have ˜ W L +1 = ( W L +1 α L, , W L +1 α L, , · · · , W L +1 α L,n ),˜ W = ( W , , W , , · · · , W ,n ) T , and ˜ b = ( b , , b , , · · · , b ,n ) T . Then, we can ﬁnish the proof by claimingthat˜ h i ( x ) = (cid:16) [˜ h i ( x )] , [˜ h i ( x )] , · · · , [˜ h i ( x )] n (cid:17) T = (cid:16) σ ( W i, h i − ( x )+ b i, ) , σ ( W i, h i − ( x )+ b i, ) , · · · , σ n ( W i,n h i − ( x )+ b i,n ) (cid:17) T , for i = 1 , · · · , L . In fact, for i = 1, we have˜ h ( x ) = n X j =0 ˜ α ,j σ j ( ˜ W x + ˜ b ) = (cid:16) σ ( W , x + b , ) , σ ( W , x + b , ) , · · · , σ n ( W ,n x + b ,n ) (cid:17) T . Then, by induction we have˜ h i ( x ) = n X j =0 ˜ α i,j σ j ( ˜ W i ˜ h i − ( x ) + ˜ b i ) =  σ ( W i, P nj =0 α i − ,j [˜ h i − ( x )] j + b i, ) σ ( W i, P nj =0 α i − ,j [˜ h i − ( x )] j + b i, )... σ n ( W i,n P nj =0 α i − ,j [˜ h i − ( x )] j + b i,n )  =  σ ( W i, h i − ( x ) + b i, ) σ ( W i, h i − ( x ) + b i, )... σ n ( W i,n h i − ( x ) + b i,n )  . Therefore, we have ˜ f ( x ) = ˜ W L +1 ˜ h L ( x ) = W L +1 P nj =0 α L σ j ( W L,j h L − ( x ) + b L,j ) = W L +1 h L ( x ) = f ( x ) , which ﬁnishes the proof. 2 Expressive power and approximation properties of PSENet

In this section, we will discuss the expressive and approximation power of PSENet deﬁned in (3) by comparingwith classical DNN under the ReLU activation function.

We ﬁrst denote the one-hidden layer PSENet function space as V n m = n f ( x ) : f ( x ) = P nj =0 α j σ j ( W j x + b j ) o , where m = ( m , m , · · · , m n ) ∈ N n +1 , W j ∈ R m j × d , b j ∈ R m j and α j ∈ R × m j . Then we show the expres-sive and approximation power in terms of the largest power n and the number of neurons | m | = n X j =0 m j .Since the activation function ReLU k is related to cardinal B-Splines, the PSENet V n m can be approximatedby the B-Spline function space. According to [5], a cardinal B-Spline of degree n ≥ b n ( x ),is written as b n ( x ) = ( n + 1) n +1 X i =0 w i σ n ( i − x ) and w i = n +1 Y j =0 ,j = i i − j , for x ∈ [0 , n + 1] and n ≥

1. More-over, the cardinal B-Spline series of degree n on the uniform grid with mesh size h = k +1 is deﬁned as B nk = n v ( x ) = k X j = − n c j b nj,h ( x ) o where b nj,h ( x ) = b n ( xh − j ) . Then we have the following lemma for the expres-sive power:

Lemma 3.1.

By choosing k i ≤ m i − i − , we have ∪ ni =1 B ik i ⊂ V n m . Proof.

We consider the so-called ﬁnite neuron methods [29] with ReLU k as the activation function and deﬁnethe one hidden layer neural network described in [29] as V nm :=  f ( x ) : f ( x ) = m X j =1 a j σ n ( ω j · x + b j )  . (4)Obviously, we have V n m = S ni =0 V im i or V n m = V m +1 S (cid:0)S ni =2 V im i (cid:1) , which is derived by x = ReLU( x ) +ReLU( − x ) and the linearity. By Lemma 3.2 in [29], we complete the proof.By using the above expressive power, we have the following approximation result for V n m : Theorem 3.1 (1D case) . For any bounded domain Ω ⊂ R and m i > i + 1 large enough, then we have inf v ∈ V n m k u − v k s, Ω . min i =1 , , ··· ,n n m s − ( i +1) i k u k i +1 , Ω o . (5) Proof.

According to the error estimate of B iN in [29], we have inf v ∈ B imi − i − k u − v k s, Ω . m s − ( i +1) i k u k i +1 , Ω . Inaddition, we have ∪ ni =1 B im i − i − ⊂ V n m , if m i > i + 1 in Lemma 3.1. This indicates that inf v ∈ V n m k u − v k s, Ω ≤ inf v ∈∪ ni =1 B imi − i − k u − v k s, Ω . min i =1 , , ··· ,n n m s − ( i +1) i k u k i +1 , Ω o . Remark 3.1.

By comparing with

ReLU n -DNN [29], the PSENet has the following advantages:1. If we know nothing about the regularity of the target function u ( x ) a priori, then the PSENet V n m gives an adaptive and uniform scheme for approximating any u ∈ H i (Ω) for all i ≥ . However, ReLU n -DNN can only work for u ∈ H i (Ω) for i ≥ n .2. By choosing m i = 0 for i ≤ n , the PSENet V n m recovers the ReLU n -DNN exactly. Thus if u ( x ) ∈ H n (Ω) , then PSENet provides almost the same asymptotic convergence rate in terms of the number ofhidden neurons | m | as the ReLU n -DNN [29]. . If u ( x ) is a smooth function, the PSENet V n m then provides a better approximation than ReLU n -DNNwhen the number of neurons, m , is not large since k u k i +1 , Ω might be very large. Following the observation in [29], we have the following theorem about the expressive power of PSENetfor polynomials on the multi-dimensional case.

Theorem 3.2 (Multi-dimensional case) . For any polynomial p ( x ) = P | α |≤ k a α x α on R d , then there existsa PSENet function ˆ p ( x ) = P kj =0 c j σ j ( W j x + b j ) with m i ≤ (cid:0) i + d − i (cid:1) , such that ˆ p ( x ) = p ( x ) on R d .Proof. We ﬁrst denote X = ( x α , x α , · · · , x α ni ) T , as the natural basis for the space of homogeneouspolynomials on R d with degree i . Here n i = (cid:0) i + d − i (cid:1) is the dimension of the space. Thus we have(( w · x ) i , ( w · x ) i , · · · , ( w n i · x ) i ) T = W X , where W ∈ R n i × n i is a matrix formed by w , w , · · · , w n i .Based on the generalized Vandermonde determinate identity [30], we see that W is an invertible matrix if wechoose w s appropriate. Thus, (( w · x ) i , ( w · x ) i , · · · , ( w n i · x ) i ) form the basis for the space of homogeneouspolynomials on R d with degree i . More details can be found in [10]. Thus, by choosing suitable w s ∈ R d for s = 1 , · · · , n i , we have that ( w s · x ) i = ReLU i ( w s · x ) + ( − i ReLU i ( − w s · x ) ∈ PSENet form a basis forhomogeneous polynomials on R d with degree i . Therefore V im i can reproduce any homogeneous polynomialswith degree i if m i = 2 (cid:0) i + d − i (cid:1) . Remark 3.2.

1. The total number of neurons is | m | = P ki =0 m i = P ki =0 (cid:0) i + d − i (cid:1) = 2 (cid:0) k + dk (cid:1) , whichequals the number of neurons of ReLU k -DNN to recover polynomials with degree k as shown in [29].Considering the spectral accuracy of polynomials for smooth functions in terms of the degree k , theabove representation theorem shows that PSENet can achieve an exponential approximation rate forsmooth functions with respect to n in V n m , some similar results can be found in [6, 9, 20, 21, 26]2. PSENet can take a large degree n to reproduce high order polynomials instead of a deep network. Butother networks need deep layers to improve the performance, for example expressive power [1, 11, 12,19, 25, 27], approximation properties [6, 7, 17, 18, 20, 21, 22], beneﬁts for training [2] and etc. We apply the PSENet on the singular function approximation which has been widely studied in hp-FEM[4] and consider non-smooth functions in Gevrey class [4, 23, 20] on I = (0 , β >

0, we deﬁnefunction ϕ β ( x ) = x β on [0 , | u | H k,ℓβ ( I ) := || ϕ β + k − ℓ D k u || L ( I ) , and the H k,ℓβ norm as || u || H k,ℓβ ( I ) :=  k X k ′ =0 | u | H k ′ , β ( I ) , if ℓ = 0 , k X k ′ = l | u | H k ′ ,ℓβ ( I ) + || u || H ℓ − ( I ) , if ℓ ≥ , (6)where ℓ, k = 0 , , , · · · . For any δ ≥ G ℓ,δβ ( I ) is deﬁned as the class of functions u ∈∩ k ≥ l H k,lβ ( I ) for which there exist M, m >

0, such that ∀ k ≥ l : | u | H k,lβ ( I ) ≤ M m k − l (( k − l )!) δ . When d = 1, these function classes have a singular point at x = 0, then hp ﬁnite element method has exponentialconvergence to these function class [23, 8]. We consider the piece-wise polynomial space on mesh T n : 0 = x < · · · < x n = 1 as P p ( T n ) = { p h ∈ C ( I ) | p h is a polynomial on grid [ x i − , x i ] with degree p ( i ) } , and havethe following estimate: Lemma 3.2 ([20]) . Let σ, β ∈ (0 , , δ ≥ , u ∈ G ,δβ ( I ) and N ∈ N be given. For µ = µ ( σ, δ, m ) :=max { , m (2 e ) − δ } and for any µ > µ , let p = ( p ( i ) ) ni =1 ⊂ N be deﬁned as p (1) := 1 and p ( i ) := ⌊ µi δ ⌋ for i ∈ { , ..., n } . Then there exists v ( x ) ∈ P p ( T n ) with v ( x i ) = u ( x i ) and x i = n − i for i ∈ { , ..., n } such thatfor constants C ( σ, β, δ, µ, M, m ) , c ( β, δ ) > it holds that || u − v || H (0 , ≤ Ce − cn . P p ( T n ) with p ( i ) ≤ p ( i +1) . Lemma 3.3.

For any function p h ∈ P p ( T n ) with p ( i ) ≤ p ( i +1) , p h ( x ) can be reproduced by a PSENet,namely, p h ( x ) = p ( n ) X j =0 α j σ j ( W j x + b j ) , ∀ x ∈ [0 , where m j ≤ n . (7) Proof.

First, we write p h ( x ) as p h ( x ) − p h (0) = n X i =1 χ I i ( x ) p h,i ( x ) , where χ I i ( x ) is the indicator function of I i =[ x i − , x i ) for i = 1 , · · · , n −

1, and I n = [ x n − , x n ]. Here p h,i ( x ) is the polynomial of p h ( x ) on I i with degree p ( i ) . Thanks to the property that p ( i ) ≤ p ( i +1) , we re-write p h ( x ) as p h ( x ) − p h (0) = n X i =1 χ ˜ I i ( x )˜ p h,i ( x ) , where˜ I i = [ x i − ,

1] and ˜ p h,i ( x ) is a polynomial of degree p ( i ) deﬁned as ˜ p h,i ( x ) = p h,i ( x ) − ˜ p h,i − ( x ) , i = 2 , , · · · , n, with ˜ p h, ( x ) = p h, ( x ). In addition, we have ˜ p h,i ( x ) = p ( i ) X j =1 ˜ a ( i ) j ( x − x i − ) j due to the continuity of p h ( x ) on[0 , χ ˜ I i ( x ), we have χ ˜ I i ( x )˜ p h,i ( x ) = p ( i ) X j =1 ˜ a ( i ) j σ j ( x − x i − ) , on[0 , Theorem 3.3.

For all δ ≥ , β ∈ (0 , , and u ∈ G ,δβ ( I ) , there exists a PSENet function ˆ u ( x ) with onehidden layer and | m | = M such that || u − ˆ u || H (0 , ≤ C e − C M δ +1 , for some constants C and C whichonly depend on u ( x ) .Proof. For any function u ( x ) ∈ G ,δβ ( I ), Lemma 3.2 shows that there exists u h ( x ) ∈ P p ( T n ) such that || u − u h || H (0 , ≤ Ce − cn , with p (1) := 1 and p ( i ) := ⌊ µi δ ⌋ for i ∈ { , ..., n } . According to Lemma 3.3, thereexists a PSENet function ˆ u ( x ) with one hidden layer and m j ≤ p ( n ) − j + 1 such that ˆ u ( x ) = u h ( x ) on [0 , k u − ˆ u k H (0 , = k| u − u h k H (0 , ≤ Ce − cn . Then, it is easy to obtain the ﬁnal approximation rate since | m | = P p ( n ) j =0 m j . µn δ +1 . This approximation result achieves the optimal convergence rate comparing with results in [9] and [20],whose rates are C e − C M δ +1 or C e − C M δ +1 , respectively. In this section, we compare the PSENet with ResNet on both fully connected and convolutional neuralnetworks by using the ReLU activation function.

We ﬁrst compare the PSENet with fully connected neural networks and ResNet to approximate y = sin ( nπx )on [0 ,

1] and y = sin( nπ ( x + x )) on [0 , × [0 ,

1] with both single and two hidden layers. We can see thatPSENet has a better approximation accuracy comparing to the other two networks with the optimal degreeshown in Table 1. Secondly, we consider an 1D function f ( x ) = x α with α ∈ (0 ,

1) on x ∈ [0 , x = 0is a singularity. By the theoretical analysis in Section 3.2, the PSENet can achieve the optimal approximationrate comparing with ReLU k -DNN which is conﬁrmed in Table 2.5able 1: The comparison between PSENet and fully connected neural networks and ResNet on the approx-imation accuracy of f ( x ) = sin( nπx ) and f ( x ) = sin( nπ ( x + x )). The number of neurons on each layer is10. The best approximation accuracy for PSENet with diﬀerent degrees n is highlighted. Function FC ResNet PSENetdegree=1 degree=2 degree=3 degree=4 degree=51-hidden-layer sin(3 πx ) 2 × − × − × − × − × − × − × − sin(4 πx ) 3 × − × − × − × − × − × − × − sin(5 πx ) 2 × − × − × − × − × − × − × − πx ) 3 × − × − × − × − × − × − × − sin(4 πx ) 2 × − × − × − × − × − × − × − sin(5 πx ) 2 × − × − × − × − × − × − × − πx ) 2 × − × − × − × − × − × − × − sin(4 πx ) 3 × − × − × − × − × − × − × − sin(5 πx ) 3 × − × − × − × − × − × − × − π ( x + x )) 2 × − × − × − × − × − × − × − sin(4 π ( x + x )) 4 × − × − × − × − × − × − × − sin(5 π ( x + x )) 3 × − × − × − × − × − × − × − π ( x + x )) 1 × − × − × − × − × − × − × − sin(4 π ( x + x )) 3 × − × − × − × − × − × − × − sin(5 π ( x + x )) 3 × − × − × − × − × − × − × − α ResNet ReLU Network PSENetdegree=1 degree=2 degree=3 degree=4 degree=5 degree=1 degree=2 degree=3 degree=4 degree=52 / . × − . × − . × − . × − . × . × − . × − . × − . × − . × − / . × − . × − . × − . × − . × . × − . × − . × − . × − . × − / . × − . × − . × − . × − . × . × − . × − . × − . × − . × − Table 2: Accuracy comparison of R ( N ( x ) − f ( x )) + ( N ′ ( x ) − f ′ ( x )) dx with f ( x ) = x α with α ∈ (0 ,

1) on x ∈ [0 , We compare the PSENet with diﬀerent ResNets on both CIFAR-10 and CIFAR-100. Results in Table 3show that the PSENet achieves better error rates than ResNet with the same number of layers. Moreover,the PSENet achieves better error rates than ResNet with shallow networks and keeps comparable error rateswith deep networks.

We develop a novel neural network by combing the ideas of the power series expansion and the neural networkapproximation. Theoretically, we prove the better approximation result of PSENet by comparing withReLU k -DNN and the optimal approximation rate on singular functions. Moreover, the PSENet can achievea better approximation accuracy on the shallow network structure comparing to other neural networks.Several numerical results have been used to demonstrate the advantages of the PSENet. This new approachshows that increasing the degree of PSENet can also lead to further performance improvements rather thangoing deep. References [1] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee. Understanding deep neural networks with rectiﬁedlinear units. In

International Conference on Learning Representations , 2018.6able 3: Comparison between PSENet and ResNet on the CIFAR-10 and CIFAR-100 datasets with diﬀerentnumbers of layers

CIFAR-10 CIFAR-100 [2] S. Arora, N Cohen, and E. Hazan. On the optimization of deep networks: Implicit acceleration byoverparameterization. In , 2018.[3] Q. Chen and W. Hao. A homotopy training algorithm for fully connected neural networks.

Proceedingsof the Royal Society A , 475(2231):20190662, 2019.[4] A. Chernov, T. von Petersdorﬀ, and C. Schwab. Exponential convergence of hp quadrature for integraloperators with gevrey kernels. ESAIM: Mathematical Modelling and Numerical Analysis-Mod´elisationMath´ematique et Analyse Num´erique , 45(3):387–422, 2011.[5] C. de Boor. Subroutine package for calculating with b-splines.

Los Alamos Scient. Lab. Report LA-4728-MS , 1971.[6] W. E and Q. Wang. Exponential convergence of the deep neural network approximation for analyticfunctions.

Science China Mathematics , 61(10):1733–1740, 2018.[7] I. G¨uhring, G. Kutyniok, and P. Petersen. Error bounds for approximations with deep relu neuralnetworks in w s,p norms. Analysis and Applications , 18(05):803–859, 2020.[8] W. Gui and I. Babuˇska. Theh, p andh-p versions of the ﬁnite element method in 1 dimension.

NumerischeMathematik , 49(6):613–657, 1986.[9] J He, L. Li, and J. Xu. Approximation properties of relu deep neural networks for smooth and non-smooth functions.

In preparation , 2021.[10] J He, L. Li, and J. Xu. Dnn with heaviside, relu and requ activation functions.

In preparation , 2021.[11] J He, L. Li, and J. Xu. Relu deep neural networks from the perspective of hierarchical basis.

Inpreparation , 2021.[12] J. He, L. Li, J. Xu, and C. Zheng. Relu deep neural networks and linear ﬁnite elements.

Journal ofComputational Mathematics , 38(3):502–527, 2020.[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.In

Proceedings of the IEEE conference on computer vision and pattern recognition , pages 770–778, 2016.[14] S Kang, W. Liao, and Y. Liu. Ident: Identifying diﬀerential equations with numerical time evolution.

Journal of Scientiﬁc Computing , 87(1):1–27, 2021.[15] H Lei, L. Wu, and E W. Machine-learning-based non-newtonian ﬂuid model with molecular ﬁdelity.

Physical Review E , 102(4):043309, 2020.[16] L Lu, X Meng, Z Mao, and G Karniadakis. Deepxde: A deep learning library for solving diﬀerentialequations.

SIAM Review , 63(1):208–228, 2021. 717] Z. Lu, H. Pu, F. Wang, Z. Hu, and L. Wang. The expressive power of neural networks: A view fromthe width. In

Advances in Neural Information Processing Systems , pages 6231–6239, 2017.[18] H. Montanelli, H. Yang, and Q. Du. Deep relu networks overcome the curse of dimensionality forbandlimited functions. arXiv preprint arXiv:1903.00735 , 2019.[19] G. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neuralnetworks. In

Advances in neural information processing systems , pages 2924–2932, 2014.[20] J. Opschoor, P. Petersen, and C. Schwab. Deep relu networks and high-order ﬁnite element methods.

Analysis and Applications , pages 1–56, 2020.[21] J. Opschoor, C. Schwab, and J. Zech. Exponential relu dnn expression of holomorphic maps in highdimension.

SAM Research Report , 2019, 2019.[22] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao. Why and when can deep-but notshallow-networks avoid the curse of dimensionality: a review.

International Journal of Automation andComputing , 14(5):503–519, 2017.[23] C. Schwab. p- and hp- Finite Element Methods: Theory and Applications to Solid and Fluid Mechanics .The Clarendon Press, Oxford University Press New York, 1998.[24] Jie Shen, Tao Tang, and Li-Lian Wang.

Spectral methods: algorithms, analysis and applications , vol-ume 41. Springer Science & Business Media, 2011.[25] Z. Shen, H. Yang, and S. Zhang. Nonlinear approximation via compositions.

Neural Networks , 119:74–84, 2019.[26] S. Tang, B. Li, and H. Yu. Chebnet: Eﬃcient and stable constructions of deep neural networks withrectiﬁed power units using chebyshev approximations. arXiv preprint arXiv:1911.05467 , 2019.[27] M. Telgarsky. Beneﬁts of depth in neural networks.

Journal of Machine Learning Research ,49(June):1517–1539, 2016.[28] B. Wang, D. Zou, Q. Gu, and S. Osher. Laplacian smoothing stochastic gradient markov chain montecarlo.

SIAM Journal on Scientiﬁc Computing , 43(1):A26–A53, 2021.[29] J. Xu. The ﬁnite neuron method and convergence analysis. arXiv preprint arXiv:2010.01458 , 2020.[30] I. Yaacov. A multivariate version of the vandermonde determinant identity. arXiv preprintarXiv:1405.0993 , 8, 2014.[31] W. Zhu, Q. Qiu, B. Wang, J. Lu, G. Sapiro, and I. Daubechies. Stop memorizing: A data-dependentregularization framework for intrinsic pattern learning.