[PDF] Approximation smooth and sparse functions by deep neural networks without saturation

Abstract

Constructing neural networks for function approximation is a classical and longstanding topic in approximation theory. In this paper, we aim at constructing deep neural networks (deep nets for short) with three hidden layers to approximate smooth and sparse functions. In particular, we prove that the constructed deep nets can reach the optimal approximation rate in approximating both smooth and sparse functions with controllable magnitude of free parameters. Since the saturation that describes the bottleneck of approximate is an insurmountable problem of constructive neural networks, we also prove that deepening the neural network with only one more hidden layer can avoid the saturation. The obtained results underlie advantages of deep nets and provide theoretical explanations for deep learning.

Full PDF

AApproximation smooth and sparse functions by deep neuralnetworks without saturation (cid:73)

Xia Liu ∗

1. School of Sciences, Xi’an University of Technology, Xi’an 710048, China

Abstract

Constructing neural networks for function approximation is a classical and longstandingtopic in approximation theory. In this paper, we aim at constructing deep neural net-works (deep nets for short) with three hidden layers to approximate smooth and sparsefunctions. In particular, we prove that the constructed deep nets can reach the optimalapproximation rate in approximating both smooth and sparse functions with controllablemagnitude of free parameters. Since the saturation that describes the bottleneck of ap-proximate is an insurmountable problem of constructive neural networks, we also provethat deepening the neural network with only one more hidden layer can avoid the sat-uration. The obtained results underlie advantages of deep nets and provide theoreticalexplanations for deep learning.

Keywords:

Approximation theory, deep learning, deep neural networks, localizedapproximation, sparse approximation.

1. Introduction

Machine learning [5] is a key sub-ﬁeld of artiﬁcial intelligence (AI) which aboundsin sciences, engineering, medicine, computational ﬁnance and so on. Neural network[29, 26, 16, 21] is an eternal topic of machine learning that makes machine learning nolonger just like a machine to execute commands, but makes machine learning have theability to draw inferences about other cases from one instance. Deep learning [12, 15] is (cid:73)

The research was supported by the National Natural Science Foundation of China (Grant Nos.61806162, 11501496 and 11701443) ∗ Corresponding author: [email protected]

Preprint submitted to Elsevier January 14, 2020 a r X i v : . [ c s . I T ] J a n new active area of machine learning research based on deep structured learning modelwith appropriate algorithms, and is acclaimed as a magical approach to deal with massivedata. Indeed, neural networks with more than one hidden layer are one of the most typicaldeep structured models in deep learning [12]. In current literature [21, 30], it was showedthat deep nets outperform shallow neural networks (shallow nets for short) in the sensethat deep nets break through some lower bounds for shallow nets. Furthermore, somestudies [11, 18, 22, 27, 32, 33] have demonstrated the superiority of deep nets via showingthat deep nets can approximate various functions while shallow nets fail with similarnumber of neurons.Constructing neural networks to approximate continuous functions is a classical andprevalent topic in approximation theory. In 1996, Mhaskar [25] proved that neural net-works with single hidden layer are capable of providing an optimal order of approximatingsmooth functions. The problem is, however, that the weights and biases of the construc-trs shallow nets are huge, which usually leads to extremely large capacity [13]. Besidesthis partly positive approximation results, it was shown in [7, 22] that there is a bot-tleneck for shallow nets in approximating smooth functions in the sense that there issome lower bound for approximation. Moreover, Chui et al. [6] showed that shallownets with an ideal sigmoidal activation function cannot provide localized approximationin Euclidean space. Furthermore, it was proved in [9] that shallow nets cannot capturethe rotation-invariance property by showing the same approximate rates in approximat-ing rotation-invariant function and general smooth function. All these results presentedlimitations of shallow nets from the approximation theory view point.To overcome these limitations of shallow nets, Chui et al. [6] demonstrated that deepnets with two hidden layers can provide localized approximation. Further than that, Chuiet al. [9] showed that deep nets with two hidden layers and controllable norms of weightscan approximate the univariate smooth functions without saturation and adding depthcan realize the rotation-invariance. Here, saturation [21] means that the approximationrate cannot be improved once the smoothness of functions achieves a certain level, whichwas proposed as an open question by Chen [2]. The general results by Lin [24] indicatedthat deep nets with two hidden layers and controllable weights possess both localized and2parse approximation properties in the spatial domain. They also proved that learningstrategies based on deep nets can learn more functions with almost optimal learning ratesthan those based on shallow nets. The problem in [24] is that the saturation cannot beovercome. The above theoretical veriﬁcations demonstrate that deep nets with two hiddenlayers can really overcome some deﬁciency of shallow nets, but that is just partially.Recent literature in deep nets [34, 17] proved that deep nets with ReLU activationfunction (denoting deep ReLU nets) are more eﬃciently in approximating smooth functionand possess better generalization performance for numerous learning tasks than shallownets. Nevertheless, the constructed deep ReLU nets are too deep, which results in severaldiﬃculty in training, including the gradient vanishing phenomenon and disvergence issue[12]. Furthermore, how to select the depth is still an open problem, and there is a commonphenomenon that deep nets with huge hidden layers will lead to inoperable [15]. Underthis circumstance, we hope to construct a deep net with good approximation capability,controllable parameters, non-saturation and not too deep. To this end, we construct in thispaper a deep net with three hidden layers that possesses the following properties: localizedapproximation, optimal approximation rate, controllable parameters, non-saturation andspatial sparsity. Our main tool for analysis is the localized approximation [7, 24], “productgate” strategy [9, 31, 34] and localized Taylor polynomials [17, 31].

2. Main results

Let I = [0 , d ∈ N , x ∈ X := I d , C ( R d ) be the space of continuous functions withthe norm (cid:107) f (cid:107) ∞ := (cid:107) f (cid:107) C ( R d ) := max x ∈ R d | f ( x ) | . For x ∈ X , the set of shallow nets can be mathematically expressed as F σ,n ( x ) = (cid:40) n (cid:88) i =1 c i σ ( w i · x + b i ) : w i ∈ R d , b i , c i ∈ R (cid:41) (2.1)where σ : R → R is an activation function, n is the number of hidden neurons (nodes), c i ∈ R is the outer weights, w i := ( w ji ) dj =1 ∈ R d is the inner weight, and b i is the bias(threshold) of the i - th hidden nodes. 3et l ∈ N , d = d, d , · · · , d l ∈ N , σ k : R → R ( k = 1 , , · · · , l ) be univariate nonlinearfunctions. For (cid:126)h = ( h (1) , · · · , h ( d k ) ) T ∈ R d k , deﬁne (cid:126)σ ( (cid:126)h ) = ( (cid:126)σ ( h (1) ) , · · · , (cid:126)σ ( h ( d k ) )) T . Denote H { σ j ,l, ˜ n } as the set of deep nets with l hidden layers and ˜ n free parameters that can bemathematically represented by h { σ j ,l, ˜ n } ( x ) = (cid:126)a · (cid:126)h l ( x ) (2.2)where (cid:126)h k ( x ) = (cid:126)σ k ( W k · (cid:126)h k − ( x ) + (cid:126)b k ) , k = 1 , , · · · , l,h ( x ) = x, (cid:126)a ∈ R d l ,(cid:126)b k ∈ R d k , W k := ( W ki,j ) d k × d k − is a d k × d k − matrix, and ˜ n denotesthe number of free parameters, i.e., ˜ n = (cid:80) lk =1 ( d k · d k − + d k ) + d l . The structure of deepnets, depicted in Figure 1, depends mainly on the structures of the weight matrices W k and the parameter vectors (cid:126)b k and (cid:126)a , k = 1 , , · · · , l . It is easy to see that when l = 1, thefunction deﬁned by (2.2) is a shallow net.   W x x d x W j W

11 1 , b 

11 2 , b  , d b 

22 1 , b 

22 2 , b  , d b  , ll b  , ll b  , l ll d b  a a l a             Figure 1: Structure for deep neural networks

In this part, we focus on approximating smooth functions by deep nets. The smoothproperty is a widely used priori-assumption in approximation and learning theory [8, 10,44, 23, 34]. Let c be a positive constant, r = k + v with k ∈ N := { } (cid:83) N and 0 < v ≤ f : X → R is said to be ( r, c )-smooth if f is k -times diﬀerentiable and for any α j ∈ N , j = 1 , · · · , d with α + · · · + α d = k , then for any x, z ∈ X , the partial derivatives ∂ k f /∂x α · · · ∂x α d d exist and satisfy (cid:12)(cid:12)(cid:12)(cid:12) ∂ k f∂x α · · · ∂x α d d ( x ) − ∂ k f∂x α · · · ∂x α d d ( z ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ c (cid:107) x − z (cid:107) v . (2.3)Throughtout this paper, (cid:107) x (cid:107) denotes the Euclidean norm of x . In particular, if 0 < r ≤ | f ( x ) − f ( z ) | ≤ c (cid:107) x − z (cid:107) r , ∀ x, z ∈ X. (2.4)Denote by Lip ( r, c ) be the family of ( r, c )-Lipschitz functions satisfying (2.4). In fact,the Lipschitz property depicts the smooth information of f and has been adopted in hugeliterature [30, 6, 9, 24, 20] to quantify the approximation ability of neural networks.As we know, diﬀerent activation functions used in neural networks will lead to diﬀer-ent results [30]. Among all the activation functions, the sigmoidal function and Heavisidefunction are two commonly used ones. Similar as [24], we use these two activation func-tions to construct deep nets. The main reason is that the usage of Heaviside functioncan enhance the localized approximation performance [24] and the adoption of sigmoidalfunction can improve the capability to approximate algebraic polynomials [9]. Let σ bethe Heaviside function, i.e., σ ( t ) = 1 , if t ≥ σ ( t ) = 0 , if t < , and σ : R → R be a sigmoidal function, i.e.,lim t → + ∞ σ ( t ) = 1 , lim t →−∞ σ ( t ) = 0 . (2.5)Due to (2.5), for any ε >

0, there exists a K ε = K ( ε, σ ) >  | σ ( t ) − | < ε, if t ≥ K ε , | σ ( t ) | < ε, if t ≤ − K ε . (2.6)Before presenting the main results, we should introduce some assumptions. Assump-tion 1 is the r -Lipschitz continuous condition for the target function, which is a standardcondition in approximation and learning theory.5 ssumption 1 We assume g ∈ Lip( r, c ) with r = k + v , k ∈ N , 0 < v ≤ c > σ , which hasalready been adopted in [19]. Assumption 2

For r > r = k + v , k ∈ N , 0 < v ≤

1, let σ be a non-decreasingsigmoidal function with (cid:107) σ (cid:48) (cid:107) L ∞ ( R ) ≤ (cid:107) σ (cid:107) L ∞ ( R ) ≤ b ∈ R d satisﬁes σ ( j ) ( b ) (cid:54) = 0 for all j = 0 , , , · · · , k , and k ≥ max { k, } + 1.There are many functions satisfy the above restrictions such as: the Logistic function σ ( t ) = e − t , the Hyperbolic tangent function σ ( t ) = (tanh( t ) + 1), the Gompertzfunction σ ( t ) = e − ae − bt with a, b > σ ( t ) = e − t .Our ﬁrst main result is the following theorem, in which we construct a deep net withthree hidden layers to approximate smooth functions. Denote by H , ˜ n := H { σ ,σ,σ, , ˜ n } bethe set of deep net with three hidden layers and ˜ n free parameters, where σ , σ, σ are theactivation functions in the ﬁrst, second and third hidden layers, respectively. Theorem 2.1.

Let < ε ≤ , under Assumptions 1 and 2, there exists a deep net H ( x ) ∈ H { , ˜ n } such that | g ( x ) − H ( x ) | ≤ C (˜ n − rd + ˜ nε ) , (2.7) where all parameters of this deep net are bounded by poly (˜ n, ε ) , poly (˜ n, ε ) denotes somepolynomial function with respect to ˜ n and ε , and C is a constant independent of ˜ n and ε . The proof of Theorem 2.1 will be postponed in Section 4, and a direct consequencesof Theorem 2.1 is as follows.

Corollary 2.2.

Under Assumptions 1 and 2, if ε = ˜ n − r + dd , then there holds | g ( x ) − H ( x ) | ≤ ¯ C ˜ n − rd , (2.8) where ¯ C is a constant independent of ˜ n , and all the parameters of the deep net are boundedby poly (˜ n ) . The approximation rate of shallow nets and deep nets with two hidden layers are O (˜ n − rd ) [29, 9], which is the same as Corollary 2.2. However, as far as the norm ofweights is considered, all the weights in Corollary 2.2 are controllable, and are much lessthan those of shallow nets. Speciﬁcally, for shallow nets, the norm of weights is at leastexponential with respect to ˜ n [25], while for deep nets in Corollary 2.2, the norm of weightsis only polynomial respect to ˜ n . Such a diﬀerence is essentially according to the capacity6stimate [13], where a rigorous proof was presented that the covering number of deepnets with controllable norms of free parameters can be tightly bounded. Furthermore,compared with similar results for deep nets with two hidden layers [24], we ﬁnd that ourconstructed deep net avoids the saturation. To sum up, the constructed deep net withthree hidden layers performs better than shallow nets and deep nets with two hiddenlayers in overcoming their shortcomings. Sparseness in the spatial domain is a prevalent data feature that abounds in numerousapplications such as magnetic resonance imaging (MRI) analysis [1], handwritten digitrecognition [4] and so on. The spatial sparseness means that the response (or function) ofsome actions only happens on several small regions instead of the whole input space. Inother words, the response vanishes in most of regions of the input space. Mathematically,the spatially sparse function is deﬁned as follows [24].Let s, N ∈ N , s ≤ N d , N dN = { , , ..., N } d . Denote by { B N, }  ∈ N dN a cubic partitionof I d with centers { ζ  }  ∈ N dN and side length N . DeﬁneΛ s := { k (cid:96) : k (cid:96) ∈ N dN , ≤ (cid:96) ≤ s } and S := (cid:91)  ∈ Λ s B N, . For any function f deﬁned on I d , if the support of f is S , then we say that f is s -sparsein N d partitions. We use Lip ( N, s, r, c ) to quantify both the smoothness property andsparseness, i.e., Lip ( N, s, r, c ) = (cid:8) f : f ∈ Lip ( r, c ) and f is s-sparse in N d partition (cid:9) . For n ∈ N with n ≥ ˆ cN for some ˆ c >

0, let { A n,j } j ∈ N dn be another cubic partition of I d with centers { ξ j } j ∈ N dn and side length n . For each  ∈ N dN , deﬁne¯Λ  := { j ∈ N dn : A n,j ∩ B N, (cid:54) = ∅} , it is easy to see that the set (cid:83)  ∈ (cid:86) s ¯Λ  is the family of A n,j where f is not vanished.7ith these helps, we present a spareness assumption of f as follows. Assumption 3

We assume f ∈ Lip(

N, s, r, c ) with r = k + v , k ∈ N , 0 < v ≤ c > N, s ∈ N .In [9], Chui et al. only discussed the approximating performance of deep nets with twohidden layers in approximating smooth function. Lin [24] extended the results in [9] toapproximate spatially sparse functions. Speciﬁcally, Lin [24] proved that deep nets withtwo hidden layers can approximate spatially sparse function much better than shallownets. However, their results suﬀered from the saturation. In this subsection, we aimat conquering the above deﬁciency by constructing a deep net with three hidden layers.Theorem 2.3 below is the second main result of this paper, and the proof also be veriﬁedin Section 4. Theorem 2.3.

Let < ε ≤ , ˜ n ≥ ˜ cN d for some ˜ c > . Under Assumptions 2 and 3,there exists a deep net H ( x ) ∈ H { , ˜ n } such that | f ( x ) − H ( x ) | ≤ c ˜ n − rd + ˜ C ˜ nε, ∀ x ∈ X. If x ∈ I d \ (cid:83)  ∈ Λ s ¯Λ  , then | H ( x ) | ≤ ˜ C ˜ nε where all the parameters of the deep net are bounded by poly (˜ n, ε ) , and ˜ C is a constantindependent of ˜ n and ε . Corollary 2.4.

Let T be arbitrary positive number satisﬁes T ≥ r + dd and ε = ˜ n − T . Underthe Assumptions 2 and 3, if ˜ n ≥ ˜ cN d for some ˜ c > , then there holds | f ( x ) − H ( x ) | ≤ ˆ C ˜ n − rd , ∀ x ∈ X. (2.9) If x ∈ X \ (cid:83)  ∈ Λ s ¯Λ  , then | H ( x ) | ≤ ˜ n − T (2.10) where ˆ C is a constant independent of ˜ n . To be detailed, (2.9) shows that the approximation rate of deep nets is as fast as O (˜ n − rd ), and (2.10) states their performance in realizing the spatial sparseness, when T is large. However, too large T may lead to extremely large weights, which implies hugecapacity measured by the covering number of H { , ˜ n } according to [13]. A preferable choiceof T should be T = O ( r + dd ). 8revious studies [3, 6] indicated that shallow nets cannot provide localized approx-imation, which is a special case for sparse approximation with s = 1. Lemma 4.1 (inSection 4) shows that deep nets with two hidden layers have the localized approximationproperty, which is the building-block to construct deep nets possessing sparse approxima-tion property. To the best our knowledge, [24] is the ﬁrst work to construct deep nets torealize sparse features. Compared with [24], our main novelty is to deepen the networkto conquer the saturation.

3. Related work

Constructing neural networks to approximate the functions is a classic problem [29,25, 27, 31, 10, 14, 24] in approximation theory. Traditional method to deal with thisproblem can be divided into three steps. Step 1, constructing a neural network to ap-proximate polynomials; Step 2, utilizing polynomials to approximate target functions;Step 3, combining the above two steps to reach the ﬁnal approximation results betweenneural networks and target functions. Tayor formula is usually be used in Step 1 to obtainthe approximation results, which usually leads to extremely large weights, i.e., | w i | ∼ e m ,where m is the degree of the polynomial. However, larger weight leads to large capabilityand consequently bad generalization and instable algorithms. Typical example includes[25] and [28]. In order to overcome this drawback, we introduce a new function by theproduct of Taylor polynomial and a deep net with two hidden layers to instead of thepolynomial in Step 1 to reduce the weights of neural networks from e m to poly( m ).For deep nets, [34] and [31] stated that deep ReLU networks are more eﬃcient toapproximate smooth functions than shallow nets. But their results are slightly worsethan Theorem 2.1 in this paper, in the sense that there is either an additional logarithmicterm or under the weaker norm. Recently, Han et al. [17] indicated that deep ReLU netscan achieve the optimal generalization performance for numerous learning tasks, but thedepth of [31] is much larger than ours. Recently, Zhou [35, 36] also veriﬁed that deepconvolutional neural network (DCNN) is universal, i.e., DCNN can be used to approximateany continuous function to an arbitrary accuracy when the depth of the neural networkis large enough. 9ll the above literature [34, 31, 17, 35, 36] demonstrated that deep nets with ReLUactivation function and DCNN have good properties both in approximation and general-ization. However, there are too deep to be particularly used in real tasks. Compared withthese results, we constructed a deep net only with three hidden layers to approximatesmooth and sparse functions, respectively. We proved in Theorem 2.1 and Theorem 2.3that the constructed deep net with three hidden layers and with controllable weights, canrealize smoothness and spatial sparseness without saturation, simultaneously.

4. Proofs

Let P m = P m ( R d ) be the set of multivariate algebraical polynomials on R d of degreeat most m , i.e., P m = span { x k ≡ x k · · · x k d d : | k | = k + ... + k d ≤ m } . Consider P hm as the set of homogeneous polynomials of degree m , i.e., P hm = span { x k = x k · · · x k d d : | k | = k + ... + k d = m } . Let N dn = { , , · · · , n } d , n ∈ N , { A n,j } j ∈ N dn be the cubic partition of X with centers { ξ j } j ∈ N dn and side length n . If x lies on the boundary of some A n,j , then j x is the set tobe the smallest integer satisfying x ∈ A n,j , i.e., j x = (cid:8) j | j ∈ N dn , x ∈ A n,j (cid:9) . Then, for

K >

0, any j ∈ N dn , x ∈ X , we construct a deep net with two hidden-layer N ∗ n,j,K ( x ) ∈ H { σ ,σ, , d +1 } as N ∗ n,j,K ( x ) = σ (cid:40) K (cid:34) d (cid:88) l =1 σ (cid:20) n + x ( l ) − ξ ( l ) j (cid:21) + d (cid:88) l =1 σ (cid:20) n − x ( l ) + ξ ( l ) j (cid:21) − d + 12 (cid:35)(cid:41) . (4.1)Localized approximation of neural networks [6] implies that if the target function ismodiﬁed only on a small subset of the Euclidean space, then only a few neurons, rather10han the entire network, need to be retrained. Lemma 4.1 below that was proved in [24]states the localized approximation property of deep nets which is totally diﬀerent fromthe shallow nets. We refer [9] (section 3.3) for details in the localized approximation ofneural networks. Lemma 4.1.

For any ε > , if N ∗ n,j,K ε is deﬁned by (4.1) with K ε satisfying (2.6) and σ being a nondecreasing sigmoidal function, then(i) For any x (cid:54)∈ A n,j , there holds | N ∗ n,j,K ε ( x ) | < ε ;(ii) For any x ∈ A n,j , there holds | − N ∗ n,j,K ε ( x ) | ≤ ε . It is easy to see that if ε →

0, then N ∗ n,j,K ε is an indicator function for A n,j . Moreover,when n → ∞ , it indicates that N ∗ n,j,K ε can recognize the location of x in an arbitrarilysmall region and will vanish in some of partitions of the input space.In order to overcome the deﬁciency of traditional method in neural networks approxi-mation. We deﬁned a new function Φ g ( x ) by a product of Taylor polynomial and a deepnetwork function with two hidden layers to instead of polynomials:Φ g ( x ) = (cid:88) j ∈ N dn P k,η j ,g ( x ) N ∗ n,j,K ε ( x ) , x ∈ X, (4.2)where N ∗ n,j,K ε and K ε are deﬁned by (4.1) and (2.6). P k,η j ,g ( x ) is the Taylor polynomialof g with degree k around η j , η j ∈ A n,j and j ∈ N dn .Based on the localized approximation results and the localized Taylor polynomial in(4.2), we construct a deep net with three hidden layers to approximate both smooth andsparse functions. The following proposition indicates that constructing a shallow net with one neuroncan replace a minimal.

Proposition 4.2.

Let c > , L = O ( m d − ) , σ ∈ Lip ( r , c ) is a sigmoidal function with ( m + 1) − times bounded derivatives, and (cid:107) σ (cid:48) (cid:107) L ∞ ( R ) ≤ , σ ( j ) (0) (cid:54) = 0 for all j = 0 , , ..., s .For arbitrary P m ∈ P m and any ε ∈ (0 , , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P m ( x ) − L (cid:88) i =1 C ( i, m ) m ! δ mm σ ( m ) (0) σ ( δ m ( w i · x )) − P ∗ m − ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε. (4.3)11 here P ∗ m − ( x ) = m − (cid:88) j =0 L (cid:88) i =1 D ( i, j )( w i · x ) j , D ( i, j ) = C ( i, j ) − C ( i, m ) m ! σ ( j ) (0) δ mm σ ( m ) (0) j ! ,δ m = min (cid:40) εM m , ε (cid:80) Li =1 | C ( i, m ) | M m (cid:41) , M m = max − ≤ ξ ≤ σ ( m +1) ( ξ ) σ ( m ) (0) , and C ( i, j ) is an absolute constant. We use the Taylor formula [9] to prove the above Proposition 4.2.

Lemma 4.3.

Let k ≥ and ϕ be k − times diﬀerentiable on R . Then for any u, u ∈ R ,there holds ϕ ( u ) = ϕ ( x ) + ϕ (cid:48) ( u )1! ( u − u ) + ... + ϕ k ( u ) k ! ( u − u ) k + R k ( u ) (4.4) where R k ( u ) = 1( k − (cid:90) uu [ ϕ ( k ) ( t ) − ϕ ( k ) ( u )]( u − t ) k − dt. (4.5)In addition, under the condition of Lemma 4.3, for any a ∈ R , there holds ϕ ( au ) = ϕ ( u ) + ϕ (cid:48) ( u )1! ( au − u ) + ... + ϕ k ( u ) k ! ( au − u ) k + R k ( u ) (4.6)where R k ( u ) = a k ( k − (cid:90) uu [ ϕ k ( at ) − ϕ k ( u )]( u − t ) k − dt. (4.7)Lemma 4.4 which was proved in [28] plays an important role in proving Proposition4.2. Lemma 4.4.

Let m ∈ N and L = O ( m d − ) . For any P m ∈ P m , there exists a set ofpoints { w , ..., w L } ⊂ I d such that P m ( x ) = span { ( w i · x ) j : x, w i ∈ I d , ≤ j ≤ m, ≤ i ≤ L } . (4.8) Proof of Proposition 4.2.

For any t ∈ [0 , δ k ∈ (0 , k ∈ [1 , s ], it follows fromLemma 4.3 that σ ( δ k t ) = σ (0) + σ (cid:48) (0)1! δ k t + ... + σ k (0) k ! ( δ k t ) k + ˜ R k ( t ) (4.9)where ˜ R k ( t ) = δ kk ( k − (cid:90) t [ σ ( k ) ( δ k u ) − σ ( k ) (0)]( t − u ) k − du. (4.10)12enote Q k − ( t ) = k − (cid:88) j =0 σ ( j ) (0) j ! ( δ k t ) j . Then (4.9) yields t k = k ! δ kk σ ( k ) (0) σ ( δ k t ) − k ! δ kk σ ( k ) (0) Q k − ( t ) − k ! δ kk σ ( k ) (0) ˜ R k ( t ) , which implies t k = k ! δ kk σ ( k ) (0) σ ( δ k t ) + q k − ( t ) + r k ( t ) (4.11)where q k − ( t ) = − k ! δ kk σ ( k ) (0) Q k − ( t )and r k ( t ) = − k ! δ kk σ ( k ) (0) ˜ R k ( t ) . (4.12)Since | σ ( k ) ( δ k u ) − σ ( k ) (0) | ≤ max <ξ< | σ ( k +1) ( ξ ) || δ k || u | , then by (4.10) and (4.12), there holds | r k ( t ) | = (cid:12)(cid:12)(cid:12)(cid:12) − k ! δ kk σ ( k ) (0) ˜ R k ( t ) (cid:12)(cid:12)(cid:12)(cid:12) = k | σ ( k ) (0) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) t [ σ ( k ) ( δ k u ) − σ ( k ) (0)]( t − u ) k − du (cid:12)(cid:12)(cid:12)(cid:12) ≤ k | σ ( k +1) ( ξ ) || σ ( k ) (0) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) t δ k u ( t − u ) k − du (cid:12)(cid:12)(cid:12)(cid:12) = δ k k | σ ( k +1) ( ξ ) || σ ( k ) (0) | (cid:12)(cid:12)(cid:12)(cid:12)(cid:90) u (1 − u ) k − du (cid:12)(cid:12)(cid:12)(cid:12) ≤ kδ k M k ( 1 k −

11 + k ) ≤ δ k M k . (4.13)From Lemma 4.4, for any P m ∈ P m , it follows P m ( x ) = m (cid:88) j =0 L (cid:88) i =1 C ( i, j )( w i · x ) j = L (cid:88) i =1 C ( i, m )( w i · x ) m + L (cid:88) i =1 C ( i, m − w i · x ) m − + · · · + L (cid:88) i =1 C ( i, w i · x ) + L (cid:88) i =1 C ( i, . (4.14)13ince x, w i ∈ I d , we have | w i · x | ≤

1. Then, for an arbitrary ε >

0, there exists a δ m ∈ (0 ,

1) such that L (cid:88) i =1 | C ( i, m ) | M m δ m ≤ ε. (4.15)Due to (4.11) and δ m | w i · x | ∈ [0 , w i · x ) m = m ! δ mm σ ( m ) (0) σ ( δ m ( w i · x )) + q m − ( w i · x ) + r m ( w i · x ) . (4.16)Inserting the above (4.16) into (4.14), we obtain P m ( x ) = L (cid:88) i =1 C ( i, m ) (cid:18) m ! δ mm σ ( m ) (0) σ ( δ m ( w i · x )) + q m − ( w i · x ) + r m ( w i · x ) (cid:19) + L (cid:88) i =1 C ( i, m − w i · x ) m − + · · · + L (cid:88) i =1 C ( i, w i · x ) + L (cid:88) i =1 C ( i, L (cid:88) i =1 C ( i, m ) m ! δ mm σ ( m ) (0) σ ( δ m ( w i · x )) + P ∗ m − ( x ) + R m ( x ) (4.17)where R m ( x ) = L (cid:88) i =1 C ( i, m ) r m ( w i · x )and P ∗ m − ( x ) = L (cid:88) i =1 C ( i, m ) q m − ( w i · x ) + L (cid:88) i =1 C ( i, m − w i · x ) m − + · · · + L (cid:88) i =1 C ( i, w i · x ) + L (cid:88) i =1 C ( i, L (cid:88) i =1 D ( i, m − w i · x ) m − + L (cid:88) i =1 D ( i, m − w i · x ) m − + · · · + L (cid:88) i =1 D ( i, w i · x ) + L (cid:88) i =1 D ( i, m − (cid:88) j =0 L (cid:88) i =1 D ( i, j )( w i · x ) j (4.18)with D ( i, j ) = C ( i, j ) − C ( i,m ) m ! σ ( j ) (0) δ mm σ ( m ) (0) j ! . It then follows from (4.13) and (4.15) that | R m ( x ) | ≤ ε. (4.19)14ombining (4.17)-(4.19), we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P m ( x ) − L (cid:88) i =1 C ( i, m ) m ! δ mm σ ( m ) (0) σ ( δ m ( w i · x + b i )) − P ∗ m − ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε. This completes the proof of Proposition 4.2.Next, we show the performance of shallow nets in approximating.

Proposition 4.5.

Let σ be a non-decreasing sigmoidal function with (cid:107) σ (cid:48) (cid:107) L ∞ ( R ) ≤ , (cid:107) σ (cid:107) L ∞ ( R ) ≤ , σ ( j ) (0) (cid:54) = 0 for all j = 0 , , ..., m + 1 . For any P m ∈ P m and ε ∈ (0 , ,there exists a shallow net h m +1 ( x ) = m (cid:88) j =0 L (cid:88) i =1 a ( i, j ) σ ( δ j ( w i · x )) with a ( i, j ) = C ( i,j ) j ! δ jj σ ( j ) (0) and δ m being a polynomial with respect to ε such that | P m ( x ) − h m +1 ( x ) | ≤ ε. Proof.

From Proposition 4.2, it holds that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P m ( x ) − L (cid:88) i =1 a ( i, m ) σ ( δ m ( w i · x )) − P ∗ m − ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < εm + 1 (4.20)where a ( i, m ) = C ( i,m ) m ! δ mm σ ( m ) (0) and δ m ∼ poly( ε ). Similar methods as above (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P ∗ m − ( x ) − L (cid:88) i =1 a ( i, m − σ ( δ m − ( w i · x )) − P ∗ m − ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < εm + 1 , (4.21) · · · (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P ∗ ( x ) − L (cid:88) i =1 a ( i, σ ( δ ( w i · x )) − P ∗ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < εm + 1 , (4.22)and (cid:12)(cid:12)(cid:12)(cid:12) P ∗ ( x ) − P ∗ ( x ) σ (0) σ (0 · x ) (cid:12)(cid:12)(cid:12)(cid:12) = 0 < εm + 1 (4.23)Then, it follows from (4.20)-(4.23) that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P m ( x ) − m (cid:88) j =0 L (cid:88) i =1 a ( i, m ) σ ( δ j ( w i · x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε. This completes the proof of Proposition 4.5.15 orollary 4.6.

Let m ∈ N , L = O ( m d − ) and σ be a non-decreasing sigmoidal functionwith (cid:107) σ (cid:48) (cid:107) L ∞ ( R ) ≤ , (cid:107) σ (cid:107) L ∞ ( R ) ≤ . If σ ( j ) ( b ) (cid:54) = 0 for some b ∈ R and all j = 0 , , ..., m + 1 ,then there exists a shallow net h m +1 ( x ) = m (cid:88) j =0 L (cid:88) i =1 a ( i, j ) σ ( δ j ( w i · x + b )) such that | P m ( x ) − h m +1 ( x ) | ≤ ε. Based on the above Proposition 4.5, we are able to yield a “product-gete” property ofdeep nets in the following Proposition 4.7, whose proof can be found in [9].

Proposition 4.7.

Let m ∈ N and L = O ( m d − ) . If σ is a non-decreasing sigmoidalfunction with (cid:107) σ (cid:48) (cid:107) L ∞ ( R ) ≤ , (cid:107) σ (cid:107) L ∞ ( R ) ≤ , σ ( j ) (0) (cid:54) = 0 for all j = 0 , , ..., m + 1 , then forany ε > , there exists a shallow net h ( x ) = (cid:88) j =1 a j σ ( w j · x ) such that for any u , u ∈ [ − , (cid:12)(cid:12)(cid:12)(cid:12) u u − (cid:18) h (cid:18) u + u (cid:19) − h ( u ) − h ( u ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) < ε. Corollary 4.8.

Let m ∈ N , L = O ( m d − ) and σ be a non-decreasing sigmoidal functionwith (cid:107) σ (cid:48) (cid:107) L ∞ ( R ) ≤ , (cid:107) σ (cid:107) L ∞ ( R ) ≤ . If there exists a point b ∈ R satisfying σ ( j ) ( b ) (cid:54) = 0 for all j = 1 , , , then for any ε > , there exists a shallow net h ( x ) = (cid:88) j =1 a j σ ( w j · x + b ) such that for any u , u ∈ [ − , (cid:12)(cid:12)(cid:12)(cid:12) u u − (cid:18) h (cid:18) u + u (cid:19) − h ( u ) − h ( u ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) < ε. In our proof, we also need the following Lemma 4.9, which can be found in [17].

Lemma 4.9.

Let x ∈ I d , r = k + v with k ∈ N and < v ≤ . If f ∈ Lip ( r, c ) and P k,x ,f ( x ) is the Taylor polynomial of f with degree k around x , then | f ( x ) − P k,x ,f ( x ) | ≤ (cid:101) c (cid:107) x − x (cid:107) r , x ∈ R d , (4.24) where (cid:101) c is a constant depending only on k, c and d . Lemma 4.10. If g ∈ Lip ( r, c ) with r = k + v, k ∈ N , < v ≤ , c > , σ is anon-decreasing sigmoidal function and Φ( x ) is deﬁned by (4.2), then | g ( x ) − Φ g ( x ) | ≤ ˜ c n − r + n d B ε, x ∈ X, where B := (cid:107) g (cid:107) L ∞ ( X ) + ˜ c . Proof.

From Lemma 4.9, we observe | P k,η jx ,g ( x ) | ≤ (cid:107) g (cid:107) L ∞ ( X ) + ˜ c (cid:107) x − η j x (cid:107) r ≤ (cid:107) g (cid:107) L ∞ ( X ) + ˜ c := B (4.25)where B := (cid:107) g (cid:107) L ∞ ( X ) + ˜ c .Since I d = (cid:83) j ∈ N dn A n,j , for each x ∈ X , there exists a j x such that x ∈ A n,j x . Therefore,it follows from Proposition 4.1 that | g ( x ) − Φ g ( x ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g ( x ) − P k,η jx ,g ( x ) − (cid:88) j (cid:54) = j x P k,η j ,g ( x ) N ∗ n,j,K ε ( x ) + P k,η jx ,g ( x )(1 − N ∗ n,j x ,K ε ( x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | g ( x ) − P k,η jx ,g ( x ) | + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) j (cid:54) = j x P k,η j ,g ( x ) N ∗ n,j,K ε ( x )] | + | P k,η jx ,g ( x )(1 − N ∗ n,j x ,K ε ( x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ˜ c (cid:107) x − η j x (cid:107) r + ( n d − B ε + B ε ≤ ˜ c n − r + n d B ε (4.26)This completes the proof of Lemma 4.10. Proof of Theorem 2.1.

The proof can be divided into three steps: the ﬁrstone is to give estimates for the product function and shallow net; then, we consider theapproximation between Taylor polynomial and shallow net; ﬁnally, we give approximationerrors by combining the above two steps.Step 1: By the deﬁnition of N ∗ n,j,K ε ( x ) in (4.1), we observe | N ∗ n,j,K ε ( x ) | ≤ . Furthermore, it follows from Lemma 4.9 that | P k,η j ,g ( x ) | ≤ (cid:107) g (cid:107) L ∞ ( X ) + ˜ c (cid:107) x − η j (cid:107) r ≤ (cid:107) g (cid:107) L ∞ ( X ) + ˜ c . (4.27)17enote B := 4( (cid:107) g (cid:107) L ∞ ( X ) + ˜ c + 1). Hence, for an arbitrary x ∈ X , we have N ∗ n,j,Kε ( x ) B ∈ [ − , ] and P k,ηj,g ( x ) B ∈ [ − , ]. It then follows from Corollary 4.8 with u = N ∗ n,j,Kε ( x ) B , u = P k,ηj,g ( x ) B that there exists a shallow net h ( x ) = (cid:88) j =1 a j σ ( w j · x + b j )such that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P k,η j ,g ( x ) N ∗ n,j,K ε ( x ) − B  h (cid:18) N ∗ n,j,K ε ( x ) + P k,η j ,g ( x )2 B (cid:19) − h (cid:16) N ∗ n,j,Kε ( x ) B (cid:17) − h (cid:16) P k,ηj,g ( x ) B (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ B ε. (4.28)For the sake of convenience, denote∆ = B  h (cid:18) N ∗ n,j,K ε ( x ) + P k,η j ,g ( x )2 B (cid:19) − h (cid:16) N ∗ n,j,Kε ( x ) B (cid:17) − h (cid:16) P k,ηj,g ( x ) B (cid:17)  . Noting (cid:107) σ (cid:48) (cid:107) L ∞ ( R ) ≤

1, for any x , x ∈ X , there holds | h ( x ) − h ( x ) | ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j =1 a j σ ( w j · x + b j ) − (cid:88) j =1 a j σ ( w j · x + b j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:88) j =1 | a j || σ ( w j · x + b j ) − σ ( w j · x + b j ) |≤ (cid:88) i =1 | a j |(cid:107) w j (cid:107)(cid:107) x − x (cid:107)≤ c (cid:107) x − x (cid:107) , (4.29)where ˜ c = max j ∈{ , , } | a j | max i ∈{ , ··· ,d } {| w j | , · · · , | w jd |} and w j = ( w j , · · · , w jd ) T .Step 2: It from Corollary 4.6 with P k ( t − η j ) = P k,ηj,g ( x ) B that there exists a shallow net h k +1 ,L ( x ) = k +1 (cid:88) j =1 L (cid:88) i =0 a ( i, j ) σ ( w j · ( x − x i ) + b j ) (4.30)such that (cid:12)(cid:12)(cid:12)(cid:12) P k,η j ,g ( x ) B − h k +1 ,L ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε . H ( x ) := n d (cid:88) j =1 H j ( x ) (4.31)where H j ( x ) = B  h (cid:18) h k +1 ,L ( x )2 + N ∗ n,j,K ε ( x )2 B (cid:19) − h ( h k +1 ,L ( x ))2 − h (cid:16) N ∗ n,j,Kε ( x ) B (cid:17)  . (4.32)By (4.2) and (4.28), we get | H ( x ) − Φ g ( x ) |≤ (cid:88) j ∈ N dn | H j ( x ) − ∆ + ∆ − P k,η j ,g ( t ) N ∗ n,j,K ε ( x ) |≤ (cid:88) j ∈ N dn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B  h (cid:18) h k +1 ,L ( x )2 + N ∗ n,j,K ε ( x )2 B (cid:19) − h ( h k +1 ,L ( x ))2 − h (cid:16) N ∗ n,j,Kε ( x ) B (cid:17)  − B  h (cid:18) N ∗ n,j,K ε ( x ) + P k,η j ,g ( x )2 B (cid:19) − h (cid:16) N ∗ n,j,Kε ( x ) B (cid:17) − h (cid:16) P k,ηj,g ( x ) B (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + n d B ε ≤ n d B (cid:18)

92 ˜ c ε + ε (cid:19) ≤ ¯ C n d ε (4.33)where we set ε = 2 ε and ¯ C is a constant depending only on B and ˜ c . Noting (4.33)and Lemma 4.10, we obtain | g ( x ) − H ( x ) | ≤ | g ( x ) − Φ g ( x ) | + | Φ g ( x ) − H ( x ) |≤ ˜ c n − r + B n d ε + ¯ C n d ε = C ( n − r + n d ε ) , where C is a constant depending only on k, c , d, B . Due to (4.1) (4.30) (4.31) and (4.32),there exists a deep net H ( x ) ∈ H , ˜ n with ˜ n = 6 n d (( L + 1)( k + 1) + 2 d + 1) free parameterssatisfying | g ( x ) − H ( x ) | ≤ C (˜ n − rd + ˜ nε ) . Furthermore, it is easy to check (see [9] for detailed proof) that all the parameters in H ( x ) can be bounded by poly(˜ n, ε ). This completes the proof of Theorem 2.1.19 roof of Corollary 2.2. This result can be directly deduced from Theorem 2.1with ε = ˜ n − r + dd . This completes the proof of Corollary 2.2. Since the spatial sparseness depends heavily on the localized approximation property,we ﬁrst show that Φ f ( x ) succeeds to realizing the sparseness of the target function f thatbreaks through the bottleneck of shallow nets [6]. For diﬀerent partitions { A n,j } j ∈ N dn and { B N,j } j ∈ N dN , we assume n ≥ ˆ cN for some ˆ c > Lemma 4.11.

Let < ε < , under Assumptions 2 and 3, if Φ f ( x ) is deﬁned by (4.2)and K ε satisﬁes (2.6), then | f ( x ) − Φ f ( x ) | ≤ c n − r + B n d ε, (4.34) where B = (cid:107) f (cid:107) L ∞ ( X ) + ˜ c . Furthermore, if n ≥ ˆ cN , there holds | Φ f ( x ) | ≤ B n d ε, ∀ x ∈ I d \ (cid:91) k ∈ Λ s ¯Λ k . (4.35) Proof.

Since I d = (cid:83) j ∈ N dn A n,j , for each x ∈ X , there exists a j x ∈ N such that x ∈ A n,j x . By Lemma 4.1, we know that for any x ∈ A n,j , | − N ∗ n,j,K ε ( x ) | ≤ ε and forany x (cid:54)∈ A n,j , | N ∗ n,j,K ε ( x ) | ≤ ε . From (4.27), we also get | P k,η j ,f ( x ) | ≤ B , (4.36)where B = (cid:107) f (cid:107) L ∞ ( X ) + ˜ c . Then | f ( x ) − Φ f ( x ) | = | f ( x ) − P k,η jx ,f ( x ) − (cid:88) j (cid:54) = j x P k,η j ,f ( x ) N ∗ n,j,K ε ( x ) + P k,η jx ,f ( x )(1 − N ∗ n,j x ,K ε ( x )) |≤ | f ( x ) − P k,η jx ,f ( x ) | + | (cid:88) j (cid:54) = j x P k,η j ,f ( x ) N ∗ n,j,K ε ( x ) | + | P k,η jx ,f ( x )(1 − N ∗ n,j x ,K ε ( x )) |≤ c (cid:107) x − η j x (cid:107) r + ( n d − B ε + B ε ≤ c n − r + B n d ε. Since n > ˆ cN , x ∈ I d \ (cid:83) k ∈ Λ s ¯Λ k implies A n,j x (cid:84) S = ∅ . This together with f ∈ ip ( N, s, r, c ) yields f ( x ) = P k,η jx ,f ( x ) = 0. From Lemma 4.1 and (4.36), we have | Φ f ( x ) | = | (cid:88) j (cid:54) = j x P k,η j ,f ( x ) N ∗ n,j,K ( ε ) ( x ) | + | P k,η jx ,f ( x ) N ∗ n,j x ,K ( ε ) ( x ) |≤ | P k,η j ,f ( x ) | (cid:88) j (cid:54) = j x | N ∗ n,j,K ( ε ) ( x ) |≤ ( n d − B ε ≤ B n d ε. This completes the proof of Lemma 4.11.

Proof of Theorem 2.3.

The proof of this theorem is similar to the proof of Theorem2.1. Similar as Step 1 and Step 2 in the proof of Theorem 2.1, we obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) P k,η j ,f ( x ) N ∗ n,j,K ε ( x ) − B  h (cid:18) N ∗ n,j,K ε ( x ) + P k,η j ,f ( x )2 B (cid:19) − h (cid:16) N ∗ n,j,Kε ( x ) B (cid:17) − h (cid:16) P k,ηj,f ( x ) B (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ B ε, (4.37)where B := 2( (cid:107) f (cid:107) L ∞ ( X ) + ˜ c + 1). Denote∆ = B  h (cid:18) N ∗ n,j,K ε ( x ) + P k,η j ,f ( x )2 B (cid:19) − h (cid:16) N ∗ n,j,Kε ( x ) B (cid:17) − h (cid:16) P k,ηj,f ( x ) B (cid:17)  . Deﬁne H ( x ) := n d (cid:88) j =1 H j ( x )where H j ( x ) = B  h (cid:16) N ∗ n,j,Kε ( x )+ h k +1 ,L ( x )2 B (cid:17) − h (cid:18) N ∗ n,j,Kε ( x ) B (cid:19) − h (cid:18) hk +1 ,L ( x ) B (cid:19)  .Proposition 4.5 implies that there exists a shallow net h k +1 ,L ( x ) = k +1 (cid:88) j =1 L (cid:88) i =0 a ( i, j ) σ ( w j · ( x − x i ) + b j )such that (cid:12)(cid:12)(cid:12)(cid:12) P k,η j ,f ( x ) B − h k +1 ,L ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε . (4.38)21ince f ∈ Lip ( N, s, r, c ) with (4.38), we obtain | H ( x ) − Φ f ( x ) |≤ (cid:88) j ∈ N dn | H j ( x ) − ∆ + ∆ − P k,η j ,f ( x ) N ∗ n,j,K ε ( x ) |≤ (cid:88) j ∈ N dn (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) B  h (cid:18) h k +1 ,L ( x )2 + N ∗ n,j,K ε ( x )2 B (cid:19) − h ( h k +1 ,L ( x ))2 − h (cid:16) N ∗ n,j,Kε ( x ) B (cid:17)  − B  h (cid:18) N ∗ n,j,K ε ( x ) + P k,η j ,f ( x )2 B (cid:19) − h (cid:16) N ∗ n,j,Kε ( x ) B (cid:17) − h (cid:16) P k,ηj,f ( x ) B (cid:17) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + B n d ε ≤ c B n d (cid:12)(cid:12)(cid:12)(cid:12) h k +1 ,L ( x ) − P k,η j ,f ( x ) B (cid:12)(cid:12)(cid:12)(cid:12) + B n d ε ≤ c B n d ε + B n d ε = B n d ε (9 ˜ c + 1)= ˜ c n d ε (4.39)where ε = 2 ε , ˜ c is a constant depending only on B and ˜ c .Due to (4.39) and Lemma 4.11, for any x ∈ X , we get | f ( x ) − H ( x ) | ≤ | f ( x ) − Φ f ( x ) | + | Φ f ( x ) − H ( x ) |≤ c n − r + B n d ε + ˜ c n d ε = c n − r + ˜ Cn d ε, where ˜ C is a constant depending only on ˜ c , B .Moreover, if x ∈ I d \ (cid:83)  ∈ Λ s ¯Λ  , we have f ( x ) = P k,η j ,f ( x ) = 0 and | Φ f ( x ) | ≤ B n d ε , thenit is easy to obtain that | H ( x ) | ≤ | Φ f ( x ) | + | Φ f ( x ) − H ( x ) |≤ | Φ f ( x )) | + | Φ f ( x ) − H ( x ) |≤ B n d ε + ˜ c n d ε = ˜ Cn d ε, where ˜ C is a constant depending only on k, c , d, B . It is easy to see that there are totally22 n := 6 n d (( L + 1)( k + 1) + 2 d + 1) free parameters in H ( x ). In this case, we obtain | f ( x ) − H ( x ) | ≤ c ˜ n − rd + ˜ C ˜ nε. Furthermore, if x ∈ X \ (cid:83)  ∈ Λ s ¯Λ  and ˜ n ≥ ˜ cN d , then | H ( x ) | ≤ ˜ C ˜ nε. It is noticeable that all the parameters of deep nets are controllable, which is bounded bypoly(˜ n, ε ). This completes the proof of Theorem 2.3. Proof of Corollary 2.4.

The result (2.9) can be deduced directly from Theorem2.3 with ε = ˜ n − T for T ≥ r + dd . This completes the proof of Corollary 2.4.

5. ReferencesReferences [1] Z. Akkus, A. Galimzianova, A. Hoogi, D. L. Rubin and B. J. Erickson. Deep learningfor brain MRI segmentation: state of the art and future directions. Journal of DigitalImaging, 30(4): 449-459, 2017.[2] D. B. Chen. Degree of approximation by superpsitions of a sigmoidal function. Ap-proximation Theory and its Applications, 9:17-28, 1993.[3] E. Blum and L. Li. Approximation theory and neural networks. Neural Networks.4(4): 511-515, 1991.[4] D. C. Ciresan, U. Meier, L. M. Gambardella and J. Schmidhuber. Deep, big, simpleneural nets for handwritten digit recognition. Neural Computation, 22(12): 3207-3220, 2010.[5] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.[6] C. K. Chui, X. Li and H. N. Mhaskar. Neural networks for localized approximation.Mathematics of Computation, 63(208): 607-607, 1994.237] C. K. Chui, X. Li and H. N. Mhaskar. Limitations of the approximation capabilitiesof neural networks with one hidden layer. Advances in Computational Mathematics,5(1): 233-243, 1996.[8] C. K. Chui, S. B. Lin and D. X. Zhou. Construction of neural networks for realizationof localized deep learning. Frontiers in Applied Mathematics and Statistics, 4: 14,2018.[9] C. K. Chui, S. B. Lin and D. X. Zhou. Deep neural networks for rotation-invarianceapproximation and learning. Analysis and Applications, 17(05): 737-772, 2019.[10] F.Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint.Cambridge University Press, Cambridge, 2007.[11] R. Eldan R and O. Shamir. The power of depth for feedforward neural networks.Conference on learning theory, 907-940, 2016.[12] I. Goodfellow, Y. Bengio and A. Courville. Deep Learning. MIT Press, 2016.[13] Z. C. Guo, L. Shi and S. B. Lin. Realizing data features by deep nets. IEEE Trans-action on Neural Networks and Learning Systems, 2019, arXiv:1901.00130.[14] L. Gy¨orﬁ, M. Kohler, A. Krzy ˙ zz