[PDF] Approximation of Smoothness Classes by Deep ReLU Networks

Abstract

We consider approximation rates of sparsely connected deep rectified linear unit (ReLU) and rectified power unit (RePU) neural networks for functions in Besov spaces B α q ( L p ) in arbitrary dimension d , on bounded or unbounded domains. We show that RePU networks with a fixed activation function attain optimal approximation rates for functions in the Besov space B α τ ( L τ ) on the critical embedding line 1/τ=α/d+1/p for arbitrary smoothness order α>0 . Moreover, we show that ReLU networks attain near to optimal rates for any Besov space strictly above the critical line. Using interpolation theory, this implies that the entire range of smoothness classes at or above the critical line is (near to) optimally approximated by deep ReLU/RePU networks.

Full PDF

AAPPROXIMATION OF SMOOTHNESS CLASSES BY DEEP RELU NETWORKS

MAZEN ALI AND ANTHONY NOUY

Centrale Nantes, LMJL UMR CNRS 6629, France

Abstract.

We consider approximation rates of sparsely connected deep rectiﬁed linear unit (ReLU) andrectiﬁed power unit (RePU) neural networks for functions in Besov spaces B αq ( L p ) in arbitrary dimension d , on bounded or unbounded domains. We show that RePU networks with a ﬁxed activation functionattain optimal approximation rates for functions in the Besov space B ατ ( L τ ) on the critical embeddingline 1 /τ = α/d + 1 /p for arbitrary smoothness order α >

0. Moreover, we show that ReLU networksattain near to optimal rates for any Besov space strictly above the critical line. Using interpolationtheory, this implies that the entire range of smoothness classes at or above the critical line is (near to)optimally approximated by deep ReLU/RePU networks. Introduction

Artiﬁcial neural networks (NNs) have become a popular tool in various ﬁelds of computationaland data science. Due to its popularity and good performance, NNs motivated a lot of research inmathematics – especially in recent years – in an attempt to explain the properties of NNs responsiblefor their success.Although many aspects of NNs still lack a satisfactory mathematical explanation, the expressivity orapproximation theoretic properties of NNs are by now quite well understood. By expressivity we meanthe theoretical capacity of NNs to approximate functions from diﬀerent classes. We do not intend togive a literature overview on this topic and instead refer to the recent survey in [17].

Contribution.

In this work, we contribute to the existing body of knowledge on the expressivity ofNNs by showing that the very popular and yet quite simple feed-forward rectiﬁed linear unit (ReLU)NNs can approximate a very wide range of smoothness classes with near to optimal complexity. Tomake the distinction to existing results clear, we brieﬂy review what is known by now about theapproximation of some more standard smoothness classes closely related to our work. In all instances“complexity” is measured by the number of connections, i.e., non-zero weights.In [23] it was shown that analytic functions on a compact product domain in any dimension canbe approximated in the Sobolev norm W k, ∞ by ReLU and RePU networks with close to exponentialconvergence. In [25] it was shown that ReLU networks can approximate any H¨older continuous functionwith optimal complexity. In [16] it was shown that functions in the Besov space B αp ( L p (Ω)) on boundedLipschitz domains Ω ⊂ R d in any dimension can be approximated in the L p -norm by RePU networkswith activation function of degree r (cid:38) α with optimal complexity. The spaces B αp ( L p (Ω)) correspond tothe vertical line in Figure 1, i.e., for p ≥ W k,p (Ω).In [22] it was shown that functions in the Besov space B ατ ( L τ ( I )), for α > /τ − /p on boundedintervals I ⊂ R , can be approximated in the L p -norm with near to optimal complexity. The space B ατ ( L τ ( I )) is above the critical embedding line of functions that barely have enough regularity to bemembers of L p , see the diagonal in Figure 1. Spaces above this critical line are embedded in L p , spaceson this line may or may not be embedded in L p , and spaces below this line are never embedded in L p . It E-mail address : { mazen.ali,anthony.nouy } @ec-nantes.fr . Date : July 31, 2020.2010

Mathematics Subject Classiﬁcation.

Key words and phrases.

ReLU Neural Networks, Approximation Spaces, Besov Spaces, Direct Embeddings, Direct(Jackson) Inequalities.Acknowledgments: The authors acknowledge AIRBUS Group for the ﬁnancial support with the project AtRandom. Or in the W ,p ( I )-norm and consequently, by interpolation, in the fractional Sobolev norm. a r X i v : . [ m a t h . F A ] J u l . Ali, A. Nouy Approximation by Deep ReLU Networks Figure 1.

DeVore diagram of smoothness spaces [8]. The Sobolev embedding line isthe diagonal with the points (1 /τ, α ) and (1 /µ, r ).was also shown in [22] that piece-wise Gevrey functions can be approximated with close to exponentialconvergence. Similar results for classical smoothness spaces of univariate functions are contained in [7].In this work, we show that functions in B ατ ( L τ (Ω)), for 1 /τ = α/d + 1 /p and Ω = R d or Ω ⊂ R d a Lipschitz domain in any dimension d ∈ N , can be approximated by RePU networks with activationfunction of degree r = 2 with optimal complexity for any α >

0. We show the same for ReLU networkswith near to optimal complexity and assuming additionally α/d > /τ − /p , i.e., for any Besov classstrictly above the critical embedding line. This completes the picture for ReLU/RePU expressivityrates for classical smoothness spaces in the sense that, with regard to L p approximation, functionsfrom any Besov space that embeds into L p can be approximated by ReLU/RePU networks with (nearto) optimal complexity. Note that this feature can be attributed solely to depth as it was observedin [1, 2] that tensor networks (or sum-product neural networks) – speciﬁcally, tensor train networkswith a simple nonlinearity – exhibit the same expressivity with regard to classical smoothness spaces.Upon completion of this work we became aware of a similar result in [26]. We comment on this indetail in Section 5. Outline.

We begin in Section 1.1 and Section 1.2 by reviewing the theoretical framework of our work.We then state the main result in Section 1.3. To keep the presentation self-contained, we review previousresults on ReLU approximation in Section 2 and smoothness classes in Section 3 that we require forour work. Finally, in Section 4 we derive the main result of this work, stated again in Theorem 4.5. Weconclude with a brief discussion on extensions and alternative proofs in Section 5. The reader familiarwith results on ReLU/RePU approximation and wavelet characterizations of Besov spaces can skipdirectly to Section 4.1.1.

Neural Networks.

We brieﬂy introduce the mathematical description and notation we use forNNs through out this work. Speciﬁcally, we will only consider feed-forward NNs. In Figure 2, we sketcha pictorial representation of a feed-forward NN.Input values are passed on to the ﬁrst layer of neurons after possibly undergoing an aﬃne transfor-mation. In the neurons, an activation function is applied to the transformed input values. The resultagain undergoes an aﬃne transformation and is passed to the next layer and so on, until the outputlayer is reached.The number of inputs and outputs is typically determined by the intended application. Specifyingthe architecture of such an NN amounts to choosing the number of layers, the number of neurons ineach hidden layer, the activation functions and the connections or, equivalently, the position of thenon-zero weights in the aﬃne transformations. The process of training then consists of determiningsaid weights.We formalize our description of the considered mathematical objects. Let L ∈ N be the number oflayers, N the number of inputs, N L the number of outputs and N , . . . , N L − the number of neuronsin each hidden layer. A neural network Φ can be described by the tupleΦ := (( T , σ ) , . . . , ( T L , σ L )) , I.e., for any approximation rate arbitrarily close to optimal.

2. Ali, A. Nouy Approximation by Deep ReLU Networks

Figure 2.

Example of a feed-forward neural network. On the left we have the input nodes marked in red that represent input data to the network. The yellow nodes are the neurons that perform some simple operations on the input. The edges between the nodesrepresent connections that transfer (after possibly applying an aﬃne transformation) theoutput of one node into the input of another. The ﬁnal green nodes are the output nodes.In this particular example the number of layers L is three, with two hidden layers .where for each 1 ≤ l ≤ L , T l is an aﬃne transformation T l : R N l − → R N l , x (cid:55)→ A l x + b l , A l ∈ R N l × N l − , b l ∈ R N l , (1.1)and σ l : R N l → R N l is a (nonlinear) function, usually applied component-wise as x (cid:55)→ ( σ l ( x ) , . . . , σ N l l ( x N l )) . In this work we will use RePU activation functions, i.e., σ il ∈ { I R , ρ r } , ρ r ( t ) := max { , t } , ≤ l ≤ L − , r ∈ N , σ iL := I R , (1.2)where I R : R → R is the identity map and for r = 1, ρ is referred to as the rectiﬁed linear unit(ReLU) . We allow for the possibility of a non-strict network, i.e., an activation function is either I R or ρ r . Another possibility is a strict network where each activation function is necessarily ρ r (with theexclusion of the output nodes). But, as was shown in [16], the approximation theoretic properties ofboth are the same and thus, for our work, it is irrelevant.Let Aﬀ ( N l − , N l ) denote the set of aﬃne maps as in (1.1) and NL ( N l , r ) denote the set of activationfunctions as in (1.2). For ﬁxed N , N L , deﬁne RePU r,N ,N L := (cid:91) L ∈ N (cid:91) ( N ,...,N L ) ∈ N L Aﬀ ( N , N ) × NL ( N , r ) × · · · × Aﬀ ( N L − , N L ) × NL ( N L , r ) , ReLU N ,N L := RePU ,N ,N L , and the realization map R : RePU r,N ,N L → ( R N L ) R N by R (Φ) := σ N L ◦ T L ◦ · · · σ ◦ T . Approximation Classes.

In this work we will state our results in the approximation theoreticframework introduced in [16]. Before we do so, let us ﬁrst recall the deﬁnition of approximation spaces.Let X be a quasi-normed linear space, Σ n ⊂ X subsets of X for n ∈ N and Σ := (Σ n ) n ∈ N anapproximation tool. Deﬁne the best approximation error E ( f, Σ n ) X := inf ϕ ∈ Σ n (cid:107) f − ϕ (cid:107) X . With this we deﬁne approximation classes as

Deﬁnition 1.1 (Approximation Classes) . For any f ∈ X and α > , deﬁne the quantity (cid:107) f (cid:107) A αq := (cid:40)(cid:0)(cid:80) ∞ n =1 [ n α E ( f, Σ n ) X ] q n (cid:1) /q , < q < ∞ , sup n ≥ [ n α E ( f, Σ n ) X ] , q = ∞ . The approximation classes A αq of Σ = (Σ n ) n ∈ N are deﬁned by A αq ( X, Σ) := (cid:110) f ∈ X : (cid:107) f (cid:107) A αq < ∞ (cid:111) .

3. Ali, A. Nouy Approximation by Deep ReLU NetworksThe utility of using these classes comes to light only if the sets Σ n satisfy certain properties. Thiswas discussed in detail in [16] and the relevant properties were shown to hold for RePU networks.We perform approximation in X = L p (Ω) for 0 < p ≤ ∞ and Ω = R d or Ω ⊂ R d a bounded Lipschitzdomain. We thus abbreviate E ( f, Σ n ) p := E ( f, Σ n ) L p (Ω) . As a measure of complexity we will use the number of non-zero weights. I.e., for a given Φ ∈ RePU r,d ,d for some d , d ∈ N , the number of non-zero weights is W (Φ) := L (cid:88) l =1 (cid:107) T l (cid:107) (cid:96) , (cid:107) T l (cid:107) (cid:96) := (cid:107) A l (cid:107) (cid:96) , with (cid:107) A l (cid:107) (cid:96) being the number of non-zero weights of the matrix A l . With this we deﬁne for any n ∈ N RePU r,d ,d n := (cid:110) Φ ∈ RePU r,d ,d : R (Φ) ∈ X, W (Φ) ≤ n (cid:111) , ReLU d ,d n := RePU ,d ,d n . The main result of this work then concerns the approximation classes A αq (cid:0) L p (Ω) , RePU r,d, (cid:1) . Main Result.

For the statement of our main result we will use real K -interpolation spaces( X, Y ) θ,q , see Section 3.1 for a refresher. Main Result 1.2 (Direct Embeddings) . Let

Ω = R d or Ω ⊂ R d a Lipschitz domain.(i) For r ≥ , such that α/d ≥ /τ − /p, q ≤ ( α/d + 1 /p ) − , the following embeddings hold B αq ( L τ (Ω)) (cid:44) → A α/d ∞ ( L p (Ω) , RePU r,d, ) , ( L p (Ω) , B αq ( L τ (Ω))) θ/α, ¯ q (cid:44) → A θ/d ¯ q ( L p (Ω) , RePU r,d, ) , for < θ < α , < ¯ q ≤ ∞ .(ii) For r = 1 , such that α/d > /τ − /p, q ≤ ( α/d + 1 /p ) − , the following embeddings hold B αq ( L τ (Ω)) (cid:44) → A ¯ α/d ∞ ( L p (Ω) , ReLU d, ) , ( L p (Ω) , B αq ( L τ (Ω))) θ/ ¯ α, ¯ q (cid:44) → A θ/d ¯ q ( L p (Ω) , ReLU d, ) , for < θ < ¯ α , < ¯ q ≤ ∞ and any < ¯ α < α . Notation.

For quantities

A, B ∈ R , we will use the notation A (cid:46) B if there exists a constant C thatdoes not depend on A or B such that A ≤ CB . Similarly for (cid:38) and ∼ if both inequalities hold. Forany m ∈ Z , we deﬁne N ≥ m := { m, m + 1 , m + 2 , . . . } . We use supp( f ) to denote the support of a function f ∈ R d → R supp( f ) := { x ∈ R d : f ( x ) (cid:54) = 0 } , and | supp( f ) | to denote the Lebesgue measure of this set. Finally, we use Preliminaries on ReLU Approximation

In this section, we review recent results on deep RePU approximation relevant for this work. Weuse the notation deﬁned in Section 1.1. The next theorem states that RePU networks can eﬃcientlyreproduce or approximate multiplication.

Theorem 2.1 (Multiplication [16,23,27]) . Let M d : R d → R be the multiplication function x (cid:55)→ (cid:81) di =1 x i .Then, there exists a constant C such that(i) for r ≥ , and n := Cd , there exists a RePU network Φ M ∈ RePU r,d, n such that M d = R (Φ M ) ,

4. Ali, A. Nouy Approximation by Deep ReLU Networks (ii) for r = 1 , any K > and any ε > , and n := Cd log( dK d /ε ) , there exists a ReLU network Φ εM ∈ ReLU d, n with (cid:107) M d − R (Φ εM ) (cid:107) L ∞ ([ − K,K ] d ) ≤ ε. This in turn implies RePU networks can eﬃciently reproduce or approximate piece-wise polynomials.

Theorem 2.2 (Piece-wise Polynomials [22]) . Let v : R → R be a piece-wise polynomial with N v pieces,of maximum degree t ∈ N ≥ and with compact support of measure S := | supp( v ) | < ∞ . Then, thereexists a constant C > depending on N v , t and r such that(i) for r ≥ , there exists a RePU network Φ ∈ RePU r, , C with v = R (Φ) , (ii) for r = 1 , the constant C additionally depends on S and (cid:107) v (cid:107) L ∞ ( R ) , and for any ε > thereexists a ReLU network Φ ∈ ReLU , n with n := C log( ε − ) and the same support as v , such that (cid:107) v − R (Φ) (cid:107) L ∞ ( R ) ≤ ε. The previous result states that RePU networks with r ≥ . This suggests the following saturation property . Theorem 2.3 (Saturation Property [16]) . For any r ≥ , any < p, q ≤ ∞ and α > , and any d , d ∈ N the approximation spaces deﬁned in Section 1.2 coincide A αq ( L p , RePU ,d ,d ) = A αq ( L p , RePU r,d ,d ) . The saturation property will also be clearly visible in the main result of this work in Theorem 4.5.We conclude by pointing out that RePU networks can eﬃciently reproduce aﬃne systems , i.e., linearcombinations of functions that are generated by dilating and shifting a single mother function or, insome cases, a ﬁnite number of mother functions. A prominent example of aﬃne systems are waveletswhich will play an important role for the main result in Theorem 4.5.The reproduction of aﬃne systems by NNs was studied in greater detail in [3]. In the following weonly mention the properties relevant for this work.

Theorem 2.4 (NN calculus [16]) . For any r ≥ , the following properties hold.(i) For any c ∈ R , n ∈ N , d , d ∈ N and any Φ ∈ RePU r,d ,d n , there exists Φ ∈ RePU r,d ,d n with c R (Φ ) = R (Φ ) . (ii) For any n , . . . , n N ∈ N , d , d ∈ N and any Φ ∈ RePU r,d ,d n , . . . , Φ N ∈ RePU r,d ,d n N , set C := min { d , d } (max i depth(Φ i ) − min i depth(Φ i )) . Then, for n := C + (cid:80) Ni =1 n i , there exists Φ (cid:80) ∈ RePU r,d ,d n with N (cid:88) i =1 R (Φ i ) = R (Φ (cid:80) ) . (iii) For any n , . . . , n N ∈ N , d, d , . . . , d N ∈ N and any Φ ∈ RePU r,d,d n , . . . , Φ N ∈ RePU r,d,d N n N ,set K := (cid:80) Ni =1 d i and C := min { d, K − } (max i depth(Φ i ) − min i depth(Φ i )) . Then, for n := C + (cid:80) Ni =1 n i , there exists Φ × ∈ RePU r,d,Kn with ( R (Φ ) , . . . , R (Φ N )) = R (Φ × ) . (iv) For any n , n ∈ N , d , d , d ∈ N , any Φ ∈ RePU r,d ,d n and any Φ ∈ RePU r,d ,d n , there exists Φ ∈ RePU r,d ,d n + n such that R (Φ ) ◦ R (Φ ) = R (Φ) . (v) Let D ab : R d → R d denote the aﬃne transformation x (cid:55)→ ax − b for a ∈ R , b ∈ R d . Then, forany n ∈ N , d , d ∈ N , any Φ ∈ RePU r,d, n and any a ∈ R , b ∈ R d , there exists Φ ∈ RePU r,d, n with R (Φ ) ◦ D ab = R (Φ ) . Note that the constants will be, however, aﬀected by the degree.

5. Ali, A. Nouy Approximation by Deep ReLU Networks3.

Besov Spaces and Wavelet Systems

In this section, we recall some classical results on (isotropic) Besov spaces and their characterizationwith wavelets. As in Section 2, we focus mostly on results relevant to our work. For more details werefer to, e.g., [4].3.1.

Besov Spaces.

Let Ω ⊂ R d be an open subset and f ∈ L p (Ω) for 0 < p ≤ ∞ . For h ∈ R d , let τ h denote the translation operator ( τ h f )( x ) := f ( x + h ), I : R d → R d the identity operator and deﬁne the m -th diﬀerence ∆ mh := ( τ h − I ) m := ( τ h − I ) ◦ . . . ◦ ( τ h − I ) (cid:124) (cid:123)(cid:122) (cid:125) m times , m ∈ N . We use the notation ∆ mh ( f, x, Ω) := (cid:40) (∆ mh f )( x ) , if x, x + h, . . . , x + rh ∈ Ω , , otherwise . The modulus of smoothness of order m is deﬁned for any t > ω m ( f, t, Ω) p := sup | h |≤ t (cid:107) ∆ mh ( f, · , Ω) (cid:107) L p (Ω) , where | h | denotes the standard Euclidean 2-norm. Finally, the Besov semi-norm is deﬁned for any0 < p, q ≤ ∞ , any α > m := (cid:98) α (cid:99) + 1 by | f | B αq ( L p (Ω)) := (cid:16)(cid:82) [ t − α ω m ( f, t, Ω) p ] q d t/t (cid:17) /q , < q < ∞ , sup t> t − α ω m ( f, t, Ω) p , q = ∞ . Then, the (isotropic) Besov space is deﬁned as B αq ( L p (Ω)) := (cid:110) f ∈ L p (Ω) : | f | B αq ( L p (Ω)) < ∞ (cid:111) , and it is a (quasi-)Banach space equipped with the norm (cid:107) f (cid:107) B αq ( L p (Ω)) := (cid:107) f (cid:107) L p (Ω) + | f | B αq ( L p (Ω)) . The parameter α > p reﬂects the measureof said smoothness. The secondary parameter q is less important and merely provides a ﬁner gradationof smoothness. A few relationships are rather straight-forward B α q ( L p (Ω)) (cid:44) → B α q ( L p (Ω)) , α ≥ α ,B αq ( L p (Ω)) (cid:44) → B αq ( L p (Ω)) , p ≥ p ,B αq ( L p (Ω)) (cid:44) → B αq ( L p (Ω)) , q ≤ q , where (cid:44) → denotes a continuous embedding. For non-integer α > ≤ p ≤ ∞ , B αp ( L p (Ω)) is thefractional Sobolev space W α,p (Ω). For integer α >

0, the Besov space B α ∞ ( L p (Ω)) is slightly largerthan W α,p (Ω). For p = q = 2, the Besov space B α ( L (Ω)) is the same as the Sobolev space W α, (Ω).A less obvious property are the following embedding results. Theorem 3.1 (Besov Embeddings [4]) . Let

Ω = R d or Ω ⊂ R d be a Lipschitz domain.(i) For and < τ < p such that /τ = α/d + 1 /p , the followingembedding holds B αq ( L τ (Ω)) (cid:44) → L p (Ω) . (ii) For < α < α , < p < p ≤ ∞ and q ≤ q , it holds B α q ( L p (Ω)) (cid:44) → B α q ( L p (Ω)) , if α − α ≥ d (1 /p − /p ) . The Besov spaces in Theorem 3.1 (i) are on the critical embedding line (see Figure 1). Spaces abovethis line are embedded in L p , spaces on this line may or may not be embedded in L p , and spacesbelow this line are never embedded in L p . In this sense, such Besov spaces are quite large as thefunctions on this line barely have enough regularity to be members of L p . It is well-known that optimalapproximation of functions from such spaces with a continuous parameter selection can only be achievedby non-linear methods, see [9]. It is the main result of this work that RePU networks achieve optimalapproximation for these spaces, while ReLU networks achieve near to optimal approximation.6. Ali, A. Nouy Approximation by Deep ReLU NetworksTo transfer results from R d to bounded Lipschitz domains, we will use the common technique ofextension operators. Theorem 3.2 (Extension Operator [13, 19]) . Let Ω ⊂ R d be a Lipschitz domain . Then, for any α > and any < p, q ≤ ∞ , there exists a linear operator E : B αq ( L p (Ω)) → B αq ( L p ( R d )) such that (cid:107) f (cid:107) B αq ( L p (Ω)) ≤ (cid:107)E f (cid:107) B αq ( L p ( R d )) ≤ C (cid:107) f (cid:107) B αq ( L p (Ω)) , where C depends only on d , α , p and the domain Ω . We conclude by noting that Besov spaces combine well with interpolation. To be precise, we brieﬂydeﬁne interpolation spaces via the K -functional. Let X be a quasi-normed space and Y be a quasi-semi-normed space with Y (cid:44) → X . The K -functional is deﬁned for any f ∈ X by K ( f, t, X, Y ) := inf f = f + f {(cid:107) f (cid:107) X + t | f | Y } , t > . For 0 < θ < < q ≤ ∞ , deﬁne the quantity | f | ( X,Y ) θ,q := (cid:40)(cid:0)(cid:82) ∞ [ t − θ K ( f, t, X, Y )] q d t/t (cid:1) /q , < q < ∞ , sup t> t − θ K ( f, t, X, Y ) , q = ∞ . Then, the spaces (

X, Y ) θ,q := (cid:110) f ∈ X : | f | ( X,Y ) θ,q < ∞ (cid:111) , equipped with the (quasi-)norm (cid:107) f (cid:107) ( X,Y ) θ,q := (cid:107) f (cid:107) X + | f | ( X,Y ) θ,q , are interpolation spaces .Besov spaces provide a relatively complete description of interpolation spaces in the following sense:for 0 < θ < L p (Ω) , W α ( L p (Ω))) θ,q = B θαq ( L p (Ω)) , ≤ p ≤ ∞ , < q ≤ ∞ , ( B α q ( L p (Ω)) , B α q ( L p (Ω))) θ,q = B αq ( L p (Ω)) , < α < α , α := (1 − θ ) α + θα , < p, q, q , q ≤ ∞ , ( L p (Ω) , B αq ( L p (Ω))) θ,q = B θαq ( L p (Ω)) , < p, q, q ≤ ∞ . For Besov spaces on the critical line with 1 /τ = α/d + 1 /p , we obtain( L p (Ω) , B ατ ( L τ (Ω))) θ,q = B θαq ( L q (Ω)) , if 1 /q = θα/d + 1 /p. Wavelets.

There are many possible wavelets constructions satisfying diﬀerent properties depend-ing on the intended application. Said constructions can be rather technical, with the payoﬀ beingvarious favorable analytical and numerical features. We do not intend to cover this topic in-depthand once again only pick out the aspects required for this work. We proceed by brieﬂy reviewingone-dimensional wavelets constructions, after which we turn to wavelets on R d . Our presentation issomewhat abstract and therefore ﬂexible, but we will also be more speciﬁc with some aspects of theconstruction that we require in Section 4. For more details on the subject we refer to [4].The starting point of a wavelet construction is typically a multi-resolution analysis (MRA), i.e., asequence of closed subspaces V j ⊂ V j +1 of L ( R ) that are nested, dilation- and shift-invariant, dense in L ( R ) and are all generated by a single scaling function ϕ ∈ V . To be more precise, we assume thesystem { ϕ ( · − k ) : k ∈ Z } is a Riesz basis of V and therefore { ϕ (2 j · − k ) : k ∈ Z } is a Riesz basis of V j . We use the shorthand notation ϕ j,k := 2 j/ ϕ (2 j · − k ) , where the pre-factor 2 j/ normalizes ϕ in L . Later we will redeﬁne this to 2 j/p for normalization in L p for any 0 < p ≤ ∞ , with the convention 2 j/ ∞ = 1.Deﬁning a projection P j : L ( R ) → V j is rather simple if { ϕ ( ·− k ) : k ∈ Z } forms an orthogonal basisof V . Indeed, this property implies that { ϕ j,k : k ∈ Z } forms an orthogonal basis of V j , and P j can bechosen to be the orthogonal projection. However, for numerical reasons, it is sometimes unpractical toconstruct scaling functions ϕ such that { ϕ ( · − k ) : k ∈ Z } forms an orthogonal basis of V and withoutthis property a constructive deﬁnition of P j is not straight-forward. The result is actually valid for more general domains (cf. ( ε, δ ) domains), see [13]. Multiple scaling functions are possible as well in which case such functions are referred to as multi-wavelets , see [14].

7. Ali, A. Nouy Approximation by Deep ReLU NetworksA way-out are so-called bi-orthogonal constructions. A function ˜ ϕ ∈ L ( R ) is dual to ϕ if it satisﬁes (cid:104) ϕ ( · − k ) , ˜ ϕ ( · − l ) (cid:105) L = δ k,l , k, l ∈ Z , where δ k,l is the Kronecker delta. We then deﬁne the oblique projection P j P j f := (cid:88) k ∈ Z (cid:104) f, ˜ ϕ j,k (cid:105) L ϕ j,k . (3.1)A representation of a function in V j is typically referred to as a single-scale representation. To switchto a multi-scale representation, we need to characterize the so-called detail spaces deﬁned through theprojections Q j := P j +1 − P j , with the detail spaces deﬁned as W j := Q j ( L ( R )). This is achieved by constructing a wavelet ψ ∈ V ψ := (cid:88) k ∈ Z g k ϕ (2 · − k ) , for some coeﬃcients g k ∈ R such that N ψ := { k : g k (cid:54) = 0 } < ∞ . (3.2)Any function f ∈ L ( R ) can then be decomposed into a sequence of single-scale coeﬃcients on thecoarsest level and detail coeﬃcients on all higher levels f = (cid:88) k ∈ Z c ,k ϕ ,k + (cid:88) j ≥ (cid:88) k ∈ Z c j,k ψ j,k . (3.3)To simplify notation, one typically sets ψ − ,k := ϕ ,k and introduces the index set ∇ := { ( j, k ) : j ∈ N ≥− , k ∈ Z } . Decomposition (3.3) then simpliﬁes to f = (cid:88) λ ∈∇ c λ ψ λ . In order for the wavelets ψ λ to characterize Besov spaces, they have to satisfy certain assumptions. Assumption 3.3 (Characterization) . We assume the scaling function ϕ and its dual ˜ ϕ satisfy thefollowing properties.(W1) (Integrability) For some p (cid:48) , p (cid:48)(cid:48) ∈ [1 , ∞ ] such that /p (cid:48) + 1 /p (cid:48)(cid:48) = 1 , we assume ϕ ∈ L p (cid:48) ( R ) and ˜ ϕ ∈ L p (cid:48)(cid:48) ( R ) .(W2) (Polynomial Reproduction) We assume ϕ satisﬁes Strang-Fix conditions of order L ∈ N or,equivalently, for any polynomial P ∈ P L − of degree L − , we have P ∈ V .(W3) (Regularity) For some s > , < p, q ≤ ∞ , we assume ϕ ∈ B sq ( L p ( R )) . These conditions are suﬃcient to ensure Besov spaces can be characterized by the decay of thewavelet coeﬃcients. For our work we will require two additional conditions that are, however, easy tosatisfy for a variety of wavelet families.

Assumption 3.4 (Piece-wise Polynomial) . We additionally assume the scaling function ϕ satisﬁes thefollowing properties.(A1) We assume ϕ has compact support.(A2) We assume ϕ is piece-wise polynomial. An example of a wavelet family satisfying all of the assumptions (W1)–(W3) and (A1)–(A2) arethe CDF bi-orthogonal B-spline wavelets from [6]. These constructions allow to choose an arbitrarypolynomial reproduction degree L −

1, regularity order s and the resulting scaling function ϕ (andconsequently ψ as well) are compactly supported splines of degree L − R d . There are several possibleapproaches for this, but we describe a speciﬁc tensor product construction suitable for isotropic Besovspaces. We comment on anisotropic Besov spaces in Section 5.For x ∈ R d , we deﬁne the tensor product scaling function as φ ( x ) := ϕ ( x ) · · · ϕ ( x d ) , (3.4)and in the same manner as before, but for a general 0 < p ≤ ∞ , φ j,k,p ( x ) := 2 dj/p φ (2 j x − k ) , j ∈ Z , k ∈ Z d , (3.5) 8. Ali, A. Nouy Approximation by Deep ReLU Networkswith the convention 2 dj/ ∞ = 1. Next, for e ∈ { , } d \ { } , we deﬁne ψ e ( x ) := ψ e ( x ) · · · ψ e d ( x d ) , (3.6)with the convention ψ ( x i ) := ψ ( x i ) and ψ ( x i ) = ϕ ( x i ), and ψ ej,k,p is deﬁned as in (3.5). Simplifyingas before with ∇ := { ( e, j, k ) : e ∈ { , } d \ { } , j ∈ N ≥ , k ∈ Z d } ∪ { (0 , , k ) : k ∈ Z d } , we obtain the d -dimensional wavelet systemΨ := { ψ λ,p : λ ∈ ∇} , (3.7)where we also use the shorthand notation | λ | := | ( e, j, k ) | := j for e (cid:54) = 0, and | λ | = − ∇ j := { λ = ( e, j, k ) ∈ ∇ : e (cid:54) = 0 , | λ | = j } , j ≥ , ∇ − := { λ = (0 , , k ) ∈ ∇ : k ∈ Z d } . Theorem 3.5 (Characterization [4]) . Let ϕ satisfy (W1) for some integrability parameters p (cid:48) , p (cid:48)(cid:48) , (W2)for order L and (W3) with smoothness order s for primary parameter < p ≤ p (cid:48) and any secondaryparameter < q ≤ ∞ . Then, if f = (cid:80) λ ∈∇ c λ,p ψ λ,p is the wavelet decomposition of f , for d (1 /p − /p (cid:48) ) < α < min { s, L } , (3.8) we have the norm equivalence | f | B αq ( L p ( R d )) ∼ (cid:18)(cid:80) j ≥− jαq (cid:16)(cid:80) λ ∈∇ j | c λ,p | p (cid:17) q/p (cid:19) /q , < q < ∞ , sup j ≥− jα (cid:16)(cid:80) λ ∈∇ j | c λ,p | p (cid:17) /p , q = ∞ . The above characterization implies optimal approximation rates for best N -term wavelet approxi-mations. Theorem 3.6 ( N -term Approximation [4]) . Let < p < ∞ and let Ψ be a wavelet system satisfyingthe assumptions of Theorem 3.5. Deﬁne the set of N -term wavelet expansions as W N := (cid:40)(cid:88) λ ∈ Λ c λ,p ψ λ,p : Λ ⊂ ∇ , ≤ N (cid:41) . Then, for α > , /τ = α/d + 1 /p and any f ∈ B ατ ( L τ ( R d )) , it holds E ( f, W N ) p (cid:46) N − α/d | f | B ατ ( L τ ( R d )) . Remark 3.7.

For p = τ = ∞ the corresponding Besov space is the space of H¨older continuous functionsand we refer to [11, 25]. The restriction p > stems from (3.8) which in turn is based on the fact thatthe oblique projector deﬁned in (3.1) is, in general, not L p -stable for p < . More on this in Section 5. Optimal ReLU Approximation of Smoothness Classes

With the results from Section 2 and Section 3 we have all the tools necessary to derive approximationrates for arbitrary Besov functions. As was reviewed in Section 3, Besov spaces can be characterized bythe decay of the wavelet coeﬃcients, and N -term approximations achieve optimal approximation ratesfor Besov functions.In this section, we show that a RePU network can reproduce an N -term wavelet expansion with O ( N ) complexity. More importantly, we also show that a ReLU network can approximate an N -termwavelet expansion with O ( N log( ε − )) complexity, where ε > Lemma 4.1 (Wavelet Complexity) . Let ϕ : R → R be a scaling function satisfying (A1) – (A2), i.e., acompactly supported piece-wise polynomial scaling function with polynomial reproduction order L ∈ N .Then, Of bounded depth depending on the smoothness order and polynomial degree of the activation function. Of depth depending logarithmically on ε − .

9. Ali, A. Nouy Approximation by Deep ReLU Networks (i) for r ≥ , there exists a constant C > depending only on the number of polynomial pieces (andthus on N ψ from (3.2) ), order L and r such that there exist RePU networks Φ ϕ , Φ ψ ∈ RePU r, , C with ϕ = R (Φ ϕ ) , ψ = R (Φ ψ ) , (ii) for r = 1 and as above that additionally dependson | supp( ϕ ) | , | supp( ψ ) | , (cid:107) ϕ (cid:107) L ∞ ( R ) and (cid:107) ψ (cid:107) L ∞ ( R ) such that for any ε > there exist ReLUnetworks Φ εϕ , Φ εψ ∈ ReLU , C (1+log( ε − )) with the same supports as ϕ and ψ , respectively, and (cid:13)(cid:13) ϕ − R (Φ εϕ ) (cid:13)(cid:13) L p ( R ) ≤ ε, (cid:13)(cid:13) ψ − R (Φ εψ ) (cid:13)(cid:13) L p ( R ) ≤ ε. Next, using Theorem 2.1 and Lemma 4.1, the above can be extended to the multi-dimensional setting.

Lemma 4.2 (Tensor Product Wavelet Complexity) . Let ϕ : R → R be as in Lemma 4.1, satisfying(A1) – (A2), and let φ , ψ e be the tensor product wavelets deﬁned in (3.5) and (3.6) . Then,(i) for r ≥ , there exists a constant C > as in Lemma 4.1 (i), depending additionally on d , suchthat there exist networks Φ φ , Φ ψ e ∈ RePU r,d, C with φ = R (Φ φ ) , ψ e = R (Φ ψ e ) , for any e ∈ { , } d \ { } . (ii) For r = 1 and < p ≤ ∞ , there exists a constant as in Lemma 4.1 (ii), depending additionallyon d , such that for any < ε < there exist networks Φ εφ , Φ εψ e ∈ RePU d, C (1+log( ε − )) with (cid:13)(cid:13) φ − R (Φ εφ ) (cid:13)(cid:13) L p ( R d ) ≤ ε, (cid:13)(cid:13) ψ e − R (Φ εψ e ) (cid:13)(cid:13) L p ( R d ) ≤ ε, for any e ∈ { , } d \ { } . Proof.

Part (i) follows from Theorem 2.1 (i), Theorem 2.4 (iii) – (iv) and Lemma 4.1 (i). Part (ii) followsfrom Theorem 2.1 (ii), Theorem 2.4 (iii) – (iv) and Lemma 4.1 (ii), which we brieﬂy demonstrate for φ . The proof for ψ e is analogous. In the following we will frequently use the triangle inequality, i.e.,assuming p ≥

1. For 0 < p < (cid:107)·(cid:107) L p ( R d ) is only a quasi-norm, i.e., the right-hand-side of the triangleinequality is to be multiplied by a constant, and the corresponding complexities are to be adjustedaccordingly.Let φ = ϕ ⊗ · · · ⊗ ϕ . We approximate each ϕ as in Lemma 4.1 (ii) by networks Φ δϕ ∈ ReLU , C (1+log( δ − )) with some accuracy δ > δ × ∈ ReLU d,dn δ as in Theo-rem 2.4 (iii) with n δ := dC (1 + log( δ − )). Finally, for some diﬀerent accuracy η >

0, we construct anapproximate multiplication network Φ ηM ∈ ReLU d, n η with n η := Cd log( dK d η − ) as in Theorem 2.1 (ii),where K := (cid:107) ϕ (cid:107) L ∞ ( R ) + δ . This choice of K is justiﬁed by (cid:13)(cid:13)(cid:13) R (Φ δϕ ) (cid:13)(cid:13)(cid:13) L ∞ ( R d ) ≤ δ + (cid:107) ϕ (cid:107) L ∞ ( R ) = K. (4.1)Our ﬁnal approximation Φ εφ ∈ ReLU d, n δ,η is deﬁned by R (Φ εφ ) := R (Φ ηM ) ◦ R (Φ δ × ) , where, according to Theorem 2.4 (iv), n δ,η = dC (1 + log( δ − )) + Cd log( dK d η − ).We now estimate the resulting error from which it will be clear how to choose δ, η > n δ,η . We introduce the auxiliary approximation R ( ˜Φ) := M d ◦ R (Φ δ × ) and the notation ϕ δ := R (Φ δϕ ). Then, (cid:13)(cid:13) φ − R (Φ εφ ) (cid:13)(cid:13) L p ( R d ) ≤ (cid:13)(cid:13)(cid:13) φ − R ( ˜Φ) (cid:13)(cid:13)(cid:13) L p ( R d ) + (cid:13)(cid:13)(cid:13) R ( ˜Φ) − R (Φ εφ ) (cid:13)(cid:13)(cid:13) L p ( R d ) . With S := | supp( ϕ ) | , for the second term we apply Theorem 2.1 (ii) and obtain (cid:13)(cid:13)(cid:13) R ( ˜Φ) − R (Φ εφ ) (cid:13)(cid:13)(cid:13) L p ( R d ) ≤ ηS d/p . (4.2)For the ﬁrst term, we can write φ − R (Φ εφ ) = ( ϕ − ϕ δ ) ⊗ ϕ ⊗ · · · ⊗ ϕ + ϕ δ ⊗ ( ϕ − ϕ δ ) ⊗ ϕ ⊗ · · · ⊗ ϕ + . . . + ϕ δ ⊗ · · · ⊗ ϕ δ ⊗ ( ϕ − ϕ δ ) . Thus, by a triangle inequality (cid:13)(cid:13) φ − R (Φ εφ ) (cid:13)(cid:13) L p ( R d ) ≤ dK d − δ, (4.3) 10. Ali, A. Nouy Approximation by Deep ReLU Networkswhere K is deﬁned in (4.1). From (4.2) and (4.3), we set η := S − d/p ε/ δ := K − d ε/ (2 d ) and thestatement follows. (cid:3) Since aﬃne transformations can be eﬃciently implemented with NNs (see Theorem 2.4 (v)), theresults of the previous lemma carry over to any wavelet ψ λ . Lemma 4.3 (Wavelet System Complexity) . Let Ψ be a wavelet system as deﬁned in (3.7) , with one-dimensional scaling function ϕ : R → R as in Lemma 4.1 satisfying (A1) – (A2). Then,(i) for r ≥ , there exists a constant C > , with dependencies as in Lemma 4.2 (i), such that forany ψ λ ∈ Ψ there exists a RePU network Φ λ ∈ RePU r,d, C with ψ λ = R (Φ λ ) . (ii) For r = 1 and , with dependencies as in Lemma 4.2 (ii),such that for any ψ λ and any < ε < , there exists a ReLU network Φ ελ ∈ ReLU d, C (1+log( ε − )) with (cid:107) ψ λ − Φ ελ (cid:107) ≤ ε. Proof.

From Lemma 4.2, we have Φ ψ e ∈ RePU r,d, C and we deﬁne Φ λ ∈ RePU r,d, C using Theorem 2.4 (i)and (v) such that for λ = ( e, j, k ) R (Φ λ ) = 2 dj/p R (Φ ψ e ) ◦ D jk . Similarly we deﬁne Φ ελ ∈ ReLU d, C (1+log( ε − )) . Note additionally that the approximation error bound forΦ ελ remains unchanged due to the normalization constant 2 dj/p . (cid:3) With this we ﬁnally turn to direct estimates for networks.

Lemma 4.4 (Direct Estimates RePU/ReLU) . Let

Ω = R d or Ω ⊂ R d a Lipschitz domain .(i) For r ≥ , and α/d ≥ /τ − /p, q ≤ ( α/d + 1 /p ) − , it holds E ( f, RePU r,d, n ) p (cid:46) n − α/d | f | B αq ( L p (Ω)) . (ii) For r = 1 , and α/d > /τ − /p, q ≤ ( α/d + 1 /p ) − , it holds E ( f, ReLU d, n ) p (cid:46) n − ¯ α/d | f | B αq ( L p (Ω)) , for any < ¯ α < α .Proof. Consider a wavelet system Ψ that satisﬁes (A1) – (A2) and (W1) – (W3) for parameters ¯ p to be deﬁned in (4.4), some p (cid:48) , s and L such that the assumptions of Theorem 3.5 are satisﬁed for¯ α := 1 / ¯ p − /p . Such a wavelet system can be constructed for arbitrary smoothness order ¯ α using, e.g.,bi-orthogonal wavelets from [6].Part (i) follows from Lemma 4.3, Theorem 3.6 and Theorem 2.4 (ii). For part (ii), ﬁrst let Ω = R d and let f N := (cid:80) λ ∈ Λ N c λ,p ( f ) ψ λ with N ≤ N be an N -term wavelet approximation of f using thewavelet system Ψ as mentioned above. For some ε > ελ ∈ ReLU d, C (1+log( ε − )) be the ReLU ε -approximation of ψ λ from Lemma 4.3, and let Φ N ∈ ReLU d, CN (1+log( ε − )) be the sumnetwork from Theorem 2.4 (ii). Then, (cid:107) f N − R (Φ N ) (cid:107) L p ( R d ) ≤ ε (cid:88) λ ∈ Λ N | c λ,p ( f ) | . The sum of the coeﬃcients is bounded by a Besov semi-norm of f as we show next.Deﬁne ¯ p := ( α/d − /τ + 2 /p ) − . (4.4) Or, more generally, an ( ε, δ ) domain.

11. Ali, A. Nouy Approximation by Deep ReLU NetworksSince we assumed α/d > /τ − /p , we have 0 < ¯ p , ∞ , otherwise . Then, we can estimate the sum as (cid:88) λ ∈ Λ N | c λ,p ( f ) | ≤  (cid:88) λ ∈ Λ N | c λ,p ( f ) | ¯ p  / ¯ p N /q , with the convention N / ∞ = 1. For ¯ α > α := 1 / ¯ p − /p, and by the characterization fromTheorem 3.5, we obtain  (cid:88) λ ∈ Λ N | c λ,p ( f ) | ¯ p  / ¯ p (cid:46) | f N | B ¯ α ¯ p ( L ¯ p ( R d )) (cid:46) | f | B ¯ α ¯ p ( L ¯ p ( R d )) . By the Besov embeddings from Theorem 3.1 (ii) and the choice of ¯ p from (4.4), we have B αq ( L τ ( R d )) (cid:44) → B ¯ α ¯ p ( L ¯ p ( R d )).Thus, we set ε := N − α/d − /q and obtain (cid:107) f − R (Φ N ) (cid:107) L p ( R d ) (cid:46) N − α/d | f | B αq ( L τ ( R d )) . The complexity of this network can be bounded by n := C (1 + α/d + 1 /q ) N log( N ) (cid:46) N δ , for any δ >

0, or, equivalently, (cid:107) f − R (Φ N ) (cid:107) L p ( R d ) (cid:46) n − ¯ α/d | f | B αq ( L τ ( R d )) , for any 0 < ¯ α < α . This shows the statement for Ω = R d .For Ω ⊂ R d a Lipschitz domain, we use the extension operator from Theorem 3.2 to obtain for any f ∈ B αq ( L τ (Ω)) E ( f, ReLU d, n ) L p (Ω) ≤ E ( E f, ReLU d, n ) L p ( R d ) (cid:46) n − ¯ α/d |E f | B αq ( L τ ( R d )) (cid:46) n − ¯ α/d (cid:107) f (cid:107) B αq ( L τ (Ω)) . (cid:3) Finally, the direct estimates above immediately imply the main result of this work.

Theorem 4.5 (Direct Embeddings) . Let

12. Ali, A. Nouy Approximation by Deep ReLU Networks5.

Concluding Remarks (i) We emphasize that the main result Theorem 4.5 covers the entire range above the criticalembedding line (see Figure 1). In particular, it states that deep ReLU networks can approximateany Besov class strictly above the embedding line with near to optimal complexity.(ii) The results can be extended to Sobolev error measures, i.e., measuring the error in the deriva-tives as well. See [4, Remark 4.3.3] and [22].(iii) The above results concern isotropic Besov spaces, i.e., when measuring smoothness, we do notmake a distinction between coordinate directions. For anisotropic Besov spaces and correspond-ing wavelet characterizations we refer to [18].(iv) For the case p = ∞ and B α ∞ ( L ∞ (Ω)) = C α (Ω), we refer to [11, 25]. The limitation concerning p > L p -stable for p <

1. One way to circumvent this is toreplace the L p space with the Hardy space H p , see [20]. Another way is described next.(v) Upon completion of this work we became aware of a similar result in [26]. There the authorstates that any function in B ατ ( L τ ([0 , d )) for α/d > /τ − /p can be approximated with aReLU network in near to optimal complexity. Unlike in the proof of Lemma 4.4 where we useda wavelet Riesz basis construction, in [26] the author uses highly redundant B-spline framescombined with the characterization from [12] and an adaptive sampling strategy from [15]. Infact, best N -term approximation rates for Besov spaces B ατ ( L τ ( R d )) were obtained much earlierin [10], where the authors once again use highly redundant B-Spline frames and best N -termapproximations therein.A representation of a function f ∈ B ατ ( L τ ([0 , d )) in such a frame is not unique and, in gen-eral, not stable. But there exists a particular stable representation based on quasi-interpolantsof piece-wise polynomial near best approximations as constructed in [12] and used in [26]. Com-bined with the adaptive sampling strategy from [15], such a construction possesses somewhatsimilar features to the tree approximations from [5]. Hence, the condition α/d > /τ − /p appears in [5, 15, 26]. In contrast, the results of [10] hold for α/d = 1 /τ − /p .The advantage of using such highly redundant frames as in [10, 12, 26] is that the result ofLemma 4.4 (ii) is valid for the range 0 < p ≤ L p -stable for any0 < p ≤ ∞ .The result of [26] is stated only for Ω = [0 , d . However, we believe this can be extended toΩ = R d .We also note that in [24] the author uses B-splines to characterize Besov spaces and estimatebest N -term approximation rates as well. However, that construction diﬀers from [10, 12]and, in particular, the characterization and subsequent N -term approximation is performed for modiﬁed Besov spaces. These coincide with standard Besov spaces in case, once again, (3.8)is satisﬁed.Overall, using either wavelets as in this work or B-spline quasi-interpolants as in [26] providestwo alternatives for proving Lemma 4.4 (ii) – neither is inherently more diﬃcult than the other,with the caveat that for p ≤ References [1]

Ali, M., and Nouy, A.

Approximation with Tensor Networks. Part I: Approximation Spaces. arXiv e-prints (June2020), arXiv:2007.00118.[2]

Ali, M., and Nouy, A.

Approximation with Tensor Networks. Part II: Approximation Rates for Smoothness Classes. arXiv e-prints (June 2020), arXiv:2007.00128.[3]

B¨olcskei, H., Grohs, P., Kutyniok, G., and Petersen, P.

Optimal Approximation with Sparsely ConnectedDeep Neural Networks.

SIAM Journal on Mathematics of Data Science 1 , 1 (2019), 8–45.[4]

Cohen, A.

Numerical Analysis of Wavelet Methods . Elsevier, Amsterdam Boston, 2003.[5]

Cohen, A., Dahmen, W., Daubechies, I., and DeVore, R.

Tree Approximation and Optimal Encoding.

Appliedand Computational Harmonic Analysis 11 , 2 (2001), 192 – 226. The author also shows that functions in Besov spaces of mixed smoothness can be approximated with a “milder” curseof dimensionality: the exponential dependence on d enters the complexity through log-factors such as (log( ε − )) d . Thesame was shown in [21] for Korobov spaces of mixed smoothness and deep ReLU networks. Essentially Besov spaces of anisotropic smoothness.

13. Ali, A. Nouy Approximation by Deep ReLU Networks [6]

Cohen, A., Daubechies, I., and Feauveau, J.-C.

Biorthogonal Bases of Compactly Supported Wavelets.

Com-munications on Pure and Applied Mathematics 45 , 5 (1992), 485–560.[7]

Daubechies, I., DeVore, R., Foucart, S., Hanin, B., and Petrova, G.

Nonlinear Approximation and (Deep)ReLU Networks. arXiv e-prints (May 2019), arXiv:1905.02199.[8]

DeVore, R. A.

Nonlinear Approximation.

Acta Numerica 7 (1998), 51–150.[9]

DeVore, R. A., Howard, R., and Micchelli, C.

Optimal Nonlinear Approximation.

Manuscripta mathematica63 , 4 (1989), 469–478.[10]

DeVore, R. A., Jawerth, B., and Popov, V.

Compression of Wavelet Decompositions.

American Journal ofMathematics 114 , 4 (1992), 737–785.[11]

DeVore, R. A., Petrushev, P., and Yu, X. M.

Nonlinear Wavelet Approximation in the Space C ( R d ). In Progressin Approximation Theory (New York, NY, 1992), A. A. Gonchar and E. B. Saﬀ, Eds., Springer New York, pp. 261–283.[12]

DeVore, R. A., and Popov, V. A.

Interpolation of Besov Spaces.

Transactions of the American MathematicalSociety 305 , 1 (1988), 397–414.[13]

DeVore, R. A., and Sharpley, R. C.

Besov Spaces on Domains in R d . Transactions of the American MathematicalSociety 335 , 2 (1993), 843–864.[14]

Donovan, G. C., Geronimo, J. S., and Hardin, D. P.

Orthogonal Polynomials and the Construction of PiecewisePolynomial Smooth Wavelets.

SIAM Journal on Mathematical Analysis 30 , 5 (1999), 1029–1056.[15]

D˜ung, D.

Optimal Adaptive Sampling Recovery.

Advances in Computational Mathematics 34 , 1 (2011), 1–41.[16]

Gribonval, R., Kutyniok, G., Nielsen, M., and Voigtlaender, F.

Approximation Spaces of Deep NeuralNetworks. arXiv e-prints (May 2019), arXiv:1905.01208.[17]

G¨uhring, I., Raslan, M., and Kutyniok, G.

Expressivity of Deep Neural Networks. arXiv e-prints (July 2020),arXiv:2007.04759.[18]

Hochmuth, R.

Wavelet Characterizations for Anisotropic Besov Spaces.

Applied and Computational Harmonic Anal-ysis 12 , 2 (2002), 179 – 208.[19]

Johnen, H., and Scherer, K.

On the Equivalence of the K-functional and Moduli of Continuity and Some Ap-plications. In

Constructive Theory of Functions of Several Variables (Berlin, Heidelberg, 1977), W. Schempp andK. Zeller, Eds., Springer Berlin Heidelberg, pp. 119–140.[20]

Kyriazis, G. C.

Wavelet Coeﬃcients Measuring Smoothness in H p ( R d ). Applied and Computational HarmonicAnalysis 3 (1996), 100–119.[21]

Montanelli, H., and Du, Q.

New Error Bounds for Deep ReLU Networks Using Sparse Grids.

SIAM Journal onMathematics of Data Science 1 , 1 (2019), 78–92.[22]

Opschoor, J. A. A., Petersen, P. C., and Schwab, C.

Deep ReLU Networks and High-order Finite ElementMethods.

Analysis and Applications (2020).[23]

Opschoor, J. A. A., Schwab, C., and Zech, J.

Exponential ReLU DNN Expression of Holomorphic Maps in HighDimension. Tech. Rep. 2019-35, Seminar for Applied Mathematics, ETH Z¨urich, Switzerland, 2019.[24]

Oswald, P.

On the Degree of Nonlinear Spline Approximation in Besov-Sobolev Spaces.

Journal of ApproximationTheory 61 , 2 (1990), 131 – 157.[25]

Petersen, P., and Voigtlaender, F.

Optimal Approximation of Piecewise Smooth Functions Using Deep ReLUNeural Networks.

Neural Networks 108 (2018), 296 – 330.[26]

Suzuki, T.

Adaptivity of Deep ReLU Network for Learning in Besov and Mixed Smooth Besov Spaces: OptimalRate and Curse of Dimensionality. In

International Conference on Learning Representations (2019).[27]

Yarotsky, D.

Error Bounds for Approximations with Deep ReLU Networks.

Neural Networks 94 (2017), 103 – 114.(2017), 103 – 114.