Approximation of Smoothness Classes by Deep ReLU Networks
AAPPROXIMATION OF SMOOTHNESS CLASSES BY DEEP RELU NETWORKS
MAZEN ALI AND ANTHONY NOUY
Centrale Nantes, LMJL UMR CNRS 6629, France
Abstract.
We consider approximation rates of sparsely connected deep rectified linear unit (ReLU) andrectified power unit (RePU) neural networks for functions in Besov spaces B αq ( L p ) in arbitrary dimension d , on bounded or unbounded domains. We show that RePU networks with a fixed activation functionattain optimal approximation rates for functions in the Besov space B ατ ( L τ ) on the critical embeddingline 1 /τ = α/d + 1 /p for arbitrary smoothness order α >
0. Moreover, we show that ReLU networksattain near to optimal rates for any Besov space strictly above the critical line. Using interpolationtheory, this implies that the entire range of smoothness classes at or above the critical line is (near to)optimally approximated by deep ReLU/RePU networks. Introduction
Artificial neural networks (NNs) have become a popular tool in various fields of computationaland data science. Due to its popularity and good performance, NNs motivated a lot of research inmathematics – especially in recent years – in an attempt to explain the properties of NNs responsiblefor their success.Although many aspects of NNs still lack a satisfactory mathematical explanation, the expressivity orapproximation theoretic properties of NNs are by now quite well understood. By expressivity we meanthe theoretical capacity of NNs to approximate functions from different classes. We do not intend togive a literature overview on this topic and instead refer to the recent survey in [17].
Contribution.
In this work, we contribute to the existing body of knowledge on the expressivity ofNNs by showing that the very popular and yet quite simple feed-forward rectified linear unit (ReLU)NNs can approximate a very wide range of smoothness classes with near to optimal complexity. Tomake the distinction to existing results clear, we briefly review what is known by now about theapproximation of some more standard smoothness classes closely related to our work. In all instances“complexity” is measured by the number of connections, i.e., non-zero weights.In [23] it was shown that analytic functions on a compact product domain in any dimension canbe approximated in the Sobolev norm W k, ∞ by ReLU and RePU networks with close to exponentialconvergence. In [25] it was shown that ReLU networks can approximate any H¨older continuous functionwith optimal complexity. In [16] it was shown that functions in the Besov space B αp ( L p (Ω)) on boundedLipschitz domains Ω ⊂ R d in any dimension can be approximated in the L p -norm by RePU networkswith activation function of degree r (cid:38) α with optimal complexity. The spaces B αp ( L p (Ω)) correspond tothe vertical line in Figure 1, i.e., for p ≥ W k,p (Ω).In [22] it was shown that functions in the Besov space B ατ ( L τ ( I )), for α > /τ − /p on boundedintervals I ⊂ R , can be approximated in the L p -norm with near to optimal complexity. The space B ατ ( L τ ( I )) is above the critical embedding line of functions that barely have enough regularity to bemembers of L p , see the diagonal in Figure 1. Spaces above this critical line are embedded in L p , spaceson this line may or may not be embedded in L p , and spaces below this line are never embedded in L p . It E-mail address : { mazen.ali,anthony.nouy } @ec-nantes.fr . Date : July 31, 2020.2010
Mathematics Subject Classification.
Key words and phrases.
ReLU Neural Networks, Approximation Spaces, Besov Spaces, Direct Embeddings, Direct(Jackson) Inequalities.Acknowledgments: The authors acknowledge AIRBUS Group for the financial support with the project AtRandom. Or in the W ,p ( I )-norm and consequently, by interpolation, in the fractional Sobolev norm. a r X i v : . [ m a t h . F A ] J u l . Ali, A. Nouy Approximation by Deep ReLU Networks Figure 1.
DeVore diagram of smoothness spaces [8]. The Sobolev embedding line isthe diagonal with the points (1 /τ, α ) and (1 /µ, r ).was also shown in [22] that piece-wise Gevrey functions can be approximated with close to exponentialconvergence. Similar results for classical smoothness spaces of univariate functions are contained in [7].In this work, we show that functions in B ατ ( L τ (Ω)), for 1 /τ = α/d + 1 /p and Ω = R d or Ω ⊂ R d a Lipschitz domain in any dimension d ∈ N , can be approximated by RePU networks with activationfunction of degree r = 2 with optimal complexity for any α >
0. We show the same for ReLU networkswith near to optimal complexity and assuming additionally α/d > /τ − /p , i.e., for any Besov classstrictly above the critical embedding line. This completes the picture for ReLU/RePU expressivityrates for classical smoothness spaces in the sense that, with regard to L p approximation, functionsfrom any Besov space that embeds into L p can be approximated by ReLU/RePU networks with (nearto) optimal complexity. Note that this feature can be attributed solely to depth as it was observedin [1, 2] that tensor networks (or sum-product neural networks) – specifically, tensor train networkswith a simple nonlinearity – exhibit the same expressivity with regard to classical smoothness spaces.Upon completion of this work we became aware of a similar result in [26]. We comment on this indetail in Section 5. Outline.
We begin in Section 1.1 and Section 1.2 by reviewing the theoretical framework of our work.We then state the main result in Section 1.3. To keep the presentation self-contained, we review previousresults on ReLU approximation in Section 2 and smoothness classes in Section 3 that we require forour work. Finally, in Section 4 we derive the main result of this work, stated again in Theorem 4.5. Weconclude with a brief discussion on extensions and alternative proofs in Section 5. The reader familiarwith results on ReLU/RePU approximation and wavelet characterizations of Besov spaces can skipdirectly to Section 4.1.1.
Neural Networks.
We briefly introduce the mathematical description and notation we use forNNs through out this work. Specifically, we will only consider feed-forward NNs. In Figure 2, we sketcha pictorial representation of a feed-forward NN.Input values are passed on to the first layer of neurons after possibly undergoing an affine transfor-mation. In the neurons, an activation function is applied to the transformed input values. The resultagain undergoes an affine transformation and is passed to the next layer and so on, until the outputlayer is reached.The number of inputs and outputs is typically determined by the intended application. Specifyingthe architecture of such an NN amounts to choosing the number of layers, the number of neurons ineach hidden layer, the activation functions and the connections or, equivalently, the position of thenon-zero weights in the affine transformations. The process of training then consists of determiningsaid weights.We formalize our description of the considered mathematical objects. Let L ∈ N be the number oflayers, N the number of inputs, N L the number of outputs and N , . . . , N L − the number of neuronsin each hidden layer. A neural network Φ can be described by the tupleΦ := (( T , σ ) , . . . , ( T L , σ L )) , I.e., for any approximation rate arbitrarily close to optimal.
2. Ali, A. Nouy Approximation by Deep ReLU Networks
Figure 2.
Example of a feed-forward neural network. On the left we have the input nodes marked in red that represent input data to the network. The yellow nodes are the neurons that perform some simple operations on the input. The edges between the nodesrepresent connections that transfer (after possibly applying an affine transformation) theoutput of one node into the input of another. The final green nodes are the output nodes.In this particular example the number of layers L is three, with two hidden layers .where for each 1 ≤ l ≤ L , T l is an affine transformation T l : R N l − → R N l , x (cid:55)→ A l x + b l , A l ∈ R N l × N l − , b l ∈ R N l , (1.1)and σ l : R N l → R N l is a (nonlinear) function, usually applied component-wise as x (cid:55)→ ( σ l ( x ) , . . . , σ N l l ( x N l )) . In this work we will use RePU activation functions, i.e., σ il ∈ { I R , ρ r } , ρ r ( t ) := max { , t } , ≤ l ≤ L − , r ∈ N , σ iL := I R , (1.2)where I R : R → R is the identity map and for r = 1, ρ is referred to as the rectified linear unit(ReLU) . We allow for the possibility of a non-strict network, i.e., an activation function is either I R or ρ r . Another possibility is a strict network where each activation function is necessarily ρ r (with theexclusion of the output nodes). But, as was shown in [16], the approximation theoretic properties ofboth are the same and thus, for our work, it is irrelevant.Let Aff ( N l − , N l ) denote the set of affine maps as in (1.1) and NL ( N l , r ) denote the set of activationfunctions as in (1.2). For fixed N , N L , define RePU r,N ,N L := (cid:91) L ∈ N (cid:91) ( N ,...,N L ) ∈ N L Aff ( N , N ) × NL ( N , r ) × · · · × Aff ( N L − , N L ) × NL ( N L , r ) , ReLU N ,N L := RePU ,N ,N L , and the realization map R : RePU r,N ,N L → ( R N L ) R N by R (Φ) := σ N L ◦ T L ◦ · · · σ ◦ T . Approximation Classes.
In this work we will state our results in the approximation theoreticframework introduced in [16]. Before we do so, let us first recall the definition of approximation spaces.Let X be a quasi-normed linear space, Σ n ⊂ X subsets of X for n ∈ N and Σ := (Σ n ) n ∈ N anapproximation tool. Define the best approximation error E ( f, Σ n ) X := inf ϕ ∈ Σ n (cid:107) f − ϕ (cid:107) X . With this we define approximation classes as
Definition 1.1 (Approximation Classes) . For any f ∈ X and α > , define the quantity (cid:107) f (cid:107) A αq := (cid:40)(cid:0)(cid:80) ∞ n =1 [ n α E ( f, Σ n ) X ] q n (cid:1) /q , < q < ∞ , sup n ≥ [ n α E ( f, Σ n ) X ] , q = ∞ . The approximation classes A αq of Σ = (Σ n ) n ∈ N are defined by A αq ( X, Σ) := (cid:110) f ∈ X : (cid:107) f (cid:107) A αq < ∞ (cid:111) .
3. Ali, A. Nouy Approximation by Deep ReLU NetworksThe utility of using these classes comes to light only if the sets Σ n satisfy certain properties. Thiswas discussed in detail in [16] and the relevant properties were shown to hold for RePU networks.We perform approximation in X = L p (Ω) for 0 < p ≤ ∞ and Ω = R d or Ω ⊂ R d a bounded Lipschitzdomain. We thus abbreviate E ( f, Σ n ) p := E ( f, Σ n ) L p (Ω) . As a measure of complexity we will use the number of non-zero weights. I.e., for a given Φ ∈ RePU r,d ,d for some d , d ∈ N , the number of non-zero weights is W (Φ) := L (cid:88) l =1 (cid:107) T l (cid:107) (cid:96) , (cid:107) T l (cid:107) (cid:96) := (cid:107) A l (cid:107) (cid:96) , with (cid:107) A l (cid:107) (cid:96) being the number of non-zero weights of the matrix A l . With this we define for any n ∈ N RePU r,d ,d n := (cid:110) Φ ∈ RePU r,d ,d : R (Φ) ∈ X, W (Φ) ≤ n (cid:111) , ReLU d ,d n := RePU ,d ,d n . The main result of this work then concerns the approximation classes A αq (cid:0) L p (Ω) , RePU r,d, (cid:1) . Main Result.
For the statement of our main result we will use real K -interpolation spaces( X, Y ) θ,q , see Section 3.1 for a refresher. Main Result 1.2 (Direct Embeddings) . Let
Ω = R d or Ω ⊂ R d a Lipschitz domain.(i) For r ≥ , < p < ∞ and any Besov space B αq ( L τ (Ω)) with α, τ, q > such that α/d ≥ /τ − /p, q ≤ ( α/d + 1 /p ) − , the following embeddings hold B αq ( L τ (Ω)) (cid:44) → A α/d ∞ ( L p (Ω) , RePU r,d, ) , ( L p (Ω) , B αq ( L τ (Ω))) θ/α, ¯ q (cid:44) → A θ/d ¯ q ( L p (Ω) , RePU r,d, ) , for < θ < α , < ¯ q ≤ ∞ .(ii) For r = 1 , < p < ∞ and any Besov space B αq ( L τ (Ω)) with α, τ, q > such that α/d > /τ − /p, q ≤ ( α/d + 1 /p ) − , the following embeddings hold B αq ( L τ (Ω)) (cid:44) → A ¯ α/d ∞ ( L p (Ω) , ReLU d, ) , ( L p (Ω) , B αq ( L τ (Ω))) θ/ ¯ α, ¯ q (cid:44) → A θ/d ¯ q ( L p (Ω) , ReLU d, ) , for < θ < ¯ α , < ¯ q ≤ ∞ and any < ¯ α < α . Notation.
For quantities
A, B ∈ R , we will use the notation A (cid:46) B if there exists a constant C thatdoes not depend on A or B such that A ≤ CB . Similarly for (cid:38) and ∼ if both inequalities hold. Forany m ∈ Z , we define N ≥ m := { m, m + 1 , m + 2 , . . . } . We use supp( f ) to denote the support of a function f ∈ R d → R supp( f ) := { x ∈ R d : f ( x ) (cid:54) = 0 } , and | supp( f ) | to denote the Lebesgue measure of this set. Finally, we use Preliminaries on ReLU Approximation
In this section, we review recent results on deep RePU approximation relevant for this work. Weuse the notation defined in Section 1.1. The next theorem states that RePU networks can efficientlyreproduce or approximate multiplication.
Theorem 2.1 (Multiplication [16,23,27]) . Let M d : R d → R be the multiplication function x (cid:55)→ (cid:81) di =1 x i .Then, there exists a constant C such that(i) for r ≥ , and n := Cd , there exists a RePU network Φ M ∈ RePU r,d, n such that M d = R (Φ M ) ,
4. Ali, A. Nouy Approximation by Deep ReLU Networks (ii) for r = 1 , any K > and any ε > , and n := Cd log( dK d /ε ) , there exists a ReLU network Φ εM ∈ ReLU d, n with (cid:107) M d − R (Φ εM ) (cid:107) L ∞ ([ − K,K ] d ) ≤ ε. This in turn implies RePU networks can efficiently reproduce or approximate piece-wise polynomials.
Theorem 2.2 (Piece-wise Polynomials [22]) . Let v : R → R be a piece-wise polynomial with N v pieces,of maximum degree t ∈ N ≥ and with compact support of measure S := | supp( v ) | < ∞ . Then, thereexists a constant C > depending on N v , t and r such that(i) for r ≥ , there exists a RePU network Φ ∈ RePU r, , C with v = R (Φ) , (ii) for r = 1 , the constant C additionally depends on S and (cid:107) v (cid:107) L ∞ ( R ) , and for any ε > thereexists a ReLU network Φ ∈ ReLU , n with n := C log( ε − ) and the same support as v , such that (cid:107) v − R (Φ) (cid:107) L ∞ ( R ) ≤ ε. The previous result states that RePU networks with r ≥ . This suggests the following saturation property . Theorem 2.3 (Saturation Property [16]) . For any r ≥ , any < p, q ≤ ∞ and α > , and any d , d ∈ N the approximation spaces defined in Section 1.2 coincide A αq ( L p , RePU ,d ,d ) = A αq ( L p , RePU r,d ,d ) . The saturation property will also be clearly visible in the main result of this work in Theorem 4.5.We conclude by pointing out that RePU networks can efficiently reproduce affine systems , i.e., linearcombinations of functions that are generated by dilating and shifting a single mother function or, insome cases, a finite number of mother functions. A prominent example of affine systems are waveletswhich will play an important role for the main result in Theorem 4.5.The reproduction of affine systems by NNs was studied in greater detail in [3]. In the following weonly mention the properties relevant for this work.
Theorem 2.4 (NN calculus [16]) . For any r ≥ , the following properties hold.(i) For any c ∈ R , n ∈ N , d , d ∈ N and any Φ ∈ RePU r,d ,d n , there exists Φ ∈ RePU r,d ,d n with c R (Φ ) = R (Φ ) . (ii) For any n , . . . , n N ∈ N , d , d ∈ N and any Φ ∈ RePU r,d ,d n , . . . , Φ N ∈ RePU r,d ,d n N , set C := min { d , d } (max i depth(Φ i ) − min i depth(Φ i )) . Then, for n := C + (cid:80) Ni =1 n i , there exists Φ (cid:80) ∈ RePU r,d ,d n with N (cid:88) i =1 R (Φ i ) = R (Φ (cid:80) ) . (iii) For any n , . . . , n N ∈ N , d, d , . . . , d N ∈ N and any Φ ∈ RePU r,d,d n , . . . , Φ N ∈ RePU r,d,d N n N ,set K := (cid:80) Ni =1 d i and C := min { d, K − } (max i depth(Φ i ) − min i depth(Φ i )) . Then, for n := C + (cid:80) Ni =1 n i , there exists Φ × ∈ RePU r,d,Kn with ( R (Φ ) , . . . , R (Φ N )) = R (Φ × ) . (iv) For any n , n ∈ N , d , d , d ∈ N , any Φ ∈ RePU r,d ,d n and any Φ ∈ RePU r,d ,d n , there exists Φ ∈ RePU r,d ,d n + n such that R (Φ ) ◦ R (Φ ) = R (Φ) . (v) Let D ab : R d → R d denote the affine transformation x (cid:55)→ ax − b for a ∈ R , b ∈ R d . Then, forany n ∈ N , d , d ∈ N , any Φ ∈ RePU r,d, n and any a ∈ R , b ∈ R d , there exists Φ ∈ RePU r,d, n with R (Φ ) ◦ D ab = R (Φ ) . Note that the constants will be, however, affected by the degree.
5. Ali, A. Nouy Approximation by Deep ReLU Networks3.
Besov Spaces and Wavelet Systems
In this section, we recall some classical results on (isotropic) Besov spaces and their characterizationwith wavelets. As in Section 2, we focus mostly on results relevant to our work. For more details werefer to, e.g., [4].3.1.
Besov Spaces.
Let Ω ⊂ R d be an open subset and f ∈ L p (Ω) for 0 < p ≤ ∞ . For h ∈ R d , let τ h denote the translation operator ( τ h f )( x ) := f ( x + h ), I : R d → R d the identity operator and define the m -th difference ∆ mh := ( τ h − I ) m := ( τ h − I ) ◦ . . . ◦ ( τ h − I ) (cid:124) (cid:123)(cid:122) (cid:125) m times , m ∈ N . We use the notation ∆ mh ( f, x, Ω) := (cid:40) (∆ mh f )( x ) , if x, x + h, . . . , x + rh ∈ Ω , , otherwise . The modulus of smoothness of order m is defined for any t > ω m ( f, t, Ω) p := sup | h |≤ t (cid:107) ∆ mh ( f, · , Ω) (cid:107) L p (Ω) , where | h | denotes the standard Euclidean 2-norm. Finally, the Besov semi-norm is defined for any0 < p, q ≤ ∞ , any α > m := (cid:98) α (cid:99) + 1 by | f | B αq ( L p (Ω)) := (cid:16)(cid:82) [ t − α ω m ( f, t, Ω) p ] q d t/t (cid:17) /q , < q < ∞ , sup t> t − α ω m ( f, t, Ω) p , q = ∞ . Then, the (isotropic) Besov space is defined as B αq ( L p (Ω)) := (cid:110) f ∈ L p (Ω) : | f | B αq ( L p (Ω)) < ∞ (cid:111) , and it is a (quasi-)Banach space equipped with the norm (cid:107) f (cid:107) B αq ( L p (Ω)) := (cid:107) f (cid:107) L p (Ω) + | f | B αq ( L p (Ω)) . The parameter α > p reflects the measureof said smoothness. The secondary parameter q is less important and merely provides a finer gradationof smoothness. A few relationships are rather straight-forward B α q ( L p (Ω)) (cid:44) → B α q ( L p (Ω)) , α ≥ α ,B αq ( L p (Ω)) (cid:44) → B αq ( L p (Ω)) , p ≥ p ,B αq ( L p (Ω)) (cid:44) → B αq ( L p (Ω)) , q ≤ q , where (cid:44) → denotes a continuous embedding. For non-integer α > ≤ p ≤ ∞ , B αp ( L p (Ω)) is thefractional Sobolev space W α,p (Ω). For integer α >
0, the Besov space B α ∞ ( L p (Ω)) is slightly largerthan W α,p (Ω). For p = q = 2, the Besov space B α ( L (Ω)) is the same as the Sobolev space W α, (Ω).A less obvious property are the following embedding results. Theorem 3.1 (Besov Embeddings [4]) . Let
Ω = R d or Ω ⊂ R d be a Lipschitz domain.(i) For < p < ∞ , < q ≤ p , α > and < τ < p such that /τ = α/d + 1 /p , the followingembedding holds B αq ( L τ (Ω)) (cid:44) → L p (Ω) . (ii) For < α < α , < p < p ≤ ∞ and q ≤ q , it holds B α q ( L p (Ω)) (cid:44) → B α q ( L p (Ω)) , if α − α ≥ d (1 /p − /p ) . The Besov spaces in Theorem 3.1 (i) are on the critical embedding line (see Figure 1). Spaces abovethis line are embedded in L p , spaces on this line may or may not be embedded in L p , and spacesbelow this line are never embedded in L p . In this sense, such Besov spaces are quite large as thefunctions on this line barely have enough regularity to be members of L p . It is well-known that optimalapproximation of functions from such spaces with a continuous parameter selection can only be achievedby non-linear methods, see [9]. It is the main result of this work that RePU networks achieve optimalapproximation for these spaces, while ReLU networks achieve near to optimal approximation.6. Ali, A. Nouy Approximation by Deep ReLU NetworksTo transfer results from R d to bounded Lipschitz domains, we will use the common technique ofextension operators. Theorem 3.2 (Extension Operator [13, 19]) . Let Ω ⊂ R d be a Lipschitz domain . Then, for any α > and any < p, q ≤ ∞ , there exists a linear operator E : B αq ( L p (Ω)) → B αq ( L p ( R d )) such that (cid:107) f (cid:107) B αq ( L p (Ω)) ≤ (cid:107)E f (cid:107) B αq ( L p ( R d )) ≤ C (cid:107) f (cid:107) B αq ( L p (Ω)) , where C depends only on d , α , p and the domain Ω . We conclude by noting that Besov spaces combine well with interpolation. To be precise, we brieflydefine interpolation spaces via the K -functional. Let X be a quasi-normed space and Y be a quasi-semi-normed space with Y (cid:44) → X . The K -functional is defined for any f ∈ X by K ( f, t, X, Y ) := inf f = f + f {(cid:107) f (cid:107) X + t | f | Y } , t > . For 0 < θ < < q ≤ ∞ , define the quantity | f | ( X,Y ) θ,q := (cid:40)(cid:0)(cid:82) ∞ [ t − θ K ( f, t, X, Y )] q d t/t (cid:1) /q , < q < ∞ , sup t> t − θ K ( f, t, X, Y ) , q = ∞ . Then, the spaces (
X, Y ) θ,q := (cid:110) f ∈ X : | f | ( X,Y ) θ,q < ∞ (cid:111) , equipped with the (quasi-)norm (cid:107) f (cid:107) ( X,Y ) θ,q := (cid:107) f (cid:107) X + | f | ( X,Y ) θ,q , are interpolation spaces .Besov spaces provide a relatively complete description of interpolation spaces in the following sense:for 0 < θ < L p (Ω) , W α ( L p (Ω))) θ,q = B θαq ( L p (Ω)) , ≤ p ≤ ∞ , < q ≤ ∞ , ( B α q ( L p (Ω)) , B α q ( L p (Ω))) θ,q = B αq ( L p (Ω)) , < α < α , α := (1 − θ ) α + θα , < p, q, q , q ≤ ∞ , ( L p (Ω) , B αq ( L p (Ω))) θ,q = B θαq ( L p (Ω)) , < p, q, q ≤ ∞ . For Besov spaces on the critical line with 1 /τ = α/d + 1 /p , we obtain( L p (Ω) , B ατ ( L τ (Ω))) θ,q = B θαq ( L q (Ω)) , if 1 /q = θα/d + 1 /p. Wavelets.
There are many possible wavelets constructions satisfying different properties depend-ing on the intended application. Said constructions can be rather technical, with the payoff beingvarious favorable analytical and numerical features. We do not intend to cover this topic in-depthand once again only pick out the aspects required for this work. We proceed by briefly reviewingone-dimensional wavelets constructions, after which we turn to wavelets on R d . Our presentation issomewhat abstract and therefore flexible, but we will also be more specific with some aspects of theconstruction that we require in Section 4. For more details on the subject we refer to [4].The starting point of a wavelet construction is typically a multi-resolution analysis (MRA), i.e., asequence of closed subspaces V j ⊂ V j +1 of L ( R ) that are nested, dilation- and shift-invariant, dense in L ( R ) and are all generated by a single scaling function ϕ ∈ V . To be more precise, we assume thesystem { ϕ ( · − k ) : k ∈ Z } is a Riesz basis of V and therefore { ϕ (2 j · − k ) : k ∈ Z } is a Riesz basis of V j . We use the shorthand notation ϕ j,k := 2 j/ ϕ (2 j · − k ) , where the pre-factor 2 j/ normalizes ϕ in L . Later we will redefine this to 2 j/p for normalization in L p for any 0 < p ≤ ∞ , with the convention 2 j/ ∞ = 1.Defining a projection P j : L ( R ) → V j is rather simple if { ϕ ( ·− k ) : k ∈ Z } forms an orthogonal basisof V . Indeed, this property implies that { ϕ j,k : k ∈ Z } forms an orthogonal basis of V j , and P j can bechosen to be the orthogonal projection. However, for numerical reasons, it is sometimes unpractical toconstruct scaling functions ϕ such that { ϕ ( · − k ) : k ∈ Z } forms an orthogonal basis of V and withoutthis property a constructive definition of P j is not straight-forward. The result is actually valid for more general domains (cf. ( ε, δ ) domains), see [13]. Multiple scaling functions are possible as well in which case such functions are referred to as multi-wavelets , see [14].
7. Ali, A. Nouy Approximation by Deep ReLU NetworksA way-out are so-called bi-orthogonal constructions. A function ˜ ϕ ∈ L ( R ) is dual to ϕ if it satisfies (cid:104) ϕ ( · − k ) , ˜ ϕ ( · − l ) (cid:105) L = δ k,l , k, l ∈ Z , where δ k,l is the Kronecker delta. We then define the oblique projection P j P j f := (cid:88) k ∈ Z (cid:104) f, ˜ ϕ j,k (cid:105) L ϕ j,k . (3.1)A representation of a function in V j is typically referred to as a single-scale representation. To switchto a multi-scale representation, we need to characterize the so-called detail spaces defined through theprojections Q j := P j +1 − P j , with the detail spaces defined as W j := Q j ( L ( R )). This is achieved by constructing a wavelet ψ ∈ V ψ := (cid:88) k ∈ Z g k ϕ (2 · − k ) , for some coefficients g k ∈ R such that N ψ := { k : g k (cid:54) = 0 } < ∞ . (3.2)Any function f ∈ L ( R ) can then be decomposed into a sequence of single-scale coefficients on thecoarsest level and detail coefficients on all higher levels f = (cid:88) k ∈ Z c ,k ϕ ,k + (cid:88) j ≥ (cid:88) k ∈ Z c j,k ψ j,k . (3.3)To simplify notation, one typically sets ψ − ,k := ϕ ,k and introduces the index set ∇ := { ( j, k ) : j ∈ N ≥− , k ∈ Z } . Decomposition (3.3) then simplifies to f = (cid:88) λ ∈∇ c λ ψ λ . In order for the wavelets ψ λ to characterize Besov spaces, they have to satisfy certain assumptions. Assumption 3.3 (Characterization) . We assume the scaling function ϕ and its dual ˜ ϕ satisfy thefollowing properties.(W1) (Integrability) For some p (cid:48) , p (cid:48)(cid:48) ∈ [1 , ∞ ] such that /p (cid:48) + 1 /p (cid:48)(cid:48) = 1 , we assume ϕ ∈ L p (cid:48) ( R ) and ˜ ϕ ∈ L p (cid:48)(cid:48) ( R ) .(W2) (Polynomial Reproduction) We assume ϕ satisfies Strang-Fix conditions of order L ∈ N or,equivalently, for any polynomial P ∈ P L − of degree L − , we have P ∈ V .(W3) (Regularity) For some s > , < p, q ≤ ∞ , we assume ϕ ∈ B sq ( L p ( R )) . These conditions are sufficient to ensure Besov spaces can be characterized by the decay of thewavelet coefficients. For our work we will require two additional conditions that are, however, easy tosatisfy for a variety of wavelet families.
Assumption 3.4 (Piece-wise Polynomial) . We additionally assume the scaling function ϕ satisfies thefollowing properties.(A1) We assume ϕ has compact support.(A2) We assume ϕ is piece-wise polynomial. An example of a wavelet family satisfying all of the assumptions (W1)–(W3) and (A1)–(A2) arethe CDF bi-orthogonal B-spline wavelets from [6]. These constructions allow to choose an arbitrarypolynomial reproduction degree L −
1, regularity order s and the resulting scaling function ϕ (andconsequently ψ as well) are compactly supported splines of degree L − R d . There are several possibleapproaches for this, but we describe a specific tensor product construction suitable for isotropic Besovspaces. We comment on anisotropic Besov spaces in Section 5.For x ∈ R d , we define the tensor product scaling function as φ ( x ) := ϕ ( x ) · · · ϕ ( x d ) , (3.4)and in the same manner as before, but for a general 0 < p ≤ ∞ , φ j,k,p ( x ) := 2 dj/p φ (2 j x − k ) , j ∈ Z , k ∈ Z d , (3.5) 8. Ali, A. Nouy Approximation by Deep ReLU Networkswith the convention 2 dj/ ∞ = 1. Next, for e ∈ { , } d \ { } , we define ψ e ( x ) := ψ e ( x ) · · · ψ e d ( x d ) , (3.6)with the convention ψ ( x i ) := ψ ( x i ) and ψ ( x i ) = ϕ ( x i ), and ψ ej,k,p is defined as in (3.5). Simplifyingas before with ∇ := { ( e, j, k ) : e ∈ { , } d \ { } , j ∈ N ≥ , k ∈ Z d } ∪ { (0 , , k ) : k ∈ Z d } , we obtain the d -dimensional wavelet systemΨ := { ψ λ,p : λ ∈ ∇} , (3.7)where we also use the shorthand notation | λ | := | ( e, j, k ) | := j for e (cid:54) = 0, and | λ | = − ∇ j := { λ = ( e, j, k ) ∈ ∇ : e (cid:54) = 0 , | λ | = j } , j ≥ , ∇ − := { λ = (0 , , k ) ∈ ∇ : k ∈ Z d } . Theorem 3.5 (Characterization [4]) . Let ϕ satisfy (W1) for some integrability parameters p (cid:48) , p (cid:48)(cid:48) , (W2)for order L and (W3) with smoothness order s for primary parameter < p ≤ p (cid:48) and any secondaryparameter < q ≤ ∞ . Then, if f = (cid:80) λ ∈∇ c λ,p ψ λ,p is the wavelet decomposition of f , for d (1 /p − /p (cid:48) ) < α < min { s, L } , (3.8) we have the norm equivalence | f | B αq ( L p ( R d )) ∼ (cid:18)(cid:80) j ≥− jαq (cid:16)(cid:80) λ ∈∇ j | c λ,p | p (cid:17) q/p (cid:19) /q , < q < ∞ , sup j ≥− jα (cid:16)(cid:80) λ ∈∇ j | c λ,p | p (cid:17) /p , q = ∞ . The above characterization implies optimal approximation rates for best N -term wavelet approxi-mations. Theorem 3.6 ( N -term Approximation [4]) . Let < p < ∞ and let Ψ be a wavelet system satisfyingthe assumptions of Theorem 3.5. Define the set of N -term wavelet expansions as W N := (cid:40)(cid:88) λ ∈ Λ c λ,p ψ λ,p : Λ ⊂ ∇ , ≤ N (cid:41) . Then, for α > , /τ = α/d + 1 /p and any f ∈ B ατ ( L τ ( R d )) , it holds E ( f, W N ) p (cid:46) N − α/d | f | B ατ ( L τ ( R d )) . Remark 3.7.
For p = τ = ∞ the corresponding Besov space is the space of H¨older continuous functionsand we refer to [11, 25]. The restriction p > stems from (3.8) which in turn is based on the fact thatthe oblique projector defined in (3.1) is, in general, not L p -stable for p < . More on this in Section 5. Optimal ReLU Approximation of Smoothness Classes
With the results from Section 2 and Section 3 we have all the tools necessary to derive approximationrates for arbitrary Besov functions. As was reviewed in Section 3, Besov spaces can be characterized bythe decay of the wavelet coefficients, and N -term approximations achieve optimal approximation ratesfor Besov functions.In this section, we show that a RePU network can reproduce an N -term wavelet expansion with O ( N ) complexity. More importantly, we also show that a ReLU network can approximate an N -termwavelet expansion with O ( N log( ε − )) complexity, where ε > Lemma 4.1 (Wavelet Complexity) . Let ϕ : R → R be a scaling function satisfying (A1) – (A2), i.e., acompactly supported piece-wise polynomial scaling function with polynomial reproduction order L ∈ N .Then, Of bounded depth depending on the smoothness order and polynomial degree of the activation function. Of depth depending logarithmically on ε − .
9. Ali, A. Nouy Approximation by Deep ReLU Networks (i) for r ≥ , there exists a constant C > depending only on the number of polynomial pieces (andthus on N ψ from (3.2) ), order L and r such that there exist RePU networks Φ ϕ , Φ ψ ∈ RePU r, , C with ϕ = R (Φ ϕ ) , ψ = R (Φ ψ ) , (ii) for r = 1 and < p ≤ ∞ , there exists a constant C > as above that additionally dependson | supp( ϕ ) | , | supp( ψ ) | , (cid:107) ϕ (cid:107) L ∞ ( R ) and (cid:107) ψ (cid:107) L ∞ ( R ) such that for any ε > there exist ReLUnetworks Φ εϕ , Φ εψ ∈ ReLU , C (1+log( ε − )) with the same supports as ϕ and ψ , respectively, and (cid:13)(cid:13) ϕ − R (Φ εϕ ) (cid:13)(cid:13) L p ( R ) ≤ ε, (cid:13)(cid:13) ψ − R (Φ εψ ) (cid:13)(cid:13) L p ( R ) ≤ ε. Next, using Theorem 2.1 and Lemma 4.1, the above can be extended to the multi-dimensional setting.
Lemma 4.2 (Tensor Product Wavelet Complexity) . Let ϕ : R → R be as in Lemma 4.1, satisfying(A1) – (A2), and let φ , ψ e be the tensor product wavelets defined in (3.5) and (3.6) . Then,(i) for r ≥ , there exists a constant C > as in Lemma 4.1 (i), depending additionally on d , suchthat there exist networks Φ φ , Φ ψ e ∈ RePU r,d, C with φ = R (Φ φ ) , ψ e = R (Φ ψ e ) , for any e ∈ { , } d \ { } . (ii) For r = 1 and < p ≤ ∞ , there exists a constant as in Lemma 4.1 (ii), depending additionallyon d , such that for any < ε < there exist networks Φ εφ , Φ εψ e ∈ RePU d, C (1+log( ε − )) with (cid:13)(cid:13) φ − R (Φ εφ ) (cid:13)(cid:13) L p ( R d ) ≤ ε, (cid:13)(cid:13) ψ e − R (Φ εψ e ) (cid:13)(cid:13) L p ( R d ) ≤ ε, for any e ∈ { , } d \ { } . Proof.
Part (i) follows from Theorem 2.1 (i), Theorem 2.4 (iii) – (iv) and Lemma 4.1 (i). Part (ii) followsfrom Theorem 2.1 (ii), Theorem 2.4 (iii) – (iv) and Lemma 4.1 (ii), which we briefly demonstrate for φ . The proof for ψ e is analogous. In the following we will frequently use the triangle inequality, i.e.,assuming p ≥
1. For 0 < p < (cid:107)·(cid:107) L p ( R d ) is only a quasi-norm, i.e., the right-hand-side of the triangleinequality is to be multiplied by a constant, and the corresponding complexities are to be adjustedaccordingly.Let φ = ϕ ⊗ · · · ⊗ ϕ . We approximate each ϕ as in Lemma 4.1 (ii) by networks Φ δϕ ∈ ReLU , C (1+log( δ − )) with some accuracy δ > δ × ∈ ReLU d,dn δ as in Theo-rem 2.4 (iii) with n δ := dC (1 + log( δ − )). Finally, for some different accuracy η >
0, we construct anapproximate multiplication network Φ ηM ∈ ReLU d, n η with n η := Cd log( dK d η − ) as in Theorem 2.1 (ii),where K := (cid:107) ϕ (cid:107) L ∞ ( R ) + δ . This choice of K is justified by (cid:13)(cid:13)(cid:13) R (Φ δϕ ) (cid:13)(cid:13)(cid:13) L ∞ ( R d ) ≤ δ + (cid:107) ϕ (cid:107) L ∞ ( R ) = K. (4.1)Our final approximation Φ εφ ∈ ReLU d, n δ,η is defined by R (Φ εφ ) := R (Φ ηM ) ◦ R (Φ δ × ) , where, according to Theorem 2.4 (iv), n δ,η = dC (1 + log( δ − )) + Cd log( dK d η − ).We now estimate the resulting error from which it will be clear how to choose δ, η > n δ,η . We introduce the auxiliary approximation R ( ˜Φ) := M d ◦ R (Φ δ × ) and the notation ϕ δ := R (Φ δϕ ). Then, (cid:13)(cid:13) φ − R (Φ εφ ) (cid:13)(cid:13) L p ( R d ) ≤ (cid:13)(cid:13)(cid:13) φ − R ( ˜Φ) (cid:13)(cid:13)(cid:13) L p ( R d ) + (cid:13)(cid:13)(cid:13) R ( ˜Φ) − R (Φ εφ ) (cid:13)(cid:13)(cid:13) L p ( R d ) . With S := | supp( ϕ ) | , for the second term we apply Theorem 2.1 (ii) and obtain (cid:13)(cid:13)(cid:13) R ( ˜Φ) − R (Φ εφ ) (cid:13)(cid:13)(cid:13) L p ( R d ) ≤ ηS d/p . (4.2)For the first term, we can write φ − R (Φ εφ ) = ( ϕ − ϕ δ ) ⊗ ϕ ⊗ · · · ⊗ ϕ + ϕ δ ⊗ ( ϕ − ϕ δ ) ⊗ ϕ ⊗ · · · ⊗ ϕ + . . . + ϕ δ ⊗ · · · ⊗ ϕ δ ⊗ ( ϕ − ϕ δ ) . Thus, by a triangle inequality (cid:13)(cid:13) φ − R (Φ εφ ) (cid:13)(cid:13) L p ( R d ) ≤ dK d − δ, (4.3) 10. Ali, A. Nouy Approximation by Deep ReLU Networkswhere K is defined in (4.1). From (4.2) and (4.3), we set η := S − d/p ε/ δ := K − d ε/ (2 d ) and thestatement follows. (cid:3) Since affine transformations can be efficiently implemented with NNs (see Theorem 2.4 (v)), theresults of the previous lemma carry over to any wavelet ψ λ . Lemma 4.3 (Wavelet System Complexity) . Let Ψ be a wavelet system as defined in (3.7) , with one-dimensional scaling function ϕ : R → R as in Lemma 4.1 satisfying (A1) – (A2). Then,(i) for r ≥ , there exists a constant C > , with dependencies as in Lemma 4.2 (i), such that forany ψ λ ∈ Ψ there exists a RePU network Φ λ ∈ RePU r,d, C with ψ λ = R (Φ λ ) . (ii) For r = 1 and < p ≤ ∞ , there exists a constant C > , with dependencies as in Lemma 4.2 (ii),such that for any ψ λ and any < ε < , there exists a ReLU network Φ ελ ∈ ReLU d, C (1+log( ε − )) with (cid:107) ψ λ − Φ ελ (cid:107) ≤ ε. Proof.
From Lemma 4.2, we have Φ ψ e ∈ RePU r,d, C and we define Φ λ ∈ RePU r,d, C using Theorem 2.4 (i)and (v) such that for λ = ( e, j, k ) R (Φ λ ) = 2 dj/p R (Φ ψ e ) ◦ D jk . Similarly we define Φ ελ ∈ ReLU d, C (1+log( ε − )) . Note additionally that the approximation error bound forΦ ελ remains unchanged due to the normalization constant 2 dj/p . (cid:3) With this we finally turn to direct estimates for networks.
Lemma 4.4 (Direct Estimates RePU/ReLU) . Let
Ω = R d or Ω ⊂ R d a Lipschitz domain .(i) For r ≥ , < p < ∞ , any f ∈ B αq ( L τ (Ω)) with α, τ, q > and α/d ≥ /τ − /p, q ≤ ( α/d + 1 /p ) − , it holds E ( f, RePU r,d, n ) p (cid:46) n − α/d | f | B αq ( L p (Ω)) . (ii) For r = 1 , < p < ∞ , any f ∈ B αq ( L τ (Ω)) with α, τ, q > and α/d > /τ − /p, q ≤ ( α/d + 1 /p ) − , it holds E ( f, ReLU d, n ) p (cid:46) n − ¯ α/d | f | B αq ( L p (Ω)) , for any < ¯ α < α .Proof. Consider a wavelet system Ψ that satisfies (A1) – (A2) and (W1) – (W3) for parameters ¯ p to be defined in (4.4), some p (cid:48) , s and L such that the assumptions of Theorem 3.5 are satisfied for¯ α := 1 / ¯ p − /p . Such a wavelet system can be constructed for arbitrary smoothness order ¯ α using, e.g.,bi-orthogonal wavelets from [6].Part (i) follows from Lemma 4.3, Theorem 3.6 and Theorem 2.4 (ii). For part (ii), first let Ω = R d and let f N := (cid:80) λ ∈ Λ N c λ,p ( f ) ψ λ with N ≤ N be an N -term wavelet approximation of f using thewavelet system Ψ as mentioned above. For some ε > ελ ∈ ReLU d, C (1+log( ε − )) be the ReLU ε -approximation of ψ λ from Lemma 4.3, and let Φ N ∈ ReLU d, CN (1+log( ε − )) be the sumnetwork from Theorem 2.4 (ii). Then, (cid:107) f N − R (Φ N ) (cid:107) L p ( R d ) ≤ ε (cid:88) λ ∈ Λ N | c λ,p ( f ) | . The sum of the coefficients is bounded by a Besov semi-norm of f as we show next.Define ¯ p := ( α/d − /τ + 2 /p ) − . (4.4) Or, more generally, an ( ε, δ ) domain.
11. Ali, A. Nouy Approximation by Deep ReLU NetworksSince we assumed α/d > /τ − /p , we have 0 < ¯ p < p . Next, define q := (cid:40) (1 − / ¯ p ) − , if ¯ p > , ∞ , otherwise . Then, we can estimate the sum as (cid:88) λ ∈ Λ N | c λ,p ( f ) | ≤ (cid:88) λ ∈ Λ N | c λ,p ( f ) | ¯ p / ¯ p N /q , with the convention N / ∞ = 1. For ¯ α > α := 1 / ¯ p − /p, and by the characterization fromTheorem 3.5, we obtain (cid:88) λ ∈ Λ N | c λ,p ( f ) | ¯ p / ¯ p (cid:46) | f N | B ¯ α ¯ p ( L ¯ p ( R d )) (cid:46) | f | B ¯ α ¯ p ( L ¯ p ( R d )) . By the Besov embeddings from Theorem 3.1 (ii) and the choice of ¯ p from (4.4), we have B αq ( L τ ( R d )) (cid:44) → B ¯ α ¯ p ( L ¯ p ( R d )).Thus, we set ε := N − α/d − /q and obtain (cid:107) f − R (Φ N ) (cid:107) L p ( R d ) (cid:46) N − α/d | f | B αq ( L τ ( R d )) . The complexity of this network can be bounded by n := C (1 + α/d + 1 /q ) N log( N ) (cid:46) N δ , for any δ >
0, or, equivalently, (cid:107) f − R (Φ N ) (cid:107) L p ( R d ) (cid:46) n − ¯ α/d | f | B αq ( L τ ( R d )) , for any 0 < ¯ α < α . This shows the statement for Ω = R d .For Ω ⊂ R d a Lipschitz domain, we use the extension operator from Theorem 3.2 to obtain for any f ∈ B αq ( L τ (Ω)) E ( f, ReLU d, n ) L p (Ω) ≤ E ( E f, ReLU d, n ) L p ( R d ) (cid:46) n − ¯ α/d |E f | B αq ( L τ ( R d )) (cid:46) n − ¯ α/d (cid:107) f (cid:107) B αq ( L τ (Ω)) . (cid:3) Finally, the direct estimates above immediately imply the main result of this work.
Theorem 4.5 (Direct Embeddings) . Let
Ω = R d or Ω ⊂ R d a Lipschitz domain.(i) For r ≥ , < p < ∞ and any Besov space B αq ( L τ (Ω)) with α, τ, q > such that α/d ≥ /τ − /p, q ≤ ( α/d + 1 /p ) − , the following embeddings hold B αq ( L τ (Ω)) (cid:44) → A α/d ∞ ( L p (Ω) , RePU r,d, ) , ( L p (Ω) , B αq ( L τ (Ω))) θ/α, ¯ q (cid:44) → A θ/d ¯ q ( L p (Ω) , RePU r,d, ) , for < θ < α , < ¯ q ≤ ∞ .(ii) For r = 1 , < p < ∞ and any Besov space B αq ( L τ (Ω)) with α, τ, q > such that α/d > /τ − /p, q ≤ ( α/d + 1 /p ) − , the following embeddings hold B αq ( L τ (Ω)) (cid:44) → A ¯ α/d ∞ ( L p (Ω) , ReLU d, ) , ( L p (Ω) , B αq ( L τ (Ω))) θ/ ¯ α, ¯ q (cid:44) → A θ/d ¯ q ( L p (Ω) , ReLU d, ) , for < θ < ¯ α , < ¯ q ≤ ∞ and any < ¯ α < α .
12. Ali, A. Nouy Approximation by Deep ReLU Networks5.
Concluding Remarks (i) We emphasize that the main result Theorem 4.5 covers the entire range above the criticalembedding line (see Figure 1). In particular, it states that deep ReLU networks can approximateany Besov class strictly above the embedding line with near to optimal complexity.(ii) The results can be extended to Sobolev error measures, i.e., measuring the error in the deriva-tives as well. See [4, Remark 4.3.3] and [22].(iii) The above results concern isotropic Besov spaces, i.e., when measuring smoothness, we do notmake a distinction between coordinate directions. For anisotropic Besov spaces and correspond-ing wavelet characterizations we refer to [18].(iv) For the case p = ∞ and B α ∞ ( L ∞ (Ω)) = C α (Ω), we refer to [11, 25]. The limitation concerning p > L p -stable for p <
1. One way to circumvent this is toreplace the L p space with the Hardy space H p , see [20]. Another way is described next.(v) Upon completion of this work we became aware of a similar result in [26]. There the authorstates that any function in B ατ ( L τ ([0 , d )) for α/d > /τ − /p can be approximated with aReLU network in near to optimal complexity. Unlike in the proof of Lemma 4.4 where we useda wavelet Riesz basis construction, in [26] the author uses highly redundant B-spline framescombined with the characterization from [12] and an adaptive sampling strategy from [15]. Infact, best N -term approximation rates for Besov spaces B ατ ( L τ ( R d )) were obtained much earlierin [10], where the authors once again use highly redundant B-Spline frames and best N -termapproximations therein.A representation of a function f ∈ B ατ ( L τ ([0 , d )) in such a frame is not unique and, in gen-eral, not stable. But there exists a particular stable representation based on quasi-interpolantsof piece-wise polynomial near best approximations as constructed in [12] and used in [26]. Com-bined with the adaptive sampling strategy from [15], such a construction possesses somewhatsimilar features to the tree approximations from [5]. Hence, the condition α/d > /τ − /p appears in [5, 15, 26]. In contrast, the results of [10] hold for α/d = 1 /τ − /p .The advantage of using such highly redundant frames as in [10, 12, 26] is that the result ofLemma 4.4 (ii) is valid for the range 0 < p ≤ L p -stable for any0 < p ≤ ∞ .The result of [26] is stated only for Ω = [0 , d . However, we believe this can be extended toΩ = R d .We also note that in [24] the author uses B-splines to characterize Besov spaces and estimatebest N -term approximation rates as well. However, that construction differs from [10, 12]and, in particular, the characterization and subsequent N -term approximation is performed for modified Besov spaces. These coincide with standard Besov spaces in case, once again, (3.8)is satisfied.Overall, using either wavelets as in this work or B-spline quasi-interpolants as in [26] providestwo alternatives for proving Lemma 4.4 (ii) – neither is inherently more difficult than the other,with the caveat that for p ≤ References [1]
Ali, M., and Nouy, A.
Approximation with Tensor Networks. Part I: Approximation Spaces. arXiv e-prints (June2020), arXiv:2007.00118.[2]
Ali, M., and Nouy, A.
Approximation with Tensor Networks. Part II: Approximation Rates for Smoothness Classes. arXiv e-prints (June 2020), arXiv:2007.00128.[3]
B¨olcskei, H., Grohs, P., Kutyniok, G., and Petersen, P.
Optimal Approximation with Sparsely ConnectedDeep Neural Networks.
SIAM Journal on Mathematics of Data Science 1 , 1 (2019), 8–45.[4]
Cohen, A.
Numerical Analysis of Wavelet Methods . Elsevier, Amsterdam Boston, 2003.[5]
Cohen, A., Dahmen, W., Daubechies, I., and DeVore, R.
Tree Approximation and Optimal Encoding.
Appliedand Computational Harmonic Analysis 11 , 2 (2001), 192 – 226. The author also shows that functions in Besov spaces of mixed smoothness can be approximated with a “milder” curseof dimensionality: the exponential dependence on d enters the complexity through log-factors such as (log( ε − )) d . Thesame was shown in [21] for Korobov spaces of mixed smoothness and deep ReLU networks. Essentially Besov spaces of anisotropic smoothness.
13. Ali, A. Nouy Approximation by Deep ReLU Networks [6]
Cohen, A., Daubechies, I., and Feauveau, J.-C.
Biorthogonal Bases of Compactly Supported Wavelets.
Com-munications on Pure and Applied Mathematics 45 , 5 (1992), 485–560.[7]
Daubechies, I., DeVore, R., Foucart, S., Hanin, B., and Petrova, G.
Nonlinear Approximation and (Deep)ReLU Networks. arXiv e-prints (May 2019), arXiv:1905.02199.[8]
DeVore, R. A.
Nonlinear Approximation.
Acta Numerica 7 (1998), 51–150.[9]
DeVore, R. A., Howard, R., and Micchelli, C.
Optimal Nonlinear Approximation.
Manuscripta mathematica63 , 4 (1989), 469–478.[10]
DeVore, R. A., Jawerth, B., and Popov, V.
Compression of Wavelet Decompositions.
American Journal ofMathematics 114 , 4 (1992), 737–785.[11]
DeVore, R. A., Petrushev, P., and Yu, X. M.
Nonlinear Wavelet Approximation in the Space C ( R d ). In Progressin Approximation Theory (New York, NY, 1992), A. A. Gonchar and E. B. Saff, Eds., Springer New York, pp. 261–283.[12]
DeVore, R. A., and Popov, V. A.
Interpolation of Besov Spaces.
Transactions of the American MathematicalSociety 305 , 1 (1988), 397–414.[13]
DeVore, R. A., and Sharpley, R. C.
Besov Spaces on Domains in R d . Transactions of the American MathematicalSociety 335 , 2 (1993), 843–864.[14]
Donovan, G. C., Geronimo, J. S., and Hardin, D. P.
Orthogonal Polynomials and the Construction of PiecewisePolynomial Smooth Wavelets.
SIAM Journal on Mathematical Analysis 30 , 5 (1999), 1029–1056.[15]
D˜ung, D.
Optimal Adaptive Sampling Recovery.
Advances in Computational Mathematics 34 , 1 (2011), 1–41.[16]
Gribonval, R., Kutyniok, G., Nielsen, M., and Voigtlaender, F.
Approximation Spaces of Deep NeuralNetworks. arXiv e-prints (May 2019), arXiv:1905.01208.[17]
G¨uhring, I., Raslan, M., and Kutyniok, G.
Expressivity of Deep Neural Networks. arXiv e-prints (July 2020),arXiv:2007.04759.[18]
Hochmuth, R.
Wavelet Characterizations for Anisotropic Besov Spaces.
Applied and Computational Harmonic Anal-ysis 12 , 2 (2002), 179 – 208.[19]
Johnen, H., and Scherer, K.
On the Equivalence of the K-functional and Moduli of Continuity and Some Ap-plications. In
Constructive Theory of Functions of Several Variables (Berlin, Heidelberg, 1977), W. Schempp andK. Zeller, Eds., Springer Berlin Heidelberg, pp. 119–140.[20]
Kyriazis, G. C.
Wavelet Coefficients Measuring Smoothness in H p ( R d ). Applied and Computational HarmonicAnalysis 3 (1996), 100–119.[21]
Montanelli, H., and Du, Q.
New Error Bounds for Deep ReLU Networks Using Sparse Grids.
SIAM Journal onMathematics of Data Science 1 , 1 (2019), 78–92.[22]
Opschoor, J. A. A., Petersen, P. C., and Schwab, C.
Deep ReLU Networks and High-order Finite ElementMethods.
Analysis and Applications (2020).[23]
Opschoor, J. A. A., Schwab, C., and Zech, J.
Exponential ReLU DNN Expression of Holomorphic Maps in HighDimension. Tech. Rep. 2019-35, Seminar for Applied Mathematics, ETH Z¨urich, Switzerland, 2019.[24]
Oswald, P.
On the Degree of Nonlinear Spline Approximation in Besov-Sobolev Spaces.
Journal of ApproximationTheory 61 , 2 (1990), 131 – 157.[25]
Petersen, P., and Voigtlaender, F.
Optimal Approximation of Piecewise Smooth Functions Using Deep ReLUNeural Networks.
Neural Networks 108 (2018), 296 – 330.[26]
Suzuki, T.
Adaptivity of Deep ReLU Network for Learning in Besov and Mixed Smooth Besov Spaces: OptimalRate and Curse of Dimensionality. In
International Conference on Learning Representations (2019).[27]
Yarotsky, D.
Error Bounds for Approximations with Deep ReLU Networks.
Neural Networks 94 (2017), 103 – 114.(2017), 103 – 114.