Approximation with Neural Networks in Variable Lebesgue Spaces
aa r X i v : . [ m a t h . F A ] J u l APPROXIMATION WITH NEURAL NETWORKSIN VARIABLE LEBESGUE SPACES ´ANGELA CAPEL AND JES ´US OC ´ARIZ
Abstract.
This paper concerns the universal approximation property with neural networks invariable Lebesgue spaces. We show that, whenever the exponent function of the space is bounded,every function can be approximated with shallow neural networks with any desired accuracy. Thisresult subsequently leads to determine the universality of the approximation depending on theboundedness of the exponent function. Furthermore, whenever the exponent is unbounded, weobtain some characterization results for the subspace of functions that can be approximated.
Contents
1. Introduction 12. Main results 33. Preliminaries 43.1. Neural networks 43.2. Variable Lebesgue spaces 64. Universal approximation in function spaces 84.1. Space of continuous functions 84.2. Negative results 104.3. Locally integrable functions 115. Approximation in variable Lebesgue spaces 125.1. Case I: Bounded exponent function 125.2. Case II: Unbounded exponent function, discrete case 155.3. Case III: Unbounded exponent function, general case 16References 191.
Introduction
Artificial neural networks are a model created with the purpose of imitating the behavior ofbiological neural networks using digital computing. Their origins are tied back to [MP43] and[Ros58], and since then numerous applications have been found in a wide range of fields, varyingfrom machine learning to computer vision, speech recognition or mathematical finances, amongmany others. A major problem in the theory of neural networks is that of approximating withina desired accuracy a generic class of functions using neural networks, initially motivated by thebehavior for neural networks observed in the Representation Theorem due to Arnold [Arn57] andKolmogorov [Kol57] and the aim of providing a theoretical justification for it.The starting point of approximation theory for neural networks was the
Universal ApproximationTheorem of [Cyb89] and [HSW89], which shows that every continuous function on a compact can beuniformly approximated by shallow neural networks with a continuous, non-polynomial activationfunction. Subsequent extensions of this result addressed the analogous problem for Lebesgue spaceswith a finite exponent [Hor91] and locally integrable spaces [PS91], meanwhile some others also
Date : July 9, 2020. considered the derivatives of the neural networks to show that shallow neural networks with asufficiently smooth activation function and unrestricted width are dense in the space of sufficientlydifferentiable functions [Pin99]. Soon after this wave of results for shallow neural networks, varioussurveys on the topic appeared in the literature, such as [Pin97], [ST98], [TKG +
03] and [San08].In the last years, many directions have been explored in the approximation theory for both shal-low and deep neural networks. For neural networks with ReLU activation functions, there are anumber of recent papers concerning various topics: Approximation for Besov spaces [Suz19], regres-sion [SH17] and optimization [GS09] problems, restriction to encodable weights [PV18], negativeresults of approximation [ALdTRLV18], estimates for the errors obtained in the approximation[Pet99], [Yar17], or, in general, deep neural networks [KL19], [SCC18], among many others. Asimpler and inspiring new proof for the universality theory in deep neural networks can be found in[HHH20]. Moreover, in [HH20] the authors develop a new technique named “un-rectifying” whichtransfers piece-wise continuous non-linear activation functions into piece-wise continuous linearfunctions and then use it to show that ReLU networks and MaxLU networks are indeed deep trees.Furthermore, for more regular activation functions there are also numerous recent articles, namely[Bar94], [BGKP19], [Lin19], [LTY20], [Mha96], [OK19], [TLY19], most of which focus on deepneural networks. Some other directions are currently being studied for the problem of approximationtoo, such as the topological approach presented in [Kra19], the application of these results to findingsolutions of partial differential equations [GR20] or the comparison to approximation with tensornetworks [AN20].In this paper, we take a step forward in the theory of approximation with neural networksand address the problem of approximating any function in a variable Lebesgue space with enough precision using shallow neural networks with various activation functions. VariableLebesgue spaces are in particular locally integrable spaces. Even though there exist some previousresults of universal approximation in locally integrable spaces, they depend on a metric defined ina way that the behavior of a function away from a compact is always negligible. Moreover, sincethat distance is not constructed from a norm, the approximations are not stable by dilations. Inthis manuscript, we overcome this situation and prove our approximation results employing thedistance determined by the usual norm of the space.Variable Lebesgue spaces are a generalization of Lebesgue spaces that might contain functionswhich do not belong to any Lebesgue space [CUF13]. In particular, a variable Lebesgue space mightinclude all the bounded functions even if the domain is not compact. Therefore, a natural problemthat arises in this setting is that of approximating functions in non-compact domains. For example,continuous functions defined on an unbounded interval with a suitable asymptotic limit. This typeof functions may appear as the representation of a quantity that follows a diffusion process withtime (like the temperature at a certain point in a closed system).More specifically, in this paper we show approximation results with neural networks for variableLebesgue spaces depending on the boundedness of their exponent function. If the exponent of thespace is essentially bounded, we show that a result of universal approximation holds, yielding thusan analogous behavior to that of usual Lebesgue spaces. On the other hand, if the exponent isunbounded, the situation is much more subtle, but we can characterize in some cases the subspace offunctions which can be approximated with neural networks relying on some results of [ACACUO19].We first address the simpler case of variable sequence spaces to subsequently lift our results to amore general domain. Our results hold for most of the activation functions present in the literature,namely any sigmoidal function (logistic sigmoid, hyperbolic tangent, Heaviside function, etc) or therectifier function.The outline of the manuscript is the following: In Section 2, we present an informal expositionof the main results of the present article. Important notions and results on neural networks andvariable Lebesgue spaces are reviewed in Section 3. In Section 4, we collect some of the previousresults on universal approximation in certain function spaces and provide several improvements or
PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 3 generalizations for them. Finally, in Section 5, we present our results on approximation with neuralnetworks in variable Lebesgue spaces. 2.
Main results A shallow neural network is described by a function g : R d → R given by g ( x ) = M X j =1 α j σ ( w j · x + b j ) , where x ∈ R d represents the input to the neural network, g ( x ) ∈ R the output , w j ∈ R d and α j ∈ R are the weights between first and second layer, and second and third layer, respectively, b j ∈ R arethe biases , σ : R → R is the activation function and M the height . The subspace generated by suchfunctions will be denoted by H σ .Given an activation function σ and a function normed space ( X, k · k ), the Universal Approxi-mation (UA) property for shallow neural networks can be formally stated as follows:For every f ∈ X and ∀ ε >
0, there is a function g ∈ H σ such that k f − g k < ε .The main results of this article concern the UA property for variable Lebesgue spaces, i.e. spacesof the form L p ( · ) (Ω) := { f : Ω → R : f measurable and k f k p ( · ) < + ∞} , for an open Ω ⊆ R d , where p : Ω → [1 , + ∞ ) is an exponent function and the norm is given by k f k p ( · ) := inf ( λ > Z Ω (cid:18) | f ( x ) | λ (cid:19) p ( x ) d x ≤ ) . The first of these results (which appears in the main text as Theorem 5.3) shows that universalapproximation holds whenever the exponent function of the space is bounded.
Theorem.
Let Ω ⊆ R d , consider p : Ω → [1 , + ∞ ) a bounded exponent function and σ : R → R discriminatory for L p ( · ) ( K ) for every compact K ⊂ Ω . Then, truncated finite sums of the form g ( x ) = M X j =1 α j σ ( w j · x + b j ) x ∈ K , x ∈ Ω \ K , with K compact are dense in L p ( · ) (Ω) . The condition imposed on σ essentially means that if a functional over L p ( · ) ( K ) vanishes on H σ , then the functional is identically null. Most of the activation functions considered in theliterature verify this property, such as any continuous function different to an algebraic polynomial,for example.Whenever the exponent function of a variable Lebesgue space is unbounded, the situation ismuch more complex. Indeed, as a consequence of the separability of the space and the previoustheorem, we provide a characterization of universal approximation in terms of the boundedness ofthe exponent function, which appears in the main text as Corollary 5.5. Corollary.
Let Ω ⊆ R d , p : Ω → [1 , + ∞ ) a exponent function and σ : R → R continuous and withfinite limit at + ∞ and −∞ . Then, UA holds for L p ( · ) (Ω) if, and only if, p is bounded. ´ANGELA CAPEL AND JES ´US OC ´ARIZ Subsequently, since for unbounded exponent functions the UA fails, we study conditions to de-scribe the subspace of functions which can be approximated with neural networks. In the followingresult, the condition of characterization is a generalization of the concept of having an asymptoticlimit.
Theorem.
Let Ω ⊆ R be an unbounded interval and p : Ω → [1 , + ∞ ) an unbounded exponentfunction such that L ∞ (Ω) ⊂ L p ( · ) (Ω) and it is bounded in every compact subset of Ω . Let σ ∈ L ∞ ( R ) be a non-constant, sigmoidal activation function. Then, the following conditions are equivalent for f ∈ L p ( · ) (Ω) : (1) For every ε > , there is a g ε ∈ H σ such that k f − g ε k L p ( · ) (Ω) < ε . (2) There is a scalar β ∈ R such that k [ f − β Ω ] k Q = 0 , where k · k Q is the quotient norm given in Definition 3.13. This theorem appears as Theorem 5.10 in the main text. In a nutshell, the quotient norm encodesthe behavior of the function at ∞ . Therefore, the functions that can be approximated are thosewhich converge, in some sense, at ∞ .Finally, prior to these results we recall and adapt some classical results of universal approximationfor some function spaces. More specifically, we address the space of continuous functions over acompact and locally integrable spaces. For the former, we prove an extension of the original resultof universal approximation from continuous functions on a compact to the space of functions whichvanish at infinity, in the unidimensional case, and a negative result of universal approximation,whereas for the latter we extend a result for radial activation functions to non-radial ones. Moreover,even though variable Lebesgue spaces are locally integrable spaces, our contribution for thosespaces is a major improvement, because our concept of distance comes from a norm and, therefore,approximations in this context are more interesting, since for example we can control the error ofdilation. 3. Preliminaries
Neural networks.
In this subsection, we introduce the concepts and properties associated toneural networks that we will need for the rest of the paper. Some references for the mathematicalformulation of neural networks in this context are [Hay98], [Bis06], [KvdS93] or [Roj96], amongmany others.In this paper, we focus on feedforward artificial neural networks (denoted ANN hereafter), thesimplest model for neural networks. A feedforward artificial neural network contains several nodes,arranged in layers serving differently depending on their position, namely input , hidden and output nodes. Moreover, this type of neural networks only has a single input layer, which provides informa-tion from the environment to the network, and a single output layer, which transmits informationfrom the network to the environment.In the past, different classes of ANN have been considered to show approximation in genericclasses of functions. It is known that ANN with no hidden layer are not capable of approximatinggeneric, non-linear, continuous functions [Wid90]. On the opposite side, ANN with two or morehidden layers, known as deep neural networks , have given rise to a broad field of research in thepast years. However, for simplicity we focus specifically in the case of one hidden layer, commonlyknown as shallow neural networks , for which typical results on universal approximation have beenstudied in the past. PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 5
Definition 3.1. A feedforward shallow artificial neural network (ANN) can be described by a finitelinear combination of the form g ( x ) = M X j =1 α j σ ( w j · x + b j ) , where x ∈ R d represents the input to the neural network, g ( x ) ∈ R the output , w j ∈ R d and α j ∈ R are the weights between first and second layer, and second and third layer, respectively, b j ∈ R arethe biases , σ : R → R is the activation function and M the height . The subspace of all functions that can be obtained using an ANN with activation function σ willbe denoted hereafter by H σ := g ( x ) = M X j =1 α j σ ( w j · x + b j ) , where w j , x ∈ R d , α j , b j ∈ R and M ∈ N . Throughout the text, we will impose different conditionson σ to obtain diverse results, since the role of the activation function is essential in the results ofapproximation of generic function spaces. Here we summarize some of the most typical examplesfor activation functions. Example 3.2.
Examples of activation functions: • The Heaviside step function, which is piece-wise defined over R in the following way: σ ( x ) = ( if x > , if x ≤ . • The rectifier function (ReLU) is defined as: σ ( x ) = ( x if x > , if x ≤ . • The logistic sigmoid is given by: σ ( x ) = 11 + e − x . • The hyperbolic tangent is an affinely transformed logistic sigmoid. It is given by: σ ( x ) = tanh( x ) . For further examples of activation functions, see [ST98] for instance. Now, we introduce aproperty concerning the activation function which will appear often throughout the rest of thetext.
Definition 3.3.
Given an activation function σ : R → R , we say that σ is sigmoidal if σ ( t ) = ( c + ∞ as t → + ∞ ,c −∞ as t → −∞ , where the constants c + ∞ and c −∞ are finite.Usually, c + ∞ and c −∞ are taken to be for c + ∞ and − or for c −∞ . Remark 3.4.
Note that all the activation functions presented in Example 3.2 satisfy this condition.Indeed, even though it goes against the intuition for the ReLU, we can obtain another activationfunction which is continuous and bounded by combining properly two ReLU. ´ANGELA CAPEL AND JES ´US OC ´ARIZ
Variable Lebesgue spaces.
In this subsection, we introduce all the necessary notions andprevious results on variable Lebesgue spaces to understand the results on approximation with neuralnetworks in those spaces that will be presented in Section 5. A good reference for these spaces andtheir application to harmonic analysis is [CUF13].For simplicity, the measurable space we consider hereafter is (Ω , B , | · | ), where Ω ⊆ R d , B is the σ -algebra of measurable sets and | · | is the Lebesgue measure. Other measurable spaces could beconsidered, but here we restrict to this special case, due to its importance for applications.The motivation for the study of variable Lebesgue spaces is the following: Consider the realfunction g ( x ) := √ | x | , which presents a singularity at x = 0. This natural function does not belongto any L p ( R ) for 1 ≤ p ≤ + ∞ , because it either grows too fast at the origin or decays too slow atinfinity. An idea to solve this matter and find a Lebesgue space to which it could belong wouldbe to split the domain. With this on mind, we can prove g ∈ L ([ − , g ∈ L ( R \ [ − , variable Lebesgue spaces . Definition 3.5. An exponent function is a measurable function p : Ω → [1 , + ∞ ) . Note that the exponent function always takes a finite value at each point of the domain. It ispossible to generalize the concept to take the value + ∞ , but this is just a technical issue in whichwe are not particularly interested here. Definition 3.6.
Given a measurable function f and an exponent function p , the norm is definedby k f k p ( · ) := inf ( λ > Z Ω (cid:18) | f ( x ) | λ (cid:19) p ( x ) d x ≤ ) . In some occasions, we might denote the previous norm by k f k L p ( · ) (Ω) whenever it is necessary toemphasize the domain of the functions. There exist in the literature different ways to define thenorm for these spaces. The previous expression is obtained through the Luxemburg norm, but itis also possible to use the Amemiya norm, or even employ a different modular than the one usedabove, i.e. ρ ( f ( x )) = R | f ( x ) | p ( x ) .We can now introduce the formal definition of variable Lebesgue spaces as follows. Definition 3.7.
Let p : Ω → [1 , + ∞ ) be an exponent function and consider the space of functionsgiven by: L p ( · ) (Ω) := { f : Ω → R : f measurable and k f k p ( · ) < + ∞} . ( L p ( · ) (Ω) , k · k p ( · ) ) is a Banach space, which we call variable Lebesgue space . In particular, if the exponent function is constantly equal to a value p , we recover the classicalBanach space ( L p (Ω) , k · k p ) with 1 ≤ p < ∞ , showing thus that the previous definition is ageneralized version of the so-called Lebesgue spaces. When the domain is not relevant we mightdenote the previous spaces by L p ( · ) .Next, we present an essential definition which allows to split the behavior of variable Lebesguespaces into two different cases. Definition 3.8.
We say that an exponent function p : Ω → [1 , + ∞ ) is bounded when the essentialsupremum of p verifies: p + := ess sup x ∈ Ω p ( x ) < + ∞ , i.e. it is uniformly bounded except possibly in a set of zero Lebesgue measure. If p is not bounded,we say that it is unbounded . PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 7
The importance of the boundedness of the exponent function resides in the fact that variableLebesgue spaces with bounded exponent function behave similarly to classical Lebesgue spaces.Moreover, they verify certain important properties, like the ones presented below, whose proofs canbe found in [CUF13].
Proposition 3.9.
The variable Lebesgue space ( L p ( · ) , k · k p ( · ) ) is separable if, and only if, theexponent function is bounded. Therefore, whenever the exponent p is bounded, the space L p ( · ) is separable. For the next prop-erty, we need to consider the subspaces of L p ( · ) of compactly supported functions and smoothcompactly supported functions, denoted by L p ( · ) c and C ∞ c , respectively. Now, one can prove the fol-lowing characterization of bounded exponent for variable Lebesgue spaces, concerning their densityin L p ( · ) . Proposition 3.10. (cid:16) L p ( · ) c , k · k p ( · ) (cid:17) is dense in L p ( · ) if, and only if, the exponent function isbounded.Moreover, (cid:0) C ∞ c , k · k p ( · ) (cid:1) is dense in L p ( · ) if, and only if, the exponent function is bounded. Next, we present an essential result concerning the duality of this class of spaces. Before statingit, we recall that for any exponent function p : Ω → [1 , + ∞ ), its H¨older conjugate is the exponentfunction q defined point-wise by 1 q ( x ) + 1 p ( x ) = 1 , for almost every x ∈ Ω. Note that if p ( x ) = 1, then q ( x ) = + ∞ . Although this situation does notfollow completely the definition of exponent function presented in these lines, the function q canalso be interpreted as an exponent function in a generalized sense. Proposition 3.11.
The dual space of ( L p ( · ) , k · k p ( · ) ) is ( L q ( · ) , k · k q ( · ) ) where q is point-wise theH¨older conjugate of p if, and only if, the exponent function p is bounded. Furthermore, if p isbounded, there is an equivalence between functionals, T , and functions in L q ( · ) , g , through thefollowing identity T ( f ) = Z Ω f ( x ) g ( x )d x , for every f ∈ L p ( · ) . Moreover, still in the case in which the exponent function is bounded, there are also someinclusions between variable Lebesgue spaces whenever the domain is compact.
Proposition 3.12.
Let K be compact. Then, if we have two bounded exponent functions p and p verifying that for every x ∈ K it holds that p ( x ) ≤ p ( x ) , then L p ( · ) ( K ) ⊆ L p ( · ) ( K ) . On the other hand, if the exponent function is unbounded, the situation is much more subtle.Variable Lebesgue spaces with unbounded exponent functions are much more complex in general.They are non-separable, we cannot approximate their elements by compactly supported functionsand the dual is more complicated (see previous work of one of the authors in [ACACUO19]). Inthe following, we present some notions needed to deal with unbounded exponent functions.
Definition 3.13.
Let p : Ω → [1 , + ∞ ) be an unbounded exponent function and consider the quotientspace by the closure of the compactly supported functions of L p ( · ) , i.e. L p ( · ) Q = L p ( · ) /L p ( · ) c . This quotient is equipped with the quotient norm k · k Q , which is defined by k [ f ] k Q := inf g ∈ L p ( · ) c k f − g k p ( · ) . ´ANGELA CAPEL AND JES ´US OC ´ARIZ Note that when p is unbounded, this is nontrivial (Proposition 3.10). This way of computing the quotient norm is impractical. To overcome that issue, in [ACACUO19]the following useful characterization was proven.
Proposition 3.14.
Let p : Ω → [1 , + ∞ ) be an unbounded exponent function. Then, k [ f ] k Q = inf ( λ > Z Ω (cid:18) | f ( x ) | λ (cid:19) p ( x ) d x < + ∞ ) . Note that this expression has the same flavor than the norm of L p ( · ) . Furthermore, it is easy tosee that k [ f ] k Q ≤ k f k p ( · ) , because of the inclusion of the subsets of λ .To conclude this subsection on variable Lebesgue spaces, we include a result which relates L p ( · ) spaces, when p is unbounded, to the space of bounded functions. For that, given A ⊆ Ω, we definethe application w ( A ) = k [ A ] k Q . Proposition 3.15.
Let p be an unbounded exponent function. Then, L ∞ (Ω) ⊂ L p ( · ) (Ω) if, andonly if, w (Ω) < + ∞ . By virtue of the characterization of the quotient norm, this result essentially states that, when Ωhas infinite Lebesgue measure, if p diverges fast enough, then w (Ω) < + ∞ and L ∞ (Ω) ⊂ L p ( · ) (Ω).Some important examples of exponent functions, when we take Ω = [1 , + ∞ ), are p ( x ) = x and p ( x ) = [ x ], where [ x ] denotes the integer part of x . These exponent functions verify w (Ω) < + ∞ .Nevertheless, not every exponent function verifies this condition, and this is something that canbe identified from its velocity of divergence. Indeed, for example, p ( x ) = 1 + log x or any otherexponent function which diverges faster verifies w (Ω) < + ∞ . However, there are other unboundedexponent functions which do not verify the previous condition, such as p ( x ) = 1 + log(1 + log x ) orany other exponent function which diverges slower than this.4. Universal approximation in function spaces
The aim of this section is to present some of the known results about the property of
UniversalApproximation (UA) for classical function spaces, as well as some improvements to some of them.A nice survey of these results in soft computing techniques is [TKG + R d . This result was later extended to several other spaces (see [PS91], [Hor91], [HSW89][Cyb89]).In this section, we focus on results of universal approximation in various function spaces, namelythose of continuous functions and spaces of locally integrable functions. For each of these classes, wereview some of the previous literature and present some new results, with improvements concerningeither the domain of the functions, the domain of the activation functions or the techniques involvedto prove them.4.1. Space of continuous functions.
In 1989, in both [Cyb89] and [HSW89], the authors provedsimultaneously for many different activation functions that for any continuous function on a compactset K of R d and any ε >
0, there exists a feedforward neural network with one hidden layerwhich uniformly approximates the function with precision ε . Before stating formally the result andproving it, we introduce a concept concerning activation functions that is necessary to understandthe statement of the theorem. PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 9
Definition 4.1.
Given an activation function σ : R → R and a compact K ⊂ R d , we say that σ is discriminatory for C ( K ) if for µ a finite, signed regular Borel measure on K , the fact that thefollowing holds Z K σ ( w · x + b ) d µ ( x ) = 0 for all w ∈ R d and b ∈ R implies that µ ≡ . Now, we can state the first result on universal approximation using ANN, namely the universalapproximation theorem for continuous functions on a compact, whose proof can be found in [Cyb89,Theorem 1] or [HSW89, Theorem 2.1].
Theorem 4.2.
Let σ be any continuous discriminatory for C ( K ) activation function. Then, thefinite sums of the form M X j =1 α j σ ( w j · x + b j ) are dense in C ( K ) , the space of continuous functions over a compact K . Note from this result that a natural question is whether an activation function is discriminatoryfor C ( K ). We can provide the following sufficient conditions to check when a continuous function σ is discriminatory for C ( K ), whose proof also appears in [Cyb89, Lemma 1] and [Hor91, Theorem5]. Proposition 4.3.
Any bounded, measurable non-constant function, σ , is discriminatory for C ( K ) for every compact K ⊂ R d . In particular, any continuous sigmoidal function is discriminatory. By virtue of this result, we can check that all the examples of activation functions providedin Example 3.2 are discriminatory for C ( K ). However, to get a continuous approximation theactivation function needs to be continuous; thus, to apply Theorem 4.2 we can only use the examplesthat are continuous, namely the rectifier, the logistic sigmoid and the hyperbolic tangent.As a generalization of the previous result, shortly after it the following characterization fordiscriminatory was provided in [LLPS93]. Proposition 4.4.
Let σ be locally bounded and continuous almost everywhere. Then, it is dis-criminatory for C ( K ) for every compact K ⊂ R d if, and only if, it is different from an algebraicpolynomial almost everywhere. Building on these results, we present now some improvements to some of them. The following isa slight improvement of Theorem 4.2, in the sense that it extends that result from a compact to thespace of functions which vanish at infinity. As a counterpart, it only works in the unidimensionalcase.
Proposition 4.5.
Let σ ∈ C ( R ) be sigmoidal and non-constant, and let f ∈ C ( R ) with compactsupport. Then, for every ε > there is g ε ∈ H σ such that sup x ∈ R | f ( x ) − g ε ( x ) | < ε . Proof.
First, we know that σ is discriminatory for C ( K ) with K compact because it is boundedand non-constant (Proposition 4.3).Next, we recall that C ( R ) is the subspace of continuous functions that vanishes at infinity. Thisspace coincides with the closure in the supremum norm of the union of C ([ − ℓ, ℓ ]) with ℓ ∈ N ,viewed as subspaces of L ∞ ( R ). Thus, σ is also discriminatory for C ( R ), since there are continuousfunctions vanishing at infinity in the subspace H σ , due to the fact that σ is sigmoidal. Therefore, it is dense and we can find ∀ ε > g ε ∈ H σ such thatsup x ∈ R | f ( x ) − g ε ( x ) | < ε . (cid:3) The following result takes a further step in the approximation of continuous functions, sincethe domain is now the whole R d instead of just a compact K . It has the same spirit than [PS91,Theorem 2], although the latter result concerns radial activation functions and we have considereda different structure slicing with hyperplanes. Here we check that this result also holds in oursetting. Theorem 4.6.
Let σ be continuous and discriminatory for C ( K ) for every K ⊂ R d compact.Given ε > and f ∈ C ( R d ) , there is a function g ∈ H σ such that d ( f, g ) < ε , where the distance isdefined in the following way: d ( f, g ) := ∞ X k =1 − k k ( f − g ) [ − k,k ] d k ∞ k ( f − g ) [ − k,k ] d k ∞ , where [ − k, k ] d is the centered d -dimensional cube of side k . We omit the proof of this result, as it is completely analogous to that of Theorem 4.10 for locallyintegrable functions. However, we remark a couple of facts concerning this result.
Remark 4.7.
The previous theorem also holds for σ the rectifier. Moreover, as we can see in theproof of Theorem 4.10, we only use the assumption of continuity for f to justify that it is boundedover compacts. Then, we could have stated the theorem above for f ∈ L ∞ loc ( R d ) . Negative results.
Next, we explore the opposite direction, i.e. negative results of universalapproximation. First, we show that the following is a necessary condition for a space to satisfy theUA.
Lemma 4.8.
If a metric space ( X, d ) of functions can be universally approximated by finite sumsof the form M X j =1 α j σ ( w j · x + b j ) , where σ is continuous and sigmoidal, then ( X, d ) is separable.Proof. The proof easily follows from the fact that Q is dense in R . Indeed, due to the uniformcontinuity of σ , the countable subspace consisting of the finite sums M X j =1 α j σ ( w j · x + b j ) , where the variables α j , b j and the components of the vector w j are taken to be rational numbers isa dense subspace. (cid:3) Since every sigmoidal continuous activation function is uniformly continuous, Lemma 4.8 statesthat the size of the function space is an important fact to consider in the problem of UA. Further-more, it can be used to show negative results of UA, such as the following example:
Example 4.9. ( L ∞ ( R d ) , k · k ∞ ) is a non-separable space. Therefore, UA fails for continuoussigmoidal activation functions. Moreover, it even fails for continuous activation functions. PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 11
Indeed, if σ is continuous, every finite sum of the form g ( x ) = M X j =1 α j σ ( w j · x + b j ) is continuous. The Heaviside step function belongs to L ∞ ( R ) and has a jump discontinuity at x = 0 of size . Therefore, at x = 0 the approximated continuous function g must be simultaneously closeto the value and . Thus, using the continuity of g we can see easily that k f − g k ∞ := sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h ( x ) − M X j =1 α j σ ( w j · x + b j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ , concluding thus the proof that for ( L ∞ ( R d ) , k · k ∞ ) UA fails for continuous activation functions.
Locally integrable functions.
The space of locally integrable functions, L loc ( R d ), is thelargest space that we deal with in this manuscript. It includes as special cases the previous functionspaces. However, this space lacks an associated norm. One can solve this problem by constructinga distance in a similar way that we have already done for C ( R d ) in Theorem 4.6. Theorem 4.10.
Given σ discriminatory for C ( K ) for every K ⊂ R d compact, ε > and f ∈ L loc ( R d ) , there is a function g of the form: g ( x ) = M X j =1 α j σ ( w j · x + b j ) , for every x ∈ R d , such that d ( f, g ) < ε , where the distance is defined in the following way: d ( f, g ) := ∞ X k =1 − k k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k , where [ − k, k ] d is the centered n -dimensional cube of side k .Proof. The key idea of the proof relies on the fact that, due to the construction of the distance,the behaviour away from compact sets of the form [ − k, k ] d has very few impact in the computationof the distance, because of the multiplicative factor 2 − k . Therefore, given ε >
0, there is a largeenough k so that we can restrict to approximating the function f on [ − k, k ] d .Indeed, fix f : R d → R continuous and ε >
0. Take m ∈ N such that12 m < ε . Hence, this clearly implies that ∞ X k = m +1 k < ε . Now, by virtue of Theorem 4.2, there exists a function g of the form g ( x ) = ℓ P j =1 α j σ ( w j · x + b j )for every x ∈ R d such that (cid:13)(cid:13)(cid:13) ( f − g ) [ − m,m ] d (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ( f − g ) [ − m,m ] d (cid:13)(cid:13)(cid:13) ∞ (2 m ) d < ε . Therefore, it is clear that d ( f, g ) = ∞ X k =1 − k k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k = m X k =1 − k k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k + ∞ X k = m +1 − k k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k ≤ k ( f − g ) [ − m,m ] d k + ∞ X k = m +1 − k < ε ε ε , where we are using in the second line the fact that f − g is locally integrable, so that k ( f − g ) [ − k,k ] d k is finite for every k ∈ N and thus k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k ≤ k ∈ N . (cid:3) Remark 4.11.
As in the case of Theorem 4.6, a result in a similar spirit appeared in [PS91] ,although the latter result concerned radial activation functions. Approximation in variable Lebesgue spaces
In the previous section, we have collected some results concerning the universal approximationproperty with neural networks in certain function spaces. However, to the best of our knowledge,there is no result about universal approximation for variable Lebesgue spaces.In this section, we present the main results of this paper. We have split them into three differentcases, depending on the exponent function of the space and its domain. First, we show some resultsof universal approximation whenever the exponent function is bounded, subsequently proving that,under certain conditions on the activation function, this is the only case for which a universal ap-proximation is possible. Later, we shift towards the unbounded exponent function setting, startingwith the case in which the domain is discrete and subsequently lifting these results to a generaldomain case.5.1.
Case I: Bounded exponent function.
In this section, we discuss the results of approxima-tion for variable Lebesgue spaces obtained when the exponent function is bounded, i.e. whenever p : Ω → [1 , + ∞ ) for Ω ⊆ R d verifies ess sup x ∈ Ω p ( x ) < + ∞ . In the case that the exponent function is bounded, the variable Lebesgue space L p ( · ) is separable.Furthermore, smooth functions with compact support are dense. Thus, we can present a first versionof universal approximation result for bounded variable Lebesgue spaces relying on the previous factand Proposition 4.4. Theorem 5.1.
Let σ : R → R be locally bounded, continuous almost everywhere and different froman algebraic polynomial. Then, truncated finite sums of the form g ( x ) = M X j =1 α j σ ( w j · x + b j ) x ∈ K , x ∈ R d \ K ,
PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 13 with K compact are dense in L p ( · ) ( R d ) with bounded exponent p .Proof. Fix ε >
0. Since C ∞ c ( R d ) is dense in L p ( · ) ( R d ) by Proposition 3.10, we can find f ∈ C ( R d )with compact support K , such that k f − f k L p ( · ) ( R d ) < ε/ g ε ∈ H σ such that it is truncated to be zero outside K andsup x ∈ K | f ( x ) − g ε ( x ) | < ε | K | . Therefore, k f − g ε k L p ( · ) ( R d ) ≤ k f − f k L p ( · ) ( R d ) + k f − g ε k L p ( · ) ( R d ) < ε | K | ε | K | = ε. (cid:3) Furthermore, when the exponent function is bounded, the dual of variable Lebesgue spaces isfully characterized (see Proposition 3.11) and it coincides with L q ( · ) , where q is point-wise theH¨older conjugate of p , i.e. p ( x ) + q ( x ) = 1 for every x ∈ Ω. By virtue of this result, and analogouslyto the previous notions of discriminatory activation functions, we can give the following definition:
Definition 5.2.
Let σ : R → R be an activation function and K ⊂ R d compact. For a boundedexponent function p : Ω → [1 , + ∞ ) , for Ω ⊆ R d , we say that σ is discriminatory for L p ( · ) ( K ) iffor any h ∈ L q ( · ) ( K ) , where q is pointwise the H¨older conjugate of p (i.e. /p ( x ) + 1 /q ( x ) = 1 forevery x ∈ Ω ), whenever Z K σ ( w · x + b ) h ( x ) dx = 0 for all w ∈ R d and b ∈ R implies that h = 0 almost everywhere. As mentioned above, this is indeed a reasonable definition due to the fact that L q ( · ) can beidentified with the dual of L p ( · ) . Now, considering this definition, we can state and prove thefollowing result in this setting. Theorem 5.3.
Let Ω ⊆ R d , p : Ω → [1 , + ∞ ) a bounded exponent function and σ : R → R discriminatory for L p ( · ) ( K ) for every compact K ⊂ Ω . Then, truncated finite sums of the form g ( x ) = M X j =1 α j σ ( w j · x + b j ) x ∈ K , x ∈ Ω \ K , with K compact are dense in L p ( · ) (Ω) .Proof. Given f ∈ L p ( · ) (Ω) and ε >
0, we apply Proposition 3.10 to find a compact K where k f − f K k L p ( · )(Ω) < ε . Then, we can show that there is a g of the form d X j =1 α j σ ( w j · x + b j ) such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f − M X j =1 α j σ ( w j · x + b j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L p ( · ) ( K ) < ε , finishing thus the proof. Indeed, if above were false, we would get a contradiction as a consequenceof Hahn-Banach theorem. By virtue of its version as a hyperplane separation theorem (see [Rud91, Theorem 3.5]), we can find a nontrivial functional, T , which vanishes over the all the finite sumsof the form M X j =1 α j σ ( w j · x + b j ) . From Proposition 3.11, we know that the dual of L p ( · ) ( K ) is L q ( · ) ( K ) where q is point-wise theH¨older conjugate of p and T ( f ) = Z K f ( x ) h ( x )d x . Therefore, we have obtained a relation between functionals, T , and elements of L q ( · ) ( K ), h . Since σ is discriminatory, and h can in particular be taken to be σ above (as σ belongs to L q ( · ) ( K )), weconclude T = 0, getting a contradiction with the nontriviality of T . (cid:3) A natural question that yields the previous result is when the activation function is discriminatoryfor variable Lebesgue spaces. We can hence prove the following relation between the propertiesof being discriminatory, which in particular yields the fact that Theorem 5.3 is more general thanTheorem 5.1.
Lemma 5.4.
Given K ⊂ R d compact and two bounded exponent functions p and p satisfying p ( x ) ≤ p ( x ) for all x ∈ K , { σ disc. for C ( K ) } ⊆ { σ disc. for L p ( · ) ( K ) } ⊆ { σ disc. for L p ( · ) ( K ) } . In particular, non-constant, bounded activation functions σ and the rectifier are discriminatoryfor L p ( · ) ( K ) (with p bounded).Proof. We have the following inclusion of variable Lebesgue space when the domain is compact:Whenever p ≤ p , it holds that L p ( · ) ( K ) ⊆ L p ( · ) ( K ) (Proposition 3.12). We recall that the dualof bounded variable Lebesgue space L p ( · ) is the variable Lebesgue space L q ( · ) where q is pointwisethe H¨older conjugate of p (Proposition 3.11). Therefore, L q ( · ) ( K ) ⊆ L q ( · ) ( K ) , where q and q are the H¨older conjugates of p and p , respectively. This proves that a function σ that is discriminatory for L p ( · ) ( K ) is also discriminatory for L p ( · ) ( K ).Moreover, since for every h ∈ L ( K ), h ( x )d x is a Radon measure, we have if σ is discriminatoryfor C ( K ), it is also discriminatory for L p ( · ) ( K ) with p bounded.Finally, using Proposition 4.3 and Remark 3.4, non-constant, bounded activation functions σ and the rectifier are discriminatory for L p ( · ) ( K ) (with p bounded). (cid:3) From Lemma 5.4 and Proposition 4.3, it easily follows that all the examples shown in Subsection3.1 are discriminatory for L p ( · ) ( K ). Therefore, we can use Theorem 5.3 with any of the activationfunctions from Example 3.2.We conclude this subsection showing the connection between the boundedness of the exponentfunction and the universal approximation property for L p ( · ) spaces. Corollary 5.5.
Let Ω ⊆ R d , p : Ω → [1 , + ∞ ) a exponent function and σ : R → R continuous andsigmoidal. Then, UA holds for L p ( · ) (Ω) if, and only if, p is bounded.Proof. On the one hand, from Theorem 5.3, Proposition 4.3 and Lemma 5.4 we deduce that, when p is bounded, we have UA for L p ( · ) (Ω). On the other hand, when p is unbounded the space L p ( · ) (Ω)lacks UA, due to Proposition 3.9 and Lemma 4.8. (cid:3) PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 15
Case II: Unbounded exponent function, discrete case.
After having dealt with thebounded case in the last section, we now shift towards the unbounded case. For that, it is betterto show first the results that can be obtained in the discrete case, i.e. in variable sequence spaces.Even though variable sequence spaces are easier to describe than variable Lebesgue spaces, theyare complex enough to show that it is impossible to achieve a universal approximation property.More precisely, we consider the measure space ( N , N , µ ), where 2 N denotes all the subsets ofthe natural numbers and µ is the counting measure (it associates to every subset of the naturalnumbers its cardinality). As in the continuum case, other measures could be considered and theresults would follow with minor modifications.Let us recall the definition of a variable sequence space: Given an exponent function p : N → [1 , + ∞ ), consider the modular ρ p ( · ) ( { x ( k ) } ) := ∞ X j =1 | x ( j ) | p ( j ) . Then, the norm is given by k{ x ( k ) }k p ( · ) = inf (cid:8) λ > ρ p ( · ) ( { x ( k ) } /λ ) ≤ (cid:9) . Let us recall that we are denoting the subspace of all sequences that can be obtained using anANN with 1 hidden layer and activation function σ by H σ := { y ( k ) } : y ( k ) = M X j =1 α j σ ( w j · k + b j ) , where α j , w j , b j ∈ R and M ∈ N . Hereafter, we will assume that the activation function satisfies σ ∈ L ∞ ( R ) non-constant and sigmoidal (existence of the limits at + ∞ and −∞ ). Therefore, H σ ⊆ ℓ ∞ . More specifically, since σ is sigmoidal, H σ is a subspace of the space of convergentsequences, i.e. H σ ⊆ c .Before proceeding to the main results of this section, here we collect some results concerningvariable sequence spaces and its relation with ℓ ∞ addressed in [ACACUO19]. Proposition 5.6.
Let p : N → [1 , + ∞ ) be an exponent function. Then, • ℓ p ( · ) ⊆ ℓ ∞ . • ℓ p ( · ) = ℓ ∞ as vector spaces if, and only if, k N k p ( · ) < + ∞ . The condition k N k p ( · ) < + ∞ is related to the divergence of the exponent function p , that is, weneed that p ( k ) → + ∞ ( k → + ∞ ) fast enough so that the series ∞ X j =1 s p ( j ) converges for some s >
0. For example, p ( k ) = 1 + log k or any other exponent function whichdiverges faster verifies the previous condition. However, there are other unbounded exponentfunctions which do not verify the previous condition, such as p ( k ) = 1 + log(1 + log k ) or any otherexponent function which diverges slower than this. Proposition 5.7.
Let p : N → [1 , + ∞ ) be an exponent function such that k N k p ( · ) < + ∞ and σ ∈ L ∞ ( R ) sigmoidal and non-constant. Then, H σ = c , where the closure is taken in the k · k p ( · ) norm.Proof. Note that using the hypothesis of the exponent function, ( ℓ p ( · ) , k · k p ( · ) ) and ( ℓ ∞ , k · k ∞ ) isthe same space with two equivalent norms.On the one hand, since c is closed in ( ℓ ∞ , k · k ∞ ), from the obvious inclusion H σ ⊆ c we get that H σ ⊆ c . On the other hand, to prove the reverse inclusion it is enough to show that the space of sequenceswith finite number of nonvanishing terms c is contained in H σ , since the closure in ( ℓ ∞ , k · k ∞ )of c is c (the space of sequences which converges to 0) and we have good approximations ofconstants in H σ because σ is sigmoidal.Finally, c can be approximated by a an analogous discrete argument as Proposition 4.5. (cid:3) Remark 5.8.
The statement of Proposition 5.7 holds for σ the rectifier. More precisely, we canprove that H σ ∩ ℓ p ( · ) = c , since using two ReLU we can obtain a non-constant, sigmoidal, boundedactivation function and appeal then to Proposition 5.7. Case III: Unbounded exponent function, general case.
Now we analyze the generalsituation in which the exponent function is unbounded. In this case, the variable Lebesgue spaceis nonseparable (see Proposition 3.9). Using Lemma 4.8, we deduce that the separability of thefunction space is necessary for a Universal Approximation result to hold for the sigmoidal, contin-uous activation function. Therefore, in the current context, the difficulty to obtain approximationresults is much higher. For example, note that we cannot even approximate the function by itsrestriction to compact domains (Proposition 3.10).Moreover, there is an additional problem associated to finding a precise characterization of thedual, since many subtleties appear in this setting. For more information about the dual in thiscase, we refer the interested reader to some previous work of one of the authors [ACACUO19].For the reasons aforementioned, here we just focus in the unidimensional case, having on mindthe toy model Ω = [1 , + ∞ ) and an exponent function p : Ω → Ω given by p ( x ) = x or p ( x ) = [ x ],where [ x ] denotes the integer part of x . In this case, we show a characterization of the subspace offunctions which actually can be approximated using an artificial neural network.First, since we are focusing in the unidimensional case, we start by showing that we can approx-imate bounded functions with limit at ∞ . Proposition 5.9.
Let Ω ⊆ R be an unbounded interval and p : Ω → [1 , + ∞ ) an unboundedexponent function such that L ∞ (Ω) ⊂ L p ( · ) (Ω) and it is bounded in every compact subset of Ω . Let σ ∈ L ∞ ( R ) be a non-constant, sigmoidal activation function. Then, given f ∈ L ∞ ( R ) with limit at ∞ , and ε > , there is a function of the form g ε ( x ) := n X j =1 α j σ ( w j · x + b j ) such that k f − g ε k L p ( · ) (Ω) < ε .Proof. Without loss of generality we can assume that Ω = [1 , + ∞ ). Let β := lim x → + ∞ f ( x ). Forsimplicity, we take β = 0. Indeed, since σ is sigmoidal, there is function h ( x ) = α σ ( w · x + b ) suchthat f − h is a bounded function with limit 0 at + ∞ , for suitable α and w .Fix ε >
0. Given δ >
L > | f ( x ) | < δ if x > L . Since L ∞ (Ω) ⊂ L p ( · ) , weknow that w (Ω) < + ∞ . Hence, we can take δ = ε k Ω k L p ( · ) (Ω) . Now, since p is bounded in [1 , M ] for some M > L , we can use Theorem 5.3 and Proposition 4.5to find g ε of the form g ε ( x ) := n X j =1 α j σ ( w j · x + b j ) , such that the following holds k f − g ε k L p ( · ) ([1 ,M ]) < ε , PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 17 and | g ε ( x ) | < δ for x > M . Then, k f − g ε k L p ( · ) (Ω) ≤ k f − g ε k L p ( · ) ([1 ,M ]) + k f − g ε k L p ( · ) ([ M, + ∞ )) < ε δ k Ω k L p ( · ) (Ω) = ε (cid:3) Next, we can proceed to the main result of this paper. In the theorem below, we provide acharacterization of the set of functions of a variable Lebesgue space with an unbounded exponentthat can be approximated using neural networks.
Theorem 5.10.
Let Ω ⊆ R be an unbounded interval and p : Ω → [1 , + ∞ ) be an unboundedexponent function such that L ∞ (Ω) ⊂ L p ( · ) (Ω) and it is bounded in every compact subset of Ω . Let σ ∈ L ∞ ( R ) be a non-constant, sigmoidal activation function. Then, the following conditions areequivalent for f ∈ L p ( · ) (Ω) : (1) For every ε > , there is a function of the form g ε ( x ) := n X j =1 α j σ ( w j · x + b j ) , such that k f − g ε k L p ( · ) (Ω) < ε . (2) There is a scalar β ∈ R such that k [ f − β Ω ] k Q = 0 , where k · k Q is the quotient norm given in Definition 3.13. Remark 5.11.
The rectifier in Example 3.2 is also a valid activation function for Theorem 5.10,since we can obtain a continuous, bounded activation function as a combination of two ReLU.Proof of Theorem 5.10.
Without loss of generality we can assume that Ω = [1 , + ∞ ). First we showthat 1 ⇒
2. We recall that, since σ is sigmoidal (see Definition 3.3), then σ ( t ) = ( c + ∞ as t → + ∞ ,c −∞ as t → −∞ . Therefore, every function of the form g ( x ) = n X j =1 α j σ ( w j · x + b j )converges to X { j : w j > } α j c + ∞ + X { j : w j < } α j c −∞ + X { j : w j =0 } α j σ ( b j )when x tends to + ∞ .From hypothesis 1, we know that ∀ ε >
0, there is a function g ε with the previous form such that k f − g ε k L p ( · ) (Ω) < ε . Let us denote by β ε the limit at + ∞ of g ε . First, we can show that g ε and β ε Ω belong to the same class in the quotient space.Indeed, since β ε is the limit of g ε ( x ) at + ∞ , for every δ > M > x > M , it holds that | g ε ( x ) − β ε | < δ. Then, k [ g ε − β ε Ω ] k Q = k [( g ε − β ε ) ( M, + ∞ ) ] k Q ≤ k [ δ ( M, + ∞ ) ] k Q = δw (Ω) , where the first equality follows from the fact that the quotient space is taken over the closure of thecompactly supported functions in L p ( · ) . Note that w (Ω) is finite. Then, as δ > k [ g ε − β ε Ω ] k Q = 0.Since these two functions are equivalent in the quotient space, we can deduce the following: k [ f − β ε Ω ] k Q = k [ f − g ε ] k Q ≤ k f − g ε k L p ( · ) (Ω) < ε . Next, we claim that the sequence { β ε } is bounded. Indeed, this holds due to the fact that: | β ε − β ε | w (Ω) = k [( β ε − β ε ) Ω ] k Q ≤ k [ β ε Ω − f ] k Q + k [ f − β ε Ω ] k Q < ε + ε , and thus, | β ε − β ε | ≤ ε + ε w (Ω) . Since { β ε } is bounded, there is subsequence { β ε n } , with ε n tending to 0 and β ε n converging tosome scalar β when n tends to + ∞ . To conclude, we now have to show that k [ f − β Ω ] k Q = 0 , finishing thus the proof of 2. Indeed: k [ f − β Ω ] k Q ≤ k [ f − β ε n Ω ] k Q + k [( β ε n − β ) Ω ] k Q ≤ ε n + | β ε n − β | w (Ω) . which clearly vanishes when n tends to + ∞ .Now we prove 2 ⇒
1. Fix ε >
0. From hypothesis 2, we can find f c ∈ L p ( · ) c such that k ( f − β Ω ) − f c k L p ( · ) (Ω) < ε . The support of f c is contained in a compact of the form [1 , L ] for some L . Since p is bounded at[1 , L ] by hypothesis, using Theorem 5.3 we can find a function g of the form g = n X j =1 α j σ ( w j · x + b j ) , such that k f c − g k L p ( · ) ([1 ,L ]) < ε . Since β Ω − g ( l, + ∞ ) is clearly a bounded function whose limit exits (and is finite) at + ∞ , wecan appeal to Proposition 5.9 to find a function function g of the form g = n + m X j = n +1 α j σ ( w j · x + b j ) , such that k ( β Ω − g ( l, + ∞ ) ) − g k L p ( · ) (Ω) < ε . Then, we can take g ε in the statement of the theorem to be: g ε ( x ) := g ( x ) + g ( x ) = n + m X j =1 α j σ ( w j · x + b j ) , since k f − g ε k L p ( · ) (Ω) ≤ k ( f − β Ω ) − f c k L p ( · ) (Ω) + k f c − g k L p ( · ) ([1 ,L ]) + k ( β Ω − g ( L, + ∞ ) ) − g k L p ( · ) (Ω) < ε ε ε ε . (cid:3) Acknowledgments.
The authors want to thank Wen-Liang Hwang for his valuable comments on anearlier draft and JO would like to thank both him and the Institute of Information Science in Taipeifor the hospitality during his research stay in the summer of 2018, when the idea for this project wasborn. AC acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German ResearchFoundation) under Germanys Excellence Strategy EXC-2111 390814868. JO is partially supportedby the grant MTM2017-83496-P from the Spanish Ministry of Economy and Competitiveness and
PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 19 through the “Severo Ochoa Programme for Centres of Excellence in R&D” (SEV-2015-0554). Thisproject has received funding from the European Union’s Horizon 2020 research and innovationprogramme under the Marie Sk lodowska-Curie grant agreement No 777822.
References [ACACUO19] A. Amenta, J. M. Conde-Alonso, D. Cruz-Uribe, and J. Oc´ariz. On the dual of variable lebesguespaces with unbounded exponent. preprint, arXiv:1909.05987 , 2019.[ALdTRLV18] J. Almira, P. L´opez-de Teruel, D. Romero-L´opez, and F. Voigtlaender. Some negative results for singlelayer and multilayer feedforward neural networks. preprint, arXiv:1810.10032 , 2018.[AN20] M. Ali and A. Nouy. Approximation with tensor networks. part i: Approximation spaces. preprint,arXiv:2007.00118 , 2020.[Arn57] V. I. Arnold. On functions of three variables.
Doklady Akademii Nauk USSR , 114:679–681, 1957.[Bar94] A. Barron. Approximation and estimation bounds for artificial neural networks.
Mach. Learn. ,14(1):115–133, 1994.[BGKP19] H. B¨olcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connecteddeep neural networks.
SIAM J. Math. Data Sci. , 1(1):8–45, 2019.[Bis06] C. Bishop.
Pattern Recognition and Machine Learning . Springer New York, 2006.[CUF13] D. V. Cruz-Uribe and A. Fiorenza.
Variable Lebesgue Spaces: Foundations and Harmonic Analysis .Springer Science & Business Media, 2013.[Cyb89] G. Cybenko. Approximation by superpositions of a sigmoidal function.
Mathematics of control, signalsand systems , 2(4):303–314, 1989.[GR20] I. G¨uhring and M. Raslan. Approximation rates for neural networks with encodable weights in smooth-ness spaces. preprint, arXiv:2006.16822 , 2020.[GS09] S. Giulini and M. Sanguineti. Approximation schemes for functional optimization problems.
J. Optim.Theory Appl. , 140:33–54, 2009.[Hay98] S. Haykin.
Neural Networks: A Comprehensive Foundation . Prentice Hall, 1998.[HH20] W. Hwang and A. Heinecke. Un-rectifying non-linear networks for signal representation.
IEEE Trans-actions on Signal Processing , 68:196–210, 2020.[HHH20] A. Heinecke, J. Ho, and W. Hwang. Refinement and universal approximation via sparsely connectedrelu convolution nets.
IEEE Signal Processing Letters , 2020.[Hor91] K. Hornik. Approximation capabilities of multilayer feedforward networks.
Neural networks , 4(2):251–257, 1991.[HSW89] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approxima-tors.
Neural networks , 2(5):359–366, 1989.[KL19] P. Kidger and T. Lyons. Universal approximation with deep narrow networks. preprint,arXiv:1905.08539 , 2019.[Kol57] A. N. Kolmogorov. On the representation of continuous functions of many variables by superpositionsof continuous functions of one variable and addition.
Dokl. Akad. SSSRl , 114:953–956, 1957.[Kra19] A. Kratsios. The universal approximation property: Characterizations, existence, and a canonicaltopology for deep-learning. preprint, arXiv:1910.03344 , 2019.[KvdS93] B. Kr¨ose and P. van der Smagt. An introduction to neural networks.
J. Comput. Sci. , 48, 01 1993.[Lin19] S.-B. Lin. Generalization and expressivity for deep nets.
IEEE T. Neur. Net. Lear. , 30(5):1392–1406,2019.[LLPS93] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networkswith a nonpolynomial activation function can approximate any function.
Neural networks , 6(6):861–867, 1993.[LTY20] B. Li, S. Tang, and H. Yu. Better approximations of high dimensional smooth functions by deep neuralnetworks with rectied power units.
Commun. in Comp. Phys. , 27:379–411, 2020.[Mha96] M. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.
NeuralComput. , 8(1):164–177, 1996.[MP43] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.
Thebulletin of mathematical biophysics , 5(4):115–133, 1943.[OK19] I. Ohn and Y. Kim. Smooth function approximation by deep neural networks with general activationfunctions.
Entropy , 21(7):627, 2019.[Pet99] P. P. Petrushev. Approximation by ridge function and neural networks.
SIAM Journal on MathematicalAnalysis , 30:155–189, 1999. [Pin97] A. Pinkus. Approximation by ridge function.
M´ehaut´e ALe, Rabut C, Schumaker LL, editors. SurfaceFitting and Multiresolution Methods. Vanderbilt University Press, Nashville, TN , pages 1–14, 1997.[Pin99] A. Pinkus. Approximation theory of the mlp model in neural networks.
Acta Numer. , 8:143–195, 1999.[PS91] J. Park and I. W. Sandberg. Universal approximation using radial-basis-function networks.
Neuralcomputation , 3(2):246–257, 1991.[PV18] P. Petersen and F. Voigtl¨ander. Optimal approximation of piecewise smooth functions using deep reluneural networks.
Neural Netw. , 108:296–330, 2018.[Roj96] R. Rojas.
Neural Networks - A Systematic Introduction . Springer-Verlag New York, 1996.[Ros58] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in thebrain.
Psychological review , 65(6):386, 1958.[Rud91] W. Rudin.
Functional analysis . McGraw-Hill, 1991.[San08] M. Sanguineti. Universal approximation by ridge computational models and neural networks: A sur-vey.
The Open Applied Mathematics Journal , 2(1):31–58, 2008.[SCC18] U. Shaman, A. Cloninger, and R. Coifman. Provable approximation properties for deep neural net-works.
Appl. Comput. Harmon. Anal. , 44(3):537–557, 2018.[SH17] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. preprint, arXiv:1708.06633 , 2017.[ST98] F. Scarselli and A. C. Tsoi. Universal approximation using feedforward neural networks: A survey ofsome existing methods, and some new results.
Neural Networks , 11(1):15–37, 1998.[Suz19] T. Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces:optimal rate and curse of dimensionality. , 2019.[TKG +
03] D. Tikk, L. T K´oczy, T. D. Gedeon, et al. A survey on universal approximation and its limits in softcomputing techniques.
International Journal of Approximate Reasoning , 33(2):185–202, 2003.[TLY19] S. Tang, B. Li, and H. Yu. Efficient and stable constructions of deep neural networks with rectied power units using chebyshev approximations. preprint, arXiv:1911.05467 , 2019.[Wid90] B. Widrow. 30 years of adaptive neural networks: Perceptron, madaline, and backpropagation.
IEEETransactions on Neural Networks , 78(9):1415–1442, 1990.[Yar17] D. Yarotsky. Error bounds for approximations with deep relu networks.
Neural Netw. , 94:103–114,2017.
E-mail address : [email protected] Department of Mathematics, Technische Universit¨at M¨unchen, 85748 Garching, Germany and Mu-nich Center for Quantum Science and Technology (MCQST), M¨unchen, Germany
E-mail address : [email protected]@uam.es