[PDF] Approximation with Neural Networks in Variable Lebesgue Spaces

Abstract

This paper concerns the universal approximation property with neural networks in variable Lebesgue spaces. We show that, whenever the exponent function of the space is bounded, every function can be approximated with shallow neural networks with any desired accuracy. This result subsequently leads to determine the universality of the approximation depending on the boundedness of the exponent function. Furthermore, whenever the exponent is unbounded, we obtain some characterization results for the subspace of functions that can be approximated.

Full PDF

aa r X i v : . [ m a t h . F A ] J u l APPROXIMATION WITH NEURAL NETWORKSIN VARIABLE LEBESGUE SPACES ´ANGELA CAPEL AND JES ´US OC ´ARIZ

Abstract.

This paper concerns the universal approximation property with neural networks invariable Lebesgue spaces. We show that, whenever the exponent function of the space is bounded,every function can be approximated with shallow neural networks with any desired accuracy. Thisresult subsequently leads to determine the universality of the approximation depending on theboundedness of the exponent function. Furthermore, whenever the exponent is unbounded, weobtain some characterization results for the subspace of functions that can be approximated.

Contents

1. Introduction 12. Main results 33. Preliminaries 43.1. Neural networks 43.2. Variable Lebesgue spaces 64. Universal approximation in function spaces 84.1. Space of continuous functions 84.2. Negative results 104.3. Locally integrable functions 115. Approximation in variable Lebesgue spaces 125.1. Case I: Bounded exponent function 125.2. Case II: Unbounded exponent function, discrete case 155.3. Case III: Unbounded exponent function, general case 16References 191.

Introduction

Artiﬁcial neural networks are a model created with the purpose of imitating the behavior ofbiological neural networks using digital computing. Their origins are tied back to [MP43] and[Ros58], and since then numerous applications have been found in a wide range of ﬁelds, varyingfrom machine learning to computer vision, speech recognition or mathematical ﬁnances, amongmany others. A major problem in the theory of neural networks is that of approximating withina desired accuracy a generic class of functions using neural networks, initially motivated by thebehavior for neural networks observed in the Representation Theorem due to Arnold [Arn57] andKolmogorov [Kol57] and the aim of providing a theoretical justiﬁcation for it.The starting point of approximation theory for neural networks was the

Universal ApproximationTheorem of [Cyb89] and [HSW89], which shows that every continuous function on a compact can beuniformly approximated by shallow neural networks with a continuous, non-polynomial activationfunction. Subsequent extensions of this result addressed the analogous problem for Lebesgue spaceswith a ﬁnite exponent [Hor91] and locally integrable spaces [PS91], meanwhile some others also

Date : July 9, 2020. considered the derivatives of the neural networks to show that shallow neural networks with asuﬃciently smooth activation function and unrestricted width are dense in the space of suﬃcientlydiﬀerentiable functions [Pin99]. Soon after this wave of results for shallow neural networks, varioussurveys on the topic appeared in the literature, such as [Pin97], [ST98], [TKG +

03] and [San08].In the last years, many directions have been explored in the approximation theory for both shal-low and deep neural networks. For neural networks with ReLU activation functions, there are anumber of recent papers concerning various topics: Approximation for Besov spaces [Suz19], regres-sion [SH17] and optimization [GS09] problems, restriction to encodable weights [PV18], negativeresults of approximation [ALdTRLV18], estimates for the errors obtained in the approximation[Pet99], [Yar17], or, in general, deep neural networks [KL19], [SCC18], among many others. Asimpler and inspiring new proof for the universality theory in deep neural networks can be found in[HHH20]. Moreover, in [HH20] the authors develop a new technique named “un-rectifying” whichtransfers piece-wise continuous non-linear activation functions into piece-wise continuous linearfunctions and then use it to show that ReLU networks and MaxLU networks are indeed deep trees.Furthermore, for more regular activation functions there are also numerous recent articles, namely[Bar94], [BGKP19], [Lin19], [LTY20], [Mha96], [OK19], [TLY19], most of which focus on deepneural networks. Some other directions are currently being studied for the problem of approximationtoo, such as the topological approach presented in [Kra19], the application of these results to ﬁndingsolutions of partial diﬀerential equations [GR20] or the comparison to approximation with tensornetworks [AN20].In this paper, we take a step forward in the theory of approximation with neural networksand address the problem of approximating any function in a variable Lebesgue space with enough precision using shallow neural networks with various activation functions. VariableLebesgue spaces are in particular locally integrable spaces. Even though there exist some previousresults of universal approximation in locally integrable spaces, they depend on a metric deﬁned ina way that the behavior of a function away from a compact is always negligible. Moreover, sincethat distance is not constructed from a norm, the approximations are not stable by dilations. Inthis manuscript, we overcome this situation and prove our approximation results employing thedistance determined by the usual norm of the space.Variable Lebesgue spaces are a generalization of Lebesgue spaces that might contain functionswhich do not belong to any Lebesgue space [CUF13]. In particular, a variable Lebesgue space mightinclude all the bounded functions even if the domain is not compact. Therefore, a natural problemthat arises in this setting is that of approximating functions in non-compact domains. For example,continuous functions deﬁned on an unbounded interval with a suitable asymptotic limit. This typeof functions may appear as the representation of a quantity that follows a diﬀusion process withtime (like the temperature at a certain point in a closed system).More speciﬁcally, in this paper we show approximation results with neural networks for variableLebesgue spaces depending on the boundedness of their exponent function. If the exponent of thespace is essentially bounded, we show that a result of universal approximation holds, yielding thusan analogous behavior to that of usual Lebesgue spaces. On the other hand, if the exponent isunbounded, the situation is much more subtle, but we can characterize in some cases the subspace offunctions which can be approximated with neural networks relying on some results of [ACACUO19].We ﬁrst address the simpler case of variable sequence spaces to subsequently lift our results to amore general domain. Our results hold for most of the activation functions present in the literature,namely any sigmoidal function (logistic sigmoid, hyperbolic tangent, Heaviside function, etc) or therectiﬁer function.The outline of the manuscript is the following: In Section 2, we present an informal expositionof the main results of the present article. Important notions and results on neural networks andvariable Lebesgue spaces are reviewed in Section 3. In Section 4, we collect some of the previousresults on universal approximation in certain function spaces and provide several improvements or

PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 3 generalizations for them. Finally, in Section 5, we present our results on approximation with neuralnetworks in variable Lebesgue spaces. 2.

Main results A shallow neural network is described by a function g : R d → R given by g ( x ) = M X j =1 α j σ ( w j · x + b j ) , where x ∈ R d represents the input to the neural network, g ( x ) ∈ R the output , w j ∈ R d and α j ∈ R are the weights between ﬁrst and second layer, and second and third layer, respectively, b j ∈ R arethe biases , σ : R → R is the activation function and M the height . The subspace generated by suchfunctions will be denoted by H σ .Given an activation function σ and a function normed space ( X, k · k ), the Universal Approxi-mation (UA) property for shallow neural networks can be formally stated as follows:For every f ∈ X and ∀ ε >

0, there is a function g ∈ H σ such that k f − g k < ε .The main results of this article concern the UA property for variable Lebesgue spaces, i.e. spacesof the form L p ( · ) (Ω) := { f : Ω → R : f measurable and k f k p ( · ) < + ∞} , for an open Ω ⊆ R d , where p : Ω → [1 , + ∞ ) is an exponent function and the norm is given by k f k p ( · ) := inf ( λ > Z Ω (cid:18) | f ( x ) | λ (cid:19) p ( x ) d x ≤ ) . The ﬁrst of these results (which appears in the main text as Theorem 5.3) shows that universalapproximation holds whenever the exponent function of the space is bounded.

Theorem.

Let Ω ⊆ R d , consider p : Ω → [1 , + ∞ ) a bounded exponent function and σ : R → R discriminatory for L p ( · ) ( K ) for every compact K ⊂ Ω . Then, truncated ﬁnite sums of the form g ( x ) =  M X j =1 α j σ ( w j · x + b j ) x ∈ K , x ∈ Ω \ K , with K compact are dense in L p ( · ) (Ω) . The condition imposed on σ essentially means that if a functional over L p ( · ) ( K ) vanishes on H σ , then the functional is identically null. Most of the activation functions considered in theliterature verify this property, such as any continuous function diﬀerent to an algebraic polynomial,for example.Whenever the exponent function of a variable Lebesgue space is unbounded, the situation ismuch more complex. Indeed, as a consequence of the separability of the space and the previoustheorem, we provide a characterization of universal approximation in terms of the boundedness ofthe exponent function, which appears in the main text as Corollary 5.5. Corollary.

Let Ω ⊆ R d , p : Ω → [1 , + ∞ ) a exponent function and σ : R → R continuous and withﬁnite limit at + ∞ and −∞ . Then, UA holds for L p ( · ) (Ω) if, and only if, p is bounded. ´ANGELA CAPEL AND JES ´US OC ´ARIZ Subsequently, since for unbounded exponent functions the UA fails, we study conditions to de-scribe the subspace of functions which can be approximated with neural networks. In the followingresult, the condition of characterization is a generalization of the concept of having an asymptoticlimit.

Theorem.

Let Ω ⊆ R be an unbounded interval and p : Ω → [1 , + ∞ ) an unbounded exponentfunction such that L ∞ (Ω) ⊂ L p ( · ) (Ω) and it is bounded in every compact subset of Ω . Let σ ∈ L ∞ ( R ) be a non-constant, sigmoidal activation function. Then, the following conditions are equivalent for f ∈ L p ( · ) (Ω) : (1) For every ε > , there is a g ε ∈ H σ such that k f − g ε k L p ( · ) (Ω) < ε . (2) There is a scalar β ∈ R such that k [ f − β Ω ] k Q = 0 , where k · k Q is the quotient norm given in Deﬁnition 3.13. This theorem appears as Theorem 5.10 in the main text. In a nutshell, the quotient norm encodesthe behavior of the function at ∞ . Therefore, the functions that can be approximated are thosewhich converge, in some sense, at ∞ .Finally, prior to these results we recall and adapt some classical results of universal approximationfor some function spaces. More speciﬁcally, we address the space of continuous functions over acompact and locally integrable spaces. For the former, we prove an extension of the original resultof universal approximation from continuous functions on a compact to the space of functions whichvanish at inﬁnity, in the unidimensional case, and a negative result of universal approximation,whereas for the latter we extend a result for radial activation functions to non-radial ones. Moreover,even though variable Lebesgue spaces are locally integrable spaces, our contribution for thosespaces is a major improvement, because our concept of distance comes from a norm and, therefore,approximations in this context are more interesting, since for example we can control the error ofdilation. 3. Preliminaries

Neural networks.

In this subsection, we introduce the concepts and properties associated toneural networks that we will need for the rest of the paper. Some references for the mathematicalformulation of neural networks in this context are [Hay98], [Bis06], [KvdS93] or [Roj96], amongmany others.In this paper, we focus on feedforward artiﬁcial neural networks (denoted ANN hereafter), thesimplest model for neural networks. A feedforward artiﬁcial neural network contains several nodes,arranged in layers serving diﬀerently depending on their position, namely input , hidden and output nodes. Moreover, this type of neural networks only has a single input layer, which provides informa-tion from the environment to the network, and a single output layer, which transmits informationfrom the network to the environment.In the past, diﬀerent classes of ANN have been considered to show approximation in genericclasses of functions. It is known that ANN with no hidden layer are not capable of approximatinggeneric, non-linear, continuous functions [Wid90]. On the opposite side, ANN with two or morehidden layers, known as deep neural networks , have given rise to a broad ﬁeld of research in thepast years. However, for simplicity we focus speciﬁcally in the case of one hidden layer, commonlyknown as shallow neural networks , for which typical results on universal approximation have beenstudied in the past. PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 5

Deﬁnition 3.1. A feedforward shallow artiﬁcial neural network (ANN) can be described by a ﬁnitelinear combination of the form g ( x ) = M X j =1 α j σ ( w j · x + b j ) , where x ∈ R d represents the input to the neural network, g ( x ) ∈ R the output , w j ∈ R d and α j ∈ R are the weights between ﬁrst and second layer, and second and third layer, respectively, b j ∈ R arethe biases , σ : R → R is the activation function and M the height . The subspace of all functions that can be obtained using an ANN with activation function σ willbe denoted hereafter by H σ :=  g ( x ) = M X j =1 α j σ ( w j · x + b j )  , where w j , x ∈ R d , α j , b j ∈ R and M ∈ N . Throughout the text, we will impose diﬀerent conditionson σ to obtain diverse results, since the role of the activation function is essential in the results ofapproximation of generic function spaces. Here we summarize some of the most typical examplesfor activation functions. Example 3.2.

Examples of activation functions: • The Heaviside step function, which is piece-wise deﬁned over R in the following way: σ ( x ) = ( if x > , if x ≤ . • The rectiﬁer function (ReLU) is deﬁned as: σ ( x ) = ( x if x > , if x ≤ . • The logistic sigmoid is given by: σ ( x ) = 11 + e − x . • The hyperbolic tangent is an aﬃnely transformed logistic sigmoid. It is given by: σ ( x ) = tanh( x ) . For further examples of activation functions, see [ST98] for instance. Now, we introduce aproperty concerning the activation function which will appear often throughout the rest of thetext.

Deﬁnition 3.3.

Given an activation function σ : R → R , we say that σ is sigmoidal if σ ( t ) = ( c + ∞ as t → + ∞ ,c −∞ as t → −∞ , where the constants c + ∞ and c −∞ are ﬁnite.Usually, c + ∞ and c −∞ are taken to be for c + ∞ and − or for c −∞ . Remark 3.4.

Note that all the activation functions presented in Example 3.2 satisfy this condition.Indeed, even though it goes against the intuition for the ReLU, we can obtain another activationfunction which is continuous and bounded by combining properly two ReLU. ´ANGELA CAPEL AND JES ´US OC ´ARIZ

Variable Lebesgue spaces.

In this subsection, we introduce all the necessary notions andprevious results on variable Lebesgue spaces to understand the results on approximation with neuralnetworks in those spaces that will be presented in Section 5. A good reference for these spaces andtheir application to harmonic analysis is [CUF13].For simplicity, the measurable space we consider hereafter is (Ω , B , | · | ), where Ω ⊆ R d , B is the σ -algebra of measurable sets and | · | is the Lebesgue measure. Other measurable spaces could beconsidered, but here we restrict to this special case, due to its importance for applications.The motivation for the study of variable Lebesgue spaces is the following: Consider the realfunction g ( x ) := √ | x | , which presents a singularity at x = 0. This natural function does not belongto any L p ( R ) for 1 ≤ p ≤ + ∞ , because it either grows too fast at the origin or decays too slow atinﬁnity. An idea to solve this matter and ﬁnd a Lebesgue space to which it could belong wouldbe to split the domain. With this on mind, we can prove g ∈ L ([ − , g ∈ L ( R \ [ − , variable Lebesgue spaces . Deﬁnition 3.5. An exponent function is a measurable function p : Ω → [1 , + ∞ ) . Note that the exponent function always takes a ﬁnite value at each point of the domain. It ispossible to generalize the concept to take the value + ∞ , but this is just a technical issue in whichwe are not particularly interested here. Deﬁnition 3.6.

Given a measurable function f and an exponent function p , the norm is deﬁnedby k f k p ( · ) := inf ( λ > Z Ω (cid:18) | f ( x ) | λ (cid:19) p ( x ) d x ≤ ) . In some occasions, we might denote the previous norm by k f k L p ( · ) (Ω) whenever it is necessary toemphasize the domain of the functions. There exist in the literature diﬀerent ways to deﬁne thenorm for these spaces. The previous expression is obtained through the Luxemburg norm, but itis also possible to use the Amemiya norm, or even employ a diﬀerent modular than the one usedabove, i.e. ρ ( f ( x )) = R | f ( x ) | p ( x ) .We can now introduce the formal deﬁnition of variable Lebesgue spaces as follows. Deﬁnition 3.7.

Let p : Ω → [1 , + ∞ ) be an exponent function and consider the space of functionsgiven by: L p ( · ) (Ω) := { f : Ω → R : f measurable and k f k p ( · ) < + ∞} . ( L p ( · ) (Ω) , k · k p ( · ) ) is a Banach space, which we call variable Lebesgue space . In particular, if the exponent function is constantly equal to a value p , we recover the classicalBanach space ( L p (Ω) , k · k p ) with 1 ≤ p < ∞ , showing thus that the previous deﬁnition is ageneralized version of the so-called Lebesgue spaces. When the domain is not relevant we mightdenote the previous spaces by L p ( · ) .Next, we present an essential deﬁnition which allows to split the behavior of variable Lebesguespaces into two diﬀerent cases. Deﬁnition 3.8.

We say that an exponent function p : Ω → [1 , + ∞ ) is bounded when the essentialsupremum of p veriﬁes: p + := ess sup x ∈ Ω p ( x ) < + ∞ , i.e. it is uniformly bounded except possibly in a set of zero Lebesgue measure. If p is not bounded,we say that it is unbounded . PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 7

The importance of the boundedness of the exponent function resides in the fact that variableLebesgue spaces with bounded exponent function behave similarly to classical Lebesgue spaces.Moreover, they verify certain important properties, like the ones presented below, whose proofs canbe found in [CUF13].

Proposition 3.9.

The variable Lebesgue space ( L p ( · ) , k · k p ( · ) ) is separable if, and only if, theexponent function is bounded. Therefore, whenever the exponent p is bounded, the space L p ( · ) is separable. For the next prop-erty, we need to consider the subspaces of L p ( · ) of compactly supported functions and smoothcompactly supported functions, denoted by L p ( · ) c and C ∞ c , respectively. Now, one can prove the fol-lowing characterization of bounded exponent for variable Lebesgue spaces, concerning their densityin L p ( · ) . Proposition 3.10. (cid:16) L p ( · ) c , k · k p ( · ) (cid:17) is dense in L p ( · ) if, and only if, the exponent function isbounded.Moreover, (cid:0) C ∞ c , k · k p ( · ) (cid:1) is dense in L p ( · ) if, and only if, the exponent function is bounded. Next, we present an essential result concerning the duality of this class of spaces. Before statingit, we recall that for any exponent function p : Ω → [1 , + ∞ ), its H¨older conjugate is the exponentfunction q deﬁned point-wise by 1 q ( x ) + 1 p ( x ) = 1 , for almost every x ∈ Ω. Note that if p ( x ) = 1, then q ( x ) = + ∞ . Although this situation does notfollow completely the deﬁnition of exponent function presented in these lines, the function q canalso be interpreted as an exponent function in a generalized sense. Proposition 3.11.

The dual space of ( L p ( · ) , k · k p ( · ) ) is ( L q ( · ) , k · k q ( · ) ) where q is point-wise theH¨older conjugate of p if, and only if, the exponent function p is bounded. Furthermore, if p isbounded, there is an equivalence between functionals, T , and functions in L q ( · ) , g , through thefollowing identity T ( f ) = Z Ω f ( x ) g ( x )d x , for every f ∈ L p ( · ) . Moreover, still in the case in which the exponent function is bounded, there are also someinclusions between variable Lebesgue spaces whenever the domain is compact.

Proposition 3.12.

Let K be compact. Then, if we have two bounded exponent functions p and p verifying that for every x ∈ K it holds that p ( x ) ≤ p ( x ) , then L p ( · ) ( K ) ⊆ L p ( · ) ( K ) . On the other hand, if the exponent function is unbounded, the situation is much more subtle.Variable Lebesgue spaces with unbounded exponent functions are much more complex in general.They are non-separable, we cannot approximate their elements by compactly supported functionsand the dual is more complicated (see previous work of one of the authors in [ACACUO19]). Inthe following, we present some notions needed to deal with unbounded exponent functions.

Deﬁnition 3.13.

Let p : Ω → [1 , + ∞ ) be an unbounded exponent function and consider the quotientspace by the closure of the compactly supported functions of L p ( · ) , i.e. L p ( · ) Q = L p ( · ) /L p ( · ) c . This quotient is equipped with the quotient norm k · k Q , which is deﬁned by k [ f ] k Q := inf g ∈ L p ( · ) c k f − g k p ( · ) . ´ANGELA CAPEL AND JES ´US OC ´ARIZ Note that when p is unbounded, this is nontrivial (Proposition 3.10). This way of computing the quotient norm is impractical. To overcome that issue, in [ACACUO19]the following useful characterization was proven.

Proposition 3.14.

Let p : Ω → [1 , + ∞ ) be an unbounded exponent function. Then, k [ f ] k Q = inf ( λ > Z Ω (cid:18) | f ( x ) | λ (cid:19) p ( x ) d x < + ∞ ) . Note that this expression has the same ﬂavor than the norm of L p ( · ) . Furthermore, it is easy tosee that k [ f ] k Q ≤ k f k p ( · ) , because of the inclusion of the subsets of λ .To conclude this subsection on variable Lebesgue spaces, we include a result which relates L p ( · ) spaces, when p is unbounded, to the space of bounded functions. For that, given A ⊆ Ω, we deﬁnethe application w ( A ) = k [ A ] k Q . Proposition 3.15.

Let p be an unbounded exponent function. Then, L ∞ (Ω) ⊂ L p ( · ) (Ω) if, andonly if, w (Ω) < + ∞ . By virtue of the characterization of the quotient norm, this result essentially states that, when Ωhas inﬁnite Lebesgue measure, if p diverges fast enough, then w (Ω) < + ∞ and L ∞ (Ω) ⊂ L p ( · ) (Ω).Some important examples of exponent functions, when we take Ω = [1 , + ∞ ), are p ( x ) = x and p ( x ) = [ x ], where [ x ] denotes the integer part of x . These exponent functions verify w (Ω) < + ∞ .Nevertheless, not every exponent function veriﬁes this condition, and this is something that canbe identiﬁed from its velocity of divergence. Indeed, for example, p ( x ) = 1 + log x or any otherexponent function which diverges faster veriﬁes w (Ω) < + ∞ . However, there are other unboundedexponent functions which do not verify the previous condition, such as p ( x ) = 1 + log(1 + log x ) orany other exponent function which diverges slower than this.4. Universal approximation in function spaces

The aim of this section is to present some of the known results about the property of

UniversalApproximation (UA) for classical function spaces, as well as some improvements to some of them.A nice survey of these results in soft computing techniques is [TKG + R d . This result was later extended to several other spaces (see [PS91], [Hor91], [HSW89][Cyb89]).In this section, we focus on results of universal approximation in various function spaces, namelythose of continuous functions and spaces of locally integrable functions. For each of these classes, wereview some of the previous literature and present some new results, with improvements concerningeither the domain of the functions, the domain of the activation functions or the techniques involvedto prove them.4.1. Space of continuous functions.

In 1989, in both [Cyb89] and [HSW89], the authors provedsimultaneously for many diﬀerent activation functions that for any continuous function on a compactset K of R d and any ε >

0, there exists a feedforward neural network with one hidden layerwhich uniformly approximates the function with precision ε . Before stating formally the result andproving it, we introduce a concept concerning activation functions that is necessary to understandthe statement of the theorem. PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 9

Deﬁnition 4.1.

Given an activation function σ : R → R and a compact K ⊂ R d , we say that σ is discriminatory for C ( K ) if for µ a ﬁnite, signed regular Borel measure on K , the fact that thefollowing holds Z K σ ( w · x + b ) d µ ( x ) = 0 for all w ∈ R d and b ∈ R implies that µ ≡ . Now, we can state the ﬁrst result on universal approximation using ANN, namely the universalapproximation theorem for continuous functions on a compact, whose proof can be found in [Cyb89,Theorem 1] or [HSW89, Theorem 2.1].

Theorem 4.2.

Let σ be any continuous discriminatory for C ( K ) activation function. Then, theﬁnite sums of the form M X j =1 α j σ ( w j · x + b j ) are dense in C ( K ) , the space of continuous functions over a compact K . Note from this result that a natural question is whether an activation function is discriminatoryfor C ( K ). We can provide the following suﬃcient conditions to check when a continuous function σ is discriminatory for C ( K ), whose proof also appears in [Cyb89, Lemma 1] and [Hor91, Theorem5]. Proposition 4.3.

Any bounded, measurable non-constant function, σ , is discriminatory for C ( K ) for every compact K ⊂ R d . In particular, any continuous sigmoidal function is discriminatory. By virtue of this result, we can check that all the examples of activation functions providedin Example 3.2 are discriminatory for C ( K ). However, to get a continuous approximation theactivation function needs to be continuous; thus, to apply Theorem 4.2 we can only use the examplesthat are continuous, namely the rectiﬁer, the logistic sigmoid and the hyperbolic tangent.As a generalization of the previous result, shortly after it the following characterization fordiscriminatory was provided in [LLPS93]. Proposition 4.4.

Let σ be locally bounded and continuous almost everywhere. Then, it is dis-criminatory for C ( K ) for every compact K ⊂ R d if, and only if, it is diﬀerent from an algebraicpolynomial almost everywhere. Building on these results, we present now some improvements to some of them. The following isa slight improvement of Theorem 4.2, in the sense that it extends that result from a compact to thespace of functions which vanish at inﬁnity. As a counterpart, it only works in the unidimensionalcase.

Proposition 4.5.

Let σ ∈ C ( R ) be sigmoidal and non-constant, and let f ∈ C ( R ) with compactsupport. Then, for every ε > there is g ε ∈ H σ such that sup x ∈ R | f ( x ) − g ε ( x ) | < ε . Proof.

First, we know that σ is discriminatory for C ( K ) with K compact because it is boundedand non-constant (Proposition 4.3).Next, we recall that C ( R ) is the subspace of continuous functions that vanishes at inﬁnity. Thisspace coincides with the closure in the supremum norm of the union of C ([ − ℓ, ℓ ]) with ℓ ∈ N ,viewed as subspaces of L ∞ ( R ). Thus, σ is also discriminatory for C ( R ), since there are continuousfunctions vanishing at inﬁnity in the subspace H σ , due to the fact that σ is sigmoidal. Therefore, it is dense and we can ﬁnd ∀ ε > g ε ∈ H σ such thatsup x ∈ R | f ( x ) − g ε ( x ) | < ε . (cid:3) The following result takes a further step in the approximation of continuous functions, sincethe domain is now the whole R d instead of just a compact K . It has the same spirit than [PS91,Theorem 2], although the latter result concerns radial activation functions and we have considereda diﬀerent structure slicing with hyperplanes. Here we check that this result also holds in oursetting. Theorem 4.6.

Let σ be continuous and discriminatory for C ( K ) for every K ⊂ R d compact.Given ε > and f ∈ C ( R d ) , there is a function g ∈ H σ such that d ( f, g ) < ε , where the distance isdeﬁned in the following way: d ( f, g ) := ∞ X k =1 − k k ( f − g ) [ − k,k ] d k ∞ k ( f − g ) [ − k,k ] d k ∞ , where [ − k, k ] d is the centered d -dimensional cube of side k . We omit the proof of this result, as it is completely analogous to that of Theorem 4.10 for locallyintegrable functions. However, we remark a couple of facts concerning this result.

Remark 4.7.

The previous theorem also holds for σ the rectiﬁer. Moreover, as we can see in theproof of Theorem 4.10, we only use the assumption of continuity for f to justify that it is boundedover compacts. Then, we could have stated the theorem above for f ∈ L ∞ loc ( R d ) . Negative results.

Next, we explore the opposite direction, i.e. negative results of universalapproximation. First, we show that the following is a necessary condition for a space to satisfy theUA.

Lemma 4.8.

If a metric space ( X, d ) of functions can be universally approximated by ﬁnite sumsof the form M X j =1 α j σ ( w j · x + b j ) , where σ is continuous and sigmoidal, then ( X, d ) is separable.Proof. The proof easily follows from the fact that Q is dense in R . Indeed, due to the uniformcontinuity of σ , the countable subspace consisting of the ﬁnite sums M X j =1 α j σ ( w j · x + b j ) , where the variables α j , b j and the components of the vector w j are taken to be rational numbers isa dense subspace. (cid:3) Since every sigmoidal continuous activation function is uniformly continuous, Lemma 4.8 statesthat the size of the function space is an important fact to consider in the problem of UA. Further-more, it can be used to show negative results of UA, such as the following example:

Example 4.9. ( L ∞ ( R d ) , k · k ∞ ) is a non-separable space. Therefore, UA fails for continuoussigmoidal activation functions. Moreover, it even fails for continuous activation functions. PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 11

Indeed, if σ is continuous, every ﬁnite sum of the form g ( x ) = M X j =1 α j σ ( w j · x + b j ) is continuous. The Heaviside step function belongs to L ∞ ( R ) and has a jump discontinuity at x = 0 of size . Therefore, at x = 0 the approximated continuous function g must be simultaneously closeto the value and . Thus, using the continuity of g we can see easily that k f − g k ∞ := sup x ∈ R (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h ( x ) − M X j =1 α j σ ( w j · x + b j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ , concluding thus the proof that for ( L ∞ ( R d ) , k · k ∞ ) UA fails for continuous activation functions.

Locally integrable functions.

The space of locally integrable functions, L loc ( R d ), is thelargest space that we deal with in this manuscript. It includes as special cases the previous functionspaces. However, this space lacks an associated norm. One can solve this problem by constructinga distance in a similar way that we have already done for C ( R d ) in Theorem 4.6. Theorem 4.10.

Given σ discriminatory for C ( K ) for every K ⊂ R d compact, ε > and f ∈ L loc ( R d ) , there is a function g of the form: g ( x ) = M X j =1 α j σ ( w j · x + b j ) , for every x ∈ R d , such that d ( f, g ) < ε , where the distance is deﬁned in the following way: d ( f, g ) := ∞ X k =1 − k k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k , where [ − k, k ] d is the centered n -dimensional cube of side k .Proof. The key idea of the proof relies on the fact that, due to the construction of the distance,the behaviour away from compact sets of the form [ − k, k ] d has very few impact in the computationof the distance, because of the multiplicative factor 2 − k . Therefore, given ε >

0, there is a largeenough k so that we can restrict to approximating the function f on [ − k, k ] d .Indeed, ﬁx f : R d → R continuous and ε >

0. Take m ∈ N such that12 m < ε . Hence, this clearly implies that ∞ X k = m +1 k < ε . Now, by virtue of Theorem 4.2, there exists a function g of the form g ( x ) = ℓ P j =1 α j σ ( w j · x + b j )for every x ∈ R d such that (cid:13)(cid:13)(cid:13) ( f − g ) [ − m,m ] d (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ( f − g ) [ − m,m ] d (cid:13)(cid:13)(cid:13) ∞ (2 m ) d < ε . Therefore, it is clear that d ( f, g ) = ∞ X k =1 − k k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k = m X k =1 − k k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k + ∞ X k = m +1 − k k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k ≤ k ( f − g ) [ − m,m ] d k + ∞ X k = m +1 − k < ε ε ε , where we are using in the second line the fact that f − g is locally integrable, so that k ( f − g ) [ − k,k ] d k is ﬁnite for every k ∈ N and thus k ( f − g ) [ − k,k ] d k k ( f − g ) [ − k,k ] d k ≤ k ∈ N . (cid:3) Remark 4.11.

As in the case of Theorem 4.6, a result in a similar spirit appeared in [PS91] ,although the latter result concerned radial activation functions. Approximation in variable Lebesgue spaces

In the previous section, we have collected some results concerning the universal approximationproperty with neural networks in certain function spaces. However, to the best of our knowledge,there is no result about universal approximation for variable Lebesgue spaces.In this section, we present the main results of this paper. We have split them into three diﬀerentcases, depending on the exponent function of the space and its domain. First, we show some resultsof universal approximation whenever the exponent function is bounded, subsequently proving that,under certain conditions on the activation function, this is the only case for which a universal ap-proximation is possible. Later, we shift towards the unbounded exponent function setting, startingwith the case in which the domain is discrete and subsequently lifting these results to a generaldomain case.5.1.

Case I: Bounded exponent function.

In this section, we discuss the results of approxima-tion for variable Lebesgue spaces obtained when the exponent function is bounded, i.e. whenever p : Ω → [1 , + ∞ ) for Ω ⊆ R d veriﬁes ess sup x ∈ Ω p ( x ) < + ∞ . In the case that the exponent function is bounded, the variable Lebesgue space L p ( · ) is separable.Furthermore, smooth functions with compact support are dense. Thus, we can present a ﬁrst versionof universal approximation result for bounded variable Lebesgue spaces relying on the previous factand Proposition 4.4. Theorem 5.1.

Let σ : R → R be locally bounded, continuous almost everywhere and diﬀerent froman algebraic polynomial. Then, truncated ﬁnite sums of the form g ( x ) =  M X j =1 α j σ ( w j · x + b j ) x ∈ K , x ∈ R d \ K ,

PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 13 with K compact are dense in L p ( · ) ( R d ) with bounded exponent p .Proof. Fix ε >

0. Since C ∞ c ( R d ) is dense in L p ( · ) ( R d ) by Proposition 3.10, we can ﬁnd f ∈ C ( R d )with compact support K , such that k f − f k L p ( · ) ( R d ) < ε/ g ε ∈ H σ such that it is truncated to be zero outside K andsup x ∈ K | f ( x ) − g ε ( x ) | < ε | K | . Therefore, k f − g ε k L p ( · ) ( R d ) ≤ k f − f k L p ( · ) ( R d ) + k f − g ε k L p ( · ) ( R d ) < ε | K | ε | K | = ε. (cid:3) Furthermore, when the exponent function is bounded, the dual of variable Lebesgue spaces isfully characterized (see Proposition 3.11) and it coincides with L q ( · ) , where q is point-wise theH¨older conjugate of p , i.e. p ( x ) + q ( x ) = 1 for every x ∈ Ω. By virtue of this result, and analogouslyto the previous notions of discriminatory activation functions, we can give the following deﬁnition:

Deﬁnition 5.2.

Let σ : R → R be an activation function and K ⊂ R d compact. For a boundedexponent function p : Ω → [1 , + ∞ ) , for Ω ⊆ R d , we say that σ is discriminatory for L p ( · ) ( K ) iffor any h ∈ L q ( · ) ( K ) , where q is pointwise the H¨older conjugate of p (i.e. /p ( x ) + 1 /q ( x ) = 1 forevery x ∈ Ω ), whenever Z K σ ( w · x + b ) h ( x ) dx = 0 for all w ∈ R d and b ∈ R implies that h = 0 almost everywhere. As mentioned above, this is indeed a reasonable deﬁnition due to the fact that L q ( · ) can beidentiﬁed with the dual of L p ( · ) . Now, considering this deﬁnition, we can state and prove thefollowing result in this setting. Theorem 5.3.

Let Ω ⊆ R d , p : Ω → [1 , + ∞ ) a bounded exponent function and σ : R → R discriminatory for L p ( · ) ( K ) for every compact K ⊂ Ω . Then, truncated ﬁnite sums of the form g ( x ) =  M X j =1 α j σ ( w j · x + b j ) x ∈ K , x ∈ Ω \ K , with K compact are dense in L p ( · ) (Ω) .Proof. Given f ∈ L p ( · ) (Ω) and ε >

0, we apply Proposition 3.10 to ﬁnd a compact K where k f − f K k L p ( · )(Ω) < ε . Then, we can show that there is a g of the form d X j =1 α j σ ( w j · x + b j ) such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f − M X j =1 α j σ ( w j · x + b j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L p ( · ) ( K ) < ε , ﬁnishing thus the proof. Indeed, if above were false, we would get a contradiction as a consequenceof Hahn-Banach theorem. By virtue of its version as a hyperplane separation theorem (see [Rud91, Theorem 3.5]), we can ﬁnd a nontrivial functional, T , which vanishes over the all the ﬁnite sumsof the form M X j =1 α j σ ( w j · x + b j ) . From Proposition 3.11, we know that the dual of L p ( · ) ( K ) is L q ( · ) ( K ) where q is point-wise theH¨older conjugate of p and T ( f ) = Z K f ( x ) h ( x )d x . Therefore, we have obtained a relation between functionals, T , and elements of L q ( · ) ( K ), h . Since σ is discriminatory, and h can in particular be taken to be σ above (as σ belongs to L q ( · ) ( K )), weconclude T = 0, getting a contradiction with the nontriviality of T . (cid:3) A natural question that yields the previous result is when the activation function is discriminatoryfor variable Lebesgue spaces. We can hence prove the following relation between the propertiesof being discriminatory, which in particular yields the fact that Theorem 5.3 is more general thanTheorem 5.1.

Lemma 5.4.

Given K ⊂ R d compact and two bounded exponent functions p and p satisfying p ( x ) ≤ p ( x ) for all x ∈ K , { σ disc. for C ( K ) } ⊆ { σ disc. for L p ( · ) ( K ) } ⊆ { σ disc. for L p ( · ) ( K ) } . In particular, non-constant, bounded activation functions σ and the rectiﬁer are discriminatoryfor L p ( · ) ( K ) (with p bounded).Proof. We have the following inclusion of variable Lebesgue space when the domain is compact:Whenever p ≤ p , it holds that L p ( · ) ( K ) ⊆ L p ( · ) ( K ) (Proposition 3.12). We recall that the dualof bounded variable Lebesgue space L p ( · ) is the variable Lebesgue space L q ( · ) where q is pointwisethe H¨older conjugate of p (Proposition 3.11). Therefore, L q ( · ) ( K ) ⊆ L q ( · ) ( K ) , where q and q are the H¨older conjugates of p and p , respectively. This proves that a function σ that is discriminatory for L p ( · ) ( K ) is also discriminatory for L p ( · ) ( K ).Moreover, since for every h ∈ L ( K ), h ( x )d x is a Radon measure, we have if σ is discriminatoryfor C ( K ), it is also discriminatory for L p ( · ) ( K ) with p bounded.Finally, using Proposition 4.3 and Remark 3.4, non-constant, bounded activation functions σ and the rectiﬁer are discriminatory for L p ( · ) ( K ) (with p bounded). (cid:3) From Lemma 5.4 and Proposition 4.3, it easily follows that all the examples shown in Subsection3.1 are discriminatory for L p ( · ) ( K ). Therefore, we can use Theorem 5.3 with any of the activationfunctions from Example 3.2.We conclude this subsection showing the connection between the boundedness of the exponentfunction and the universal approximation property for L p ( · ) spaces. Corollary 5.5.

Let Ω ⊆ R d , p : Ω → [1 , + ∞ ) a exponent function and σ : R → R continuous andsigmoidal. Then, UA holds for L p ( · ) (Ω) if, and only if, p is bounded.Proof. On the one hand, from Theorem 5.3, Proposition 4.3 and Lemma 5.4 we deduce that, when p is bounded, we have UA for L p ( · ) (Ω). On the other hand, when p is unbounded the space L p ( · ) (Ω)lacks UA, due to Proposition 3.9 and Lemma 4.8. (cid:3) PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 15

Case II: Unbounded exponent function, discrete case.

After having dealt with thebounded case in the last section, we now shift towards the unbounded case. For that, it is betterto show ﬁrst the results that can be obtained in the discrete case, i.e. in variable sequence spaces.Even though variable sequence spaces are easier to describe than variable Lebesgue spaces, theyare complex enough to show that it is impossible to achieve a universal approximation property.More precisely, we consider the measure space ( N , N , µ ), where 2 N denotes all the subsets ofthe natural numbers and µ is the counting measure (it associates to every subset of the naturalnumbers its cardinality). As in the continuum case, other measures could be considered and theresults would follow with minor modiﬁcations.Let us recall the deﬁnition of a variable sequence space: Given an exponent function p : N → [1 , + ∞ ), consider the modular ρ p ( · ) ( { x ( k ) } ) := ∞ X j =1 | x ( j ) | p ( j ) . Then, the norm is given by k{ x ( k ) }k p ( · ) = inf (cid:8) λ > ρ p ( · ) ( { x ( k ) } /λ ) ≤ (cid:9) . Let us recall that we are denoting the subspace of all sequences that can be obtained using anANN with 1 hidden layer and activation function σ by H σ :=  { y ( k ) } : y ( k ) = M X j =1 α j σ ( w j · k + b j )  , where α j , w j , b j ∈ R and M ∈ N . Hereafter, we will assume that the activation function satisﬁes σ ∈ L ∞ ( R ) non-constant and sigmoidal (existence of the limits at + ∞ and −∞ ). Therefore, H σ ⊆ ℓ ∞ . More speciﬁcally, since σ is sigmoidal, H σ is a subspace of the space of convergentsequences, i.e. H σ ⊆ c .Before proceeding to the main results of this section, here we collect some results concerningvariable sequence spaces and its relation with ℓ ∞ addressed in [ACACUO19]. Proposition 5.6.

Let p : N → [1 , + ∞ ) be an exponent function. Then, • ℓ p ( · ) ⊆ ℓ ∞ . • ℓ p ( · ) = ℓ ∞ as vector spaces if, and only if, k N k p ( · ) < + ∞ . The condition k N k p ( · ) < + ∞ is related to the divergence of the exponent function p , that is, weneed that p ( k ) → + ∞ ( k → + ∞ ) fast enough so that the series ∞ X j =1 s p ( j ) converges for some s >

0. For example, p ( k ) = 1 + log k or any other exponent function whichdiverges faster veriﬁes the previous condition. However, there are other unbounded exponentfunctions which do not verify the previous condition, such as p ( k ) = 1 + log(1 + log k ) or any otherexponent function which diverges slower than this. Proposition 5.7.

Let p : N → [1 , + ∞ ) be an exponent function such that k N k p ( · ) < + ∞ and σ ∈ L ∞ ( R ) sigmoidal and non-constant. Then, H σ = c , where the closure is taken in the k · k p ( · ) norm.Proof. Note that using the hypothesis of the exponent function, ( ℓ p ( · ) , k · k p ( · ) ) and ( ℓ ∞ , k · k ∞ ) isthe same space with two equivalent norms.On the one hand, since c is closed in ( ℓ ∞ , k · k ∞ ), from the obvious inclusion H σ ⊆ c we get that H σ ⊆ c . On the other hand, to prove the reverse inclusion it is enough to show that the space of sequenceswith ﬁnite number of nonvanishing terms c is contained in H σ , since the closure in ( ℓ ∞ , k · k ∞ )of c is c (the space of sequences which converges to 0) and we have good approximations ofconstants in H σ because σ is sigmoidal.Finally, c can be approximated by a an analogous discrete argument as Proposition 4.5. (cid:3) Remark 5.8.

The statement of Proposition 5.7 holds for σ the rectiﬁer. More precisely, we canprove that H σ ∩ ℓ p ( · ) = c , since using two ReLU we can obtain a non-constant, sigmoidal, boundedactivation function and appeal then to Proposition 5.7. Case III: Unbounded exponent function, general case.

Now we analyze the generalsituation in which the exponent function is unbounded. In this case, the variable Lebesgue spaceis nonseparable (see Proposition 3.9). Using Lemma 4.8, we deduce that the separability of thefunction space is necessary for a Universal Approximation result to hold for the sigmoidal, contin-uous activation function. Therefore, in the current context, the diﬃculty to obtain approximationresults is much higher. For example, note that we cannot even approximate the function by itsrestriction to compact domains (Proposition 3.10).Moreover, there is an additional problem associated to ﬁnding a precise characterization of thedual, since many subtleties appear in this setting. For more information about the dual in thiscase, we refer the interested reader to some previous work of one of the authors [ACACUO19].For the reasons aforementioned, here we just focus in the unidimensional case, having on mindthe toy model Ω = [1 , + ∞ ) and an exponent function p : Ω → Ω given by p ( x ) = x or p ( x ) = [ x ],where [ x ] denotes the integer part of x . In this case, we show a characterization of the subspace offunctions which actually can be approximated using an artiﬁcial neural network.First, since we are focusing in the unidimensional case, we start by showing that we can approx-imate bounded functions with limit at ∞ . Proposition 5.9.

Let Ω ⊆ R be an unbounded interval and p : Ω → [1 , + ∞ ) an unboundedexponent function such that L ∞ (Ω) ⊂ L p ( · ) (Ω) and it is bounded in every compact subset of Ω . Let σ ∈ L ∞ ( R ) be a non-constant, sigmoidal activation function. Then, given f ∈ L ∞ ( R ) with limit at ∞ , and ε > , there is a function of the form g ε ( x ) := n X j =1 α j σ ( w j · x + b j ) such that k f − g ε k L p ( · ) (Ω) < ε .Proof. Without loss of generality we can assume that Ω = [1 , + ∞ ). Let β := lim x → + ∞ f ( x ). Forsimplicity, we take β = 0. Indeed, since σ is sigmoidal, there is function h ( x ) = α σ ( w · x + b ) suchthat f − h is a bounded function with limit 0 at + ∞ , for suitable α and w .Fix ε >

0. Given δ >

L > | f ( x ) | < δ if x > L . Since L ∞ (Ω) ⊂ L p ( · ) , weknow that w (Ω) < + ∞ . Hence, we can take δ = ε k Ω k L p ( · ) (Ω) . Now, since p is bounded in [1 , M ] for some M > L , we can use Theorem 5.3 and Proposition 4.5to ﬁnd g ε of the form g ε ( x ) := n X j =1 α j σ ( w j · x + b j ) , such that the following holds k f − g ε k L p ( · ) ([1 ,M ]) < ε , PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 17 and | g ε ( x ) | < δ for x > M . Then, k f − g ε k L p ( · ) (Ω) ≤ k f − g ε k L p ( · ) ([1 ,M ]) + k f − g ε k L p ( · ) ([ M, + ∞ )) < ε δ k Ω k L p ( · ) (Ω) = ε (cid:3) Next, we can proceed to the main result of this paper. In the theorem below, we provide acharacterization of the set of functions of a variable Lebesgue space with an unbounded exponentthat can be approximated using neural networks.

Theorem 5.10.

Let Ω ⊆ R be an unbounded interval and p : Ω → [1 , + ∞ ) be an unboundedexponent function such that L ∞ (Ω) ⊂ L p ( · ) (Ω) and it is bounded in every compact subset of Ω . Let σ ∈ L ∞ ( R ) be a non-constant, sigmoidal activation function. Then, the following conditions areequivalent for f ∈ L p ( · ) (Ω) : (1) For every ε > , there is a function of the form g ε ( x ) := n X j =1 α j σ ( w j · x + b j ) , such that k f − g ε k L p ( · ) (Ω) < ε . (2) There is a scalar β ∈ R such that k [ f − β Ω ] k Q = 0 , where k · k Q is the quotient norm given in Deﬁnition 3.13. Remark 5.11.

The rectiﬁer in Example 3.2 is also a valid activation function for Theorem 5.10,since we can obtain a continuous, bounded activation function as a combination of two ReLU.Proof of Theorem 5.10.

Without loss of generality we can assume that Ω = [1 , + ∞ ). First we showthat 1 ⇒

2. We recall that, since σ is sigmoidal (see Deﬁnition 3.3), then σ ( t ) = ( c + ∞ as t → + ∞ ,c −∞ as t → −∞ . Therefore, every function of the form g ( x ) = n X j =1 α j σ ( w j · x + b j )converges to X { j : w j > } α j c + ∞ + X { j : w j < } α j c −∞ + X { j : w j =0 } α j σ ( b j )when x tends to + ∞ .From hypothesis 1, we know that ∀ ε >

0, there is a function g ε with the previous form such that k f − g ε k L p ( · ) (Ω) < ε . Let us denote by β ε the limit at + ∞ of g ε . First, we can show that g ε and β ε Ω belong to the same class in the quotient space.Indeed, since β ε is the limit of g ε ( x ) at + ∞ , for every δ > M > x > M , it holds that | g ε ( x ) − β ε | < δ. Then, k [ g ε − β ε Ω ] k Q = k [( g ε − β ε ) ( M, + ∞ ) ] k Q ≤ k [ δ ( M, + ∞ ) ] k Q = δw (Ω) , where the ﬁrst equality follows from the fact that the quotient space is taken over the closure of thecompactly supported functions in L p ( · ) . Note that w (Ω) is ﬁnite. Then, as δ > k [ g ε − β ε Ω ] k Q = 0.Since these two functions are equivalent in the quotient space, we can deduce the following: k [ f − β ε Ω ] k Q = k [ f − g ε ] k Q ≤ k f − g ε k L p ( · ) (Ω) < ε . Next, we claim that the sequence { β ε } is bounded. Indeed, this holds due to the fact that: | β ε − β ε | w (Ω) = k [( β ε − β ε ) Ω ] k Q ≤ k [ β ε Ω − f ] k Q + k [ f − β ε Ω ] k Q < ε + ε , and thus, | β ε − β ε | ≤ ε + ε w (Ω) . Since { β ε } is bounded, there is subsequence { β ε n } , with ε n tending to 0 and β ε n converging tosome scalar β when n tends to + ∞ . To conclude, we now have to show that k [ f − β Ω ] k Q = 0 , ﬁnishing thus the proof of 2. Indeed: k [ f − β Ω ] k Q ≤ k [ f − β ε n Ω ] k Q + k [( β ε n − β ) Ω ] k Q ≤ ε n + | β ε n − β | w (Ω) . which clearly vanishes when n tends to + ∞ .Now we prove 2 ⇒

1. Fix ε >

0. From hypothesis 2, we can ﬁnd f c ∈ L p ( · ) c such that k ( f − β Ω ) − f c k L p ( · ) (Ω) < ε . The support of f c is contained in a compact of the form [1 , L ] for some L . Since p is bounded at[1 , L ] by hypothesis, using Theorem 5.3 we can ﬁnd a function g of the form g = n X j =1 α j σ ( w j · x + b j ) , such that k f c − g k L p ( · ) ([1 ,L ]) < ε . Since β Ω − g ( l, + ∞ ) is clearly a bounded function whose limit exits (and is ﬁnite) at + ∞ , wecan appeal to Proposition 5.9 to ﬁnd a function function g of the form g = n + m X j = n +1 α j σ ( w j · x + b j ) , such that k ( β Ω − g ( l, + ∞ ) ) − g k L p ( · ) (Ω) < ε . Then, we can take g ε in the statement of the theorem to be: g ε ( x ) := g ( x ) + g ( x ) = n + m X j =1 α j σ ( w j · x + b j ) , since k f − g ε k L p ( · ) (Ω) ≤ k ( f − β Ω ) − f c k L p ( · ) (Ω) + k f c − g k L p ( · ) ([1 ,L ]) + k ( β Ω − g ( L, + ∞ ) ) − g k L p ( · ) (Ω) < ε ε ε ε . (cid:3) Acknowledgments.

The authors want to thank Wen-Liang Hwang for his valuable comments on anearlier draft and JO would like to thank both him and the Institute of Information Science in Taipeifor the hospitality during his research stay in the summer of 2018, when the idea for this project wasborn. AC acknowledges funding by the Deutsche Forschungsgemeinschaft (DFG, German ResearchFoundation) under Germanys Excellence Strategy EXC-2111 390814868. JO is partially supportedby the grant MTM2017-83496-P from the Spanish Ministry of Economy and Competitiveness and

PPROXIMATION WITH NEURAL NETWORKS IN VARIABLE LEBESGUE SPACES 19 through the “Severo Ochoa Programme for Centres of Excellence in R&D” (SEV-2015-0554). Thisproject has received funding from the European Union’s Horizon 2020 research and innovationprogramme under the Marie Sk lodowska-Curie grant agreement No 777822.

References [ACACUO19] A. Amenta, J. M. Conde-Alonso, D. Cruz-Uribe, and J. Oc´ariz. On the dual of variable lebesguespaces with unbounded exponent. preprint, arXiv:1909.05987 , 2019.[ALdTRLV18] J. Almira, P. L´opez-de Teruel, D. Romero-L´opez, and F. Voigtlaender. Some negative results for singlelayer and multilayer feedforward neural networks. preprint, arXiv:1810.10032 , 2018.[AN20] M. Ali and A. Nouy. Approximation with tensor networks. part i: Approximation spaces. preprint,arXiv:2007.00118 , 2020.[Arn57] V. I. Arnold. On functions of three variables.

Doklady Akademii Nauk USSR , 114:679–681, 1957.[Bar94] A. Barron. Approximation and estimation bounds for artiﬁcial neural networks.

Mach. Learn. ,14(1):115–133, 1994.[BGKP19] H. B¨olcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connecteddeep neural networks.

SIAM J. Math. Data Sci. , 1(1):8–45, 2019.[Bis06] C. Bishop.

Pattern Recognition and Machine Learning . Springer New York, 2006.[CUF13] D. V. Cruz-Uribe and A. Fiorenza.

Variable Lebesgue Spaces: Foundations and Harmonic Analysis .Springer Science & Business Media, 2013.[Cyb89] G. Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics of control, signalsand systems , 2(4):303–314, 1989.[GR20] I. G¨uhring and M. Raslan. Approximation rates for neural networks with encodable weights in smooth-ness spaces. preprint, arXiv:2006.16822 , 2020.[GS09] S. Giulini and M. Sanguineti. Approximation schemes for functional optimization problems.

J. Optim.Theory Appl. , 140:33–54, 2009.[Hay98] S. Haykin.

Neural Networks: A Comprehensive Foundation . Prentice Hall, 1998.[HH20] W. Hwang and A. Heinecke. Un-rectifying non-linear networks for signal representation.

IEEE Trans-actions on Signal Processing , 68:196–210, 2020.[HHH20] A. Heinecke, J. Ho, and W. Hwang. Reﬁnement and universal approximation via sparsely connectedrelu convolution nets.

IEEE Signal Processing Letters , 2020.[Hor91] K. Hornik. Approximation capabilities of multilayer feedforward networks.

Neural networks , 4(2):251–257, 1991.[HSW89] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approxima-tors.

Neural networks , 2(5):359–366, 1989.[KL19] P. Kidger and T. Lyons. Universal approximation with deep narrow networks. preprint,arXiv:1905.08539 , 2019.[Kol57] A. N. Kolmogorov. On the representation of continuous functions of many variables by superpositionsof continuous functions of one variable and addition.

Dokl. Akad. SSSRl , 114:953–956, 1957.[Kra19] A. Kratsios. The universal approximation property: Characterizations, existence, and a canonicaltopology for deep-learning. preprint, arXiv:1910.03344 , 2019.[KvdS93] B. Kr¨ose and P. van der Smagt. An introduction to neural networks.

J. Comput. Sci. , 48, 01 1993.[Lin19] S.-B. Lin. Generalization and expressivity for deep nets.

IEEE T. Neur. Net. Lear. , 30(5):1392–1406,2019.[LLPS93] Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networkswith a nonpolynomial activation function can approximate any function.

Neural networks , 6(6):861–867, 1993.[LTY20] B. Li, S. Tang, and H. Yu. Better approximations of high dimensional smooth functions by deep neuralnetworks with rectied power units.

Commun. in Comp. Phys. , 27:379–411, 2020.[Mha96] M. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.

NeuralComput. , 8(1):164–177, 1996.[MP43] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity.

Thebulletin of mathematical biophysics , 5(4):115–133, 1943.[OK19] I. Ohn and Y. Kim. Smooth function approximation by deep neural networks with general activationfunctions.

Entropy , 21(7):627, 2019.[Pet99] P. P. Petrushev. Approximation by ridge function and neural networks.

SIAM Journal on MathematicalAnalysis , 30:155–189, 1999. [Pin97] A. Pinkus. Approximation by ridge function.

M´ehaut´e ALe, Rabut C, Schumaker LL, editors. SurfaceFitting and Multiresolution Methods. Vanderbilt University Press, Nashville, TN , pages 1–14, 1997.[Pin99] A. Pinkus. Approximation theory of the mlp model in neural networks.

Acta Numer. , 8:143–195, 1999.[PS91] J. Park and I. W. Sandberg. Universal approximation using radial-basis-function networks.

Neuralcomputation , 3(2):246–257, 1991.[PV18] P. Petersen and F. Voigtl¨ander. Optimal approximation of piecewise smooth functions using deep reluneural networks.

Neural Netw. , 108:296–330, 2018.[Roj96] R. Rojas.

Neural Networks - A Systematic Introduction . Springer-Verlag New York, 1996.[Ros58] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in thebrain.

Psychological review , 65(6):386, 1958.[Rud91] W. Rudin.

Functional analysis . McGraw-Hill, 1991.[San08] M. Sanguineti. Universal approximation by ridge computational models and neural networks: A sur-vey.

The Open Applied Mathematics Journal , 2(1):31–58, 2008.[SCC18] U. Shaman, A. Cloninger, and R. Coifman. Provable approximation properties for deep neural net-works.

Appl. Comput. Harmon. Anal. , 44(3):537–557, 2018.[SH17] J. Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function. preprint, arXiv:1708.06633 , 2017.[ST98] F. Scarselli and A. C. Tsoi. Universal approximation using feedforward neural networks: A survey ofsome existing methods, and some new results.

Neural Networks , 11(1):15–37, 1998.[Suz19] T. Suzuki. Adaptivity of deep relu network for learning in besov and mixed smooth besov spaces:optimal rate and curse of dimensionality. , 2019.[TKG +

03] D. Tikk, L. T K´oczy, T. D. Gedeon, et al. A survey on universal approximation and its limits in softcomputing techniques.

International Journal of Approximate Reasoning , 33(2):185–202, 2003.[TLY19] S. Tang, B. Li, and H. Yu. Eﬃcient and stable constructions of deep neural networks with rectied power units using chebyshev approximations. preprint, arXiv:1911.05467 , 2019.[Wid90] B. Widrow. 30 years of adaptive neural networks: Perceptron, madaline, and backpropagation.

IEEETransactions on Neural Networks , 78(9):1415–1442, 1990.[Yar17] D. Yarotsky. Error bounds for approximations with deep relu networks.

Neural Netw. , 94:103–114,2017.

E-mail address : [email protected] Department of Mathematics, Technische Universit¨at M¨unchen, 85748 Garching, Germany and Mu-nich Center for Quantum Science and Technology (MCQST), M¨unchen, Germany

E-mail address : [email protected]@uam.es