[PDF] Elementary superexpressive activations

Abstract

We call a finite family of activation functions superexpressive if any multivariate continuous function can be approximated by a neural network that uses these activations and has a fixed architecture only depending on the number of input variables (i.e., to achieve any accuracy we only need to adjust the weights, without increasing the number of neurons). Previously, it was known that superexpressive activations exist, but their form was quite complex. We give examples of very simple superexpressive families: for example, we prove that the family {sin, arcsin} is superexpressive. We also show that most practical activations (not involving periodic functions) are not superexpressive.

Full PDF

aa r X i v : . [ c s . N E ] F e b Elementary superexpressive activations

Dmitry Yarotsky Abstract

We call a ﬁnite family of activation functions superexpressive if any multivariate continuousfunction can be approximated by a neural net-work that uses these activations and has a ﬁxedarchitecture only depending on the number of in-put variables (i.e., to achieve any accuracy weonly need to adjust the weights, without increas-ing the number of neurons). Previously, it wasknown that superexpressive activations exist, buttheir form was quite complex. We give examplesof very simple superexpressive families: for ex-ample, we prove that the family { sin , arcsin } issuperexpressive. We also show that most practi-cal activations (not involving periodic functions)are not superexpressive.

1. Introduction

In the study of approximations by neural networks, an in-teresting fact is the existence of activation functions thatallow to approximate any continuous function on a givencompact domain with arbitrary accuracy by using a net-work with a ﬁnite, ﬁxed architecture independent of thefunction and the accuracy (i.e., merely by adjusting thenetwork weights, without increasing the number of neu-rons). We will refer to this property as “superexpressive-ness”. The existence of superexpressive activations can beseen as a consequence of a result of (Maiorov & Pinkus,1999).

Theorem 1 (Maiorov & Pinkus 1999) . There exists anactivation function σ which is real analytic, strictlyincreasing, sigmoidal (i.e., lim x →−∞ σ ( x ) = 0 and lim x → + ∞ σ ( x ) = 1 ), and such that any f ∈ C ([0 , d ) canbe uniformly approximated with any accuracy by expres-sions P d +3 i =1 d i σ ( P dj =1 c ij σ ( P dk =1 w ijk x k + θ ij ) + γ i ) with some parameters d i , c ij , w ijk , θ ij , γ i . The proof of this theorem includes two essential steps. Inthe ﬁrst step, the result is proved for univariate functions Skolkovo Institute of Science and Technology, Moscow, Rus-sia. E-mail: [email protected] . (i.e., for d = 1 ). The key idea here is to use the separa-bility of the space C ([0 , , and construct a (quite com-plicated) activation by joining all the functions from somedense countable subset. In the second step, one reduces themultivariate case to the univariate one by using the Kol-mogorov Superposition Theorem (KST).Though the superexpressive activations constructed in theproof of Theorem 1 have the nice properties of analyt-icity, monotonicity and boundedness, they are neverthe-less quite complex and non-elementary – at least, notknown to be representable in terms of ﬁnitely many el-ementary functions. See the papers (Ismailov, 2014;Guliyev & Ismailov, 2016; 2018a;b) for reﬁnements andalgorithmic aspects of such and similar activations, aswell as the papers (K˚urkov´a, 1991; 1992; Igelnik & Parikh,2003; Montanelli & Yang, 2020; Schmidt-Hieber, 2020)for further connections between KST and neural networks.There is, however, another line of research in whichsome weaker forms of superexpressiveness have been re-cently established for elementary (or otherwise simple)activations. The weaker form means that the networkmust grow to achieve higher accuracy, but this growth ismuch slower than the power laws expected from the ab-stract approximation theory under standard regularity as-sumptions (DeVore et al., 1989). In particular, results of(Yarotsky & Zhevnerchuk, 2019) imply that a deep net-work having both sin and ReLU activations can approx-imate Lipschitz functions with error O ( e − cW / ) , where c > is a constant and W is the number of weights. Re-sults of (Shen et al., 2020b) (see also (Shen et al., 2020a))imply that a three-layer network using the ﬂoor ⌊·⌋ , theexponential x and the step function x ≥ as activationscan approximate Lipschitz functions with an exponentiallysmall error O ( e − cW ) .In the present paper, we show that there are activations su-perexpressive in the initially mentioned strong sense andyet constructed using simple elementary functions; seeSection 2. For example, we prove that there are ﬁxed-sizenetworks with the activations sin and arcsin that can ap-proximate any continuous function with any accuracy. Onthe other hand, we show in Section 3 that most practicallyused activations (not involving periodic functions) are notsuperexpressive. lementary superexpressive activations in out Figure1. An example of network architecture with 3 input neu-rons, 1 output neuron and 7 hidden neurons.

2. Elementary superexpressive families

Throughout the paper, we consider standard feedforwardneural networks. The architecture of the network is de-ﬁned by a directed acyclic graph connecting the neurons(see Fig. 1). A network implementing a scalar d -variablefunction has d input neurons, one output neuron and a num-ber of hidden neurons. A hidden neuron computes the value σ ( P ni =1 w i z i + h ) , where w i and h are the weights asso-ciated with this neuron, z i are the incoming connectionsfrom other hidden or input neurons, and σ is an activationfunction. We will generally allow different hidden neuronsto have different activation functions. The output neuroncomputes the value P ni =1 w i z i + h without an activationfunction.Some of our activations (in particular, arcsin ) are naturallydeﬁned only on a subset of R . In this case we ensure thatthe inputs of these activations always belong to this subset.Throughout the paper, we consider approximations of func-tions f ∈ C ([0 , d ) in the uniform norm k · k ∞ . We gener-ally denote vectors by boldface letters; the components ofa vector x are denoted x , x , . . . .We give now the key deﬁnition of the paper. Deﬁnition 1.

We call a ﬁnite family A of univariate ac-tivation functions superexpressive for dimension d if thereexists a ﬁxed d -input network architecture with each hiddenneuron equipped with some ﬁxed activation from the family A , so that any function f ∈ C ([0 , d ) can be approxi-mated on [0 , d with any accuracy in the uniform norm k · k ∞ by such a network, by adjusting the network weights.We call a family A simply superexpressive if it is superex-pressive for all d = 1 , , . . . We refer to respective archi-tectures as superexpressive for A . Recall that the Kolmogorov Superposition Theorem (KST)(Kolmogorov, 1957) proves that any multivariate continu-ous function can be expressed via additions and univariatecontinuous functions. The following version of this theo-rem is taken from (Maiorov & Pinkus, 1999).

Theorem 2 (KST) . There exist d constants λ j > , j =1 , . . . , d, P dj =1 λ j ≤ , and d + 1 continuous strictly in-creasing functions χ i , i = 1 , . . . , d + 1 , which map [0 , to itself, such that every f ∈ C ([0 , d ) can be representedin the form f ( x , . . . , x d ) = d +1 X i =1 g (cid:16) d X j =1 λ j χ i ( x j ) (cid:17) for some g ∈ C ([0 , depending on f. An immediate corollary of this theorem is a reduction ofmultivariate superexpressiveness to the univariate one.

Corollary 1.

If a family A is superexpressive for dimen-sion d = 1 , then it is superexpressive for all d . Moreover,the number of neurons and connections in the respectivesuperexpressive architectures scales as O ( d ) . The proof follows simply by approximating the functions χ i and g in the KST by univariate superexpressive net-works.Our main result establishes existence of simple superex-pressive families constructed from ﬁnitely many elemen-tary functions. The full list of properties of the activationsthat we use is relatively cumbersome, so we ﬁnd it moreconvenient to just prove the result for a few particular ex-amples rather than attempt to state it in a general form. Theorem 3.

Each of the following families of activationfunctions is superexpressive: A = { σ , ⌊·⌋} , A = { sin , arcsin } , A = { σ } , where σ is any function that is real analytic and non-polynomial in some interval ( α, β ) ⊂ R , and σ ( x ) =  − x , x < − , π ( x arcsin x + √ − x ) + x, x ∈ [ − , , − x + sin xπx , x > . The function σ is C ( R ) , bounded, and strictly monotoneincreasing. The family A is a generalization of the family for which(Shen et al., 2020b) proved a weaker superexpressivenessproperty.The function σ is given as an example of an explicit su-perexpressive activation that is smooth and sigmoidal (seeFig. 2). Proof of Theorem 3.

We consider the families A , A , A one by one. Proof for A . Given a function f ∈ C ([0 , d ) , we willconstruct the approximation e f as a function piecewise con-stant on a partition of the cube [0 , into a grid of smaller lementary superexpressive activations − x σ ( x ) Figure2. The function σ from the statement of Theorem 3. cubes. Following the paper (Shen et al., 2020b), we spec-ify these cubes by mapping them to integers with the helpof function ⌊·⌋ . Speciﬁcally, take some M ∈ N and let g M ( x , . . . , x d ) = 1 + d X k =1 ( M + 1) k − ⌊ M x k ⌋ . (1)The function g M is integer-valued and constant on thecubes I M, m = [ m M , m +1 M ) × . . . × [ m d M , m d +1 M ) indexed byinteger multi-indices m = ( m , . . . , m d ) ∈ Z d . The cube [0 , d overlaps with ( M + 1) d such cubes I M, m , namelythose with ≤ m k ≤ M . Each of these cubes is mappedby g M to a unique integer in the range [1 , ( M + 1) d ] . Consider the periodic function φ ( x ) = x − ⌊ x ⌋ , φ : R → [0 , . (2)We will now seek our approximation in the form e f ( x ) = u ( g M ( x )) , u ( y ) = ( B − A ) φ ( sσ ( α + β + wy ))+ A, (3)where A = min x ∈ [0 , d f ( x ) , B = max x ∈ [0 , d f ( x ) , (4) α + β is the center of the interval ( α, β ) where σ is ana-lytic and non-polynomial, and s and w are some weightsto be chosen shortly. Clearly, the computation deﬁned byEqs. (1),(2),(3) is representable by a neural network of aﬁxed size only depending on d (as O ( d ) ) and using activa-tions from A .Let N = ( M + 1) d . Using the uniform continuity of f and choosing M large so that the size of each cube I M, m isarbitrarily small, we see that the superexpressiveness willbe established if we show that for any N , any ǫ > , and any y ∈ [ A, B ] N there exist some weights s and w suchthat | u ( n ) − y n | < ǫ for all n = 1 , . . . , N. (5)Recall that a set of numbers a , . . . , a N is called rationallyindependent if they are linearly independent over the ﬁeld Q (i.e., no equality P Nn =1 λ n a n = 0 with rational coefﬁ-cients λ n can hold unless all λ n = 0 ). Our strategy willbe:1. to choose the weight w so as to make the values a n = σ ( α + β + wn ) with n = 1 , . . . , N rationallyindependent;2. use the density of the irrational winding on the torusto ﬁnd s ensuring condition (5).For step 1, we state the following lemma. Lemma 1.

Let σ be a real analytic function in an inter-val ( α, β ) with β > α . Suppose that there is N such thatfor all w with sufﬁciently small absolute value, the values ( σ ( α + β + wn )) Nn =1 are not rationally independent. Then σ is a polynomial.Proof. For ﬁxed coefﬁcients λ = ( λ , . . . , λ N ) , the func-tion σ λ ( w ) = P Nn =1 λ n σ ( α + β + nw ) is real analytic for w ∈ U N = ( − β − α N , β − α N ) . Since there are only count-ably many λ ∈ Q N , we see that under hypothesis of thelemma there is some λ such that σ λ vanishes on an un-countable subset of U N . Then, by analyticity, σ λ ≡ on U N . Expanding this σ λ into the Taylor series at w = 0 ,we get the identity P Nn =1 λ n n m = 0 for each m such that d m σdw m ( α + β ) = 0 . If there are inﬁnitely many such m , thenall λ n = 0 (by letting m → ∞ ). It follows that if λ isnonzero, then there are only ﬁnitely many m ’s such that d m σdw m ( α + β ) = 0 , i.e. σ is a polynomial.Applying Lemma 1 to σ = σ , we see that for any N thereis w such that the values a n = σ ( α + β + wn ) with n =1 , . . . , N are rationally independent.For step 2, we use the well-known fact that an irrationalwinding on the torus is dense: Lemma 2.

Let a , . . . , a N be rationally independent realnumbers. Then the set Q N = { ( φ ( sa ) , . . . , φ ( sa N )) : s ∈ R } (where φ is deﬁned in Eq. (2) ) is dense in [0 , N . For completeness, we provide a proof in Appendix A.Lemma (2) implies that for any y ∈ [ A, B ] N , thepoint y − AB − A ∈ [0 , N can be approximated by vectors ( φ ( sa n )) Nn =1 . This implies condition (5), thus ﬁnishing theproof for A . lementary superexpressive activations − − x θ ( x ) ν ( x ) ψ ( x ) Figure3. The functions θ, ν, ψ from the proof of Theorem 3.

Proof for A . We will only give a proof for d = 1 ; theclaim then follows for all larger d by Corollary 1.Consider the piecewise linear periodic function θ ( x ) = π arcsin(sin πx ) and the related functions ν ( x ) = x + θ ( x ) ,ψ ( x ) = ν ( θ ( x ) − ) + 1 (see Fig. 3).We would like to extend the previous proof for A to thepresent case of A using the function ν as a substitute for ⌊·⌋ , since ν is constant on the intervals [ k − , k + ] withodd integer k . However, in contrast to the function ⌊·⌋ , thefunction ν is continuous and cannot map the whole seg-ment [0 , to a ﬁnite set of values, which was crucial in theproof for A . For this reason, we use a partition of unityand represent the approximated function f ∈ C ([0 , as asum of four functions supported on a grid of disjoint smallsegments. Speciﬁcally, let again M be a large integer de-termining the scale of our partition of unity. We deﬁne thispartition by ≡ X q = − ψ q ( x ) , ψ q ( x ) = ψ ( M x − q ) , x ∈ R , (6)and the respective decomposition of the function f by f = X q = − f q , f q = f ψ q . (7)For a ﬁxed q , the function ψ q , and hence also f q , vanishoutside of the union of N = M + O (1) disjoint segments J q,p = [ p − q M , p + q M ] , p = 1 , . . . , N, overlapping withthe segment [0 , . Denote this union by J q . We approximate each function f q by a function e f q using ananalog of the representation (1),(2),(3): G q ( x ) = ν ( M x − q + ) , (8) v q ( x ) = (2 max x ∈ [0 , | f ( x ) | ) θ ( s sin( wG q ( x ))) , (9) e f q ( x ) = v q ( x ) ψ q ( x ) , (10) e f = X q = − e f q . (11)The function G q in Eq. (8) is constant and equal to p − on each segment J q,p . In particular, different segments J q,p overlapping with the segment [0 , are mapped by G q todifferent integers in the interval [1 , M + 1] . The function v q in Eq. (9) is the analog of the expres-sion for e f given in Eq. (3). Like G q , the function v q is constant on each interval J q,p . By Lemma 1, thevalues (sin( wm )) M +1 m =1 are rationally independent for asuitable w. We can then use again the density of irra-tional winding on the torus (Lemma 2) to ﬁnd s suchthat for each p the value v q | J q,p is arbitrarily close to thevalue of f at the center x q,p = p − q M of the interval J q,p . Indeed, θ is a continuous periodic (period–2) func-tion with max x θ ( x ) = − min x θ ( x ) = . For each p = 1 , . . . , N, we can ﬁrst ﬁnd z q,p ∈ R / (2 Z ) suchthat (2 max x ∈ [0 , | f ( x ) | ) θ ( z p,q ) = f ( x q,p ) , and then, byLemma 2, ﬁnd s such that s sin( w (2 p − is arbitrarilyclose to z q,p on the circle R / (2 Z ) for each p = 1 , . . . , N. As a result, we see that the function v q can approximatethe function f on the whole set J q . As before, to achievean arbitrarily small error, we need to ﬁrst choose M largeenough and then choose suitable w and s . (By the uniformcontinuity of f , one can use here the same w and s for all q ∈ {− , , , } .)At the same time, it makes no difference how the function v q behaves on the complementary set [0 , \ J q , since ψ q vanishes on this set. It follows that e f q deﬁned by Eq. (10)can approximate f q deﬁned by Eq. (7) with arbitrarily smallerror on the whole segment [0 , . Then, the function e f given by Eq. (11) can approximate f uniformly on [0 , with any accuracy.The computation (8)-(11) is directly representable by aﬁxed size neural network with activations { sin , arcsin } , ex-cept for multiplication step (10). Multiplication, however,can be implemented with any accuracy by a ﬁxed-size sub-network: Lemma 3 (Approximate multiplier) . Suppose that an acti-vation function σ has a point x where the second deriva-tive d σdx ( x ) exists and is nonzero. Then there is a ﬁxedtwo-input network architecture with this activation that lementary superexpressive activations allows to implement the approximate multiplication ofthe inputs, x, y xy, with any accuracy uniformly onany bounded set of inputs x, y , by suitably adjusting theweights.Proof. First note that we can implement the approximatesquaring x x with any accuracy using just a networkwith three neurons. Indeed, by the assumption on d σdx , forany C, ǫ > we can choose δ such that | ( d σdx ( x )) − δ ( σ ( x + xδ )+ σ ( x − xδ ) − σ ( x )) − x | < ǫ for all | x | < C . Then, using the polarization identity xy = (( x + y ) + ( x − y ) ) , we see that the desiredapproximate multiplier can be implemented using a ﬁxed6-neuron architecture.We can apply this lemma with σ = sin and any x = πk, k ∈ Z , thus completing the proof for A . Proof for A . We reduce this case to the previous one, A . First observe that we can approximate the function arcsin by a ﬁxed-size σ -network. Lemma 4.

A superexpressive family of continuous activa-tions remains superexpressive if some activations are re-placed by their antiderivatives.Proof.

The claim follows since any continuous activation σ can be approximated uniformly on compact sets by expres-sions δ ( σ ( − ( x + δ ) − σ ( − ( x )) , where σ ( − = R σ .Our activation σ is the antiderivative of π arcsin x + onthe interval [ − , .Observe next that on the interval [1 , ∞ ) , we can express thefunction sin x by multiplying σ ( x ) by some polynomialsin x and subtracting constants. By Lemma 3, these opera-tions can be implemented with any accuracy by a ﬁxed size σ -network. By periodicity of sin , we can then approxi-mate it on any bounded interval.We conclude that we can approximate any A -networkwith any accuracy by a σ -network that has the same sizeup to a constant factor.It is an elementary computation that σ is C ( R ) , boundedand monotone increasing. This completes the proof of thetheorem.

3. Absence of superexpressiveness forstandard activations

In this section we show that most practically used ac-tivation functions (those not involving sin x or cos x )are not superexpressive. This is an easy consequence of Khovanskii’s bounds on the number of zeros of ele-mentary functions (Khovanskii, 1991). We remark thatthese bounds have been used previously to bound expres-siveness of neural networks in terms of VC dimension(Karpinski & Macintyre, 1997) or Betti numbers of levelsets (Bianchini & Scarselli, 2014).First recall the standard deﬁnition of Pfafﬁan functions (seee.g. (Khovanskii, 1991; Zell, 1999; Gabrielov & Vorobjov,2004)). A Pfafﬁan chain is a sequence f , . . . , f l of realanalytic functions deﬁned on a common connected domain U ⊂ R d and such that the equations ∂f i ∂x j ( x ) = P ij ( x , f ( x ) , . . . , f i ( x )) , ≤ i ≤ l, ≤ j ≤ d hold in U for some polynomials P ij . A Pfafﬁan func-tion in the chain ( f , . . . , f l ) is a function on U thatcan be expressed as a polynomial P in the variables ( x , f ( x ) , . . . , f l ( x )) . Complexity of the Pfafﬁan function f is the triplet ( l, α, β ) consisting of the length l of thechain, the maximum degree α of the polynomials P ij , andthe degree β of the polynomial P. The importance of Pfafﬁan functions stems from the factthat they include all elementary functions when consid-ered on suitable domains. This is shown by ﬁrst check-ing that the simplest elementary functions are Pfafﬁan, andthen by checking that arithmetic operations and compo-sitions of Pfafﬁan functions produce again Pfafﬁan func-tions. We refer again to (Khovanskii, 1991; Zell, 1999;Gabrielov & Vorobjov, 2004) for details.

Proposition 1.

1. (Elementary examples) The following functions arePfafﬁan: polynomials on U = R d , e x on R , ln x on R + , arcsin x on ( − , . The function sin x is Pfaf-ﬁan on any bounded interval ( A, B ) , with complexitydepending on B − A , but sin x is not Pfafﬁan on R .2. (Operations with Pfafﬁan functions) Sums and prod-ucts of Pfafﬁan functions f, g with a common domain U are Pfafﬁan. If the domain of a Pfafﬁan function f includes the range of a Pfafﬁan function g , then thecomposition f ◦ g is Pfafﬁan on the domain of g . Thecomplexity of the resulting functions f + g, f g, f ◦ g is determined by the complexity of the functions f, g . We state now the fundamental result on Pfafﬁan functions.We call a solution x ∈ R d of a system f ( x ) = . . . = f d ( x ) = 0 nondegenerate if the respective Jacobi matrix ∂f i ∂x j ( x ) is nondegenerate. Theorem 4 (Khovanskii 1991) . Let f , . . . , f d be Pfafﬁan d -variable functions with a common Pfafﬁan chain on aconnected domain U . Then the number of nondegener-ate solutions of the system f ( x ) = . . . = f d ( x ) = 0 is lementary superexpressive activations bounded by a ﬁnite number only depending on the com-plexities of the functions f , . . . , f d . The idea of the proof is to use a generalized Rolle’s lemmaand bound the number of common zeros of the functions f k by the number of common zeros of suitable polynomials (ina larger number of variables). The latter number can thenbe upper bounded using the classical B´ezout theorem. It ispossible to write the bound in Theorem 4 explicitly, but wewill not need that for our purposes.We will only use the univariate version of Theorem 4.In this case, it will also be easy to remove the incon-venient nondegeneracy condition in this theorem. (Notethat this condition is essential in general – for example, if f ( x , x ) ≡ f ( x , x ) = x , then the system f = f =0 has inﬁnitely many degenerate solutions). Proposition 2.

Let f be a univariate Pfafﬁan function onan open interval I ⊂ R . Then either f ≡ on I , or thenumber of zeros of f is bounded by a ﬁnite number onlydepending on the complexity of f .Proof. Suppose f . Then, by real analiticity of f , anyzero x of f in I is isolated, and we can write f ( x ) = c ( x − x ) k (1 + o (1)) as x → x , with some c = 0 and k ∈ N . By Sard’s theorem, there is a sequence ǫ n ց such that the values ± ǫ n are not critical values of f . Thefunctions f ± ǫ n are Pfafﬁan with the same complexity as f ,and don’t have degenerate zeros. For any zero x of f , thetwo functions f ± ǫ n will have in total two nondegeneratezeros in a vicinity of x , for any ǫ n small enough. It followsthat the total number of all nondegenerate zeros of the twofunctions f ± ǫ n , for ǫ n small enough, will be at least twiceas large as the number or zeros of the function f (or canbe made arbitrarily large if f has inﬁnitely many zeros).Applying Theorem 4 to the functions f ± ǫ n , we obtain thedesired conclusion on the zeros of f .Now we apply these results to standard activation func-tions. Deﬁnition 2.

We say that an activation function σ is piece-wise Pfafﬁan if its domain of deﬁnition can be representedas a union of ﬁnitely many open intervals U n and points x k in R so that σ is Pfafﬁan on each U n . By discussion above, this deﬁnition covers most prac-tically used activations, such as tanh x , standard sig-moid σ ( x ) = (1 + e − x ) − , ReLU σ ( x ) = max(0 , x ) , leaky ReLU σ ( x ) = max( ax, x ) , binary step function σ ( x ) = ( , x < , x ≥ , Gaussian σ ( x ) = e − x , softplus σ ( x ) = ln(1 + e x ) (Glorot et al., 2011), ELU σ ( x ) = ( a ( e x − , x < x, x ≥ (Clevert et al., 2015), etc. Our main result in this section states that any ﬁnite collection of suchactivations is not superexpressive. Theorem 5.

Let A be a family of ﬁnitely many piecewisePfafﬁan activation functions. Then A is not superexpres-sive.Proof. Suppose that A is superexpressive, and there is aﬁxed one-input network architecture allowing us to approx-imate any univariate function f ∈ C ([0 , . Then for any N we can choose the network weights so that the function e f implemented by the network has at least N sign changes, inthe sense that there are points ≤ a < . . . < a N ≤ suchthat ( − n e f ( a n ) > for all n . Indeed, this follows sim-ply by approximating the function f ( x ) = sin(( N + 1) πx ) with an error less than 1. We will show, however, that this N cannot be arbitrarily large if the activations are from aﬁnite piecewise Pfafﬁan family. Lemma 5.

If the activations belong to a ﬁnite piecewisePfafﬁan family A , then any function e f implemented by thenetwork is piecewise Pfafﬁan. Moreover, the number ofrespective intervals U n as well as the complexity of eachrestriction f | U n do not exceed some ﬁnite values only de-pending on the family A and the network architecture.Proof. This can be proved by induction on the number ofhidden neurons in the network. The base of induction cor-responds to networks without hidden neurons; in this casethe statement is trivial. Now we make the induction step.Given a network, choose some hidden neuron whose outputis not used by other hidden neurons (i.e., choose a neuronin the “last hidden layer”). With respect to this neuron, wecan decompose the network output as e f ( x ) = cσ k (cid:16) K X s =1 c s e f s ( x ) + h (cid:17) + K X s =1 c ′ s e f s ( x ) + h ′ . (12)Here, σ k is the activation function residing at the chosenneuron, e f s are the signals going out of the other hiddenand input neurons, and c, c s , c ′ s , h, h ′ are various weights.By inductive hypothesis, all functions e f s here are piece-wise Pfafﬁan. Moreover, by taking intersections, the seg-ment [0 , can be divided into ﬁnitely many open intervals I j separated by ﬁnitely many points x l so that each of thefunctions e f s is Pfafﬁan on each interval I j . The number ofthese intervals I j and the complexities of f s | I j are boundedby some ﬁnite values depending only on the family A andthe network architecture. We see also that the linear com-bination F ( x ) = P Ks =1 c s e f s ( x ) + h appearing in Eq. (12)is Pfafﬁan on each interval I j . Observe next that the composition σ k ◦ F is piecewise Pfaf-ﬁan on each interval I j . Indeed, let U ( k ) r and x ( k ) r be theﬁnitely many open intervals and points associated with the lementary superexpressive activations activation σ k as a piecewise Pfafﬁan function. By Propo-sition 2, for each r , the pre-image ( F | I j ) − ( x ( k ) r ) is ei-ther the whole interval I j or its ﬁnite subset. In the ﬁrstcase, σ k ◦ F is constant and thus trivially Pfafﬁan on I j . Inthe second case, the interval I j can be subdivided into sub-intervals I j,m such that each image F ( I j,m ) belongs to oneof the intervals U ( k ) r so that σ k ◦ F is Pfafﬁan on I j,m . Thenumber of these sub-intervals and the complexities of therestrictions are bounded by some ﬁnite numbers dependingon the activation σ k and the complexity of F | I j . Returning to representation (12), we see that e f is Pfaf-ﬁan on each interval I j,m ; moreover, the total number ofthese intervals as well as the complexities of the restrictions e f | I j,m are bounded by ﬁnite numbers determined by thefamily A and the architecture, thus proving the claim.The lemma implies that some interval U n in which e f isPfafﬁan and has a bounded complexity can contain an ar-bitrarily large number of sign changes of e f . This gives acontradiction with Proposition 2.

4. Discussion

We have given examples of simple explicit activation func-tions that allow to approximate arbitrary functions usingﬁxed-size networks (Theorem 3), and we have also shownthat this can not be achieved with the common practicalactivations (Theorem 5). We mention two interesting ques-tions left open by our results.First, our existence result (Theorem 3) is of course purelytheoretical: though the network is small, a huge approxi-mation complexity is hidden in the very special choice ofthe network weights. Nevertheless, assuming that we canperform computations with any precision, one can ask if itis possible to algorithmically ﬁnd network weights provid-ing a good approximation. The main difﬁculty here is toﬁnd a value s such that ( φ ( sa n )) Nn =1 is close to the given N -dimensional point. Such a value exists by Lemma 2on the density of irrational winding, and the proof of thelemma is essentially constructive, so theoretically one canperform the necessary computation and ﬁnd the desired s .However, the proof is based on the pigeonhole principleand is very prone to the curse of dimensionality (with di-mensionality here corresponding to the number N of ﬁtteddata points), making this computation practically unfeasi-ble even for relatively small N .Another open question is whether the function sin alone issuperexpresive. This can not be ruled out by the methods ofSection 3, since sin has an inﬁnite Pfafﬁan complexity on R . More generally, one can ask if there are individual su-perexpressive activations that are elementary and real ana-lytic on the whole R . A repeated computation of antideriva- tives using Lemma 4 allows us to construct a piecewise ele-mentary superexpressive function of any ﬁnite smoothness,but not analytic on R . A. Proof of Lemma 2

It is convenient to endow the cube [0 , N with the topol-ogy of the torus T N = R N / Z N by gluing the endpoints ofthe interval [0 , . Though the lemma is stated in terms ofthe original topology on [0 , N , it is clear that a subset isdense in the original topology if and only if it is dense inthe topology of the torus. Accordingly, when consideringthe distance between two points b , b ∈ [0 , N , it will beconvenient to use the distance between the correspondingcosets, i.e. ρ ( b , b ) = min z , z ∈ Z N | b + z − ( b + z ) | , where | · | is the usual euclidean norm. Note that this ρ is ashift–invariant metric on the torus.The proof of the lemma is by induction on N . The base N = 1 is obvious (a single number a is rationally inde-pendent iff a = 0 ). Let us make the induction step from N − to N , with N ≥ .Given the rationally independent numbers a , . . . , a N , ﬁrstobserve that none of them equals 0. Let s = a N . Let φ ( x ) = x − ⌊ x ⌋ as in Eq. (2). If s = ms withsome integer m , then φ ( ms a N ) = 0 , so that the points b m = ( φ ( ms a ) , . . . , φ ( ms a N )) lie in the ( N − -dimensional face [0 , N − of the full set [0 , N . Observe that the points b m are different for different in-teger m ’s. Indeed, if b m = b m for some integer m = m , then there are some integers p , . . . , p N suchthat ( m − m ) s a n = p n for all n = 1 , . . . , N. Butthen the numbers a , . . . , a N are not rationally indepen-dent, since, e.g., ( m − m ) a = p s = p a N .Since the points b m are distinct, they form an inﬁniteset in [0 , N − . Then for any ǫ we can ﬁnd a pair ofdifferent points b m and b m separated by a distance ρ ( b m , b m ) < ǫ. Note that the distance ρ ( b m , b m ) only depends on the difference m − m , so we can as-sume that m = 0 : ρ ( b , b m ) = ρ ( , b m ) < ǫ. By deﬁnition of ρ , we can then ﬁnd z ∈ Z N such that for b ′ m = b m − z we have | b ′ m | = ρ ( , b m ) < ǫ. (13)We can write b ′ m in the form b ′ m = ( m s a − p , . . . , m s a N − − p N − , with some integers p , . . . , p N − . Observe that the ﬁrst N − components b ′ m ,n of b ′ m are rationally indepen-dent. Indeed, if P N − n =1 λ n b ′ m ,n = 0 with some rational lementary superexpressive activations λ n , then, by expressing this identity in terms of the origi-nal values a n , we get N − X n =1 λ n a n − m N − X n =1 λ n p n a N = 0 , so λ n ≡ by the rational independence of a n . Consider now the set Q ′ N − = { φ ( tb ′ m , ) , . . . , φ ( tb ′ m ,N − )) : t ∈ R } . On the one hand, by induction hypothesis, the set Q ′ N − is dense in [0 , N − , because the numbers b ′ m ,n are ra-tionally independent. On the other hand, observe that thepoints in Q ′ N − corresponding to integer t also belong tothe set Q N = { ( φ ( sa ) , . . . , φ ( sa N )) : s ∈ R } : specif-ically, the respective s = m ts . It follows that for any b ∈ [0 , N − we can ﬁnd a point e b of the set Q N at a dis-tance at most ǫ from b : ﬁrst ﬁnd a point b b ∈ Q ′ N − suchthat | b b − b | < ǫ , and then, if b b corresponds to some t = t in Q ′ N − , take e b corresponding to t = ⌊ t ⌋ . The distance ρ ( b b , e b ) < ǫ by Eq. (13) and because | t − ⌊ t ⌋| < : ρ ( b b , e b ) ≤ | t b ′ m − ⌊ t ⌋ b ′ m | < | b ′ m | < ǫ. The above argument shows that the face [0 , N − = { b ∈ [0 , N : b N = 0 } can be approximated by points of Q N with s belonging to the set S = { m ts } t ∈ Z . Any other ( N − -dimensional cross-section { b ∈ [0 , N : b N = c } is then approximated by the points of Q N with s ∈ S + cs :indeed, s = cs gives us one point in this cross-section, andadditional shifts by ∆ s ∈ S allow us to approximate anyother point with the same b N . Acknowledgment

I thank Maksim Velikanov for useful feedback on the pre-liminary version of the paper.

References

Bianchini, M. and Scarselli, F. On the complexity of neu-ral network classiﬁers: A comparison between shallowand deep architectures.

IEEE transactions on neural net-works and learning systems , 25(8):1553–1565, 2014.Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fastand accurate deep network learning by exponential lin-ear units (elus). arXiv preprint arXiv:1511.07289 , 2015.DeVore, R. A., Howard, R., and Micchelli, C. Optimalnonlinear approximation.

Manuscripta mathematica , 63(4):469–478, 1989. Gabrielov, A. and Vorobjov, N. Complexity of computa-tions with Pfafﬁan and Noetherian functions.

Normalforms, bifurcations and ﬁniteness problems in differen-tial equations , 137:211–250, 2004.Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectiﬁerneural networks. In

Proceedings of the fourteenth inter-national conference on artiﬁcial intelligence and statis-tics , pp. 315–323, 2011.Guliyev, N. J. and Ismailov, V. E. A single hidden layerfeedforward network with only one neuron in the hiddenlayer can approximate any univariate function.

Neuralcomputation , 28(7):1289–1304, 2016.Guliyev, N. J. and Ismailov, V. E. Approximation capabilityof two hidden layer feedforward neural networks withﬁxed weights.

Neurocomputing , 316:262–269, 2018a.Guliyev, N. J. and Ismailov, V. E. On the approximationby single hidden layer feedforward neural networks withﬁxed weights.

Neural Networks , 98:296–304, 2018b.Igelnik, B. and Parikh, N. Kolmogorov’s spline network.

IEEE transactions on neural networks , 14(4):725–733,2003.Ismailov, V. E. On the approximation by neural networkswith bounded number of neurons in hidden layers.

Jour-nal of Mathematical Analysis and Applications , 417(2):963–969, 2014.Karpinski, M. and Macintyre, A. Polynomial bounds forVC dimension of sigmoidal and general Pfafﬁan neuralnetworks.

Journal of Computer and System Sciences , 54(1):169–176, 1997.Khovanskii, A. G.

Fewnomials . Vol. 88 of Translationsof Mathematical Monographs. American MathematicalSociety, 1991.Kolmogorov, A. N. On the representation of continuousfunctions of many variables by superposition of contin-uous functions of one variable and addition. In

Dok-lady Akademii Nauk , volume 114, pp. 953–956. RussianAcademy of Sciences, 1957.K˚urkov´a, V. Kolmogorov’s theorem is relevant.

Neuralcomputation , 3(4):617–622, 1991.K˚urkov´a, V. Kolmogorov’s theorem and multilayer neuralnetworks.

Neural networks , 5(3):501–506, 1992.Maiorov, V. and Pinkus, A. Lower bounds for approxima-tion by mlp neural networks.

Neurocomputing , 25(1-3):81–91, 1999. lementary superexpressive activations

Montanelli, H. and Yang, H. Error bounds for deep ReLUnetworks using the Kolmogorov–Arnold superpositiontheorem.

Neural Networks , 129:1–6, 2020.Schmidt-Hieber, J. The Kolmogorov-Arnold rep-resentation theorem revisited. arXiv preprintarXiv:2007.15884 , 2020.Shen, Z., Yang, H., and Zhang, S. Deep network ap-proximation with discrepancy being reciprocal of widthto power of depth. arXiv preprint arXiv:2006.12231 ,2020a.Shen, Z., Yang, H., and Zhang, S. Neural network approx-imation: Three hidden layers are enough. arXiv preprintarXiv:2010.14075 , 2020b.Yarotsky, D. and Zhevnerchuk, A. The phase diagram ofapproximation rates for deep neural networks. arXivpreprint arXiv:1906.09477 , 2019.Zell, T. Betti numbers of semi-Pfafﬁan sets.