Elementary superexpressive activations
aa r X i v : . [ c s . N E ] F e b Elementary superexpressive activations
Dmitry Yarotsky Abstract
We call a finite family of activation functions superexpressive if any multivariate continuousfunction can be approximated by a neural net-work that uses these activations and has a fixedarchitecture only depending on the number of in-put variables (i.e., to achieve any accuracy weonly need to adjust the weights, without increas-ing the number of neurons). Previously, it wasknown that superexpressive activations exist, buttheir form was quite complex. We give examplesof very simple superexpressive families: for ex-ample, we prove that the family { sin , arcsin } issuperexpressive. We also show that most practi-cal activations (not involving periodic functions)are not superexpressive.
1. Introduction
In the study of approximations by neural networks, an in-teresting fact is the existence of activation functions thatallow to approximate any continuous function on a givencompact domain with arbitrary accuracy by using a net-work with a finite, fixed architecture independent of thefunction and the accuracy (i.e., merely by adjusting thenetwork weights, without increasing the number of neu-rons). We will refer to this property as “superexpressive-ness”. The existence of superexpressive activations can beseen as a consequence of a result of (Maiorov & Pinkus,1999).
Theorem 1 (Maiorov & Pinkus 1999) . There exists anactivation function σ which is real analytic, strictlyincreasing, sigmoidal (i.e., lim x →−∞ σ ( x ) = 0 and lim x → + ∞ σ ( x ) = 1 ), and such that any f ∈ C ([0 , d ) canbe uniformly approximated with any accuracy by expres-sions P d +3 i =1 d i σ ( P dj =1 c ij σ ( P dk =1 w ijk x k + θ ij ) + γ i ) with some parameters d i , c ij , w ijk , θ ij , γ i . The proof of this theorem includes two essential steps. Inthe first step, the result is proved for univariate functions Skolkovo Institute of Science and Technology, Moscow, Rus-sia. E-mail: [email protected] . (i.e., for d = 1 ). The key idea here is to use the separa-bility of the space C ([0 , , and construct a (quite com-plicated) activation by joining all the functions from somedense countable subset. In the second step, one reduces themultivariate case to the univariate one by using the Kol-mogorov Superposition Theorem (KST).Though the superexpressive activations constructed in theproof of Theorem 1 have the nice properties of analyt-icity, monotonicity and boundedness, they are neverthe-less quite complex and non-elementary – at least, notknown to be representable in terms of finitely many el-ementary functions. See the papers (Ismailov, 2014;Guliyev & Ismailov, 2016; 2018a;b) for refinements andalgorithmic aspects of such and similar activations, aswell as the papers (K˚urkov´a, 1991; 1992; Igelnik & Parikh,2003; Montanelli & Yang, 2020; Schmidt-Hieber, 2020)for further connections between KST and neural networks.There is, however, another line of research in whichsome weaker forms of superexpressiveness have been re-cently established for elementary (or otherwise simple)activations. The weaker form means that the networkmust grow to achieve higher accuracy, but this growth ismuch slower than the power laws expected from the ab-stract approximation theory under standard regularity as-sumptions (DeVore et al., 1989). In particular, results of(Yarotsky & Zhevnerchuk, 2019) imply that a deep net-work having both sin and ReLU activations can approx-imate Lipschitz functions with error O ( e − cW / ) , where c > is a constant and W is the number of weights. Re-sults of (Shen et al., 2020b) (see also (Shen et al., 2020a))imply that a three-layer network using the floor ⌊·⌋ , theexponential x and the step function x ≥ as activationscan approximate Lipschitz functions with an exponentiallysmall error O ( e − cW ) .In the present paper, we show that there are activations su-perexpressive in the initially mentioned strong sense andyet constructed using simple elementary functions; seeSection 2. For example, we prove that there are fixed-sizenetworks with the activations sin and arcsin that can ap-proximate any continuous function with any accuracy. Onthe other hand, we show in Section 3 that most practicallyused activations (not involving periodic functions) are notsuperexpressive. lementary superexpressive activations in out Figure1. An example of network architecture with 3 input neu-rons, 1 output neuron and 7 hidden neurons.
2. Elementary superexpressive families
Throughout the paper, we consider standard feedforwardneural networks. The architecture of the network is de-fined by a directed acyclic graph connecting the neurons(see Fig. 1). A network implementing a scalar d -variablefunction has d input neurons, one output neuron and a num-ber of hidden neurons. A hidden neuron computes the value σ ( P ni =1 w i z i + h ) , where w i and h are the weights asso-ciated with this neuron, z i are the incoming connectionsfrom other hidden or input neurons, and σ is an activationfunction. We will generally allow different hidden neuronsto have different activation functions. The output neuroncomputes the value P ni =1 w i z i + h without an activationfunction.Some of our activations (in particular, arcsin ) are naturallydefined only on a subset of R . In this case we ensure thatthe inputs of these activations always belong to this subset.Throughout the paper, we consider approximations of func-tions f ∈ C ([0 , d ) in the uniform norm k · k ∞ . We gener-ally denote vectors by boldface letters; the components ofa vector x are denoted x , x , . . . .We give now the key definition of the paper. Definition 1.
We call a finite family A of univariate ac-tivation functions superexpressive for dimension d if thereexists a fixed d -input network architecture with each hiddenneuron equipped with some fixed activation from the family A , so that any function f ∈ C ([0 , d ) can be approxi-mated on [0 , d with any accuracy in the uniform norm k · k ∞ by such a network, by adjusting the network weights.We call a family A simply superexpressive if it is superex-pressive for all d = 1 , , . . . We refer to respective archi-tectures as superexpressive for A . Recall that the Kolmogorov Superposition Theorem (KST)(Kolmogorov, 1957) proves that any multivariate continu-ous function can be expressed via additions and univariatecontinuous functions. The following version of this theo-rem is taken from (Maiorov & Pinkus, 1999).
Theorem 2 (KST) . There exist d constants λ j > , j =1 , . . . , d, P dj =1 λ j ≤ , and d + 1 continuous strictly in-creasing functions χ i , i = 1 , . . . , d + 1 , which map [0 , to itself, such that every f ∈ C ([0 , d ) can be representedin the form f ( x , . . . , x d ) = d +1 X i =1 g (cid:16) d X j =1 λ j χ i ( x j ) (cid:17) for some g ∈ C ([0 , depending on f. An immediate corollary of this theorem is a reduction ofmultivariate superexpressiveness to the univariate one.
Corollary 1.
If a family A is superexpressive for dimen-sion d = 1 , then it is superexpressive for all d . Moreover,the number of neurons and connections in the respectivesuperexpressive architectures scales as O ( d ) . The proof follows simply by approximating the functions χ i and g in the KST by univariate superexpressive net-works.Our main result establishes existence of simple superex-pressive families constructed from finitely many elemen-tary functions. The full list of properties of the activationsthat we use is relatively cumbersome, so we find it moreconvenient to just prove the result for a few particular ex-amples rather than attempt to state it in a general form. Theorem 3.
Each of the following families of activationfunctions is superexpressive: A = { σ , ⌊·⌋} , A = { sin , arcsin } , A = { σ } , where σ is any function that is real analytic and non-polynomial in some interval ( α, β ) ⊂ R , and σ ( x ) = − x , x < − , π ( x arcsin x + √ − x ) + x, x ∈ [ − , , − x + sin xπx , x > . The function σ is C ( R ) , bounded, and strictly monotoneincreasing. The family A is a generalization of the family for which(Shen et al., 2020b) proved a weaker superexpressivenessproperty.The function σ is given as an example of an explicit su-perexpressive activation that is smooth and sigmoidal (seeFig. 2). Proof of Theorem 3.
We consider the families A , A , A one by one. Proof for A . Given a function f ∈ C ([0 , d ) , we willconstruct the approximation e f as a function piecewise con-stant on a partition of the cube [0 , into a grid of smaller lementary superexpressive activations − x σ ( x ) Figure2. The function σ from the statement of Theorem 3. cubes. Following the paper (Shen et al., 2020b), we spec-ify these cubes by mapping them to integers with the helpof function ⌊·⌋ . Specifically, take some M ∈ N and let g M ( x , . . . , x d ) = 1 + d X k =1 ( M + 1) k − ⌊ M x k ⌋ . (1)The function g M is integer-valued and constant on thecubes I M, m = [ m M , m +1 M ) × . . . × [ m d M , m d +1 M ) indexed byinteger multi-indices m = ( m , . . . , m d ) ∈ Z d . The cube [0 , d overlaps with ( M + 1) d such cubes I M, m , namelythose with ≤ m k ≤ M . Each of these cubes is mappedby g M to a unique integer in the range [1 , ( M + 1) d ] . Consider the periodic function φ ( x ) = x − ⌊ x ⌋ , φ : R → [0 , . (2)We will now seek our approximation in the form e f ( x ) = u ( g M ( x )) , u ( y ) = ( B − A ) φ ( sσ ( α + β + wy ))+ A, (3)where A = min x ∈ [0 , d f ( x ) , B = max x ∈ [0 , d f ( x ) , (4) α + β is the center of the interval ( α, β ) where σ is ana-lytic and non-polynomial, and s and w are some weightsto be chosen shortly. Clearly, the computation defined byEqs. (1),(2),(3) is representable by a neural network of afixed size only depending on d (as O ( d ) ) and using activa-tions from A .Let N = ( M + 1) d . Using the uniform continuity of f and choosing M large so that the size of each cube I M, m isarbitrarily small, we see that the superexpressiveness willbe established if we show that for any N , any ǫ > , and any y ∈ [ A, B ] N there exist some weights s and w suchthat | u ( n ) − y n | < ǫ for all n = 1 , . . . , N. (5)Recall that a set of numbers a , . . . , a N is called rationallyindependent if they are linearly independent over the field Q (i.e., no equality P Nn =1 λ n a n = 0 with rational coeffi-cients λ n can hold unless all λ n = 0 ). Our strategy willbe:1. to choose the weight w so as to make the values a n = σ ( α + β + wn ) with n = 1 , . . . , N rationallyindependent;2. use the density of the irrational winding on the torusto find s ensuring condition (5).For step 1, we state the following lemma. Lemma 1.
Let σ be a real analytic function in an inter-val ( α, β ) with β > α . Suppose that there is N such thatfor all w with sufficiently small absolute value, the values ( σ ( α + β + wn )) Nn =1 are not rationally independent. Then σ is a polynomial.Proof. For fixed coefficients λ = ( λ , . . . , λ N ) , the func-tion σ λ ( w ) = P Nn =1 λ n σ ( α + β + nw ) is real analytic for w ∈ U N = ( − β − α N , β − α N ) . Since there are only count-ably many λ ∈ Q N , we see that under hypothesis of thelemma there is some λ such that σ λ vanishes on an un-countable subset of U N . Then, by analyticity, σ λ ≡ on U N . Expanding this σ λ into the Taylor series at w = 0 ,we get the identity P Nn =1 λ n n m = 0 for each m such that d m σdw m ( α + β ) = 0 . If there are infinitely many such m , thenall λ n = 0 (by letting m → ∞ ). It follows that if λ isnonzero, then there are only finitely many m ’s such that d m σdw m ( α + β ) = 0 , i.e. σ is a polynomial.Applying Lemma 1 to σ = σ , we see that for any N thereis w such that the values a n = σ ( α + β + wn ) with n =1 , . . . , N are rationally independent.For step 2, we use the well-known fact that an irrationalwinding on the torus is dense: Lemma 2.
Let a , . . . , a N be rationally independent realnumbers. Then the set Q N = { ( φ ( sa ) , . . . , φ ( sa N )) : s ∈ R } (where φ is defined in Eq. (2) ) is dense in [0 , N . For completeness, we provide a proof in Appendix A.Lemma (2) implies that for any y ∈ [ A, B ] N , thepoint y − AB − A ∈ [0 , N can be approximated by vectors ( φ ( sa n )) Nn =1 . This implies condition (5), thus finishing theproof for A . lementary superexpressive activations − − x θ ( x ) ν ( x ) ψ ( x ) Figure3. The functions θ, ν, ψ from the proof of Theorem 3.
Proof for A . We will only give a proof for d = 1 ; theclaim then follows for all larger d by Corollary 1.Consider the piecewise linear periodic function θ ( x ) = π arcsin(sin πx ) and the related functions ν ( x ) = x + θ ( x ) ,ψ ( x ) = ν ( θ ( x ) − ) + 1 (see Fig. 3).We would like to extend the previous proof for A to thepresent case of A using the function ν as a substitute for ⌊·⌋ , since ν is constant on the intervals [ k − , k + ] withodd integer k . However, in contrast to the function ⌊·⌋ , thefunction ν is continuous and cannot map the whole seg-ment [0 , to a finite set of values, which was crucial in theproof for A . For this reason, we use a partition of unityand represent the approximated function f ∈ C ([0 , as asum of four functions supported on a grid of disjoint smallsegments. Specifically, let again M be a large integer de-termining the scale of our partition of unity. We define thispartition by ≡ X q = − ψ q ( x ) , ψ q ( x ) = ψ ( M x − q ) , x ∈ R , (6)and the respective decomposition of the function f by f = X q = − f q , f q = f ψ q . (7)For a fixed q , the function ψ q , and hence also f q , vanishoutside of the union of N = M + O (1) disjoint segments J q,p = [ p − q M , p + q M ] , p = 1 , . . . , N, overlapping withthe segment [0 , . Denote this union by J q . We approximate each function f q by a function e f q using ananalog of the representation (1),(2),(3): G q ( x ) = ν ( M x − q + ) , (8) v q ( x ) = (2 max x ∈ [0 , | f ( x ) | ) θ ( s sin( wG q ( x ))) , (9) e f q ( x ) = v q ( x ) ψ q ( x ) , (10) e f = X q = − e f q . (11)The function G q in Eq. (8) is constant and equal to p − on each segment J q,p . In particular, different segments J q,p overlapping with the segment [0 , are mapped by G q todifferent integers in the interval [1 , M + 1] . The function v q in Eq. (9) is the analog of the expres-sion for e f given in Eq. (3). Like G q , the function v q is constant on each interval J q,p . By Lemma 1, thevalues (sin( wm )) M +1 m =1 are rationally independent for asuitable w. We can then use again the density of irra-tional winding on the torus (Lemma 2) to find s suchthat for each p the value v q | J q,p is arbitrarily close to thevalue of f at the center x q,p = p − q M of the interval J q,p . Indeed, θ is a continuous periodic (period–2) func-tion with max x θ ( x ) = − min x θ ( x ) = . For each p = 1 , . . . , N, we can first find z q,p ∈ R / (2 Z ) suchthat (2 max x ∈ [0 , | f ( x ) | ) θ ( z p,q ) = f ( x q,p ) , and then, byLemma 2, find s such that s sin( w (2 p − is arbitrarilyclose to z q,p on the circle R / (2 Z ) for each p = 1 , . . . , N. As a result, we see that the function v q can approximatethe function f on the whole set J q . As before, to achievean arbitrarily small error, we need to first choose M largeenough and then choose suitable w and s . (By the uniformcontinuity of f , one can use here the same w and s for all q ∈ {− , , , } .)At the same time, it makes no difference how the function v q behaves on the complementary set [0 , \ J q , since ψ q vanishes on this set. It follows that e f q defined by Eq. (10)can approximate f q defined by Eq. (7) with arbitrarily smallerror on the whole segment [0 , . Then, the function e f given by Eq. (11) can approximate f uniformly on [0 , with any accuracy.The computation (8)-(11) is directly representable by afixed size neural network with activations { sin , arcsin } , ex-cept for multiplication step (10). Multiplication, however,can be implemented with any accuracy by a fixed-size sub-network: Lemma 3 (Approximate multiplier) . Suppose that an acti-vation function σ has a point x where the second deriva-tive d σdx ( x ) exists and is nonzero. Then there is a fixedtwo-input network architecture with this activation that lementary superexpressive activations allows to implement the approximate multiplication ofthe inputs, x, y xy, with any accuracy uniformly onany bounded set of inputs x, y , by suitably adjusting theweights.Proof. First note that we can implement the approximatesquaring x x with any accuracy using just a networkwith three neurons. Indeed, by the assumption on d σdx , forany C, ǫ > we can choose δ such that | ( d σdx ( x )) − δ ( σ ( x + xδ )+ σ ( x − xδ ) − σ ( x )) − x | < ǫ for all | x | < C . Then, using the polarization identity xy = (( x + y ) + ( x − y ) ) , we see that the desiredapproximate multiplier can be implemented using a fixed6-neuron architecture.We can apply this lemma with σ = sin and any x = πk, k ∈ Z , thus completing the proof for A . Proof for A . We reduce this case to the previous one, A . First observe that we can approximate the function arcsin by a fixed-size σ -network. Lemma 4.
A superexpressive family of continuous activa-tions remains superexpressive if some activations are re-placed by their antiderivatives.Proof.
The claim follows since any continuous activation σ can be approximated uniformly on compact sets by expres-sions δ ( σ ( − ( x + δ ) − σ ( − ( x )) , where σ ( − = R σ .Our activation σ is the antiderivative of π arcsin x + onthe interval [ − , .Observe next that on the interval [1 , ∞ ) , we can express thefunction sin x by multiplying σ ( x ) by some polynomialsin x and subtracting constants. By Lemma 3, these opera-tions can be implemented with any accuracy by a fixed size σ -network. By periodicity of sin , we can then approxi-mate it on any bounded interval.We conclude that we can approximate any A -networkwith any accuracy by a σ -network that has the same sizeup to a constant factor.It is an elementary computation that σ is C ( R ) , boundedand monotone increasing. This completes the proof of thetheorem.
3. Absence of superexpressiveness forstandard activations
In this section we show that most practically used ac-tivation functions (those not involving sin x or cos x )are not superexpressive. This is an easy consequence of Khovanskii’s bounds on the number of zeros of ele-mentary functions (Khovanskii, 1991). We remark thatthese bounds have been used previously to bound expres-siveness of neural networks in terms of VC dimension(Karpinski & Macintyre, 1997) or Betti numbers of levelsets (Bianchini & Scarselli, 2014).First recall the standard definition of Pfaffian functions (seee.g. (Khovanskii, 1991; Zell, 1999; Gabrielov & Vorobjov,2004)). A Pfaffian chain is a sequence f , . . . , f l of realanalytic functions defined on a common connected domain U ⊂ R d and such that the equations ∂f i ∂x j ( x ) = P ij ( x , f ( x ) , . . . , f i ( x )) , ≤ i ≤ l, ≤ j ≤ d hold in U for some polynomials P ij . A Pfaffian func-tion in the chain ( f , . . . , f l ) is a function on U thatcan be expressed as a polynomial P in the variables ( x , f ( x ) , . . . , f l ( x )) . Complexity of the Pfaffian function f is the triplet ( l, α, β ) consisting of the length l of thechain, the maximum degree α of the polynomials P ij , andthe degree β of the polynomial P. The importance of Pfaffian functions stems from the factthat they include all elementary functions when consid-ered on suitable domains. This is shown by first check-ing that the simplest elementary functions are Pfaffian, andthen by checking that arithmetic operations and compo-sitions of Pfaffian functions produce again Pfaffian func-tions. We refer again to (Khovanskii, 1991; Zell, 1999;Gabrielov & Vorobjov, 2004) for details.
Proposition 1.
1. (Elementary examples) The following functions arePfaffian: polynomials on U = R d , e x on R , ln x on R + , arcsin x on ( − , . The function sin x is Pfaf-fian on any bounded interval ( A, B ) , with complexitydepending on B − A , but sin x is not Pfaffian on R .2. (Operations with Pfaffian functions) Sums and prod-ucts of Pfaffian functions f, g with a common domain U are Pfaffian. If the domain of a Pfaffian function f includes the range of a Pfaffian function g , then thecomposition f ◦ g is Pfaffian on the domain of g . Thecomplexity of the resulting functions f + g, f g, f ◦ g is determined by the complexity of the functions f, g . We state now the fundamental result on Pfaffian functions.We call a solution x ∈ R d of a system f ( x ) = . . . = f d ( x ) = 0 nondegenerate if the respective Jacobi matrix ∂f i ∂x j ( x ) is nondegenerate. Theorem 4 (Khovanskii 1991) . Let f , . . . , f d be Pfaffian d -variable functions with a common Pfaffian chain on aconnected domain U . Then the number of nondegener-ate solutions of the system f ( x ) = . . . = f d ( x ) = 0 is lementary superexpressive activations bounded by a finite number only depending on the com-plexities of the functions f , . . . , f d . The idea of the proof is to use a generalized Rolle’s lemmaand bound the number of common zeros of the functions f k by the number of common zeros of suitable polynomials (ina larger number of variables). The latter number can thenbe upper bounded using the classical B´ezout theorem. It ispossible to write the bound in Theorem 4 explicitly, but wewill not need that for our purposes.We will only use the univariate version of Theorem 4.In this case, it will also be easy to remove the incon-venient nondegeneracy condition in this theorem. (Notethat this condition is essential in general – for example, if f ( x , x ) ≡ f ( x , x ) = x , then the system f = f =0 has infinitely many degenerate solutions). Proposition 2.
Let f be a univariate Pfaffian function onan open interval I ⊂ R . Then either f ≡ on I , or thenumber of zeros of f is bounded by a finite number onlydepending on the complexity of f .Proof. Suppose f . Then, by real analiticity of f , anyzero x of f in I is isolated, and we can write f ( x ) = c ( x − x ) k (1 + o (1)) as x → x , with some c = 0 and k ∈ N . By Sard’s theorem, there is a sequence ǫ n ց such that the values ± ǫ n are not critical values of f . Thefunctions f ± ǫ n are Pfaffian with the same complexity as f ,and don’t have degenerate zeros. For any zero x of f , thetwo functions f ± ǫ n will have in total two nondegeneratezeros in a vicinity of x , for any ǫ n small enough. It followsthat the total number of all nondegenerate zeros of the twofunctions f ± ǫ n , for ǫ n small enough, will be at least twiceas large as the number or zeros of the function f (or canbe made arbitrarily large if f has infinitely many zeros).Applying Theorem 4 to the functions f ± ǫ n , we obtain thedesired conclusion on the zeros of f .Now we apply these results to standard activation func-tions. Definition 2.
We say that an activation function σ is piece-wise Pfaffian if its domain of definition can be representedas a union of finitely many open intervals U n and points x k in R so that σ is Pfaffian on each U n . By discussion above, this definition covers most prac-tically used activations, such as tanh x , standard sig-moid σ ( x ) = (1 + e − x ) − , ReLU σ ( x ) = max(0 , x ) , leaky ReLU σ ( x ) = max( ax, x ) , binary step function σ ( x ) = ( , x < , x ≥ , Gaussian σ ( x ) = e − x , softplus σ ( x ) = ln(1 + e x ) (Glorot et al., 2011), ELU σ ( x ) = ( a ( e x − , x < x, x ≥ (Clevert et al., 2015), etc. Our main result in this section states that any finite collection of suchactivations is not superexpressive. Theorem 5.
Let A be a family of finitely many piecewisePfaffian activation functions. Then A is not superexpres-sive.Proof. Suppose that A is superexpressive, and there is afixed one-input network architecture allowing us to approx-imate any univariate function f ∈ C ([0 , . Then for any N we can choose the network weights so that the function e f implemented by the network has at least N sign changes, inthe sense that there are points ≤ a < . . . < a N ≤ suchthat ( − n e f ( a n ) > for all n . Indeed, this follows sim-ply by approximating the function f ( x ) = sin(( N + 1) πx ) with an error less than 1. We will show, however, that this N cannot be arbitrarily large if the activations are from afinite piecewise Pfaffian family. Lemma 5.
If the activations belong to a finite piecewisePfaffian family A , then any function e f implemented by thenetwork is piecewise Pfaffian. Moreover, the number ofrespective intervals U n as well as the complexity of eachrestriction f | U n do not exceed some finite values only de-pending on the family A and the network architecture.Proof. This can be proved by induction on the number ofhidden neurons in the network. The base of induction cor-responds to networks without hidden neurons; in this casethe statement is trivial. Now we make the induction step.Given a network, choose some hidden neuron whose outputis not used by other hidden neurons (i.e., choose a neuronin the “last hidden layer”). With respect to this neuron, wecan decompose the network output as e f ( x ) = cσ k (cid:16) K X s =1 c s e f s ( x ) + h (cid:17) + K X s =1 c ′ s e f s ( x ) + h ′ . (12)Here, σ k is the activation function residing at the chosenneuron, e f s are the signals going out of the other hiddenand input neurons, and c, c s , c ′ s , h, h ′ are various weights.By inductive hypothesis, all functions e f s here are piece-wise Pfaffian. Moreover, by taking intersections, the seg-ment [0 , can be divided into finitely many open intervals I j separated by finitely many points x l so that each of thefunctions e f s is Pfaffian on each interval I j . The number ofthese intervals I j and the complexities of f s | I j are boundedby some finite values depending only on the family A andthe network architecture. We see also that the linear com-bination F ( x ) = P Ks =1 c s e f s ( x ) + h appearing in Eq. (12)is Pfaffian on each interval I j . Observe next that the composition σ k ◦ F is piecewise Pfaf-fian on each interval I j . Indeed, let U ( k ) r and x ( k ) r be thefinitely many open intervals and points associated with the lementary superexpressive activations activation σ k as a piecewise Pfaffian function. By Propo-sition 2, for each r , the pre-image ( F | I j ) − ( x ( k ) r ) is ei-ther the whole interval I j or its finite subset. In the firstcase, σ k ◦ F is constant and thus trivially Pfaffian on I j . Inthe second case, the interval I j can be subdivided into sub-intervals I j,m such that each image F ( I j,m ) belongs to oneof the intervals U ( k ) r so that σ k ◦ F is Pfaffian on I j,m . Thenumber of these sub-intervals and the complexities of therestrictions are bounded by some finite numbers dependingon the activation σ k and the complexity of F | I j . Returning to representation (12), we see that e f is Pfaf-fian on each interval I j,m ; moreover, the total number ofthese intervals as well as the complexities of the restrictions e f | I j,m are bounded by finite numbers determined by thefamily A and the architecture, thus proving the claim.The lemma implies that some interval U n in which e f isPfaffian and has a bounded complexity can contain an ar-bitrarily large number of sign changes of e f . This gives acontradiction with Proposition 2.
4. Discussion
We have given examples of simple explicit activation func-tions that allow to approximate arbitrary functions usingfixed-size networks (Theorem 3), and we have also shownthat this can not be achieved with the common practicalactivations (Theorem 5). We mention two interesting ques-tions left open by our results.First, our existence result (Theorem 3) is of course purelytheoretical: though the network is small, a huge approxi-mation complexity is hidden in the very special choice ofthe network weights. Nevertheless, assuming that we canperform computations with any precision, one can ask if itis possible to algorithmically find network weights provid-ing a good approximation. The main difficulty here is tofind a value s such that ( φ ( sa n )) Nn =1 is close to the given N -dimensional point. Such a value exists by Lemma 2on the density of irrational winding, and the proof of thelemma is essentially constructive, so theoretically one canperform the necessary computation and find the desired s .However, the proof is based on the pigeonhole principleand is very prone to the curse of dimensionality (with di-mensionality here corresponding to the number N of fitteddata points), making this computation practically unfeasi-ble even for relatively small N .Another open question is whether the function sin alone issuperexpresive. This can not be ruled out by the methods ofSection 3, since sin has an infinite Pfaffian complexity on R . More generally, one can ask if there are individual su-perexpressive activations that are elementary and real ana-lytic on the whole R . A repeated computation of antideriva- tives using Lemma 4 allows us to construct a piecewise ele-mentary superexpressive function of any finite smoothness,but not analytic on R . A. Proof of Lemma 2
It is convenient to endow the cube [0 , N with the topol-ogy of the torus T N = R N / Z N by gluing the endpoints ofthe interval [0 , . Though the lemma is stated in terms ofthe original topology on [0 , N , it is clear that a subset isdense in the original topology if and only if it is dense inthe topology of the torus. Accordingly, when consideringthe distance between two points b , b ∈ [0 , N , it will beconvenient to use the distance between the correspondingcosets, i.e. ρ ( b , b ) = min z , z ∈ Z N | b + z − ( b + z ) | , where | · | is the usual euclidean norm. Note that this ρ is ashift–invariant metric on the torus.The proof of the lemma is by induction on N . The base N = 1 is obvious (a single number a is rationally inde-pendent iff a = 0 ). Let us make the induction step from N − to N , with N ≥ .Given the rationally independent numbers a , . . . , a N , firstobserve that none of them equals 0. Let s = a N . Let φ ( x ) = x − ⌊ x ⌋ as in Eq. (2). If s = ms withsome integer m , then φ ( ms a N ) = 0 , so that the points b m = ( φ ( ms a ) , . . . , φ ( ms a N )) lie in the ( N − -dimensional face [0 , N − of the full set [0 , N . Observe that the points b m are different for different in-teger m ’s. Indeed, if b m = b m for some integer m = m , then there are some integers p , . . . , p N suchthat ( m − m ) s a n = p n for all n = 1 , . . . , N. Butthen the numbers a , . . . , a N are not rationally indepen-dent, since, e.g., ( m − m ) a = p s = p a N .Since the points b m are distinct, they form an infiniteset in [0 , N − . Then for any ǫ we can find a pair ofdifferent points b m and b m separated by a distance ρ ( b m , b m ) < ǫ. Note that the distance ρ ( b m , b m ) only depends on the difference m − m , so we can as-sume that m = 0 : ρ ( b , b m ) = ρ ( , b m ) < ǫ. By definition of ρ , we can then find z ∈ Z N such that for b ′ m = b m − z we have | b ′ m | = ρ ( , b m ) < ǫ. (13)We can write b ′ m in the form b ′ m = ( m s a − p , . . . , m s a N − − p N − , with some integers p , . . . , p N − . Observe that the first N − components b ′ m ,n of b ′ m are rationally indepen-dent. Indeed, if P N − n =1 λ n b ′ m ,n = 0 with some rational lementary superexpressive activations λ n , then, by expressing this identity in terms of the origi-nal values a n , we get N − X n =1 λ n a n − m N − X n =1 λ n p n a N = 0 , so λ n ≡ by the rational independence of a n . Consider now the set Q ′ N − = { φ ( tb ′ m , ) , . . . , φ ( tb ′ m ,N − )) : t ∈ R } . On the one hand, by induction hypothesis, the set Q ′ N − is dense in [0 , N − , because the numbers b ′ m ,n are ra-tionally independent. On the other hand, observe that thepoints in Q ′ N − corresponding to integer t also belong tothe set Q N = { ( φ ( sa ) , . . . , φ ( sa N )) : s ∈ R } : specif-ically, the respective s = m ts . It follows that for any b ∈ [0 , N − we can find a point e b of the set Q N at a dis-tance at most ǫ from b : first find a point b b ∈ Q ′ N − suchthat | b b − b | < ǫ , and then, if b b corresponds to some t = t in Q ′ N − , take e b corresponding to t = ⌊ t ⌋ . The distance ρ ( b b , e b ) < ǫ by Eq. (13) and because | t − ⌊ t ⌋| < : ρ ( b b , e b ) ≤ | t b ′ m − ⌊ t ⌋ b ′ m | < | b ′ m | < ǫ. The above argument shows that the face [0 , N − = { b ∈ [0 , N : b N = 0 } can be approximated by points of Q N with s belonging to the set S = { m ts } t ∈ Z . Any other ( N − -dimensional cross-section { b ∈ [0 , N : b N = c } is then approximated by the points of Q N with s ∈ S + cs :indeed, s = cs gives us one point in this cross-section, andadditional shifts by ∆ s ∈ S allow us to approximate anyother point with the same b N . Acknowledgment
I thank Maksim Velikanov for useful feedback on the pre-liminary version of the paper.
References
Bianchini, M. and Scarselli, F. On the complexity of neu-ral network classifiers: A comparison between shallowand deep architectures.
IEEE transactions on neural net-works and learning systems , 25(8):1553–1565, 2014.Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fastand accurate deep network learning by exponential lin-ear units (elus). arXiv preprint arXiv:1511.07289 , 2015.DeVore, R. A., Howard, R., and Micchelli, C. Optimalnonlinear approximation.
Manuscripta mathematica , 63(4):469–478, 1989. Gabrielov, A. and Vorobjov, N. Complexity of computa-tions with Pfaffian and Noetherian functions.
Normalforms, bifurcations and finiteness problems in differen-tial equations , 137:211–250, 2004.Glorot, X., Bordes, A., and Bengio, Y. Deep sparse rectifierneural networks. In
Proceedings of the fourteenth inter-national conference on artificial intelligence and statis-tics , pp. 315–323, 2011.Guliyev, N. J. and Ismailov, V. E. A single hidden layerfeedforward network with only one neuron in the hiddenlayer can approximate any univariate function.
Neuralcomputation , 28(7):1289–1304, 2016.Guliyev, N. J. and Ismailov, V. E. Approximation capabilityof two hidden layer feedforward neural networks withfixed weights.
Neurocomputing , 316:262–269, 2018a.Guliyev, N. J. and Ismailov, V. E. On the approximationby single hidden layer feedforward neural networks withfixed weights.
Neural Networks , 98:296–304, 2018b.Igelnik, B. and Parikh, N. Kolmogorov’s spline network.
IEEE transactions on neural networks , 14(4):725–733,2003.Ismailov, V. E. On the approximation by neural networkswith bounded number of neurons in hidden layers.
Jour-nal of Mathematical Analysis and Applications , 417(2):963–969, 2014.Karpinski, M. and Macintyre, A. Polynomial bounds forVC dimension of sigmoidal and general Pfaffian neuralnetworks.
Journal of Computer and System Sciences , 54(1):169–176, 1997.Khovanskii, A. G.
Fewnomials . Vol. 88 of Translationsof Mathematical Monographs. American MathematicalSociety, 1991.Kolmogorov, A. N. On the representation of continuousfunctions of many variables by superposition of contin-uous functions of one variable and addition. In
Dok-lady Akademii Nauk , volume 114, pp. 953–956. RussianAcademy of Sciences, 1957.K˚urkov´a, V. Kolmogorov’s theorem is relevant.
Neuralcomputation , 3(4):617–622, 1991.K˚urkov´a, V. Kolmogorov’s theorem and multilayer neuralnetworks.
Neural networks , 5(3):501–506, 1992.Maiorov, V. and Pinkus, A. Lower bounds for approxima-tion by mlp neural networks.
Neurocomputing , 25(1-3):81–91, 1999. lementary superexpressive activations
Montanelli, H. and Yang, H. Error bounds for deep ReLUnetworks using the Kolmogorov–Arnold superpositiontheorem.
Neural Networks , 129:1–6, 2020.Schmidt-Hieber, J. The Kolmogorov-Arnold rep-resentation theorem revisited. arXiv preprintarXiv:2007.15884 , 2020.Shen, Z., Yang, H., and Zhang, S. Deep network ap-proximation with discrepancy being reciprocal of widthto power of depth. arXiv preprint arXiv:2006.12231 ,2020a.Shen, Z., Yang, H., and Zhang, S. Neural network approx-imation: Three hidden layers are enough. arXiv preprintarXiv:2010.14075 , 2020b.Yarotsky, D. and Zhevnerchuk, A. The phase diagram ofapproximation rates for deep neural networks. arXivpreprint arXiv:1906.09477 , 2019.Zell, T. Betti numbers of semi-Pfaffian sets.