[PDF] Approximation capability of two hidden layer feedforward neural networks with fixed weights

Abstract

We algorithmically construct a two hidden layer feedforward neural network (TLFN) model with the weights fixed as the unit coordinate vectors of the d-dimensional Euclidean space and having 3d+2 number of hidden neurons in total, which can approximate any continuous d-variable function with an arbitrary precision. This result, in particular, shows an advantage of the TLFN model over the single hidden layer feedforward neural network (SLFN) model, since SLFNs with fixed weights do not have the capability of approximating multivariate functions.

Full PDF

AAPPROXIMATION CAPABILITY OF TWO HIDDEN LAYERFEEDFORWARD NEURAL NETWORKS WITH FIXEDWEIGHTS

NAMIG J. GULIYEV AND VUGAR E. ISMAILOV

Abstract.

We algorithmically construct a two hidden layer feedforward neu-ral network (TLFN) model with the weights ﬁxed as the unit coordinate vec-tors of the d -dimensional Euclidean space and having 3 d + 2 number of hiddenneurons in total, which can approximate any continuous d -variable functionwith an arbitrary precision. This result, in particular, shows an advantageof the TLFN model over the single hidden layer feedforward neural network(SLFN) model, since SLFNs with ﬁxed weights do not have the capability ofapproximating multivariate functions. Introduction

The topic of artiﬁcial neural networks is an important and vibrant area of re-search in modern science. This is due to a large number of application areas.Nowadays, neural networks are being successfully applied in areas as diverse ascomputer science, ﬁnance, medicine, geology, engineering, physics, etc. Perhapsthe greatest advantage of neural networks is their ability to be used as an arbitraryfunction approximation mechanism. In this paper, we are interested in questions ofdensity (or approximation with arbitrary accuracy) of the multilayer feedforwardneural network (MLFN) model. Approximation capabilities of this model have beenwell studied for the past 30 years. Choosing various activation functions σ it wasshown in a great number of papers that MLFNs can approximate any continuousfunction with an arbitrary precision. The most simple MLFN model is the singlehidden layer feedforward neural network (SLFN) model. This model evaluates amultivariate function k (cid:88) i =1 c i σ ( w i · x − θ i ) (1.1)of the variable x = ( x , . . . , x d ), d ≥

1. Here the weights w i are vectors in R d , thethresholds θ i and the coeﬃcients c i are real numbers, and the activation function σ is a univariate function. A multiple hidden layer network is deﬁned by iterationsof the SLFN model. For example, the output of the two hidden layer feedforwardneural network (TLFN) model with k units in the ﬁrst layer, m units in the second Mathematics Subject Classiﬁcation.

Key words and phrases. multilayer feedforward neural network, hidden layer, sigmoidal func-tion, activation function, weight, the Kolmogorov superposition theorem. a r X i v : . [ c s . N E ] J a n NAMIG J. GULIYEV AND VUGAR E. ISMAILOV layer and the input x = ( x , . . . , x d ) is m (cid:88) i =1 e i σ  k (cid:88) j =1 c ij σ ( w ij · x − θ ij ) − ζ i  . Here d i , c ij , θ ij and γ i are real numbers, w ij are vectors of R d , and σ is a ﬁxedunivariate function.In many applications, it is convenient to take an activation function σ as a sigmoidal function , which is deﬁned aslim t →−∞ σ ( t ) = 0 and lim t → + ∞ σ ( t ) = 1 . The literature on neural networks abounds with the use of such functions and theirsuperpositions.The possibility of approximating a continuous function on a compact subset of R d , d ≥

1, by SLFNs with a sigmoidal activation function has been tremendouslystudied in many papers. To the best of our knowledge, Gallant and White [11]were the ﬁrst to prove the universal approximation property for the SLFN modelwith a sigmoidal activation function. Their activation function, called the cosinesquasher , has the ability to generate any trigonometric series. As such, this func-tion has the density property. Carroll and Dickinson [3] implemented the inverseRadon transformation to approximate L functions, using any continuous sigmoidalfunction as an activation function. Cybenko [8] proved that SLFNs with a contin-uous sigmoidal activation function can approximate any continuous function witharbitrary accuracy on compact subsets of R d . Funahashi [10], independently of Cy-benko, proved the density property for a continuous monotone sigmoidal function.Hornik, Stinchcombe and White [17] proved density of SLFNs with a discontinuousbounded sigmoidal function. K˚urkov´a [28] showed that staircase-like functions ofany sigmoidal type has the capability of approximating continuous univariate func-tions on any compact subset of R within arbitrarily small tolerance. This resultwas substantially used in K˚urkov´a’s further results, which showed that a continu-ous multivariate function can be approximated arbitrarily well by TLFNs with asigmoidal activation function (see [27, 28]). Chen, Chen and Liu [4] generalized theresult of Cybenko by proving that any continuous function on a compact subset of R d can be approximated by SLFNs with a bounded (not necessarily continuous)sigmoidal activation function. Almost the same result was independently obtainedby Jones [25]. Costarelli and Spigler [6] constructed special sums of the form (1.1),using a given function f ∈ C [ a, b ]. They then proved that these sums approximate f within any degree of accuracy. In their result, similar to [4], σ is any boundedsigmoidal function. Chui and Li [5] proved that SLFNs with a continuous sig-moidal activation function having integer weights and thresholds can approximatecontinuous univariate functions on any compact subset of the real line.In a number of subsequent papers, which considered the density problem for theSLFN model, nonsigmoidal activation functions were allowed. Here we cite a few ofthem. The papers by Stinchcombe and White [40], Cotter [7], Hornik [16], Mhaskarand Micchelli [36] are among many others. It should be remarked that the moregeneral result in this direction belongs to Leshno, Lin, Pinkus and Schocken [29].They proved that the necessary and suﬃcient condition for any continuous activa-tion function to have the density property is that it not be a polynomial. For moredetailed discussion of the density problem, see the review paper by Pinkus [37]. PPROXIMATION CAPABILITY OF TWO HIDDEN LAYER NETWORKS 3

The above results show that SLFNs with various activation functions enjoy theuniversal approximation property. In recent years, the theory of neural networkshas been developed further in this direction. For example, from the point of viewof practical applications, SLFNs with a restricted set of weights have gained aspecial interest (see, e.g., [9, 18, 20, 21, 24, 30]). It was proved that SLFNs withsome restricted set of weights still possess the universal approximation property.For example, Stinchcombe and White [40] showed that SLFNs with a polygonal,polynomial spline or analytic activation function and a bounded set of weightshave the universal approximation property. Ito [22, 23] investigated this property ofnetworks using monotone sigmoidal functions, with only weights located on the unitsphere. In [18, 20, 21], the second coauthor considered SLFNs with weights varyingon a restricted set of directions, and gave several necessary and suﬃcient conditionsfor good approximation by such networks. For a set of weights consisting of twodirections, he showed that there is a geometrically explicit solution to the problem.Hahm and Hong [15] went further in this direction, and showed that SLFNs withﬁxed weights can approximate arbitrarily well any continuous univariate function.Since ﬁxed weights reduce the computational expense and training time, this resultis of particular interest. In a mathematical formulation, the result says that for abounded measurable sigmoidal function σ , networks of the form (cid:80) ki =1 c i σ ( αx − θ i )are dense in C [ a, b ]. Cao and Xie [2] strengthened this result by specifying thenumber of hidden neurons to realize ε -approximation to any continuous function.By implementing modulus of continuity, they established Jackson-type upper boundestimations for the approximation error.Approximation capabilities of SLFNs with ﬁxed weights were also analyzed inLin, Guo, Cao and Xu [32]. Taking the activation function σ as a continuous, evenand 2 π -periodic function, the authors of [32] showed that neural networks of theform (cid:80) ri =1 c i σ ( x − x i ) can approximate any continuous function on [ − π, π ] withan arbitrary precision ε . Note that all the weights are ﬁxed equal to 1, and conse-quently do not depend on ε . To prove this, they ﬁrst gave an integral representationfor trigonometric polynomials, and constructed explicitly a network with the weight1 that approximates this integral representation. Finally, the obtained result fortrigonometric polynomials was used to prove a Jackson-type upper bound for theapproximation error.Note that SLFNs with a ﬁxed number of weights cannot approximate d -variablefunctions if d >

1. That is, if in (1.1) we have n diﬀerent weights w i ( n is ﬁxed),then there exist a compact set Q ⊂ R d and a function f ∈ C ( Q ), which cannotbe approximated arbitrarily well by the networks formed as (1.1). This followsfrom a result of Lin and Pinkus on sums of n ridge functions (see [33, Theorem5.1]). For details, see our recent paper [14]. Thus the above results of Hahm andHong [15], Cao and Xie [2], Lin, Guo, Cao and Xu [32] cannot be generalized tothe d -dimensional case if one allows only the SLFN model of neural networks.It should be remarked that in all of the above-mentioned works the number ofneurons k in the hidden layer is not ﬁxed. As such to achieve a desired precisionone may take an excessive number of hidden neurons. Unfortunately, practicalitydecreases with the increase of the number of neurons in the hidden layer. In otherwords, SLFNs are not always eﬀective if the number of neurons in the hidden layeris prescribed. More precisely, they are eﬀective if and only if we consider univariatefunctions. In [13], we consider constructive approximation on any ﬁnite interval of NAMIG J. GULIYEV AND VUGAR E. ISMAILOV R by SLFNs with a ﬁxed number of hidden neurons. We construct algorithmically asmooth, sigmoidal, almost monotone activation function σ providing approximationto an arbitrary univariate continuous function within any degree of accuracy. Notethat the result of [13] is not applicable to multivariate functions.The ﬁrst crucial step in investigating approximation capabilities of MLFNs witha prescribed number of hidden neurons was made by Maiorov and Pinkus [35].Their remarkable result revealed that TLFNs with 3 d units in the ﬁrst layer and6 d + 3 units in the second layer can approximate an arbitrary continuous d -variablefunction. Using a diﬀerent activation function than in [35], the second coauthor [19]showed that the number of neurons in hidden layers can be reduced to d and 2 d + 2respectively. Note that the results of both papers carry a theoretical character,as they indicate only the existence of the corresponding TLFNs, their activationfunctions.We see that in each result above at least one of the following general propertiesis violated.(1) the number of hidden neurons is ﬁxed;(2) the weights are ﬁxed;(3) the activation function is computable;(4) the network has the capability of approximating d -variable functions in thecase d > The main result

In the sequel, we deal with an activation function, which is monotonic in theweak sense. Here by weak monotonicity we understand behavior of a functionwhose diﬀerence in absolute value from a monotonic function is a suﬃciently smallnumber. In this regard we say that a real function f deﬁned on a set X ⊆ R is λ -increasing (respectively, λ -decreasing ) if there exists an increasing (respectively,decreasing) function u : X → R such that | f ( x ) − u ( x ) | ≤ λ for all x ∈ X . Clearly, 0-monotonicity coincides with the usual concept of monotonicity and a λ -increasingfunction is λ -increasing if λ ≤ λ .Our main result is the following theorem. Theorem 2.1.

Assume a closed interval [ a, b ] ⊂ R is given, s = b − a , and λ is anysuﬃciently small positive real number. Then one can algorithmically construct acomputable, inﬁnitely diﬀerentiable, sigmoidal activation function σ : R → R whichis strictly increasing on ( −∞ , s ) , λ -strictly increasing on [ s, + ∞ ) and satisﬁes thefollowing property: For any continuous function f on the d -dimensional box [ a, b ] d and ε > , there exist constants e p , c pq , θ pq and ζ p such that the inequality (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( x ) − d +2 (cid:88) p =1 e p σ (cid:32) d (cid:88) q =1 c pq σ ( w q · x − θ pq ) − ζ p (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε holds for all x = ( x , . . . , x d ) ∈ [ a, b ] d . Here the weights w q , q = 1 , . . . , d , are ﬁxedas follows: w = (1 , , . . . , , w = (0 , , . . . , , . . . , w d = (0 , , . . . , . PPROXIMATION CAPABILITY OF TWO HIDDEN LAYER NETWORKS 5

In addition, all the coeﬃcients e p , except one, are equal.Proof. We start with the algorithmic construction of σ mentioned in the theorem.The algorithm consists of the following steps.1. Consider the function h ( x ) := 1 − min { / , λ } x − s + 1) . Obviously, this function is strictly increasing on the real line and satisﬁes the fol-lowing properties:(1) 0 < h ( x ) < x ∈ [ s, + ∞ );(2) 1 − h ( s ) ≤ λ ;(3) h ( x ) → x → + ∞ .Our purpose is to construct σ satisfying the two-sided inequality h ( x ) < σ ( x ) < x ∈ [ s, + ∞ ). Then our σ will approach 1 as x approaches + ∞ and obey theinequality | σ ( x ) − h ( x ) | ≤ λ, that is, it will be a λ -increasing function.2. In this step, we enumerate the monic polynomials with rational coeﬃcients.Let q n be the Calkin–Wilf sequence (see [1]). We can enumerate all the rationalnumbers by setting r := 0 , r n := q n , r n − := − q n , n = 1 , , . . . . Note that each monic polynomial with rational coeﬃcients can uniquely be writtenas r k + r k x + . . . + r k l − x l − + x l , and each positive rational number determinesa unique ﬁnite continued fraction[ m ; m , . . . , m l ] := m + 1 m + 1 m + 1. . . + 1 m l with m ≥ m , . . . , m l − ≥ m l ≥

2. We now construct a one-to-one map-ping between the set of all monic polynomials with rational coeﬃcients and the setof all positive rational numbers as follows. To the only zeroth-degree monic poly-nomial 1 we associate the rational number 1, to each ﬁrst-degree monic polynomialof the form r k + x we associate the rational number k + 2, to each second-degreemonic polynomial of the form r k + r k x + x we associate the rational number[ k ; k + 2] = k + 1 / ( k + 2), and to each monic polynomial r k + r k x + . . . + r k l − x l − + r k l − x l − + x l of degree l ≥ k ; k + 1 , . . . , k l − + 1 , k l − + 2].In other words, we deﬁne u ( x ) := 1, u n ( x ) := r q n − + x if q n ∈ Z , u n ( x ) := r m + r m − x + x if q n = [ m ; m ], and u n ( x ) := r m + r m − x + . . . + r m l − − x l − + r m l − − x l − + x l if q n = [ m ; m , . . . , m l − , m l − ] with l ≥

3. Hence the ﬁrst few elements of thissequence are deﬁned as1 , x , x, x − x, x − , x , x − , x + x, . . . . The sequence of monic polynomials will be used in the sequel.3. First we construct σ on the intervals [(2 n − s, ns ], n = 1 , , . . . . For eachmonic polynomial u n ( x ) = ρ + ρ x + . . . + ρ l − x l − + x l with rational coeﬃcients,set B := ρ + ρ − | ρ | . . . + ρ l − − | ρ l − | B := ρ + ρ + | ρ | . . . + ρ l − + | ρ l − | . Note that the numbers B and B depend on n , but for simplicity we will omit thisin the notation.Consider the sequence M n := h ((2 n + 1) s ) , n = 1 , , . . . . Obviously, this sequence is strictly increasing and converges to 1.Now we deﬁne σ as the function σ ( x ) := a n + b n u n (cid:16) xs − n + 1 (cid:17) , x ∈ [(2 n − s, ns ] . (2.2)Here a := 12 , b := h (3 s )2 , (2.3)and a n := (1 + 2 M n ) B − (2 + M n ) B B − B ) , b n := 1 − M n B − B ) , n = 2 , , . . . . (2.4)It is not diﬃcult to see that for n > a n , b n are the coeﬃcientsof the linear function y = a n + b n x mapping the closed interval [ B , B ] onto theclosed interval [(1 + 2 M n ) / , (2 + M n ) / n = 1, i.e. on the interval[ s, s ], σ ( x ) = 1 + M . Thus, we obtain that h ( x ) < M n < M n ≤ σ ( x ) ≤ M n < , (2.5)for all x ∈ [(2 n − s, ns ], n = 1, 2, . . . .4. In this step, we construct σ on the intervals [2 ns, (2 n + 1) s ], n = 1 , , . . . . Tothis end we use the smooth transition function β a,b ( x ) := (cid:98) β ( b − x ) (cid:98) β ( b − x ) + (cid:98) β ( x − a ) , where (cid:98) β ( x ) := (cid:40) e − /x , x > , , x ≤ . PPROXIMATION CAPABILITY OF TWO HIDDEN LAYER NETWORKS 7

Clearly, β a,b ( x ) = 1 for x ≤ a , β a,b ( x ) = 0 for x ≥ b , and 0 < β a,b ( x ) < a < x < b .Consider the sequence K n := σ (2 ns ) + σ ((2 n + 1) s )2 , n = 1 , , . . . . Recall that the numbers σ (2 ns ) and σ ((2 n + 1) s ) have already been deﬁned inthe previous step. Since both the numbers σ (2 ns ) and σ ((2 n + 1) s ) belong to theinterval ( M n , K n ∈ ( M n , σ smoothly to the interval [2 ns, ns + s/ ε := (1 − M n ) / δ ≤ s/ (cid:12)(cid:12)(cid:12) a n + b n u n (cid:16) xs − n + 1 (cid:17) − ( a n + b n u n (1)) (cid:12)(cid:12)(cid:12) ≤ ε, x ∈ [2 ns, ns + δ ] . (2.6)One can select this δ as δ := min (cid:26) εsb n C , s (cid:27) , where C > | u (cid:48) n ( x ) | ≤ C for x ∈ (1 , . n = 1, then δ can be selected as s/

2. Now deﬁne σ on the left-hand half of theinterval [2 ns, (2 n + 1) s ] as the function σ ( x ) := K n − β ns, ns + δ ( x ) × (cid:16) K n − a n − b n u n (cid:16) xs − n + 1 (cid:17)(cid:17) , x ∈ (cid:104) ns, ns + s (cid:105) . (2.7)Let us prove that σ ( x ) satisﬁes the condition (2.1). Indeed, if 2 ns + δ ≤ x ≤ ns + s/

2, then there is nothing to prove, since σ ( x ) = K n ∈ ( M n , ns ≤ x < ns + δ , then 0 < β ns, ns + δ ( x ) ≤ x ∈ [2 ns, ns + δ ), σ ( x ) is between the numbers K n and A n ( x ) := a n + b n u n (cid:0) xs − n + 1 (cid:1) . On the other hand, from (2.6) it follows that a n + b n u n (1) − ε ≤ A n ( x ) ≤ a n + b n u n (1) + ε. The last inequality together with (2.2) and the inequalities (2.5) yields that A n ( x ) ∈ (cid:2) M n − ε, M n + ε (cid:3) for x ∈ [2 ns, ns + δ ). Since ε = (1 − M n ) /

6, the inclusion A n ( x ) ∈ ( M n ,

1) is valid. Now since both K n and A n ( x ) lie in the interval ( M n , h ( x ) < M n < σ ( x ) < , for x ∈ (cid:104) ns, ns + s (cid:105) . We deﬁne σ on the right-hand half of the interval in a similar way: σ ( x ) := K n − (1 − β (2 n +1) s − δ, (2 n +1) s ( x )) × (cid:16) K n − a n +1 − b n +1 u n +1 (cid:16) xs − n − (cid:17)(cid:17) , x ∈ (cid:104) ns + s , (2 n + 1) s (cid:105) , where δ := min (cid:26) εsb n +1 C , s (cid:27) , ε := 1 − M n +1 , C ≥ sup [ − . , | u (cid:48) n +1 ( x ) | . It is not diﬃcult to verify, as above, that the constructed σ ( x ) satisﬁes the condi-tion (2.1) on [2 ns + s/ , ns + s ] and σ (cid:16) ns + s (cid:17) = K n , σ ( i ) (cid:16) ns + s (cid:17) = 0 , i = 1 , , . . . . Steps 3 and 4 together construct σ on the interval [ s, + ∞ ). NAMIG J. GULIYEV AND VUGAR E. ISMAILOV

Figure 2.1.

The graph of σ on [0 ,

50] ( s = 3, λ = 1 / −∞ , s ), we deﬁne σ as σ ( x ) := (cid:16) − (cid:98) β ( s − x ) (cid:17) M , x ∈ ( −∞ , s ) . Clearly, σ is a strictly increasing, smooth function on ( −∞ , s ). In addition, σ ( x ) → σ ( s ) = (1 + M ) /

2, as x tends to s from the left and σ ( i ) ( s ) = 0 for i = 1, 2, . . . .This ﬁnal step completes the construction of σ on the whole real line. Note thatthe constructed σ is sigmoidal, inﬁnitely diﬀerentiable on R , strictly increasing on( −∞ , s ) and λ -strictly increasing on [ s, + ∞ ).It should be noted that the above algorithm allows one to compute σ at anypoint of the real axis instantly. The code of this algorithm is available at https://sites.google.com/site/njguliyev/papers/tlfn . As a practical example, wegive here the graph of σ (see Figure 2.1) and a numerical table (see Table 2.1)containing several computed values of this function on the interval [0 , λ -increasing function σ changes on the interval [0 , λ decreases.Figure 2.3 displays variations in the graph of σ with respect to the length s of aclosed interval [ a, b ].Now we show that in addition to its nice properties such as computability,smoothness and weak monotonicity, our σ enjoys an important property of approx-imating each continuous d -variable function as an activation function for TLFNswith a ﬁxed number of hidden neurons.It follows from (2.2) that σ ( sx + (2 n − s ) = a n + b n u n ( x ) , x ∈ [0 ,

1] (2.8)for n = 1, 2, . . . . Here a n and b n are computed by (2.3) and (2.4) for n = 1 and n > u n , n = 1, 2, . . . , can be represented in the form u n ( x ) = 1 b n σ ( sx + (2 n − s ) − a n b n . (2.9) PPROXIMATION CAPABILITY OF TWO HIDDEN LAYER NETWORKS 9

Table 2.1.

Some computed values of σ ( s = 3, λ = 1 / t σ t σ t σ t σ t σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.2.

Changes in the graph of σ with respect to λ ( s = 1)Let now f be any continuous function on the box [ a, b ] d . By the Kolmogorovsuperposition theorem [26] in the form given by Lorentz [34] and Sprecher [38],there exist constants λ q > q = 1, . . . , d with (cid:80) dq =1 λ q = 1 and nondecreasingcontinuous functions φ p : [ a, b ] → [0 , p = 1, . . . , 2 d +1 such that every continuousfunction f : [ a, b ] d → R admits the representation f ( x , . . . , x d ) = d +1 (cid:88) p =1 g (cid:32) d (cid:88) q =1 λ q φ p ( x q ) (cid:33) (2.10)for some g ∈ C [0 ,

1] depending on f .By the density of polynomials with the rational coeﬃcients in the space of contin-uous functions over any compact subset of R , for the exterior continuous univariatefunction g in (2.10) and any ε > p ( x ) of the mentionedform such that | g ( x ) − p ( x ) | < ε d + 1) Figure 2.3.

Changes in the graph of σ with respect to s ( λ = 0 . x ∈ [0 , p the leading coeﬃcient of p . If p (cid:54) = 0 (i.e., p (cid:54)≡ u n as u n ( x ) := p ( x ) /p , otherwise we just set u n ( x ) := 1. In bothcases | g ( x ) − p u n ( x ) | < ε d + 1) , x ∈ [0 , . This together with (2.9) means that | g ( x ) − ( α σ ( sx − β ) − γ ) | < ε d + 1) (2.11)for some α , β , γ ∈ R and all x ∈ [0 , α = p b n , β = s − ns, γ = p a n b n . (2.12)Substituting (2.11) in (2.10) we obtain that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( x , . . . , x d ) − d +1 (cid:88) p =1 (cid:32) α σ (cid:32) s d (cid:88) q =1 λ q φ p ( x q ) − β (cid:33) − γ (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε x , . . . , x d ) ∈ [0 , d .For each p = 1, . . . , 2 d + 1, the function φ p in (2.10) is deﬁned on [ a, b ]. For thisfunction, using the linear transformation x = ( t − a ) /s from [ a, b ] to [0 ,

1] and thesame procedure for the function g above, we can obtain the inequality | φ p ( t ) − ( α p σ ( t − β p ) − γ p ) | < δ, (2.14)for all t ∈ [ a, b ]. Here δ is any positive real number, and the parameters α p , β p and γ p depend on δ . Note that these parameters can be computed similarly as in(2.12).Since λ q > q = 1, . . . , d , and (cid:80) dq =1 λ q = 1, it follows from (2.14) that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d (cid:88) q =1 λ q φ p ( x q ) − (cid:32) d (cid:88) q =1 λ q α p σ ( x q − β p ) − γ p (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < δ, (2.15)for all p = 1, . . . , 2 d + 1, and ( x , . . . , x d ) ∈ [0 , d . PPROXIMATION CAPABILITY OF TWO HIDDEN LAYER NETWORKS 11

Now since the function α σ ( sx − β ) is uniformly continuous on every closedinterval of the real line, we can choose δ as small as necessary and obtain from(2.15) that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d +1 (cid:88) p =1 α σ (cid:32) s d (cid:88) q =1 λ q φ p ( x q ) − β (cid:33) − d +1 (cid:88) p =1 α σ (cid:32) s (cid:32) d (cid:88) q =1 λ q α p σ ( x q − β p ) − γ p (cid:33) − β (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε . This inequality may be rewritten in the form (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d +1 (cid:88) p =1 α σ (cid:32) s d (cid:88) q =1 λ q φ p ( x q ) − β (cid:33) − d +1 (cid:88) p =1 α σ (cid:32) d (cid:88) q =1 c pq σ ( w q · x − β p ) − ζ p (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε , (2.16)where c pq = sλ q α p , ζ p = sγ p + β , and w q is the q -th coordinate vector. From(2.13) and (2.16) it follows that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( x ) − (cid:32) d +1 (cid:88) p =1 α σ (cid:32) d (cid:88) q =1 c pq σ ( w q · x − β p ) − ζ p (cid:33) − (2 d + 1) γ (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε, (2.17)Clearly, the constant (2 d + 1) γ can be written in the form(2 d + 1) γ = ασ (cid:32) d (cid:88) q =1 c q σ ( w q · x − θ q ) − ζ (cid:33) , (2.18)for c q = 0, q = 1, . . . , d , and suitable coeﬃcients α and ζ . Considering (2.18) in(2.17) we ﬁnally obtain that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f ( x ) − (cid:32) d +2 (cid:88) p =1 e p σ (cid:32) d (cid:88) q =1 c pq σ ( w q · x − θ pq ) − ζ p (cid:33)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ε, where e = e = . . . = e d +1 . The last inequality completes the proof of thetheorem. (cid:3) Remark . Obviously, a compact subset Q of the space R d can be embedded intoa box [ − l, l ] d , and by the Tietze extension theorem (see [41, Theorem 15.8]), anycontinuous function f on Q can be extended to [ − l, l ] d . Hence Theorem 2.1 is validnot only for boxes of the form [ a, b ] d but for any compact set Q ⊂ R d , with theproviso that s = 2 l . Remark . Theorem 2.1, in particular, shows that TLFNs are more powerful thanSLFNs, since SLFNs with a ﬁxed number of hidden neurons and/or weights havenot the capability of approximating multivariate functions (see Introduction). Werefer the reader to [31] for interesting results and discussions around the comparisonof performances between MLFNs and SLFNs.

Remark . In [12], Gripenberg showed that the general approximation propertyof feedforward multilayer perceptron networks can be achieved in networks wherethe number of neurons in each layer is bounded, but the number of layers growsto inﬁnity. This is the case provided the activation function is twice continuouslydiﬀerentiable and not linear. Taking an exceedingly large number of layers is an indispensable part of Gripenberg’s method. Can one develop a diﬀerent methodwhich enables to use only a preliminarily prescribed number of layers for all ap-proximated functions? To answer this question, we started with SLFNs. It turnedout that in this case the answer is “yes” if approximated functions are univariate(see [13]). Moreover, one can ﬁx the weights of constructed SLFNs. But SLFNswith ﬁxed weights or bounded number of neurons are proved not capable of ap-proximating multivariate functions (see [14]). Then how many hidden layers withbounded number of neurons are needed to approximate multivariate functions witharbitrary precision? First of all, one may want to know if any such constrainedapproximation is possible in practice. Theorem 2.1 shows that even two hiddenlayers and a speciﬁcally constructed activation function are suﬃcient to solve thisproblem aﬃrmatively.

References [1] N. Calkin and H. S. Wilf,

Recounting the rationals , Amer. Math. Monthly (2000), 360–367.[2] F. Cao and T. Xie,

The construction and approximation for feedforword neural networkswith ﬁxed weights , Proceedings of the ninth international conference on machine learning andcybernetics, Qingdao, 2010, pp. 3164–3168.[3] S. M. Carroll and B. W. Dickinson,

Construction of neural nets using the Radon transform ,Proceedings of the 1989 IEEE international joint conference on neural networks, vol. 1, IEEE,New York, 1989, pp. 607–611.[4] T. Chen, H. Chen and R. Liu,

A constructive proof of Cybenko’s approximation theorem andits extensions , Computing science and statistics, Springer, 1992, pp. 163–168.[5] C. K. Chui and X. Li,

Approximation by ridge functions and neural networks with one hiddenlayer , J. Approx. Theory (1992), 131–141.[6] D. Costarelli and R. Spigler, Constructive approximation by superposition of sigmoidal func-tions , Anal. Theory Appl. (2013), 169–196.[7] N. E. Cotter, The Stone–Weierstrass theorem and its application to neural networks , IEEETrans. Neural Networks (1990), 290–295.[8] G. Cybenko, Approximation by superpositions of a sigmoidal function , Math. Control SignalSystems (1989), 303–314.[9] S. Draghici, On the capabilities of neural networks using limited precision weights , NeuralNetworks (2002), 395–414.[10] K. Funahashi, On the approximate realization of continuous mapping by neural networks ,Neural Networks (1989), 183–192.[11] A. R. Gallant and H. White, There exists a neural network that does not make avoidablemistakes , Proceedings of the IEEE 1988 international conference on neural networks, vol. 1,IEEE Press, New York, 1988, pp. 657–664.[12] G. Gripenberg,

Approximation by neural networks with a bounded number of nodes at eachlevel , J. Approx. Theory (2003), no. 2, 260–266.[13] N. J. Guliyev and V. E. Ismailov,

A single hidden layer feedforward network with only oneneuron in the hidden layer can approximate any univariate function , Neural Computation (2016), no. 7, 1289–1304. arXiv:1601.00013[14] N. J. Guliyev and V. E. Ismailov, On the approximation by single hidden layer feedforwardneural networks with ﬁxed weights , Neural Networks (2018), 296–304. arXiv:1708.06219[15] N. Hahm and B.I. Hong, An approximation by neural networks with a ﬁxed weight , Comput.Math. Appl. (2004), no. 12, 1897–1903.[16] K. Hornik, Approximation capabilities of multilayer feedforward networks , Neural Networks (1991), 251–257.[17] K. Hornik, M. Stinchcombe and H. White, Multilayer feedforward networks are universalapproximators , Neural Networks (1989), 359–366.[18] V. E. Ismailov, Approximation by neural networks with weights varying on a ﬁnite set ofdirections , J. Math. Anal. Appl. (2012), no. 1, 72–83.

PPROXIMATION CAPABILITY OF TWO HIDDEN LAYER NETWORKS 13 [19] V. E. Ismailov,

On the approximation by neural networks with bounded number of neuronsin hidden layers , J. Math. Anal. Appl. (2014), no. 2, 963–969.[20] V. E. Ismailov,

Approximation by ridge functions and neural networks with a bounded numberof neurons , Appl. Anal. (2015), no. 11, 2245–2260.[21] V. E. Ismailov and E. Savas, Measure theoretic results for approximation by neural networkswith limited weights , Numer. Funct. Anal. Optim. (2017), no. 7, 819–830.[22] Y. Ito, Representation of functions by superpositions of a step or sigmoid function and theirapplications to neural network theory , Neural Networks (1991), 385–394.[23] Y. Ito, Approximation of continuous functions on R d by linear combinations of shifted rota-tions of a sigmoid function with and without scaling , Neural Networks (1992), 105–115.[24] B. Jian, C. Yu and Y. Jinshou, Neural networks with limited precision weights and its applica-tion in embedded systems , Proceedings of the the second international workshop on educationtechnology and computer science, Wuhan, 2010, pp. 86–91.[25] L. K. Jones,

Constructive approximations for neural networks by sigmoidal functions , Proc.IEEE (1990), no. 10, 1586–1589; Correction and addition, Proc. IEEE (1991), no. 2,243.[26] A. N. Kolmogorov, On the representation of continuous functions of many variables by super-position of continuous functions of one variable and addition (Russian), Dokl. Akad. NaukSSSR (1957), 953–956; English transl. in Amer. Math. Soc. Transl. (2) (1963), 55–59.[27] V. K˚urkov´a, Kolmogorov’s theorem is relevant , Neural Comput. (1991), 617–622.[28] V. K˚urkov´a, Kolmogorov’s theorem and multilayer neural networks , Neural Networks (1992), 501–506.[29] M. Leshno, V. Ya. Lin, A. Pinkus and S. Schocken, Multilayer feedforward networks with anon-polynomial activation function can approximate any function , Neural Networks (1993),861–867.[30] Y. Liao, S.-C. Fang and H. L. W. Nuttle, A neural network model with bounded-weights forpattern classiﬁcation , Comput. Oper. Res. (2004), 1411–1426.[31] S. Lin, Limitations of shallow nets approximation , Neural Networks (2017), 96–102.[32] S. Lin, X. Guo, F. Cao and Z. Xu, Approximation by neural networks with scattered data

Appl. Math. Comput. (2013), 29–35.[33] V. Ya. Lin and A. Pinkus,

Fundamentality of ridge functions , J. Approx. Theory (1993),295–311.[34] G. G. Lorentz, Metric entropy, widths, and superpositions of functions , Amer. Math. Monthly (1962), 469–485.[35] V. Maiorov and A. Pinkus, Lower bounds for approximation by MLP neural networks , Neu-rocomputing (1999), 81–91.[36] H. N. Mhaskar and C. A. Micchelli, Approximation by superposition of a sigmoidal functionand radial basis functions , Adv. Appl. Math. (1992), 350–373.[37] A. Pinkus, Approximation theory of the MLP model in neural networks , Acta numerica, 1999,Cambridge Univ. Press, Cambridge, 1999, pp. 143–195.[38] D. A. Sprecher,

On the structure of continuous functions of several variables , Trans. Amer.Math. Soc. (1965), 340–355.[39] W. A. Stein et al.,

Sage Mathematics Software (Version 7.6) , The Sage Developers, 2017, .[40] M. Stinchcombe and H. White,

Approximating and learning unknown mappings using multi-layer feedforward networks with bounded weights , Proceedings of the 1990 IEEE internationaljoint conference on neural networks, vol. 3, IEEE, New York, 1990, pp. 7–16.[41] S. Willard,

General topology , Addison-Wesley Publishing Co., Reading, Mass.-London-DonMills, Ont., 1970.

Institute of Mathematics and Mechanics, Azerbaijan National Academy of Sciences,9 B. Vahabzadeh str., AZ1141, Baku, Azerbaijan.

Email address : [email protected] Institute of Mathematics and Mechanics, Azerbaijan National Academy of Sciences,9 B. Vahabzadeh str., AZ1141, Baku, Azerbaijan.

Email address ::