[PDF] Quantitative approximation results for complex-valued neural networks

Abstract

Until recently, applications of neural networks in machine learning have almost exclusively relied on real-valued networks. It was recently observed, however, that complex-valued neural networks (CVNNs) exhibit superior performance in applications in which the input is naturally complex-valued, such as MRI fingerprinting. While the mathematical theory of real-valued networks has, by now, reached some level of maturity, this is far from true for complex-valued networks. In this paper, we analyze the expressivity of complex-valued networks by providing explicit quantitative error bounds for approximating C^n functions on compact subsets of \mathbb{C}^d by complex-valued neural networks that employ the modReLU activation function, given by \sigma(z) = \mathrm{ReLU}(|z| - 1) \, \mathrm{sgn} (z), which is one of the most popular complex activation functions used in practice. We show that the derived approximation rates are optimal (up to log factors) in the class of modReLU networks with weights of moderate growth.

Full PDF

QQuantitative approximation resultsfor complex-valued neural networks

A. Caragea, D.G. Lee, J. Maly, G. Pfander, F. Voigtlaender

Abstract:

We show that complex-valued neural networks with the modReLUactivation function σ ( z ) = ReLU( | z | − · z/ | z | can uniformly approximatecomplex-valued functions of regularity C n on compact subsets of C d , givingexplicit bounds on the approximation rate. Motivated by the remarkable practical success of machine learning algorithms based on deepneural networks (collectively called deep learning [14]) in applications like image recognition[13] and machine translation [21], the expressive power of such neural networks has recentlybeen closely studied [16, 18, 26, 27]. These recent results mostly focus on networks using the rectiﬁed linear unit (ReLU) activation function (cid:37) ( x ) = max { , x } which has been observed inpractice to massively reduce the training time [9, 14] while apparently maintaining the sameexpressivity as networks with smooth activation functions, which have been studied alreadyin the 90s [17].Due to the missing support for complex arithmetic in the leading deep learning softwarelibraries [22], practical applications of deep neural networks have mostly relied on real-valued neural networks. Alas, recently there has been an increased interest in complex-valued neuralnetworks (CVNNs) for problems in which the input is naturally complex-valued and in whicha faithful treatment of phase information is important [22, 23]. For instance, for the problemof MRI ﬁngerprinting, CVNNs signiﬁcantly outperform their real-valued counterparts [23].Moreover, CVNNs have been shown to greatly improve the stability and convergence propertiesof recurrent neural networks [4, 25].Motivated by this interest in complex-valued neural networks, in this paper we initiate theanalysis of their expressive power, mathematically quantiﬁed by their ability to approximatefunctions of a given regularity. Speciﬁcally, we analyze how well CVNNs with the modReLUactivation function (deﬁned in Section 1.1) can approximate functions of Sobolev regularity W n, ∞ on compact subsets of C d (see Section 1.2); the explicit result is given in Section 1.3below. In a complex-valued neural network (CVNN), each neuron computes a function of the form z (cid:55)→ σ ( w T z + b ) with z , w ∈ C N and b ∈ C , and where σ : C → C is a complex activationfunction.Formally, a complex-valued neural network (CVNN) is a tuple Φ = (cid:0) ( A , b ) , . . . , ( A L , b L ) (cid:1) ,where L =: L (Φ) ∈ N denotes the depth of the network and where A (cid:96) ∈ C N (cid:96) × N (cid:96) − and Mathematics Subject Classiﬁcation.

Key words and phrases.

Deep neural networks, Complex-valued neural networks, function approximation,modReLU activation function a r X i v : . [ m a t h . F A ] F e b aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs b (cid:96) ∈ C N (cid:96) for (cid:96) ∈ { , . . . , L } . Then d in (Φ) := N and d out (Φ) := N L denote the input- andoutput-dimension of Φ . Given any function σ : C → C , the network function associated to thenetwork Φ (also called the realization of Φ ) is the function R σ Φ := T L ◦ ( σ ◦ T L − ) ◦ · · · ◦ ( σ ◦ T ) : C d in (Φ) → C d out (Φ) where T (cid:96) z = A (cid:96) z + b (cid:96) , and where σ acts componentwise on vectors, meaning σ (cid:0) ( z , . . . , z k ) (cid:1) = (cid:0) σ ( z ) , . . . , σ ( z k ) (cid:1) .The number of neurons N (Φ) of Φ is N (Φ) := (cid:80) L(cid:96) =0 N (cid:96) and the number of weights of Φ is deﬁned as W (Φ) := (cid:80) Lj =1 ( (cid:107) A j (cid:107) (cid:96) + (cid:107) b j (cid:107) (cid:96) ) , where (cid:107) A (cid:107) (cid:96) denotes the number of non-zeroentries of a matrix or vector A . Finally, we say that the weights of Φ are bounded by C ≥ if | ( A (cid:96) ) j,k | ≤ C and | ( b (cid:96) ) j | ≤ C for applicable indices (cid:96), k, j .Finally, we will also use the notion of a network architecture . Formally, this is a tuple A = (cid:0) ( N , . . . , N L ) , ( I , . . . , I L ) , ( J , . . . , J L ) (cid:1) where ( N , . . . , N L ) determines the depth L ofthe network and the number of neurons N (cid:96) in each layer. Furthermore, the sets J (cid:96) ⊂ { , . . . , N (cid:96) } and I (cid:96) ⊂ { , . . . , N (cid:96) } × { , . . . , N (cid:96) − } determine which weights of the network are allowed tobe non-zero. Thus, a network Φ is of architecture A as above if Φ = (cid:0) ( A , b ) , . . . , ( A L , b L ) (cid:1) where A (cid:96) ∈ C N (cid:96) × N (cid:96) − and b (cid:96) ∈ C N (cid:96) , and if furthermore ( A (cid:96) ) j,k (cid:54) = 0 only if ( j, k ) ∈ I (cid:96) and ( b (cid:96) ) j (cid:54) = 0 only if j ∈ J (cid:96) . The number of weights and neurons of an architecture A are deﬁnedas N ( A ) := (cid:80) L(cid:96) =0 N (cid:96) and W ( A ) := (cid:80) L(cid:96) =1 ( | J (cid:96) | + | I (cid:96) | ) , respectively.In the present paper, we will only consider neural networks using the modReLU activationfunction σ : C → C , z (cid:55)→ (cid:37) ( | z | − z | z | = (cid:40) , if | z | ≤ ,z − z | z | , if | z | ≥ (1.1)proposed in [4] as a generalization of the ReLU activation function (cid:37) : R → R , x (cid:55)→ max { , x } to the complex domain. We brieﬂy discuss other activation functions in Section 1.4 below. We are interested in uniformly approximating functions f : C d → C that belong to the Sobolevspace W n, ∞ , with diﬀerentiability understood in the sense of real variables. Speciﬁcally, let Q C d := (cid:8) z = ( z , . . . , z d ) ∈ C d : Re z k , Im z k ∈ [0 , for all ≤ k ≤ d (cid:9) be the unit cube in C d . As in the deﬁnition of Q C d , we will use throughout the paper boldfacecharacters to denote real and complex vectors.Identifying z = ( z , . . . , z d ) ∈ C d with x = (cid:0) Re( z ) , . . . , Re( z d ) , Im( z ) , . . . , Im( z d ) (cid:1) ∈ R d ,we will consider C d ∼ = R d as usual. With this in mind, a complex function g : C d → C can beidentiﬁed with a pair of functions g Re , g Im : R d → R given by g Re = Re( g ) and g Im = Im( g ) .Given a real function f : [0 , d → R and n ∈ N , we write f ∈ W n, ∞ ([0 , d ; R ) if f is n − times continuously diﬀerentiable with all derivatives of order n − being Lipschitzcontinuous. We then deﬁne (cid:107) f (cid:107) W n, ∞ := max (cid:110) max | α |≤ n − (cid:107) ∂ α f (cid:107) L ∞ , max | α | = n − Lip( ∂ α f ) (cid:111) . Using this norm, we deﬁned the unit ball in the Sobolev space W n, ∞ as F n,d := (cid:8) f ∈ W n, ∞ ([0 , d ; R ) : (cid:107) f (cid:107) W n, ∞ ≤ (cid:9) and deﬁne the set of functions that we seek to approximate by F n,d := (cid:8) g : Q C d → C : g Re , g Im ∈ F n,d (cid:9) . aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs The main result of this paper provides explicit error bounds for approximating functions g ∈ F n,d using modReLU networks. This result can be seen as a generalization of the approx-imation bounds for ReLU networks developed in [26] to the complex domain. Theorem 1.

For any d, n ∈ N , there exists a constant C = C ( d, n ) > with the followingproperty.Given any ε ∈ (0 , ) there exists a modReLU -network architecture with no more than C · ln(1 /ε ) layers and no more than C · ε − d/n ln (1 /ε ) weights such that for any g : Q C d → C with (cid:107) Re( g ) (cid:107) W n, ∞ , (cid:107) Im( g ) (cid:107) W n, ∞ ≤ there exists a choice of network weights (all bounded by C · ε − d ) such that the associated network function (cid:101) g satisﬁes | g ( z ) − (cid:101) g ( z ) | ≤ ε for all z ∈ Q C d . The exponent − dn in place of − dn is explained by the identiﬁcation C d ∼ = R d . Remark.

Note that the architecture and therefore the size of the networks is independent ofthe function g to approximate, once we ﬁx an approximation accuracy ε and the parameters n and d . Only the speciﬁc choice of the weights depends on g . While the approximation properties of real-valuedneural networks are comparatively well understood by now, the corresponding questions forcomplex-valued networks are still mostly open. In fact, even the property of universality—well studied for real-valued networks [7,11,12,15]—was only settled for very speciﬁc activationfunctions [1–3, 10], until the recent paper [24] resolved the question. This universal approx-imation theorem for CVNNs highlights that the properties of complex-valued networks aresigniﬁcantly more subtle than their real counterparts: Real-valued networks (either shallow ordeep) are universal if and only if the activation function is not a polynomial [15]. In contrast,shallow complex networks are universal if and only if the real part or the imaginary part of theactivation function σ is not polyharmonic . Finally, deep complex networks (with more thanone hidden layer) are universal if and only if σ is neither holomorphic, nor antiholomorphic,nor a polynomial (in z and z ).Except for these purely qualitative universality results, no quantitative approximationbounds for complex-valued networks are known whatsoever . The present paper is thus theﬁrst to provide such bounds. Role of the activation function

As empirically observed in [4, 23], the main advantage ofcomplex-valued networks over their real-valued counterparts stems from the fact that the set ofpossible complex activation functions is much richer than in the real-valued case. In fact, onecan lift each real-valued activation function ρ : R → R to the complex function σ ( z ) := ρ (Re z ) ;then, writing w = α + i β and z = x + i y , we see σ ( w T z + b ) = ρ ( α T x − β T y + Re b ) . Thus,identifying C d ∼ = R d , every real-valued network can be written as a complex-valued one.Therefore, one can in principle transfer every approximation result involving real-valued net-works to a corresponding complex-valued result. Similar arguments apply to activation func-tions of the form σ ( z ) = ρ (Re z )+ iρ (Im z ) .Alas, using such “intrinsically real-valued” activation functions forfeits the main beneﬁts ofusing complex-valued networks, namely increased expressivity and a faithful handling of phaseand magnitude information. Therefore, the two most prominent complex-valued activationfunctions appearing in the literature (see [22, 23]) are the modReLU (see Equation (1.1)) andthe complex cardioid , given by σ ( z ) = z (cid:0) Re z | z | (cid:1) ; both of which are not of the form ρ (Re( z )) for a real activation function ρ . aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs In the present work, we focus on the modReLU, since it satisﬁes the natural phase homo-geneity property σ ( e iθ z ) = e iθ σ ( z ) . Investigating the complex cardioid—and other complex-valued activation functions—is an interesting topic for future work. Role of the network depth

Deep networks greatly outperform their shallow counterpartsin applications [14]; therefore, much research has been devoted to rigorously quantify theinﬂuence of the network depth on the expressivity of (real-valued) neural networks. The preciseﬁndings depend on the activation function: While for smooth activation functions, already shallow networks with O ( ε − d/n ) weights and neurons can uniformly approximate functions f ∈ C n ([0 , d ) up to error ε (see [17]), this is not true for ReLU networks. To achieve thesame approximation rate, ReLU networks need at least O (1 + nd ) layers [18–20].The proofs of these bounds crucially use that the ReLU is piecewise linear. Since this is nottrue of the modReLU, these arguments do not apply here. In fact, the result from [18–20] thatno non-linear C function can be approximated with error smaller than O ( W − L ) by ReLUnetworks of depth L with W weights does not hold for the modReLU, since the modReLUitself is a smooth, non-linear function when restricted to any set Ω compactly contained in C \ { } .Regarding suﬃciency, the best known approximation result for ReLU networks [26] shows—similar to our main theorem—that ReLU networks with depth O (ln(1 /ε )) and O ( ε − d/n ) weights can approximate f ∈ C n ([0 , d ) uniformly up to error ε . For networks with boundeddepth, similar results are only known for approximation in L p [18] or for approximation interms of the network width instead of the number of nonzero weights [16]. It is an interestingquestion whether similar results hold for modReLU networks.Finally, we mention the intriguing result in [27] which shows that extremely deep ReLUnetworks (for which the number of layers is proportional to the number of weights) with extremely large weights can approximate functions f ∈ C n ([0 , up to error ε using only O ( ε − d/ (2 n ) ) weights. Due to the impractical size of the network weights, this bound has limitedpractical signiﬁcance, but is an extremely surprising and insightful mathematical result. Weexpect that the arguments in [27] can be extended to modReLU networks, but leave this asfuture work. Optimality

For modReLU networks with polynomial growth of the individual weights andlogarithmic growth of the depth (as in Theorem 1), the approximation rate of Theorem 1 isoptimal up to logarithmic factors; this can be seen by combining the quantization results in[6, Lemma 3.7] or [8, Lemma VI.8] with arguments from rate distortion theory as in [6, 18].For ReLU networks, a similar optimality result holds for networks with logarithmic growthof the depth even without assumptions on the growth of the network weights [26]. The proofrelies on sharp bounds for the VC dimension of ReLU networks [5]. For modReLU networks,a similar question is more subtle, since no analogous VC dimension bounds are available. Wethus leave it as future work to study optimality without assumptions on the size of the networkweights.

Inspired by [26], our proof proceeds by locally approximating g using Taylor polynomials, andthen showing that these Taylor polynomials and a suitable partition of unity can be well ap-proximated by modReLU networks. To prove this, we ﬁrst show in Section 2 that modReLUnetworks of constant size can approximate the functions z (cid:55)→ Re z and z (cid:55)→ Im z arbitrar-ily well—only the size of the individual weights of the network grows as the approximationaccuracy improves. Then, based on proof techniques in [26], we show in Section 3 that mod- aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs ReLU networks with O (ln (1 /ε )) weights and O (ln(1 /ε )) layers can approximate the function z (cid:55)→ (Re z ) up to error ε . By a polarization argument, this also allows to approximate theproduct function ( z, w ) (cid:55)→ zw ; see Section 4. After describing in Section 5 how modReLUnetworks can implement a partition of unity, we combine all the ingredients in Section 6 toprove Theorem 1. In this section, we show that modReLU networks of constant size can approximate the func-tions z (cid:55)→ Re z and z (cid:55)→ Im z arbitrarily well, as stated by the following proposition. Proposition 2.

For any R ≥ and ε ∈ (0 , , there exist shallow σ -networks Re R,ε , Im R,ε with neurons and weights, all bounded in absolute value by C R /ε with an absoluteconstant C > , satisfying | Re R,ε ( z ) − Re( z ) | ≤ ε and | Im R,ε ( z ) − Im( z ) | ≤ ε for all z ∈ C with | z | ≤ R. To prove Proposition 2, we need two ingredients: First, modReLU networks can implementthe identity function exactly , on bounded subset of C . Precisely, for arbitrary R > it holdsthat Id R ( z ) = z for z ∈ C with | z | ≤ R , where Id R ( z ) := σ (cid:0) z + 2 R + 2 (cid:1) − σ ( z + R + 1) − ( R + 1) . (2.1)Indeed, for w ∈ C with | w | ≥ , we have σ (2 w ) − σ ( w ) = 2 w − w | w | − ( w − w | w | ) = w . For z ∈ C with | z | ≤ R , setting w = z + R + 1 so that | w | ≥ gives Id R ( z ) = z .As the second ingredient, we use the following functions, parameterized by h > : Im h ( z ) := − ih (cid:16) sgn (cid:0) hz + h (cid:1) − (cid:17) and Re h ( z ) := 1 h (cid:16) sgn (cid:0) hz − ih (cid:1) + i (cid:17) , z ∈ C . (2.2)The next lemma shows that these functions well approximate the functions Re and Im . Theproof of Proposition 2 will then consist in showing that Im h and Re h can be implemented bymodReLU networks. Lemma 3.

For z ∈ C and < h < | z | , we have (cid:12)(cid:12)(cid:12) Im( z ) − Im h ( z ) (cid:12)(cid:12)(cid:12) ≤ h | z | and (cid:12)(cid:12)(cid:12) Re( z ) − Re h ( z ) (cid:12)(cid:12)(cid:12) ≤ h | z | . (2.3) Proof.

Deﬁne w := hz + h . If Im( z ) = 0 , then w ∈ (0 , ∞ ) and hence sgn( w ) − z ) ,so that the ﬁrst part of Equation (2.3) is true. Hence, we can assume in what follows that Im( z ) (cid:54) = 0 . Now, note by choice of h that < h < and h | z | < , which shows that | zh | ≥ − h · h | z | ≥ and therefore also | w | = h | h z | ≥

34 1 h .As a consequence, we see (cid:12)(cid:12)(cid:12) h | w | − h (cid:12)(cid:12)(cid:12) = 1 h · (cid:12)(cid:12)(cid:12) h | h − + hz | − (cid:12)(cid:12)(cid:12) = 1 h · (cid:12)(cid:12)(cid:12) | h z | − (cid:12)(cid:12)(cid:12) = 1 h · (cid:12)(cid:12) − | h z | (cid:12)(cid:12) | h z | ≤ h h | z | / h | z | ≤ ≤ . Because of

Im( w ) = h Im( z ) (cid:54) = 0 , this implies (cid:12)(cid:12)(cid:12) h Im( w/ | w | ) − Im( z ) (cid:12)(cid:12)(cid:12) = h · | Im( z ) | · (cid:12)(cid:12)(cid:12) h Im( w/ | w | )Im( w ) − h (cid:12)(cid:12)(cid:12) = h · | Im( z ) | · (cid:12)(cid:12)(cid:12) h | w | − h (cid:12)(cid:12)(cid:12) ≤ h | z | . aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs Next, note

Re( w ) = 1 h (cid:0) h Re( z ) (cid:1) ≥ h (cid:0) − h · h | z | (cid:1) ≥ h (cid:0) − · (cid:1) = 34 1 h > . Hence,

Re( w ) + | w | ≥ w ) ≥

32 1 h . Since also | Im( w ) | = h | Im( z ) | ≤ h | z | , we thus see (cid:12)(cid:12) Re( w ) − | w | (cid:12)(cid:12) = (cid:12)(cid:12) (Re( w ) − | w | )(Re( w ) + | w | ) (cid:12)(cid:12)(cid:12)(cid:12) Re( w ) + | w | (cid:12)(cid:12) = (cid:12)(cid:12) (Re( w )) − | w | (cid:12)(cid:12)(cid:12)(cid:12) Re( w ) + | w | (cid:12)(cid:12) ≤ | Im( w ) |

232 1 h and hence (cid:12)(cid:12) Re( w ) − | w | (cid:12)(cid:12) ≤ h | z | . Combining this with the esimate | w | ≥

34 1 h from thebeginning of the proof, we see h (cid:12)(cid:12)(cid:12) Re( w/ | w | ) − (cid:12)(cid:12)(cid:12) = 1 h (cid:12)(cid:12)(cid:12)(cid:12) Re( w ) − | w || w | (cid:12)(cid:12)(cid:12)(cid:12) ≤ h h | z |

234 1 h = 43 23 h | z | ≤ h | z | . Combining everything, we see (cid:12)(cid:12)(cid:12)

Im( z ) − − ih (cid:16) sgn( h − + hz ) − (cid:17)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) − ih (cid:16) w | w | − (cid:17) − Im( z ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) − ih · i · Im (cid:16) w | w | (cid:17) − Im( z ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) − ih (cid:16) Re (cid:16) w | w | (cid:17) − (cid:17)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) h Im( w/ | w | ) − Im( z ) (cid:12)(cid:12)(cid:12) + 1 h (cid:12)(cid:12) Re( w/ | w | ) − (cid:12)(cid:12) ≤ h | z | + h | z | ≤ h | z | , proving the ﬁrst estimate in Equation (2.3). To prove the second estimate in Equation (2.3),simply note that Re( z ) = Im( iz ) and sgn( iw ) = i sgn( w ) , whence h | z | = 2 h | iz | ≥ (cid:12)(cid:12)(cid:12) Im( iz ) − − ih (cid:0) sgn( h − + hiz ) − (cid:1)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) Re( z ) − − ih (cid:0) i sgn( hz − ih ) + i i (cid:1)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) Re( z ) − h (cid:0) sgn( hz − ih ) + i (cid:1)(cid:12)(cid:12)(cid:12) , as claimed. Proof of Proposition 2.

Set h := ε R , noting that indeed < h < | z | and h | z | ≤ ε whenever | z | ≤ R . Note that w := hz − ih satisﬁes | w | ≤ h + h | z | ≤ h =: R (cid:48) and | w | = | hz − ih | ≥ h − h | z | ≥ − ≥ , so that sgn( w ) = w − σ ( w ) = Id R (cid:48) ( w ) − σ ( w ) , with Id R (cid:48) as in Equation (2.1). Then it is not hard to see that Re h ( z ) = h − · (cid:0) Id R (cid:48) ( w ) − σ ( w ) + i (cid:1) = h − · (cid:16) σ (2 hz + h + 2 − ih ) − σ ( hz + h + 1 − ih ) − σ ( hz − ih ) + i − h − (cid:17) =: Re R,ε ( z ) , where the right-hand side is a shallow σ -network with three neurons and all weights boundedby max (cid:8) h , h (cid:9) ≤ C · R ε , for an absolute constant C > . Finally Lemma 3 easily shows | Re( z ) − Re R,ε ( z ) | ≤ ε for all z ∈ C with | z | ≤ R . The claim concerning the approximation of Im( z ) is shown similarly. aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs The main result of this section is Proposition 6 below, showing that modReLU networks with O (ln(1 /ε )) layers and O (ln (1 /ε )) weights can approximate the function z (cid:55)→ (Re( z )) up toerror ε .As the ﬁrst part of the proof, we show that modReLU networks can approximate functionsof the form z (cid:55)→ (cid:37) (Re( z ) + c ) with the usual ReLU (cid:37) ; this will then allow us to use theapproximation of the square function by ReLU networks as derived in [26]. Proposition 4.

For any R ≥ , c ∈ R , and ε ∈ (0 , , there exist -layer σ -networks (cid:37) Re ,cR,ε , (cid:37) Im ,cR,ε with neurons and weights, all bounded in absolute value by C R /ε + | c | (withan absolute constant C ), satisfying | (cid:37) Re ,cR,ε ( z ) − (cid:37) (Re( z )+ c ) | ≤ ε and | (cid:37) Im ,cR,ε ( z ) − (cid:37) (Im( z ) + c ) | ≤ ε for all z ∈ C with | z | ≤ R .Proof. Let us ﬁrst prove the statement for (cid:37) Re ,cR,ε . Note that σ is -Lipschitz since | σ ( z ) − σ ( w ) | =  if | z | , | w | ≤ , | σ ( z ) | = | z | − ≤ | z | − | w | ≤ | z − w | if | w | ≤ < | z | , | σ ( w ) | = | w | − ≤ | w | − | z | ≤ | w − z | if | z | ≤ < | w | , | ( z − z | z | ) − ( w − w | w | ) | ≤ | z − w | else,where we used the law of cosines for the last part: indeed, if we denote the angle between z and w by α , then this is also the angle between z − z | z | and w − w | w | , so that we see for | z | , | w | > that | ( z − z | z | ) − ( w − w | w | ) | = ( | z | − + ( | w | − − | z | − | w | −

1) cos( α )= | z | + | w | − | z || w | cos( α ) − (2 | z | + 2 | w | − − cos( α )) ≤ | z | + | w | − | z || w | cos( α ) = | z − w | . ( x )( x ) Figure 1: A plot of the modReLU function σ , on the interval [ − , . The plotshows that σ ( x + 1) = (cid:37) ( x ) for x ∈ [ − , ∞ ) .7 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs Now, set h := 1 / (2( R + | c | )) and deﬁne (cid:37) Re ,cR,ε := 1 h σ (cid:0) h · (cid:0) Re R,ε ( z ) + c (cid:1)(cid:1) , where Re R,ε is the network from Proposition 2. Now, a direct computation (see also Figure 1)shows that σ ( x + 1) = (cid:37) ( x ) for x ∈ [ − , ∞ ) . Because of | h · (Re( z ) + c ) | ≤ h · ( | z | + | c | ) ≤ ,this implies h σ (cid:0) h · (Re( z ) + c ) (cid:1) = h (cid:37) (cid:0) h · (Re( z ) + c ) (cid:1) = (cid:37) (cid:0) Re( z ) + c (cid:1) . Combined with the -Lipschitz continuity of σ , this implies | (cid:37) (Re( z ) + c ) − (cid:37) Re ,cR,ε ( z ) | = (cid:12)(cid:12) h σ (1 + h (Re( z ) + c )) − h σ (1 + h (Re R,ε ( z ) + c )) (cid:12)(cid:12) ≤ | Re( z ) − Re R,ε ( z ) | ≤ ε. Based on the properties of Re R,ε from Proposition 2, it is easy to see that (cid:37) Re ,cR,ε is implementedby a 2-layer σ -network with neurons and weights, all bounded in absolute value by | c | + C R /ε . The construction of (cid:37) Im ,cR,ε is similar, replacing Re R,ε ( z ) with Im R,ε ( z ) .Our next goal is to construct σ -networks approximating the function z (cid:55)→ (Re z ) . Thiswill be based on combining the preceding lemma with the approximation of the real function x (cid:55)→ x by ReLU networks, as presented in [26].The construction in [26] is based on the following auxiliary functions, some of them shownin Figure 2: g : R → R , g ( x ) := 2 (cid:37) ( x ) − (cid:37) ( x − ) + 2 (cid:37) ( x − ,g s : R → R , g s ( x ) := g ◦ · · · ◦ g (cid:124) (cid:123)(cid:122) (cid:125) s − times ( x ) for s ∈ N ,f m : R → R , f m ( x ) := x − m (cid:88) s =1 g s ( x )2 s for m ∈ N ∪ { } . gg gg g g x f f f Figure 2: A plot of the function g , its compositions g ◦ g and g ◦ g ◦ g , and the squareapproximations f m , for m = 0 , , . Evidently, g is -Lipschitz and f m ( x ) = x for x / ∈ [0 , . It is easily seen (cf. [26, Proposition 2]) that (cid:12)(cid:12) x − f m ( x ) (cid:12)(cid:12) ≤ − m − for ≤ x ≤ . (3.1) aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs Further, using the functions (cid:37) Re ,cR,ε from Proposition 4, we deﬁne for z ∈ C , g Re ( z ) := g (cid:0) Re( z ) (cid:1) = 2 (cid:37) (cid:0) Re( z ) (cid:1) − (cid:37) (cid:0) Re( z ) − (cid:1) + 2 (cid:37) (cid:0) Re( z ) − (cid:1) ,g Re ,s ( z ) := g Re ◦ · · · ◦ g Re (cid:124) (cid:123)(cid:122) (cid:125) s − times ( z ) = g s (cid:0) Re( z ) (cid:1) ( since g : R → R ) ,g Re R,ε ( z ) := 2 (cid:37) Re , R,ε − (cid:37) Re , − . R,ε + 2 (cid:37) Re , − R,ε ,g Re ,sR,ε ( z ) := g Re R,ε ◦ · · · ◦ g Re R,ε (cid:124) (cid:123)(cid:122) (cid:125) s − times ( z ) . As is clear from Figure 2, the function g : R → R is -Lipschitz, which in turn implies that g Re : C → C is -Lipschitz; indeed, (cid:12)(cid:12) g (Re( z )) − g (Re( z (cid:48) )) (cid:12)(cid:12) ≤ | Re( z ) − Re( z (cid:48) ) | ≤ | z − z (cid:48) | for z, z (cid:48) ∈ C . As the last preparation for the proof of Proposition 6, we need the followingtechnical bound concerning the size of g Re ,sR,ε ( z ) . Lemma 5.

For any R ≥ , < ε < R , and s, m ∈ N , we have (cid:12)(cid:12) g Re ,sR + m,ε ( z ) (cid:12)(cid:12) ≤ R + 1 , for all z ∈ C with | z | ≤ R .Proof. It follows from Proposition 4 that if < ε < R and | z | ≤ R + 1 ≤ R + m , then (cid:12)(cid:12) g Re R + m,ε ( z ) − g Re ( z ) (cid:12)(cid:12) ≤ (2 + 4 + 2) ε ≤ R. Since | g Re ( z ) | = | g (Re( z )) | ≤ for all z ∈ C (see Figure 2), we have that | g Re R + m,ε ( z ) | ≤ (cid:12)(cid:12) g Re R + m,ε ( z ) − g Re ( z ) (cid:12)(cid:12) + (cid:12)(cid:12) g Re ( z ) (cid:12)(cid:12) ≤ R + 1 ∀ z ∈ C with | z | ≤ R + 1 . This shows that g Re R + m,ε maps { z ∈ C : | z | ≤ R + 1 } into itself. It then follows by inductionthat g Re ,sR + m,ε maps { z ∈ C : | z | ≤ R + 1 } into itself, and this easily implies the claim. Proposition 6.

Let R ≥ . For z ∈ C with | z | ≤ R and | Re( z ) | ≤ , the function Re( z ) canbe approximated with any error < ε < min { , R } by a σ -network whose depth is O (cid:0) log( ε ) (cid:1) ,whose number of weights, and computation units are O (cid:0) log( ε ) (cid:1) , and whose weights arebounded by O (cid:0) ε − ( R + log( ε )) (cid:1) .Proof. Let < ε < R and let m ∈ N be an integer that is to be determined later. First notethat by Proposition 4, we have for any | z | ≤ R + m that (cid:12)(cid:12) g Re ( z ) − g Re R + m,ε ( z ) (cid:12)(cid:12) ≤ (2 + 4 + 2) ε = 8 ε. (3.2) aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs Since g Re is -Lipschitz, we thus get for | z | ≤ R that (cid:12)(cid:12)(cid:12) g Re ,s ( z ) − g Re ,sR + m,ε ( z ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) g Re ◦ · · · ◦ g Re ( z ) − g Re R + m,ε ◦ · · · ◦ g Re R + m,ε ( z ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) g Re ◦ · · · ◦ g Re ( z ) − g Re ◦ · · · ◦ g Re ◦ g Re R + m,ε ( z ) (cid:12)(cid:12) + (cid:12)(cid:12) g Re ◦ · · · ◦ g Re ◦ g Re R + m,ε ( z ) − g Re ◦ · · · ◦ g Re ◦ g Re R + m,ε ◦ g Re R + m,ε ( z ) (cid:12)(cid:12) + · · · + (cid:12)(cid:12) g Re ◦ g Re R + m,ε ◦ · · · ◦ g Re R + m,ε ( z ) − g Re R + m,ε ◦ · · · ◦ g Re R + m,ε ( z ) (cid:12)(cid:12) ≤ s − (cid:12)(cid:12) g Re ( z ) − g Re R + m,ε ( z ) (cid:12)(cid:12) + 2 s − (cid:12)(cid:12) g Re (cid:0) g Re R + m,ε ( z ) (cid:1) − g Re R + m,ε (cid:0) g Re R + m,ε ( z ) (cid:1)(cid:12)(cid:12) + · · · + (cid:12)(cid:12)(cid:12) g Re (cid:0) g Re ,s − R + m,ε ( z ) (cid:1) − g Re h (cid:0) g Re ,s − R + m,ε ( z ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ ε (cid:16) s − + 2 s − + · · · + 1 (cid:17) ≤ ε · s , where we used (3.2) and Lemma 5 in the second last step.Now, using the function Id R from Equation (2.1), for z ∈ C deﬁne f m,R,ε ( z ) := Re R,ε ◦ Id R ◦ · · · ◦ Id R ( z ) − m (cid:88) s =1 g Re ,sR + m,ε ◦ Id R ◦ · · · ◦ Id R ( z )4 s , where the number of the “factors” Id R is chosen such that all networks have the same depth m . It then follows for m ∈ N and | z | ≤ R that | f m (Re( z )) − f m,R,ε ( z ) | ≤ | Re( z ) − Re R,ε ( z ) | + m (cid:88) s =1 | g Re ,s ( z ) − g Re ,sR + m,ε ( z ) | s ≤ ε + 8 ε m (cid:88) s =1 − s ≤ ε. (3.3)Setting m = (cid:6) log( ε ) / log 2 − (cid:7) (so that − m − ≤ ε ) and combining (3.1) and (3.3), wededuce that for | z | ≤ R with ≤ Re( z ) ≤ , | Re( z ) − f m,R,ε ( z ) | ≤ | Re( z ) − f m (Re( z )) | + | f m (Re( z )) − f m,R,ε ( z ) | ≤ ε. (3.4)We will now extend this result to z ∈ C with | z | ≤ R and | Re( z ) | ≤ . Deﬁne Abs Re R,ε ( z ) := (cid:37) Re , R,ε ( z ) + (cid:37) Re , R,ε ( − z ) for z ∈ C . Since | x | = (cid:37) ( x ) + (cid:37) ( − x ) for x ∈ R , it follows from Proposition 4 that for | z | ≤ R , (cid:12)(cid:12)(cid:12) | Re( z ) | − Abs Re R,ε ( z ) (cid:12)(cid:12)(cid:12) ≤ ε, (3.5)which implies (cid:12)(cid:12) | Re( z ) | − Re(Abs Re R,ε ( z )) (cid:12)(cid:12) ≤ ε and | Abs Re R,ε ( z ) | ≤ | Re( z ) | + 2 ε ≤ ε ≤ R if | Re( z ) | ≤ . Therefore, if | z | ≤ R , | Re( z ) | ≤ , and Re(Abs Re R,ε ( z )) ∈ [0 , , then | Re( z ) − f m,R,ε (Abs Re R,ε ( z )) |≤ | Re( z ) − Re(Abs Re R,ε ( z )) | + | Re(Abs Re R,ε ( z )) − f m,R,ε (Abs Re R,ε ( z )) |≤ (cid:0) | Re( z ) | + (cid:12)(cid:12) Abs Re R,ε ( z ) (cid:12)(cid:12)(cid:1) · (cid:12)(cid:12) | Re( z ) | − Re(Abs Re R,ε ( z )) (cid:12)(cid:12) + (cid:12)(cid:12) Re(Abs Re R,ε ( z )) − f m,R,ε (Abs Re R,ε ( z )) (cid:12)(cid:12) ≤ (cid:0) ε ) (cid:1) · ε + 10 ε by (3.4) ≤ ε.

10 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs

Now assume that | z | ≤ R , | Re( z ) | ≤ , and Re(Abs Re R,ε ( z )) / ∈ [0 , . Then (3.5) implies (cid:12)(cid:12)(cid:12) | Re( z ) | − Re (cid:0) Abs Re R,ε ( z ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ ε, and since | Re( z ) | ∈ [0 , and Re(Abs Re R,ε ( z )) / ∈ [0 , , we have either | Re( z ) | ∈ [0 , ε ] or | Re( z ) | ∈ [1 − ε, . In the former case we have | Re( z ) | ≤ ε and ≤ − | Re( z ) | ≤ , whilein the latter we have | Re( z ) | ≤ and ≤ − | Re( z ) | ≤ ε . In both cases, we get | Re( z ) | · (cid:0) − | Re( z ) | (cid:1) ≤ ε. Also, since f m ( x ) = x for m ∈ N ∪ { } and x / ∈ [0 , , and since | Abs Re R,ε ( z ) | ≤ ε ≤ ≤ R ,it follows from (3.3) that (cid:12)(cid:12) Re(Abs Re R,ε ( z )) − f m,R,ε (Abs Re R,ε ( z )) (cid:12)(cid:12) = (cid:12)(cid:12) f m (cid:0) Re(Abs Re R,ε ( z )) (cid:1) − f m,R,ε (Abs Re R,ε ( z )) (cid:12)(cid:12) ≤ ε. Combining the above, we have | Re( z ) − f m,R,ε (Abs Re R,ε ( z )) |≤ (cid:12)(cid:12) Re( z ) − | Re( z ) | (cid:12)(cid:12) + (cid:12)(cid:12) | Re( z ) | − Re(Abs Re R,ε ( z )) (cid:12)(cid:12) + (cid:12)(cid:12) Re(Abs Re R,ε ( z )) − f m,R,ε (Abs Re R,ε ( z )) (cid:12)(cid:12) ≤ | Re( z ) | · (cid:0) − | Re( z ) | (cid:1) + 2 ε + 10 ε ≤ ε. Therefore, we conclude that if z ∈ C with | z | ≤ R and | Re( z ) | ≤ , then | Re( z ) − f m,R,ε (Abs Re R,ε ( z )) | ≤ ε .Finally, let us count the layers and weights of f m,R,ε (Abs Re R,ε ( z )) and determine a bound forthe size of weights. By Proposition 4, Abs Re R,ε ( z ) is a -layer σ -network with neurons and weights which can be bounded in absolute value by CR /ε . Similarly, the function f m,R,ε ( z ) is a m -layer σ -network with C (cid:48) m neurons and weights that are bounded by C ( R + m ) /ε .Collecting these numbers with m = O (log( ε )) yields the claim. Following the same argument as in [26, Proposition 3] (note that f m,h ( z ) = 0 = Re( z ) if Re( z ) = 0 by the properties of Re h ), we hence deduce the following. The correspondingresults for imaginary parts is obtained by replacing Re with Im . Proposition 7.

Given R ≥ , M ≥ , and ε ∈ (0 , ) , there is a σ -network with two inputunits that implements a function ˜ × Re : C → C such that1. for any inputs z, w ∈ C with | z | , | w | ≤ R and | Re( z ) | , | Re( w ) | < M we have | ˜ × Re ( z, w ) − Re( z ) Re( w ) | ≤ ε ;2. the depth is O (cid:0) log( M ε ) (cid:1) , the numbers of weights and computation units are O (cid:0) log( M ε ) (cid:1) and all weights are bounded in absolute value by O (cid:0) M ε − ( R + log( M ε )) (cid:1) .Proof. Let Φ R,ε be the network in the statement of Proposition 6. Given M ≥ and ε ∈ (0 , R ) ,we set ˜ × Re ( z, w ) := 2 M (cid:16) Φ R,ε ( z + w M ) − Φ R,ε ( z M ) − Φ R,ε ( w M ) (cid:17) , z, w ∈ C .

11 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs

For any z, w ∈ C , we have Re( z ) Re( w ) = 2 M (cid:16) Re( z + w M ) − Re( z M ) − Re( w M ) (cid:17) , such that if | z | , | w | ≤ R , | Re( z ) | ≤ M , and | Re( w ) | ≤ M , we have (cid:12)(cid:12) Re( z ) Re( w ) − ˜ × Re ( z, w ) (cid:12)(cid:12) ≤ M (cid:16)(cid:12)(cid:12)(cid:12) Re( z + w M ) − Φ R,ε ( z + w M ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) Re( z M ) − Φ R,ε ( z M ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) Re( w M ) − Φ R,ε ( w M ) (cid:12)(cid:12)(cid:12)(cid:17) ≤ M ε Replacing ε with ε M in Proposition 6 gives the corresponding order of depth, number ofweights, and computation units.It is straightforward to derive from Proposition 7 an approximation for complex scalarproducts. Corollary 8.

Given R ≥ , M ≥ , and ε ∈ (0 , ) , there is a σ -network with two input unitsthat implements a function ˜ × : C → C such that1. for any inputs z, w ∈ C with | z | , | w | ≤ R and | Re( z ) | , | Re( w ) | < M we have | ˜ × ( z, w ) − zw | ≤ ε ;2. the depth is O (cid:0) log( M ε ) (cid:1) , the numbers of weights and computation units are O (cid:0) log( M ε ) (cid:1) and all weights are bounded in absolute value by O (cid:0) M ε − ( R + log( M ε )) (cid:1) .Proof. Note that zw = Re( z ) Re( w ) − Im( z ) Im( w ) + i (cid:0) Re( z ) Im( w ) + Im( z ) Re( w ) (cid:1) = Re( z ) Re( w ) − Re( − iz ) Re( − iw ) + i (cid:0) Re( z ) Re( − iw ) + Re( − iz ) Re( w ) (cid:1) . Then by Proposition 7, the function ˜ × ( z, w ) := ˜ × Re ( z, w ) − ˜ × Re ( − iz, − iw ) + i (cid:0) ˜ × Re ( z, − iw ) + ˜ × Re ( − iz, w ) (cid:1) (4.1)satisﬁes | ˜ × ( z, w ) − zw | ≤ ε for every z, w ∈ C with | z | , | w | ≤ M . Deﬁne the function ψ Re , ψ Im : C → C by ψ Re ( z ) := 1 − σ ( z + ) + σ ( z − ) and furthermore ψ Im ( z ) := i − σ ( z + i ) + σ ( z − i ) . Note that for x ∈ R we have ψ Re ( x ) =  if | x | ≤ − | x | if ≤ | x | ≤ if | x | ≥ (5.1)and similarly for x ∈ i R for ψ Im (as ψ Im is just a rotation by π of ψ Re ).Let N ≥ be a natural number. For m ∈ { , , . . . , N } deﬁne the functions φ Re m : C → C by φ Re m ( z ) = ψ Re (4 N ( z − m N )) . It is not diﬃcult to see that on the unit interval [0 , ⊂ R ⊂ C ,the φ Re m form a partition of unity; see Figure 3. Similarly, deﬁning φ Im m ( z ) := ψ Im (4 N ( z − m N )) for m ∈ { , i, . . . , N i } , we see that the φ Im m form a partition of unity on the imaginary unitinterval i · [0 , .

12 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs Re (4 x ) Re (4( x )) Re (4( x Figure 3: A plot of the function ψ Re (4 • ) and its shifts, showing that they form apartition of unity. In this section, we prove our main result, which we state here once more for the convenienceof the reader.

Theorem 9.

For any d, n ≥ and ε ∈ (0 , ) there exists a σ network architecture that canexpress any function g ∈ F n,d with error ε . In other words, one can ﬁnd a particular realization (cid:101) g of the architecture (a particular weight choice depending on g ) satisfying (cid:107) g − (cid:101) g (cid:107) ∞ < ε .Moreover, (cid:101) g has at most c ln( ε ) layers and at most c ε − dn ln ( ε ) weights, each no larger than c · ε − d , for some constant c = c ( d, n ) > . Before proving our main result, we will need a technical lemma.

Lemma 10.

Let Ω (cid:54) = ∅ , M ∈ N , ε ∈ (0 , M +1 ) , and < δ ≤ ε . Suppose that • (cid:101) × : C → C satisﬁes | (cid:101) × ( z, w ) − zw | ≤ ε for all | z | , | w | ≤ ; • α , . . . , α M : Ω → C satisfy | α j ( z ) | ≤ for all z ∈ Ω ; • β , . . . , β M : Ω → C with | α j ( z ) − β j ( z ) | ≤ δ for all z ∈ Ω .Deﬁne inductively γ ( z ) := β ( z ) and γ j +1 ( z ) := (cid:101) × (cid:0) β j +1 ( z ) , γ j ( z ) (cid:1) for z ∈ Ω . Then (cid:12)(cid:12)(cid:12) γ M ( z ) − M (cid:89) (cid:96) =1 α (cid:96) ( z ) (cid:12)(cid:12)(cid:12) ≤ M ε ∀ z ∈ Ω . Proof.

Deﬁne θ j ( z ) := (cid:81) j(cid:96) =1 α (cid:96) ( z ) and κ j := ε (cid:80) j(cid:96) =1 (1 + ε ) (cid:96) . We will show inductively that | γ j ( z ) − θ j ( z ) | ≤ κ j . This will imply the claim by taking j = M , since we have (1 + ε ) (cid:96) ≤ (1 + ε ) M +1 ≤ (cid:0) M +1 (cid:1) M +1 ≤ e ≤ and hence κ j ≤ κ M ≤ M ε .

13 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs

The case j = 1 is trivial, since | γ ( z ) − θ ( z ) | = | β ( z ) − α ( z ) | ≤ ε ≤ κ . For the inductionstep, ﬁrst note that κ j ≤ κ M = ε (1 + ε ) M − (cid:88) (cid:96) =0 (1 + ε ) (cid:96) = ε (1 + ε ) (1 + ε ) M − ε ) − ≤ (1 + ε ) M +1 ≤ (cid:16) M +1 (cid:17) M +1 ≤ e ≤ and hence | γ j ( z ) | ≤ | θ j ( z ) | + κ j ≤ , since | α (cid:96) ( z ) | ≤ for all (cid:96) , and thus | θ j ( z ) | ≤ . Sincealso | β j +1 ( z ) | ≤ δ + | α j +1 ( z ) | ≤ δ ≤ , we see by choice of (cid:101) × for any z ∈ Ω that (cid:12)(cid:12) γ j +1 ( z ) − θ j +1 ( z ) (cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) (cid:101) × (cid:0) β j +1 ( z ) , γ j ( z ) (cid:1) − β j +1 ( z ) γ j ( z ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12) β j +1 ( z ) γ j ( z ) − β j +1 ( z ) θ j ( z ) (cid:12)(cid:12) + (cid:12)(cid:12)(cid:0) β j +1 ( z ) − α j +1 ( z ) (cid:1) θ j ( z ) (cid:12)(cid:12) ≤ ε + | β j +1 ( z ) | · κ j + δ · | θ j ( z ) |≤ ε + δ + (1 + δ ) κ j ≤ ε (1 + ε ) + (1 + ε ) κ j , where the last step used that δ ≤ ε ≤ ε . Finally, note that by choice of κ j that ε (1 + ε ) +(1 + ε ) κ j = ε (1 + ε ) + ε (cid:80) j +1 (cid:96) =2 (1 + ε ) (cid:96) = ε (cid:80) j +1 (cid:96) =1 (1 + ε ) (cid:96) = κ j +1 . This completes the inductionand thus the proof.

Proof of Theorem 9.

We identify the function g : Q d → C with the pair of functions g Re , g Im :[0 , d → R and we will only explicitly show the approximation of f := g Re , since g Im can beapproximated in exactly the same way.We roughly follow the structure of the proof of Theorem 1 in [26]: In the ﬁrst step,we approximate f by f ∗ , a sum of Taylor polynomials subordinate to a partition of unity,constructed with our activation function σ in mind; see Section 5. In the second step weapproximate f ∗ by a realization (cid:101) f of an appropriate architecture. An additional complicationover the real setting discussed in [26] is that we cannot access the real and imaginary parts ofthe inputs of f exactly with a σ network, but only approximatively; see Proposition 2. Step 1.

Let N be an arbitrary natural number. Employing similar notations to [26],we will denote by bold-faced characters ordered pairs (vectors) of coordinates. Given N ∈ N (speciﬁed precisely in Equation (6.1) below), let us write N := { , , . . . , N } d × { , i, . . . , N i } d . For m := ( m , m , . . . , m d ) ∈ N , we deﬁne on [0 , d the function φ m ( x ) = d (cid:89) k =1 ψ Re (cid:16) N · (cid:0) x k − m k N (cid:1)(cid:17) d (cid:89) (cid:96) = d +1 ψ Im (cid:16) N · (cid:0) x (cid:96) i − m (cid:96) N (cid:1)(cid:17) , where x = ( x , . . . , x d , x d +1 , . . . , x d ) and ψ Re , ψ Im are given by Equation (5.1).As noted before, the φ m form a partition of unity and supp( φ m ) ⊂ S m := (cid:110) x ∈ R d : | x k − m k N | < N and | x (cid:96) i − m (cid:96) N | < N for all ≤ k ≤ d < (cid:96) ≤ d (cid:111) . Now for any m ∈ N , consider the Taylor polynomial of f at the point x = m N of degree n − , given by P m ( • ) = (cid:88) n : | n |

14 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs and deﬁne f ∗ = (cid:88) m ∈ N φ m P m . We can bound the error by | f ( x ) − f ∗ ( x ) | = (cid:12)(cid:12)(cid:12) (cid:88) m φ m ( x ) (cid:0) f ( x ) − P m ( x ) (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:88) m : x ∈ S m (cid:12)(cid:12) f ( x ) − P m ( x ) (cid:12)(cid:12) ≤ d max m : x ∈ S m (cid:12)(cid:12) f ( x ) − P m ( x ) (cid:12)(cid:12) ≤ d d n n ! (cid:18) N (cid:19) n (cid:107) f (cid:107) W ≤ d d n n ! (cid:18) N (cid:19) n , where, with the same argumentation as that on page 11 of [26], we used successively the factthat the φ m form a partition of unity and are supported on S m , that each x can be in at most d supports S m , a standard bound for the Taylor polynomial error (see for instance the proofof [18, Lemma A.8]), and ﬁnally that f is in the unit ball of the Sobolev space. Therefore bychoosing N = (cid:24)(cid:18) n ! · ε · d d n (cid:19) − /n (cid:25) (6.1)(where (cid:100) x (cid:101) is the smallest integer bigger or equal to x ), we obtain that (cid:107) f − f ∗ (cid:107) ∞ ≤ ε . Step 2.

We approximate f ∗ up to error ε by a σ -network. To this end, note that we canrewrite f ∗ as f ∗ ( x ) = (cid:88) m ∈ N (cid:88) n : | n |

15 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs a σ -network with L ≤ C ln(1 / (cid:101) ε ) ≤ C ln(1 /ε ) layers and at most C · ln(1 / (cid:101) ε ) ≤ C ln(1 /ε ) weights, each bounded in absolute value by C · (cid:101) ε − ln( (cid:101) ε ) . Since up to constants (dependingonly on d and n ), (cid:101) ε − ≤ ε − d , it follows that the weights are bounded by C · ε − d . Here, theconstant C , C only depend on d, n .Next, note that N (cid:101) ε ≤ C ( d, n ) · ε − d . Therefore, Proposition 2 shows that there existfunctions Re (cid:92) , Im (cid:92) : C → C with | Re (cid:92) ( z ) − Re( z ) | ≤ (cid:101) ε N and | Im (cid:92) ( z ) − Im( z ) | ≤ (cid:101) ε N for all z ∈ C with | z | ≤ , and such that Re (cid:92) and Im (cid:92) are implemented by shallow σ -networks with weights of size ≤ C ( (cid:101) ε N ) − ≤ C · ε − d .Finally, to apply Lemma 10, writing n = ( n , . . . , n d ) , we deﬁne α k , β k : C d → C for ≤ k ≤ d + | n | = M as follows:• For ≤ k ≤ d , set α k ( z ) = ψ Re (cid:0) N Re( z k ) − m k (cid:1) and β k ( z ) = ψ Re (cid:0) N Re (cid:92) ( z k ) − m k (cid:1) ; • For d + 1 ≤ k ≤ d , set α k ( z ) = ψ Im (cid:0) N i

Im( z k ) − m k (cid:1) and β k ( z ) = ψ Im (cid:0) N i Im (cid:92) ( z k ) − m k (cid:1) ; • For d + n + · · · + n (cid:96) − < k ≤ d + n + · · · + n (cid:96) ≤ d + n + · · · + n d , set α k ( z ) = Re( z (cid:96) ) − m (cid:96) N and β k ( z ) = Re (cid:92) ( z (cid:96) ) − m (cid:96) N ; • For d + n + · · · + n d < d + n + · · · + n (cid:96) − < k ≤ d + n + · · · + n (cid:96) , set α k ( z ) = i Im( z (cid:96) − d ) − m (cid:96) − d N and β k ( z ) = i Im (cid:92) ( z (cid:96) − d ) − m (cid:96) − d N .

Since ψ Re , ψ Im are -Lipschitz, it is then easy to see that | α k ( z ) − β k ( z ) | ≤ (cid:101) ε = δ for all z ∈ Ω and ≤ k ≤ M . Furthermore, note that indeed | α k ( z ) | ≤ for all z ∈ Ω and ≤ k ≤ M .Overall, we can thus apply Lemma 10, which shows for (cid:101) f m , n := (cid:101) × (cid:0) β , (cid:101) × (cid:0) β , . . . , (cid:101) × ( β d + | n |− , β d + | n | ) (cid:1)(cid:1) that | f m , n ( z ) − (cid:101) f m , n ( z ) | ≤ M (cid:101) ε ≤ ε S = ε d n | N | ∀ z ∈ Ω . Thus, setting (cid:101) f := (cid:80) m ∈ N (cid:80) n ∈ N d , | n |

16 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs • The network Id R has O (1) weights, layers and weight bounds;• The networks Re (cid:92) , Im (cid:92) have O (1) weights and layers and O ( ε − d ) weight bounds;• The networks ψ Re , ψ Im have O (1) weights, layers and weight bounds;• The network (cid:101) × has O (ln( (cid:101) ε )) layers and O (ln ( (cid:101) ε )) weights, bounded in absolute value by O ( (cid:101) ε − ln ( (cid:101) ε )) . Since up to constants (depending only on d and n ), (cid:101) ε − ≤ ε − d , it followsthat (cid:101) × has O (ln( ε )) layers and O (ln (1 /ε )) weights, all bounded up to a constant by ε − d .We can therefore conclude that for some C = C ( d, n ) > , the network (cid:101) f m , n has at most C · ln( ε ) layers and C · ln ( ε ) weights and the weights are bounded by C · ε − d .Finally, since (cid:101) f is a linear combination of no more than d n (2 N + 1) d networks (cid:101) f m , n withcoeﬃcients no larger in absolute value than and recalling that N scales like ε − /n , it followsthat for some C = C ( d, n ) > independent of ε and f , the network (cid:101) f will have no more than C · ln( ε ) layers and no more than C · ε − d/n · ln ( ε ) weights, each no larger in absolute valuethan C · ε − d . z Re (cid:92) z , . . . , Re (cid:92) z d Im (cid:92) z , . . . , Im (cid:92) z d β k ( z ) , k ∈ N d β k ( z ) , k ∈ N dd β k ( z ) , k ∈ N d + s d +1 β k ( z ) , k ∈ N d + | n | d + s +1 β k ( z ) , k ∈ N d β k ( z ) , k ∈ N dd β k ( z ) , k ∈ N d + s d +1 β k ( z ) , k ∈ N d + | n |− d + s +1 u ( z ) := ˜ × ( β d + | n |− ( z ) , β d + | n | ( z )) β k ( z ) , k ∈ N d β k ( z ) , k ∈ N dd β k ( z ) , k ∈ N d + s d +1 β k ( z ) , k ∈ N d + | n |− d + s +1 ˜ × ( β d + | n |− ( z ) , u ( z )) ... ... ... ... ...β ( z ) ˜ × ( β ( z ) , . . . , ˜ × ( β d + | n |− ( z ) , β d + | n | ( z )))˜ f m , n ( z )Re (cid:92) Im (cid:92) ψ Re (4 N (cid:3) − m ) (cid:3) − m N ψ Im (4 iN (cid:3) − m ) i (cid:3) − m N Id R Id R Id R Id R ˜ × Id R Id R Id R Id R ˜ × ˜ × Id R Id R Id R Id R ˜ × ˜ × Id R ˜ × ˜ × ˜ × ˜ × Figure 4: Schematic of the architecture of the network implementing ˜ f m , n , where N (cid:96)k := { k, k + 1 , . . . , (cid:96) − , (cid:96) } .

17 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs

References [1] P. Arena, L. Fortuna, G. Muscato, and M. G. Xibilia.

Neural networks in multidimensional domains:fundamentals and new trends in modelling and control , volume 234. Springer, 1998.[2] P. Arena, L. Fortuna, R. Re, and M. G. Xibilia. On the capability of neural networks with complexneurons in complex valued functions approximation. In . IEEE, 1993. doi:10.1109/ISCAS.1993.394188 .[3] P. Arena, L. Fortuna, R. Re, and M. G. Xibilia. Multilayer perceptrons to approximate complex valuedfunctions.

International Journal of Neural Systems , 6(04), 1995. doi:10.1142/s0129065795000299 .[4] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In

InternationalConference on Machine Learning , pages 1120–1128, 2016.[5] P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight VC-dimension and PseudodimensionBounds for Piecewise Linear Neural Networks.

J. Mach. Learn. Res. , 20(63):1–17, 2019.[6] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely connecteddeep neural networks.

SIAM J. Math. Data Sci. , 1(1):8–45, 2019. doi:10.1137/18M118709X .[7] G. Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics of control, signalsand systems , 2(4):303–314, 1989. doi:10.1007/BF02551274 .[8] D. Elbrächter, D. Perekrestenko, P. Grohs, and H. Bölcskei. Deep neural network approximation theory. arXiv preprint arXiv:1901.02220 , 2019.[9] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neural networks. In

Proceedings of the fourteenthinternational conference on artiﬁcial intelligence and statistics , 2011.[10] A. Hirose.

Complex-valued neural networks: theories and applications , volume 5. World Scientiﬁc, 2003. doi:10.1142/5345 .[11] K. Hornik. Approximation capabilities of multilayer feedforward networks.

Neural Networks , 4(2):251–257,1991. doi:10.1016/0893-6080(91)90009-T .[12] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.

Neural Networks , 2(5):359–366, 1989. doi:10.1016/0893-6080(89)90020-8 .[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classiﬁcation with deep convolutional neuralnetworks.

Communications of the ACM , 60(6), 2017. doi:10.1145/3065386 .[14] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.

Nature , 521(7553), 2015.[15] M. Leshno, V. Lin, A. Pinkus, and S. Schocken. Multilayer feedforward networks with a nonpolynomialactivation function can approximate any function.

Neural Networks , 6(6):861–867, 1993. doi:10.1016/S0893-6080(05)80131-5 .[16] J. Lu, Z. Shen, H. Yang, and S. Zhang. Deep network approximation for smooth functions. arXiv preprintarXiv:2001.03040 , 2020.[17] H. N. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.

Neuralcomputation , 8(1), 1996. doi:10.1162/neco.1996.8.1.164 .[18] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using deep ReLUneural networks.

Neural Netw. , 108, 2018. doi:10.1016/j.neunet.2018.08.019 .[19] I. Safran and O. Shamir. Depth-width tradeoﬀs in approximating natural functions with neural networks. arXiv preprint arXiv:1610.09887 , 2016.[20] I. Safran and O. Shamir. Depth-width tradeoﬀs in approximating natural functions with neural networks.In

International Conference on Machine Learning , pages 2979–2987. PMLR, 2017.[21] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In

Advancesin neural information processing systems , 2014.[22] C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh,Y. Bengio, and C. J. Pal. Deep complex networks. In

ICLR , 2018. URL: https://openreview.net/forum?id=H1T2hmZAb .[23] P. Virtue, S. X. Yu, and M. Lustig. Better than real: Complex-valued neural nets for MRI ﬁngerprinting.In , 2017. doi:10.1109/ICIP.2017.8297024 .[24] F. Voigtlaender. The universal approximation theorem for complex-valued neural networks. arXiv preprintarXiv:2012.03351 , 2020.

18 of 19 aragea, Lee, Maly, Pfander, Voigtlaender Quantitative approximation results for CVNNs [25] M. Wolter and A. Yao. Complex gated recurrent neural networks. In

Advances in Neural InformationProcessing Systems , volume 31. Curran Associates, Inc., 2018.[26] D. Yarotsky. Error bounds for approximations with deep ReLU networks.

Neural Networks , 94:103–114,2017.[27] D. Yarotsky and A. Zhevnerchuk. The phase diagram of approximation rates for deep neural networks.

Advances in Neural Information Processing Systems , 33, 2020., 33, 2020.