Overcoming the Curse of Dimensionality in Neural Networks
aa r X i v : . [ c s . N E ] J u l OVERCOMING THE CURSE OF DIMENSIONALITY INNEURAL NETWORKS
KAREN YERESSIAN
Abstract.
Let A be a set and V a real Hilbert space. Let H be a real Hilbertspace of functions f : A → V and assume H is continuously embedded in theBanach space of bounded functions. For i = 1 , · · · , n , let ( x i , y i ) ∈ A × V comprise our dataset. Let 0 < q < f ∗ ∈ H be the unique globalminimizer of the functional u ( f ) = q k f k H + 1 − q n n X i =1 k f ( x i ) − y i k V . In this paper we show that for each k ∈ N there exists a two layer net-work where the first layer has k functions which are Riesz representations inthe Hilbert space H of point evaluation functionals and the second layer is aweighted sum of the first layer, such that the functions f k realized by thesenetworks satisfy k f k − f ∗ k H ≤ (cid:16) o (1) + Cq E (cid:2) k Du I ( f ∗ ) k H ∗ (cid:3)(cid:17) k . By choosing the Hilbert space H appropriately, the computational com-plexity of evaluating the Riesz representations of point evaluations might besmall and thus the network has low computational complexity. Introduction
Let us denote by H m ( R d ) the Sobolev space of functions defined on R d whosepartial derivatives of order up to m are in L ( R d ). For m large enough we know that H m ( R d ) is continuously embedded in C ( R d ), the space of continuous functions on R d which converge to 0 at infinity. For i = 1 , · · · , n , let ( x i , y i ) ∈ R d × R and define u : H m ( R d ) → R by u ( f ) = k f k H m ( R d ) + 1 n n X i =1 | f ( x i ) − y i | . Thus u is a strictly convex functional defined on H m ( R d ). It is known that in thiscase u has a unique global minimizer f ∗ ∈ H m ( R d ). Also it is known that(1.1) f ∗ = n X i =1 λ i Φ( · − x i )where Φ ∈ H m ( R d ) is the fundamental solution of the corresponding elliptic oper-ator, or equivalently the Riesz representation of the point evaluation functional at0. One may view f ∗ as computed using a two layer neural network where the firstlayer computes Φ( · − x i ) for each i , and the second layer computes the sum withweights λ i . Networks achieved in this manner are called regularization networksand have been introduced in [8]. Date : July 23, 2019.2010
Mathematics Subject Classification.
Primary 68Q32; Secondary 68T05, 41A25, 41A63,41A65.
Key words and phrases.
Neural network, Regularization network, Curse of Dimensionality,Approximation.
For very large n this approach is not feasible and one is interested in approximateminimizers.A well known method to obtain approximations of the unique minimizer f ∗ isthe so called conformal finite element method. In this method and many similarmethods, one considers a finite dimensional subspace H k of H m ( R d ) which hasgood approximation properties. Then one solves the problem in H k and appliesC´ea’s lemma (cf. [1]) to obtain error estimates.But this approach has couple of drawbacks. By deterministically choosing H k one usually needs to choose k , the dimension of H k , as exponentially growing with d which is the curse of dimensionality for these problems. As cited in [7] and[9] a function f ∈ C r ([0 , d ; R ) can be approximated in C ([0 , d ; R ) deterministi-cally with various forms of such networks with first layer having k components andachieving an error estimate of O ( k − rd ). Thus in higher dimensions one needs expo-nentially more neurons to achieve the same error or one should have very smoothfunctions to approximate. Even if one proceeds with a randomized choice of thesubspace H k then one overcomes the curse of dimensionality but instead each basisfunction will have very high computational complexity. As originally proved in [6]for euclidean spaces and then extended in [5] for Hilbert spaces, if H is a separablereal Hilbert space of functions embedded compactly in C ([0 , d ; R ) then we have E (cid:2) k f − F k k C ([0 , d ; R ) (cid:3) = O ( 1 √ k )where F k = k X i =1 L i ( f ) G i ,G i ∈ C ([0 , d ; R ) for i = 1 , · · · , k are random functions with structure similar tothat of Brownian motion and L i : H → R are random, almost surely discontinuous,linear functionals.To overcome both the curse of dimensionality and the complexity of involvedfunctions in this paper in a general setting we consider both randomization andadaptation to the functional u . We form the solution through a random processusing parts of the functional u at each step. Our error estimate is dimension inde-pendent and the same as the randomized approximation result above, i.e. O ( √ k ),but the approximating functions have a simple structure similar to the sum in (1.1).The novelty of our result lies in applying the stochastic gradient descent infunctional spaces, each step of the process adds a function in the first layer andadapts the weights of the second layer.1.1. Structure of this paper.
In Section 2 we enlist some of the notation used inthis paper. In Section 3 we present the main results of this paper. In Section 4, weconsider appropriate Banach spaces for our loss functionals. In Section 5, we enlistwell known results about projection and constrained convex minimization in Hilbertspaces. In Section 6, we enlist some facts about Banach space valued randomvariables and in particular conditional expectations involving Bochner integral. InSection 7, we prove our main Theorem 1 which is about the stochastic gradientdescent in functional spaces. In Section 8, we apply our main Theorem 1 to thecase of a minimization problem arising in supervised learning and prove Theorem2.
VERCOMING THE CURSE OF DIMENSIONALITY 3 Notation (Ω , F , P ) a probability space; G , F σ -algebras; B ( N ) σ -algebra of Borel subsets of the space N ; H real Hilbert space of functions; H ∗ dual Hilbert space of the space H ; V real Hilbert space; N Banach space; L p (Ω , F , P ) Lebesgue spaces; L p (Ω , F , P ; N ) Bochner spaces (cf. Definition 3); C , C , C generic constants; χ A characteristic function of the set A ; A the closure of A ; | · | absolute value, length of a vector, norm of a matrix,Lebesgue measure or surface measure; k · k norm;[ · ] seminorm; B r,H (cid:8) f ∈ H (cid:12)(cid:12) k f k H < r (cid:9) ; Du ( f ) differential of the function u at f ; C ( H ) continuous functions defined on H (with values in R ); C ( H, H ∗ ) continuous functions defined on H with values in H ∗ ; C ( H ) continuous and differentiable functions defined on H ,with continuous differentials; E [ · ] expectation; E [ ·|G ] conditional expectation with respect to the σ -algebra G ; L ∗ for the linear operator L : H → H the dual operator L ∗ : H → H ; P roj K projection operator on the closed and convex set K ; o (1) a positive sequence converging to 0; R H Riesz representation operator mapping H ∗ to H ;( , ) H inner product of the space H ; h , i H ∗ ,H duality pairing between H and its dual, ρ ( A ) Spectral radius of the bounded, symmetricand positive definite operator A : H → H , i.e. ρ ( A ) = sup f ∈ H, k f k H ≤ ( Af, f ) H .3. Main Results
For v ∈ C ( H, H ∗ ) let us define(3.1) k v k C b ( H,H ∗ ;max(1 , k·k H )) = sup f ∈ H k v ( f ) k H ∗ max(1 , k f k H )and C b ( H, H ∗ ; max(1 , k · k H )) the space of those v ∈ C ( H, H ∗ ) such that k v k C b ( H,H ∗ ;max(1 , k·k H )) < + ∞ .For u ∈ C ( H ) let us define k u k C b ( H ;max(1 , k·k H )) = | u (0) | + k Du k C b ( H,H ∗ ;max(1 , k·k H )) . and C b ( H ; max(1 , k · k H )) the space of those u ∈ C ( H ) such that Du ∈ C b ( H, H ∗ ; max(1 , k · k H )).In Lemma 3 we show that C b ( H, H ∗ ; max(1 , k · k H )) and C b ( H ; max(1 , k · k H ))are Banach spaces. Definition 1 (Simple Function) . X : Ω → N is called a simple function if it takesfinite number of values and is F to B ( N ) measurable. KAREN YERESSIAN
Definition 2 ( P -Strongly Measurable) . X : Ω → N is called P -strongly measurableif there exists a sequence of simple functions X n such that X n → X a.s. with respectto the probability measure P . We identify two random variables if they are almost surely equal.
Definition 3 (Bochner Spaces) . Let ≤ p < + ∞ . We denote by L p (Ω , F , P ; N ) the Banach space of P -strongly measurable functions X : Ω → N such that E (cid:2) k X k pN (cid:3) < + ∞ . For the theory of Bochner integrals and spaces one may refer to [3].
Theorem 1.
Let K be a closed and convex subset of H .Let G, U , L , U , L , U , L , · · · be a sequence of random variables such that all are independent, P -strongly mea-surable, U , U , · · · are identically distributed, L , L , · · · are identically distributed,we have G ∈ L (Ω , F , P ; H ) ,L ∈ L (Ω , F , P ; L ( H )) with E [ L ] = I,U ∈ L (cid:0) Ω , F , P ; C b ( H ; max(1 , k · k H )) (cid:1) , (3.2) DU ( f ) ∈ L (Ω , F , P ; H ∗ ) for all f ∈ H, there exists Λ > such that (3.3) E (cid:2) k DU ( f ) − DU ( g ) k H ∗ (cid:3) ≤ Λ k f − g k H for all f, g ∈ H, let u = E [ U ] , and there exists λ > such that (3.4) h Du ( f ) − Du ( g ) , f − g i H ∗ ,H ≥ λ k f − g k H for all f, g ∈ H. Let f ∗ be the unique minimizer of u in K (see Lemma 6).Let F = G and consider the stochastic gradient descent sequence F k +1 = P roj K (cid:16) F k − η k L k (cid:0) R H ( DU k ( F k )) (cid:1)(cid:17) for k = 1 , , · · · .Then F k ∈ L (Ω , F , P ; H ) for k ≥ and F k ( ω ) ∈ K for k ≥ and ω ∈ Ω .There exists a harmonically decreasing sequence η k such that asymptotically forlarge k we have (3.5) E (cid:2) k F k − f ∗ k H (cid:3) ≤ (cid:16) o (1) + Cλ E (cid:2) k R H DU ( f ∗ ) k H,E [ L ∗ L ] (cid:3)(cid:17) k , here k f k H,E [ L ∗ L ] = ( E [ L ∗ L ] f, f ) H and C > is a constant independent of thespecific problem data. Let A be a set, V be a Hilbert space, and denote by B ( A, V ) the space ofuniformly bounded functions f : A → V . Let us define k f k B ( A,V ) = sup x ∈ A k f ( x ) k V . Let H be a Hilbert space of functions f : A → V , continuously embedded in B ( A, V ), i.e. there exists
M > k f k B ( A,V ) ≤ M k f k H for all f ∈ H. For x ∈ A and y ∈ V let Φ( x, y ) ∈ H such that(3.7) (Φ( x, y ) , ϕ ) H = ( y, ϕ ( x )) V for all ϕ ∈ H. VERCOMING THE CURSE OF DIMENSIONALITY 5
Let us note that Φ( x, y ) is a function in H for each x ∈ A, y ∈ V and thatΦ( x, y ) is linear in y ∈ V .Let our data be the finite set ( x i , y i ) ∈ A × V for i = 1 , · · · , n .Let 0 < q < I : Ω → { , , · · · , n } be distributed as P ( { I = 0 } ) = q and P ( { I = i } ) = 1 − qn for i = 1 , · · · , n. Let U = u I where(3.8) u ( f ) = 12 k f k H and u i ( f ) = 12 k f ( x i ) − y i k V for i = 1 , · · · , n. We compute u ( f ) = E (cid:2) U ( f ) (cid:3) = q k f k H + 1 − q n n X i =1 k f ( x i ) − y i k V . Theorem 2.
Let r ∈ (0 , + ∞ ] and f ∗ ∈ B r,H be the unique minimizer of u in B r,H .Let I , I , · · · be independent and uniformly distributed taking valuesin { , , · · · , n } .Let F = 0 and consider the (stochastic gradient descent) sequence F k +1 = (1 − η k ) F k , I k = 0 , S k ˜ F k , I k ∈ { , · · · , n } where (3.9) ˜ F k = F k + η k Φ( x I k , y I k − F k ( x I k )) and S k = max (cid:0) , r k ˜ F k k H (cid:1) . For k ≥ , F k +1 is a linear combination of Φ( x I h , y I h − F h ( x I h )) for h = 1 , · · · , k and there exists a harmonically decreasing sequence η k such that asymptotically forlarge k we have E (cid:2) k F k − f ∗ k H (cid:3) ≤ (cid:16) o (1) + Cq E (cid:2) k Du I ( f ∗ ) k H ∗ (cid:3)(cid:17) k here C > is a constant independent of the specific problem data and E (cid:2) k Du I ( f ∗ ) k H ∗ (cid:3) = q k f ∗ k H + 1 − qn n X i =1 k Φ( x i , f ∗ ( x i ) − y i ) k H . Banach Spaces of Differentiable Functions defined on Hilbertspaces
For r > u ∈ C ( B r,H ) let us define k u k C b ( B r,H ) = sup f ∈ B r,H | u ( f ) | and C b ( B r,H ) the space of those u ∈ C ( B r,H ) such that k u k C b ( B r,H ) < + ∞ . One may check that C b ( B r,H ) endowed with k · k C b ( B r,H ) asnorm is a Banach space. Similarly one may define the Banach space C b ( B r,H , H ∗ )of continuous and bounded functions defined on B r,H with values in H ∗ . Lemma 1.
For u ∈ C b ( H ; max(1 , k · k H )) we have (4.1) | u ( f ) | ≤ k u k C b ( H ;max(1 , k·k H )) max(1 , k f k H ) for all f ∈ H . KAREN YERESSIAN
Proof.
For f ∈ H we compute | u ( f ) | ≤ | u (0) | + | u ( f ) − u (0) | = | u (0) | + (cid:12)(cid:12)Z ddt u ( tf ) dt (cid:12)(cid:12) = | u (0) | + (cid:12)(cid:12)Z h Du ( tf ) , f i H ∗ ,H dt (cid:12)(cid:12) ≤ | u (0) | + Z k Du ( tf ) k H ∗ k f k H dt ≤ | u (0) | + Z k Du k C b ( H,H ∗ ;max(1 , k·k H )) max (cid:0) , k tf k H (cid:1) k f k H dt ≤ | u (0) | + k Du k C b ( H,H ∗ ;max(1 , k·k H )) max (cid:0) , k f k H (cid:1) k f k H ≤ (cid:0) | u (0) | + k Du k C b ( H,H ∗ ;max(1 , k·k H )) (cid:1) max (cid:0) , max (cid:0) , k f k H (cid:1) k f k H (cid:1) = k u k C b ( H ;max(1 , k·k H )) max(1 , k f k H )which proves (3.1) and completes the proof of the lemma. (cid:3) Corollary 1.
For r > and u ∈ C b ( H ; max(1 , k · k H )) we have (4.2) k u k C b ( B r,H ) ≤ C r k u k C b ( H ;max(1 , k·k H )) (here C r = max(1 , r ) ). Lemma 2. C b ( H, H ∗ ; max(1 , k · k H )) together with k · k C b ( H,H ∗ ;max(1 , k·k H )) as normis a Banach space.Proof. It is clear that C b ( H, H ∗ ; max(1 , k · k H )) is a linear space and that k ·k C b ( H,H ∗ ;max(1 , k·k H )) is a norm. Let us prove that C b ( H, H ∗ ; max(1 , k · k H )) to-gether with this norm is complete. Let v n ∈ C b ( H, H ∗ ; max(1 , k · k H )) be a Cauchysequence, i.e. k v n − v m k C b ( H,H ∗ ;max(1 , k·k H )) → n, m → ∞ . We should show that there exists v ∈ C b ( H, H ∗ ; max(1 , k · k H )) such that v n → v as n → ∞ with respect to the norm k · k C b ( H,H ∗ ;max(1 , k·k H )) .For each r > k v n − v m k C b ( B r,H ,H ∗ ) ≤ max(1 , r ) k v n − v m k C b ( H,H ∗ ;max(1 , k·k H )) thus the restriction of v n to B r,H is a Cauchy sequence in C b ( B r,H , H ∗ ). It followsthat there exists v ∈ C ( H, H ∗ ) such that v n → v in C b ( B r,H , H ∗ ) for all r > ǫ > N ǫ ∈ N such that if n, m ≥ N ǫ then k v n − v m k C b ( H,H ∗ ;max(1 , k·k H )) < ǫ. It follows that for f ∈ H and n, m ≥ N ǫ we have k v n ( f ) − v m ( f ) k H ∗ < max(1 , k f k H ) ǫ. Passing to the limit m → ∞ we obtain that for n ≥ N ǫ and f ∈ H we have k v n ( f ) − v ( f ) k H ∗ ≤ max(1 , k f k H ) ǫ. This proves that v ∈ C b ( H, H ∗ ; max(1 , k · k H )) and v n → v in C b ( H, H ∗ ; max(1 , k ·k H )). This completes the proof of the Lemma. (cid:3) Lemma 3. C b ( H ; max(1 , k · k H )) together with k · k C b ( H ;max(1 , k·k H )) as norm is aBanach space.Proof. It is clear that C b ( H ; max(1 , k · k H )) is a linear spaceand that k · k C b ( H ;max(1 , k·k H )) is a norm. Let us prove that C b ( H ; max(1 , k · k H ))together with this norm is complete. Let u n ∈ C b ( H ; max(1 , k · k H )) be a Cauchysequence i.e. lim n,m →∞ k u n − u m k C b ( H ;max(1 , k·k H )) = 0 . VERCOMING THE CURSE OF DIMENSIONALITY 7
From Corollary 1 it follows that for all r > u n is a Cauchy sequence in C ( B r,H ),it follows that there exists u ∈ C ( H ) such that u n → u in C b ( B r,H ) for each r > n → ∞ .We should show that u ∈ C b ( H ; max(1 , k · k H )) and u n → u in C b ( H ; max(1 , k ·k H )).From Lemma 2 it follows that there exists v ∈ C b ( H, H ∗ ; max(1 , k · k H )) suchthat Du n → v in C b ( H, H ∗ ; max(1 , k · k H )).It remains to show that u is differentiable and Du = v . Step 1.
For f, ϕ ∈ H , u ( f + tϕ ) is differentiable in t ∈ R at t = 0 with differentialvalue h v ( f ) , ϕ i H ∗ ,H .For f, ϕ ∈ H and t ∈ R we compute u n ( f + tϕ ) − u n ( f ) = Z t dds u n ( f + sϕ ) ds = Z t h Du n ( f + sϕ ) , ϕ i H ∗ ,H ds. Because u n is Cauchy in C b ( H ; max(1 , k · k H )) it is bounded there. We estimate | h Du n ( f + sϕ ) , ϕ i H ∗ ,H | ≤ k Du n ( f + sϕ ) k H ∗ k ϕ k H ≤ k u n k C b ( H ;max(1 , k·k H )) max(1 , k f + sϕ k H ) k ϕ k H ≤ C max(1 , k f + sϕ k H ) k ϕ k H . Thus by Lebesgue Dominated convergence theorem we have u ( f + tϕ ) − u ( f ) = Z t h v ( f + sϕ ) , ϕ i H ∗ ,H ds and by the continuity of v we obtain t − ( u ( f + tϕ ) − u ( f )) → h v ( f ) , ϕ i H ∗ ,H as t → Step 2. u is differentiable with Du = v . Let f, ϕ ∈ H . Define γ ( t ) = u ( f + tϕ )then by the previous step we have γ ′ ( t ) = h v ( f + tϕ ) , ϕ i H ∗ ,H . We compute u ( f + ϕ ) − u ( f ) = γ (1) − γ (0) = Z ddt γ ( t ) dt = Z h v ( f + tϕ ) , ϕ i H ∗ ,H dt = h v ( f ) , ϕ i H ∗ ,H + Z h v ( f + tϕ ) − v ( f ) , ϕ i H ∗ ,H dt and | u ( f + ϕ ) − u ( f ) − h v ( f ) , ϕ i H ∗ ,H | ≤ Z | h v ( f + tϕ ) − v ( f ) , ϕ i H ∗ ,H | dt ≤ k ϕ k H Z k v ( f + tϕ ) − v ( f ) k H ∗ dt therefore from the continuity of v the differentiability of u follows with Du ( f ) = v ( f ).This completes the proof of the Lemma. (cid:3) Lemma 4. u ( f ) is jointly continuous as a function from C b ( H ; max(1 , k · k H )) × H to R . And similarly Du ( f ) is jointly continuous as a function from C b ( H ; max(1 , k·k H )) × H to H ∗ .Proof. Let u n , u ∈ C b ( H ; max(1 , k · k H )) and f n , f ∈ H for n ∈ N such that u n → u in C b ( H ; max(1 , k · k H )) and f n → f in H . We might assume that k f n k H ≤k f k H + 1 = r , for all n ≥ | u n ( f n ) − u ( f ) | ≤ | ( u n − u )( f n ) | + | u ( f n ) − u ( f ) | . By the continuity of u we have that | u ( f n ) − u ( f ) | → KAREN YERESSIAN
Using (3.1) we estimate | ( u n − u )( f n ) | ≤ k u n − u k C b ( H ;max(1 , k·k H )) max(1 , k f n k H ) ≤ k u n − u k C b ( H ;max(1 , k·k H )) max(1 , r )which converges to 0 as n → ∞ and this proves the continuity of u ( f ). Similarlywe prove the continuity of Du ( f ). (cid:3) Well Known Results About Projection and Constrained ConvexMinimization in Hilbert Spaces
Lemma 5 (Projection on a Closed and Convex Subset in a Hilbert Space) . Let K be a closed and convex subset of H . Let f ∈ H , then there exists a unique g ∈ K such that k f − g k H ≤ k f − h k H for all h ∈ K . Let us denote g = P roj K ( f ) . Wehave also the following properties (5.1) ( f − P roj K ( f ) , h − P roj K ( f )) H ≤ for all h ∈ K and (5.2) k P roj K ( f ) − P roj K ( f ) k H ≤ k f − f k H . Lemma 6 (Convex Constrained Minimization of a Strictly Convex and RegularFunctional) . Let u ∈ C b ( H ; max(1 , k · k H )) and K be a closed and convex subset of H . Assume there exist < λ ≤ Λ such that k Du ( f ) − Du ( g ) k H ∗ ≤ Λ k f − g k H and h Du ( f ) − Du ( g ) , f − g i H ∗ ,H ≥ λ k f − g k H for all f, g ∈ H . Then there exists a unique f ∗ ∈ K such that u ( f ∗ ) ≤ u ( g ) for all g ∈ K . Also we have (5.3) h Du ( f ∗ ) , g − f ∗ i H ∗ ,H ≥ for all g ∈ K . Some Facts About Banach Space Valued Random Variables
Definition 4 (Conditional Expectation) . Let X ∈ L (Ω , F , P ; N ) and G be a sub- σ -algebra of F . We say Y ∈ L (Ω , G , P ; N ) is conditional expectation of X withrespect to the σ -algebra G if E (cid:2) Xχ A (cid:3) = E (cid:2) Y χ A (cid:3) for all A ∈ G . Here χ A is the characteristic function of the set A . The conditional expectation exists and is unique.All ordinary results, regarding expectations and conditional expectations whichare compatible with the structure of Banach spaces for the values of random vari-ables, hold.For example if G and G are two independent, with respect to the probabilitymeasure P , sub- σ -algebras of F , and X ∈ L (Ω , G , P ; N ) then E [ X |G ] = E [ X ]. Lemma 7 (Independence Lemma) . Let (Ω , F , P ) be a probability space. Let X be a( P -strongly measurable) random variable with values in the Banach space N and Y be a ( P -strongly measurable) random variable with values in the Banach space N .Let G be a sub- σ -algebra of F . Assume X is G measurable and Y is independent of G . Let N be a Banach space. Let f ∈ C ( N × N , N ) such that E (cid:2) k f ( x, Y ) k N (cid:3) < + ∞ for all x ∈ N and E (cid:2) k f ( X, Y ) k N (cid:3) < + ∞ . VERCOMING THE CURSE OF DIMENSIONALITY 9
For each x ∈ N let us define g ( x ) = E (cid:2) f ( x, Y ) (cid:3) . Then we have g ( X ) = E (cid:2) f ( X, Y ) (cid:12)(cid:12) G (cid:3) a.s. . Lemma 8 (Expectation of Operator Action on Random Variable) . Let ℓ : N → N be a bounded linear operator between Banach spaces. Let X ∈ L (Ω , F , P ; N ) .Then we have E (cid:2) ℓX (cid:12)(cid:12) G (cid:3) = ℓE [ X |G ] . Proof.
First let us prove the desired equation in the non-conditional and simple X case.We compute E [ ℓX ] = E (cid:2) ℓ m X i =1 x i χ A i (cid:3) = E (cid:2) m X i =1 ℓx i χ A i (cid:3) = m X i =1 E (cid:2) ℓx i χ A i (cid:3) = m X i =1 ℓx i P ( A i )= ℓ m X i =1 x i P ( A i ) = ℓ m X i =1 E (cid:2) x i χ A i (cid:3) = ℓE (cid:2) m X i =1 x i χ A i (cid:3) = ℓE [ X ] . Now we consider the non-conditional and general X case. Let X n be simple and X n → X , P -almost surely and k X n k N ≤ k X k N . Then using the boundednessof ℓ , we might pass to the limit on both sides of E [ ℓX n ] = ℓE [ X n ] and obtain thedesired equation for X .Now let us consider the conditional and general X case. For A ∈ G we compute E h E (cid:2) ℓX (cid:12)(cid:12) G (cid:3) χ A i = E (cid:2) ℓXχ A (cid:3) = ℓE [ Xχ A ] = ℓE (cid:2) E [ X |G ] χ A (cid:3) = E (cid:2) ℓE [ X |G ] χ A (cid:3) which by the arbitrariness of A ∈ G proves the desired equation. (cid:3) Lemma 9.
For U ∈ L (cid:0) Ω , F , P ; C b ( H ; max(1 , k · k H ) (cid:1) let us define u = E [ U ] . We have u ∈ C b ( H ; max(1 , k · k H ) and for all f ∈ H we have (6.1) u ( f ) = E (cid:2) U ( f ) (cid:3) and Du ( f ) = E (cid:2) DU ( f ) (cid:3) . Proof.
The equations in (6.1) follow from Lemma 8. (cid:3) Stochastic Gradient Descent in Functional Spaces(Proof of Theorem 1)
Lemma 10.
Let L ∈ L (Ω , F , P ; L ( H )) and assume E [ L ] is injective with abounded inverse defined in its image. For f, g ∈ H let us define ( f, g ) H,E [ L ∗ L ] = (cid:0) E [ L ∗ L ] f, g (cid:1) H . Then ( · , · ) H,E [ L ∗ L ] is a real inner product on H and the associated norm is equiv-alent to the original norm on H .We have (7.1) ρ (cid:0) E [ L ∗ L ] (cid:1) = sup f ∈ H, k f k H ≤ (cid:0) E [ L ∗ L ] f, f (cid:1) H ≤ k L k L (Ω , F ,P ; L ( H )) < + ∞ . If E [ L ] = I then (7.2) (cid:0) E [ L ∗ L ] f, f (cid:1) H ≥ k f k H and in particular (7.3) ρ (cid:0) E [ L ∗ L ] (cid:1) ≥ . Proof.
It is clear that ( · , · ) H,E [ L ∗ L ] defines a bilinear form on H .For f, g ∈ H let us compute(7.4) ( f, g ) H,E [ L ∗ L ] = (cid:0) E [ L ∗ L ] f, g (cid:1) H = E (cid:2) ( L ∗ Lf, g ) H (cid:3) = E (cid:2) ( Lf, Lg ) H (cid:3) . It follows that the bilinear form ( · , · ) H,E [ L ∗ L ] is symmetric.By (7.4) for f ∈ H we have(7.5) ( f, f ) H,E [ L ∗ L ] = E (cid:2) ( Lf, Lf ) H (cid:3) = E (cid:2) k Lf k H (cid:3) . By (7.5), for all f ∈ H we have ( f, f ) H,E [ L ∗ L ] ≥ f ∈ H we have ( f, f ) H,E [ L ∗ L ] = 0 then from (7.5) it follows that Lf = 0almost surely. Taking expectation we compute0 = E [ Lf ] = E [ L ] f and because E [ L ] is injective we obtain f = 0 which proves that ( · , · ) H,E [ L ∗ L ] is aninner product on H .Now let us show that the associated norm is equivalent to the norm in H . For f ∈ H we compute(7.6) k f k H,E [ L ∗ L ] = (cid:0) E [ L ∗ L ] f, f (cid:1) H = E (cid:2)(cid:0) L ∗ Lf, f (cid:1) H (cid:3) = E (cid:2) k Lf k H (cid:3) ≤ E (cid:2) k L k L ( H ) k f k H (cid:3) = E (cid:2) k L k L ( H ) (cid:3) k f k H . Let A ⊂ H be the image of E [ L ]. Let T : A → H be the bounded inverse of E [ L ]. Thus there exists C > k T g k H ≤ C k g k H for all g ∈ A . Taking f ∈ H and g = E [ L ] f we obtain k f k H ≤ C k E [ L ] f k H for all f ∈ H .We compute E [ L ∗ L ] = E (cid:2) ( L − E [ L ]) ∗ ( L − E [ L ]) (cid:3) + E [ L ∗ ] E [ L ]thus for f ∈ H we have(7.7) k f k H,E [ L ∗ L ] = (cid:0) E [ L ∗ L ] f, f (cid:1) H = (cid:16)(cid:0) E (cid:2) ( L − E [ L ]) ∗ ( L − E [ L ]) (cid:3) + E [ L ∗ ] E [ L ] (cid:1) f, f (cid:17) H = (cid:0) E (cid:2) ( L − E [ L ]) ∗ ( L − E [ L ]) (cid:3) f, f (cid:1) H + (cid:0) E [ L ∗ ] E [ L ] f, f (cid:1) H = E (cid:2) k ( L − E [ L ]) f k H (cid:3) + k E [ L ] f k H ≥ k E [ L ] f k H ≥ C k f k H . From (7.6) and (7.7) it follows that k · k
H,E [ L ∗ L ] is equivalent to k · k H .From (7.6) the inequality (7.1) follows.From E [ L ] = I and (7.7) the inequality (7.2) follows and this completes theproof of the Lemma. (cid:3) Proof of Theorem 1 .
By considering ˜ u ( f ) = u ( f + f ∗ ) we might assume that f ∗ =0 ∈ K . In particular 0 = P roj K (0) and for g ∈ H (7.8) k P roj K ( g ) k H = k P roj K ( g ) − P roj K (0) k H ≤ k g − k H = k g k H . By definition F = G is σ ( G ), P -strongly measurable and for k ≥ F k is F k − = σ ( G, ( U , L ) , · · · , ( U k − , L k − )), P -strongly measurable.Let us consider the decomposition R H ( DU k ( F k )) = A k + B k where A k = R H ( DU k ( F k )) − R H ( DU k (0))and B k = R H ( DU k (0)) . VERCOMING THE CURSE OF DIMENSIONALITY 11
Using (7.8) and Young’s inequality we estimate(7.9) k F k +1 k H = (cid:13)(cid:13) P roj K (cid:0) F k − η k L k (cid:0) R H ( DU k ( F k )) (cid:1)(cid:1)(cid:13)(cid:13) H ≤ (cid:13)(cid:13) F k − η k L k (cid:0) R H ( DU k ( F k )) (cid:1)(cid:13)(cid:13) H = k F k − η k L k ( A k + B k ) k H = k F k k H + η k k L k A k k H + η k k L k B k k H − η k ( F k , L k A k ) H − η k ( F k , L k B k ) H + 2 η k ( L k A k , L k B k ) H ≤ k F k k H + 2 η k k L k A k k H + 2 η k k L k B k k H − η k ( F k , L k A k ) H − η k ( F k , L k B k ) H . By induction we might assume that F k ∈ L (Ω , F , P ; H ) and we should showthat F k +1 ∈ L (Ω , F , P ; H ).For M > φ M : [0 , + ∞ ) → [0 ,
1] by φ M ( r ) = (cid:0) − ( r − M ) + (cid:1) + = , ≤ r ≤ M, − ( r − M ) , M < r ≤ M + 1 , , r > M + 1 . Let
M >
0, using the independence Lemma 7 and the inequality (3.3) we com-pute E (cid:2) k A k k H φ M ( k A k k H ) (cid:3) = E (cid:2) k R H ( DU k ( F k )) − R H ( DU k (0)) k H φ M (cid:0) k R H ( DU k ( F k )) − R H ( DU k (0)) k H (cid:1)(cid:3) = E (cid:2) k DU k ( F k ) − DU k (0) k H ∗ φ M (cid:0) k DU k ( F k ) − DU k (0) k H ∗ (cid:1)(cid:3) = E h E (cid:2) k DU k ( F k ) − DU k (0) k H ∗ φ M (cid:0) k DU k ( F k ) − DU k (0) k H ∗ (cid:1) (cid:12)(cid:12) F k − (cid:3)i = E h E (cid:2) k DU k ( f k ) − DU k (0) k H ∗ φ M (cid:0) k DU k ( f k ) − DU k (0) k H ∗ (cid:1)(cid:3) (cid:12)(cid:12) f k = F k i ≤ E h Λ k f k − k H φ M (cid:0) k DU k ( f k ) − DU k (0) k H ∗ (cid:1) (cid:12)(cid:12) f k = F k i = E h Λ k F k k H φ M (cid:0) k DU k ( F k ) − DU k (0) k H ∗ (cid:1)i ≤ Λ E (cid:2) k F k k H (cid:3) < + ∞ . Now from monotone convergence theorem passing M → ∞ we obtain that(7.10) E (cid:2) k A k k H (cid:3) ≤ Λ E (cid:2) k F k k H (cid:3) < + ∞ . For
M > f ∈ H we compute(7.11) (cid:0) E (cid:2) L ∗ L φ M (cid:0) k L k L ( H ) (cid:1)(cid:3) f, f (cid:1) H = E (cid:2)(cid:0) L ∗ L φ M (cid:0) k L k L ( H ) (cid:1) f, f (cid:1) H (cid:3) = E (cid:2)(cid:0) L ∗ L f, f (cid:1) H φ M (cid:0) k L k L ( H ) (cid:1)(cid:3) = E (cid:2)(cid:0) L f, L f (cid:1) H φ M (cid:0) k L k L ( H ) (cid:1)(cid:3) = E (cid:2) k L f k H φ M (cid:0) k L k L ( H ) (cid:1)(cid:3) ≤ E (cid:2) k L f k H (cid:3) = E (cid:2)(cid:0) L f, L f (cid:1) H (cid:3) = E (cid:2)(cid:0) L ∗ L f, f (cid:1) H (cid:3) = (cid:0) E (cid:2) L ∗ L (cid:3) f, f (cid:1) H . Let G k = σ ( F k − , σ ( U k )) and M >
0, using the independence Lemma 7 and theinequalities (7.10) and (7.11) we compute E (cid:2) k L k A k k H φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k A k k H (cid:1)(cid:3) = E (cid:2)(cid:0) L ∗ k L k φ M (cid:0) k L k k L ( H ) (cid:1) A k , A k (cid:1) H φ M (cid:0) k A k k H (cid:1)(cid:3) = E (cid:2) E (cid:2)(cid:0) L ∗ k L k φ M (cid:0) k L k k L ( H ) (cid:1) A k , A k (cid:1) H φ M (cid:0) k A k k H (cid:1) (cid:12)(cid:12) G k (cid:3)(cid:3) = E (cid:2) E (cid:2)(cid:0) L ∗ k L k φ M (cid:0) k L k k L ( H ) (cid:1) a k , a k (cid:1) H φ M (cid:0) k a k k H (cid:1)(cid:3) (cid:12)(cid:12) a k = A k (cid:3) = E (cid:2)(cid:0) E (cid:2) L ∗ k L k φ M (cid:0) k L k k L ( H ) (cid:1)(cid:3) a k , a k (cid:1) H φ M (cid:0) k a k k H (cid:1) (cid:12)(cid:12) a k = A k (cid:3) = E (cid:2)(cid:0) E (cid:2) L ∗ L φ M (cid:0) k L k L ( H ) (cid:1)(cid:3) a k , a k (cid:1) H φ M (cid:0) k a k k H (cid:1) (cid:12)(cid:12) a k = A k (cid:3) ≤ E (cid:2)(cid:0) E (cid:2) L ∗ L (cid:3) a k , a k (cid:1) H φ M (cid:0) k a k k H (cid:1) (cid:12)(cid:12) a k = A k (cid:3) ≤ E (cid:2)(cid:0) E (cid:2) L ∗ L (cid:3) A k , A k (cid:1) H φ M (cid:0) k A k k H (cid:1)(cid:3) = E (cid:2) k A k k H,E [ L ∗ L ] φ M (cid:0) k A k k H (cid:1)(cid:3) ≤ ρ (cid:0) E [ L ∗ L ] (cid:1) E (cid:2) k A k k H φ M (cid:0) k A k k H (cid:1)(cid:3) ≤ ρ (cid:0) E [ L ∗ L ] (cid:1) E (cid:2) k A k k H (cid:3) ≤ ρ (cid:0) E [ L ∗ L ] (cid:1) Λ E (cid:2) k F k k H (cid:3) < + ∞ . Now from monotone convergence theorem passing M → ∞ we obtain that(7.12) E (cid:2) k L k A k k H (cid:3) ≤ ρ (cid:0) E [ L ∗ L ] (cid:1) Λ E (cid:2) k F k k H (cid:3) < + ∞ . Let
M >
0, using independence Lemma 7 and (3.2) we have E (cid:2) k L k B k k H φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k B k k H (cid:1)(cid:3) = E (cid:2) k L k R H ( DU k (0)) k H φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k R H ( DU k (0)) k H (cid:1)(cid:3) = E (cid:2) ( L k R H ( DU k (0)) , L k R H ( DU k (0))) H φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k R H ( DU k (0)) k H (cid:1)(cid:3) = E (cid:2) ( L ∗ k L k R H ( DU k (0)) , R H ( DU k (0))) H φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k R H ( DU k (0)) k H (cid:1)(cid:3) = E h E (cid:2) ( L ∗ k L k R H ( DU k (0)) , R H DU k (0)) H φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k R H ( DU k (0)) k H (cid:1) (cid:12)(cid:12) σ ( U k ) (cid:3)i = E h E (cid:2) ( L ∗ k L k R H ( Du k (0)) , R H Du k (0)) H φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k R H ( Du k (0)) k H (cid:1)(cid:3) (cid:12)(cid:12) u k = U k i = E h ( E (cid:2) L ∗ k L k φ M (cid:0) k L k k L ( H ) (cid:1)(cid:3) R H ( Du k (0)) ,R H Du k (0)) H φ M (cid:0) k R H ( Du k (0)) k H (cid:1) (cid:12)(cid:12) u k = U k i = E h ( E (cid:2) L ∗ L φ M (cid:0) k L k L ( H ) (cid:1)(cid:3) R H ( Du k (0)) ,R H Du k (0)) H φ M (cid:0) k R H ( Du k (0)) k H (cid:1) (cid:12)(cid:12) u k = U k i = E h ( E (cid:2) L ∗ L φ M (cid:0) k L k L ( H ) (cid:1)(cid:3) R H ( DU k (0)) ,R H DU k (0)) H φ M (cid:0) k R H ( DU k (0)) k H (cid:1)i ≤ E h(cid:0) E (cid:2) L ∗ L (cid:3) R H ( DU k (0)) , R H DU k (0) (cid:1) H i = E h(cid:0) E (cid:2) L ∗ L (cid:3) R H ( DU (0)) , R H DU (0) (cid:1) H i = E (cid:2) k R H DU (0) k H,E [ L ∗ L ] (cid:3) ≤ E (cid:2) ρ (cid:0) E [ L ∗ L ] (cid:1) k R H DU (0) k H (cid:3) = ρ (cid:0) E [ L ∗ L ] (cid:1) E (cid:2) k DU (0) k H ∗ (cid:3) < + ∞ . Now from monotone convergence theorem passing M → ∞ we obtain that(7.13) E (cid:2) k L k B k k H (cid:3) ≤ ρ (cid:0) E [ L ∗ L ] (cid:1) E (cid:2) k DU (0) k H ∗ (cid:3) < + ∞ . VERCOMING THE CURSE OF DIMENSIONALITY 13
For
M >
0, using independence Lemma 7 we compute(7.14) E (cid:2)(cid:12)(cid:12) ( F k , L k A k ) H (cid:12)(cid:12) φ M (cid:0) k F k k H (cid:1) φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k A k k H (cid:1)(cid:3) ≤ E (cid:2) k F k k H k L k k L ( H ) k A k k H φ M (cid:0) k F k k H (cid:1) φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k A k k H (cid:1)(cid:3) = E h E (cid:2) k F k k H k L k k L ( H ) k A k k H φ M (cid:0) k F k k H (cid:1) φ M (cid:0) k L k k L ( H ) (cid:1) φ M (cid:0) k A k k H (cid:1) (cid:12)(cid:12) G k (cid:3)i = E h E (cid:2) k L k k L ( H ) φ M (cid:0) k L k k L ( H ) (cid:1) (cid:12)(cid:12) G k (cid:3) k A k k H k F k k H φ M (cid:0) k F k k H (cid:1) φ M (cid:0) k A k k H (cid:1)i = E h E (cid:2) k L k k L ( H ) φ M (cid:0) k L k k L ( H ) (cid:1)(cid:3) k A k k H k F k k H φ M (cid:0) k F k k H (cid:1) φ M (cid:0) k A k k H (cid:1)i = E (cid:2) k L k k L ( H ) φ M (cid:0) k L k k L ( H ) (cid:1)(cid:3) E h k A k k H k F k k H φ M (cid:0) k F k k H (cid:1) φ M (cid:0) k A k k H (cid:1)i ≤ E (cid:2) k L k k L ( H ) (cid:3) E (cid:2) k A k k H k F k k H (cid:3) ≤ (cid:0) E (cid:2) k L k k L ( H ) (cid:3)(cid:1) (cid:0) E (cid:2) k A k k H (cid:3)(cid:1) (cid:0) E (cid:2) k F k k H (cid:3)(cid:1) < + ∞ . Now from monotone convergence theorem passing M → ∞ we obtain that(7.15) E (cid:2)(cid:12)(cid:12) ( F k , L k A k ) H (cid:12)(cid:12)(cid:3) < + ∞ . Using independence Lemma 7 and E [ L ] = I we compute(7.16) E (cid:2) ( F k , L k A k ) H (cid:3) = E (cid:2) E (cid:2) ( L k A k , F k ) H (cid:12)(cid:12) G k (cid:3)(cid:3) = E (cid:2) E (cid:2) ( L k a k , f k ) H (cid:3) (cid:12)(cid:12) a k = A k , f k = F k (cid:3) = E (cid:2) ( E (cid:2) L k (cid:3) a k , f k ) H (cid:12)(cid:12) a k = A k , f k = F k (cid:3) = E (cid:2) ( E (cid:2) L (cid:3) A k , F k ) H (cid:3) = E [( A k , F k ) H ]= E (cid:2)(cid:0) R H ( DU k ( F k )) − R H ( DU k (0)) , F k (cid:1) H (cid:3) = E (cid:2)(cid:10) DU k ( F k ) − DU k (0) , F k (cid:11) H ∗ ,H (cid:3) = E (cid:2) E (cid:2)(cid:10) DU k ( F k ) − DU k (0) , F k (cid:11) H ∗ ,H (cid:12)(cid:12) F k − (cid:3)(cid:3) = E (cid:2) E (cid:2)(cid:10) DU k ( f k ) − DU k (0) , f k (cid:11) H ∗ ,H (cid:3) (cid:12)(cid:12) f k = F k (cid:3) = E (cid:2)(cid:10) Du ( f k ) − Du (0) , f k (cid:11) H ∗ ,H (cid:12)(cid:12) f k = F k (cid:3) = E (cid:2)(cid:10) Du ( F k ) − Du (0) , F k (cid:11) H ∗ ,H (cid:3) ≥ E (cid:2) λ k F k k H (cid:3) = λE (cid:2) k F k k H (cid:3) . Using Lemma 8 we compute(7.17) E [ B k ] = E [ R H ( DU k (0))] = E [ R H ( DU (0))] = R H E [ DU (0)]= R H ( Du (0)) . Using similar computation as in (7.14) we obtain(7.18) E (cid:2)(cid:12)(cid:12) ( F k , L k B k ) H (cid:12)(cid:12)(cid:3) < + ∞ . Using independence Lemma 7, E [ L ] = I , (7.17) and (5.3) we compute(7.19) E (cid:2) ( F k , L k B k ) H (cid:3) = E (cid:2) E (cid:2) ( F k , L k B k ) H (cid:12)(cid:12) F k − (cid:3)(cid:3) = E (cid:2) E (cid:2) ( f k , L k B k ) H (cid:3) (cid:12)(cid:12) f k = F k (cid:3) = E (cid:2)(cid:0) f k , E [ L k B k ] (cid:1) H (cid:12)(cid:12) f k = F k (cid:3) = E (cid:2)(cid:0) F k , E [ L k B k ] (cid:1) H (cid:3) = E (cid:2)(cid:0) F k , E (cid:2) E [ L k B k | σ ( U k )] (cid:3)(cid:1) H (cid:3) = E (cid:2)(cid:0) F k , E (cid:2) E [ L k b k ] | b k = B k (cid:3)(cid:1) H (cid:3) = E (cid:2)(cid:0) F k , E (cid:2)(cid:0) E [ L k ] b k (cid:1) | b k = B k (cid:3)(cid:1) H (cid:3) = E (cid:2)(cid:0) F k , E (cid:2)(cid:0) E [ L ] b k (cid:1) | b k = B k (cid:3)(cid:1) H (cid:3) = E (cid:2)(cid:0) F k , E (cid:2) b k | b k = B k (cid:3)(cid:1) H (cid:3) = E (cid:2)(cid:0) F k , E [ B k ] (cid:1) H (cid:3) = (cid:0) E [ F k ] , E [ B k ] (cid:1) H = (cid:0) R H ( Du (0)) , E [ F k ] (cid:1) H = h Du (0) , E [ F k ] i H ∗ ,H = h Du (0) , E [ F k ] − i H ∗ ,H = E (cid:2) h Du (0) , F k − i H ∗ ,H (cid:3) ≥ . Taking the expectation in (7.9) and using (7.12), (7.13), (7.16) and (7.19) weobtain(7.20) E (cid:2) k F k +1 k H (cid:3) ≤ (cid:0) − λη k + 2Λ ρ (cid:0) E [ L ∗ L ] (cid:1) η k (cid:1) E (cid:2) k F k k H (cid:3) + 2 E (cid:2) k R H DU (0) k H,E [ L ∗ L ] (cid:3) η k . Let s > , b = 2 ρ (cid:0) E [ L ∗ L ] (cid:1) ( Λ λ ) s and η k = sλ b + k for k ∈ N . By our choice of η k we have η k ≤ λ (cid:0) ρ (cid:0) E [ L ∗ L ] (cid:1)(cid:1) − and thus we have(7.21) 1 − λη k + 2Λ ρ (cid:0) E [ L ∗ L ] (cid:1) η k ≤ − λη k ≤ e − λη k . From (7.20) and (7.21) we obtain E [ k F k +1 k H ] ≤ E [ k F k k H ] e − λη k + 2 E (cid:2) k R H DU (0) k H,E [ L ∗ L ] (cid:3) η k and by iteration we have(7.22) E [ k F n k H ] ≤ E [ k F k H ] e − λ P n − i =1 η i + 2 E (cid:2) k R H DU (0) k H,E [ L ∗ L ] (cid:3) n − X k =1 η k e − λ P n − i = k +1 η i . By our choice of η k one may see that we have(7.23) e − λ P n − i =1 η i ≤ ( b + 1 b + n ) s and(7.24) n − X k =1 η k e − λ P n − i = k +1 η i ≤ ( sλ ) (1 + 2 b ) s s − n + b . Using (7.3) we have the estimate b = 2 ρ (cid:0) E [ L ∗ L ] (cid:1) ( Λ λ ) s ≥ s. We obtain the estimate(1 + 2 b ) s ≤ (1 + 22 s ) s = (1 + 1 s ) s < e. From (7.22), (7.23) and (7.24) the result of the theorem follows. (cid:3)
VERCOMING THE CURSE OF DIMENSIONALITY 15 Application in Neural Networks(Proof of Theorem 2)
Lemma 11. B ( A, V ) equipped with the norm k · k B ( A,V ) is a Banach space.Proof. Let f n be a Cauchy sequence. There exists k ∈ N such that if n, m ≥ k then k f n − f m k B ( A,V ) ≤ k f n − f k k B ( A,V ) ≤ k f n k B ( A,V ) ≤ k f n − f k k B ( A,V ) + k f k k B ( A,V ) ≤ k f k k B ( A,V ) thus f n are uniformly bounded.Let x ∈ X . We have k f n ( x ) − f m ( x ) k V ≤ k f n − f m k B ( A,V ) → f n ( x ) isa Cauchy sequence in V . Therefore there exists y ∈ V such that f n ( x ) → y . As y depends on x , let us define f : A → V by f ( x ) = y .We have k f ( x ) k V = lim n →∞ k f n ( x ) k V ≤ sup n k f n k B ( A,V ) < + ∞ thus f ∈ B ( A, V ).For each ǫ > N ǫ such that if n, m ≥ N ǫ then for all x ∈ A k f n ( x ) − f m ( x ) k V ≤ k f n − f m k B ( A,V ) < ǫ. passing to the limit as m → ∞ in k f n ( x ) − f m ( x ) k V we obtain that k f n ( x ) − f ( x ) k V ≤ ǫ and by the arbitrariness of x , it follows that f n → f in B ( A, V ). Andthis proves the lemma. (cid:3)
For x ∈ A and y ∈ V we define ℓ ( x, y ) ∈ H ∗ by h ℓ ( x, y ) , ϕ i H ∗ ,H = ( y, ϕ ( x )) V for all ϕ ∈ H. Let us note that ℓ ( x, y ) is linear in y ∈ V .Let us denote R H ( ℓ ( x, y )) = Φ( x, y ) ∈ H. Let us note that Φ( x, y ) is a function in H for each x ∈ A and y ∈ V . Let usalso note that Φ( x, y ) is linear in y ∈ V .By the definition of ℓ ( x, y ) and Φ( x, y ) the equation (3.7) holds.We compute(8.1) k ℓ ( x, y ) k H ∗ = sup k ϕ k H ≤ h ℓ ( x, y ) , ϕ i H ∗ ,H = sup k ϕ k H ≤ ( y, ϕ ( x )) V ≤ k y k V sup k ϕ k H ≤ k ϕ ( x ) k V ≤ M k y k V . Lemma 12.
We have u ∈ C b ( H ; max(1 , k · k H )) with (8.2) Du ( f ) = R − H f for all f ∈ H. Proof.
It is clear that u ∈ C ( H ) with h Du ( f ) , ϕ i H ∗ ,H = ( f, ϕ ) H for all f, ϕ ∈ H. Thus Du ( f ) = R − H f and k Du ( f ) k H ∗ = k R − H f k H ∗ = k f k H from which it follows that Du ∈ C b ( H, H ∗ ; max(1 , k · k H )) and thus u ∈ C b ( H ; max(1 , k · k H )). (cid:3) Lemma 13.
For i = 1 , · · · , n we have u i ∈ C b ( H ; max(1 , k · k H )) with (8.3) Du i ( f ) = ℓ ( x i , f ( x i ) − y i ) for all f ∈ H. Proof.
From the embedding inequality (3.7) it follows that u i ∈ C ( H ).For f, ϕ ∈ H we have u i ( f + ϕ ) − u i ( f ) − (cid:0) f ( x i ) − y i , ϕ ( x i ) (cid:1) V = 12 k ϕ ( x i ) k V . From this and the embedding inequality (3.7) it follows that u i is differentiablewith differential as in (8.3).From (8.3) and the embedding inequality (3.7) we obtain that u i ∈ C ( H, H ∗ ).Using (8.1) we estimate k Du i ( f ) k H ∗ = k ℓ ( x i , f ( x i ) − y i ) k H ∗ ≤ M k f ( x i ) − y i k V ≤ M (cid:0) k f ( x i ) k V + k y i k V (cid:1) ≤ M (cid:0) M k f k H + k y i k V (cid:1) from which it follows that Du i ∈ C b ( H, H ∗ ; max(1 , k · k H )) and therefore u i ∈ C b ( H ; max(1 , k · k H )). (cid:3) Lemma 14.
For all f, g ∈ H we have E (cid:2) k Du I ( f ) − Du I ( g ) k H ∗ (cid:3) ≤ (cid:0) q + (1 − q ) M (cid:1) k f − g k H . Proof.
Using (8.1) we compute E (cid:2) k Du I ( f ) − Du I ( g ) k H ∗ (cid:3) = q k Du ( f ) − Du ( g ) k H ∗ + (1 − q ) 1 n n X i =1 k Du i ( f ) − Du i ( g ) k H ∗ = q k R − H f − R − H g k H ∗ + (1 − q ) 1 n n X i =1 k ℓ ( x i , f ( x i ) − y i ) − ℓ ( x i , g ( x i ) − y i ) k H ∗ = q k f − g k H + (1 − q ) 1 n n X i =1 k ℓ ( x i , f ( x i ) − g ( x i )) k H ∗ ≤ q k f − g k H + (1 − q ) 1 n n X i =1 M k f ( x i ) − g ( x i ) k V ≤ q k f − g k H + (1 − q ) 1 n n X i =1 M k f − g k H = (cid:0) q + (1 − q ) M (cid:1) k f − g k H which proves the Lemma. (cid:3) Lemma 15.
We have (8.4) h Du ( f ) − Du ( g ) , f − g i H ∗ ,H ≥ q k f − g k H for all f, g ∈ H. Proof.
For f ∈ H we compute Du ( f ) = E (cid:2) Du I ( f ) (cid:3) = qDu ( f ) + (1 − q ) 1 n n X i =1 Du i ( f )= qR − H f + (1 − q ) 1 n n X i =1 ℓ ( x i , f ( x i ) − y i ) . VERCOMING THE CURSE OF DIMENSIONALITY 17
Using this for f, g ∈ H we compute Du ( f ) − Du ( g ) = qR − H f + (1 − q ) 1 n n X i =1 ℓ ( x i , f ( x i ) − y i ) − (cid:16) qR − H g + (1 − q ) 1 n n X i =1 ℓ ( x i , g ( x i ) − y i ) (cid:17) = qR − H f + (1 − q ) 1 n n X i =1 ℓ ( x i , f ( x i )) − (cid:16) qR − H g + (1 − q ) 1 n n X i =1 ℓ ( x i , g ( x i )) (cid:17) = qR − H f − qR − H g + (1 − q ) 1 n n X i =1 ℓ ( x i , f ( x i )) − (1 − q ) 1 n n X i =1 ℓ ( x i , g ( x i ))= qR − H ( f − g ) + (1 − q ) 1 n n X i =1 ℓ ( x i , f ( x i ) − g ( x i )) . Using this for f, g ∈ H we obtain h Du ( f ) − Du ( g ) , f − g i H ∗ ,H = D qR − H ( f − g ) + (1 − q ) 1 n n X i =1 ℓ ( x i , f ( x i ) − g ( x i )) , f − g E H ∗ ,H = q (cid:10) R − H ( f − g ) , f − g (cid:11) H ∗ ,H + (1 − q ) 1 n n X i =1 h ℓ ( x i , f ( x i ) − g ( x i )) , f − g i H ∗ ,H = q k f − g k H + (1 − q ) 1 n n X i =1 k f ( x i ) − g ( x i ) k V ≥ q k f − g k H which proves the Lemma. (cid:3) Lemma 16. B r,H is a closed and convex subset of H and for f ∈ H we have P roj B r,H ( f ) = f max(1 , k f k H r ) = ( f, k f k H ≤ r,r k f k − H f, k f k H > r. Proof. B r,H being a closed ball in H is closed and convex.If f ∈ H and k f k H ≤ r then clearly P roj B r,H ( f ) = f .Now assume f ∈ H and k f k H > r . We have h = P roj B r,H ( f ) if and only if( f − h, g − h ) H ≤ g ∈ B r,H . We should show that
P roj B r,H ( f ) = r k f k − H f i.e. (cid:0) f − r k f k − H f, g − r k f k − H f (cid:1) H ≤ g ∈ H with k g k H ≤ r. After multiplying with ( k f k H − r ) − k f k H and some modifications this in equivalentto ( f, g ) H ≤ r k f k H for all g ∈ H with k g k H ≤ r which holds because of the Cauchy-Schwartz inequality. (cid:3) Proof of Theorem 2 .
Let us apply Theorem 1 by choosing K = B r,H , G = 0, L k = I , U k = u I k .If I k = 0 then F k +1 = P roj B r,H (cid:16) F k − η k (cid:0) R H ( Du I k ( F k )) (cid:1)(cid:17) = P roj B r,H (cid:16) F k − η k (cid:0) R H ( R − H F k ) (cid:1)(cid:17) = P roj B r,H (cid:0) (1 − η k ) F k (cid:1) = (1 − η k ) F k . If I k ∈ { , · · · , n } then F k +1 = P roj B r,H (cid:16) F k − η k (cid:0) R H ( Du I k ( F k )) (cid:1)(cid:17) = P roj B r,H (cid:16) F k − η k (cid:0) R H ( ℓ ( x I k , F k ( x I k ) − y I k )) (cid:1)(cid:17) = P roj B r,H (cid:0) F k − η k Φ( x I k , F k ( x I k ) − y I k ) (cid:1) = P roj B r,H (cid:16) F k + η k Φ (cid:0) x I k , y I k − F k ( x I k ) (cid:1)(cid:17) = 1 S k (cid:0) F k + η k Φ( x I k , y I k − F k ( x I k )) (cid:1) where S k is as in the equation (3.9).Computing expectation we get E (cid:2) k R H Du I ( f ∗ ) k H,E [ L ∗ L ] (cid:3) = E (cid:2) k R H Du I ( f ∗ ) k H (cid:3) = E (cid:2) k Du I ( f ∗ ) k H ∗ (cid:3) = q k Du ( f ∗ ) k H ∗ + (1 − q ) 1 n n X i =1 k Du i ( f ∗ ) k H ∗ = q k R − H f ∗ k H ∗ + (1 − q ) 1 n n X i =1 k ℓ ( x i , f ∗ ( x i ) − y i ) k H ∗ = q k f ∗ k H + (1 − q ) 1 n n X i =1 k Φ( x i , f ∗ ( x i ) − y i ) k H . This together with (8.4) completes the proof of the Theorem. (cid:3)
References [1] Cea, J.,
Approximation variationnelle des probl´emes aux limites. , (French) Ann. Inst. Fourier(Grenoble) Breaking the curse for uniform approximation in Hilbert spaces via Monte Carlomethods. , (English summary) J. Complexity (2018), 15–35.[6] Math´e, P., Approximation variationnelle des probl´emes aux limites. , Random approximationof Sobolev embeddings. J. Complexity (1991), no. 3, 261–281.[7] Mhaskar, H. N.; Poggio, T., Deep vs. shallow networks: an approximation theory perspective ,Anal. Appl. (Singap.) (2016), no. 6, 829–848.[8] Poggio, T.; Girosi, F., Regularization algorithms for learning that are equivalent to multilayernetworks , Science (1990), no. 4945, 978–982.
VERCOMING THE CURSE OF DIMENSIONALITY 19 [9] Poggio, T.; Mhaskar, H.; Rosasco, L. et al.,
Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , Int. J. Autom. Comput. (2017) :no. 5, 503–519. https://doi.org/10.1007/s11633-017-1054-2(Karen Yeressian) Department of MathematicsKTH Royal Institute of Technology100 44 Stockholm, Sweden
E-mail address , Karen Yeressian:, Karen Yeressian: