Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods
aa r X i v : . [ s t a t . M L ] D ec Benefit of deep learning with non-convex noisy gradient descent B ENEFIT OF DEEP LEAR NING WITH NON - C ONVEXNOISY GR ADIENT DESC ENT : P
ROVABLE EXC ESS R ISKB OUND AND SUPER IORITY TO KER NEL METHODS
Taiji Suzuki
Graduate School of Information Science and Technology, The University of Tokyo, JapanCenter for Advanced Intelligence Project, RIKEN, Japan [email protected]
Shunta Akiyama
Graduate School of Information Science and Technology, The University of Tokyo, Japan [email protected] A BSTRACT
Establishing a theoretical analysis that explains why deep learning can outperformshallow learning such as kernel methods is one of the biggest issues in the deeplearning literature. Towards answering this question, we evaluate excess risk ofa deep learning estimator trained by a noisy gradient descent with ridge regular-ization on a mildly overparameterized neural network, and discuss its superiorityto a class of linear estimators that includes neural tangent kernel approach, ran-dom feature model, other kernel methods, k -NN estimator and so on. We considera teacher-student regression model, and eventually show that any linear estimatorcan be outperformed by deep learning in a sense of the minimax optimal rate espe-cially for a high dimension setting. The obtained excess bounds are so-called fastlearning rate which is faster than O (1 / √ n ) that is obtained by usual Rademachercomplexity analysis. This discrepancy is induced by the non-convex geometry ofthe model and the noisy gradient descent used for neural network training provablyreaches a near global optimal solution even though the loss landscape is highlynon-convex. Although the noisy gradient descent does not employ any explicitor implicit sparsity inducing regularization, it shows a preferable generalizationperformance that dominates linear estimators. NTRODUCTION
In the deep learning theory literature, clarifying the mechanism by which deep learning can outper-form shallow approaches has been gathering most attention for a long time. In particular, it is quiteimportant to show that a tractable algorithm for deep learning can provably achieve a better gener-alization performance than shallow methods. Towards that goal, we study the rate of convergenceof excess risk of both deep and shallow methods in a setting of a nonparametric regression problem.One of the difficulties to show generalization ability of deep learning with certain optimization meth-ods is that the solution is likely to be stacked in a bad local minimum, which prevents us to show itspreferable performances. Recent studies tackled this problem by considering optimization on over-parameterized networks as in neural tangent kernel (NTK) (Jacot et al., 2018; Du et al., 2019a) andmean field analysis (Nitanda & Suzuki, 2017; Chizat & Bach, 2018; Rotskoff & Vanden-Eijnden,2018; 2019; Mei et al., 2018; 2019), or analyzing the noisy gradient descent such as stochastic gradi-ent Langevin dynamics (SGLD) (Welling & Teh, 2011; Raginsky et al., 2017; Erdogdu et al., 2018).The NTK analysis deals with a relatively large scale initialization so that the model is well ap-proximated by the tangent space at the initial solution, and eventually, all analyses can be reducedto those of kernel methods (Jacot et al., 2018; Du et al., 2019b; Allen-Zhu et al., 2019; Du et al.,2019a; Arora et al., 2019; Cao & Gu, 2019; Zou et al., 2020). Although this regime is useful to showits global convergence, the obtained estimator looses large advantage of deep learning approaches1enefit of deep learning with non-convex noisy gradient descentbecause the estimation ability is reduced to the corresponding kernel methods. To overcome thisissue, there are several “beyond-kernel” type analyses. For example, Allen-Zhu & Li (2019; 2020)showed benefit of depth by analyzing ResNet type networks. Li et al. (2020) showed global opti-mality of gradient descent by reducing the optimization problem to a tensor decomposition problemfor a specific regression problem, and showed the “ideal” estimator on a linear model has worsedependency on the input dimensionality. Bai & Lee (2020) considered a second order Taylor ex-pansion and showed that the sample complexity of deep approaches has better dependency on theinput dimensionality than kernel methods. Chen et al. (2020) also derived a similar conclusion byconsidering a hierarchical representation. The analyses mentioned above actually show some supe-riority of deep learning, but all of these bounds are essentially
Ω(1 / √ n ) where n is the sample size,which is not optimal for regression problems with squared loss (Caponnetto & de Vito, 2007). Thereason why only such a sub-optimal rate is considered is that the target of their analyses is mostlythe Rademacher complexity of the set in which estimators exist for bounding the generalization gap.However, to derive a tight excess risk bound instead of the generalization gap, we need to eval-uate so called local Rademacher complexity (Mendelson, 2002; Bartlett et al., 2005; Koltchinskii,2006) (see Eq. (2) for the definition of excess risk). Moreover, some of the existing analyses shouldchange the target function class as the sample size n increases, for example, the input dimensionalityis increased against the sample size, which makes it difficult to see how the rate of convergence isaffected by the choice of estimators.Another promising approach is the mean field analysis. There are also some work that showed su-periority of deep learning against kernel methods. Ghorbani et al. (2019) showed that, when thedimensionality d of input is polynomially increasing with respect to n , the kernel methods is out-performed by neural network approaches. Although the situation of increasing d explains well themodern high dimensional situations, this setting blurs the rate of convergence. Actually, we canshow the superiority of deep learning even in a fixed dimension setting.There are several studies about approximation abilities of deep and shallow models. Ghorbani et al.(2020) showed adaptivity of kernel methods to the intrinsic dimensionality in terms of approxi-mation error and discuss difference between deep and kernel methods. Yehudai & Shamir (2019)showed that the random feature method requires exponentially large number of nodes against theinput dimension to obtain a good approximation for a single neuron target function. These are onlyfor approximation errors and estimation errors are not compared.Recently, the superiority of deep learning against kernel methods has been discussed also in thenonparametric statistics literature where the minimax optimality of deep learning in terms of excessrisk is shown. Especially, it is shown that deep learning achieves better rate of convergence than linear estimators in several settings (Schmidt-Hieber, 2020; Suzuki, 2019; Imaizumi & Fukumizu,2019; Suzuki & Nitanda, 2019; Hayakawa & Suzuki, 2020). Here, the linear estimators are a generalclass of estimators that includes kernel ridge regression, k -NN regression and Nadaraya-Watsonestimator. Although these analyses give clear statistical characterization on estimation ability ofdeep learning, they are not compatible with tractable optimization algorithms.In this paper, we give a theoretical analysis that unifies these analyses and shows the superiority of adeep learning method trained by a tractable noisy gradient descent algorithm. We evaluate the excessrisks of the deep learning approach and linear estimators in a nonparametric regression setting, andshow that the minimax optimal convergence rate of the linear estimators can be dominated by thenoisy gradient descent on neural networks. In our analysis, the model is fixed and no explicit sparseregularization is employed. Our contributions can be summarized as follows:• A refined analysis of excess risks for a fixed model with a fixed input dimension is given to com-pare deep and shallow estimators. Although several studies pointed out the curse of dimensionalityis a key factor that separates shallow and deep approaches, we point out that such a separation ap-pears in a rather low dimensional setting, and more importantly, the non-convexity of the modelessentially makes the two regimes different.• A lower bound of the excess risk which is valid for any linear estimator is derived. The analysis isconsiderably general because the class of linear estimators includes kernel ridge regression withany kernel and thus it also includes estimators in the NTK regime.• All derived convergence rate is a fast learning rate that is faster than O (1 / √ n ) . We show thatsimple noisy gradient descent on a sufficiently wide two-layer neural network achieves a fastlearning rate by using a fact that the solution converges to a Bayes estimator with a Gaussian2enefit of deep learning with non-convex noisy gradient descentprocess prior, and the derived convergence rate can be faster than that of linear estimators. Thisis much different from such existing work that compared only coefficients with the same rate ofconvergence with respect to the sample size n . Other related work
Bach (2017) analyzed the model capacity of neural networks and its corre-sponding reproducing kernel Hilbert space (RKHS), and showed that the RKHS is much larger thanthe neural network model. However, separation of the estimation abilities between shallow and deepis not proven. Moreover, the analyzed algorithm is basically the Frank-Wolfe type method whichis not typically used in practical deep learning. The same technique is also employed by Barron(1993). The Frank-Wolfe algorithm is a kind of sparsity inducing algorithm that is effective forestimating a function in a model with an L -norm constraint. It has been shown that explicit or im-plicit sparse regularization such as L -regularization is beneficial to obtain better performances ofdeep learning under certain situations (Chizat & Bach, 2020; Chizat, 2019; Gunasekar et al., 2018;Woodworth et al., 2020; Klusowski & Barron, 2016). For example, E et al. (2019b;a) showed thatthe approximation error of a linear model suffers from the curse of dimensionality in a setting wherethe target function is in the Barron class (Barron, 1993), and showed an L -type regularizationavoids the curse of dimensionality. However, our analysis goes in a different direction where asparse regularization is not required. ROBLEM SETTING AND MODEL
In this section, we give the problem setting and notations that will be used in the theoretical analy-sis. We consider the standard nonparametric regression problem where data are generated from thefollowing model for an unknown true function f o : R d → R : y i = f o ( x i ) + ǫ i ( i = 1 , . . . , n ) , (1)where x i is independently identically distributed from P X whose support is included in Ω = [0 , d ,and ǫ i is an observation noise that is independent of x i and satisfies E[ ǫ i ] = 0 and ǫ i ∈ [ − U, U ] almost surely. The n i.i.d. observations are denoted by D n = ( x i , y i ) ni =1 . We want to estimate thetrue function f o through the training data D n . To achieve this purpose, we employ the squared loss ℓ ( y, f ( x )) = ( y − f ( x )) and accordingly we define the expected and empirical risks as L ( f ) :=E Y,X [ ℓ ( Y, f ( X ))] and b L ( f ) := n P ni =1 ℓ ( y i , f ( x i )) respectively. Throughout this paper, we areinterested in the excess (expected) risk of an estimator b f defined by(Excess risk) L ( b f ) − inf f : measurable L ( f ) . (2)Since the loss function ℓ is the squared loss, the infimum of inf f : measurable L ( f ) is achieved by f o : inf f : measurable L ( f ) = L ( f o ) . The population L ( P X ) -norm is denoted by k f k L ( P X ) := p E X ∼ P X [ f ( X ) ] and the sup-norm on the support of P X is denoted by k f k ∞ :=sup x ∈ supp( P X ) | f ( x ) | . We can easily check that for an estimator b f , the L -distance k b f − f o k L ( P X ) between the estimator b f and the true function f o is identical to the excess risk: L ( b f ) − L ( f o ) = k b f − f o k L ( P X ) . Note that the excess risk is different from the generalization gap L ( b f ) − b L ( b f ) .Indeed, the generalization gap typically converges with the rate of O (1 / √ n ) which is optimal in atypical setting (Mohri et al., 2012). On the other hand, the excess risk can be faster than O (1 / √ n ) ,which is known as a fast learning rate (Mendelson, 2002; Bartlett et al., 2005; Koltchinskii, 2006;Gin´e & Koltchinskii, 2006).2.1 M ODEL OF TRUE FUNCTIONS
To analyze the excess risk, we need to specify a function class (in other words, model) in whichthe true function f o is included. In this paper, we only consider a two layer neural network model,whereas the techniques adapted in this paper can be directly extended to deeper neural networkmodels. We consider a teacher-student setting, that is, the true function f o can be represented bya neural network defined as follows. For w ∈ R , let ¯ w be a “clipping” of w defined as ¯ w := R × tanh( w/R ) where R ≥ is a fixed constant, and let [ x ; 1] := [ x ⊤ , ⊤ for x ∈ R d . Then, theteacher network is given by f W ( x ) = P ∞ m =1 a m ¯ w ,m σ m ( w ⊤ ,m [ x ; 1]) , w ,m ∈ R d +1 and w ,m ∈ R ( m ∈ N ) are the trainable parameters (where W =( w ,m , w ,m ) ∞ m =1 ), a m ∈ R ( m ∈ N ) is a fixed scaling parameter, and σ m : R → R is an activationfunction for the m -th node. The reason why we applied the clipping operation to the parameter ofthe second layer is just for a technical reason to ensure convergence of Langevin dynamics. The dy-namics is bounded in high probability in practical situations and the boundedness condition wouldbe removed if further theoretical development of infinite dimensional Langevin dynamics would beachieved.Let H be a set of parameters W such that its squared norm is bounded: H := { W =( w ,m , w ,m ) ∞ m =1 | P ∞ m =1 ( k w ,m k + w ,m ) < ∞} . Define k W k H := [ P ∞ m =1 ( k w ,m k + w ,m )] / for W ∈ H . Let ( µ m ) ∞ m =1 be a regularization parameter such that µ m ց . Accordinglywe define H γ := { W ∈ H | k W k H γ < ∞} where k W k H γ := [ P ∞ m =1 µ − γm ( k w ,m k + w ,m )] / for a given < γ . Throughout this paper, we analyze an estimation problem in which the truefunction is included in the following model: F γ = { f W | W ∈ H γ , k W k H γ ≤ } . This is basically two layer neural network with infinite width. As assumed later, a m is assumedto decrease as m → ∞ . Its decreasing rate controls the capacity of the model. If the first layerparameters ( w ,m ) m are fixed, this model can be regarded as a variant of the unit ball of some repro-ducing kernel Hilbert space (RKHS) with basis functions a m σ m ( w ⊤ ,m [ x ; 1]) . However, since thefirst layer ( w ,m ) is also trainable, there appears significant difference between deep and kernel ap-proaches. The Barron class (Barron, 1993; E et al., 2019b) is relevant to this function class. Indeed,it is defined as the convex hull of w σ ( w ⊤ [ x ; 1]) with norm constraints on ( w , w ) where σ is anactivation function. On the other hand, we will put an explicit decay rate on a m and the parameter W has an L -norm constraint, which makes the model F γ smaller than the Barron class. STIMATORS
We consider two classes of estimators and discuss their differences: linear estimators and deeplearning estimator with noisy gradient descent (NGD).
Linear estimator
A class of linear estimators, which we consider as a representative of “shallow”learning approach, consists of all estimators that have the following form: b f ( x ) = P ni =1 y i ϕ i ( x , . . . , x n , x ) . Here, ( ϕ i ) ni =1 can be any measurable function (and L ( P X ) -integrable so that the excess risk canbe defined). Thus, they could be selected as the “optimal” one so that the corresponding linear esti-mator minimizes the worst case excess risk. Even if we chose such an optimal one, the worst caseexcess risk should be lower bounded by our lower bound given in Theorem 1. It should be noted thatthe linear estimator does not necessarily imply “linear model.” The most relevant linear estimatorin the machine learning literature is the kernel ridge regression: b f ( x ) = Y ⊤ ( K X + λ I) − k ( x ) where K X = ( k ( x i , x j )) ni,j =1 ∈ R n × n , k ( x ) = [ k ( x, x ) , . . . , k ( x, x n )] ⊤ ∈ R n and Y =[ y , . . . , y n ] ⊤ ∈ R n for a kernel function k : R d × R d → R . Therefore, the ridge regression estima-tor in the NTK regime or the random feature model is also included in the class of linear estimators.The solution obtained in the early stopping criteria instead of regularization in the NTK regimeunder the squared loss is also included in the linear estimators. Other examples include the k -NNestimator and the Nadaraya-Watson estimator. All of them do not train the basis function in a non-linear way, which makes difference from the deep learning approach. In the nonparametric statisticsliterature, linear estimators have been studied for estimating a wavelet series model. Donoho et al.(1990; 1996) have shown that a wavelet shrinkage estimator can outperform any linear estimatorby showing suboptimality of linear estimators. Suzuki (2019) utilized such an argument to showsuperiority of deep learning but did not present any tractable optimization algorithm. Noisy Gradient Descent with regularization
As for the neural network approach, we consider anoisy gradient descent algorithm. Basically, we minimize the following regularized empirical risk: b L ( f W ) + λ k W k H . H -norm as the regularizer. We note that the constant γ controls therelative complexity of the true function f o compared to the typical solution obtained un-der the regularization. Here, we define a linear operator A as λ k W k H = W ⊤ AW , thatis, AW = ( λµ − m w ,m , λµ − m w ,m ) ∞ m =1 . The regularized empirical risk can be minimizedby noisy gradient descent as W k +1 = W k − η ∇ ( b L ( f W k ) + λ k W k k H ) + q ηβ ξ k , where η > is a step size and ξ k = ( ξ k, (1 ,m ) , ξ k, (2 ,m ) ) ∞ m =1 is an infinite-dimensional Gaus-sian noise, i.e., ξ k, (1 ,m ) and ξ k, (2 ,m ) are independently identically distributed from the stan-dard normal distribution (Da Prato & Zabczyk, 1996). Here, ∇ b L ( f W ) = n P ni =1 f W ( x i ) − y i )( ¯ w ,m a m [ x i ; 1] σ ′ m ( w ⊤ ,m [ x i ; 1]) , a m tanh ′ ( w ,m /R ) σ m ( w ⊤ ,m [ x i ; 1])) ∞ m =1 . However, since ∇k W k − k H is unbounded which makes it difficult to show convergence, we employ the semi-implicit Euler scheme defined by W k +1 = W k − η ∇ b L ( f W k ) − ηAW k +1 + q ηβ ξ k ⇔ W k +1 = S η (cid:16) W k − η ∇ b L ( f W k )+ q ηβ ξ k (cid:17) , (3)where S η := (I+ ηA ) − . It is easy to check that this is equivalent to the following update rule: W k = W k − − η (cid:16) ∇ b L ( f W k − ) + S η AW k − + q ηβ ξ k − (cid:17) . Therefore, the implicit Euler scheme can beseen as a naive noisy gradient descent for minimizing the empirical risk with a slightly modifiedridge regularization. This can be interpreted as a discrete time approximation of the following infinite dimensional Langevin dynamics : d W t = −∇ ( b L ( f W t ) + λ k W t k H )d t + p /β d ξ t , (4)where ( ξ t ) t ≥ is the so-called cylindrical Brownian motion (see Da Prato & Zabczyk (1996) for thedetails). Its application and analysis for machine learning problems with non-convex objectives havebeen recently studied by, for example, Muzellec et al. (2020); Suzuki (2020).The above mentioned algorithm is executed on an infinite dimensional parameter space. In practice,we should deal with a finite width network. To do so, we approximate the solution by a finitedimensional one: W ( M ) = ( w ,m , w ,m ) Mm =1 where M corresponds to the width of the network.We identify W ( M ) to the “zero-padded” infinite dimensional one, W = ( w ,m , w ,m ) ∞ m =1 with w ,m = 0 and w ,m = 0 for all m > M . Accordingly, we use the same notation f W ( M ) to indicate f W with zero padded vector W . Then, the finite dimensional version of the update rule is given by W ( M ) k +1 = S ( M ) η (cid:16) W ( M ) k − η ∇ b L ( f W ( M ) k ) + q ηβ ξ ( M ) k (cid:17) , where ξ ( M ) k is the Gaussian noise vectorobtained by projecting ξ k to the first M components and S ( M ) η is also obtained in a similar way. ONVERGENCE RATE OF ESTIMATORS
In this section, we present the excess risk bounds for linear estimators and the deep learning estima-tor. As for the linear estimators, we give its lower bound while we give an upper bound for the deeplearning approach. To obtain the result, we setup some assumptions on the model.
Assumption 1. (i)
There exists a constant c µ such that µ m ≤ c µ m − ( m ∈ N ) . (ii) There exists α > / such that a m ≤ µ α m ( m ∈ N ) . (iii) The activation functions ( σ m ) m is bounded as k σ m k ∞ ≤ . Moreover, they are three timesdifferentiable and their derivatives upto third order differentiation are uniformly bounded: ∃ C σ such that k σ m k , := max {k σ ′ m k ∞ , k σ ′′ m k ∞ , k σ ′′′ m k ∞ } ≤ C σ ( ∀ m ∈ N ) . The first assumption (i) controls the strength of the regularization, and combined with the secondassumption (ii) and definition of the model F γ , complexity of the model is controlled. If α and γ are large, the model is less complicated. Indeed, the convergence rate of the excess risk becomesfaster if these parameters are large as seen later. The decay rate µ m ≤ c µ m − can be generalized as m − p with p > but we employ this setting just for a technical simplicity for ensuring convergenceof the Langevin dynamics. The third assumption (iii) can be satisfied by several activation functionssuch as the sigmoid function and the hyperbolic tangent. The assumption k σ m k ∞ ≤ could bereplaced by another one like k σ m k ∞ ≤ C , but we fix this scaling for simple presentation.5enefit of deep learning with non-convex noisy gradient descent4.1 M INIMAX LOWER BOUND FOR LINEAR ESTIMATORS
Here, we analyze a lower bound of excess risk of linear estimators, and eventually we show that any linear estimator suffers from curse of dimensionality. To rigorously show that, we consider thefollowing minimax excess risk over the class of linear estimators: R lin ( F γ ) := inf b f : linear sup f o ∈F γ E D n [ k b f − f o k L ( P X ) ] , where inf is taken over all linear estimators and E D n [ · ] is taken with respect to the training data D n .This expresses the best achievable worst case error over the class of linear estimators to estimate afunction in F γ . To evaluate it, we additionally assume the following condition. Assumption 2.
We assume that µ m = m − and a m = µ α m ( m ∈ N ) (and hence c µ = 1 ). Thereexists a monotonically decreasing sequence ( b m ) ∞ m =1 and s ≥ such that b m = µ α m ( ∀ m ) with α > γ/ and σ m ( u ) = b sm σ ( b − m u ) ( u ∈ R ) where σ is the sigmoid function: σ ( u ) = 1 / (1 + e − u ) . Intuitively, the parameter s controls the “resolution” of each basis function σ m , and the relationbetween parameter α and α controls the magnitude of coefficient for each basis σ m . Note that thecondition s ≥ ensures k σ m k , is uniformly bounded and < b m ≤ ensures k σ m k ∞ ≤ . Ourmain strategy to obtain the lower bound is to make use of the so-called convex-hull argument . Thatis, it is known that, for a function class F , the minimax risk R ( F ) over a class of linear estimatorsis identical to that for the convex hull of F (Hayakawa & Suzuki, 2020; Donoho et al., 1990): R lin ( F ) = R lin (conv( F )) , where conv( F ) = { P Ni =1 λ i f i | f i ∈ F , P Ni =1 λ i = 1 , λ i ≥ , N ∈ N } and conv( · ) is theclosure of conv( · ) with respect to L ( P X ) -norm. Intuitively, since the linear estimator is linear tothe observations ( y i ) ni =1 of outputs, a simple application of Jensen’s inequality yields that its worstcase error on the convex hull of the function class F does not increase compared with that on theoriginal one F (see Hayakawa & Suzuki (2020) for the details). This indicates that the linear es-timators cannot distinguish the original hypothesis class F and its convex hull. Therefore, if theclass F is highly non-convex, then the linear estimators suffer from much slower convergence ratebecause its convex hull conv( F ) becomes much “fatter” than the original one F . To make use ofthis argument, for each sample size n , we pick up appropriate m n and consider a subset generatedby the basis function σ m n , i.e., F ( n ) γ := { a m n ¯ w ,m n σ m ( w ⊤ ,m n [ x ; 1]) ∈ F γ } . By applying the con-vex hull argument to this set, we obtain the relation R lin ( F γ ) ≥ R lin ( F ( n ) γ ) = R lin (conv( F ( n ) γ )) .Since F ( n ) γ is highly non-convex, its convex hull conv( F ( n ) γ ) is much larger than the original set F ( n ) γ and thus the minimax risk over the linear estimators would be much larger than that over allestimators including deep learning. More intuitively, linear estimators do not adaptively select thebasis functions and thus they should prepare redundantly large class of basis functions to approx-imate functions in the target function class. The following theorem gives the lower bound of theminimax optimal excess risk over the class of linear estimators. Theorem 1.
Suppose that
Var( ǫ ) > , P X is the uniform distribution on [0 , d , and Assumption 2is satisfied. Let ˜ β = α +( s +1) α α − γ/ . Then for arbitrary small κ ′ > , we have that R lin ( F γ ) & n − β + d β +2 d n − κ ′ . (5)The proof is in Appendix A. We utilized the Irie-Miyake integral representation (Irie & Miyake,1988; Hornik et al., 1990) to show there exists a “complicated” function in the convex hull, andthen we adopted the technique of Zhang et al. (2002) to show the lower bound. The lower bound ischaracterized by the decaying rate ( α ) of a m relative to that ( α ) of the scaling factor b m . Indeed,the faster a m decays with increasing m , the faster the rate of the minimax lower bound becomes.We can see that the minimax rate of linear estimators is quite sensitive to the dimension d . Actually,for relatively high dimensional settings, this lower bound becomes close to a slow rate Ω(1 / √ n ) ,which corresponds to the curse of dimensionality.It has been pointed out that the sample complexity of kernel methods suffers from the curse ofdimensionality while deep learning can avoid that by a tractable algorithms (e.g., Ghorbani et al.(2019); Bach (2017)). Among them, Ghorbani et al. (2019) showed that if the dimensionality d is6enefit of deep learning with non-convex noisy gradient descentpolynomial against n , then the excess risk of kernel methods is bounded away from 0 for all n . Onthe other hand, our analysis can be applied to any linear estimator including kernel methods, andit shows that even if the dimensionality d is fixed, the convergence rate of their excess risk suffersfrom the curse of dimensionality. This can be accomplished thanks to a careful analysis of the rateof convergence. Bach (2017) derived an upper bound of the Rademacher complexity of the unitball of the RKHS corresponding to a neural network model. However, it is just an upper boundand there is still a large gap from excess risk estimates. Allen-Zhu & Li (2019; 2020); Bai & Lee(2020); Chen et al. (2020) also analyzed a lower bound of sample complexity of kernel methods.However, their lower bound is not for the excess risk of the squared loss. Eventually, the samplecomplexities of all methods including deep learning take a form of O ( C/ √ n ) and dependency ofcoefficient C to the dimensionality or other factors such as magnitude of residual components iscompared. On the other hand, our lower bound properly involves the properties of squared loss suchas strong convexity and smoothness and the bound shows the curse of dimensionality occurs even inthe rate of convergence instead of just the coefficient. Finally, we would like to point out that severalexisting work (e.g., Ghorbani et al. (2019); Allen-Zhu & Li (2019)) considered a situation wherethe target function class changes as the sample size n increases. However, our analysis reveals thatseparation between deep and shallow occurs even if the target function class F γ is fixed.4.2 U PPER BOUND FOR DEEP LEARNING
Here, we analyze the excess risk of deep learning trained by NGD and its algorithmic convergencerate. Our analysis heavily relies on the weak convergence of the discrete time gradient Langevin dy-namics to the stationary distribution of the continuous time one (Eq. (4)). Under some assumptions,the continuous time dynamics has a stationary distribution (Da Prato & Zabczyk, 1992; Maslowski,1989; Sowers, 1992; Jacquot & Royer, 1995; Shardlow, 1999; Hairer, 2002). If we denote the prob-ability measure on H corresponding to the stationary distribution by π ∞ , then it is given by d π ∞ d ν β ( W ) ∝ exp( − β b L ( f W )) , where ν β is the Gaussian measure in H with mean 0 and covariance ( βA ) − (seeDa Prato & Zabczyk (1996) for the rigorous definition of the Gaussian measure on a Hilbert space).Remarkably, this can be seen as the Bayes posterior for a prior distribution ν β and a “log-likelihood”function exp( − β b L ( W )) . Through this view point, we can obtain an excess risk bound of the solu-tion W k . The proofs of all theorems in this section are in Appendix B.Under Assumption 1, the distribution of W k derived by the discrete time gradient Langevin synamicssatisfies the following weak convergence property to the stationary distribution π ∞ . This conver-gence rate analysis depends on the techniques by Br´ehier & Kopec (2016); Muzellec et al. (2020). Proposition 1.
Assume Assumption 1 holds and β > η . Then, there exist spectral gaps Λ ∗ η and Λ ∗ (defined in Eq. (10) of Appendix B.1) and a constant C such that, for any < a < / , thefollowing convergence bound holds for almost sure observation D n : | E W k [ L ( f W k ) | D n ] − E W ∼ π ∞ [ L ( f W ) | D n ] | ≤ C exp( − Λ ∗ η ηk ) + C √ β Λ ∗ η / − a =: Ξ k , (6) where C is a constant depending only on c µ , R, α , C σ , U, a (independent of η, k, β, λ, n ). This proposition indicates that the expected risk of W k can be almost identical to that of the “Bayesposterior solution” obeying π ∞ after sufficiently large iterations k with sufficiently small step size η even though b L ( f W ) is not convex. The definition of Λ ∗ η can be found in Eq. (10). We shouldnote that its dependency on β is exponential. Thus, if we take β = Ω( n ) , then the computationalcost until a sufficiently small error could be exponential with respect to the sample size n . The sameconvergence holds also for finite dimensional one W ( M ) k with a modified stationary distribution. Theconstants appearing in the bound are independent of the model size M (see the proof of Proposition1 in Appendix B). In particular, the convergence can be guaranteed even if W is infinite dimensional.This is quite different from usual finite dimensional analyses (Raginsky et al., 2017; Erdogdu et al.,2018; Xu et al., 2018) which requires exponential dependency on the dimension, but thanks to theregularization term, we can obtain the model size independent convergence rate. Xu et al. (2018)also analyzed a finite dimensional gradient Langevin dynamics and obtained a similar bound where7enefit of deep learning with non-convex noisy gradient descent O ( η ) appears in place of the second term η / − a which corresponds to time discretization error. Inour setting the regularization term is k W k H = P m ( k w ,m k + w ,m ) /µ m with µ m . m − , butif we employ k W k H p/ = P m ( k w ,m k + w ,m ) /µ p/ m for p > , then the time discretizationerror term would be modified to η ( p − /p − a (Andersson et al., 2016). We can interpret the finitedimensional setting as the limit of p → ∞ which leads to η ( p − /p → η that recovers the finitedimensional result ( O ( η ) ) as shown by Xu et al. (2018).In addition to the above algorithmic convergence, we also have the following convergence rate forthe excess risk bound of the finite dimensional solution W ( M ) k . Theorem 2.
Assume Assumption 1 holds, assume η < β ≤ min { n/ (2 U ) , n } , and < γ < / α . Then, if the width satisfies M ≥ min (cid:8) λ / γ ( α +1) β / γ , λ − / α +1) , n / γ (cid:9) , theexpected excess risk of W k is bounded as E D n h E W ( M ) k [ k f W ( M ) k − f o k L ( P X ) | D n ] i ≤ C max (cid:8) ( λβ ) /γ / γ n − / γ , λ − α β − , λ γ α (cid:9) +Ξ k , where C is a constant independent of n, β, λ, η, k . In particular, if we set β = min { n/ (2 U ) , n } and λ = β − , then for M ≥ n / α +1) , we obtain E D n h E W ( M ) k [ k f W ( M ) k − f o k L ( P X ) | D n ] i . n − γα + Ξ k . In addition to this theorem, if we further assume Assumption 2, we obtain a refined bound as follows.
Corollary 1.
Assume Assumptions 1 and 2 hold and η < β , and let β = min { n/ (2 U ) , n } and λ = β − . Suppose that there exists ≤ q ≤ s − such that < γ < / α + qα . Then, theexcess risk bound of W ( M ) k for M ≥ n / α + qα +1) can be refined as E D n h E W ( M ) k [ k f W ( M ) k − f o k L ( P X ) | D n ] i . n − γα qα + Ξ k . (7)These theorem and corollary shows that the tractable NGD algorithm achieves a fast convergencerate of the excess risk bound. Indeed, if q is chosen so that γ > ( α + qα + 1) / , then the excessrisk bound converges faster than O (1 / √ n ) . Remarkably, the convergence rate is not affected bythe input dimension d , which makes discrepancy from linear estimators. The bound of Theorem2 is tightest when γ is close to / α ( γ ≈ / α + 3 α for Corollary 1), and a smaller γ yields looser bound. This relation between γ and α reflects misspecification of the “prior” dis-tribution. When γ is small, the regularization λ k W k H is not strong enough so that the varianceof the posterior distribution becomes unnecessarily large for estimating the true function f o ∈ F γ .Therefore, the best achievable bound can be obtained when the regularization is correctly specified.The analysis of fast rate is in contrast to some existing work (Allen-Zhu & Li, 2019; 2020; Li et al.,2020; Bai & Lee, 2020) that basically evaluated the Rademacher complexity. This is because weessentially evaluated a local Rademacher complexity instead.4.3 C OMPARISON BETWEEN LINEAR ESTIMATORS AND DEEP LEARNING
Here, we compare the convergence rate of excess risks between the linear estimators and the neuralnetwork method trained by NGD using the bounds obtained in Theorem 1 and Corollary 1 respec-tively. We write the lower bound (5) of the minimax excess risk of linear estimators as R ∗ lin and theexcess risk of the neural network approach (7) as R ∗ NN . To make the discussion concise, we considera specific situation where s = 3 , α = γ = α . In this case, ˜ β = 17 / ≈ . , which gives R ∗ lin & n − (cid:18) d β + d (cid:19) − n − κ ′ ≈ n − (cid:18) d . d (cid:19) − n − κ ′ . On the other hand, by setting q = 0 , we have R ∗ NN . n − α α = n − (cid:16)
1+ 1 α (cid:17) − . Thus, as long as α > . /d + 1 ≈ β/d + 1 , we have that R ∗ lin & R ∗ NN , and lim n →∞ R ∗ NN R ∗ lin = 0 . d gets larger, R ∗ lin approaches to Ω( n − / ) while R ∗ NN is not affected by d and itgets close to O ( n − ) as α gets larger. Moreover, the inequality α > . /d + 1 can be satis-fied by a relatively low dimensional setting; for example, d = 10 is sufficient when α = 3 . As α becomes large, the model becomes “simpler” because ( a m ) ∞ m =1 decays faster. However, thelinear estimators cannot take advantage of this information whereas deep learning can. From theconvex hull argument, this discrepancy stems from the non-convexity of the model. We also notethat the superiority of deep learning is shown without sparse regularization while several existingwork showed favorable estimation property of deep learning though sparsity inducing regularization(Bach, 2017; Chizat, 2019; Hayakawa & Suzuki, 2020). However, our analysis indicates that sparseregularization is not necessarily as long as the model has non-convex geometry, i.e., sparsity is justone sufficient condition for non-convexity but not a necessarily condition. The parameter settingabove is just a sufficient condition and the lower bound R ∗ lin would not be tight. The superiority ofdeep learning would hold in much wider situations. ONCLUSION
In this paper, we studied excess risks of linear estimators, as a representative of shallow methods,and a neural network estimator trained by a noisy gradient descent where the model is fixed and nosparsity inducing regularization is imposed. Our analysis revealed that deep learning can outperformany linear estimator even for a relatively low dimensional setting. Essentially, non-convexity of themodel induces this difference and the curse of dimensionality for linear estimators is a consequenceof a fact that the geometry of the model becomes more “non-convex” as the dimension of inputgets higher. All derived bounds are fast rate because the analyses are about the excess risk with thesquared loss, which made it possible to compare the rate of convergence. The fast learning rate ofthe deep learning approach is derived through the fact that the noisy gradient descent behaves like aBayes estimator with model size independent convergence rate.A
CKNOWLEDGMENTS
TS was partially supported by JSPS Kakenhi (26280009, 15H05707 and 18H03201), Japan DigitalDesign and JST-CREST. R EFERENCES
Z. Allen-Zhu and Y. Li. What can ResNet learn efficiently, going beyond kernels? In
Advances inNeural Information Processing Systems 32 , pp. 9017–9028. Curran Associates, Inc., 2019.Z. Allen-Zhu and Y. Li. Backward feature correction: How deep learning performs deep learning. arXiv preprint arXiv:2001.04413 , 2020.Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization.In
Proceedings of International Conference on Machine Learning , pp. 242–252, 2019.A. Andersson, R. Kruse, and S. Larsson. Duality in refined Sobolev–Malliavin spaces and weak ap-proximation of SPDE.
Stochastics and Partial Differential Equations Analysis and Computations ,4(1):113–149, 2016.S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and gen-eralization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584 ,2019.F. Bach. Breaking the curse of dimensionality with convex neural networks.
Journal of MachineLearning Research , 18(19):1–53, 2017.Y. Bai and J. D. Lee. Beyond linearization: On quadratic and higher-order approximation ofwide neural networks. In
International Conference on Learning Representations , 2020. URL https://openreview.net/forum?id=rkllGyBFPH .A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.
IEEETransactions on Information theory , 39(3):930–945, 1993.9enefit of deep learning with non-convex noisy gradient descentP. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities.
The Annals of Statis-tics , 33:1487–1537, 2005.C.-E. Br´ehier and M. Kopec. Approximation of the invariant law of SPDEs: error analysis usinga Poisson equation for a full-discretization scheme.
IMA Journal of Numerical Analysis , 37(3):1375–1410, 07 2016.Y. Cao and Q. Gu. A generalization theory of gradient descent for learning over-parameterized deepReLU networks. arXiv preprint arXiv:1902.01384 , 2019.A. Caponnetto and E. de Vito. Optimal rates for regularized least-squares algorithm.
Foundationsof Computational Mathematics , 7(3):331–368, 2007.M. Chen, Y. Bai, J. D. Lee, T. Zhao, H. Wang, C. Xiong, and R. Socher. Towards understanding hier-archical learning: Benefits of neural representations.
Advances in Neural Information ProcessingSystems , 33, 2020.L. Chizat. Sparse optimization on measures with over-parameterized gradient descent. arXivpreprint arXiv:1907.10300 , 2019.L. Chizat and F. Bach. A note on lazy training in supervised differentiable programming. arXivpreprint arXiv:1812.07956 , 2018.L. Chizat and F. Bach. Implicit bias of gradient descent for wide two-layer neural networks trainedwith the logistic loss. arXiv preprint arXiv:2002.04486 , 2020.G. Da Prato and J. Zabczyk. Non-explosion, boundedness and ergodicity for stochastic semilinearequations.
Journal of Differential Equations , 98:181–195, 1992.G. Da Prato and J. Zabczyk.
Ergodicity for Infinite Dimensional Systems . London MathematicalSociety Lecture Note Series. Cambridge University Press, 1996.D. L. Donoho, R. C. Liu, and B. MacGibbon. Minimax risk over hyperrectangles, and implications.
The Annal of Statistics , 18(3):1416–1437, 09 1990. doi: 10.1214/aos/1176347758.D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard. Density estimation by waveletthresholding.
The Annals of Statistics , 24(2):508–539, 1996.S. Du, J. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neuralnetworks. In
International Conference on Machine Learning , pp. 1675–1685, 2019a.S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterizedneural networks.
International Conference on Learning Representations 7 , 2019b.W. E, C. Ma, and L. Wu. A priori estimates of the population risk for two-layer neural networks.
Communications in Mathematical Sciences , 17(5):1407–1425, 2019a.W. E, C. Ma, and L. Wu. A comparative analysis of optimization and generalization properties oftwo-layer neural network and random feature models under gradient descent dynamics.
ScienceChina Mathematics , pp. 1–24, 2019b.M. A. Erdogdu, L. Mackey, and O. Shamir. Global non-convex optimization with discretized diffu-sions. In
Advances in Neural Information Processing Systems 31 , pp. 9671–9680. 2018.B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. Linearized two-layers neural networks inhigh dimension. arXiv preprint arXiv:1904.12191 , 2019.B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. When do neural networks outperformkernel methods? arXiv preprint arXiv:2006.13409 , 2020.E. Gin´e and V. Koltchinskii. Concentration inequalities and asymptotic results for ratio type empir-ical processes.
The Annals of Probability , 34(3):1143–1216, 2006.10enefit of deep learning with non-convex noisy gradient descentS. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Implicit bias of gradient descent on linearconvolutional networks. In
Advances in Neural Information Processing Systems , pp. 9482–9491,2018.M. Hairer. Exponential mixing properties of stochastic PDEs through asymptotic coupling.
Probab.Theory Related Fields , 124(3):345–380, 2002.S. Hayakawa and T. Suzuki. On the minimax optimality and superiority of deep neural networklearning over sparse parameter spaces.
Neural Networks , 123:343–361, 2020. ISSN 0893-6080.K. Hornik, M. Stinchcombe, and H. White. Universal approximation of an unknown mapping andits derivatives using multilayer feedforward networks.
Neural Networks , 3(5):551–560, 1990.M. Imaizumi and K. Fukumizu. Deep neural networks learn non-smooth functions effectively. InK. Chaudhuri and M. Sugiyama (eds.),
Proceedings of Machine Learning Research , volume 89of
Proceedings of Machine Learning Research , pp. 869–878. PMLR, 16–18 Apr 2019.B. Irie and S. Miyake. Capabilities of three-layered perceptrons. In
IEEE 1988 International Con-ference on Neural Networks , pp. 641–648, 1988.A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization inneural networks. In
Advances in Neural Information Processing Systems 31 , pp. 8580–8589,2018.S. Jacquot and G. Royer. Ergodicit´e d’une classe d’´equations aux d´eriv´ees partielles stochastiques.
Comptes Rendus de l’Acad´emie des Sciences. S´erie I. Math´ematique , 320(2):231–236, 1995.J. M. Klusowski and A. R. Barron. Risk bounds for high-dimensional ridge function combinationsincluding neural networks. arXiv preprint arXiv:1607.01434 , 2016.V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization.
TheAnnals of Statistics , 34:2593–2656, 2006.Y. Li, T. Ma, and H. R. Zhang. Learning over-parametrized two-layer neural networks beyond ntk.volume 125 of
Proceedings of Machine Learning Research , pp. 2613–2682. PMLR, 09–12 Jul2020.B. Maslowski. Strong Feller property for semilinear stochastic evolution equations and applications.In
Stochastic systems and optimization (Warsaw, 1988) , volume 136 of
Lect. Notes Control Inf.Sci. , pp. 210–224. Springer, Berlin, 1989.S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neuralnetworks.
Proceedings of the National Academy of Sciences , 115(33):E7665–E7671, 2018. doi:10.1073/pnas.1806579115.S. Mei, T. Misiakiewicz, and A. Montanari. Mean-field theory of two-layers neural networks:dimension-free bounds and kernel limit. In A. Beygelzimer and D. Hsu (eds.),
Proceedings ofthe Thirty-Second Conference on Learning Theory , volume 99 of
Proceedings of Machine Learn-ing Research , pp. 2388–2464, Phoenix, USA, 25–28 Jun 2019. PMLR.S. Mendelson. Improving the sample complexity using global data.
IEEE Transactions on Informa-tion Theory , 48:1977–1991, 2002.M. Mohri, A. Rostamizadeh, and A. Talwalkar.
Foundations of Machine Learning . The MIT Press,2012.B. Muzellec, K. Sato, M. Massias, and T. Suzuki. Dimension-free convergence rates for gradientLangevin dynamics in RKHS. arXiv preprint 2003.00306 , 2020.A. Nitanda and T. Suzuki. Stochastic particle gradient descent for infinite ensembles. arXiv preprintarXiv:1712.05438 , 2017.M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via Stochastic Gradient LangevinDynamics: a nonasymptotic analysis. arXiv e-prints , pp. arXiv:1702.03849, 2017.11enefit of deep learning with non-convex noisy gradient descentG. Rotskoff and E. Vanden-Eijnden. Parameters as interacting particles: long time convergenceand asymptotic error scaling of neural networks. In
Advances in Neural Information ProcessingSystems 31 , pp. 7146–7155. Curran Associates, Inc., 2018.G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of neural networks: An interactingparticle system approach. arXiv preprint arXiv:1805.00915 , 2019.W. Rudin.
Real and Complex Analysis (third edition) . Mathematics series. McGraw-Hill, 1987.J. Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activationfunction.
The Annals of Statistics , 48(4), 2020.T. Shardlow. Geometric ergodicity for stochastic PDEs.
Stochastic Analysis and Applications , 17(5):857–869, 1999.R. Sowers. Large deviations for the invariant measure of a reaction-diffusion equation with non-Gaussian perturbations.
Probab. Theory Related Fields , 92(3):393–421, 1992. ISSN 0178-8051.T. Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces:optimal rate and curse of dimensionality. In
International Conference on Learning Representa-tions , 2019. URL https://openreview.net/forum?id=H1ebTsActm .T. Suzuki. Generalization bound of globally optimal non-convex neural network training: Trans-portation map estimation by infinite dimensional langevin dynamics. In
Advances in NeuralInformation Processing Systems 33 , pp. to appear. Curran Associates, Inc., 2020.T. Suzuki and A. Nitanda. Deep learning is adaptive to intrinsic dimensionality of model smoothnessin anisotropic Besov space. arXiv preprint arXiv:1910.12799 , 2019.M. Welling and Y.-W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In
ICML ,pp. 681–688, 2011.B. Woodworth, S. Gunasekar, J. D. Lee, E. Moroshko, P. Savarese, I. Golan, D. Soudry, and N. Sre-bro. Kernel and rich regimes in overparametrized models. volume 125 of
Proceedings of MachineLearning Research , pp. 3635–3673. PMLR, 09–12 Jul 2020.P. Xu, J. Chen, D. Zou, and Q. Gu. Global convergence of langevin dynamics based algorithms fornonconvex optimization. In
Advances in Neural Information Processing Systems , volume 31, pp.3122–3133. Curran Associates, Inc., 2018.G. Yehudai and O. Shamir. On the power and limitations of random features for understandingneural networks. In
Advances in Neural Information Processing Systems 32 , pp. 6598–6608.Curran Associates, Inc., 2019.S. Zhang, M.-Y. Wong, and Z. Zheng. Wavelet threshold estimation of a regression function withrandom design.
Journal of Multivariate Analysis , 80(2):256–284, 2002.D. Zou, Y. Cao, D. Zhou, and Q. Gu. Gradient descent optimizes over-parameterized deep ReLUnetworks.
Machine Learning , 109(3):467–492, 2020.
A P
ROOF OF T HEOREM We basically combine the “convex hull argument” and the minimax optimal rate analysis for linearestimators developed by Zhang et al. (2002).Zhang et al. (2002) essentially showed the following statement in their Theorem 1.
Proposition 2 (Theorem 1 of Zhang et al. (2002)) . Let µ be the Lebesgue measure. Suppose thatthe space Ω has even partition A such that |A| = 2 K for an integer K ∈ N , each A has equivalentmeasure µ ( A ) = 2 − K for all A ∈ A , and A is indeed a partition of Ω , i.e., ∪ A ∈A = Ω , A ∩ A ′ = ∅ for A, A ′ ∈ Ω and A = A ′ . Then, if K is chosen as n − γ ≤ − K ≤ n − γ for constants γ , γ > that are independent of n , then there exists an event E such that, for a constant C ′ > , P ( E ) ≥ o (1) and |{ x i | x i ∈ A ( i ∈ { , . . . , n } ) }| ≤ C ′ n/ K ( ∀ A ∈ A ) . Moreover, suppose that, for a class F ◦ of functions on Ω , there exists ∆ > that satisfies thefollowing conditions:1. There exists F > such that, for any A ∈ A , there exists g ∈ F ◦ that satisfies g ( x ) ≥ ∆ F for all x ∈ A ,2. There exists K ′ and C ′′ > such that n P ni =1 g ( x i ) ≤ C ′′ ∆ − K ′ for any g ∈ F ◦ onthe event E .Then, there exists a constant F such that at least one of the following inequalities holds: F F C ′′ K ′ n ≤ R lin ( F ◦ ) , (8a) F
32 ∆ − K ≤ R lin ( F ◦ ) , (8b) for sufficiently large n . Before we show the main assertion, we prepare some additional lemmas. For a sigmoid function σ ,let ˜ F ( σ ) C,τ := { x ∈ R d aσ ( τ ( w ⊤ x + b ))) | | a | ≤ C, k w k ≤ , | b | ≤ a, b ∈ R , w ∈ R d ) } for C > , τ > . Lemma 1.
Let ψ ( x ) = ( σ ( x + 1) − σ ( x − and ˆ ψ be its Fourier transform: ˆ ψ ( ω ) :=(2 π ) − R e − i ωx ψ ( x )d x . Let h > and D w > . Then, by setting τ = h − (2 √ d + 1) D w and C = (2 √ d +1) D w πh | ˆ ψ (1) | , the Gaussian RBF kernel can be approximated by inf ˇ g ∈ conv( ˜ F ( σ ) C,τ ) sup x ∈ [0 , d (cid:12)(cid:12)(cid:12)(cid:12) ˇ g ( x ) − exp (cid:18) − k x − c k h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | π ˆ ψ (1) | h C d D d − w exp( − D w /
2) + exp( − D w ) i for any c ∈ [0 , d , where C d is a constant depending only on d . In particular, the right hand side is O (exp( − n κ )) if D w = n κ .Proof of Lemma 1. Let ψ h ( x ) = ψ ( h − x ) . Suppose that, for f ∈ L ( R d ) , its Fourier transform ˆ f ( ω ) = (2 π ) − d R e − i ω ⊤ x f ( x )d x ( ω ∈ R d ) gives Z R d exp(i w ⊤ x ) ˆ f ( w )d w = f ( x ) , for every x ∈ R d . Then the Irie-Miyake itegral representation (Irie & Miyake (1988); see also theproof of Theorem 3.1 in Hornik et al. (1990)) gives f ( x ) = Z a ∈ R d Z b ∈ R ψ ( a ⊤ x + b )d ν ( a, b ) ( a.e. ) , where d ν ( a, b ) is given by d ν ( a, b ) = Re | ω | d e − i wb π ˆ ψ ( ω ) ! ˆ f ( wa )d a d b for any ω = 0 . Since the characteristic function of the multivariate normal distribution gives that Z R d exp(i w ⊤ ( x − c )) s h d (2 π ) d exp (cid:18) − h k w k (cid:19)| {z } = ˆ f ( w ) d w = exp (cid:18) − k x − c k h (cid:19) =: f ( x ) ( ∀ x ∈ R d ) , If ˆ f is integrable, this inversion formula holds for almost every x ∈ R d (Rudin, 1987). However, weassume a stronger condition that it holds for every x ∈ R d . exp (cid:18) − k x − c k h (cid:19) = Z a ∈ R d Z b ∈ R ψ h ( a ⊤ ( x − c ) + b )Re e − i wb π ˆ ψ h ( ω ) ! s | ωh | d (2 π ) d exp (cid:18) − ( ωh ) k a k (cid:19) d a d b, for all x ∈ R d . Since ψ h ( · ) = ψ ( h − · ) and ˆ ψ h ( · ) = h ˆ ψ ( h · ) by its definition, the right hand side isequivalent to Z a ∈ R d Z b ∈ R ψ ( h − [ a ⊤ ( x − c ) + b ])Re e − i wb πh ˆ ψ ( hω ) ! s | ωh | d (2 π ) d exp (cid:18) − ( ωh ) k a k (cid:19) d a d b. Here, we set ω = h − . Let N σ be the probability measure corresponding to the multivariatenormal with mean and covariance σ I , and let A D := { w ∈ R d | k w k ≤ D } . Let D a > and D b = D a ( √ d + 1) , and define f D a ( x ) := 12 D b N ( A D a ) Z k a k≤ D a , | b |≤ D b ψ ( h − [ a ⊤ ( x − c ) + b ])Re e − i b/h πh ˆ ψ (1) ! × s π ) d exp (cid:18) − k a k (cid:19) d a d b. Then, we can see that, for any x ∈ [0 , d , it holds that (cid:12)(cid:12)(cid:12)(cid:12) D b N ( A D a ) f ( x ) − f D a ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ D b N ( A D a ) | πh ˆ ψ (1) | " N ( A cD a ) Z − h − | x | )d x + Z | b | >D b − [ h − ( | b | − √ dD a )])d b ≤ D b N ( A D a ) | πh ˆ ψ (1) | (cid:2) hN ( A cD a ) + 4 h exp( − D a ) (cid:3) = 4 h D b N ( A D a ) | πh ˆ ψ (1) | h C d D d − a exp( − D a /
2) + exp( − D a ) i , where C d > is a constant depending on only d , and we used | a ⊤ ( x − c ) + b | ≥ | b | − | a ⊤ ( x − c ) | ≥ | b | − √ dD a and ψ ( x ) ≤ −| x | ) . Note that if D a = n κ , then the right hand side is O ( h exp( − n κ )) . Therefore, since N ( A D a ) ≤ , by setting τ = h − D b , C = D b πh | ˆ ψ (1) | , we havethat inf ˇ g ∈ conv( ˜ F ( σ ) C,τ ) sup x ∈ [0 , d (cid:12)(cid:12)(cid:12)(cid:12) ˇ g ( x ) − exp (cid:18) − k x − c k h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | π ˆ ψ (1) | h C d D d − a exp( − D a /
2) + exp( − D a ) i . Hence, by rewriting D w ← D a , we obtain the assertion. As noted above, the right hand is O (exp( − n κ )) if D a = n κ . Proof of Theorem 1.
For a sample size n , we fix m n which will be determined later and useProposition 2 with F ◦ = F ( n ) γ . If w ,m n = b p µ γm n / with | b | ≤ and w ,m = µ γ/ m n [ u ; − u ⊤ c ] / ( p d + 1)) for u ∈ R d such that k u k ≤ and c ∈ [0 , d , then k ( w ,m n , w ,m n ) k ≤ µ γm n (1 / | u ⊤ c | ) / d + 1)) ≤ µ γm n . Therefore, ˜ ϕ u,c ( x ) = a m n ¯ w ,m n σ m n ( w ⊤ ,m n [ x ; 1]) = µ α m n ( bµ γ/ m n / √ µ sα m n σ (cid:16) µ − α + γ/ m n u ⊤ ( x − c ) / p d + 1) (cid:17) ∈F ( n ) γ ⊂ F γ for all b ∈ R with | b | ≤ , u ∈ R d with k u k ≤ , and c ∈ [0 , d . In other words, µ α + γ/ sα m n (2 C ) − F ( σ ) C,τ ⊂ F ( n ) γ for any C > and τ = √ d +1) µ − α + γ/ m n .14enefit of deep learning with non-convex noisy gradient descentTherefore, by setting C = ( √ d + 1) D w / ( πh | ˆ ψ (1) | ) for D w > , Lemma 1 yields that for any c ∈ [0 , d and given h > , there exists g ∈ conv( F ( n ) γ ) such that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) µ α + γ/ sα m n √ d + 1) D w πh | ˆ ψ (1) | ! − exp (cid:18) − k · − c k h (cid:19) − g (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ µ α + γ/ sα m n √ d + 1) D w πh | ˆ ψ (1) | ! − | π ˆ ψ (1) | h C d D d − w exp( − D w /
2) + exp( − D w ) i = µ α + γ/ sα m n h ( √ d + 1) D w h C d D d − a exp( − D w /
2) + exp( − D w ) i . We let D w = n κ for any κ > and choose µ m n as τ ≃ µ − α + γ/ m n = D w h − = h − n κ . We write ∆ := µ α + γ/ sα m n (2 C ) − ≃ h α sα γ/ α − γ/ +1 n − κ ( α sα γ/ α − γ/ +1) . Then, it holds that (cid:13)(cid:13)(cid:13)(cid:13) ∆ exp (cid:18) − k · − c k h (cid:19) − g (cid:13)(cid:13)(cid:13)(cid:13) ∞ . ∆ exp( − n κ ) . (9)Here, we set h as h = 2 − k with a positive integer k . Accordingly, we define a partition A of Ω sothat any A ∈ A can be represented as A = [2 − k j , − k ( j + 1)] × · · · × [2 − k j d , − k ( j d + 1)] bynon-negative integers ≤ j i ≤ k − ( i = 1 , . . . , d ). Note that |A| = 2 dk = h − d .For each A ∈ A , we define c A as c A = (2 − k ( j + 1 / , . . . , − k ( j d + 1 / ⊤ where ( j , . . . , j d ) is a set of indexes that satisfies A = [2 − k j , − k ( j + 1)] × · · · × [2 − k j d , − k ( j d + 1)] . For each A ∈ A , we define g A ∈ conv( F ( n ) γ ) as a function that satisfies Eq. (9) for c = c A .Now, we apply Proposition 2 with F ◦ = conv( F ( n ) γ ) and K = K ′ = dk . Let R ∗ := R lin (conv( F ( n ) γ )) . First, we can see that there exits a constant F > such that g A ( x ) ≥ F ∆ ( ∀ x ∈ A ) , where we used exp( − n κ ) ≪ .Second, in the event E introduced in the statement of Proposition 2, there exists C such that |{ i ∈{ , . . . , n } | x i ∈ A ′ }| ≤ Cn/ − dk for all A ′ ∈ A . In this case, we can check that n n X i =1 (cid:20) ∆ exp (cid:18) − k x i − c A k h (cid:19)(cid:21) . ∆ h d = ∆ − kd , by the uniform continuity of the Gaussian RBF. Therefore, we also have n n X i =1 g A ( x i ) ≤ n n X i =1 (cid:20) ∆ exp (cid:18) − k x i − c A k h (cid:19)(cid:21) + c ∆ exp( − n κ ) . ∆ ( h d + exp( − n κ )) , where c > is a constant. Thus, as long as h is polynomial to n like h = Θ( n − a ) , the right handside is O (∆ h d ) .Now, if we write ˜ β = α + sα + γ/ α − γ/ α + ( s + 1) α α − γ/ , then we have ∆ ≃ h ˜ β n − κ ˜ β by its definition.Here, we choose k as a maximum integer that satisfies F ∆ − dk > R ∗ . In this situation, it holdsthat h β + d n − κ ˜ β ≃ R ∗ . Since Eq. (8b) is not satisfied, Eq. (8a) must hold, and hence we have n − h − d . R ∗ ≃ h β + d n − κ ˜ β ⇒ h ≃ n − − κ ˜ β β +2 d . Therefore, we obtain that R ∗ & n − β + d β +2 d n − κd ˜ β β +2 d ≥ n − β + d β +2 d n − κ ′ , by setting κ ′ = κ d ˜ β β +2 d . This gives the assertion. B P
ROOFS OF P ROPOSITION
1, T
HEOREM AND C OROLLARY Proposition 1, Theorem 2 and Corollary 1 can be shown by using Propositions 3 and 4 given inAppendix B.1 shown below.Let T α W = ( µ αm w ,m , µ αm w ,m ) ∞ m =1 for W = ( w ,m , w ,m ) ∞ m =1 for α > , and let us consider amodel h W := f T − α/ W . Then, the training error can be rewritten as b L ( f W ) = b L ( h T α/ W ) . For notational simplicity, we let b L ( W ) := b L ( f W ) .Let H ( M ) be { W ( M ) = ( w ,m , w ,m ) Mm =1 | w ,m ∈ R d +1 , w ,m ∈ R , ≤ m ≤ M } and ι : H ( M ) → H be the zero padding of W ( M ) , that is, ι ( W ( M ) ) = ( w ′ ,m , w ′ ,m ) ∞ m =1 ∈ H satisfies w ′ ,m = w ,m , w ′ ,m = w ,m ( m ≤ M ) and w ′ ,m = 0 , w ′ ,m = 0 ( m > M ) . Moreover, wedefine ι ∗ : H → H ( M ) as the map that extracts first M components. By abuse of notation, we write f W ( M ) for W ( M ) ∈ H ( M ) to indicate f ι ( W ( M ) ) . Finally, let A ( M ) : H ( M ) → H ( M ) be a linearoperator such that A ( M ) W ( M ) = ι ∗ ( Aι ( W ( M ) )) , which is just a truncation of A . Similarly, let T aM W ( M ) for W ( M ) ∈ H ( M ) be the operator corresponding to T a W for W ∈ H , i.e., T aM W ( M ) = ι ∗ ( T a ι ( W ( M ) )) .B.1 A UXILIARY LEMMAS
First, we show some key propositions to show the main results. To do so, we utilize the result byMuzellec et al. (2020) and Suzuki (2020).
Assumption 3. (i)
There exists a constant c µ such that µ m ≤ c µ m − . (ii) There exist
B, U > such that the following two inequalities hold for some a ∈ (1 / , almost surely: k∇ b L ( W ) k H ≤ B ( ∀ W ∈ H ) , k∇ b L ( W ) − ∇ b L ( W ′ ) k H ≤ L k W − W ′ k H − a ( ∀ W, W ′ ∈ H ) . (iii) For any data D n , b L is three times differentiable. Let ∇ b L ( W ) be the third-order derivativeof b L ( W ) . This can be identified with a third-order linear form and ∇ b L ( W ) · ( h, k ) denotesthe Riesz representor of l ∈ H 7→ ∇ b L ( W ) · ( h, k, l ) . There exists α ′ ∈ [0 , , C α ′ ∈ (0 , ∞ ) such that ∀ W, h, k ∈ H , k∇ b L ( W ) · ( h, k ) k H − α ′ ≤ C α ′ k h k H k k k H , k∇ b L ( W ) · ( h, k ) k H ≤ C α ′ k h k H α ′ k k k H (a.s.) . Remark 1.
In the analysis of Br´ehier & Kopec (2016); Muzellec et al. (2020); Suzuki (2020), As-sumption 3-(iii) is imposed for any finite dimensional projection L ( W ( M ) ) as a function on H ( M ) ) for all M ≥ instead of L ( W ) as a function of H . However, the condition on L ( W ) gives a suffi-cient condition for any finite dimensional projection in our setting. Thus, we employed the currentversion. Assumption 4.
For the loss function ℓ ( y, f ( x )) = ( y − f ( x )) , the following conditions holds: There exists
C > such that for any f W ( W ∈ H ) , it holds that E X,Y [( ℓ ( Y, f W ( X )) − ℓ ( Y, f ∗ ( X ))) ] ≤ C ( L ( f W ) − L ( f ∗ )) . (ii) β > is chosen so that, for any h : R d → R and x ∈ supp( P X ) , it holds that E Y | X = x (cid:2) exp (cid:0) − βn ( ℓ ( Y, h ( x )) − ℓ ( Y, f ∗ ( x ))) (cid:1)(cid:3) ≤ . (iii) There exists L h > such that k∇ W ℓ ( Y, h W ( X )) − ∇ W ℓ ( Y, h W ′ ( X )) k H ≤ L h k W − W ′ k H ( ∀ W, W ′ ∈ H ) almost surely. (iv) There exists C h such that k h W − h W ′ k ∞ ≤ C h k W − W ′ k H ( W, W ′ ∈ H ) . Proposition 3.
Assume Assumption 3 holds and β > η . Suppose that ∃ ¯ R > , ≤ ℓ ( Y, f W ( X )) ≤ ¯ R for any W ∈ H (a.s.). Let ρ = λη/µ and b = µ λ B + c µ βλ . Accordingly, let ¯ b = max { b, } , κ = ¯ b + 1 and ¯ V = 4¯ b/ ( √ (1+ ρ /η ) / − ρ /η ) . Then, the spectral gap of the dynamics is given by Λ ∗ η = min (cid:16) λ µ , (cid:17) κ ( ¯ V + 1) / (1 − δ )) δ (10) where < δ < is a real number satisfying δ = Ω(exp( − Θ(poly( λ − ) β ))) . We define Λ ∗ =lim η → Λ ∗ η (i.e., ¯ V is replaced by b/ ( q (1+exp( − λµ )) / − exp( − λµ )) ). We also define C W = κ [ ¯ V +1] + √
2( ¯ R + b ) √ δ . Then, for any < a < / , the following convergence bound holds for almost sureobservation D n : for either L = L or L = b L , | E W k [ L ( W k ) | D n ] − E W ∼ π ∞ [ L ( W ) | D n ] | (11) ≤ C (cid:20) C W exp( − Λ ∗ η ηk ) + √ β Λ ∗ η / − a (cid:21) = Ξ ′ k , (12) where C is a constant depending only on c µ , B, L, C α ′ , a, ¯ R (independent of η, k, β, λ ). Proposition 4.
Assume that Assumptions 3 and 4 hold. Let ˜ α := 1 / { α + 1) } for a given α > and θ be an arbitrary real number satisfying < θ < − ˜ α . Assume that thetrue function f o can be represented by h W ∗ = f o for W ∗ ∈ H θ ( α +1) . Then, if M ≥ min (cid:8) λ ˜ α/ θ ( α +1)] β / θ ( α +1)] , λ − / α +1) , n / θ ( α +1)] (cid:9) , the expected excess risk is bounded by E D n h E W ( M ) k [ L ( h T α/ M W ( M ) k ) | D n ] − L ( f o ) i ≤ C max (cid:8) ( λβ ) α/θ α/θ n − α/θ , λ − ˜ α β − , λ θ , /n (cid:9) + Ξ ′ k , (13) where C is a constant independent of n, β, λ, η, k .Proof. Repeating the same argument in Proposition 1 and using the same notation, Proposition 3gives | E W ( M ) k [ L ( W ( M ) k ) | D n ] − E W ∼ π ( M ) ∞ [ L ( W ) | D n ] | ≤ Ξ ′ k , for any ≤ M ≤ ∞ . Therefore, we just need to bound the following quantity: (cid:12)(cid:12)(cid:12) E D n h E W ( M ) ∼ π ( M ) ∞ [ L ( h T α/ M W ( M ) ) | D n ] i − L ( f o ) (cid:12)(cid:12)(cid:12) .We define k W ( M ) k H ( M ) := k ι ∗ ( W ( M ) ) k H for W ( M ) ∈ H ( M ) . For a > , we define H ( M ) a bethe projection of H a to the first M components, H ( M ) a = { ι ( W ) | W ∈ H a } , and we define k W ( M ) k H ( M ) a := k ι ∗ ( W ( M ) ) k H a (note that since H ( M ) a is a finite dimensional linear space, it issame as H as a set). Let ν ( M ) β be the Gaussian measure on H ( M ) with mean 0 and covariance ( βA ( M ) ) − , and ˜ ν ( M ) β be the Gaussian measure corresponding to the random variable T α/ M W ( M ) with W ( M ) ∼ ν ( M ) β . Let the concentration function be φ ( M ) β,λ ( ǫ ) := inf W ∈H ( M ) α +1 : L ( h W ) −L ( f o ) ≤ ǫ βλ k W k H ( M ) α +1 − log ˜ ν ( M ) β ( { W ∈ H ( M ) : k W k H ( M ) ≤ ǫ } ) + log(2) , W ∈ H ( M ) α +1 that satisfies the condition inf , then we define φ ( M ) β,λ ( ǫ ) = ∞ , then Let ǫ ∗ > be ǫ ∗ := max { inf { ǫ > | φ β,λ ( ǫ ) ≤ βǫ } , /n } . Then, Suzuki (2020) showed the following bound: (cid:12)(cid:12)(cid:12)(cid:12) E D n (cid:20) E W ( M ) ∼ π ( M ) ∞ [ L ( h T α/ M ) W ( M ) ) | D n ] − L ( f o ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ C max n ǫ ∗ , (cid:0) βn ǫ ∗ + n − α/θ ( λβ ) α/θ α/θ (cid:1) , n o . (14)They also showed that, for M = ∞ , it holds that ǫ ∗ . max n ( λβ ) − ˜ α β − (1 − ˜ α ) , λ θ , n − o = max (cid:8) λ − ˜ α β − , λ θ , n − (cid:9) . Substituting this bound of ǫ ∗ to Eq. (14), we obtain Eq. (13) for M = ∞ . Moreover, in their proof,if M ≥ ( ǫ ∗ ) − / [ θ ( α +1)] , then inf W ∈H ( M ) α +1 : L ( h W ) −L ( f o ) ≤ ǫ βλ k W k H ( M ) α +1 . β ( ǫ ∗ ) . Finally, since ˜ ν ( M ) β is a marginal distribution of ˜ ν ( ∞ ) β , it holds that − log ˜ ν ( M ) β ( { W ∈ H ( M ) : k W k H ( M ) ≤ ǫ } ) ≤ − log ˜ ν ( ∞ ) β ( { W ∈ H : k W k H ≤ ǫ } ) . Therefore, as long as M ≥ ( ǫ ∗ ) − / [ θ ( α +1)] , the rate of ǫ ∗ is not deteriorated from M = ∞ . Inother words, if M ≥ min (cid:8) λ ˜ α/ θ ( α +1)] β / θ ( α +1)] , λ − θ/ θ ( α +1)] , n / θ ( α +1)] (cid:9) , the bound (13)holds. Remark 2.
Suzuki (2020) showed Proposition 4 under a condition α > / . However, this is usedonly to ensure Assumption 3. In our setting, we can show Assumption 3 holds directly and thus wemay omit the condition α > / . B.2 P
ROOFS OF P ROPOSITION
1, T
HEOREM AND C OROLLARY
Proof of Proposition 1 and Theorem 2.
Let ¯ R = (2 P ∞ m =1 a m R + U ) . Then, we can easily checkthat ( y i − f W ( x i )) ≤ ¯ R . As stated above, we use Propositions 3 and 4 to show the statements.First, we show Proposition 1 for the dynamics of W ( M ) k for any ≤ M ≤ ∞ . However, it sufficesto show the statement only for M = ∞ because the finite dimensional version can be seen as aspecific case of the infinite dimensional one. Actually, the dynamics of W ( M ) k is same as that of ι ( ˜ W k ) where ˜ W k ∈ H obeys the following dynamics: ˜ W k +1 = S η (cid:18) ˜ W k − η ∇ b L ( f ι ( ˜ W k ) ) + r ηβ ξ k (cid:19) . This is because f ι ( ˜ W k ) is determined by only the first M components ι ( ˜ W k ) , ι ( ∇ b L ( f ι ( ˜ W k ) )) = ∇ W ( M ) b L ( f W ( M ) ) | W ( M ) = ι ( ˜ W k ) and S η is a diagonal operator. Since the components of ˜ W k withindexes higher than M does not affect the objective, smoothness of the objective is not lost. Thestationary distribution π ( M ) ∞ of the continuous dynamics corresponding to W ( M ) is a probabilitymeasure on H ( M ) that satisfies d π ( M ) ∞ d ν ( M ) β ( W ( M ) ) ∝ exp( − β b L ( f W ( M ) )) , where ν ( M ) β is the Gaussian measure on R M × ( d +2) with mean 0 and covariance ( βA ( M ) ) − . Wecan notice that this is the marginal distribution of the stationary distribution of the continuous time18enefit of deep learning with non-convex noisy gradient descentcounterpart of ˜ W k : d˜ π ∞ ( ˜ W ) ∝ exp( − β b L ( f ι ( ˜ W ) ))d ν β . Therefore, we just need to consider aninfinite dimensional one. For this reasoning, we show the convergence for the original infinitedimensional dynamics ( W k ) ∞ k =1 . The convergence of the finite dimensional one ( W ( M ) k ) ∞ k =1 can beshown by the same manner using the argument above.To show Proposition 1, we use Propositions 3. To do so, we need to check validity of Assumptions3. First, we check Assumption 3. Assumption 3-(i) is ensured by Assumption 1. Next, we checkAssumption 3-(ii). The boundedness of the gradient can be shown as follows: k∇ b L ( f W ) k H = ∞ X m =1 (cid:16)(cid:13)(cid:13)(cid:13) n n X i =1 f W ( x i ) − y i ) ¯ w ,m a m [ x i ; 1] σ ′ m ( w ⊤ ,m [ x i ; 1]) (cid:13)(cid:13)(cid:13) + (cid:12)(cid:12)(cid:12) n n X i =1 f W ( x i ) − y i ) a m tanh ′ ( w ,m /R ) σ m ( w ⊤ ,m [ x i ; 1])) ∞ m =1 (cid:12)(cid:12)(cid:12) (cid:17) ≤ ∞ X m =1 RR a m ( d + 1) C σ + 4 ¯ Ra m ( ∵ | f W ( x i ) − y i | ≤ ¯ R, k σ ′ m k ∞ ≤ C σ , k tanh ′ k ∞ ≤ ≤ R [ R C σ ( d + 1) + 1] ∞ X m =1 a m < ∞ . Similarly, we can show the Lipschitz continuity of the gradient as k∇ b L ( f W ) − ∇ b L ( f W ′ ) k H ≤ ∞ X m =1 µ − α m µ α m n Ra m ( d + 1) C σ [( w ,m − w ′ ,m ) + R k w ,m − w ′ ,m k ]+ 4 ¯ Ra m [( w ,m − w ′ ,m ) /R + C σ ( d + 1) k w ,m − w ′ ,m k ] o ( ∵ k tanh ′′ k ∞ ≤ ≤ R [( d + 1) C σ (1 + R ) + 1 /R + C σ ( d + 1)] max m ∈ N { µ − α m a m }× ∞ X m =1 µ α m [( w ,m − w ′ ,m ) + k w ,m − w ′ ,m k ] . k W − W ′ k H − α . We can also verify Assumption 3-(iii) in a similar way. Then, we have verified Assumption 3.Therefore, we may apply Proposition 3, and then we obtain Proposition 1.Next, we show Theorem 2 by using Proposition 4. For that purpose, we need to we verify Assump-tion 4. The first condition can be verified as E X,Y [(( Y − f W ( X )) − ( Y − f o ( X )) ) ]= E X,ǫ [(( f o ( X ) + ǫ − f W ( X )) − ǫ ) ]= E X [(( f o ( X ) − f W ( X )) + 2 ǫ ( f o ( X ) − f W ( X ))) ]= E X [( f o ( X ) − f W ( X )) + 2 ǫ ( f o ( X ) − f W ( X ))( f o ( X ) − f W ( X )) + ǫ ( f o ( X ) − f W ( X )) ]= k f o − f W k ∞ E X [( f o ( X ) − f W ( X )) ] + U E X [( f o ( X ) − f W ( X )) ] ≤ ¯ R E X [( f o ( X ) − f W ( X )) ] = ¯ R ( L ( f W ) − L ( f o )) . The second condition can be checked as follows. Note that E Y | X = x (cid:18) exp (cid:26) − βn [( Y − f W ( x )) − ( Y − f o ( x )) ] (cid:27)(cid:19) = E ǫ (cid:18) exp (cid:20) − βn ( f o ( x ) − f W ( x )) − ǫ ( f W ( x ) − f o ( x ))] (cid:27)(cid:19) = exp (cid:20) − βn ( f o ( x ) − f W ( x )) (cid:21) E ǫ (cid:26) exp (cid:20) βn ǫ ( f W ( x ) − f o ( x )) (cid:21)(cid:27) ≤ exp (cid:20) − βn ( f o ( x ) − f W ( x )) (cid:21) exp (cid:20)
18 4 β n U ( f W ( x ) − f o ( x )) (cid:21) . Thus, under the condition β ≤ n/ (2 U ) , the right hand side can be upper bounded by exp (cid:20) − βn (cid:18) − U βn (cid:19) ( f W ( x ) − f o ( x )) (cid:21) ≤ . Next, we check the third and fourth conditions. Noting that ∇ W h W ( X )= (cid:16) a m ( µ − α/ m w ,m ) µ − α/ m [ x i ; 1] σ ′ m ( µ − α/ m w ⊤ ,m [ x i ; 1]) ,a m µ − α/ m tanh ′ ( µ − α/ m w ,m /R ) σ m ( µ − α/ m w ⊤ ,m [ x i ; 1])) ∞ m =1 (cid:17) ∞ m =1 , we have that k∇ W h W ( X ) k H ≤ ∞ X m =1 a m µ − αm [( d + 1) R C σ + 1] ≤ [( d + 1) R C σ + 1] ∞ X m =1 µ − α +2 α m ≤ [( d + 1) R C σ + 1] c − α +2 α µ ∞ X m =1 m − − α +2 α ) =: C < ∞ ( ∵ − α + 2 α = α > / , and k∇ W h W ( X ) − ∇ W h W ′ ( X ) k H ≤ ∞ X m =1 a m µ − αm ( d + 1)[ µ − αm ( w ,m − w ′ ,m ) + R µ − αm k w ,m − w ′ ,m k ]+ a m µ − αm [ µ − αm ( w ,m − w ′ ,m ) /R + C σ ( d + 1) µ − αm k w ,m − w ′ ,m k ] ≤ ∞ X m =1 a m µ − αm [( d + 1)(1 + R ) + 1 /R + C σ ( d + 1)][ k w ,m − w ′ ,m k + ( w ,m − w ′ ,m ) ] ≤ c α µ max m { µ α − α ) m } [( d + 1)(1 + R ) + 1 /R + C σ ( d + 1)] k W − W ′ k H =: C k W − W ′ k H , for a constant < C < ∞ . Therefore, it holds that | h W ( X ) − h W ′ ( X ) | ≤ C k W − W ′ k H , which yields the forth condition, and we also have k∇ W ℓ ( Y, h W ( X )) − ∇ W ℓ ( Y, h W ′ ( X )) k H = k h W ( X ) − Y ) ∇ W h W ( X ) − h W ′ ( X ) − Y ) ∇ W h W ′ ( X ) k H ≤ k h W ( X ) − Y )( ∇ W h W ( X ) − ∇ W h W ( X )) k H + 2 k h W ( X ) − h W ′ ( X )) ∇ W h W ′ ( X ) k H ≤ RC k W − W ′ k H + 8 C k W − W ′ k H . k W − W ′ k H , which yields the third condition.Since f o ∈ F γ , there exists W ∗ ∈ H γ such that f o = f W ∗ . Therefore, applying Proposition 4 with α = α ( ˜ α = 1 / [2( α + 1)] ) and θ = γ/ (1 + α ) (since γ < / α , the condition θ < − ˜ α M ≥ min (cid:8) λ / γ ( α +1) β / γ , λ − / α +1) , n / γ (cid:9) , the followingexcess risk bound holds: E D n h E W ( M ) k [ L ( W ( M ) k ) | D n ] − L ( f ∗ ) i . max (cid:8) ( λβ ) α/θ α/θ n − α/θ , λ − ˜ α β − , λ θ , /n (cid:9) + Ξ k . Finally, by noting L ( W ( M ) k ) − L ( f ∗ ) = k f W ( M ) k − f ∗ k L ( P X ) , we obtain the assertion.Finally, we give the proof of Corollary 1. Proof of Corollary 1.
Note that f W ( x )= ∞ X m =1 a m ¯ w ,m σ m ( w ⊤ ,m [ x ; 1])= ∞ X m =1 µ α m ¯ w ,m µ qα m µ − qα m µ sα m σ ( µ − α m w ⊤ ,m [ x ; 1]) ( ∵ a m = µ α m , b m = µ α m )= ∞ X m =1 µ α + qα m ¯ w ,m µ − ( s − q ) α m σ ( µ − α m w ⊤ ,m [ x ; 1]) . Therefore, we may redefine α ′ ← α + qα and s ′ ← s − q so that we obtain another representationof the model F γ : F γ = ( f W ( x ) = ∞ X m =1 µ α ′ m ¯ w ,m ˇ σ m ( w ⊤ ,m [ x ; 1]) (cid:12)(cid:12)(cid:12) W ∈ H γ , k W k H γ ≤ ) , where ˇ σ m ( · ) = µ − s ′ α m σ ( µ − α m · ) . Note that the condition ≤ q ≤ s − gives s − q ≥ . Therefore,Assumptions 3 and 4 are valid even for the redefined parameters α ′ , s ′ and ˇ σ m instead of α , s and σ m . Therefore, we can apply Theorem 2 by simply replacing α by α ′ = α + qα2