On the generalization of bayesian deep nets for multi-class classification
OOn the generalization of bayesian deep nets for multi-class classification
Yossi Adi Yaniv Nemcovsky Alex Schwing Tamir Hazan Abstract
Generalization bounds which assess the differ-ence between the true risk and the empirical riskhave been studied extensively. However, to ob-tain bounds, current techniques use strict assump-tions such as a uniformly bounded or a Lipschitzloss function. To avoid these assumptions, in thispaper, we propose a new generalization boundfor Bayesian deep nets by exploiting the con-tractivity of the Log-Sobolev inequalities. Usingthese inequalities adds an additional loss-gradientnorm term to the generalization bound, which isintuitively a surrogate of the model complexity.Empirically, we analyze the affect of this loss-gradient norm term using different deep nets.
1. Introduction
Deep neural networks are ubiquitous across disciplines andoften achieve state of the art results. Albeit deep nets areable to encode highly complex input-output relations, inpractice, they do not tend to overfit (Zhang et al., 2016).This tendency to not overfit has been investigated in numer-ous works on generalization bounds. Indeed, many gener-alization bounds apply to composite functions specified bydeep nets. However, most of these bounds assume that theloss function is bounded or Lipschitz. Unfortunately, thisassumption excludes plenty of deep nets and Bayesian deepnets that rely on the popular negative log-likelihood (NLL)loss.In this work we introduce a new PAC-Bayesian generaliza-tion bound for unbounded loss functions with unboundedgradient-norm, i.e., non-Lipschitz functions. This settingis closer to present-day deep net training, which uses theunbounded NLL loss and requires to avoid large gradientvalues during training so as to prevent exploding gradients.To prove the bound we utilize the contractivity of the log-Sobolev inequality (Ledoux, 1999). It enables to bound themoment-generating function of the model risk. Our PAC-Bayesian bound adds a novel complexity term to existing Bar-Ilan University Technion University of Illinois. Corre-spondence to: Yossi Adi < [email protected] > , Tamir Hazan < [email protected] > .
220 300 400 500 600 700 800 9001020304050607080 V a l u e Bound
Sub-Gamma220 300 400 500 600 700 800 9000.250.500.751.001.251.501.75 V a l u e Bound
Sub-Gamma
Figure 1.
The proposed bound as a function of λ for both ResNet(top) and a Linear model (bottom). Notice, this suggests that therandom variables L D ( w ) − L S ( w ) for both ResNet and Linearmodels are sub-gamma (see Definition 1 and Theorem 1). Weobtain the results for ResNet using CIFAR-10 dataset and for theLinear model using MNIST dataset. The parameter for the sub-gamma fit are v = 1 . for both models and c = 1e − for ResNetand c = 1e − for the Linear model. PAC-Bayesian bounds: the expected norm of the loss func-tion gradients computed with respect to the input. Intuitivelythis norm measures the complexity of the loss function, i.e.,the model. In our work we prove that this complexity termis sub-gamma when considering linear models with the NLLloss, or more generally, for any linear model with a Lipschitzloss function. We also derive a bound for any Bayesian deepnet, which permits to verify empirically that this complexityterm is sub-gamma. See Figure 1.This new term, which measures the complexity of the model,augments existing PAC-Bayesian bounds for bounded orLipschitz loss functions which typically consist of two terms:(1) the empirical risk, which measures the fitting of the pos-terior over the parameters to the training data, and (2) theKL-divergence between the prior and the posterior distribu- a r X i v : . [ c s . L G ] F e b n the generalization of bayesian deep nets tions over the parameters, which measures the complexityof learning the posterior when starting from the prior overthe parameters.
2. Related work
Generalization bounds for deep nets were explored in var-ious settings. VC-theory provides both upper boundsand lower bounds to the network’s VC-dimension, whichare linear in the number of network parameters (Bartlettet al., 2017b; 2019). While VC-theory asserts that such amodel should overfit as it can learn any random labeling(e.g., Zhang et al. (2016)), surprisingly, deep nets generallydo not overfit.Rademacher complexity allows to apply data depen-dent bounds to deep nets (Bartlett & Mendelson, 2002;Neyshabur et al., 2015; Bartlett et al., 2017a; Golowich et al.,2017; Neyshabur et al., 2018). These bounds rely on the lossand the Lipschitz constant of the network and consequentlydepend on a product of norms of weight matrices whichscales exponentially in the network depth. Wei & Ma (2019)developed a bound that considers the gradient-norm overtraining examples. In contrast, our bound depends on aver-age quantities of the gradient-norm and thus we answer anopen question of Bartlett et al. (2017a) about the existenceof bounds that depend on average loss and average gradient-norm, albeit in a PAC-Bayesian setting. PAC-Bayesianbounds that use Rademacher complexity have been studiedby Kakade et al. (2009); Yang et al. (2019).Stability bounds may be applied to unbounded loss func-tions and in particular to the negative log-likelihood (NLL)loss (Bousquet & Elisseeff, 2002; Rakhlin et al., 2005;Shalev-Shwartz et al., 2009; Hardt et al., 2015; Zhang et al.,2016). However, stability bounds for convex loss func-tions, e.g., for logistic regression, do not apply to deepnets since they require the NLL loss to be a convex func-tion of the parameters. Alternatively, Hardt et al. (2015)and Kuzborskij & Lampert (2017) estimate the stability ofstochastic gradient descent dynamics, which strongly relieson early stopping. This approach results in weaker boundsfor the non-convex setting. Stability PAC-Bayesian boundsfor bounded and Lipschitz loss functions were developed byLondon (2017).PAC-Bayesian bounds were recently applied to deepnets (McAllester, 2013; Dziugaite & Roy, 2017; Neyshaburet al., 2017). In contrast to our work, those related worksall consider bounded loss functions. An excellent surveyon PAC-Bayesian bounds was provided by Germain et al.(2016). Alquier et al. (2016) introduced PAC-Bayesianbounds for linear classifiers with the hinge-loss by explicitlybounding its moment generating function. Alquier et al.(2012) provide an analysis for PAC-Bayesian bounds with Lipschitz functions. Our work differs as we derive PAC-Bayesian bounds for non-Lipschitz functions. Work byGermain et al. (2016) is closer to our setting and considersPAC-Bayesian bounds for the quadratic loss function. Incontrast, our work considers the multi-class setting, and non-linear models. PAC-Bayesian bounds for the NLL loss inthe online setting were put forward by Takimoto & Warmuth(2000); Banerjee (2006); Bartlett et al. (2013); Gr¨unwald& Mehta (2017). The online setting does not consider thewhole sample space and therefore is simpler to analyze inthe Bayesian setting.PAC-Bayesian bounds for the NLL loss function are in-timately related to learning Bayesian inference (Germainet al., 2016). Recently many works applied various pos-teriors in Bayesian deep nets. Gal & Ghahramani (2015);Gal (2016) introduce a Bayesian inference approximationusing Monte Carlo (MC) dropout, which approximates aGaussian posterior using Bernoulli dropout. Srivastava et al.(2014); Kingma et al. (2015) introduced Gaussian dropoutwhich effectively creates a Gaussian posterior that couplesbetween the mean and the variance of the learned parame-ters and explored the relevant log-uniform priors. Blundellet al. (2015); Louizos & Welling (2016) suggest to take afull Bayesian perspective and learn separately the mean andthe variance of each parameter.
3. Background
Generalization bounds provide statistical guarantees onlearning algorithms. They assess how the learned param-eters w of a model perform on test data given the model’sresult on the training data S = { ( x , y ) , . . . , ( x m , y m ) } ,where x i is the data instance and y i is the correspondinglabel. The performance of the parametric model is measuredby a loss function (cid:96) ( w, x, y ) . The risk of this model is its av-erage loss, when the data instance and its label are sampledfrom the true but unknown distribution D . We denote therisk by L D ( w ) = E ( x,y ) ∼ D (cid:96) ( w, x, y ) . The empirical risk isthe average training set loss L S ( w ) = m (cid:80) mi =1 (cid:96) ( w, x i , y i ) .PAC-Bayesian theory bounds the expected risk E w ∼ q L D ( w ) of a model when its parameters are averaged over the learnedposterior distribution q . The parameters of the posteriordistribution are learned from the training data S . In ourwork we start from the following PAC-Bayesian bound: Theorem 1 (Alquier et al. (2016)) . Let KL ( q || p ) = (cid:82) q ( w ) log( q ( w ) /p ( w )) dw be the KL-divergence betweentwo probability density functions p, q . For any λ > , forany δ ∈ (0 , and for any prior distribution p , with proba-bility at least − δ over the draw of the training set S , thefollowing holds simultaneously for any posterior distribu-tion q : E w ∼ q [ L D ( w )] ≤ E w ∼ q [ L S ( w )] + C ( λ, p ) + KL ( q || p ) + log(1 /δ ) λ , (1) n the generalization of bayesian deep nets where C ( λ, p ) = log E w ∼ p,S ∼ D m [ e λ ( L D ( w ) − L S ( w )) ] . (2)Unfortunately, the complexity term C ( λ, p ) of this boundis impossible to compute for large values of λ , as we showin our experimental evaluation. To deal with this com-plexity term, Alquier et al. (2016); Germain et al. (2016);Boucheron et al. (2013) consider the sub-Gaussian assump-tion, which amounts to the bound C ( λ, p ) ≤ λ v for any λ ∈ R and some variance factor v . This assumption is alsoreferred to as the Hoeffding assumption, which is related toHoeffding’s lemma that is usually applied in PAC-Bayesianbounds to loss function that are uniformly bounded by aconstant, i.e., for any w, x, y simultaneously.Unfortunately, many loss functions that are used in practiceare unbounded. In particular, the NLL loss function is un-bounded even in the multi-class setting when y is discrete.For instance, consider a fully connected deep net, wherethe input vector of the k -th layer is a function of the param-eters of all previous layers, i.e., x k ( W , . . . , W k − ) . Theentries of x k are computed from the response of its preced-ing layer, i.e., W k − x k − , followed by a transfer function σ ( · ) , i.e., x k = σ ( W k − x k − ) . Since the NLL is defineas − log p ( y | x, w ) = − ( W k x k ) y + log( (cid:80) ˆ y e ( W k x k ) ˆ y ) , theNLL loss increases with r and is unbounded when r → ∞ ifthe rows in W k consist of the vector rx k . In our experimen-tal validation in Section 5 we show that the unboundednessof the NLL loss results in a complexity term C ( λ, p ) that isnot sub-Gaussian.This complexity term C ( λ, p ) influences the value of λ ,which controls the convergence rate of the bound, as itweighs the complexity terms C ( λ, p ) and KL ( q || p ) by λ .Therefore, a tight bound requires λ to be as large as possi-ble. However, since λ influences C ( λ, p ) exponentially, oneneeds to make sure that C ( λ, p ) < ∞ . In such cases onemay use sub-gamma random variables (Alquier et al., 2016;Germain et al., 2016; Boucheron et al., 2013): Definition 1.
The random variable L D ( w ) − L S ( w ) iscalled sub-gamma if the complexity term C ( λ, p ) in The-orem 1 satisfies C ( λ, p ) ≤ λ v − λc ) for every λ such that < λ < /c . In Corollary 1 we prove that the complexity term C ( λ, p ) is sub-gamma when considering linear models with theNLL loss, or more generally, for any linear model with aLipschitz loss function. In Corollary 2 we derive a bound on C ( λ, p ) for any Bayesian deep net, which permits to verifyempirically that C ( λ, p ) is sub-gamma.
4. PAC-Bayesian bounds for smooth lossfunctions
Our main theorem below shows that for smooth loss func-tions, the complexity term C ( λ, p ) is bounded by the ex-pected gradient-norm of the loss function with respect to thedata x . This property is appealing since the gradient-normcontracts the network’s output, as evident in its extreme caseby the vanishing gradients property. In our experimentalevaluation we show how this contractivity depends on thedepth of the network and the variance of the prior and itsaffect on the generalization of Bayesian deep nets. Theorem 2.
Consider the setting of Theorem 1 and assume ( x, y ) ∼ D , x given y follows the Gaussian distributionand (cid:96) ( w, x, y ) is a smooth loss function (e.g., the negativelog-likelihood loss). Let M ( α ) (cid:44) E ( x,y ) ∼ D e α ( − (cid:96) ( w,x,y )) .Then C ( λ, p ) ≤ log E w ∼ p e λ E ( x,y ) ∼ D (cid:2) (cid:107)∇ x (cid:96) ( w,x,y ) (cid:107) (cid:82) λm e − α(cid:96) ( w,x,y ) M ( α ) dα (cid:3) . (3)Our proof technique to show the main Theorem uses thelog-Sobolev inequality for Gaussian distributions (Ledoux,1999) as we illustrate next. Proof.
The proof consists of three steps. First we use thestatistical independence of the training samples to decom-pose the moment generating function E S ∼ D m [ e λ ( L D ( w ) − L S ( w )) ] = e λL D ( w ) M (cid:16) λm (cid:17) m . (4)Next we consider the moment generating function M ( λm ) (cid:44) E (ˆ x, ˆ y ) ∼ D [ e λm ( − (cid:96) ( w, ˆ x, ˆ y )) ] in its log-space, i.e., by consider-ing the cumulant generating function K ( α ) (cid:44) α log M ( α ) and obtain the following equality: M (cid:16) λm (cid:17) = e − λm L D ( w )+ λm (cid:82) λm αM (cid:48) ( α ) − M ( α ) log M ( α ) α M ( α ) dα . (5)Finally we use the log-Sobolev inequality for Gaussian dis-tributions, αM (cid:48) ( α ) − M ( α ) log M ( α ) ≤ · E ( x,y ) ∼ D [ e − α(cid:96) ( w,x,y ) α (cid:107)∇ x (cid:96) ( w, x, y ) (cid:107) ] (6)and complete the proof through some algebraic manipula-tions.The first step of the proof results in Eq. (4). To de-rive it we use the independence of the training samples: E S ∼ D m [ e λ ( L D ( w ) − L S ( w )) ] = e λL D ( w ) E S ∼ D m [ e λ m (cid:80) mi =1 ( − (cid:96) ( w,x i ,y i )) ] (7) = e λL D ( w ) m (cid:89) i =1 E ( x i ,y i ) ∼ D [ e λm ( − (cid:96) ( w,x i ,y i )) ] (8) = e λL D ( w ) (cid:16) E ( x,y ) ∼ D [ e λm ( − (cid:96) ( w,x i ,y i )) ] (cid:17) m . (9) n the generalization of bayesian deep nets The first equality holds since e λL D ( w ) is a constant that isindependent of the expectation S ∼ D m . The last equalityholds since ( x i , y i ) ∼ D are identically distributed.The second step of the proof results in Eq. (5). It relieson the relation of the moment generating function and thecumulant generating function M ( λm ) = e λm K ( λm ) . Thefundamental theorem of calculus asserts K ( λm ) − K (0) = (cid:82) λ/m K (cid:48) ( α ) dα , while K (cid:48) ( α ) refers to the derivative at α .We then compute K (cid:48) ( α ) and K (0) : K (cid:48) ( α ) = αM (cid:48) ( α ) − M ( α ) log M ( α ) α M ( α ) , (10) K (0) = lim α → + log M ( α ) α = − L D ( w ) . (11)The second equality follows from l’Hopital rule: lim α → + log M ( α ) α l’Hopital = M (cid:48) (0) /M (0)1 and recalling that M (0) = 1 and M (cid:48) (0) = − L D ( w ) .The third and final step of the proof begins with applying thelog-Sobolev inequality in Eq. (6) for Gaussian distributions.Combining it with Eq. (5) leads to the inequality M ( λm ) ≤ e − λm L D ( w )+ λm (cid:82) λm · E ( x,y ) ∼ D [ e − α(cid:96) ( w,x,y ) α (cid:107)∇ x(cid:96) ( w,x,y ) (cid:107) α M ( α ) dα . (12)It is insightful to see that M ( λm ) m ≤ e − λL D ( w ) e λ (cid:82) λm · E ( x,y ) ∼ D [ e − α(cid:96) ( w,x,y ) α (cid:107)∇ x(cid:96) ( w,x,y ) (cid:107) α M ( α ) dα . (13)This is compelling because the e − λL D ( w ) term in the aboveinequality cancels with the term e λL D ( w ) in Eq. (4). Thisshows theoretically why the complexity term mostly de-pends on the gradient-norm. This observation also con-cludes the proof after rearranging terms.The above theorem can be extended to settings for which x is sampled from any log-concave distribution, e.g., theLaplace distribution, cf. Gentil (2005). For readability wedo not discuss this generalization here.Eq. (4) hints at the infeasibility of computing the complex-ity term C ( λ, p ) directly for large values of λ : since theloss function is non-negative, the value of e λL D ( w ) growsexponentially with λ , while e λm ( − (cid:96) ( w, ˆ x, ˆ y )) diminishes tozero exponentially fast. These opposing quantities makeevaluation of C ( λ, p ) numerically infeasible. In contrast,our bound makes all computations in log-space, hence itscomputation is feasible for larger values of λ , up to theirsub-gamma interval < λ < /c , see Table 1.Notably, the general bound in Theorem 2 is more theoreticalthan practical: to estimate it in practice one needs to avoidthe integration over α . However, it is an important inter-mediate step to derive a practical bound for linear models with Lipschitz loss function, and a general bound for anysmooth loss function as we discuss in the next two sectionsrespectively. In the following we consider smooth loss functions overlinear models in the multi-class setting, where x ∈ R d is the data instance, y ∈ { , . . . , k } are the possible la-bels and the loss function takes the form (cid:96) ( w, x, y ) (cid:44) ˆ (cid:96) ( W x, y ) . We also assume that ˆ (cid:96) ( t, y ) is a Lipschitz func-tion, i.e., (cid:107)∇ t ˆ (cid:96) ( t, y ) (cid:107) ≤ L . Included in these assumptionsare the popular NLL loss − log p ( y | x, w ) = − ( W x ) y +log( (cid:80) ˆ y e ( W x ) ˆ y ) that is used in logistic regression and themulti-class hinge loss max ˆ y { ( W x ) ˆ y − ( W x ) y + 1[ y (cid:54) = ˆ y ] } that is used in support vector machines (SVMs). Corollary 1.
Consider smooth loss functions over linearmodels (cid:96) ( w, x, y ) (cid:44) ˆ (cid:96) ( W x, y ) , with Lipschitz constant L , i.e., (cid:107)∇ t ˆ (cid:96) ( t, y ) (cid:107) ≤ L . Under the conditions of The-orem 2 with Gaussian prior distribution p ∼ N (0 , σ p ) and variance σ p for which λ ≤ (cid:112) m /Lσ p , we obtain C ( λ, p ) ≤ kd log(2) .Proof. This bound is derived by applying Theorem 2. Webegin by realizing the gradient of ˆ (cid:96) ( W x, y ) with respect to x .Using the chain rule, ∇ x ˆ (cid:96) ( W x, y ) = W (cid:62) ∇ W x ˆ (cid:96) ( W x, y ) .Hence, we obtain for the gradient norm (cid:107)∇ x ˆ (cid:96) ( W x, y ) (cid:107) ≤(cid:107) ˆ (cid:96) ( W x, y ) (cid:107) · (cid:80) ky =1 (cid:80) dj =1 w y,j ≤ L (cid:80) ky =1 (cid:80) dj =1 w y,j .Plugging this result into Eq. (3) we obtain the followingbound for its exponent: E ( x,y ) ∼ D (cid:2) (cid:107)∇ x (cid:96) ( w, x, y ) (cid:107) (cid:90) λm e − α(cid:96) ( w,x,y ) M ( α ) dα (cid:3) (14) ≤ L k (cid:88) y =1 d (cid:88) j =1 w y,j · E ( x,y ) ∼ D (cid:2) (cid:90) λm e − α(cid:96) ( w,x,y ) M ( α ) dα (cid:3) (15) = L k (cid:88) y =1 d (cid:88) j =1 w y,j (cid:90) λm E ( x,y ) ∼ D (cid:2) e − α(cid:96) ( w,x,y ) (cid:3) M ( α ) dα. (16)Since M ( α ) (cid:44) E ( x,y ) ∼ D (cid:2) e − α(cid:96) ( w,x,y ) (cid:3) , the ratio in the inte-gral equals one and the integral (cid:82) λm dα = λm . Combiningthese results we obtain: C ( λ, p ) ≤ log (cid:16) E w ∼ p e λ L m (cid:80) y,j w y,j (cid:17) . (17)Finally, whenever λLσ p ≤ (cid:112) m/ we follow the Gaussianintegral and derive the bound E w ∼ p e λ L m (cid:80) y,j w y,j ≤ (cid:16) mm − L λ σ p (cid:17) kd . (18) n the generalization of bayesian deep nets C ( , . ) Linear model2 layers MLP3 layers MLP4 layers MLP5 layers MLP
Figure 2.
Estimating the complexity term C ( λ, p ) in Eq. 2 overMNIST, for a linear model and MLPs of depth d = 2 , ..., . Weare able to compute the complexity term C ( λ, p ) for λ ≤ andeven to smaller λ in for the linear model ( λ ≤ ). Standard boundsfor MNIST require λ to be at least , which is the square rootof the training sample size. In all settings we use variance of 0.1for the prior distribution over the model parameters. The above corollary provides a PAC-Bayesian bound forclassification using the NLL loss, and shows C ( λ, p ) is sub-gamma in the interval < λ < (cid:112) m /Lσ p . Interestingly,it can achieve a rate of λ = m , albeit with variance of theprior of /m . In our experimental evaluation we show thatit is better to achieve lower rate, i.e., λ = √ m while usinga prior with a fixed variance, i.e., σ p ≈ . . The abovebound also extends the result of Alquier et al. (2016) forbinary hinge-loss to the multi-class hinge loss (cf. Alquieret al. (2016), Section 6). Unfortunately, the above boundcannot be applied to non-linear loss functions, since theirgradient-norm is not bounded, as evident by the explodinggradients property in deep nets. In the following we derive a generalization bound for on-average bounded loss functions and on-average boundedgradient-norms.
Corollary 2.
Consider smooth loss functions that are on-average bounded, i.e., E ( x,y ) ∼ D (cid:96) ( w, x, y ) ≤ b . Under theconditions of Theorem 2, for any < λ ≤ m we obtain C ( λ, p ) ≤ log E w ∼ p e λ ebm E ( x,y ) ∼ D (cid:2) (cid:107)∇ x (cid:96) ( w,x,y ) (cid:107) (cid:3) . (19) Proof.
This bound is derived by applying Theorem 2 andbounding (cid:82) e − α(cid:96) ( w,x,y ) M ( α ) dα ≤ e b . We derive this bound inthree steps: • From (cid:96) ( w, x, y ) ≥ we obtain e − α(cid:96) ( x,x,y ) ≤ . • We lower bound M ( α ) ≥ M (1) for any ≤ α ≤ λ/m : First we note that < λ ≤ m , therefore weconsider ≤ α ≤ . Also, since (cid:96) ( w, x, y ) ≥ the function e − α(cid:96) ( w,x,y ) is monotone in α within theunit interval, i.e., for ≤ α ≤ α ≤ there holds e − α (cid:96) ( w,x,y ) ≥ e − α (cid:96) ( w,x,y ) and consequently M ( α ) ≥ M (1) for any α ≤ . • The assumption E ( x,y ) ∼ D [ − (cid:96) ( w, x, y )] ≥ − b and themonotonicity of the exponential function result inthe lower bound e E ( x,y ) ∼ D [ − (cid:96) ( w,x,y )] ≥ e − b . Fromconvexity of the exponential function, M (1) = E ( x,y ) ∼ D e − (cid:96) ( w,x,y ) ≥ e E ( x,y ) ∼ D [ − (cid:96) ( w,x,y )] and thelower bound M (1) ≥ e − b follows.Combining these bounds we derive the upper bound (cid:82) e − α(cid:96) ( w,x,y ) M ( α ) dα ≤ (cid:82)
10 1 e − b dα = e b , and the result fol-lows.The above derivation upper bounds the complexity term C ( λ, p ) by the expected gradient-norm of the loss func-tion, i.e., the flow of its gradients through the architectureof the model. It provides the means to empirically showthat C ( λ, p ) is sub-gamma (see Section 5.3). In particu-lar, we show empirically that the rate of the bound λ canbe as high as m , dependent on the gradient-norm. This isa favorable property, since the convergence of the boundscales as /λ . Therefore, one would like to avoid explodinggradient-norms, as this effectively harms the true risk bound.While one may achieve a fast rate bound by forcing thegradient-norm to vanish rapidly, practical experience showsthat vanishing gradients prevent the deep net from fitting themodel to the training data when minimizing the empiricalrisk. In our experimental evaluation we demonstrate theinfluence of the expected gradient-norm on the bound of thetrue risk.
5. Experiments
In this section we perform an experimental evaluation ofthe our PAC-Bayesian bounds, both for linear models andnon-linear models. We begin by verifying our assumptions:(i) The PAC-Bayesian bound in Theorem 1 cannot be com-puted for large values of λ . (ii) Although the NLL loss isunbounded, it is on-average bounded. Next, we study thebehavior of the complexity term C ( λ, p ) for different archi-tectures, both for linear models and deep nets. We show thatthe random variable L D ( w ) − L S ( w ) is sub-gamma, namelythat C ( λ, p ) ≤ λ v − λc ) for every λ such that < λ < /c .Importantly, we show that /c which relates to the rate ofconvergence of the bound is determined by the architectureof the deep net. Lastly, we demonstrate the importance of C ( λ, p ) in the learning process, balancing the three termsof (i) the empirical risk; (ii) the KL divergence; and (iii) thecomplexity term. Implementation details.
We use multilayer perceptrons(MLPs) for the MNIST and Fashion-MNIST dataset (Xiaoet al., 2017). We use convolutional neural networks (CNNs)the CIFAR-10 dataset. In all models we use the ReLU acti- n the generalization of bayesian deep nets
Variance of the parameters prior E x p e c t e d l o ss Linear model2 layers MLP3 layers MLP4 layers MLP5 layers MLP 1e-05 0.004 0.01 0.05 0.1 0.3
Variance of the parameters prior E x p e c t e d l o ss Linear model2 layers MLP3 layers MLP4 layers MLP5 layers MLP 1e-05 0.004 0.01 0.05 0.1 0.3
Variance of the parameters prior E x p e c t e d l o ss Linear model2 layers CNN3 layers CNN4 layers CNN5 layers CNN
Figure 3.
Verifying the assumption in Corollary 2 that the loss function is on-average bounded, i.e., E ( x,y ) ∼ D (cid:96) ( w, x, y ) ≤ b . We testedthis assumption on MNIST (left), Fashion-MNIST (middle), using MLPs and CIFAR10 (right) using CNNs. The loss is on-averagebounded, while being dependent on the variance of the prior. For variance up to . , the loss is on-average bound is about . Variance of the parameters prior E x p e c t e d g r a d i e n t n o r m Linear model2 layers MLP3 layers MLP4 layers MLP5 layers MLP 1e-05 0.004 0.01 0.05 0.1 0.3
Variance of the parameters prior E x p e c t e d g r a d i e n t n o r m Linear model2 layers MLP3 layers MLP4 layers MLP5 layers MLP 1e-05 0.004 0.01 0.05 0.1 0.3
Variance of the parameters prior E x p e c t e d g r a d i e n t n o r m Linear model2 layers CNN3 layers CNN4 layers CNN5 layers CNN
Figure 4.
Estimating E ( x,y ) ∼ D (cid:2) (cid:107)∇ x (cid:96) ( w, x, y ) (cid:107) (cid:3) as a function of the different variance levels for the prior distribution p . Results arereported for MNIST (left), Fashion-MNIST (middle), using MLPs and CIFAR10 (right) using CNNs. The linear model has the largestexpected gradient-norm, since the Lipschitz condition considers the worst-case gradient-norm, see Corollary 1. The gradient-norm getssmaller as a function of the depth, due to the vanishing gradient property. As a result, deeper nets can have faster convergence rate, i.e.,use larger values of λ can be used in the generalization bound.
220 1000 2000 3000 4000 5000 6000010203040506070 V a l u e Bound
Sub-Gamma 220 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000010203040506070 V a l u e Bound
Sub-Gamma
220 1000 2000 3000 4000 5000 6000 7000 8000 9000 1000001234567 V a l u e Bound
Sub-Gamma
220 1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.000.250.500.751.001.251.501.752.00 V a l u e Bound
Sub-Gamma
Figure 5.
The proposed bound as a function of λ for MLP models using two, three, four, and five layers , depicted from left to right, (theleft figure is for two layers, the second left is for three layers and so on). Notice, this suggests that the random variables L D ( w ) − L S ( w ) for all presented MLP models are sub-gamma. We obtain the results for all models using the MNIST dataset. The parameter for thesub-gamma fit is c = 1e − for all models using different scaling factors. vation function. We optimize the NLL loss function usingSGD with a learning rate of 0.01 and a momentum valueof 0.9 in all settings for 50 epochs. We use mini-batches ofsize 128 and did not use any learning rate scheduling. Forthe ResNet experiments we optimize an 18-layers ResNetmodel on the CIFAR-10 dataset, using Adam optimizer withlearning rate of 0.001 for 150 epochs where we halve the learning rate every 50 epochs using batch-size of 128. We start by empirically demonstrating the numerical insta-bility of computing the complexity term C ( λ, p ) in Eq. 2.This numerical instability occurs due to the exponentiationof the random variables, namely e λ ( L D ( w ) − L S ( w )) , which n the generalization of bayesian deep nets Variance of the parameters prior B o un d Linear model2 layers MLP3 layers MLP4 layers MLP5 layers MLP0.004 0.01 0.05 0.1 0.3
Variance of the parameters prior B o un d Linear model2 layers MLP3 layers MLP4 layers MLP5 layers MLP
Figure 6.
Our bound on C ( λ, p ) as a function of variance levelsfor the prior distribution over the models parameters, for λ = √ m (top) and λ = m (bottom). One can see that the bound gets largerwhen the variance of the prior increases. This plot conforms thepractice of Bayesian deep nets that set the variance of the prior upto . . Notice, the five layers for the λ = m , it get value only forvariance of . , below that value the bound is zero, and above thatvalue the bound explodes. quickly goes to infinity as λ grows. We estimated Eq. 2 overMNIST, for MLPs of depth d = 1 , ..., , where MLP ofdepth is a linear model. For a fair comparison we changedlayers’ width to reach roughly the same number of param-eters in each model (except for the linear case). For thesearchitectures evaluated C ( λ, p ) for different values of λ , us-ing variance of 0.1 over the prior distribution. The results aredepicted in Figure 2. One can see that we are able to com-pute the complexity term C ( λ, p ) for λ ≤ and even tosmaller λ for the linear model ( λ ≤ ). Standard bounds forMNIST require λ to be in the interval √ m < λ < m , where m is the training sample size ( m = 60 , for MNIST).We observe that while computing C ( λ, p ) the term e λL D ( w ) goes to infinity while e − λL S ( w ) goes to zero, and they arenot able to balance each other. In our derivation this issolved by looking at the gradient to emphasize their change,see Eq. (13).In Corollary 2 we assume that although the lossis unbounded, it is on-average bounded, meaning E ( x,y ) ∼ D (cid:96) ( w, x, y ) ≤ b . We tested this assumption onMNIST, Fashion-MNIST using MLPs of depth d = 1 , ..., and CIFAR10 using CNNs of depth d = 2 , ..., , where forin the CNN models d is the number of convolutional layers,and we include a max pool layer after each convolutionallayer. In all CNN models we include an additional outputlayer to be a fully connected one. To span the possibleweights we sampled them from a normal prior distributionwith different variance. The results appear in Figure 3. Weobserve that the loss is on-average bounded by , whilebeing dependent on the variance of the prior. Moreover,for variance up to . , the on-average loss bound is about and its affect on the complexity term C ( λ, p ) is minimal.Notice, although the expected loss is increasing for highvariance levels, these are not being used for initialize deepnets, considering common initialization techniques. Next we turn to estimate our bounds of C ( λ, p ) , both forthe linear models and nonlinear models, corresponding toCorollary 1 and Corollary 2. We use the same architecturesas mentioned above. The bound on C ( λ, p ) is controlledby the expected gradient-norm E ( x,y ) ∼ D (cid:2) (cid:107)∇ x (cid:96) ( w, x, y ) (cid:107) (cid:3) .Figure 4 presents the expected gradient-norm as a functionof different variance levels for the prior distribution p overthe models parameters. For the linear model we used thebound in Corollary 1. One can see that the linear modelhas the largest expected gradient-norm, since the Lipschitzcondition considers the worst-case gradient-norm. One alsocan see that the deeper the network, the smaller its gradient-norm. This is attributed to the gradient vanishing property.As a result, deeper nets can have faster convergence rate,i.e., use larger values of λ in the generalization bound, sincethe vanishing gradients creates a contractivity property thatstabilize the loss function, i.e., reduces its variability. How-ever, this comes at the expanse of the expressivity of thedeep net, since vanishing gradients cannot fit the trainingdata in the learning phase. This is demonstrated in the nextexperiment.Figure 6 presents the bound on C ( λ, p ) as a function of thevariance levels for the prior distribution over the modelsparameters. One can see that the bound gets larger whenthe variance of the prior increases. Another thing to note isthat the bound is often explodes when the variance is largerthan . . This conforms with Corollary 1, which C ( λ, p ) isunbounded for variance larger than . . This plot conformsthe practice of Bayesian deep nets that set the variance ofthe prior up to . . In Section 4.1 we proved that the proposed bound is sub-gamma for the linear case. Unfortunately, such proof cannot be directly applied to the non-linear case. Hence, we em-pirically demonstrate that the proposed bound over C ( λ, p ) n the generalization of bayesian deep nets Table 1.
We optimize MLP models of different depth levels, where one corresponds to the linear model, two corresponds to two layers andso on. We report the avg. test loss, avg. train loss, bound on C ( λ, p ) , and the KL value for the MNIST dataset. Prior Variance 0.0004 0.01 0.05 0.1 0.3 0.5 0.7 O n e Test Loss
Train Loss
Bound on C ( m, p ) Bound on C ( √ m, p ) KL T w o Test Loss
Train Loss
Bound on C ( m, p ) Bound on C ( √ m, p ) KL T h r ee Test Loss
Train Loss
Bound on C ( m, p ) Bound on C ( √ m, p ) KL nan 10776 9480 8215 95984 175134 226237 F ou r Test Loss
Train Loss
Bound on C ( m, p ) Bound on C ( √ m, p ) KL nan nan 10849 8239 113943 nan nan F i v e Test Loss
Train Loss
Bound on C ( m, p ) Bound on C ( √ m, p ) KL nan nan 11800 9090 140305 nan nanis indeed sub-gamma for various model architectures. Forthat we used the same models architectures as before usingthe MNIST dataset. Results are depicted in Figure 5. No-tice, similar to the ResNet model, the proposed bound issub-gamma in all explored settings using c = 1 e − withdifferent scaling factors. Lastly, in order to better understand the balance between allcomponents composed the proposed generalization boundwe optimize all five MLP models presented above usingthe MNIST dataset, and computed the average training loss,average test loss, KL divergence, and the bound on C ( λ, p ) ,bound using λ = m and λ = √ m . We repeat this opti-mization process for various variance levels over the priordistribution over the model parameters. Results for theMNIST dataset are summerized in Table 1, more experimen-tal results can be found in Section A.1 in the Appendix.Results suggests that using variance levels of [0.05, 0.1]produce the overall best performance across all depth levels.This findings is consistent with our previous results which suggest that below this value the Bound goes to zero, hencemake a good generalization on the expense of model per-formance. However, larger variance levels may cause thebound to explode and as a results makes the optimizationproblem harder.Lastly, we analyzed the commonly used ResNet model (Heet al., 2016). For that, we trained four different versionsof the ResNet18 model: (i) standard model (ResNet); (ii)model with no skip connections (ResNetNoSkip); (iii)model with no batch normalizations (ResNetNoBN); and(iv) model without both skip connections and batch nor-malization layers (ResNetNoSkipNoBN). We optimize allmodels using CIFAR-10 dataset. Figure 7 visualizes theresults. Consistently with previous findings, variance levelsof 0.1 gets the best performance overall, both in terms ofmodel test loss and the generalization.Notice, ResNet and ResNetNoSkip achieve comparable per-formance in all measures. Additionally, when consider-ing variance levels of 0.1 for the prior distribution, remov-ing the batch normalization layers and including the skip-connections also gets comparable performance to ResNetand ResNetNoSkip. Similarly to Zhang et al. (2019), this n the generalization of bayesian deep nets Variance of the parameters prior L o ss ResNet (train)ResNet no Skip (train)ResNet no BN (train)ResNet no Skip no BN (train)ResNet (test)ResNet no Skip (test)ResNet no BN (test)ResNet no Skip no BN (test)
Variance of the parameters prior K L ResNetResNet no SkipResNet no BN
Variance of the parameters prior B o un d ResNetResNet no SkipResNet no BN
Figure 7.
ResNet variations results for CIFAR10 dataset. In the left subplot we report the average training loss and average test loss(dashed lines). In the middle sub figure we present the KL values, and in the right hand sub figure we report the bound on C ( λ, p ) . Allresults are reported using different variance levels of the prior distribution over the model parameters. findings suggest that even without batch normalization lay-ers models can converge using exact initialization. On theother hand while removing both batch normalization theand skip connections, models either explores immediatelyor suffer greatly from gradient vanishing. These resultsare consistent with previous findings in which batch nor-mazliaton greatly improves optimization (Santurkar et al.,2018).
6. Discussion and Future Work
We present a new PAC-Bayesian generalization bound fordeep Bayesian neural networks for unbounded loss functionswith unbounded gradient-norm. The proof relies on bound-ing the log-partition function using the expected squarednorm of the gradients with respect to the input. We provethat the proposed bound is sub-gamma for any linear modelwith a Lipschitz loss function and we verify it empiricallyfor the non-linear case. Experimental validation shows thatthe resulting bound provides insights for better model opti-mization, prior distribution search and model initialization.
References
Alquier, P., Wintenberger, O., et al. Model selection forweakly dependent time series forecasting.
Bernoulli , 18(3):883–913, 2012.Alquier, P., Ridgway, J., and Chopin, N. On the propertiesof variational approximations of gibbs posteriors.
Journalof Machine Learning Research , 17(239):1–41, 2016.Banerjee, A. On bayesian bounds. In
Proceedings of the23rd international conference on Machine learning , pp.81–88. ACM, 2006.Bartlett, P., Grunwald, P., Harremo¨es, P., Hedayati, F., andKotlowski, W. Horizon-independent optimal predictionwith log-loss in exponential families. arXiv preprintarXiv:1305.4324 , 2013. Bartlett, P. L. and Mendelson, S. Rademacher and gaussiancomplexities: Risk bounds and structural results.
Journalof Machine Learning Research , 3(Nov):463–482, 2002.Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally-normalized margin bounds for neural networks. In
Ad-vances in Neural Information Processing Systems , pp.6241–6250, 2017a.Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A.Nearly-tight vc-dimension and pseudodimension boundsfor piecewise linear neural networks. arXiv preprintarXiv:1703.02930 , 2017b.Bartlett, P. L., Harvey, N., Liaw, C., and Mehrabian, A.Nearly-tight vc-dimension and pseudodimension boundsfor piecewise linear neural networks.
Journal of MachineLearning Research , 20(63):1–17, 2019. URL http://jmlr.org/papers/v20/17-612.html .Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,D. Weight uncertainty in neural networks. arXiv preprintarXiv:1505.05424 , 2015.Boucheron, S., Lugosi, G., and Massart, P.
Concentrationinequalities: A nonasymptotic theory of independence .Oxford university press, 2013.Bousquet, O. and Elisseeff, A. Stability and generalization.
The Journal of Machine Learning Research , 2:499–526,2002.Dziugaite, G. K. and Roy, D. M. Computing nonvacuousgeneralization bounds for deep (stochastic) neural net-works with many more parameters than training data. arXiv preprint arXiv:1703.11008 , 2017.Gal, Y.
Uncertainty in deep learning . PhD thesis, PhDthesis, University of Cambridge, 2016.Gal, Y. and Ghahramani, Z. Dropout as a bayesian approxi-mation: Representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142 , 2, 2015. n the generalization of bayesian deep nets
Gentil, I. Logarithmic sobolev inequality for log-concavemeasure from prekopa-leindler inequality. arXiv preprintmath/0503476 , 2005.Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S.Pac-bayesian theory meets bayesian inference. In
Ad-vances in Neural Information Processing Systems , pp.1884–1892, 2016.Golowich, N., Rakhlin, A., and Shamir, O. Size-independentsample complexity of neural networks. arXiv preprintarXiv:1712.06541 , 2017.Gr¨unwald, P. D. and Mehta, N. A. A tight excess riskbound via a unified pac-bayesian-rademacher-shtarkov-mdl complexity. arXiv preprint arXiv:1710.07732 , 2017.Hardt, M., Recht, B., and Singer, Y. Train faster, generalizebetter: Stability of stochastic gradient descent. arXivpreprint arXiv:1509.01240 , 2015.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016.Kakade, S. M., Sridharan, K., and Tewari, A. On the com-plexity of linear prediction: Risk bounds, margin bounds,and regularization. In
Advances in neural informationprocessing systems , pp. 793–800, 2009.Kingma, D. P., Salimans, T., and Welling, M. Variationaldropout and the local reparameterization trick. In
Ad-vances in Neural Information Processing Systems , pp.2575–2583, 2015.Kuzborskij, I. and Lampert, C. H. Data-dependent sta-bility of stochastic gradient descent. arXiv preprintarXiv:1703.01678 , 2017.Ledoux, M. Concentration of measure and logarithmicsobolev inequalities. In
Seminaire de probabilites XXXIII ,pp. 120–216. Springer, 1999.London, B. A pac-bayesian analysis of randomized learn-ing with application to stochastic gradient descent. In
Advances in Neural Information Processing Systems , pp.2931–2940, 2017.Louizos, C. and Welling, M. Structured and efficient vari-ational deep learning with matrix gaussian posteriors.In
International Conference on Machine Learning , pp.1708–1716, 2016.McAllester, D. A pac-bayesian tutorial with a dropoutbound. arXiv preprint arXiv:1307.2118 , 2013. Neyshabur, B., Tomioka, R., and Srebro, N. Norm-basedcapacity control in neural networks. In
Conference onLearning Theory , pp. 1376–1401, 2015.Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro,N. A pac-bayesian approach to spectrally-normalizedmargin bounds for neural networks. arXiv preprintarXiv:1707.09564 , 2017.Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., andSrebro, N. Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076 , 2018.Rakhlin, A., Mukherjee, S., and Poggio, T. Stability resultsin learning theory.
Analysis and Applications , 3(04):397–417, 2005.Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. Howdoes batch normalization help optimization? In
Ad-vances in Neural Information Processing Systems , pp.2483–2493, 2018.Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan,K. Stochastic convex optimization. In
COLT , 2009.Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,and Salakhutdinov, R. Dropout: A simple way to preventneural networks from overfitting.
The Journal of MachineLearning Research , 15(1):1929–1958, 2014.Takimoto, E. and Warmuth, M. K. The last-step minimaxalgorithm. In
International Conference on AlgorithmicLearning Theory , pp. 279–290. Springer, 2000.Wei, C. and Ma, T. Data-dependent sample complexityof deep neural networks via lipschitz augmentation. In
Advances in Neural Information Processing Systems , pp.9722–9733, 2019.Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: anovel image dataset for benchmarking machine learningalgorithms, 2017.Yang, J., Sun, S., and Roy, D. M. Fast-rate pac-bayesgeneralization bounds via shifted rademacher processes.In
Advances in Neural Information Processing Systems ,pp. 10802–10812, 2019.Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O.Understanding deep learning requires rethinking general-ization. arXiv preprint arXiv:1611.03530 , 2016.Zhang, H., Dauphin, Y. N., and Ma, T. Fixup initialization:Residual learning without normalization. arXiv preprintarXiv:1901.09321 , 2019. n the generalization of bayesian deep nets
A. Appendix
A.1. Extended results
We provide an additional results for the optimization experiments. We optimized MLP models at different depth levels usingFashion-MNIST dataset.
Table 2.
We optimize models of different depth levels, where one corresponds to the linear model, two corresponds to two layers and soon. We report the avg. test loss, avg. train loss, bound on C ( λ, p ) , and the KL value. Prior Variance 0.0004 0.01 0.05 0.1 0.3 0.5 0.7 O n e Test Loss
Train Loss
MGF bound ( λ = m ) MGF bound ( λ = √ m ) KL T w o Test Loss
Train Loss
MGF bound ( λ = m ) MGF bound ( λ = √ m ) KL T h r ee Test Loss
Train Loss
MGF bound ( λ = m ) MGF bound ( λ = √ m ) KL F ou r Test Loss
Train Loss
MGF bound ( λ = m ) MGF bound ( λ = √ m ) KL F i v e Test Loss
Train Loss