[PDF] Convergence Rates of Variational Inference in Sparse Deep Learning

Abstract

Variational inference is becoming more and more popular for approximating intractable posterior distributions in Bayesian statistics and machine learning. Meanwhile, a few recent works have provided theoretical justification and new insights on deep neural networks for estimating smooth functions in usual settings such as nonparametric regression. In this paper, we show that variational inference for sparse deep learning retains the same generalization properties than exact Bayesian inference. In particular, we highlight the connection between estimation and approximation theories via the classical bias-variance trade-off and show that it leads to near-minimax rates of convergence for Hölder smooth functions. Additionally, we show that the model selection framework over the neural network architecture via ELBO maximization does not overfit and adaptively achieves the optimal rate of convergence.

Full PDF

aa r X i v : . [ m a t h . S T ] S e p Convergence Rates of Variational Inference in Sparse Deep Learning

Convergence Rates of Variational Inference inSparse Deep Learning

Badr-Eddine Chérief-Abdellatif [email protected]

CREST, ENSAE, Institut Polytechnique de Paris

Editor:

Abstract

Variational inference is becoming more and more popular for approximating intractable pos-terior distributions in Bayesian statistics and machine learning. Meanwhile, a few recentworks have provided theoretical justiﬁcation and new insights on deep neural networksfor estimating smooth functions in usual settings such as nonparametric regression. Inthis paper, we show that variational inference for sparse deep learning retains the samegeneralization properties than exact Bayesian inference. In particular, we highlight theconnection between estimation and approximation theories via the classical bias-variancetrade-oﬀ and show that it leads to near-minimax rates of convergence for Hölder smoothfunctions. Additionally, we show that the model selection framework over the neural net-work architecture via ELBO maximization does not overﬁt and adaptively achieves theoptimal rate of convergence.

Keywords:

Variational Inference, Neural Networks, Deep Learning, Generalization

1. Introduction

Deep learning (DL) is a ﬁeld of machine learning that aims to model data using complex ar-chitectures combining several nonlinear transformations with hundreds of parameters calledDeep Neural Networks (DNN) (LeCun et al., 2015; Goodfellow et al., 2016). Although gen-eralization theory that explains why DL generalizes so well is still an open problem, it iswidely acknowledged that it mainly takes advantage of large datasets containing millions ofsamples and a huge computing power coming from clusters of graphics processing units. Verypopular architectures for deep neural networks such as the multilayer perceptron, the convo-lutional neural network (Lecun et al., 1998), the recurrent neural network (Rumelhart et al.,1986) or the generative adversarial network (Goodfellow et al., 2014) have shown impressiveresults and have enabled to perform better than humans in various important areas in ar-tiﬁcial intelligence such as image recognition, game playing, machine translation, computervision or natural language processing, to name a few prominent examples. An outstandingexample is AlphaGo (Silver et al., 2017), an artiﬁcial intelligence developed by Google thatlearned to play the game of Go using deep learning techniques and even defeated the worldchampion in 2016.The Bayesian approach, leading to popular methods such as Hidden Markov Models(Baum and Petrie, 1966) and Particle Filtering (Doucet and Johansen, 2009), provides anatural way to model uncertainty. Some prior distribution is put over the space of parametersand represents the prior belief as to which parameters are likely to have generated the data hérief-Abdellatif before any datapoint is observed. Then this prior distribution is updated using the Bayes rulewhen new data arrive in order to capture the more likely parameters given the observations.Unfortunately, exact Bayesian inference is computationally challenging for complex modelsas the normalizing constant of the posterior distribution is often intractable. In such cases,approximate inference methods such as variational inference (VI) (Jordan et al., 1999) andexpectation propagation (Minka, 2001) are popular to overcome intractability in Bayesianmodeling. The idea of VI is to minimize the Kullback-Leibler (KL) divergence with respectto the posterior given a set of tractable distributions, which is also equivalent to maximizinga numerical criterion called the Evidence Lower Bound (ELBO). Recent advances of VIhave shown great performance in practice and have been applied to many machine learningproblems (Hoﬀman et al., 2013; Kingma and Welling, 2013).The Bayesian approach to learning in neural networks has a long history. BayesianNeural Networks (BNN) have been ﬁrst proposed in the 90s and widely studied since then(MacKay, 1992a; Neal, 1995). They oﬀer a probabilistic interpretation and a measure ofuncertainty for DL models. They are more robust to overﬁtting than classical neural net-works and still achieve great performance even on small datasets. A prior distribution isput on the parameters of the network, namely the weight matrices and the bias vectors, forinstance a Gaussian or a uniform distribution, and Bayesian inference is done through thelikelihood speciﬁcation. Nevertheless, state-of-the-art neural networks may contain millionsof parameters and the form of a neural network is not adapted to exact integration, whichmakes the posterior distribution be intractable in practice. Modern approximate inferencemainly relies on VI, with sometimes a ﬂavor of sampling techniques. A lot of recent pa-pers have investigated variational inference for DNNs (Hinton and van Camp, 1993; Graves,2011; Blundell et al., 2015) to ﬁt an approximate posterior that maximizes the evidence lowerbound. For instance, Blundell et al. (2015) introduced Bayes by Backprop, one of the mostfamous techniques of VI applied to neural networks, which derives a fully factorized Gaussianapproximation to the posterior: using the reparameterization trick (Opper and Archambeau,2008), the gradients of ELBO towards parameters of the Gaussian approximation can becomputed by backpropagation, and then be used for updates. Another point of interestin DNNs is the choice of the prior. Blundell et al. (2015) introduced a mixture of Gaus-sians prior on the weights, with one mixture tightly concentrated around zero, imitatingthe sparsity-inducing spike-and-slab prior. This oﬀers a Bayesian alternative to the dropoutregularization procedure (Srivastava et al., 2014) which injects sparsity in the network byswitching oﬀ randomly some of the weights of the network. This idea goes back to DavidMacKay who discussed in his thesis the possibility of choosing a spike-and-slab prior overthe weights of the neural network (MacKay, 1992b). More recently, Rockova and Polson(2018) introduced Spike-and-Slab Deep Learning (SS-DL), a fully Bayesian alternative todropout for improving generalizability of deep ReLU networks. Although deep learning is extremely popular, the study of generalization properties of DNNsis still an open problem. Some works have been conducted in order to investigate the theoret-ical properties of neural networks from diﬀerent points of view. The literature developed inthe past decades can be shared in three parts. First, the approximation theory wonders how onvergence Rates of Variational Inference in Sparse Deep Learning well a function can be approximated by neural networks. The ﬁrst studies were mostly con-ducted to obtain approximation guarantees for shallow neural nets with a single hidden layer(Cybenko, 1989; Barron, 1993). Since then, modern research has focused on the expressivepower of depth and extended the previous results to deep neural networks with a larger num-ber of layers (Bengio and Delalleau, 2011; Yarotsky, 2016; Petersen and Voigtländer, 2017;Grohs et al., 2019). Indeed, even though the universal approximation theorem (Cybenko,1989) states that a shallow neural network containing a ﬁnite number of neurons can approx-imate any continuous function on compact sets under mild assumptions on the activationfunction, recent advances showed that a shallow network requires exponentially many neu-rons in terms of the dimension to represent a monomial function, whereas linearly manyneurons are suﬃcient for a deep network (Rolnick and Tegmark, 2018). Second, as the ob-jective function in deep learning is known to be nonconvex, the optimization communityhas discussed the landscape of the objective as well as the dynamics of some learning algo-rithms such as Stochastic Gradient Descent (SGD) (Baldi and Hornik, 1989; Stanford et al.,2000; Soudry and Carmon, 2016; Kawaguchi, 2016; Kawaguchi et al., 2019; Nguyen et al.,2019; Allen-Zhu et al., 2019; Du et al., 2019). Finally, the statistical learning communityhas investigated generalization properties of DNNs, see Barron (1994); Zhang et al. (2017);Schmidt-Hieber (2017); Suzuki (2018); Imaizumi and Fukumizu (2019); Suzuki (2019). Inparticular, Schmidt-Hieber (2017) and Suzuki (2019) showed that estimators in nonparamet-ric regression based on sparsely connected DNNs with ReLU activation function and wiselychosen architecture achieve the minimax estimation rates (up to logarithmic factors) underclassical smoothness assumptions on the regression function. In the same time, Bartlett et al.(2017) and Neyshabur et al. (2018) respectively used Rademacher complexity and coveringnumber, and PAC-Bayes theory to get spectrally-normalized margin bounds for deep ReLUnetworks. More recently, Imaizumi and Fukumizu (2019) and Hayakawa and Suzuki (2019)showed the superiority of DNNs over linear operators in some situations when DNNs achievethe minimax rate of convergence while alternative methods fail. From a Bayesian point ofview, Rockova and Polson (2018) and Suzuki (2018) studied the concentration of the pos-terior distribution while Vladimirova et al. (2019) investigated the regularization eﬀect ofprior distributions at the level of the units.Such as for generalization properties of DNNs, only little attention has been put in theliterature towards the theoretical properties of VI until recently. Alquier et al. (2016) stud-ied generalization properties of variational approximations of Gibbs distributions in machinelearning for bounded loss functions. Alquier and Ridgway (2017); Zhang and Gao (2017);Sheth and Khardon (2017); Bhattacharya et al. (2018); Chérief-Abdellatif and Alquier (2018);Cherief-Abdellatif (2019); Jaiswal et al. (2019b) extended the previous guarantees to moregeneral statistical models and studied the concentration of variational approximations of theposterior distribution, while Wang and Blei (2018) provided Bernstein-von-Mises’ theoremsfor variational approximations in parametric models. Huggins et al. (2018); Campbell and Li(2019); Jaiswal et al. (2019a) discussed theoretical properties of variational inference algo-rithms based on various divergences (respectively Wasserstein and Hellinger distances, andRényi divergence). More recently, Chérief-Abdellatif et al. (2019) presented generalizationbounds for online variational inference. All these works show that under mild conditions,the variational approximation is consistent and achieves the same rate of convergence thanthe Bayesian posterior distribution it approximates. Note that Alquier and Ridgway (2017); hérief-Abdellatif Bhattacharya et al. (2018); Chérief-Abdellatif and Alquier (2018); Cherief-Abdellatif (2019)restricted their studies to tempered versions of the posterior distribution where the likelihoodis raised to an α -power ( α < ) as it is known to require less stringent assumptions to obtainconsistency and to be robust to misspeciﬁcation, see respectively Bhattacharya et al. (2016)and Grünwald and Van Ommen (2017). Nevertheless, some questions remain unanswered,as the theoretical study of generalization of variational inference for deep neural networks. This paper aims at ﬁlling the gap between theory and practice when using variational ap-proximations for tempered Bayesian Deep Neural Networks. To the best of our knowledge,this is the ﬁrst paper to present theoretical generalization error bounds of variational infer-ence for Bayesian deep learning. Inspired by the related literature, our work is motivatedby the following questions: • Do consistency of Bayesian DNNs still hold when an approximation is used insteadof the exact posterior distribution, and can we obtain the same rates of convergencethan those obtained for the regular posterior distribution and frequentist estimators ? • Is it possible to obtain a nonasymptotic generalization error bound that holds for(almost) any generating distribution function and that gives a general formula ? • What about the consistency of numerical algorithms used to compute these variationalapproximations ? • Can we obtain new insights on the structure of the networks ?The main contribution of this paper, a nonasymptotic generalization error bound for vari-ational inference in sparse DL in the nonparametric regression framework, answers the ﬁrsttwo questions. This generalization result is similar to theoretical inequalities in the seminalworks of Suzuki (2018); Imaizumi and Fukumizu (2019); Rockova and Polson (2018) on gen-eralization properties of deep neural networks, and is inspired by the general literature on theconsistency of variational approximations (Alquier and Ridgway, 2017; Bhattacharya et al.,2018). In particular, it states that under the same conditions, sparse variational approxima-tions of posterior distributions of deep neural networks are consistent at the same rate ofconvergence than the exact posterior.It also raises the question of ﬁnding a relevant general deﬁnition of consistency that canbe used to provide theoretical properties for the exact Bayesian DNNs distribution and theirvariational approximations. Indeed, a classical criterion used to assess frequentist guaranteesfor Bayesian estimators is the concentration of the posterior (to the true distribution) whichis deﬁned as the asymptotic concentration of the Bayesian estimator to the true distribution(Ghosal et al., 2000). Nevertheless, posterior concentration to the true distribution onlyapplies when the model is well speciﬁed, or at least when the model contains distributionsin the neighborhood of the true distribution, which is problematic for misspeciﬁed models e.g.when the neural network does not suﬃciently approximate the generating distribution. Andalthough the posterior distribution may concentrate to the best approximation of the truedistribution in KL divergence in such misspeciﬁed models, there exists pathological cases onvergence Rates of Variational Inference in Sparse Deep Learning where the regular Bayesian posterior is not consistent at all, see Grünwald and Van Ommen(2017). This is the reason why we focus here on tempered posteriors which are robust tosuch misspeciﬁcation. Therefore, we introduce in Section 2 a notion of consistency of aBayesian estimator which is closely related to the notion of concentration - even stronger -and which enables a more robust formulation of generalization error bounds for variationalapproximations. See Appendix A for more details on the connection between the notions ofconsistency and concentration.Then we focus on optimization aspects. We no longer assume an ideal optimization, asdone for instance in Schmidt-Hieber (2017); Imaizumi and Fukumizu (2019). We address inthis paper the question of the consistency of numerical algorithms used to compute our idealapproximations. We consider an optimization error given by any algorithm and independentto the statistical error, and we show how it aﬀects our generalization result. Our upper boundhighlights the connection between the consistency of the variational approximation and theconvergence of the ELBO.We also provide insights on the structure of the network which leads to optimal ratesof convergence, i.e. its depth, its width and its sparsity. Indeed, in our ﬁrst generalizationerror bound, the structure of the network is ideally tuned for some choice of the generatingfunction, and we show how to choose such a structure. Nevertheless, the characteristics of theregression function may be unknown, e.g. we may know that the regression function is Höldercontinuous but we ignore its level of smoothness. We propose here an automated method forchoosing the architecture of the network. We introduce a classical model selection frameworkbased on the ELBO criterion (Cherief-Abdellatif, 2019), and we show that the variationalapproximation associated with the selected structure does not overﬁt and adaptively achievesthe optimal rate of convergence even without any oracle information.The rest of this paper is organized as follows. Section 2 introduces the notations andthe framework that will be considered in the paper, and presents sparse spike-and-slabvariational inference for deep neural networks. Section 3 provides theoretical generalizationerror bounds for variational approximations of DNNs and shows the optimality of the methodfor estimating Hölder smooth functions. Finally, insights on the choice of the architectureof the network are given in Section 4 via the ELBO maximization framework. All the proofsare deferred to the appendix.

2. Sparse deep variational inference

Let us introduce the notations and the statistical framework we adopt in this paper. For anyvector x = ( x , ..., x d ) ∈ [ − , d and any real-valued function f deﬁned on [ − , d , d > ,we denote: k x k ∞ = max ≤ i ≤ d | x i | , k f k = (cid:18) Z f (cid:19) / and k f k ∞ = sup y ∈ [ − , d | f ( y ) | . For any k ∈ { , , , ... } d , we deﬁne | k | = P di =1 k i and the mixed partial derivatives whenall partial derivatives up to order | k | exist: D k f ( x ) = ∂ | k | f∂ k x ...∂ k d x d ( x ) . hérief-Abdellatif We also introduce the notion of β -Hölder continuity for β > . We denote ⌊ β ⌋ the largestinteger strictly smaller than β . Then f is said to be β -Hölder continuous (Tsybakov, 2008)if all partial derivatives up to order ⌊ β ⌋ exist and are bounded, and if: k f k C β := max | k |≤⌊ β ⌋ k D k f k ∞ + max | k | = ⌊ β ⌋ sup x,y ∈ [ − , d ,x = y | D k f ( x ) − D k f ( y ) |k x − y k β −⌊ β ⌋∞ < + ∞ . k f k C β is the norm of the Hölder space C β = { f / k f k C β < + ∞} . We consider the nonparametric regression framework. We have a collection of random vari-ables ( X i , Y i ) ∈ [ − , d × R for i = 1 , ..., n which are independent and identically distributed(i.i.d.) with the generating process: ( X i ∼ U ([ − , d ) ,Y i = f ( X i ) + ζ i where U ([ − , d ) is the uniform distribution on the interval [ − , d , ζ , ..., ζ n are i.i.d. Gaus-sian random variables with mean and known variance σ , and f : [ − , d → R is the trueunknown function. For instance, the true regression function f may belong to the set C β ofHölder functions with level of smoothness β . We call deep neural network any map f θ : R d → R deﬁned recursively as follows:  x (0) := x,x ( ℓ ) := ρ ( A ℓ x ( ℓ − + b ℓ ) for ℓ = 1 , ..., L − ,f θ ( x ) := A L x ( L − + b L where L ≥ . ρ is an activation function acting componentwise. For instance, we can choosethe ReLU activation function ρ ( u ) = max( u, . Each A ℓ ∈ R D ℓ × D ℓ − is a weight matrixsuch that its ( i, j ) coeﬃcient, called edge weight, connects the j -th neuron of the ( ℓ − -thlayer to the i -th neuron of the ℓ -th layer, and each b ℓ ∈ R D ℓ is a shift vector such thatits i -th coeﬃcient, called node vector, represents the weight associated with the i -th nodeof layer ℓ . We set D = d the number of units in the input layer, D L = 1 the numberof units in the output layer and D ℓ = D the number of units in the hidden layers. Thearchitecture of the network is characterized by its number of edges S , i.e. the total numberof nonzero entries in matrices A ℓ and vectors b ℓ , its number of layers L ≥ (excluding theinput layer), and its width D ≥ . We have S ≤ T where T = P Lℓ =1 D ℓ ( D ℓ − + 1) is thetotal number of coeﬃcients in a fully connected network. By now, we consider that S , L and D are ﬁxed, and d = O (1) as n → + ∞ . In particular, we assume that d ≤ D , which impliesthat T ≤ LD ( D + 1) . We also suppose that the absolute values of all coeﬃcients are upperbounded by some positive constant B ≥ . This boundedness assumption will be relaxed inthe appendix, see Appendix G. Then, the parameter of a DNN is θ = { ( A , b ) , ..., ( A L , b L ) } ,and we denote Θ S,L,D the set of all possible parameters. We will also alternatively considerthe stacked coeﬃcients parameter θ = ( θ , ..., θ T ) . onvergence Rates of Variational Inference in Sparse Deep Learning We adopt a Bayesian approach, and we place a spike-and-slab prior π (Castillo et al., 2015)over the parameter space Θ S,L,D (equipped with some suited sigma-algebra) that is deﬁnedhierarchically. The spike-and-slab prior is known to be a relevant alternative to dropout forBayesian deep learning, see Rockova and Polson (2018). First, we sample a vector of binaryindicators γ = ( γ , ..., γ T ) ∈ { , } T uniformly among the set S ST of T -dimensional binaryvectors with exactly S nonzero entries, and then given γ t for each t = 1 , ..., T , we put aspike-and-slab prior on θ t that returns if γ t = 0 and a random sample from a uniformdistribution on [ − B, B ] otherwise: ( γ ∼ U ( S ST ) ,θ t | γ t ∼ γ t U ([ − B, B ]) + (1 − γ t ) δ { } , t = 1 , ..., T where δ { } is a point mass at and U ([ − B, B ]) is a uniform distribution on [ − B, B ] . Werecall that the sparsity level S is ﬁxed here and that this assumption will be relaxed inSection 4. Remark 2.1.

We consider uniform distributions for simplicity as in similar works (Rockova and Polson,2018; Suzuki, 2018), but Gaussian distributions can be used as well when working on an un-bounded parameter set Θ S,L,D , see Theorem 7 in Appendix G.

Then we deﬁne the tempered posterior distribution π n,α on parameter θ ∈ Θ S,L,D usingprior π for any α ∈ (0 , : π n,α ( dθ ) ∝ exp (cid:18) − α σ n X i =1 ( Y i − f θ ( X i )) (cid:19) π ( dθ ) , which is a slight variant of the deﬁnition of the regular Bayesian posterior (for which α = 1 ).This distribution is known to be easier to sample from, to require less stringent assumptionsto obtain concentration, and to be robust to misspeciﬁcation, see respectively Behrens et al.(2012), Bhattacharya et al. (2016) and Grünwald and Van Ommen (2017). The variational Bayes approximation ˜ π n,α of the tempered posterior is deﬁned as the projec-tion (with respect to the Kullback-Leibler divergence) of the tempered posterior onto someset F S,L,D : ˜ π n,α = arg min q ∈F S,L,D KL ( q k π n,α ) . which is equivalent to: ˜ π n,α = arg min q ∈F S,L,D (cid:26) α σ n X i =1 Z ( Y i − f θ ( X i )) q ( dθ ) + KL ( q k π ) (cid:27) (1)where the function inside the argmin operator in (1) is the opposite of the evidence lowerbound L n ( q ) . hérief-Abdellatif We choose a sparse spike-and-slab variational set F S,L,D - see for instance Tonolini et al.(2019) - which can be seen as an extension of the popular mean-ﬁeld variational set witha dependence assumption specifying the number of active neurons. The mean-ﬁeld ap-proximation is based on a decomposition of the space of parameters Θ S,L,D as a prod-uct θ = ( θ , ..., θ T ) and consists in compatible product distributions on each parameter θ t , t = 1 , ..., T . Here, we ﬁt a distribution in the family that matches the prior: we ﬁrst choosea distribution π γ on the set S ST that selects a T -dimensional binary vector γ with S nonzeroentries, and then we place a spike-and-slab variational approximation on each θ t given γ t : ( γ ∼ π γ ,θ t | γ t ∼ γ t U ([ l t , u t ]) + (1 − γ t ) δ { } for each t = 1 , ..., T where − ≤ l t ≤ u t ≤ , with the distribution π γ and the intervals [ l t , u t ] , t = 1 , ..., T as thehyperparameters of the variational set F S,L,D . In particular, if we choose a deterministic π γ = δ { γ ′ } with γ ′ ∈ S ST , then we will obtain a parametric mean-ﬁeld approximation. SeeSection 6.6 of the PhD thesis of Gal (2016) for a more detailed discussion on the connectionbetween Gaussian mean-ﬁeld and sparse spike-and-slab posterior approximations.The generalization error of the tempered posterior π n,α and of its variational approxima-tion ˜ π n,α is the expected average of the squared L -distance to the true generating functionover the Bayesian estimator: E (cid:20) Z k f θ − f k π n,α ( dθ ) (cid:21) and E (cid:20) Z k f θ − f k ˜ π n,α ( dθ ) (cid:21) . We say that a Bayesian estimator is consistent at rate r n → if its generalization error is up-per bounded by r n . Notice that consistency of the Bayesian estimator implies concentrationto f . Again, see Appendix A for the connection between these two notions.

3. Generalization of variational inference for neural networks

The ﬁrst result of this section is an extension of the result of Rockova and Polson (2018) onthe Bayesian distribution for Hölder regression functions. Indeed, we provide a concentrationresult on the posterior distribution for the expected L -distance instead of the empirical L -distance, which enables generalization instead of reconstruction on the training datapoints.This result is then extended again to the variational approximation for our deﬁnition ofconsistency: we show that we can still achieve near-optimality using an approximation ofthe posterior without any additional assumption. Finally, we explain how we can incorporateoptimization error in our generalization results. Rockova and Polson (2018) gives the ﬁrst posterior concentration result for deep ReLU net-works when estimating Hölder smooth functions in nonparametric regression with empirical L -distance. The authors highlight the ﬂexibility of DNNs over other methods for estimating β -Hölder smooth functions as there is a large range of values of the level of smoothness β for which one can obtain concentration, e.g. < β < d for a DNN against < β < for aBayesian tree. onvergence Rates of Variational Inference in Sparse Deep Learning The following theorem provides the concentration of the tempered posterior distribution π n,α for deep ReLU neural networks when using the expected L -distance for some suitablearchitecture of the network: Theorem 1.

Let us assume that α ∈ (0 , , that f is β -Hölder smooth with < β < d andthat the activation function is ReLU. We consider the architecture of Rockova and Polson(2018) for some positive constant C D independent of n : L = 8 + ( ⌊ log n ⌋ + 5)(1 + ⌈ log d ⌉ ) ,D = C D ⌊ n d β + d / log n ⌋ ,S ≤ d ( β + 1) d D ( L + ⌈ log d ⌉ ) . Then the tempered posterior distribution π n,α concentrates at the minimax rate r n = n − β β + d up to a (squared) logarithmic factor for the expected L -distance in the sense that: π n,α (cid:18) θ ∈ Θ S,L,D (cid:14) k f θ − f k > M n · n − β β + d · log n (cid:19) −−−−−→ n → + ∞ in probability as n → + ∞ for any M n → + ∞ . In order to prove Theorem 1, we actually have to check that the so-called prior mass condition is satisﬁed: π (cid:18) θ ∈ Θ S,L,D (cid:14) k f θ − f k ≤ r n (cid:19) ≥ e − nr n . (2)This assumption, introduced in Ghosal et al. (2000) in order to obtain the concentrationof the regular posterior distribution states that the prior must give enough mass to someneighborhood of the true parameter. As shown in Bhattacharya et al. (2016), this conditionis even suﬃcient for tempered posteriors. Actually, this inequality was ﬁrst stated usingthe KL divergence instead of the expected L -distance (see Condition 2.4 in Theorem 2.1 inGhosal et al. (2000)), but the KL metric is equivalent to the squared L -metric in regressionproblems with Gaussian noise. This prior mass condition gives us the rate of convergenceof the tempered posterior r n = n − β β + d (up to a squared logarithmic factor) which is knownto be optimal when estimating β -Hölder smooth functions (Tsybakov, 2008). Note that the log n term is common in the theoretical deep learning literature (Imaizumi and Fukumizu,2019; Suzuki, 2019; Schmidt-Hieber, 2017). Remark 3.1.

The number of parameters of order n d β + d / log n ∈ [ n / / log( n ) , n / log( n )] is high compared to standard machine learning methods, which may lead to overﬁtting andhence prevent the procedure from achieving the minimax rate of convergence. The sparsityparameter S which gives a network with a small number of nonzero parameters along withthe spike-and-slab prior help us tackle this issue and obtain optimal rates of convergence (upto logarithmic factors). hérief-Abdellatif The result we state in this subsection applies to a wide range of activation functions, includ-ing the popular ReLU activation and the identity map:

Assumption 3.1.

In the following, we assume that the activation function ρ is -Lispchitzcontinuous (with respect to the aboluste value) and is such that for any x ∈ R , | ρ ( x ) | ≤ | x | . We do not assume any longer that the regression function is β -Hölder and we considerany structure ( S, L, D ) . The following theorem gives a generalization error bound whenusing variational approximations instead of exact tempered posteriors for DNNs. The proofis given in Appendix B and is based on PAC-Bayes theory (Catoni, 2007; Guedj, 2019): Theorem 2.

For any α ∈ (0 , , E (cid:20) Z k f θ − f k ˜ π n,α ( dθ ) (cid:21) ≤ − α inf θ ∗ ∈ Θ S,L,D k f θ ∗ − f k + 21 − α (cid:18) σ α (cid:19) r S,L,Dn , (3) with r S,L,Dn = LSn log( BD ) + 2 Sn log( BLD ) + Sn log (cid:18) dL max (cid:0) nS , (cid:1)(cid:19) . The oracle inequality (3) ensures consistency of variational Bayes for estimating neuralnetworks and provides the associated rate of convergence given the structure ( S, L, D ) . In-deed, if f is a neural network with structure ( S, L, D ) , then the inﬁmum term on the righthand side of the inequality vanishes and we obtain a rate of convergence of order r S,L,Dn ∼ max (cid:18) S log( nL/S ) n , LS log Dn (cid:19) , which underlines a linear dependence on the number of layers and the sparsity. In fact, thisrate of convergence is determined by the extended prior mass condition (Alquier and Ridgway,2017; Chérief-Abdellatif and Alquier, 2018; Cherief-Abdellatif, 2019), which requires that inaddition to the previous prior mass condition of Ghosal et al. (2000) and Bhattacharya et al.(2016), the variational set F S,L,D must contain probability distributions q that are concen-trated enough around the true generating function f . One of the main ﬁndings of Theorem2 is that our choice of the sparse spike-and-slab variational set F S,L,D is rich enough andthat both conditions are actually similar and lead to the same rate of convergence. Hence,the rate of convergence is the one that satisﬁes the prior mass condition (2). In particular,as the prior distribution is uniform over the parameter space, the negative logarithm of theprior mass of the neighborhood of the true regression function in Equation (2) is a localcovering entropy, that is the logarithm of the number of r S,L,Dn -balls needed to cover a neigh-borhood of the true regression function. Especially, it has been shown in previous studiesthat this local covering entropy fully characterizes the rate of convergence of the empiricalrisk minimizer for DNNs (Schmidt-Hieber, 2017; Suzuki, 2019). The rate r S,L,Dn we obtain inthis work is exactly of the same order than the upper bound on the covering entropy numbergiven in Lemma 5 in Schmidt-Hieber (2017) and in Lemma 3 in Suzuki (2019) which deriverates of convergence for the empirical risk minimizer using diﬀerent proof techniques. Note onvergence Rates of Variational Inference in Sparse Deep Learning that replacing a uniform by a Gaussian in the prior and variational distributions leads tothe same rate of convergence, see Appendix G.Nevertheless, deep neural networks are mainly used for their computational eﬃciency andtheir ability to approach complex functions, which makes the task of estimating a neuralnetwork not so popular in machine learning. As said earlier, Imaizumi and Fukumizu (2019)used neural networks for estimating non-smooth functions. In such a context where theneural network model is misspeciﬁed, our generalization error bound is robust and stillholds, and satisﬁes the best possible balance between bias and variance.Indeed, the upper bound on the generalization error on the right-hand-side of (3) ismainly divided in two parts: the approximation error of f by a DNN f θ ∗ in Θ S,L,D (i.e. thebias) and the estimation error r S,L,Dn of a neural network f θ ∗ in Θ S,L,D (i.e. the variance). Forinstance, even if the generalization power is decreasing linearly with respect to the number oflayers compared to the logarithmic dependence on the width due to the variance term, thiseﬀect is compensated by the beneﬁts of depth in the approximation theory of deep learning.Then, as there exists relationships between the bias/the variance and the architecture of aneural network (respectively due to the approximation theory/the form of r S,L,Dn ), Theorem 2gives both a general formula for deriving rates of convergence for variational approximationsand insight on the way to choose the architecture. We choose the architecture that minimizesthe right-hand-side of (3), which can lead to minimax estimators for smooth functions. Italso connects the approximation and estimation theories following previous studies. This wasdone for instance by Schmidt-Hieber (2017); Suzuki (2019); Imaizumi and Fukumizu (2019)who exploited the eﬀectiveness of ReLU activation function in terms of approximation ability(Yarotsky, 2016; Petersen and Voigtländer, 2017) for Hölder/Besov smooth and piecewisesmooth generating functions.Now we illustrate Theorem 2 on Hölder smooth functions. The following result showsthat the variational approximation achieves the same rate of convergence than the posteriordistribution it approximates, and even the minimax rate of convergence if the architectureis well chosen. We present both consistency and concentration results.

Corollary 3.

Let us ﬁx α ∈ (0 , . We consider the ReLU activation function. Assume that f is β -Hölder smooth with < β < d . Then with L , D and S deﬁned as in Theorem 1, thevariational approximation of the tempered posterior distribution ˜ π n,α is consistent and henceconcentrates at the minimax rate r n = n − β β + d (up to a squared logarithmic factor): ˜ π n,α (cid:18) θ ∈ Θ S,L,D (cid:14) k f θ − f k > M n · n − β β + d · log n (cid:19) −−−−−→ n → + ∞ in probability as n → + ∞ for any M n → + ∞ . In this subsection, we discuss the eﬀect of an optimization error that is independent onthe previous statistical error. Indeed, in the variational Bayes community, people use ap-proximate algorithms in practice to solve the optimization problem (1) when the model isnon-conjugate, i.e. the VB solution is not available in closed-form. This is the case herewhen considering a sparse spike-and-slab variational approximation in F S,L,D for DNNs with hérief-Abdellatif hyperparameters φ = ( π γ , ( φ t ) ≤ t ≤ T ) and an algorithm that gives a sequence of hyperparam-eters ( φ k ) k ≥ and associated variational approximations (˜ π kn,α ) k ≥ . The following theoremgives a statistical guarantee for any approximation ˜ π kn,α , k ≥ : Theorem 4.

For any α ∈ (0 , , E (cid:20) Z k f θ − f k ˜ π kn,α ( dθ ) (cid:21) ≤ − α inf θ ∗ k f θ ∗ − f k + 21 − α (cid:18) σ α (cid:19) r S,L,Dn + 2 σ α (1 − α ) · E [ L ∗ n − L kn ] n , where L ∗ n is the maximum of the evidence lower bound i.e. the ELBO evaluated at ˜ π n,α ,while L kn is the ELBO evaluated at ˜ π kn,α . We establish a clear connection between the convergence (in mean) of the ELBO L kn to L ∗ n and the consistency of our algorithm ˜ π kn,α . Indeed, as soon as the ELBO L kn convergesat rate c k,n , then our variational approximation ˜ π kn,α is consistent at rate: max (cid:18) c k,n n , S log( nL/S ) n , SL log Dn (cid:19) . In particular, as soon as k is such that c k,n ≤ max( S log n, S log D ) , then we obtain consis-tency of ˜ π kn,α at rate r S,L,Dn , i.e. ˜ π kn,α and ˜ π n,α have the same rate of convergence.However, deriving the convergence of the ELBO is a hard task. For instance, when consid-ering a simple Gaussian mean-ﬁeld approximation without sparsity, the variational objective L n can be maximized using either stochastic (Graves, 2011; Blundell et al., 2015) or natu-ral gradient methods (Khan et al., 2018) on the parameters of the Gaussian approximation.The convergence of the ELBO is often met in practice (Buchholz et al., 2018; Mishkin et al.,2018) and the recent work of Osawa et al. (2019) even showed that Bayesian deep learningenables practical deep learning and matches the performance of standard methods while pre-serving beneﬁts of Bayesian principles. Nevertheless, the objective is nonconvex and henceit is diﬃcult to prove the convergence to a global maximum in theory. Some recent papersstudied global convergence properties of gradient descent algorithms for frequentist classi-ﬁcation and regression losses (Du et al., 2019; Allen-Zhu et al., 2019) that we may extendto gradient descent algorithms for the ELBO objective such as Variational Online GaussNewton or Vadam (Khan et al., 2018; Osawa et al., 2019).Another point is to develop and study more complex algorithms than simple gradientdescent that deal with spike-and-slab sparsity-inducing variational inference, as for instanceTitsias and Lázaro-Gredilla (2011) did for multi-task and multiple kernel learning. Also,Louizos et al. (2018) connected sparse spike-and-slab variational inference with L -normregularization for neural networks and proposed a solution to the intractability of the L -penalty term through the use of non-negative stochastic gates, while Bellec et al. (2018)proposed an algorithm preserving sparsity during training. Nevertheless, these optimizationconcerns fall beyond the scope of this paper and are left for further research. onvergence Rates of Variational Inference in Sparse Deep Learning

4. Architecture design via ELBO maximization

We saw in Section 3 that the choice of the architecture of the neural network is crucial andcan lead to faster convergence and better approximation. In this section, we formulate thearchitecture design of DNNs as a model selection problem and we investigate the ELBOmaximization strategy which is very popular in the variational Bayes community. Thisapproach is diﬀerent from Rockova and Polson (2018) which is fully Bayesian and treats theparameters of the network architecture, namely the depth, the width and the sparsity, asrandom variables. We show that the ELBO criterion does not overﬁt and is adaptive: itprovides a variational approximation with the optimal rate of convergence, and it does notrequire the knowledge of the unknown aspects of the regression function f (e.g. the levelof smoothness for smooth functions) to select the optimal variational approximation.We denote M S,L,D the statistical model associated with the parameter set Θ S,L,D . Weconsider a countable number of models, and we introduce prior beliefs π S,L,D over thesparsity, the depth and the width of the network, that can be deﬁned hierarchically and thatare known beforehand. For instance, the prior beliefs can be chosen such that π L = 2 − L , π D | L follows a uniform distribution over { d, ..., max( e L , d ) } given L , and π S | L,D a uniformdistribution over { , ..., T } given L and D (we recall that T is the number of coeﬃcientsin a fully connected network). This particular choice is sensible as it allows to considerany number of hidden layers and (at most) an exponentially large width with respect tothe depth of the network. We still consider spike-and-slab priors on θ S,L,D ∈ Θ S,L,D givenmodel M S,L,D .Each tempered posterior associated with model M S,L,D is denoted π S,L,Dn,α . We recallthat the variational approximation ˜ π S,L,Dn,α associated with model M S,L,D is deﬁned as thedistribution into the variational set F S,L,D that maximizes the Evidence Lower Bound: ˜ π S,L,Dn,α = arg max q S,L,D ∈F S,L,D L n ( q S,L,D ) . We will simply denote in the following L ∗ n ( S, L, D ) the closest approximation to the log-evidence i.e., the value of the ELBO evaluated at its maximum: L ∗ n ( S, L, D ) = L n (˜ π S,L,Dn,α ) . The model selection criterion we use here to select the architecture of the network isa slight penalized variant of the classical ELBO criterion (Blei et al., 2017) with strongtheoretical guarantees (Cherief-Abdellatif, 2019) : ( ˆ S, ˆ L, ˆ D ) = arg max S,L,D (cid:26) L ∗ n ( S, L, D ) − log (cid:18) π S,L,D (cid:19)(cid:27) . For any choice of the prior beliefs π S,L,D , compute the ELBO for each model M S,L,D usingan algorithm that will converge to L ∗ n ( S, L, D ) and choose the architecture that maximizesthe penalized ELBO criterion. It is possible to restrict to a ﬁnite number of layers in practice(for instance, a factor of n or log n ).The following theorem shows that this ELBO criterion leads to a variational approxima-tion with the optimal rate of convergence: hérief-Abdellatif Theorem 5.

For any α ∈ (0 , , E (cid:20) Z k f θ − f k ˜ π ˆ S, ˆ L, ˆ Dn,α ( dθ ) (cid:21) ≤ inf S,L,D (cid:26) − α inf θ ∗ ∈ Θ S,L,D k f θ ∗ − f k + 21 − α (cid:18) σ α (cid:19) r S,L,Dn + 2 σ α (1 − α ) log( π S,L,D ) n (cid:27) . This inequality shows that as soon as the complexity term log(1 /π S,L,D ) /n that reﬂectsthe prior beliefs is lower than the eﬀective rate of convergence that balances the accuracyand the estimation error r S,L,Dn , the selected variational approximation adaptively achievesthe best possible rate. For instance, it leads to (near-)minimax rates for Hölder smoothfunctions and selects the optimal architecture even without the knowledge of β , which wasrequired in the previous section. Note that for the previous choice of prior beliefs π L = 2 − L , π D | L = 1 / (max( e L , d ) − d + 1) , π S | L,D = 1 /T , we get: log( π S,L,D ) n ≤ D + 1) + log L + max( L, log d ) + L log 2 n that is lower than r S,L,Dn (up to a factor) and hence the ELBO criterion does not overﬁt.

5. Discussion

In this paper, we provided theoretical justiﬁcations for neural networks from a Bayesianpoint of view using sparse variational inference. We derived new generalization error boundsand we showed that sparse variational approximations of DNNs achieve (near-)minimaxoptimality when the regression function is Hölder smooth. All our results directly implyconcentration of the approximation of the posterior distribution. We also proposed anautomated method for selecting an architecture of the network with optimal consistencyguarantees via the ELBO maximization framework.We think that one of the main challenges here is the design of new computational algo-rithms for spike-and-slab deep learning in the wake of the work of Titsias and Lázaro-Gredilla(2011) for multi-task and multiple kernel learning, or those of Louizos et al. (2018) andBellec et al. (2018). In the latter paper, the authors designed an algorithm for trainingdeep networks while simultaneously learning their sparse connectivity allowing for fast andcomputationally eﬃcient learning, whereas most approaches have focused on compressingalready trained neural networks.In the same time, a future point of interest is the study of the global convergence ofthese approximate algorithms in nonconvex settings i.e. study of the theoretical conver-gence of the ELBO. This work was conducted for frequentist gradient descent algorithms(Allen-Zhu et al., 2019; Du et al., 2019). Such studies should be investigated for Bayesiangradient descents, as well as for algorithms that preserve the sparsity of the network duringtraining.

Acknowledgments

We would like to warmly thank Pierre Alquier for his helpful suggestions on early versionsof this work. onvergence Rates of Variational Inference in Sparse Deep Learning References

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning viaover-parameterization. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,

Pro-ceedings of the 36th International Conference on Machine Learning , volume 97 of

Proceed-ings of Machine Learning Research , pages 242–252, Long Beach, California, USA, 09–15Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/allen-zhu19a.html .P. Alquier and J. Ridgway. Concentration of tempered posteriors and of their variationalapproximations. arXiv preprint arXiv:1706.09293 , 2017.P. Alquier, J. Ridgway, and N. Chopin. On the properties of variational approximations ofGibbs posteriors.

JMLR , 17(239):1–41, 2016.Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learningfrom examples without local minima”, ne.

Neural Networks , 2:53–58, 12 1989. doi: 10.1016/0893-6080(89)90014-2.Andrew Barron. Barron, a.e.: Universal approximation bounds for superpositions of asigmoidal function. ieee trans. on information theory 39, 930-945.

Information Theory,IEEE Transactions on , 39:930 – 945, 06 1993. doi: 10.1109/18.256500.Andrew R Barron. Approximation and estimation bounds for artiﬁcial neural networks.

Machine Learning , 14(1):115–133, 1994.Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized marginbounds for neural networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, editors,

Advances in Neural Infor-mation Processing Systems 30 , pages 6240–6249. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7204-spectrally-normalized-margin-bounds-for-neural-networks.pdf .Leonard E. Baum and Ted Petrie. Statistical inference for probabilistic functions of ﬁnitestate markov chains.

Ann. Math. Statist. , 37(6):1554–1563, 12 1966. doi: 10.1214/aoms/1177699147. URL https://doi.org/10.1214/aoms/1177699147 .G. Behrens, N. Friel, and M. Hurn. Tuning tempered transitions.

Statistics and computing ,22(1):65–78, 2012.Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring:Training very sparse deep networks. In

International Conference on Learning Represen-tations , 2018. URL https://openreview.net/forum?id=BJ_wN01C- .Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In

Pro-ceedings of the 22Nd International Conference on Algorithmic Learning Theory , ALT’11,pages 18–36, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 978-3-642-24411-7. URL http://dl.acm.org/citation.cfm?id=2050345.2050349 .A. Bhattacharya, D. Pati, and Y. Yang. Bayesian fractional posteriors. arXiv preprintarXiv:1611.01125, to appear in the Annals of Statistics , 2016. hérief-Abdellatif A. Bhattacharya, D. Pati, and Y. Yang. On statistical optimality of variational Bayes.

Proceedings of Machine Learning Research , 84 - AISTAT, 2018.David M Blei, Alp Kucukelbir, and Jon D McAuliﬀe. Variational inference: A review forstatisticians.

Journal of the American Statistical Association , 112(518):859–877, 2017.Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight un-certainty in neural networks. In

Proceedings of the 32Nd International Conference onInternational Conference on Machine Learning - Volume 37 , ICML’15, pages 1613–1622.JMLR.org, 2015. URL http://dl.acm.org/citation.cfm?id=3045118.3045290 .Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities us-ing the entropy method.

Ann. Probab. , 31(3):1583–1614, 07 2003. doi: 10.1214/aop/1055425791. URL https://doi.org/10.1214/aop/1055425791 .Alexander Buchholz, Florian Wenzel, and Stephan Mandt. Quasi-Monte Carlo variationalinference. In Jennifer Dy and Andreas Krause, editors,

Proceedings of the 35th Interna-tional Conference on Machine Learning , volume 80 of

Proceedings of Machine LearningResearch , pages 668–677, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.URL http://proceedings.mlr.press/v80/buchholz18a.html .Trevor Campbell and Xinglong Li. Universal boosting variational inference. volumearXiv:1903.05220, 2019.Ismaël Castillo, Johannes Schmidt-Hieber, and Aad van der Vaart. Bayesian linear regressionwith sparse priors.

Ann. Statist. , 43(5):1986–2018, 10 2015. doi: 10.1214/15-AOS1334.URL https://doi.org/10.1214/15-AOS1334 .O. Catoni.

PAC-Bayesian supervised classiﬁcation: the thermodynamics of statistical learn-ing . Institute of Mathematical Statistics Lecture Notes—Monograph Series, 56. Instituteof Mathematical Statistics, Beachwood, OH, 2007.B. Chérief-Abdellatif and P. Alquier. Consistency of variational bayes inference for estima-tion and model selection in mixtures.

Electronic Journal of Statistics , 12(2):2995–3035,2018. ISSN 1935-7524. doi: 10.1214/18-EJS1475.B.-E. Chérief-Abdellatif, P. Alquier, and M.E. Khan. A generalization bound for onlinevariational inference. Preprint arXiv:1904.03920v1, 2019.Badr-Eddine Cherief-Abdellatif. Consistency of elbo maximization for model selection.In Francisco Ruiz, Cheng Zhang, Dawen Liang, and Thang Bui, editors,

Proceedingsof The 1st Symposium on Advances in Approximate Bayesian Inference , volume 96 of

Proceedings of Machine Learning Research , pages 11–31. PMLR, 02 Dec 2019. URL http://proceedings.mlr.press/v96/cherief-abdellatif19a.html .G. Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics ofControl, Signals, and Systems (MCSS) , 2(4):303–314, December 1989. ISSN 0932-4194.doi: 10.1007/BF02551274. URL http://dx.doi.org/10.1007/BF02551274 . onvergence Rates of Variational Inference in Sparse Deep Learning A. Doucet and A. Johansen. A tutorial on particle ﬁltering and smoothing: Fifteen yearslater.

Handbook of Nonlinear Filtering , 12, 01 2009.Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent ﬁndsglobal minima of deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov,editors,

Proceedings of the 36th International Conference on Machine Learning , volume 97of

Proceedings of Machine Learning Research , pages 1675–1685, Long Beach, California,USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/du19c.html .Yarin Gal.

Uncertainty in Deep Learning . PhD thesis, University of Cambridge, 2016.Subhashis Ghosal, Jayanta K. Ghosh, and Aad W. van der Vaart. Convergence rates of pos-terior distributions.

Ann. Statist. , 28(2):500–531, 04 2000. doi: 10.1214/aos/1016218228.URL https://doi.org/10.1214/aos/1016218228 .Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahra-mani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,

Advancesin Neural Information Processing Systems 27 , pages 2672–2680. Curran Associates, Inc.,2014. URL http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf .Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning . MIT Press, 2016. .Alex Graves. Practical variational inference for neural networks. In J. Shawe-Taylor, R. S.Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors,

Advances in NeuralInformation Processing Systems 24 , pages 2348–2356. Curran Associates, Inc., 2011. URL http://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks.pdf .Philipp Grohs, Dmytro Perekrestenko, Dennis Elbrächter, and Helmut Bölcskei. Deep neuralnetwork approximation theory, 01 2019.P. D. Grünwald and T. Van Ommen. Inconsistency of Bayesian inference for misspeciﬁedlinear models, and a proposal for repairing it.

Bayesian Analysis , 12(4):1069–1103, 2017.B. Guedj. A primer on pac-bayesian learning. arXiv preprint arXiv:1901.05353 , 2019.Satoshi Hayakawa and Taiji Suzuki. On the minimax optimality and superiority of deepneural network learning over sparse parameter spaces. arXiv preprint arXiv:1905.09195 ,2019.Geoﬀrey E. Hinton and Drew van Camp. Keeping the neural networks simple by min-imizing the description length of the weights. In

Proceedings of the Sixth AnnualConference on Computational Learning Theory , COLT ’93, pages 5–13, New York,NY, USA, 1993. ACM. ISBN 0-89791-611-5. doi: 10.1145/168304.168306. URL http://doi.acm.org/10.1145/168304.168306 .M. D. Hoﬀman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference.

TheJournal of Machine Learning Research , 14(1):1303–1347, 2013. hérief-Abdellatif Jonathan H. Huggins, Trevor Campbell, Mikolaj Kasprzak, and Tamara Broderick. Practicalbounds on the error of bayesian posterior approximations: A nonasymptotic approach.

ArXiv , abs/1809.09505, 2018.Masaaki Imaizumi and Kenji Fukumizu. Deep neural networks learn non-smooth functions ef-fectively. In Kamalika Chaudhuri and Masashi Sugiyama, editors,

Proceedings of MachineLearning Research , volume 89 of

Proceedings of Machine Learning Research , pages 869–878.PMLR, 16–18 Apr 2019. URL http://proceedings.mlr.press/v89/imaizumi19a.html .P. Jaiswal, V. A. Rao, and H. Honnappa. Asymptotic consistency of α -rényi-approximateposteriors. Preprint arXiv:1902.01902, 2019a.Prateek Jaiswal, Harsha Honnappa, and Vinayak A. Rao. Risk-sensitive variational bayes:Formulations and bounds. volume arXiv:1906.01235, 2019b.M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variationalmethods for graphical models. Machine Learning , 37:183–233, 1999.Kenji Kawaguchi. Deep learning without poor local minima. In D. D. Lee, M. Sugiyama,U. V. Luxburg, I. Guyon, and R. Garnett, editors,

Advances in Neural Informa-tion Processing Systems 29 , pages 586–594. Curran Associates, Inc., 2016. URL http://papers.nips.cc/paper/6112-deep-learning-without-poor-local-minima.pdf .Kenji Kawaguchi, Jiaoyang Huang, and Leslie Pack Kaelbling. Eﬀect of depth and widthon local minima in deep learning.

Neural Computation , 31(6):1462–1498, 2019.Mohammad Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Sri-vastava. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. InJennifer Dy and Andreas Krause, editors,

Proceedings of the 35th International Con-ference on Machine Learning , volume 80 of

Proceedings of Machine Learning Research ,pages 2611–2620, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/khan18a.html .Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 , 2013.Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradient-based learningapplied to document recognition. In

Proceedings of the IEEE , pages 2278–2324, 1998.Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning.

Nature , 521(7553):436–444, 5 2015. ISSN 0028-0836. doi: 10.1038/nature14539.Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networksthrough l -regularization. In International Conference on Learning Representations , 2018.URL https://openreview.net/forum?id=H1Y8hhg0b .David J. C. MacKay. A practical bayesian framework for backpropagation networks.

Neural Computation , 4(3):448–472, 1992a. doi: 10.1162/neco.1992.4.3.448. URL https://doi.org/10.1162/neco.1992.4.3.448 . onvergence Rates of Variational Inference in Sparse Deep Learning David J. C. MacKay.

Bayesian methods for adaptive models . PhD thesis, California Instituteof Technology, 1992b.T. P. Minka. Expectation propagation for approximate bayesian inference. In

Proceedingsof the 17th Conference in Uncertainty in Artiﬁcial Intelligence , UAI ’01, pages 362–369,San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-800-1.URL http://dl.acm.org/citation.cfm?id=647235.720257 .Aaron Mishkin, Frederik Kunstner, Didrik Nielsen, Mark Schmidt, and Mohammad EmtiyazKhan. Slang: Fast structured covariance approximations for bayesian deep learning withnatural gradient. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,and R. Garnett, editors,

Advances in Neural Information Processing Systems 31 , pages6245–6255. Curran Associates, Inc., 2018.Radford. M. Neal.

Bayesian learning for neural networks . PhD thesis, University of Toronto,1995.Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-bayesian approach tospectrally-normalized margin bounds for neural networks. In

International Conference onLearning Representations , 2018. URL https://openreview.net/forum?id=Skz_WfbCZ .Quynh Nguyen, Mahesh Chandra Mukkamala, and Matthias Hein. On the loss landscape ofa class of deep neural networks with no bad local valleys. In

International Conference onLearning Representations , 2019. URL https://openreview.net/forum?id=HJgXsjA5tQ .Manfred Opper and Cedric Archambeau. The variational gaussian approximation revisited.

Neural computation , 21:786–92, 10 2008. doi: 10.1162/neco.2008.08-07-592.Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E. Turner,Rio Yokota, and Mohammad Emtiyaz Khan. Practical deep learning with bayesian prin-ciples, 2019. URL http://arxiv.org/abs/1906.02506 . cite arxiv:1906.02506Comment:Under review.Philipp Petersen and Felix Voigtländer. Optimal approximation of piecewise smooth func-tions using deep relu neural networks.

Neural Networks , 09 2017. doi: 10.1016/j.neunet.2018.08.019.Veronika Rockova and nicholas Polson. Posterior concentration for sparse deep learning. InS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett,editors,

Advances in Neural Information Processing Systems 31 , pages 930–941. CurranAssociates, Inc., 2018.David Rolnick and Max Tegmark. The power of deeper networks for expressing naturalfunctions. In , 2018. URL https://openreview.net/forum?id=SyProzZAW .David E. Rumelhart, Geoﬀrey E. Hinton, and Ronald J. Williams. Learning Representationsby Back-propagating Errors.

Nature , 323(6088):533–536, 1986. doi: 10.1038/323533a0.URL . hérief-Abdellatif Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with reluactivation function.

ArXiv , arxiv:1708.06633, 2017.Rishit Sheth and Roni Khardon. Excess risk bounds for the bayes risk using variationalinference in latent gaussian models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett, editors,

Advances in Neural InformationProcessing Systems 30 , pages 5151–5161. Curran Associates, Inc., 2017.David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, ArthurGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, TimothyLillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and DemisHassabis. Mastering the game of go without human knowledge.

Nature , 550:354–, October2017. URL http://dx.doi.org/10.1038/nature24270 .Daniel Soudry and Yair Carmon. No bad local minima: Data independent training errorguarantees for multilayer neural networks. 05 2016.Nitish Srivastava, Geoﬀrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. Dropout: A simple way to prevent neural networks from over-ﬁtting.

Journal of Machine Learning Research , 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/srivastava14a.html .J.A. Stanford, K Giardina, G.A. Gerhardt, Kenji Fukumizu, and Shun-ichi Amari. Localminima and plateaus in hierarchical structures of multilayer perceptrons.

Neural Networks ,13, 05 2000. doi: 10.1016/S0893-6080(00)00009-5.Taiji Suzuki. Fast generalization error bound of deep learning from a kernel perspective. InAmos Storkey and Fernando Perez-Cruz, editors,

Proceedings of the Twenty-First Inter-national Conference on Artiﬁcial Intelligence and Statistics , volume 84 of

Proceedings ofMachine Learning Research , pages 1397–1406, Playa Blanca, Lanzarote, Canary Islands,09–11 Apr 2018. PMLR. URL http://proceedings.mlr.press/v84/suzuki18a.html .Taiji Suzuki. Adaptivity of deep reLU network for learning in besov and mixed smoothbesov spaces: optimal rate and curse of dimensionality. In

International Conference onLearning Representations , 2019. URL https://openreview.net/forum?id=H1ebTsActm .Michalis K. Titsias and Miguel Lázaro-Gredilla. Spike and slab variational inference formulti-task and multiple kernel learning. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett,F. Pereira, and K. Q. Weinberger, editors,

Advances in Neural Information ProcessingSystems 24 , pages 2339–2347. Curran Associates, Inc., 2011.Francesco Tonolini, Bjorn Sand Jensen, and Roderick Murray-Smith. Variational sparsecoding, 2019. URL https://openreview.net/forum?id=SkeJ6iR9Km .Alexandre B. Tsybakov.

Introduction to Nonparametric Estimation . Springer PublishingCompany, Incorporated, 1st edition, 2008. ISBN 0387790519, 9780387790510. onvergence Rates of Variational Inference in Sparse Deep Learning Mariia Vladimirova, Jakob Verbeek, Pablo Mesejo, and Julyan Arbel. Understand-ing priors in Bayesian neural networks at the unit level. In Kamalika Chaud-huri and Ruslan Salakhutdinov, editors,

Proceedings of the 36th International Con-ference on Machine Learning , volume 97 of

Proceedings of Machine Learning Re-search , pages 6458–6467, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/vladimirova19a.html .Y. Wang and D. M. Blei. Frequentist consistency of variational Bayes. Journal of theAmerican Statistical Association (to appear), 2018.Dmitry Yarotsky. Error bounds for approximations with deep relu networks.

Neural Net-works , 94, 10 2016. doi: 10.1016/j.neunet.2017.07.002.Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning requires rethinking generalization. 2017. URL https://arxiv.org/abs/1611.03530 .F. Zhang and C. Gao. Convergence rates of variational posterior distributions. arXiv preprintarXiv:1712.02519v1 , 2017. hérief-Abdellatif Appendix A. Connection between concentration and consistency

In this appendix, we show the connection between the notions of consistency and concen-tration .The Bayesian estimator ρ (e.g. the tempered posterior π n,α or its variational approxima-tion ˜ π n,α ) is said to be consistent if its generalization error goes to zero as n → + ∞ : E (cid:20) Z k f θ − f k ρ ( dθ ) (cid:21) −−−−−→ n → + ∞ . We say that the Bayesian estimator ρ concentrates at rate r n (Ghosal et al., 2000) if inprobability (with respect to the random variables distributed according to the generatingprocess), the estimator concentrates asymptotically around the true distribution as n → + ∞ ,i.e.: ρ (cid:18) θ ∈ Θ S,L,D (cid:14) k f θ − f k > M n r n (cid:19) −−−−−→ n → + ∞ . in probability as n → + ∞ for any M n → + ∞ .The consistency of the Bayesian distribution ρ at rate r n implies its concentration atrate r n . Indeed, if we we assume that ρ is consistent at rate r n , i.e.: E (cid:20) Z k f θ − f k ρ ( dθ ) (cid:21) ≤ r n , then, using Markov’s inequality for any M n → + ∞ as n → + ∞ : E (cid:20) ρ (cid:18) θ ∈ Θ S,L,D (cid:14) k f θ − f k > M n r n (cid:19)(cid:21) ≤ E (cid:20) R k f θ − f k ρ ( dθ ) (cid:21) M n r n ≤ r n M n r n = 1 M n → . Hence, we have the convergence in mean of ρ (cid:0) θ ∈ Θ S,L,D (cid:14) k f θ − f k > M n r n (cid:1) to , andthen the convergence in probability of ρ (cid:0) θ ∈ Θ S,L,D (cid:14) k f θ − f k > M n r n (cid:1) to , i.e. theconcentration of ρ to f at rate r n . Appendix B. Proof of Theorem 2

The structure of the proof of Theorem 2 is composed of three main steps. The ﬁrst oneconsists in obtaining the general shape of the inequality using PAC-Bayes inequalities, andthe two others in ﬁnding a rate that satisﬁes the extended prior mass condition.

First step : we obtain the general inequality We start from inequality 2.6 in Alquier and Ridgway (2017) that provides an upperbound on the generalization error but in α -Rényi divergence. We denote P the generatingdistribution of any ( X i , Y i ) and P θ the distribution characterizing the model. Then, for any α ∈ (0 , : E (cid:20) Z D α ( P θ , P )˜ π n,α ( dθ ) (cid:21) ≤ inf q ∈F S,L,D (cid:26) α − α Z KL ( P , P θ ) q (d θ ) + KL ( q k π ) n (1 − α ) (cid:27) . onvergence Rates of Variational Inference in Sparse Deep Learning Moreover, the α -Rényi divergence is equal to D α ( P θ , P ) = α σ k f θ − f k and the KLdivergence is KL ( P k P θ ) = σ k f θ − f k , and for any θ ∗ , k f θ − f k ≤ k f θ − f θ ∗ k +2 k f θ ∗ − f k . Hence, for any θ ∗ ∈ Θ S,L,D : E (cid:20) Z α σ k f θ − f k ˜ π n,α ( dθ ) (cid:21) ≤ α − α σ k f θ ∗ − f k + inf q ∈F S,L,D (cid:26) α − α Z σ k f θ − f θ ∗ k q (d θ ) + KL ( q k π ) n (1 − α ) (cid:27) , i.e. for any θ ∗ ∈ Θ S,L,D , E (cid:20) Z k f θ − f k ˜ π n,α ( dθ ) (cid:21) ≤ − α k f θ ∗ − f k + inf q ∈F S,L,D (cid:26) − α Z k f θ − f θ ∗ k q (d θ ) + 2 σ α KL ( q k π ) n (1 − α ) (cid:27) . From now on, the rest of the proof consists in ﬁnding a distribution q ∗ n ∈ F S,L,D thatsatisﬁes for θ ∗ = arg min θ ∈ Θ S,L,D k f θ − f k the extended prior mass condition, i.e. thatsatisﬁes both: Z k f θ − f θ ∗ k q ∗ n (d θ ) ≤ r n (4)and KL ( q ∗ n k π ) ≤ nr n (5)with r n = SLn log( BD ) + Sn log( BL ( D + 1) ) + S n log (cid:18) nS (cid:26) d + 2) L (cid:27)(cid:19) that is smallerthan r S,L,Dn as x + 2) L ≤ x L for x ≥ and L ≥ . This will lead to: E (cid:20) Z k f θ − f k ˜ π n,α ( dθ ) (cid:21) ≤ − α inf θ ∗ ∈ Θ S,L,D k f θ ∗ − f k + 21 − α (cid:18) σ α (cid:19) r S,L,Dn . Second step : we prove Inequality (4)To begin with, we deﬁne the loss of the ℓ th layer of the neural network f θ : r ℓ ( θ ) = sup x ∈ [ − , d sup ≤ i ≤ D | f ℓθ ( x ) i − f ℓθ ∗ ( x ) i | where f ℓθ s are deﬁned as the partial networks: ( f θ ( x ) := x,f ℓθ ( x ) := ρ ( A ℓ f ℓ − θ ( x ) + b ℓ ) for ℓ = 1 , ..., L. We also deﬁne the loss of the output layer: r ℓ ( θ ) = sup x ∈ [ − , d | f Lθ ( x ) − f Lθ ∗ ( x ) | = sup x ∈ [ − , d | f θ ( x ) − f θ ∗ ( x ) | . hérief-Abdellatif We will prove by induction that for any ℓ = 1 , ..., L : r ℓ ( θ ) ≤ ( BD ) ℓ (cid:18) d + 1 + 1 BD − (cid:19) ℓ X u =1 ˜ A u + ℓ X u =1 ( BD ) ℓ − u ˜ b u where ˜ A u = sup i,j | A u,i,j − A ∗ u,i,j | and ˜ b u = sup j | b u,j − b ∗ u,j | . To do so, we will also prove byinduction that: c ℓ ≤ B ℓ D ℓ − (cid:18) d + 1 + 1 BD − (cid:19) where ( c ℓ = sup x ∈ [ − , d sup ≤ i ≤ D | f ℓθ ∗ ( x ) i | for ℓ = 1 , ..., L,c L = sup x ∈ [ − , d | f θ ∗ ( x ) | , using the formula: x n ≤ u n x n − + v n = ⇒ x n ≤ n X i =2 (cid:18) n Y j = i +1 u j (cid:19) v i + (cid:18) n Y j =2 u j (cid:19) x (6)for any n ≥ with the convention Q nj = n +1 u j = 1 .Indeed, we have according to Assumption 3.1: • Initialization: c = sup x ∈ [ − , d sup ≤ i ≤ D | f θ ∗ ( x ) i |≤ sup x ∈ [ − , d sup ≤ i ≤ D (cid:12)(cid:12)(cid:12)(cid:12) d X j =1 A ∗ ij x j + b ∗ i (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup x ∈ [ − , d sup ≤ i ≤ D (cid:26) d X j =1 | A ∗ ij | · | x j | + | b ∗ i | (cid:27) ≤ d · B · B = ( d + 1) B. • For any layer ℓ : c ℓ ≤ sup x ∈ [ − , d sup ≤ i ≤ D (cid:12)(cid:12)(cid:12)(cid:12) D X j =1 A ∗ ℓij f ℓ − θ ∗ ( x ) j + b ∗ ℓi (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup x ∈ [ − , d sup ≤ i ≤ D (cid:26) D X j =1 | A ∗ ℓij | · | f ℓ − θ ∗ ( x ) j | + | b ∗ ℓi | (cid:27) ≤ D · B · c ℓ − + B. onvergence Rates of Variational Inference in Sparse Deep Learning • Hence, using Formula (6), we get: c ℓ ≤ ℓ X u =2 (cid:18) ℓ Y v = u +1 DB (cid:19) B + (cid:18) ℓ Y v =2 BD (cid:19) c ≤ B ℓ X u =2 ( DB ) ℓ − u + ( BD ) ℓ − ( d + 1) B = B ℓ − X u =0 ( DB ) u + ( d + 1) D ℓ − B ℓ = B ( BD ) ℓ − − BD − d + 1) D ℓ − B ℓ ≤ B ℓ D ℓ − (cid:18) d + 1 + 1 BD − (cid:19) . Let us now come back to ﬁnding an upper bound on losses of the partial networks f ℓθ s. Aspreviously, we have: • Initialization: r ( θ ) = sup x ∈ [ − , d sup ≤ i ≤ D | f θ ∗ ( x ) i − f θ ( x ) i |≤ sup x ∈ [ − , d sup ≤ i ≤ D (cid:26) d X j =1 | A ij − A ∗ ij | · | x j | + | b i − b ∗ i | (cid:27) ≤ d · ˜ A + ˜ b . • For any layer ℓ : r ℓ ( θ ) ≤ sup x ∈ [ − , d sup ≤ i ≤ D (cid:26) D X j =1 | A ℓij f ℓ − θ ( x ) j − A ∗ ℓij f ℓ − θ ∗ ( x ) j | + | b ℓi − b ∗ ℓi | (cid:27) ≤ sup x ∈ [ − , d sup ≤ i ≤ D (cid:26) D X j =1 (cid:20) | A ℓij − A ∗ ℓij | · | f ℓ − θ ∗ ( x ) j | + | A ℓij | · | f ℓ − θ ∗ ( x ) j − f ℓ − θ ( x ) j | (cid:21) + | b ℓi − b ∗ ℓi | (cid:27) ≤ Dc ℓ − ˜ A ℓ + BDr ℓ − ( θ ) + ˜ b ℓ ≤ BDr ℓ − ( θ ) + ˜ A ℓ B ℓ − D ℓ − (cid:18) d + 1 + 1 BD − (cid:19) + ˜ b ℓ . hérief-Abdellatif • Finally, using Formula (6): r ℓ ( θ ) ≤ ℓ X u =2 (cid:18) ℓ Y v = u +1 BD (cid:19)(cid:18) ˜ A u ( BD ) u − (cid:26) d + 1 + 1 BD − (cid:27) + ˜ b u (cid:19) + (cid:18) ℓ Y v =2 BD (cid:19) r ( θ )= ℓ X u =2 ( BD ) ℓ − u ˜ A u ( BD ) u − (cid:18) d + 1 + 1 BD − (cid:19) + ℓ X u =2 ( BD ) ℓ − u ˜ b u + ( BD ) ℓ − r ( θ ) ≤ (cid:18) d + 1 + 1 BD − (cid:19) ℓ X u =2 ( BD ) ℓ − ˜ A u + ℓ X u =2 ( BD ) ℓ − u ˜ b u + ( BD ) ℓ − d ˜ A + ( BD ) ℓ − ˜ b ≤ ( BD ) ℓ − (cid:18) d + 1 + 1 BD − (cid:19) ℓ X u =1 ˜ A u + ℓ X u =1 ( BD ) ℓ − u ˜ b u . Then, for any distribution q : Z k f θ − f θ ∗ k q (d θ ) ≤ Z k f θ − f θ ∗ k ∞ q (d θ ) = Z r L ( θ ) q (d θ ) ≤ Z BD ) L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) L X ℓ =1 ˜ A ℓ (cid:19) q (d θ ) + Z (cid:18) L X ℓ =1 ( BD ) L − ℓ ˜ b u (cid:19) q (d θ )= 2( BD ) L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) Z L X ℓ =1 ˜ A ℓ q (d θ ) + 2 Z L X ℓ =1 ℓ − X k =1 ˜ A ℓ ˜ A k q (d θ ) (cid:19) + 2 (cid:18) Z L X ℓ =1 ( BD ) L − ℓ ) ˜ b l q (d θ ) + 2 Z L X ℓ =1 ℓ − X k =1 ( BD ) L − ℓ ( BD ) L − k ˜ b ℓ ˜ b k q (d θ ) (cid:19) . Here, we deﬁne q ∗ n ( θ ) as follows: ( γ ∗ t = I ( θ ∗ t = 0) ,θ t ∼ γ ∗ t U ([ θ ∗ t − s n , θ ∗ t + s n ]) + (1 − γ ∗ t ) δ { } for each t = 1 , ..., T. with s n = S n ( BD ) − L (cid:26)(cid:18) d + 1 + BD − (cid:19) L ( BD ) + BD ) − + BD − (cid:27) − . Hence: Z ˜ A ℓ q ∗ n (d θ ) = Z sup i,j ( A ℓ,i,j − A ∗ ℓ,i,j ) q ∗ n (d A ℓ,i,j ) ≤ s n , and Z ˜ A ℓ ˜ A k q ∗ n (d θ ) = (cid:18) Z sup i,j | A ℓ,i,j − A ∗ ℓ,i,j | q ∗ n (d θ ) (cid:19)(cid:18) Z sup i,j | A k,i,j − A ∗ k,i,j | q ∗ n (d θ ) (cid:19) ≤ | s n | · | s n | = s n , and similarly, R ˜ b ℓ q ∗ n (d θ ) ≤ s n and R ˜ b ℓ ˜ b k q ∗ n (d θ ) ≤ s n . onvergence Rates of Variational Inference in Sparse Deep Learning Then Z k f θ − f θ ∗ k q ∗ n (d θ ) ≤ BD ) L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) Z L X ℓ =1 ˜ A ℓ q (d θ ) + 2 Z L X ℓ =1 ℓ − X k =1 ˜ A ℓ ˜ A k q (d θ ) (cid:19) + 2 (cid:18) Z L X ℓ =1 ( BD ) L − ℓ ) ˜ b l q (d θ ) + 2 Z L X ℓ =1 ℓ − X k =1 ( BD ) L − ℓ ( BD ) L − k ˜ b ℓ ˜ b k q (d θ ) (cid:19) ≤ BD ) L − (cid:18) d + 1 + 1 BD − (cid:19) s n (cid:18) L + 2 L − X ℓ =0 ℓ (cid:19) + 2 s n L − X ℓ =0 ( BD ) ℓ + 4 s n L X ℓ =1 L − X k = L − ℓ +1 ( BD ) L − ℓ ( BD ) k = 2( BD ) L − (cid:18) d + 1 + 1 BD − (cid:19) s n L + 2 s n ( BD ) L − BD ) − s n L X ℓ =1 ℓ − X k =0 ( BD ) L − ℓ ( BD ) k ( BD ) L − ℓ +1 = 2 s n ( BD ) L − (cid:18) d + 1 + 1 BD − (cid:19) L + 2 s n ( BD ) L − BD ) − s n L X ℓ =1 ( BD ) L − ℓ ( BD ) ℓ − − BD − BD ) L − ℓ +1 ≤ s n ( BD ) L − (cid:18) d + 1 + 1 BD − (cid:19) L + 2 s n ( BD ) L − BD ) − s n BD − L X ℓ =1 ( BD ) L − ℓ = 2 s n ( BD ) L − (cid:18) d + 1 + 1 BD − (cid:19) L + 2 s n ( BD ) L − BD ) − s n BD − BD ) L ( BD ) L − BD − ≤ s n ( BD ) L − (cid:18) d + 1 + 1 BD − (cid:19) L + 2 s n ( BD ) L − BD ) − s n BD − ( BD ) L ≤ s n ( BD ) L (cid:26)(cid:18) d + 1 + 1 BD − (cid:19) L ( BD ) + 1( BD ) − BD − (cid:27) = S n ≤ r n which proves Equation (4). hérief-Abdellatif Third step : we prove Inequality (5)We will use the fact that for any K , any p, p ∈ [0 , K such that P Kk =1 p k = P Kk =1 p k = 1 and any distributions Q k , Q k for k = 1 , ..., K , we have: K K X k =1 p k Q k (cid:13)(cid:13)(cid:13)(cid:13) K X k =1 p k Q k ! ≤ K ( p k p ) + K X k =1 p k K ( Q k k Q k ) . (7)Please refer to Lemma 6.1 in Chérief-Abdellatif and Alquier (2018) for a proof. Then wewrite q ∗ n and π as mixtures of independent products of mixtures of two components: q ∗ n = X γ ∈S ST I ( γ = γ ∗ ) T O t =1 (cid:26) γ t U ([ l t , u t ]) + (1 − γ t ) δ { } (cid:27) and π = X γ ∈S ST (cid:18) TS ∗ (cid:19) − T O t =1 (cid:26) γ t U ([ − B, B ]) + (1 − γ t ) δ { } (cid:27) Hence, using Inequality 7 twice and the additivity of KL for independent distributions:KL ( q ∗ n k π ) ≤ KL (cid:18) { I ( γ = γ ∗ ) } γ ∈S ST (cid:13)(cid:13)(cid:13)(cid:13)(cid:26)(cid:18) T ∗ S ∗ (cid:19) − (cid:27) γ ∈S ST (cid:19) + X γ ∈S ST I ( γ = γ ∗ ) KL (cid:18) T O t =1 (cid:26) γ t U ([ l t , u t ]) + (1 − γ t ) δ { } (cid:27)(cid:13)(cid:13)(cid:13)(cid:13) T O t =1 (cid:26) γ t U ([ − B, B ]) + (1 − γ t ) δ { } (cid:27)(cid:19) = log (cid:18) TS (cid:19) + T X t =1 KL (cid:18) γ ∗ t U ([ l t , u t ]) + (1 − γ ∗ t ) δ { } (cid:13)(cid:13)(cid:13)(cid:13) γ ∗ t U ([ − B, B ]) + (1 − γ ∗ t ) δ { } (cid:19) ≤ log (cid:18) TS (cid:19) + T X t =1 γ ∗ t KL (cid:18) U ([ l t , u t ]) (cid:13)(cid:13)(cid:13)(cid:13) U ([ − B, B ]) (cid:19) + T X t =1 (1 − γ ∗ t ) KL (cid:0) δ { } k δ { } (cid:1) ≤ S log( T ) + T X t =1 γ ∗ t log (cid:18) Bu t − l t (cid:19) = S log( T ) + T X t =1 γ ∗ t log (cid:18) B s n (cid:19) = S log( T ) + S log( B ) + S (cid:18) s n (cid:19) = S log( T ) + S log( B )+ S (cid:18) nS ( BD ) L (cid:26)(cid:18) d + 1 + 1 BD − (cid:19) L + 1( BD ) − BD − (cid:27)(cid:19) , onvergence Rates of Variational Inference in Sparse Deep Learning and hence,KL ( q ∗ n k π ) ≤ S log( T ) + S log( B )+ S (cid:18) nS ( BD ) L (cid:26)(cid:18) d + 1 + 1 BD − (cid:19) L + 1( BD ) − BD − (cid:27)(cid:19) ≤ S log( L ( D + 1) ) + S log( B ) + LS log( BD )+ S (cid:18) nS (cid:26)(cid:18) d + 1 + 1 BD − (cid:19) L + 1( BD ) − BD − (cid:27)(cid:19) ≤ nr n , which ends the proof. Appendix C. Proof of Corollary 3

Corollary 3 is a direct consequence of Theorem 2, and we just need to ﬁnd an upper boundon inf θ ∗ ∈ Θ S,L,D k f θ ∗ − f k ∞ and r S,L,Dn . Indeed, according to Theorem 2: E (cid:20) Z k f θ − f k ˜ π n,α ( dθ ) (cid:21) ≤ − α inf θ ∗ ∈ Θ S,L,D k f θ ∗ − f k ∞ + 21 − α (cid:18) σ α (cid:19) r n . (8)We directly use the rate r n in the proof of Theorem 2 rather than r S,L,Dn .Let us assume that f is β -Hölder smooth with < β < d . Then according to Lemma5.1 in Rockova and Polson (2018), we have for some positive constant C D independent of n (see Theorem 6.1 in Rockova and Polson (2018)) a neural network with architecture : L = 8 + ( ⌊ log n ⌋ + 5)(1 + ⌈ log d ⌉ ) ,D = C D ⌊ n d β + d / log n ⌋ ,S ≤ d ( β + 1) d D ( L + ⌈ log d ⌉ ) , with an error k f − f k ∞ that is at most a constant multiple of Dn + D − β/d ≤ C D n − β β + d / log n + C − β/dD n − β β + d log β/d n ≤ ( C D / log n + C − β/dD log n ) n − β β + d , which gives an upper bound on theﬁrst term of the right-hand-side of Inequality 8 of order n − β β + d log n .In the same time, we have for some constants C, C ′ that do not depend on n : r n ≤ SLn log( BD ) + Sn log(2 BL ( D + 1) ) + S n log (cid:18) nS (cid:26) d + 2) L (cid:27)(cid:19) ≤ C (cid:18) DL n log D + DLn log( LD ) + DLn log n (cid:19) ≤ C ′ n d β + d n log n = C ′ n − β β + d log n. hérief-Abdellatif Then the tempered posterior distribution π n,α concentrates at the minimax rate r n = n − β β + d up to a (squared) logarithmic factor for the expected L -distance in the sense that: π n,α (cid:18) θ ∈ Θ S,L,D (cid:14) k f θ − f k > M n n − β β + d log n (cid:19) −−−−−→ n → + ∞ . in probability as n → + ∞ for any M n → + ∞ . Appendix D. Proof of Theorem 1

We could prove Theorem 1 using the prior mass condition (2) but we will use instead thesame proof than for Theorem 2. Indeed, we can easily show that for any θ ∗ ∈ Θ S,L,D , E (cid:20) Z k f θ − f k π n,α ( dθ ) (cid:21) ≤ − α k f θ ∗ − f k +inf q (cid:26) − α Z k f θ − f θ ∗ k q (d θ )+ 2 σ α KL ( q k π ) n (1 − α ) (cid:27) where the inﬁmum is taken over all the probability distributions on Θ S,L,D . We have: inf q (cid:26) − α Z k f θ − f θ ∗ k q (d θ ) + 2 σ α KL ( q k π ) n (1 − α ) (cid:27) ≤ inf q ∈F S,L,D (cid:26) − α Z k f θ − f θ ∗ k q (d θ ) + 2 σ α KL ( q k π ) n (1 − α ) (cid:27) ≤ − α (cid:18) σ α (cid:19) r S,L,Dn , which implies E (cid:20) Z k f θ − f k ˜ π n,α ( dθ ) (cid:21) ≤ − α inf θ ∗ ∈ Θ S,L,D k f θ ∗ − f k + 21 − α (cid:18) σ α (cid:19) r S,L,Dn ≤ − α inf θ ∗ ∈ Θ S,L,D k f θ ∗ − f k ∞ + 21 − α (cid:18) σ α (cid:19) r S,L,Dn . The rest of the proof follows the same lines than the one of Corollary 3.

Appendix E. Proof of Theorem 4

First, we need Donsker and Varadhan’s variational formula. Refer to Lemma 1.1.3. inCatoni (2007) for a proof.

Theorem 6.

For any probability λ on some measurable space ( E , E ) and any measurablefunction h : E → R such that R e h d λ < ∞ , log Z e h d λ = sup q (cid:26) Z h d q − KL ( q, λ ) (cid:27) , where the supremum is taken over all probability distributions over E and with the convention ∞ − ∞ = −∞ . Moreover, if h is upper-bounded on the support of λ , then the supremum isreached by the distribution of the form: λ h ( dβ ) = e h ( β ) R e h d λ λ (d β ) . onvergence Rates of Variational Inference in Sparse Deep Learning Let us come back to the proof of Theorem 4. Here, we can not directly use Theorem2.6 in Alquier and Ridgway (2017). Thus we begin from scratch. For any α ∈ (0 , and θ ∈ Θ S,L,D , using the deﬁnition of Rényi divergence and D α ( P ⊗ n , R ⊗ n ) = nD α ( P, R ) asdata are i.i.d. E (cid:20) exp (cid:18) − αr n ( P θ , P ) + (1 − α ) nD α ( P θ , P ) (cid:19)(cid:21) = 1 where r n ( P θ , P ) = σ P ni =1 { ( Y i − f θ ( X i )) − ( Y i − f ( X i )) } is the negative log-likelihoodratio. Then we integrate and use Fubini’s theorem, E (cid:20) Z exp (cid:18) − αr n ( P θ , P ) + (1 − α ) nD α ( P θ , P ) (cid:19) π ( dθ ) (cid:21) = 1 . According to Theorem 6, E (cid:20) exp (cid:18) sup q (cid:26) Z (cid:18) − αr n ( P θ , P ) + (1 − α ) nD α ( P θ , P ) (cid:19) q ( dθ ) − KL ( q || π ) (cid:27)(cid:19)(cid:21) = 1 where the supremum is taken over all probability distributions over Θ S,L,D . Then, usingJensen’s inequality, E (cid:20) sup q (cid:26) Z (cid:18) − αr n ( P θ , P ) + (1 − α ) nD α ( P θ , P ) (cid:19) q ( dθ ) − KL ( q || π ) (cid:27)(cid:21) ≤ , and then, E (cid:20) Z (cid:18) − αr n ( P θ , P ) + (1 − α ) nD α ( P θ , P ) (cid:19) ˜ π kn,α ( dθ ) − KL (˜ π kn,α || π ) (cid:21) ≤ . We rearrange terms: E (cid:20) Z D α ( P θ , P )˜ π kn,α ( dθ ) (cid:21) ≤ E (cid:20) α − α Z r n ( P θ , P ) n ˜ π kn,α ( dθ ) + KL (˜ π kn,α || π ) n (1 − α ) (cid:21) , that we can write: E (cid:20) Z D α ( P θ , P )˜ π kn,α ( dθ ) (cid:21) ≤ E (cid:20) α − α Z r n ( P θ , P ) n ˜ π n,α ( dθ ) + KL (˜ π n,α || π ) n (1 − α ) (cid:21) + E (cid:20) α − α Z r n ( P θ , P ) n ˜ π kn,α ( dθ ) + KL (˜ π kn,α || π ) n (1 − α ) (cid:21) − E (cid:20) α − α Z r n ( P θ , P ) n ˜ π n,α ( dθ ) + KL (˜ π n,α || π ) n (1 − α ) (cid:21) . Let us precise that E (cid:20) r n ( P θ ,P ) n (cid:21) = KL ( P || P θ ) = k f − f θ k σ , and: L n ( q ) = − α σ n X i =1 Z ( Y i − f θ ( X i )) q ( dθ ) − KL ( q k π ) up to a constant. hérief-Abdellatif Then: E (cid:20) Z D α ( P θ , P )˜ π kn,α ( dθ ) (cid:21) ≤ E (cid:20) α − α Z r n ( P θ , P ) n ˜ π n,α ( dθ ) + KL (˜ π n,α || π ) n (1 − α ) (cid:21) + E [ L ∗ n − L kn ] n (1 − α ) . We conclude by interverting the inﬁmum and the expectation and the same inequalities thanin Theorem 2: E (cid:20) α − α Z r n ( P θ , P ) n ˜ π n,α ( dθ ) + KL (˜ π n,α k π ) n (1 − α ) (cid:21) = E (cid:20) inf q ∈F S,L,D (cid:26) α − α Z r n ( P θ , P ) n q (d θ ) + KL ( q k π ) n (1 − α ) (cid:27)(cid:21) ≤ inf q ∈F S,L,D (cid:26) E (cid:20) α − α Z r n ( P θ , P ) n q (d θ ) + KL ( q k π ) n (1 − α ) (cid:21)(cid:27) ≤ α − α σ inf θ ∗ ∈ Θ S,L,D k f θ ∗ − f k + α σ − α (cid:18) σ α (cid:19) r S,L,Dn . Appendix F. Proof of Theorem 5

We start from the last inequality obtained in the proof of Theorem 3 in Cherief-Abdellatif(2019) that provides an upper bound in α -Rényi divergence for the ELBO model selectionframework. We still denote P the generating distribution and P θ the distribution charac-terizing the model. Then, for any α ∈ (0 , : E (cid:20) Z D α ( P θ , P )˜ π ˆ S, ˆ L, ˆ Dn,α ( dθ ) (cid:21) ≤ inf S,L,D (cid:26) inf q ∈F S,L,D (cid:26) α − α Z KL ( P , P θ S,L,D ) q ( dθ S,L,D ) + KL ( q, Π S,L,D ) n (1 − α ) (cid:27) + log( π S,L,D ) n (1 − α ) (cid:27) where Π S,L,D denotes the prior over the parameter set Θ S,L,D and π S,L,D the prior beliefover model ( S, L, D ) .As for the proof of Theorem 2, for any S, L, D and any θ ∗ ∈ Θ S,L,D : E (cid:20) Z α σ k f θ − f k ˜ π ˆ S, ˆ L, ˆ Dn,α ( dθ ) (cid:21) ≤ α − α σ k f θ ∗ − f k + inf q ∈F S,L,D (cid:26) α − α Z σ k f θ − f θ ∗ k q ( dθ ) + KL ( q, Π S,L,D ) n (1 − α ) (cid:27) + log( π S,L,D ) n (1 − α ) , and then for any S, L, D and any θ ∗ ∈ Θ S,L,D , E (cid:20) Z k f θ − f k ˜ π ˆ S, ˆ L, ˆ Dn,α ( dθ ) (cid:21) ≤ − α k f θ ∗ − f k + 21 − α (cid:18) σ α (cid:19) r S,L,Dn + 2 σ α (1 − α ) log( π S,L,D ) n , which ﬁnally leads to Theorem 5. onvergence Rates of Variational Inference in Sparse Deep Learning Appendix G. Result for sparse Gaussian approximations

In this appendix, we consider non-bounded parameter sets Θ S,L,D and Gaussians instead ofuniform distributions in spike-and-slab priors on θ ∈ Θ S,L,D : ( γ ∼ U ( S ST ) ,θ t | γ t ∼ γ t N (0 ,

1) + (1 − γ t ) δ { } , t = 1 , ..., T and Gaussian-based sparse spike-and-slab approximations: ( γ ∼ π γ ,θ t | γ t ∼ γ t N ( m t , s n ) + (1 − γ t ) δ { } for each t = 1 , ..., T. The following theorem states that using Gaussians instead of uniform distributions still leadsto consistency with the same rate of convergence. Note that the inﬁmum in the RHS of theinequality is taken over a bounded neural network model.

Theorem 7.

Let us introduce the sets Θ BS,L,D that contain the neural network parametersupper bounded by B (in L ∞ -norm). Then for any α ∈ (0 , , for any B ≥ , E (cid:20) Z k f θ − f k ˜ π n,α ( dθ ) (cid:21) ≤ − α inf θ ∗ ∈ Θ BS,L,D k f θ ∗ − f k + 21 − α (cid:18) σ α (cid:19) r S,L,Dn with r S,L,Dn = SLn log(2 BD ) + S n (cid:18)

12 log( LD ) + B (cid:19) + Sn log (cid:18) d max( nS , (cid:19) . Proof.

The proof follows the same structure than for Theorem 2. We ﬁx B ≥ . First step : we obtain the general inequality We can directly write for any θ ∗ ∈ Θ S,L,D , E (cid:20) Z k f θ − f k ˜ π n,α ( dθ ) (cid:21) ≤ − α k f θ ∗ − f k + inf q ∈F S,L,D (cid:26) − α Z k f θ − f θ ∗ k q (d θ ) + 2 σ α KL ( q k π ) n (1 − α ) (cid:27) . We deﬁne θ ∗ = arg min θ ∈ Θ BS,L,D k f θ − f k . Again, the rest of the proof consists in ﬁnding adistribution q ∗ n ∈ F S,L,D that satisﬁes the extended prior mass condition: Z k f θ − f θ ∗ k q ∗ n (d θ ) ≤ r n (9)and KL ( q ∗ n k π ) ≤ nr n (10)with r n = SLn log(2 BD )+ Sn log( L ( D +1) )+ S log log(3 D ) n + SB n + S n log (cid:18) nS (cid:26) d +2) (cid:27)(cid:19) ≤ r S,L,Dn as x + 2) ≤ x for x ≥ . hérief-Abdellatif Second step : we prove Inequality (9)All coeﬃcients of parameter θ ∗ are upper bounded by B . Hence, we still have: c ℓ ≤ B ℓ D ℓ − (cid:18) d + 1 + 1 BD − (cid:19) . However, the upper bound on r ℓ ( θ ) is not the same, as | A ℓ,i,j | can not be upper bounded by B directly and must be upper bounded by | A ∗ ℓ,i,j | + ˜ A ℓ ≤ B + ˜ A ℓ : r ℓ ( θ ) ≤ sup x ∈ [ − , d sup ≤ i ≤ D (cid:26) D X j =1 (cid:20) | A ℓij − A ∗ ℓij | · | f ℓ − θ ∗ ( x ) j | + | A ℓij | · | f ℓ − θ ∗ ( x ) j − f ℓ − θ ( x ) j | (cid:21) + | b ℓi − b ∗ ℓi | (cid:27) ≤ sup x ∈ [ − , d sup ≤ i ≤ D (cid:26) D X j =1 (cid:20) | A ℓij − A ∗ ℓij | · | f ℓ − θ ∗ ( x ) j | + ( B + ˜ A ℓ ) · | f ℓ − θ ∗ ( x ) j − f ℓ − θ ( x ) j | (cid:21) + | b ℓi − b ∗ ℓi | (cid:27) ≤ Dc ℓ − ˜ A ℓ + ( B + ˜ A ℓ ) Dr ℓ − ( θ ) + ˜ b ℓ ≤ ( B + ˜ A ℓ ) Dr ℓ − ( θ ) + ˜ A ℓ B ℓ − D ℓ − (cid:18) d + 1 + 1 BD − (cid:19) + ˜ b ℓ . Then, using Formula 6: r ℓ ( θ ) ≤ ℓ X u =2 (cid:18) ℓ Y v = u +1 ( B + ˜ A v ) D (cid:19)(cid:18) ˜ A u ( BD ) u − (cid:26) d + 1 + 1 BD − (cid:27) + ˜ b u (cid:19) + (cid:18) ℓ Y v =2 ( B + ˜ A v ) D (cid:19) r ( θ ) ≤ ℓ X u =2 D ℓ − u ℓ Y v = u +1 ( B + ˜ A v ) ˜ A u ( BD ) u − (cid:18) d + 1 + 1 BD − (cid:19) + ℓ X u =2 D ℓ − u ℓ Y v = u +1 ( B + ˜ A v )˜ b u + D ℓ − ℓ Y v =2 ( B + ˜ A v ) r ( θ ) , and using inequality r ( θ ) ≤ d · ˜ A + ˜ b : r ℓ ( θ ) ≤ D ℓ − (cid:18) d + 1 + 1 BD − (cid:19) ℓ X u =2 B u − ℓ Y v = u +1 ( B + ˜ A v ) ˜ A u + ℓ X u =2 D ℓ − u ℓ Y v = u +1 ( B + ˜ A v )˜ b u + dD ℓ − ℓ Y v =2 ( B + ˜ A v ) ˜ A + D ℓ − ℓ Y v =2 ( B + ˜ A v )˜ b ≤ D ℓ − (cid:18) d + 1 + 1 BD − (cid:19) ℓ X u =1 B u − ℓ Y v = u +1 ( B + ˜ A v ) ˜ A u + ℓ X u =1 D ℓ − u ℓ Y v = u +1 ( B + ˜ A v )˜ b u . onvergence Rates of Variational Inference in Sparse Deep Learning Then we have for any distribution q ( θ ) = q ( θ ) × ... × q T ( θ T ) : Z k f θ − f θ ∗ k q (d θ ) ≤ Z k f θ − f θ ∗ k ∞ q (d θ ) = Z r L ( θ ) q (d θ ) ≤ Z D L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) L X ℓ =1 B ℓ − L Y v = ℓ +1 ( B + ˜ A v ) ˜ A ℓ (cid:19) q (d θ )+ Z (cid:18) L X ℓ =1 D L − ℓ L Y v = ℓ +1 ( B + ˜ A v )˜ b ℓ (cid:19) q (d θ )= 2 D L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) Z L X ℓ =1 B ℓ − L Y v = ℓ +1 ( B + ˜ A v ) ˜ A ℓ q (d θ )+ 2 Z L X ℓ =1 ℓ − X k =1 B ℓ − B k − L Y v = ℓ +1 ( B + ˜ A v ) ˜ A ℓ L Y v = k +1 ( B + ˜ A v ) ˜ A k q (d θ ) (cid:19) + 2 (cid:18) Z L X ℓ =1 D L − ℓ ) L Y v = ℓ +1 ( B + ˜ A v ) ˜ b ℓ q (d θ )+ 2 Z L X ℓ =1 ℓ − X k =1 D L − ℓ D L − k L Y v = ℓ +1 ( B + ˜ A v )˜ b ℓ L Y v = k +1 ( B + ˜ A v )˜ b k q (d θ ) (cid:19) = 2 D L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) L X ℓ =1 B ℓ − L Y v = ℓ +1 Z ( B + ˜ A v ) q (d θ ) Z ˜ A ℓ q ℓ (d θ ℓ )+ 2 L X ℓ =1 ℓ − X k =1 B ℓ − B k − L Y v = ℓ +1 Z ( B + ˜ A v ) q (d θ ) Z ˜ A ℓ q ℓ (d θ ℓ ) ℓ Y v = k +1 Z ( B + ˜ A v ) q (d θ ) Z ˜ A k q (d θ ) (cid:19) + 2 (cid:18) L X ℓ =1 D L − ℓ ) L Y v = ℓ +1 Z ( B + ˜ A v ) q (d θ ) Z ˜ b ℓ q (d θ )+ 2 L X ℓ =1 ℓ − X k =1 D L − ℓ D L − k L Y v = ℓ +1 Z ( B + ˜ A v ) q (d θ ) Z ˜ b ℓ q (d θ ) ℓ Y v = k +1 Z ( B + ˜ A v ) q (d θ ) Z ˜ b k q (d θ ) (cid:19) . Here, we deﬁne q ∗ n ( θ ) as follows: ( γ ∗ t = I ( θ ∗ t = 0) ,θ t ∼ γ ∗ t N ( θ ∗ t , s n ) + (1 − γ ∗ t ) δ { } for each t = 1 , ..., T. with s n = S n log(3 D ) − (2 BD ) − L (cid:26)(cid:18) d + 1 + BD − (cid:19) + BD ) − + BD − (cid:27) − .We upper bound the expectation of the supremum of absolute values of Gaussian vari-ables: Z ˜ A ℓ q ∗ n (d θ ) = Z sup i,j | A ℓ,i,j − A ∗ ℓ,i,j | q ∗ n (d θ ) ≤ p s n log(2 D ) = p s n log(3 D ) , hérief-Abdellatif and use Example 2.7 in Boucheron et al. (2003): Z ˜ A ℓ q ∗ n (d θ ) = Z sup i,j ( A ℓ,i,j − A ∗ ℓ,i,j ) q ∗ n (d θ ) ≤ s n (1 + 2 p log( D ) + log( D )) = 4 s n log(3 D ) , which also give: Z ( B + ˜ A ℓ ) q ∗ n (d θ ) = B + Z ˜ A ℓ q ∗ n (d θ ) ≤ B + p s n log(3 D ) ≤ B, and Z ( B + ˜ A ℓ ) q ∗ n (d θ ) = B + 2 B Z ˜ A ℓ q ∗ n (d θ ) + Z ˜ A ℓ q ∗ n (d θ ) ≤ B + 2 B p s n log(3 D ) + 4 s n log(3 D ) ≤ B as p s n log(3 D ) ≤ B ( s n ≤ LD ( D +1)16 n (2 BD ) − L ≤ LD n − L D − L ≤ ).Similarly, Z ˜ b ℓ q ∗ n (d θ ) ≤ p s n log(3 D ) and Z ˜ b ℓ q ∗ n (d θ ) ≤ s n log(3 D ) . Then Z k f θ − f θ ∗ k q ∗ n (d θ ) ≤ D L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) L X ℓ =1 B ℓ − (4 B ) L − ℓ s n log(3 D )+ 2 L X ℓ =1 ℓ − X k =1 B ℓ − B k − (4 B ) L − ℓ p s n log(3 D )(2 B ) ℓ − k p s n log(3 D ) (cid:19) + 2 (cid:18) L X ℓ =1 D L − ℓ ) (4 B ) L − ℓ s n log(3 D )+ 2 L X ℓ =1 ℓ − X k =1 D L − ℓ D L − k (4 B ) L − ℓ p s n log(3 D )(2 B ) ℓ − k p s n log(3 D ) (cid:19) , onvergence Rates of Variational Inference in Sparse Deep Learning i.e. Z k f θ − f θ ∗ k q ∗ n (d θ ) ≤ D L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) B L − s n log(3 D ) L − X ℓ =0 ℓ + 2 B L − s n log(3 D ) L X ℓ =1 ℓ − X k =1 L − ℓ L − k (cid:19) + 2 (cid:18) s n log(3 D ) L X ℓ =1 (2 BD ) L − ℓ + 8 s n log(3 D ) L X ℓ =1 ℓ − X k =1 (2 BD ) L − ℓ (2 BD ) L − k (cid:19) ≤ D L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) B L − s n log(3 D ) 4 L − −

1+ 2 B L − s n log(3 D ) L X ℓ =1 L − ℓ L − ℓ +1 ℓ − X k =0 k (cid:19) + 2 (cid:18) s n log(3 D ) L − X ℓ =0 (2 BD ) ℓ + 8 s n log(3 D ) L X ℓ =1 (2 BD ) L − ℓ (2 BD ) L − ℓ +1 ℓ − X k =0 (2 BD ) k (cid:19) ≤ D L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) B L − s n log(3 D ) 4 L

3+ 2 B L − s n log(3 D ) L X ℓ =1 L − ℓ L − ℓ +1 ℓ − (cid:19) + 2 (cid:18) s n log(3 D ) (2 BD ) L (2 BD ) − s n log(3 D ) L X ℓ =1 (2 BD ) L − ℓ (2 BD ) L − ℓ +1 (2 BD ) ℓ − BD − (cid:19) ≤ D L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) B L − s n log(3 D ) 4 L B L − s n log(3 D )2 L L − X ℓ =0 ℓ (cid:19) + 2 (cid:18) s n log(3 D ) (2 BD ) L (2 BD ) − s n log(3 D ) L − X ℓ =0 (2 BD ) ℓ (2 BD ) L BD − (cid:19) ≤ D L − (cid:18) d + 1 + 1 BD − (cid:19) (cid:18) B L − s n log(3 D ) 4 L B L − s n log(3 D )2 L (cid:19) + 2 (cid:18) s n log(3 D ) (2 BD ) L (2 BD ) − s n log(3 D ) (2 BD ) L (2 BD − (cid:19) = 2 D L − (cid:18) d + 1 + 1 BD − (cid:19) s n log(3 D ) (cid:18) B L − L B L − L (cid:19) + 2 (cid:18) (2 BD ) L (2 BD ) − BD ) L (2 BD − (cid:19) s n log(3 D ) , hérief-Abdellatif and consequently, as BD ≥ , Z k f θ − f θ ∗ k q ∗ n (d θ ) ≤ s n log(3 D ) (cid:26) D L − (cid:18) d + 1 + 1 BD − (cid:19) B L − L + (2 BD ) L (cid:18) BD ) − BD − (cid:19)(cid:27) = 8 s n log(3 D ) (cid:26) (2 BD ) L BD ) (cid:18) d + 1 + 1 BD − (cid:19)

73+ (2 BD ) L (cid:18) BD ) − BD − (cid:19)(cid:27) ≤ s n log(3 D )(2 BD ) L (cid:26)(cid:18) d + 1 + 1 BD − (cid:19) + 1(2 BD ) − BD − (cid:27) = S n ≤ r n . which ends Step 2. Third step : we prove Inequality (10)We end the proof:KL ( q ∗ n k π ) ≤ log (cid:18) TS (cid:19) + T X t =1 γ ∗ t KL (cid:18) N ( θ ∗ t , s n ) (cid:13)(cid:13)(cid:13)(cid:13) N (0 , (cid:19) ≤ S log( T ) + T X t =1 γ ∗ t (cid:26)

12 log (cid:18) s n (cid:19) + s n + θ ∗ t − (cid:27) ≤ S log( T ) + T X t =1 γ ∗ t (cid:26)

12 log (cid:18) s n (cid:19) + s n + B − (cid:27) = S log( T ) + S s n + S B −

12 + S (cid:18) s n (cid:19) ≤ S log( T ) + S S B − S (cid:18) nS log(3 D )(2 BD ) L (cid:26)(cid:18) d + 1 + 1 BD − (cid:19) + 1(2 BD ) − BD − (cid:27)(cid:19) ≤ S log( L ( D + 1) ) + B S LS log(2 BD ) + S D )+ S (cid:18) nS (cid:26)(cid:18) d + 1 + 1 BD − (cid:19) + 1( BD ) − BD − (cid:27)(cid:19) ≤ nr n ..