[PDF] Adaptive Network Sparsification with Dependent Variational Beta-Bernoulli Dropout

Abstract

While variational dropout approaches have been shown to be effective for network sparsification, they are still suboptimal in the sense that they set the dropout rate for each neuron without consideration of the input data. With such input-independent dropout, each neuron is evolved to be generic across inputs, which makes it difficult to sparsify networks without accuracy loss. To overcome this limitation, we propose adaptive variational dropout whose probabilities are drawn from sparsity-inducing beta Bernoulli prior. It allows each neuron to be evolved either to be generic or specific for certain inputs, or dropped altogether. Such input-adaptive sparsity-inducing dropout allows the resulting network to tolerate larger degree of sparsity without losing its expressive power by removing redundancies among features. We validate our dependent variational beta-Bernoulli dropout on multiple public datasets, on which it obtains significantly more compact networks than baseline methods, with consistent accuracy improvements over the base networks.

Full PDF

AAdaptive Network Sparsiﬁcation withDependent Variational Beta-Bernoulli Dropout

Juho Lee

Saehoon Kim Jaehong Yoon Hae Beom Lee Eunho Yang

Sung Ju Hwang

Abstract

While variational dropout approaches have beenshown to be effective for network sparsiﬁcation,they are still suboptimal in the sense that theyset the dropout rate for each neuron without con-sideration of the input data. With such input-independent dropout, each neuron is evolved tobe generic across inputs, which makes it difﬁcultto sparsify networks without accuracy loss. Toovercome this limitation, we propose adaptivevariational dropout whose probabilities are drawnfrom sparsity-inducing beta-Bernoulli prior. Itallows each neuron to be evolved either to begeneric or speciﬁc for certain inputs, or droppedaltogether. Such input-adaptive sparsity-inducingdropout allows the resulting network to toleratelarger degree of sparsity without losing its expres-sive power by removing redundancies among fea-tures. We validate our dependent variational beta-Bernoulli dropout on multiple public datasets, onwhich it obtains signiﬁcantly more compact net-works than baseline methods, with consistent ac-curacy improvements over the base networks.

1. Introduction

One of the main obstacles in applying deep learning to large-scale problems and low-power computing systems is thelarge number of network parameters, as it can lead to exces-sive memory and computational overheads. To tackle thisproblem, researchers have explored network sparsiﬁcationmethods to remove unnecessary connections in a network,which is implementable either by weight pruning (Hanet al., 2016) or sparsity-inducing regularizations (Wen et al.,2016).Recently, variational Bayesian approaches have shown to * Equal contribution Department of Statistics, University of Ox-ford AITRICS School of Computing, KAIST. Correspondenceto: Juho Lee < [email protected] >.Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute. be useful for network sparsiﬁcation, outperforming non-Bayesian counterparts. They take a completely differentapproach from the conventional methods that either usesthresholding or sparsity-inducing norms on parameters, anduses well-known dropout regularization instead. Speciﬁ-cally, these approaches use variational dropout (Kingmaet al., 2015) which adds in multiplicative stochastic noiseto each neuron, as a means of obtaining sparse neural net-works. Removal of unnecessary neurons could be done byeither setting the dropout rate individually for each neuronwith unbounded dropout rate (Molchanov et al., 2017) or bypruning based on the signal-to-noise ratio (Neklyudov et al.,2017).While these variational dropout approaches do yield com-pact networks, they are suboptimal in that the dropout ratefor each neuron is learned completely independently of thegiven input data and labels. With input-independent dropoutregularization, each neuron has no choice but to encodegeneric information for all possible inputs, since it does notknow what input and tasks it will be given at evaluation time,as each neuron will be retained with ﬁxed rate regardlessof the input. Obtaining high degree of sparsity in such assetting will be difﬁcult as dropping any of the neurons willresult in information loss. For maximal utilization of thenetwork capacity and thus to obtain a more compact model,however, each neuron should be either irreplaceably genericand used by all tasks, or highly specialized for a task suchthat there exists minimal redundancy among the learnedrepresentations. This goal can be achieved by adaptively set-ting the dropout probability for each input, such that someof the neurons are retained with high probability only forcertain types of inputs and tasks.To this end, we propose a novel input-dependent variationaldropout regularization for network sparsiﬁcation. We ﬁrstpropose beta-Bernoulli dropout that learns to set dropoutrate for each individual neuron, by generating the dropoutmask from beta-Bernoulli prior, and show how to train itusing variational inference. This dropout regularization isa proper way of obtaining a Bayesian neural network andalso sparsiﬁes the network, since beta-Bernoulli distributionis a sparsity-inducing prior. Then, we propose dependentbeta-Bernoulli dropout, which is an input-dependent version a r X i v : . [ s t a t . M L ] M a r ependent Variational Beta-Bernoulli Dropout of our variational dropout regularization.Such adaptive regularization has been utilized for generalnetwork regularization by a non-Bayesian and non-sparsity-inducing model (Ba & Frey, 2013); yet, the increased mem-ory and computational overheads that come from learningadditional weights for dropout mask generation made it lessappealing for generic network regularization. In our case ofnetwork sparsiﬁcation, however, the overheads at trainingtime is more than rewarded by the reduced memory andcomputational requirements at evaluation time, thanks tothe high degree of sparsiﬁcation obtained in the ﬁnal outputmodel.We validate our dependent beta-Bernoulli variationaldropout regularizer on multiple public datasets for networksparsiﬁcation performance and prediction error, on which itobtains more compact network with substantially reducedprediction errors, when compared with both the base net-work and existing network sparsiﬁcation methods. Furtheranalysis of the learned dropout probability for each unitreveals that our input-adaptive variational dropout approachgenerates a clearly distinguishable dropout mask for eachtask, thus enables each task to utilize different sets of neu-rons for their specialization.Our contribution in this paper is threefold: • We propose beta-Bernoulli dropout, a novel dropoutregularizer which learns to generate Bernoulli dropoutmask for each neuron with sparsity-inducing prior, thatobtains high degree of sparsity without accuracy loss. • We further propose dependent beta-Bernoulli dropout,which yields signiﬁcantly more compact network thaninput-independent beta-Bernoulli dropout, and furtherperform run-time pruning for even less computationalcost. • Our beta-Bernoulli dropout regularizations providenovel ways to implement a sparse Bayesian NeuralNetwork, and we provide a variational inference frame-work for learning it.

2. Related Work

Deep neural networks are known to be prone to overﬁtting,due to its large number of parameters. Dropout (Srivastavaet al., 2014) is an effective regularization that helps preventoverﬁtting by reducing coadaptations of the units in thenetworks. During dropout training, the hidden units in thenetworks are randomly dropped with ﬁxed probability p ,which is equivalent to multiplying the Bernoulli noises z ∼ Ber(1 − p ) to the units. It was later found that multiplyingGaussian noises with the same mean and variance, z ∼N (1 , p − p ) , works just as well or even better (Srivastavaet al., 2014). Dropout regularizations generally treat the dropout rate p asa hyperparameter to be tuned, but there have been severalstudies that aim to automatically determine proper dropoutrate. (Kingma et al., 2015) propose to determine the varianceof the Gaussian dropout by stochastic gradient variationalBayes. Generalized dropout (Srinivas & Babu, 2016) placesa beta prior on the dropout rate and learn the posterior ofthe dropout rate through variational Bayes. They showedthat by adjusting the hyperparameters of the beta prior, wecan obtain several regularization algorithms with differentcharacteristics. Our beta-Bernoulli dropout is similar toone of its special cases, but while they obtain the dropoutestimates via point-estimates and compute the gradientsof the binary random variables with biased heuristics, weapproximate the posterior distribution of the dropout ratewith variational distributions and compute asymptoticallyunbiased gradients for the binary random variables.Ba et al. (Ba & Frey, 2013) proposed adaptive dropout(StandOut), where the dropout rates for each individualneurons are determined as function of inputs. This idea issimilar in spirit to our dependent beta-Bernoulli dropout,but they use heuristics to model this function, while we useproper variational Bayesian approach to obtain the dropoutrates. One drawback of their model is the increased memoryand computational cost from additional parameters intro-duced for dropout mask generation, which is not negligiblewhen the network is large. Our model also requires addi-tional parameters, but with our model the increased cost attraining time is rewarded at evaluation time, as it yields asigniﬁcantly sparse network than the baseline model as aneffect of the sparsity-inducing prior.Recently, there has been growing interest in structure learn-ing or sparsiﬁcation of deep neural networks. Han etal. (Han et al., 2016) proposed a strategy to iteratively pruneweak network weights for efﬁcient computations, and Wenet al. (Wen et al., 2016) proposed a group sparsity learningalgorithm to drop neurons, ﬁlters or even residual blocks indeep neural networks. In Bayesian learning, various sparsityinducing priors have been demonstrated to efﬁciently prunenetwork weights with little drop in accuracies (Molchanovet al., 2017; Louizos et al., 2017; Neklyudov et al., 2017;Louizos et al., 2018; Dai et al., 2018). In the nonpara-metric Bayesian perspective, Feng et al. (Feng & Darrell,2015) proposed IBP based algorithm that learns proper num-ber of channels in convolutional neural networks using theasymptotic small-variance limit approximation of the IBP.While our dropout regularizer is motivated by IBP as withthis work, our work is differentiated from it by the input-adaptive adjustments of dropout rates that allow each neuronto specialize into features speciﬁc for some subsets of tasks. ependent Variational Beta-Bernoulli Dropout

3. Backgrounds

Suppose that we are given a neural network f ( x ; W ) parametrized by W , a training set D = { ( x n , y n ) } Nn =1 , anda likelihood p ( y | f ( x ; W )) chosen according to the problemof interest (e.g., the categorical distribution for a classiﬁca-tion task). In Bayesian neural networks, the parameter W is treated as a random variable drawn from a pre-speciﬁedprior distribution p ( W ) , and the goal of learning is to com-pute the posterior distribution p ( W |D ) : p ( W |D ) ∝ p ( W ) N (cid:89) n =1 p ( y n | f ( x n ; W )) . (1)When a novel input x ∗ is given, the prediction y ∗ is obtainedas a distribution, by mixing W from p ( W |D ) as follows: p ( y ∗ | x ∗ , D ) = (cid:90) p ( y ∗ | f ( x ∗ ; W )) p ( W |D )d W . (2)Unfortunately, p ( W |D ) is in general computationally in-tractable due to computing p ( D ) , and thus we resort toapproximate inference schemes. Speciﬁcally, we use varia-tional Bayes (VB), where we posit a variational distribution q ( W ; φ ) of known parametric form and minimize the KL-divergence between it and the true posterior p ( W |D ) . Itturns out that minimizing D KL [ q ( W ; φ ) (cid:107) p ( W |D )] is equiv-alent to maximizing the evidence lower-bound (ELBO), L ( φ ) = N (cid:88) n =1 E q [log p ( y n | f ( x n ; W ))] − D KL [ q ( W ; φ ) (cid:107) p ( W )] , (3)where the ﬁrst term measures the expected log-likelihoodof the dataset w.r.t. q ( W ; φ ) , and the second term reg-ularizes q ( W ; φ ) so that it does not deviate too muchfrom the prior. The parameter φ is learned by gradientdescent, but these involves two challenges. First, the ex-pected likelihood is intractable in many cases, and so isits gradient. To resolve this, we assume that q ( W ; φ ) isreparametrizable, so that we can obtain i.i.d. samples from q ( W ; φ ) by computing differentiable transformation of i.i.d.noise (Kingma & Welling, 2014; Rezende et al., 2014) as ε ( s ) ∼ r ( ε ) , W ( s ) = T ( ε ( s ) ; φ ) . Then we can obtain alow-variance unbiased estimator of the gradient, namely ∇ φ E q [log p ( y n | f ( x n ; W ))] ≈ S S (cid:88) s =1 ∇ φ log p ( y n | f ( x n ; T ( ε ( s ) ; φ ))) . (4)The second challenge is that the number of training instances N may be too large, which makes it impossible to compute the summation of all expected log-likelihood terms. Regard-ing on this challenge, we employ the stochastic gradientdescent technique where we approximate with the summa-tion over a uniformly sampled mini-batch B , N (cid:88) n =1 ∇ φ E q [log p ( y n | f ( x n ; W ))] ≈ N | B | (cid:88) n ∈ B ∇ φ E q [log p ( y n | f ( x n ; W ))] . (5)Combining the reparametrization and the mini-batch sam-pling, we obtain an unbiased estimator of ∇ φ L ( φ ) to update φ . This procedure, often referred to as stochastic gradi-ent variational Bayes (SGVB) (Kingma & Welling, 2014),is guaranteed to converge to local optima under properlearning-rate scheduling. In latent feature model, data are assumed to be generated ascombinations of latent features: d n = f ( Wz n ) = f (cid:18) K (cid:88) k =1 z n,k w k (cid:19) , (6)where z n,k = 1 means that d n possesses the k -th feature w k , and f is an arbitrary function.The Indian Buffet Process (IBP) (Grifﬁths & Ghahramani,2005) is a generative process of binary matrices with inﬁnitenumber of columns. Given N data points, IBP generatesa binary matrix Z ∈ { , } N × K whose n -th row encodesthe feature indicator z (cid:62) n . The IBP is suitable to use as aprior process in latent feature models, since it generatespossibly inﬁnite number of columns and adaptively adjustthe number of features on given dataset. Hence, with an IBPprior we need not specify the number of features in advance.One interesting observation is that while it is a marginalof the beta-Bernoulli processes (Thibaux & Jordan, 2007),the IBP may also be understood as a limit of the ﬁnite-dimensional beta-Bernoulli process. More speciﬁcally, theIBP with parameter α > can be obtained as π k ∼ beta( α/K, , z n,k ∼ Ber( π k ) , K → ∞ . (7)This beta-Bernoulli process naturally induces sparsity in thelatent feature allocation matrix Z . As K → ∞ , the expectednumber of nonzero entries in Z converges to N α (Grifﬁths &Ghahramani, 2005), where α is a hyperparameter to controlthe overall sparsity level of Z .In this paper, we relate the latent feature models (6) toneural networks with dropout mask. Speciﬁcally, the binaryrandom variables z n,k correspond to the dropout indicator,and the features w correspond to the inputs or intermediate ependent Variational Beta-Bernoulli Dropout units in neural networks. From this connection, we can thinkof a hierarchical Bayesian model where we place the IBP,or ﬁnite-dimensional beta-Bernoulli priors for the binarydropout indicators. We expect that due to the property ofthe IBP favoring sparse model, the resulting neural networkwould also be sparse. One important assumption in the IBP is that features are exchangeable - the distribution is invariant to the permu-tation of feature assignments. This assumption makes theposterior inference convenient, but restricts ﬂexibility whenwe want to model the dependency of feature assignmentsto the input covariates x , such as times or spatial locations.To this end, (Williamson et al., 2010) proposed dependentIndian Buffet processes (dIBP), which triggered a line offollow-up work (Zhou et al., 2011; Ren et al., 2011). Thesemodels can be summarized as following generative process: π ∼ p ( π ) z | π , x ∼ Ber( g ( π , x )) , (8)where g ( · , · ) is an arbitrary function that maps π and x toa probability. In our latent feature interpretation for neuralnetwork layers above, the input covariates x correspondsto the input or activations in the previous layer. In otherwords, we build a data-dependent dropout model where thedropout rates depend on the inputs. In the main contributionsection, we will further explain how we will construct thisdata-dependent dropout layers in detail.

4. Main contribution

Inspired by the latent-feature model interpretation of layersin neural networks, we propose a Bayesian neural networklayer overlaid with binary random masks sampled fromthe ﬁnite-dimensional beta-Bernoulli prior. Speciﬁcally,let W be a parameter of a neural network layer, and let z n ∈ { , } K be a binary mask vector to be applied for the n -th observation x n . The dimension of W needs not beequal to K . Instead, we may enforce arbitrary group sparsityby sharing the binary masks among multiple elements of W .For instance, let W ∈ R K × L × M be a parameter tensor in aconvolutional neural network with K channels. To enforcea channel-wise sparsity, we introduce z n ∈ { , } K of K dimension, and the resulting masked parameter (cid:102) W n for the n -th observation is given as { z n,k W k,(cid:96),m | ( k, (cid:96), m ) = (1 , , , . . . , ( K, L, M ) } , (9)where W k,(cid:96),m is the ( k, (cid:96), m ) -th element of W . From nowon, with a slight abuse of notation, we denote this binarymask multiplication as (cid:102) W n = z n ⊗ W , (10) with appropriate sharing of binary mask random variables.The generative process of our Bayesian neural network isthen described as W ∼ N ( , λ I ) , π ∼ K (cid:89) k =1 beta( π k ; α/K, , z n | π ∼ K (cid:89) k =1 Ber( z n,k ; π k ) , (cid:102) W n = z n ⊗ W . (11)Note the difference between our model and the model in(Gal & Ghahramani, 2016). In the latter model, only Gaus-sian prior is placed on the parameter W , and the dropout isapplied in the variational distribution q ( W ) to approximate p ( W |D ) . Our model, on the other hand, includes the binarymask z n in the prior, and the posterior for the binary masksshould also be approximated.The goal of the posterior inference is to compute the poste-rior distribution p ( W , Z , π |D ) , where Z = { z , . . . , z N } .We approximate this posterior with the variational distribu-tion of the form q ( W , Z , π | X )= δ (cid:99) W ( W ) K (cid:89) k =1 q ( π k ) N (cid:89) n =1 K (cid:89) k =1 q ( z n,k | π k ) , (12)where we omitted the indices of layers for simplicity. For W , we conduct computationally efﬁcient point-estimate toget the single value (cid:99) W , with the weight-decay regularizationarising from the zero-mean Gaussian prior. For π , follow-ing (Nalisnick & Smyth, 2017), we use the Kumaraswamydistribution (Kumaraswamy, 1980) for q ( π k ) : q ( π k ; a k , b k ) = a k b k π a k − k (1 − π a k k ) b k − , (13)since it closely resembles the beta distribution and easilyreparametrizable as π k ( u ; a k , b k ) = (1 − u bk ) ak , u ∼ unif([0 , . (14)We further assume that q ( z n,k | π k ) = p ( z n,k | π k ) =Ber( π k ) . z k is sampled by reparametrization with continu-ous relaxation (Maddison et al., 2017; Jang et al., 2017; Galet al., 2017), z k = sgm (cid:32) τ (cid:18) log π k − π k + log u − u (cid:19)(cid:33) , (15)where τ is a temperature of continuous relaxation, u ∼ unif([0 , , and sgm( x ) = e − x . The KL-divergencebetween the prior and the variational distribution is then ependent Variational Beta-Bernoulli Dropout obtained in closed form as follows (Nalisnick & Smyth,2017): D KL [ q ( Z , π ) (cid:107) p ( Z , π )]= K (cid:88) k =1 (cid:40) a k − α/Ka k (cid:18) − γ − Ψ( b k ) − b k (cid:19) + log a k b k α/K − b k − b k (cid:41) , (16)where γ is Euler-Mascheroni constant and Ψ( · ) is thedigamma function. Note that the inﬁnite series in theKL-divergence vanishes because of the choice p ( π k ) =beta( π k ; α/K, .We can apply the SGVB framework described in Section 3.1to optimize the variational parameters { a k , b k } Kk =1 . Afterthe training, the prediction for a novel input x ∗ is given as p ( y ∗ | x ∗ , D , W )= E p ( z ∗ , π , W |D ) [ p ( y ∗ | f ( x ∗ ; z ∗ ⊗ W ))] ≈ E q ( z ∗ , π ) [ p ( y ∗ | f ( x ∗ ; z ∗ ⊗ (cid:99) W ))] , (17)and we found that the following näive approximation workswell in practice, p ( y ∗ | x ∗ , D , W ) ≈ p ( y ∗ | f ( x ∗ ; E q [ z ∗ ] ⊗ (cid:99) W )) , (18)where E q [ z ∗ ,k ] = E q ( π k ) [ π k ] , E q ( π k ) [ π k ] = b k Γ(1 + a − k )Γ( b k )Γ(1 + a − k + b k ) . (19) Now we describe our Bayesian neural network model withinput dependent beta-Bernoulli dropout prior constructed asfollows: W ∼ N ( , λ I ) , π ∼ K (cid:89) k =1 beta( π k ; α/K, , z n | π , x n ∼ K (cid:89) k =1 Ber( z n,k ; ϕ k ( x n,k )) , (20)Here, x n is the input to the dropout layer. For convolutionallayers, we apply the global average pooling to tensors to getvectorized inputs. In principle, we may introduce anotherfully connected layer as ( ϕ ( x n, ) , . . . , ϕ K ( x n,K )) =sgm( Vx n + c ) , with additional parameters V ∈ R K × K and c ∈ R K , but this is undesirable for the network sparsiﬁ-cation. Rather than adding parameters for fully connectedlayer, we propose simple yet effective way to generate input-dependent probability, with minimal parameters involved. Speciﬁcally, we construct each π k ( x n,k ) independently asfollows: ϕ k ( x n,k ) = π k · clamp (cid:18) γ k x n,k − µ k σ k + β k , (cid:15) (cid:19) , (21)where µ k and σ k are the estimates of k -th componentsof mean and standard deviation of inputs, and γ k and β k are scaling and shifting parameters to be learned, and (cid:15) > is some small tolerance to prevent overﬂow, and clamp( x, (cid:15) ) = min(1 − (cid:15), max( (cid:15), x )) . The parameteriza-tion in (21) is motivated by the batch normalization (Ioffe& Szegedy, 2015). The intuition behind this constructionis as follows (see Fig. 1 for a graphical example). Theinputs after the standardization would approximately be dis-tributed as N (0 , , and thus would be centered around zero.If we pass them through min(1 − (cid:15), max( (cid:15), x )) , most ofinsigniﬁcant dimensions would have probability close tozero. However, some inputs may be important regardlessof the signiﬁcance of current activation. In that case, weexpect the corresponding shifting parameter β k to be large.Thus by β = ( β , . . . , β K ) we control the overall sparsity,but we want them to be small unless required to get sparseoutcomes. We enforce this by placing a prior distributionon β ∼ N ( , ρ I ) .The goal of variational inference is hence to learn the poste-rior distribution p ( W , Z , π , β |D ) , and we approximate thiswith variational distribution of the form q ( W , Z , π , β | X )= δ (cid:99) W ( W ) K (cid:89) k =1 q ( π k ) q ( β k ) N (cid:89) n =1 K (cid:89) k =1 q ( z n,k | π k , x n ) , (22)where q ( π k ) are the same as in beta-Bernoullidropout, q ( β k ) = N ( β k ; η k , κ k ) , and q ( z n,k | π k ) = p ( z n,k | π k , x n ) The KL-divergence is computed as D KL [ q ( Z , π | X ) (cid:107) p ( Z , π )] + D KL [ q ( β ) (cid:107) p ( β )] , (23)where the ﬁrst term was described for beta-Bernoullidropout and the second term can be computed analytically.The prediction for the novel input x ∗ is similarity done asin the beta-Bernoulli dropout, with the naïve approximationfor the expectation: p ( y ∗ | x ∗ , D , W ) ≈ p ( y ∗ | f ( x ∗ ; E q [ z ∗ ] ⊗ (cid:99) W )) , (24) In principle, we may introduce an inference network q ( z | π , x , y ) and minimizes the KL-divergence between q ( z | π , x , y ) and p ( z | π , x ) , but this results in discrepancybetween training and testing for sampling z , and also makeoptimization cumbersome. Hence, we chose to simply set themequal. Please refer to (Sohn et al., 2015) for discussion about this. ependent Variational Beta-Bernoulli Dropout .

75 4 .

00 4 . x − . . . x = x − µσ x = γ x + β . . . x = clamp( x , (cid:15) ) Figure 1.

An example to show the intuition behind Eq (21). Each block represents a histogram of output distribution. A set of activations(1st block) are standardized (2nd block), and rescaled and shifted (3rd block), and transformed into probabilities (4th block). where E q [ z ∗ ,k ] = E q [ π k ] · clamp (cid:18) γ k x n,k − µ k σ k + η k , (cid:15) (cid:19) . (25) Two stage pruning scheme

Since π k ≥ π k ( x n,k ) for all x n,k , we expect the resulting network to be sparser thanthe network pruned only with the beta-Bernoulli dropout(only with π k ). To achieve this, we propose a two-stagepruning scheme, where we ﬁrst prune the network withbeta-Bernoulli dropout, and prune the network again with π k ( x n,k ) while holding the variables π ﬁxed. By ﬁxing π the resulting network is guaranteed to be sparser than thenetwork before the second pruning.

5. Experiments

We now compare our beta-Bernoulli dropout (BB) andinput-dependent beta-Bernoulli dropout (DBB) to otherstructure learning/pruning algorithms on several neuralnetworks using benchmark datasets. The codes to repli-cate our experiments will be made available on https://github.com/juho-lee/bbdrop . Experiment Settings

We follow a common experimen-tal setting used by existing work to evaluate pruning per-formance, and use LeNet 500-300, LeNet 5-Caffe , andVGG-like (Zagoruyko, 2015) on MNIST (LeCun et al.,1998), CIFAR-10, and CIFAR-100 datasets (Krizhevsky& Hinton, 2009). For baselines, we use the followingrecent Bayesian pruning methods:

1) SBP:

StructuredBayesian Pruning (Neklyudov et al., 2017),

2) VIB:

Vari-ational Information Bottleneck (Dai et al., 2018), L : L regularization (Louizos et al., 2018),

4) GD:

General-ized Dropout (Srinivas & Babu, 2016),

5) CD:

ConcreteDropout (Gal et al., 2017). We faithfully tune all hyperpa-rameters of baselines on a validation set to ﬁnd a reasonablesolution that is well balanced between accuracy and spar- https://github.com/BVLC/caffe/blob/master/examples/mnist siﬁcation, while ﬁxing batch size (100) and the number ofmaximum epochs (200) to match our experiment setting. Implementation Details

We pretrain all networks usingthe standard training procedure before ﬁne-tuning for net-work sparsiﬁcation (Neklyudov et al., 2017; Dai et al., 2018;Louizos et al., 2018). While pruning, we set the learningrate for the weights W to be 0.1 times smaller than thosefor the variational parameters as in (Neklyudov et al., 2017).We used Adam (Kingma & Ba, 2015) for all methods. ForDBB, as mentioned in Section 4.2, we ﬁrst prune networkswith BB, and then prune again with DBB whiling holdingthe variational parameters for q ( π ) ﬁxed.We report all hyperparameters of BB and DBB for reproduc-ing our results. We set α/K = 10 − for all layers of BBand DBB. In principle, we may ﬁx K to be large numberand tune α . However, in the network sparsiﬁcation tasks, K is given as the neurons/ﬁlters to be pruned. Hence, wechoose to set the ratio α/K to be small number altogether.In the testing phase, we prune the neurons/ﬁlters whoseexpected dropout mask probability E q [ π k ] are smaller thana ﬁxed threshold − . For the input-dependent dropout,since the number of pruned neurons/ﬁlters differ accordingto the inputs, we report them as the running average overthe test data. We ﬁx the temperature parameter of concretedistribution τ = 10 − and the prior variance of β , ρ = √ for all experiments. We ran all experiments three times andreported the median and standard deviation.To control the tradeoff between classiﬁcation error andpruned network size, we run each algorithm with varioustradeoff parameters. For VIB and L , we controlled thetradeoff parameters originally introduced in the papers. Forvariational inference based algorithms including SBP andBB, we scaled the KL terms in ELBO with tradeoff param-eter γ > . Note that when γ > , the modiﬁed ELBO isa still lower bound on the marginal likelihood. For DBBwe use ﬁxed parameter settings but retrain the model with We tried different values such as − or − , but the differ-ence was insigniﬁcant. ependent Variational Beta-Bernoulli Dropout LeNet 500-300 LeNet5-CaffeError (%) Neurons xFLOPs Memory (%) Error (%) Neurons/Filters xFLOPs Memory (%)Original 1.63 784-500-300 1.0 100.0 0.71 20-50-800-500 1.0 100.0CD 1.54 ± ± ± ± ± ± L ± ± ± ± ± ± ± ± ± ± ± ± Speedup in FLOPs . . . . . . . E rr o r( % ) GDBBVIB L SBPDBB 2 . . . . . . . . Speedup in FLOPs . . . . . . E rr o r( % ) DBBGDBBVIB L SBP c l a ss c l a ss c l a ss c l a ss . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.

Top:

Classiﬁcation errors and sparsiﬁcaton performance of various pruning methods for LeNet-500-300 and LeNet5-Caffe onthe MNIST dataset.

Bottom:

Error-Speedup tradeoff plots (left) and input-regions pruned by DBB in four different examples (right). different runs of BBs, that are trained with different tradeoffparameters γ . For more detailed settings of tradeoff control,please refer to the appendix. We use LeNet 500-300 and LeNet 5-Caffe networks onMNIST for comparison. Following the conventions, weapply dropout to the inputs to the fully connected layers andright after the convolution for the convolutional layers. Wereport the trade-off between the error and speedup in termsof FLOPs in Fig. 2, and show one representative result pereach algorithm in Section 5 to compare the speedups andmemory savings with respect to a particular error level. ForLeNet 5-Caffe, following (Dai et al., 2018; Neklyudov et al.,2017), we used larger tradeoff parameters for the ﬁrst twoconvolutional layers - please refer to the appendix for thedetail.With both neworks, BB and DBB either achieve signiﬁ-cantly smaller error than the baseline methods, or signif-icant speedup ( . for LeNet 500-300 and . forLeNet5-Caffe) and memory saving ( . for LeNet-300and . for LeNet5-Caffe) at similar error rates. DBB,with its input-adaptive pruning, obtains larger speedup andmemory saving compared to BB, which is better shown inthe error-speedup tradeoff plot. On LeNet-500-300, DBB class c l a ss conv20 class c l a ss conv50 class c l a ss dense800 class c l a ss dense500 . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.

Correlation coefﬁcients of class averages of ϕ ( x ) for thefour layers in LeNet5-Caffe, where the darker shows the morecorrelation between the class average values of ϕ ( x ) . prunes large amount of neurons in the input layer, becausethe inputs to this network are simply vectorized pixel val-ues, so it can prune the inputs according to the digit classes,which is shown in Fig. 2 Also, we observe that the dropoutmasks generated by DBB tend to be generic at lower net-work layers to extract common features, but become class-speciﬁc at higher layers to specialize features for class dis-criminativity. Fig. 3 supports our observation, which showsthe correlation coefﬁcients between the class average valuesof ϕ ( x ) learnt from DBB. This clearly show the tendency toshare ﬁlters in lower layers, and be speciﬁc in higher layers.We observed similar behavior of DBB in the experimentswith VGG on both CIFAR10 and CIFAR100 datasets. ependent Variational Beta-Bernoulli Dropout CIFAR-10 CIFAR-100Error (%) xFLOPs Memory (%) Error (%) xFLOPs Memory (%)Original 7.13 1.0 100.0 33.10 1.0 100.0CD 6.94 ± ± ± ± ± ± L ± ± ± ± ± ± ± ± ± ± ± ± . . . . . . . Speedup in FLOPs . . . . . . . . E rr o r( % ) BBVIBGD L SBPDBB 2 . . . . . . Speedup in FLOPs E rr o r( % ) BBVIBGD L SBPDBB bus layer 3 layer 8 layer 15bus layer 3 layer 8 layer 15apple layer 3 layer 8 layer 15 . . . . . . . . . . . . . . . . . . Figure 4.

Top:

Classiﬁcation errors and sparsiﬁcation performances of various pruning methods on CIFAR-10 and CIFAR-100 datasets.

Bottom:

Error-Speedup tradeoff plots (left) and an empirical analysis of learned ﬁlters from DBB (right).

We further compare the pruning algorithms on VGG-likenetwork on CIFAR-10 and CIFAR-100 datasets. Fig. 4summarizes the performance of each algorithm on partic-ular setting, where BB and DBB achieve both impressivespeedups and memory savings with signiﬁcantly improvedaccracy. When compared with the baseline sparsiﬁcationmethods, they either achieve better error at similar sparsiﬁ-cation rates, or achieve better speedup and memory savingat simlar error rates. The error-speedup trade-off plot inthe bottom row of Fig. 4 shows that DBB generally achievelarger speedup to BB at similar error rates, and to all otherbaselines. Further analysis of the ﬁlters retained by DBBin Fig. 4 shows that DBB either retains most ﬁlters (layer 3)or perform generic pruning (layer 8) at lower layers, whileperforming diversiﬁed pruning at higher layers (layer 15).Further, at layer 15, instances from the same class (bus) re-tained similar ﬁlters, while instances from different classes(bus vs. apple) retained different ﬁlters.

6. Conclusion

We have proposed novel beta-Bernoulli dropout for networkregularization and sparsiﬁcation, where we learn dropoutprobabilities for each neuron either in an input-independentor input-dependent manner. Our beta-Bernoulli dropout learns the distribution of sparse Bernoulli dropout mask foreach neuron in a variational inference framework, in contrastto existing work that learned the distribution of Gaussianmultiplicative noise or weights, and obtains signiﬁcantlymore compact network compared to those competing ap-proaches. Further, our dependent beta-Bernoulli dropoutthat input-adaptively decides which neuron to drop furtherimproves on the input-independent beta-Bernoulli dropout,both in terms of size of the ﬁnal network obtained andrun-time computations. Future work may include networkstructure learning (e.g. a tree structure) using a generalizedversion of the method where dropout mask is applied to ablock of weights rather than to each hidden unit.

References

Ba, J. and Frey, B. Adaptive dropout for training deep neuralnetworks. In

Advances in Neural Information ProcessingSystems 26 , 2013.Dai, B., Zhou, C., and Wipf, D. Compressing neuralnetworks using the variational information bottleneck. arXiv:1802.10399 , 2018.Feng, J. and Darrell, T. Learning the structure of deepconvolutional networks.

IEEE International Conferenceon Computer Vision , 2015. ependent Variational Beta-Bernoulli Dropout

Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approxi-mation: representing model uncertainty in deep learning.In

Proceedings of the 33rd International Conference onMachine Learning , 2016.Gal, Y., Hron, J., and Kendall, A. Concrete dropout.

Ad-vances in Neural Information Processing Systems , 2017.Grifﬁths, T. L. and Ghahramani, Z. Inﬁnite latent featuremodels and the Indian buffet process. In

NIPS , 2005.Han, S., Mao, H., and Dally, W. J. Deep compression:compressing deep neural networks with pruning, trainedquantization and Huffman coding. In

Proceedings of theInternational Conference on Learning Representations ,2016.Ioffe, S. and Szegedy, C. Batch normalization: acceleratingdeep network training by reducing internal covariate shift.In

Proceedings of the 32nd International Conference onMachine Learning , 2015.Jang, E., Gu, S., and Poole, B. Categorical reparametriza-tion with Gumbel-softmax. In

Proceedings of the Inter-national Conference on Learning Representations , 2017.Kingma, D. P. and Ba, J. L. Adam: A method for stochasticoptimization. In

Proceedings of the International Confer-ence on Learning Representations , 2015.Kingma, D. P. and Welling, M. Auto-encoding variationalBayes. In

Proceedings of the International Conferenceon Learning Representations , 2014.Kingma, D. P., Salimans, T., and Welling, M. Variationaldropout and the local reparametrization trick. In

Advancesin Neural Information Processing Systems 28 , 2015.Krizhevsky, A. and Hinton, G. E. Learning multiple layersof features from tiny images. Technical report, ComputerScience Department, University of Toronto, 2009.Kumaraswamy, P. A generalized probability density func-tion for double-bounded random processes.

Journal ofHydrology , 1980.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.

Proceed-ings of the IEEE , 86(11):2278–2324, 1998.Louizos, C., Ullrich, K., and Welling, M. Bayesian compres-sion for deep learning.

Advances in Neural InformationProcessing Systems , 2017.Louizos, C., Welling, M., and Kingma, D. P. Learning sparseneural networks through L regularization. InternationalConference on Learning Representations , 2018. Maddison, C. J., Mnih, A., and Teh, Y. W. The concretedistribution: a continuous relaxation of discrete randomvariables. In

Proceedings of the International Conferenceon Learning Representations , 2017.Molchanov, D., Ashukha, A., and Vetrov, D. Variationaldropout sparsiﬁes deep neural networks. In

Proceedingsof the 34th International Conference on Machine Learn-ing , 2017.Nalisnick, E. and Smyth, P. Stick-breaking variational au-toencoders. In

Proceedings of the International Confer-ence on Learning Representations , 2017.Neklyudov, K., Molchanov, D., Ashukha, A., and Vetrov, D.Structured Bayesian pruning via log-normal multiplica-tive noise.

Advances in Neural Information ProcessingSystems , 2017.Ren, L., Wang, Y., Dunson, D. B., and Carin, L. Thekernel beta process. In

Advances in Neural InformationProcessing Systems 24 , 2011.Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. In

Proceedings of the 31st InternationalConference on Machine Learning , 2014.Sohn, K., Lee, H., and Yan, X. Learning structured ouputrepresentation using deep conditional generative models.

Advances in Neural Information Processing Systems 28 ,2015.Srinivas, S. and Babu, R. V. Generalized dropout. arXivpreprint arXiv:1611.06791 , 2016.Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I.,and Salakhutdinov, R. Dropout: a simple way to preventneural networks from overﬁtting.

Journal of MachineLearning Research , 15(1):1929–1958, 2014.Thibaux, R. and Jordan, M. I. Hierarchical beta processessand the Indian buffet processes. In

Proceedings of the11th International Conference on Artiﬁcial Intelligenceand Statistics , 2007.Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learningstructured sparsity in deep neural networks. In

Advancesin Neural Information Processing Systems 29 , 2016.Williamson, S., Orbanz, P., and Ghahramani, Z. Depen-dent indian buffet processes. In

Proceedings of the 13thInternational Conference on Artiﬁcial Intelligence andStatistics , 2010.Zagoruyko, S. 92.45 on CIFAR-10 in Torch. 2015. ependent Variational Beta-Bernoulli Dropout

Zhou, M., Yang, H., Sapiro, G., and Dunson, D. B. De-pendent hierarchical beta process for image interpolationand denoising. In

Proceedings of the 14th InternationalConference on Artiﬁcial Intelligence and Statistics , 2011. ependent Variational Beta-Bernoulli Dropout

Appendices

A. More details on the experiments

We ﬁrst describe the tradeoff parameter settings we used in the experiments. • For the variational inference based methods (BB, SBP), we scaled the KL-term by γ ≥ . We tested with γ ∈{ . , . , . , . , . } . • For GD, we tested with hyperparameter α ∈ { − , − , − , − } . • For DBB, we didn’t do any tradeoff control. Instead, we started from different results of BB produced with varioustradeoff parameters. • For VIB, we tested with tradeoff parameter γ ∈ { − , − , · − , − , · − } for LeNet-500-300, γ ∈{ − , · − , − , · − , − } for LeNet5-Caffe and VGG-like. • For L , we tested with λ ∈ { . / , . / , . / , . / , . / } for LeNet-500-300 andLeNet5-Caffe, λ ∈ { − , · − , · − , − , · − } for VGG-like.For LeNet5-Caffe, we used larger tradeoff parameters for the ﬁrst two conv layers, because the penalty for them are relativelyunderestimated due to the small number of ﬁlters (20, 50) compared to those of fully-connected layers (800-500). • For BB, SBP, and GD, we multiplied the kl scaling factor γ by and for the ﬁrst two convolutional layers. • For VIB, following the paper, we multiplied the sizes of feature maps, × and × to the tradeoff parameters ofthe ﬁrst and second convolutional layers. • For L reg, we used the setting speciﬁed in the paper ( L -sep). B. Additional results

We present the errors, speedup in FLOPs, and memory savings for all trade-off settings of every algorithm. The results ofLeNet-500-300 and LeNet-5-Caffe are displayed in Table 1, and the results for VGG-like on CIFAR-10 and CIFAR-100 aredisplayed in Table 2 ependent Variational Beta-Bernoulli Dropout

Table 1.

Comparison of pruning methods on LeNet-500-300 and LeNet5-Caffe with MNIST. Error and Memory are in %.

LeNet 500-300 LeNet5-CaffeError Speedup Memory Error Speedup MemoryCD 1.54 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± L ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ependent Variational Beta-Bernoulli Dropout Table 2.

Comparison of pruning methods on VGG-like with CIFAR10 and CIFAR100. Error and Memory are in %.

VGG-CIFAR10 VGG-CIFAR100Error Speedup Memory Error Speedup MemoryCD 6.94 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± L ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±±