Adaptive Network Sparsification with Dependent Variational Beta-Bernoulli Dropout
Juho Lee, Saehoon Kim, Jaehong Yoon, Hae Beom Lee, Eunho Yang, Sung Ju Hwang
AAdaptive Network Sparsification withDependent Variational Beta-Bernoulli Dropout
Juho Lee
Saehoon Kim Jaehong Yoon Hae Beom Lee Eunho Yang
Sung Ju Hwang
Abstract
While variational dropout approaches have beenshown to be effective for network sparsification,they are still suboptimal in the sense that theyset the dropout rate for each neuron without con-sideration of the input data. With such input-independent dropout, each neuron is evolved tobe generic across inputs, which makes it difficultto sparsify networks without accuracy loss. Toovercome this limitation, we propose adaptivevariational dropout whose probabilities are drawnfrom sparsity-inducing beta-Bernoulli prior. Itallows each neuron to be evolved either to begeneric or specific for certain inputs, or droppedaltogether. Such input-adaptive sparsity-inducingdropout allows the resulting network to toleratelarger degree of sparsity without losing its expres-sive power by removing redundancies among fea-tures. We validate our dependent variational beta-Bernoulli dropout on multiple public datasets, onwhich it obtains significantly more compact net-works than baseline methods, with consistent ac-curacy improvements over the base networks.
1. Introduction
One of the main obstacles in applying deep learning to large-scale problems and low-power computing systems is thelarge number of network parameters, as it can lead to exces-sive memory and computational overheads. To tackle thisproblem, researchers have explored network sparsificationmethods to remove unnecessary connections in a network,which is implementable either by weight pruning (Hanet al., 2016) or sparsity-inducing regularizations (Wen et al.,2016).Recently, variational Bayesian approaches have shown to * Equal contribution Department of Statistics, University of Ox-ford AITRICS School of Computing, KAIST. Correspondenceto: Juho Lee < [email protected] >.Preliminary work. Under review by the International Conferenceon Machine Learning (ICML). Do not distribute. be useful for network sparsification, outperforming non-Bayesian counterparts. They take a completely differentapproach from the conventional methods that either usesthresholding or sparsity-inducing norms on parameters, anduses well-known dropout regularization instead. Specifi-cally, these approaches use variational dropout (Kingmaet al., 2015) which adds in multiplicative stochastic noiseto each neuron, as a means of obtaining sparse neural net-works. Removal of unnecessary neurons could be done byeither setting the dropout rate individually for each neuronwith unbounded dropout rate (Molchanov et al., 2017) or bypruning based on the signal-to-noise ratio (Neklyudov et al.,2017).While these variational dropout approaches do yield com-pact networks, they are suboptimal in that the dropout ratefor each neuron is learned completely independently of thegiven input data and labels. With input-independent dropoutregularization, each neuron has no choice but to encodegeneric information for all possible inputs, since it does notknow what input and tasks it will be given at evaluation time,as each neuron will be retained with fixed rate regardlessof the input. Obtaining high degree of sparsity in such assetting will be difficult as dropping any of the neurons willresult in information loss. For maximal utilization of thenetwork capacity and thus to obtain a more compact model,however, each neuron should be either irreplaceably genericand used by all tasks, or highly specialized for a task suchthat there exists minimal redundancy among the learnedrepresentations. This goal can be achieved by adaptively set-ting the dropout probability for each input, such that someof the neurons are retained with high probability only forcertain types of inputs and tasks.To this end, we propose a novel input-dependent variationaldropout regularization for network sparsification. We firstpropose beta-Bernoulli dropout that learns to set dropoutrate for each individual neuron, by generating the dropoutmask from beta-Bernoulli prior, and show how to train itusing variational inference. This dropout regularization isa proper way of obtaining a Bayesian neural network andalso sparsifies the network, since beta-Bernoulli distributionis a sparsity-inducing prior. Then, we propose dependentbeta-Bernoulli dropout, which is an input-dependent version a r X i v : . [ s t a t . M L ] M a r ependent Variational Beta-Bernoulli Dropout of our variational dropout regularization.Such adaptive regularization has been utilized for generalnetwork regularization by a non-Bayesian and non-sparsity-inducing model (Ba & Frey, 2013); yet, the increased mem-ory and computational overheads that come from learningadditional weights for dropout mask generation made it lessappealing for generic network regularization. In our case ofnetwork sparsification, however, the overheads at trainingtime is more than rewarded by the reduced memory andcomputational requirements at evaluation time, thanks tothe high degree of sparsification obtained in the final outputmodel.We validate our dependent beta-Bernoulli variationaldropout regularizer on multiple public datasets for networksparsification performance and prediction error, on which itobtains more compact network with substantially reducedprediction errors, when compared with both the base net-work and existing network sparsification methods. Furtheranalysis of the learned dropout probability for each unitreveals that our input-adaptive variational dropout approachgenerates a clearly distinguishable dropout mask for eachtask, thus enables each task to utilize different sets of neu-rons for their specialization.Our contribution in this paper is threefold: • We propose beta-Bernoulli dropout, a novel dropoutregularizer which learns to generate Bernoulli dropoutmask for each neuron with sparsity-inducing prior, thatobtains high degree of sparsity without accuracy loss. • We further propose dependent beta-Bernoulli dropout,which yields significantly more compact network thaninput-independent beta-Bernoulli dropout, and furtherperform run-time pruning for even less computationalcost. • Our beta-Bernoulli dropout regularizations providenovel ways to implement a sparse Bayesian NeuralNetwork, and we provide a variational inference frame-work for learning it.
2. Related Work
Deep neural networks are known to be prone to overfitting,due to its large number of parameters. Dropout (Srivastavaet al., 2014) is an effective regularization that helps preventoverfitting by reducing coadaptations of the units in thenetworks. During dropout training, the hidden units in thenetworks are randomly dropped with fixed probability p ,which is equivalent to multiplying the Bernoulli noises z ∼ Ber(1 − p ) to the units. It was later found that multiplyingGaussian noises with the same mean and variance, z ∼N (1 , p − p ) , works just as well or even better (Srivastavaet al., 2014). Dropout regularizations generally treat the dropout rate p asa hyperparameter to be tuned, but there have been severalstudies that aim to automatically determine proper dropoutrate. (Kingma et al., 2015) propose to determine the varianceof the Gaussian dropout by stochastic gradient variationalBayes. Generalized dropout (Srinivas & Babu, 2016) placesa beta prior on the dropout rate and learn the posterior ofthe dropout rate through variational Bayes. They showedthat by adjusting the hyperparameters of the beta prior, wecan obtain several regularization algorithms with differentcharacteristics. Our beta-Bernoulli dropout is similar toone of its special cases, but while they obtain the dropoutestimates via point-estimates and compute the gradientsof the binary random variables with biased heuristics, weapproximate the posterior distribution of the dropout ratewith variational distributions and compute asymptoticallyunbiased gradients for the binary random variables.Ba et al. (Ba & Frey, 2013) proposed adaptive dropout(StandOut), where the dropout rates for each individualneurons are determined as function of inputs. This idea issimilar in spirit to our dependent beta-Bernoulli dropout,but they use heuristics to model this function, while we useproper variational Bayesian approach to obtain the dropoutrates. One drawback of their model is the increased memoryand computational cost from additional parameters intro-duced for dropout mask generation, which is not negligiblewhen the network is large. Our model also requires addi-tional parameters, but with our model the increased cost attraining time is rewarded at evaluation time, as it yields asignificantly sparse network than the baseline model as aneffect of the sparsity-inducing prior.Recently, there has been growing interest in structure learn-ing or sparsification of deep neural networks. Han etal. (Han et al., 2016) proposed a strategy to iteratively pruneweak network weights for efficient computations, and Wenet al. (Wen et al., 2016) proposed a group sparsity learningalgorithm to drop neurons, filters or even residual blocks indeep neural networks. In Bayesian learning, various sparsityinducing priors have been demonstrated to efficiently prunenetwork weights with little drop in accuracies (Molchanovet al., 2017; Louizos et al., 2017; Neklyudov et al., 2017;Louizos et al., 2018; Dai et al., 2018). In the nonpara-metric Bayesian perspective, Feng et al. (Feng & Darrell,2015) proposed IBP based algorithm that learns proper num-ber of channels in convolutional neural networks using theasymptotic small-variance limit approximation of the IBP.While our dropout regularizer is motivated by IBP as withthis work, our work is differentiated from it by the input-adaptive adjustments of dropout rates that allow each neuronto specialize into features specific for some subsets of tasks. ependent Variational Beta-Bernoulli Dropout
3. Backgrounds
Suppose that we are given a neural network f ( x ; W ) parametrized by W , a training set D = { ( x n , y n ) } Nn =1 , anda likelihood p ( y | f ( x ; W )) chosen according to the problemof interest (e.g., the categorical distribution for a classifica-tion task). In Bayesian neural networks, the parameter W is treated as a random variable drawn from a pre-specifiedprior distribution p ( W ) , and the goal of learning is to com-pute the posterior distribution p ( W |D ) : p ( W |D ) ∝ p ( W ) N (cid:89) n =1 p ( y n | f ( x n ; W )) . (1)When a novel input x ∗ is given, the prediction y ∗ is obtainedas a distribution, by mixing W from p ( W |D ) as follows: p ( y ∗ | x ∗ , D ) = (cid:90) p ( y ∗ | f ( x ∗ ; W )) p ( W |D )d W . (2)Unfortunately, p ( W |D ) is in general computationally in-tractable due to computing p ( D ) , and thus we resort toapproximate inference schemes. Specifically, we use varia-tional Bayes (VB), where we posit a variational distribution q ( W ; φ ) of known parametric form and minimize the KL-divergence between it and the true posterior p ( W |D ) . Itturns out that minimizing D KL [ q ( W ; φ ) (cid:107) p ( W |D )] is equiv-alent to maximizing the evidence lower-bound (ELBO), L ( φ ) = N (cid:88) n =1 E q [log p ( y n | f ( x n ; W ))] − D KL [ q ( W ; φ ) (cid:107) p ( W )] , (3)where the first term measures the expected log-likelihoodof the dataset w.r.t. q ( W ; φ ) , and the second term reg-ularizes q ( W ; φ ) so that it does not deviate too muchfrom the prior. The parameter φ is learned by gradientdescent, but these involves two challenges. First, the ex-pected likelihood is intractable in many cases, and so isits gradient. To resolve this, we assume that q ( W ; φ ) isreparametrizable, so that we can obtain i.i.d. samples from q ( W ; φ ) by computing differentiable transformation of i.i.d.noise (Kingma & Welling, 2014; Rezende et al., 2014) as ε ( s ) ∼ r ( ε ) , W ( s ) = T ( ε ( s ) ; φ ) . Then we can obtain alow-variance unbiased estimator of the gradient, namely ∇ φ E q [log p ( y n | f ( x n ; W ))] ≈ S S (cid:88) s =1 ∇ φ log p ( y n | f ( x n ; T ( ε ( s ) ; φ ))) . (4)The second challenge is that the number of training instances N may be too large, which makes it impossible to compute the summation of all expected log-likelihood terms. Regard-ing on this challenge, we employ the stochastic gradientdescent technique where we approximate with the summa-tion over a uniformly sampled mini-batch B , N (cid:88) n =1 ∇ φ E q [log p ( y n | f ( x n ; W ))] ≈ N | B | (cid:88) n ∈ B ∇ φ E q [log p ( y n | f ( x n ; W ))] . (5)Combining the reparametrization and the mini-batch sam-pling, we obtain an unbiased estimator of ∇ φ L ( φ ) to update φ . This procedure, often referred to as stochastic gradi-ent variational Bayes (SGVB) (Kingma & Welling, 2014),is guaranteed to converge to local optima under properlearning-rate scheduling. In latent feature model, data are assumed to be generated ascombinations of latent features: d n = f ( Wz n ) = f (cid:18) K (cid:88) k =1 z n,k w k (cid:19) , (6)where z n,k = 1 means that d n possesses the k -th feature w k , and f is an arbitrary function.The Indian Buffet Process (IBP) (Griffiths & Ghahramani,2005) is a generative process of binary matrices with infinitenumber of columns. Given N data points, IBP generatesa binary matrix Z ∈ { , } N × K whose n -th row encodesthe feature indicator z (cid:62) n . The IBP is suitable to use as aprior process in latent feature models, since it generatespossibly infinite number of columns and adaptively adjustthe number of features on given dataset. Hence, with an IBPprior we need not specify the number of features in advance.One interesting observation is that while it is a marginalof the beta-Bernoulli processes (Thibaux & Jordan, 2007),the IBP may also be understood as a limit of the finite-dimensional beta-Bernoulli process. More specifically, theIBP with parameter α > can be obtained as π k ∼ beta( α/K, , z n,k ∼ Ber( π k ) , K → ∞ . (7)This beta-Bernoulli process naturally induces sparsity in thelatent feature allocation matrix Z . As K → ∞ , the expectednumber of nonzero entries in Z converges to N α (Griffiths &Ghahramani, 2005), where α is a hyperparameter to controlthe overall sparsity level of Z .In this paper, we relate the latent feature models (6) toneural networks with dropout mask. Specifically, the binaryrandom variables z n,k correspond to the dropout indicator,and the features w correspond to the inputs or intermediate ependent Variational Beta-Bernoulli Dropout units in neural networks. From this connection, we can thinkof a hierarchical Bayesian model where we place the IBP,or finite-dimensional beta-Bernoulli priors for the binarydropout indicators. We expect that due to the property ofthe IBP favoring sparse model, the resulting neural networkwould also be sparse. One important assumption in the IBP is that features are exchangeable - the distribution is invariant to the permu-tation of feature assignments. This assumption makes theposterior inference convenient, but restricts flexibility whenwe want to model the dependency of feature assignmentsto the input covariates x , such as times or spatial locations.To this end, (Williamson et al., 2010) proposed dependentIndian Buffet processes (dIBP), which triggered a line offollow-up work (Zhou et al., 2011; Ren et al., 2011). Thesemodels can be summarized as following generative process: π ∼ p ( π ) z | π , x ∼ Ber( g ( π , x )) , (8)where g ( · , · ) is an arbitrary function that maps π and x toa probability. In our latent feature interpretation for neuralnetwork layers above, the input covariates x correspondsto the input or activations in the previous layer. In otherwords, we build a data-dependent dropout model where thedropout rates depend on the inputs. In the main contributionsection, we will further explain how we will construct thisdata-dependent dropout layers in detail.
4. Main contribution
Inspired by the latent-feature model interpretation of layersin neural networks, we propose a Bayesian neural networklayer overlaid with binary random masks sampled fromthe finite-dimensional beta-Bernoulli prior. Specifically,let W be a parameter of a neural network layer, and let z n ∈ { , } K be a binary mask vector to be applied for the n -th observation x n . The dimension of W needs not beequal to K . Instead, we may enforce arbitrary group sparsityby sharing the binary masks among multiple elements of W .For instance, let W ∈ R K × L × M be a parameter tensor in aconvolutional neural network with K channels. To enforcea channel-wise sparsity, we introduce z n ∈ { , } K of K dimension, and the resulting masked parameter (cid:102) W n for the n -th observation is given as { z n,k W k,(cid:96),m | ( k, (cid:96), m ) = (1 , , , . . . , ( K, L, M ) } , (9)where W k,(cid:96),m is the ( k, (cid:96), m ) -th element of W . From nowon, with a slight abuse of notation, we denote this binarymask multiplication as (cid:102) W n = z n ⊗ W , (10) with appropriate sharing of binary mask random variables.The generative process of our Bayesian neural network isthen described as W ∼ N ( , λ I ) , π ∼ K (cid:89) k =1 beta( π k ; α/K, , z n | π ∼ K (cid:89) k =1 Ber( z n,k ; π k ) , (cid:102) W n = z n ⊗ W . (11)Note the difference between our model and the model in(Gal & Ghahramani, 2016). In the latter model, only Gaus-sian prior is placed on the parameter W , and the dropout isapplied in the variational distribution q ( W ) to approximate p ( W |D ) . Our model, on the other hand, includes the binarymask z n in the prior, and the posterior for the binary masksshould also be approximated.The goal of the posterior inference is to compute the poste-rior distribution p ( W , Z , π |D ) , where Z = { z , . . . , z N } .We approximate this posterior with the variational distribu-tion of the form q ( W , Z , π | X )= δ (cid:99) W ( W ) K (cid:89) k =1 q ( π k ) N (cid:89) n =1 K (cid:89) k =1 q ( z n,k | π k ) , (12)where we omitted the indices of layers for simplicity. For W , we conduct computationally efficient point-estimate toget the single value (cid:99) W , with the weight-decay regularizationarising from the zero-mean Gaussian prior. For π , follow-ing (Nalisnick & Smyth, 2017), we use the Kumaraswamydistribution (Kumaraswamy, 1980) for q ( π k ) : q ( π k ; a k , b k ) = a k b k π a k − k (1 − π a k k ) b k − , (13)since it closely resembles the beta distribution and easilyreparametrizable as π k ( u ; a k , b k ) = (1 − u bk ) ak , u ∼ unif([0 , . (14)We further assume that q ( z n,k | π k ) = p ( z n,k | π k ) =Ber( π k ) . z k is sampled by reparametrization with continu-ous relaxation (Maddison et al., 2017; Jang et al., 2017; Galet al., 2017), z k = sgm (cid:32) τ (cid:18) log π k − π k + log u − u (cid:19)(cid:33) , (15)where τ is a temperature of continuous relaxation, u ∼ unif([0 , , and sgm( x ) = e − x . The KL-divergencebetween the prior and the variational distribution is then ependent Variational Beta-Bernoulli Dropout obtained in closed form as follows (Nalisnick & Smyth,2017): D KL [ q ( Z , π ) (cid:107) p ( Z , π )]= K (cid:88) k =1 (cid:40) a k − α/Ka k (cid:18) − γ − Ψ( b k ) − b k (cid:19) + log a k b k α/K − b k − b k (cid:41) , (16)where γ is Euler-Mascheroni constant and Ψ( · ) is thedigamma function. Note that the infinite series in theKL-divergence vanishes because of the choice p ( π k ) =beta( π k ; α/K, .We can apply the SGVB framework described in Section 3.1to optimize the variational parameters { a k , b k } Kk =1 . Afterthe training, the prediction for a novel input x ∗ is given as p ( y ∗ | x ∗ , D , W )= E p ( z ∗ , π , W |D ) [ p ( y ∗ | f ( x ∗ ; z ∗ ⊗ W ))] ≈ E q ( z ∗ , π ) [ p ( y ∗ | f ( x ∗ ; z ∗ ⊗ (cid:99) W ))] , (17)and we found that the following näive approximation workswell in practice, p ( y ∗ | x ∗ , D , W ) ≈ p ( y ∗ | f ( x ∗ ; E q [ z ∗ ] ⊗ (cid:99) W )) , (18)where E q [ z ∗ ,k ] = E q ( π k ) [ π k ] , E q ( π k ) [ π k ] = b k Γ(1 + a − k )Γ( b k )Γ(1 + a − k + b k ) . (19) Now we describe our Bayesian neural network model withinput dependent beta-Bernoulli dropout prior constructed asfollows: W ∼ N ( , λ I ) , π ∼ K (cid:89) k =1 beta( π k ; α/K, , z n | π , x n ∼ K (cid:89) k =1 Ber( z n,k ; ϕ k ( x n,k )) , (20)Here, x n is the input to the dropout layer. For convolutionallayers, we apply the global average pooling to tensors to getvectorized inputs. In principle, we may introduce anotherfully connected layer as ( ϕ ( x n, ) , . . . , ϕ K ( x n,K )) =sgm( Vx n + c ) , with additional parameters V ∈ R K × K and c ∈ R K , but this is undesirable for the network sparsifi-cation. Rather than adding parameters for fully connectedlayer, we propose simple yet effective way to generate input-dependent probability, with minimal parameters involved. Specifically, we construct each π k ( x n,k ) independently asfollows: ϕ k ( x n,k ) = π k · clamp (cid:18) γ k x n,k − µ k σ k + β k , (cid:15) (cid:19) , (21)where µ k and σ k are the estimates of k -th componentsof mean and standard deviation of inputs, and γ k and β k are scaling and shifting parameters to be learned, and (cid:15) > is some small tolerance to prevent overflow, and clamp( x, (cid:15) ) = min(1 − (cid:15), max( (cid:15), x )) . The parameteriza-tion in (21) is motivated by the batch normalization (Ioffe& Szegedy, 2015). The intuition behind this constructionis as follows (see Fig. 1 for a graphical example). Theinputs after the standardization would approximately be dis-tributed as N (0 , , and thus would be centered around zero.If we pass them through min(1 − (cid:15), max( (cid:15), x )) , most ofinsignificant dimensions would have probability close tozero. However, some inputs may be important regardlessof the significance of current activation. In that case, weexpect the corresponding shifting parameter β k to be large.Thus by β = ( β , . . . , β K ) we control the overall sparsity,but we want them to be small unless required to get sparseoutcomes. We enforce this by placing a prior distributionon β ∼ N ( , ρ I ) .The goal of variational inference is hence to learn the poste-rior distribution p ( W , Z , π , β |D ) , and we approximate thiswith variational distribution of the form q ( W , Z , π , β | X )= δ (cid:99) W ( W ) K (cid:89) k =1 q ( π k ) q ( β k ) N (cid:89) n =1 K (cid:89) k =1 q ( z n,k | π k , x n ) , (22)where q ( π k ) are the same as in beta-Bernoullidropout, q ( β k ) = N ( β k ; η k , κ k ) , and q ( z n,k | π k ) = p ( z n,k | π k , x n ) The KL-divergence is computed as D KL [ q ( Z , π | X ) (cid:107) p ( Z , π )] + D KL [ q ( β ) (cid:107) p ( β )] , (23)where the first term was described for beta-Bernoullidropout and the second term can be computed analytically.The prediction for the novel input x ∗ is similarity done asin the beta-Bernoulli dropout, with the naïve approximationfor the expectation: p ( y ∗ | x ∗ , D , W ) ≈ p ( y ∗ | f ( x ∗ ; E q [ z ∗ ] ⊗ (cid:99) W )) , (24) In principle, we may introduce an inference network q ( z | π , x , y ) and minimizes the KL-divergence between q ( z | π , x , y ) and p ( z | π , x ) , but this results in discrepancybetween training and testing for sampling z , and also makeoptimization cumbersome. Hence, we chose to simply set themequal. Please refer to (Sohn et al., 2015) for discussion about this. ependent Variational Beta-Bernoulli Dropout .
75 4 .
00 4 . x − . . . x = x − µσ x = γ x + β . . . x = clamp( x , (cid:15) ) Figure 1.
An example to show the intuition behind Eq (21). Each block represents a histogram of output distribution. A set of activations(1st block) are standardized (2nd block), and rescaled and shifted (3rd block), and transformed into probabilities (4th block). where E q [ z ∗ ,k ] = E q [ π k ] · clamp (cid:18) γ k x n,k − µ k σ k + η k , (cid:15) (cid:19) . (25) Two stage pruning scheme
Since π k ≥ π k ( x n,k ) for all x n,k , we expect the resulting network to be sparser thanthe network pruned only with the beta-Bernoulli dropout(only with π k ). To achieve this, we propose a two-stagepruning scheme, where we first prune the network withbeta-Bernoulli dropout, and prune the network again with π k ( x n,k ) while holding the variables π fixed. By fixing π the resulting network is guaranteed to be sparser than thenetwork before the second pruning.
5. Experiments
We now compare our beta-Bernoulli dropout (BB) andinput-dependent beta-Bernoulli dropout (DBB) to otherstructure learning/pruning algorithms on several neuralnetworks using benchmark datasets. The codes to repli-cate our experiments will be made available on https://github.com/juho-lee/bbdrop . Experiment Settings
We follow a common experimen-tal setting used by existing work to evaluate pruning per-formance, and use LeNet 500-300, LeNet 5-Caffe , andVGG-like (Zagoruyko, 2015) on MNIST (LeCun et al.,1998), CIFAR-10, and CIFAR-100 datasets (Krizhevsky& Hinton, 2009). For baselines, we use the followingrecent Bayesian pruning methods:
1) SBP:
StructuredBayesian Pruning (Neklyudov et al., 2017),
2) VIB:
Vari-ational Information Bottleneck (Dai et al., 2018), L : L regularization (Louizos et al., 2018),
4) GD:
General-ized Dropout (Srinivas & Babu, 2016),
5) CD:
ConcreteDropout (Gal et al., 2017). We faithfully tune all hyperpa-rameters of baselines on a validation set to find a reasonablesolution that is well balanced between accuracy and spar- https://github.com/BVLC/caffe/blob/master/examples/mnist sification, while fixing batch size (100) and the number ofmaximum epochs (200) to match our experiment setting. Implementation Details
We pretrain all networks usingthe standard training procedure before fine-tuning for net-work sparsification (Neklyudov et al., 2017; Dai et al., 2018;Louizos et al., 2018). While pruning, we set the learningrate for the weights W to be 0.1 times smaller than thosefor the variational parameters as in (Neklyudov et al., 2017).We used Adam (Kingma & Ba, 2015) for all methods. ForDBB, as mentioned in Section 4.2, we first prune networkswith BB, and then prune again with DBB whiling holdingthe variational parameters for q ( π ) fixed.We report all hyperparameters of BB and DBB for reproduc-ing our results. We set α/K = 10 − for all layers of BBand DBB. In principle, we may fix K to be large numberand tune α . However, in the network sparsification tasks, K is given as the neurons/filters to be pruned. Hence, wechoose to set the ratio α/K to be small number altogether.In the testing phase, we prune the neurons/filters whoseexpected dropout mask probability E q [ π k ] are smaller thana fixed threshold − . For the input-dependent dropout,since the number of pruned neurons/filters differ accordingto the inputs, we report them as the running average overthe test data. We fix the temperature parameter of concretedistribution τ = 10 − and the prior variance of β , ρ = √ for all experiments. We ran all experiments three times andreported the median and standard deviation.To control the tradeoff between classification error andpruned network size, we run each algorithm with varioustradeoff parameters. For VIB and L , we controlled thetradeoff parameters originally introduced in the papers. Forvariational inference based algorithms including SBP andBB, we scaled the KL terms in ELBO with tradeoff param-eter γ > . Note that when γ > , the modified ELBO isa still lower bound on the marginal likelihood. For DBBwe use fixed parameter settings but retrain the model with We tried different values such as − or − , but the differ-ence was insignificant. ependent Variational Beta-Bernoulli Dropout LeNet 500-300 LeNet5-CaffeError (%) Neurons xFLOPs Memory (%) Error (%) Neurons/Filters xFLOPs Memory (%)Original 1.63 784-500-300 1.0 100.0 0.71 20-50-800-500 1.0 100.0CD 1.54 ± ± ± ± ± ± L ± ± ± ± ± ± ± ± ± ± ± ± Speedup in FLOPs . . . . . . . E rr o r( % ) GDBBVIB L SBPDBB 2 . . . . . . . . Speedup in FLOPs . . . . . . E rr o r( % ) DBBGDBBVIB L SBP c l a ss c l a ss c l a ss c l a ss . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.
Top:
Classification errors and sparsificaton performance of various pruning methods for LeNet-500-300 and LeNet5-Caffe onthe MNIST dataset.
Bottom:
Error-Speedup tradeoff plots (left) and input-regions pruned by DBB in four different examples (right). different runs of BBs, that are trained with different tradeoffparameters γ . For more detailed settings of tradeoff control,please refer to the appendix. We use LeNet 500-300 and LeNet 5-Caffe networks onMNIST for comparison. Following the conventions, weapply dropout to the inputs to the fully connected layers andright after the convolution for the convolutional layers. Wereport the trade-off between the error and speedup in termsof FLOPs in Fig. 2, and show one representative result pereach algorithm in Section 5 to compare the speedups andmemory savings with respect to a particular error level. ForLeNet 5-Caffe, following (Dai et al., 2018; Neklyudov et al.,2017), we used larger tradeoff parameters for the first twoconvolutional layers - please refer to the appendix for thedetail.With both neworks, BB and DBB either achieve signifi-cantly smaller error than the baseline methods, or signif-icant speedup ( . for LeNet 500-300 and . forLeNet5-Caffe) and memory saving ( . for LeNet-300and . for LeNet5-Caffe) at similar error rates. DBB,with its input-adaptive pruning, obtains larger speedup andmemory saving compared to BB, which is better shown inthe error-speedup tradeoff plot. On LeNet-500-300, DBB class c l a ss conv20 class c l a ss conv50 class c l a ss dense800 class c l a ss dense500 . . . . . . . . . . . . . . . . . . . . . . . . Figure 3.
Correlation coefficients of class averages of ϕ ( x ) for thefour layers in LeNet5-Caffe, where the darker shows the morecorrelation between the class average values of ϕ ( x ) . prunes large amount of neurons in the input layer, becausethe inputs to this network are simply vectorized pixel val-ues, so it can prune the inputs according to the digit classes,which is shown in Fig. 2 Also, we observe that the dropoutmasks generated by DBB tend to be generic at lower net-work layers to extract common features, but become class-specific at higher layers to specialize features for class dis-criminativity. Fig. 3 supports our observation, which showsthe correlation coefficients between the class average valuesof ϕ ( x ) learnt from DBB. This clearly show the tendency toshare filters in lower layers, and be specific in higher layers.We observed similar behavior of DBB in the experimentswith VGG on both CIFAR10 and CIFAR100 datasets. ependent Variational Beta-Bernoulli Dropout CIFAR-10 CIFAR-100Error (%) xFLOPs Memory (%) Error (%) xFLOPs Memory (%)Original 7.13 1.0 100.0 33.10 1.0 100.0CD 6.94 ± ± ± ± ± ± L ± ± ± ± ± ± ± ± ± ± ± ± . . . . . . . Speedup in FLOPs . . . . . . . . E rr o r( % ) BBVIBGD L SBPDBB 2 . . . . . . Speedup in FLOPs E rr o r( % ) BBVIBGD L SBPDBB bus layer 3 layer 8 layer 15bus layer 3 layer 8 layer 15apple layer 3 layer 8 layer 15 . . . . . . . . . . . . . . . . . . Figure 4.
Top:
Classification errors and sparsification performances of various pruning methods on CIFAR-10 and CIFAR-100 datasets.
Bottom:
Error-Speedup tradeoff plots (left) and an empirical analysis of learned filters from DBB (right).
We further compare the pruning algorithms on VGG-likenetwork on CIFAR-10 and CIFAR-100 datasets. Fig. 4summarizes the performance of each algorithm on partic-ular setting, where BB and DBB achieve both impressivespeedups and memory savings with significantly improvedaccracy. When compared with the baseline sparsificationmethods, they either achieve better error at similar sparsifi-cation rates, or achieve better speedup and memory savingat simlar error rates. The error-speedup trade-off plot inthe bottom row of Fig. 4 shows that DBB generally achievelarger speedup to BB at similar error rates, and to all otherbaselines. Further analysis of the filters retained by DBBin Fig. 4 shows that DBB either retains most filters (layer 3)or perform generic pruning (layer 8) at lower layers, whileperforming diversified pruning at higher layers (layer 15).Further, at layer 15, instances from the same class (bus) re-tained similar filters, while instances from different classes(bus vs. apple) retained different filters.
6. Conclusion
We have proposed novel beta-Bernoulli dropout for networkregularization and sparsification, where we learn dropoutprobabilities for each neuron either in an input-independentor input-dependent manner. Our beta-Bernoulli dropout learns the distribution of sparse Bernoulli dropout mask foreach neuron in a variational inference framework, in contrastto existing work that learned the distribution of Gaussianmultiplicative noise or weights, and obtains significantlymore compact network compared to those competing ap-proaches. Further, our dependent beta-Bernoulli dropoutthat input-adaptively decides which neuron to drop furtherimproves on the input-independent beta-Bernoulli dropout,both in terms of size of the final network obtained andrun-time computations. Future work may include networkstructure learning (e.g. a tree structure) using a generalizedversion of the method where dropout mask is applied to ablock of weights rather than to each hidden unit.
References
Ba, J. and Frey, B. Adaptive dropout for training deep neuralnetworks. In
Advances in Neural Information ProcessingSystems 26 , 2013.Dai, B., Zhou, C., and Wipf, D. Compressing neuralnetworks using the variational information bottleneck. arXiv:1802.10399 , 2018.Feng, J. and Darrell, T. Learning the structure of deepconvolutional networks.
IEEE International Conferenceon Computer Vision , 2015. ependent Variational Beta-Bernoulli Dropout
Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approxi-mation: representing model uncertainty in deep learning.In
Proceedings of the 33rd International Conference onMachine Learning , 2016.Gal, Y., Hron, J., and Kendall, A. Concrete dropout.
Ad-vances in Neural Information Processing Systems , 2017.Griffiths, T. L. and Ghahramani, Z. Infinite latent featuremodels and the Indian buffet process. In
NIPS , 2005.Han, S., Mao, H., and Dally, W. J. Deep compression:compressing deep neural networks with pruning, trainedquantization and Huffman coding. In
Proceedings of theInternational Conference on Learning Representations ,2016.Ioffe, S. and Szegedy, C. Batch normalization: acceleratingdeep network training by reducing internal covariate shift.In
Proceedings of the 32nd International Conference onMachine Learning , 2015.Jang, E., Gu, S., and Poole, B. Categorical reparametriza-tion with Gumbel-softmax. In
Proceedings of the Inter-national Conference on Learning Representations , 2017.Kingma, D. P. and Ba, J. L. Adam: A method for stochasticoptimization. In
Proceedings of the International Confer-ence on Learning Representations , 2015.Kingma, D. P. and Welling, M. Auto-encoding variationalBayes. In
Proceedings of the International Conferenceon Learning Representations , 2014.Kingma, D. P., Salimans, T., and Welling, M. Variationaldropout and the local reparametrization trick. In
Advancesin Neural Information Processing Systems 28 , 2015.Krizhevsky, A. and Hinton, G. E. Learning multiple layersof features from tiny images. Technical report, ComputerScience Department, University of Toronto, 2009.Kumaraswamy, P. A generalized probability density func-tion for double-bounded random processes.
Journal ofHydrology , 1980.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.
Proceed-ings of the IEEE , 86(11):2278–2324, 1998.Louizos, C., Ullrich, K., and Welling, M. Bayesian compres-sion for deep learning.
Advances in Neural InformationProcessing Systems , 2017.Louizos, C., Welling, M., and Kingma, D. P. Learning sparseneural networks through L regularization. InternationalConference on Learning Representations , 2018. Maddison, C. J., Mnih, A., and Teh, Y. W. The concretedistribution: a continuous relaxation of discrete randomvariables. In
Proceedings of the International Conferenceon Learning Representations , 2017.Molchanov, D., Ashukha, A., and Vetrov, D. Variationaldropout sparsifies deep neural networks. In
Proceedingsof the 34th International Conference on Machine Learn-ing , 2017.Nalisnick, E. and Smyth, P. Stick-breaking variational au-toencoders. In
Proceedings of the International Confer-ence on Learning Representations , 2017.Neklyudov, K., Molchanov, D., Ashukha, A., and Vetrov, D.Structured Bayesian pruning via log-normal multiplica-tive noise.
Advances in Neural Information ProcessingSystems , 2017.Ren, L., Wang, Y., Dunson, D. B., and Carin, L. Thekernel beta process. In
Advances in Neural InformationProcessing Systems 24 , 2011.Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. In
Proceedings of the 31st InternationalConference on Machine Learning , 2014.Sohn, K., Lee, H., and Yan, X. Learning structured ouputrepresentation using deep conditional generative models.
Advances in Neural Information Processing Systems 28 ,2015.Srinivas, S. and Babu, R. V. Generalized dropout. arXivpreprint arXiv:1611.06791 , 2016.Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I.,and Salakhutdinov, R. Dropout: a simple way to preventneural networks from overfitting.
Journal of MachineLearning Research , 15(1):1929–1958, 2014.Thibaux, R. and Jordan, M. I. Hierarchical beta processessand the Indian buffet processes. In
Proceedings of the11th International Conference on Artificial Intelligenceand Statistics , 2007.Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learningstructured sparsity in deep neural networks. In
Advancesin Neural Information Processing Systems 29 , 2016.Williamson, S., Orbanz, P., and Ghahramani, Z. Depen-dent indian buffet processes. In
Proceedings of the 13thInternational Conference on Artificial Intelligence andStatistics , 2010.Zagoruyko, S. 92.45 on CIFAR-10 in Torch. 2015. ependent Variational Beta-Bernoulli Dropout
Zhou, M., Yang, H., Sapiro, G., and Dunson, D. B. De-pendent hierarchical beta process for image interpolationand denoising. In
Proceedings of the 14th InternationalConference on Artificial Intelligence and Statistics , 2011. ependent Variational Beta-Bernoulli Dropout
Appendices
A. More details on the experiments
We first describe the tradeoff parameter settings we used in the experiments. • For the variational inference based methods (BB, SBP), we scaled the KL-term by γ ≥ . We tested with γ ∈{ . , . , . , . , . } . • For GD, we tested with hyperparameter α ∈ { − , − , − , − } . • For DBB, we didn’t do any tradeoff control. Instead, we started from different results of BB produced with varioustradeoff parameters. • For VIB, we tested with tradeoff parameter γ ∈ { − , − , · − , − , · − } for LeNet-500-300, γ ∈{ − , · − , − , · − , − } for LeNet5-Caffe and VGG-like. • For L , we tested with λ ∈ { . / , . / , . / , . / , . / } for LeNet-500-300 andLeNet5-Caffe, λ ∈ { − , · − , · − , − , · − } for VGG-like.For LeNet5-Caffe, we used larger tradeoff parameters for the first two conv layers, because the penalty for them are relativelyunderestimated due to the small number of filters (20, 50) compared to those of fully-connected layers (800-500). • For BB, SBP, and GD, we multiplied the kl scaling factor γ by and for the first two convolutional layers. • For VIB, following the paper, we multiplied the sizes of feature maps, × and × to the tradeoff parameters ofthe first and second convolutional layers. • For L reg, we used the setting specified in the paper ( L -sep). B. Additional results
We present the errors, speedup in FLOPs, and memory savings for all trade-off settings of every algorithm. The results ofLeNet-500-300 and LeNet-5-Caffe are displayed in Table 1, and the results for VGG-like on CIFAR-10 and CIFAR-100 aredisplayed in Table 2 ependent Variational Beta-Bernoulli Dropout
Table 1.
Comparison of pruning methods on LeNet-500-300 and LeNet5-Caffe with MNIST. Error and Memory are in %.
LeNet 500-300 LeNet5-CaffeError Speedup Memory Error Speedup MemoryCD 1.54 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± L ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ependent Variational Beta-Bernoulli Dropout Table 2.
Comparison of pruning methods on VGG-like with CIFAR10 and CIFAR100. Error and Memory are in %.
VGG-CIFAR10 VGG-CIFAR100Error Speedup Memory Error Speedup MemoryCD 6.94 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± L ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±±