[PDF] Nonparametric Bayesian Deep Networks with Local Competition

Abstract

The aim of this work is to enable inference of deep networks that retain high accuracy for the least possible model complexity, with the latter deduced from the data during inference. To this end, we revisit deep networks that comprise competing linear units, as opposed to nonlinear units that do not entail any form of (local) competition. In this context, our main technical innovation consists in an inferential setup that leverages solid arguments from Bayesian nonparametrics. We infer both the needed set of connections or locally competing sets of units, as well as the required floating-point precision for storing the network parameters. Specifically, we introduce auxiliary discrete latent variables representing which initial network components are actually needed for modeling the data at hand, and perform Bayesian inference over them by imposing appropriate stick-breaking priors. As we experimentally show using benchmark datasets, our approach yields networks with less computational footprint than the state-of-the-art, and with no compromises in predictive accuracy.

Full PDF

NNonparametric Bayesian Deep Networks with Local Competition

Konstantinos P. Panousis * 1

Sotirios Chatzis * 2

Sergios Theodoridis

Abstract

The aim of this work is to enable inference ofdeep networks that retain high accuracy for theleast possible model complexity, with the latter de-duced from the data during inference. To this end,we revisit deep networks that comprise competinglinear units, as opposed to nonlinear units that donot entail any form of (local) competition. In thiscontext, our main technical innovation consists inan inferential setup that leverages solid argumentsfrom Bayesian nonparametrics. We infer boththe needed set of connections or locally compet-ing sets of units, as well as the required ﬂoating-point precision for storing the network parame-ters. Speciﬁcally, we introduce auxiliary discretelatent variables representing which initial networkcomponents are actually needed for modeling thedata at hand, and perform Bayesian inference overthem by imposing appropriate stick-breaking pri-ors. As we experimentally show using benchmarkdatasets, our approach yields networks with lesscomputational footprint than the state-of-the-art,and with no compromises in predictive accuracy.

1. Introduction

Deep neural networks (DNNs) (LeCun et al., 2015) areﬂexible models that represent complex functions as a combi-nation of simpler primitives. Despite their success in a widerange of applications, they typically suffer from overparam-eterization: they entail millions of weights, a large fractionof which is actually redundant. This leads to unnecessarycomputational burden, and limits their scalability to com-modity hardware devices, such as mobile phones and cars. * Equal contribution Dept. of Informatics & Telecommuni-cations, National and Kapodistrian University of Athens, Greece Dept. of Electrical Eng., Computer Eng., and Informatics, CyprusUniversity of Technology, Limassol, Cyprus The Chinese Uni-versity of Hong Kong, Shenzen, China. Correspondence to: Kon-stantinos P. Panousis < [email protected] > , Sotirios Chatzis < [email protected] > . Proceedings of the th International Conference on MachineLearning , Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

In addition, this fact renders DNNs susceptible to strongoverﬁtting tendencies that may severely undermine theirgeneralization capacity.The deep learning community has devoted signiﬁcant effortto address overﬁtting in deep learning; (cid:96) regularization,Dropout, and variational variants thereof are characteristicsuch examples (Gal & Ghahramani, 2015). However, thescope of regularization is limited to effectively training (andretaining) all network weights. Addressing redundancy indeep networks requires data-driven structure shrinkage andweight compression techniques.A popular type of solution to this end consists in traininga condensed student network by leveraging a previouslytrained full-ﬂedged teacher network (Ba & Caruana, 2014;Hinton et al., 2015). However, this paradigm suffers fromtwo main drawbacks: (i) One cannot avoid the computa-tional costs and overﬁtting tendencies related to training alarge deep network; on the contrary, the total training costsare augmented with the weight distillation and training costsof the student network; and (ii) the student teaching pro-cedure itself entails a large deal of heuristics and assortedartistry in designing effective teacher distillation.As an alternative, several researchers have examined ap-plication of network component (unit/connection) pruningcriteria. In most cases, these criteria are applied on top ofsome appropriate regularization technique. In this context,Bayesian Neural Networks (BNNs) have been proposed asa full probabilistic paradigm for formulating DNNs (Gal &Ghahramani, 2015; Graves, 2011), obtained by imposinga prior distribution over their weights. Then, appropriateposteriors are inferred, and predictive distributions are ob-tained via marginalization in the Bayesian averaging sense.This way, BNNs induce strong regularization under a solidinferential framework. In addition, they naturally allowfor reducing ﬂoating-point precision in storing the networkweights. Speciﬁcally, since Bayesian inference boils downto drawing samples from an inferred weight posterior, thehigher the inferred weight posterior variance, the lower theneeded ﬂoating-point precision (Louizos et al., 2017).Finally, Chatzis (2018) recently considered addressing theseproblems by introducing an additional set of auxiliaryBernoulli latent variables, which explicitly indicate the util-ity of each component (in an “on/off” fashion). In this a r X i v : . [ c s . L G ] M a y onparametric Bayesian Deep Networks with Local Competition context, they obtain a sparsity-inducing behavior, by impos-ing appropriate stick-breaking priors (Ishwaran & James,2001) over the postulated auxiliary latent variables. Theirstudy, although limited to variational autoencoders , showedpromising results in a variety of benchmarks.On the other hand, a prevalent characteristic of moderndeep networks is the use of nonlinear units on each hid-den layer. Even though this sort of functionality offers amathematically convenient way of creating a hierarchicalmodel, it is also well understood that it does not come withstrong biological plausibility. Indeed, there is an increasingbody of evidence supporting that neurons in biological sys-tems that have similar functional properties are aggregatedtogether in modules or columns where local competitiontakes place (Kandel et al., 1991; Andersen et al., 1969; Ec-cles et al., 1967; Stefanis, 1969; Douglas & Martin, 2004;Lansner, 2009). This is effected via the entailed lateral inhi-bition mechanisms, under which only a single neuron withina block can be active at a steady state.Drawing from this inspiration, several researchers have ex-amined development of deep networks which replace non-linear units with local competition mechanisms among sim-pler linear units. As it has been shown, such local winner-takes-all (LWTA) networks can discover effective sparselydistributed representations of their input stimuli (Lee &Seung, 1999; Olshausen & Field, 1996), and constitute uni-versal function approximators, as powerful as networks withthreshold or sigmoidal units (Maass, 1999; 2000). In ad-dition, this type of network organization has been arguedto give rise to a number of interesting properties, includingautomatic gain control, noise suppression, and prevention ofcatastrophic forgetting in online learning (Srivastava et al.,2013; Grossberg, 1988; McCloskey & Cohen, 1989).This paper draws from these results, and attempts to offer aprincipled and systematic paradigm for inferring the needednetwork complexity and compressing its parameters. Weposit that the capacity to infer an explicit posterior distribu-tion of component (connection/unit) utility in the contextof LWTA-based deep networks may offer signiﬁcant advan-tages in model effectiveness and computational efﬁciency.The proposed inferential construction relies on nonparamet-ric Bayesian inference arguments, namely stick-breakingpriors; we employ these tools in a fashion tailored to theunique structural characteristics of LWTA networks. Thisway, we give rise to a data-driven mechanism that intelli-gently adapts the complexity of model structure and infersthe needed ﬂoating-point precision.We derive efﬁcient training and inference algorithms for ourmodel, by relying on stochastic gradient variational Bayes(SGVB). We dub our approach Stick-Breaking LWTA (SB-LWTA) networks. We evaluate our approach using well-known benchmark datasets. Our provided empirical evi- dence vouches for the capacity of our approach to yieldpredictive accuracy at least competitive with the state-of-the-art, while enabling automatic inference of the modelcomplexity, concurrently with model parameters estima-tion. This results in trained networks that yield much bettermemory footprint than the competition, without the need ofextensively applying heuristic criteria.The remainder of this paper is organized as follows: InSection 2, we introduce the proposed approach. In Section3, we provide the training and inference algorithms of ourmodel. In Section 4, we perform an extensive experimentalevaluation of our approach, and provide insights into itsfunctionality. Finally, in the concluding Section, we sum-marize the contribution of this work, and discuss directionsfor further research.

2. Proposed Approach

In this work, we introduce a paradigm of designing deepnetworks whereby the output of each layer is derived fromblocks of competing linear units, and appropriate argumentsfrom nonparametric statistics are employed to infer networkcomponent utility in a Bayesian sense. An outline of theenvisaged modeling rationale is provided in Fig. 1.In the following, we begin our exposition by brieﬂy intro-ducing the Indian Buffet Process (IBP) prior (Grifﬁths &Ghahramani, 2005); we employ this prior to enable infer-ence of which components introduced into the model atinitialization time are actually needed for modeling the dataat hand. Then, we proceed to the deﬁnition of our proposedmodel.

The IBP is a probability distribution over inﬁnite binarymatrices. By using it as a prior, it allows for inferringhow many components are needed for modeling a givenset of observations, in a way that ensures sparsity in theobtained representations (Theodoridis, 2015). In addition,it also allows for the emergence of new components asnew observations appear. Teh et al. (2007) presented a stick-breaking construction for the IBP, which renders it amenableto variational inference. Let us consider N observations,and denote as Z = [ z i,k ] N,Ki,k =1 a binary matrix where eachentry indicates the existence of component k in observation i . Taking the inﬁnite limit, K → ∞ , we arrive at thefollowing hierarchical representation for the IBP (Teh et al.,2007): u k ∼ Beta ( α, π k = k (cid:89) i =1 u i z ik ∼ Bernoulli ( π k ) Here, α > is the innovation hyperparameter of the IBP,which controls the magnitude of the induced sparsity. In onparametric Bayesian Deep Networks with Local Competition x x J ... z , = z , = z J , K = z J , K = ... ... ... ξ = 1 ξ = 0 ξ = 0 ξ = 1 SB-LWTA layer

K K

SB-LWTA layerInput layer Output layer

Figure 1.

A graphical illustration of the proposed architecture. Bold edges denote active (effective) connections (with z = 1 ); nodes withbold contours denote winner units (with ξ = 1 ); rectangles denote LWTA blocks. We consider U = 2 competitors in each LWTA block, k = 1 , . . . , K . practice, K → ∞ denotes a setting whereby we obtain anovercomplete representation of the observed data; that is, K equals input dimensionality. Let { x n } Nn =1 ∈ R J be an input dataset containing N ob-servations, with J features each. Hidden layers in tradi-tional neural networks contain nonlinear units; they are pre-sented with linear combinations of the inputs, obtained viaa weights matrix W ∈ R J × K , and produce output vectors { y n } Nn =1 ∈ R K as input to the next layer. In our approach,this mechanism is replaced by the introduction of LWTAblocks in the hidden layers, each containing a set of com-peting linear units. The layer input is originally presentedto each block, via different weights for each unit; thus, theweights of the connections are now organized into a three-dimensional matrix W ∈ R J × K × U , where K denotes thenumber of blocks and U is the number of competing unitstherein.Let us consider a layer of the proposed model. Within eachblock, the linear units compute their activations; then, theblock selects one winner unit on the basis of a competitiverandom sampling procedure we describe next, and sets therest to zero. This way, we yield a sparse layer output, en-coded into the vectors { y n } Nn =1 ∈ R K · U that are fed to thenext layer. In the following, we encode the outcome of localcompetition between the units of each block via the discretelatent vectors ξ n ∈ one hot(U) K , where one hot(U) is anone-hot vector with U components. These denote the win-ning unit out of the U competitors in each of the K blocksof the layer, when presented with the n th datapoint.To allow for inferring which layer connections must beretained, we adopt concepts from the ﬁeld of Bayesian non-parametrics. Speciﬁcally, we commence by introducinga matrix of binary latent variables, Z ∈ { , } J × K . The ( j, k ) th entry therein is equal to one if the j th input is pre-sented to the k th block, and equal to zero otherwise; in thelatter case, the corresponding set of weights , { w j,k,u } Uu =1 ,are effectively canceled out from the model. Subsequently,we impose an IBP prior over Z , to allow for performing in-ference over it, in a way that promotes retention of the barelyneeded components, as explained in Section 2.1. Turningto the winner sampling procedure within each LWTA block,we postulate that the latent variables ξ n are also driven fromthe layer input, and exploit the connection utility informa-tion encoded into the inferred Z matrices.Let us begin with deﬁning the expression of layer output, y n ∈ R K · U . Following the above-prescribed rationale, wehave: [ y n ] ku = [ ξ n ] ku J (cid:88) j =1 ( w j,k,u · z j,k ) · [ x n ] j ∈ R (1)where we denote as [ h ] l the l th component of a vector h . Inthis expression, we consider that the winner indicator latentvectors are drawn from a Categorical (posterior) distributionof the form: q ([ ξ n ] k ) = Discrete (cid:18) [ ξ n ] k (cid:12)(cid:12)(cid:12)(cid:12) softmax (cid:0) J (cid:88) j =1 [ w j,k,u ] Uu =1 · z j,k · [ x n ] j (cid:1)(cid:19) (2)where [ w j,k,u ] Uu =1 denotes the vector concatenation of theset { w j,k,u } Uu =1 , and [ ξ n ] k ∈ one hot(U) . On the otherhand, the utility latent variables, Z , are independently drawnfrom Bernoulli posteriors that read: q ( z j,k ) = Bernoulli ( z j,k | ˜ π j,k ) (3)where the ˜ π j,k are obtained through model training (Section3.1).Turning to the prior speciﬁcation of the model latent vari-ables, we consider a symmetric Discrete prior over the win-ner unit indicators, [ ξ n ] k ∼ Discrete(1 /U ) , and an IBP onparametric Bayesian Deep Networks with Local Competition prior over the utility indicators: u k ∼ Beta ( α, π k = k (cid:89) i =1 u i z j,k ∼ Bernoulli ( π k ) ∀ j. (4)Finally, we deﬁne a distribution over the weight matrices, W . To allow for simplicity, we impose a spherical prior W ∼ (cid:81) j,k,u N ( w j,k,u | , , and seek to infer a posteriordistribution q ( W ) = (cid:81) j,k,u N ( w j,k,u | µ j,k,u , σ j,k,u ) .This concludes the formulation of a layer of the proposedSB-LWTA model. Further, we consider a variant of SB-LWTA which allowsfor accommodating convolutional operations. These are ofimportance when dealing with signals of 2D structure, e.g.images. To perform a convolution operation over an inputtensor { X } Nn =1 ∈ R H × L × C at a network layer, we deﬁne aset of kernels, each with weights W k ∈ R h × l × C × U , where h, l, C, U are the kernel height, length, number of channels,and number of competing feature maps , respectively, and k ∈ { , . . . , K } . Hence, contrary to the grouping of linearunits in LWTA blocks in Fig. 1, the proposed convolutionalvariant performs local competition among feature maps.That is, each (convolutional) kernel is treated as an LWTAblock. Each layer of our convolutional SB-LWTA networkscomprises multiple kernels of competing feature maps.We provide a graphical illustration of the proposed convo-lutional variant of SB-LWTA in Fig. 2. Under this modelvariant, we deﬁne the utility latent indicator variables, z ,over whole kernels, that is full LWTA blocks. If the inferredposterior, q ( z k = 1) , over the k th block is low, then theblock is effectively omitted from the network. Our insightsmotivating this modeling selection concern the resultingcomputational complexity. Speciﬁcally, this formulationallows for completely removing kernels, thus reducing thenumber of executed convolution operations. Hence, thisconstruction facilitates efﬁciency, since convolution is com-putationally expensive.Under this rationale, a layer of the proposed convolu-tional variant represents an input, X n , via an output tensor Y n ∈ R H × L × K · U obtained as the concatenation along thelast dimension of the subtensors { [ Y n ] k } Kk =1 ∈ R H × L × U deﬁned below: [ Y n ] k = [ ξ n ] k · (cid:0) ( W k · z k ) (cid:63) X n (cid:1) (5)where “ (cid:63) ” denotes the convolution operation and [ ξ n ] k ∈ one hot( U ) . Local competition among feature maps withinan LWTA block (kernel) is implemented via a samplingprocedure which is driven from the feature map output, yielding: q ([ ξ n ] k ) = Discrete (cid:0) [ ξ n ] k (cid:12)(cid:12) softmax( (cid:88) h (cid:48) ,l (cid:48) [ z k W k (cid:63) X n ] h (cid:48) ,l (cid:48) ,u ) (cid:1) (6)We postulate a prior [ ξ n ] k ∼ Discrete(1 /U ) . We consider q ( z k ) = Bernoulli( z k | ˜ π k ) (7)with corresponding priors: u k ∼ Beta ( α, π k = k (cid:89) i =1 u i z k ∼ Bernoulli ( π k ) (8)Finally, we again consider network weights imposed a spher-ical prior N (0 , , and seek to infer a posterior distributionof the form N ( µ, σ ) .This concludes the formulation of the convolutional layersof the proposed SB-LWTA model. Obviously, this type ofconvolutional layer may be succeeded by a conventionalpooling layer, as deemed needed in the application at hand.

3. Training and Inference Algorithms

To train the proposed model, we resort to maximizationof the resulting ELBO expression. Speciﬁcally, we adoptStochastic Gradient Variational Bayes (SGVB) combinedwith: (i) the standard reparameterization trick for the postu-lated Gaussian weights, W , (ii) the Gumbel-Softmax relax-ation trick (Maddison et al., 2017) for the introduced latentindicator variables, ξ and Z ; and (iii) the Kumaraswamyreparameterization trick (Kumaraswamy, 1980) for the stickvariables u .Speciﬁcally, when it comes to the entailed Beta-distributedstick variables of the IBP prior, we can easily observe thatthese are not amenable to the reparameterization trick, incontrast to the postulated Gaussian weights. To addressthis issue, one can approximate their variational posteriors q ( u k ) = Beta( u k | a k , b k ) via the Kumaraswamy distribu-tion (Kumaraswamy, 1980): q ( u k ; a k , b k ) = a k b k u a k − k (1 − u a k k ) b k − (9)Samples from this distribution can be reparameterized asfollows (Nalisnick & Smyth, 2016): u k = (cid:16) − (1 − X ) bk (cid:17) ak , X ∼ U (0 , (10)On the other hand, in the case of the Discrete (Categori-cal or Bernoulli) latent variables of our model, performingback-propagation through reparameterized drawn samplesbecomes infeasible. Recently, the solution of introducing onparametric Bayesian Deep Networks with Local Competition C onvo l u ti on P oo li ng . . . ξ = 0 ξ = 1 z = 0 z = 1 z K − = 0 z K = 1 ξ = 0 ξ = 1 Input Convolutional Variant

C LH K K z = 0 z K = 1 Figure 2.

A convolutional variant of our approach. Bold frames denote active (effective) kernels (LWTA blocks of competing featuremaps), with z = 1 . Bold rectangles denote winner feature maps (with ξ = 1 ). appropriate continuous relaxations has been proposed bydifferent research teams (Jang et al., 2017; Maddison et al.,2017). Let η ∈ (0 , ∞ ) K be the unnormalized probabilitiesof a considered Discrete distribution, X = [ X k ] Kk =1 , and λ ∈ (0 , ∞ ) be a hyperparameter referred to as the temper-ature of the relaxation. Then, the drawn samples of X areexpressed as differentiable functions of the form: X k = exp(log η k + G k ) /λ ) (cid:80) Ki =1 exp((log η i + G i ) /λ ) , (11) G k = − log( − log U k ) , U k ∼ Uniform (0 , (12)In our work, the values of λ are annealed during training assuggested in Jang et al. (2017).We introduce the mean-ﬁeld (posterior independence) as-sumption across layers, as well as among the latent vari-ables ξ and Z pertaining to the same layer. All the poste-rior expectations in the ELBO are computed by drawingMC samples under the Normal, Gumbel-Softmax and Ku-maraswamy repametrization tricks, respectively. On thisbasis, ELBO maximization is performed using standard off-the-shelf, stochastic gradient techniques; speciﬁcally, weadopt ADAM (Kingma & Ba, 2014) with default settings.For completeness sake, the expression of the eventually ob-tained ELBO that is optimized via ADAM is provided inthe Supplementary Material. Having trained the model posteriors, we can now use them toeffect inference for unseen data. In this context, SB-LWTAoffers two main advantages over conventional techniques:(i) By exploiting the inferred component utility latent indica-tor variables, we can naturally devise a method for omittingthe contribution of components that are effectively deemedunnecessary. To this end, one may introduce a cut-off thresh- old , τ ; any component with inferred corresponding posterior q ( z ) below τ is omitted from computation.We emphasize that this mechanism is in stark contrast torecent related work in the ﬁeld of BNNs; in these cases,utility is only implicitly inferred , by thresholding higher-order moments of hierarchical densities over the valuesof the network weights themselves , W (see also relateddiscussion in Sec. 1). For instance, Louizos et al. (2017)imposed the following prior over the network weights z ∼ p ( z ) w ∼ N (0 , z ) (13)where p ( z ) can be a Horseshoe-type or log-uniform prior.However, such a modeling scheme requires extensive heuris-tics for the appropriate, ad hoc, selection of the prior p ( z ) hyperparameter values that can facilitate the desired spar-sity, and the associated thresholds at each network layer.On the contrary, our principled paradigm enables fully auto-matic, data-driven inference of network utility, using dedi-cated latent variables to infer which network componentsare needed. We only need to specify one global hyperpa-rameter, that is the innovation hyperparameter α of the IBP,and one global truncation threshold , τ . Even more impor-tantly, our model is not sensitive to small ﬂuctuations of thevalues of these selections . This is a unique advantage of ourmodel compared to the alternatives, as it obviates the needof extensive heuristic search of hyperparameter values.(ii) The provision of a full Gaussian posterior distributionover the network weights, W , offers a natural way of re-ducing the ﬂoating-point bit precision level of the networkimplementation. Speciﬁcally, the posterior variance of thenetwork weights constitutes a measure of uncertainty intheir estimates. Therefore, we can leverage this uncertaintyinformation to assess which bits are signiﬁcant, and removethe ones which ﬂuctuate too much under approximate pos-terior sampling. The unit round off necessary to representthe weights is computed by making use of the mean of onparametric Bayesian Deep Networks with Local Competition the weight variances, in a fashion similar to Louizos et al.(2017).We emphasize that, contrary to Louizos et al. (2017), ourmodel is endowed with the important beneﬁt that the proce-dure of bit precision selection for the network weights relieson different posteriors than the component omission pro-cess. We posit that by disentangling these two processes, wereduce the tendency of the model to underestimate posteriorvariance. Thus, we may yield stronger network compressionwhile retaining predictive performance.Finally, we turn to prediction generation. To be Bayesian,we need to sample several conﬁgurations of the weightsin order to assess the predictive density, and perform aver-aging; this is inefﬁcient for real-world testing. Here, weadopt a common approximation as in Louizos et al. (2017);Neklyudov et al. (2017); that is, we perform traditional for-ward propagation using the means of the weight posteriorsin place of the weight values. Concerning winner selection,we compute the posteriors q ( ξ ) and select the unit with max-imum probability as the winner; that is, we resort to a hardwinner selection, instead of performing sampling. Lastly,we retain all network components the posteriors, q ( z ) , ofwhich exceed the imposed truncation threshold, τ .

4. Experimental Evaluation

In the following, we evaluate the two variants of our SB-LWTA approach. We assess the predictive performance ofthe model, and its requirements in terms of ﬂoating-pointbit precision and number of trained parameters. We alsocompare the effectiveness of local competition among linearunits to standard nonlinearities.

In our experiments, the stick variables are drawn from a

Beta(1 , prior. The hyperparameters of the approximateKumaraswamy posteriors of the sticks are initialized as fol-lows: the a k ’s are set equal to the number of LWTA blocksof their corresponding layer; the b k ’s are always set equalto . All other initializations are random within the corre-sponding support sets. The employed cut-off threshold, τ , isset to − . The evaluated simple SB-LWTA networks omit connections on the basis of the corresponding latent indica-tors z being below the set threshold τ . Analogously, whenusing the proposed convolutional SB-LWTA architecture,we omit full LWTA blocks (convolutional kernels). We ﬁrst consider the classical LeNet-300-100 feedforwardarchitecture. We initially assess LWTA nonlinearities re-garding their classiﬁcation performance and bit precisionrequirements, compared to ReLU and Maxout (Goodfellow

Table 1.

Classiﬁcation accuracy and bit precision for the LeNet-300-100 architecture. All connections are retained. Bit precisionrefers to the necessary precision (in bits) required to represent theweights of each of the three layers. A CTIVATION E RROR ( % ) B IT P RECISION (E RROR % )R E LU 1.60 2/4/10 (1.62)M

AXOUT /2 UNITS

AXOUT /4 UNITS

UNITS

UNITS (1.5) et al., 2013) activations. To this end, we replace the K LWTA blocks and the U units therein (Fig. 1) with (i) K maxout blocks, each comprising U units, and (ii) K · U ReLU units (see supplementary material); no other regu-larization techniques are used, e.g., dropout. These alter-natives are trained by imposing Gaussian priors over thenetwork weights and inferring the corresponding posteri-ors via SGVB. We consider two alternative conﬁgurationscomprising: 1)150 and 50 LWTA blocks on the ﬁrst andsecond layer, respectively, of two competing units each;and 2) 75 and 25 LWTA blocks of four competing units.This experimental setup allows for us to examine the effectof the number of competing LWTA units on model perfor-mance, with all competitors initialized at the same numberof trainable weights . We use MNIST in these experiments.Further, we consider the LeNet-5-Caffe convolutional net,which we also evaluate on MNIST. The original LeNet-5-Caffe comprises 20 5x5 kernels (feature maps) on theﬁrst layer, 50 5x5 kernels (feature maps) on the secondlayer, and a dense layer with 500 units on the third. In our(convolutional) SB-LWTA implementation, we consider 105x5 kernels (LWTA blocks) with 2 competing feature mapseach on the ﬁrst layer, and 25 5x5 kernels with 2 competingfeature maps each on the second layer. The intermediatepooling layers are similar with the reference architecture.We additionally consider an implementation comprising4 competing feature maps deployed within 5 5x5 kernelson the ﬁrst layer, and 12 5x5 kernels on the second layer,reducing the total feature maps of the second layer to 48.Finally, we perform experimental evaluations on amore challenging benchmark dataset, namely CIFAR-10(Krizhevsky & Hinton, 2009). To enable the wide replica-bility of our results within the community, we employ acomputationally light convolutional architecture proposedby Alex Krizhevsky, which we dub ConvNet. The archi-tecture comprises two layers with 64 5x5 kernels (featuremaps), followed by two dense layers with 384 and 192 unitsrespectively. Similar to LeNet-5-Caffe, our SB-LWTA im-plementation consists in splitting the original architectureinto pairs of competing feature maps on each layer. Forcompleteness sake, an extra experiment on CIFAR-10, deal-ing with a much larger network (VGG-like architecture), onparametric Bayesian Deep Networks with Local Competition

Table 2.

Computational footprint reduction experiments. SB-ReLU denotes a variant of SB-LWTA using ReLU units.

Architecture Method Error (%) K/ K/ K / / StructuredBP (Neklyudov et al., 2017) 1.7 , / , /

450 23 / / Sparse-VD (Molchanov et al., 2017) 1.92 , / , /

720 8 / / BC-GHS (Louizos et al., 2017) 1.8 , / , /

140 13 / / SB-ReLU 1.75 . / . /

730 3 / / SB-LWTA (2 units) 1.7 , / , /

534 2 / / SB-LWTA (4 units) 1.75 , / , /

618 2 / / can be found in the provided Supplementary Material. LeNet-300-100.

We train the network from scratch onthe MNIST dataset, without using any data augmentationprocedure. In Table 1, we compare the classiﬁcation per-formance of our approach, employing 2 or 4 competingLWTA units, to LeNet-300-100 conﬁgurations employingcommonly used nonlinearities. The results reported in thisTable pertaining to our approach, are obtained without omit-ting connections the utility posteriors, q ( z ) , of which fallbelow the cut-off threshold, τ . In the second column ofthis Table, we observe that our SB-LWTA model offerscompetitive accuracy and improves over the considered al-ternatives when operating at full bit precision (ﬂoat32). Thethird column of this Table shows how network performancechanges when we attempt to reduce bit precision for bothour model and the considered competitors . Bit precisionreduction is based on the inferred weight posterior variance,similar to Louizos et al. (2017) (see also the supplementarymaterial). As we observe, not only does our approach yielda clearly improved accuracy in this case, but it also imposesthe lowest memory footprint.The corresponding comparative results obtained when weemploy the considered threshold to reduce the computa-tional costs are depicted in Table 2. As we observe, ourapproach continues to yield competitive accuracy; this ison par with the best performing alternative, which requires,though, a signiﬁcantly higher number of weights combinedwith up to an order of magnitude higher bit precision. Thus,our approach yields the same accuracy for a lighter compu-tational footprint. Indeed, it is important to note that ourapproach remains at the top of the list in terms of the ob-tained accuracy while retaining the least number of weights ,despite the fact that it was initialized in the same dense fash-ion as the alternatives. Even more importantly, our method completely outperforms all the alternatives when it comesto its ﬁnal bit precision requirements.Finally, it is signiﬁcant to note that by replacing in our Following IEEE 754-2008, ﬂoating-point data representationcomprises 3 different quantities: a) -bit sign, b) w exponent bits.and c) t = p − precision in bits (Zuras et al., 2008). Thus, forthe 32-bit format, we have t = 23 as the original bit precision. Figure 3.

Probabilities of winner selection for each digit in the testset for the ﬁrst 10 blocks of the second layer of the LeNet-300-100 network, with two competing units; black denotes very highwinning probability, while white denotes very low probability. model the LWTA blocks with ReLU units, a variant we dubSB-ReLU in Table 2, we yield clearly inferior outcomes.This constitutes strong evidence that LWTA mechanisms, atleast the way implemented in our work, offer beneﬁts overconventional nonlinearities.

LeNet-5-Caffe and ConvNet convolutional architec-tures.

For the LeNet-5-Caffe architecture, we train thenetwork from scratch. In Table 3, we provide the obtainedcomparative effectiveness of our approach, employing 2 or4 competing LWTA feature maps. Our approach requiresthe least number of feature maps while at the same time of-fering signiﬁcantly higher compression rates in terms of bitprecision, as well as better classiﬁcation accuracy than thebest considered alternative. By using the SB-ReLU variantof our approach, we once again yield inferior performancecompared to SB-LWTA, reafﬁrming the beneﬁts of LWTAmechanisms compared to conventional nonlinearities.To obtain some comparative results, we additionally imple-ment the BC-GNJ and BC-GHS models with the default pa-rameters as described in Louizos et al. (2017). The learnedarchitectures along with their classiﬁcation accuracy andbit precision requirements are illustrated in Table 3. Sim-ilar to the LeNet-5-Caffe convolutional architecture, ourmethod retains the least number of feature maps , while atthe same time provides the most competitive bit precision onparametric Bayesian Deep Networks with Local Competition

Table 3.

Learned Convolutional Architectures.Architecture Method Error (%)

Feature Maps (Conv. Layers) Bit precision (All Layers)LeNet-5-Caffe Original 0.9 /

50 23 / / / StructuredBP (Neklyudov et al., 2017) 0.86 /

18 23 / / / VIBNet (Dai et al., 2018) 1.0 /

25 23 / / / Sparse-VD (Molchanov et al., 2017) 1.0 /

19 13 / / / BC-GHS (Louizos et al., 2017) 1.0 /

10 10 / / / SB-ReLU 0.9 /

16 8 / / / SB-LWTA-2 0.9 / / / / SB-LWTA-4 0.8 /

12 11 / / / ConvNet Original . /

64 23 in all layersBC-GNJ(Louizos et al., 2017) 18.6 /

49 13 / / / / BC-GHS(Louizos et al., 2017) 17.9 /

52 12 / / / / SB-LWTA-2 17.5 /

42 11 / / / / requirements accompanied with higher predictive accuracy compared to the competition. Further Insights.

Finally, we scrutinize the competitionpatterns established within the LWTA blocks of an SB-LWTA network. To this end, we focus on the second layerof the LeNet-300-100 network with blocks comprising twocompeting units. Initially, we examine the distribution ofthe winner selection probabilities, and how they vary overthe ten MNIST classes. In Figure 3, we depict these prob-abilities for the ﬁrst ten blocks of the network, averagedover all the data points in the test set . As we observe, thedistribution of winner selection probabilities is unique foreach digit. This provides empirical evidence that the trainedwinner selection mechanism successfully encodes salientdiscriminative patterns with strong generalization value inthe test set. Further, in Figure 4, we examine what theoverlap of winner selection is among the MNIST digits.Speciﬁcally, for each digit, we compute the most often win-ning unit in each LWTA block, and derive the fraction ofoverlapping winning units over all blocks, for each pair ofdigits. It is apparent that winner overlap is quite low, typi-cally below ; that is, considering any pair of digits, weyield an overlap in the winner selection procedure whichis always below . This is another strong empirical re-sult reafﬁrming that the winner selection process encodesdiscriminative patterns of generalization value.

Computational Times.

As a concluding note, let us nowdiscuss the computational time required by SB-LWTA net-works, and how it compares to the baselines. One train-ing algorithm epoch takes on average 10% more computa-tional time for a network formulated under the SB-LWTAparadigm compared to a conventional network formulation(dubbed ”Original” in Tables 2 and 3). On the other hand,prediction generation is immensely faster, since SB-LWTAsigniﬁcantly reduces the effective network size. For instance,in the LeNet-5-Caffe experiments, SB-LWTA reduces pre-diction time by one order of magnitude over the baseline.

Figure 4.

MNIST dataset: Winning units overlap among digits.Black denotes that the winning units of all LWTA blocks are thesame; moving towards white, overlap drops.

5. Conclusions

In this paper, we examined how we can enable deep net-works to infer, in a data driven fashion, the immensity of thecomputational footprint they need so as to effectively modela training dataset. To this end, we introduced a deep networkprinciple with two core innovations: i) the utilization ofLWTA nonlinearities, implemented as statistical argumentsvia discrete sampling techniques; ii) the establishment ofa network component utility inference paradigm, imple-mented by resorting to nonparametric Bayesian processes.Our assumption was that the careful blend of these coreinnovations would allow for immensely reducing the com-putational footprint of the networks without underminingpredictive accuracy. Our experiments have provided strongempirical support to our approach, which outperformed allrelated attempts, and yielded a state-of-the-art combinationof accuracy and computational footprint. These ﬁndingsmotivate us to further examine the efﬁcacy of these princi-ples in the context of other challenging machine learningproblems, including generative modeling and lifelong learn-ing. These constitute our ongoing and future research workdirections. onparametric Bayesian Deep Networks with Local Competition

Acknowledgments

We gratefully acknowledge the support of NVIDIA Cor-poration with the donation of the Titan Xp GPU used forthis research. K. Panousis research was co-ﬁnanced byGreece and the European Union (European Social Fund-ESF) through the Operational Programme “Human Re-sources Development, Education and Lifelong Learning” inthe context of the project “Strengthening Human ResourcesResearch Potential via Doctorate Research” (MIS-5000432),implemented by the State Scholarships Foundation (IKY).S. Chatzis research was partially supported by the ResearchPromotion Foundation of Cyprus, through the grant: IN-TERNATIONAL/USA/0118/0037.

References

Andersen, P., Gross, G. N., Lomo, T., and Sveen, O. Participa-tion of inhibitory and excitatory interneurones in the control ofhippocampal cortical output.

UCLA Forum Med Sci , 1969.Ba, J. and Caruana, R. Do deep nets really need to be deep? In

Proc. NIPS . 2014.Chatzis, S. Indian buffet process deep generative models for semi-supervised classiﬁcation. In

IEEE ICASSP , 2018.Dai, B., Zhu, C., Guo, B., and Wipf, D. Compressing neuralnetworks using the variational information bottleneck. In

Proc.ICML , 2018.Douglas, R. J. and Martin, K. A. Neuronal circuits of the neocortex.

Annu. Rev. Neurosci. , 27, 2004.Eccles, J. C., Szentagothai, J., and Ito, M.

The cerebellum as aneuronal machine . Springer-Verlag, 1967.Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approx-imation: Representing model uncertainty in deep learning. arXiv:1506.02142 , 2015.Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., andBengio, Y. Maxout networks. In

Proc. ICML , 2013.Graves, A. Practical variational inference for neural networks. In

Proc. NIPS , 2011.Grifﬁths, T. L. and Ghahramani, Z. Inﬁnite latent feature modelsand the indian buffet process. In

Proc. NIPS , 2005.Grossberg, S. The art of adaptive pattern recognition by a self-organizing neural network.

Computer , 1988.Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge ina neural network. In

NIPS Deep Learning and RepresentationLearning Workshop , 2015.Ishwaran, H. and James, L. F. Gibbs sampling methods for stick-breaking priors.

Journal of the American Statistical Association ,2001.Jang, E., Gu, S., and Poole, B. Categorical reparameterizationusing gumbel-softmax. In

Proc. ICLR , 2017.Kandel, E. R., Schwartz, J. H., and Jessell, T. M. (eds.).

Principlesof Neural Science . Third edition, 1991. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980 , 2014.Krizhevsky, A. and Hinton, G. Learning multiple layers of featuresfrom tiny images. Technical report, 2009.Kumaraswamy, P. A generalized probability density function fordouble-bounded random processes.

Journal of Hydrology , (1),1980.Lansner, A. Associative memory models: from the cell-assemblytheory to biophysically detailed cortex simulations.

Trends inNeurosciences , 32(3), 2009.LeCun, Y., Bengio, Y., and Hinton, G. Deep learning.

Nature ,2015.Lee, D. D. and Seung, H. S. Learning the parts of objects bynonnegative matrix factorization.

Nature , 401, 1999.Louizos, C., Ullrich, K., and Welling, M. Bayesian compressionfor deep learning. In

Proc. NIPS , 2017.Maass, W. Neural computation with winner-take-all as the onlynonlinear operation. In

Proc. NIPS , 1999.Maass, W. On the computational power of winner-take-all.

NeuralComput , Nov 2000.Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribu-tion: A continuous relaxation of discrete random variables. In

Proc. ICLR , 2017.McCloskey, M. and Cohen, N. J. Catastrophic interference inconnectionist networks: The sequential learning problem. Psy-chology of Learning and Motivation. 1989.Molchanov, D., Ashukha, A., and Vetrov, D. Variational dropoutsparsiﬁes deep neural networks. In

Proc. ICML , 2017.Nalisnick, E. and Smyth, P. Stick-breaking variational autoen-coders. In

Proc. ICLR , 2016.Neklyudov, K., Molchanov, D., Ashukha, A., and Vetrov, D. P.Structured bayesian pruning via log-normal multiplicative noise.In

Proc. NIPS . 2017.Olshausen, B. A. and Field, D. J. Emergence of simple-cell recep-tive ﬁeld properties by learning a sparse code for natural images.

Nature , 1996.Srivastava, R. K., Masci, J., Kazerounian, S., Gomez, F., andSchmidhuber, J. Compete to compute. In

Proc. NIPS . CurranAssociates, Inc., 2013.Stefanis, C. Interneuronal mechanisms in the cortex.

UCLA ForumMed Sci , 1969.Teh, Y. W., G¨or¨ur, D., and Ghahramani, Z. Stick-breaking con-struction for the Indian buffet process. In

Proc. AISTATS , 2007.Theodoridis, S.