[PDF] From Hard to Soft: Understanding Deep Network Nonlinearities via Vector Quantization and Statistical Inference

Abstract

Nonlinearity is crucial to the performance of a deep (neural) network (DN). To date there has been little progress understanding the menagerie of available nonlinearities, but recently progress has been made on understanding the rôle played by piecewise affine and convex nonlinearities like the ReLU and absolute value activation functions and max-pooling. In particular, DN layers constructed from these operations can be interpreted as {\em max-affine spline operators} (MASOs) that have an elegant link to vector quantization (VQ) and K -means. While this is good theoretical progress, the entire MASO approach is predicated on the requirement that the nonlinearities be piecewise affine and convex, which precludes important activation functions like the sigmoid, hyperbolic tangent, and softmax. {\em This paper extends the MASO framework to these and an infinitely large class of new nonlinearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs).} We show that, under a GMM, piecewise affine, convex nonlinearities like ReLU, absolute value, and max-pooling can be interpreted as solutions to certain natural "hard" VQ inference problems, while sigmoid, hyperbolic tangent, and softmax can be interpreted as solutions to corresponding "soft" VQ inference problems. We further extend the framework by hybridizing the hard and soft VQ optimizations to create a β -VQ inference that interpolates between hard, soft, and linear VQ inference. A prime example of a β -VQ DN nonlinearity is the {\em swish} nonlinearity, which offers state-of-the-art performance in a range of computer vision tasks but was developed ad hoc by experimentation. Finally, we validate with experiments an important assertion of our theory, namely that DN performance can be significantly improved by enforcing orthogonality in its linear filters.

Full PDF

FF ROM H ARD TO S OFT : U

NDERSTANDING D EEP N ETWORK N ONLINEARITIES VIA V ECTOR Q UANTIZATION AND S TATISTICAL I NFERENCE

Randall Balestriero & Richard G. Baraniuk

Department of Electrical and Computer EngineeringRice UniversityHouston, TX 77005, USA [email protected] A BSTRACT

Nonlinearity is crucial to the performance of a deep (neural) network (DN). Todate there has been little progress understanding the menagerie of available non-linearities, but recently progress has been made on understanding the rˆole playedby piecewise afﬁne and convex nonlinearities like the ReLU and absolute valueactivation functions and max-pooling. In particular, DN layers constructed fromthese operations can be interpreted as max-afﬁne spline operators (MASOs) thathave an elegant link to vector quantization (VQ) and K -means. While this is goodtheoretical progress, the entire MASO approach is predicated on the requirementthat the nonlinearities be piecewise afﬁne and convex, which precludes importantactivation functions like the sigmoid, hyperbolic tangent, and softmax. This pa-per extends the MASO framework to these and an inﬁnitely large class of newnonlinearities by linking deterministic MASOs with probabilistic Gaussian Mix-ture Models (GMMs).

We show that, under a GMM, piecewise afﬁne, convexnonlinearities like ReLU, absolute value, and max-pooling can be interpreted assolutions to certain natural “hard” VQ inference problems, while sigmoid, hyper-bolic tangent, and softmax can be interpreted as solutions to corresponding “soft”VQ inference problems. We further extend the framework by hybridizing the hardand soft VQ optimizations to create a β -VQ inference that interpolates betweenhard, soft, and linear VQ inference. A prime example of a β -VQ DN nonlinearityis the swish nonlinearity, which offers state-of-the-art performance in a range ofcomputer vision tasks but was developed ad hoc by experimentation. Finally, wevalidate with experiments an important assertion of our theory, namely that DNperformance can be signiﬁcantly improved by enforcing orthogonality in its linearﬁlters. NTRODUCTION

Deep (neural) networks (DNs) have recently come to the fore in a wide range of machine learningtasks, from regression to classiﬁcation and beyond. A DN is typically constructed by composinga large number of linear/afﬁne transformations interspersed with up/down-sampling operations andsimple scalar nonlinearities such as the ReLU, absolute value, sigmoid, hyperbolic tangent, etc.Goodfellow et al. (2016). Scalar nonlinearities are crucial to a DN’s performance. Indeed, withoutnonlinearity, the entire network would collapse to a simple afﬁne transformation.

But to date therehas been little progress understanding and unifying the menagerie of nonlinearities, with few reasonsto choose one over another other than intuition or experimentation.

Recently, progress has been made on understanding the rˆole played by piecewise afﬁne and con-vex nonlinearities like the ReLU, leaky ReLU, and absolute value activations and downsamplingoperations like max-, average-, and channel-pooling Balestriero & Baraniuk (2018b;a). In particu-lar, these operations can be interpreted as max-afﬁne spline operators (MASOs) Magnani & Boyd(2009); Hannah & Dunson (2013) that enable a DN to ﬁnd a locally optimized piecewise afﬁneapproximation to the prediction operator given training data. A spline-based prediction is made in1 a r X i v : . [ c s . L G ] O c t wo steps. First, given an input signal x , we determine which region of the spline’s partition of thedomain (the input signal space) it falls into. Second, we apply to x the ﬁxed (in this case afﬁne)function that is assigned to that partition region to obtain the prediction (cid:98) y = f ( x ) .The key result of Balestriero & Baraniuk (2018b;a) is any DN layer constructed from a combina-tion of linear and piecewise afﬁne and convex is a MASO , and hence the entire DN is merely acomposition of MASOs.MASOs have the attractive property that their partition of the signal space (the collection of multi-dimensional “knots”) is completely determined by their afﬁne parameters (slopes and offsets). Thisprovides an elegant link to vector quantization (VQ) and K -means clustering . That is, during learn-ing, a DN implicitly constructs a hierarchical VQ of the training data that is then used for spline-based prediction.This is good progress for DNs based on ReLU, absolute value, and max-pooling, but what about DNsbased on classical, high-performing nonlinearities that are neither piecewise afﬁne nor convex likethe sigmoid, hyperbolic tangent, and softmax or fresh nonlinearities like the swish Ramachandranet al. (2017) that has been shown to outperform others on a range of tasks?

Contributions.

In this paper, we address this gap in the DN theory by developing a new frame-work that uniﬁes a wide range of DN nonlinearities and inspires and supports the development ofnew ones.

The key idea is to leverage the yinyang relationship between deterministic VQ/ K -meansand probabilistic Gaussian Mixture Models (GMMs) Biernacki et al. (2000). Under a GMM, piece-wise afﬁne, convex nonlinearities like ReLU and absolute value can be interpreted as solutions tocertain natural hard inference problems, while sigmoid and hyperbolic tangent can be interpretedas solutions to corresponding soft inference problems. We summarize our primary contributions asfollows:Contribution 1: We leverage the well-understood relationship between VQ, K -means, and GMMsto propose the Soft MASO (SMASO) model, a probabilistic GMM that extends the concept of adeterministic MASO DN layer. Under the SMASO model, hard maximum a posteriori (MAP)inference of the VQ parameters corresponds to conventional deterministic MASO DN operationsthat involve piecewise afﬁne and convex functions, such as fully connected and convolution matrixmultiplication; ReLU, leaky-ReLU, and absolute value activation; and max-, average-, and channel-pooling. These operations assign the layer’s input signal (feature map) to the VQ partition regioncorresponding to the closest centroid in terms of the Euclidean distance,Contribution 2: A hard VQ inference contains no information regarding the conﬁdence of the VQregion selection, which is related to the distance from the input signal to the region boundary. Inresponse, we develop a method for soft MAP inference of the VQ parameters based on the probabilitythat the layer input belongs to a given VQ region.

Switching from hard to soft VQ inference recoversseveral classical and powerful nonlinearities and provides an avenue to derive completely new ones.

We illustrate by showing that the soft versions of ReLU and max-pooling are the sigmoid gated linearunit and softmax pooling, respectively. We also ﬁnd a home for the sigmoid, hyperbolic tangent, andsoftmax in the framework as a new kind of DN layer where the MASO output is the VQ probability.Contribution 3: We generalize hard and soft VQ to what we call β -VQ inference , where β ∈ (0 , is a free and learnable parameter. This parameter interpolates the VQ from linear ( β → ), toprobabilistic SMASO ( β = 0 . ), to deterministic MASO ( β → ). We show that the β -VQ versionof the hard ReLU activation is the swish nonlinearity, which offers state-of-the-art performance in arange of computer vision tasks but was developed ad hoc through experimentation Ramachandranet al. (2017).Contribution 4: Seen through the MASO lens, current DNs solve a simplistic per-unit (per-neuron),independent VQ optimization problem at each layer. In response, we extend the SMASO GMMto a factorial GMM that that supports jointly optimal VQ optimization across all units in a layer.Since the factorial aspect of the new model would make na¨ıve VQ inference exponentially compu-tationally complex, we develop a simple sufﬁcient condition under which a we can achieve efﬁcient,tractable, jointly optimal VQ inference. The condition is that the linear “ﬁlters” feeding into anynonlinearity should be orthogonal . We propose two simple strategies to learn approximately andtruly orthogonal weights and show on three different datasets that both offer signiﬁcant improve-2ents in classiﬁcation performance. Since orthogonalization can be applied to an arbitrary DN, thisresult and our theoretical understanding are of independent interest.This paper is organized as follows. After reviewing the theory of MASOs and VQ for DNs inSection 2, we formulate the GMM-based extension to SMASOs in Section 3. Section 4 develops thehybrid β -VQ inference with a special case study on the swish nonlinearity. Section 5 extends theSMASO to a factorial GMM and shows the power of DN orthogonalization. We wrap up in Section 6with directions for future research. Proofs of the various results appear in several appendices in theSupplementary Material. ACKGROUND ON M AX -A FFINE S PLINES AND D EEP N ETWORKS

We ﬁrst brieﬂy review max-afﬁne spline operators (MASOs) in the context of understanding theinner workings of DNs Balestriero & Baraniuk (2018b;a). A MASO is an operator S [ A, B ] : R D → R K that maps an input vector of length D into an output vector of length K by leveraging K independent max-afﬁne splines Magnani & Boyd (2009); Hannah & Dunson (2013), each with R regions, which are piecewise afﬁne and convex mappings. The MASO parameters consist of the“slopes” A ∈ R K × R × D and the “offsets/biases” B ∈ R K × R . See Appendix A for the precisedeﬁnition. Given the input z ( (cid:96) − ∈ R D ( (cid:96) − (i.e., D = D ( (cid:96) − ) and parameters A ( (cid:96) ) , B ( (cid:96) ) , aMASO produces the output z ( (cid:96) ) ∈ R D ( (cid:96) ) (i.e., K = D ( (cid:96) − ) via [ z ( (cid:96) ) ] k = (cid:104) S [ A ( (cid:96) ) , B ( (cid:96) ) ]( z ( (cid:96) − ) (cid:105) k = max r =1 ,...,R ( (cid:96) ) (cid:16)(cid:68) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:69) + [ B ( (cid:96) ) ] k,r (cid:17) , (1)where [ z ( (cid:96) ) ] k denotes the k th dimension of z ( (cid:96) ) and R ( (cid:96) ) is the number of regions in the splines’partition of the input space R D ( (cid:96) − .An important consequence of (1) is that a MASO is completely determined by its slope and offsetparameters without needing to specify the partition of the input space (the “knots” when D = 1 ).Indeed, solving (1) automatically computes an optimized partition of the input space R D that isequivalent to a vector quantization (VQ) Nasrabadi & King (1988); Gersho & Gray (2012). Wecan make the VQ aspect explicit by rewriting (1) in terms of the Hard-VQ (HVQ) matrix T ( (cid:96) ) H ∈ R D ( (cid:96) ) × R ( (cid:96) ) . This VQ-matrix contains D ( (cid:96) ) stacked one-hot row vectors, each with the one-hotposition at index [ t ( (cid:96) ) ] k ∈ { , . . . , R } corresponding to the arg max over r = 1 , . . . , R ( (cid:96) ) of (1) [ z ( (cid:96) ) ] k = R ( (cid:96) ) (cid:88) r =1 [ T ( (cid:96) ) H ] k,r (cid:16)(cid:68) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:69) + [ B ( (cid:96) ) ] k,r (cid:17) . (2)We retrieve (1) from (2) by noting that [ t ( (cid:96) ) ] k, = arg max r =1 ,...,R ( (cid:96) ) ( (cid:104) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:105) +[ B ( (cid:96) ) ] k,r ) .The key background result for this paper is that the layers of very large class of DN are MASOs.Hence, such a DN is a composition of MASOs. Theorem 1.

Any DN layer comprising a linear operator (e.g., fully connected or convolution) com-posed with a convex and piecewise afﬁne operator (such as a ReLU, leaky-ReLU, or absolute valueactivation; max/average/channel-pooling; maxout; all with or without skip connections) is a MASOBalestriero & Baraniuk (2018b;a).

Appendix A provides the parameters A ( (cid:96) ) , B ( (cid:96) ) for the MASO corresponding to the (cid:96) th layer of anyDN constructed from linear plus piecewise afﬁne and convex components. Given this connection, wewill identify z ( (cid:96) − above as the input (feature map) to the MASO DN layer and z ( (cid:96) ) as the output(feature map). We also identify [ z ( (cid:96) ) ] k in (1) and (2) as the output of the k th unit (aka neuron) of the (cid:96) th layer. MASOs for higher-dimensional tensor inputs/outputs are easily developed by ﬂattening. AX -A FFINE S PLINES MEET G AUSSIAN M IXTURE M ODELS

The MASO/HVQ connection provides deep insights into how a DN clusters and organizes signalslayer by layer in a hierarchical fashion Balestriero & Baraniuk (2018b;a). However, the entire ap-proach requires that the nonlinearities be piecewise afﬁne and convex, which precludes important3ctivation functions like the sigmoid, hyperbolic tangent, and softmax.

The goal of this paper is toextend the MASO analysis framework of Section 2 to these and an inﬁnitely large class of other non-linearities by linking deterministic MASOs with probabilistic Gaussian Mixture Models (GMMs).

ROM

MASO TO GMM

VIA K -M EANS

For now, we focus on a single unit k from layer (cid:96) of a MASO DN, which contains both linear andnonlinear operators; we generalize below in Section 5. The key to the MASO mechanism lies inthe VQ variables [ t ( (cid:96) ) ] k ∀ k , since they fully determine the output via (2). For a special choice ofbias, the VQ variable computation is equivalent to the K -means algorithm Balestriero & Baraniuk(2018b;a). Proposition 1.

Given − (cid:13)(cid:13)(cid:2) A ( (cid:96) ) (cid:3) k,r,. (cid:13)(cid:13) = (cid:2) B ( (cid:96) ) (cid:3) k,r , the MASO VQ partition corresponds to a K -means clustering with centroids (cid:2) A ( (cid:96) ) (cid:3) k,r,. computed via [ (cid:99) t ( (cid:96) ) ] k = arg min r =1 ,...,R (cid:13)(cid:13)(cid:2) A ( (cid:96) ) (cid:3) k,r,. − z ( (cid:96) − (cid:13)(cid:13) . For example, consider a layer (cid:96) using a ReLU activation function. Unit k of that layer partitionsits input space using a K -means model with R ( (cid:96) ) = 2 centroids: the origin of the input space andthe unit layer parameter [ A ( (cid:96) ) ] k, , · . The input is mapped to the partition region corresponding to theclosest centroid in terms of the Euclidean distance, and the corresponding afﬁne mapping for thatregion is used to project the input and produce the layer output as in (2).We now leverage the well-known relationship between K -means and Gaussian Mixture Models(GMMs) Bishop (2006) to GMM-ize the deterministic VQ process of max-afﬁne splines. As wewill see, the constraint on the value of (cid:2) B ( (cid:96) ) (cid:3) k,r in Proposition 1 will be relaxed thanks to theGMM’s ability to work with a nonuniform prior over the regions (in contrast to K -means).To move from a deterministic MASO model to a probabilistic GMM, we reformulate the HVQselection variable [ t ( (cid:96) ) ] k as an unobserved categorical variable [ t ( (cid:96) ) ] k ∼ C at ([ π ( (cid:96) ) ] k, · ) with param-eter [ π ( (cid:96) ) ] k, · ∈ (cid:52) R ( (cid:96) ) and (cid:52) R ( (cid:96) ) the simplex of dimension R ( (cid:96) ) . Armed with this, we deﬁne thefollowing generative model for the layer input z ( (cid:96) − as a mixture of R ( (cid:96) ) Gaussians with mean [ A ( (cid:96) ) ] k,r,. ∈ R D ( (cid:96) − and identical isotropic covariance with parameter σ z ( (cid:96) − = R ( (cid:96) ) (cid:88) r =1 (cid:16)(cid:2) t ( (cid:96) ) (cid:3) k = r (cid:17) (cid:2) A ( (cid:96) ) (cid:3) k,r, · + (cid:15), (3)with (cid:15) ∼ N (0 , Iσ ) . Note that this GMM generates an independent vector input z ( (cid:96) − for everyunit k = 1 , . . . , D ( (cid:96) ) in layer (cid:96) . For reasons that will become clear below in Section 3.3, we willrefer to the GMM model (3) as the Soft MASO (SMASO) model. We develop a joint, factorial modelfor the entire MASO layer (and not just one unit) in Section 5.3.2 H

ARD

VQ I

NFERENCE

Given the GMM (3) and an input z ( (cid:96) − , we can compute a hard inference of the optimal VQselection variable [ t ( (cid:96) ) ] k via the maximum a posteriori (MAP) principle [ (cid:99) t ( (cid:96) ) ] k = arg max t =1 ,...,R ( (cid:96) ) p ( t | z ( (cid:96) − ) . (4)The following result is proved in Appendix E.1. Theorem 2.

The hard, MAP inference of the latent selection variable [ (cid:99) t ( (cid:96) ) ] k given in (4) can becomputed via the MASO HVQ (1) [ (cid:99) t ( (cid:96) ) ] k = arg max r =1 ,...,R ( (cid:96) ) (cid:104) [ A ( (cid:96) ) ] k,t, · , z ( (cid:96) − (cid:105) + [ B ( (cid:96) ) ] k,t ∀ A ( (cid:96) ) ∀ B ( (cid:96) ) (5) It would be more accurate to call this R ( (cid:96) ) -means clustering in this case. ith σ = 1 and [ π ( (cid:96) ) ] k,t = exp([ B ( (cid:96) ) ] k,r + (cid:107) [ A ( (cid:96) ) ] k,r, · (cid:107) ) (cid:80) r exp([ B ( (cid:96) ) ] k,r + (cid:107) [ A ( (cid:96) ) ] k,r, · (cid:107) ) , t = 1 , . . . , R ( (cid:96) ) . The optimal HVQselection matrix is given by [ (cid:100) T ( (cid:96) ) H ] k,r = (cid:0) r = [ (cid:99) t ( (cid:96) ) ] k (cid:1) . Note in Theorem 2 that the bias constraint of Proposition 1 (which can be interpreted as imposing auniform prior [ π ( (cid:96) ) ] k, · ) is completely relaxed.HVQ inference of the selection matrix sheds light on some of the drawbacks that affect any DNemploying piecewise afﬁne, convex activation functions. First, during gradient-based learning, thegradient will propagate back only through the activated VQ regions that correspond to the few 1-hotentries in T ( (cid:96) ) H . The parameters of other regions will not be updated; this is known as the “dyingneurons phenomenon” Trottier et al. (2017); Agarap (2018). Second, the overall MASO mapping iscontinuous but not differentiable, which leads to unexpected gradient jumps during learning. Third,the HVQ inference contains no information regarding the conﬁdence of the VQ region selection,which is related to the distance of the query point to the region boundary. As we will now see,this extra information can be very useful and gives rise to a range of classical and new activationfunctions.3.3 S OFT

VQ I

NFERENCE

We can overcome many of the limitations of HVQ inference in DNs by replacing the 1-hot entriesof the HVQ selection matrix with the probability that the layer input belongs to a given VQ region [ (cid:100) T ( (cid:96) ) S ] k,r = p (cid:16) [ t ( (cid:96) ) ] k = r | z ( (cid:96) − (cid:17) = exp (cid:0)(cid:10) [ A ( (cid:96) ) ] k,r, · , z ( (cid:96) − (cid:11) + [ B ( (cid:96) ) ] k,r, · (cid:1)(cid:80) r exp (cid:0)(cid:10) [ A ( (cid:96) ) ] k,r, · , z ( (cid:96) − (cid:11) + [ B ( (cid:96) ) ] k,r (cid:1) , (6)which follows from the simple structure of the GMM. This corresponds to a soft inference of thecategorical variable [ t ( (cid:96) ) ] k . Note that T ( (cid:96) ) S → T ( (cid:96) ) H as the noise variance in (3) → . Given the SVQselection matrix, the MASO output is still computed via (2). The SVQ matrix can be computedindirectly from an entropy-penalized MASO optimization; the following is proved in Appendix E.2. Proposition 2.

The entries of the SVQ selection matrix [ (cid:100) T ( (cid:96) ) S ] k, · from (6) solve the following entropy-penalized maximization, where H ( · ) is the Shannon entropy [ (cid:100) T ( (cid:96) ) S ] k, · = arg max t ∈(cid:52) R ( (cid:96) ) k R ( (cid:96) ) k (cid:88) r =1 [ t ] r (cid:16)(cid:68) [ A ( (cid:96) ) ] k,r, · , z ( (cid:96) − (cid:69) + [ B ( (cid:96) ) ] k,r (cid:17) + H ( t ) . (7)3.4 S OFT

VQ MASO N

ONLINEARITIES

Remarkably, switching from HVQ to SVQ MASO inference recovers several classical and powerfulnonlinearities and provides an avenue to derive completely new ones. Given a set of MASO pa-rameters A ( (cid:96) ) , B ( (cid:96) ) for calculating the layer- (cid:96) output of a DN via (1), we can derive two distinctlydifferent DNs: one based on the HVQ inference of (5) and one based on the SVQ inference of (6).The following results are proved in Appendix E.5. Proposition 3.

The MASO parameters A ( (cid:96) ) , B ( (cid:96) ) that induce the ReLU activation under HVQ in-duce the sigmoid gated linear unit Elfwing et al. (2018) under SVQ.

Proposition 4.

The MASO parameters A ( (cid:96) ) , B ( (cid:96) ) that induce the max-pooling nonlinearity underHVQ induce softmax-pooling Boureau et al. (2010) under SVQ.

Appendix C discusses how the GMM and SVQ formulations shed new light on the impact of pa-rameter initialization in DC learning plus how these formulations can be extended further.3.5 A

DDITIONAL N ONLINEARITIES AS S OFT

DN L

AYERS

Changing viewpoint slightly, we can also derive classical nonlinearities like the sigmoid, tanh, andsoftmax Goodfellow et al. (2016) from the soft inference perspective. Consider a new soft DN layer The observant reader will recognize this as the E-step of the GMM’s EM learning algorithm.

5Q Type Value for [ T ( (cid:96) ) ] k ExamplesHard VQ (HVQ) arg max t ∈(cid:52) R ( (cid:96) ) k P ( t ) ReLU, max-poolingSoft VQ (SVQ) arg max t ∈(cid:52) R ( (cid:96) ) k P ( t ) + H ( t ) SiGLU, softmax-pooling β -VQ, β ∈ [0 ,

1] arg max t ∈(cid:52) R ( (cid:96) ) k β P ( t ) + (1 − β ) H ( t ) swish, β -softmax-poolingTable 1: Impact of different VQ strategies for a MASO layer with P ( t ) := (cid:80) R ( (cid:96) ) k r =1 [ t ] r (cid:0)(cid:10) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:11) + [ B ( (cid:96) ) ] k,r (cid:1) .whose unit output [ z ( (cid:96) ) ] k is not the piecewise afﬁne spline of (2) but rather the probability [ z ( (cid:96) ) ] k = p ([ t ( (cid:96) ) ] k = 1 | z ( (cid:96) − ) that the input z ( (cid:96) ) falls into each VQ region. The following propositions areproved in Appendix E.6. Proposition 5.

The MASO parameters A ( (cid:96) ) , B ( (cid:96) ) that induce the ReLU activation under HVQ in-duce the sigmoid activation in the corresponding soft DN layer. A similar train of thought recovers the softmax nonlinearity typically used at the DN output forclassiﬁcation problems.

Proposition 6.

The MASO parameters A ( (cid:96) ) , B ( (cid:96) ) that induce a fully-connected-pooling layer underHVQ (with output dimension D ( L ) equal to the number of classes C ) induce the softmax nonlinearityin the corresponding soft DN layer. YBRID H ARD /S OFT I NFERENCE VIA E NTROPY R EGULARIZATION

Combining (5) and (6) yields a hybrid optimization for a new β -VQ that recovers hard, soft, andlinear VQ inference as special cases [ (cid:100) T ( (cid:96) ) β ] k = arg max t ∈(cid:52) R ( (cid:96) ) k [ β ( (cid:96) ) ] k R ( (cid:96) ) k (cid:88) r =1 [ t ] r (cid:16)(cid:68) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:69) + [ B ] k,r (cid:17) + (cid:16) − [ β ( (cid:96) ) ] k (cid:17) H ( t ) . (8)The new hyper-parameter [ β ( (cid:96) ) ] k ∈ (0 , . The following is proved in Appendix E.3. Theorem 3.

The unique global optimum of (8) is given by [ (cid:100) T ( (cid:96) ) β ] k,r = exp (cid:16) [ β ( (cid:96) ) ] k − [ β ( (cid:96) ) ] k (cid:0)(cid:10) [ A ( (cid:96) ) ] k,r , z ( (cid:96) − (cid:11) + [ B ( (cid:96) ) ] k,r (cid:1)(cid:17)(cid:80) R ( (cid:96) ) j =1 exp (cid:16) [ β ( (cid:96) ) ] k − [ β ( (cid:96) ) ] k (cid:0)(cid:10) [ A ( (cid:96) ) ] k,j,. , z ( (cid:96) − (cid:11) + [ B ( (cid:96) ) ] k,j (cid:1)(cid:17) . (9)The β -VQ covers all of the theory developed above as special cases: β = 1 yields HVQ, β = yields SVQ, and β = 0 yields a linear MASO with [ (cid:100) T ( (cid:96) )0 ] k,r = R ( (cid:96) ) . See Figure 1 for examplesof how the β parameter interacts with three example activation functions. Note also the attractiveproperty that (9) is differentiable with respect to [ β ( (cid:96) ) ] k .The β -VQ supports the development of new, high-performance DN nonlinearities. For example, the swish activation σ swish ( u ) = σ sig ([ η ( (cid:96) ) ] k u ) u extends the sigmoid gated linear unit Elfwing et al.(2018) with the learnable parameter [ η ( (cid:96) ) ] k Ramachandran et al. (2017). Numerous experimentalstudies have shown that DNs equipped with a learned swish activation signiﬁcantly outperformthose with more classical activations like ReLU and sigmoid. Proposition 7.

The MASO A ( (cid:96) ) , B ( (cid:96) ) parameters that induce the ReLU nonlinearity under HVQinduce the swish nonlinearity under β -VQ, with [ η ( (cid:96) ) ] k = [ β ( (cid:96) ) ] k − [ β ( (cid:96) ) ] k . Table 1 summarizes some of the many nonlinearities that are within reach of the β -VQ. The tanh activation is obtained similarly by reparametrizing A ( (cid:96) ) and B ( (cid:96) ) ; see Appendix E.6. Best performance was usually achieved with [ η ( (cid:96) ) ] k ∈ (0 , Ramachandran et al. (2017). A ( (cid:96) ) , B ( (cid:96) ) for which HVQ yields the ReLU, absolute value,and an arbitrary convex activation function, we explore how changing β in the β -VQ alters theinduced activation function. Solid black: HVQ ( β = 1 ), Dashed black: SVQ ( β = ), Red: β -VQ( β ∈ [0 . , . ). Interestingly, note how some of the β -VQ are nonconvex. PTIMAL J OINT

VQ I

NFERENCE VIA O RTHOGONALIZATION

The GMM (3) models the impact of only a single layer unit on the layer- (cid:96) input z ( (cid:96) − . We caneasily extend this model to a factorial model for z ( (cid:96) − that enables all D ( (cid:96) ) units at layer (cid:96) tocombine their syntheses: z ( (cid:96) − = D ( (cid:96) ) (cid:88) k =1 R ( (cid:96) ) (cid:88) r =1 (cid:16) [ t ( (cid:96) ) ] k = r (cid:17) [ A ( (cid:96) ) ] k,r, · + (cid:15), (10)with (cid:15) ∼ N (0 , Iσ ) . This new model is a mixture of R ( (cid:96) ) Gaussians with means [ A ( (cid:96) ) ] k,r, · ∈ R D ( (cid:96) − and identical isotropic covariances with variance σ . The factorial aspect of the modelmeans that the number of possible combinations of the t ( (cid:96) ) values grow exponentially with thenumber of units. Hence, inferring the latent variables t ( (cid:96) ) quickly becomes intractable.However, we can break this combinatorial barrier and achieve efﬁcient, tractable VQ inference byconstraining the MASO slope parameters A ( (cid:96) ) to be orthogonal (cid:68) [ A ( (cid:96) ) ] k,r,. , [ A ( (cid:96) ) ] k (cid:48) ,r (cid:48) ,. (cid:69) = 0 ∀ k (cid:54) = k (cid:48) ∀ r, r (cid:48) . (11)Orthogonality is achieved in a fully connected layer (multiplication by the dense matrix W ( (cid:96) ) com-posed with activation or pooling) when the rows of W ( (cid:96) ) are orthogonal. Orthogonality is achievedin a convolution layer (multiplication by the convolution matrix C ( (cid:96) ) composed with activation orpooling) when the rows of C ( (cid:96) ) are either non-overlapping or properly apodized; see Appendix E.4for the details plus the proof of the following result. Theorem 4.

If the slope parameters A ( (cid:96) ) of a MASO are orthogonal in the sense of (11), thenthe random variables [ t ( (cid:96) ) ] | z ( (cid:96) − , . . . , [ t ( (cid:96) ) ] | z ( (cid:96) − of the model (10) are independent and hence p (cid:0) [ t ( (cid:96) ) ] , . . . , [ t ( (cid:96) ) ] D ( (cid:96) ) | z ( (cid:96) − (cid:1) = (cid:81) D ( (cid:96) ) k =1 p (cid:0) [ t ( (cid:96) ) ] k | z ( (cid:96) − (cid:1) . Per-unit orthogonality brings the beneﬁt of “uncorrelated unit ﬁring,” which has been shown toprovide many practical advantages in DNs Srivastava et al. (2014). Orthogonality also renders thejoint MAP inference of the factorial model’s VQs tractable. The following result is proved in Ap-pendix E.4.

Corollary 1.

When the conditions of Theorem 4 are fulﬁlled, the joint MAP estimate for the VQs ofthe factorial model (10) (cid:99) t ( (cid:96) ) f = arg max t ∈{ ,...,R ( (cid:96) ) }×···×{ ,...,R ( (cid:96) ) } p (cid:16) t | z ( (cid:96) − (cid:17) = (cid:104) [ (cid:99) t ( (cid:96) ) ] , . . . , [ (cid:99) t ( (cid:96) ) ] D ( (cid:96) ) (cid:105) (cid:124) (12) and thus can be computed with linear complexity in the number of units. The advantages of orthogonal or near-orthogonal ﬁlters have been explored empirically in varioussettings, from GANs Brock et al. (2016) to RNNs Huang et al. (2017), typically demonstrating7etting LR = 0 . LR = 0 . LR = 0 . SVHN (baseline) 94.3 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± largeCNN architecture (detailed in Appendix D), we tabulate the classiﬁca-tion accuracy (larger is better) and its standard deviation averaged over runs with different Adamlearning rates. In each case, orthogonal fully-connected and convolution matrices improve the clas-siﬁcation accuracy over the baseline.improved performance. Table 2 tabulates the results of a simple conﬁrmation experiment with the largeCNN architecture described in Appendix D. We added to the standard cross-entropy loss a term λ (cid:80) k (cid:80) k (cid:48) (cid:54) = k (cid:80) r,r (cid:48) (cid:104) [ A ( (cid:96) ) ] k,r, · , [ A ( (cid:96) ) ] k (cid:48) ,r (cid:48) , · (cid:105) that penalizes non-orthogonality (recall (11)). We didnot cross-validate the penalty coefﬁcient λ but instead set it equal to 1. The tabulated results showclearly that favoring orthogonal ﬁlters improves accuracy across both different datasets and differentlearning settings.Since the orthogonality penalty does not guarantee true orthogonality but simply favors it, we per-formed one additional experiment where we reparametrized the fully-connected and convolutionmatrices using the Gram-Schmidt (GS) process Daniel et al. (1976) so that they were truly orthog-onal. Thanks to the differentiability of all of the operations involved in the GS process, we canbackpropagate the loss to the orthogonalized ﬁlters in order to update them in learning. We alsoused the swish activation, which we showed to be a β -VQ nonlinearity in Section 4. Since the GSprocess adds signiﬁcant computational overhead to the learning algorithm, we conducted only oneexperiment on the largest dataset (CIFAR100). The exactly orthogonalized largeCNN achieved aclassiﬁcation accuracy of . , which is a major improvement over all of the results in the bottom(CIFAR100) cell of Table 2. This indicates that there are good reasons to try to improve on thesimple orthogonality-penalty-based approach. UTURE W ORK

Our development of the SMASO model opens the door to several new research questions. First,we have merely scratched the surface in the exploration of new nonlinear activation functions andpooling operators based on the SVQ and β -VQ. For example, the soft- or β -VQ versions of leaky-ReLU, absolute value, and other piecewise afﬁne and convex nonlinearities could outperform thenew swish nonlinearity. Second, replacing the entropy penalty in the (7) and (8) with a differentpenalty will create entirely new classes of nonlinearities that inherit the rich analytical propertiesof MASO DNs. Third, orthogonal DN ﬁlters will enable new analysis techniques and DN probingmethods, since from a signal processing point of view problems such as denoising, reconstruction,compression have been extensively studied in terms of orthogonal ﬁlters. Fourth, the Gram-Schmidtexact orthogonalization routine for orthogonal ﬁlters is quite intense for very deep and wide DNs.We plan to explore methods based on recursion and parallelism to speeding up the computations. R EFERENCES

A. F. Agarap. Deep learning using rectiﬁed linear units (ReLU). arXiv preprint arXiv:1803.08375 , 2018.R. Balestriero and R. Baraniuk. Mad max: Afﬁne spline insights into deep learning. arXiv preprintarXiv:1805.06576 , 2018a.R. Balestriero and R. G. Baraniuk. A spline theory of deep networks. In

Proc. Int. Conf. Mach. Learn. ,volume 80, pp. 374–383, Jul. 2018b.C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model for clustering with the integrated com-pleted likelihood.

IEEE Trans. Pattern Anal. Mach. Intell. , 22(7):719–725, 2000. . M. Bishop. Pattern Recognition and Machine Learning . Springer-Verlag New York, 2006.Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In

Proc.Int. Conf. Mach. Learn. , pp. 111–118, 2010.A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial networks. arXiv preprint arXiv:1609.07093 , 2016.J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart. Reorthogonalization and stable algorithms forupdating the Gram-Schmidt QR factorization.

Math. Comput. , 30(136):772–795, 1976.S. Elfwing, E. Uchibe, and K. Doya. Sigmoid-weighted linear units for neural network function approximationin reinforcement learning.

Neural Netw. , 2018.A. Gersho and R. M. Gray.

Vector Quantization and Signal Compression . Springer, 2012.X. Glorot and Y. Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In

Proc.13th Int. Conf. AI Statist. , volume 9, pp. 249–256, 2010.I. Goodfellow, Y. Bengio, and A. Courville.

Deep Learning , volume 1. MIT Press, 2016. .L. A. Hannah and D. B. Dunson. Multivariate convex regression with adaptive partitioning.

J. Mach. Learn.Res. , 14(1):3261–3294, 2013.L. Huang, X. Liu, B. Lang, A. W. Yu, Y. Wang, and B. Li. Orthogonal weight normalization: Solu-tion to optimization over multiple dependent stiefel manifolds in deep neural networks. arXiv preprintarXiv:1709.06079 , 2017.A. Magnani and S. P. Boyd. Convex piecewise-linear ﬁtting.

Optim. Eng. , 10(1):1–17, 2009.N. M. Nasrabadi and R. A. King. Image coding using vector quantization: A review.

IEEE Trans. Commun. ,36(8):957–971, 1988.P. Ramachandran, B. Zoph, and Q. Le. Searching for activation functions. arXiv:1710.05941v2 , Oct. 2017.R. K. Srivastava, J. Masci, F. Gomez, and J. Schmidhuber. Understanding locally competitive networks. arXivpreprint arXiv:1410.1165 , 2014.L. Trottier, P. Gigu, and B. Chaib-draa. Parametric exponential linear unit for deep convolutional neural net-works. pp. 207–214. IEEE, 2017.E. W. Weisstein.

CRC Concise Encyclopedia of Mathematics . CRC press, 2002. UPPLEMENTARY M ATERIALS

A B

ACKGROUND

A Deep Network (DN) is an operator f Θ : R D → R C that maps an input signal x ∈ R D to an output prediction y ∈ R C . All current DNs can be written as a composition of L intermediate mappings called layers f Θ ( x ) = (cid:16) f ( L ) θ ( L ) ◦ · · · ◦ f (1) θ (1) (cid:17) ( x ) , (13)where Θ = (cid:110) θ (1) , . . . , θ ( L ) (cid:111) is the collection of the network’s parameters from each layer. The DN layerat level (cid:96) is an operator f ( (cid:96) ) θ ( (cid:96) ) that takes as input the vector-valued signal z ( (cid:96) − ( x ) ∈ R D ( (cid:96) − and producesthe vector-valued output z ( (cid:96) ) ( x ) ∈ R D ( (cid:96) ) with D ( L ) = C . The signals z ( (cid:96) ) ( x ) , (cid:96) > are typically called feature maps an the input is denoted as z (0) ( x ) = x . For concreteness, we will focus here on processing multi-channel images x but adjusting the appropriate dimensionalities can be used to adapt our results. We will usetwo equivalent representations for the signal and feature maps, one based on tensors and one based on ﬂattenedvectors. In the tensor representation, z ( (cid:96) ) contains C ( (cid:96) ) channels of size (cid:16) I ( (cid:96) ) × J ( (cid:96) ) (cid:17) pixels. In the vector representation, [ z ( (cid:96) ) ( x )] k represents the entry of the k th dimension of the ﬂattened, vector version z ( (cid:96) ) ( x ) of z ( (cid:96) ) ( x ) . Hence, D ( (cid:96) ) = C ( (cid:96) ) I ( (cid:96) ) J ( (cid:96) ) , C ( L ) = C , I ( L ) = 1 , and J ( L ) = 1 . For conciseness we will oftendenote z ( (cid:96) ) ( x ) as z ( (cid:96) ) . When using nonlinearities and pooling which are piecewise afﬁne and convex, thelayers and whole DN fall under the analysis of max-afﬁne spline operators (MASOs) developed in Balestriero& Baraniuk (2018a). In this framework, a max-afﬁne spline operator with parameters A ( (cid:96) ) ∈ R D ( (cid:96) ) × R × D ( (cid:96) − and B ( (cid:96) ) ∈ R D ( (cid:96) ) × R is deﬁned as z ( (cid:96) ) = S (cid:104) A ( (cid:96) ) , B ( (cid:96) ) (cid:105) ( z ( (cid:96) − ) =  max r =1 ,...,R (cid:104) [ A ( (cid:96) ) ] ,r,. , z ( (cid:96) − (cid:105) + [ B ( (cid:96) ) ] ,r ... max r =1 ,...,R (cid:104) [ A ( (cid:96) ) ] K,r,. , z ( (cid:96) − (cid:105) + [ B ( (cid:96) ) ] K,r  . (14)Any DN layer made of convex and piecewise afﬁne nonlinearities or pooling can be rewritten exactly as aMASO. Hence, such operators take place of the layer mappings of (13) We ﬁrst proceed by modifying (14) tohighlight the internal inference problem. We ﬁrst introduce the VQ -matrix T ( (cid:96) ) ∈ R D ( (cid:96) ) × R which will be usedto make the mapping region speciﬁc, as in A ( (cid:96) ) [ T ( (cid:96) ) ] =  ( (cid:80) Rr =1 [ T ( (cid:96) ) ] ,r [ A ( (cid:96) ) ] ,r,. ) T ... ( (cid:80) Rr =1 [ T ( (cid:96) ) ] K,r [ A ( (cid:96) ) ] K,r,. ) T  , B ( (cid:96) ) [ T ( (cid:96) ) ] =  ( (cid:80) Rr =1 [ T ( (cid:96) ) ] ,r [ B ( (cid:96) ) ] ,r,. ) T ... ( (cid:80) Rr =1 [ T ( (cid:96) ) ] K,r [ B ( (cid:96) ) ] K,r,. ) T  , (15)effectively making A ( (cid:96) ) [ T ( (cid:96) ) ] a matrix of shape ( D ( (cid:96) ) , D ( (cid:96) − ) and B ( (cid:96) ) [ T ( (cid:96) ) ] a vector of length D ( (cid:96) ) . Hencethe VQ-matrix is used to combined the per region parameters. In a standard MASO, each row of T ( (cid:96) ) is a one-hot vector at position corresponding to the region in which the input falls into. Due to the one-hot encodingpresent in T ( (cid:96) ) we refer to this inference as a hard-VQ. Proposition 8.

For a MASO, the VQ-matrix is denoted as T ( (cid:96) ) H and is obtained via the internal maximizationprocess of (14). It corresponds to the (hard-)VQ of the input. Once computed the output is a simple afﬁnetransform of the input as z ( (cid:96) ) = A ( (cid:96) ) [ T ( (cid:96) ) H ] z ( (cid:96) − + B ( (cid:96) ) [ T ( (cid:96) ) H ] . (16) with [ T ( (cid:96) ) H ] k,r = { r =arg max r =1 ,...,R (cid:104) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:105) +[ B ( (cid:96) ) ] k,r } . The VQ matrix T ( (cid:96) ) H always belongs to the set of all matrices with different one-hot positions (from to R ) for each of the output dimensions k = 1 , . . . , D ( (cid:96) ) . We denote this VQ-matrix space as T ( (cid:96) ) H = { [ a , . . . , a D ( (cid:96) ) ] T , a k ∈ { e , . . . , e R }} with e r = δ r , dim ( e r ) = R . B O

RTHOGONAL F ILTERS D ETAILS

The developed results on orthogonality induce orthogonality of the case of fully-connected layers. For thecase on convolutional layer it implies orthogonality as well as non overlapping patches. This is not practicalas it considerably reduces the spatial dimensions making very deep network unsuitable. As such we now ropose a brief approximation result. Due to the speciﬁcity of the convolution operator we are able to providea tractable inference coupled with an apodization scheme. To demonstrate this, we ﬁrst highlight that any inputcan be represented as a direct sum of its apodized patches. Then, we see that ﬁltering apodized patches with aﬁlter is equivalent to convolving the input with apodized ﬁlters. We ﬁrst need to introduce the patch notation.We deﬁne a patch P [ z ( (cid:96) − ]( p i , p j ) ∈ { , . . . , I ( (cid:96) ) } × { , . . . , J ( (cid:96) ) } as the slice of the input with indices c = 1 , . . . , K ( (cid:96) ) , i = (all channels) and ( i, j ) ∈ { p i , . . . , p i + I ( (cid:96) ) C } × { p i , . . . , p i + J ( (cid:96) ) C } , hence a patchstarting at position ( p i , p j ) and of same shape as the ﬁlters.Apodizing a signal in general corresponds to applying an apodization function (or windowing func-tion)Weisstein (2002) h onto it via an Hadamard product. Let deﬁne the D apodized functions h :Ω( I ( (cid:96) ) C , J ( (cid:96) ) C ) → R + with Ω( I ( (cid:96) ) C , J ( (cid:96) ) C ) = { , . . . , I ( (cid:96) ) C }×{ , . . . , J ( (cid:96) ) C } and where we remind that ( I ( (cid:96) ) C , J ( (cid:96) ) C ) is the spatial shape of the convolutional ﬁlters. Given a function h such that (cid:80) u ∈ Ω( I ( (cid:96) ) C ,J ( (cid:96) ) C ) h ( u ) = 1 one canrepresent an input by summing the apodized patches as in [ z ( (cid:96) ) ] k,i,j = (cid:88) ( p i ,p j ) ∈{ i − I ( (cid:96) ) C ,...,i }×{ j − J ( (cid:96) ) C ,...,j } P [ z ( (cid:96) − ]( p i , p j ) (cid:12) h. (17)The above highlights the ability to treat an input via its collection of patches with the condition to apply thedeﬁned apodization function. With the above, we can demonstrate how minimizing the per patch reconstructionloss leads to minimizing the overall input modeling ≤ (cid:107) (cid:88) i,j ( h (cid:12) P [ z ( (cid:96) ) ]( i, j ) − [ W ( (cid:96) ) ] t ( (cid:96) ) ( i,j ) ) (cid:107) ≤ (cid:88) i,j (cid:107) h (cid:12) P [ z ( (cid:96) ) ]( i, j ) − [ W ( (cid:96) ) ] t ( (cid:96) ) ( i,j ) (cid:107) , (18)which represents the internal modeling of the factorial model applied across ﬁlters and patches. As a result,when performing the per position minimization one minimizes an upper bound which ultimately reaches theglobal minimum as (cid:107) P [ z ( (cid:96) − ]( p i , p j ) − P [ ˆ z ( (cid:96) − ]( p i , p j ) (cid:107) → ⇒ (cid:107) z ( (cid:96) − − (cid:88) ( p i ,p j ) P [ z ( (cid:96) − ]( p i , p j ) (cid:107) = 0 . (19) C I

NTERPRETATION : I

NITIALIZATION AND I NPUT S PACE P ARTITIONING

The GMM formulation and related inference also allows interpretation of the internal layer parameters. First wedemonstrate how the region prior π ( (cid:96) ) is affected by the layer parameters especially at initialization. Then wehighlight how our result allows to generalize the input space partitioning results from Balestriero & Baraniuk(2018b;a). Region Prior.

The region prior of the GMM-MASO model [ π ( (cid:96) ) ] k,. (recall Thm. 2) depends on the biasand norm of the layer weight as [ π ( (cid:96) ) ] k,. ∝ e [ B ( (cid:96) ) ] k,r + (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) . We can study how this region priorlooks like at initialization. At initialization, common practice uses [ B ( (cid:96) ) ] k,r = 0 , ∀ k, r and [ A ( (cid:96) ) ] k,r,d ∼ N (0 , ( v ( (cid:96) ) ) ) . This bias initialization leads to a cluster prior probability proportional to the norm of the weights.For example, the case of absolute value leads to E ( (cid:107) [ A ( (cid:96) ) ] k, ,. (cid:107) ) = E ( (cid:107) [ A ( (cid:96) ) ] k, ,. (cid:107) ) and thus uniform prioras E ([ π ( (cid:96) ) ] k,. ) = (0 . , . T for any initialization standard deviation v ( (cid:96) ) . On the other hand, ReLU hasalways (cid:107) [ A ( (cid:96) ) ] k, ,. (cid:107) = 0 and E (cid:16) (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) (cid:17) = D ( (cid:96) ) ( v ( (cid:96) ) ) . If one uses Xavier initialization Glorot &Bengio (2010) then D ( (cid:96) ) ( v ( (cid:96) ) ) = 1 and we thus have as prior probability [ π ( (cid:96) ) ] k,. ≈ (0 . , . T . THelatter slightly favors the inactive state of the ReLU and thus sparser activations. In general, the smaller v ( (cid:96) ) is,the more the region prior will favor inactive state of the ReLU. Input Space Partitioning.

We now generalize the ability to study the input space partitioning which wasbefore limited to the special case of [ B ( (cid:96) ) ] k,r = − (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) (recall Prop. 1). Studying the input spacepartition is crucial as the MASO property implies that for each input region, an observation is transformedvia a simple linear transformation. However, deriving insights on that is the actual partition is cumbersome asanalytical formula are impractical and one thus has to probe the input space and record the observed VQ foreach point to estimate the input space partitioning. We are now able to derive some clear links between theMASO partition and standard models which will allows much more efﬁcient computation of the input spacepartitions. Corollary 2.

A MASO with arbitrary parameters [ A ( (cid:96) ) ] k,r,. , [ B ( (cid:96) ) ] k,r has an input space partitioning beingthe same as a GMM with parameters from Thm. 2. This augments previous study of the MASO input space partitioning only related to k-mean (recall Prop. 1)which required speciﬁc bias values. D EEP N ETWORK T OPOLOGIES AND D ATASETS

We ﬁrst present the topologies used in the experiments except for the notation ResNetD-W which is thestandard wide ResNet based topology with depth D and width W . We thus have the following networkarchitectures for smallCNN and largeCNN: largeCNN Conv2DLayer(layers[-1],96,3,pad=’same’)Conv2DLayer(layers[-1],96,3,pad=’same’)Conv2DLayer(layers[-1],96,3,pad=’same’,stride=2)Conv2DLayer(layers[-1],192,3,pad=’same’)Conv2DLayer(layers[-1],192,3,pad=’same’)Conv2DLayer(layers[-1],192,3,pad=’same’,stride=2)Conv2DLayer(layers[-1],192,3,pad=’valid’)Conv2DLayer(layers[-1],192,1)Conv2DLayer(layers[-1],10,1)GlobalPoolLayer(layers[-1],2) where the Conv2DLayer(layers[-1],192,3,pad=’valid’) denotes a standard 2D convolution with ﬁlters of spatial size (3 , and with valid padding (no padding). E P

ROOFS

E.1 THEOREM 2

Proof.

The log-probability of the model corresponds to [ t ( (cid:96) ) ] k = arg max r (cid:104) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:105) + [ B ( (cid:96) ) ] k,r = arg max r (cid:104) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:105) + [ B ( (cid:96) ) ] k,r + 12 (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) − (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) = arg max r (cid:104) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:105) + [ B ( (cid:96) ) ] k,r + 12 (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) − log (cid:32)(cid:88) r e [ B ( (cid:96) ) ] k,r + (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) (cid:33) − (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) = arg max r (cid:104) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:105) + log (cid:16) e [ B ( (cid:96) ) ] k,r + (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) (cid:17) − log (cid:32)(cid:88) r e [ B ( (cid:96) ) ] k,r + (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) (cid:33) − (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) = arg max r (cid:104) [ A ( (cid:96) ) ] k,r,. , z ( (cid:96) − (cid:105) + log (cid:32) e [ B ( (cid:96) ) ] k,r + (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) (cid:80) r e [ B ( (cid:96) ) ] k,r + (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) (cid:33) − (cid:107) [ A ( (cid:96) ) ] k,r,. (cid:107) = arg max r log ( p ( x | r ) p ( r )) − (cid:107) z ( (cid:96) − (cid:107) = arg max r p ( x | r ) p ( r ) We also remind the reader that arg max r p ( z ( (cid:96) − | r ) p ( r ) = arg max r log( p ( z ( (cid:96) − | r ) p ( r )) . Basedon the above it is straightforward to derive (5) from the above.12.2 E NTROPY R EGULARIZED O PTIMIZATION

Proof.

We are interested into the following optimization problem: [ t ( (cid:96) ) ∗ ] k = arg max q [ (cid:96),k ] F ( q [ (cid:96), k ] , Θ) = arg max q [ (cid:96),k ] E q [log( p ( z ( (cid:96) − | [ t ( (cid:96) ) ] k ) p ([ t ( (cid:96) ) ] k ))] + H ([ t ( (cid:96) ) ] k )= arg max u ( (cid:96) ) ∈(cid:52) R (cid:32)(cid:88) r [ u ( (cid:96) ) ] k,r [ − σ (cid:107) z ( (cid:96) − − µ r (cid:107) + log( π r )] − (cid:88) r [ u ( (cid:96) ) ] k,r log([ u ( (cid:96) ) ] k,r ) (cid:33) . We now use the KKT and Lagrange multiplier to optimize the new loss function (per k ) includingthe equality constraint L ( u ) = (cid:88) r [ u ] r [ − σ (cid:107) z ( (cid:96) − − µ r (cid:107) + log( π r )] − (cid:88) r [ u ] r log([ u ( (cid:96) ) ] r ) + λ ( (cid:88) r [ u ] r − Due to the strong duality we can directly optimize the primal and dual problems and solve jointlyall the partial derivatives to . We thus obtain by denoting A r := [ − σ (cid:107) z ( (cid:96) − − µ r (cid:107) + log( π r )] ∂ L ∂ [ u ] p = A p − log([ u ] p ) − λ, ∀ p∂ L ∂λ = (cid:88) r [ u ] r − we can now set the derivatives to and see that this leads to [ u ] p = e A p − λ , ∀ p . We can now sumover p to obtain [ u ] p = e A p − λ , ∀ p = ⇒ (cid:88) p [ u ] p = (cid:88) p e A p − λ = ⇒ (cid:88) p e A p − λ = ⇒ e λ (cid:88) p e A p − = ⇒ λ + log( (cid:88) p e A p − ) which leads to λ = − log( (cid:80) p e A p − ) . Plugging this back into the above equation we obtain [ u ] p = e A p − λ = e A p − (cid:80) p e A p − = e A p (cid:80) p e A p E.3 THEOREM 3For the proof of THEOREM 3 please refer to the proof in E.2 by applying the convex combinationwith coefﬁcients β .E.4 THEOREM 4 Proof.

The proof to demonstrate this inference and VQ equality is essentially the same as the oneof GMM-MASO (E.1) with addition of the following ﬁrst step: (cid:107) z ( (cid:96) − − D ( (cid:96) ) (cid:88) k =1 [ W ( (cid:96) ) ] k,r,. (cid:107) = (cid:107) z ( (cid:96) − (cid:107) − D ( (cid:96) ) (cid:88) k =1 R ( (cid:96) ) (cid:88) r =1 [ W ( (cid:96) ) ] k, [ r k ] ,. + D ( (cid:96) ) (cid:88) k =1 (cid:107) [ W ( (cid:96) ) ] k, [ r k ] ,. (cid:107) for any conﬁguration r ∈ { , . . . , R ( (cid:96) ) } D ( (cid:96) ) . Using the same results we can re-write the independentjoint optimization as multiple independent optimization problems.13.5 PROPOSITIONS 3 AND