[PDF] Neural Bayes: A Generic Parameterization Method for Unsupervised Representation Learning

Abstract

We introduce a parameterization method called Neural Bayes which allows computing statistical quantities that are in general difficult to compute and opens avenues for formulating new objectives for unsupervised representation learning. Specifically, given an observed random variable x and a latent discrete variable z , we can express p(x|z) , p(z|x) and p(z) in closed form in terms of a sufficiently expressive function (Eg. neural network) using our parameterization without restricting the class of these distributions. To demonstrate its usefulness, we develop two independent use cases for this parameterization: 1. Mutual Information Maximization (MIM): MIM has become a popular means for self-supervised representation learning. Neural Bayes allows us to compute mutual information between observed random variables x and latent discrete random variables z in closed form. We use this for learning image representations and show its usefulness on downstream classification tasks. 2. Disjoint Manifold Labeling: Neural Bayes allows us to formulate an objective which can optimally label samples from disjoint manifolds present in the support of a continuous distribution. This can be seen as a specific form of clustering where each disjoint manifold in the support is a separate cluster. We design clustering tasks that obey this formulation and empirically show that the model optimally labels the disjoint manifolds. Our code is available at \url{this https URL}

Full PDF

NNeural Bayes: A Generic Parameterization Method for UnsupervisedRepresentation Learning

Devansh Arpit Huan Wang Caiming Xiong Richard Socher Yoshua Bengio Abstract

We introduce a parameterization method calledNeural Bayes which allows computing statisticalquantities that are in general difﬁcult to computeand opens avenues for formulating new objectivesfor unsupervised representation learning. Speciﬁ-cally, given an observed random variable x and alatent discrete variable z , we can express p ( x | z ) , p ( z | x ) and p ( z ) in closed form in terms of a suf-ﬁciently expressive function (Eg. neural network)using our parameterization without restricting theclass of these distributions. To demonstrate itsusefulness, we develop two independent use casesfor this parameterization:1. Mutual Information Maximization (MIM):MIM has become a popular means for self-supervised representation learning. Neural Bayesallows us to compute mutual information betweenobserved random variables x and latent discreterandom variables z in closed form. We use thisfor learning image representations and show itsusefulness on downstream classiﬁcation tasks.2. Disjoint Manifold Labeling: Neural Bayesallows us to formulate an objective which canoptimally label samples from disjoint manifoldspresent in the support of a continuous distribution.This can be seen as a speciﬁc form of clusteringwhere each disjoint manifold in the support is aseparate cluster. We design clustering tasks thatobey this formulation and empirically show thatthe model optimally labels the disjoint manifolds.Our code is available at https://github.com/salesforce/NeuralBayes

1. Introduction

Humans have the ability to automatically categorize objectsand entities through sub-consciously deﬁned notions of sim- Salesforce Research Mila. Correspondence to: Devansh Arpit. ilarity, and often in the absence of any supervised signal.For instance, studies have shown that young infants are ca-pable of automatically forming categories based on gender(Younger & Fearing, 1999; Johnston et al., 2001), types ofanimals (Gopnik & Meltzoff, 1987; Bornstein & Arterberry,2010), shapes (Smith & Samuelson, 2006), etc. It is gener-ally hypothesized that such discrete categorizations result inefﬁcient encoding of sensory input that reduces the amountof information processing required by the brain (Rakison &Yermolayeva, 2010). Therefore, unsupervised categoriza-tion can be seen as a means of learning useful encoding forreal world data. This skill is extremely valuable since themajority of data available in the real world is unlabeled.In this spirit, we introduce a generic parameterization thatallows learning representations from unlabeled data by cat-egorizing them. Speciﬁcally, our parameterization implic-itly maps samples from an observed random variable x toa latent discrete space z where the distribution p ( x ) getssegmented into a ﬁnite number of arbitrary conditional dis-tributions. Imposing different conditions on the latent space z through different objective functions will result in learningqualitatively different representations.We note that our parameterization may be used to computestatistical quantities involving observed variables and latentdiscrete variables that are in general difﬁcult to compute,thus providing a ﬂexible framework for unsupervised rep-resentation learning. To illustrate this aspect, we developtwo independent use cases for this parameterization– mutualinformation maximization (Linsker, 1988) and disjoint man-ifold labeling, as described in the abstract. For the MIM task,we experiment with benchmark image datasets and showthat the unsupervised representation learned by the networkachieves good performance on downstream classiﬁcationtasks. For the manifold labeling task, we show experimentson 2D datasets and their high-dimensional counter-partsdesigned as per the problem formulation, and show that theproposed objective can optimally label disjoint manifolds.For both objectives we design regularizations necessary toachieve the desired behavior in practice.The paper is organized as follows. We introduce the parame-terization in section 2. We then develop the two applicationsof the parameterization, viz, mutual information maximiza- a r X i v : . [ s t a t . M L ] F e b eural Bayes tion and disjoint manifold labeling, in section 3 and section4 respectively. Finally we show experiments in section 5followed by related work and conclusion. All the proofs canbe found in the appendix.

2. Neural Bayes

Consider a data distribution p ( x ) from which we have ac-cess to i.i.d. samples x ∈ R n . We suppose that this marginaldistribution is a union of K conditionals where the k th den-sity is denoted by p ( x | z = k ) ∈ R + and the correspondingprobability mass denoted by p ( z = k ) ∈ R + . Here z is a dis-crete random variable with K states. We now introduce theparameterization that allows us to implicitly factorize anymarginal distribution into conditionals as described above.Aside from the technical details, the key idea behind thisparameterization is the Bayes’ rule. Lemma 1

Let p ( x | z = k ) and p ( z ) be any conditionaland marginal distribution deﬁned for continuous ran-dom variable x and discrete random variable z . If E x ∼ p ( x ) [ L k ( x )] (cid:54) = 0 ∀ k ∈ [ K ] , then there exists a non-parametric function L ( x ) : R n → R + K for any given input x ∈ R n with the property (cid:80) Kk =1 L k ( x ) = 1 ∀ x such that, p ( x | z = k ) = L k ( x ) · p ( x ) E x ∼ p ( x ) [ L k ( x )] , p ( z = k ) = E x [ L k ( x )] p ( z = k | x ) = L k ( x ) (1) and this parameterization is consistent. Thus the function L can be seen as a form of soft categoriza-tion of input samples. In practice, we use a neural networkwith sufﬁcient capacity and softmax output to realize thisfunction L . We name our parameterization method NeuralBayes and replace L with L θ to denote the parameters ofthe network. By imposing different conditions on the struc-ture of z by formulating meaningful objectives, we will getqualitatively different kinds of factorization of the marginal p ( x ) , and therefore the function L θ will encode the poste-rior for that factorization. In summary, if one formulatesany objective that involves the terms p ( x | z ) , p ( z ) or p ( z | x ) ,where x is an observed random variable and z is a discretelatent random variable, then they can be substituted with L k ( x ) · p ( x ) E x [ L k ( x )] , E x [ L k ( x )] and L k ( x ) respectively.On an important note, Neural Bayes parameterization re-quires using the term E x [ L k ( x )] , through which comput-ing gradient is infeasible in general. A general discussionaround this can be found in appendix A. Nonetheless, weshow that mini-batch gradients can have good ﬁdelity forone of the objectives we propose using our parameterization.In the next two sections, we explore two different ways offactorizing p ( x ) resulting in qualitatively different goals ofunsupervised representation learning.

3. Mutual Information Maximization (MIM)

Suppose we want to ﬁnd a discrete latent representation z (with K states) for the distribution p ( x ) such that themutual information M I ( x , z ) is maximized (Linsker, 1988).Such an encoding z demands that it must be very efﬁcientsince it has to capture maximum possible information aboutthe continuous distribution p ( x ) in just K discrete states.Assuming we can learn such an encoding, we are interestedin computing p ( z | x ) since it tells us the likelihood of x belonging to each discrete state of z , thereby performingsoft categorization which may be useful for downstreamtasks. In the proposition below, we show an objective forcomputing p ( z | x ) for a discrete latent representation z thatmaximizes M I ( x , z ) . Proposition 1 (Neural Bayes-MIM-v1) Let L ( x ) : R n → R + K be a non-parametric function for any given input x ∈ R n with the property (cid:80) Ki =1 L k ( x ) = 1 ∀ x . Consider thefollowing objective, L ∗ = arg max L E x (cid:34) K (cid:88) k =1 L k ( x ) log L k ( x ) E x [ L k ( x )] (cid:35) (2) Then L ∗ k ( x ) = p ( z ∗ = k | x ) , where z ∗ ∈ arg max z M I ( x , z ) . The proof essentially involves expressing MI in terms of p ( z | x ) , and p ( z ) , which can be substituted using NeuralBayes parameterization. However, the objective proposedin the above theorem poses a challenge– the objective con-tains the term E x [ L k ( x )] for which computing high ﬁdelitygradient in a batch setting is problematic (see appendix A).However, we can overcome this problem for the MIM objec-tive because it turns out that gradient through certain termsare 0 as shown by the following theorem. Theorem 1 (Gradient Simpliﬁcation) Denote, J ( θ ) = − E x (cid:34) K (cid:88) k =1 L θ k ( x ) log L θ k ( x ) E x [ L θ k ( x )] (cid:35) (3) ˆ J ( θ ) = − E x (cid:34) K (cid:88) k =1 L θ k ( x ) log (cid:104) L θ k ( x ) E x [ L θ k ( x )] (cid:105) (cid:35) (4) where (cid:104) . (cid:105) indicates that gradients are not computed throughthe argument. Then ∂J ( θ ) ∂θ = ∂ ˆ J ( θ ) ∂θ . The above theorem implies that as long as we plugin adecent estimate of E x [ L θ k ( x )] in the objective, unbiasedgradients can be computed without the need to compute eural Bayes gradients using the entire dataset. Note that the objectivecan be re-written as, min θ − E x (cid:34) K (cid:88) k =1 L θ k ( x ) log (cid:104) L θ k ( x ) (cid:105) (cid:35) + K (cid:88) k =1 E x [ L k ( x )] log (cid:104) E x [ L k ( x )] (cid:105) (5)The second term is the negative entropy of the discrete latentrepresentation p ( z = k ) := E x [ L k ( x )] which acts as a uni-form prior. In other words, this term encourages learning alatent code z such that all states of z activate uniformly overthe marginal input distribution x . This is an attribute of dis-tributed representation which is a fundamental goal in deeplearning. We can therefore further encourage this behaviorby treating the coefﬁcient of this term as a hyper-parameter.In our experiments we conﬁrm both the distributed represen-tation behavior of this term as well as the beneﬁt of using ahyper-parameter as our coefﬁcient. : In practicewe found that an alternative formulation of the second termin Eq 5 results in better performance and more interpretableﬁlters. Speciﬁcally, we replace it with the following cross-entropy formulation, R p ( θ ) := − K (cid:88) k =1 K log( E x [ L k ( x )])+ K − K log(1 − E x [ L k ( x )]) (6)While both, the second term in Eq 5 as well as R p ( θ ) areminimized when E x [ L k ( x )] = 1 /K , the latter formula-tion provides much stronger gradients during optimizationwhen E x [ L k ( x )] approaches 1 (see appendix C.1 for de-tails); E x [ L k ( x )] = 1 is undesirable since it discouragesdistributed representation. Finally, unbiased gradients canbe computed through Eq 6 as long as a good estimateof E x [ L k ( x )] is plugged in. Also note that the condition E x [ L k ( x )] / ∈ { , } in lemma 1 is met by the Neural Bayes-MIM objective implicitly during optimization as discussedin the above paragraph in regards to distributed representa-tion. Implementation : The ﬁnal Neural Bayes-MIM-v2 objec-tive is, min θ − E x (cid:34) K (cid:88) k =1 L θ k ( x ) log (cid:104) L θ k ( x ) + (cid:15) (cid:105) (cid:35) + (1 + α ) · R p ( θ ) + β · R c (7) where α and β are hyper-parameters, R c is a smoothnessregularization introduced in section 4.2, (cid:15) = 10 − is a smallscalar used to prevent numerical instability. Qualitatively,we ﬁnd that the regularization R c prevents ﬁlters from mem-orizing the input samples. Finally, we apply the ﬁrst twoterms in Eq 7 to all hidden layers of a deep network at dif-ferent scales (computed by spatially average pooling andapplying Softmax). These two regularizations gave a signif-icant performance boost. Thorough implementation detailsare provided in appendix B. For brevity, we refer to our ﬁnalobjective as Neural Bayes-MIM in the rest of the paper.On the other hand, to compute a good estimate of gradients,we use the following trick. During optimization, we com-pute gradients using a sufﬁciently large mini-batch of sizeMBS (Eg. ) that ﬁts in memory (so that the estimate of E x [ L k ( x )] is reasonable), and accumulate these gradientsuntil BS samples are seen (Eg. 2000), and averaged beforeupdating the parameters to further reduce estimation error.

4. Disjoint Manifold Labeling (DML)

A distribution is deﬁned over a support. In many cases, thesupport may be a set of disjoint manifolds. In this task, ourgoal is to label samples from each disjoint manifold with adistinct value. This formulation can be seen as a generaliza-tion of subspace clustering (Ma et al., 2008) where afﬁnemanifolds are considered. To make the problem concrete,we ﬁrst formalize the deﬁnition of a disjoint manifold.

Deﬁnition 1 (Connected Set) We say that a set S ⊂ R n is aconnected set (disjoint manifold) if for any x , y ∈ S , thereexists a continuous path between x and y such that all thepoints on the path also belong to S . To identify such disjoint manifolds in a distribution, weexploit the observation that only partitions that separate onedisjoint manifold from others have high divergence betweenthe respective conditional distributions while partitions thatcut through a disjoint manifold result in conditional distri-butions with low divergence between them. Therefore, theobjective we propose for this task is to partition the unla-beled data distribution p ( x ) into conditional distributions q i ( x ) ’s such that a divergence between them is maximized.By doing so we recover the conditional distributions deﬁnedover the disjoint manifolds (we prove its optimality in theo-rem 2). We begin with two disjoint manifolds and extendthis idea to multiple disjoint manifolds in appendix G.Let J be a symmetric divergence (Eg. Jensen-Shannondivergence, Wasserstein divergence, etc), and q and q bethe disjoint conditional distributions that we want to learn.Then the aforementioned objective can be written formallyas follows: eural Bayes max q ,q π ∈ (0 , J ( q ( x ) || q ( x )) (8)s.t. (cid:90) x q ( x ) = 1 , (cid:90) x q ( x ) = 1 q ( x ) · π + q ( x ) · (1 − π ) = p ( x ) . Since our goal is to simply assign labels to data samples x corresponding to which manifold they belong instead oflearning conditional distributions as achieved by Eq. (8),we would like to learn a function L ( x ) which maps samplesfrom disjoint manifolds to distinct labels. To do so, belowwe derive an objective equivalent to Eq. (8) that learns sucha function L ( x ) . Proposition 2 (Neural Bayes-DML) Let L ( x ) : R n → [0 , be a non-parametric function for any given input x ∈ R n , and let J be the Jensen-Shannon divergence. De-ﬁne scalars f ( x ) := L ( x ) E x [ L ( x )] and f ( x ) := − L ( x )1 − E x [ L ( x )] .Then the objective in Eq. (8) is equivalent to, max L · E x (cid:20) f ( x ) · log (cid:18) f ( x ) f ( x ) + f ( x ) (cid:19)(cid:21) + (9) · E x (cid:20) f ( x ) · log (cid:18) f ( x ) f ( x ) + f ( x ) (cid:19)(cid:21) + log 2 s.t. E x [ L ( x )] / ∈ { , } . (10) Optimality : We now prove the optimality of the proposedobjective towards discovering disjoint manifolds present inthe support of a probability density function p ( x ) . Theorem 2 (optimality) Let p ( x ) be a probability densityfunction over R n whose support is the union of two non-empty connected sets (deﬁnition 1) S and S that are dis-joint, i.e. S ∩ S = ∅ . Let L ( x ) ∈ [0 , belong to theclass of continuous functions which is learned by solvingthe objective in Eq. (9). Then the objective in Eq. (9) ismaximized if and only if one of the following is true: L ( x ) = (cid:40) ∀ x ∈ S ∀ x ∈ S or L ( x ) = (cid:40) ∀ x ∈ S ∀ x ∈ S . The above theorem proves that optimizing the derived objec-tive over the space of functions L implicitly partitions thedata distribution into maximally separated conditionals byassigning a distinct label to points in each manifold. Mostimportantly, the theorem shows that the continuity conditionon the function L ( x ) plays an important role. Without thiscondition, the network cannot identify disjoint manifolds. : The constraint in proposition 2 is a bound-ary condition required for technical reasons in lemma 1. In practice we do not worry about them because optimizationitself avoids situations where E x [ L ( x )] ∈ { , } . To seethe reason behind this, note that except when initializedin a way such that E x [ L ( x )] ∈ { , } , the log terms arenegative by deﬁnition. Since the denominators of f and f are E x [ L ( x )] and − E x [ L ( x )] respectively, the objectiveis maximized when E x [ L ( x )] moves away from 0 and 1.Thus, for any reasonable initialization, optimization itselfpushes E x [ L ( x )] away from 0 and 1. Smoothness of L θ ( . ) : As shown in theorem 2, the proposedobjectives can optimally recover disjoint manifolds onlywhen the function L θ ( . ) is continuous. In practice we foundenforcing the function to be smooth (thus also continuous)helps signiﬁcantly. Therefore, after experimenting witha handful of heuristics for regularizing L θ , we found thefollowing ﬁnite difference Jacobian regularization to beeffective ( L ( . ) can be scalar or vector), R c = 1 B B (cid:88) i =1 (cid:107) L θ ( x i ) − L θ ( x i + ζ · ˆ δ i ) (cid:107) ζ (11)where ˆ δ i := δ i (cid:107) δ i (cid:107) is a normalized noise vector computedindependently for each sample x i in a batch of size B as, δ i := Xv i . (12)Here X ∈ R n × B is the matrix containing the batch ofsamples, and each dimension of v i ∈ R B is sampled i.i.d.from a standard Gaussian. This computation ensures thatthe perturbation lies in the span of data, which we foundto be important. Finally ζ is the scale of normalized noiseadded to all samples in a batch. In our experiments, sincewe always normalize the datasets to have zero mean and unitvariance across all dimensions, we sample ζ ∼ N (0 , . ) . Implementation : We implement the binary-partition Neu-ral Bayes-DML using the Monte-Carlo sampling approxi-mation of the following objective, min θ · E x (cid:20) f ( x ) · log (cid:18) f ( x ) f ( x ) (cid:19)(cid:21) +12 · E x (cid:20) f ( x ) · log (cid:18) f ( x ) f ( x ) (cid:19)(cid:21) + β · R c (13)where f ( x ) := L θ ( x ) E x [ L θ ( x )] + (cid:15) and f ( x ) := − L θ ( x )1 − E x [ L θ ( x )] + (cid:15) .Here (cid:15) = 10 − is a small scalar used to prevent numericalinstability, and β is a hyper-parameter to control the conti-nuity of L . The multi-partition case can be implemented ina similar way. Due to the need for computing E x [ L θ ( x )] inthe objective, optimizing it using gradient descent methodswith small batch-sizes is not possible. Therefore we experi-ment with this method on datasets where gradients can becomputed for a very large batch-size needed to approximatethe gradient through E x [ L θ ( x )] sufﬁciently well. eural Bayes

5. Experiments

Instead of aiming for state-of-the-art results, our goal inthis section is to conduct a preliminary (but thorough) setof experiments using Neural Bayes-MIM to understand thebehavior of the algorithm, the hyper-parameters involvedand do a fair comparison with popular existing methodsfor self-supervised learning. Therefore, we use the follow-ing simple CNN encoder architecture Enc in our experi-ments: C (200 , , , − P (2 , , , max ) − C (500 , , , − C (700 , , , − P (2 , , , max ) − C (1000 , , , . For aninput image x of size × × , the output of this encoder Enc ( x ) has size × × . The encoder is initializedusing orthogonal initialization (Saxe et al., 2013), batchnormalization (Ioffe & Szegedy, 2015) is used after eachconvolution layer and ReLU non-linearities are used. Alldatasets are normalized to have dimension-wise 0 mean andunit variance. Early stopping in all experiments is doneusing the test set (following previous work). We broadlyfollow the experimental setup of Hjelm et al. (2019). Wedo not use any data augmentation in our experiments. Aftertraining the encoder, we freeze its features and train a 1 hid-den layer (200 units) classiﬁer to get the ﬁnal test accuracy.Extending the algorithm to more complex architectures (Eg.ResNets (He et al., 2016)), use of multiple data augmenta-tion techniques and other advanced regularizations (Eg. seeBachman et al. (2019)) is left as future work.5.1.1. A BLATION S TUDIES

Behavior of Neural Bayes-MIM-v1 (Eq 5) vs NeuralBayes-MIM (v2, Eq 6) : The experiments and details arediscussed in appendix C.2. The main differences are: 1.majority of the ﬁlters learned by the v1 objective are dead,as opposed to the v2 objective which encourages distributedrepresentation; 2. the performance of v2 is better than thatof the v1 objective.

Visualization of Filters : We visualize the ﬁlters learnedby the Neural Bayes-MIM objective on MNIST digits andqualitatively study the effects of the regularizations used.For this we train a deep fully connected network with 3hidden layers each of width 500 using Adam with learningrate 0.001, batch size 500, 0 weight decay for 50 epochs(other Adam hyper-parameters are kept standard). We trainthree conﬁgurations: 1. α = 0 , β = 4 ; 2. α = 4 , β = 0 ; 3. α = 4 , β = 4 . The learned ﬁlters are shown in ﬁgure 1. Weﬁnd that the uniform prior regularization ( α > ) preventsdead ﬁlters while the smoothness regularization ( β > )prevents input memorization. We use the following shorthand for a) conv layer: C(numberof ﬁlters, ﬁlter size, stride size, padding); b) pooling: P(kernel size,stride, padding, pool mode) (a) α = 0 , β = 4 (b) α = 4 , β = 0 (c) α = 4 , β = 4 Figure 1.

MNIST ﬁlters learned using Neural Bayes-MIM objec-tive Eq 7. A majority of ﬁlters are dead when regularization coefﬁ-cient α = 0 . Filters memorize input samples when regularizationcoefﬁcient β = 0 . Using both regularization terms results in ﬁltersthat mainly capture parts of inputs which are good for distributedrepresentations. See ﬁgure 6 (appendix) for full images. Performance due to Regularizations and State Scaling :We now evaluate the effects of the various components in-volved in the Neural Bayes-MIM objective– coefﬁcients α and β , and applying the objective at different scales ofhidden states. We use the CIFAR-10 dataset for these exper-iments.In the ﬁrst experiment, for each value of the number of dif-ferent scales considered, we vary α , β and record the ﬁnalperformance, thus capturing the variation in performancedue to all these three components. We consider two scalingconﬁgurations: 1. no pooling is applied to the hidden layers;2. for each hidden layer, we spatially average pool the stateusing a × pooing ﬁlter with a stride of 2. For the encoderused in our experiments (which has 4 internal hidden layerspost ReLU), this gives us 4 and 8 states respectively (includ-ing the original un-scaled hidden layers) to apply the NeuralBayes-MIM objective. After getting all the states, we ap-ply the Softmax activation to each state along the channeldimension so that the Neural Bayes parameterization holds.Thus for states with height and width, the objective is ap-plied to each spatial ( x, y ) location separately and averaged.Also, for states with height (or width) less than the poolingsize, we use the height (or width) as pooling size.We train Neural Bayes-MIM on the full training set for 100epochs using Adam with learning rate 0.001 (other Adamhyper-parameters are standard), mini-batch size 500 andbatch size 2000, 0 weight decay. In the ﬁrst 32 experiments, α and β are sampled uniformly from [0 , and [0 , re- eural Bayes Figure 2.

The effect of Neural Bayes-MIM hyper-parameters on the ﬁnal test performance for CIFAR-10. A CNN encoder is trained usingNeural Bayes-MIM with different conﬁgurations– hyper-parameters α , β and scaling of states on which the objective is applied. A onehidden layer classiﬁer (with 200 units) is then trained using labels on these frozen features to get the ﬁnal test accuracy. The plots showthat performance is signiﬁcantly worse when α = β = 0 and no scaling is used, showing their important role as a regularizer. The bestperforming model reaches . . Black dotted line is the baseline performance ( . ) when a randomly initialized network (withidentical architecture) is used as encoder. Green and blue dotted lines are the average of all the green and blue points respectively. spectively. In the next 5 experiments, α is set to be 0 while β is sampled uniformly. In the next 5 experiments, β is setto be 0 while α is sampled uniformly. Thus in total we run42 experiments for each number of scaling considered.Once we get a trained Enc ( x ) , we train a 1 hidden layer(with 200 units) MLP classiﬁer on the frozen features from Enc ( x ) using the labels in the training set. This training isdone for 100 epochs using Adam with learning rate 0.001(other Adam hyper-parameters are standard), batch size 128and weight decay 0.As a baseline for these experiments, we use a randomlyinitialed encoder Enc ( x ) . Since there are no tunable hyper-parameters in this case, we perform a grid search on theclassiﬁer hyper-parameters. Speciﬁcally, we choose weightdecay from { , . , . , . } , batch size from { , } , and learning rate from { . , . } . Thisyields a total of 16 conﬁgurations. The test accuracy fromthese runs varied between . and . . We consider . as our baseline.The performance of encoders under the aforementionedconﬁgurations is shown in ﬁgure 2. It is clear that both thehyper-parameters α and especially β play an important rolein the quality of representations learned. Also, applyingNeural Bayes-MIM at different scales of the network statessigniﬁcantly improves the average and best performance. Effect of Mini-batch size (MBS) and Batch size (BS) :During implementation, we proposed to compute gradientsusing a reasonably large mini-batch of size MBS and accu-mulate gradients until BS samples are seen. This is doneto overcome the gradient estimation problem due to the E x [ L k ( x )] term in Neural Bayes-MIM. Here we evaluatethe effect of these two hyper-parameters on the ﬁnal test per-formance. We choose MBS from { , , , } andBS from { , , , , } . For each combinationof MBS and BS, we train the CNN encoder using Neural MBS \BS 50 250 500 2000 300050 40.62 42.97 41.41 75 78.91100 N/A 67.97 66.41 78.12 78.91250 N/A 76.56 78.91 82.03 84.38500 N/A N/A 82.03 78.91 79.69

Table 1.

Effect of mini-batch size (MBS) and batch size (BS) onﬁnal test accuracy of Neural Bayes-MIM on CIFAR-10. Gradientsare computed using batches of size MBS and accumulated untilBS samples are seen before parameter update. As expected, a sufﬁ-ciently large MBS is needed for computing high ﬁdelity gradientsdue to E x [ L k ( x )] term. Gradient accumulation using BS furtherhelps. All models are trained for the same number of epochs . Bayes-MIM with α = 2 and β = 4 (chosen by examiningﬁgure 2); the rest of the training settings are kept identical tothose used for ﬁgure 2 experiment. Table 1 shows the ﬁnaltest accuracy on CIFAR-10 for each combination of hyper-parameters MBS and BS. We make two observations: 1.using very small MBS (Eg. 50 and 100) typically results inpoor (even worse than that of a random encoder ( . )),while larger MBS signiﬁcantly improves performance; 2.using a larger BS further improves performance in mostcases (even when MBS is small). Accuracy vs Epochs : Finally, we plot the evolution ofaccuracy over epochs for all the models learned in the ex-periments of ﬁgure 2. For Neural Bayes-MIM we use themodels with scaling (42 in total), and all 16 models for therandom encoder. The convergence plot is shown in ﬁgure 3.5.1.2. F

INAL C LASSIFICATION P ERFORMANCE

We compare the ﬁnal test accuracy of Neural Bayes-MIMwith 3 baselines– a random encoder (described in ablationstudies), Deep Infomax (Hjelm et al., 2019), and Rota-tion Prediction based representation learning (Gidaris et al.,2018) on benchmark image datasets– CIFAR10 and CIFAR- eural Bayes

Figure 3.

Mean (solid line) and standard deviation (error band)of accuracy evolution during MLP classiﬁer training on CIFAR-10 using Neural Bayes-MIM encoder (blue) and random encoder(green). Features learned by Neural Bayes-MIM allow trainingto start at a much higher value ( ∼ on average) and convergefaster as opposed to a random encoder ( ∼ on average).

100 (Krizhevsky, 2009) and STL-10 (Coates et al., 2011).Random Network refers to the use of a randomly initializednetwork. The experimental details for them are identicalto those in our ablation involving a hyper-parameter searchover 16 conﬁgurations done for each dataset separately.DIM results are reported from Hjelm et al. (2019). We omitSTL-10 number for DIM because we resize images to amuch smaller size of × in our runs instead of × as used in DIM.Rotation prediction refers to the algorithm in Gidaris et al.(2018) where the encoder is learned by training it to predictthe rotation of unlabeled images. We use the same CNNarchitecture used in previous experiments, with a linear clas-siﬁer added on top, and train it to predict 4 rotations angles–0, 90, 180, 270. We run this pre-training with 8 conﬁgu-rations of hyper-parameters– batch-size ∈ { , } (eachbatch further includes rotated copies of each sample makingthe total batch-size 100, 200), weight decay ∈ { , . } and learning rate ∈ { . , . } . For each run, we thentrain a 1 hidden layer (200 units) classiﬁer on top of thefrozen features with learning rate ∈ { . , . } . Wereport the best performance of all runs. Since Kolesnikovet al. (2019) report that lower layers in CNN architecturestrained with rotation prediction have better performance ondownstream tasks, we also train classiﬁers on the nd layerand report their performance which is signiﬁcantly better.The following describes the experiment details for NeuralBayes-MIM. We use α = 2 and β = 4 (chosen roughlyby examining ﬁgure 2), and MBS=500, BS=4000 in allthe experiments. Note these values are not tuned for STL-10 and CIFAR-100. For CIFAR-10 and STL-10 each, werun 4 conﬁgurations of Neural Bayes-MIM over hyper-parameters learning rate ∈ { . , . } and weight de-cay ∈ { , . } . For each run, we then train a 1 hiddenlayer (200 units) classiﬁer on top of the frozen features withlearning rate ∈ { . , . } . We report the best perfor- Encoder \Dataset CIFAR-10 CIFAR-100 STL-10Random Network 67.97 42.97 53.91Rotation Prediction (RP) 33.59 6.25 25.78DIM (Hjelm et al., 2019) 80.95 49.74 -Neural Bayes-MIM

RP ( nd layer) Neural Bayes-MIM ( nd layer) Table 2.

Classiﬁcation performance of a one hidden layer MLPclassiﬁer trained on frozen features from the mentioned encodermodels (trained using Neural Bayes-MIM) and datasets (no dataaugmentation used in experiments we ran). STL-10 was resized to × in our runs instead of × as in DIM due to memoryrestrictions. Performance reported from DIM paper are their bestnumbers (omitting STL-10 due to difference in image size). Figure 4.

Neural Bayes-DML network prediction on syntheticdatasets. The different colors denote the label predicted by thenetwork L θ ( . ) thresholded at 0.5. The darker shades of colorsdenote predictions made at training data points while the lightershades denote predictions on the rest of the space. mance of all runs. For CIFAR-100, we take the encoder thatproduces the best performance on CIFAR-10, and train aclassiﬁer with the 2 learning rates and report the best of the 2runs. Similar to rotation prediction, we also train classiﬁerson the nd layer and report their performance.Table 2 reports the classiﬁcation performance of all themethods. We note that all experiments were done with CNNarchitecture without any data augmentation. Neural Bayes-MIM outperforms baseline methods in general. However,when using nd layer features, rotation prediction (RP) per-forms better. We hope to further improve the performance ofNeural Bayes-MIM with additional regularizations similarto Bachman et al. (2019). Clustering in general is an ill posed problem. However, inour problem setup, the deﬁnition is precise, i.e., our goalis to optimally label all the disjoint manifolds present in eural Bayes the support of a distribution. Since this is a unique goalthat is not generally considered in literature, as empiricalveriﬁcation, we show qualitative results on 2D syntheticdatasets in ﬁgure 4. Top 2 sub-ﬁgures have 2 clusters andthe bottom 2 have 3 clusters. For all experiments we use a 4layer MLP with 400 hidden units each, batchnorm, ReLUactivation, and last layer Softmax activation. In all cases wetrain using Adam optimizer with a learning rate of 0.001,batch size of 400 and no weight decay, and trained untilconvergence. Regularization coefﬁcient β was chosen from [0 . , that resulted in optimal clustering. For generalityin these experiments, these 2D datasets were projected tohigh dimensions (512) by appending 510 dimensions of 0entries to each sample and then randomly rotated beforeperforming clustering. The datasets were then projectedback to the original 2D space for visualizing predictions.Additional experiments can be found in appendix H.

6. Related Work

Neural Bayes-MIM maximizes mutual information for learn-ing useful representations in a self-supervised way. Intro-duced in Linsker (1988) and Bell & Sejnowski (1995), thereare a myriad of self-supervised methods that involve MIM.As discussed in Vincent et al. (2010), auto-encoder basedmethods achieve this goal implicitly by minimizing thereconstruction error of the input samples under isotropicGaussian assumption. Deep infomax (DIM, Hjelm et al.(2019)) instead uses MINE (Belghazi et al., 2018) to esti-mate MI and maximize it while applying it to both local andglobal features and imposing priors on the learned represen-tation. Hjelm et al. (2019) have also shown that DIM per-forms better than representations learned by auto-encoderbased methods such as VAE (Kingma & Welling, 2013), β -VAE (Higgins et al., 2017) and adversarial auto-encoder(Makhzani et al., 2015), among others such as noise as tar-gets (Bojanowski & Joulin, 2017) and BiGAN (Donahueet al., 2016). Contrastive Predictive Coding (Oord et al.,2018) also maximizes MI by predicting lower layer repre-sentations from higher layers using a contrastive loss insteadof reconstruction loss.Unlike the aforementioned methods that learn continuouslatent representation, Neural Bayes-MIM implicitly learnsdiscrete latent representations. We note that the estimationof mutual information due to Neural Bayes parameterizationin the Neural Bayes-MIM-v1 objective (Eq 5) turns out tobe identical to the one proposed in IMSAT (Hu et al., 2017).However, there are important differences: 1. we provide the-oretical justiﬁcations for the parameterization used (lemma1) and show in theorem 1 why it is feasible to computehigh ﬁdelity gradients using this objective in the mini-batchsetting even though it contains the term E x [ L k ( x )] . Onthe other hand, the justiﬁcation used in IMSAT is that op- timizing using mini-batches is equivalent to optimizing anupper bound of the original objective; 2. while the MI partof IMSAT was introduced in the context of clustering, weimprove the MI formulation (Eq 7), and introduce regu-larization terms and state scaling which are important forlearning useful representations using the Neural Bayes-MIMobjective that perform well on downstream classiﬁcationtasks; 3. we perform extensive ablation studies exposingthe role of the introduced regularizations; 4. the goal of ourpaper is broader, i.e., to introduce the Neural Bayes parame-terization that can be used for formulating new objectives.From the aspect of learning discrete latent representation,Neural Bayes-MIM has similarities with VQ-VAE (Oordet al., 2017). However, similar to other auto-encoder basedmethods, VQ-VAE imposes the isotropy assumption in thereconstruction loss.In many self-supervised methods, the idea is to learn usefulrepresentations by predicting non-trivial information aboutthe input. Examples of such methods are Rotation Predic-tion (Gidaris et al., 2018), Exemplar (Dosovitskiy et al.,2014), Jigsaw (Noroozi & Favaro, 2016) and Relative PatchLocation (Doersch et al., 2015). Kolesnikov et al. (2019)have extensively compared these methods and found that Ro-tation Prediction (RP) in general outperforms or performs atpar with the latter methods. For the aforementioned reasons,we compared Neural Bayes-MIM with RP and DIM.Numerous recent papers have proposed clustering algorithmfor unsupervised representation learning such as Deep Clus-tering (Caron et al., 2018), information based clustering(Ji et al., 2019), Spectral Clustering (Shaham et al., 2018),Assosiative Deep Clustering (Haeusser et al., 2018) etc. Ourgoal in regards to clustering in Neural Bayes-DML is ingeneral different from such methods. Our objective is aimedat labeling disjoint manifolds in a distribution. Thus it canbe seen as a generalization of the traditional subspace clus-tering methods (Ma et al., 2008; Liu et al., 2010) from afﬁnesubspaces to arbitrary manifolds.

7. Conclusion

We proposed a parameterization method that can be used toexpress an arbitrary set of distributions p ( x | z ) , p ( z | x ) and p ( z ) in closed form using a neural network with sufﬁcientcapacity, which can in turn be used to formulate new objec-tive functions. We formulated two different objectives thatuse this parameterization which were aimed towards differ-ent goals of self-supervised learning– learning deep networkfeatures using the infomax principle, and identiﬁcation ofdisjoint manifolds in the support of continuous distributions.We presented theoretical and empirical analysis of both theobjectives while especially focusing on the former since ithas broader applications. eural Bayes Acknowledgments

I (DA) was supported by IVADO during my time at MILAand currently supported by Salesforce. There are many peo-ple who have directly or indirectly contributed to this workand we would like to thank them. During the early phaseof research on Neural Bayes-DML (in the context of whichthe Neural Bayes parameterization was developed), ChenXing pointed out an intuition which led me to simplify itsoptimization procedure. We thank Ali Madani and EhsanHosseini-Asl for exploring Neural Bayes-DML for unsu-pervised representation learning for images. We thank MinLin for taking interest in the connection between NeuralBayes-DML and mutual information, which led me to theidea that mutual information can be computed using theparameterization. We thank Aadyot Bhatnagar and WeiranWang for proof-checking the paper and providing helpfulfeedback. We thank Devon Hjelm and Alex Fedorov fordiscussing their algorithm Deep Infomax in great detail. Fi-nally, we thank Aaron Courville, Sharan Vaswani, NikhilNaik, Isabela Albuquerque, Lav Varshney, Yu Bai, JonathanBinas, David Krueger, Tegan Maharaj and Govardana Sa-chithanandam Ramachandran for helpful discussions.

References

Bachman, P., Hjelm, R. D., and Buchwalter, W. Learningrepresentations by maximizing mutual information acrossviews. In

Advances in Neural Information ProcessingSystems , pp. 15509–15519, 2019.Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio,Y., Courville, A., and Hjelm, R. D. Mine: mutual informa-tion neural estimation. arXiv preprint arXiv:1801.04062 ,2018.Bell, A. J. and Sejnowski, T. J. An information-maximization approach to blind separation and blinddeconvolution.

Neural computation , 7(6):1129–1159,1995.Bojanowski, P. and Joulin, A. Unsupervised learning bypredicting noise. In

Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 , pp. 517–526. JMLR. org, 2017.Bornstein, M. H. and Arterberry, M. E. The developmentof object categorization in young children: Hierarchicalinclusiveness, age, perceptual attribute, and group versusindividual analyses.

Developmental psychology , 46(2):350, 2010.Caron, M., Bojanowski, P., Joulin, A., and Douze, M. Deepclustering for unsupervised learning of visual features. In

Proceedings of the European Conference on ComputerVision (ECCV) , pp. 132–149, 2018. Coates, A., Ng, A., and Lee, H. An analysis of single-layer networks in unsupervised feature learning. In

Pro-ceedings of the fourteenth international conference onartiﬁcial intelligence and statistics , pp. 215–223, 2011.Doersch, C., Gupta, A., and Efros, A. A. Unsupervisedvisual representation learning by context prediction. In

Proceedings of the IEEE International Conference onComputer Vision , pp. 1422–1430, 2015.Donahue, J., Krähenbühl, P., and Darrell, T. Adversarialfeature learning. arXiv preprint arXiv:1605.09782 , 2016.Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., andBrox, T. Discriminative unsupervised feature learningwith convolutional neural networks. In

Advances in neu-ral information processing systems , pp. 766–774, 2014.Gidaris, S., Singh, P., and Komodakis, N. Unsupervised rep-resentation learning by predicting image rotations. arXivpreprint arXiv:1803.07728 , 2018.Gopnik, A. and Meltzoff, A. The development of catego-rization in the second year and its relation to other cogni-tive and linguistic developments.

Child development , pp.1523–1531, 1987.Haeusser, P., Plapp, J., Golkov, V., Aljalbout, E., and Cre-mers, D. Associative deep clustering: Training a classiﬁ-cation network with no labels. In

German Conference onPattern Recognition , pp. 18–32. Springer, 2018.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016.Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae:Learning basic visual concepts with a constrained varia-tional framework.

International Conference on LearningRepresentations (ICLR) , 2017.Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal,K., Bachman, P., Trischler, A., and Bengio, Y. Learningdeep representations by mutual information estimationand maximization.

International Conference on LearningRepresentations (ICLR) , 2019.Hu, W., Miyato, T., Tokui, S., Matsumoto, E., and Sugiyama,M. Learning discrete representations via informationmaximizing self-augmented training. In

Proceedings ofthe 34th International Conference on Machine Learning-Volume 70 , pp. 1558–1567. JMLR. org, 2017.Ioffe, S. and Szegedy, C. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015. eural Bayes

Ji, X., Henriques, J. F., and Vedaldi, A. Invariant informa-tion clustering for unsupervised image classiﬁcation andsegmentation. In

Proceedings of the IEEE InternationalConference on Computer Vision , pp. 9865–9874, 2019.Johnston, K. E., Bittinger, K., Smith, A., and Madole, K. L.Developmental changes in infants’ and toddlers’ attentionto gender categories.

Merrill-Palmer Quarterly (1982-) ,pp. 563–584, 2001.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114 , 2013.Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting self-supervised visual representation learning. In

Proceedingsof the IEEE conference on Computer Vision and PatternRecognition , pp. 1920–1929, 2019.Krizhevsky, A. Learning multiple layers of features fromtiny images. Technical report, 2009.Linsker, R. Self-organization in a perceptual network.

Com-puter , 21(3):105–117, 1988.Liu, G., Lin, Z., and Yu, Y. Robust subspace segmentationby low-rank representation. In

Proceedings of the 27thinternational conference on machine learning (ICML-10) ,pp. 663–670, 2010.Ma, Y., Yang, A. Y., Derksen, H., and Fossum, R. Esti-mation of subspace arrangements with applications inmodeling and segmenting mixed data.

SIAM review , 50(3):413–458, 2008.Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., andFrey, B. Adversarial autoencoders. arXiv preprintarXiv:1511.05644 , 2015.Noroozi, M. and Favaro, P. Unsupervised learning of visualrepresentations by solving jigsaw puzzles. In

EuropeanConference on Computer Vision , pp. 69–84. Springer,2016.Oord, A. v. d., Vinyals, O., and Kavukcuoglu, K. Neu-ral discrete representation learning. arXiv preprintarXiv:1711.00937 , 2017.Oord, A. v. d., Li, Y., and Vinyals, O. Representation learn-ing with contrastive predictive coding. arXiv preprintarXiv:1807.03748 , 2018.Rakison, D. H. and Yermolayeva, Y. Infant categorization.

Wiley Interdisciplinary Reviews: Cognitive Science , 1(6):894–905, 2010.Saxe, A. M., McClelland, J. L., and Ganguli, S. Exactsolutions to the nonlinear dynamics of learning in deeplinear neural networks. arXiv preprint arXiv:1312.6120 ,2013. Shaham, U., Stanton, K., Li, H., Nadler, B., Basri, R., andKluger, Y. Spectralnet: Spectral clustering using deepneural networks. arXiv preprint arXiv:1801.01587 , 2018.Smith, L. B. and Samuelson, L. An attentional learning ac-count of the shape bias: Reply to cimpian and markman(2005) and booth, waxman, and huang (2005).

Develop-mental Psychology , 42(6):1339–1343, 2006.Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Man-zagol, P.-A. Stacked denoising autoencoders: Learninguseful representations in a deep network with a local de-noising criterion.

Journal of machine learning research ,11(Dec):3371–3408, 2010.Younger, B. A. and Fearing, D. D. Parsing items into sepa-rate categories: Developmental change in infant catego-rization.

Child Development , 70(2):291–303, 1999. eural Bayes

AppendixA. Gradient Computation Problem for the E x [ L θ ( x )] term The Neural Bayes parameterization contains the term E x [ L θ ( x )] . Computing unbiased gradient through this term is ingeneral difﬁcult without the use of very large batch-sizes even though the quantity E x [ L θ ( x )] itself may have a goodestimate using very few samples. For instance, consider the scalar function ψ ( t ) = 1 + 0 .

01 sin ωt . Consider the scenariowhen ω → ∞ . The quantity E [ ψ ( t )] can be estimated very accurately using even one example. Further, E [ ψ ( t )] = 1 , hence ∂ E [ ψ ( t )] ∂t = 0 . However, when using a ﬁnite number of samples, the approximation of ∂ E [ ψ ( t )] ∂t can have a very high varianceestimate due to improper cancelling of gradient terms from individual samples.In the case of Neural Bayes-MIM we found that gradients through terms involving E x [ L θ ( x )] were 0. This allows us toestimate gradients for this objective reliably in the mini-batch setting. But in general it may be challenging to do so andsolving objectives using Neural Bayes parameterization may require a customized work-around for each objective. B. Implementation Details of the Neural Bayes-MIM Objective

We apply the Neural Bayes-MIM objective (Eq 7) to all the hidden layers at different scales (using average pooling).We now discuss its implementation details. Consider the CNN architecture used in our experiments– C (200 , , , − P (2 , , , max ) − C (500 , , , − C (700 , , , − P (2 , , , max ) − C (1000 , , , . Denote h i ( i ∈ { , , , } ) be the4 hidden layer ReLU outputs after the 4 convolution layers. For input of size × × , all these hidden states have heightand width dimension in addition to channel dimension. For a mini-batch B , these hidden states are therefore 4 dimensionaltensors. Let these 4 dimensions for the i th state be denoted by | B | × C i × H i × W i , where the dimensions denote batch-size,number of channels, height and width. Denote S to be the Softmax function applied along the channel dimension, and P tobe P (2 , , , avg ) . Further, denote h i := P ( h i − ) ( i ∈ { , , , } ) as the scaled version of the original states computedby average pooling, and deﬁne numbers C i , H i , W i accordingly. Then the total Neural Bayes-MIM objective for thisarchitecture is given by, min θ − | B | (cid:88) x ∈ B  (cid:88) i =0 H i W i H i ,W i (cid:88) h,w =1 C i (cid:88) k =1 S ( h ik,h,w ( x )) log (cid:104)S ( h ik,h,w ( x )) + (cid:15) (cid:105)  + (1 + α ) · R p ( θ ) + β · R c (14)where, R p ( θ ) := − (cid:88) i =0 H i W i H i ,W i (cid:88) h,w =1 (cid:34) C i (cid:88) k =1 C i log (cid:34) | B | (cid:88) x ∈ B S ( h ik,h,w ( x )) (cid:35) + C i − C i log (cid:34) − | B | (cid:88) x ∈ B S ( h ik,h,w ( x )) (cid:35)(cid:35) (15)and, R c = 1 | B | (cid:88) x ∈ B (cid:107)P ( h k ( x )) − P ( h k ( x + ζ · ˆ δ ) (cid:107) ζ (16)where ˆ δ := δ (cid:107) δ (cid:107) is a normalized noise vector computed independently for each sample x in the batch B as, δ := Xv . (17)Here X ∈ R n × B is the matrix containing the batch of samples, and each dimension of v ∈ R B is sampled i.i.d. froma standard Gaussian. This computation ensures that the perturbation lies in the span of data. Finally ζ is the scale ofnormalized noise added to all samples in a batch. In our experiments, since we always normalize the datasets to have zeromean and unit variance across all dimensions, we sample ζ ∼ N (0 , . ) . Note that for the architecture used, P ( h k ( x )) results in an output with height and width equal to 1, hence the output is effectively a 2D matrix of size | B | × C . Finally,the gradient form this mini-batch is accumulated and averaged over multiple batches before updating the parameters for amore accurate estimate of gradients. eural Bayes C. Additional Analysis of Neural Bayes-MIM

C.1. Gradient Strength of Uniform Prior in Neural Bayes-MIM-v1 (Eq 5) vs Neural Bayes-MIM-v2 (6)

As discussed in the main text, the term, R v p ( θ ) := K (cid:88) k =1 E x [ L k ( x )] log (cid:104) E x [ L k ( x )] (cid:105) (18)acts as a uniform prior encouraging the representations to be distributed. However, gradients are much stronger when E x [ L k ( x )] approaches 1 for the alternative cross-entropy formulation, R v p ( θ ) := − K (cid:88) k =1 K log E x [ L k ( x )] + K − K log(1 − E x [ L k ( x )]) (19)To see this, note that gradient for R v p ( θ ) is given by, ∂ R v p ( θ ) ∂θ = K (cid:88) k =1 ∂ E x [ L θ k ( x )] ∂θ log E x [ L θ k ( x )] − K (cid:88) k =1 ∂ E x [ L θ k ( x )] ∂θ (20) = K (cid:88) k =1 ∂ E x [ L θ k ( x )] ∂θ log E x [ L θ k ( x )] − ∂ E x [ (cid:80) Kk =1 L θ k ( x )] ∂θ (21) = K (cid:88) k =1 ∂ E x [ L θ k ( x )] ∂θ log E x [ L θ k ( x )] (22)where the last equality holds due to the linearity of expectation and because (cid:80) Kk =1 L θ k ( x ) = 1 by design. On the otherhand, gradients for R v p ( θ ) is given by, ∂ R v p ( θ ) ∂θ = − K (cid:88) k =1 K (cid:18) E x [ L k ( x )] − K − − E x [ L k ( x )] (cid:19) ∂ E x [ L θ k ( x )] ∂θ (23)When the representation being learned is such that the marginal p ( z ) peaks along a single state k , i.e., E x [ L k ( x )] → (making the representation degenerate), the gradient for the k th term for v1 is given by, ∂ E x [ L θ k ( x )] ∂θ log E x [ L θ k ( x )] ≈ (24)while that for v2 is given by, − K (cid:18) E x [ L k ( x )] − K − − E x [ L k ( x )] (cid:19) ∂ E x [ L θ k ( x )] ∂θ ≈ lim c → c · ∂ E x [ L θ k ( x )] ∂θ (25)whose magnitude approaches inﬁnity as E x [ L k ( x )] → . Thus R v p ( θ ) is beneﬁcial in terms of gradient strength. C.2. Empirical Comparison between Neural Bayes-MIM-v1 (Eq 5) and Neural Bayes-MIM-v2 (6)

To empirically understand the difference in behavior of Neural Bayes-MIM objective v1 vs v2, we ﬁrst plot the ﬁlters learnedby the v1 objective and compare it with those learned by the v2 objective. The ﬁlters learned by the v1 objective are shownin ﬁgure 5 using the conﬁguration α = 4 , β = 4 . It can be seen that most ﬁlters are dead. We tried other conﬁgurations aswell without any change in the outcome. Since the v1 and v2 objective differ only in the formulation of the uniform priorregularization, as explained in the previous section, we believe that v1 leads to dead ﬁlters because of weak gradients fromits regularization term.In the second set of experiments, we train many models using Neural Bayes-MIM-v1 and Neural Bayes-MIM-v2 objectivesseparately with different hyper-parameter conﬁgurations similar to the setting of ﬁgure 2. The performance scatter plotis shown in ﬁgure 7. We ﬁnd that Neural Bayes-MIM-v2 has better average and best performance compared with NeuralBayes-MIM-v1. eural Bayes Figure 5.

MNIST ﬁlters learned using Neural Bayes-MIM-v1 objective (Eq 5) using the conﬁguration α = 4 , β = 4 . Majority of ﬁltersare dead. For comparison with ﬁlters from Neural Bayes-MIM-v2, see ﬁgure 6 (bottom).(a) α = 0 , β = 4 (b) α = 4 , β = 0 (c) α = 4 , β = 4 Figure 6.

MNIST ﬁlters learned using Neural Bayes-MIM objective Eq 7. Majority of ﬁlters are dead when regularization coefﬁcient α = 0 . Filters memorize input samples when regularization coefﬁcient β = 0 . Using both regularization terms results in ﬁlters thatmainly capture parts of inputs which are good for distributed representation. eural Bayes Figure 7.

Performance of Neural Bayes-MIM-v1 vs Neural Bayes-MIM-v2. A CNN encoder is trained using Neural Bayes-MIM withdifferent conﬁgurations– hyper-parameters α , β and scaling of states on which the objective is applied. A one hidden layer classiﬁer (with200 units) is then trained using labels on these frozen features to get the ﬁnal test accuracy.Green and blue dotted lines are the averageof all the green and blue points respectively. Neural Bayes-MIM-v2 has better average and best performance compared with NeuralBayes-MIM-v1. D. Proof of Lemma 1

Lemma 1

Let p ( x | z = k ) and p ( z ) be any conditional and marginal distribution deﬁned for continuous random variable x and discrete random variable z . If E x ∼ p ( x ) [ L k ( x )] (cid:54) = 0 ∀ k ∈ [ K ] , then there exists a non-parametric function L ( x ) : R n → R + K for any given input x ∈ R n with the property (cid:80) Kk =1 L k ( x ) = 1 ∀ x such that, p ( x | z = k ) = L k ( x ) · p ( x ) E x ∼ p ( x ) [ L k ( x )] , p ( z = k ) = E x [ L k ( x )] , p ( z = k | x ) = L k ( x ) (26) and this parameterization is consistent. Proof : First we show the existence proof. Notice that there exists a non-parametric function g k ( x ) := p ( x | z = k ) p ( x ) ∀ x ∈ supp ( p ( x )) . Denote G k ( x ) = p ( z = k ) g k ( x ) . Then, E x [ G k ( x )] = E x [ p ( z = k ) g k ( x )] = p ( z = k ) (27) and, G k ( x ) E x [ G k ( x )] = p ( z = k ) g k ( x ) p ( z = k ) = p ( x | z = k ) p ( x ) (28) Thus L k := G k works. To verify that this parameterization is consistent, note that for any k , (cid:90) x p ( x | z = k ) = (cid:90) x L k ( x ) · p ( x ) E x ∼ p ( x ) [ L k ( x )] = 1 (29) where we use the condition E x ∼ p ( x ) [ L k ( x )] (cid:54) = 0 ∀ k ∈ [ K ] . Secondly, we note that, K (cid:88) k =1 p ( x | z = k ) · p ( z = k ) = K (cid:88) k =1 L k ( x ) · p ( x ) E x [ L k ( x )] · E x [ L k ( x )] (30) = K (cid:88) k =1 L k ( x ) · p ( x ) (31) = p ( x ) (32) eural Bayes where the last equality is due to the conditions (cid:80) Kk =1 L k ( x ) = 1 ∀ x . Thirdly, K (cid:88) k =1 p ( z = k ) = K (cid:88) k =1 E x [ L k ( x )] (33) = E x [ K (cid:88) k =1 L k ( x )] (34) = 1 Finally, we have from Bayes’ rule: p ( z = k | x ) = p ( x | z = k ) · p ( z = k ) p ( x ) (35) = L k ( x ) · p ( x ) E x ∼ p ( x ) [ L k ( x )] · E x ∼ p ( x ) [ L k ( x )] p ( x ) (36) = L k ( x ) (37) where the second equality holds because of the existence and consistency proofs of p ( x | z = k ) := L k ( x ) · p ( x ) E x ∼ p ( x ) [ L k ( x )] and p ( z = k ) := E x [ L k ( x )] shown above. (cid:3) E. Proofs for Neural Bayes-MIM

Proposition 1 (Neural Bayes-MIM-v1) (proposition 1 in main text) Let L ( x ) : R n → R + K be a non-parametric functionfor any given input x ∈ R n with the property (cid:80) Ki =1 L k ( x ) = 1 ∀ x . Consider the following objective, L ∗ = arg max L E x (cid:34) K (cid:88) k =1 L k ( x ) log L k ( x ) E x [ L k ( x )] (cid:35) (38) Then L ∗ k ( x ) = p ( z ∗ = k | x ) , where z ∗ ∈ arg max z M I ( x , z ) . Proof : Using the Neural Bayes parameterization in lemma 1, we have,

M I ( x , z ) = (cid:90) x K (cid:88) k =1 p ( x , z = k ) log p ( x , z = k ) p ( x ) p ( z = k ) (39) = (cid:90) x K (cid:88) k =1 p ( z = k | x ) p ( x ) log p ( z = k | x ) p ( z = k ) (40) = (cid:90) x K (cid:88) k =1 L k ( x ) · p ( x ) · log L k ( x ) E x [ L k ( x )] (41) = E x ∼ p ( x ) (cid:34) K (cid:88) k =1 L k ( x ) log L k ( x ) E x [ L k ( x )] (cid:35) (42) Therefore the two objectives are equivalent and we have a closed form estimate of mutual information. Given z ∗ is amaximizer of M I ( x , z ) , since L is a non-parametric function, there exists L ∗ such that p ( z ∗ = k | x ) = L ∗ k ( x ) due to lemma1. (cid:3) Theorem 1 (Theorem 1 in main text) Denote, J ( θ ) = − E x (cid:34) K (cid:88) k =1 L θ k ( x ) log L θ k ( x ) E x [ L θ k ( x )] (cid:35) (43) eural Bayes ˆ J ( θ ) = − E x (cid:34) K (cid:88) k =1 L θ k ( x ) log (cid:104) L θ k ( x ) E x [ L θ k ( x )] (cid:105) (cid:35) (44) where (cid:104) . (cid:105) denotes gradients are not computed through the argument. Then ∂J ( θ ) ∂θ = ∂ ˆ J ( θ ) ∂θ . Proof : We note that, J ( θ ) = − E x (cid:34) K (cid:88) k =1 L θ k ( x ) log L θ k ( x ) E x [ L θ k ( x )] (cid:35) (45) = − E x (cid:34) K (cid:88) k =1 L θ k ( x ) log L θ k ( x ) (cid:35) + E x (cid:34) K (cid:88) k =1 L θ k ( x ) log E x [ L θ k ( x )] (cid:35) (46) Denote the ﬁrst term by T . Then due to chain rule, − ∂T ∂θ = E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log L θ k ( x ) (cid:35) − E x (cid:34) K (cid:88) k =1 L θ k ( x ) L θ k ( x ) · ∂L θ k ( x ) ∂θ (cid:35) (47) = E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log L θ k ( x ) (cid:35) − E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ (cid:35) (48) = E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log L θ k ( x ) (cid:35) − E x (cid:34) ∂ (cid:80) Kk =1 L θ k ( x ) ∂θ (cid:35) (49) = E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log L θ k ( x ) (cid:35) (50) where the last equality holds due to the linearity of expectation and because (cid:80) Kk =1 L θ k ( x ) = 1 by design. Now denote thesecond term by T . Then due to chain rule, − ∂T ∂θ = − E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log E x [ L θ k ( x )] (cid:35) − E x (cid:34) K (cid:88) k =1 L θ k ( x ) E x [ L θ k ( x )] · ∂ E x [ L θ k ( x )] ∂θ (cid:35) (51) = − E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log E x [ L θ k ( x )] (cid:35) − K (cid:88) k =1 E x [ L θ k ( x )] E x [ L θ k ( x )] · ∂ E x [ L θ k ( x )] ∂θ (52) = − E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log E x [ L θ k ( x )] (cid:35) − K (cid:88) k =1 E x (cid:20) ∂L θ k ( x ) ∂θ (cid:21) (53) = − E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log E x [ L θ k ( x )] (cid:35) − E x (cid:34) ∂ (cid:80) Kk =1 L θ k ( x ) ∂θ (cid:35) (54) = − E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log E x [ L θ k ( x )] (cid:35) (55) where once again the last equality holds due to the linearity of expectation and because (cid:80) Kk =1 L θ k ( x ) = 1 by design. Thusthe gradient for J is given by, ∂J ( θ ) ∂θ = − E x (cid:34) K (cid:88) k =1 ∂L θ k ( x ) ∂θ log L θ k ( x ) E x [ L θ k ( x )] (cid:35) (56) which is the same as ∂ ˆ J ( θ ) ∂θ . This concludes the proof. (cid:3) eural Bayes F. Proofs for Neural Bayes-DML (binary case)

Proposition 2 (Neural Bayes-DML, proposition 2 in main text) Let L ( x ) : R n → [0 , be a non-parametric function for anygiven input x ∈ R n , and let J be the Jensen-Shannon divergence. Deﬁne scalars f ( x ) := L ( x ) E x [ L ( x )] and f ( x ) := − L ( x )1 − E x [ L ( x )] .Then the objective in Eq. (8) is equivalent to, max L · E x (cid:20) f ( x ) · log (cid:18) f ( x ) f ( x ) + f ( x ) (cid:19)(cid:21) + 12 · E x (cid:20) f ( x ) · log (cid:18) f ( x ) f ( x ) + f ( x ) (cid:19)(cid:21) + log 2 (57) s.t. E x [ L ( x )] / ∈ { , } (58) Proof : Using the Neural Bayes parameterization from lemma 1 for binary case, we set, q ( x ) := L ( x ) · p ( x ) E x ∼ p ( x ) [ L ( x )] q ( x ) := (1 − L ( x )) · p ( x )1 − E x ∼ p ( x ) [ L ( x )] (59) These parameterizations therefore automatically satisfy the constraints in Eq. (8). Finally, using the deﬁnition of JSdivergence, the maximization problem in Eq. (8) can be written as, max q ,q · (cid:90) x q ( x ) log q ( x )0 . · ( q ( x ) + q ( x )) + q ( x ) log q ( x )0 . · ( q ( x ) + q ( x )) (60) Substituting q and q with their respective parameterizations and using the deﬁnitions of f ( x ) and f ( x ) completes theproof. (cid:3) Theorem 2 (optimality, Theorem 2 in main text) Let p ( x ) be a probability density function over R n whose support is theunion of two non-empty connected sets (deﬁnition 1) S and S that are disjoint, i.e. S ∩ S = ∅ . Let L ( x ) ∈ [0 , belong to the class of continuous functions which is learned by solving the objective in Eq. (9). Then the objective in Eq. (9)is maximized if and only if one of the following is true: L ( x ) = (cid:40) ∀ x ∈ S ∀ x ∈ S or L ( x ) = (cid:40) ∀ x ∈ S ∀ x ∈ S (61) Proof : The two cases exist in the theorem due to symmetry. Recall the deﬁnition of f ( x ) and f ( x ) in Eq. (9), f ( x ) := L ( x ) E x [ L ( x )] f ( x ) := 1 − L ( x )1 − E x [ L ( x )] (62) where L ( x ) ∈ [0 , and for a feasible L ( x ) , and therefore π := E x [ L ( x )] ∈ (0 , due to the conditions speciﬁed in thistheorem. Thus f ( x ) ∈ [0 , π ] and f ( x ) ∈ [0 , − π ] . By design, the terms log (cid:16) f ( x ) f ( x )+ f ( x ) (cid:17) and log (cid:16) f ( x ) f ( x )+ f ( x ) (cid:17) arenon-positive. Thus, for any x ∈ S ∪ S , F ( x ) = f ( x ) · log (cid:18) f ( x ) f ( x ) + f ( x ) (cid:19) + f ( x ) · log (cid:18) f ( x ) f ( x ) + f ( x ) (cid:19) (63) is maximized only when L ( x ) = 0 or L ( x ) = 1 leading to F ( x ) = 0 . Therefore, the objective in Eq. (9) is maximized bysetting L ( x ) = 0 or L ( x ) = 1 ∀ x ∈ S ∪ S .Finally, since L ( x ) is a continuous function, (cid:64) x , x ∈ S such that L ( x ) = 0 and L ( x ) = 1 . We prove this bycontradiction. Suppose there exists a pair ( x , x ) of this kind. Then along any path connecting x and x within S , theremust exist a point where L ( x ) is not continuous since L ( x ) = 0 or L ( x ) = 1 ∀ x ∈ S ∪ S to satisfy the maximizationcondition. This is a contradiction. By symmetry, the same argument can be proved for x , x ∈ S . Therefore one of the twocases mentioned in the theorem must be the optimal solution for L ( x ) in Eq. (9). Thus we have proved the claim. (cid:3) G. Neural Bayes-DML: Extension to Multiple Partitions

In order to extend our proposal to multiple partitions (say K ), the idea is to ﬁnd conditional distribution q i ( i ∈ [ K ] )corresponding to each of the K partitions such the divergence between conditional distribution of every partition and the eural Bayes conditional distribution of the combined remaining partitions is maximized. Speciﬁcally, we propose the following primaryobjective, max qkπk (cid:54) =0 , ∀ k ∈ [ K ] K K (cid:88) k =1 J ( q k ( x ) || ¯ q k ( x )) s.t. (64) (cid:90) x q k ( x ) = 1 ∀ k ∈ [ K ] (65) K (cid:88) k =1 q k ( x ) · π k = p ( x ) (66) K (cid:88) k =1 π k = 1 (67)where ¯ q k ( x ) is the conditional distribution corresponding to the full data distribution excluding the partition deﬁned by q k ( x ) . Formally, ¯ q k ( x ) := p ( x ) − q k ( x ) · π k − π k (68)Then the theorem below shows an equivalent way of solving the above objective. Theorem 3

Let L ( x ) : R n → R + K be a non-parametric function for any given input x ∈ R n with the property (cid:80) Kk =1 L k ( x ) = 1 ∀ x , and let J be the Jensen-Shannon divergence. Deﬁne scalars f k ( x ) := L k ( x ) E x [ L k ( x )] and ¯ f k ( x ) := − L k ( x )1 − E x [ L k ( x )] . Then the objective in Eq. (64) is equivalent to, max L k ∀ i ∈ [ K ] · E x (cid:34) K (cid:88) k =1 f k ( x ) · log (cid:18) f k ( x ) f k ( x ) + ¯ f k ( x ) (cid:19) + ¯ f k ( x ) · log (cid:18) ¯ f k ( x )¯ f k ( x ) + f k ( x ) (cid:19)(cid:35) + log 2 (69) s.t. E x [ L k ( x )] = π k Here L k ( x ) denotes the k th unit of L ( x ) . Proof : Similar to theorem 2, the main idea is to parameterize q k and ¯ q k as follows, q k ( x ) := L k ( x ) · p ( x ) E x ∼ p ( x ) [ L k ( x )] ¯ q k ( x ) := (1 − L k ( x )) · p ( x )1 − E x ∼ p ( x ) [ L k ( x )] (70) To verify that these parameterizations are valid, note that, (cid:90) x q k ( x ) = (cid:90) x L k ( x ) · p ( x ) E x ∼ p ( x ) [ L k ( x )] = 1 (71) Similarly, (cid:82) x ¯ q k ( x ) = 1 . To verify that the second constraint is satisﬁed, we use the above parameterization and substitute E x [ L k ( x )] = π k and get, K (cid:88) k =1 L k ( x ) · p ( x ) π k · π k = p ( x ) · (cid:32) K (cid:88) k =1 L k ( x ) (cid:33) (72) = p ( x ) (73) where the last equality uses the deﬁnition of L ( x ) . Also notice that each π k ∈ [0 , and thus E x [ L k ( x )] = π k is feasiblefor any arbitrary distribution q k ( x ) when L k ( x ) ≥ . eural Bayes Finally, using the proposed parameterization we have, ¯ q k ( x ) = p ( x ) − q i ( x ) · π k − π k (74) = p ( x ) · − L i ( x ) E x [ L k ( x )] · π k − π k (75) = p ( x ) · − L i ( x )1 − E x [ L k ( x )] (76) = ¯ f k ( x ) · p ( x ) (77) where we have used the fact that E x [ L k ( x )] = π k . Using the deﬁnition of JS divergence, the max problem in Eq. (64) canbe written as, max L k ∀ i ∈ [ K ] · K (cid:88) k =1 (cid:90) x q k ( x ) · log (cid:18) q k ( x )0 . · ( q k ( x ) + ¯ q k ( x )) (cid:19) + ¯ q k ( x ) · log (cid:18) ¯ q k ( x )0 . · (¯ q k ( x ) + q k ( x )) (cid:19) (78) Substituting q k and ¯ q k with their respective parameterizations and using the deﬁnitions of f k ( x ) and ¯ f k ( x ) completes theproof. (cid:3) In terms of implementation, we propose to simply have K output units in the label generating network L θ while sharing therest of the network. Also, we use Softmax activation at the output layer to satisfy the properties of L speciﬁed in the abovetheorem. H. Additional Neural Bayes-DML Experiments

We run an experiment on MNIST. We randomly split the training set into − training-validation set In this experiment,we train a CNN with the following architecture: C (100 , , , − P (2 , , , max ) − C (100 , , , − C (200 , , , − P (2 , , , max ) − C (500 , , , − P ( ., ., ., avg ) − F C (10) . Here P ( ., ., ., avg ) denotes the entire spatial ﬁeld is averagepooled to result in × height-width, and F C (10) denotes a fully connected layer with output dimension 10. Finally,Softmax is applied at the output and the network is trained using the Neural Bayes-DML objective. We optimize the objectiveusing Adam with learning rate 0.001, batch size 5000, 0 weight decay for 100 epochs (other Adam hyper-parameters arekept standard). We use β = 1 for the smoothness regularization coefﬁcient. Once this network is trained, we train a linearclassiﬁer on top of this 10 dimensional output using Adam with identical conﬁgurations except a batch size of 128 is used.We early stop on the validation set of MNIST and report the test accuracy using that model. The classiﬁer reaches . test accuracy. This experiment shows that MNIST classes lie on nearly disjoint manifolds and that Neural Bayes-DMLcan correctly label them. As baseline, a linear classiﬁer trained on features from a randomly initialized identical CNNarchitecture reaches .97%