[PDF] The Variational InfoMax Learning Objective

Abstract

Bayesian Inference and Information Bottleneck are the two most popular objectives for neural networks, but they can be optimised only via a variational lower bound: the Variational Information Bottleneck (VIB). In this manuscript we show that the two objectives are actually equivalent to the InfoMax: maximise the information between the data and the labels. The InfoMax representation of the two objectives is not relevant only per se, since it helps to understand the role of the network capacity, but also because it allows us to derive a variational objective, the Variational InfoMax (VIM), that maximises them directly without resorting to any lower bound. The theoretical improvement of VIM over VIB is highlighted by the computational experiments, where the model trained by VIM improves the VIB model in three different tasks: accuracy, robustness to noise and representation quality.

Full PDF

TThe Variational InfoMax Learning Objective

Vincenzo Crescimanna

Department of Computer ScienceUniversity of StirlingStirling, UK

Bruce Graham

Department of Computer ScienceUniversity of StirlingStirling, UK

Abstract

Bayesian Inference and Information Bottle-neck are the two most popular objectives forneural networks, but they can be optimisedonly via a variational lower bound: the Vari-ational Information Bottleneck (VIB). In thismanuscript we show that the two objectives areactually equivalent to the InfoMax: maximisethe information between the data and the la-bels. The InfoMax representation of the twoobjectives is not relevant only per se , sinceit helps to understand the role of the networkcapacity, but also because it allows us to de-rive a variational objective, the Variational In-foMax (VIM), that maximises them directlywithout resorting to any lower bound. Thetheoretical improvement of VIM over VIB ishighlighted by the computational experiments,where the model trained by VIM improves theVIB model in three different tasks: accuracy,robustness to noise and representation quality.

Deep neural networks are a ﬂexible family of models thateasily scale to millions of parameters and data points.Due to the large number of parameters involved, train-ing such models while avoiding the overﬁtting scenariois not easy. Indeed, it is well known that minimising thenaive accuracy term is not a good objective, or in gen-eral, any metric that is a distance between the predictedand the real labels. In particular, as observed in (Zhanget al., 2016) in the case of really powerful networks (e.g.convolutional net) it is possible to train with success, aneural net with random labels. The latter scenario meansthat the network is no longer learning a description ( rep-resentation ) of the data with the associated labels, but a function from the weights to the labels (Achille andSoatto, 2018a).In light of this empirical observation, to bound the in-formation conveyed in the weights, many heuristic reg-ulariser techniques were proposed: from the classic L /L weights regularisation, bounding the norm of theweights, to the more recent ones such as Dropout (Sri-vastava et al., 2014) and batch normalization (Ioffe andSzegedy, 2015), bounding the entropy of the weights.Such heuristic approaches are not easy to interpret andthe hyper-parameter tuning is often not trivial.A solution to the interpretability issue is to consider aBayesian description, and read the neural network as amodel describing the distribution associating the data tothe labels. Under this perspective it is possible to relatethe Dropout technique to the Bayesian inference prob-lem (Kingma et al., 2015) and describe the neural netas an information channel (Alemi et al., 2016; Achilleand Soatto, 2018b). The latter two descriptions pro-vide two regularised objectives: the Variational Dropout(VD) (Kingma et al., 2015) aiming to learn the opti-mal weights, and the Variational Information Bottleneck(VIB) (Alemi et al., 2016) aiming to learn the optimalrepresentation of the data. Although the two tasks: op-timal weights and optimal representations are intuitivelyrelated, and VD is a special case of VIB, as observed in(Alemi et al., 2016) the objective optimised to learn theoptimal weights is not the one optimised to learn an op-timal representation and vice-versa.In this manuscript, we try to address the reasons for suchcounter intuitive behaviour by studying the network froman information theory perspective. In particular we con-sider a third deﬁnition of the optimal network: one max-imising the information theory between the data and itslabels, the InfoMax (IM) principle. The IM descriptionhas a twofold relevance: theoretical and computational.From a theoretical side it allows to identify an objectiveregulariser as a network capacity constraint, and in par- a r X i v : . [ c s . L G ] M a r icular to prove that the optimal network is learning bothoptimal weights and optimal representations, i.e. VD andVIB should have the same optimum. But its main advan-tage is computationally, since it can be optimised directlyvia a variational network, the same used for VD and VIB,that are optimising a lower bound of the same princi-ple. The theoretical advantages of the introduced objec-tive are conﬁrmed by the experimental results, where themodel trained optimising the Variational InfoMax (VIM)performs better than VIB in three different tasks: accu-racy, network robustness and representation quality. Given a dataset D , containing a set of N observations oftuples ( x, y ) , samples of the random variables ( X, Y ) ∼ p ( X, Y ) , the goal is to learn a model with parameter θ ∼ p ( θ ) of the conditional probability p ( y | x, θ ) , such thatfor any x ∼ p ( x ) , p ( y, x | θ ) = p ( y | x, θ ) p ( x ) coincideswith the real p ( y, x ) . I.e., ﬁnd a model p ( y |· , θ ) such thatfor any distance D the following objective is optimised: min θ D ( p ( y, x ) , p ( y, x | θ )) . (1)The naive idea to minimise the negative log-likelihood, max θ N N (cid:88) i − log p ( y i | x i , θ ) , (2)leads to model prone to overﬁt. Indeed, min-imise the log-likelihood is equivalent to minimise theKullback-Leibler divergence D KL ( p ( y | x ) || p ( y | x, θ )) = E p ( y | x ) [log p ( y | x ) − log p ( y | θ, x )] , that is optimised bya distribution p ( y | θ ) that does not depend by the input x . The latter phenomenon where the information aboutthe labels come only from the weights is undesirable, andcoincides with the complete overﬁtting. From the many regulariser techniques proposed, the mostpopular is the

Dropout one. Miming the biological be-haviour of the real neural network, in (Srivastava et al.,2014) it was proposed to train the artiﬁcial neural net-work using only some units i.e. to dropout some unitsduring the training according to a distribution p ( ξ ) ∼B ( ξ ) . As observed in (Baldi and Vershynin, 2018), thedropout technique is a way to restrict the space of dis-tributions p ( y | x, θ ) that the network can learn, i.e. thenetwork capacity .The original formulation with Bernoulli noise is not sta-ble and not easy to train, so a relevant improvement wasprovided in (Wang and Manning, 2013), where it was ob-served that introducing a multiplicative Gaussian noise, p ( ξ ) ∼ N (1 , α = (1 − ξ ) /ξ ) behaves like the Bernoullione, with the advantage of more robust and fast training.Moreover, as observed in (Kingma et al., 2015), the in-troduction of the Gaussian noise allows to move the noisefrom the units to the weights.For the sake of clarity, we describe the phenomenon inthe case where the network is a single layer with linearactivation; the generalisation to the deep network followsnaturally. Let us suppose V is the weight matrix to learnand A and B respectively, the input and the output layers.Then in the Gaussian dropout case we have that B = ( A · ξ ) V, ξ ∼ N (1 , α ) , (3)that is equivalent, by the associative property of the directmultiplication to B = A ˜ V , ˜ V = ˜ v i,j = v i,j ξ i,j . By this description, the network p ( y |· , θ ) can be readas a composition of two distributions: the regression p ( y |· , W ) , a function of the last layer, W , and the weightinference q ( W |· , φ ) , described by the rest of the networkweights φ and the noise ξ , i.e. ( W, φ ) = θ . Thanks tothis description we can read the network trained minimis-ing the negative log-likelihood with Gaussian dropout, asoptimising the objective E q ( W |D ,φ ) [ − log p ( y | x, W )] . (4)Observing that (4) is a loose lower bound of the unfeasi-ble to compute KL-divergence − D KL ( p ( y, x | θ ) || p ( y, x )) (5)(Kingma et al., 2015) provided an approximation of theKL divergence D KL ( q ( W |D , φ ) || p ( W )) and then pro-posed to optimise the Variational Inference (VI) E q ( W |D ,φ ) [ − log p ( y | x, θ )]+ E q ( θ |D ,φ ) [ D KL ( q ( W |D , φ ) || p ( W ))] , (6)which is a tight approximation of the term (5), with p ( W ) the prior of the regression weights being supposedknown. In the section above we observed that the continuousnoise can be moved from the latent units to the weights.Let us now leave the noise in the latent units. In this set-ting the network p ( y |· , θ ) is the composition of two sub-nets: the decoder p ( y | z, W ) and the encoder q ( z |· , φ ) ,i.e. p ( y | x, θ ) = p ( y | z, θ ) q ( z | x, φ ) for any x , wherethe random variable Z is deﬁned according to (3) as = A · ξ . In light of this observation the VI objective(6) can be rewritten as E q ( z | x,φ ) [ − log p ( y | z, W )]+ E q ( z | x,φ ) [ D KL ( q ( z | x, φ ) || p ( z ))] . (7)In this way we have moved our attention from theweights θ , a huge number of parameters difﬁcult to in-terpret, to the easier to describe latent variable Z . Ac-cording to (Tishby et al., 2000) it is possible to deﬁne anoptimal network X → Z → Y , as the one learning arepresentation Z that is a minimal sufﬁcient statistic of X for Y ; i.e. a description of the input data containingonly the necessary information to distinguish a class ele-ment from another one. Formally, the minimal sufﬁcientrepresentation is the random variable Z optimising thefollowing objective: min φ I ( Z ; X | φ ) s.t. I ( Y ; Z | W ) = I ( Y ; X | θ ) , (8)where the conditional mutual information I ( B ; A | W ) ,deﬁned as I ( B ; A | W ) = H ( B | W ) − H ( B | A, W ) , is a measure of the information conveyed by A to B in achannel deﬁned by weights W , with conditional entropy H ( B | W ) = E p ( b,w ) [ − log p ( b | w )] denoting a measureof the information lost by B about W .The objective in (8) is intractable, but it is possible tooptimise a lower bound of its Lagrangian form: max θ,φ I ( Y ; Z | W ) − βI ( Z ; X | φ ) . (9)Indeed, observing that • H ( Y | W ) = H ( Y ) is constant, • H ( Y | Z, W ) ≤ E q ( z | x,φ ) [ − log p ( y | z, W )] , • I ( Z ; X, φ ) ≤ E q ( z | x,φ ) [ D KL ( q ( z | x, φ ) || p ( z ))] ,the following objective, the Variational Information Bot-tleneck (VIB), E q ( z | x,φ ) [ − log p ( y | z, W )]+ β E q ( z | x,φ ) [ D KL ( q ( z | x, φ ) || p ( z ))] , (10)is a variational lower bound of the original IB in (8). Letus observe that, since I ( Z ; X | φ ) ≤ I ( Z ; X, φ ) optimis-ing (10) is equivalent to optimising a lower bound of (9).The VIB model in (10), that is a generalisation of the VD(7), was independently derived in (Achille and Soatto,2018b) and (Alemi et al., 2017), where it was observed that it is an outperforming regulariser, leading to robustlearning (in agreement with the Bayes theory) and op-timal representation quality (in agreement with the IBtheory); but, as observed in (Alemi et al., 2016) the La-grange hyper-parameter β chosen to maximise the accu-racy is not the same one used to learn robust weights andgood quality representation. We suppose that this issuearises from the fact that VD (7) and VIB (10) are optimis-ing a lower bound of the respective objectives and thatthe choice of the prior p ( z ) is arbitrary and often equalto the easy to compute unit variance Normal distribution. In the previous section we described two different deﬁni-tions of the optimal network, optimal Bayesian inference(5) and minimal sufﬁcient representation Z (8). In thissection we provide an information theoretic descriptionof the ﬁrst principle and we show that it is equivalent tothe second one. The InfoMax

The mutual information between thevariables X and Y , is a constant of the system and itis deﬁned as I ( X ; Y ) := D KL ( p ( X, Y ) || p ( X ) p ( Y )) . By property of the KL divergence, the mutual informa-tion can be decomposed as follow: I ( X ; Y ) = D KL ( p ( X, Y ) || p ( X, Y | θ ))+ D KL ( p ( X, Y | θ ) || p ( X | θ ) p ( Y | θ ))+ D KL ( p ( X | θ ) p ( Y | θ ) || p ( X ) p ( Y )) , (11)where the third term is trivially zero for any θ , andthe second term is the conditional mutual information I ( Y ; X | θ ) . Noting that the ﬁrst term is the inference ob-jective (5) to minimise, the latter problem can be rewrit-ten as the following InfoMax objective max θ I ( Y ; X | θ ) , (12)with optimum value θ ∗ , that satisﬁes the following equal-ity: I ( Y ; X, θ ∗ ) = I ( Y ; X | θ ∗ ) = I ( Y ; X ) . By the highlighted equivalence between the InfoMax andthe Bayes inference, it is possible to show the equiva-lence between the Bayes Inference (5) and the Informa-tion Bottleneck (8). In order to prove such an assertion itis enough to show that the optimal solution learnt by (12)is minimal and sufﬁcient. roposition

The network learning parameters θ ∗ op-timising the InfoMax objective (12) , i.e. I ( Y ; X | θ ∗ ) = I ( Y ; X ) , is learning, in the hidden layer, a minimal suf-ﬁcient representation Z of the input X for the variable Y .Proof. Let us observe that a model p ( y | x, θ ) optimises(12) if I ( Y ; θ ) = 0 , indeed I ( Y ; X, θ ) = I ( Y ; X | θ ) + I ( Y ; θ ) . Then, in order to prove the proposition it isenough to show that the parameter optimising (8) sat-isﬁes I ( Y ; θ ) = 0 .A representation Z of X is sufﬁcient for Y , if there existsa function φ such that Z = φ ( X ) , and p ( y | x ) = p ( y | φ ( x ) , W ) φ ( x ) , (13)or equivalently, see (Cover and Thomas, 2012) section2.9, if it satisﬁes the following equality: I ( Y ; X ) = I ( Y ; Z | W ) . (14)By the deterministic property of φ , I ( Y ; φ ) = 0 for anysufﬁcient statistic. Then it remains to show that only forthe minimal sufﬁcient statistic Z it holds that I ( Y ; W ) =0 .A sufﬁcient statistic Z , is minimal if the encoding in-formation I ( Z, X ) is minimal. Or equivalently, since Z = φ ( X ) , and then H ( Z | X ) = 0 for any φ , if theentropy H ( Z ) = H ( φ ( X )) is minimal. Since by (13)we have that H ( Y | X ) = H ( Y | Z, W ) + H ( Z ) , we obtain that a minimal sufﬁcient representation is as-sociated to a maximal H ( Y | Z, W ) , or equivalently toa minimal I ( Y ; Z, W ) . But, by (14), and rememberingthat I ( Y ; Z, W ) = I ( Y ; Z | W ) + I ( Y ; W ) , we have thatonly for a minimal sufﬁcient representation I ( Y ; W ) =0 . Q.E.D.Thanks to this proposition we showed that the similaritybetween the variational objectives (6) and (10) is not acausality but comes from the equivalence of the two the-oretical objectives from which they were derived. More-over, we showed that both the Bayes Inference (5) andthe IB (8) problems are equivalent to the IM (12). Such arelationship allows us to derive an alternative variationalobjective that optimises directly the IM without resort-ing to any lower bound approximation, and moreover tohighlight the role of network capacity and why it shouldbe bounded. The direct optimisation of the InfoMax objective (12)is unfeasible: it is necessary to rewrite it. Let us start by observing that the feasible to optimise negative log-likelihood, E [ − log p ( y | x, θ )] , is equivalent to optimisingthe MI I ( Y ; X, θ ) , an upper bound of the desired condi-tional information I ( Y ; X | θ ) . Then it is useful to rewritethe IM (12) in terms of I ( Y ; X, θ ) : max θ I ( Y ; X, θ ) , s.t. I ( Y ; X, θ ) ≤ I ( Y ; X ) . (15)In this new form we are asserting that the network ca-pacity C ( θ ) = sup θ I ( Y ; X, θ ) , the maximum value thatthe mutual information can reach, has to be equal tothe visible mutual information I ( Y ; X ) . Indeed, with-out such a bound the information can achieve the value H ( X ) + H ( θ ) , which is the scenario of pure overﬁtting.The capacity, as a function of the weights, is in general,unfeasible to compute, but given the observation madeabove on the relationship between weights and represen-tation, in the following we try to write the capacity interms of the representation.In a network of the type X → Z → Y , by the Data Pro-cessing Inequality the MI I ( Z ; X | φ ) is an upper boundof I ( Y ; X | θ ) . Then, C ( θ ) ≤ I ( Z ; X | φ ) , moreoverby equation (8), we have that for an optimal parameter θ ∗ = ( W ∗ , φ ∗ ) , the optimal capacity I ( Y ; X ) coincideswith the encoding information I ( Z ; X, φ ∗ ) = H ( Z ) ,where the latter equality follows from the sufﬁciencyproperty of Z . Then the InfoMax objective (12) can bewritten as follows: max W,φ I ( Y ; φ ( X ) , W ) , s.t. H ( φ ( X )) = I ( X ; Y ) . (16)The alternative formulation (16) does not depend any-more on the parameter θ , everything is deﬁned in termsof the sub-networks, and this highlights the relationshipbetween the network capacity and the entropy of the la-tent layer, underlying that the choice of the prior is fun-damental in order to have a proper learning. Indeed, if aprior p ( z ) with high variance is prone to over-ﬁt, a priorwith small variance will under-ﬁt. Z In the analysis above we have seenthat the choice of the prior is fundamental. This is inprinciple a real issue since the possible distributions areinﬁnite. For this reason, before deriving the variationalobjective optimising the IM, we remember that in al-most any case it is possible to restrict our attention toa standard Gaussian distribution. Such an observationis the classic principle on which is based the Normalis-ing Flow technique (Rezende and Mohamed, 2015). Theproof is divided into two steps: in the ﬁrst it is shownthere exists an invertible function g , where the objec-tive is unchanged since the I ( Y ; g ( Z )) = I ( Y ; Z ) and ( g ( Z )) = H ( Z ) , see (Cover and Thomas, 2012) chap-ter 2. The second step follows by the Inverse FunctionTheorem, where, as observed in (Kingma et al., 2016),locally almost any function can be approximated by aninvertible function.Given these observations, we can assume without loss ofgenerality that the latent entropy is distributed accordingto p ( z ) ∼ N (0 , σ I ) , such that H ( Z ) = I ( X, Y ) . Inthis way the IM can be re-written as max W,φ I ( Y ; φ ( X ) , W ) , s.t. q ( z | φ ) ∼ N (0 , σ I ) , (17)an objective that depends only by the variance of theprior and not by its shape. The variational objective

The advantage of the alter-native representation of IM (17), is that it can be opti-mised via the following variational method: max φ,W E q ( z | x,φ ) [ p ( y | z, W )] − βD ( q ( z | φ ) || p ( z )) , s.t. p ( z ) ∼ N (0 , σ I ) , (18)a Lagrangian relaxed form of the intractable variationalobjective max φ,W E q ( z | x,φ ) [ p ( y | z, W )] , s.t. q ( z | φ ) ∼ N (0 , σ I ) .D is any function, measuring the distance between twodistribution, e.g. the KL-divergence, and β is the La-grangian multiplier associated to the chosen divergence. In this section we compare the behaviour of the samestochastic neural networks trained by optimising respec-tively the VIB and VIM objectives. The section is di-vided into two parts: in the ﬁrst one, considering thesame setting analysed in (Alemi et al., 2016) of MNISTdata and a fully-connected network, we show that thenetwork trained with VIM outperforms the one trainedwith VIB, and that the optimal accuracy VIM model isthe most robust to noise and with better quality represen-tation. This is in agreement with the theory section wherea maximally informative (maximal accuracy) model isthe one learning the minimal sufﬁcient representation(good quality representation) and minimising the BayesInference problem (robust to noise). In the second part,we consider a more challenging setting, CIFAR10 dataand a convolutional network, to describe the role of thetwo hyper-parameters: the variance of the prior σ andthe Lagrange multiplier β . We observe that the choice of σ is relevant for both the variational objectives, and hasnot to be neglected. In all the experiments we consider as a metric D in (18),the Maximum Mean Discrepancy (MMD), an approxi-mation of the KL divergence (Zhao et al., 2017), deﬁnedas: M M D ( q ( z ) || p ( z )) =sup f : || f || H k ≤ E p ( z )[ f ( Z )] − E q ( z )[ f ( Z )] , where H k is Reproducing Kernel Hilbert Space associ-ated to the positive deﬁnite kernel k ( z , z ) = K/ ( K + (cid:107) z − z (cid:107) ) ≥ , with K the dimension of the latentspace, i.e. z ∈ R K . The ﬁrst setting that we consider to evaluate VIM isthe one already considered by (Alemi et al., 2016) and(Pereyra et al., 2017), where it is observed that the VIBobjective outperforms the three most popular heuristicmethods: Dropout, Label Smoothing (Szegedy et al.,2015) and Conﬁdence Penalty (Pereyra et al., 2017).Consistently with (Alemi et al., 2016), we consider anetwork with encoder modelled by an MLP with fullyconnected layers of the form − − − K ,with ReLu activation, where K is the dimension of therepresentation space, and as a decoder a logistic re-gression with Softmax activation, i.e. p ( y ; z, W ) =exp( y c ) / (cid:80) c ∈ C exp( y c ) , where y = ( y c ) C =101 = W z + b . Since the goal of this manuscript is not to pro-vide the state of the art performance, nor to assert thatVIM is the best regulariser in any setting, but simply toobserve that VIM is a tighter approximation of the IB ob-jective than VIB, in all the experiments we consider thesame (network) hyper-parameters used in (Alemi et al.,2016), and we use the Adam optimiser (Kingma and Ba,2014) with learning rate − . Accuracy

The ﬁrst task of a neural network is to pre-dict the correct label, so the ﬁrst metric that we considerto evaluate the objective is the test accuracy of the trainednetwork.As we see from table (1) and ﬁgure 1 the network trainedwith VIM and having standard deviation σ = 1 , and La-grangian β = 10 − , slightly outperforms the best VIBsolution, with the same objective hyper-parameters β and σ . Obviously, as we can see in ﬁgure 1 the accuracyperformance is a function of both the objective hyper-parameters β and σ , and it is simply a coincidence thatboth VIM and VIB are optimised by the same couple ( β, σ ) . Indeed, as we will see in the 2d MNIST case(see ﬁgure 2 and in the CIFAR10 case, see ﬁgure 4) theoptimal hyper-parameters for the two objectives are notnecessarily the same.able 1: Comparison test-error on MNIST (smaller isbetter), with Z ∈ N (0 , I ) , I ∈ R K × K , K = 256 Model error (%)Baseline 1.38Dropout (Alemi et al., 2016) 1.34Label Smoothing (Pereyra et al., 2017) 1.23Conﬁdence Penalty (Pereyra et al., 2017) 1.17VIB ( β = 10 − , σ = 1 ) (Alemi et al., 2016) 1.13 VIM ( β = 10 − , σ = 1 ) (a) Comparison test-error of the same network trained withVIM and VIB as a function of β , for a ﬁxed σ = 1 (b) Comparison test-error of the same network trained withVIM with two different priors, σ ∈ { , } as a function of β Figure 1: Test-error on MNIST dataset for VIM and VIBtrained networks Table 2: Distance between original and adversarial sam-ple Model L L L ∞ Baseline 2.20 713 0.37VIB ( β = 10 − , σ = 1 ) 3.38 β = 10 − , σ = 1 ) 3.58 697 0.63 VIM ( β = 10 − , σ = 1 ) We observed that an optimal net-work is the one learning some weights that do not shareany information with the label, which means that an op-timal network should be robust to noise. In particular,as observed in (Szegedy et al., 2015) small perturbationson the input, sometimes just a single pixel, can lead to awrong classiﬁcation. For this reason, in agreement with(Alemi et al., 2016), we decided to measure the robust-ness of the network with the magnitude of corruption ad-versary that leads to a misclassiﬁcation.Formally, given a network M and an input x with label C i such that M ( x ) = C i , the successful adversary A ( x ) of x ∈ C i , of a (targeted) attach with target C j with i (cid:54) = j is the closest element x (cid:48) , with respect to a prescribedmeasure, such that M ( A ( x )) = C j . Deﬁning x (cid:48) as thesuccessful adversary of x , the robustness of the networkis deﬁned as the average distance || x − x (cid:48) || n , with n ∈{ , , ∞} ,In particular, in our experiments, consistent with thechoice made in (Alemi et al., 2016), we compute the ad-versary attach of the ﬁrst ten zero digits in the test setwith adversary target the label one, i.e. M ( A ( x )) = C with x ∈ C , using the adversary method proposed in(Carlini and Wagner, 2017) optimised according the L distance.As we see from the results listed in table 2 the VIMmodel with σ = 1 and β = 10 − , obtaining best ac-curacy performance, is the most robust with respect toall the metrics considered but L , that, as observed in(Alemi et al., 2016) decreases when the L distance in-creases.This result gives visible evidence of the theoreticalequivalence between InfoMax, the objective associatedto the network accuracy, and the Bayes Inference, theobjective ensuring the network robustness. Indeed, theVIM optimised network is the one having maximal ac-curacy and also is the most robust to adversarial attack. Quality of the representation

According to IB theoryan optimal network should learn a good quality repre-sentation. The debate on what is a good representation isopen; in this manuscript we follow the deﬁnition given inable 3: Adjusted Rand and Hoyer index of the learnedrepresentation (higher is better)Model adjR HoyerBaseline 0.938 0.37VIB ( β = 10 − , σ = 1 ) 0.948 0.31VIB ( β = 10 − , σ = 1 ) 0.951 0.33 VIM ( β = 10 − , σ = 1 ) (Mathieu et al., 2018) and we consider a good represen-tation as the one that decomposes the hidden factors ofthe data. To evaluate the decomposition of the represen-tation we consider two properties: clustering and sparse-ness . Indeed, a representation with high clustering is theone that is able to separate out the hidden factors of thevisible data that allows us to recognise if an element be-longs in a certain class; and a sparse representation canbe thought of as one where each embedding has a sig-niﬁcant proportion of its dimensions off, i.e. close to 0(Olshausen and Field, 1996). To evaluate the clustering,we evaluate the adjusted Rand index adjR between thesets C i , individuated by a classic K-means trained with10 clusters and the set of representations associated bylabels L i ; deﬁning a i = C i ∩ L i the adjR -index is de-ﬁned as adjR = (cid:80) i a i (cid:80) C i ∈ [0 , , yielding 1 for a complete overlapping between the clus-ters and the correct set and 0 if no point lies in the in-tersection between the two sets. For the sparseness weconsider the Hoyer extrinsic metric (Hurley and Rickard,2009), Hoyer ( z ) = √ d − (cid:107) z (cid:107) / (cid:107) z (cid:107) √ d − ∈ [0 , , yielding 0 for a fully dense vector and 1 for a fullysparse vector. Since in our experiments we are consider-ing latent representation with different variance and highsparseness can be simply associated to a large variance,following the same approach in (Mathieu et al., 2018),we evaluate the Hoyer metric on a normalised represen-tation vector, i.e. Hoyer = H ( z/σ ) .In total agreement with the accuracy and robustness per-formance discussed above, we see in table 3 that the VIMtrained network is the one learning the best quality rep-resentation, conﬁrming empirically that the Bayesian In-ference (5), the Information Bottleneck (8) and the In-foMax (12) are actually the same objective. Moreover,we observe that also in this case the VIB trained networkwith β = 10 − is learning a better representation thanthe counterpart with β = 10 − that has optimal accu-racy, suggesting that the VIB trained model cannot be (a) baseline, adjR : 0.81,Hoyer: 0.305 test error:4.87% (b) VIB, adjR : 0.81, Hoyer:0.305, test error: 3.61%(c) VIM, adjR : 0.901, Hoyer : 0.328, test error: 3.05% Figure 2: 2d learnt representation of the MNIST data,the network trained with VIM (c) is the most informative(smaller test error) and it is learning the best representa-tion (higher Rand and Hoyer indices )optimised to be the best in all the three tasks at the sametime.

2d latent

For the sake of completeness we consideredalso the case K = 2 . This scenario is useful to visu-alise what the network is learning, and see the behaviourin a more challenging scenario than the one consideredabove. We see in ﬁgure 2 that, as conﬁrmed by the Hoyerand Rand indices, the learnt representations of VIM arewell clustered and the intersection between the differentclusters is minimal (high Rand score), and symmetricaround the origin, i.e. representation close to zero andthen more sparse. We conclude that, in agreement towhat was observed for the case K = 256 , a better rep-resentation corresponds with a smaller test error. Let usnotice that in this case, the optimal σ parameter differsfor the two variational objectives. Such a phenomenonwill be visible also in another challenging case, the CI-FAR10 that we discuss below. Classifying the MNIST data, although a classic bench-mark, is a quite simple task, and the differences betweenthe two variational objectives is small. The aim of thissection is to show that the differences between the twoconsidered objectives is apparent in a more challeng-ng context, and that the choice of the variational hyper-parameters is fundamental in order to have good perfor-mance. For this reason in this section we decided to traina convolutional neural network to classify the CIFAR10dataset. We take into consideration this setting since,as observed in (Zhang et al., 2016), a classical CNNwithout regulariser is prone to overﬁt and moreover, inAchille and Soatto (2018b) was observed that consider-ing the VIB objective, the overﬁtting phenomenon es-sentially disappears and the accuracy performance is im-proved drastically. We performed the experiments con-sidering an encoder network of four convolutional layerswith ﬁlter of size × and increasing kernel size, fol-lowed with a Batch Normalization, as illustrated in table4, and the decoder a classic logistic as in the MNIST set-ting. The structure of the network is similar to the oneconsidered in (Zhang et al., 2016), and as already ob-served in (Achille and Soatto, 2018b) the batch normal-ization is added only to have more stable computation,without really affecting the ﬁnal results. The network istrained using Adam with learning rate starting from − and decreasing after 30 epochs by a factor of 2.As we can see in ﬁgure 4, the difference between theVIB and VIM trained modesl, in this scenario, is clear.Both the models are optimised by a Lagrangian param-eter β = 10 − , but if the VIM model has its minimumfor σ = 2 . , the VIB is minimised by σ = 0 . . Ac-cording with what was seen in the 2d MNIST setting,and in agreement to what was observed in the theory sec-tion: when VIB performs well, VIM cannot improve toomuch, instead the performance gap is larger in the morechallenging case where VIB obtains results that are farfrom optimal. We conclude, by describing the quality ofthe learned representation, to better understand the roleof the two hyper-parameters and the odd behaviour of theVIB, where the hyper-parameters associated to the bestaccuracy are not the same associated to the best qual-ity representation. As we see in ﬁgure 3, the two objec-tives are learning representations of similar quality, apartfrom the strange behaviour of the VIM trained modelfor β = 10 , that is learning really sparse representa-tion which is then difﬁcult to clusterise. From the resultsin ﬁgure 3 it is possible to make two observations: thechoice of the prior entropy is relevant for both the varia-tional objectives, indeed if the VIB model is more robust,the difference in performance between the two VIB vari-ants ( σ = 0 . , σ = 2 . ) is not negligible; see also theaccuracy performance in ﬁgure 4. Secondly, we under-line that, as observed in the MNIST framework, if in theVIM case the single model that has best accuracy is alsothe one learning the best representation, in the VIB con-text this assertion does not hold true. Indeed, if the min-imal test error is obtained for σ = 0 . , see ﬁgure 4, the (a) adjusted Rand index, for VIM and VIB, as a function of log( β ) . As expected VIM is less robust to a change of σ (b) Hoyer index, for VIM and VIB, as a function of log( β ) .The two models have really similar results: apart the isolatedcase β = 10 , the best results are obtained by the two modelswith highest σ . Figure 3: Evaluation of the learned representation by theVIM and VIB optimisers. Note that VIM with σ = 2 . (blue line) almost always improves over the other VIMwith smaller σ ; such behaviour does not hold true in theVIB case.best Hoyer metric is achieved by the VIB with σ = 2 . ﬁgure 3. This phenomenon is presumably a symptom ofa non-accurate objective. In this manuscript we presented the Variational InfoMax(VIM), a variational objective that is optimising the In-foMax, an objective equivalent to Bayes Inference andthe Information Bottleneck, maximising the informationbetween the input data and the output labels. Differentlyfrom the Variational Information Bottleneck (VIB), thatis optimising a lower bound of the IM, the VIM optimisesdirectly the learned principle. The theoretical differencesappear clear in the computational experiments, where theVIM trained models outperform the VIB trained ones,in test accuracy, network robustness and representation a) comparison test error, as a function of β for a ﬁxed σ , VIM σ = 2 . and VIB σ = 0 . CIFAR10(b) comparison test error as a function of σ for a ﬁxed β = 1 ,CIFAR10 Figure 4: Comparison test error of the CNN trained withVIM and VIB, as a function of the two hyper-parameters σ and β . We observe that a correct choice of the pa-rameters is fundamental, but in general VIM outperformsVIB. Table 4: CNN architecture of the encoder network usedfor the CIFAR experimentsInput ( × × )Conv( × , 128)BN + ReLuConv( × , 256BN + ReLuConv( × , 512)BN + ReLuConv( × , 1024)BN + ReLuFully connected K , K = 64 quality. Moreover, the VIM derivation discloses the roleof the latent prior, and in particular of its entropy thatcoincides with the network capacity, and then with themaximal information that can be transmitted via the net-work. Such observations, conﬁrmed in the experiments,suggests to consider the variance of the prior as an hyper-parameter of the objective. In future work we will try toovercome such an issue, trying to consider the latent vari-ance an objective term to optimise, in a fashion similar tothe variational tempering technique (Mandt et al., 2016).The equivalence between Bayes inference and InfoMaxand its easy optimisation, suggests to describe the Life-Long learning problem, learning more than one task withthe same network, from an information theoretic per-spective. In particular, in future work we will investigatea natural extension of the InfoMax to the LifeLong sce-nario, the conditional InfoMax : given a network alreadytrained for a task A , and learned representation Z A , trainthe same net for a task B optimising the Z A -conditionedmutual information between the visible data x B and thelabel y B (of task B ), I ( Y B , X B | Z A ) . eferences Achille, A. and S. Soatto2018a. Emergence of invariance and disentanglementin deep representations.

Journal of Machine LearningResearch , 19(50):1–34.Achille, A. and S. Soatto2018b. Information dropout: Learning optimal rep-resentations through noisy computation.

IEEE trans-actions on pattern analysis and machine intelligence ,40(12):2897–2905.Alemi, A. A., I. Fischer, J. V. Dillon, and K. Murphy2016. Deep variational information bottleneck. arXivpreprint arXiv:1612.00410 .Alemi, A. A., B. Poole, I. Fischer, J. V. Dillon, R. A.Saurous, and K. Murphy2017. Fixing a broken elbo. arXiv preprintarXiv:1711.00464 .Baldi, P. and R. Vershynin2018. On neuronal capacity. In

Advances in NeuralInformation Processing Systems , Pp. 7729–7738.Carlini, N. and D. Wagner2017. Towards evaluating the robustness of neural net-works. In , Pp. 39–57. IEEE.Cover, T. M. and J. A. Thomas2012.

Elements of information theory . John Wiley &Sons.Hurley, N. and S. Rickard2009. Comparing measures of sparsity.

IEEE Trans-actions on Information Theory , 55(10):4723–4741.Ioffe, S. and C. Szegedy2015. Batch normalization: Accelerating deep net-work training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 .Kingma, D. P. and J. Ba2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .Kingma, D. P., T. Salimans, R. Jozefowicz, X. Chen,I. Sutskever, and M. Welling2016. Improved variational inference with inverse au-toregressive ﬂow. In

Advances in neural informationprocessing systems , Pp. 4743–4751.Kingma, D. P., T. Salimans, and M. Welling2015. Variational dropout and the local reparameter-ization trick. In

Advances in neural information pro-cessing systems , Pp. 2575–2583.Mandt, S., J. McInerney, F. Abrol, R. Ranganath, andD. Blei2016. Variational tempering. In

Artiﬁcial Intelligenceand Statistics , Pp. 704–712. Mathieu, E., T. Rainforth, S. Narayanaswamy, and Y. W.Teh2018. Disentangling disentanglement. arXiv preprintarXiv:1812.02833 .Olshausen, B. A. and D. J. Field1996. Emergence of simple-cell receptive ﬁeld proper-ties by learning a sparse code for natural images.

Na-ture , 381(6583):607–609.Pereyra, G., G. Tucker, J. Chorowski, Ł. Kaiser, andG. Hinton2017. Regularizing neural networks by penaliz-ing conﬁdent output distributions. arXiv preprintarXiv:1701.06548 .Rezende, D. J. and S. Mohamed2015. Variational inference with normalizing ﬂows. arXiv preprint arXiv:1505.05770 .Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever,and R. Salakhutdinov2014. Dropout: a simple way to prevent neural net-works from overﬁtting.

The journal of machine learn-ing research , 15(1):1929–1958.Szegedy, C., W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-novich2015. Going deeper with convolutions. In

Proceed-ings of the IEEE conference on computer vision andpattern recognition , Pp. 1–9.Tishby, N., F. C. Pereira, and W. Bialek2000. The information bottleneck method. arXivpreprint physics/0004057 .Wang, S. and C. Manning2013. Fast dropout training. In international confer-ence on machine learning , Pp. 118–126.Zhang, C., S. Bengio, M. Hardt, B. Recht, and O. Vinyals2016. Understanding deep learning requires rethink-ing generalization. arXiv preprint arXiv:1611.03530 .Zhao, S., J. Song, and S. Ermon2017. Infovae: Information maximizing variationalautoencoders. arXiv preprint arXiv:1706.02262arXiv preprint arXiv:1706.02262