[PDF] Understanding the Behaviour of the Empirical Cross-Entropy Beyond the Training Distribution

Abstract

Machine learning theory has mostly focused on generalization to samples from the same distribution as the training data. Whereas a better understanding of generalization beyond the training distribution where the observed distribution changes is also fundamentally important to achieve a more powerful form of generalization. In this paper, we attempt to study through the lens of information measures how a particular architecture behaves when the true probability law of the samples is potentially different at training and testing times. Our main result is that the testing gap between the empirical cross-entropy and its statistical expectation (measured with respect to the testing probability law) can be bounded with high probability by the mutual information between the input testing samples and the corresponding representations, generated by the encoder obtained at training time. These results of theoretical nature are supported by numerical simulations showing that the mentioned mutual information is representative of the testing gap, capturing qualitatively the dynamic in terms of the hyperparameters of the network.

Full PDF

UUnderstanding the Behaviour of the EmpiricalCross-Entropy Beyond the Training Distribution

Matias Vera

Facultad de IngenieríaUniversidad de Buenos Aires, Argentina [email protected]

Pablo Piantanida

Mila and Université de Montrál, CanadaL2S and CentraleSupélec-CNRS-UPSud, France [email protected]

Leonardo Rey Vega

Facultad de IngenieríaUniversidad de Buenos Aires, Argentina [email protected]

Abstract

Machine learning theory has mostly focused on generalization to samples fromthe same distribution as the training data. Whereas a better understanding ofgeneralization beyond the training distribution where the observed distributionchanges is also fundamentally important to achieve a more powerful form ofgeneralization. In this paper, we attempt to study through the lens of informationmeasures how a particular architecture behaves when the true probability law of thesamples is potentially different at training and testing times. Our main result is thatthe testing gap between the empirical cross-entropy and its statistical expectation(measured with respect to the testing probability law) can be bounded with highprobability by the mutual information between the input testing samples and thecorresponding representations, generated by the encoder obtained at training time.These results of theoretical nature are supported by numerical simulations showingthat the mentioned mutual information is representative of the testing gap, capturingqualitatively the dynamic in terms of the hyperparameters of the network.

Most theories of generalization for classiﬁcation, both theoretical and empirical, assumes that modelsare trained and tested using data drawn from some ﬁxed distribution. What is often needed in practice,however, is to learn a classiﬁer which performs well on a target domain with a signiﬁcantly differentdistribution of the training data, which may involve similar concepts that were observed previouslyby the learner but with some features changes (Bengio et al., 2019). In general, we would expect thatthe result of the learning phase to be able to generalize well to other distributions, with the addedfunctionality of being able to monitoring the dynamics of the generalization due to external factorsor non-stationarities of the involved distributions. In this paper, we investigate through the lens ofinformation-theoretic principles how the testing gap between the empirical cross-entropy and itsstatistical expectation (computed from the target distribution) behaves when the true probability lawof the data samples is potentially different at training and testing time.According to classical statistical learning theory (Boucheron et al., 2005), models with many parame-ters tend to overﬁt by representing the training data too accurately, therefore diminishing their abilityto generalize to unseen data Bishop (2006). Interestingly enough, this phenomena does seem to behappening with Deep Neural Networks (DNNs) which even with many parameters and a modest

Preprint. Under review. a r X i v : . [ s t a t . M L ] M a y umber of training samples present good generalization properties as shown by Neyshabur et al.(2017); Zhang et al. (2016).Stochastic representations encompass the classical learning problem to include graphic models, evenneuronal ones such as Variational Auto-Encoders (VAEs) (Kingma and Welling, 2013) or RestrictedBoltzmann Machines (RBMs) (Hinton, 2012). With this models, mutual information betweenfeature inputs and their representations becomes a relevant quantity. Empirical studies based on theInformation Bottleneck (IB) method (Tishby et al., 1999) have shown that this mutual informationmay be related to the overﬁtting problem in DNNs. The IB method studies the tradeoff betweenaccuracy and information complexity measured in terms of the mutual information. Statisticalrates on the empirical estimates of the corresponding IB tradeoffs have been reported in Shamiret al. (2010). Schwartz-Ziv and Tishby (2017) show, based on empirical evidence, that the DNNgeneralization capacity is induced by an implicit data compression of the feature inputs. However,these trade-offs are still controversial and the question remains open. As a matter of fact, recentworks by (Amjad and Geiger, 2018) and (Saxe et al., 2018) report difﬁculties in using the IB trade-offas an information-regularized objective for training. Indeed, it is known that estimating mutualinformation from high-dimensional data samples is a difﬁcult and challenging problem for which,even state-of-the-art methods, might lead to misleading numerical results (Belghazi et al., 2018). We investigate the probability concentration of the testing gap between the empirical cross-entropyand its statistical expectation, measured with respect to the testing probability distribution which maybe different of the training distribution. More speciﬁcally, there are two attributes that characterizeour model: (a) we study a deviation on the target (testing) dataset once the training stage is ﬁnished,i.e., for a given soft-classiﬁer; and (b) we consider randomized encoders which allows us not only tostudy deterministic feed-forward structures, but also stochastic models.Theorem 1 provides the rigorous statement of the cross-entropy deviation bound that is the basis of thiswork. In contrast to standard learning concentration inequalities, this bound scales with log( n ) / √ n and / √ n in a n -length dataset. Our bound depends on several factors: a mutual information termbetween the input and the representations generated from them using a reference selected encoder; asecond term that measures the decoder efﬁciency; and two other magnitudes that measure how robustcan be interpreted the problem, motivated by Xu and Mannor (2012) among others. Despite the factthat our results may not lead to the tightest bounds, they are intended to reﬂect the importance ofinformation-theoretic concepts in the problem of representation learning and the different trade-offsthat can be established between information measures and quantities of interest in statistical learning.An empirical investigation of the interplay between the cross-entropy deviation and the mutualinformation is provided on high-dimensional datasets of natural images with rotations and translationson the target domain. These simulations show the ability of mutual information to predict the behaviorof the gap for the case of three well-known stochastic representation models: (a) the standardVariational Auto-Encoders (VAE) (Kingma and Welling, 2013); (b) the log-normal encoder presentedin the Information Dropout scheme in (Achille and Soatto, 2016); and (c) the classical encoder basedon Restricted Boltzmann Machines (RBMs) (Hinton, 2012). These results validates the fact thatmutual information is an important quantity that is strongly related to the generalization properties oflearning algorithms, motivating the need of further studies in this respect.The rest of the paper is organized as follows. In Section 2, we introduce a stochastic model whichcombines the randomized encoder concept like in Achille and Soatto (2016) with the possibility ofdecomposing the classiﬁer in a encoder and a decoder (Schwartz-Ziv and Tishby (2017)). In Section3 we present our general concentration inequality based on the above mentioned mutual informationand other speciﬁc magnitudes, which will be discussed. While in Section 4 we show numericalevidence for some selected models, in Section 5 we provide concluding remarks. Major mathematicaldetails are relegated to the appendices in the Appendix. We are interested in the problem of pattern classiﬁcation consisting in the prediction of the unknownclass that match an observation. An observation or example is often a sample x ∈ X which have an2ssociated label y ∈ Y (ﬁnite space). An |Y| -ary classiﬁer is deﬁned by a (stochastic) decision rule Q ˆ Y | X : X → P ( Y ) , where ˆ Y ∈ Y denotes the random variable associated to the classiﬁer outputand X is the random observed example. Typically, this classiﬁer is trained with samples generatedaccording to an unknown training distribution. It is also assumed that the testing examples and theircorresponding labels are generated in an i.i.d. fashion according to p XY := p X P Y | X (which couldbe possibly different from the above mentioned training distribution). The problem of ﬁnding a good classiﬁer can be divided into that of simultaneously ﬁnding an (possiblyrandomized) encoder q U | X : X → P ( U ) that maps raw data to a higher-dimensional (feature) space U and a soft-decoder Q ˆ Y | U : U → P ( Y ) which maps the representation to a probability distributionon the label space Y . These mappings induce an equivalent classiﬁer: Q ˆ Y | X ( y | x ) = E q U | X (cid:104) Q ˆ Y | U ( y | U ) | X = x (cid:105) , (1) Remark In the standard methodology with deep representations, we consider L randomizedencoders ( L layers) { q U l | U l − } Ll =1 with U ≡ X . Although this appears at ﬁrst to be more general, itcan be casted formally using the one-layer case formulation induced by the marginal distributionthat relates the input and the ﬁnal L -th output layer. Therefore results on the one-layer formulationalso apply to the L -th layer formulation and thus, we shall thus focus on the one-layer case withoutloss of generality. This representation contains several cases of interest as the feed-forward neural net case as well asgenuinely graphical model cases such as VAE or RBM. The computation of (1) requires marginalizingout u ∈ U which could be computationally prohibitive in practice. We use the cross-entropy as aloss-function: (cid:96) ( x, y ) := (cid:96) (cid:0) q U | X ( ·| x ) , Q ˆ Y | U ( y |· ) (cid:1) = E q U | X (cid:104) − log Q ˆ Y | U ( y | U ) | X = x (cid:105) . (2)The learner’s goal is to select ( q U | X , Q ˆ Y | U ) by minimizing the expected risk under the trainingdistribution: L ( q U | X , Q ˆ Y | U ) := E p XY [ (cid:96) ( X, Y )] . This is done at training time using the empiricalrisk computed with i.i.d. samples from the training distribution. After training phase is over wewould like to evaluate the performance of the obtained ( q U | X , Q ˆ Y | U ) based on a testing dataset: S n := { ( x , y ) · · · ( x n , y n ) } which is independent from the training set (but not necessarily sampledfrom the same distribution). The testing risk is deﬁned by L emp ( q U | X , Q ˆ Y | U , S n ) := 1 n n (cid:88) i =1 (cid:96) ( x i , y i ) . (3)Since the testing risk is evaluated on ﬁnite size samples, its evaluation may be sensitive to samplingnoise error. The gap, to be deﬁned next, is a measure of how an encoder-decoder pair could performon unseen data (at training time) contained in the testing set S n . Deﬁnition 1 (Error gap) . Given a stochastic encoder q U | X : X → P ( U ) and decoder Q ˆ Y | U : U →P ( Y ) , the error gap is given by E gap ( q U | X , Q ˆ Y | U , S n ) := (cid:12)(cid:12)(cid:12) L emp ( q U | X , Q ˆ Y | U , S n ) − L ( q U | X , Q ˆ Y | U ) (cid:12)(cid:12)(cid:12) , (4) which quantiﬁes the error associated to ( q U | X , Q ˆ Y | U ) when L emp ( q U | X , Q ˆ Y | U , S n ) is considered anestimate of L ( q U | X , Q ˆ Y | U ) (calculated with the testing distribution). Remark The deﬁnition of this gap should not be confused with the typical one used for PAC-stylebounds (Devroye et al., 1997). The latter is computed w.r.t. the training dataset while here we willstudy a deviation bound on the testing set, i.e., after the training stage has been accomplished. Ourfocus is reasonable for scenarios where the testing statistics evolves over time and may not match thetraining distribution. In writing this we allow us a little abuse of notation to simplify it since p X is a pdf. and P Y | X is pmf. Information Theoretic Bounds on the Gap

In this section, we ﬁrst present our main result in Theorem 1, which is a bound on the gap (4) withprobability at least − δ , as a function of a ﬁxed randomized encoder-decoder pair ( q U | X , Q ˆ Y | U ) . Inparticular, we show that the mutual information between the input raw data and its representationcontrols the gap with a scaling O (cid:16) log( n ) √ n (cid:17) , which leads to a so-called informational deviationerror bound. The information measures to be used in the work are Cover and Thomas (2006): Kullback-Leibler (KL) divergence D ( p X (cid:107) q X ) := E p X (cid:104) log p X ( X ) q X ( X ) (cid:105) ; conditional KL divergence D ( p Y | X (cid:107) q Y | X | p X ) := E p X (cid:2) D (cid:0) p Y | X ( ·| X ) (cid:107) q Y | X ( ·| X ) (cid:1)(cid:3) and mutual information I ( p X ; p Y | X ) := D ( p Y | X (cid:107) p Y | p X ) . In this section, we will make use of the following assumptions: Assumptions 1.

We assume X = Supp ( p X ) and P Y ( y min ) := min y ∈Y P Y ( y ) > without lossof generality because we can ignore the zero probability events. We assume that Vol ( U ) < ∞ and that the selected encoder-decoder pair ( q U | X , Q ˆ Y | U ) is such that Q ˆ Y | U ( y min | u min ) :=inf y ∈Y u ∈U Q ˆ Y | U ( y | u ) ≥ η > . The assumption of Q ˆ Y | U ( y min | u min ) > is typically valid for the soft-max decoder, since in practiceits parameters never diverge. Theorem 1 (Information-theoretic bound) . For every δ ∈ (0 , , with probability at least − δ overthe choice of S n ∼ p XY , the gap satisﬁes: E gap ( q U | X , Q ˆ Y | U , S n ) ≤ inf K ∈ N (cid:15) ( K ) + A δ (cid:113) I ( p X ; q U | X ) · log( n ) √ n r ( K )+ D δ · D HL (cid:16) Q DY | U (cid:107) Q ˆ Y | U | q DU (cid:17) + C δ √ n + O (cid:18) log( n ) n (cid:19) , (5) ∀ ( q U | X , Q ˆ Y | U ) that meets Assumptions 1, where D HL is the Hellinger distance D HL (cid:16) Q DY | U (cid:107) Q ˆ Y | U | q DU (cid:17) = (cid:118)(cid:117)(cid:117)(cid:117)(cid:116) · E q DU (cid:88) y ∈Y (cid:16)(cid:113) Q ˆ Y | U ( y | U ) − (cid:113) Q DY | U ( y | U ) (cid:17)  , (6) constants are deﬁned as A δ := √ B δ , B δ := (cid:16) (cid:113) log (cid:0) |Y| +4 δ (cid:1)(cid:17) , C δ := 2 Vol ( U ) e − + B δ (cid:112) |Y| log (cid:16) Vol ( U ) P Y ( y min ) (cid:17) , D δ = Q − / Y | U ( y min | u min ) (cid:113) |Y| +4 δ ; and (cid:15) ( K ) = sup k,x,y :1 ≤ k ≤ Ky ∈Y x ∈K ( y ) k (cid:12)(cid:12)(cid:12) (cid:96) ( x, y ) − (cid:96) ( x ( k,y ) , y ) (cid:12)(cid:12)(cid:12) , r ( K ) = 1min k,y :1 ≤ k ≤ Ky ∈Y (cid:90) K ( y ) k p X ( x ) dx . (7) where (cid:16) {K ( y ) k } Kk =1 , { x ( k,y ) } Kk =1 (cid:17) y ∈Y are |Y| partitions of X and their respective centroids. Theyare clearly functions of the natural number K and such that for each y ∈ Y : (cid:83) Kk =1 K ( y ) k = X , K ( y ) i ∩ K ( y ) j = ∅ ∀ ≤ i < j ≤ K, Vol ( K ( y ) k ) > ∀ ≤ k ≤ K ; and q DU ( u ) = K (cid:88) k =1 (cid:88) y ∈Y q U | X ( u | x ( k,y ) ) (cid:90) K ( y ) k p XY ( x, y ) dx, (8) Q DY | U ( y | u ) = (cid:80) Kk =1 q U | X ( u | x ( k,y ) ) (cid:82) K ( y ) k p XY ( x, y ) dxq DU ( u ) (9) are distributions functions induced by the quantization of the testing distribution p XY ( x, y ) by theabove mentioned partitions. The output encoder alphabet could be continuous. With this we are implying that it is bounded set. If U isdiscrete this assumption means that it is of ﬁnite cardinality. riginal MNIST ImagesMNIST Images with Rotation and Translation Figure 1: Comparison between original MNIST dataset and our propose with rotations and translationsThe proof is relegated to the Appendix. This bound has some important terms which are worthanalyzing: • Encoder and decoder ( q U | X , Q ˆ Y | U ) : the encoder and the decoder in Theorem 1 are consid-ered to be given. However, without loss of generality they can correspond to the learnedencoder and decoder during training time. In this case, the bound in (5) should be furtheraveraged with respect to the randomness of the training samples –according to the trainingdistribution– and the training algorithm itself. • I ( p X ; q U | X ) : Mutual information between raw data X and its randomized representation U is a regularization term used to reduce the overﬁtting, that some authors understand as“measure of information complexity” (Achille and Soatto, 2016; Alemi et al., 2016; Veraet al., 2018). Theo. 1 is a ﬁrst step in order to explain how and why this effect happens. Thisterm presents a scaling rate of n − / log( n ) and it is the most important term of our deviationbound. In Section 4 we present a empirical analysis in order to study its importance. • D HL (cid:16) Q DY | U (cid:107) Q ˆ Y | U (cid:17) : Hellinger distance could be seen as a measure of the decoder efﬁciencyin comparison with the decoder Q DY | U , which is induced by the randomized encoder q U | X and the quantized testing distribution. When Q ˆ Y | U = Q DY | U this term is zero, suggestingthat this selection minimizes the error gap. In this way, we are interested in decoders withsufﬁcient freedom degrees as the soft-max decoder. • (cid:15) ( K ) and r ( K ) : Motivated by robust quantization (Xu and Mannor, 2012), these functionsdeﬁne, for each y ∈ Y , an artiﬁcial discretization of X space into cells (partition element).This discretization allow us to introduce some information-theoretic techniques and resultsfor discrete alphabets during the proof of our result. While (cid:15) ( K ) is associated with therobustness of the loss function over the partition element, r ( K ) is the minimum probabilityof falling into a cell. There is a tradeoff between these: while (cid:15) ( K ) is a decreasing function(when the number of cells is increased, they may be smaller), r ( K ) is a increasing one(smaller cells enclose less probability). • Q ˆ Y | U ( y min | u min ) : The maximum of the loss function value could be a poor choice togenerate a deviation bound with dependence on the decoder. In our case, a more sensibleapproach is followed by using the Hellinger distance between Q ˆ Y | U and Q DY | U . It can beseen that this decoder dependent term will not be relevant in two scenarios: when decoderselection Q ˆ Y | U is close to Q DY | U (in a Hellinger distance sense) and when Q ˆ Y | U ( y min | u min ) is small enough (through D δ ). The consideration of these two possibilities and the / √ n scaling of this term could justify to disregarding it when the number of samples is largeenough. • Vol ( U ) : Note that if ReLU activations are implemented (whose volume is limited forbounded entries), Vol ( U ) is expected to be larger than for the case of sigmoid activations.As a consequence, the mutual information will have a major inﬂuence in the generalizationwith saturated activations. This observation matches with Saxe et al. (2018) analysis.5 Experimental Results

In this section, we experimentally check our bound in some stochastic models used in practice. We willshow that the mutual information is representative of the gap behavior, i.e. we compare, after a trainingstage, a E gap ( q U | X , Q ˆ Y | U , S n ) quantile (the magnitude to be bounded) with A · (cid:112) I ( p X ; q U | X ) + C ,where A and C are universal constants representative of the corresponding ones in Theorem 1. Themagnitudes are compared for several rules Q indexed by a Lagrange multiplier deﬁned in eachexperiment. That is, we ﬁrst select an encoder q U | X and a decoder Q ˆ Y | U based on a Lagrangemultiplier λ in the cost function and the training set, and then we evaluate the risk deviation basedon independent features to those of used during the training, i.e., according to the testing dataset S n .We focus on the qualitative behavior, so we plot with different axes for the quantile gap and mutualinformation in order to get rid of the mentioned constants.We will show that the mutual information is representative of the behavior of said gap, even whenthe distribution with which the test samples are generated does not match with the training law.As our main goal is not to present a new classiﬁcation methodology, with competitive results withstate-of-the-art methods, we restrict ourself to use small databases, as motivated in Neyshabur et al.(2017) work. We sample a random subset of MNIST (standard dataset of handwritten digits). Thesize of the training set is K and the algorithms will be tested with standard MNIST dataset andwith a disturbed version with translations and rotations. Random translations are drawn from anuniform distribution between -5 and 5 (quantized) for each axis and random rotations are drawn froman uniform distribution ( − π , π ) for the angle, as can be seen in Fig. 1. Experiments with CIFAR-10dataset (natural images) can be seen in Appendix.In most applications, alphabets are such that X ⊂ R d , U ⊂ R m , where d is the number of in-put units and m is the number of hidden units of the auxiliary variable, so we refer to the vec-tor random variables X , U respectively. We approximate L ( q U | X , Q ˆ Y | U ) with a K dataset and L emp ( q U | X , Q ˆ Y | U , S n ) with different independent mini-testing datasets of samples, i.e., usingthe rest of the features. The E gap ( q U | X , Q ˆ Y | U , S n ) 0 . -quantile ( δ = 0 . ) is computed based on thedifferent values of the testing risk: L emp ( q U | X , Q ˆ Y | U , S n ) . Finally, there exists the difﬁculty of im-plementing a mutual information estimator. To this end, we make use of the variational bound (Coverand Thomas, 2006) to upper bound the mutual information by I (cid:0) p X ; q U | X (cid:1) ≤ D (cid:0) q U | X (cid:13)(cid:13) ˜ q U (cid:12)(cid:12) p X (cid:1) ,where ˜ q U is an auxiliary prior pdf (Kingma and Welling, 2013; Achille and Soatto, 2016). Consider adistribution ˜ q U a product distribution ˜ q U ( u ) = (cid:81) mj =1 ˜ q U j ( u j ) . It is straightforward to check that (cid:113) I (cid:0) p X ; q U | X (cid:1) ≤ (cid:118)(cid:117)(cid:117)(cid:116) m (cid:88) j =1 D (cid:0) q U j | X ( ·| X ) (cid:13)(cid:13) ˜ q U j (cid:12)(cid:12) p X (cid:1) . (10)We will implement a parametric estimation of the KL divergence D (cid:16) q U j | X ( ·| X ) (cid:13)(cid:13) ˜ q U j (cid:12)(cid:12) ˆ P X (cid:17) , where ˆ P X is the empirical pmf (KL estimation is an average over the sample), for each of the followingarchitectures: normal encoder / normal prior Kingma and Welling (2013), log-normal encoder /log-normal prior Achille and Soatto (2016) and RBM encoder / n (cid:80) ni =1 q U j | X ( u j | x i ) prior Hinton(2012). We will refer to each of the examples by their encoders: Normal, log-Normal and RBM,respectively. Note that the mutual information does not depend on the decoder, so we use always a soft-max output layer for simplicity. Experimental details can be seen in Appendix. Gaussian Variational Auto-Encoders (VAEs) introduce a normal encoder U j | X = x ∼ N ( µ j ( x ) , σ j ( x )) j = [1 : m ] , where µ j ( x ) and log σ j ( x ) are constructed via DNNs (vectorized), a standard normalprior ˜ U j ∼ N (0 , and the decoder input is generated by simple sampling using the reparameteriza-tion trick Kingma and Welling (2013). In this case each KL divergence in (10) can be estimated asfollows: D (cid:16) q U j | X ( ·| X ) (cid:13)(cid:13) ˜ q U j (cid:12)(cid:12) ˆ P X (cid:17) = 12 n n (cid:88) i =1 (cid:0) − log σ j ( x i ) + σ j ( x i ) + µ j ( x i ) − (cid:1) . (11)6 Lagrange Multiplier . - q u a n t il e ga p ( s o li d ) − � I ( q U | X ; p X )( d a s h ) (a) Lagrange Multiplier . - q u a n t il e ga p ( s o li d ) � I ( q U | X ; p X )( d a s h ) (b) Figure 2: Comparison between . -quantile of E gap ( q U | X , Q ˆ Y | U , S n ) and the mutual informationvariational bound (10) for normal encoder and testing with: (a) Images generated with the trainingdistribution, (b) Images generated with other distribution.Fig. 2a and 2b show . -quantile of E gap ( q U | X , Q ˆ Y | U , S n ) and the mutual information variationalbound (10) as a function of the training Lagrange multiplier λ , testing with the training and disturbedMNIST dataset respectively. There is an decreasing tendency indicating the vanishing of possibleoverﬁtting when regularization is incremented. Complex tasks, such as testing with images sampledwith another distribution Fig. 2b, generate an error gap behavior closer to the mutual information. Information dropout propose log-normal encoders U j = f j ( X ) e α j ( X ) Z j = [1 : m ] where Z ∼N (0 , , where f j ( x ) and α j ( x ) are constructed via DNNs (vectorized) and the decoder input isgenerated by simple sampling using the reparameterization trick. In Achille and Soatto (2016),authors recommend to use a log-normal prior ˜ U j ∼ log N ( µ j , σ j ) when f ( x ) = [ f ( x ) , · · · f m ( x )] is a DNN with soft-plus activation, where µ j and σ j are variables to train.As U j | X = x ∼ log N (log f j ( x ) , α j ( x )) and the KL divergence is invariant under reparametrizations,the divergence between two log-normal distributions is equal to the divergence between the corre-sponding normal distributions. Therefore, using the formula for the KL divergence of normal randomvariables Cover and Thomas (2006), we obtain D (cid:16) q U j | X ( ·| X ) (cid:13)(cid:13) ˜ q U j (cid:12)(cid:12) ˆ P X (cid:17) = 1 n n (cid:88) i =1 D (cid:0) N (log f j ( x i ) , α j ( x i )) (cid:107)N ( µ j , σ j ) (cid:1) (12) = 1 n n (cid:88) i =1 α j ( x i ) + (log( f j ( x i )) − µ j ) σ j − log α j ( x i ) σ j − . (13)Fig. 3a and 3b show . -quantile of E gap ( q U | X , Q ˆ Y | U , S n ) and the mutual information variationalbound (10) as a function of the training Lagrange multiplier λ , testing with the training and disturbedMNIST dataset respectively. Again, there is an decreasing tendency indicating the vanishing ofpossible overﬁtting when regularization is incremented. In this case the behaviors are pretty closebecause the prior distribution is also trained. Consider the standard models for the RBM studied in Hinton (2012); Srivastava et al. (2014). Forevery j ∈ [1 : m ] , U j given X = x is distributed as a Bernoulli

RV with parameter σ ( b j + w Tj x ) (sigmoid activation). Selecting the product distribution ˜ q U j ( u j ) = n (cid:80) ni =1 q U j | X ( u j | x i ) , we obtain D (cid:16) q U j | X ( ·| X ) (cid:13)(cid:13) ˜ q U j (cid:12)(cid:12) ˆ P X (cid:17) = 1 n n (cid:88) i =1 σ ( b j + (cid:104) w j , x i (cid:105) ) log (cid:18) σ ( b j + (cid:104) w j , x i (cid:105) ) n (cid:80) nk =1 σ ( b j + (cid:104) w j , x k (cid:105) ) (cid:19) Lagrange Multiplier . - q u a n t il e ga p ( s o li d ) � I ( q U | X ; p X )( d a s h ) (a) Lagrange Multiplier . - q u a n t il e ga p ( s o li d ) � I ( q U | X ; p X )( d a s h ) (b) Figure 3: Comparison between . -quantile of E gap ( q U | X , Q ˆ Y | U , S n ) and the mutual informationvariational bound (10) for Log-Normal encoder and testing with: (a) Images generated with thetraining distribution, (b) Images generated with other distribution. − − − − − Lagrange Multiplier . - q u a n t il e ga p ( s o li d ) − − − − − � I ( q U | X ; p X )( d a s h ) (a) − − − − − Lagrange Multiplier . - q u a n t il e ga p ( s o li d ) − − − − − � I ( q U | X ; p X )( d a s h ) (b) Figure 4: Comparison between . -quantile of E gap ( q U | X , Q ˆ Y | U , S n ) and the mutual informationvariational bound (10) for RBM encoder and testing with: (a) Images generated with the trainingdistribution, (b) Images generated with other distribution. + (1 − σ ( b j + (cid:104) w j , x i (cid:105) )) log (cid:18) − σ ( b j + (cid:104) w j , x i (cid:105) ) n (cid:80) nk =1 − σ ( b j + (cid:104) w j , x k (cid:105) ) (cid:19) . (14)Figures 4a and 4b show . -quantile of E gap ( q U | X , Q ˆ Y | U , S n ) and the mutual information variationalbound (10) as a function of the Lagrange multiplier λ , testing with the training and disturbed MNISTdataset respectively. Again, the behaviors are similar and decreasing, both when testing with samplesgenerated with the training distribution and with samples generated with another one. We presented a theoretical investigation of a typical classiﬁcation task in which we have trainingdata from a source domain, but we wish the testing gap between the empirical cross-entropy andits statistical expectation (measured with respect to a possible different testing probability law) tobe as small as possible. Our main result (Theorem 1) is that the testing gap can be bounded withhigh probability by the mutual information between the input testing samples and the correspondingrepresentations, the Hellinger distance which measures the decoder efﬁciency and other less relevantconstants. Empirical study of this metric suggests that the mutual information may be a good measureto capture the dynamic of the gap with respect to important training parameters. We ﬁnally presented asimple experimental setup which shows that there is a strong correlation between the gap represented8nd the mutual information between the raw inputs and their representations. Some further workwill be needed to provide strong support to these numerical results in presence of other sources ofnon-stationarities between training and testing datasets.

Acknowledgment

The work of Prof. Pablo Piantanida was supported by the European Commission’s Marie Sklodowska-Curie Actions (MSCA), through the Marie Sklodowska-Curie IF (H2020-MSCAIF-2017-EF-797805).The work of Matias Vera was supported by Peruilh PhD Scholarship from Facultad de Ingeniería,Universidad de Buenos Aires.

References

Achille, A. and Soatto, S. (2016). Information Dropout: learning optimal representations throughnoisy computation.

ArXiv e-prints .Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. (2016). Deep variational informationbottleneck.

CoRR , abs/1612.00410.Amjad, R. A. and Geiger, B. C. (2018). Learning representations for neural network-based classiﬁca-tion using the information bottleneck principle.

CoRR , abs/1802.09766.Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Hjelm, R. D., and Courville, A. C.(2018). Mutual information neural estimation. In

Proceedings of the 35th International Conferenceon Machine Learning, ICML 2018, Stockholm, Sweden , pages 530–539.Bengio, Y., Deleu, T., Rahaman, N., Ke, N. R., Lachapelle, S., Bilaniuk, O., Goyal, A., and Pal,C. J. (2019). A meta-transfer objective for learning to disentangle causal mechanisms.

CoRR ,abs/1901.10912.Bishop, C. M. (2006).

Pattern Recognition and Machine Learning (Information Science and Statistics) .Springer-Verlag New York, Inc., Secaucus, NJ, USA.Boucheron, S., Bousquet, O., and Lugosi, G. (2005). Theory of classiﬁcation: a survey of somerecent advances.

ESAIM: Probability and Statistics , 9:323–375.Chopra, P. and Yadav, S. K. (2018). Restricted boltzmann machine and softmax regression for faultdetection and classiﬁcation.

Complex & Intelligent Systems , 4(1):67–77.Cover, T. M. and Thomas, J. A. (2006).

Elements of Information Theory (Wiley Series in Telecommu-nications and Signal Processing) . Wiley-Interscience.Devroye, L., Györﬁ, L., and Lugosi, G. (1997).

A Probabilistic Theory of Pattern Recognition ,volume 31 of

Applications of Mathematics . Springer, 2nd edition. missing.Ghosal, S., Ghosh, J., and Van der Vaart, A. (2000). Convergence rates of posterior distributions.

TheAnnals of Statistics , 28(2):500–531.Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence.

NeuralComput. , 14(8):1771–1800.Hinton, G. E. (2012). A practical guide to training restricted boltzmann machines. In

Proceedings ofNeural Networks: Tricks of the Trade (2nd ed.) , pages 599–619. Springer.Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets.

Neural Comput. , 18(7):1527–1554.Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. In

Proc. of the 2nd Int.Conf. on Learning Representations (ICLR) .Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Technical report,University of Toronto. 9eyshabur, B., Tomioka, R., Salakhutdinov, R., and Srebro, N. (2017). Geometry of optimizationand implicit regularization in deep learning.

CoRR , abs/1705.03071.Saxe, A., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B., and Cox, D. (2018). Onthe information bottleneck theory of deep learning. In

Proc. of the 6th Int. Conf. on LearningRepresentations (ICLR) .Schwartz-Ziv, R. and Tishby, N. (2017). Opening the black box of deep neural networks viainformation.

CoRR , abs/1703.00810.Shamir, O., Sabato, S., and Tishby, N. (2010). Learning and generalization with the informationbottleneck.

Theor. Comput. Sci. , 411(29-30):2696–2711.Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: asimple way to prevent neural networks from overﬁtting.

J. of Mach. Learning Research , 15(1):1929–1958.Srivastava, N., Salakhutdinov, R., and Hinton, G. E. (2013). Modeling documents with deepboltzmann machines. In

Proceedings of the Twenty-Ninth Conference on Uncertainty in ArtiﬁcialIntelligence, UAI 2013, Bellevue, WA, USA, August 11-15, 2013 .Tishby, N., Pereira, F. C., and Bialek, W. (1999). The information bottleneck method. In

Proc. of the37th Annu. Allerton Conf. on Communication, Control and Computing , pages 368–377.Vera, M., Rey Vega, L., and Piantanida, P. (2018). Compression-based regularization with anapplication to multitask learning.

IEEE Journal of Selected Topics in Signal Processing , 12(5):1063–1076.Xu, H. and Mannor, S. (2012). Robustness and generalization.

Machine Learning , 86(3):391–423.Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learningrequires rethinking generalization.

CoRR . 10 ppendix

Notation and conventions

Let P ( X ) denote the set of all probability measures P over the set X . The probability mass (pmf)function or probability density function (pdf) in the case of continuous random variables of X isdenoted interchangeably by P X or p X . Supp ( p X ) denotes the support of the distribution, i.e. theclosure of { x ∈ X : p X ( x ) > } . Vol ( X ) = (cid:82) X dx if X is continuous or Vol ( X ) = |X | if X is discrete. (cid:107) · (cid:107) denotes the usual Euclidean norm of a vector and (cid:104)· , ·(cid:105) the canonical innerproduct. We use E p [ · ] and Var p ( · ) to denote the mathematical expectation and variance respectively,measured with respect to p . The information measures to be used in the work are Cover and Thomas(2006): the entropy H ( P X ) := E P X [ − log P X ( X )] ; the conditional entropy H ( P Y | X | P X ) := E P X P Y | X (cid:2) − log P Y | X ( Y | X ) (cid:3) ; the relative entropy : D ( P X (cid:107) Q X ) := E P X (cid:104) log P X ( X ) Q X ( X ) (cid:105) ; the condi-tional relative entropy : D ( P Y | X (cid:107) Q Y | X | P X ) := E P X (cid:2) D (cid:0) P Y | X ( ·| X ) (cid:107) Q Y | X ( ·| X ) (cid:1)(cid:3) and the mutualinformation : I ( P X ; P Y | X ) := D ( P Y | X (cid:107) P Y | P X ) . We talk about testing dataset S n of n samples.We study a deviation bound, after training (over S n ), so we assume that the training set is givenimplicitly. A Proof of Theorem 1

In this Appendix we will prove the main result of this work. We will use some well-known results,listed in C. We are looking for a relationship between error gap and mutual information, which is a information-theoretic measure. Information theory in general and mutual information in particular hasseveral known results, a large percentage of them being for discrete spaces. For example, Shamir et al.(2010) bound the deviation over the mutual information between labels Y and hidden representations U , through the mutual information between hidden representations and inputs X with discretealphabets. However, in most of learning problems it is more appropriate to consider a continuousalphabet X . In order to use those discrete results, our ﬁrst step is to analyze the error introduceddue to a reasonable discretization. This approach has points in common with Xu and Mannor (2012) robust-algorithms theory . Lemma 1.

Let |Y| different partitions of X and their respective centroids for each y ∈ Y (cid:16) {K ( y ) k } Kk =1 , { x ( k,y ) } Kk =1 (cid:17) function of the natural number K , where partitions meet for each y ∈ Y : (cid:83) Kk =1 K ( y ) k = X , K ( y ) i ∩ K ( y ) j = ∅ ∀ ≤ i < j ≤ K, (cid:82) K ( y ) k dx > ∀ ≤ k ≤ K , the error gap (4) can be bounded as E gap ( q U | X , Q ˆ Y | U , S n ) ≤ (cid:15) ( K ) + E D gap ( q U | X , Q ˆ Y | U , S n ) almost surely, where (cid:15) ( K ) was deﬁned in (7) and E D gap ( q U | X , Q ˆ Y | U , S n ) is deﬁned as E D gap ( q U | X , Q ˆ Y | U , S n ) = (cid:12)(cid:12)(cid:12) L D ( q U | X , Q ˆ Y | U ) − L D emp ( q U | X , Q ˆ Y | U , S n ) (cid:12)(cid:12)(cid:12) (15) where L D ( q U | X , Q ˆ Y | U ) = K (cid:88) k =1 (cid:88) y ∈Y P DXY ( k, y ) (cid:96) ( x ( k,y ) , y ) , (16) L D emp ( q U | X , Q ˆ Y | U , S n ) = 1 n K (cid:88) k =1 (cid:88) i ∈ [1: n ] x i ∈K k (cid:96) ( x ( k,y i ) , y i ) (17) and P DXY : { x ( k,y ) : k ∈ [1 : K ] , y ∈ Y} × Y → [0 , is P DXY ( x, y ) = K (cid:88) k =1 (cid:110) x = x ( k,y ) (cid:111) (cid:90) K ( y ) k p XY ( x (cid:48) , y ) dx (cid:48) . (18)On the one hand, P DXY is a pmf such that P Y (true value) is its marginal. On the other hand the othermarginal P DX ( x ) = (cid:80) y ∈Y P DXY ( x, y ) has the elements of the set A = { x ( k,y ) : 1 ≤ k ≤ K, y ∈Y} as atoms. 11 roof.

Triangle inequality allow to relate error gaps as, E gap ( q U | X , Q ˆ Y | U , S n ) ≤ (cid:12)(cid:12)(cid:12) L D ( q U | X , Q ˆ Y | U ) − L D emp ( q U | X , Q ˆ Y | U , S n ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) L ( q U | X , Q ˆ Y | U ) − L D ( q U | X , Q ˆ Y | U ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) L D emp ( q U | X , Q ˆ Y | U , S n ) − L emp ( q U | X , Q ˆ Y | U , S n ) (cid:12)(cid:12)(cid:12) , (19)where the last term is the discrete error gap E D gap ( q U | X , Q ˆ Y | U , S n ) . The other terms in (19) can bebounded using the fact that P DX ( x ( k,y ) ) = P (cid:16) X ∈ K ( Y ) k , Y = y (cid:17) and deﬁnition of (cid:15) ( K ) (7): (cid:12)(cid:12)(cid:12) L ( q U | X , Q ˆ Y | U ) − L D ( q U | X , Q ˆ Y | U ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =1 |Y| (cid:88) y =1 P DXY ( x ( k,y ) , y ) (cid:16) E (cid:104) (cid:96) ( X, Y ) | Y = y, X ∈ K ( Y ) k (cid:105) − (cid:96) ( x ( k,y ) , y ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (20) ≤ (cid:15) ( K ) (21) (cid:12)(cid:12)(cid:12) L D emp ( q U | X , Q ˆ Y | U , S n ) − L emp ( q U | X , Q ˆ Y | U , S n ) (cid:12)(cid:12)(cid:12) = 1 n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) k =1 (cid:88) i ∈ [1: n ] x i ∈K k (cid:104) (cid:96) ( x ( k,y i ) , y i ) − (cid:96) ( x i , y i ) (cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (22) ≤ (cid:15) ( K ) . (23)The second step in our proof is to measure the decoupling between the encoder and decoder, i.e. whatis the error when considering Q DY | U (9) as decoder. The following lemma separates the decoder term. Lemma 2.

Discrete error gap can be bounded as E D gap ( q U | X , Q ˆ Y | U , S n ) ≤ E D gap ( q U | X , Q DY | U , S n ) + d ( S n ) (24) almost surely, where d ( S n ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n K (cid:88) k =1 (cid:88) i ∈ [1: n ] x i ∈K k T ( x ( k,y i ) , y i ) − D (cid:16) Q DY | U (cid:107) Q ˆ Y | U | q DU (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (25) and T ( x, y ) := E q U | X (cid:20) log (cid:18) Q DY | U ( y | U ) Q ˆ Y | U ( y | U ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:21) . Note that E P DXY [ T ( X, Y )] = D (cid:16) Q DY | U (cid:107) Q ˆ Y | U | q DU (cid:17) , so d ( S n ) is a deviation of T ( X, Y ) . Proof.

It is immediate to see that: (cid:96) (cid:0) q U | X ( ·| x ) , Q ˆ Y | U ( y |· ) (cid:1) = (cid:96) (cid:0) q U | X ( ·| x ) , Q DY | U ( y |· ) (cid:1) + T ( x, y ) (26)So, taking expectation (as (16) and (17)) and using triangle inequality, we can prove the lemma.The third step of our proof is to bound the new gap E D gap ( q U | X , Q DY | U , S n ) . For that, we useempirical distributions ˆ P DXY , ˆ P DX , ˆ P Y , ˆ P DX | Y as the occurrence rate of S n ; e.g. ˆ P DXY ( k, y ) = | (cid:110) ( x i ,y i ) ∈S n : y i = y, x i ∈K ( y ) k (cid:111) | n . Also, we deﬁne the empirical distributions generated from the en-coder ˆ q DU , ˆ Q DY | U , ˆ q DU | Y generated from the encoder; e.g. q DU ( u ) = (cid:80) x ∈A q U | X ( u | x ) ˆ P DX ( x ) .12 emark The marginal P Y and its empirical pmf ˆ P Y has not got superscript D because it matcheswith the true pmf (without quantization). Lemma 3.

The gap E D gap ( q U | X , Q DY | U , S n ) can be bounded as, E D gap ( q U | X , Q DY | U , S n ) ≤ D (cid:16) ˆ P DXY (cid:13)(cid:13) P DXY (cid:17) + (cid:90) U φ (cid:18)(cid:13)(cid:13)(cid:13) P DX − ˆP DX (cid:13)(cid:13)(cid:13) (cid:113) V (cid:0) q U | X ( u |· ) (cid:1)(cid:19) du + log (cid:18) Vol ( U ) P Y ( y min ) (cid:19) (cid:112) |Y| (cid:13)(cid:13)(cid:13) P Y − ˆP Y (cid:13)(cid:13)(cid:13) + O (cid:16) (cid:107) P Y − ˆP Y (cid:107) (cid:17) + E P Y (cid:104) (cid:90) U φ (cid:18)(cid:13)(cid:13)(cid:13) P DX | Y ( ·| Y ) − ˆP DX | Y ( ·| Y ) (cid:13)(cid:13)(cid:13) (cid:113) V (cid:0) q U | X ( u |· ) (cid:1)(cid:19) du (cid:105) (27) almost surely, where φ ( · ) is deﬁned in (64) and V ( c ) := (cid:107) c − ¯ c a (cid:107) , (28) with c ∈ R a , a ∈ N , ¯ c = a (cid:80) ai =1 c i , and a is the vector of ones of length a . Notation P Y means the pmf P Y as a vector P Y = [ P Y (1) , · · · , P Y ( |Y| )] , so we can apply it vectornorms (cid:107) · (cid:107) and V ( · ) operator. This operator measures the dispersion of the components of ¯ a aroundthe mean, where V ( c ) ≤ (cid:107) c − b a (cid:107) , ∀ b ∈ R . Proof.

Adding and subtracting ˆ P DXY ( x ( k,y ) , y ) E q U | X (cid:20) log (cid:18) Q DY | U ( y | U ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) X = x ( k,y ) (cid:21) we can provevia triangle inequality that E D gap ( q U | X , Q DY | U , S n ) (29) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ∀ ( k,y ) (cid:104) P DXY ( x ( k,y ) , y ) − ˆ P DXY ( x ( k,y ) , y ) (cid:105) E q U | X (cid:34) log (cid:32) Q DY | U ( y | U ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x ( k,y ) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) H ( Q DY | U | q DU ) − H ( ˆ Q DY | U | ˆ q DU ) (cid:12)(cid:12)(cid:12) + D (cid:16) ˆ Q DY | U (cid:13)(cid:13) Q DY | U (cid:12)(cid:12) ˆ q DU (cid:17) . (30)We can bound the second term in (30) using the inequality: D (cid:16) ˆ Q DY | U (cid:13)(cid:13) Q DY | U (cid:12)(cid:12) ˆ q DU (cid:17) ≤ D (cid:16) ˆ Q DY | U ˆ q DU (cid:13)(cid:13) Q DY | U q DU (cid:17) ≤ D (cid:16) ˆ P DXY (cid:13)(cid:13) P DXY (cid:17) . (31)The ﬁrst term of (30) can be bounded as: (cid:12)(cid:12)(cid:12) H ( Q DY | U | q DU ) −H ( ˆ Q DY | U | ˆ q DU ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) H ( P Y ) − H ( ˆ P Y ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12) H d ( q DU ) −H d (ˆ q DU ) (cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) H d ( q DU | Y | P Y ) − H d (ˆ q DU | Y | ˆ P Y ) (cid:12)(cid:12)(cid:12) , (32)where H d is the differential entropy when U is continuous and the classical entropy when U isdiscrete. The terms (cid:12)(cid:12) H d ( q DU ) −H d (ˆ q DU ) (cid:12)(cid:12) and (cid:12)(cid:12)(cid:12) H d ( q DU | Y | P Y ) − H d (ˆ q DU | Y | ˆ P Y ) (cid:12)(cid:12)(cid:12) can be bounded byLemmas 9 and 10 respectively. Finally, it is clear that P Y (cid:55)→ H ( P Y ) is differentiable and a ﬁrst orderTaylor expansion yields: H ( P Y ) − H ( ˆ P Y ) = (cid:28) ∂ H ( P Y ) ∂ P Y , P Y − ˆP Y (cid:29) + O (cid:16) (cid:107) P Y − ˆP Y (cid:107) (cid:17) , (33)where ∂ H ( P Y ) ∂P Y ( y ) = − log P Y ( y ) − for each y ∈ Y . Then, applying Cauchy-Schwartz inequality thelemma was proved: (cid:12)(cid:12)(cid:12) H ( P Y ) −H ( ˆ P Y ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:68) log P Y , P Y − ˆP Y (cid:69)(cid:12)(cid:12)(cid:12) + O (cid:16) (cid:107) P Y − ˆP Y (cid:107) (cid:17) (34) ≤ (cid:107) log P Y (cid:107) (cid:13)(cid:13)(cid:13) P Y − ˆP Y (cid:13)(cid:13)(cid:13) + O (cid:16) (cid:107) P Y − ˆP Y (cid:107) (cid:17) (35) ≤ log (cid:18) P Y ( y min ) (cid:19) (cid:112) |Y| (cid:13)(cid:13)(cid:13) P Y − ˆP Y (cid:13)(cid:13)(cid:13) + O (cid:16) (cid:107) P Y − ˆP Y (cid:107) (cid:17) . (36)13he combination of lemmas 1, 2 and 3 allow to bound the error gap with probability one. The fourthstep is to use a concentration inequality over the terms: D (cid:0) ˆ P DXY (cid:107) P DXY (cid:1) , (cid:107) P DX − ˆP DX (cid:107) , (cid:107) P Y − ˆP Y (cid:107) , (cid:107) P DX | Y ( ·| y ) − ˆP DX | Y ( ·| y ) (cid:107) for y ∈ Y and d ( S n ) simultaneously. Lemma 7 guarantees that thebounds hold simultaneously over all these |Y| + 4 quantities, by replacing δ with δ/ ( |Y| + 4) . Withprobability at least − δ , we apply Lemmas 6, 8 and Chebyshev inequality Devroye et al. (1997): D (cid:16) ˆ P DXY (cid:107) P DXY (cid:17) ≤ |X ||Y| log( n + 1) n + 1 n log (cid:18) |Y| + 4 δ (cid:19) = O (cid:18) log( n ) n (cid:19) . (37) max (cid:110)(cid:13)(cid:13) P Y − ˆP Y (cid:13)(cid:13) , (cid:13)(cid:13) P DX − ˆP DX (cid:13)(cid:13) , (cid:13)(cid:13) P DX | Y ( ·| y ) − ˆP DX | Y ( ·| y ) (cid:13)(cid:13) (cid:111) ≤ (cid:114) log (cid:16) |Y| +4 δ (cid:17) √ n ≡ B δ √ n , (38) d ( S n ) ≤ (cid:114) |Y| + 4 nδ (cid:113) Var P DXY ( T ( X, Y )) . (39)In order to bound the last variance we enunciate the following lemma. Lemma 4.

Variance of T random variable can be bounded asVar P DXY ( T ( X, Y )) ≤ (cid:113) Q ˆ Y | U ( y min | u min ) D HL (cid:16) Q DY | U (cid:107) Q ˆ Y | U | q DU (cid:17) , (40) where h is the Hellinger distance (6) .Proof. A similar result can be founded Ghosal et al. (2000). Variance can be bounded asVar P DXY ( T ( X, Y )) =

For every δ ∈ (0 , , with probability at least − δ over the choice of S n ∼ p XY , thegap satisﬁes: E gap ( q U | X , Q ˆ Y | U , S n ) ≤ inf K ∈ N (cid:15) ( K ) + A δ (cid:113) I ( P DX ; q U | X ) · log( n ) √ n r ( K )+ D δ · D HL (cid:16) Q DY | U (cid:107) Q ˆ Y | U | q DU (cid:17) + C δ √ n + O (cid:18) log( n ) n (cid:19) , (45) ∀ ( q U | X , Q ˆ Y | U ) that meets Assumptions 1. roof. We bound the error gap using lemmas 1, 2, 3, 4 and 11, with probability at least − δ : E gap ( q U | X , Q ˆ Y | U , S n ) ≤ (cid:15) ( K ) + D δ √ n D HL (cid:0) Q DY | U (cid:107) Q ˆ Y | U | q DU (cid:1) + log (cid:18) Vol ( U ) P Y ( y min ) (cid:19) (cid:112) |Y| B δ √ n + 2 (cid:90) U φ (cid:32) B δ √ n (cid:114) V (cid:16) q U | X ( u |· ) (cid:17)(cid:33) du + O (cid:18) log( n ) n (cid:19) (46) ≤ (cid:15) ( K ) + D δ √ n D HL (cid:0) Q DY | U (cid:107) Q ˆ Y | U | q DU (cid:1) + log (cid:18) Vol ( U ) P Y ( y min ) (cid:19) (cid:112) |Y| B δ √ n + log( n ) √ n B δ (cid:90) U (cid:113) V (cid:0) q U | X ( u |· ) (cid:1) du + 2 Vol ( U ) e − √ n + O (cid:18) log( n ) n (cid:19) . (47)We relate the mutual information I ( P DX ; q U | X ) with (cid:82) U (cid:113) V (cid:0) q U | X ( u |· ) (cid:1) du . Its proof follows froman application of Pinsker’s inequality (Cover and Thomas, 2006, Lemma 11.6.1) (cid:107) P − P (cid:107) ≤ D ( P (cid:107) P ) and the fact that V ( c ) ≤ (cid:107) c − b a (cid:107) , ∀ b ∈ R : (cid:113) V (cid:0) q U | X ( u |· ) (cid:1) ≤ (cid:115)(cid:88) x ∈A (cid:2) q U | X ( u | x ) − q DU ( u ) (cid:3) (48) = q DU ( u ) (cid:118)(cid:117)(cid:117)(cid:116)(cid:88) x ∈A (cid:34) Q DX | U ( x | u ) P DX ( x ) − (cid:35) (49) ≤ q DU ( u ) (cid:88) x ∈A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q DX | U ( x | u ) P DX ( x ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (50) = q DU ( u ) (cid:88) x ∈A P DX ( k ) (cid:12)(cid:12)(cid:12) Q DX | U ( x | u ) − P DX ( x ) (cid:12)(cid:12)(cid:12) (51) ≤ √ r ( K ) · q DU ( u ) (cid:114) D (cid:16) Q DX | U ( ·| u ) (cid:107) P DX (cid:17) , (52)where Q DX | U ( k | u ) = q U | X ( u | x ( k ) ) P DX ( k ) q DU ( u ) . So, using Jensen inequality (cid:90) U (cid:113) V (cid:0) q U | X ( u |· ) (cid:1) du ≤ √ r ( K ) E q DU (cid:34)(cid:114) D (cid:16) Q DX | U ( ·| u ) (cid:107) P DX (cid:17)(cid:35) (53) ≤ √ · r ( K ) · (cid:113) I ( P DX ; q U | X ) , (54)for all K ∈ N . Finally the lemma is proved taking the inﬁmum over K .Finally, using the identity I ( p X ; q U | X ) = I ( p XY ; q U | X ) and the fact that the k where a sample x belongs is a deterministic function of x and y , we can use the Data Processing Inequality Coverand Thomas (2006) I ( P DX ; q U | X ) ≤ I ( p XY ; q U | X ) = I ( p X ; q U | X ) . With this result the lemma 5becomes in our Theorem 1. Remark It is worth mentioning the differences between our result and that presented in Shamiret al. (2010). While we work with the cross-entropy gap making appear several terms in lemma 3,their only bounded the mutual information gap: |I (cid:0) q DU ; Q DY | U (cid:1) − I (cid:0) ˆ q DU ; ˆ Q DY | U (cid:1) |≤ (cid:12)(cid:12) H d ( q DU ) − H d (ˆ q DU ) (cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) H d ( q DU | Y | P Y ) − H d (ˆ q DU | Y | ˆ P Y ) (cid:12)(cid:12)(cid:12) . (55) For this reason we had to use more concentration inequalities including (37) and Chebyshev, whiletheir work only uses Lemma 8. In addition, our ﬁnal mutual information is expressed in terms ofcontinuous pdf, while theirs is a function of the discrete pmf. Finally some constants were subtlyreduced and we extend the result for continuous representations of U . Experimental Details and Other Results

In this section we mention the details of the simulations in Section 4, and present new simulationsusing the CIFAR-10 dataset (Krizhevsky, 2009). We sample two different random subsets of: MNIST(standard dataset of handwritten digits) and CIFAR-10 (natural images). The size of the trainingset is K for both datasets. It is important to emphasize the main difference between the datasets:while MNIST propose a task that is simpler, CIFAR-10 is a more difﬁcult one, especially witha small numbers of samples and without convolutional DNNs. From this observation, we wouldexpect that in the latter case, the classiﬁer will not be trained enough to have an close to optimalperformance. We approximate L ( q U | X , Q ˆ Y | U ) with a K dataset and L emp ( q U | X , Q ˆ Y | U , S n ) withdifferent independent mini-testing datasets of samples, i.e., using the rest of the features. The E gap ( q U | X , Q ˆ Y | U , S n ) 0 . -quantile ( δ = 0 . ) is computed based on the different values of thetesting risk. The values reported in each simulation are the average of three simulations independent,choosing at random, in each case, different sets. Below are the architectures used in each setup: • Normal encoder:

The DNNs used was a feed-forward layer of hidden units with ReLUactivation followed by another linear for each parameter ( µ and log σ ) with hiddenunits. I.e., each parameter, µ and log σ , are a two-layers network where the ﬁrst one iscommon to both. We choose a learning rate in . , a batch-size of and we train during epochs. The cost function considered during the training phase was of the form, L emp ( q U | X , Q ˆ Y | U , D l ) + λ m (cid:88) j =1 l l (cid:88) i =1 D (cid:0) q U j | X ( ·| x i ) (cid:13)(cid:13) ˜ q U j (cid:1) , (56)where λ is the regulation Lagrange multiplier and D l is the l -training dataset. • LogNormal encoder:

The DNNs used for f ( x ) was a feed-forward structure with twolayers of hidden units with a softplus activation and for α ( x ) a feed-forward layer of hidden units with a sigmoid activation multiplied by . , so that the maximum varianceof the log-normal error distribution will be approximately Achille and Soatto (2016). Wechoose a learning rate in . , a batch-size of and we train during epochs. Thecost function trained was the same that the one in (56). • RBM encoder:

Eq. (14) is difﬁcult to use as a regularizer even using training with thecontrastive divergence learning procedure Hinton (2002). Instead, we rely on the usualRBM regularization: weight-decay. This is a traditional way to improve the generalizationcapacity. We explore the effect of the Lagrange multiplier λ , so called weight-cost, overboth the gap and the mutual information. This meta-parameter controls the gradient weightdecay, i.e., the cost function can be written as: CD RBM + λ (cid:107) W (cid:107) F , (57)where CD RBM is the classical unsupervised RBM cost function trained with the contrastivedivergence learning procedure Hinton (2012), and W is the matrix that has w j , with j ∈ [1 : m ] , as columns. In order to compute the gap we add to the output of the lastRBM layer a soft-max regression decoder trained during epochs separately. Severalauthors have combined RBMs with soft-max regression, among them Hinton et al. (2006),Srivastava et al. (2013) and Chopra and Yadav (2018).Following suggestions from Hinton (2012), we study the Lagrange multiplier when λ ≥ . . We choose learning rates in . , a batch-size of , hidden units and we trainduring epochs. We start with a momentum of . and change to . after epochs.The following subsections present new experiment using CIFAR-10 dataset: Comparison of thebehavior of the error gap and mutual information, and an example about the tradeoff between (cid:15) ( K ) and r ( K ) . B.1 Comparison between the behavior of the error gap and mutual information

In this section we repeat simulation of Section 4 for CIFAR-10 dataset, testing with images withoutdisturbing. Figures 5a, 5b and 5c show . -quantile of E gap ( q U | X , Q ˆ Y | U , S n ) and the mutual16 Lagrange Multiplier . - q u a n t il e ga p ( s o li d ) p I ( X ; U )( d a s h ) (a) Lagrange Multiplier . - q u a n t il e ga p ( s o li d ) p I ( X ; U )( d a s h ) (b) Lagrange Multiplier . - q u a n t il e ga p ( s o li d ) p I ( X ; U )( d a s h ) (c) Figure 5: Comparison between . -quantile of E gap ( q U | X , Q ˆ Y | U , S n ) and the mutual informationvariational bound (10) for CIFAR-10 dataset and: (a) Normal encoder, (b) LogNormal encoder, (c)RBM encoder.information variational bound (10) as a function of the Lagrange multiplier λ for Normal encoder,logNormal encoder and RBM encoder respectively. In these cases, mutual information and error gaphave a behavior closer than MNIST. B.2 Discretization Tradeoff with CIFAR-10 dataset

We use the setup presented in Section 4.1 to implement numerically this tradeoff between (cid:15) ( K ) and r ( K ) through K-Means algorithm. For every K , we iterate between: • The samples coloring : Let the loss centroids { (cid:96) ( x ( k ) , y ) } Kk =1 for each y ∈ Y , we assign x i ∈ K k i , where k i is computed as k i = arg min ≤ k ≤ K max y ∈Y (cid:12)(cid:12)(cid:12) (cid:96) ( x i , y ) − (cid:96) ( x ( k ) , y ) (cid:12)(cid:12)(cid:12) ; (58) • Find centroids : We compute for each k and each y ∈ Y (cid:96) ( x ( k ) , y ) = 1 |{ i : x i ∈ K k |} (cid:88) i : x i ∈K k (cid:96) ( x i , y ) . (59)After that, we estimate (cid:15) ( K ) and r ( K ) as: (cid:15) ( K ) = max ≤ i ≤ n max y ∈Y (cid:12)(cid:12)(cid:12) (cid:96) ( x i , y ) − (cid:96) ( x ( k i ) , y ) (cid:12)(cid:12)(cid:12) , r ( K ) = 1min ≤ k ≤ K |{ i : x i ∈ K k | n , (60)using a K dataset of CIFAR-10. This algorithm focus on minimize (cid:15) ( K ) inside the tradeoff, andthis function is really sensible to the samples (it only depends of the maximum). Fig. 6 shows a noisytradeoff, which achieves its minimum at K = 16 . C Auxiliary Results

In this appendix some auxiliary facts, which are used in proof of the main results, are listed withoutproof.

Lemma 6 (Theorem 11.2.1 (Cover and Thomas, 2006)) . Let P ∈ P ( X ) be a discrete probabilitydistribution and let ˆ P be its empirical estimation over a n -data set S n . Then, D ( ˆ P (cid:107) P ) ≤ |X | log( n + 1) n + 1 n log(1 /δ ) (61) with probability at least − δ over the choice of S n . We compute the loss centroid inside the true centroid x ( k ) . ( K ) + A / p I ( X ; U ) " l og n p n " r ( K ) Figure 6: Tradeoff between (cid:15) ( K ) and r ( K ) vs the number of cells, highlighting its minimum value,in a Normal Encoder example with CIFAR-10 dataset. Lemma 7 (Union bound application) . Let {A k } mk =1 events such that Pr( A k ) ≥ − δ for each k ∈ [1 : m ] . Then, P ( (cid:84) mk =1 A k ) ≥ − δm . Lemma 8 (Application of McDiarmid’s Inequality to the vector probability) . Let P ∈ P ( X ) beany probability distribution and let ˆ P be its empirical estimation over a n -data set S n . Then, withprobability at least − δ over S n : (cid:107) P − ˆP (cid:107) ≤ (cid:112) log(1 /δ ) √ n . (62) Lemma 9 (Adaptation of Shamir et al. (2010)) . Consider the encoder given by q U | X . We have (cid:12)(cid:12) H d ( q DU ) − H d (ˆ q DU ) (cid:12)(cid:12) ≤ (cid:90) U φ (cid:18) (cid:107) P X − ˆP X (cid:107) (cid:113) V (cid:0) q U | X ( u |· ) (cid:1)(cid:19) du (63) with φ ( x ) =  x ≤ − x log( x ) 0 < x < e − e − x ≥ e − (64) for (cid:107) P X − ˆP X (cid:107) not so small. Lemma 10 (Adaptation of Shamir et al. (2010)) . Let U a compact space. Consider the encoder q U | X , then (cid:12)(cid:12)(cid:12) H d ( q DU | Y | P Y ) − H d (ˆ q DU | Y | ˆ P Y ) (cid:12)(cid:12)(cid:12) ≤ (cid:107) P Y − ˆP Y (cid:107) (cid:112) |Y| log Vol ( U ) (65) + E P Y (cid:20)(cid:90) U φ (cid:18)(cid:13)(cid:13)(cid:13) P X | Y ( ·| Y ) − ˆP X | Y ( ·| Y ) (cid:13)(cid:13)(cid:13) (cid:113) V (cid:0) q U | X ( u |· ) (cid:1)(cid:19) du (cid:21) , for max y (cid:107) P X | Y ( ·| y ) − ˆP X | Y ( ·| y ) (cid:107) not so small Lemma 11 (Shamir et al. (2010)) . Let n ≥ a e , then φ (cid:16) a √ n (cid:17) ≤ a n ) √ n + e − √ n ..