[PDF] The Role of Mutual Information in Variational Classifiers

Abstract

Overfitting data is a well-known phenomenon related with the generation of a model that mimics too closely (or exactly) a particular instance of data, and may therefore fail to predict future observations reliably. In practice, this behaviour is controlled by various--sometimes heuristics--regularization techniques, which are motivated by developing upper bounds to the generalization error. In this work, we study the generalization error of classifiers relying on stochastic encodings trained on the cross-entropy loss, which is often used in deep learning for classification problems. We derive bounds to the generalization error showing that there exists a regime where the generalization error is bounded by the mutual information between input features and the corresponding representations in the latent space, which are randomly generated according to the encoding distribution. Our bounds provide an information-theoretic understanding of generalization in the so-called class of variational classifiers, which are regularized by a Kullback-Leibler (KL) divergence term. These results give theoretical grounds for the highly popular KL term in variational inference methods that was already recognized to act effectively as a regularization penalty. We further observe connections with well studied notions such as Variational Autoencoders, Information Dropout, Information Bottleneck and Boltzmann Machines. Finally, we perform numerical experiments on MNIST and CIFAR datasets and show that mutual information is indeed highly representative of the behaviour of the generalization error.

Full PDF

TT HE R OLE OF M UTUAL I NFORMATION IN V ARIATIONAL C LASSIFIERS

Matias Vera

CSC, CONICETBuenos Aires, Argentina [email protected]

Leonardo Rey Vega

Facultad de Ingenier´ıa, Universidad de Buenos AiresCSC, CONICETBuenos Aires, Argentina [email protected]

Pablo Piantanida

Laboratoire des Signaux et Syst`emes, CentraleSup´elecCNRS, Universit´e Paris-Saclay,Gif-sur-Yvette, France [email protected]

October 23, 2020 A BSTRACT

Overﬁtting data is a well-known phenomenon related with the generation of a model that mim-ics too closely (or exactly) a particular instance of data, and may therefore fail to predict futureobservations reliably. In practice, this behaviour is controlled by various–sometimes heuristics–regularization techniques, which are motivated by developing upper bounds to the generalizationerror. In this work, we study the generalization error of classiﬁers relying on stochastic encodingstrained on the cross-entropy loss, which is often used in deep learning for classiﬁcation problems. Wederive bounds to the generalization error showing that there exists a regime where the generalizationerror is bounded by the mutual information between input features and the corresponding represen-tations in the latent space, which are randomly generated according to the encoding distribution.Our bounds provide an information-theoretic understanding of generalization in the so-called classof variational classiﬁers, which are regularized by a Kullback-Leibler (KL) divergence term. Theseresults give theoretical grounds for the highly popular KL term in variational inference methods thatwas already recognized to act effectively as a regularization penalty. We further observe connec-tions with well studied notions such as Variational Autoencoders, Information Dropout, InformationBottleneck and Boltzmann Machines. Finally, we perform numerical experiments on MNIST andCIFAR datasets and show that mutual information is indeed highly representative of the behaviourof the generalization error.

The work of Matias Vera was supported by CONICET PostDoc Scholarship. The work of Leonardo Rey Vega work wassupported by grants UBACyT 20020170100470BA from University of Buenos Aires. The work of Prof. Pablo Piantanida waspartially supported by the European Commission’s Marie Sklodowska-Curie Actions (MSCA), through the Marie Sklodowska-Curie IF (H2020-MSCAIF-2017-EF-797805-STRUDEL). a r X i v : . [ s t a t . M L ] O c t PREPRINT - O

CTOBER

23, 2020

The major challenge in representation learning is to learn the different explanatory factors of variation behind a givendataset. Learning models are often guided by the objective of optimizing performance on training data when the realobjective is to generalize well to unseen data. Generalization error, i.e. the difference between expected and empiricalrisk, is the standard measure for quantifying the capacity of an algorithm to generalize learned patterns from seen datato unseen ones (Mohri et al., 2018, Chapter 2). An upper bound that dominates the aforementioned error can oftenprovide an indication of what is a good regularization term to the empirical risk minimization. In this way, investigatingthe information-theoretic impact of different regularization techniques on the generalization error is a ﬁrst step in theunderstanding of these methods. In this work, we investigate upper bounds to the generalization error in terms ofthe mutual information between input features and latent representations intended to classiﬁers, which are relying onstochastic encodings trained on the cross-entropy loss. More precisely, we present a PAC-learning result that analyzesthe impact of regularization techniques based on the information shared by inputs X to the corresponding latentrepresentations U ( X ) of the algorithm. We observe interesting connections with Variational Autoencoders (VAEs)(Kingma and Welling, 2013) and the so-called Information Bottleneck method (IB) (Tishby et al., 1999), amongothers. The main result of the paper can be somehow loosely summarized, for the speciﬁc details see Theorem 1, as follows: (cid:12)(cid:12) true cross-entropy − empirical cross-entropy (cid:12)(cid:12) ≤ O (cid:18)(cid:113) I (cid:0) U ( X ) ; X (cid:1) log( n ) √ n (cid:19) (1)with high probability and where n is the sample size of the training set and I (cid:0) U ( X ) ; X (cid:1) indicates the mutual in-formation between the random input features X and the latent representations U ( X ) using a stochastic encoding Q U | X : X (cid:55)→ U ( X ) . This result motivates formally the use of the above mutual information as a regularizing term tocontrol the among of information conveyed by latent representations about input features, leading to the minimizationof the following objective: empirical cross-entropy + λ · I (cid:0) U ( X ) ; X (cid:1) , ∀ λ ≥ , (2)which was commonly implemented via an appropriate (empirical) upper-bound to the mutual information using theKullback Leibler (KL) divergence term: I (cid:0) U ( X ) ; X (cid:1) ≤ E p X (cid:2) KL ( q U | X ( ·| X ) (cid:107) q U ) (cid:3) , ∀ q U , (3)according to some known prior (often a normal) distribution q U on the latent representation space. Although a theoreti-cal understanding of the connection between the above mutual information and the generalization of the cross-entropyloss remains elusive in the literature, the impact of the multiplier λ together with the KL bound (3) in the trainingobjective (2) has been shown empirically to improve performance of some deep learning algorithms by Achille andSoatto (2018a); Kingma and Welling (2013), among others works.Our analysis of the generalization error is framed within the classiﬁcation problem in a speciﬁc framework. Thereare two main ingredients that characterize our model: (a) we focus on randomized encodings, which allows us notonly to study deterministic algorithms, but also to cover graphical methods as variational classiﬁers (Maggipintoet al., 2020) and Restricted Boltzmann Machines (RBMs) (Hinton, 2012); and (b) our analysis assumes cross-entropyas the loss to be trained. It is worth mentioning some of our motivation behind these assumptions. From a purelymathematical view-point, by admitting stochastic encodings rules, the chances of ﬁnding predictors that are optimalin some way can only be improved (Yamanishi, 1992). On the other hand, the cross-entropy loss is often used to trainstate-of-art deep learning algorithms for classiﬁcation problems. Furthermore, the cross-entropy is even employed asa performance metric for deep learning algorithms ending with a soft-max layer (Goodfellow et al., 2016, Section6.2.2), which not only indicates a hard decision but also grants a notion of how likely each class is (i.e., the so calledcalibration). Unfortunately, modern neural networks may be poorly calibrated even when they are accurate (Guo et al., PREPRINT - O

CTOBER

23, 2020

The idea of relating the generalization capacity of moderns machine learning algorithms to information measures isnot new. In the last decade, the topic has received considerable attention (Vincent et al., 2010; Russo and Zou, 2015;Kingma and Welling, 2013; Achille and Soatto, 2018b; Halbersberg et al., 2020). Perhaps, one of the most exciting–but also controversial–approach is the so-called Information Bottleneck (IB) method by Tishby and Zaslavsky (2015).The IB principle postulates that the mutual information between inputs and the corresponding latent representationsis related to the overﬁtting problem, establishing a trade-off between accuracy and a measure of compression whichis interpreted as being the information complexity. Although the information-theoretic principles are well-grounded,the experimental validation of this trade-off still remains a challenging task because of the underlying difﬁculties inestimating mutual information with statistical conﬁdence in high-dimensional spaces, as it was recently reported byPichler et al. (2020). Hopefully, in some cases closed-form approximations to estimate mutual information can bederived as it is the case for VAEs and RBMs due to the presence of stochastic encodings of latent representations.Statistical rates on the empirical estimates corresponding to IB trade-offs have been reported by Shamir et al. (2010)and a deviation bound related to the cross-entropy loss was reported by Vera et al. (2018a). In Schwartz-Ziv andTishby (2017), it is empirically shown that deep networks undergo two phases consisting of an initial ﬁtting phase anda subsequent compression phase of the features where the last phase is causally related to the well-known general-ization performance of deep neural networks. However, subsequent works by Amjad and Geiger (2018); Saxe et al.(2018) report that none of these claims hold true in general. In other words, regularization based on (2) could im-prove performance but there is no enough evidence to conclude that compression phase is responsible for well-knowngeneralization capabilities of deep neural networks.The robust framework deﬁned by Xu and Mannor (2012) provides a novel approach, different from complexity orstability (Bousquet and Elisseeff, 2002) arguments, for studying the performance of learning algorithms in terms ofthe generalization error. It is showed that feed-forward neural networks are robust provided that the L -norm of theweights in each layer is bounded. In this context, Sokolic et al. (2017b) explore the Jacobian matrix of the model as abound to the generalization error, extending results to convolutional networks as well. Sokolic et al. (2017a) showedthat the bound of the generalization error of a stable invariant classiﬁer is much smaller than the one of a robustnon-invariant classiﬁer. PREPRINT - O

CTOBER

23, 2020

From a different perspective, generalization in deep neural networks was also studied by Neyshabur et al. (2017),where it is shown that some other forms of capacity control–different from network size–plays a central role in learningthese networks. Also Zhang et al. (2017) concluded that the effective capacity of several successful neural networkarchitectures is large enough to shatter the training data. Consequently, these models are in principle rich enough tomemorize the full training data.The rest of the paper is organized as follows. In Section 2, we introduce the underlying learning model following(Achille and Soatto, 2018b). In Section 3, we explore some of the existent connections with others approaches pre-sented in the literature. In Section 4, we present our main result. Sections 5.2 and 5.1 discuss some results in view ofthe assumptions made in the main theorem and further discuss the results. Finally, in Section 6 we provide numericalevidence for some selected models and concluding remarks are relegated to Section 7. Major mathematical details isgiven in Appendices A to C.

Notation and Conventions

Table 1 presents a dictionary of the most relevant symbols used across this paper.

We are interested in the problem of pattern classiﬁcation consisting in the prediction of the unknown class matching anobservation. This framework is deﬁned by three main elements: (1) the source model that deﬁnes the data probabilitylaw; (2) the representation space of hidden units which allows us to divide the decision rule into an encoder and adecoder; and (3) the class model deﬁned by the architecture of the algorithm, which determines the set of possiblesolutions.1.

Source model:

Let ( Y , B Y ) and ( X , B X ) with be two measurable spaces, where Y is discrete with |Y| < ∞ and X ⊆ R d x , and let P X | Y ( ·| y ) , y ∈ Y , be a collection of probability distributions such that for every y ∈ Y , P X | Y ( ·| y ) is a probability measure on ( X , B X ) , and for every ﬁxed B ∈ B X , P X | Y ( B|· ) is measurable on ( Y , B Y ) with respect to P Y . P Y and P X | Y ( ·| y ) induce the probability measure P XY on ( X × Y , B X × B Y ) ,and where for every y ∈ Y we assume that there exists the corresponding probability density function denotedby p XY ( x, y ) with respect to the usual Lebesgue measure in X .2. Representation model:

Let ( U , B U ) be a measurable space with representation space U ⊆ R d u , and let Q U | X ( u | x ) , x ∈ X , be a collection of probability distributions such that for every x ∈ X , Q U | X ( ·| x ) is aprobability measure on ( U , B U ) , and for every ﬁxed B ∈ B U , Q U | X ( B|· ) is measurable on ( X , B X ) withrespect to P X . Finally, let Q (cid:98) Y | U ( y | u ) , u ∈ U , be a collection of probability distributions such that for every u ∈ U , Q (cid:98) Y | U ( ·| u ) is a probability measure on ( Y , B Y ) , and for every ﬁxed B ∈ B Y , Q (cid:98) Y | U ( B|· ) is measurableon ( U , B U ) with respect to P U . Typically, distribution Q U | X ( ·| x ) has the appropriate regularity conditionsin order to have a density q U | X ( ·| x ) with respect to the Lebesgue measure in U for almost every x ∈ X .The above deﬁnitions induce a stochastic decision rule Q (cid:98) Y | X that is modelled by the concatenation of thecorresponding encoder-decoder pair (cid:0) q U | X , Q (cid:98) Y | U (cid:1) .3. Class of encoding distributions and predictors:

The encoder/decoder selection is not arbitrary, there is aparametric model H := (cid:8) f θ : θ ∈ Θ ⊂ R (cid:63) (cid:9) where f θ ≡ (cid:0) q θU | X , Q θ (cid:98) Y | U (cid:1) . Sometimes when we will need torefer to the random representations generated by the encoder q θU | X we will use the notation U θ . When we willneed to focus on speciﬁc parameter details for the the encoder and decoder separately, we will write them θ ≡ ( θ E , θ D ) with θ E and θ D the encoder and decoder parameters respectively and f θ ≡ (cid:16) q θ E U | X , Q θ D (cid:98) Y | U (cid:17) .We will concern ourselves with learning representation models (randomized encoders) and inference models (ran-domized decoders) from randomly generated samples. The problem of ﬁnding a good classiﬁer can be divided into This assumption is not strictly needed and it is only assumed for simplicity. PREPRINT - O

CTOBER

23, 2020

Table 1

Table of symbolsSymbol Meaning X , Y and U Input, target and representation spaces. P Y Original pmf. p X and p XY Original pdfs or mixture of pdf and pmf. P X and P X | Y ( ·| y ) Probability measure associated to p X and p X | Y ( ·| y ) . E [ · ] and Var ( · ) Mathematical expectation and variance. θ ∈ Θ Parameter and Parameter space. f θ := ( q θU | X , Q θ (cid:98) Y | U ) Randomized encoder and decoder. q θU , Q θY | U pdfs and pmfs generated from p XY q θU | X . (cid:96) θ ( x, y ) Cross-entropy loss function generated by ( q θU | X , Q θ (cid:98) Y | U ). ˜ (cid:96) θ ( x, y ) Cross-entropy loss function generated by ( q θU | X , Q θY | U ). S n Training dataset. L ( f θ ) Expected risk. L emp ( f θ , S n ) Empirical risk. (cid:98) θ n Parameter than minimize the empirical risk. E gen-err ( S n ) Generalization error. O ( a n ) , o ( a n ) big-o and small-o notation. | · | Cardinality. d x , d u Input and representation dimensions. Ω = { (cid:96) θ ( x, y ) : θ ∈ Θ } Loss function class.KL ( ·(cid:107)· ) Kullback Leibler divergence. I ( · ; · ) , H ( · ) , H d ( · ) Mutual information, entropy and differential entropy. F ( ε ) Ω Finite parameter set. {K ( y ) k } Kk =1 , { x ( k,y ) } Kk =1 Cells and centroids of input-space discretization for each y ∈ Y . (cid:15) θ ( K ) , r ( K ) Basic elements of the input-space discretization. P DXY , P DX , P DX | Y pmfs generated from the input-space discretization. q D,θU , Q D,θY | U , q D,θU | Y , Q D,θX | U pdfs and pmfs generated from P DXY q θU | X . T θ ( x, y ) Difference between the loss function and an artiﬁcial loss function. E gap ( f θ , S n ) Error-gap. L D ( f θ ) Discretizated version of expected risk. L D emp ( f θ , S n ) Discretizated version of empirical risk. E D gap ( f θ , S n ) Discretizated version of error-gap. (cid:98) P DXY , (cid:98) P DX , (cid:98) P DX | Y , (cid:98) P Y Empirical estimation over S n (ocurrence rate). (cid:98) q D,θU , (cid:98) Q D,θY | U , (cid:98) q D,θU | Y , (cid:98) Q D,θX | U pdfs and pmfs generated from (cid:98) P DXY q θU | X . (cid:107) · (cid:107) Norm-2 for ﬁnite vectors. (cid:104)· , ·(cid:105) Euclidean inner product. PREPRINT - O

CTOBER

23, 2020 that of simultaneously ﬁnding an (possibly randomized) encoder q θU | X that maps raw data to a (latent) space U and asoft-decoder Q θ ˆ Y | U which maps the representation to a probability distribution on labels Y . These mappings induce aclassiﬁer: Q θ ˆ Y | X ( y | x ) = E q θU | X (cid:104) Q θ ˆ Y | U ( y | U ) | X = x (cid:105) , (4) Remark In the standard methodology with deep representations, we consider L randomized encoders ( L layers) { q θU l | U l − } Ll =1 with U ≡ X . Although this appears at ﬁrst to be more general, it can be casted formally using theone-layer case formulation induced by the marginal distribution that relates the input and the ﬁnal L -th output layer.Therefore, results on the one-layer formulation also apply to the L -th layer formulation and thus, the focus of themathematical developments will be on the one-layer case. This representation contains several cases of interest as the feed-forward neural networks case as well as genuinelygraphical model cases (e.g., VAE or RBM). The computation of (4) requires marginalizing out u ∈ U which is ingeneral computationally prohibitive in practice due to the large number of involved terms (specially when U is discretespace). A variational upper bound is used to rewrite this problem in the following form: − log Q θ ˆ Y | X ( y | x ) ≤ E q θU | X (cid:104) − log Q θ ˆ Y | U ( y | U ) | X = x (cid:105) , (5)which simply follows by applying Jensen inequality Cover and Thomas (2006). Equality in (5) holds for the feed-forward neural network case, where U = g ( X ) almost surely for some g : X → U , and Q θ ˆ Y | X ( y | x ) = Q θ ˆ Y | U ( y | g ( x )) .The above equation suggest using the cross-entropy as a loss-function: (cid:96) θ ( x, y ) := E q θU | X (cid:104) − log Q θ (cid:98) Y | U ( y | U ) (cid:12)(cid:12)(cid:12) X = x (cid:105) . (6)The learner’s goal is to select f θ := ( q θU | X , Q θ (cid:98) Y | U ) minimizing the upper bound in (5), so-called expected risk: L ( f θ ) := E p XY [ (cid:96) θ ( X, Y )] . (7)Obviously since p XY is unknown, the risk cannot be directly measured and it is common to measure the agreement ofcandidates with a training dataset based on the empirical risk. Let the training set S n = { ( X k , Y k ) } nk =1 , the empiricalrisk is deﬁned by L emp ( f θ , S n ) := 1 n (cid:88) ( x,y ) ∈S n (cid:96) θ ( x, y ) , (8)where the parameter selection (cid:98) θ n := f ( S n ) is chosen to minimize (8): (cid:98) θ n := arg min θ ∈ Θ L emp ( f θ , S n ) . (9)The generalization error is deﬁned by E gen-err ( S n ) := L emp (cid:0) f (cid:98) θ n , S n (cid:1) − L (cid:0) f (cid:98) θ n (cid:1) . (10)Note that L (cid:0) f (cid:98) θ n (cid:1) = L emp (cid:0) f (cid:98) θ n , S n (cid:1) − E gen-err ( S n ) , i.e. the generalization error is the complement to empirical riskminimization in order to avoid overﬁtting. There is an interesting connection between the risk minimization of the cross-entropy loss and the IB principle pre-sented by Tishby et al. (1999):

Deﬁnition 1 (Information Bottleneck)

The IB method (Tishby et al., 1999) consists in ﬁnding q U | X that minimizesthe functional: L ( λ ) IB ( q U | X ) := H ( Y | U ) + λ · I (cid:0) U ( X ) ; X (cid:1) , (11) PREPRINT - O

CTOBER

23, 2020 for a suitable multiplier λ ≥ , where q U ( u ) := E p X (cid:2) q U | X ( u | X ) (cid:3) , (12) Q Y | U ( y | u ) := E p X (cid:2) q U | X ( u | X ) P Y | X ( y | X ) (cid:3) q U ( u ) . (13)In a parametric classiﬁcation problem, the IB can be interpreted as a minimization of the conditional entropy H ( Y | U θ ) with a regularization term I (cid:0) U θ ( X ) ; X (cid:1) . This cost function is a lower bound, independent of the decoder Q θ ˆ Y | U , of therisk: L ( q θU | X , Q θ ˆ Y | U ) = E q θU (cid:104) KL (cid:16) Q θY | U (cid:107) Q θ ˆ Y | U (cid:17)(cid:105) + H ( Y | U θ ) (14) ≥ H ( Y | U θ ) (15) = L (cid:16) q θU | X , Q θY | U (cid:17) , (16)where the equality in (15) holds if and only if Q θ ˆ Y | U = Q θY | U almost surely , i.e. in order to minimize the risk, thelearner should choose the decoder induced by the encoder in (13) while trying to minimize the resulting risk withrespect to the encoder. Typically, in real-world applications, the probability measure p XY is unknown, and usually thelearning algorithm chooses a decoder belonging to an appropriately deﬁned parametric class which not necessarilycontain (13). Either way, the decoder induced by the encoder Q θY | U will always exist and with it the cross-entropy lossinduced only by the encoder is deﬁned: ˜ (cid:96) θ ( x, y ) := E q θU | X (cid:104) − log Q θY | U ( y | U ) (cid:12)(cid:12)(cid:12) X = x (cid:105) . (17)Additionally, we observe that using an arbitrary ˜ q U ∈ P ( U ) (a so called prior) in a classical variational setting, L ( λ ) IB ( q θU | X ) = H ( Y | U θ ) + λ · (cid:104) E p X (cid:104) KL (cid:0) q θU | X (cid:107) ˜ q U (cid:1)(cid:105) − KL (cid:0) q θU (cid:107) ˜ q U (cid:1)(cid:105) (18) ≤ H ( Y | U θ ) + λ · E p X (cid:104) KL (cid:0) q θU | X (cid:107) ˜ q U (cid:1)(cid:105) (19) ≡ L ( λ ) VA ( q θU | X , ˜ q U ) . (20)The surrogate (20) shares a lot of in common with a slightly more general form of VAE discussed in (Kingma andWelling, 2013) and Achille and Soatto (2018b), where the latent space is regularized using a normal prior ˜ q U .Another connection to the present work can be found in Xu and Mannor (2012) algorithm robustness theory, in whicha K − elements partition K of the X space is proposed obtaining, among others, the following result: P (cid:32) E gen-err ( S n ) ≤ inf K d θ ( K ) + M θ (cid:114) K log(2) + 2 log(1 /δ ) n (cid:33) ≥ − δ (21)with probability at least − δ , where S n denotes the n size training set, M θ is the maximum value of the loss function, K is the cell number of the partition and d θ ( K ) is the maximum diameter of any element in the partition measuredin terms of the cost function. ( K, d θ ( K )) are parameters that deﬁne the robustness of the decision rule. Analogousto these two parameters, our bound provided in the next section consists of two magnitudes ( (cid:15) ( K ) , r ( K )) that can belinked to some aspect of robustness in the learning problem (see Section 4).Further works where mutual information plays a signiﬁcant role in generalization are reported within the well-knownframework of Bayesian PAC-learning by Russo and Zou (2015); Xu and Raginsky (2017); Bassily et al. (2018);Graepel et al. (2005). The generalization error is upper bounded by the mutual information between the training setand the algorithm (i.e., the learned parameters if a parametric class model is allowed). More precisely, the learningprocess consists in mapping training data to an algorithm according to some class via a Markov kernel p ˆ θ |S n . Then,the generalization error is bounded as follows: (cid:12)(cid:12)(cid:12) E p S n p ˆ θ |S n [ E gen-err ( S n )] (cid:12)(cid:12)(cid:12) ≤ M θ (cid:114) n I (cid:16) ˆ θ ( S n ) ; S n (cid:17) , (22) PREPRINT - O

CTOBER

23, 2020 where M θ is the maximum value of the loss function. However, in our present work, the framework, the underlyinghypothesis and the tools are fundamentally different from Bayesian PAC-learning. Therefore, the mutual informationwe obtained in this paper is between input features and latent representations, which is very different than the abovemutual information between the parameters and the training set. Our focus is to study the role of the mutual informationbetween features and latent representations as a potential candidate to control regularization, as it was already shownto be useful in (Achille and Soatto, 2018b; Alemi et al., 2016; Vera et al., 2018b; Achille and Soatto, 2018), amongothers. Nonetheless, the fact that two completely different approaches and models yield connections between mutualinformation and generalization conﬁrms the importance of studying information-theoretic quantities in the context ofstatistical learning. In this section, we present our main result in Theorem 1, which is a bound on the generalization error. In particular, weshow that the mutual information between the input raw data and its representation controls the generalization with ascaling O (cid:16) log( n ) √ n (cid:17) , which leads to a so-called informational generalization error bound. To this end, we will need thefollowing assumptions. Assumptions 1

The following assumptions are made:1. Input space

X ⊂ R d x has ﬁnite volume Vol ( X ) < ∞ and target space Y is ﬁnite |Y| < ∞ . In addition X ≡

Supp ( p X ) and P Y ( y min ) := min y ∈Y P Y ( y ) > . These are extremely mild conditions as we always candiscard the sets of zero probability in X and Y .2. Every encoder in the parametric class q θU | X ( u |· ) is continuous in x for all u ∈ U ⊂ R d u , and its marginalpdf has ﬁnite second order moment: sup x ∈X sup θ ∈ Θ max j ∈ [1: d u ] E q θUj | X (cid:2) U j | X = x (cid:3) ≤ S < ∞ , (23) where u j denotes the j -th entry in u ∈ U ⊂ R d u with j = 1 , . . . , d u . In addition, every decoder in theparametric model allocates non-zero probability mass: Q θ (cid:98) Y | U ( y | u ) > , ∀ θ ∈ Θ, u ∈ U , y ∈ Y . (24)

3. The class of loss functions denoted by Ω := { (cid:96) θ ( x, y ) : θ ∈ Θ } is totally bounded. Theorem 1 involves some steps which would worth explaining previously to facilitate the understanding of its state-ment. In particular, it uses two discretization procedures: one for the parametric space of the loss functions Ω , whichis requested to derive an uniform deviation of the generalization error (i.e., using the assumption that Ω is totallybounded), and another one for the input (feature) space X , which is needed to introduce some information-theoreticmeasures such as the mutual information. In more precise terms: – The discretization of a parametric class of loss functions is a common procedure when we need a bound forthe probability of uniform deviations with respect to an inﬁnite set. The typical approach is to bound therequired probability using a worst-case criterion over a ﬁnite number of events with the help of the unionbound. The use of Vapnik dimension (Devroye et al., 1997) of the underlying class or covering numbers isthe most common approach in classical learning theory (Devroye et al., 1997). The later will be the approachwe will follow here as well. In particular, we make use of the totally bounded hypothesis of Ω , which allowus to guarantee that for all ε > there exists a ﬁnite parameter set F ( ε ) Ω := { θ i } |F ( ε ) Ω | i =1 ⊂ Θ with |F ( ε ) Ω | < ∞ such that: for all θ ∈ Θ (or (cid:96) θ ∈ Ω ) there exists i ∗ ∈ { , . . . , |F ( ε ) Ω |} satisfying sup x ∈X max y ∈Y | (cid:96) θ ( x, y ) − (cid:96) θ i ∗ ( x, y ) | < ε. (25) PREPRINT - O

CTOBER

23, 2020 – The discretization of the input space X is used to introduce in our problem an information-theoretic quantitysuch as the mutual information which is well deﬁned for ﬁnite alphabets spaces. To this end, we introducean artiﬁcial quantization of the input space following the approach in (Xu and Mannor, 2012). Let us deﬁne,for each y ∈ Y , a ﬁnite B X -measurable partition of the feature space X into K connected cells {K ( y ) k } Kk =1 satisfying (cid:83) Kk =1 K ( y ) k ≡ X , K ( y ) i ∩ K ( y ) j = ∅ , ∀ ≤ i < j ≤ K, (cid:82) K ( y ) k dx > , ∀ ≤ k ≤ K . In addition,let { x ( k,y ) } Kk =1 be the respective cell centroids for each y ∈ Y , so the partition family K is given by K = (cid:110) K, (cid:0) {K ( y ) k } Kk =1 , { x ( k,y ) } Kk =1 (cid:1) y ∈Y (cid:111) . (26)The partition K deﬁnes the cell radius (measured with respect to the encoder): ∆ θ ( K ) = sup ≤ k ≤ K ( x,u,y ) ∈K ( y ) k ×U×Y (cid:12)(cid:12)(cid:12) q θU | X ( u | x ) − q θU | X ( u | x ( k,y ) ) (cid:12)(cid:12)(cid:12) . (27)Different possible partitions deﬁnes how small can these magnitudes be made. When K increases, every cell K ( y ) k should naturally shrinks and the radius ∆ θ ( K ) decreases (with the appropriate regularity conditions forthe parametric class of encoders). The above discretization induces the following probability distribution forthe input and output samples: P DXY ( x, y ) := K (cid:88) k =1 (cid:110) x = x ( k,y ) (cid:111) P X | Y ( K ( y ) k | y ) P Y ( y ) . (28)This probability distribution deﬁnes a new probability measure from P DXY q θU | X and a modiﬁed loss function,which will be used in the proof of Theorem 1: (cid:96) Dθ ( x, y ) := E q θU | X (cid:104) − log Q D,θY | U ( y | U ) (cid:12)(cid:12)(cid:12) X = x (cid:105) , (29)where Q D,θY | U ( y | u ) = (cid:80) Kk =1 P DXY ( x ( k,y ) , y ) q θU | X ( u | x ( k,y ) ) (cid:80) y (cid:48) ∈Y (cid:80) Kk =1 P DXY ( x ( k,y (cid:48) ) , y (cid:48) ) q θU | X ( u | x ( k,y (cid:48) ) ) . (30)From the above deﬁnitions, it is understood implicitly that the partition of X is independent on the parametricclass of encoders. However, the value of ∆ θ ( K ) is clearly a function of θ . For different values of θ this numbercan be different. However, if the class of the encoders is sufﬁciently well-behaved, this variation will not bevery wild. It is important to mention that the last discretization procedure is only used in the proof of the nexttheorem, but it does not restrict the validity of the theorem for continuous inputs probability distributions.The reason for this is the monotonicity of the mutual information with respect to ﬁnite measurable partitionsof the input space (the reader is referred to Pinsker (1964) for further details).Now we are able to present our main result. Theorem 1 (Main result)

For every δ ∈ (0 , , there exists a parameter set F ( ε ) Ω := { θ i } |F ( ε ) Ω | i =1 ⊂ Θ with |F ( ε ) Ω | < ∞ that meets P  |E gen-err ( S n ) | ≤ inf ε>  sup θ ∈F ( ε ) Ω B (cid:32) θ, δ |F ( ε ) Ω | (cid:33) + 2 ε  ≥ − δ, (31) where B ( θ, δ ) = inf K  (cid:15) ( K ) + r ( K )A δ (cid:113) I (cid:0) U θ ( X ) ; X (cid:1) · log( n ) √ n + inf β> e − g θ ( β )(1 + β ) β (cid:104) √ r ( K )B δ (cid:113) I (cid:0) U θ ( X ) ; X (cid:1)(cid:105) β √ n  + C δ + D δ · (cid:112) E p XY [ T θ ( X, Y ) ] √ n + O (cid:18) log( n ) n (cid:19) (32) PREPRINT - O

CTOBER

23, 2020 with A δ := √ δ , B δ := 1 + (cid:115) log (cid:18) |Y| + 4 δ (cid:19) , (33) C δ := log (cid:18) (4 πeS ) d u / P Y ( y min ) (cid:19) (cid:112) |Y| B δ , D δ := (cid:114) |Y| + 4 δ , (34) and g θ ( β ) := sup x,z ∈X (cid:115)(cid:90) U q θU | X ( u | x ) (cid:16) q θU | X ( u | z ) (cid:17) − β β du, (35) T θ ( x, y ) := (cid:96) θ ( x, y ) − ˜ (cid:96) θ ( x, y ) , (36) (cid:15) ( K ) := sup x,k,y,θ :1 ≤ k ≤ Ky ∈Y x ∈K ( y ) k θ ∈ Θ (cid:12)(cid:12)(cid:12) ˜ (cid:96) θ ( x, y ) − (cid:96) Dθ ( x ( k,y ) , y ) (cid:12)(cid:12)(cid:12) , (37) r ( K ) := 1min k,y :1 ≤ k ≤ Ky ∈Y P X (cid:16) K ( y ) k (cid:17) . (38)The proof is relegated to Appendix A. This bound has some important terms which are worth analyzing: – I ( U θ ( X ) ; X ) : Mutual information between raw data X and its randomized representation U ( X ) appears to berelated to the generalization capabilities and thus to overﬁtting. It was interpreted as a “measure of informationcomplexity” (Achille and Soatto, 2018b; Alemi et al., 2016; Vera et al., 2018b). Theorem 1 is a ﬁrst step inorder to explain how and why this effect happens. This term presents a scaling rate of n − / log( n ) andit is the main term in the generalization bound. It is well-known (Devroye et al., 1997) that there existsbounds showing that the generalization error vanishes faster with n , but this intermediate regime appears tobe visible in practical scenarios and these results can capture the dynamic of the generalization error withrespect to some of hyperparameters regardless of this poorer scaling. In Section 6, we present an empiricalanalysis that supports this claim. – E p XY (cid:2) T θ ( X, Y ) (cid:3) : This term can be interpreted as a measure of the decoder efﬁciency. It is basically, themean-square error between the loss function (6) and the modiﬁed loss ˜ (cid:96) θ ( x, y ) (17) induced solely by theencoder. This magnitude can be also understood as a measure of similarity between the decoders Q θ (cid:98) Y | U and Q θY | U induced by the encoder: T θ ( x, y ) = E q θU | X (cid:34) log Q θY | U ( y | U ) Q θ (cid:98) Y | U ( y | U ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:35) . (39)When Q θ (cid:98) Y | U = Q θY | U this term is zero, suggesting that this selection could have a beneﬁcial effect on thegeneralization error. This result is consistent with the bottleneck effect: when this decoder is selected, thegeneralization error is controlled mainly by I ( U θ ( X ) ; X ) . When the selection of the decoder Q θ (cid:98) Y | U is closeto Q θY | U (according to T θ ( x, y ) ) and the number of samples n is such that / √ n is considerably lower than log( n ) √ n , this term can be neglected. – (cid:15) ( K ) and r ( K ) : The trade-off between these two magnitudes is obviously related to the original trade-offbetween K and ∆ θ ( K ) in (27), and (cid:15) ( K ) increases with ∆ θ ( K ) , i.e., if the encoders are close in the senseof (27), the decoders Q θY | U and Q θ,DY | U will necessarily be, r ( K ) increases with K (smaller cells enclose lessprobability). As the exact trade-off is highly dependent, not only in terms of the encoder parametric class, butalso in the exact inputs distribution, is not easy to obtain its accurate description. However, under some mildextra assumptions some analysis can be carried out, as showed in Section 5.2. PREPRINT - O

CTOBER

23, 2020 – g θ ( β )(1+ β ) β : This term depends entirely on the encoder. On the one hand, β (cid:55)→ ββ is a decreasing function,i.e. ββ → ∞ when β → + and ββ → when β → ∞ . On the other hand, g θ ( β ) is an increasing functionsuch that g θ ( β ) → when β → + and g θ (1) ≥ (cid:112) Vol ( U ) . So, when U is not bounded, β should be limitedto (0 , . In any case, this does not seem to be critical if the encoders class is chosen carefully, e.g., normalor log-normal distributions for which this term can be shown to be ﬁnite provided that β ∈ (0 , . Ω A critical requirement for Theorem 1 is the total-boundedness hypothesis on Ω . This hypothesis allows us to analyzethe generalization error using a worst-case (over the the ε -net F ( ε ) Ω ) criterion for the generalization error (63). It isimportant to verify if this hypothesis can be justify in practical scenarios. The goal of this section is to show thatthis is indeed the case. In order to show our claim, we will consider a common and popular choice of parametricencoders/decoders class. For example, for the class of encoders we will consider the case of Gaussian encoders (Achilleand Soatto, 2018b; Kingma and Welling, 2013). That is, q θ E U | X ( u | x ) = d u (cid:89) i =1 N (cid:0) µ i ( x, β i ) , σ i ( x, α i ) (cid:1) , (40)where N ( µ, σ ) denotes a Gaussian pdf with mean µ and variance σ . Functions µ i ( x, β i ) and σ i ( x, α i ) with i = 1 , . . . , d u are typically deep feed-forward neural nets, where the parameters θ E ≡ { α i , β i } d u i =1 are learned byminimizing the empirical risk (8). We will assume that α i , β i ∈ R l for i = 1 , . . . , d u . With these deﬁnitions it is clearthat the total set of parameters for the encoder, that is θ E , lives in Θ E ⊆ R ld u . Note that with this choice for theencoder, U = R d u .For the decoder parametric class we will consider the well-known soft-max architecture: Q θ D ˆ Y | U ( k | u ) = exp {(cid:104) w k , u (cid:105) + b k } (cid:80) |Y| i =1 exp {(cid:104) w i , u (cid:105) + b i } , k = 1 , . . . , |Y| , (41)where θ D ≡ { w i , b i } |Y| i =1 and w i ∈ R d u , b i ∈ R for i = 1 , . . . , |Y| are the parameters of the soft-max chosen also by theminimization of the empirical risk. We clearly see that the total set of decoder parameters ( θ D ) is in Θ D ⊆ R |Y| ( d u +1) .With these deﬁnitions, we can rewrite the set Ω as: Ω = (cid:26) E q θEU | X (cid:104) − log Q θ D ˆ Y | U ( k | U ) (cid:12)(cid:12)(cid:12) X = x (cid:105) : ( θ E , θ D ) ∈ Θ E × Θ D (cid:27) . (42)Now we are ready to present the main result of this section, whose assumptions and proof will be relegated to AppendixB. Theorem 2 (Total-boundedness of Ω ) Under the set of Assumptions 2, Ω in (42) is totally bounded. Remark Notice that the result of this theorem is valid for Gaussian encoders. However, it is not difﬁcult to show thatfor other well-behaved encoders (e.g., log-normal (Achille and Soatto, 2018b)), the result holds also true. Althoughadditional efforts are needed to show it. (cid:15) ( K ) and r ( K ) In this section, we present two very simple lemmas that allow us to have a ﬁrst glimpse on the trade-off between (cid:15) ( K ) and r ( K ) , which are the most relevant terms in Theorem 1, and gives some basic notion of the scaling with the cellnumber K . PREPRINT - O

CTOBER

23, 2020

Lemma 1

For every partition family K , (cid:15) ( K ) ≤ O (sup θ ∈ Θ ∆ θ ( K )) .Proof : From the deﬁnition of (cid:15) ( K ) in (38), we can write: (cid:15) ( K ) ≤ sup θ ∈ Θ  (cid:15) θ ( K ) + max k,y :1 ≤ k ≤ Ky ∈Y E q θU | X (cid:34) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log Q D,θY | U ( U | y ) Q θY | U ( U | y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x ( k,y ) (cid:35) , (43)where (cid:15) θ ( K ) = sup ( k,x,y ) ∈C (cid:12)(cid:12)(cid:12) ˜ (cid:96) θ ( x, y ) − ˜ (cid:96) θ ( x ( k,y ) , y ) (cid:12)(cid:12)(cid:12) (44)and C = { ( k, x, y ) : 1 ≤ k ≤ K, y ∈ Y , x ∈ K ( y ) k } . It is easy to check, from the continuity of ˜ (cid:96) θ ( x, y ) with respectto q U | X ( ·| x ) , that (cid:15) θ ( K ) = O ( ∆ θ ( K )) . Something similar happens to the second term of (43), where for every u ∈ U and y ∈ Y : log Q θY | U ( y | u ) = log (cid:80) Kk =1 (cid:82) K ( y ) k p XY ( x, y ) q θU | X ( u | x ) dx (cid:80) y (cid:48) ∈Y (cid:80) Kk =1 (cid:82) K ( y (cid:48) ) k p XY ( x, y (cid:48) ) q θU | X ( u | x ) dx (45) ≤ log (cid:80) Kk =1 P DXY ( k, y ) (cid:16) q θU | X ( u | x ( k,y ) ) + ∆ θ ( K ) (cid:17)(cid:80) y (cid:48) ∈Y (cid:80) Kk =1 P DXY ( k, y (cid:48) ) (cid:16) q θU | X ( u | x ( k,y (cid:48) ) ) − ∆ θ ( K ) (cid:17) . (46)Then, it is not hard to verify that: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log Q D,θY | U ( u | y ) Q θY | U ( u | y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O ( ∆ θ ( K )) , (47)from which we can conclude that (cid:15) ( K ) ≤ O (sup θ ∈ Θ ∆ θ ( K )) .When K increases, ∆ θ ( K ) decreases and Lemma 1 shows that (cid:15) ( K ) tends to decrease as well. The following lemmastudies the trade-off between r ( K ) and (cid:15) ( K ) under rather reasonable assumptions. Lemma 2

Let q θU | X ( u |· ) be a parametric class of Lipchitz continuous encoders and for every partition family K : min k,y :1 ≤ k ≤ Ky ∈Y P (cid:16) X ∈ K ( y ) k (cid:17) ≥ O (cid:18) K (cid:19) , max k,y :1 ≤ k ≤ Ky ∈Y Vol ( K ( y ) k ) ≤ O (cid:18) K (cid:19) . (48) Then, r ( K ) ≤ O ( K ) , (cid:15) ( K ) ≤ O (cid:0) K − /d x (cid:1) and ∆ θ ( K ) ≤ O (cid:0) K − /d x (cid:1) . Remark The conditions required in Lemma 2 are not highly restricting. Notice that it is possible to show that min k,y :1 ≤ k ≤ Ky ∈Y P (cid:16) X ∈ K ( y ) k (cid:17) ≤ K , max k,y :1 ≤ k ≤ Ky ∈Y Vol ( K ( y ) k ) ≥ Vol ( X ) K (49) with equalities in equiprobable and equivolume partitions, respectively. As a consequence, it would be reasonable tothink that provided good partitions can be found for (32) , theabove quantities do not deviate signiﬁcantly from thethese behaviours.Proof The proof of r ( K ) ≤ O ( K ) is an immediate consequence of (48). From Lipschitz continuity of q θU | X ( u |· ) in K ( y ) k and the relationship with the volume, it is easy to show that ∆ θ ( K ) = O (cid:32) sup ( k,x,y ) ∈C (cid:107) x − x ( k,y ) (cid:107) (cid:33) = O (cid:16) V M ( K ) /d x (cid:17) , (50) PREPRINT - O

CTOBER

23, 2020 q θU j | X Q θ ˆ Y | U ˜ q θU j Motivated byEx. 1 Normal Softmax N (0 , Kingma and Welling (2013)Ex. 2 Log-Normal Softmax Log-Normal Achille and Soatto (2018b)Ex. 3 RBM Softmax n (cid:80) ni =1 q θU j | X ( u j | x i ) Hinton (2012)

Table 2

Architectures to be implemented. where V M ( K ) = max k,y :1 ≤ k ≤ Ky ∈Y Vol ( K ( y ) k ) = O ( K − ) . (51)Then, ∆ θ ( K ) ≤ O (cid:0) K − /d x (cid:1) and using Lemma 1, (cid:15) ( K ) ≤ O (cid:0) K − /d x (cid:1) . The goal of our experiments in this section is to validate the theory in Theorem 1, by showing that indeed the mutualinformation is representative of the generalization error, at least in a qualitative fashion. In other words, by controllingthe mutual information between features and representations we aim at investigating if we can better control general-ization in architectures of limited capacity. The magnitudes are compared for several rules ( q θU | X , Q θ ˆ Y | U ) consideringalso the inﬂuence of the Lagrange multiplier deﬁned in each experiment that controls the level of regularization duringtraining. Remark Theorem 1 studies the link between generalization error and mutual information for a given θ , which yieldsto a worst-case bound over a ﬁnite parameter set F ( ε ) Ω . As this set cannot be known in practice, we are going to makethe comparison for the θ found by the algorithm during training stage. Since this worst-case criterion is used fortheoretical convenience to ﬁnd inequalities and that the comparison to be made is merely qualitative, this decisiondoes not seem far-fetched and maintains the spirit of the bound. There exists the difﬁculty of implementing a mutual information estimator. In practice we have a product-form en-coder: q θU | X ( u | x ) = (cid:81) d u j =1 q θU j | X ( u j | x ) but the marginal distribution q θU ( u ) = E p X (cid:104) q θU | X ( u | X ) (cid:105) does not satisfy thisproperty. To this end, we make use of a variational bound (Cover and Thomas, 2006) to mutual information as follows: I (cid:16) U θ ( X ) ; X (cid:17) = E p X (cid:104) KL (cid:16) q θU | X (cid:13)(cid:13) ˜ q θU (cid:17)(cid:105) − KL ( q θU (cid:107) ˜ q θU ) (52) ≤ d u (cid:88) j =1 E p X (cid:104) KL (cid:16) q θU j | X ( ·| X ) (cid:13)(cid:13) ˜ q θU j (cid:17)(cid:105) , (53)where ˜ q θU ( u ) = (cid:81) d u j =1 ˜ q θU j ( u j ) is an auxiliary prior pdf (Kingma and Welling, 2013; Achille and Soatto, 2018b) whichmay or not depend on θ . The best choice in the above inequality is ˜ q θU j ( u j ) = E p X (cid:104) q θU j | X ( u j | X ) (cid:105) , j = [1 : d u ] .We make use of a parametric estimator of the KL divergence based on: E p X (cid:104) KL (cid:16) q θU j | X ( ·| X ) (cid:13)(cid:13) ˜ q θU j (cid:17)(cid:105) ≈ n n (cid:88) i =1 KL (cid:16) q θU j | X ( ·| x i ) (cid:13)(cid:13) ˜ q θU j (cid:17) , (54)We refer to each of the examples by the speciﬁcs of its encoder: Example uses a Normal encoder while example uses a Log-Normal one and example a RBM. Note that the mutual information does not depend on the decoder, sofor simplicity we use always a soft-max output layer. The values reported in each simulation are the average of threeindependent simulations, choosing at random and in each case, different sets for training and testing. PREPRINT - O

CTOBER

23, 2020

Normal Log-Normal RBMMNIST .

951 0 .

933 0 . CIFAR-10 .

429 0 .

446 0 . Table 3

Best accuracy achieved with each architecture.

As our main goal is not to present a new classiﬁcation methodology, with competitive state-of-the-art methods, werestrict ourself to small subsets of databases, as motivated by Neyshabur et al. (2017). More speciﬁcally, we sam-ple two different random subsets of: MNIST (standard dataset of handwritten digits) and CIFAR-10 (natural imagesKrizhevsky (2009)). The size of the training set is K for both datasets. It is important to emphasize the main differencebetween the datasets: MNIST proposes a task that is simpler than CIFAR-10, especially in presence of small samplesand without convolutional networks. From this observation, we deal with really different regimes between them: onein which high accuracy is achieved and another in which the algorithm is unable to achieve good performances, as canbe seen in Table 3. Remark In fact, RBM encoder supposes a discrete latent variable U while our main result is for continuous randomvariables. Theorem 1 can be modiﬁed, by relying on tools from (Shamir et al., 2010; Vera et al., 2018a) for discretealphabets. Gaussian Variational Autoencoders (VAEs) introduce a conditionally independent normal encoder U j | X = x ∼N ( µ j ( x ) , σ j ( x )) , j = [1 : d u ] , where µ j ( x ) and log σ j ( x ) are constructed via deep neural networks (vectorizedfunction of θ ), a standard normal prior ˜ U j ∼ N (0 , and the decoder input is generated by sampling based on thewell-known reparameterization trick Kingma and Welling (2013). Each KL divergence involved in expression (54)can be computed as KL (cid:16) q θU j | X ( ·| x i ) (cid:13)(cid:13) ˜ q θU j (cid:17) = 12 (cid:0) − log σ j ( x i ) + σ j ( x i ) + µ j ( x i ) − (cid:1) . (55)We consider a deep neural network composed of a feed-forward layer of hidden units with ReLU activationfollowed by another linear for each parameter ( µ and log σ ) with hidden units, that is each parameter, µ and log σ , are a two-layers network where the ﬁrst one is common to both . We choose a learning rate of . , a batch-size of and we train during epochs. The cost function considered during the training phase was of the form L emp ( q θU | X , Q θ ˆ Y | U , S n ) + λ d u (cid:88) j =1 n n (cid:88) i =1 KL (cid:16) q θU j | X ( ·| x i ) (cid:13)(cid:13) ˜ q U j (cid:17) , (56)where λ ≥ is the regularizing Lagrange multiplier. This approach is known as β − VAE (Higgins et al., 2017), whichmatches the classic VAE for a Lagrange multiplier λ = 1 . A β − Variational classiﬁer was used by Li et al. (2019) andMaggipinto et al. (2020) among others.Fig. 1 shows generalization error and mutual information behaviour for experiments on MNIST and CIFAR-10datasets. As the mutual information estimator is used as a regularization term, it is reasonable to expect a decreas-ing behavior of this with respect to the Lagrange multiplier. In addition, KL divergence is a classic regularization termfor a normal encoder, with which it is also expected a decreasing behavior in terms of the generalization error as well.Both decreasing behaviors are not only experimentally corroborated in Fig. 1, but a certain similarity is seen in theway of both are decreasing, especially in our experiments with CIFAR-10. For the case of MNIST dataset the behaviorbecomes more similar as the Lagrange multiplier grows. It should also be added that the mutual information estimatedwith the training set and with the testing one are very much the same. Therefore, it can have a qualitative notion ofgeneralization error with respect to the Lagrange multiplier without the need of relying on a validation set. The implementation of log σ inside σ prevents the algorithm from ﬁnding degenerate normals of zero variance. PREPRINT - O

CTOBER

23, 2020

Lagrange Multiplier G e n e r a li z a t i o n E rr o r MNIST Lagrange Multiplier G e n e r a li z a t i o n E rr o r -0.100.10.20.30.40.50.60.70.8 CIFARLagrange Multiplier M u t u a l I n f o r m a t i o n MNIST

TrainingTesting

Lagrange Multiplier M u t u a l I n f o r m a t i o n CIFAR

TrainingTesting

Fig. 1

Generalization error and mutual information for a normal encoder architecture. Curves on the LHS correspond to experi-ments on the MNIST database and those on the RHS correspond to CIFAR-10.

Information dropout proposes conditionally-independent log-normal encoders U j = f j ( X ) e α j ( X ) Z , j = [1 : d u ] where Z ∼ N (0 , , and f j ( x ) , and α j ( x ) are constructed via deep feed-forward neural networks (i.e., vectorizedfunction of θ ) and the decoder input is generated by sampling using the reparameterization trick. We follow theapproach Achille and Soatto (2018b), where it is recommended to use a log-normal prior ˜ U j ∼ log N ( µ j , σ j ) with f ( x ) = [ f ( x ) , · · · , f d u ( x )] denoting a deep feed-forward neural network with soft-plus activation, and ( µ j , σ j ) arevariables to be learned.In order to compute the KL divergence, note that U j | X = x ∼ log N (log f j ( x ) , α j ( x )) . Since the KL divergence isinvariant under reparametrizations, the divergence between two log-normal distributions is equal to the divergencebetween the corresponding normal distributions (Cover and Thomas, 2006). Therefore, using the formula for the KLdivergence of normal random variables, we obtainKL (cid:16) q θU j | X ( ·| x i ) (cid:13)(cid:13) ˜ q θU j (cid:17) = KL (cid:0) log N (log f j ( x i ) , α j ( x i )) (cid:107) log N ( µ j , σ j ) (cid:1) (57) = KL (cid:0) N (log f j ( x i ) , α j ( x i )) (cid:107)N ( µ j , σ j ) (cid:1) (58) = α j ( x i ) + (log( f j ( x i )) − µ j ) σ j − log α j ( x i ) σ j − . (59)The DNNs used for f ( x ) was a feed-forward structure with two layers of hidden units with a softplus activation andleting α ( x ) = [ α ( x ) , · · · , α d u ( x )] be a feed-forward layer of hidden units with a sigmoid activation multipliedby . , so that the maximum variance of the log-normal error distribution will be approximately and preventing nullvariances (Achille and Soatto, 2018b). We choose a learning rate at . , a batch-size of and we train during epochs. The cost function trained was the same than in (56). PREPRINT - O

CTOBER

23, 2020

Lagrange Multiplier G e n e r a li z a t i o n E rr o r MNIST Lagrange Multiplier G e n e r a li z a t i o n E rr o r -0.200.20.40.60.811.21.41.61.8 CIFARLagrange Multiplier M u t u a l I n f o r m a t i o n MNIST

TrainingTesting

Lagrange Multiplier M u t u a l I n f o r m a t i o n CIFAR

TrainingTesting

Fig. 2

Generalization error and mutual information for a lognormal encoder architecture. Curves on the left correspond to experi-ments on the MNIST database and those on the right correspond to CIFAR-10.

Fig. 2 shows generalization error and mutual information behaviour for experiments on MNIST and CIFAR-10datasets. As in the normal-encoder case, it is expected that both the mutual information and the generalization er-ror have a decreasing behavior: the ﬁrst because it is controlled by the Lagrange multiplier and the second becausethis type of regularization is known to perform well for this architecture (Achille and Soatto, 2018b). Both decreasingbehaviors are not only experimentally corroborated in Fig. 2, but both curves have a similar shape in terms of decay.Again this similarity is more pronounced in experiments on CIFAR-10 or in MNIST for moderately high Lagrangemultipliers. It is also seen that the mutual-information estimation with the training set is almost as good as with thetesting one.

In this section, we consider the standard models for RBMs studied in Hinton (2012); Srivastava et al. (2014). For every j ∈ [1 : d u ] , U j given X = x is distributed as a Bernoulli random variable with parameter σ ( b j + w Tj x ) (sigmoidactivation). By selecting the product distribution ˜ q U j ( u j ) = n (cid:80) ni =1 q U j | X ( u j | x i ) , we obtainKL (cid:16) q θU j | X ( ·| x i ) (cid:13)(cid:13) ˜ q θU j (cid:17) = σ ( b j + (cid:104) w j , x i (cid:105) ) log (cid:18) σ ( b j + (cid:104) w j , x i (cid:105) ) n (cid:80) nk =1 σ ( b j + (cid:104) w j , x k (cid:105) ) (cid:19) ++ (1 − σ ( b j + (cid:104) w j , x i (cid:105) )) log (cid:18) − σ ( b j + (cid:104) w j , x i (cid:105) ) n (cid:80) nk =1 − σ ( b j + (cid:104) w j , x k (cid:105) ) (cid:19) . (60)Eq. (60) is difﬁcult to use as a regularizer even using training with the contrastive divergence learning procedure byHinton (2002). Instead, we rely on the usual RBM regularization: weight-decay. This is a traditional way to improvethe generalization capacity. We explore the effect of the Lagrange multiplier λ , so called weight-cost, over both thegeneralization error and the mutual information. This meta-parameter controls the gradient weight decay, i.e., the cost PREPRINT - O

CTOBER

23, 2020

Lagrange Multiplier -5 -4 -3 -2 -1 G e n e r a li z a t i o n E rr o r MNIST Lagrange Multiplier -5 -4 -3 -2 -1 G e n e r a li z a t i o n E rr o r CIFARLagrange Multiplier -5 -4 -3 -2 -1 M u t u a l I n f o r m a t i o n MNIST

TrainingTesting

Lagrange Multiplier -5 -4 -3 -2 -1 M u t u a l I n f o r m a t i o n CIFAR

TrainingTesting

Fig. 3

Generalization error and mutual information for a RBM encoder architecture. Curves on the left correspond to experimentson the MNIST database and those on the right correspond to CIFAR-10. function can be written as: CD

RBM + λ d u (cid:88) j =1 (cid:107) w j (cid:107) , (61)where CD RBM is the classical unsupervised RBM cost function trained via the contrastive divergence learning pro-cedure by Hinton (2012). In order to compute the generalization error we add to the output of the last RBM layer asoft-max regression decoder trained during epochs separately. Several authors have combined RBMs with soft-max regression (Hinton et al., 2006; Srivastava et al., 2013; Chopra and Yadav, 2018), among others.Following suggestions from Hinton (2012), we study the Lagrange multiplier when λ ≥ . and plot the curveson a logarithmic scale. We choose learning rates at . , a batch-size of , hidden units and we train during epochs. We start with a momentum of . and change to . after epochs.Fig. 3 shows generalization error and mutual information behaviour for experiments on MNIST and CIFAR-10datasets. As weight decay is a standard regularization term, it is reasonable to expect a decreasing behavior of thegeneralization error. We observe a decreasing behavior of the mutual information and which is close enough to thatof the generalization error. Again, the mutual information estimator over the training set perform as good as over thetesting set. In this work we presented a theoretical study of a typical classiﬁcation task where the learner is interested in the be-haviour of the generalization error between the expected cross-entropy risk and its empirical approximation measuredwith respect to the training set. The main result is stated in Theorem 1 which shows that the generalization error canbe upper bounded (with high probability) by two major terms which depend on the mutual information between theinput and the corresponding latent representations (across the parametric class of stochastic encoders in use), and a PREPRINT - O

CTOBER

23, 2020 measure related to the decoder’s efﬁciency, among others factors. Our proof borrows tools from concepts of robustalgorithms introducing a discretization of the input (feature) space. Beside this, we provided formal support to showthat our critical assumptions needed for the proof of Theorem 1 can be easily fulﬁlled in practice by popular classes ofencoders (cf. Section 5.1).Experimental results on real-image datasets using well-known encoders models based on deep neural networks wereused to validate the existence of the statistical regime predicted by Theorem 1 where the mutual information betweeninput features and the corresponding latent representations is enable to predict relatively well the behaviour of thegeneralization error with respect to some important structural parameters of the learning algorithm, e.g., the Lagrangemultiplier weighting the regularization term in the cross-entropy loss. Of course, further numerical analysis will beneeded, in particular exploring more sophisticated deep neural networks models, but our results indicate that regular-ization by mutual information can have a direct inﬂuence in the generalization error.

A Proof of Theorem 1

In this Appendix we will prove the main theorem of this work. We will use some well-known results, listed in Appendix C. Beforebeginning some comments regarding the general approach to be followed in the proof are presented:1. Statistical dependence between the training set S n and the parameters θ (or equivalent the encoder/decoder pair) is amajor issue when trying to obtain bounds on the probability tails of the generalization error. The ﬁrst step to avoid thiscomplication is to use the totally bounded hypothesis about class loss Ω (item 4 of Assumptions 1) to break-down theabove mentioned dependency with a worst-case criterion. This is approach is common in the classical statistical learningtheory.2. While our decision rule depends on jointly on the encoder and decoder, the mutual information term will only dependson the encoder. For this reason we will decouple the inﬂuence of the decoder in our bound in a different term.3. We will look for a relationship between error gap (to be deﬁned in Section A.1) and mutual information, which is a information-theoretic measure. Although mutual information is well-deﬁned for continuous alphabets (under some reg-ularity conditions), the case for discrete alphabets is very important and generally, easier to handle. This is the casein Shannon theory (Cover and Thomas, 2006) and also in some results in learning theory.For example in (Shamir et al.,2010), authors bounds the deviation over the mutual information between labels Y and hidden representations U , throughthe mutual information between hidden representations and inputs X with discrete alphabets. Of course, through the useof variational representations, like the Donsker-Varadhan formula (Donsker and Varadhan, 1983), mutual informationfor continuous alphabets can be easily introduced in learning problems (Russo and Zou, 2015; Xu and Raginsky, 2017).However, such results lead to mutual information terms between the inputs and the outputs of the learning algorithm.Although such results are very interesting, it is often very difﬁcult to model the conditional distribution of the outputof the algorithm given the input samples (which is needed for the full characterization of the mutual information term).Our approach here is different. Inspired by the information bottleneck criterion, we consider the effect on the general-ization error, of the mutual information between the inputs X and the the representations U , which are generated usingthe parametric class on encoders. An easier way to obtain such term is considering that the input space is discrete (as in(Shamir et al., 2010)). As this hypothesis is not easy to justify in a typical learning problem, we will achieve our desiredresult through a careful discretization of the input space and the use, in a ﬁnal step, of the well-known Data ProcessingInequality . The approach used for the discretization of the input space has points in common with the robust-algorithmstheory (Xu and Mannor, 2012). A.1 Error-gap: A Worst-Case Bound for the Generalization Error

Our ﬁrst step will be to bound the generalization error with a worst criterion using the hypothesis about Ω (item 4 of Assumptions1). As a matter of fact, the Data Processing Inequality is not needed. It sufﬁces, to use the well-known monotonicty properties off-divergences measures (of which the mutual information is a special case) Pinsker (1964). PREPRINT - O

CTOBER

23, 2020

Lemma 3

Under Assumptions 1, let F ( ε ) Ω = { θ i } |F ( ε ) Ω | i =1 ⊂ Θ with |F ( ε ) Ω | < ∞ the ε -net introduced in (25) for ε > . Let δ > and α δ such that: max θ ∈F ( ε ) Ω P ( E gap ( f θ , S n ) ≤ α δ − ε ) > − δ, (62) where the error-gap is deﬁned as: E gap ( f θ , S n ) = (cid:12)(cid:12)(cid:12) L ( f θ ) − L emp ( f θ , S n ) (cid:12)(cid:12)(cid:12) . (63) The generalization error is upper bounded, with probability at least − δ , with α δ/ |F ( ε ) Ω | , i.e. P (cid:16) | E gen-err ( S n ) | ≤ α δ/ |F ( ε ) Ω | (cid:17) ≥ − δ. (64) Remark Notice that we introduce what we have called the error gap, in order to provide a bound for the generalization error.Notice that the error gap is deﬁned for ﬁxed parameters or independent from the training set. In this sense, the error gap is notthe same that the generalization error but both concepts are related. This is similar to Vapnik’s approach (Devroye et al., 1997,Chapter 12), in which the generalization error can be bounded by a worst-case in the error gap: | E gen-err ( S n ) | ≤ sup θ ∈ Θ E gap ( f θ , S n ) . (65) Proof As Ω is totally bounded there exists a θ ∗ ∈ F ( ε ) Ω such that sup x ∈X max y ∈Y (cid:12)(cid:12)(cid:12) (cid:96) (cid:98) θ n ( x, y ) − (cid:96) θ ∗ ( x, y ) (cid:12)(cid:12)(cid:12) < ε, (66)where (cid:98) θ n is the minimum over Ω of the empirical risk as deﬁned in (9). Then, (cid:12)(cid:12)(cid:12) L (cid:16) f (cid:98) θ n (cid:17) − L ( f θ ∗ ) (cid:12)(cid:12)(cid:12) ≤ ε, (67) (cid:12)(cid:12)(cid:12) L emp (cid:16) f (cid:98) θ n , S n (cid:17) − L emp ( f θ ∗ , S n ) (cid:12)(cid:12)(cid:12) ≤ ε. (68)The generalization error can be bounded using triangle inequality: |E gen-err ( S n ) | = (cid:12)(cid:12)(cid:12) L (cid:16) f (cid:98) θ n (cid:17) − L emp (cid:16) f (cid:98) θ n , S n (cid:17)(cid:12)(cid:12)(cid:12) (69) ≤ |L ( f θ ∗ ) − L emp ( f θ ∗ , S n ) | + 2 ε (70) ≤ max θ ∈F ( ε ) Ω (cid:12)(cid:12)(cid:12) L ( f θ ) − L emp ( f θ , S n ) (cid:12)(cid:12)(cid:12) + 2 ε (71) = max θ ∈F ( ε ) Ω E gap ( f θ , S n ) + 2 ε, (72)which allows us to write: P (cid:16) |E gen-err ( S n ) | > α δ (cid:17) ≤ P (cid:16) max θ ∈F ( ε ) Ω E gap ( f θ , S n ) > α δ − ε (cid:17) (73) ≤ (cid:88) θ ∈F ( ε ) Ω P ( E gap ( f θ , S n ) > α δ − ε ) (74) ≤ (cid:12)(cid:12)(cid:12) F ( ε ) Ω (cid:12)(cid:12)(cid:12) max θ ∈F ( ε ) Ω P ( E gap ( f θ , S n ) > α δ − ε ) (75) ≤ (cid:12)(cid:12)(cid:12) F ( ε ) Ω (cid:12)(cid:12)(cid:12) δ. (76)Finally, α δ/ |F ( ε ) Ω | is an upper bound with probability at least − δ over the generalization error. A.2 Decoupling the Decoder’s Inﬂuence

As a ﬁrst step we will decouple from the error gap the effect of the decoder. The following lemma allow us to achieve this:

Lemma 4

Under Assumptions 1, the error gap can be bounded as E gap (cid:0) f θ , S n (cid:1) ≤ E gap (cid:0) q θU | X , Q θY | U , S n ) + d ( S n (cid:1) , (77) PREPRINT - O

CTOBER

23, 2020 where Q θY | U ( y | u ) is the decoder induced by the encoder given in (17) and which we rewrite as: Q θY | U ( y | u ) = P Y ( y ) E p X | Y (cid:2) q θU | X ( u | X ) | Y = y (cid:3) E p X (cid:104) q θU | X ( u | X ) (cid:105) (78) and where d ( S n ) is: d ( S n ) = (cid:12)(cid:12)(cid:12) n n (cid:88) i =1 T θ ( x i , y i ) − E q θU (cid:104) KL (cid:16) Q θY | U (cid:107) Q θ (cid:98) Y | U (cid:17)(cid:105) (cid:12)(cid:12)(cid:12) (79) with T θ ( x, y ) := E q U | X (cid:20) log Q θY | U ( y | U ) Q θ (cid:98) Y | U ( y | U ) (cid:12)(cid:12)(cid:12)(cid:12) X = x (cid:21) .Proof It is easy to see that: (cid:96) θ ( x, y ) = ˜ (cid:96) θ ( x, y ) + T θ ( x, y ) , (80)where, as we already know, (cid:96) θ ( x, y ) is cross-entropy loss function (depending on the speciﬁc choice of the encoder and decoder)and ˜ (cid:96) θ ( x, y ) is modiﬁed cross-entropy function given by: ˜ (cid:96) θ ( x, y ) ≡ E q θU | X (cid:104) − log Q θY | U ( y | U ) (cid:12)(cid:12)(cid:12) X = x (cid:105) . Notice that the main difference between these two cross-entropies is that, while (cid:96) θ ( x, y ) depends on both the encoder and decoderchosen from the parametric class, ˜ (cid:96) θ ( x, y ) depends only on the encoder. In this case the decoder is given by (78), that is the optimumdecoder implied by the information bottleneck criterion as discusses in Section 3. Taking expectation in (80) with respect to p XY and the empirical distribution given by the set of samples S n we obtain respectively: L ( f θ ) = E p XY (cid:104) ˜ (cid:96) θ ( X, Y ) (cid:105) + E q θU (cid:104) KL (cid:16) Q θY | U (cid:107) Q θ (cid:98) Y | U (cid:17)(cid:105) and L emp ( f θ , S n ) = 1 n n (cid:88) i =1 ˜ (cid:96) θ ( x i , y i ) + 1 n n (cid:88) i =1 T θ ( x i , y i ) , where we have used that E p XY (cid:2) T θ ( X, Y ) (cid:3) = E q θU (cid:104) KL (cid:16) Q θY | U (cid:107) Q θ (cid:98) Y | U (cid:17)(cid:105) . Subtracting these two equations and taking absolutevalue at both sides we obtain: E gap (cid:0) f θ , S n (cid:1) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E p XY (cid:104) ˜ (cid:96) θ ( X, Y ) (cid:105) − n n (cid:88) i =1 (cid:2) ˜ (cid:96) θ ( x i , y i ) + T θ ( x i , y i ) (cid:3) + E q θU (cid:104) KL (cid:16) Q θY | U (cid:107) Q θ (cid:98) Y | U (cid:17)(cid:105)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Taking into account that we can deﬁne the modiﬁed error gap as: E gap (cid:0) q θU | X , Q θY | U , S n (cid:1) ≡ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E p XY (cid:104) ˜ (cid:96) θ ( X, Y ) (cid:105) − n n (cid:88) i =1 ˜ (cid:96) θ ( x i , y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (81)and using triangular inequality we obtain the desired result.The most important fact in this simple lemma is that the decoder inﬂuence is captured in the term d ( S n ) (which is a simpledeviation) and the encoder inﬂuence is captured through the deﬁnition of the modiﬁed loss function ˜ (cid:96) θ ( x, y ) and the correspondinggap E gap ( q θU | X , Q θY | U , S n ) . It is important to note the role of the optimal decoder Q θY | U matched to the encoder q θU | X , on both thegap E gap ( q θU | X , Q θY | U , S n ) and the term d ( S n ) , where the distance between the parametric decoder Q θ (cid:98) Y | U and Q θY | U (in terms ofKullback-Leibler divergence) is explicit. A.3 Analysis of the Term E gap Now we can focus on bound E gap ( q θU | X , Q θY | U , S n ) that depends only on the encoder. In order to do that we will consider the abovepresented discretization of the input space X . Let us deﬁne, for each y ∈ Y , a ﬁnite B X -measurable partition of the feature space X into K cells {K ( y ) k } Kk =1 satisfying (cid:83) Kk =1 K ( y ) k ≡ X , K ( y ) i ∩ K ( y ) j = ∅ , ∀ ≤ i < j ≤ K, (cid:82) K ( y ) k dx > ∀ ≤ k ≤ K . Inaddition, let { x ( k,y ) } Kk =1 denote the respective centroids for each y ∈ Y , so the partition family K is deﬁned as (26). This partitioninduces the probability distributions P DXY (28), where its marginal pmfs are P Y (true value) and a P DX ( x ) = (cid:80) y ∈Y P DXY ( x, y ) , PREPRINT - O

CTOBER

23, 2020 which has the elements of the set A = { x ( k,y ) : 1 ≤ k ≤ K, y ∈ Y} as atoms. The distribution P DXY and the encoder q θU | X deﬁne a probability measure from which Q D,θY | U in (30) is obtained. Also q D,θU is given by: q D,θU ( u ) := K (cid:88) k =1 (cid:88) y ∈Y q θU | X (cid:0) u | x ( k,y ) (cid:1) P X | Y ( K ( y ) k | y ) P Y ( y ) , (82)Discretization procedure introduces the magnitudes (cid:15) ( K ) and r ( K ) deﬁned in (38). From the above deﬁnitions, it is implicitlyunderstood that the partition of X is independent on the parametric class of the modiﬁed loss functions. However, (cid:15) ( K ) and r ( K ) depends on the parametric class of encoders and true input distribution respectively. We are ready to establish the following lemma: Lemma 5

Under the set of Assumptions 1, let a partition family K . The modiﬁed error gap (81) can be bounded as: E gap ( q θU | X , Q θY | U , S n ) ≤ (cid:15) ( K ) + E D gap ( q θU | X , Q D,θY | U , S n ) where (cid:15) ( K ) was deﬁned in (38) , Q D,θY | U in (30) and E D gap ( q θU | X , Q D,θY | U , S n ) is deﬁned as E D gap ( q θU | X , Q D,θY | U , S n ) = (cid:12)(cid:12)(cid:12) L D ( q θU | X , Q D,θY | U ) − L D emp ( q θU | X , Q D,θY | U , S n ) (cid:12)(cid:12)(cid:12) (83) where L D ( q θU | X , Q D,θY | U ) = K (cid:88) k =1 (cid:88) y ∈Y P DXY ( k, y ) (cid:96) Dθ ( x ( k,y ) , y ) , (84) L D emp ( q θU | X , Q θY | U , S n ) = 1 n K (cid:88) k =1 (cid:88) i ∈ [1: n ] x i ∈K k (cid:96) Dθ ( x ( k,y i ) , y i ) , (85) P DXY is given in (28) and (cid:96) Dθ in (29) .Proof : Triangle inequality allow us to write, E gap ( q θU | X , Q θY | U , S n ) ≤ (cid:12)(cid:12)(cid:12) L D ( q θU | X , Q D,θY | U ) − L D emp ( q θU | X , Q D,θY | U , S n )) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) L ( q θU | X , Q θY | U ) − L D ( q θU | X , Q D,θY | U ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) L D emp ( q θU | X , Q D,θY | U , S n ) − L emp ( q θU | X , Q θY | U , S n ) (cid:12)(cid:12)(cid:12) , (86)where the ﬁrst term is E D gap ( q θU | X , Q θY | U , S n ) . The other terms in (86) can be bounded using the fact that P DX ( x ( k,y ) ) = P (cid:0) X ∈K ( Y ) k , Y = y (cid:1) and deﬁnition of (cid:15) ( K ) (38). For example, for the second term we can write (cid:12)(cid:12)(cid:12) L ( q θU | X , Q θY | U ) − L D ( q θU | X , Q D,θY | U ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) K (cid:88) k =1 |Y| (cid:88) y =1 P DXY ( x ( k,y ) , y ) (cid:16) E p XY (cid:104) ˜ (cid:96) θ ( X, Y ) | Y = y, X ∈ K ( Y ) k (cid:105) − (cid:96) Dθ ( x ( k,y ) , y ) (cid:17) (cid:12)(cid:12)(cid:12) (87) ≤ (cid:15) ( K ) . (88)The third term is treated similarly to show that: (cid:12)(cid:12)(cid:12) L D emp ( q θU | X , Q θY | U , S n ) − L emp ( q θU | X , Q θY | U , S n ) (cid:12)(cid:12)(cid:12) ≤ (cid:15) ( K ) . (89)In order to progress in the proof of the theorem, we introduce the following notation for the empirical distributions (cid:98) P DXY , (cid:98) P DX , (cid:98) P Y , (cid:98) P DX | Y as the occurrence rate of S n ; e.g. (cid:98) P DXY ( k, y ) = | (cid:110) ( x i ,y i ) ∈S n : y i = y, x i ∈K ( y ) k (cid:111) | n . Also, we deﬁne distributions (cid:98) q D,θU , (cid:98) Q D,θY | U , (cid:98) q D,θU | Y induced from the encoder q θU | X and the empirical distribution (cid:98) P DXY . In addition, notation used in the main bodyof paper for information magnitudes, such as entropy or mutual information, is insufﬁcient since it does not speciﬁes clearly thedistribution with which the random variables are sampled. For this reason we introduce a change in the notation for such quantitiesas H ( Y | U θ ) ≡ H (cid:0) Q θY | U | q θU (cid:1) and I ( U θ ( X ) ; X ) ≡ I ( p X ; q θU | X ) . PREPRINT - O

CTOBER

23, 2020

Lemma 6

Under Assumptions 1, the gap E D gap ( q θU | X , Q D,θY | U , S n ) can be bounded as, E D gap ( q θU | X , Q D,θY | U , S n ) ≤ KL (cid:16) (cid:98) P DXY (cid:13)(cid:13) P DXY (cid:17) + (cid:90) U φ (cid:32)(cid:13)(cid:13)(cid:13) P DX − (cid:98) P DX (cid:13)(cid:13)(cid:13) (cid:114) V (cid:16) q θU | X ( u |· ) (cid:17)(cid:33) du + log (cid:18) (4 πeS ) d u / P Y ( y min ) (cid:19) (cid:112) |Y| (cid:13)(cid:13)(cid:13) P Y − (cid:98) P Y (cid:13)(cid:13)(cid:13) + O (cid:16) (cid:107) P Y − (cid:98) P Y (cid:107) (cid:17) + E P Y (cid:104) (cid:90) U φ (cid:32)(cid:13)(cid:13)(cid:13) P DX | Y ( ·| Y ) − (cid:98) P DX | Y ( ·| Y ) (cid:13)(cid:13)(cid:13) (cid:114) V (cid:16) q θU | X ( u |· ) (cid:17)(cid:33) du (cid:105) (90) where φ ( · ) is deﬁned as φ ( x ) =  x ≤ − x log( x ) 0 < x < e − e − x ≥ e − (91) and V ( · ) is deﬁned as V ( c ) := (cid:107) c − ¯ c a (cid:107) , (92) with c ∈ R a , a ∈ N , ¯ c = a (cid:80) ai =1 c i , and a is the vector of ones of length a . Remark Notation P Y is used to think the pmf P Y as a vector P Y = [ P Y (1) , · · · , P Y ( |Y| )] , so we can apply to it euclideannorms (cid:107) · (cid:107) and V ( · ) operator. Notice, that this is well-deﬁned as soon as the support of the considered pmfs is ﬁnite and discrete.Proof Adding and subtracting (cid:98) P DXY ( x ( k,y ) , y ) E q θU | X (cid:20) log (cid:18) (cid:98) Q D,θY | U ( y | U ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) X = x ( k,y ) (cid:21) we can write using the triangle inequality: E D gap ( q θU | X , Q D,θY | U , S n ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) ∀ ( k,y ) (cid:104) P DXY ( x ( k,y ) , y ) − (cid:98) P DXY ( x ( k,y ) , y ) (cid:105) E q θU | X (cid:34) log (cid:32) Q D,θY | U ( y | U ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X = x ( k,y ) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) H (cid:0) Q D,θY | U | q D,θU (cid:1) − H (cid:0) (cid:98) Q D,θY | U | (cid:98) q D,θU (cid:1)(cid:12)(cid:12)(cid:12) + E (cid:98) q D,θU (cid:104) KL (cid:16) (cid:98) Q D,θY | U (cid:13)(cid:13) Q D,θY | U (cid:17)(cid:105) . (93)We can bound the second term in (93) using the inequality: E (cid:98) q D,θU (cid:104) KL (cid:16) (cid:98) Q D,θY | U (cid:13)(cid:13) Q D,θY | U (cid:17)(cid:105) ≤ KL (cid:16) (cid:98) Q D,θY | U (cid:98) q D,θU (cid:13)(cid:13) Q D,θY | U q D,θU (cid:17) ≤ KL (cid:16) (cid:98) P DXY (cid:13)(cid:13) P DXY (cid:17) . (94)The ﬁrst term of (93) can be bounded as: (cid:12)(cid:12)(cid:12) H (cid:0) Q D,θY | U | q D,θU (cid:1) − H (cid:0) (cid:98) Q D,θY | U | (cid:98) q D,θU (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) H ( P Y ) − H ( (cid:98) P Y ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) H d ( q DU ) − H d (cid:0)(cid:98) q D,θU (cid:1)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) H d (cid:0) q D,θU | Y | P Y (cid:1) − H d (cid:0)(cid:98) q D,θU | Y | (cid:98) P Y (cid:1)(cid:12)(cid:12)(cid:12) , (95)where H d is the differential entropy. The terms (cid:12)(cid:12)(cid:12) H d (cid:0) q D,θU (cid:1) − H d (cid:0)(cid:98) q D,θU (cid:1)(cid:12)(cid:12)(cid:12) and (cid:12)(cid:12) H d (cid:0) q D,θU | Y | P Y (cid:1) −H d (cid:0)(cid:98) q D,θU | Y | (cid:98) P Y (cid:1)(cid:12)(cid:12) can be boundedby Lemmas 11 and 13 respectively. Finally, it is clear that P Y (cid:55)→ H ( P Y ) is differentiable and a ﬁrst order Taylor expansion yields: H ( P Y ) − H ( (cid:98) P Y ) = (cid:28) ∂ H ( P Y ) ∂ P Y , P Y − (cid:98) P Y (cid:29) + O (cid:16) (cid:107) P Y − (cid:98) P Y (cid:107) (cid:17) , (96)where ∂ H ( P Y ) ∂P Y ( y ) = − log P Y ( y ) − for each y ∈ Y . Then, applying Cauchy-Schwartz inequality the lemma is proved: (cid:12)(cid:12)(cid:12) H ( P Y ) − H ( (cid:98) P Y ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:68) log P Y , P Y − (cid:98) P Y (cid:69)(cid:12)(cid:12)(cid:12) + O (cid:16) P Y − (cid:98) P Y (cid:107) (cid:17) (97) ≤ (cid:107) log P Y (cid:107) (cid:13)(cid:13)(cid:13) P Y − (cid:98) P Y (cid:13)(cid:13)(cid:13) + O (cid:16) (cid:107) P Y − (cid:98) P Y (cid:107) (cid:17) (98) ≤ log (cid:18) P Y ( y min ) (cid:19) (cid:112) |Y| (cid:13)(cid:13)(cid:13) P Y − (cid:98) P Y (cid:13)(cid:13)(cid:13) + O (cid:16) (cid:107) P Y − (cid:98) P Y (cid:107) (cid:17) . (99)The combination of lemmas 5, 4 and 6 allow us to bound the error gap. PREPRINT - O

CTOBER

23, 2020

A.4 Bounds Related to Concentration Inequalities

Many terms in the above lemmas are deviation of random variables with respect to their means. As such they can be analyzedwith well-known concentration inequalities. This is the case for KL (cid:0) (cid:98) P DXY (cid:107) P DXY (cid:1) , (cid:107) P DX − (cid:98) P DX (cid:107) , (cid:107) P Y − (cid:98) P Y (cid:107) , (cid:107) P DX | Y ( ·| y ) − (cid:98) P DX | Y ( ·| y ) (cid:107) for y ∈ Y and d ( S n ) simultaneously. With probability at least − δ , we apply Lemmas 9, 10 and Chebyshevinequality (Devroye et al., 1997, Theorem A.16) to obtain:KL (cid:16) (cid:98) P DXY (cid:107) P DXY (cid:17) ≤ |X ||Y| log( n + 1) n + 1 n log (cid:18) |Y| + 4 δ (cid:19) = O (cid:18) log( n ) n (cid:19) , (100) max (cid:110)(cid:13)(cid:13) P Y − (cid:98) P Y (cid:13)(cid:13) , (cid:13)(cid:13) P DX − (cid:98) P DX (cid:13)(cid:13) , (cid:13)(cid:13) P DX | Y ( ·| y ) − (cid:98) P DX | Y ( ·| y ) (cid:13)(cid:13) (cid:111) ≤ (cid:114) log (cid:16) |Y| +4 δ (cid:17) √ n ≡ B δ √ n , (101) d ( S n ) ≤ (cid:114) |Y| + 4 nδ (cid:113) Var p XY ( T θ ( X, Y )) . (102)Using concentration inequalities (100), (101) and (102) we have the following lemma for the error gap: Lemma 7

Under Assumptions 1, for every δ ∈ (0 , the gap satisﬁes: P (cid:18) E gap ( f θ , S n ) ≤ inf K β> (cid:15) ( K ) + A δ (cid:113) I ( P DX ; q θU | X ) · log( n ) √ n r ( K ) + 2 e − g D,θ ( β )(1 + β ) β √ n (cid:104) √ r ( K )B δ (cid:113) I ( P DX ; q θU | X ) (cid:105) β + C δ + D δ · (cid:112) E p XY [ T θ ( X, Y ) ] √ n + O (cid:18) log( n ) n (cid:19) (cid:19) ≥ − δ, (103) where g D,θ ( β ) = (cid:114) E q D,θU (cid:104) q D,θU ( U ) − β β (cid:105) .Proof Using lemmas 4, 5, 6 and 12, with probability at least − δ we have: E gap ( f θ , S n ) ≤ (cid:15) ( K ) + D δ √ n Var p XY (cid:16) T θ ( X, Y ) (cid:17) + log (cid:18) (4 πeS ) d u / P Y ( y min ) (cid:19) (cid:112) |Y| B δ √ n + 2 (cid:90) U φ (cid:32) B δ √ n (cid:114) V (cid:16) q θU | X ( u |· ) (cid:17)(cid:33) du + O (cid:18) log( n ) n (cid:19) (104) ≤ (cid:15) ( K ) + D δ √ n E p XY (cid:104) T θ ( X, Y ) (cid:105) + log (cid:18) (4 πeS ) d u / P Y ( y min ) (cid:19) (cid:112) |Y| B δ √ n + log( n ) √ n B δ (cid:90) U (cid:114) V (cid:16) q θU | X ( u |· ) (cid:17) du + 2(1 + β ) e − B β δ β √ n (cid:90) U V (cid:16) q θU | X ( u |· ) (cid:17) β ) du + O (cid:18) log( n ) n (cid:19) (105)for every β > , where we have used that Var p XY (cid:0) T θ ( X, Y ) (cid:1) ≤ E p XY (cid:2) T θ ( X, Y ) (cid:3) . Next, we relate the mutual information I ( P DX ; q θU | X ) with V (cid:0) q θU | X ( u |· ) (cid:1) . This follows from an application of Pinsker’s inequality (Cover and Thomas, 2006, Lemma11.6.1) (cid:107) P − P (cid:107) ≤ KL ( P (cid:107) P ) and the fact that V ( c ) ≤ (cid:107) c − b a (cid:107) , ∀ b ∈ R : V (cid:16) q θU | X ( u |· ) (cid:17) ≤ (cid:88) x ∈A (cid:104) q θU | X ( u | x ) − q D,θU ( u ) (cid:105) (106) = q D,θU ( u ) (cid:88) x ∈A (cid:34) Q D,θX | U ( x | u ) P DX ( x ) − (cid:35) (107) ≤ q D,θU ( u ) (cid:32)(cid:88) x ∈A (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q D,θX | U ( x | u ) P DX ( x ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) (108) = q D,θU ( u ) (cid:32)(cid:88) x ∈A P DX ( x ) (cid:12)(cid:12)(cid:12) Q D,θX | U ( x | u ) − P DX ( x ) (cid:12)(cid:12)(cid:12)(cid:33) (109) ≤ r ( K ) q D,θU ( u ) KL (cid:16) Q D,θX | U ( ·| u ) (cid:107) P DX (cid:17) , (110) PREPRINT - O

CTOBER

23, 2020 where Q D,θX | U ( k | u ) = q θU | X ( u | x ) P DX ( x ) q D,θU ( u ) . So, using Jensen inequality (cid:90) U (cid:114) V (cid:16) q θU | X ( u |· ) (cid:17) du ≤ √ r ( K ) E q D,θU (cid:34)(cid:114) KL (cid:16) Q D,θX | U ( ·| u ) (cid:107) P DX (cid:17)(cid:35) (111) ≤ √ r ( K ) (cid:113) I ( P DX ; q θU | X ) , (112)Similarly, we proceed using Cauchy-Swartz inequality: (cid:90) U V (cid:16) q θU | X ( u |· ) (cid:17) β ) du ≤ β ) r ( K ) β ) E q D,θU (cid:20) q D,θU ( U ) − β β KL (cid:16) Q D,θX | U ( ·| U ) (cid:107) P DX (cid:17) β ) (cid:21) (113) ≤ β ) r ( K ) β ) g D,θ ( β ) (cid:115) E q D,θU (cid:20) KL (cid:16) Q D,θX | U ( ·| U ) (cid:107) P DX (cid:17) β ) (cid:21) (114) ≤ β ) r ( K ) β ) g D,θ ( β ) · I ( P DX ; q θU | X ) β ) . (115)The lemma is ﬁnally proved considering the inﬁmum over the partition K and β > . Remark It is worth mentioning the differences between our result and those presented in (Shamir et al., 2010). While we workwith the cross-entropy gap they only bounded the mutual information gap: |I (cid:0) q D,θU ; Q D,θY | U (cid:1) − I (cid:0)(cid:98) q D,θU ; (cid:98) Q D,θY | U (cid:1) | ≤ (cid:12)(cid:12)(cid:12) H d ( q D,θU ) − H d ( (cid:98) q D,θU ) (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) H d ( q D,θU | Y | P Y ) − H d ( (cid:98) q D,θU | Y | (cid:98) P Y ) (cid:12)(cid:12)(cid:12) . (116) For this reason, our proofs are substantially different. In addition, we consider continuous representations for U , while they workwith discrete and ﬁnite alphabets. Finally, in our case, some constants were subtly reduced. We also have the following lemma:

Lemma 8 (cid:114) E q D,θU (cid:104) q D,θU ( U ) − β β (cid:105) ≤ sup x,z ∈X (cid:115)(cid:90) U q θU | X ( u | x ) (cid:16) q θU | X ( u | z ) (cid:17) − β β du. (117) Proof

Function f ( x ) = x − β β is convex in x > for all β > , so we can use Jensen inequality in (82): q D,θU ( u ) − β β ≤ K (cid:88) k =1 (cid:88) y ∈Y P XY ( K ( y ) k , y ) (cid:32) q θU | X (cid:16) u | x ( k,y ) (cid:17) − β β (cid:33) (118)With this inequality we can bound the expectation as: E q D,θU (cid:104) q D,θU ( U ) − β β (cid:105) ≤ K (cid:88) k =1 (cid:88) y ∈Y P XY ( K ( y ) k , y ) (cid:90) U K (cid:88) l =1 (cid:88) y (cid:48) ∈Y P XY ( K ( y (cid:48) ) l , y (cid:48) ) q θU | X (cid:16) u | x ( l,y (cid:48) ) (cid:17) (cid:32) q θU | X (cid:16) u | x ( k,y ) (cid:17) − β β (cid:33) du ≤ sup x,z ∈X (cid:90) U q θU | X ( u | x ) (cid:16) q θU | X ( u | z ) (cid:17) − β β du. (119)From this last expression, the result of the lemma is inmmediate.At this point the proof of Theorem 1 is easily concluded. In ﬁrst place it is easy to see that I ( p X ; q θU | X ) = I ( p XY ; q θU | X ) . Asthe cell K yk where a particular x ∈ X belong is a deterministic function of x and y from the Data Processing inequality (Coverand Thomas, 2006, Theorem 2.8.1) I ( P DX ; q θU | X ) ≤ I ( p XY ; q U | X ) = I ( p X ; q θU | X ) . The result from Lemma 7 can be writtenas E gap ( q θU | X , Q θ (cid:98) Y | U , S n ) ≤ B ( θ, δ ) , with probability at least − δ and where B ( θ, δ ) was deﬁned in (32). Using Lemma 3 with α δ = 2 ε + sup θ ∈F ( εΩ ) B ( θ, δ ) , the proof of Theorem 1 is concluded. B Proof of Theorem 2

Clearly, Ω is a set of functions for which we want to ﬁnd conditions in order to possess the total-boundedness property under thesupremum norm, i.e., (cid:107) f (cid:107) ∞ ≡ sup x ∈X y ∈Y | f ( x, y ) | , f ∈ Ω. (120) PREPRINT - O

CTOBER

23, 2020

It is well-known that the Arzel`a-Ascoli Theorem (Rudin, 1986, Theorem 7.25) gives necessary and sufﬁcient conditions for the set Ω to be totally-bounded. These conditions are the the equicontinuity and uniform boundedness of Ω : – Uniform boundedness: Ω is uniformly bounded if exists K < ∞ such that: | f ( x, y ) | < K, ∀ ( x, y ) ∈ X × Y , ∀ f ∈ Ω. (121) – Equicontinuity: Ω is equicontinuous if for every (cid:15) > , exists δ ( (cid:15) ) > such that ∀ ( x , y ) , ( x , y ) ∈ X × Y and ∀ f ∈ Ω : | f ( x , y ) − f ( x , y ) | < (cid:15), if (cid:107) ( x , y ) − ( x , y ) (cid:107) < δ ( (cid:15) ) . (122)Notice that in our case, where Y is a ﬁnite and discrete space, the equicontinuity can be considered only over the input space X ,that is: | f ( x , k ) − f ( x , k ) | < (cid:15), if (cid:107) x − x (cid:107) < δ ( (cid:15) ) , ∀ k = 1 , . . . , |Y| , (123)where (cid:107) · (cid:107) is an appropriate norm in input space X . As in our case X is contained in R d x , this norm can be taken as the usualEuclidean norm.The main result of this section will be to show, that under the appropriate assumptions, the set Ω in Eq. (42) is equicontinuous anduniformly bounded. The set of assumptions we will consider here is described below: Assumptions 2 : We assume the following:1. For every k = 1 , . . . , |Y| , ( w k , b k ) ∈ W ⊂ R |Y| ( d u +1) with diam ( W ) < M < ∞ .2. For every i = [1 : d u ] α i ∈ A ⊂ R l and β i ∈ B ⊂ R l with diam ( A ) < M < ∞ and diam ( B ) < M < ∞ .3. For every i = [1 : d u ] functions µ i ( x, β i ) and σ i ( x, α i ) are uniformly bounded. That is, exists, M , M < ∞ such that: | µ i ( x, β i ) | < M , | σ i ( x, α i ) | < M , ∀ x ∈ X , ∀ ( α i , β i ) ∈ A × B , i = [1 : d u ] .4. For every i = 1 , . . . , d u functions µ i ( x, β i ) and σ i ( x, α i ) are uniformly Lipschitz as functions of x ∈ X . In preciseterms: exists K , K < ∞ such that: | µ i ( x , β i ) − µ i ( x , β i ) | < K (cid:107) x − x (cid:107) , ∀ x , x ∈ X , ∀ β i ∈ B, ∀ i =1 , . . . , d u , and | σ i ( x , α i ) − σ i ( x , α i ) | < K (cid:107) x − x (cid:107) , ∀ x , x ∈ X , ∀ α i ∈ A, ∀ i = [1 : d u ]

5. There exists η > such that: σ i ( x, α i ) > η, ∀ x ∈ X , ∀ α i ∈ A, i = [1 : d u ] . Assumptions 1 and 2 are usually enforced in practical situations. The parameters of the encoder and decoder, found throughempirical risk minimization usually belong to compact sets in the parameter space, practically avoiding that they could divergeduring the training phase. In many cases, this is usually enforced through the use of a proper regularization for the empirical risk.Assumption 3 and 4 are typically satisﬁed for several feed-forward architectures. For example, for RELU activation functions,assumption 2 is sufﬁcient for assumption 4 to be true. With the additional assumption that X is bounded, assumption 3 will alsobe true for RELU activation functions. For sigmoid activations, assumption 3 is valid even if X is not bounded. Constants K and K can be easily written in terms of the number L of layers of the feed-forward architecture used and properties of the activationfunctions used . Moreover, the Assumption 4 can also be relaxed for asking only for equicontinuity for the functions µ i ( x, β i ) and σ i ( x, α i ) without any loss. Assumption 5 is needed to avoid degeneration of the Gaussian encoders. It is also easy to enforce atypical parameter learning scenario.By Arzel`a-Ascoli Theorem we will need to show that Ω is uniformly bounded and equicontinuous. Let us begin with the uniformboundedness property. Using (cid:96) θ E ,θ D ( x, k ) = E q θEU | X (cid:104) − log Q θ D ˆ Y | U ( k | U ) (cid:12)(cid:12)(cid:12) X = x (cid:105) , we can write: | (cid:96) θ E ,θ D ( x, k ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) U d u (cid:89) i =1 N (cid:0) µ i ( x, β i ) , σ i ( x, α i ) (cid:1)  (cid:104) w k , u (cid:105) + b k − log  |Y| (cid:88) i =1 exp {(cid:104) w i , u (cid:105) + b i }  du (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (124) ≤|(cid:104) w k E q θEU | X [ u ] (cid:105)| + | b k | + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:90) U d u (cid:89) i =1 N (cid:0) µ i ( x, β i ) , σ i ( x, α i ) (cid:1) log  |Y| (cid:88) i =1 exp {(cid:104) w i , u (cid:105) + b i }  du (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (125) In order to keep the expressions simpler we will not do this. PREPRINT - O

CTOBER

23, 2020 ≤(cid:107) w k (cid:107) (cid:32) d u (cid:88) i =1 µ i ( x, β i ) (cid:33) / + | b k | + (cid:90) U d u (cid:89) i =1 N (cid:0) µ i ( x, β i ) , σ i ( x, α i ) (cid:1) (126) × (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log  |Y| (cid:88) i =1 exp {(cid:104) w i , u (cid:105) + b i } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) du, (127)where we have used the subaddtivity of absolute value and Cauchy-Schwartz inequality. Using the underlying assumptions, it is notdifﬁcult to check that (cid:107) w k (cid:107) (cid:32) d u (cid:88) i =1 µ i ( x, β i ) (cid:33) / + | b k | < M d / u M + M . (128)For the other term we can use the following easy to obtain inequality: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log  |Y| (cid:88) i =1 exp {(cid:104) w i , u (cid:105) + b i } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max i =[1: |Y| ] {|(cid:104) w i , u (cid:105) + b i |} + log |Y| (129)Then: (cid:90) U d u (cid:89) i =1 N (cid:0) µ i ( x, β i ) , σ i ( x, α i ) (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) log  |Y| (cid:88) i =1 exp {(cid:104) w i , u (cid:105) + b i } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) du ≤ E q θEU | X (cid:20) max i =[1: |Y| ] {|(cid:104) w i , u (cid:105) + b i |} | X = x (cid:21) + log |Y| . (130)It is straightforward to write: E q θEU | X (cid:20) max i =[1: |Y| ] {|(cid:104) w i , u (cid:105) + b i |} (cid:21) ≤ E q θEU | X (cid:20) max i =[1: |Y| ] {(cid:107) w i (cid:107) (cid:107) U (cid:107) } | X = x (cid:21) + max i =[1: |Y| ] | b i | (131) ≤ M · E q θEU | X [ (cid:107) U (cid:107) | X = x ] + M (132) ≤ M (cid:114) E q θEU | X [ (cid:107) U (cid:107) | X = x ] + M (133) = M (cid:118)(cid:117)(cid:117)(cid:116) d u (cid:88) i =1 ( µ i ( x, β i ) + σ i ( x, α i )) + M (134) ≤ M d / u ( M + M ) / + M , (135)where we have use Cauchy-Schwartz and Jensen inequality and the set to Assumptions 2. Combining the above results we obtainthat Ω is uniformly bounded.For the equicontinuity of Ω we can write: | (cid:96) θ E ,θ D ( x , k ) − (cid:96) θ E ,θ D ( x , k ) | ≤ (cid:90) U (cid:12)(cid:12)(cid:12) q θ E U | X ( u | x ) − q θ E U | X ( u | x ) (cid:12)(cid:12)(cid:12) × (cid:12)(cid:12)(cid:12) log Q θ D ˆ Y | U ( k | u ) (cid:12)(cid:12)(cid:12) du (136) ≤ (cid:90) U (cid:12)(cid:12)(cid:12) q θ E U | X ( u | x ) − q θ E U | X ( u | x ) (cid:12)(cid:12)(cid:12) × ( M (cid:107) u (cid:107) + M + log |Y| ) du, (137)where we have used (129) and the set of Assumptions 2. Consider the total variation for two probability densities, which can bewritten as: TV ( p ; q ) ≡ (cid:90) | p ( x ) − q ( x ) | dx. (138)Using this deﬁnition we can write: | (cid:96) θ E ,θ D ( x , k ) − (cid:96) θ E ,θ D ( x , k ) | ≤ ( M + log |Y| ) TV (cid:16) q θ E U | X ( ·| x ); q θ E U | X ( ·| x ) (cid:17) + M (cid:90) U (cid:107) u (cid:107) (cid:12)(cid:12)(cid:12) q θ E U | X ( u | x ) − q θ E U | X ( u | x ) (cid:12)(cid:12)(cid:12) du. (139)The second term in the above equation can be bounded using Cauchy-Schwartz yielding: (cid:90) U (cid:107) u (cid:107) (cid:12)(cid:12)(cid:12) q θ E U | X ( u | x ) − q θ E U | X ( u | x ) (cid:12)(cid:12)(cid:12) du ≤ √ · TV / (cid:16) q θ E U | X ( ·| x ); q θ E U | X ( ·| x ) (cid:17) × (cid:18)(cid:90) U (cid:107) u (cid:107) (cid:12)(cid:12)(cid:12) q θ E U | X ( u | x ) − q θ E U | X ( u | x ) (cid:12)(cid:12)(cid:12) du (cid:19) / (140) PREPRINT - O

CTOBER

23, 2020 ≤ √ · TV / (cid:16) q θ E U | X ( ·| x ); q θ E U | X ( ·| x ) (cid:17) × (cid:18) E q θEU | X (cid:2) (cid:107) U (cid:107) | X = x (cid:3) + E q θEU | X (cid:2) (cid:107) U (cid:107) | X = x (cid:3)(cid:19) / (141) ≤ √ · d / u (cid:0) M + M (cid:1) / TV / (cid:16) q θ E U | X ( ·| x ); q θ E U | X ( ·| x ) (cid:17) . (142)We see that the equicontinuity for Ω depends on the continuity properties of the variational distance for two Gaussians encoderswith inputs x and x . The variational distance for two Gaussian pdfs is difﬁcult to compute in close form (Devroye et al., 2020).However, it can be easily bounded using Pinsker Inequality (Pinsker, 1964): T V ( p ; q ) ≤ (cid:114) KL ( p || q ) . (143)As the encoders are Gaussian, the KL divergence can be easily computed and thus,TV (cid:16) q θ E U | X ( ·| x ); q θ E U | X ( ·| x ) (cid:17) ≤ (cid:32) d u (cid:88) i =1 σ i ( x , α i ) σ i ( x , α i ) − σ i ( x , α i ) σ i ( x , α i ) + (cid:2) µ i ( x , β i ) − µ i ( x , β i ) (cid:3) σ i ( x , α i ) (cid:33) / . (144)It will sufﬁce to analyze each of the following quantities: σ i ( x , α i ) σ i ( x , α i ) − σ i ( x , α i ) σ i ( x , α i ) + (cid:2) µ i ( x , β i ) − µ i ( x , β i ) (cid:3) σ i ( x , α i ) . (145)Using items 3 and 5 in Assumptions 2 we can write: | σ i ( x , α i ) − σ i ( x , α i ) | ≤ | σ i ( x , α i ) − σ i ( x , α i ) || σ i ( x , α i ) + σ i ( x , α i ) | (146) ≤ M K (cid:107) x − x (cid:107) . (147)Using the fact that σ i ( x , α i ) ≤ σ i ( x , α i ) + 2 M K (cid:107) x − x (cid:107) and Claim 5 in Assumptions 2, we can write: log σ i ( x , α i ) σ i ( x , α i ) ≤ log (cid:18) M (cid:107) x − x (cid:107) σ i ( x , α i ) (cid:19) (148) ≤ M K (cid:107) x − x (cid:107) σ i ( x , α i ) (149) ≤ M K (cid:107) x − x (cid:107) η . (150)Similarly, we have σ i ( x , α i ) σ i ( x , α i ) − ≤ σ i ( x , α i ) | σ i ( x , α i ) − σ i ( x , α i ) | (151) ≤ M K (cid:107) x − x (cid:107) η . (152)Finally, (cid:2) µ i ( x , β i ) − µ i ( x , β i ) (cid:3) σ i ( x , α i ) ≤ K (cid:107) x − x (cid:107) η . (153)Combining all the above results we obtain: | (cid:96) θ E ,θ D ( x , k ) − (cid:96) θ E ,θ D ( x , k ) | ≤ ( M + log |Y| ) d / u η (cid:0) K (cid:107) x − x (cid:107) + 4 M K (cid:107) x − x (cid:107) (cid:1) / + M d / u ( M + M ) / η / (cid:0) K (cid:107) x − x (cid:107) + 4 M K (cid:107) x − x (cid:107) (cid:1) / , (154)from which equicontinuity immediately follows. C Auxiliary Results

In this Appendix, some auxiliary facts which are used in proof of the main result are presented. PREPRINT - O

CTOBER

23, 2020

C.1 Some Basic Results

Lemma 9 (Cover and Thomas, 2006, Theorem 11.2.1) Let P ∈ P ( X ) be a discrete probability distribution and let (cid:98) P be itsempirical estimation over a n -data set. Then, P (cid:18) KL ( (cid:98) P (cid:107) P ) ≤ |X | log( n + 1) n + 1 n log(1 /δ ) (cid:19) ≥ − δ, (155) for all δ ∈ (0 , . Lemma 10 (Application of McDiarmid’s Inequality)

Let P ∈ P ( X ) be any probability distribution and let ˆ P be its empiricalestimation over a n -data set. Then, P (cid:32) (cid:107) P − ˆP (cid:107) ≤ (cid:112) log(1 /δ ) √ n (cid:33) ≥ − δ, (156) for all δ ∈ (0 , . Lemma 11 (Modiﬁed result from (Shamir et al., 2010))

Let X and U be two random variables ( X is discrete and U is con-tinuous) distributed according to P X and q U | X , respectively and let (cid:98) P X be its empirical estimation over a set of sample size n .Then, |H d ( q U ) − H d ( (cid:98) q U ) | ≤ (cid:90) U φ (cid:18) (cid:107) P X − (cid:98) P X (cid:107) (cid:113) V (cid:0) q U | X ( u |· ) (cid:1)(cid:19) du, (157) where q U ( u ) = E P X (cid:2) q U | X ( u | X ) (cid:3) ; (cid:98) q U ( u ) = E (cid:98) P X (cid:2) q U | X ( u | X ) (cid:3) ; φ ( · ) is deﬁned in (91) and V ( · ) in (92) for (cid:107) P DX − (cid:98) P DX (cid:107) smallenough . C.2 Additional Auxiliary Results

Lemma 12

Let n ≥ a e with a ≥ , then φ (cid:16) a √ n (cid:17) ≤ a n ) √ n + (1+ β ) e − β a β √ n for every β > .Proof Function φ ( · ) is deﬁned in (91), for n ≥ a e as φ (cid:18) a √ n (cid:19) = a n ) √ n + a log (cid:0) a (cid:1) √ n . (158)In order to bound the second summand, we look for the maximum of the following function: f β ( x ) = x log (cid:0) x (cid:1) x β = − x β β log ( x ) . (159)It is easy to check its derivative: f (cid:48) β ( x ) = − β β x β β − log ( x ) − x β β − = − x β β − (cid:20) β β log( x ) + 1 (cid:21) . (160)Derivative is null in e − ββ and this point is a maximum because f (cid:48) β ( x ) > for x < e − ββ and f (cid:48) β ( x ) < for x > e − ββ .Finally, x log (cid:18) x (cid:19) = f β ( x ) x β ≤ f β ( e − ββ ) x β = 1 + ββ e − x β . (161) Lemma 13

Under Assumptions 1: (cid:12)(cid:12)(cid:12) H d (cid:0) q D,θU | Y | P Y (cid:1) − H d (cid:0)(cid:98) q D,θU | Y | (cid:98) P Y (cid:1)(cid:12)(cid:12)(cid:12) ≤ (cid:107) P Y − (cid:98) P Y (cid:107) (cid:112) |Y| d u πeS )+ E P Y (cid:20)(cid:90) U φ (cid:16)(cid:13)(cid:13)(cid:13) P DX | Y ( ·| Y ) − (cid:98) P DX | Y ( ·| Y ) (cid:13)(cid:13)(cid:13) (cid:113) V (cid:0) q θU | X ( u |· ) (cid:1)(cid:17) du (cid:21) , (162) for max y (cid:107) P DX | Y ( ·| y ) − (cid:98) P DX | Y ( ·| y ) (cid:107) small enough. In the present context, this magnitude is O ( n − / ) . PREPRINT - O

CTOBER

23, 2020

Proof

Using triangle and Cauchy-Swartz inequalities, we obtain: |H d (cid:0) q D,θU | Y | P Y (cid:1) − H d (cid:0)(cid:98) q D,θU | Y | (cid:98) P Y (cid:1) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) y ∈Y P Y ( y ) H d ( q D,θU | Y ( ·| y )) − (cid:98) P Y ( y ) H d ( (cid:98) q D,θU | Y ( ·| y )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (163) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) y ∈Y P Y ( y ) (cid:16) H d ( q D,θU | Y ( ·| y )) − H d ( (cid:98) q D,θU | Y ( ·| y )) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) y ∈Y (cid:16) P Y ( y ) − (cid:98) P Y ( y ) (cid:17) H d ( (cid:98) q D,θU | Y ( ·| y )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (164) ≤ (cid:88) y ∈Y P Y ( y ) (cid:12)(cid:12)(cid:12) H d ( q D,θU | Y ( ·| y )) − H d ( (cid:98) q D,θU | Y ( ·| y )) (cid:12)(cid:12)(cid:12) + (cid:107) P Y − (cid:98) P Y (cid:107) · (cid:115)(cid:88) y ∈Y H d ( (cid:98) q D,θU | Y ( ·| y )) . (165)We apply Lemma 11 to the ﬁrst term in (165) and we obtain: (cid:88) y ∈Y P Y ( y ) (cid:12)(cid:12)(cid:12) H d ( q D,θU | Y ( ·| y )) − H d ( (cid:98) q D,θU | Y ( ·| y )) (cid:12)(cid:12)(cid:12) ≤ E P Y (cid:20)(cid:90) U φ (cid:16)(cid:13)(cid:13)(cid:13) P DX | Y ( ·| Y ) − (cid:98) P DX | Y ( ·| Y ) (cid:13)(cid:13)(cid:13) (cid:113) V (cid:0) q θU | X ( u |· ) (cid:1)(cid:17) du (cid:21) . (166)For the second term of (165) we bound the differential entropy using Hadamard’s inequality and the fact that it is maximized forGaussian random variables: (cid:88) y ∈Y H d ( (cid:98) q D,θU | Y ( ·| y )) ≤ (cid:88) y ∈Y

14 log (cid:20) (2 πe ) d u det (cid:18) Σ (cid:98) q D,θU | Y ( U | Y = y ) (cid:19)(cid:21) (167) ≤ (cid:88) y ∈Y (cid:32) d u (cid:88) j =1 log (cid:20) πe · Var (cid:98) q D,θUj | Y ( U j | Y = y ) (cid:21)(cid:33) , (168)where E (cid:98) q D,θU | Y (cid:2) U | Y = y (cid:3) denotes the covariance matrix associated to (cid:98) q D,θU | Y ( ·| y ) . In order to bound the last term, we make use thelaw of total variance:Var (cid:98) q D,θUj | Y ( U j | Y = y ) = E (cid:98) P DX | Y (cid:20) Var q θUj | X ( U j | X ) (cid:12)(cid:12)(cid:12) Y = y (cid:21) + Var (cid:98) P DX | Y (cid:18) E q θUj | X [ U j | X ] (cid:12)(cid:12)(cid:12) Y = y (cid:19) (169) ≤ E (cid:98) P DX | Y (cid:20) Var q θUj | X ( U j | X ) + E q θUj | X [ U j | X ] (cid:12)(cid:12)(cid:12) Y = y (cid:21) (170) ≤ S, (171)where we use that Var q θUj | X ( U j | X = x ) ≤ S and E q θUj | X [ U j | X = x ] ≤ √ S for all x ∈ X . Finally, (cid:88) y ∈Y H d ( (cid:98) q D,θU | Y ( ·| y )) ≤ |Y| m (4 πeS ) . (172) References

Achille A, Soatto S (2018) Emergence of invariance and disentangling in deep representations. Journal of Machine LearningResearch (JMLR) 19Achille A, Soatto S (2018a) Information dropout: Learning optimal representations through noisy computation. IEEE Transactionson Pattern Analysis and Machine Intelligence 40(12):2897–2905Achille A, Soatto S (2018b) Information dropout: Learning optimal representations through noisy computation. IEEE Transactionson Pattern Analysis and Machine Intelligence 40(12):2897–2905Alemi AA, Fischer I, Dillon JV, Murphy K (2016) Deep variational information bottleneck. CoRR abs/1612.00410, URL http://arxiv.org/abs/1612.00410 PREPRINT - O

CTOBER

23, 2020

Amjad RA, Geiger BC (2018) Learning representations for neural network-based classiﬁcation using the information bottleneckprinciple. CoRR abs/1802.09766, URL http://arxiv.org/abs/1802.09766 , Bassily R, Moran S, Nachum I, Shafer J, Yehudayoff A (2018) Learners that use little information. In: Proceedings of MachineLearning Research, PMLR, vol 83, pp 25–55Bousquet O, Elisseeff A (2002) Stability and generalization. J Mach Learn Res 2:499–526, DOI 10.1162/153244302760200704Chopra P, Yadav SK (2018) Restricted boltzmann machine and softmax regression for fault detection and classiﬁcation. Complex& Intelligent Systems 4(1):67–77Cover TM, Thomas JA (2006) Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing).Wiley-InterscienceDevroye L, Gy¨orﬁ L, Lugosi G (1997) A Probabilistic Theory of Pattern Recognition, Applications of Mathematics, vol 31, 2ndedn. SpringerDevroye L, Mehrabian A, Reddad T (2020) The total variation distance between high-dimensional gaussians. arXiv:181008693[math, stat] URL http://arxiv.org/abs/1810.08693 , arXiv: 1810.08693Donsker MD, Varadhan SRS (1983) Asymptotic evaluation of certain markov process expectations for large time. iv. Communica-tions on Pure and Applied Mathematics 36(2):183–212, DOI 10.1002/cpa.3160360204Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press,

Graepel T, Herbrich R, Shawe-Taylor J (2005) PAC-bayesian compression bounds on the prediction error of learning algorithmsfor classiﬁcation. Springer Machine Learning 59:55–76Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modern neural networks. In: Proceedings of the InternationalConference on Machine Learning ICML, SydneyHalbersberg D, Wienreb M, Lerner B (2020) Joint maximization of accuracy and information for learning the structure of a bayesiannetwork classiﬁer. Springer Machine Learning 109:1039–1099Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M, Mohamed S, Lerchner A (2017) β -VAE: Learning basic visualconcepts with a constrained variational framework. In: Proceedings of the Inernational Conference on Learning RepresentationsICLR, ToulonHinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800, DOI10.1162/089976602760128018Hinton GE (2012) A practical guide to training restricted boltzmann machines. In: Proceedings of Neural Networks: Tricks of theTrade (2nd ed.), Springer, pp 599–619Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554Kingma DP, Welling M (2013) Auto-encoding variational bayes. In: Proc. of the 2nd Int. Conf. on Learning Representations (ICLR)Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep., University of TorontoLi Y, Bradshaw J, Sharma Y (2019) Are generative classiﬁers more robust to adversarial attacks? In: Proceedings of MachineLearning Research, PMLR, Long Beach, California, USA, vol 97, pp 3804–3814Maggipinto M, Terzi M, Susto GA (2020) β -variational classiﬁers under attack. CoRR Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of Machine Learning, 2nd edn. Adaptive Computation and MachineLearning, MIT Press, Cambridge, MANeyshabur B, Tomioka R, Salakhutdinov R, Srebro N (2017) Geometry of optimization and implicit regularization in deep learning.CoRR abs/1705.03071Pichler G, Piantanida P, Koliander G (2020) On the estimation of information measures of continuous distributions.

Pinsker M (1964) Information and information stability of random variables and processes. Holden-Day series in time series anal-ysis, Holden-DayRudin W (1986) Principles of Mathematical Analysis. McGraw - Hill Book C.Russo D, Zou J (2015) How much does your data exploration overﬁt? controlling bias via information usage. arXiv:151105219 [cs,stat] URL http://arxiv.org/abs/1511.05219 , arXiv: 1511.05219 PREPRINT - O

CTOBER

23, 2020

Saxe A, Bansal Y, Dapello J, Advani M, Kolchinsky A, Tracey B, Cox D (2018) On the information bottleneck theory of deeplearning. In: Proc. of the 6th Int. Conf. on Learning Representations (ICLR)Schwartz-Ziv R, Tishby N (2017) Opening the black box of deep neural networks via information. CoRR abs/1703.00810, URL http://arxiv.org/abs/1703.00810

Shamir O, Sabato S, Tishby N (2010) Learning and generalization with the information bottleneck. Theor Comput Sci 411(29-30):2696–2711, DOI 10.1016/j.tcs.2010.04.006, URL http://dx.doi.org/10.1016/j.tcs.2010.04.006

Sokolic J, Giryes R, Sapiro G, Rodrigues M (2017a) Generalization Error of Invariant Classiﬁers. In: Proceedings of MachineLearning Research, PMLR, Fort Lauderdale, FL, USA, vol 54, pp 1094–1103Sokolic J, Giryes R, Sapiro G, Rodrigues MRD (2017b) Robust large margin deep neural networks. IEEE Transactions on SignalProcessing 65:4265–4280Srivastava N, Salakhutdinov R, Hinton GE (2013) Modeling documents with deep boltzmann machines. In: Proceedings of theTwenty-Ninth Conference on Uncertainty in Artiﬁcial Intelligence, UAI 2013, Bellevue, WA, USA, August 11-15, 2013Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networksfrom overﬁtting. J of Mach Learning Research 15(1):1929–1958Tishby N, Zaslavsky N (2015) Deep learning and the information bottleneck principle. CoRR abs/1503.02406, URL http://arxiv.org/abs/1503.02406

Tishby N, Pereira FC, Bialek W (1999) The information bottleneck method. In: Proc. of the 37th Annu. Allerton Conf. on Commu-nication, Control and Computing, pp 368–377Vera M, Piantanida P, Rey Vega L (2018a) The role of the information bottleneck in representation learning. In: IEEE Int. Symp.on Inform. Theory (ISIT)Vera M, Rey Vega L, Piantanida P (2018b) Compression-based regularization with an application to multitask learning. IEEEJournal of Selected Topics in Signal Processing 12(5):1063–1076Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol PA (2010) Stacked denoising autoencoders: Learning useful representationsin a deep network with a local denoising criterion. J of Mach Learning Research 11:3371–3408Xu A, Raginsky M (2017) Information-theoretic analysis of generalization capability of learning algorithms. arXiv:170507809 [cs,math, stat] URL http://arxiv.org/abs/1705.07809

Xu H, Mannor S (2012) Robustness and generalization. Machine Learning 86(3):391–423, DOI 10.1007/s10994-011-5268-1, URL https://doi.org/10.1007/s10994-011-5268-1

Yamanishi K (1992) A learning criterion for stochastic rules. Springer Machine Learning 9:165–203Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2017) Understanding deep learning requires rethinking generalization. In:5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference TrackProceedings, URL https://openreview.net/forum?id=Sy8gdB9xxhttps://openreview.net/forum?id=Sy8gdB9xx