[PDF] Learning under Distribution Mismatch and Model Misspecification

Abstract

We study learning algorithms when there is a mismatch between the distributions of the training and test datasets of a learning algorithm. The effect of this mismatch on the generalization error and model misspecification are quantified. Moreover, we provide a connection between the generalization error and the rate-distortion theory, which allows one to utilize bounds from the rate-distortion theory to derive new bounds on the generalization error and vice versa. In particular, the rate-distortion based bound strictly improves over the earlier bound by Xu and Raginsky even when there is no mismatch. We also discuss how "auxiliary loss functions" can be utilized to obtain upper bounds on the generalization error.

Full PDF

LLearning under Distribution Mismatch and ModelMisspeciﬁcation

Mohammad Saeed Masiha Amin Gohari Mohammad Hossein YassaeeMohammad Reza ArefFebruary 24, 2021

Abstract

We study learning algorithms when there is a mismatch between the distributions ofthe training and test datasets of a learning algorithm. The eﬀect of this mismatch on thegeneralization error and model misspeciﬁcation are quantiﬁed. Moreover, we provide aconnection between the generalization error and the rate-distortion theory, which allowsone to utilize bounds from the rate-distortion theory to derive new bounds on the gen-eralization error and vice versa. In particular, the rate-distortion based bound strictlyimproves over the earlier bound by Xu and Raginsky even when there is no mismatch.We also discuss how “auxiliary loss functions” can be utilized to obtain upper bounds onthe generalization error.

In a learning algorithm, a distribution mismatch occurs when the training dataset and the testdataset are not drawn from the same distribution. This mismatch might also occur if trainingdata is corrupted or if the statistical distribution of the data changes from training to testing.For example, suppose that (in the Covid era) a pharmaceutical company located in region Rhas developed a drug for Covid-19 (in statistical terms, the company has tuned the parametersof a process that describes how to mix diﬀerent chemicals to make a drug). Clinical experimentsshow high eﬀectiveness (say 95%) of this treatment for the population that resides in regionR. There is an urgent need for the drug and the company lacks time to test the medicineon other populations with possibly diﬀerent genetic backgrounds (in statistical terms, with adiﬀerent distribution from the distribution of the population in region R). Hence it is requiredto have some guarantee on how the eﬀectiveness of treatment for the population R generalizesto other populations. As another example, in federated learning, a centralized model is trainedbased on chunks of training data originating from a number of clients, which may be mobilephones, other mobile devices, or sensors. While the training data may come from only a limitednumber of clients, statistical guarantees on the learning algorithm should be expressed in termsof testing on a population-averaged model of all client distributions, which might be diﬀerentfrom the training distribution.Distribution mismatch can manifest itself in diﬀerent ways: consider a data scientist in acompany who is given access to a training dataset and asked to make a recommendation abouta decision for the company. The training dataset is corrupted and its distribution slightly diﬀersfrom that of the test data. The data scientist might run a learning algorithm A and utilizeits output on the training data to make a recommendation. In the ﬁrst part of this paper,we study the eﬀect of distribution mismatch on the generalization error of algorithm A . Next,1 a r X i v : . [ c s . I T ] F e b ssume that the company’s manager impresses upon the data scientist the importance of thedecision for the company and asks about his conﬁdence level about his recommendation. Toaddress this question, the data scientist needs to come up with a mathematical model for thedata and give guarantees based on that model. For instance, the data scientist might choosethe parametric class of Gaussian distributions, partly based on the training data histograms(many methods to ﬁnd a family of distributions for data samples are data-driven). Since thetraining data is corrupted, this process could lead to model misspeciﬁcation. In the second partof this paper, we study how model misspeciﬁcation aﬀects theoretical guarantees of a learningalgorithm. Generalization error under distribution mismatch:

Distribution mismatch is thesubject of previous studies in transfer learning or domain adaptation [1–5]. In the ﬁrst partof this paper, we provide information-theoretic bounds on the generalization error under adistribution mismatch. Designing algorithms with low generalization error is a key challengein machine learning. It is known that under certain assumptions, the generalization error ofa learning algorithm can be bounded from above in terms of the mutual information betweenthe input and output of the algorithm [6, 7] (see also [8–17] for various generalizations andextensions using other measures of dependence). These works assume that the test data aredrawn from the same distribution as the training data. Herein, we provide bounds on thegeneralization error of the learning algorithm assuming a bound on the KL divergence betweenthe test and training distributions as well as a bound on the mutual information between theinput and output of the learning algorithm. One of our bounds is based on (to the best ofour knowledge) a novel connection between generalization error and the rate-distortion theory.When specialized to the case of no-mismatch, this bound strictly improves over the bound in [7](see Corollary 1 and Figure 1).A question that we also address in this section is as follows: in case of having no mismatchbetween the training and test distributions, having more training data samples leads to in-creasingly better estimates of the unknown distribution of the data. On the other hand, in caseof a mismatch, increasing the number of training samples can only provide more informationabout the training distribution. In the limit of the number of samples going to inﬁnity, wewill perfectly learn the training distribution but will still have a residual ambiguity about thetest distribution: we will only know that the test distribution is at a certain KL distance fromthe training distribution. If we are in a regime where the error is dominated by this residualambiguity in the test distribution, the value of training samples gradually depreciates as wegather more samples. Subsequently, we might have insuﬃcient incentive to gather more train-ing samples. This shows that there is an “optimal” number of samples associated with ourproblem. To the best of our knowledge, this question has not been addressed in the literatureso far. We address the above question as follows: In Corollary 1, we provide an upper boundof generalization error in terms of γ + r/n where γ is the KL-divergence between the train-ing and test distributions, r is the mutual information between the input and output of thelearning algorithm and n is the sample size. If γ > n (or small values of r ), the term γ becomes the dominant term, and theeﬀect of r/n vanishes in the upper bound. This happens when r/n is of the same order as γ .For a ﬁxed sample size n and γ , it suﬃces to work with algorithms that have input-outputmutual information r satisfying r ≈ nγ . In other words, since the training data is drawn froma diﬀerent distribution than the test data, limited overﬁtting will not aﬀect the generaliza-tion error. Next, we give a lower bound on the generalization error in Corollary 2 underdistribution mismatch. Similar to the upper bound, this lower bound on the generalizationerror involves the summation of two terms. The ﬁrst term is a constant (bounded from aboveby the KL-divergence between the training and test distribution, e.g. see (61)) and anotherterm (depending on the input-output mutual information of the algorithm) and vanishing in n .2inally, we also consider the performance of the ERM algorithm under distribution mismatchin Theorem 9. We present an upper bound on excess risk. Increasing the number of samplesdoes not make the upper bound vanish and we get a constant upper bound (due to distributionmismatch) when the number of samples tends to inﬁnity. Model misspeciﬁcation:

A learning algorithm has access to a training dataset that isdrawn from an unknown distribution. This unknown distribution is commonly assumed tobelong to a known family of distributions P . A learning “model” provides a description for thefamily P , and a learning algorithm is required to have good performance when the data is drawnfrom any arbitrary distribution belonging to P . We say that model misspeciﬁcation occurs whenthe data distribution does not belong to P . The amount of misspeciﬁcation may be measuredby the minimum KL-divergence from the true distribution to the family of distributions in class P [18]. Model misspeciﬁcation is a key consideration in statistics [18, 19]. For instance, [19]shows that Bayesian methods are not optimal for learning predictive models unless the modelclass is perfectly speciﬁed. In the second part of the paper, we ﬁx a uniformly stable learningalgorithm A and assume a notion of sample complexity for the class P . Then we bound thesample complexity under a distribution µ (cid:48) / ∈ P based on the minimum KL-divergence of µ (cid:48) from the family P . Organization:

The rest of this paper is organized as follows. The paper splits into twoparts: section 2 gives our results on generalization error while Section 3 is dedicated to modelmisspeciﬁcation. In Section 2.1 we formally deﬁne learning with mismatched (training and testdata) distributions. Section 2.2 provides a connection between the rate-distortion theory andthe generalization error, along with upper and lower bounds on the generalization error. Theperformance of the ERM algorithm on the training data when there is a distribution mismatchis also studied. Section 3 studies model mismatch for the class of uniformly-stable algorithms.Finally, Section 4 discusses some ideas to improve the upper bounds on the generalization errorgiven in Section 2.2.

Notation and preliminaries:

Random variables are shown in capital letters, whereastheir realizations are shown in lowercase letters. We show sets with calligraphic font. For arandom variable X generated from a distribution µ , we use E X ∼ µ to denote the expectationtaken over X with distribution µ and P X means the distribution over X . We use D ( µ (cid:107) ν ) and D α ( µ (cid:107) ν ) = α − log (cid:82) (cid:0) dµdν (cid:1) α dν ( x ) to denote the KL divergence and the Renyi divergence oforder α respectively. In particular, we have D ( µ (cid:107) ν ) = log (1 + χ ( µ (cid:107) ν )) where χ -divergenceis deﬁned as χ ( µ (cid:107) ν ) = E ν (cid:0) dµdν − (cid:1) . Given two random variables X and Y , we use theshorthand X ≤ − δ Y to denote P [ X ≤ Y ] ≥ − δ . Observe that X ≤ − δ Y and Y ≤ − δ Z implies X ≤ − δ − δ Z by the union bound.We write g ( n ) ∼ Ω( f ( n )) when g ( n ) ≥ c · f ( n ) for large enough n .The concept of subgaussianity is deﬁned as follows: Deﬁnition 1.

The random variable X is said to be sub-Gaussian with parameter σ if ∀ s ∈ RE [ e s ( X − E [ X ]) ] ≤ e σ s . (1) Using the Chernoﬀ ’s bound, we obtain, P ( | X − E X | > t ) ≤ e − t σ . (2)The following lemma relates the expectation of a measurable function over two diﬀerentdistributions: Lemma 1 (Donsker-Varadhan) . Let X be a sample space and let P be a distribution on X .Let Q be a distribution on X with the support which is a subset of the P support. Then for anymeasurable function φ : X → R with respect to P , we have ln (cid:0) E P [ e φ ( X ) ] (cid:1) ≥ E Q [ φ ( X )] − D ( Q (cid:107) P ) . emma 2. [Coupling] Given the marginal distributions µ and µ (cid:48) on Z , one can ﬁnd a coupling π ( z, z (cid:48) ) on ( z, z (cid:48) ) ∈ Z × Z such that ( Z, Z (cid:48) ) ∼ π satisfy Z ∼ µ , Z (cid:48) ∼ µ (cid:48) and P π [ Z (cid:54) = Z (cid:48) ] = (cid:107) µ − µ (cid:48) (cid:107) T V where (cid:107) µ − µ (cid:48) (cid:107) T V is deﬁned as (cid:107) µ − µ (cid:48) (cid:107) T V = sup A ∈Z [ µ ( A ) − µ (cid:48) ( A )] . Consider an instance space Z , a hypothesis space W and a non-negative loss function (cid:96) : W × Z → R + . Assume that the test and training samples are produced (in an i.i.d. fashion)from two unknown distributions µ and µ (cid:48) on Z respectively. A training dataset of size n isshown by the n -tuple, S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) ∈ Z n of i.i.d. random elements according to anunknown distribution µ (cid:48) . A learning algorithm is characterized by a probabilistic mapping A ( · )(a Markov Kernel) that maps training data S (cid:48) to the random variable W (cid:48) = A ( S (cid:48) ) ∈ W asthe output hypothesis. The population risk of a hypothesis w ∈ W is computed on the testdistribution µ as follows: L µ ( w ) (cid:44) E µ [ (cid:96) ( w, Z )] = (cid:90) Z (cid:96) ( w, z ) µ ( dz ) , ∀ w ∈ W . (3)The goal of learning is to ensure that under any data generating distribution µ , the populationrisk of the output hypothesis W (cid:48) is small, either in expectation or with high probability. Since µ and µ (cid:48) are unknown, the learning algorithm cannot directly compute L µ ( w ) for any w ∈ W ,but can compute the empirical risk of w on the training dataset S (cid:48) as an approximation, whichis deﬁned as L S (cid:48) ( w ) (cid:44) n n (cid:88) i =1 (cid:96) ( w, Z (cid:48) i ) . (4)The true objective of the learning algorithm, L µ ( W (cid:48) ), is unknown to the learning algorithmwhile the empirical risk L S (cid:48) ( W (cid:48) ) is known. The generalization gap is deﬁned as the diﬀerencebetween these two quantities as [3, 4]gen µ ( W (cid:48) , S (cid:48) ) = L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) , (5)where W (cid:48) = A ( S (cid:48) ) is the output of the algorithm A on the input S (cid:48) ∼ ( µ (cid:48) ) ⊗ n . In commonalgorithms such as empirical risk minimization (ERM) and gradient descent, L S (cid:48) ( W (cid:48) ) is min-imized [20, 21]. Therefore, to control L µ ( W (cid:48) ) we need to bound gen µ ( W (cid:48) , S (cid:48) ) from above (inexpectation or with high probability). Observe that gen µ ( W (cid:48) , S (cid:48) ), as deﬁned in (5), is a ran-dom variable and a function of ( S (cid:48) , W (cid:48) ). The generalization error is the expected value ofgen µ ( W (cid:48) , S (cid:48) ): gen ( µ, µ (cid:48) , A ) = E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] . (6)When there is no-mismatch, i.e., µ = µ (cid:48) , we denote the generalization error by gen ( µ, A ) forsimplicity. The following upper bound on the generalization error is given in [7] (see also [6]):4 heorem 1 ( [7]) . Assume that there is no distribution mismatch, i.e., µ (cid:48) = µ . Suppose (cid:96) ( w, Z ) is σ -subgaussian under Z ∼ µ for all w ∈ W . Take an arbitrary algorithm A that runs on atraining dataset S (cid:48) . Then the generalization error is bounded as gen( µ, A ) ≤ (cid:114) σ n I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) . Let us write the sharpest possible bound on the generalization error given an upper bound r on I ( S (cid:48) ; A ( S (cid:48) )): D ( r ) (cid:44) sup P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] (7)where the supremum in (7) is over all Markov kernels P W (cid:48) | S (cid:48) with a bounded input/outputmutual information and S (cid:48) ∼ ( µ (cid:48) ) ⊗ n . We claim that D ( r ) is related to the rate-distortionfunction. To see this, consider a rate-distortion problem where the input symbol space is S ,the reproduction space is W and the following distortion function between a symbol w and aninput symbol s is used: ∆( w, s ) = L s ( w ) − L µ ( w ) . With this deﬁnition, from (7), we obtain − D ( r ) = inf P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r E [∆( W (cid:48) , S (cid:48) )] (8)which is in the rate-distortion form.With D ( r ) deﬁned as in (7), it follows that for any arbitrary algorithm A with I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) ≤ r we have gen ( µ, µ (cid:48) , A ) ≤ D ( r ) . This upper bound does not require any subgaussianity assumption on the loss function. Fromthis viewpoint, Theorem 1 is just a convenient and explicit lower bound on a rate-distortionfunction under an extra assumption on the loss function (for the no distribution mismatchcase). We formalize this intuition in Theorem 3.Computing the upper bound D ( r ) is a convex optimization problem and there are eﬃcientalgorithms for computing it [22]. However, computation of the bound can be practically diﬃcultif the sample size n is large. The following theorem provides a computable upper bound thatrequires running an optimization when the sample size is just one. Theorem 2.

For any arbitrary loss function (cid:96) ( w, z ) , and algorithm A that runs on a trainingdataset S (cid:48) of size n , we have gen ( µ, µ (cid:48) , A ) ≤ D (cid:18) I ( S (cid:48) ; A ( S (cid:48) )) n (cid:19) where D ( r ) (cid:44) max P ˆ W | Z (cid:48) : I ( ˆ W ; Z (cid:48) ) ≤ r E (cid:104) L µ ( ˆ W ) − (cid:96) ( ˆ W , Z (cid:48) ) (cid:105) (9) where Z (cid:48) ∈ Z is distributed according to µ (cid:48) . Furthermore, to compute the maximum in (9) , itsuﬃces to compute the maximum over all conditional distributions P ˆ W | Z for ˆ W ∈ W such thatthe support of ˆ W can be chosen of size at most |Z| + 1 . While the literature commonly takes the reproduction space to be the same as the input symbol space, therate-distortion theory does not formally require that. |Z| < ∞ ) and does not requireany subguassianity assumption on the loss function, the bound in Theorem 1 is in a veryexplicit form. Moreover, the bound in Theorem 1 (for the case of no-mismatch) dependsonly on mutual information I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) while the bound in Theorem 2 depends on µ , µ (cid:48) and I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) . However, one can obtain a bound from Theorem 2 that does not depend on µ , µ (cid:48) by maximizing the bound in Theorem 2 over all distributions µ and µ (cid:48) such that D ( µ (cid:48) (cid:107) µ ) ≤ γ for some γ >

0. We show that even after this maximization, the bounds in Theorem 2 is stillan improvement over Theorem 1. To show this, we need to prove that the bound in Theorem2 is always less than or equal to the bound in Theorem 1 for any arbitrary µ and µ (cid:48) satisfying D ( µ (cid:48) (cid:107) µ ) ≤ γ and the subgaussianity assumption on the loss function. Below, we give a generalresult for the rate-distortion function and deduce the relation between the bounds in Theorem1 and Theorem 2 as a corollary to it. Theorem 3.

Consider a generic rate-distortion problem for X ∼ ζ and a distortion function d ( x, ˆ x ) ∈ R . Let φ ( · ) be a function deﬁned on ( − b, for some b ∈ (0 , ∞ ] as follows: φ ( λ ) = sup ˆ x log E η (cid:2) e λd ( X, ˆ x ) (cid:3) , for some distribution η on X (possibly diﬀerent from ζ ). Then, inf P ˆ X | X : I ( ˆ X ; X ) ≤ r E (cid:104) d ( X, ˆ X ) (cid:105) ≥ sup − b<λ< (cid:26) λ [ r + D ( ζ X (cid:107) η X )] + 1 λ φ ( λ ) (cid:27) . (10)Proof of Theorem 3 is in Section 6.2.We apply the above theorem to obtain an upper bound on the bound given in Theorem 2as follows: let X = Z (cid:48) ∼ µ (cid:48) , ˆ X = ˆ W and d ( z (cid:48) , ˆ w ) = − [ L µ ( ˆ w ) − (cid:96) ( ˆ w, z (cid:48) )]. Corollary 1.

Suppose that (cid:96) ( w, Z ) is σ -subgaussian for every w ∈ W under the distribu-tion µ on Z . Take an arbitrary algorithm A that runs on a training dataset S (cid:48) . Then when I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) ≤ r and D ( µ (cid:48) (cid:107) µ ) ≤ γ for some r, γ ≥ , then D ( r ) ≤ (cid:114) σ γ + 2 σ n r. (11) Remark 1.

Under the assumptions of Corollary 1, we deduce that gen ( µ, µ (cid:48) , A ) ≤ (cid:114) σ γ + 2 σ n r. (12) This generalizes the bound in Theorem 1 to the case of having mismatch.

Example 1.

Let W = Z = { , } and consider a learning problem on a data set S (cid:48) with thesize n = 1 with loss function (cid:96) ( w, z ) = w · z . Figure 1 depicts the bound in Theorem 1 versusthe maximum of the bound in Theorem 2 over all distributions µ on { , } for the case of no-mismatch for a particular loss function. Note that the distortion function itself depends on thechoice of µ and this makes it diﬃcult to ﬁnd a closed form expression for the maximum of thebound in Theorem 2 over all distributions µ . Example 2.

Let W = [0 , , Z = { , } and consider a learning problem on a data set S (cid:48) withthe size n = 1 with loss function (cid:96) ( w, z ) = | w − z | . Figure 2 depicts the bound in Theorem 1versus the maximum of the bound in Theorem 2 over all distributions µ on { , } for the caseof no-mismatch for a particular loss function. µ , assuming no distribution mismatch, W = Z = { , } and (cid:96) ( w, z ) = w · z .Figure 2: The bound in Theorem 1 versus the maximum of the upper bound in Theorem2 over all distributions µ , assuming no distribution mismatch, W = [0 , , Z = { , } and (cid:96) ( w, z ) = | w − z | . 7 .2.1 An improved upper bound In [8], a strengthened version of Theorem 1 is given as follows:

Theorem 4 ( [8]) . Suppose that the loss function (cid:96) ( w, Z ) is σ -subgaussian under the distri-bution µ on Z for any w ∈ W . For µ (cid:48) = µ , we have: gen( µ, µ (cid:48) , A ) ≤ n n (cid:88) i =1 (cid:113) σ I (cid:0) Z (cid:48) i ; A ( S (cid:48) ) (cid:1) . (13)The following variant of Theorem 4 holds for the case with distribution mismatch: Theorem 5.

For any arbitrary loss function (cid:96) ( w, z ) , and algorithm A that runs on a trainingdataset S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) of size n , we have gen ( µ, µ (cid:48) , A ) ≤ n n (cid:88) i =1 D ( I ( Z (cid:48) i ; A ( S (cid:48) ))) where D ( r ) is given in (9) . Moreover, if the loss function (cid:96) ( w, Z ) under µ is σ -subgaussianfor all w ∈ W , we further have gen ( µ, µ (cid:48) , A ) ≤ n n (cid:88) i =1 D ( I ( Z (cid:48) i ; A ( S (cid:48) ))) ≤ n n (cid:88) i =1 (cid:113) σ (cid:2) I (cid:0) Z (cid:48) i ; A ( S (cid:48) ) (cid:1) + D ( µ (cid:48) (cid:107) µ ) (cid:3) . Proof of Theorem 5 is in Section 6.3.

Next, we consider lower bounds on the generalization error. Similar to (7), the following lowerbound on the generalization error given an upper bound r on I ( S (cid:48) ; A ( S (cid:48) )) can be written:inf P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] (14)where the inﬁmum in (14) is over all Markov kernels P W (cid:48) | S (cid:48) with a bounded input/outputmutual information and S (cid:48) ∼ ( µ (cid:48) ) ⊗ n . However, the bound in this form may not be useful.To see this, assume that µ = µ (cid:48) . One possible choice for W (cid:48) in (14) is a constant randomvariable. For this choice, I ( W (cid:48) ; S (cid:48) ) = 0 ≤ r and the bound in (14) vanishes. It follows that D ( r ) ≤

0. However, we are interested in a lower bound on the generalization error in termsof the population risk. In order to prevent W (cid:48) from being a constant random variable, weattempt to ﬁnd a lower bound on the generalization error in terms of both I ( S (cid:48) ; A ( S (cid:48) )) and anassumption about the marginal distribution of the output of the algorithm A ( S (cid:48) ). In particular,we assume that I ( S (cid:48) ; A ( S (cid:48) )) ≤ r and A ( S (cid:48) ) ∼ p W (cid:48) ∈ M for a family M of distributions on W .We aim to ﬁnd a lower bound on gen ( µ, µ (cid:48) , A ) that depends on both r and M . The sharpestsuch bound is D (cid:0) r, M (cid:1) = inf P W (cid:48) ∈M inf P W (cid:48) ,S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ): I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] (15)where U ( P W (cid:48) , P S (cid:48) ) is the set of all couplings of two marginal distribution P W (cid:48) and P S (cid:48) on W ×S .When r = 0, the set U ( P W (cid:48) , P S (cid:48) ) includes only the product distribution P W (cid:48) P S (cid:48) and D (cid:0) , M (cid:1) can be computed explicitly. The following theorem gives an explicit lower bound on the gener-alization error when r >

0: 8 heorem 6.

Let ψ ( λ ) be a function satisfying ψ ( λ ) ≥ sup ν ∈M E ν (cid:2) e λ [ (cid:96) ( W,z ) − E ν [ (cid:96) ( W,z )]] (cid:3) , ∀ z ∈ Z . Then, we have: D (cid:0) , M (cid:1) ≥ D (cid:0) r, M (cid:1) ≥ D (cid:0) , M (cid:1) − inf λ ≥ [ λr − λ ( ψ (1 /nλ ) n − . Corollary 2.

Suppose that (cid:96) ( W (cid:48) , z ) is α -subgaussian under any P W (cid:48) ∈ M for all z ∈ Z .Considering the special choice of λ = α/ √ nr , we deduce D (cid:0) , P W (cid:48) (cid:1) ≥ D (cid:0) r, P W (cid:48) (cid:1) ≥ D (cid:0) , P W (cid:48) (cid:1) − √ n (cid:20) α √ r √ α √ r ( e r − (cid:21) . Therefore, gen ( µ, µ (cid:48) , A ) ≥ D (cid:0) , P W (cid:48) (cid:1) − √ n (cid:34) α (cid:112) I ( S (cid:48) ; A ( S (cid:48) )) √ α (cid:112) I ( S (cid:48) ; A ( S (cid:48) )) ( e I ( S (cid:48) ; A ( S (cid:48) )) − (cid:35) . Proof of the above theorem can be found in Section 6.4. The following theorem givesanother lower bound on the generalization error which can be compared with the upper boundin Theorem 2:

Theorem 7.

For any arbitrary loss function (cid:96) ( w, z ) , and algorithm A that runs on a trainingdataset S (cid:48) of size n and induces a marginal distribution on A ( S (cid:48) ) in M , we have gen ( µ, µ (cid:48) , A ) ≥ D (cid:18) I ( S (cid:48) ; A ( S (cid:48) ) n , M (cid:19) where D ( r, M ) (cid:44) inf P W (cid:48) ∈M min P ˆ W,Z (cid:48) ∈ U ( P W (cid:48) ,µ (cid:48) ): I ( ˆ W ; Z (cid:48) ) ≤ r E (cid:104) L µ ( ˆ W ) − (cid:96) ( ˆ W , Z (cid:48) ) (cid:105) . (16)The proof is given in the Section 6.5. High probability guarantees:

Just as the excess distortion probability of a rate-distortioncode has been subject of many studies in information theory (see [23, 24] for two examples), anumber of “high probability” upper bounds on the generalization gap are also reported in theliterature. Here the problem is to ﬁnd an upper bound on P [gen µ ( W (cid:48) , S (cid:48) ) ≥ η ] = P [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) ≥ η ]for some given η .The following bound is a generalization of a bound in [15] to include distribution mismatch.Our method for deriving this inequality is diﬀerent from the one used in [15], and similar tothe one used in [11]. Theorem 8.

Take some algorithm A that runs on a training dataset S (cid:48) and produces an outputhypothesis W (cid:48) = A ( S (cid:48) ) . Let (cid:96) ( w, Z ) be a loss function which is σ -subgaussian under thedistribution µ on Z for all w ∈ W . Then, we have P [ | gen µ ( W (cid:48) , S (cid:48) ) | ≥ η ] ≤  − n (cid:16) η − σ D ( µ (cid:48) (cid:107) µ ) (cid:17) − σ D ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) )3 σ  . (17)9roof of the above theorem is given in Section 6.6. Performance of the ERM algorithm:

As an application of Theorem 8, let us considerthe ERM algorithm which is deﬁned as follows: W ERM ( S (cid:48) ) = arg min w ∈W L S (cid:48) ( w ) . (18)Then, we claim the following upper bound on excess risk of the ERM algorithm: Theorem 9.

Let (cid:96) ( w, Z ) be σ -subgaussian under the distribution µ on Z for every w . Considerthe ERM learning algorithm A ERM as deﬁned in (18) . Then, with probability of at least − δ , L µ ( A ERM ( S (cid:48) )) ≤ min w ∈W L µ ( w ) + (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ log (cid:0) δ (cid:1) n + (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ (cid:2) D ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + 3 log (cid:0) δ (cid:1)(cid:3) n . (19)Proof of the above theorem can be found in Section 6.7. Take an algorithm A along with a sample-complexity guarantee for a family of distributions P , i.e., the model has speciﬁed the class P . Given δ, (cid:15) >

0, sample complexity is deﬁned as n ( A , P , (cid:15), δ ) = min (cid:26) N : ∀ n > N, sup µ ∈P P (cid:20) L µ ( A ( S )) − min w ∈W L µ ( w ) ≤ (cid:15) (cid:21) ≥ − δ (cid:27) where S = ( Z , Z , · · · , Z n ) ∼ µ ⊗ n is the training data. We would like to ﬁnd the increase insample-complexity if we expand the set P to P γ = { µ (cid:48) : inf µ ∈P D ( µ (cid:48) (cid:107) µ ) ≤ γ } . The set P γ relates to model misspeciﬁcation when it is llimited in KL divergence of at most γ .We utilize the following alternative equivalent deﬁnition of sample-complexity: (cid:15) ( A , P , n, δ ) = inf (cid:26) x ∈ R : sup µ ∈P P (cid:20) L µ ( A ( S )) − min w ∈W L µ ( w ) ≤ x (cid:21) ≥ − δ (cid:27) . We restrict to uniformly-stable algorithms. In general terms, a learning algorithm is said tobe stable if a small change of the input to the algorithm does not change the output of thealgorithm much. Examples of stability deﬁnitions include uniform stability deﬁned by Bousquetand Elisseeﬀ [25]. The deﬁnition of stability that we adopt in this paper is as follows:

Deﬁnition 2.

Given non-negative real numbers β i ( n ) we say that the A is called uniformly-stable if for any s = ( z , z , · · · , z n ) , s = ( z , z , · · · , z n ) ∈ Z n , the following inequalityholds (almost surely): | (cid:96) ( A ( s ) , z ) − (cid:96) ( A ( s ) , z ) | ≤ n (cid:88) i =1 β i ( n ) [ z i (cid:54) = z i ] , ∀ z ∈ Z . (20)10 heorem 10. Let (cid:96) ( w, Z ) be σ -subgaussian over Z with distribution µ ∈ P for every w . Then,for every n, γ > and δ ∈ [0 , , we have (cid:15) ( A , P γ , n, δ ) ≤ (cid:15) ( A , P , n, δ ) + n (cid:88) i =1 β i + 2 (cid:112) σ γ, (21) and (cid:15) ( A , P γ , n, δ ) ≤ (cid:15) ( A , P , n, δ/

2) + f ( δ ) , (22) where the function f is deﬁned as f ( δ ) (cid:44) (cid:112) σ γ + (cid:112) σ [log(2 /δ ) + γ ] + 1 g ( δ/ n (cid:88) i =1 ln (cid:18) g ( δ/ β i ) − √ γ (cid:19) ,g ( δ ) (cid:44) (cid:114) /δ ) + γ ] σ . (23)Proof of Theorem 10 is given in Section 6.8. Remark 2.

While the upper bound (21) is in terms of (cid:80) ni =1 β i , the upper bound (22) is not.Numerical simulations suggest that the bound (22) is better than the bound (21) if β i ∼ Ω( √ n ) .The regime β i ∼ Ω( √ n ) could be of importance, e.g., see [20, 21]. While D ( r ) (as deﬁned in (7)) is the sharpest possible bound on the generalization error givenan upper bound r on I ( S (cid:48) ; A ( S (cid:48) )), the single-letter bound D ( r/n ) in Theorem 2 is not. In fact,the following relaxation is used in the proof of Theorem 2: instead of producing one outputhypothesis W for the entire sequence S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ), we produce n output hypothesis˜ W , ˜ W , · · · , ˜ W n . To tighten the gap between D ( r ) and D ( r/n ), one needs to answer thefollowing question: given a joint distribution ( W, Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ), what are the set of marginaldistributions on ( W, Z (cid:48) i )? For instance if W is a binary random variable and ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n )are i.i.d., W cannot have high dependence with all of the Z (cid:48) i ’s. Motivated by the above question, in the rest of this section we present a general idea whichmay be used on its own, or in conjunction with the ideas in the previous section to improvethe upper bound given in Theorem 2. Let ˜ (cid:96) ( w, z ) be an “auxiliary” loss function; an arbitraryloss function of our choice which can be diﬀerent from the original loss function (cid:96) ( w, z ). Weshow that the average risk of the ERM algorithm on the auxiliary loss function ˜ (cid:96) can be usedto bound the generalization error of a diﬀerent algorithm A , which runs on the same trainingdata as the ERM algorithm, but with the original loss function (cid:96) ( w, z ). Let ERM ( z (cid:48) , · · · , z (cid:48) n ) = min w n (cid:88) i =1 n ˜ (cid:96) ( w, z (cid:48) i )be the risk of the ERM algorithm given a training sequence s (cid:48) = ( z (cid:48) , z (cid:48) , · · · , z (cid:48) n ) according to˜ (cid:96) . Let v n = E S (cid:48) ∼ ( ρ (cid:48) ) ⊗ n ERM ( Z (cid:48) , · · · , Z (cid:48) n ) In particular, using mutual information as the measure of dependence we have the following: for a bi-nary W and a sequence ( Z (cid:48) , · · · , Z (cid:48) n ) of independent random variables, we have 1 ≥ I ( W ; Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) ≥ (cid:80) i I ( W ; Z (cid:48) i ). See (31) for a proof. Thus, sum of correlations between W and Z (cid:48) i is no more than one bit.

11e the average risk of the ERM algorithm. Let us, for now, assume that v n is known to us.Take an arbitrary algorithm A . Let W (cid:48) = A ( S (cid:48) ) Then, the risk of A with respect to ˜ (cid:96) isgreater than or equal the risk of the ERM algorithm, i.e., E (cid:34) n (cid:88) i =1 n ˜ (cid:96) ( W (cid:48) , Z (cid:48) i ) (cid:35) ≥ v n . (24)Let Q be a random variable, independent of all previously deﬁned variables, and uniform onthe set { , , · · · , n } . Set ˜ Z = Z (cid:48) Q . Observe that ˜ Z ∼ µ (cid:48) because Z (cid:48) i ∼ µ (cid:48) for all i and Q isindependent of ( Z (cid:48) , · · · , Z (cid:48) n ). Using this deﬁnition for ˜ Z , the risk of A with respect to the loss˜ (cid:96) equals E [˜ (cid:96) ( W (cid:48) , ˜ Z )] = E (cid:34) n (cid:88) i =1 n ˜ (cid:96) ( W (cid:48) , Z (cid:48) i ) (cid:35) (25)and the generalization error with respect to the loss (cid:96) can be characterized asgen ( µ, µ (cid:48) , A ) = 1 n n (cid:88) i =1 E [ L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) i )] = E (cid:104) L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , ˜ Z ) (cid:105) . (26)From (24), (25) and (26) we obtain the following upper bound on the generalization error of (cid:96) :gen ( µ, µ (cid:48) , A ) ≤ max P ˆ W | ˜ Z : E [˜ (cid:96) ( ˆ W , ˜ Z )] ≥ v n E (cid:104) L µ ( ˆ W ) − (cid:96) ( ˆ W , ˜ Z ) (cid:105) (27)where ˜ Z ∈ Z is distributed according to µ (cid:48) . The above bound has a similar form as the onegiven in Theorem 2. Observe that (27) provides a generalization bound on the algorithm A based on the sole assumption that it uses a training data of size n . If more is known aboutthe algorithm, e.g. an upper bound on the input and output mutual information, we can writebetter bounds as follows: Theorem 11.

Let ˜ D ( r ) (cid:44) max P ˆ W | ˜ Z : I ( ˆ W ; ˜ Z ) ≤ r, E [˜ (cid:96) ( ˆ W , ˜ Z )] ≥ v n E (cid:104) L µ ( ˆ W ) − (cid:96) ( ˆ W , ˜ Z ) (cid:105) . (28) Then, D ( r ) ≤ ˜ D ( r/n ) ≤ D ( r/n ) . Proof of the Theorem 11 can be found in Section 6.9.

Example 3.

Consider the setting in Example 1. Figure 3 illustrates this improvement in D ( r/n ) when ˜ (cid:96) ( w, z ) = − [ w (cid:54) = z ] and n = 10 . Example 4.

Consider the setting in Example 2. Figure 3 illustrates this improvement in D ( r/n ) when ˜ (cid:96) ( w, z ) = ( w − z ) and n = 10 . In order to use the bound in Theorem 11, one must know the value of v n . However, this isnot known in practice. For instance, consider the special case of loss function ˜ (cid:96) ( w, z ) = ( w − z ) .Given a training data ( z (cid:48) , z (cid:48) , · · · , z (cid:48) n ), the output of the ERM algorithm with the quadratic lossis just the average of the traning data samples and v n equals n − n Var µ (cid:48) ( Z (cid:48) ) . (cid:96) ( w, z ) = − [ w (cid:54) = z ] for W = Z = { , } and n = 10 and the original loss function (cid:96) ( w, z ) = w · z .Figure 4: The bound in Theorem 2 and its improved version via the auxiliary loss function˜ (cid:96) ( w, z ) = ( w − z ) for the learning setting W = [0 , , Z = { , } and n = 10 and the originalloss function (cid:96) ( w, z ) = | w − z | . 13he variance of the test data is not known, but can be estimated from the training datasetitself. Below we show how to estimate v n by running the ERM algorithm on the availabletraining data. Assume that the auxiliary loss satisﬁes | ˜ (cid:96) ( w, z ) − ˜ (cid:96) ( w, z (cid:48) ) | ≤ c for all w, z, z (cid:48) .Then, we have ERM ( z (cid:48) , z (cid:48) , · · · , z (cid:48) n ) = min w n n (cid:88) i =1 ˜ (cid:96) ( w, z i ) ≤ min w (cid:34) cn + 1 n ˜ (cid:96) ( w, z (cid:48)(cid:48) ) + 1 n n (cid:88) i =2 ˜ (cid:96) ( w, z (cid:48) i ) (cid:35) = cn + ERM ( z (cid:48)(cid:48) , z (cid:48) , · · · , z (cid:48) n ) . Then McDiarmid’s inequality implies high concentration around expected value for the ERMalgorithm: P (cid:2)(cid:12)(cid:12) ERM − E [ ERM ] (cid:12)(cid:12) ≥ t (cid:3) ≤ e − nt c . Thus, one can ﬁnd an estimate for v n with high probability based on the available training datasequence.At the end, we remark that it is also possible to write bounds based on multiple auxiliaryloss functions rather than just one. The ﬁrst author is also grateful to Dr. Mohammad Mahdi Mojahedian for helpful discussionson learning from heterogeneous data in mixture models which gave birth to some ideas in thiswork.

In the following sections we present the proofs of the results stated in the previous section intheir order of appearance.

Let ˜ w = ( ˜ w , ˜ w , · · · , ˜ w n ) ∈ W n be a sequence of length n . Let¯ D ( r ) (cid:44) sup P ˜ W | S (cid:48) : I ( ˜ W ; S (cid:48) ) ≤ r n n (cid:88) i =1 E (cid:104) L µ ( ˜ W i ) − (cid:96) ( ˜ W i , Z (cid:48) i ) (cid:105) (29)where S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) and ˜ W = ( ˜ W , ˜ W , · · · , ˜ W n ). Observe that if the entries of thevector ˜ W are all equal, the expression in (29) reduces to the one in (7). Therefore, in (29)we are taking the supremum over a larger set. Thus, ¯ D ( r ) ≥ D ( r ). It follows that for anyalgorithm A satisfying I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) ≤ r , we havegen ( µ, µ (cid:48) , A ) ≤ ¯ D ( r ) . We claim that ¯ D ( r ) = D ( r/n ). The proof follows similar steps as in [26, Section 3.6.2]for lossy compression. However, we provide a proof for completeness. We ﬁrst claim that¯ D ( r ) ≥ D ( r/n ). To see this, take some P ˆ W | Z (cid:48) in (9) and take p ( ˜ w | s (cid:48) ) = n (cid:89) i =1 p ˆ W | Z (cid:48) ( ˜ w i | z (cid:48) i ) . P ˆ W | Z (cid:48) in (29) shows that ¯ D ( r ) ≥ D ( r/n ).It remains to show that ¯ D ( r ) ≤ D ( r/n ). Take some arbitrary P ˜ W | S (cid:48) satisfying I ( ˜ W ; S (cid:48) ) ≤ r . We have r ≥ I ( ˜ W ; S (cid:48) )= (cid:88) i I ( ˜ W ; Z (cid:48) i | Z (cid:48) i − )= (cid:88) i I ( ˜ W , Z (cid:48) i − ; Z (cid:48) i ) (30) ≥ (cid:88) i I ( ˜ W i ; Z (cid:48) i ) (31)where (30) follows from the fact that Z (cid:48) i are i.i.d. random variables. We also have1 n n (cid:88) i =1 E (cid:104) L µ ( ˜ W i ) − (cid:96) ( ˜ W i , Z (cid:48) i ) (cid:105) ≤ n n (cid:88) i =1 D ( I ( ˜ W i ; Z (cid:48) i )) (32) ≤ D (cid:32) n n (cid:88) i =1 I ( ˜ W i ; Z (cid:48) i ) (cid:33) (33) ≤ D (cid:16) rn (cid:17) (34)where (32) follows from the deﬁnition of D , (33) follows from concavity of D ( · ), and (34)follows from (31) and the fact that D ( · ) is an increasing function. Concavity of D ( · ) followsfrom the fact that mutual information I ( ˜ W ; S (cid:48) ) is convex in P ˜ W | S (cid:48) for a ﬁxed distribution on P S (cid:48) .Since P ˜ W | S (cid:48) was an arbitrary conditional distribution satisfying I ( ˜ W ; S (cid:48) ) ≤ r , we deducefrom (32)-(34) that D ( r/n ) ≥ ¯ D ( r ) as desired.The cardinality bounds on the auxiliary random variable ˆ W in the deﬁnition of D comesfrom the standard Caratheodory-Bunt [27] arguments and is omitted. Given the distribution ζ ( x ) and some arbitrary conditional distribution ζ (ˆ x | x ), let ζ ( x, ˆ x ) = ζ (ˆ x | x ) ζ ( x ). Set q ( x, ˆ x ) = η ( x ) ζ (ˆ x ) and f ( x, ˆ x ) = λd ( x, ˆ x ) where − b < λ <

0. From theDonsker-Varadhan representation, we obtain that D ( ζ X, ˆ X (cid:107) q X, ˆ X ) ≥ λ E ζ [ d ( ˆ X, X )] − log E q (cid:104) e λ ( d ( X, ˆ X )) (cid:105) , (35)Using independence of X and ˆ X under q we can write for − b < λ < E q (cid:104) e λ ( d ( X, ˆ X )) (cid:105) = log E ˆ X ∼ ζ (cid:110) E X ∼ η (cid:104) e λ ( d ( X, ˆ X )) (cid:105)(cid:111) ≤ sup ˆ x log E η (cid:2) e λd ( X, ˆ x ) (cid:3) ≤ φ ( λ ) . Then from (35) and consider λ < E ζ [ d ( ˆ X, X )] ≥ λ D ( ζ X, ˆ X (cid:107) q X, ˆ X ) + 1 λ φ ( λ ) . Moreover, D ( ζ X, ˆ X (cid:107) q X, ˆ X ) = I ζ ( ˆ X ; X ) + D ( ζ X (cid:107) η X ) and I ζ ( ˆ X ; X ) ≤ r . Thus, E ζ [ d ( ˆ X, X )] ≥ sup − b<λ< (cid:26) λ [ r + D ( ζ X (cid:107) η X )] + 1 λ φ ( λ ) (cid:27) .

15n conclusion, inf P ˆ X | X : I ζ ( ˆ X ; X ) ≤ r E ζ (cid:104) d ( X, ˆ X ) (cid:105) ≥ sup − b<λ< (cid:26) λ [ r + D ( ζ X (cid:107) η X )] + 1 λ φ ( λ ) (cid:27) . (36) The inequality 1 n n (cid:88) i =1 D ( I ( Z (cid:48) i ; A ( S (cid:48) ))) ≤ n n (cid:88) i =1 (cid:113) σ (cid:2) I (cid:0) Z (cid:48) i ; A ( S (cid:48) ) (cid:1) + D ( µ (cid:48) (cid:107) µ ) (cid:3) . follows from Corollary 1. To show the inequalitygen ( µ, µ (cid:48) , A ) ≤ n n (cid:88) i =1 D ( I ( Z (cid:48) i ; A ( S (cid:48) )))take some algorithm A and let W (cid:48) = A ( S (cid:48) n ). Then,gen ( µ, µ (cid:48) , A ) = 1 n n (cid:88) i =1 E [ L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) i )] ≤ n n (cid:88) i =1 D ( I ( W (cid:48) ; Z (cid:48) i )) (37)where (37) follows from the deﬁnition of D . It suﬃces to prove the lower bound when a ﬁxed distribution P W (cid:48) ∈ M is chosen for the outputof the algorithm because a minimum can be taken over all P W (cid:48) ∈ M from both sides of thedesired inequality at the end. We have D ( r ) = min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ): I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] (38)= min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) max λ ≥ E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) − λr (39) ≥ max λ ≥ min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) − λr (40) ≥ max λ ≥ [ D (0) + λ − λψ (1 / ( nλ )) n − λr ] (41)= D (0) − min λ ≥ [ λr + λ ( ψ (1 / ( nλ )) n − . (42)where (41) follows from Lemma 3. Lemma 3.

Let (cid:96) ( W (cid:48) , z ) satisﬁes ψ ( λ ) ≥ E P W (cid:48) (cid:104) e λ [ (cid:96) ( W (cid:48) ,z ) − E PW (cid:48) [ (cid:96) ( W (cid:48) ,z )] ] (cid:105) , ∀ z ∈ Z . Then, for any λ ≥ P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E P W (cid:48) S (cid:48) [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) ≥ D (0) − λ ( ψ (1 / ( nλ )) n − . (43)16 roof. Assume that W (cid:48) ∼ ζ and S (cid:48) ∼ β are the marginal distributions of W (cid:48) and S (cid:48) . Setting∆( w, s ) = L µ ( w ) − L s ( w ) , we can express the left hand side of (43) asmin ( W (cid:48) ,S (cid:48) ) ∼ π ∈ U ( ζ,β ) E π ∆( W (cid:48) , S (cid:48) ) + λ (cid:90) W×Z ⊗ n φ (cid:18) dπ ( w (cid:48) , s (cid:48) ) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (44)where φ ( x ) = x log( x ) − x + 1. We ﬁnd the dual problem of the above optimization problem.Introducing the Lagrange multipliers f and g associated to the constraints, the Lagrangianreads L ( λ, ζ, β ) = E π ∆( W (cid:48) , S (cid:48) ) + λ (cid:90) W×Z ⊗ n φ (cid:18) dπ ( w (cid:48) , s (cid:48) ) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) dζ ( w (cid:48) ) dβ ( s (cid:48) )+ (cid:90) W f ( w (cid:48) ) (cid:18) dζ ( w (cid:48) ) − (cid:90) Z ⊗ n dπ ( w (cid:48) , s (cid:48) ) (cid:19) + (cid:90) Z ⊗ n g ( s (cid:48) ) (cid:18) dβ ( s (cid:48) ) − (cid:90) W dπ ( w (cid:48) , s (cid:48) ) (cid:19) . The dual Lagrange function is given by min π L ( λ, α, β ) over all π ( w (cid:48) , s (cid:48) ) ≥

0. Note that incomputing the minimum we do not require (cid:80) w (cid:48) ,s (cid:48) π ( w (cid:48) , s (cid:48) ) = 1. Observe thatmin π L ( λ, ζ, β )= (cid:90) W f ( w (cid:48) ) dζ ( w (cid:48) ) + (cid:90) Z ⊗ n g ( s (cid:48) ) dβ ( s (cid:48) )+ λ min π (cid:18)(cid:90) W×Z ⊗ n (cid:18) φ (cid:18) dπ ( w (cid:48) , s (cid:48) ) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) + ∆( w (cid:48) , s (cid:48) ) − f ( w (cid:48) ) − g ( s (cid:48) ) λ dπ ( w (cid:48) , s (cid:48) ) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) = (cid:90) W f ( w (cid:48) ) dζ ( w (cid:48) ) + (cid:90) Z ⊗ n g ( s (cid:48) ) dβ ( s (cid:48) ) − λ (cid:90) W×Z ⊗ n φ ∗ (cid:18) f ( w (cid:48) ) + g ( s (cid:48) ) − ∆( w (cid:48) , s (cid:48) ) λ (cid:19) dζ ( w (cid:48) ) dβ ( s (cid:48) ) , where φ ∗ is the Legendre transform of φ given by φ ∗ ( y ) = sup x ≥ (cid:2) xy − φ ( x ) (cid:3) = e y − . Thus, we obtainmin π L ( λ, ζ, β ) = E [ g ( S (cid:48) )] + E [ f ( W (cid:48) )] + λ − λ E P S (cid:48) P W (cid:48) (cid:20) exp (cid:18) − ∆( W (cid:48) , S (cid:48) ) − g ( S (cid:48) ) − f ( W (cid:48) ) λ (cid:19)(cid:21) . (45)From weak duality, for every continuous functions f and g , E [ g ( S (cid:48) )] + E [ f ( W (cid:48) )] + λ − λ E P S (cid:48) P W (cid:48) (cid:20) exp (cid:18) − ∆( W (cid:48) , S (cid:48) ) − g ( S (cid:48) ) − f ( W (cid:48) ) λ (cid:19)(cid:21) ≤ min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E P W (cid:48) S (cid:48) [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) . (46)Assigning g ( s (cid:48) ) = − E W (cid:48) ∼ ζ L s (cid:48) ( W (cid:48) ) and f ( w (cid:48) ) = L µ ( w (cid:48) ) and using the fact that ∆( w, s ) = L µ ( w ) − L s ( w ) , we obtain E [ L µ ( W (cid:48) ) − L µ (cid:48) ( W (cid:48) )] + λ − λ E P S (cid:48) P W (cid:48) (cid:20) exp (cid:18) L S (cid:48) ( W (cid:48) ) − E P W (cid:48) L S (cid:48) ( W (cid:48) ) λ (cid:19)(cid:21) min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E P W (cid:48) S (cid:48) [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) . (47)We give an upper bound for the exponential term as follows: E P S (cid:48) P W (cid:48) (cid:20) exp (cid:18) L S (cid:48) ( W (cid:48) ) − E P W (cid:48) L S (cid:48) ( W (cid:48) ) λ (cid:19)(cid:21) = n (cid:89) i =1 E P Z (cid:48) i P W (cid:48) (cid:20) exp (cid:18) (cid:96) ( W (cid:48) , Z (cid:48) i ) − E P W (cid:48) (cid:96) ( W (cid:48) , Z (cid:48) i ) nλ (cid:19)(cid:21) ≤ n (cid:89) i =1 sup z (cid:48) ∈Z E P W (cid:48) (cid:20) exp (cid:18) (cid:96) ( W (cid:48) , z (cid:48) ) − E P W (cid:48) (cid:96) ( W (cid:48) , z (cid:48) ) nλ (cid:19)(cid:21) ≤ (cid:18) ψ (cid:18) nλ (cid:19)(cid:19) n . Thus, min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E P W (cid:48) S (cid:48) [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) ≥ E [ L µ ( W (cid:48) ) − L µ (cid:48) ( W (cid:48) )] + λ − λ (cid:18) ψ (cid:18) nλ (cid:19)(cid:19) n = D (0) + λ − λ (cid:18) ψ (cid:18) nλ (cid:19)(cid:19) n . The proof is similar to the proof of Theorem 2. As in Section 6.1, we let ˜ w = ( ˜ w , ˜ w , · · · , ˜ w n ) ∈W n be a sequence of length n . Let˜ D ( r ) (cid:44) inf 1 n n (cid:88) i =1 E (cid:104) L µ ( ˜ W i ) − (cid:96) ( ˜ W i , Z (cid:48) i ) (cid:105) (48)where the inﬁmum is over P ˜ W ,S (cid:48) satisfying P ˜ W i ,S (cid:48) ∈ U ( P W (cid:48) , P S (cid:48) ) and I ( ˜ W ; S (cid:48) ) ≤ r . Observethat if the entries of the vector ˜ W are all equal, the expression in (48) reduces to the one in(16). Therefore, in (48) we are taking the inﬁmum over a larger set. Thus, ˜ D ( r ) ≤ D ( r ). Itfollows that for any algorithm A satisfying I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) ≤ r , we havegen ( µ, µ (cid:48) , A ) ≥ ˜ D ( r ) . We claim that ˜ D ( r ) = D ( r/n ). The rest of the proof follows similar lines as in the proof ofTheorem 2 given in Section 6.1. Thus, it is omitted. We would like to bound L µ ( A ( S (cid:48) )) − L S (cid:48) ( A ( S (cid:48) )) from above. Let S = ( Z , Z , · · · , Z n ) bedistributed according to µ ⊗ n , while S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) was distributed according to ( µ (cid:48) ) ⊗ n .From the subgaussian assumption, we have E P S P W (cid:48) [exp ( λL µ ( W (cid:48) ) − λL S ( W (cid:48) ))] ≤ exp (cid:18) λ σ n (cid:19) . (49)Using a change of measure argument, we obtain E P W (cid:48) S (cid:48) (cid:20) exp (cid:18) λL µ ( W (cid:48) ) − λL S (cid:48) ( W (cid:48) ) − λ σ n − log dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19)(cid:21) ≤ . (50)Using Markov’s inequality P [ X > δ ] < E [ X ] δ , we deduce P W (cid:48) S (cid:48) (cid:20) exp (cid:18) λL µ ( W (cid:48) ) − λL S (cid:48) ( W (cid:48) ) − λ σ n − log dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19) ≥ δ (cid:21) ≤ δ. (51)18quivalently, P W (cid:48) S (cid:48) (cid:20) λL µ ( W (cid:48) ) − λL S (cid:48) ( W (cid:48) ) ≥ λ σ n + log dP W (cid:48) S (cid:48) dP W (cid:48) dP S + log (cid:18) δ (cid:19)(cid:21) ≤ δ. (52)Thus, the following inequality holds with probability at least 1 − δ : L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) ≤ − δ λσ n + 1 λ log dP W (cid:48) S (cid:48) dP W (cid:48) dP S + 1 λ log (cid:18) δ (cid:19) . (53)Using Chernoﬀ’s bound on log (cid:16) dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:48) (cid:17) , we get for 1 < α , P W (cid:48) S (cid:48) (cid:20) log (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19) ≥ t (cid:21) ≤ E P W (cid:48) S (cid:48) (cid:34) e ( α −

1) log (cid:18) dPW (cid:48) S (cid:48) PW (cid:48) PS (cid:19) (cid:35) e ( α − t = e ( α − D α ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S ) − t ) . Thus, log (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19) ≤ − δ D α ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S ) + 1 α − (cid:18) δ (cid:19) . For the case of no-mismatch, the above equation together with (53) recovers the result of [15]once we optimize over λ .We use Lemma 4 to show the following inequality: D α ( P W (cid:48) S (cid:48) || P W (cid:48) P S ) ≤ D α − p ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + nD α − q ( µ (cid:48) (cid:107) µ ) , where p and q are non-negative and Holder conjugate. Therefore, combining with (53), we get L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) ≤ − δ λσ n + 1 λ D α − p ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + nλ D α − q ( µ (cid:48) (cid:107) µ ) + α ( α − λ log (cid:18) δ (cid:19) . Optimizing over λ yields L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) ≤ − δ (cid:115) σ D α − q ( µ (cid:48) (cid:107) µ ) + 2 σ (cid:2) D α − p ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + αα − log (cid:0) δ (cid:1)(cid:3) n . (54)Then with α = and p = q = 2, we get from equation (54), P [ | gen µ ( W (cid:48) , S (cid:48) ) | ≥ η ] ≤  − n (cid:16) η − σ D ( µ (cid:48) (cid:107) µ ) (cid:17) − σ D ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) )3 σ  . Lemma 4.

For < α, p, q < ∞ with p + q = 1 , we have D α ( P W (cid:48) S (cid:48) || P W (cid:48) P S ) ≤ D α − p ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + nD α − q ( µ (cid:48) (cid:107) µ ) . Proof.

We use Holder’s inequality for 1 < p, q < ∞ in following inequality:exp (( α − D α ( P W (cid:48) S (cid:48) || P W (cid:48) P S )) = (cid:90) (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19) α − dP W (cid:48) S (cid:48) (cid:90) (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP (cid:48) S dP (cid:48) S dP S (cid:19) α − dP W (cid:48) S (cid:48) ≤ (cid:32)(cid:90) (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP (cid:48) S (cid:19) p ( α − dP W (cid:48) S (cid:48) (cid:33) p (cid:32)(cid:90) (cid:18) dP (cid:48) S dP S (cid:19) q ( α − dP W (cid:48) S (cid:48) (cid:33) q = exp (cid:0) ( α − D p ( α − ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) (cid:1) exp (cid:0) ( α − D q ( α − ( P S (cid:48) (cid:107) P S ) (cid:1) . Then we get, D α ( P W (cid:48) S (cid:48) || P W (cid:48) P S ) ≤ D p ( α − ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + nD q ( α − ( µ (cid:48) (cid:107) µ ) . Proof.

Let w = arg min w ∈W L µ ( w ). Let A ∗ be an algorithm that outputs w regardless of thetraining data sequence. Then, using Theorem 8 with A ∗ we obtain L S (cid:48) ( w ) − L µ ( w ) ≤ − δ (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ log (cid:0) δ (cid:1) n . (55)On the other hand, from the deﬁnition of the ERM algorithm we have L S (cid:48) ( A ERM ( S (cid:48) )) ≤ L S (cid:48) ( w ) . (56)Since L µ ( w ) = min w ∈W L µ ( w ), it follows that L S (cid:48) ( A ERM ( S (cid:48) )) ≤ − δ min w ∈W L µ ( w ) + (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ log (cid:0) δ (cid:1) n . (57)Then using Theorem 8, L µ ( A ERM ( S (cid:48) )) − min w ∈W L µ ( w ) = L µ ( A ERM ( S (cid:48) )) − L S (cid:48) ( A ERM ( S (cid:48) )) + L S (cid:48) ( A ERM ( S (cid:48) )) − min w ∈W L µ ( w ) ≤ − δ (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ (cid:2) D ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + 2 log (cid:0) δ (cid:1)(cid:3) n + (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ log (cid:0) δ (cid:1) n . (58) Take some arbitrary µ ∈ P and µ (cid:48) ∈ P γ . It suﬃces to ﬁnd a bound on the diﬀerence (cid:15) ( A , µ (cid:48) , n, δ ) − (cid:15) ( A , µ, n, δ ) that depends only on D ( µ (cid:48) (cid:107) µ ). From Lemma 2, given the train-ing data S (cid:48) = ( Z (cid:48) , · · · , Z (cid:48) n ) ∼ ( µ (cid:48) ) ⊗ n , we can deﬁne S = ( Z , · · · , Z n ) ∼ µ ⊗ n such that ( Z i , Z (cid:48) i )are i.i.d. for 1 ≤ i ≤ n and P [ Z i (cid:54) = Z (cid:48) i ] = (cid:107) µ − µ (cid:48) (cid:107) T V . (59)For the ﬁrst upper bound (21), we write L µ (cid:48) ( A ( S (cid:48) )) = L µ (cid:48) ( A ( S (cid:48) ) − L µ (cid:48) ( A ( S )) + L µ ( A ( S )) + L µ (cid:48) ( A ( S )) − L µ ( A ( S ))20 a ) ≤ − δ n (cid:88) i =1 β i + min w ∈W L µ ( w ) + (cid:15) ( A , P , n, δ ) + L µ (cid:48) ( A ( S )) − L µ ( A ( S )) ( b ) ≤ n (cid:88) i =1 β i + min w ∈W L µ ( w ) + (cid:15) ( A , P , n, δ ) + (cid:112) σ γ, (60)where, (a) comes from the uniform stability condition and deﬁnition of (cid:15) ( A , P , n, δ ). Inequality(b) is derived using Lemma 1 as follows L µ (cid:48) ( A ( S )) − L µ ( A ( S )) = E µ (cid:48) [ (cid:96) ( A ( S ) , Z (cid:48) ) − E µ [ (cid:96) ( A ( S ) , Z )]] ≤ λ D ( µ (cid:48) (cid:107) µ ) + 1 λ log E µ (cid:104) e λ [ (cid:96) ( A ( S ) ,Z (cid:48) ) − E µ [ (cid:96) ( A ( S ) ,Z )]] (cid:105) ≤ λ D ( µ (cid:48) (cid:107) µ ) + λσ . Optimizing on λ and D ( µ (cid:48) (cid:107) µ ) ≤ γ , we get L µ (cid:48) ( A ( S )) − L µ ( A ( S )) ≤ (cid:112) σ γ. (61)Next, we give an upper bound for min w L µ ( w ). Let w = arg min w L µ ( w ) and w (cid:48) = arg min w L µ (cid:48) ( w ).A similar argument as above gives L µ ( w ) ≤ L µ ( w ) − L µ (cid:48) ( w (cid:48) ) + L µ (cid:48) ( w (cid:48) ) ≤ L µ ( w (cid:48) ) − L µ (cid:48) ( w (cid:48) ) + L µ (cid:48) ( w (cid:48) ) ≤ (cid:112) σ D ( µ (cid:48) (cid:107) µ ) + L µ (cid:48) ( w (cid:48) ) . (62)Using (60), we deduce L µ (cid:48) ( W ( S (cid:48) )) ≤ min w ∈W L µ (cid:48) ( w ) + (cid:15) ( A , P , n, δ ) + n (cid:88) i =1 β i + 2 (cid:112) σ γ. (63)This completes the proof for the ﬁrst upper bound.For the second upper bound (22), the population risk of the learning algorithm with respectto µ (cid:48) by using Lemma 1 could be written as, λL µ (cid:48) ( A ( S (cid:48) )) = (cid:90) Z λ(cid:96) ( A ( S (cid:48) ) , z ) µ (cid:48) ( dz )= E Z ∼ µ (cid:48) [ λ(cid:96) ( A ( S (cid:48) ) , Z )] ≤ ln (cid:16) E Z ∼ µ (cid:2) exp (cid:0) λ(cid:96) ( A ( S (cid:48) ) , Z ) (cid:1)(cid:3) (cid:17) + D ( µ (cid:48) (cid:107) µ ) . (64)Note that both sides of (64) are random variables (and functions of S (cid:48) ) and Z is taken tobe independent of S (cid:48) .Considering the stability notion of algorithm from Deﬁnition 2, the following inequalityholds almost surely: (cid:12)(cid:12)(cid:12) (cid:96) ( A ( S (cid:48) ) , z ) − (cid:96) ( A ( S ) , z ) (cid:12)(cid:12)(cid:12) ≤ n (cid:88) i =1 β i ( n ) [ Z i (cid:54) = Z (cid:48) i ] , ∀ z ∈ Z . Therefore, if we take Z ∼ µ independent of ( S, S (cid:48) ), we deduce that (cid:12)(cid:12)(cid:12) (cid:96) ( A ( S (cid:48) ) , Z ) − (cid:96) ( A ( S ) , Z ) (cid:12)(cid:12)(cid:12) ≤ n (cid:88) i =1 β i ( n ) [ Z i (cid:54) = Z (cid:48) i ] . (65)21ext, we bound the random variable E Z ∼ µ (cid:2) exp (cid:0) λ(cid:96) ( A ( S (cid:48) ) , Z ) (cid:3) in (64) from above as follows: E Z ∼ µ (cid:2) exp (cid:0) λ(cid:96) ( A ( S (cid:48) ) , Z ) (cid:1)(cid:3) ( a ) ≤ E Z ∼ µ (cid:34) exp (cid:32) λ(cid:96) ( A ( S ) , Z ) + λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:33)(cid:35) ( b ) ≤ exp (cid:16) λ E Z (cid:96) ( A ( S ) , Z ) + λ σ / λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:17) ( c ) ≤ − δ exp (cid:16) λ min w L µ ( w ) + λ(cid:15) + λ σ / λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:17) , (66)where ( a ) comes from (65), inequality ( b ) comes from the subgaussianity of (cid:96) ( A ( S ) , Z ) in termsof Z for any ﬁxed S and ( c ) is derived from the deﬁnition of (cid:15) = (cid:15) ( A , P , n, δ ).Next, from Markov’s inequality we haveexp (cid:16) λ min w L µ ( w ) + λ(cid:15) + λ σ / λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:17) ( d ) ≤ − δ (cid:48) δ (cid:48) E S,S (cid:48) exp (cid:32) λ min w L µ ( w ) + λ(cid:15) + λ σ / λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:33) = 1 δ (cid:48) exp( λ σ /

2) exp( λ min w L µ ( w ) + λ(cid:15) ) × n (cid:89) i =1 (exp( λβ i ) · (cid:107) µ − µ (cid:48) (cid:107) T V + 1 − (cid:107) µ − µ (cid:48) (cid:107) T V ) , (67)where the last equality follows from (59).Using (64), (66) and (67), we ﬁnd the following upper bound on L µ (cid:48) ( A ( S (cid:48) )) with the prob-ability at least 1 − δ − δ (cid:48) : L µ (cid:48) ( A ( S (cid:48) )) ≤ min w L µ ( w ) + (cid:15) + 1 λ log(1 /δ (cid:48) ) + λσ / λ n (cid:88) i =1 ln (cid:0) λβ i ) − (cid:107) µ − µ (cid:48) (cid:107) T V (cid:1) + 1 λ D ( µ (cid:48) (cid:107) µ ) ≤ min w L µ ( w ) + (cid:15) + 1 λ log(1 /δ (cid:48) ) + λσ / λ n (cid:88) i =1 ln (cid:0) λβ i ) − √ γ (cid:1) + 1 λ γ where we used D ( µ (cid:48) (cid:107) µ ) ≤ γ and Pinsker’s inequality. The choice of λ = g ( δ (cid:48) ) = (cid:113) /δ (cid:48) )+ γ ] σ yields L µ (cid:48) ( A ( S (cid:48) )) ≤ − δ − δ (cid:48) min w L µ ( w ) + (cid:15) ( A , P , n, δ )+ (cid:112) σ [log(1 /δ (cid:48) ) + γ ] + 1 g ( δ (cid:48) ) n (cid:88) i =1 ln (cid:18) g ( δ (cid:48) ) β i ) − √ γ (cid:19) . Therefore, we get following upper bound for L µ (cid:48) ( A ( S (cid:48) )) with probability at least 1 − δ : L µ (cid:48) ( A ( S (cid:48) )) ≤ min w L µ ( w ) + (cid:15) ( A , P , n, δ/

2) + (cid:112) σ [log(2 /δ ) + γ ]+ 1 g ( δ/ n (cid:88) i =1 ln (cid:18) g ( δ/ β i ) − √ γ (cid:19) . L µ (cid:48) ( A ( S (cid:48) )) ≤ min w L µ (cid:48) ( w ) + (cid:15) ( A , P , n, δ/

2) + f ( δ )where f ( δ ) (cid:44) (cid:112) σ γ + (cid:112) σ [log(2 /δ ) + γ ] + 1 g ( δ/ n (cid:88) i =1 ln (cid:18) g ( δ/ β i ) − √ γ (cid:19) . This completes the proof.

It is clear that ˜ D ( r/n ) ≤ D ( r/n ) from their deﬁnitions. By the deﬁnition of v n for anyarbitrary p W (cid:48) | S (cid:48) where S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) we have E (cid:34) n (cid:88) i =1 n ˜ (cid:96) ( W (cid:48) , Z (cid:48) i ) (cid:35) ≥ v n . It follows that D ( r ) = max P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )]= max P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r, E [ (cid:80) ni =1 1 n ˜ (cid:96) ( W (cid:48) ,Z (cid:48) i ) ] ≥ v n E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] . A similar argument as in (31) shows that for any arbitrary p W (cid:48) | S (cid:48) we have I ( W (cid:48) ; S (cid:48) ) ≥ n (cid:88) i =1 I ( W (cid:48) ; Z (cid:48) i ) . Thus, D ( r ) ≤ max P W (cid:48)| S (cid:48) : n (cid:80) ni =1 I ( W (cid:48) ; Z (cid:48) i ) ≤ rn , n (cid:80) ni =1 E [ ˜ (cid:96) ( W (cid:48) ,Z (cid:48) i ) ] ≥ v n n n (cid:88) i =1 E [ L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) i )] . Take some arbitrary p W (cid:48) | S (cid:48) and a time-sharing random variable Q uniform on { , , · · · , n } ,independent of previously deﬁned variables. Note that I ( W (cid:48) ; Z (cid:48) Q ) ≤ I ( Q, W (cid:48) ; Z (cid:48) Q )= I ( W (cid:48) ; Z (cid:48) Q | Q ) (69)= 1 n n (cid:88) i =1 I ( W (cid:48) ; Z (cid:48) i ) ≤ rn where (69) follows from the fact that Z (cid:48) i ’s are iid. We also have E (cid:104) ˜ (cid:96) ( W (cid:48) , Z (cid:48) Q ) (cid:105) = 1 n n (cid:88) i =1 E (cid:104) ˜ (cid:96) ( W (cid:48) , Z (cid:48) i ) (cid:105) ≥ v n , Thus, the joint distribution p W (cid:48) ,Z (cid:48) Q satisﬁes the constraints of ˜ D ( rn ). Moreover, Z (cid:48) Q ∼ µ (cid:48) and E (cid:2) L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) Q ) (cid:3) = 1 n E n (cid:88) i =1 [ L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) i )]Thus, we deduce that D ( r ) ≤ ˜ D ( rn ) as desired.23 eferences [1] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” arXiv preprintarXiv:1902.00146 , 2019.[2] Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation: Learning bounds andalgorithms,” arXiv preprint arXiv:0902.3430 , 2009.[3] Z. Wang, “Theoretical guarantees of transfer learning,” arXiv preprint arXiv:1810.05986 ,2018.[4] X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Information-theoretic analysis for transferlearning,” arXiv preprint arXiv:2005.08697 , 2020.[5] Y. Mansour, M. Mohri, A. T. Suresh, and K. Wu, “A theory of multiple-source adaptationwith limited target labeled data,” arXiv preprint arXiv:2007.09762 , 2020.[6] D. Russo and J. Zou, “How much does your data exploration overﬁt? controlling biasvia information usage,” IEEE Transactions on Information Theory , vol. 66, no. 1, pp.302–323, 2019.[7] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability oflearning algorithms,” in

Advances in Neural Information Processing Systems , 2017, pp.2524–2533.[8] Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information based bounds ongeneralization error,”

IEEE Journal on Selected Areas in Information Theory , 2020.[9] A. T. Lopez and V. Jog, “Generalization error bounds using wasserstein distances,” in . IEEE, 2018, pp. 1–5.[10] H. Wang, M. Diaz, J. C. S. Santos Filho, and F. P. Calmon, “An information-theoreticview of generalization via wasserstein distance,” in . IEEE, 2019, pp. 577–581.[11] F. Hellstr¨om and G. Durisi, “Generalization bounds via information density and condi-tional information density,”

IEEE Journal on Selected Areas in Information Theory , 2020.[12] G. Aminian, L. Toni, and M. R. Rodrigues, “Jensen-shannon information based characteri-zation of the generalization error of learning algorithms,” arXiv preprint arXiv:2010.12664 ,2020.[13] A. R. Esposito, M. Gastpar, and I. Issa, “Robust generalization via α -mutual information,” arXiv preprint arXiv:2001.06399 , 2020.[14] I. Issa, A. R. Esposito, and M. Gastpar, “Strengthened information-theoretic bounds onthe generalization error,” in . IEEE, 2019, pp. 582–586.[15] A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via r \ ’enyi-, f -divergences and maximal leakage,” arXiv preprint arXiv:1912.01439 , 2019.[16] J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration biasfor general measurements,” in . IEEE, 2017, pp. 1475–1479. 2417] A. R. Asadi, E. Abbe, and S. Verd´u, “Chaining mutual information and tightening gener-alization bounds,” arXiv preprint arXiv:1806.03803 , 2018.[18] Y. Wang and D. M. Blei, “Variational bayes under model misspeciﬁcation,” arXiv preprintarXiv:1905.10859 , 2019.[19] A. R. Masegosa, “Learning under model misspeciﬁcation: Applications to variational andensemble methods,” arXiv preprint arXiv:1912.08335 , 2019.[20] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability anduniform convergence,” The Journal of Machine Learning Research , vol. 11, pp. 2635–2670,2010.[21] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: Stability of stochasticgradient descent,” in

International Conference on Machine Learning . PMLR, 2016, pp.1225–1234.[22] R. Blahut, “Computation of channel capacity and rate-distortion functions,”

IEEE trans-actions on Information Theory , vol. 18, no. 4, pp. 460–473, 1972.[23] T. Matsuta and T. Uyematsu, “Non-asymptotic bounds for ﬁxed-length lossy compres-sion,” in . IEEE,2015, pp. 1811–1815.[24] K. Marton, “Error exponent for source coding with a ﬁdelity criterion,”

IEEE Transactionson Information Theory , vol. 20, no. 2, pp. 197–199, 1974.[25] O. Bousquet and A. Elisseeﬀ, “Stability and generalization,”

Journal of machine learningresearch , vol. 2, no. Mar, pp. 499–526, 2002.[26] A. El Gamal and Y.-H. Kim,