Learning under Distribution Mismatch and Model Misspecification
Mohammad Saeed Masiha, Amin Gohari, Mohammad Hossein Yassaee, Mohammad Reza Aref
LLearning under Distribution Mismatch and ModelMisspecification
Mohammad Saeed Masiha Amin Gohari Mohammad Hossein YassaeeMohammad Reza ArefFebruary 24, 2021
Abstract
We study learning algorithms when there is a mismatch between the distributions ofthe training and test datasets of a learning algorithm. The effect of this mismatch on thegeneralization error and model misspecification are quantified. Moreover, we provide aconnection between the generalization error and the rate-distortion theory, which allowsone to utilize bounds from the rate-distortion theory to derive new bounds on the gen-eralization error and vice versa. In particular, the rate-distortion based bound strictlyimproves over the earlier bound by Xu and Raginsky even when there is no mismatch.We also discuss how “auxiliary loss functions” can be utilized to obtain upper bounds onthe generalization error.
In a learning algorithm, a distribution mismatch occurs when the training dataset and the testdataset are not drawn from the same distribution. This mismatch might also occur if trainingdata is corrupted or if the statistical distribution of the data changes from training to testing.For example, suppose that (in the Covid era) a pharmaceutical company located in region Rhas developed a drug for Covid-19 (in statistical terms, the company has tuned the parametersof a process that describes how to mix different chemicals to make a drug). Clinical experimentsshow high effectiveness (say 95%) of this treatment for the population that resides in regionR. There is an urgent need for the drug and the company lacks time to test the medicineon other populations with possibly different genetic backgrounds (in statistical terms, with adifferent distribution from the distribution of the population in region R). Hence it is requiredto have some guarantee on how the effectiveness of treatment for the population R generalizesto other populations. As another example, in federated learning, a centralized model is trainedbased on chunks of training data originating from a number of clients, which may be mobilephones, other mobile devices, or sensors. While the training data may come from only a limitednumber of clients, statistical guarantees on the learning algorithm should be expressed in termsof testing on a population-averaged model of all client distributions, which might be differentfrom the training distribution.Distribution mismatch can manifest itself in different ways: consider a data scientist in acompany who is given access to a training dataset and asked to make a recommendation abouta decision for the company. The training dataset is corrupted and its distribution slightly differsfrom that of the test data. The data scientist might run a learning algorithm A and utilizeits output on the training data to make a recommendation. In the first part of this paper,we study the effect of distribution mismatch on the generalization error of algorithm A . Next,1 a r X i v : . [ c s . I T ] F e b ssume that the company’s manager impresses upon the data scientist the importance of thedecision for the company and asks about his confidence level about his recommendation. Toaddress this question, the data scientist needs to come up with a mathematical model for thedata and give guarantees based on that model. For instance, the data scientist might choosethe parametric class of Gaussian distributions, partly based on the training data histograms(many methods to find a family of distributions for data samples are data-driven). Since thetraining data is corrupted, this process could lead to model misspecification. In the second partof this paper, we study how model misspecification affects theoretical guarantees of a learningalgorithm. Generalization error under distribution mismatch:
Distribution mismatch is thesubject of previous studies in transfer learning or domain adaptation [1–5]. In the first partof this paper, we provide information-theoretic bounds on the generalization error under adistribution mismatch. Designing algorithms with low generalization error is a key challengein machine learning. It is known that under certain assumptions, the generalization error ofa learning algorithm can be bounded from above in terms of the mutual information betweenthe input and output of the algorithm [6, 7] (see also [8–17] for various generalizations andextensions using other measures of dependence). These works assume that the test data aredrawn from the same distribution as the training data. Herein, we provide bounds on thegeneralization error of the learning algorithm assuming a bound on the KL divergence betweenthe test and training distributions as well as a bound on the mutual information between theinput and output of the learning algorithm. One of our bounds is based on (to the best ofour knowledge) a novel connection between generalization error and the rate-distortion theory.When specialized to the case of no-mismatch, this bound strictly improves over the bound in [7](see Corollary 1 and Figure 1).A question that we also address in this section is as follows: in case of having no mismatchbetween the training and test distributions, having more training data samples leads to in-creasingly better estimates of the unknown distribution of the data. On the other hand, in caseof a mismatch, increasing the number of training samples can only provide more informationabout the training distribution. In the limit of the number of samples going to infinity, wewill perfectly learn the training distribution but will still have a residual ambiguity about thetest distribution: we will only know that the test distribution is at a certain KL distance fromthe training distribution. If we are in a regime where the error is dominated by this residualambiguity in the test distribution, the value of training samples gradually depreciates as wegather more samples. Subsequently, we might have insufficient incentive to gather more train-ing samples. This shows that there is an “optimal” number of samples associated with ourproblem. To the best of our knowledge, this question has not been addressed in the literatureso far. We address the above question as follows: In Corollary 1, we provide an upper boundof generalization error in terms of γ + r/n where γ is the KL-divergence between the train-ing and test distributions, r is the mutual information between the input and output of thelearning algorithm and n is the sample size. If γ > n (or small values of r ), the term γ becomes the dominant term, and theeffect of r/n vanishes in the upper bound. This happens when r/n is of the same order as γ .For a fixed sample size n and γ , it suffices to work with algorithms that have input-outputmutual information r satisfying r ≈ nγ . In other words, since the training data is drawn froma different distribution than the test data, limited overfitting will not affect the generaliza-tion error. Next, we give a lower bound on the generalization error in Corollary 2 underdistribution mismatch. Similar to the upper bound, this lower bound on the generalizationerror involves the summation of two terms. The first term is a constant (bounded from aboveby the KL-divergence between the training and test distribution, e.g. see (61)) and anotherterm (depending on the input-output mutual information of the algorithm) and vanishing in n .2inally, we also consider the performance of the ERM algorithm under distribution mismatchin Theorem 9. We present an upper bound on excess risk. Increasing the number of samplesdoes not make the upper bound vanish and we get a constant upper bound (due to distributionmismatch) when the number of samples tends to infinity. Model misspecification:
A learning algorithm has access to a training dataset that isdrawn from an unknown distribution. This unknown distribution is commonly assumed tobelong to a known family of distributions P . A learning “model” provides a description for thefamily P , and a learning algorithm is required to have good performance when the data is drawnfrom any arbitrary distribution belonging to P . We say that model misspecification occurs whenthe data distribution does not belong to P . The amount of misspecification may be measuredby the minimum KL-divergence from the true distribution to the family of distributions in class P [18]. Model misspecification is a key consideration in statistics [18, 19]. For instance, [19]shows that Bayesian methods are not optimal for learning predictive models unless the modelclass is perfectly specified. In the second part of the paper, we fix a uniformly stable learningalgorithm A and assume a notion of sample complexity for the class P . Then we bound thesample complexity under a distribution µ (cid:48) / ∈ P based on the minimum KL-divergence of µ (cid:48) from the family P . Organization:
The rest of this paper is organized as follows. The paper splits into twoparts: section 2 gives our results on generalization error while Section 3 is dedicated to modelmisspecification. In Section 2.1 we formally define learning with mismatched (training and testdata) distributions. Section 2.2 provides a connection between the rate-distortion theory andthe generalization error, along with upper and lower bounds on the generalization error. Theperformance of the ERM algorithm on the training data when there is a distribution mismatchis also studied. Section 3 studies model mismatch for the class of uniformly-stable algorithms.Finally, Section 4 discusses some ideas to improve the upper bounds on the generalization errorgiven in Section 2.2.
Notation and preliminaries:
Random variables are shown in capital letters, whereastheir realizations are shown in lowercase letters. We show sets with calligraphic font. For arandom variable X generated from a distribution µ , we use E X ∼ µ to denote the expectationtaken over X with distribution µ and P X means the distribution over X . We use D ( µ (cid:107) ν ) and D α ( µ (cid:107) ν ) = α − log (cid:82) (cid:0) dµdν (cid:1) α dν ( x ) to denote the KL divergence and the Renyi divergence oforder α respectively. In particular, we have D ( µ (cid:107) ν ) = log (1 + χ ( µ (cid:107) ν )) where χ -divergenceis defined as χ ( µ (cid:107) ν ) = E ν (cid:0) dµdν − (cid:1) . Given two random variables X and Y , we use theshorthand X ≤ − δ Y to denote P [ X ≤ Y ] ≥ − δ . Observe that X ≤ − δ Y and Y ≤ − δ Z implies X ≤ − δ − δ Z by the union bound.We write g ( n ) ∼ Ω( f ( n )) when g ( n ) ≥ c · f ( n ) for large enough n .The concept of subgaussianity is defined as follows: Definition 1.
The random variable X is said to be sub-Gaussian with parameter σ if ∀ s ∈ RE [ e s ( X − E [ X ]) ] ≤ e σ s . (1) Using the Chernoff ’s bound, we obtain, P ( | X − E X | > t ) ≤ e − t σ . (2)The following lemma relates the expectation of a measurable function over two differentdistributions: Lemma 1 (Donsker-Varadhan) . Let X be a sample space and let P be a distribution on X .Let Q be a distribution on X with the support which is a subset of the P support. Then for anymeasurable function φ : X → R with respect to P , we have ln (cid:0) E P [ e φ ( X ) ] (cid:1) ≥ E Q [ φ ( X )] − D ( Q (cid:107) P ) . emma 2. [Coupling] Given the marginal distributions µ and µ (cid:48) on Z , one can find a coupling π ( z, z (cid:48) ) on ( z, z (cid:48) ) ∈ Z × Z such that ( Z, Z (cid:48) ) ∼ π satisfy Z ∼ µ , Z (cid:48) ∼ µ (cid:48) and P π [ Z (cid:54) = Z (cid:48) ] = (cid:107) µ − µ (cid:48) (cid:107) T V where (cid:107) µ − µ (cid:48) (cid:107) T V is defined as (cid:107) µ − µ (cid:48) (cid:107) T V = sup A ∈Z [ µ ( A ) − µ (cid:48) ( A )] . Consider an instance space Z , a hypothesis space W and a non-negative loss function (cid:96) : W × Z → R + . Assume that the test and training samples are produced (in an i.i.d. fashion)from two unknown distributions µ and µ (cid:48) on Z respectively. A training dataset of size n isshown by the n -tuple, S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) ∈ Z n of i.i.d. random elements according to anunknown distribution µ (cid:48) . A learning algorithm is characterized by a probabilistic mapping A ( · )(a Markov Kernel) that maps training data S (cid:48) to the random variable W (cid:48) = A ( S (cid:48) ) ∈ W asthe output hypothesis. The population risk of a hypothesis w ∈ W is computed on the testdistribution µ as follows: L µ ( w ) (cid:44) E µ [ (cid:96) ( w, Z )] = (cid:90) Z (cid:96) ( w, z ) µ ( dz ) , ∀ w ∈ W . (3)The goal of learning is to ensure that under any data generating distribution µ , the populationrisk of the output hypothesis W (cid:48) is small, either in expectation or with high probability. Since µ and µ (cid:48) are unknown, the learning algorithm cannot directly compute L µ ( w ) for any w ∈ W ,but can compute the empirical risk of w on the training dataset S (cid:48) as an approximation, whichis defined as L S (cid:48) ( w ) (cid:44) n n (cid:88) i =1 (cid:96) ( w, Z (cid:48) i ) . (4)The true objective of the learning algorithm, L µ ( W (cid:48) ), is unknown to the learning algorithmwhile the empirical risk L S (cid:48) ( W (cid:48) ) is known. The generalization gap is defined as the differencebetween these two quantities as [3, 4]gen µ ( W (cid:48) , S (cid:48) ) = L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) , (5)where W (cid:48) = A ( S (cid:48) ) is the output of the algorithm A on the input S (cid:48) ∼ ( µ (cid:48) ) ⊗ n . In commonalgorithms such as empirical risk minimization (ERM) and gradient descent, L S (cid:48) ( W (cid:48) ) is min-imized [20, 21]. Therefore, to control L µ ( W (cid:48) ) we need to bound gen µ ( W (cid:48) , S (cid:48) ) from above (inexpectation or with high probability). Observe that gen µ ( W (cid:48) , S (cid:48) ), as defined in (5), is a ran-dom variable and a function of ( S (cid:48) , W (cid:48) ). The generalization error is the expected value ofgen µ ( W (cid:48) , S (cid:48) ): gen ( µ, µ (cid:48) , A ) = E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] . (6)When there is no-mismatch, i.e., µ = µ (cid:48) , we denote the generalization error by gen ( µ, A ) forsimplicity. The following upper bound on the generalization error is given in [7] (see also [6]):4 heorem 1 ( [7]) . Assume that there is no distribution mismatch, i.e., µ (cid:48) = µ . Suppose (cid:96) ( w, Z ) is σ -subgaussian under Z ∼ µ for all w ∈ W . Take an arbitrary algorithm A that runs on atraining dataset S (cid:48) . Then the generalization error is bounded as gen( µ, A ) ≤ (cid:114) σ n I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) . Let us write the sharpest possible bound on the generalization error given an upper bound r on I ( S (cid:48) ; A ( S (cid:48) )): D ( r ) (cid:44) sup P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] (7)where the supremum in (7) is over all Markov kernels P W (cid:48) | S (cid:48) with a bounded input/outputmutual information and S (cid:48) ∼ ( µ (cid:48) ) ⊗ n . We claim that D ( r ) is related to the rate-distortionfunction. To see this, consider a rate-distortion problem where the input symbol space is S ,the reproduction space is W and the following distortion function between a symbol w and aninput symbol s is used: ∆( w, s ) = L s ( w ) − L µ ( w ) . With this definition, from (7), we obtain − D ( r ) = inf P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r E [∆( W (cid:48) , S (cid:48) )] (8)which is in the rate-distortion form.With D ( r ) defined as in (7), it follows that for any arbitrary algorithm A with I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) ≤ r we have gen ( µ, µ (cid:48) , A ) ≤ D ( r ) . This upper bound does not require any subgaussianity assumption on the loss function. Fromthis viewpoint, Theorem 1 is just a convenient and explicit lower bound on a rate-distortionfunction under an extra assumption on the loss function (for the no distribution mismatchcase). We formalize this intuition in Theorem 3.Computing the upper bound D ( r ) is a convex optimization problem and there are efficientalgorithms for computing it [22]. However, computation of the bound can be practically difficultif the sample size n is large. The following theorem provides a computable upper bound thatrequires running an optimization when the sample size is just one. Theorem 2.
For any arbitrary loss function (cid:96) ( w, z ) , and algorithm A that runs on a trainingdataset S (cid:48) of size n , we have gen ( µ, µ (cid:48) , A ) ≤ D (cid:18) I ( S (cid:48) ; A ( S (cid:48) )) n (cid:19) where D ( r ) (cid:44) max P ˆ W | Z (cid:48) : I ( ˆ W ; Z (cid:48) ) ≤ r E (cid:104) L µ ( ˆ W ) − (cid:96) ( ˆ W , Z (cid:48) ) (cid:105) (9) where Z (cid:48) ∈ Z is distributed according to µ (cid:48) . Furthermore, to compute the maximum in (9) , itsuffices to compute the maximum over all conditional distributions P ˆ W | Z for ˆ W ∈ W such thatthe support of ˆ W can be chosen of size at most |Z| + 1 . While the literature commonly takes the reproduction space to be the same as the input symbol space, therate-distortion theory does not formally require that. |Z| < ∞ ) and does not requireany subguassianity assumption on the loss function, the bound in Theorem 1 is in a veryexplicit form. Moreover, the bound in Theorem 1 (for the case of no-mismatch) dependsonly on mutual information I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) while the bound in Theorem 2 depends on µ , µ (cid:48) and I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) . However, one can obtain a bound from Theorem 2 that does not depend on µ , µ (cid:48) by maximizing the bound in Theorem 2 over all distributions µ and µ (cid:48) such that D ( µ (cid:48) (cid:107) µ ) ≤ γ for some γ >
0. We show that even after this maximization, the bounds in Theorem 2 is stillan improvement over Theorem 1. To show this, we need to prove that the bound in Theorem2 is always less than or equal to the bound in Theorem 1 for any arbitrary µ and µ (cid:48) satisfying D ( µ (cid:48) (cid:107) µ ) ≤ γ and the subgaussianity assumption on the loss function. Below, we give a generalresult for the rate-distortion function and deduce the relation between the bounds in Theorem1 and Theorem 2 as a corollary to it. Theorem 3.
Consider a generic rate-distortion problem for X ∼ ζ and a distortion function d ( x, ˆ x ) ∈ R . Let φ ( · ) be a function defined on ( − b, for some b ∈ (0 , ∞ ] as follows: φ ( λ ) = sup ˆ x log E η (cid:2) e λd ( X, ˆ x ) (cid:3) , for some distribution η on X (possibly different from ζ ). Then, inf P ˆ X | X : I ( ˆ X ; X ) ≤ r E (cid:104) d ( X, ˆ X ) (cid:105) ≥ sup − b<λ< (cid:26) λ [ r + D ( ζ X (cid:107) η X )] + 1 λ φ ( λ ) (cid:27) . (10)Proof of Theorem 3 is in Section 6.2.We apply the above theorem to obtain an upper bound on the bound given in Theorem 2as follows: let X = Z (cid:48) ∼ µ (cid:48) , ˆ X = ˆ W and d ( z (cid:48) , ˆ w ) = − [ L µ ( ˆ w ) − (cid:96) ( ˆ w, z (cid:48) )]. Corollary 1.
Suppose that (cid:96) ( w, Z ) is σ -subgaussian for every w ∈ W under the distribu-tion µ on Z . Take an arbitrary algorithm A that runs on a training dataset S (cid:48) . Then when I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) ≤ r and D ( µ (cid:48) (cid:107) µ ) ≤ γ for some r, γ ≥ , then D ( r ) ≤ (cid:114) σ γ + 2 σ n r. (11) Remark 1.
Under the assumptions of Corollary 1, we deduce that gen ( µ, µ (cid:48) , A ) ≤ (cid:114) σ γ + 2 σ n r. (12) This generalizes the bound in Theorem 1 to the case of having mismatch.
Example 1.
Let W = Z = { , } and consider a learning problem on a data set S (cid:48) with thesize n = 1 with loss function (cid:96) ( w, z ) = w · z . Figure 1 depicts the bound in Theorem 1 versusthe maximum of the bound in Theorem 2 over all distributions µ on { , } for the case of no-mismatch for a particular loss function. Note that the distortion function itself depends on thechoice of µ and this makes it difficult to find a closed form expression for the maximum of thebound in Theorem 2 over all distributions µ . Example 2.
Let W = [0 , , Z = { , } and consider a learning problem on a data set S (cid:48) withthe size n = 1 with loss function (cid:96) ( w, z ) = | w − z | . Figure 2 depicts the bound in Theorem 1versus the maximum of the bound in Theorem 2 over all distributions µ on { , } for the caseof no-mismatch for a particular loss function. µ , assuming no distribution mismatch, W = Z = { , } and (cid:96) ( w, z ) = w · z .Figure 2: The bound in Theorem 1 versus the maximum of the upper bound in Theorem2 over all distributions µ , assuming no distribution mismatch, W = [0 , , Z = { , } and (cid:96) ( w, z ) = | w − z | . 7 .2.1 An improved upper bound In [8], a strengthened version of Theorem 1 is given as follows:
Theorem 4 ( [8]) . Suppose that the loss function (cid:96) ( w, Z ) is σ -subgaussian under the distri-bution µ on Z for any w ∈ W . For µ (cid:48) = µ , we have: gen( µ, µ (cid:48) , A ) ≤ n n (cid:88) i =1 (cid:113) σ I (cid:0) Z (cid:48) i ; A ( S (cid:48) ) (cid:1) . (13)The following variant of Theorem 4 holds for the case with distribution mismatch: Theorem 5.
For any arbitrary loss function (cid:96) ( w, z ) , and algorithm A that runs on a trainingdataset S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) of size n , we have gen ( µ, µ (cid:48) , A ) ≤ n n (cid:88) i =1 D ( I ( Z (cid:48) i ; A ( S (cid:48) ))) where D ( r ) is given in (9) . Moreover, if the loss function (cid:96) ( w, Z ) under µ is σ -subgaussianfor all w ∈ W , we further have gen ( µ, µ (cid:48) , A ) ≤ n n (cid:88) i =1 D ( I ( Z (cid:48) i ; A ( S (cid:48) ))) ≤ n n (cid:88) i =1 (cid:113) σ (cid:2) I (cid:0) Z (cid:48) i ; A ( S (cid:48) ) (cid:1) + D ( µ (cid:48) (cid:107) µ ) (cid:3) . Proof of Theorem 5 is in Section 6.3.
Next, we consider lower bounds on the generalization error. Similar to (7), the following lowerbound on the generalization error given an upper bound r on I ( S (cid:48) ; A ( S (cid:48) )) can be written:inf P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] (14)where the infimum in (14) is over all Markov kernels P W (cid:48) | S (cid:48) with a bounded input/outputmutual information and S (cid:48) ∼ ( µ (cid:48) ) ⊗ n . However, the bound in this form may not be useful.To see this, assume that µ = µ (cid:48) . One possible choice for W (cid:48) in (14) is a constant randomvariable. For this choice, I ( W (cid:48) ; S (cid:48) ) = 0 ≤ r and the bound in (14) vanishes. It follows that D ( r ) ≤
0. However, we are interested in a lower bound on the generalization error in termsof the population risk. In order to prevent W (cid:48) from being a constant random variable, weattempt to find a lower bound on the generalization error in terms of both I ( S (cid:48) ; A ( S (cid:48) )) and anassumption about the marginal distribution of the output of the algorithm A ( S (cid:48) ). In particular,we assume that I ( S (cid:48) ; A ( S (cid:48) )) ≤ r and A ( S (cid:48) ) ∼ p W (cid:48) ∈ M for a family M of distributions on W .We aim to find a lower bound on gen ( µ, µ (cid:48) , A ) that depends on both r and M . The sharpestsuch bound is D (cid:0) r, M (cid:1) = inf P W (cid:48) ∈M inf P W (cid:48) ,S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ): I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] (15)where U ( P W (cid:48) , P S (cid:48) ) is the set of all couplings of two marginal distribution P W (cid:48) and P S (cid:48) on W ×S .When r = 0, the set U ( P W (cid:48) , P S (cid:48) ) includes only the product distribution P W (cid:48) P S (cid:48) and D (cid:0) , M (cid:1) can be computed explicitly. The following theorem gives an explicit lower bound on the gener-alization error when r >
0: 8 heorem 6.
Let ψ ( λ ) be a function satisfying ψ ( λ ) ≥ sup ν ∈M E ν (cid:2) e λ [ (cid:96) ( W,z ) − E ν [ (cid:96) ( W,z )]] (cid:3) , ∀ z ∈ Z . Then, we have: D (cid:0) , M (cid:1) ≥ D (cid:0) r, M (cid:1) ≥ D (cid:0) , M (cid:1) − inf λ ≥ [ λr − λ ( ψ (1 /nλ ) n − . Corollary 2.
Suppose that (cid:96) ( W (cid:48) , z ) is α -subgaussian under any P W (cid:48) ∈ M for all z ∈ Z .Considering the special choice of λ = α/ √ nr , we deduce D (cid:0) , P W (cid:48) (cid:1) ≥ D (cid:0) r, P W (cid:48) (cid:1) ≥ D (cid:0) , P W (cid:48) (cid:1) − √ n (cid:20) α √ r √ α √ r ( e r − (cid:21) . Therefore, gen ( µ, µ (cid:48) , A ) ≥ D (cid:0) , P W (cid:48) (cid:1) − √ n (cid:34) α (cid:112) I ( S (cid:48) ; A ( S (cid:48) )) √ α (cid:112) I ( S (cid:48) ; A ( S (cid:48) )) ( e I ( S (cid:48) ; A ( S (cid:48) )) − (cid:35) . Proof of the above theorem can be found in Section 6.4. The following theorem givesanother lower bound on the generalization error which can be compared with the upper boundin Theorem 2:
Theorem 7.
For any arbitrary loss function (cid:96) ( w, z ) , and algorithm A that runs on a trainingdataset S (cid:48) of size n and induces a marginal distribution on A ( S (cid:48) ) in M , we have gen ( µ, µ (cid:48) , A ) ≥ D (cid:18) I ( S (cid:48) ; A ( S (cid:48) ) n , M (cid:19) where D ( r, M ) (cid:44) inf P W (cid:48) ∈M min P ˆ W,Z (cid:48) ∈ U ( P W (cid:48) ,µ (cid:48) ): I ( ˆ W ; Z (cid:48) ) ≤ r E (cid:104) L µ ( ˆ W ) − (cid:96) ( ˆ W , Z (cid:48) ) (cid:105) . (16)The proof is given in the Section 6.5. High probability guarantees:
Just as the excess distortion probability of a rate-distortioncode has been subject of many studies in information theory (see [23, 24] for two examples), anumber of “high probability” upper bounds on the generalization gap are also reported in theliterature. Here the problem is to find an upper bound on P [gen µ ( W (cid:48) , S (cid:48) ) ≥ η ] = P [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) ≥ η ]for some given η .The following bound is a generalization of a bound in [15] to include distribution mismatch.Our method for deriving this inequality is different from the one used in [15], and similar tothe one used in [11]. Theorem 8.
Take some algorithm A that runs on a training dataset S (cid:48) and produces an outputhypothesis W (cid:48) = A ( S (cid:48) ) . Let (cid:96) ( w, Z ) be a loss function which is σ -subgaussian under thedistribution µ on Z for all w ∈ W . Then, we have P [ | gen µ ( W (cid:48) , S (cid:48) ) | ≥ η ] ≤ − n (cid:16) η − σ D ( µ (cid:48) (cid:107) µ ) (cid:17) − σ D ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) )3 σ . (17)9roof of the above theorem is given in Section 6.6. Performance of the ERM algorithm:
As an application of Theorem 8, let us considerthe ERM algorithm which is defined as follows: W ERM ( S (cid:48) ) = arg min w ∈W L S (cid:48) ( w ) . (18)Then, we claim the following upper bound on excess risk of the ERM algorithm: Theorem 9.
Let (cid:96) ( w, Z ) be σ -subgaussian under the distribution µ on Z for every w . Considerthe ERM learning algorithm A ERM as defined in (18) . Then, with probability of at least − δ , L µ ( A ERM ( S (cid:48) )) ≤ min w ∈W L µ ( w ) + (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ log (cid:0) δ (cid:1) n + (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ (cid:2) D ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + 3 log (cid:0) δ (cid:1)(cid:3) n . (19)Proof of the above theorem can be found in Section 6.7. Take an algorithm A along with a sample-complexity guarantee for a family of distributions P , i.e., the model has specified the class P . Given δ, (cid:15) >
0, sample complexity is defined as n ( A , P , (cid:15), δ ) = min (cid:26) N : ∀ n > N, sup µ ∈P P (cid:20) L µ ( A ( S )) − min w ∈W L µ ( w ) ≤ (cid:15) (cid:21) ≥ − δ (cid:27) where S = ( Z , Z , · · · , Z n ) ∼ µ ⊗ n is the training data. We would like to find the increase insample-complexity if we expand the set P to P γ = { µ (cid:48) : inf µ ∈P D ( µ (cid:48) (cid:107) µ ) ≤ γ } . The set P γ relates to model misspecification when it is llimited in KL divergence of at most γ .We utilize the following alternative equivalent definition of sample-complexity: (cid:15) ( A , P , n, δ ) = inf (cid:26) x ∈ R : sup µ ∈P P (cid:20) L µ ( A ( S )) − min w ∈W L µ ( w ) ≤ x (cid:21) ≥ − δ (cid:27) . We restrict to uniformly-stable algorithms. In general terms, a learning algorithm is said tobe stable if a small change of the input to the algorithm does not change the output of thealgorithm much. Examples of stability definitions include uniform stability defined by Bousquetand Elisseeff [25]. The definition of stability that we adopt in this paper is as follows:
Definition 2.
Given non-negative real numbers β i ( n ) we say that the A is called uniformly-stable if for any s = ( z , z , · · · , z n ) , s = ( z , z , · · · , z n ) ∈ Z n , the following inequalityholds (almost surely): | (cid:96) ( A ( s ) , z ) − (cid:96) ( A ( s ) , z ) | ≤ n (cid:88) i =1 β i ( n ) [ z i (cid:54) = z i ] , ∀ z ∈ Z . (20)10 heorem 10. Let (cid:96) ( w, Z ) be σ -subgaussian over Z with distribution µ ∈ P for every w . Then,for every n, γ > and δ ∈ [0 , , we have (cid:15) ( A , P γ , n, δ ) ≤ (cid:15) ( A , P , n, δ ) + n (cid:88) i =1 β i + 2 (cid:112) σ γ, (21) and (cid:15) ( A , P γ , n, δ ) ≤ (cid:15) ( A , P , n, δ/
2) + f ( δ ) , (22) where the function f is defined as f ( δ ) (cid:44) (cid:112) σ γ + (cid:112) σ [log(2 /δ ) + γ ] + 1 g ( δ/ n (cid:88) i =1 ln (cid:18) g ( δ/ β i ) − √ γ (cid:19) ,g ( δ ) (cid:44) (cid:114) /δ ) + γ ] σ . (23)Proof of Theorem 10 is given in Section 6.8. Remark 2.
While the upper bound (21) is in terms of (cid:80) ni =1 β i , the upper bound (22) is not.Numerical simulations suggest that the bound (22) is better than the bound (21) if β i ∼ Ω( √ n ) .The regime β i ∼ Ω( √ n ) could be of importance, e.g., see [20, 21]. While D ( r ) (as defined in (7)) is the sharpest possible bound on the generalization error givenan upper bound r on I ( S (cid:48) ; A ( S (cid:48) )), the single-letter bound D ( r/n ) in Theorem 2 is not. In fact,the following relaxation is used in the proof of Theorem 2: instead of producing one outputhypothesis W for the entire sequence S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ), we produce n output hypothesis˜ W , ˜ W , · · · , ˜ W n . To tighten the gap between D ( r ) and D ( r/n ), one needs to answer thefollowing question: given a joint distribution ( W, Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ), what are the set of marginaldistributions on ( W, Z (cid:48) i )? For instance if W is a binary random variable and ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n )are i.i.d., W cannot have high dependence with all of the Z (cid:48) i ’s. Motivated by the above question, in the rest of this section we present a general idea whichmay be used on its own, or in conjunction with the ideas in the previous section to improvethe upper bound given in Theorem 2. Let ˜ (cid:96) ( w, z ) be an “auxiliary” loss function; an arbitraryloss function of our choice which can be different from the original loss function (cid:96) ( w, z ). Weshow that the average risk of the ERM algorithm on the auxiliary loss function ˜ (cid:96) can be usedto bound the generalization error of a different algorithm A , which runs on the same trainingdata as the ERM algorithm, but with the original loss function (cid:96) ( w, z ). Let ERM ( z (cid:48) , · · · , z (cid:48) n ) = min w n (cid:88) i =1 n ˜ (cid:96) ( w, z (cid:48) i )be the risk of the ERM algorithm given a training sequence s (cid:48) = ( z (cid:48) , z (cid:48) , · · · , z (cid:48) n ) according to˜ (cid:96) . Let v n = E S (cid:48) ∼ ( ρ (cid:48) ) ⊗ n ERM ( Z (cid:48) , · · · , Z (cid:48) n ) In particular, using mutual information as the measure of dependence we have the following: for a bi-nary W and a sequence ( Z (cid:48) , · · · , Z (cid:48) n ) of independent random variables, we have 1 ≥ I ( W ; Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) ≥ (cid:80) i I ( W ; Z (cid:48) i ). See (31) for a proof. Thus, sum of correlations between W and Z (cid:48) i is no more than one bit.
11e the average risk of the ERM algorithm. Let us, for now, assume that v n is known to us.Take an arbitrary algorithm A . Let W (cid:48) = A ( S (cid:48) ) Then, the risk of A with respect to ˜ (cid:96) isgreater than or equal the risk of the ERM algorithm, i.e., E (cid:34) n (cid:88) i =1 n ˜ (cid:96) ( W (cid:48) , Z (cid:48) i ) (cid:35) ≥ v n . (24)Let Q be a random variable, independent of all previously defined variables, and uniform onthe set { , , · · · , n } . Set ˜ Z = Z (cid:48) Q . Observe that ˜ Z ∼ µ (cid:48) because Z (cid:48) i ∼ µ (cid:48) for all i and Q isindependent of ( Z (cid:48) , · · · , Z (cid:48) n ). Using this definition for ˜ Z , the risk of A with respect to the loss˜ (cid:96) equals E [˜ (cid:96) ( W (cid:48) , ˜ Z )] = E (cid:34) n (cid:88) i =1 n ˜ (cid:96) ( W (cid:48) , Z (cid:48) i ) (cid:35) (25)and the generalization error with respect to the loss (cid:96) can be characterized asgen ( µ, µ (cid:48) , A ) = 1 n n (cid:88) i =1 E [ L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) i )] = E (cid:104) L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , ˜ Z ) (cid:105) . (26)From (24), (25) and (26) we obtain the following upper bound on the generalization error of (cid:96) :gen ( µ, µ (cid:48) , A ) ≤ max P ˆ W | ˜ Z : E [˜ (cid:96) ( ˆ W , ˜ Z )] ≥ v n E (cid:104) L µ ( ˆ W ) − (cid:96) ( ˆ W , ˜ Z ) (cid:105) (27)where ˜ Z ∈ Z is distributed according to µ (cid:48) . The above bound has a similar form as the onegiven in Theorem 2. Observe that (27) provides a generalization bound on the algorithm A based on the sole assumption that it uses a training data of size n . If more is known aboutthe algorithm, e.g. an upper bound on the input and output mutual information, we can writebetter bounds as follows: Theorem 11.
Let ˜ D ( r ) (cid:44) max P ˆ W | ˜ Z : I ( ˆ W ; ˜ Z ) ≤ r, E [˜ (cid:96) ( ˆ W , ˜ Z )] ≥ v n E (cid:104) L µ ( ˆ W ) − (cid:96) ( ˆ W , ˜ Z ) (cid:105) . (28) Then, D ( r ) ≤ ˜ D ( r/n ) ≤ D ( r/n ) . Proof of the Theorem 11 can be found in Section 6.9.
Example 3.
Consider the setting in Example 1. Figure 3 illustrates this improvement in D ( r/n ) when ˜ (cid:96) ( w, z ) = − [ w (cid:54) = z ] and n = 10 . Example 4.
Consider the setting in Example 2. Figure 3 illustrates this improvement in D ( r/n ) when ˜ (cid:96) ( w, z ) = ( w − z ) and n = 10 . In order to use the bound in Theorem 11, one must know the value of v n . However, this isnot known in practice. For instance, consider the special case of loss function ˜ (cid:96) ( w, z ) = ( w − z ) .Given a training data ( z (cid:48) , z (cid:48) , · · · , z (cid:48) n ), the output of the ERM algorithm with the quadratic lossis just the average of the traning data samples and v n equals n − n Var µ (cid:48) ( Z (cid:48) ) . (cid:96) ( w, z ) = − [ w (cid:54) = z ] for W = Z = { , } and n = 10 and the original loss function (cid:96) ( w, z ) = w · z .Figure 4: The bound in Theorem 2 and its improved version via the auxiliary loss function˜ (cid:96) ( w, z ) = ( w − z ) for the learning setting W = [0 , , Z = { , } and n = 10 and the originalloss function (cid:96) ( w, z ) = | w − z | . 13he variance of the test data is not known, but can be estimated from the training datasetitself. Below we show how to estimate v n by running the ERM algorithm on the availabletraining data. Assume that the auxiliary loss satisfies | ˜ (cid:96) ( w, z ) − ˜ (cid:96) ( w, z (cid:48) ) | ≤ c for all w, z, z (cid:48) .Then, we have ERM ( z (cid:48) , z (cid:48) , · · · , z (cid:48) n ) = min w n n (cid:88) i =1 ˜ (cid:96) ( w, z i ) ≤ min w (cid:34) cn + 1 n ˜ (cid:96) ( w, z (cid:48)(cid:48) ) + 1 n n (cid:88) i =2 ˜ (cid:96) ( w, z (cid:48) i ) (cid:35) = cn + ERM ( z (cid:48)(cid:48) , z (cid:48) , · · · , z (cid:48) n ) . Then McDiarmid’s inequality implies high concentration around expected value for the ERMalgorithm: P (cid:2)(cid:12)(cid:12) ERM − E [ ERM ] (cid:12)(cid:12) ≥ t (cid:3) ≤ e − nt c . Thus, one can find an estimate for v n with high probability based on the available training datasequence.At the end, we remark that it is also possible to write bounds based on multiple auxiliaryloss functions rather than just one. The first author is also grateful to Dr. Mohammad Mahdi Mojahedian for helpful discussionson learning from heterogeneous data in mixture models which gave birth to some ideas in thiswork.
In the following sections we present the proofs of the results stated in the previous section intheir order of appearance.
Let ˜ w = ( ˜ w , ˜ w , · · · , ˜ w n ) ∈ W n be a sequence of length n . Let¯ D ( r ) (cid:44) sup P ˜ W | S (cid:48) : I ( ˜ W ; S (cid:48) ) ≤ r n n (cid:88) i =1 E (cid:104) L µ ( ˜ W i ) − (cid:96) ( ˜ W i , Z (cid:48) i ) (cid:105) (29)where S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) and ˜ W = ( ˜ W , ˜ W , · · · , ˜ W n ). Observe that if the entries of thevector ˜ W are all equal, the expression in (29) reduces to the one in (7). Therefore, in (29)we are taking the supremum over a larger set. Thus, ¯ D ( r ) ≥ D ( r ). It follows that for anyalgorithm A satisfying I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) ≤ r , we havegen ( µ, µ (cid:48) , A ) ≤ ¯ D ( r ) . We claim that ¯ D ( r ) = D ( r/n ). The proof follows similar steps as in [26, Section 3.6.2]for lossy compression. However, we provide a proof for completeness. We first claim that¯ D ( r ) ≥ D ( r/n ). To see this, take some P ˆ W | Z (cid:48) in (9) and take p ( ˜ w | s (cid:48) ) = n (cid:89) i =1 p ˆ W | Z (cid:48) ( ˜ w i | z (cid:48) i ) . P ˆ W | Z (cid:48) in (29) shows that ¯ D ( r ) ≥ D ( r/n ).It remains to show that ¯ D ( r ) ≤ D ( r/n ). Take some arbitrary P ˜ W | S (cid:48) satisfying I ( ˜ W ; S (cid:48) ) ≤ r . We have r ≥ I ( ˜ W ; S (cid:48) )= (cid:88) i I ( ˜ W ; Z (cid:48) i | Z (cid:48) i − )= (cid:88) i I ( ˜ W , Z (cid:48) i − ; Z (cid:48) i ) (30) ≥ (cid:88) i I ( ˜ W i ; Z (cid:48) i ) (31)where (30) follows from the fact that Z (cid:48) i are i.i.d. random variables. We also have1 n n (cid:88) i =1 E (cid:104) L µ ( ˜ W i ) − (cid:96) ( ˜ W i , Z (cid:48) i ) (cid:105) ≤ n n (cid:88) i =1 D ( I ( ˜ W i ; Z (cid:48) i )) (32) ≤ D (cid:32) n n (cid:88) i =1 I ( ˜ W i ; Z (cid:48) i ) (cid:33) (33) ≤ D (cid:16) rn (cid:17) (34)where (32) follows from the definition of D , (33) follows from concavity of D ( · ), and (34)follows from (31) and the fact that D ( · ) is an increasing function. Concavity of D ( · ) followsfrom the fact that mutual information I ( ˜ W ; S (cid:48) ) is convex in P ˜ W | S (cid:48) for a fixed distribution on P S (cid:48) .Since P ˜ W | S (cid:48) was an arbitrary conditional distribution satisfying I ( ˜ W ; S (cid:48) ) ≤ r , we deducefrom (32)-(34) that D ( r/n ) ≥ ¯ D ( r ) as desired.The cardinality bounds on the auxiliary random variable ˆ W in the definition of D comesfrom the standard Caratheodory-Bunt [27] arguments and is omitted. Given the distribution ζ ( x ) and some arbitrary conditional distribution ζ (ˆ x | x ), let ζ ( x, ˆ x ) = ζ (ˆ x | x ) ζ ( x ). Set q ( x, ˆ x ) = η ( x ) ζ (ˆ x ) and f ( x, ˆ x ) = λd ( x, ˆ x ) where − b < λ <
0. From theDonsker-Varadhan representation, we obtain that D ( ζ X, ˆ X (cid:107) q X, ˆ X ) ≥ λ E ζ [ d ( ˆ X, X )] − log E q (cid:104) e λ ( d ( X, ˆ X )) (cid:105) , (35)Using independence of X and ˆ X under q we can write for − b < λ < E q (cid:104) e λ ( d ( X, ˆ X )) (cid:105) = log E ˆ X ∼ ζ (cid:110) E X ∼ η (cid:104) e λ ( d ( X, ˆ X )) (cid:105)(cid:111) ≤ sup ˆ x log E η (cid:2) e λd ( X, ˆ x ) (cid:3) ≤ φ ( λ ) . Then from (35) and consider λ < E ζ [ d ( ˆ X, X )] ≥ λ D ( ζ X, ˆ X (cid:107) q X, ˆ X ) + 1 λ φ ( λ ) . Moreover, D ( ζ X, ˆ X (cid:107) q X, ˆ X ) = I ζ ( ˆ X ; X ) + D ( ζ X (cid:107) η X ) and I ζ ( ˆ X ; X ) ≤ r . Thus, E ζ [ d ( ˆ X, X )] ≥ sup − b<λ< (cid:26) λ [ r + D ( ζ X (cid:107) η X )] + 1 λ φ ( λ ) (cid:27) .
15n conclusion, inf P ˆ X | X : I ζ ( ˆ X ; X ) ≤ r E ζ (cid:104) d ( X, ˆ X ) (cid:105) ≥ sup − b<λ< (cid:26) λ [ r + D ( ζ X (cid:107) η X )] + 1 λ φ ( λ ) (cid:27) . (36) The inequality 1 n n (cid:88) i =1 D ( I ( Z (cid:48) i ; A ( S (cid:48) ))) ≤ n n (cid:88) i =1 (cid:113) σ (cid:2) I (cid:0) Z (cid:48) i ; A ( S (cid:48) ) (cid:1) + D ( µ (cid:48) (cid:107) µ ) (cid:3) . follows from Corollary 1. To show the inequalitygen ( µ, µ (cid:48) , A ) ≤ n n (cid:88) i =1 D ( I ( Z (cid:48) i ; A ( S (cid:48) )))take some algorithm A and let W (cid:48) = A ( S (cid:48) n ). Then,gen ( µ, µ (cid:48) , A ) = 1 n n (cid:88) i =1 E [ L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) i )] ≤ n n (cid:88) i =1 D ( I ( W (cid:48) ; Z (cid:48) i )) (37)where (37) follows from the definition of D . It suffices to prove the lower bound when a fixed distribution P W (cid:48) ∈ M is chosen for the outputof the algorithm because a minimum can be taken over all P W (cid:48) ∈ M from both sides of thedesired inequality at the end. We have D ( r ) = min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ): I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] (38)= min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) max λ ≥ E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) − λr (39) ≥ max λ ≥ min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) − λr (40) ≥ max λ ≥ [ D (0) + λ − λψ (1 / ( nλ )) n − λr ] (41)= D (0) − min λ ≥ [ λr + λ ( ψ (1 / ( nλ )) n − . (42)where (41) follows from Lemma 3. Lemma 3.
Let (cid:96) ( W (cid:48) , z ) satisfies ψ ( λ ) ≥ E P W (cid:48) (cid:104) e λ [ (cid:96) ( W (cid:48) ,z ) − E PW (cid:48) [ (cid:96) ( W (cid:48) ,z )] ] (cid:105) , ∀ z ∈ Z . Then, for any λ ≥ P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E P W (cid:48) S (cid:48) [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) ≥ D (0) − λ ( ψ (1 / ( nλ )) n − . (43)16 roof. Assume that W (cid:48) ∼ ζ and S (cid:48) ∼ β are the marginal distributions of W (cid:48) and S (cid:48) . Setting∆( w, s ) = L µ ( w ) − L s ( w ) , we can express the left hand side of (43) asmin ( W (cid:48) ,S (cid:48) ) ∼ π ∈ U ( ζ,β ) E π ∆( W (cid:48) , S (cid:48) ) + λ (cid:90) W×Z ⊗ n φ (cid:18) dπ ( w (cid:48) , s (cid:48) ) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (44)where φ ( x ) = x log( x ) − x + 1. We find the dual problem of the above optimization problem.Introducing the Lagrange multipliers f and g associated to the constraints, the Lagrangianreads L ( λ, ζ, β ) = E π ∆( W (cid:48) , S (cid:48) ) + λ (cid:90) W×Z ⊗ n φ (cid:18) dπ ( w (cid:48) , s (cid:48) ) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) dζ ( w (cid:48) ) dβ ( s (cid:48) )+ (cid:90) W f ( w (cid:48) ) (cid:18) dζ ( w (cid:48) ) − (cid:90) Z ⊗ n dπ ( w (cid:48) , s (cid:48) ) (cid:19) + (cid:90) Z ⊗ n g ( s (cid:48) ) (cid:18) dβ ( s (cid:48) ) − (cid:90) W dπ ( w (cid:48) , s (cid:48) ) (cid:19) . The dual Lagrange function is given by min π L ( λ, α, β ) over all π ( w (cid:48) , s (cid:48) ) ≥
0. Note that incomputing the minimum we do not require (cid:80) w (cid:48) ,s (cid:48) π ( w (cid:48) , s (cid:48) ) = 1. Observe thatmin π L ( λ, ζ, β )= (cid:90) W f ( w (cid:48) ) dζ ( w (cid:48) ) + (cid:90) Z ⊗ n g ( s (cid:48) ) dβ ( s (cid:48) )+ λ min π (cid:18)(cid:90) W×Z ⊗ n (cid:18) φ (cid:18) dπ ( w (cid:48) , s (cid:48) ) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) + ∆( w (cid:48) , s (cid:48) ) − f ( w (cid:48) ) − g ( s (cid:48) ) λ dπ ( w (cid:48) , s (cid:48) ) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) dζ ( w (cid:48) ) dβ ( s (cid:48) ) (cid:19) = (cid:90) W f ( w (cid:48) ) dζ ( w (cid:48) ) + (cid:90) Z ⊗ n g ( s (cid:48) ) dβ ( s (cid:48) ) − λ (cid:90) W×Z ⊗ n φ ∗ (cid:18) f ( w (cid:48) ) + g ( s (cid:48) ) − ∆( w (cid:48) , s (cid:48) ) λ (cid:19) dζ ( w (cid:48) ) dβ ( s (cid:48) ) , where φ ∗ is the Legendre transform of φ given by φ ∗ ( y ) = sup x ≥ (cid:2) xy − φ ( x ) (cid:3) = e y − . Thus, we obtainmin π L ( λ, ζ, β ) = E [ g ( S (cid:48) )] + E [ f ( W (cid:48) )] + λ − λ E P S (cid:48) P W (cid:48) (cid:20) exp (cid:18) − ∆( W (cid:48) , S (cid:48) ) − g ( S (cid:48) ) − f ( W (cid:48) ) λ (cid:19)(cid:21) . (45)From weak duality, for every continuous functions f and g , E [ g ( S (cid:48) )] + E [ f ( W (cid:48) )] + λ − λ E P S (cid:48) P W (cid:48) (cid:20) exp (cid:18) − ∆( W (cid:48) , S (cid:48) ) − g ( S (cid:48) ) − f ( W (cid:48) ) λ (cid:19)(cid:21) ≤ min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E P W (cid:48) S (cid:48) [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) . (46)Assigning g ( s (cid:48) ) = − E W (cid:48) ∼ ζ L s (cid:48) ( W (cid:48) ) and f ( w (cid:48) ) = L µ ( w (cid:48) ) and using the fact that ∆( w, s ) = L µ ( w ) − L s ( w ) , we obtain E [ L µ ( W (cid:48) ) − L µ (cid:48) ( W (cid:48) )] + λ − λ E P S (cid:48) P W (cid:48) (cid:20) exp (cid:18) L S (cid:48) ( W (cid:48) ) − E P W (cid:48) L S (cid:48) ( W (cid:48) ) λ (cid:19)(cid:21) min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E P W (cid:48) S (cid:48) [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) . (47)We give an upper bound for the exponential term as follows: E P S (cid:48) P W (cid:48) (cid:20) exp (cid:18) L S (cid:48) ( W (cid:48) ) − E P W (cid:48) L S (cid:48) ( W (cid:48) ) λ (cid:19)(cid:21) = n (cid:89) i =1 E P Z (cid:48) i P W (cid:48) (cid:20) exp (cid:18) (cid:96) ( W (cid:48) , Z (cid:48) i ) − E P W (cid:48) (cid:96) ( W (cid:48) , Z (cid:48) i ) nλ (cid:19)(cid:21) ≤ n (cid:89) i =1 sup z (cid:48) ∈Z E P W (cid:48) (cid:20) exp (cid:18) (cid:96) ( W (cid:48) , z (cid:48) ) − E P W (cid:48) (cid:96) ( W (cid:48) , z (cid:48) ) nλ (cid:19)(cid:21) ≤ (cid:18) ψ (cid:18) nλ (cid:19)(cid:19) n . Thus, min P W (cid:48) S (cid:48) ∈ U ( P W (cid:48) ,P S (cid:48) ) E P W (cid:48) S (cid:48) [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] + λD ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) ≥ E [ L µ ( W (cid:48) ) − L µ (cid:48) ( W (cid:48) )] + λ − λ (cid:18) ψ (cid:18) nλ (cid:19)(cid:19) n = D (0) + λ − λ (cid:18) ψ (cid:18) nλ (cid:19)(cid:19) n . The proof is similar to the proof of Theorem 2. As in Section 6.1, we let ˜ w = ( ˜ w , ˜ w , · · · , ˜ w n ) ∈W n be a sequence of length n . Let˜ D ( r ) (cid:44) inf 1 n n (cid:88) i =1 E (cid:104) L µ ( ˜ W i ) − (cid:96) ( ˜ W i , Z (cid:48) i ) (cid:105) (48)where the infimum is over P ˜ W ,S (cid:48) satisfying P ˜ W i ,S (cid:48) ∈ U ( P W (cid:48) , P S (cid:48) ) and I ( ˜ W ; S (cid:48) ) ≤ r . Observethat if the entries of the vector ˜ W are all equal, the expression in (48) reduces to the one in(16). Therefore, in (48) we are taking the infimum over a larger set. Thus, ˜ D ( r ) ≤ D ( r ). Itfollows that for any algorithm A satisfying I (cid:0) S (cid:48) ; A ( S (cid:48) ) (cid:1) ≤ r , we havegen ( µ, µ (cid:48) , A ) ≥ ˜ D ( r ) . We claim that ˜ D ( r ) = D ( r/n ). The rest of the proof follows similar lines as in the proof ofTheorem 2 given in Section 6.1. Thus, it is omitted. We would like to bound L µ ( A ( S (cid:48) )) − L S (cid:48) ( A ( S (cid:48) )) from above. Let S = ( Z , Z , · · · , Z n ) bedistributed according to µ ⊗ n , while S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) was distributed according to ( µ (cid:48) ) ⊗ n .From the subgaussian assumption, we have E P S P W (cid:48) [exp ( λL µ ( W (cid:48) ) − λL S ( W (cid:48) ))] ≤ exp (cid:18) λ σ n (cid:19) . (49)Using a change of measure argument, we obtain E P W (cid:48) S (cid:48) (cid:20) exp (cid:18) λL µ ( W (cid:48) ) − λL S (cid:48) ( W (cid:48) ) − λ σ n − log dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19)(cid:21) ≤ . (50)Using Markov’s inequality P [ X > δ ] < E [ X ] δ , we deduce P W (cid:48) S (cid:48) (cid:20) exp (cid:18) λL µ ( W (cid:48) ) − λL S (cid:48) ( W (cid:48) ) − λ σ n − log dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19) ≥ δ (cid:21) ≤ δ. (51)18quivalently, P W (cid:48) S (cid:48) (cid:20) λL µ ( W (cid:48) ) − λL S (cid:48) ( W (cid:48) ) ≥ λ σ n + log dP W (cid:48) S (cid:48) dP W (cid:48) dP S + log (cid:18) δ (cid:19)(cid:21) ≤ δ. (52)Thus, the following inequality holds with probability at least 1 − δ : L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) ≤ − δ λσ n + 1 λ log dP W (cid:48) S (cid:48) dP W (cid:48) dP S + 1 λ log (cid:18) δ (cid:19) . (53)Using Chernoff’s bound on log (cid:16) dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:48) (cid:17) , we get for 1 < α , P W (cid:48) S (cid:48) (cid:20) log (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19) ≥ t (cid:21) ≤ E P W (cid:48) S (cid:48) (cid:34) e ( α −
1) log (cid:18) dPW (cid:48) S (cid:48) PW (cid:48) PS (cid:19) (cid:35) e ( α − t = e ( α − D α ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S ) − t ) . Thus, log (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19) ≤ − δ D α ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S ) + 1 α − (cid:18) δ (cid:19) . For the case of no-mismatch, the above equation together with (53) recovers the result of [15]once we optimize over λ .We use Lemma 4 to show the following inequality: D α ( P W (cid:48) S (cid:48) || P W (cid:48) P S ) ≤ D α − p ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + nD α − q ( µ (cid:48) (cid:107) µ ) , where p and q are non-negative and Holder conjugate. Therefore, combining with (53), we get L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) ≤ − δ λσ n + 1 λ D α − p ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + nλ D α − q ( µ (cid:48) (cid:107) µ ) + α ( α − λ log (cid:18) δ (cid:19) . Optimizing over λ yields L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) ) ≤ − δ (cid:115) σ D α − q ( µ (cid:48) (cid:107) µ ) + 2 σ (cid:2) D α − p ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + αα − log (cid:0) δ (cid:1)(cid:3) n . (54)Then with α = and p = q = 2, we get from equation (54), P [ | gen µ ( W (cid:48) , S (cid:48) ) | ≥ η ] ≤ − n (cid:16) η − σ D ( µ (cid:48) (cid:107) µ ) (cid:17) − σ D ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) )3 σ . Lemma 4.
For < α, p, q < ∞ with p + q = 1 , we have D α ( P W (cid:48) S (cid:48) || P W (cid:48) P S ) ≤ D α − p ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + nD α − q ( µ (cid:48) (cid:107) µ ) . Proof.
We use Holder’s inequality for 1 < p, q < ∞ in following inequality:exp (( α − D α ( P W (cid:48) S (cid:48) || P W (cid:48) P S )) = (cid:90) (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP S (cid:19) α − dP W (cid:48) S (cid:48) (cid:90) (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP (cid:48) S dP (cid:48) S dP S (cid:19) α − dP W (cid:48) S (cid:48) ≤ (cid:32)(cid:90) (cid:18) dP W (cid:48) S (cid:48) dP W (cid:48) dP (cid:48) S (cid:19) p ( α − dP W (cid:48) S (cid:48) (cid:33) p (cid:32)(cid:90) (cid:18) dP (cid:48) S dP S (cid:19) q ( α − dP W (cid:48) S (cid:48) (cid:33) q = exp (cid:0) ( α − D p ( α − ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) (cid:1) exp (cid:0) ( α − D q ( α − ( P S (cid:48) (cid:107) P S ) (cid:1) . Then we get, D α ( P W (cid:48) S (cid:48) || P W (cid:48) P S ) ≤ D p ( α − ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + nD q ( α − ( µ (cid:48) (cid:107) µ ) . Proof.
Let w = arg min w ∈W L µ ( w ). Let A ∗ be an algorithm that outputs w regardless of thetraining data sequence. Then, using Theorem 8 with A ∗ we obtain L S (cid:48) ( w ) − L µ ( w ) ≤ − δ (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ log (cid:0) δ (cid:1) n . (55)On the other hand, from the definition of the ERM algorithm we have L S (cid:48) ( A ERM ( S (cid:48) )) ≤ L S (cid:48) ( w ) . (56)Since L µ ( w ) = min w ∈W L µ ( w ), it follows that L S (cid:48) ( A ERM ( S (cid:48) )) ≤ − δ min w ∈W L µ ( w ) + (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ log (cid:0) δ (cid:1) n . (57)Then using Theorem 8, L µ ( A ERM ( S (cid:48) )) − min w ∈W L µ ( w ) = L µ ( A ERM ( S (cid:48) )) − L S (cid:48) ( A ERM ( S (cid:48) )) + L S (cid:48) ( A ERM ( S (cid:48) )) − min w ∈W L µ ( w ) ≤ − δ (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ (cid:2) D ( P W (cid:48) S (cid:48) (cid:107) P W (cid:48) P S (cid:48) ) + 2 log (cid:0) δ (cid:1)(cid:3) n + (cid:115) σ D ( µ (cid:48) (cid:107) µ ) + 2 σ log (cid:0) δ (cid:1) n . (58) Take some arbitrary µ ∈ P and µ (cid:48) ∈ P γ . It suffices to find a bound on the difference (cid:15) ( A , µ (cid:48) , n, δ ) − (cid:15) ( A , µ, n, δ ) that depends only on D ( µ (cid:48) (cid:107) µ ). From Lemma 2, given the train-ing data S (cid:48) = ( Z (cid:48) , · · · , Z (cid:48) n ) ∼ ( µ (cid:48) ) ⊗ n , we can define S = ( Z , · · · , Z n ) ∼ µ ⊗ n such that ( Z i , Z (cid:48) i )are i.i.d. for 1 ≤ i ≤ n and P [ Z i (cid:54) = Z (cid:48) i ] = (cid:107) µ − µ (cid:48) (cid:107) T V . (59)For the first upper bound (21), we write L µ (cid:48) ( A ( S (cid:48) )) = L µ (cid:48) ( A ( S (cid:48) ) − L µ (cid:48) ( A ( S )) + L µ ( A ( S )) + L µ (cid:48) ( A ( S )) − L µ ( A ( S ))20 a ) ≤ − δ n (cid:88) i =1 β i + min w ∈W L µ ( w ) + (cid:15) ( A , P , n, δ ) + L µ (cid:48) ( A ( S )) − L µ ( A ( S )) ( b ) ≤ n (cid:88) i =1 β i + min w ∈W L µ ( w ) + (cid:15) ( A , P , n, δ ) + (cid:112) σ γ, (60)where, (a) comes from the uniform stability condition and definition of (cid:15) ( A , P , n, δ ). Inequality(b) is derived using Lemma 1 as follows L µ (cid:48) ( A ( S )) − L µ ( A ( S )) = E µ (cid:48) [ (cid:96) ( A ( S ) , Z (cid:48) ) − E µ [ (cid:96) ( A ( S ) , Z )]] ≤ λ D ( µ (cid:48) (cid:107) µ ) + 1 λ log E µ (cid:104) e λ [ (cid:96) ( A ( S ) ,Z (cid:48) ) − E µ [ (cid:96) ( A ( S ) ,Z )]] (cid:105) ≤ λ D ( µ (cid:48) (cid:107) µ ) + λσ . Optimizing on λ and D ( µ (cid:48) (cid:107) µ ) ≤ γ , we get L µ (cid:48) ( A ( S )) − L µ ( A ( S )) ≤ (cid:112) σ γ. (61)Next, we give an upper bound for min w L µ ( w ). Let w = arg min w L µ ( w ) and w (cid:48) = arg min w L µ (cid:48) ( w ).A similar argument as above gives L µ ( w ) ≤ L µ ( w ) − L µ (cid:48) ( w (cid:48) ) + L µ (cid:48) ( w (cid:48) ) ≤ L µ ( w (cid:48) ) − L µ (cid:48) ( w (cid:48) ) + L µ (cid:48) ( w (cid:48) ) ≤ (cid:112) σ D ( µ (cid:48) (cid:107) µ ) + L µ (cid:48) ( w (cid:48) ) . (62)Using (60), we deduce L µ (cid:48) ( W ( S (cid:48) )) ≤ min w ∈W L µ (cid:48) ( w ) + (cid:15) ( A , P , n, δ ) + n (cid:88) i =1 β i + 2 (cid:112) σ γ. (63)This completes the proof for the first upper bound.For the second upper bound (22), the population risk of the learning algorithm with respectto µ (cid:48) by using Lemma 1 could be written as, λL µ (cid:48) ( A ( S (cid:48) )) = (cid:90) Z λ(cid:96) ( A ( S (cid:48) ) , z ) µ (cid:48) ( dz )= E Z ∼ µ (cid:48) [ λ(cid:96) ( A ( S (cid:48) ) , Z )] ≤ ln (cid:16) E Z ∼ µ (cid:2) exp (cid:0) λ(cid:96) ( A ( S (cid:48) ) , Z ) (cid:1)(cid:3) (cid:17) + D ( µ (cid:48) (cid:107) µ ) . (64)Note that both sides of (64) are random variables (and functions of S (cid:48) ) and Z is taken tobe independent of S (cid:48) .Considering the stability notion of algorithm from Definition 2, the following inequalityholds almost surely: (cid:12)(cid:12)(cid:12) (cid:96) ( A ( S (cid:48) ) , z ) − (cid:96) ( A ( S ) , z ) (cid:12)(cid:12)(cid:12) ≤ n (cid:88) i =1 β i ( n ) [ Z i (cid:54) = Z (cid:48) i ] , ∀ z ∈ Z . Therefore, if we take Z ∼ µ independent of ( S, S (cid:48) ), we deduce that (cid:12)(cid:12)(cid:12) (cid:96) ( A ( S (cid:48) ) , Z ) − (cid:96) ( A ( S ) , Z ) (cid:12)(cid:12)(cid:12) ≤ n (cid:88) i =1 β i ( n ) [ Z i (cid:54) = Z (cid:48) i ] . (65)21ext, we bound the random variable E Z ∼ µ (cid:2) exp (cid:0) λ(cid:96) ( A ( S (cid:48) ) , Z ) (cid:3) in (64) from above as follows: E Z ∼ µ (cid:2) exp (cid:0) λ(cid:96) ( A ( S (cid:48) ) , Z ) (cid:1)(cid:3) ( a ) ≤ E Z ∼ µ (cid:34) exp (cid:32) λ(cid:96) ( A ( S ) , Z ) + λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:33)(cid:35) ( b ) ≤ exp (cid:16) λ E Z (cid:96) ( A ( S ) , Z ) + λ σ / λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:17) ( c ) ≤ − δ exp (cid:16) λ min w L µ ( w ) + λ(cid:15) + λ σ / λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:17) , (66)where ( a ) comes from (65), inequality ( b ) comes from the subgaussianity of (cid:96) ( A ( S ) , Z ) in termsof Z for any fixed S and ( c ) is derived from the definition of (cid:15) = (cid:15) ( A , P , n, δ ).Next, from Markov’s inequality we haveexp (cid:16) λ min w L µ ( w ) + λ(cid:15) + λ σ / λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:17) ( d ) ≤ − δ (cid:48) δ (cid:48) E S,S (cid:48) exp (cid:32) λ min w L µ ( w ) + λ(cid:15) + λ σ / λ n (cid:88) i =1 β i [ Z i (cid:54) = Z (cid:48) i ] (cid:33) = 1 δ (cid:48) exp( λ σ /
2) exp( λ min w L µ ( w ) + λ(cid:15) ) × n (cid:89) i =1 (exp( λβ i ) · (cid:107) µ − µ (cid:48) (cid:107) T V + 1 − (cid:107) µ − µ (cid:48) (cid:107) T V ) , (67)where the last equality follows from (59).Using (64), (66) and (67), we find the following upper bound on L µ (cid:48) ( A ( S (cid:48) )) with the prob-ability at least 1 − δ − δ (cid:48) : L µ (cid:48) ( A ( S (cid:48) )) ≤ min w L µ ( w ) + (cid:15) + 1 λ log(1 /δ (cid:48) ) + λσ / λ n (cid:88) i =1 ln (cid:0) λβ i ) − (cid:107) µ − µ (cid:48) (cid:107) T V (cid:1) + 1 λ D ( µ (cid:48) (cid:107) µ ) ≤ min w L µ ( w ) + (cid:15) + 1 λ log(1 /δ (cid:48) ) + λσ / λ n (cid:88) i =1 ln (cid:0) λβ i ) − √ γ (cid:1) + 1 λ γ where we used D ( µ (cid:48) (cid:107) µ ) ≤ γ and Pinsker’s inequality. The choice of λ = g ( δ (cid:48) ) = (cid:113) /δ (cid:48) )+ γ ] σ yields L µ (cid:48) ( A ( S (cid:48) )) ≤ − δ − δ (cid:48) min w L µ ( w ) + (cid:15) ( A , P , n, δ )+ (cid:112) σ [log(1 /δ (cid:48) ) + γ ] + 1 g ( δ (cid:48) ) n (cid:88) i =1 ln (cid:18) g ( δ (cid:48) ) β i ) − √ γ (cid:19) . Therefore, we get following upper bound for L µ (cid:48) ( A ( S (cid:48) )) with probability at least 1 − δ : L µ (cid:48) ( A ( S (cid:48) )) ≤ min w L µ ( w ) + (cid:15) ( A , P , n, δ/
2) + (cid:112) σ [log(2 /δ ) + γ ]+ 1 g ( δ/ n (cid:88) i =1 ln (cid:18) g ( δ/ β i ) − √ γ (cid:19) . L µ (cid:48) ( A ( S (cid:48) )) ≤ min w L µ (cid:48) ( w ) + (cid:15) ( A , P , n, δ/
2) + f ( δ )where f ( δ ) (cid:44) (cid:112) σ γ + (cid:112) σ [log(2 /δ ) + γ ] + 1 g ( δ/ n (cid:88) i =1 ln (cid:18) g ( δ/ β i ) − √ γ (cid:19) . This completes the proof.
It is clear that ˜ D ( r/n ) ≤ D ( r/n ) from their definitions. By the definition of v n for anyarbitrary p W (cid:48) | S (cid:48) where S (cid:48) = ( Z (cid:48) , Z (cid:48) , · · · , Z (cid:48) n ) we have E (cid:34) n (cid:88) i =1 n ˜ (cid:96) ( W (cid:48) , Z (cid:48) i ) (cid:35) ≥ v n . It follows that D ( r ) = max P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )]= max P W (cid:48)| S (cid:48) : I ( W (cid:48) ; S (cid:48) ) ≤ r, E [ (cid:80) ni =1 1 n ˜ (cid:96) ( W (cid:48) ,Z (cid:48) i ) ] ≥ v n E [ L µ ( W (cid:48) ) − L S (cid:48) ( W (cid:48) )] . A similar argument as in (31) shows that for any arbitrary p W (cid:48) | S (cid:48) we have I ( W (cid:48) ; S (cid:48) ) ≥ n (cid:88) i =1 I ( W (cid:48) ; Z (cid:48) i ) . Thus, D ( r ) ≤ max P W (cid:48)| S (cid:48) : n (cid:80) ni =1 I ( W (cid:48) ; Z (cid:48) i ) ≤ rn , n (cid:80) ni =1 E [ ˜ (cid:96) ( W (cid:48) ,Z (cid:48) i ) ] ≥ v n n n (cid:88) i =1 E [ L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) i )] . Take some arbitrary p W (cid:48) | S (cid:48) and a time-sharing random variable Q uniform on { , , · · · , n } ,independent of previously defined variables. Note that I ( W (cid:48) ; Z (cid:48) Q ) ≤ I ( Q, W (cid:48) ; Z (cid:48) Q )= I ( W (cid:48) ; Z (cid:48) Q | Q ) (69)= 1 n n (cid:88) i =1 I ( W (cid:48) ; Z (cid:48) i ) ≤ rn where (69) follows from the fact that Z (cid:48) i ’s are iid. We also have E (cid:104) ˜ (cid:96) ( W (cid:48) , Z (cid:48) Q ) (cid:105) = 1 n n (cid:88) i =1 E (cid:104) ˜ (cid:96) ( W (cid:48) , Z (cid:48) i ) (cid:105) ≥ v n , Thus, the joint distribution p W (cid:48) ,Z (cid:48) Q satisfies the constraints of ˜ D ( rn ). Moreover, Z (cid:48) Q ∼ µ (cid:48) and E (cid:2) L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) Q ) (cid:3) = 1 n E n (cid:88) i =1 [ L µ ( W (cid:48) ) − (cid:96) ( W (cid:48) , Z (cid:48) i )]Thus, we deduce that D ( r ) ≤ ˜ D ( rn ) as desired.23 eferences [1] M. Mohri, G. Sivek, and A. T. Suresh, “Agnostic federated learning,” arXiv preprintarXiv:1902.00146 , 2019.[2] Y. Mansour, M. Mohri, and A. Rostamizadeh, “Domain adaptation: Learning bounds andalgorithms,” arXiv preprint arXiv:0902.3430 , 2009.[3] Z. Wang, “Theoretical guarantees of transfer learning,” arXiv preprint arXiv:1810.05986 ,2018.[4] X. Wu, J. H. Manton, U. Aickelin, and J. Zhu, “Information-theoretic analysis for transferlearning,” arXiv preprint arXiv:2005.08697 , 2020.[5] Y. Mansour, M. Mohri, A. T. Suresh, and K. Wu, “A theory of multiple-source adaptationwith limited target labeled data,” arXiv preprint arXiv:2007.09762 , 2020.[6] D. Russo and J. Zou, “How much does your data exploration overfit? controlling biasvia information usage,” IEEE Transactions on Information Theory , vol. 66, no. 1, pp.302–323, 2019.[7] A. Xu and M. Raginsky, “Information-theoretic analysis of generalization capability oflearning algorithms,” in
Advances in Neural Information Processing Systems , 2017, pp.2524–2533.[8] Y. Bu, S. Zou, and V. V. Veeravalli, “Tightening mutual information based bounds ongeneralization error,”
IEEE Journal on Selected Areas in Information Theory , 2020.[9] A. T. Lopez and V. Jog, “Generalization error bounds using wasserstein distances,” in . IEEE, 2018, pp. 1–5.[10] H. Wang, M. Diaz, J. C. S. Santos Filho, and F. P. Calmon, “An information-theoreticview of generalization via wasserstein distance,” in . IEEE, 2019, pp. 577–581.[11] F. Hellstr¨om and G. Durisi, “Generalization bounds via information density and condi-tional information density,”
IEEE Journal on Selected Areas in Information Theory , 2020.[12] G. Aminian, L. Toni, and M. R. Rodrigues, “Jensen-shannon information based characteri-zation of the generalization error of learning algorithms,” arXiv preprint arXiv:2010.12664 ,2020.[13] A. R. Esposito, M. Gastpar, and I. Issa, “Robust generalization via α -mutual information,” arXiv preprint arXiv:2001.06399 , 2020.[14] I. Issa, A. R. Esposito, and M. Gastpar, “Strengthened information-theoretic bounds onthe generalization error,” in . IEEE, 2019, pp. 582–586.[15] A. R. Esposito, M. Gastpar, and I. Issa, “Generalization error bounds via r \ ’enyi-, f -divergences and maximal leakage,” arXiv preprint arXiv:1912.01439 , 2019.[16] J. Jiao, Y. Han, and T. Weissman, “Dependence measures bounding the exploration biasfor general measurements,” in . IEEE, 2017, pp. 1475–1479. 2417] A. R. Asadi, E. Abbe, and S. Verd´u, “Chaining mutual information and tightening gener-alization bounds,” arXiv preprint arXiv:1806.03803 , 2018.[18] Y. Wang and D. M. Blei, “Variational bayes under model misspecification,” arXiv preprintarXiv:1905.10859 , 2019.[19] A. R. Masegosa, “Learning under model misspecification: Applications to variational andensemble methods,” arXiv preprint arXiv:1912.08335 , 2019.[20] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan, “Learnability, stability anduniform convergence,” The Journal of Machine Learning Research , vol. 11, pp. 2635–2670,2010.[21] M. Hardt, B. Recht, and Y. Singer, “Train faster, generalize better: Stability of stochasticgradient descent,” in
International Conference on Machine Learning . PMLR, 2016, pp.1225–1234.[22] R. Blahut, “Computation of channel capacity and rate-distortion functions,”
IEEE trans-actions on Information Theory , vol. 18, no. 4, pp. 460–473, 1972.[23] T. Matsuta and T. Uyematsu, “Non-asymptotic bounds for fixed-length lossy compres-sion,” in . IEEE,2015, pp. 1811–1815.[24] K. Marton, “Error exponent for source coding with a fidelity criterion,”
IEEE Transactionson Information Theory , vol. 20, no. 2, pp. 197–199, 1974.[25] O. Bousquet and A. Elisseeff, “Stability and generalization,”
Journal of machine learningresearch , vol. 2, no. Mar, pp. 499–526, 2002.[26] A. El Gamal and Y.-H. Kim,