[PDF] A Limitation of the PAC-Bayes Framework

Abstract

PAC-Bayes is a useful framework for deriving generalization bounds which was introduced by McAllester ('98). This framework has the flexibility of deriving distribution- and algorithm-dependent bounds, which are often tighter than VC-related uniform convergence bounds. In this manuscript we present a limitation for the PAC-Bayes framework. We demonstrate an easy learning task that is not amenable to a PAC-Bayes analysis. Specifically, we consider the task of linear classification in 1D; it is well-known that this task is learnable using just O(log(1/δ)/ϵ) examples. On the other hand, we show that this fact can not be proved using a PAC-Bayes analysis: for any algorithm that learns 1-dimensional linear classifiers there exists a (realizable) distribution for which the PAC-Bayes bound is arbitrarily large.

Full PDF

aa r X i v : . [ c s . L G ] J un A Limitation of the PAC-Bayes Framework

Roi Livni ∗ Shay Moran † April 2020

Abstract

PAC-Bayes is a useful framework for deriving generalization boundswhich was introduced by McAllester (’98). This framework has the ﬂex-ibility of deriving distribution- and algorithm-dependent bounds, whichare often tighter than VC-related uniform convergence bounds. In thismanuscript we present a limitation for the PAC-Bayes framework. Wedemonstrate an easy learning task which is not amenable to a PAC-Bayesanalysis.Speciﬁcally, we consider the task of linear classiﬁcation in 1D; it iswell-known that this task is learnable using just O (log(1 /δ ) /ǫ ) examples.On the other hand, we show that this fact can not be proved using aPAC-Bayes analysis: for any algorithm that learns 1-dimensional linearclassiﬁers there exists a (realizable) distribution for which the PAC-Bayesbound is arbitrarily large. The classical setting of supervised binary classiﬁcation considers learning al-gorithms that receive (binary) labelled examples and are required to outputa predictor or a classiﬁer that predicts the label of new and unseen exam-ples. Within this setting, Probably Approximately Correct (PAC) generaliza-tion bounds quantify the success of an algorithm to approximately predict withhigh probability. The PAC-Bayes framework, introduced in [20, 31] and fur-ther developed in [19, 18, 27], provides PAC-ﬂavored bounds to Bayesian algo-rithms that produce

Gibbs-classiﬁers (also called stochastic-classiﬁers ). Theseare classiﬁers that, instead of outputting a single classiﬁer, output a probabilitydistribution over the family of classiﬁers. Their performance is measured by theexpected success of prediction where expectation is taken with respect to bothsampled data and sampled classiﬁer.A PAC-Bayes generalization bound relates the generalization error of thealgorithm to a KL distance between the stochastic output classiﬁer and some prior distribution P . In more detail, the generalization bound is comprised of ∗ Tel Aviv University, Department of Electrical Engineering † Google AI, Princeton

We consider the standard setting of binary classiﬁcation. Let X denote thedomain and Y = {± } the label space. We study learning algorithms thatobserve as input a sample S of labelled examples drawn independently froman unknown target distribution D , supported on X × Y . The output of thealgorithm is an hypothesis h : X → Y , and its goal is to minimize the 0 / L D ( h ) = E ( x,y ) ∼ D (cid:2) [ h ( x ) = y ] (cid:3) . We will focus on the setting where the distribution D is realizable with re-spect to a ﬁxed hypothesis class H ⊆ Y X which is known in advance. Thatis, it is assumed that there exists h ∈ H such that: L D ( h ) = 0. Let S = h ( x , y ) , . . . , ( x m , y m ) i ∈ ( X × Y ) m be a sample of labelled examples. The2mpirical error L S with respect to S is deﬁned by L S ( h ) = 1 m m X i =1 [ h ( x ) = y ] . We will use the following notation: for a sample S = h ( x , y ) , . . . ( x m , y m ) i , let S denote the underlying set of unlabeled examples S = { x i : i ≤ m } . The Class of Thresholds.

For k ∈ N let h k : N → {± } denote the thresholdfunction h k ( x ) = ( − x ≤ k +1 x > k. The class of thresholds H N is the class H N := { h k : k ∈ N } over the domain X N := N . Similarly, for a ﬁnite n ∈ N let H n denote the class of all thresholdsrestricted to the domain X n := [ n ] = { , . . . , n } . Note that S is realizable withrespect to H N if and only if either (i) y i = +1 for all i ≤ m , or (ii) there exists1 ≤ j ≤ m such that y i = − x i ≤ x j .A basic fact in statistical learning is that H N is PAC-learnable. That is,there exists an algorithm A such that for every realizable distribution D , if A isgiven a sample of size O ( log 1 /δǫ ) examples drawn from D , then with probabilityat least 1 − δ , the output hypothesis h S satisﬁes L D ( h S ) ≤ ǫ . In fact, anyalgorithm A which returns an hypothesis h k ∈ H N which is consistent withthe input sample, will satisfy the above guarantee. Such algorithms are calledempirical risk minimizers (ERMs). We stress that the above sample complexitybound is independent of the domain size. In particular it applies to H n for every n , as well as to the inﬁnite class H N . For further reading, we refer to text bookson the subject, such as [29, 21]. PAC Bayes bounds are concerned with stochastic-classiﬁers , or

Gibbs-classiﬁers .A Gibbs-classiﬁer is deﬁned by a distribution Q over hypotheses. The distri-bution Q is sometimes referred to as a posterior . The loss of a Gibbs-classiﬁerwith respect to a distribution D is given by the expected loss over the drawnhypothesis and test point, namely: L D ( Q ) = E h ∼ Q, ( x,y ) ∼ D [ (cid:2) h ( x ) = y ] (cid:3) . A key advantage of the PAC-Bayes framework is its ﬂexibility of derivinggeneralization bounds that do not depend on an hypothesis class. Instead, theyprovide bounds that depend on the KL distance between the output posteriorand a ﬁxed prior P . Recall that the KL divergence between a distribution P Q is deﬁned as follows :KL ( P k Q ) = E x ∼ P h log P ( x ) Q ( x ) i . Then, the classical PAC-Bayes bound asserts the following:

Theorem 1 (PAC-Bayes Generalization Bound [20]) . Let D be a distributionover examples, let P be a prior distribution over hypothesis, and let δ > . De-note by S a sample of size m drawn independently from D . Then, the followingevent occurs with probability at least − δ : for every posterior distribution Q , L D ( Q ) ≤ L S ( Q ) + O r KL ( Q k P ) + ln √ m/δm ! . The above bound relates the generalization error to the KL divergence be-tween the posterior and the prior. Remarkably, the prior distribution P can bechosen as a function of the target distribution D , allowing to obtain distribution-dependent generalization bounds. We next present the main result in this manuscript. Proofs are provided inSection 5. The statements use the following function Φ( m, γ, n ), which is deﬁnedfor m, n > γ ∈ (0 , m, γ, n ) = log ( m ) ( n )( mγ ) m . Here, log ( k ) ( x ) denotes the iterated logarithm, i.e.log ( k ) ( x ) = log(log . . . (log( x ))) | {z } k times . An important observation is that lim n →∞ Φ( m, γ, n ) = ∞ for every ﬁxed m and γ . Theorem 2 (Main Result) . Let n, m > be integers, and let γ ∈ (0 , . Con-sider the class H n of thresholds over the domain X n = [ n ] . Then, for anylearning algorithm A which is deﬁned on samples of size m , there exists a real-izable distribution D = D A such that for any prior P the following event occurswith probability at least / over the input sample S ∼ D m , KL ( Q S k P ) = ˜Ω (cid:18) γ m log (cid:16) Φ( m, γ, n ) m (cid:17)(cid:19) or L D ( Q S ) > / − γ − m Φ( m, γ, n ) , where Q S denotes the posterior outputted by A . We use here the standard convention that if P ( { x : Q ( x ) = 0 } ) > P k Q ) = ∞ .

4o demonstrate how this result implies a limitation of the PAC-Bayes frame-work, pick γ = 1 / A which learns thresholdsover the natural numbers X N = N with conﬁdence 1 − δ ≥ / ǫ < / − γ = 1 /

4, and m examples . Since Φ( m, / , n ) tends to inﬁnitywith n for any ﬁxed m , the above result implies the existence of a realizabledistribution D n supported on X n ⊆ N such that the PAC-Bayes bound withrespect to any possible prior P will produce vacuous bounds. We summarize itin the following corollary. Corollary 1 (PAC-learnability of Linear classiﬁers cannot be explained byPAC-Bayes) . Let H N denote the class of thresholds over X N = N and let m > .Then, for every algorithm A that maps inputs sample S of size m to outputposteriors Q S and for every arbitrarily large N > there exists a realizabledistribution D such that, for any prior P , with probability at least / over S ∼ D m on of the following holds: KL ( Q S k P ) > N or , L D ( Q S ) > / . A diﬀerent interpretation of Theorem 2 is that in order to derive meaningfulPAC-Bayes generalization bounds for PAC-learning thresholds over a ﬁnite do-main X n , the sample complexity must grow to inﬁnity with the domain size n (itis at least Ω(log ⋆ ( n ))). In contrast, the true sample complexity of this problemis O (log(1 /δ ) /ǫ ) which is independent of n . A common approach of proving impossibility results in computer science (andin machine learning in particular) exploits a Minmax principle, whereby onespeciﬁes a ﬁxed hard distribution over inputs, and establishes the desired im-possibility result for any algorithm with respect to random inputs from thatdistribution. As an example, consider the “No-Free-Lunch Theorem” whichestablishes that the VC dimension lower bounds the sample complexity of PAC-learning a class H . Here, one ﬁxes the distribution to be uniform over a shatteredset of size d = VC ( H ), and argues that every learning algorithm must observeΩ( d ) examples. (See e.g. Theorem 5.1 in [29].)Such “Minmax” proofs establish a stronger assertion: they apply even to al-gorithms that “know” the input-distribution. For example, the No-Free-LunchTheorem applies even to learning algorithms that are designed given the knowl-edge that the marginal distribution is uniform over some shattered set.Interestingly, such an approach is bound to fail in proving Theorem 2. Thereason is that if the marginal distribution D X over X n is ﬁxed, then one canpick an ǫ/ C n ⊆ H n of size |C n | = O (1 /ǫ ), and use any Empirical RiskMinimizer for C n . Then, by picking the prior distribution P to be uniform over We note in passing that any Empirical Risk Minimizer learns thresholds with these pa-rameters using <

50 examples. I.e. C n satisﬁes that ( ∀ h ∈ H n )( ∃ c ∈ C n ) : Pr x ∼ D X ( c ( x ) = h ( x )) ≤ ǫ/ n , one obtains a PAC-Bayes bound which scales with the entropy H ( P ) =log |C n | = O (log(1 /ǫ )), and yields a poly (1 /ǫ, log(1 /δ )) generalization bound,which is independent of n . In other words, in the context of Theorem 2, thereis no single distribution which is “hard” for all algorithms.

Thus, to overcome this diﬃculty one must come up with a “method” whichassigns to any given algorithm A a “hard” distribution D = D A , which witnessesTheorem 2 with respect to A . The challenge is that A is an arbitrary algorithm;e.g. it may be improper or add diﬀerent sorts of noise to its output classiﬁer.The “method” we use in the proof of Theorem 2 exploits Ramsey Theory.In a nutshell, Ramsey Theory provides powerful tools which allow to detect, forany learning algorithm, a large homogeneous set such that the behavior of A on inputs from the homogeneous set is highly regular. Then, we consider theuniform distribution over the homogeneous set to establish Theorem 2.We note that similar applications of Ramsey Theory in proving lower boundsin computer science date back to the 80’s [22]. For more recent usages seee.g. [7, 9, 8, 1]. Our proof closely follows the argument of [1], which establishesan impossibility result for learning H n by diﬀerentially-private algorithms. Technical Comparison with the Work by Alon et al. [1].

For readerswho are familiar with the work of [1], let us summarize the main diﬀerencesbetween the two proofs. The main challenge in extending the technique from[1] to prove Theorem 2 is that PAC-Bayes bounds are only required to holdfor typical samples. This is unlike the notion of diﬀerential-privacy (which wasthe focus of [1]) that is deﬁned with respect to all samples. Thus, establishinga lower bound in the context of diﬀerential privacy is easier: one only needs todemonstrate a single sample for which privacy is breached. However, to proveTheorem 2 one has to demonstrate that the lower bound applies to many samples.Concretely, this aﬀects the following parts of the proof:(i) The Ramsey argument in the current manuscript (Lemma 1) is more com-plex: to overcome the above diﬃculty we needed to modify the coloring andthe overall construction is more convoluted.(ii) Once Ramsey Theorem is applied and the homogeneous subset R n ⊆ X n isderived, one still needs to derive a lower bound on the PAC-Bayes quantity.This requires a technical argument (Lemma 2), which is tailored to thedeﬁnition of PAC-Bayes. Again, this lemma is more complicated than thecorresponding lemma in [1].(iii) Even with Lemma 1 and Lemma 2 in hand, the remaining derivation ofTheorem 2 still requires a careful analysis which involves deﬁning several“bad” events and bounding their probabilities. Again, this is all a conse-quence of that the PAC-Bayes quantity is an “average-case” complexitymeasure. I.e. A may output hypotheses which are not thresholds, or Gibbs-classiﬁers supported onhypotheses which are not thresholds. .1 Proof Sketch and Key Deﬁnitions The proof of Theorem 2 consists of two steps: (i) detecting a hard distribution D = D A which witnesses Theorem 2 with respect to the assumed algorithm A ,and (ii) establishing the conclusion of Theorem 2 given the hard distribution D .The ﬁrst part is combinatorial (exploits Ramsey Theory), and the second partis more information-theoretic. For the purpose of exposition, we focus in thistechnical overview, on a speciﬁc algorithm A . This will make the introduction ofthe key deﬁnitions and presentation of the main technical tools more accessible. The algorithm A . Let S = h ( x , y ) , . . . , ( x m , y m ) i be an input sample. Thealgorithm A outputs the posterior distribution Q S which is deﬁned as follows:let h x i = [ x > x i ] − [ x ≤ x i ] denote the threshold corresponding to the i ’thinput example. The posterior Q S is supported on { h x i } mi =1 , and to each h x i it assigns a probability according to a decreasing function of its empirical risk.(So, hypotheses with lower risk are more probable.) The speciﬁc choice of thedecreasing function does not matter, but for concreteness let us pick the function exp( − x ) . Thus, Q S ( h x i ) ∝ exp (cid:0) −L S ( h x i ) (cid:1) . (1) While one can directly prove that the above algorithm does not admit a PAC-Bayes analysis, we provide here an argument which follows the lines of thegeneral case. We start by explaining the key property of Homogeneity, whichallows to detect the hard distribution.

The ﬁrst step in the proof of Theorem 2 takes the given algorithm and iden-tiﬁes a large subset of the domain on which its behavior is

Homogeneous . Inparticular, we will soon see that the algorithm A is Homogeneous on the entiredomain X n . In order to deﬁne Homogeneity, we use the following equivalencerelation between samples: Deﬁnition 1 (Equivalent Samples) . Let S = h ( x , y ) , . . . , ( x m , y m ) i and S ′ = h ( x ′ , y ′ ) , . . . , ( x ′ m , y ′ m ) i be two samples. We say that S and S ′ are equivalent iffor all i, j ≤ m the following holds.1. x i ≤ x j ⇐⇒ x ′ i ≤ x ′ j , and2. y i = y ′ i .For example, h (1 , − ) , (5 , +) , (8 , +) i and h (10 , − ) , (70 , +) , (100 , +) i are equiv-alent, but h (3 , − ) , (6 , +) , (4 , +) i is not equivalent to them (because of Item 1).For a point x ∈ X n let pos ( x ; S ) denote the number of examples in S that areless than or equal to x : pos ( x ; S ) = (cid:12)(cid:12)(cid:12) { x i ∈ S : x i ≤ x } (cid:12)(cid:12)(cid:12) . (2)7 or a sample S = h ( x , y ) , . . . , ( x m , y m ) i let π ( S ) denote the order-type of S : π ( S ) = ( pos ( x ; S ) , pos ( x ; S ) , . . . , pos ( x m ; S )) . (3) So, the samples h (1 , − ) , (5 , +) , (8 , +) i and h (10 , − ) , (70 , +) , (100 , +) i have order-type π = (1 , , , whereas h (3 , − ) , (6 , +) , (4 , +) i has order-type π = (1 , , .Note that S, S ′ are equivalent if and only if they have the same labels-vectorsand the same order-type. Thus, we encode the equivalence class of a sample bythe pair ( π, ¯ y ) , where π denotes its order-type and ¯ y = ( y . . . y m ) denotes itslabels-vector. The pair ( π, y ) is called the equivalence-type of S .We claim that A satisﬁes the following property of Homogeneity: Property 1 (Homogeneity) . The algorithm A possesses the following property:for every two equivalent samples S, S ′ and every x, x ′ ∈ X n such that pos ( x, S ) = pos ( x ′ , S ′ ) , Pr h ∼ Q S [ h ( x ) = 1] = Pr h ′ ∼ Q S ′ [ h ′ ( x ′ ) = 1] , where Q S , Q S ′ denote the Gibbs-classiﬁer outputted by A on the samples S, S ′ .In short, Homogeneity means that the probability h ∼ Q S satisﬁes h ( x ) = 1 depends only on pos ( x, S ) and on the equivalence-type of S . To see that A isindeed homogeneous, let S, S ′ be equivalent samples and let Q S , Q S ′ denote thecorresponding Gibbs-classiﬁers outputted by A . Then, for every x, x ′ such that pos ( x, S ) = pos ( x ′ , S ′ ) , Equation (1) yields that: Pr h ∼ Q S (cid:2) h ( x ) = +1 (cid:3) = X x i

Before we continue todeﬁne the hard distribution for algorithm A , let us discuss how the proof ofTheorem 2 handles arbitrary algorithms that are not necessarily homogeneous.The general case complicates the argument in two ways. First, the notion ofHomogeneity is relaxed to an approximate variant which is deﬁned next. Here,an order type π is called a permutation if π ( i ) = π ( j ) for every distinct i, j ≤ m .(Indeed, in this case π = ( π ( x ) . . . π ( x m )) is a permutation of . . . m .) Notethat the order type of S = h ( x , y ) . . . ( x m , y m )) i is a permutation if and onlyif all the points in S are distinct (i.e. x i = x j for all i = j ). Deﬁnition 2 (Approximate Homogeneity) . An algorithm B is γ -approximately m -homogeneous if the following holds: let S, S ′ be two equivalent samples oflength m whose order-type is a permutation, and let x / ∈ S, x ′ / ∈ S ′ such that pos ( x, S ) = pos ( x ′ , S ′ ) . Then, | Q S ( x ) − Q S ′ ( x ′ ) | ≤ γ m , (4) where Q S , Q S ′ denote the Gibbs-classiﬁer outputted by B on the samples S, S ′ . econd, we need to identify a suﬃciently large subdomain on which the as-sumed algorithm is approximately homogeneous. This is achieved by the nextlemma, which is based on a Ramsey argument. Lemma 1 (Large Approximately Homogeneous Sets ) . Let m, n > and let B be an algorithm that is deﬁned over input samples of size m over X n . Then,there is X ′ ⊆ X n of size |X ′ | ≥ Φ( m, γ, n ) such that the restriction of B to inputsamples from X ′ is γ -approximate m -homogeneous.We prove Lemma 1 in Section 5.2. For the rest of this exposition we rely onProperty 1 as it simpliﬁes the presentation of the main ideas. The Hard Distribution D . We are now ready to ﬁnish the ﬁrst step anddeﬁne the “hard” distribution D . Deﬁne D to be uniform over examples ( x, y ) such that y = h n/ ( x ) . So, each drawn example ( x, y ) satisﬁes that x is uniformin X n and y = − if and only if x ≤ n/ . In the general case, D will be deﬁnedin the same way with respect to the detected homogeneous subdomain. = ⇒ Lower Bound: Sensitivity

We next outline the second step of the proof, which establishes Theorem 2 usingthe hard distribution D . Speciﬁcally, we show that for a sample S ∼ D m , KL ( Q S k P ) = ˜Ω (cid:18) m log( |X n | ) (cid:19) , with a constant probability bounded away from zero. (In the general case | X n | is replaced by Φ( m, γ, n ) – the size of the homogeneous set.) Sensitive Indices.

We begin with describing the key property of homogeneouslearners. Let ( π, ¯ y ) denote the equivalence-type of the input sample S . By homo-geneity (Property 1), there is a list of numbers p , . . . , p m , which depends only onthe order-type ( π, ¯ y ) , such that Pr h ∼ Q S [ h ( x ) = 1] = p i for every x ∈ X n , where i = pos ( x, S ) . The crucial observation is that there exists an index i ≤ m ′ whichis sensitive in the sense that p i − p i − ≥ m . (5) Indeed, consider x j such that h x j = arg min k L S ( h x k ) , and let i = pos ( x j , S ) .Then, p i − p i − = L S ( h x j ) P i ′ ≤ m L S ( h x i ′ ) ≥ m . In the general case we show that any homogeneous algorithm that learns H n satisﬁes Equation (5) for typical samples (see Claim 1). The intuition is thatany algorithm that learns the distribution D must output a Gibbs-classiﬁer Q S such that for typical points x , if x > n/ then Pr h ∼ Q S [ h ( x ) = 1] ≈ , and if x ≤ n/ then Pr h ∼ Q S [ h ( x ) = 1] ≈ . Thus, when traversing all x ’s from upto n there must be a jump between p i − and p i for some i . rom Sensitive Indices to a Lower Bound on the KL-divergence. Howdo sensitive indices imply a lower bound on PAC-Bayes? This is the mosttechnical part of the proof. The crux of it is a connection between sensitivityand the KL-divergence which we discuss next. Consider a sensitive index i andlet x j be the input example such that pos ( x j , S ) = i . For ˆ x ∈ X n , let S ˆ x denotethe sample obtained by replacing x j with ˆ x : S ˆ x = h ( x , y ) , . . . , ( x j − , y j − ) , (ˆ x j , y j ) , ( x j +1 , y j +1 ) . . . ( x m , y m ) . i , and let Q ˆ x := Q S ˆ x denote the posterior outputted by A given the sample S ˆ x .Consider the set I ⊆ X n of all points ˆ x such that S ˆ x is equivalent to S . Equa-tion (5) implies that that for every x, ˆ x ∈ I , Pr h ∼ Q ˆ x [ h ( x ) = 1] = ( p i − x < ˆ x,p i x > ˆ x. Combined with the fact that p i − p i − ≥ /m , this implies a lower bound onKL-divergence between an arbitrary prior P and Q ˆ x for most ˆ x ∈ I . This issummarized in the following lemma: Lemma 2 (Sensitivity Lemma) . Let I be a linearly ordered set and let { Q ˆ x } ˆ x ∈ I be a family of posteriors supported on {± } I . Suppose there are q < q ∈ [0 , such that for every x, ˆ x ∈ I : x < ˆ x = ⇒ Pr h ∼ Q ˆ x [ h ( x ) = 1] ≤ q + q − q ,x > ˆ x = ⇒ Pr h ∼ Q ˆ x [ h ( x ) = 1] ≥ q − q − q . Then, for every prior distribution P , if ˆ x ∈ I is drawn uniformly at random,then the following event occurs with probability at least / : KL ( Q ˆ x k P ) = Ω (cid:16) ( q − q ) log | I | log log | I | (cid:17) . The sensitivity lemma tells us that in the above situation, the KL divergencebetween Q ˆ x and any prior P , for a random choice ˆ x , scales in terms of twoquantities: the distance between the two values, q − q , and the size of I .The proof of Lemma 2 is provided in Section 5.3. In a nutshell, the strategy isto bound from below KL ( Q r ˆ x k P r ) , where r is suﬃciently small; the desired lowerbound then follows from the chain rule, KL ( Q ˆ x k P ) = r KL ( Q r ˆ x k P r ) . Obtainingthe lower bound with respect to the r -fold products is the crux of the proof. Inshort, we will exhibit events E ˆ x such that Q r ˆ x ( E ˆ x ) ≥ for every ˆ x ∈ I , but P r ( E ˆ x ) is tiny for | I | of the ˆ x ’s. This implies a lower bound on KL ( Q r ˆ x k P r ) since KL ( Q r ˆ x k P r ) ≥ KL ( Q r ˆ x ( E ˆ x ) k P r ( E ˆ x )) , by the data-processing inequality. rapping Up. We now continue in deriving a lower bound for A . Consideran input sample S ∼ D m . In order to apply Lemma 2, ﬁx any equivalence-type ( π, y ) with a sensitive index i and let x j be such that pos ( x j ; S ) = i . The keystep is to condition the random sample S on ( π, y ) as well as on { x t } mt =1 \ { x j } – all sample points besides the sensitive point x j . Thus, only x j is remained tobe drawn in order to fully specify S . Note then, that by symmetry ˆ x is uniformlydistributed in a set I ⊆ X n , and plugging q := p i , q := p i − in Lemma 2 yieldsthat for any prior distribution P : KL ( Q S k P ) ≥ ˜Ω (cid:18) m log( | I | ) (cid:19) , with probability at least / . Note that we are not quite done since the size | I | is a random variable which depends on the type ( π, ¯ y ) and the sample points { x k } k = j . However, the distribution of | I | can be analyzed by elementary tools.In particular, we show that | I | ≥ Ω( |X n | /m ) with high enough probability, whichyields the desired lower bound on the PAC-Bayes quantity. (In the general case |X n | is replaced by the size of the homogeneous set.) Let A be an algorithm as in the premise of Theorem 2. That is, A receives asinput a labeled sample S of length m and outputs a posterior Q S . By Lemma 1,there exists X ′ ⊆ X n of size |X ′ | = k ≥ Φ( m, γ, n ) such that the restriction of A to inputs from X ′ is γ -approximate m -homogeneous. Without loss of generality,assume that X ′ = X k consists of the ﬁrst k points in X n and that k is an evennumber.By the deﬁnition of approximate homogeneity (Deﬁnition 2) it follows thatfor every equivalence type ( π, ¯ y ) , where π is a permutation, there is a list ( p ( π, ¯ y ) i ) mi =0 ∈ [0 , m +1 such that for every sample S ∈ ( X k × { , } ) m whosetype is ( π, ¯ y ) and and every x ∈ X k \ S : (cid:12)(cid:12) Q S ( x ) − p ( π, ¯ y ) i (cid:12)(cid:12) ≤ γ m γ m , where pos ( x, S ) = i . For the rest of the proof ﬁx D to be the distribution overexamples ( x, y ) such that x is drawn uniformly from X k and y = − if andonly if x ≤ k/ . The underlying property we will require is summarized in thefollowing claim: Claim 1.

Let ( π, ¯ y ) be an equivalence-type, where π is a permutation. Then,one of the following holds: either there exists a sensitive index ≤ i ≤ m suchthat | p ( π, ¯ y ) i − p ( π, ¯ y ) i − | ≥ γ m , (6)11 r else, L D ( Q S ) > − γ − mk with probability over S ∼ D m ( ·| ( π, ¯ y )) .The proof of Claim 1 is deterred to Section 5.1.1.With Claim 1 in hand we proceed with the proof of Theorem 2. Let S be asample and let ( π, ¯ y ) denote its equivalence-type. Deﬁne an interval I ( S ) ⊆ X k as follows. • if π is not a permutation then I ( S ) = ∅ . • If ( π, ¯ y ) does not have a sensitive index that satisﬁes Equation (6) then I ( S ) = ∅ . • Finally, if π is a permutation and ( π, ¯ y ) has a sensitive index i then set I ( S ) =  ( x − j , x + j ) k / ∈ ( x − j , x + j ) , ( x − j , k ] k ∈ ( x − j , x + j ) and y j = − , ( k , x + j ) k ∈ ( x − j , x + j ) and y j = +1 , where x j is such that pos ( x j ; S ) = i , and x − j = max( { x t : x t < x j } ∪ { } ) and x + j = min( { x t : x t > x j } ∪ { k + 1 } ) .We next deﬁne two events which will be used to ﬁnish the proof. First,consider the event that the drawn sample S satisﬁes either KL ( Q S k P ) = Ω (cid:16) γ m log | I ( S ) | log log | I ( S ) | (cid:17) , (7) or L ( Q S ) ≥ − γ − mk , (8) We show that this event occurs with probability at least / : Claim 2.

Deﬁne E to be the event E = n S ∈ ( X k × {± } ) m : S satisﬁes Equation (7) or Equation (8) o . Then, E occurs with probability at least / over S ∼ D m .The proof of Claim 2 is deterred to Section 5.1.2. The second event weconsider is that the drawn sample S satisﬁes either Equation (8) or | I ( S ) | ≥ Φ( m, γ, n )8( m + 1) . (9) We show that this event occurs with probability at least / : For concreteness, let i be the minimal sensitive index. We use here the convention, that log x log log x = −∞ for x ≤

2. Alternatively, one can assumethat Equation (7) holds vacuously if | I ( S ) | = 0 laim 3. Deﬁne E to be the event E = { S : S satisﬁes Equation (8) or Equation (9) } . Then E occurs with probability at least / over S ∼ D m The proof of Claim 3 is deterred to Section 5.1.3. With Claims 2 and 3in hand, the proof of Theorem 2 is completed as follows. First, a union boundimplies that the event E ∩ E occurs with probability at least / . That is, withprobability at least / either Equation (8) holds and we are done, or else, ifEquation (8) doesn’t hold, then both Equations (7) and (9) hold simultaneously,which yields that KL ( Q S k P ) ≥ Ω (cid:16) γ m log | I ( S ) | log log | I ( S ) | (cid:17) (By Equation (7)) ≥ Ω (cid:16) γ m log Φ( m,γ,n )8( m +1) log log Φ( m,γ,n )8( m +1) (cid:17) . (By Equation (9)) This concludes the proof of Theorem 2.We are thus left with proving Claims 1 to 3.

Let ( π, ¯ y ) be an equivalence-type such that π is a permutation. Assume that L ( Q S ) < − γ − mk (10) occurs with a positive probability over S ∼ D m ( ·| π, ¯ y ) . We ﬁrst show that thereis i such that | p π, ¯ yi − p π, ¯ y | > γ/ . (11) Indeed, assume the contrary and ﬁx a sample S with type ( π, ¯ y ) which satisﬁesEquation (10) . Recall that A is homogeneous, hence for every x / ∈ S , | Q S ( x ) − p π, ¯ yi | < γ m, where i = pos ( x, S ) . On the other hand, since Equation (11) is not met byany i , it follows that for every x / ∈ S : | Q S ( x ) − p π, ¯ y | = | Q S ( x ) − p π, ¯ yi + p π, ¯ yi − p π, ¯ y |≤ | Q S ( x ) − p π, ¯ yi | + | p π, ¯ yi − p π, ¯ y |≤ γ m + γ ≤ γ. hus, Pr h ∼ Q S [ h ( x ) = 1] ∈ [ p π, ¯ y − γ, p π, ¯ y + γ ] , for every x ∈ X k \ S . Now, since Pr ( x,y ) ∼ D [ y = 1] = 1 / it follows that L D ( Q S ) ≥ − γ − mk . Indeed, for every x / ∈ S , if x ≤ k/ then h ∼ Q S errs on x with probability atleast q = p π, ¯ y − γ , and if x > k/ then h ∼ Q S errs on x with probability atleast q = 1 − ( p π, ¯ y + γ ) . Thus, the expected loss of h ∼ Q S conditioned on x / ∈ S is at least q + q = 1 / − γ , and the above inequality follows by taking intoaccount that h ∼ Q S may have zero error on the m points in S .Finally, let i be some index that satisfy Equation (11) , then because ≤ i ≤ m we obtain via telescoping that there must be some i ′ ≤ i , such that | p π, ¯ yi ′ − p π, ¯ yi ′ − | ≥ γ m . Proof of Claim 2 . It is enough to show that E occurs with probability atleast 1 / S ∼ D m ( ·| π, ¯ y ) for every ﬁxed equivalence-type ( π, ¯ y ). Indeed,by summing over all equivalence types, the law of total probability then impliesthat E occurs with probability at least 1 / S ∼ D m .Fix an equivalence-type ( π, ¯ y ). We may assume that π is a permutationand that ( π, ¯ y ) has a sensitive index i (or else Equation (7) trivially holds bythe deﬁnition of I ( S ) and we are done). If Equation (8) holds with probabilityat least 1 / E occurs with probability at least 1 / /

4. It suﬃcesto show that Equation (7) holds with probability at least 1 /

4. By Claim 1, thereis a sensitive index i such that | p ( π, ¯ y ) i − p ( π, ¯ y ) i − | > γ m . Let x j in S be such that pos ( x j ; S ) = i . It will be convenient to consider thefollowing (slightly convoluted) process of sampling a pair of (correlated) samplesfrom D m ( ·| π, ¯ y ):1. Sample T = h ( x , y ) . . . ( x m , y m ) i ∼ D m ( ·| π, ¯ y ).2. Resample only the sensitive point x j while keeping all other points ﬁxed,as well as the equivalence type ( π, ¯ y ). Let ˆ x denote the newly sampledpoint and let T ˆ x denote the sample obtained by replacing x j by ˆ x .3. Set S = T ˆ x Note that both T and S are drawn from D m ( ·| π, ¯ y ) and that I ( T ) = I ( S ) always.Since the marginal distribution of D is uniform over X k , by symmetry it follows14hat the point ˆ x drawn in Step 2 is uniform in the interval I ( T ) = I ( S ). Our nextstep is to apply Lemma 2 on the family of distributions { Q T ˆ x } ˆ x ∈ I ( T ) . Towardsthis end, we ﬁrst ﬁx T and show that the premise of Lemma 2 is satisﬁed, with I = I ( T ), q = p ( π, ¯ y ) i − and q = p ( π, ¯ y ) i . Indeed, by homogeneity it follows thatfor each x ∈ I ( T ), if x < ˆ x (cid:12)(cid:12)(cid:12) Pr h ∼ Q ˆ x [ h ( x ) = 1] − p π, ¯ yi − (cid:12)(cid:12)(cid:12) ≤ γ m< | p ( π, ¯ y ) i − p ( π, ¯ y ) i − | , (because i is sensitive)and similarly if x ≥ ˆ x : (cid:12)(cid:12)(cid:12) Pr h ∼ Q ˆ x [ h ( x ) = 1] − p ( π, ¯ y ) i (cid:12)(cid:12)(cid:12) < | p ( π, ¯ y ) i − p ( π, ¯ y ) i − | . Thus, applying Lemma 2 on the family of distributions { Q T ˆ x } ˆ x ∈ I ( T ) yields thatfor every T sampled in Step 1, the following holds with probability at least 1 / x :KL ( Q S k P ) = KL ( Q T ˆ x k P ) ≥ Ω (cid:16)(cid:0) p ( π, ¯ y ) i − − p ( π, ¯ y ) i (cid:1) log( | I ( T ) | )log log | I ( T ) | (cid:17) ≥ Ω (cid:16) γ m log | I ( T ) | log log | I ( T ) | (cid:17) = Ω (cid:16) γ m log | I ( S ) | log log | I ( S ) | (cid:17) . Note that the above holds for any ﬁxed T . Taking expectation over T it followsthat with probability at least 1 / S ∼ D ( ·| ( π, ¯ y )),KL ( Q S k P ) ≥ Ω (cid:16) γ m log | I ( S ) | log log | I ( S ) | (cid:17) . As discussed, taking expectation over the equivalence type concludes the proof.

Proof of Claim 3 . Consider S ∼ D m where S = h ( x , y ) , . . . ( x m , y m ) i . Weclaim that with probability at least 7 /

8, every two unlabeled examples x i , x j with i = j are at distance at least k m +1) from each other and from k/

2. In-deed, ﬁx any distinct x ′ , x ′′ ∈ { x , . . . , x m , k/ } . Recall that the distribution D Here we assume without loss of generality that p ( π, ¯ y ) i − < p ( π, ¯ y ) i . If the reverse inequalityholds then the argument follows by applying Lemma 2 with respect to the reverse linear orderover I ( T ). x , . . . , x m are sampled uniformly and ind. from X k . Thus, theprobability that 0 ≤ x ′ − x ′′ < k m +1) is at most m +1) . A union bound overall possible (cid:0) m +12 (cid:1) pairs implies that that the following holds with probabilityat least over S ∼ D m : (cid:16) ∀ distinct x ′ , x ′′ ∈ n x , . . . , x m , k o(cid:17) : | x ′ − x ′′ | ≥ k m + 1) . (12)We will now show that the latter event implies E . Let S be a sample satisfyingEquation (12). In particular, x i = x j for every distinct i, j ≤ m and so theorder-type π = π ( S ) is a permutation. Now, if S satisﬁes Equation (8) then S ∈ E and we are done. Else, by Claim 1 there exists a sensitive index thatsatisﬁes Equation (6) and therefore I ( S ) = ( x ′ , x ′′ ), where x ′ , x ′′ are distinctpoints in { x . . . , x m , k/ } . Thus, | I ( S ) | ≥ k m + 1) , and Equation (9) holds, which also gives S ∈ E . Thus, every S which satisﬁesEquation (12) is in E and so E occurs with probability at least 7 / We next prove Lemma 1 which establishes the existence of a “largish” homoge-neous set with respect to an arbitrary algorithm A . Notation.

Recall from Equation (2) the deﬁnition of pos ( x, S ) which was de-ﬁned for a sample S and a point x . It will be convenient to extend this deﬁnitionto sets: for R ⊆ X n and x ∈ X n deﬁne pos ( x, R ) = |{ x ′ ∈ R : x ′ ≤ x }| . From Sets to Samples.

Let ( π, ¯ y ) be an equivalence-type whose order-typeis a permutation and let D = { x < . . . < x m } ⊆ X n be a set of m points.Denote by D π, ¯ y = h ( x i j , y i j ) i mj =1 the sample obtained by ordering and labelingthe elements of D such that D π, ¯ y has type ( π, ¯ y ) ; that is, D π, ¯ y is deﬁned suchthat for every j ≤ m , π ( j ) = pos ( x i j , D π, ¯ y ) = pos ( x i j , D ) and ¯ y = ( y , . . . , y m ) . (13) A Coloring.

We deﬁne a coloring over subsets D ⊆ X n of size | D | = m + 1 .Let D = { x < x < . . . < x m } be a ( m + 1) -subset of X n . The coloring assignedto D is c ( D ) = (cid:8) ( p π, ¯ y , . . . , p π, ¯ ym ) : ( π, ¯ y ) is an equivalence-type s.t. π is a permutation (cid:9) , where each p π, ¯ yi is deﬁned as follows: let D − i = D \ { x i } . For each equivalencetype ( π, ¯ y ) such that π is a permutation consider the sample D π, ¯ y − i (see Equa-tion (13) ), and deﬁne p π, ¯ yi to be the fraction of the form t · γ m for t ∈ N which is losest to Pr h ∼ Q π, ¯ y − i [ h ( x i ) = 1] , where Q π, ¯ y − i is the stochastic classiﬁer obtained by applying A on D π, ¯ y − i .Since the total number of equivalence-types whose order-type is a permutationis at most m ! · m , it follows that the total number of colors is at most m ! · m ·⌈ mγ + 1 ⌉ ( m +1) ≤ ( mγ ) m . Ramsey.

We next apply Ramsey Theorem to derive a large X ′ ⊆ X n suchthat every subset D ⊆ X n of size m + 1 has the same color. Later we will arguethat A is γ -approximately homogeneous with respect to X ′ which will ﬁnish theproof.We will use the following quantitative version of Ramsey Theorem due to [13](see also the book [14], or Theorem 10.1 in the survey by [23]). Here, the towerfunction twr k ( x ) is deﬁned by the recursion twr ( i ) x = ( x i = 1 , twr ( i − x ) i > . Theorem 3 (Ramsey Theorem [13]) . Let s > t ≥ and q be integers, and let N ≥ twr t (3 sq log q ) . Then, for every coloring of the subsets of size t of a universe of size N using q colors there is a homogeneous subset of size s .Stated diﬀerently, Theorem 3 guarantees the existence of a homogeneous sub-set of size log ( t − ( N )3 q log q . (14) Thus, by plugging q := ( mγ ) m , t := m + 1 , N := n in Equation (14) we get ahomogeneous set X ′ ⊆ X n of size |X ′ | ≥ log ( m ) ( n )3( mγ ) m · m log( mγ ) ≥ log ( m ) ( n )( mγ ) m . Wrapping-up.

It remains to show that A is γ -approximately homogeneouswith respect to X ′ . By the construction of X ′ there exist a speciﬁc color L = { ( p π, ¯ yi ) mi =0 : ( π, ¯ y ) is an equivalence-type s.t. π is a permutation } such that c ( D ) = L for every D = { x < . . . < x m } ⊆ X ′ . We need to show thatfor every pair of equivalent samples S ′ , S ′′ whose order-type is a permutation andfor every x ∈ X ′ \ S, x ′ ∈ X ′ \ S ′ such that pos ( x, S ) = pos ( x ′ , S ′ ) : (cid:12)(cid:12)(cid:12) Pr h ∼ Q S [ h ( x ) = 1] − Pr h ′ ∼ Q S ′ [ h ′ ( x ′ ) = 1] (cid:12)(cid:12)(cid:12) ≤ γ m . A subset of the universe is homogeneous if all of its t -subsets have the same color. et ( π, ¯ y ) be an equivalence-type such that π is a permutation, let S be anysample whose equivalence-type is ( π, ¯ y ) , and let x ∈ X ′ \ ¯ S . Consider the set D = { x j : j ≤ m } ∪ { x } and set i = pos ( x, S ) . By the deﬁnition of D π, ¯ y − i , wehave D π, ¯ y − i = S and hence by the deﬁnition of p π, ¯ yi we have (cid:12)(cid:12)(cid:12) Pr h ∼ Q S [ h ( x ) = 1] − p π, ¯ yi (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) Pr h ∼ Q π, ¯ y − i [ h ( x ) = 1] − p π, ¯ yi (cid:12)(cid:12)(cid:12) ≤ γ m . Since the latter holds for every sample S whose order type is ( π, ¯ y ) and every x / ∈ ¯ S , it follows that for every pair of samples S, S ′ whose order-type is ( π, ¯ y ) and every x ∈ X ′ \ S, x ′ ∈ X ′ \ S ′ such that pos ( x, S ) = pos ( x ′ , S ′ ) : (cid:12)(cid:12)(cid:12) Pr h ∼ Q S [ h ( x ) = 1] − Pr h ′ ∼ Q S ′ [ h ′ ( x ′ ) = 1] (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) Pr h ∼ Q S [ h ( x ) = 1] − p π, ¯ yi (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) Pr h ∼ Q S [ h ( x ) = 1] − p π, ¯ yi (cid:12)(cid:12)(cid:12) ≤ γ m + γ m = γ m , where i := pos ( x, S ) = pos ( x ′ , S ′ ) . This ﬁnishes the proof. Notation.

We will assume without loss of generality that I = { , , , ..., | I |} .Also, to simplify the presentation, we will assume that | I | is a power of , i.e. | I | = 2 b for some b ∈ N . (Removing this assumption is straight-forward, butcomplicates some of the notation.) Overview.

Let P be an arbitrary prior supported on {± } I . Our goal is toshow that at least | I | / of all ˆ x ’s in I satisfy KL ( Q ˆ x k P ) ≥ Ω (cid:16) ( q − q ) log | I | log log | I | (cid:17) = Ω (cid:16) ( q − q ) b log( b ) (cid:17) . The proof strategy is to bound from below

KL ( Q m ˆ x k P m ) , where m is suﬃcientlysmall; the desired lower bound then follows from the chain rule: KL ( Q ˆ x k P ) = 1 m KL ( Q m ˆ x k P m ) . Obtaining the lower bound with respect to the m -fold products is the crux ofthe proof. In a nutshell, we will exhibit events E ˆ x such that for every ˆ x ∈ I , Q m ˆ x ( E ˆ x ) ≥ / , , but for | I | / of the ˆ x ’s, P m ( E ˆ x ) is tiny. This implies a lowerbound on KL ( Q m ˆ x k P m ) since KL ( Q m ˆ x k P m ) ≥ KL ( Q m ˆ x ( E ˆ x ) k P m ( E ˆ x )) , by the data-processing inequality. onstruction of The Events E ˆ x . For every Gibbs-classiﬁer Q ∈ { Q ˆ x : ˆ x ∈ I } ∪ { P } deﬁne its rounded-hypothesis h Q : X → {± } as follows: h Q ( x ) = ( − E h ∼ Q ˆ x [ h ( x )] ≤ q + q , +1 E h ∼ Q ˆ x [ h ( x )] > q + q . To simplify notation, let h ˆ x = h Q ˆ x . Note that by the assumption of Lemma 2: h ˆ x ( x ) = ( − x < ˆ x, +1 x > ˆ x. (15) In words, each h ˆ x is a threshold with a sign-change either right before ˆ x or rightafter it. Next, given h : I → {± } , consider the following iterative process whichapplies binary-search on h towards detecting a pair of subsequent coordinateswhich contain a sign-change. Binary-Search

Input: h : I → {± } .1. Set I = [ a , b ] , where a = 0 , b = | I | = 2 b .2. For j = 0 , . . . (a) If | I j | ≤ then output I j .(b) Query the coordinate h ( m j ) , where m j = a j + b j .(c) If h ( m j ) = +1 then set a j +1 = a j , b j +1 = m j ,(d) Else, set a j +1 = m j + 1 , b j +1 = b j .The following observations follow from the standard analysis of binary-search.1. The process ends after b − iterations and each of the points m j queriedin Item (b) are even numbers.2. If the process is applied on a threshold h which changes sign from − to + between x and x + 1 then the output interval I out is { x, x + 1 } . Thus, byEquation (15) , if we apply this process on h = h ˆ x then ˆ x ∈ I out .Given a sequence of hypotheses h , . . . , h m : I → {± } , deﬁne the empiricalrounded-hypothesis h h m by: h h m ( x ) = ( − m P mi =1 [ h i ( x ) = 1] ≤ q + q , +1 m P mi =1 [ h i ( x ) = 1] > q + q . Consider h , . . . , h m ∼ Q ˆ x for an odd ˆ x ∈ I . The following claim shows that withhigh probability, applying the binary search on h h m yields an output interval I out such that ˆ x ∈ I out . laim 4. Let ˆ x ≤ b be an odd number. Let J out denote the interval outputtedby applying the binary search on h ˆ x and let I out denote the interval outputtedby applying the binary search on h h m , where h , . . . h m ∼ Q ˆ x are drawn inde-pendently. Then, Pr h ...h m ∼ Q m ˆ x [ I out = J out ] ≤ b · exp (cid:16) − m q − q ) (cid:17) . In particular, if m = b )+2)( q − q ) then Pr[ˆ x / ∈ I out ] ≤ .Proof. Let x , . . . x , . . . , x b − be the coordinates queried by the binary searchon J out . We will show that with high probability h h m ( x i ) = h ˆ x ( x i ) for every i , which implies that J out = I out . Let i ≤ b − µ i = E h ∼ Q ˆ x [ [ h ( x i ) = +1]] = Pr h ∼ Q ˆ x [ h ( x i ) = +1] . Note that ˆ x = x i (because x i is even and ˆ x is odd). Therefore, by the assumptionof Lemma 2: µ i ( ≤ q + q − q − q x i < ˆ x, ≥ q + q + q − q x i > ˆ x. Hence, by a Chernoﬀ bound:Pr h ...h m [ h h m ( x i ) = h ˆ x ( x i )] ≤ Pr h ...h m h m m X j =1 [ h j ( x i ) = 1] ≥ µ i + q − q i ≤ exp (cid:16) − m q − q ) (cid:17) (Chernoﬀ Bound)Thus, by taking a union bound over all i ≤ b − h h m ( x ) = h ˆ x ( x )for every i ≤ b − − log( | I | ) · exp( − m ( q − q ) ). Inparticular, with the above probability we have that J out = I out .Lastly, assume m = b )+2)( q − q ) . Then, b · exp( − m ( q − q ) ) ≤ /

2, andtherefore Pr[ J out = I out ] ≥ . Since h ˆ x is a threshold which changes signeither right before ˆ x or right after ˆ x , it follows that ˆ x ∈ J out , and thereforePr[ˆ x ∈ I out ] ≥ / We are now ready to deﬁne the events E ˆ x . Set m = b )+2)( q − q ) , according toClaim 4, and let E ˆ x denote the event that ˆ x ∈ I out . That is, E ˆ x is the set of allsequences h , . . . h m such that ˆ x ∈ I out , where I out is the interval outputted bythe binary-search on h h m . Thus, Claim 4 says that Q m ˆ x ( E ˆ x ) ≥ for an odd ˆ x . ounding the KL-divergence. We next use the events E ˆ x to lower bound KL ( Q ˆ x k P ) : KL ( Q ˆ x k P ) = 1 m KL ( Q m ˆ x k P m ) (Chain Rule) ≥ m KL ( Q m ˆ x ( E ˆ x ) k P m ( E ˆ x )) (Data Processing Ineq.) ≥ m (cid:18) −

23 log (cid:16) (cid:17) −

13 log (cid:16) (cid:17) −

23 log (cid:0) P m ( E ˆ x ) (cid:1)(cid:19) ≥ − log (cid:0) P m ( E ˆ x ) (cid:1) − m Therefore, to lower bound

KL ( Q ˆ x k P ) it suﬃces to shows that P m ( E ˆ x ) issmall. We next establish this for / of the ˆ x ’s in I . Note that whenever ˆ x , ˆ x ∈ I are odd and distinct then E ˆ x ∩ E ˆ x = ∅ . Indeed, this follows sincethe outputted interval I out is of size ≤ and hence contains at most one oddnumber. Thus, X ˆ x is odd P m ( E ˆ x ) ≤ . In particular, since there are b − odd numbers in I , at least / of them mustsatisfy P m ( E ˆ x ) ≤ b − . Taken together we obtain that at least / of all ˆ x ∈ I satisfy: KL ( Q ˆ x k P ) ≥ b − − m = b − b )+2)( q − q ) = Ω (cid:16) ( q − q ) b log( b ) (cid:17) , which ﬁnishes the proof of Lemma 2 In this work we presented a limitation for the PAC-Bayes framework by showingthat PAC-learnability of one-dimensional thresholds can not be established usingPAC-Bayes.Perhaps the biggest caveat of our result is the mild dependence of the boundon the size of the domain in Theorem 2. In fact, Theorem 2 does not exclude thepossibility of PAC-learning thresholds over X n with sample complexity that scalewith O (log ∗ n ) such that the PAC-Bayes bound vanishes. It would be interestingto explore this possibility; one promising direction is to borrow ideas from thediﬀerential privacy literature: [3] and [6] designed a private learning algorithmfor thresholds with sample complexity exp(log ∗ n ) ; this bound was later improvedby [15] to ˜ O ((log ∗ n ) ) . Also, [5] showed that ﬁnite Littlestone dimension is uﬃcient for private learnability, and it would be interesting to extend theseresults to the context of PAC-Bayes. Let us note that in the context of pure diﬀerential privacy, the connection between PAC-Bayes analysis and privacy hasbeen established in [12].Another aspect is the implication of our work to learning algorithms beyondthe uniform PAC setting. Indeed, many successful and practical algorithmsexhibit sample complexity that depends on the target-distribution. E.g.,the k -Nearest-Neighbor algorithm eventually learns any target-distribution (with a distribution-dependent rate). The ﬁrst point we address in this context concerns interpolat-ing algorithms . These are learners that achieve zero (or close to zero) trainingerror (i.e. they interpolate the training set). Examples of such algorithms in-clude kernel machines, boosting, random forests, as well as deep neural networks[4, 26]. PAC-Bayes analysis has been utilized in this context, for example, toprovide margin-dependent generalization guarantees for kernel machines [17].It is therefore natural to ask whether our lower bound has implications in thiscontext. As a simple case-study, consider the -Nearest-Neighbour. Observethat this algorithm forms a proper and consistent learner for the class of 1-dimensional thresholds , and therefore enjoys a very fast learning rate. On theother hand, our result implies that for any algorithm (including as 1-Nearest-Neighbor) that is amenable to PAC-Bayes analysis, there is a distribution real-izable by thresholds on which it has high population error. Thus, no algorithmwith a PAC-Bayes generalization bound can match the performance of nearest-neighbour with respect to such distributions.Finally, this work also relates to a recent attempt to explain generalizationthrough the implicit bias of learning algorithms: it is commonly argued thatthe generalization performance of algorithms can be explained by an implicit al-gorithmic bias. Building upon the ﬂexibility of providing distribution-dependentgeneralization bounds, the PAC-Bayes framework has seen a resurgence of inter-est in this context towards explaining generalization in large-scale modern-timepractical algorithms [24, 25, 11, 12, 2]. Indeed PAC-Bayes bounds seem to pro-vide non-vacuous bounds in several relevant domains [16, 12]. Nevertheless, thework here shows that any algorithm that can learn 1D thresholds is necessarilynot biased, in the PAC-Bayes sense, towards a (possibly distribution-dependent)prior. We mention that recently, [10] showed that SGD’s generalization perfor-mance indeed cannot be attributed to some implicit bias of the algorithm thatgoverns the generalization. Acknowledgements

The authors would like to acknowledge Steve Hanneke for suggesting and en-couraging them to write this manuscript. Indeed, given any realizable sample it will output the threshold which maximizes themargin. eferences [1] Noga Alon, Roi Livni, Maryanthe Malliaris, and Shay Moran. Private paclearning implies ﬁnite littlestone dimension. In Proceedings of the 51stAnnual ACM SIGACT Symposium on Theory of Computing , pages 852–860, 2019.[2] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Strongergeneralization bounds for deep nets via a compression approach. arXivpreprint arXiv:1802.05296 , 2018.[3] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and san-itization: Pure vs. approximate diﬀerential privacy.

Theory of Computing ,12(1):1–61, 2016.[4] Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overﬁtting or perfectﬁtting? risk bounds for classiﬁcation and regression rules that interpolate.In

Advances in neural information processing systems , pages 2300–2311,2018.[5] Mark Bun, Roi Livni, and Shay Moran. An equivalence between privateclassiﬁcation and online prediction. arXiv preprint arXiv:2003.00563 , 2020.[6] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Diﬀerentiallyprivate release and learning of threshold functions. In , pages 634–649.IEEE, 2015.[7] Mark Mar Bun.

New Separations in the Complexity of Diﬀerential Privacy .PhD thesis, Harvard University, Graduate School of Arts & Sciences, 2016.[8] Alon Cohen, Avinatan Hassidim, Haim Kaplan, Yishay Mansour, and ShayMoran. Learning to screen. In Hanna M. Wallach, Hugo Larochelle, AlinaBeygelzimer, Florence d’Alch´e-Buc, Emily B. Fox, and Roman Garnett,editors,

Advances in Neural Information Processing Systems 32: AnnualConference on Neural Information Processing Systems 2019, NeurIPS 2019,8-14 December 2019, Vancouver, BC, Canada , pages 8612–8621, 2019.[9] Jos´e R. Correa, Paul D¨utting, Felix A. Fischer, and Kevin Schewior.Prophet inequalities for I.I.D. random variables from an unknown distri-bution. In Anna Karlin, Nicole Immorlica, and Ramesh Johari, editors,

Proceedings of the 2019 ACM Conference on Economics and Computation,EC 2019, Phoenix, AZ, USA, June 24-28, 2019 , pages 3–17. ACM, 2019.[10] Assaf Dauber, Meir Feder, Tomer Koren, and Roi Livni. Can implicit biasexplain generalization? stochastic convex optimization as a case study. arXiv preprint arXiv:2003.06152 , 2020.

11] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuousgeneralization bounds for deep (stochastic) neural networks with many moreparameters than training data. arXiv preprint arXiv:1703.11008 , 2017.[12] Gintare Karolina Dziugaite and Daniel M Roy. Data-dependent pac-bayespriors via diﬀerential privacy. In

Advances in Neural Information Process-ing Systems , pages 8430–8441, 2018.[13] Paul Erdos and Richard Rado. Combinatorial theorems on classiﬁcationsof subsets of a given set.

Proceedings of the London mathematical Society ,3(1):417–439, 1952.[14] Ronald L Graham, Bruce L Rothschild, and Joel H Spencer.

Ramsey the-ory , volume 20. John Wiley & Sons, 1990.[15] Haim Kaplan, Katrina Ligett, Yishay Mansour, Moni Naor, and Uri Stem-mer. Privately learning thresholds: Closing the exponential gap. arXivpreprint arXiv:1911.10137 , 2019.[16] John Langford and Rich Caruana. (not) bounding the true error. In

Ad-vances in Neural Information Processing Systems , pages 809–816, 2002.[17] John Langford and John Shawe-Taylor. Pac-bayes & margins. In

Advancesin neural information processing systems , pages 439–446, 2003.[18] David McAllester. Simpliﬁed pac-bayesian margin bounds. In

Learningtheory and Kernel machines , pages 203–215. Springer, 2003.[19] David A McAllester. Pac-bayesian model averaging. In

Proceedings of thetwelfth annual conference on Computational learning theory , pages 164–170, 1999.[20] David A McAllester. Some pac-bayesian theorems.

Machine Learning ,37(3):355–363, 1999.[21] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar.

Foundationsof machine learning . MIT press, 2018.[22] Shlomo Moran, Marc Snir, and Udi Manber. Applications of ramsey’s the-orem to decision tree complexity.

Journal of the ACM (JACM) , 32(4):938–949, 1985.[23] Dhruv Mubayi and Andrew Suk. A survey of hypergraph ramsey problems. arXiv preprint arXiv:1707.04229 , 2017.[24] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and NatiSrebro. Exploring generalization in deep learning. In

Advances in Neu-ral Information Processing Systems , pages 5947–5956, 2017.

25] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural net-works. arXiv preprint arXiv:1707.09564 , 2017.[26] Ruslan Salakhotdinov. Deep learning tutorial at the simons institute, berke-ley. 2017.[27] Matthias Seeger. Pac-bayesian generalisation error bounds for gaussianprocess classiﬁcation.

Journal of machine learning research , 3(Oct):233–269, 2002.[28] Matthias Seeger, John Langford, and Nimrod Megiddo. An improved pre-dictive accuracy bound for averaging classiﬁers. In

Proceedings of the 18thInternational Conference on Machine Learning , number CONF, pages 290–297, 2001.[29] Shai Shalev-Shwartz and Shai Ben-David.

Understanding machine learn-ing: From theory to algorithms . Cambridge university press, 2014.[30] John Shawe-Taylor and David Hardoon. Pac-bayes analysis of maximumentropy classiﬁcation. In

Artiﬁcial Intelligence and Statistics , pages 480–487, 2009.[31] John Shawe-Taylor and Robert C Williamson. A pac analysis of a bayesianestimator. In

Proceedings of the tenth annual conference on Computationallearning theory , pages 2–9, 1997.25