A note on the price of bandit feedback for mistake-bounded online learning
aa r X i v : . [ c s . D M ] J a n A note on the price of bandit feedback formistake-bounded online learning
Jesse GenesonFebruary 2, 2021
Abstract
The standard model and the bandit model are two generalizations of the mistake-bound model to online multiclass classification. In both models the learner guesses aclassification in each round, but in the standard model the learner recieves the correctclassification after each guess, while in the bandit model the learner is only told whetheror not their guess is correct in each round. For any set F of multiclass classifiers, defineopt std ( F ) and opt bandit ( F ) to be the optimal worst-case number of prediction mistakesin the standard and bandit models respectively.Long (Theoretical Computer Science, 2020) claimed that for all M > k , there exists a set F of functions from a set X to a set Y of size k such thatopt std ( F ) = M and opt bandit ( F ) ≥ (1 − o (1))( | Y | ln | Y | ) opt std ( F ). The proof of thisresult depended on the following lemma, which is false e.g. for all prime p ≥ s = (the all 1 vector), t = (the all 2 vector), and all z .Lemma: Fix n ≥ p , and let u be chosen uniformly at random from { , . . . , p − } n . For any s, t ∈ { , . . . , p − } n with s = t and for any z ∈ { , . . . , p − } ,we have Pr( t · u = z mod p | s · u = z mod p ) = p .We show that this lemma is false precisely when s and t are multiples of each othermod p . Then using a new lemma, we fix Long’s proof. Keywords : online learning, bandit feedback, mistake-bound model, learning theory
Auer et al. [1] introduced two generalizations of the mistake-bound model [5] called the standard model and the bandit model . Let F be a set of functions from some set X to a finiteset Y . In the standard model, the adversary selects some f ∈ F that the learner does notknow. In each round t of the learning process, the adversary gives the learner some x t ∈ X ,the learner predicts the output of f with input x t , and the adversary tells them the correctvalue of f ( x t ). The bandit model is similar, except at the end of each round, the adversarytells the learner yes or no instead of the correct value of f ( x t ). In both models, in any roundthe adversary may change the function f to any other function in F as long as it is consistentwith their previous answers. 1he goal of the learner in each model is to minimize the number of prediction mistakes,while the adversary wants to maximize the number of prediction mistakes. Define opt std ( F )and opt bandit ( F ) to be the number of prediction mistakes in the standard and bandit modelsrespectively if both the learner and adversary play optimally.Long [6] proved that opt bandit ( F ) ≤ (1 + o (1))( | Y | ln | Y | ) opt std ( F ) for all such F , andclaimed that the upper bound is best possible up to the leading constant. In order to showthat the upper bound is best possible up to the leading constant, Long claimed that for all M > k , there exists a set F of functions from a set X to a set Y ofsize k such that opt std ( F ) = M and opt bandit ( F ) ≥ (1 − o (1))( | Y | ln | Y | ) opt std ( F ). The proofused probabilistic methods inspired by [9, 10, 4, 7].In particular, part of the proof used Chebyshev’s inequality and required a set of randomvariables to be pairwise independent. The pairwise independence was proved using thefollowing lemma, which is false e.g. for all prime p ≥ s = (the all 1 vector), t = (theall 2 vector), and all z . The error in the proof of the following lemma occurs on the last “=”in Appendix B of [6]. Lemma 1.1. [6] Fix n ≥ and prime p , and let u be chosen uniformly at random from { , . . . , p − } n . For any s, t ∈ { , . . . , p − } n with s = t and for any z ∈ { , . . . , p − } , wehave Pr( t · u = z mod p | s · u = z mod p ) = p . Neither s nor t is the all 0 vector, so s is a multiple of t mod p if and only if t is a multipleof s mod p . We can see that Lemma 1.1 is false when s and t are multiples of each othermod p .If z = 0 and s and t are multiples of each other mod p , then Pr( t · u = z mod p | s · u = z mod p ) = 1. On the other hand if z = 0 and s and t are multiples of each other mod p with s = t , then Pr( t · u = z mod p | s · u = z mod p ) = 0. In the next section, we showthat Lemma 1.1 is true when s and t are not multiples of each other mod p . We use thisfact to fix the proof from [6] and show that for all M > k , thereexists a set F of functions from a set X to a set Y of size k such that opt std ( F ) = M andopt bandit ( F ) ≥ (1 − o (1))( | Y | ln | Y | ) opt std ( F ). In the proof of the main result, we will use the following lemma from [6].
Lemma 2.1. [6] Fix n ≥ , and let u be chosen uniformly at random from { , . . . , p − } n .For any x ∈ { , . . . , p − } n − { } and for any y ∈ { , . . . , p − } , P ( x · u = y mod p ) = p . Now we prove a new lemma. We will use this in place of the false Lemma 1.1 from [6]when we prove that for all
M > k , there exists a set F of func-tions from a set X to a set Y of size k such that opt std ( F ) = M and opt bandit ( F ) ≥ (1 − o (1))( | Y | ln | Y | ) opt std ( F ). Lemma 2.2.
Fix n ≥ , and let u be chosen uniformly at random from { , . . . , p − } n .For any s, t ∈ { , . . . , p − } n that are not multiples of each other mod p and for any z ∈ { , . . . , p − } , we have Pr( t · u = z mod p | s · u = z mod p ) = p . roof. By Lemma 2.1 and the definition of conditional probability, we have Pr( t · u = z mod p | s · u = z mod p ) = Pr( t · u = z mod p ∧ s · u = z mod p )Pr( s · u = z mod p ) = p Pr( t · u = z mod p ∧ s · u = z mod p ). Moreover Pr( t · u = z mod p ∧ s · u = z mod p ) = |{ u : t · u = z mod p ∧ s · u = z mod p ∧ u ∈{ ,...,p − } n }| p n .In order to calculate | { u : t · u = z mod p ∧ s · u = z mod p ∧ u ∈ { , . . . , p − } n } | ,we must find the number of solutions u ∈ { , . . . , p − } n to the system of equations t · u = z mod p and s · u = z mod p .Treating s and t as row vectors, we form the augmented matrix (cid:20) s zt z (cid:21) and row-reduce it.Since s and t are not multiples of each other mod p , they are therefore linearly independent, soRREF( (cid:20) s zt z (cid:21) ) has two pivot entries. Therefore the system of equations t · u = z mod p and s · u = z mod p has two dependent variables u i and u j for some i = j and n − u k with k = i and k = j . There are p choices for each of the independent variables,and the dependent variables are determined by the values of the independent variables, sothere are p n − solutions u ∈ { , . . . , p − } n to the system of equations t · u = z mod p and s · u = z mod p .Thus Pr( t · u = z mod p ∧ s · u = z mod p ) = |{ u : t · u = z mod p ∧ s · u = z mod p ∧ u ∈{ ,...,p − } n }| p n = p n − p n = p , so Pr( t · u = z mod p | s · u = z mod p ) = p Pr( t · u = z mod p ∧ s · u = z mod p ) = p .With this new lemma, we obtain the following lemma which is analogous to a lemma in[6] that followed from the false Lemma 1.1. Lemma 2.3.
For any subset S ⊂ { , . . . , p − } n , there is an element u ∈ { , . . . , p − } n such that for all z ∈ { , . . . , p − } , | { x ∈ S : x · u = z mod p } | ≤ | S | p + 2 p | S | .Proof. Suppose that S is any subset of { , . . . , p − } n , and let u be chosen uniformly atrandom from { , . . . , p − } n . For each z ∈ { , . . . , p − } , define T z as the set of x ∈ S forwhich x · u = z mod p . By Lemma 2.1 and linearity of expectation, we have E ( | T z | ) = | S | p for all z . By Lemma 2.1, Lemma 2.2, and the definition of S , the events s · u = z mod p and t · u = z mod p are pairwise independent for any distinct s, t ∈ S that are not multiples ofeach other mod p . We split into two cases for z , z = 0 and z = 0.First, suppose that z = 0. For each s ∈ S , define the indicator random variable X s,z sothat X s,z = 1 if s · u = z mod p , and X s,z = 0 otherwise. If s and t are not multiples ofeach other mod p , then Cov( X s,z , X t,z ) = 0. If s and t are multiples of each other mod p with s = t , then Cov( X s,z , X t,z ) = E ( X s,z X t,z ) − E ( X s,z ) E ( X t,z ) = − p <
0. Since | T z | = P s ∈ S X s,z , we have Var( | T z | ) = Var( P s ∈ S X s,z ) = P s ∈ S Var( X s,z ) + P s = t Cov( X s,z , X t,z ) ≤ P s ∈ S Var( X s,z ) = | S | ( p − p ) < | S | p . By Chebyshev’s inequality, Pr( | T z | ≥ | S | p + 2 p | S | ) ≤ p .Now, suppose that z = 0. For each s ∈ S , define the indicator random variable X s,z so that X s,z = 1 if s · u = z mod p , and X s,z = 0 otherwise. If s and t are not multiplesof each other mod p , then Cov( X s,z , X t,z ) = 0. If s and t are multiples of each other mod p with s = t , then Cov( X s,z , X t,z ) = E ( X s,z X t,z ) − E ( X s,z ) E ( X t,z ) = p − p < p . Notethat there are at most ( p − | S | ordered pairs ( s, t ) for which s and t are multiples of eachother mod p with s = t . Since | T z | = P s ∈ S X s,z , we have Var( | T z | ) = Var( P s ∈ S X s,z ) =3 s ∈ S Var( X s,z ) + P s = t Cov( X s,z , X t,z ) < | S | p + ( p − | S | p < | S | . By Chebyshev’s inequality,Pr( | T z | ≥ | S | p + 2 p | S | ) ≤ .By the union bound, Pr( ∀ z | T z | ≤ | S | p +2 p | S | ) ≥ − ( p − p − ≥ . Thus we can choose u randomly, and with probability at least we will have | { x ∈ S : x · u = z mod p } | ≤ | S | p + 2 p | S | for all z ∈ { , . . . , p − } .The proof of the following theorem is the same as in [6], we include it for completeness.Let p be any prime number. For all a ∈ { , . . . , p − } n , define f a : { , . . . , p − } n →{ , . . . , p − } so that f a ( x ) = a · x mod p and define F L ( p, n ) = { f a : a ∈ { , . . . , p − } n } .It is known that opt std ( F L ( p, n )) = n for all primes p and n > Theorem 2.4.
For all
M > and infinitely many k , there exists a set F of functionsfrom a set X to a set Y of size k such that opt std ( F ) = M and opt bandit ( F ) ≥ (1 − o (1))( | Y | ln | Y | ) opt std ( F ) .Proof. Fix n ≥ p ≥
5. We let F = F L ( p, n ), with X = { , . . . , p − } n and Y = { , . . . , p − } . Let S = { , . . . , p − } n , so | S | = ( p − n .Let R = { f a : a ∈ S } ⊂ F L ( p, n ). In each round t > R t ofmembers of { f a : a ∈ S } that are consistent with its previous answers, it always answers no ,and it picks x t for round t that minimizes max ˆ y t | R t ∩ { f : f ( x t ) = ˆ y t } | .By Lemma 2.3, we have | R t +1 | ≥ | R t |− | R t | p − p | R t | ≥ | R t |− | R t | p − | R t | p √ ln p = (1 − √ ln p p ) | R t | ,as long as | R t | ≥ p ln p .By induction on the previous inequality, we have | R t | ≥ (1 − √ ln p p ) t − ( p − n . If(1 − √ ln p p ) b − ( p − n ≥ p ln p , then the adversary can guarantee b wrong guesses before | R t | < p ln p . This is true for b = (1 − o (1)) np ln p , completing the proof.In the last proof, we assumed that M ≥