Bandits with Switching Costs: T^{2/3} Regret
aa r X i v : . [ c s . L G ] N ov Bandits with Switching Costs: T / Regret
Ofer DekelMicrosoft Research [email protected]
Jian Ding ∗ University of Chicago [email protected]
Tomer Koren ∗ Technion [email protected]
Yuval PeresMicrosoft Research [email protected]
Abstract
We study the adversarial multi-armed bandit problem in a setting where the playerincurs a unit cost each time he switches actions. We prove that the player’s T -roundminimax regret in this setting is e Θ( T / ), thereby closing a fundamental gap in ourunderstanding of learning with bandit feedback. In the corresponding full-informationversion of the problem, the minimax regret is known to grow at a much slower rate ofΘ( √ T ). The difference between these two rates provides the first indication that learn-ing with bandit feedback can be significantly harder than learning with full-informationfeedback (previous results only showed a different dependence on the number of actions,but not on T .)In addition to characterizing the inherent difficulty of the multi-armed bandit prob-lem with switching costs, our results also resolve several other open problems in onlinelearning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of e Θ( T / ). Another implicationis that the minimax regret of online learning in adversarial Markov decision processes(MDPs) is e Θ( T / ). The key to all of our results is a new randomized construction ofa multi-scale random walk, which is of independent interest and likely to prove usefulin additional settings. ∗ Most of this work was done while the author was at Microsoft Research, Redmond. Introduction
Online learning with a finite set of actions is a fundamental problem in machine learning,with two important special cases: the
Adversarial (Non-Stochastic) Multi-Armed Bandit (Auer et al., 2002) and
Predicting with Expert Advice (Cesa-Bianchi et al., 1997; Freund and Schapire,1997). This problem is often presented as a T -round repeated game between a player andan adversary: on each round of the game, the player chooses an action from the set[ k ] = { , . . . , k } and incurs a loss in [0 ,
1] for that action. The player is allowed to ran-domize, i.e., on each round he selects a distribution over actions and draws an action fromthat distribution. The loss corresponding to each action on each round is set in advance bythe adversary, and in particular, the loss of each action can vary from round to round. Theplayer’s goal is to minimize the total loss accumulated over the course of the game.The bandit problem and the experts problem differ in the feedback received by the playerafter each round. In the bandit problem, the player only observes his loss (a single number)on each round; this is called bandit feedback . In the experts problem, the player observesthe loss assigned to each possible action (for a total of k real numbers in each round);this is called full feedback or full information . A player that receives bandit feedback mustbalance an exploration/exploitation trade-off, while a player that receives full feedback isonly concerned with exploitation.For example, say that we manage an investment portfolio, we receive daily advice from k financial experts, and on each day we must follow the advice of one expert. The lossassociated with each expert on each day reflects the amount of money we would lose byfollowing that expert’s advice on that day. If we know the advice given by each expert, theproblem is said to provide full feedback. Alternatively, if we purchase advice from a singleexpert on each day, and the advice of the other k − switching cost : In addition to thelosses chosen by the adversary, the player pays a penalty each time his action differs from theone he played on the previous round. In the motivating example described above, switchingour primary financial consultant may require terminating a contract with the previous expertand negotiating contract with the new one, or it may just cost us the fees and commissionsthat result from a significant change in investment strategy. Switching costs arise naturally ina variety of other applications: In online web applications, switching the content of a websitetoo frequently can be annoying to users; in industrial applications, switching actions mightentail reconfiguring a production line. Moreover, Geulen et al. (2010) reduced a family ofonline buffering problems to switching cost problems; similarly, Gyorgy and Neu (2011) usedthe switching cost setting to solve the limited-delay universal lossy source coding problem.We focus on analyzing the inherent difficulty of online learning with switching costs, usingthe game-theoretic notion of minimax regret . To define this notion, we must first specify the In the bandit problem, each action is called an arm ; in the experts problem, each action is called an expert . ℓ , . . . , ℓ T ,where each ℓ t maps the action set [ k ] to [0 , oblivious (to the player’s actions). On round t , the player selectsa distribution over the set of actions and draws an action X t from that distribution. Theplayer then incurs the loss ℓ t ( X t ) + 11 X t = X t − , which includes the adversarially chosen loss ℓ t ( X t ) and the switching cost. To make the loss on the first round well-defined, we set X = 0(so the first action always counts as a switch). The player’s cumulative loss at the end ofthe game equals P Tt =1 (cid:0) ℓ t ( X t ) + 11 X t = X t − (cid:1) .Since the loss functions are adversarial, the cumulative loss is only meaningful whencompared to an adequate baseline. Therefore, we compare the player’s cumulative loss tothe loss of the best fixed policy (in hindsight), which is a policy that chooses the same actionon all T rounds. Formally, we define the player’s regret at the end of the game as R = T X t =1 (cid:0) ℓ t ( X t ) + 11 X t = X t − (cid:1) − min x ∈ [ k ] T X t =1 ℓ t ( x ) . (1)While regret measures the player’s performance on a given instance of the game, the inherentdifficulty of the game itself is measured by minimax expected regret (or just minimax regret for brevity). Intuitively, minimax regret is the expected regret when both the adversary andthe player behave optimally. Formally, minimax regret is the minimum over all randomizedplayer strategies, of the maximum over all loss sequences, of E [ R ]. In this paper, our primaryfocus is to determine the asymptotic growth rate of the minimax regret as a function of thenumber of rounds T and the number of actions k .Minimax regret rates are already well understood in several of the settings discussedabove. Without switching costs, the minimax regret of the adversarial multi-armed banditproblem is Θ( √ T k ) (see Auer et al. (2002); Cesa-Bianchi and Lugosi (2006)) and the min-imax regret of the experts problem is Θ( √ T log k ) (see Littlestone and Warmuth (1994);Freund and Schapire (1997); Cesa-Bianchi and Lugosi (2006)). This implies that when noswitching costs are added, the bandit problem is not substantially more difficult than theexperts problem (at least when the number of actions is constant), despite the added burdenof exploration.When switching costs are added, the previous literature does not provide a full charac-terization of minimax regret. Clearly, the lower bound without switching costs still applywith switching costs are added. In the full feedback setting with switching costs, the Followthe Lazy Leader algorithm (Kalai and Vempala, 2005) and the
Shrinking Dartboard algo-rithm (Geulen et al., 2010) both guarantee a matching upper bound of O ( √ T log k ), so theminimax regret is Θ( √ T log k ). However, the minimax regret of the bandit problem withswitching costs was not well understood. Arora et al. (2012) presented a simple algorithmwith a guaranteed regret of O ( k / T / ), but a matching lower bound was not known.Recently, Cesa-Bianchi et al. (2013) addressed this gap, but fell short of resolving it.Specifically, they modified the game by allowing the loss per round to drift out of the interval[0 ,
1] and to possibly grow in magnitude to be as large as Θ( √ T ). In this setting, theyproved that the minimax regret (with a constant number of actions k ) grows at a rate of3 Θ( T / ). However, allowing unbounded loss per round is quite uncommon and not verynatural. Also, it isn’t clear what implications their results have on the original problem (i.e.,with bounded losses), and whether their e Θ( T / ) rate is merely an artifact of the enlargedrange of admissible loss values. Our main result is a new e Ω( T / ) lower bound on the regret of the multi-armed banditproblem with switching costs (in the standard setup, with losses bounded in [0 , Theorem 1.
For any randomized player strategy that relies on bandit feedback, there existsa sequence of loss functions ℓ , . . . , ℓ T (where ℓ t : [ k ] [0 , ) that incurs a regret of R = e Ω( k / T / ) , provided that k ≤ T . When combined with the upper bound in Arora et al. (2012), our result implies that theminimax regret of the multi-armed bandit problem with switching costs is e Θ( k / T / ). Thuswhen switching costs are added, the bandit problem becomes substantially more difficult thanthe corresponding experts problem. To the best of our knowledge, this is the first examplethat exhibits (even for constant k ) a clear gap between the asymptotic difficulty, as T grows,of online learning with bandit and full feedback.To prove Theorem 1, we apply (the easy direction of) Yao’s minimax principle (Yao,1977), which states that the regret of a randomized player against the worst-case loss se-quence is at least the minimax regret of the optimal deterministic player against a stochastic loss sequence. In other words, as an intermediate step toward proving Theorem 1, we con-struct a stochastic sequence of loss functions , L T , where each L t is a random function from[ k ] to [0 , E " T X t =1 (cid:0) L t ( X t ) + 11 X t = X t − (cid:1) − min x ∈ [ k ] T X t =1 L t ( x ) = e Ω( k / T / ) , for any deterministic player strategy.After proving our lower bound for constant switching costs, we generalize is to arbitraryswitching costs (e.g., set the switching cost to T q , for some q ∈ [ − , O ( √ T ) (without switching costs), such as the algorithm presented in Auer et al. (2002), canbe forced to make e Ω( T ) switches. Finally, we observe that our problem is a special case ofan online Markov decision process (MDP) learning problem with adversarial rewards andbandit feedback, and therefore the minimax regret of that problem is also e Ω( T / ).The paper is organized as follows: in Sec. 2 we describe the general construction of thestochastic loss sequence and in Sec. 3 we present the stochastic process that underlies ourconstruction. We then prove our lower bound on regret in Sec. 4 and present extensions andimplications in Sec. 5. We use the notation U i : j as shorthand for the sequence U i , . . . , U j throughout. nput: time horizon T >
0, number of actions k ≥ Set ǫ = k / T − / / (9 log T ) and σ = 1 / (9 log T ). Choose χ ∈ [ k ] uniformly at random. Draw T independent zero-mean σ -variance Gaussians ξ T . Define W T recursively by W = 0 , ∀ t ∈ [ T ] W t = W ρ ( t ) + ξ t , where ρ ( t ) = t − δ ( t ) , δ ( t ) = max { i ≥ i divides t } . For all t ∈ [ T ] and x ∈ [ k ], set L ′ t ( x ) = W t + − ǫ · χ = x ,L t ( x ) = clip (cid:0) L ′ t ( x ) (cid:1) , where clip( α ) = min { max { α, } , } . Output: loss functions L T .Figure 1: The adversary’s randomized algorithm for generating a loss sequence L T , whichensures an expected regret of e Ω( k / T / ) against any deterministic player. In this section we present our construction of a stochastic sequence of loss functions, L T ,which ensures an expected regret of e Ω( k / T / ) against any deterministic player. The ad-versary’s algorithm for generating the sequence L T is given in Fig. 1. The key to thisalgorithm is the stochastic process W T , defined on lines 3–4 of Fig. 1. The adversary drawsa concrete sequence from this process and uses it to define the loss values of all k actions.First, the adversary picks an action χ ∈ [ k ] uniformly at random to serve as the best action(whose loss is always smaller than the loss of the other actions), and defines the intermediateloss function sequence L ′ T , whose values are not guaranteed to be bounded in [0 , x = χ is simply set to L ′ t ( x ) = W t + . The loss of the best action χ is setto L ′ t ( χ ) = W t + − ǫ , where ǫ is a predefined gap parameter, and is therefore consistentlybetter than the losses of the other actions. The loss sequence L T is obtained by taking theintermediate sequence L ′ T and projecting each of its values to the interval [0 , L T , the player attempts to identify which of the k actions has the smaller loss (or equivalently, to reveal the value of χ ). Although the lossvalues of the best action are deterministically separated from those of the other actions bya constant gap, the player only observes one loss value on each round, and never knowsif his chosen action incurred the higher loss or the lower loss. Our analysis shows thatthe player’s ability to uncover information about the identity of the best action depends5n the characteristics of the stochastic process W T . For example, if this process were ani.i.d. sequence, it is easy to see that the player could identify the best action by estimating theexpected loss of every action to within ǫ/ O ( σ /ǫ ) samples of each action and at most k − W T plays acentral role. We show that a careful choice of the stochastic process W T ensures that theamount of information uncovered by the player during the game is tightly controlled by thenumber of switches he performs. Therefore, to detect the best action, the player must switchactions frequently and pay the associated switching costs. The key to our analysis is a careful choice of the stochastic process W T that underliesthe definition of L T . In this section we describe a stochastic processes with a controllabledependence structure, which includes i.i.d. Gaussian sequences and simple Gaussian randomwalks as special cases.Let ξ T be a sequence of independent zero-mean Gaussian random variables with variance σ . Let ρ : [ T ]
7→ { } ∪ [ T ] be a function that assigns each t ∈ [ T ] with a parent ρ ( t ). Weallow ρ to be any function that satisfies ρ ( t ) < t for all t . Now define W = 0 , ∀ t ∈ [ T ] W t = W ρ ( t ) + ξ t . Note that the constraint ρ ( t ) < t guarantees that a recursive application of ρ always leadsback to zero. The definition of the parent function ρ determines the behavior of the stochasticprocesses. For example, setting ρ ( t ) = 0 implies that W t = ξ t for all t , so the stochasticprocess is simply a sequence of i.i.d. Gaussians. On the other hand, setting ρ ( t ) = t − ρ can create interestingdependencies between the variables of the stochastic process. We highlight two properties of the parent function ρ (and consequently, of the inducedstochastic process) that are essential to our analysis. Definition 1 ( ancestors, depth ) . Given a parent function ρ , the set of ancestors of t isdenoted by ρ ∗ ( t ) and defined as the set of positive indices that are encountered when ρ isapplied recursively to t . Formally, ρ ∗ ( t ) is defined recursively as ρ ∗ (0) = {}∀ t ∈ [ T ] ρ ∗ ( t ) = ρ ∗ (cid:0) ρ ( t ) (cid:1) ∪ { ρ ( t ) } . (2) The depth of ρ is then defined as d ( ρ ) = max t ∈ [ T ] | ρ ∗ ( t ) | . W t = ξ t + P s ∈ ρ ∗ ( t ) ξ s , where ξ = 0. Thus, if d ( ρ ) = d ,the induced stochastic process includes sums of at most d independent Gaussians, each withvariance σ . This implies the following bound. Lemma 1.
Let W T be the stochastic process defined by the parent function ρ . Then ∀ δ ∈ (0 , P (cid:18) max t ∈ [ T ] | W t | ≤ σ q d ( ρ ) log Tδ (cid:19) ≥ − δ . Proof.
For any t ∈ [ T ], W t is normally distributed with zero mean and variance boundedby d ( ρ ) σ . Since a standard Gaussian variable Z satisfies P ( | Z | ≥ z ) ≤ exp( − z ) for any z ≥
0, we infer that P (cid:18) | W t | ≥ σ q d ( ρ ) log Tδ (cid:19) ≤ exp (cid:0) − log Tδ (cid:1) = δT . The above holds for each t ∈ [ T ] and the lemma follows from the union bound.Lemma 1 implies that the depth of ρ and the variance σ determine how far the process W T will drift. Since we require a process that is bounded with high probability, we needto minimize the depth of ρ . (We could counter the effect of a deep ρ by setting σ to besmall, but if we do so, the resulting process would not be able to mask the ǫ gap betweenthe losses of the different actions.) This consideration rules out the simple Gaussian randomwalk, whose depth is T . Definition 2 ( cut, width ) . Given a parent function ρ , define cut( t ) = { s ∈ [ T ] : ρ ( s ) < t ≤ s } , the set of rounds that are separated from their parent by t . The width of ρ is then defined as w ( ρ ) = max t ∈ [ T ] | cut( t ) | . Note that the cut size for any s ∈ [ T ] is an integer between 1 and T . One extreme is thesimple Gaussian random walk ( ρ ( t ) = t − ρ ( t ) = 0), for which | cut( s ) | = s , and therefore w ( ρ ) = T .Our analysis in Sec. 4.1 shows that any information that the player uncovers about theidentity of the best action can be attributed to a switch performed on the current roundor on a past round (where the first round is always considered to be a switch). Moreover,we prove that the amount of information that can be extracted from a switch at time t iscontrolled by the size of cut( t ). Therefore, a process with a small width forces the player toperform many switches. This rules out the sequence of i.i.d. Gaussians, as it is too wide andreveals too much information to a player that selects the same action repeatedly. The width of ρ coincides with the cut-width of the numbered graph it determines,see Chung and Seymour (1989). ξ ξ ξ ξ ξ ξ W W W W W W W W width = 3 ξ ξ ξ ξ ξ ξ ξ W W W W W W W W Figure 2: An illustration of the MRW process for T = 7. (Top) The MRW with a directededge from ρ ( t ) to t , for each t ∈ [ T ]. (Bottom) The MRW can be equivalently described asthe values at the leaves of a binary tree, where the value at each leaf is obtained by summingthe i.i.d. Gaussian variables ξ t ’s on the (right) edges along the path from the root. To prove our lower bound, we require a stochastic process that is neither too deep nor toowide. We present such a process, called the
Multi-scale Random Walk (MRW), whose depthand width are both logarithmic in T . The MRW process is formed by the parent functiongiven by ρ ( t ) = t − δ ( t ) , where δ ( t ) = max (cid:8) i ≥ i divides t (cid:9) . (3)Put another way, ρ ( t ) is obtained by taking the binary representation of t , identifying thelowest order 1, and flipping it to 0. For example if t = 10110
00 (which equals the decimalnumber 180) then ρ ( t ) = 10110
00 (which equals the decimal number 176).Fig. 2 depicts the MRW process for T = 7. Notice that the process takes steps on multiplescales , each of which corresponds to a different power of two. An alternative description ofthe same process can be obtained by considering a binary tree with leaves corresponding tothe random variables W T , as depicted in Fig. 2. In this description, we associate the right edges of the tree, enumerated in a DFS traversal order, with the Gaussian variables ξ T .Then, each W t is defined as the sum of the ξ j ’s encountered along the path from the root tothe leaf corresponding to W t .We conclude the section with the following lemma, which summarizes the properties ofthe MRW process used in our analysis. Lemma 2.
The depth and width of the MRW are both upper-bounded by ⌊ log T ⌋ + 1 . roof. Let n = ⌊ log T ⌋ + 1 and note that any integer t ∈ [ T ] can be written using n bits.We shall prove that, for all t ∈ [ T ], the number | ρ ∗ ( t ) | is bounded by (in fact, is equal to)the number of 1’s in the n -digit binary representation of t , while | cut( t ) | is bounded by thenumber of 0’s in that representation plus one. This would immediately imply the lemma, as | ρ ∗ ( t ) | and | cut( t ) | are both positive and their sum is at most n + 1.First, observe that the number of 1’s in the representation of the parent ρ ( t ) is one lessthan the number of 1’s in the representation of t , and | ρ ∗ (0) | = 0. Hence, | ρ ∗ ( t ) | equals thenumber of 1’s in the binary representation of t .Moving on to the width, choose any t ∈ [ T ] and consider the cut it defines. We show thateach s ∈ cut( t ) \ { t } corresponds to a distinct zero in the n -bit binary representation of t .Let s ∈ cut( t ) \ { t } and denote j = δ ( s ). Note that ρ ( s ) = s − j is a multiple of 2 j +1 , so wecan write s − j = a · j +1 for some integer a . By the definition of the cut and since s = t , wehave a · j +1 < t < a · j +1 + 2 j . Consequently, s = 2 j +1 · ⌊ t/ j +1 ⌋ + 2 j and the coefficient of2 j in the binary representation of t is zero. Together with the fact that t ∈ cut( t ), we haveshown that the size of the cut defined by t is at most the number of zero bits in its binaryrepresentation plus one. In this section, we prove our main result: a e Ω( k / T / ) lower bound on the expected regretof the multi-armed bandit with switching costs, when the loss functions are stochastic andthe player is deterministic. Our result is stated formally in the following theorem. Theorem 2.
Let L T be the stochastic sequence of loss functions defined in Fig. 1. Thenfor T ≥ max { k, } , the expected regret (as defined in Eq. (1)) of any deterministic playeragainst this sequence is at least k / T / / (100 log T ) . Our analysis requires some new notation. First, let M = P Tt =1 X t = X t − be the numberof switches in the action sequence X T (recall that we arbitrarily set X = 1). Also, for all t ∈ [ T ], let Z t = L t ( X t ) be the loss observed by the player on round t . Recall our assumptionthat X t , the player’s action on round t , is a deterministic function of his past observations Z t − . We begin the analysis with a key lemma that relates the player’s ability to identify the bestaction to the number of switches he performs. This lemma also highlights the importanceof finding a stochastic process with a small w ( ρ ). The lemma bounds the distance betweeneach one of the conditional probability measures Q i ( · ) = P ( · | χ = i ) , i = 1 , , . . . , k , and the probability measure Q that corresponds to an (imaginary) adversary that uses χ = 0. Thus Q ( · ) is the probability when all actions incur the same loss. Let F be the9 -algebra generated by the player’s observations Z T . Then the total variation distancebetween Q and Q i on F is defined as d F TV ( Q , Q i ) = sup A ∈F (cid:12)(cid:12) Q ( A ) − Q i ( A ) (cid:12)(cid:12) . This distance captures the player’s ability to identify whether action i is better than orequivalent to the other actions based on the loss values he observes. The following lemmaupper-bounds this distance in terms of the number of switches the player performs to orfrom action i , denoted by the random variable M i , and the width w ( ρ ) of the underlyingstochastic process. Here we use the notation E Q j to refer to the expectation with respect tothe distribution Q j , for any j = 0 , , . . . , k . Lemma 3.
For all i ∈ [ k ] , it holds that d F TV ( Q , Q i ) ≤ ( ǫ/ σ ) p w ( ρ ) E Q [ M i ] and d F TV ( Q , Q i ) ≤ ( ǫ/ σ ) p w ( ρ ) E Q i [ M i ] . To see the significance of this lemma, consider first the case k = 2, where M = M = M by definition. By the triangle inequality, d F TV ( Q , Q ) ≤ d F TV ( Q , Q ) + d F TV ( Q , Q ).Concavity of square root yields p E Q [ M ] + p E Q [ M ] ≤ p E Q [ M ] + E Q [ M ]) = 2 p E [ M ] . The second claim of Lemma 3 for k = 2 now implies that d F TV ( Q , Q ) ≤ ( ǫ/σ ) p w ( ρ ) E [ M ].This inequality clarifies the dilemma facing the player: If he switches actions frequently sothat E [ M ] = Ω( T / / log( T )), the switching costs guarantee the desired lower bound onregret. Otherwise, E [ M ] = o ( T / / log( T )) ; since ǫ/σ = Θ( T − / ) and w ( ρ ) = Θ(log( T )),the distance d F TV ( Q , Q ) will tend to zero with T , so the player will be unable to distinguishbetween the two actions and will suffer an expected regret of order Θ( ǫT ) = Θ( T / / log( T )).We do not formalize this argument here, since we prove the lower bound for any k below. Proof of Lemma 3.
Let Y = and Y t = L ′ ( X t ) for all t ∈ [ T ]. Note that X t is a deterministicfunction of Y t − . Define Y S = { Y t } t ∈ S and let ∆( Y S | Y S ′ ) be the relative entropy (i.e., theKullback-Leibler divergence) between the joint distribution of Y S , conditioned on Y S ′ , under Q and Q i . Namely, ∆( Y S | Y S ′ ) = E Q (cid:20) log Q ( Y S | Y S ′ ) Q i ( Y S | Y S ′ ) (cid:21) . (4)For brevity, also define ∆( Y S ) = ∆( Y S | ∅ ). We use the chain rule for relative entropy (see,e.g., Theorem 2.5.3 in Cover and Thomas (2006)) to decompose ∆( Y T ) as∆( Y T ) = ∆( Y ) + T X t =1 ∆ (cid:0) Y t | Y ρ ∗ ( t ) (cid:1) (5)and deal separately with each term in the sum. First note that ∆( Y ) = 0 as Y is a constant.The value of ∆ (cid:0) Y t | Y ρ ∗ ( t ) (cid:1) is computed by considering three separate cases. If X t = X ρ ( t ) t and ρ ( t )) then the distribution of Y t conditioned on Y ρ ∗ ( t ) is N ( Y ρ ( t ) , σ ) under both Q and Q i , where N ( µ, σ ) denotes the normaldistribution with mean µ and variance σ . If X t = i and X ρ ( t ) = i then the distribution of Y t conditioned on Y ρ ∗ ( t ) is N ( Y ρ ( t ) , σ ) under Q and N ( Y ρ ( t ) − ǫ, σ ) under Q i . Finally, if X t = i and X ρ ( t ) = i then the distribution of Y t conditioned on Y ρ ∗ ( t ) is N ( Y ρ ( t ) , σ ) under Q and N ( Y ρ ( t ) + ǫ, σ ) under Q i . Overall,∆ (cid:0) Y t | Y ρ ∗ ( t ) (cid:1) = Q (cid:0) X t = i, X ρ ( t ) = i (cid:1) · d KL (cid:0) N (0 , σ ) (cid:13)(cid:13) N ( − ǫ, σ ) (cid:1) + Q (cid:0) X t = i, X ρ ( t ) = i (cid:1) · d KL (cid:0) N (0 , σ ) (cid:13)(cid:13) N ( ǫ, σ ) (cid:1) = ǫ σ Q ( A t ) , (6)where A t = (cid:8) X t = i, X ρ ( t ) = i ∨ X t = i, X ρ ( t ) = i (cid:9) is the event that the player switched anodd number of times (and in particular, at least once) from or to action i between rounds ρ ( t ) and t . Substituting Eq. (6) into Eq. (5) gives∆( Y T ) = ǫ σ T X t =1 Q ( A t ) = ǫ σ E Q " T X t =1 A t . (7)The event A t implies that there exists at least one time s of switch from or to action i ,such that t ∈ cut( s ). Therefore, if we let S M i denote the random sequence of times of suchswitches (in the action sequence X T ), then T X t =1 A t ≤ M i X r =1 X t ∈ cut( S r ) A t ≤ M i X r =1 | cut( S r ) | ≤ w ( ρ ) M i . Plugging this inequality back into Eq. (7) gives∆( Y T ) ≤ ǫ w ( ρ )2 σ E Q [ M i ] . Pinsker’s inequality (Lemma 11.6.1 in Cover and Thomas (2006)) now implies thatsup A ∈F ′ (cid:0) Q ( A ) − Q i ( A ) (cid:1) ≤ ǫ σ p w ( ρ ) E Q [ M i ] , where F ′ is the σ -algebra generated by Y T . We can replace F ′ with F above to obtain d F TV ( Q , Q i ) in the left-hand side, simply because Z T is a deterministic function of Y T andtherefore F ⊂ F ′ .This proves the first claim of the lemma. To prove the second bound, we can simplyreverse the roles of Q and Q i in our arguments above and obtain the same bound overthe total variation distance but in terms of the expectation with respect to the distribution Q i . 11 .2 Regret Lower Bound With Lemma 3 in hand, we can prove Theorem 2 and conclude Theorem 1. We begin witha simple corollary of the lemma.
Corollary 1.
It holds that k P ki =1 d F TV ( Q , Q i ) ≤ ǫσ √ k · p w ( ρ ) E Q [ M ] . Proof.
Averaging the inequalities of Lemma 3 over i = 1 , , . . . , k , using the concavity of theroot function and noting that P ki =1 M i = 2 M (as each switch is counted twice in the sum)yields 1 k k X i =1 d F TV ( Q , Q i ) ≤ ǫ σ · k k X i =1 p w ( ρ ) E Q [ M i ] ≤ ǫσ √ k · p w ( ρ ) E Q [ M ] , as claimed.We now turn to analyzing the player’s expected regret. Using the definitions above, thisregret can be written as R = T X t =1 L t ( X t ) + M − min x ∈ [ k ] T X t =1 L t ( x ) . As a tool in our analysis, we also define the hypothetical regret with respect to the unclippedloss functions L ′ T that the player would suffer on the same action sequence X T . Namely, R ′ = T X t =1 L ′ t ( X t ) + M − min x ∈ [ k ] T X t =1 L ′ t ( x ) . The next lemma shows that in expectation, the regret R can be lower bounded in termsof R ′ . Lemma 4.
Assume that T ≥ max { k, } . Then E [ R ] ≥ E [ R ′ ] − ǫT / . Proof.
We consider the event B = {∀ t : L t = L ′ t } , and first show that P ( B ) ≥ /
6. As theprocess W T has depth d ≤ ⌊ log T ⌋ + 1 ≤ T , Lemma 1 with δ = 1 /T ≤ / /
6, we have | W t | ≤ σ q d log Tδ ≤ σ p T log T ≤ σ log T for all t ∈ [ T ]. Thus, setting σ = 1 / (9 log T ) we obtain that P (cid:18) ∀ t ∈ [ t ] 12 + W t ∈ (cid:20) , (cid:21)(cid:19) ≥ . For T ≥ max { k, } we have ǫ < / L ′ t ( x ) ∈ [0 ,
1] for all x ∈ [ k ] whenever + W t ∈ [ , ]. This implies that P ( B ) ≥ / B takes place then R = R ′ ; otherwise, M ≤ R ≤ R ′ ≤ M + ǫT so that R ′ − R ≤ ǫT .Therefore, E [ R ′ ] − E [ R ] = E [ R ′ − R | ¬ B ] · P ( ¬ B ) ≤ ǫT / , as required.12ext, we relate the hypothetical regret R ′ to the total variation between Q and the Q i . Lemma 5.
The quantity E [ R ′ ] is lower bounded in terms of the distributions Q , Q , . . . , Q k as E [ R ′ ] ≥ ǫT − ǫTk · k X i =1 d F TV ( Q , Q i ) + E [ M ] . Proof.
For i ∈ [ k ], let N i denote the number of times the player picks action i , so we canwrite R ′ = ǫ ( T − N χ ) + M . Consequently, E [ R ′ ] = 1 k k X i =1 E [ ǫ ( T − N i ) + M | χ = i ] = ǫT − ǫk k X i =1 E Q i [ N i ] + E [ M ] . (8)On the other hand, for all i ∈ [ k ] and t ∈ [ T ], the event { X t = i } is in the σ -field F , so Q i ( X t = i ) − Q ( X t = i ) ≤ d F TV ( Q , Q i ) . Summing over t = 1 , . . . , T yields E Q i [ N i ] − E Q [ N i ] ≤ T · d F TV ( Q , Q i ) , whence k X i =1 E Q i [ N i ] ≤ T · k X i =1 d F TV ( Q , Q i ) + k X i =1 E Q [ N i ] = T · k X i =1 d F TV ( Q , Q i ) + T .
Plugging this into Eq. (8) and using k ≥ E [ R ′ ] ≥ ǫT − ǫTk · k X i =1 d F TV ( Q , Q i ) − ǫTk + E [ M ] ≥ ǫT − ǫTk · k X i =1 d F TV ( Q , Q i ) + E [ M ] , as claimed.We are now ready to prove Theorem 2. Proof of Theorem 2.
We first prove the theorem for deterministic players that make no morethan ǫT switches on any sequence of loss functions, and relax this assumption towards theend of the proof. For algorithms with this property, we have Q ( M > ǫT ) = Q i ( M > ǫT ) = 0for all i ∈ [ k ]. Since { M ≥ m } ∈ F , this implies E Q [ M ] − E Q i [ M ] = ⌊ ǫT ⌋ X m =1 ( Q ( M ≥ m ) − Q i ( M ≥ m )) ≤ ǫT · d F TV ( Q , Q i )for all i ∈ [ k ], that gives E Q [ M ] − E [ M ] = 1 k k X i =1 ( E Q [ M ] − E Q i [ M ]) ≤ ǫTk k X i =1 d F TV ( Q , Q i ) . E [ R ] ≥ ǫT − ǫTk k X i =1 d F TV ( Q , Q i ) + E Q [ M ] . On the other hand, recall Lemma 2 that states that the width of the MRW process isbounded by w ( ρ ) ≤ ⌊ log T ⌋ + 1 ≤ T . Corollary 1 together with this bound gives1 k k X i =1 d F TV ( Q , Q i ) ≤ ǫσ √ k · p E Q [ M ] log T .
Plugging this into the previous inequality and using the notation m = p E Q [ M ] results withthe lower bound E [ R ] ≥ ǫT m (cid:18) m − ǫ σ √ k T p log T (cid:19) . The right hand side, which is minimized at m = ( ǫ /σ √ k ) T p log T , can be further lowerbounded by ǫT / − ( ǫ /σ k ) T log T .
Using our choice of σ = 1 / (9 log T ) and ǫ = k / T − / / (9 log T ) gives E [ R ] ≥ (cid:18) − (cid:19) · k / T / log T ≥ k / T /
50 log T . (9)This proves the theorem for algorithms with the assumed property. In order to relax thisassumption, note that we can turn any player algorithm to an algorithm that makes at most ǫT switches, simply by halting the algorithm once it makes ⌊ ǫT ⌋ switches and repeatingits last action on the remaining rounds. The regret R ∗ of the modified algorithm equals R unless M > ǫT and in the latter case R ∗ ≤ R + ǫT ≤ R , so E [ R ∗ ] ≤ E [ R ]. Since E [ R ∗ ]is lower bounded by the right-hand side of Eq. (9), this implies the claimed lower bound onthe expected regret of any deterministic player.Finally, we can prove Theorem 1. Proof of Theorem 1.
Recall that any randomized algorithm is equivalent to an a-priori ran-dom choice of a deterministic algorithm, for which the statement of Theorem 2 applies.Hence, since the adversary is oblivious to the player’s actions, the statement of Theorem 2for a randomized player (where the expectation is now taken with respect to both the func-tions L T and the player’s random bits) follows by taking the expectation over its internalrandomization. The fact the expectation of the regret with respect to the randomization in L T is lower bounded by the stated quantity implies that there exists some realization ℓ T of the variables L T for which the regret is lower bounded by the same quantity. This givesthe result of Theorem 1. 14 Extensions and Implications
In this section we present few extentions of our results and discuss several implications.
In our construction of a randomized adversary, described in Sec. 2, the loss values L t ( x )are all real numbers in the interval [0 , x at time t to be the outcome of a biased coin toss with bias L t ( x ). In this sequenceof binary loss functions, action χ is consistently better in expectation by an ǫ gap, whichis sufficient in our analysis. Our arguments regarding the player’s inability to identify thebest action still apply since the feedback he observes is only further obscured by additionalrandom noise. Assume that each switch incurs a cost of c to the player, instead of a unit cost as before.Repeating the proof of Theorem 2, we are able to get an e Ω( c / k / T / ) lower bound, whichis tight with respect to T , k and c (up to poly-log factors) in light of the upper bound ofArora et al. (2012). Theorem 3.
Let the cost of switch be c > and assume that T > c · max { k, } . Forany randomized player strategy that relies on bandit feedback, there exists a sequence of lossfunctions ℓ T (where ℓ t : [ k ] [0 , ) that incurs a regret of R = e Ω( c / k / T / ) .Proof. Redefine the gap between the actions in the construction of the functions L T to ǫ = ( ck ) / T − / / (9 log T ). Using the same notation as in the proof of Theorem 2, we canshow that E [ R ] ≥ ǫT m (cid:18) cm − ǫ σ √ k T p log T (cid:19) . The right-hand side is minimized at m = ( ǫ /cσ √ k ) T p log T and is lower bounded by ǫT / − ( ǫ /σ ck ) T log T . Setting ǫ = ( ck ) / T − / / (9 log T ) and using our choice of σ = 1 / (9 log T ) gives the lower bound E [ R ] ≥ c / k / T /
50 log T .
Proceeding as in the proofs of Theorem 2 and Theorem 1, we establish the existence of therequired sequence of loss functions ℓ T . 15 .3 Tradeoff between Loss and Switches As a corollary of Theorem 3, we can quantify the tradeoff between the loss accumulate bya multi-armed bandit algorithm and the number of switches it performs. For simplicity, wetreat the number of actions k as a constant and state the result only in terms of T . Theorem 4.
Let A be a multi-armed bandit algorithm that guarantees an expected regret(without switching costs) of e O ( T α ) then there exists a sequence of loss functions that forces A to make e Ω( T − α ) ) switches. In particular, the popular EXP3 algorithm (Auer et al., 2002) guarantees a regret of O ( √ T ) without switching costs. In this case, Theorem 4 implies that EXP3 can be forcedto make e Ω( T ) switches. Proof of Theorem 4.
Assume the contrary, i.e. that A can guarantee a regret of e O ( T α )(without switching costs) with e O ( T β ) switches over any sequence of T loss functions, with α + β/ <
1. In this case, we can pick a real number γ such that α < γ < − β/
2. Con-sider the performance of this algorithm in a setting where the cost of a switch is c = T γ − .Clearly, the expected regret (including switching costs) of the algorithm in this setting isupper bounded by e O ( T α + T γ − · T β ) = e o ( T γ ) , over any sequence of loss functions, as α < γ and β < − γ . This contradicts Theorem 3,which guarantees the existence of a loss sequence that incurs a regret (including switchingcosts) of e Ω( T (3 γ − / · T / ) = e Ω( T γ ). The multi-armed bandit problem with switching costs is a special case of the online adversar-ial deterministic Markov decision process (ADMDP) with bandit feedback (see Dekel and Hazan(2013) for a formal description of this setting). The important aspect of the ADMDP settingis that the player has a state, and that his loss on each round depends both on his actionand on his current state. Moreover, the player’s action on round t determines his state onround t + 1. The k -armed bandit problem with switching costs can be described as a k -stateADMDP, where each state represents the player’s previous action. The player incurs theloss associated with the action he chooses and pays an additional cost whenever he changeshis state.As a result, our lower bound applies to the class of ADMDP problems. Dekel and Hazan(2013) proves a matching upper bound, which implies that the (undiscounted) minimaxregret of the ADMDP problem is e Θ( T / ). The ADMDP setting belongs to the more generalclass of adversarial MDPs with bandit feedback (Yu et al., 2009; Neu et al., 2010), wherethe state transitions are allowed to be stochastic. This implies a e Ω( T / ) lower bound on the(undiscounted) minimax regret of the general setting.16 Summary
In this paper, we proved that the T -round k -action multi-armed bandit problem with switch-ing costs has a minimax regret of e Θ( k / T / ), and is therefore strictly harder than the cor-responding experts problem (with full feedback). To the best of our knowledge, this is thefirst example of a setting in which learning with bandit feedback is significantly harder thanlearning with full-information feedback (in terms of the dependence on T ). Our analysisshows that the difficulty of this problem stems from the player’s need to pay for exploring the quality of the different actions. Since this problem is a special case of online learningwith bandit feedback against a bounded-memory adaptive adversary, we conclude that theminimax regret of the general setting is also e Ω( T / ), which matches the upper bounds ofArora et al. (2012). We also showed how our construction resolves several other open prob-lems in online learning. Moreover, we believe that the multi-scale random walk, defined inSec. 3.2, will prove to be a useful tool in other settings. References
R. Arora, O. Dekel, and A. Tewari. Online bandit learning against an adaptive adversary:from regret to policy regret. In
Proceedings of the Twenty-Ninth International Conferenceon Machine Learning , 2012.P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire. The nonstochastic multiarmed banditproblem.
SIAM Journal on Computing , 32(1):48–77, 2002.N. Cesa-Bianchi and G. Lugosi.
Prediction, learning, and games . Cambridge UniversityPress, 2006.N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K. War-muth. How to use expert advice.
Journal of the ACM , 44(3):427–485, May 1997.N. Cesa-Bianchi, O. Dekel, and O. Shamir. Online learning with switching costs and otheradaptive adversaries. In
Advances in Neural Information Processing Systems 26 , 2013.F. R. K. Chung and P. D. Seymour. Graphs with small bandwidth and cutwidth.
DiscreteMathematics , 75(1-3):113–119, 1989.T.M. Cover and J.A. Thomas.
Elements of information theory . John Wiley & Sons, 2006.O. Dekel and E. Hazan. Better rates for any adversarial deterministic MDP. In
Proceedingsof the Thirtieth International Conference on Machine Learning , 2013.Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and anapplication to boosting.
Journal of computer and System Sciences , 55(1):119–139, 1997.17. Geulen, B. V¨ocking, and M. Winkler. Regret minimization for online buffering problemsusing the weighted majority algorithm. In
Proceedings of the 23rd International Conferenceon Learning Theory , pages 132–143, 2010.A. Gyorgy and G. Neu. Near-optimal rates for limited-delay universal lossy source coding. In
Information Theory Proceedings (ISIT), 2011 IEEE International Symposium on , pages2218–2222. IEEE, 2011.A. Kalai and S. Vempala. Efficient algorithms for online decision problems.
Journal ofComputer and System Sciences , 71:291–307, 2005.N. Littlestone and M.K. Warmuth. The weighted majority algorithm.
Information andComputation , 108:212–261, 1994.G. Neu, A. Gy¨orgy, C. Szepesv´ari, and A. Antos. Online Markov decision processes underbandit feedback. In
Advances in Neural Information Processing Systems 23 , pages 1804–1812, 2010.A. Yao. Probabilistic computations: Toward a unified measure of complexity. In
Proceedingsof the 18th IEEE Symposium on Foundations of Computer Science (FOCS) , pages 222–227, 1977.J. Y. Yu, S. Mannor, and N. Shimkin. Markov decision processes with arbitrary rewardprocesses.