[PDF] Improper Reinforcement Learning with Gradient-based Policy Optimization

Abstract

We consider an improper reinforcement learning setting where a learner is given M base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. \par We propose a gradient-based approach that operates over a class of improper mixtures of the controllers. We derive convergence rate guarantees for the approach assuming access to a gradient oracle. The value function of the mixture and its gradient may not be available in closed-form; however, we show that we can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) for explicit gradient descent optimization. Numerical results on (i) the standard control theoretic benchmark of stabilizing an inverted pendulum and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable\footnote{Under review. Please do not distribute.}.

Full PDF

IImproper Learning with Gradient-based Policy Optimization

Mohammadi Zaki , Avinash Mohan , Aditya Gopalan , and Shie Mannor Indian Institute of Science, Bengaluru Technion, HaifaEmail: [email protected], [email protected],[email protected], [email protected]

Abstract

We consider an improper reinforcement learning setting where the learner is given M base controllersfor an unknown Markov Decision Process, and wishes to combine them optimally to produce a potentiallynew controller that can outperform each of the base ones. We propose a gradient-based approach thatoperates over a class of improper mixtures of the controllers.The value function of the mixture and its gradient may not be available in closed-form; however, weshow that we can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) forexplicit gradient descent optimization. We derive convergence and convergence rate guarantees for theapproach assuming access to a gradient oracle. Numerical results on a challenging constrained queueingtask show that our improper policy optimization algorithm can stabilize the system even when eachconstituent policy at its disposal is unstable . A natural approach to design eﬀective controllers for large, complex systems is to ﬁrst approximate thesystem using a tried-and-true Markov decision process (MDP) model, such as the Linear Quadratic Regulator(LQR) [12] or tabular MDPs [5], and then compute (near-) optimal policies for the assumed model. Thoughthis yields favorable results in principle, it is quite possible that errors in describing or understanding thesystem – leading to misspeciﬁed models – may lead to ‘overﬁtting’, resulting in subpar controllers in practice.Moreover, in many cases stability of the designed controller may be crucial or more desirable than optimizinga ﬁne-grained cost function. From the controller design standpoint, it is often easier, cheaper and moreinterpretable to specify or hardcode control policies based on domain-speciﬁc principles, e.g., anti-lock brakingsystem (ABS) controllers [30]. For these reasons, we investigate in this paper a promising, general-purpose reinforcement learning (RL) approach towards designing controllers given pre-designed ensembles of basic or atomic controllers, which (a) allows for ﬂexibly combining the given controllers to obtain richer policiesthan the atomic policies, and, at the same time, (b) can preserve the basic structure of the given class ofcontrollers and confer a high degree of interpretability on the resulting hybrid policy. Overview of the approach.

We consider a situation where we are given ‘black-box’ access to M controllers(maps from state to action distributions) { k , . . . , k M } for an unknown MDP . By this we mean that we can Under review. Please do not distribute. We use the terms ’policy’ and ’controller’ interchangeably in this article. a r X i v : . [ c s . L G ] F e b hoose to invoke any of the given controllers at any point during the operation of the system. With theunderstanding that the given family of controllers is reasonable, we frame the problem of learning the bestcombination of the controllers by trial and error. We ﬁrst set up an improper policy class of all randomizedmixtures of the M given controllers – each such mixture is parameterized by a probability distribution overthe M base controllers. Applying an improper policy in this class amounts to selecting independently at eachtime a base controller according to this distribution and implementing the recommended action as a functionof the present state of the system.The learner’s goal, therefore, is to ﬁnd the best performing mixture policy by iteratively testing from thepool of given controllers and observing the resulting state-action-reward trajectory. To this end we develop anew gradient-based RL optimization algorithm that operates on a softmax parameterization of each mixture(probability distribution) of the M basic controllers, and takes steps by following the gradient of the return ofthe current probability distribution to reach the optimum mixture. This is reminiscent of the standard policygradient (PG) method with a softmax parameterization of the policy over a discrete state and action space.However, there is a basic diﬀerence in that the underlying parameterization in our setting is over a setof given controllers which could be potentially abstract and deﬁned for complex MDPs with continuousstate/action spaces, instead of the PG view where the parameterization directly deﬁnes the policy in terms ofthe state-action map. Our algorithm, therefore, hews more closely to a meta RL framework, in that we operateover a set of controllers that have themselves been designed using some optimization framework to which weare agnostic. This confers a great deal of generality to our approach since the class of controllers can now bechosen to promote any desirable secondary characteristic such as interpretability, ease of implementation orcost eﬀectiveness.It is also worth noting that our approach is diﬀerent from treating each of the base controllers as an ‘expert’and applying standard mixture-of-experts algorithms, e.g., Hedge or Exponentiated Gradient [21, 4, 18, 27].Whereas the latter approach is tailored to converge to the best single controller (under the usual gradientapproximation framework) and hence qualiﬁes as a ’proper’ learning algorithm, the former optimizationproblem is in the improper class of mixture policies which not only contains each atomic controller but alsoallows for a true mixture (i.e., one which puts positive probability on at least two elements) of many atomiccontrollers to achieve optimality; we exhibit concrete examples where this is indeed possible. Our Contributions.

We make the following contributions in this context:1. We begin with an example motivating the need for improper learning in RL (Sec. 2), where we alsoinformally describe our approach and its strengths. To the best of our knowledge, the improperlearning-based approach we adopt in this paper is novel. We then formally deﬁne the problem andsupply related notation (Sec. 3).2. We develop a gradient-based RL algorithm to iteratively tune a softmax parameterization of an improper(mixture) policy deﬁned over the base controllers (Algorithm 1). While this algorithm,

Softmax PolicyGradient (or Softmax PG), relies on the availability of value function gradients, we later proposea modiﬁcation that we call

GradEst (Algorithm 4) to Softmax PG to rectify this. GradEst uses acombination of rollouts and Simultaneously Perturbed Stochastic Approximation (SPSA) [8] to estimatethe value gradient at the current mixture distribution.3. We begin analyzing the performance of Softmax PG, with the instructive special case of Multi-armedBandits (Sec. 4.1). For a horizon of T steps, we recover the well-known O (log( T )) bound on regret[20] with both perfect and estimated value gradients (see Algorithm 2). Further, when perfect valuegradients are available, we show a O (1 /t ) rate of convergence to the optimal value function, t being thecurrent round.. Moving on to the case of ﬁnite State-Action MDPs, we show a convergence rate of O (1 /t ) to the optimalvalue function. To do this we employ a novel Non-uniform (cid:32)Lojaseiwicz-type inequality [22], that lowerbounds the 2-norm of the value gradient in terms of the suboptimality of the current mixture policy’svalue. Essentially, this helps establish that when the gradient of the value function hits zero, the valuefunction is itself close to the optimum. Along the way, we also establish the β -smoothness of valuefunction of our improper controller, which may be of independent interest.5. We corroborate our theory using multiple simulations on a well-known, constrained queueing systemexample. As discussed in Sec. 2, this is a countable state, ﬁnite action MDP. In our experiments (seeSec. 6), we eschew access to exact value gradients and instead rely on a combination of roll outs andSPSA to estimate them. Results show that our algorithm quickly converges to the correct mixture ofavailable atomic controllers. Before we delve into our problem, it is vital to ﬁrst distinguish the approach investigated in the present paperfrom the plethora of existing algorithms based on ’proper learning’. Essentially, these algorithms try to ﬁndan (approximately) optimal policy for the MDP under investigation. In stochastic control parlance, theseproposals try to get close to the Bellman ﬁxed point of the MDP. These approaches can broadly be classiﬁedin two groups: model-based and model-free .The former is based on ﬁrst learning the dynamics of the unknown MDP followed by planning for this learntmodel. Algorithms in this class include Thompson Sampling-based approaches [28, 29, 16], Optimism-basedapproaches such as the UCRL algorithm [5], both achieving order-wise optimal O ( √ T ) regret bound.A particular class of MDPs which has been studied extensively is the Linear Quadratic Regulator (LQR)which is a continuous state-action MDP with linear state dynamics and quadratic cost [12]. Let x t ∈ R m be the current state and let u t ∈ R n be the action applied at time t . The inﬁnite horizon average costminimization problem for LQR is to ﬁnd a policy to choose actions { u t } t (cid:62) so as tominimize lim T →∞ E (cid:34) T T (cid:88) t =1 x T t Qx t + u T t Ru t (cid:35) such that x t +1 = Ax t + Bu t + n ( t ), n ( t ) is iid zero-mean noise. Here the matrices A and B are unknown tothe learner. Earlier works like [1, 17] proposed algorithms based on the well-known optimism principle (withconﬁdence ellipsoids around estimates of A and B ). These show regret bounds of O ( √ T ).However, these approaches do not focus on the stability of the closed-loop system. [12] describe a robustcontroller design which seeks to minimize the worst-case performance of the system given the error in theestimation process. They show a sample complexity analysis guaranteeing convergence rate of O (1 / √ N ) tothe optimal policy for the given LQR, N being the number of rollouts. More recently, certainity equivalence[24] was shown to achieve O ( √ T ) regret for LQRs. Further, [9] show that it is possible to achieve O (log T )regret if either one of the matrices A or B are known to the learner, and also provided a lower bound showingthat Ω( √ T ) regret is unavoidable when both are unknown.The model-free approach on the other hand, bypasses model estimation and directly learns the value functionof the unknown MDP. While the most popular among these have historically been Q-learning, TD-learning[37] and SARSA [31], algorithms based on gradient-based policy optimization have been gaining considerableattention of late, following their stunning success with playing the game of Go which has long been viewed asthe most challenging of classic games for artiﬁcial intelligence owing to its enormous search space and theiﬃculty of evaluating board positions and moves. [35] and more recently [36] use policy gradient methodcombined with a neural network representation to beat human experts. Indeed, the Policy Gradient methodhas become the cornerstone of modern RL and given birth to an entire class of highly eﬃcient policy searchalgorithms such as TRPO [32], PPO[33], and MADDPG [23].Despite its excellent empirical performance, not much was known about theoretical guarantees for thisapproach until recently. There is now a growing body of promising results showing convergence rates for PGalgorithms over ﬁnite state-action MDPs [2, 34, 7, 25], where the parameterization is over the entire space ofstate -action pairs, i.e., R S × A . In particular, [7] show that projected gradient descent does not suﬀer fromspurious local optima on the simplex, [2] show that the with softmax parameterization PG converges to theglobal optima asymptotically. [34] show a O (1 / √ t ) convergence rate for mirror descent. [25] show that withsoftmax policy gradient convergence to the global optima occurs at a rate O (1 /t ) and at O ( e − t ) with entropyregularization.We end this section noting once again that all of the above works concern proper learning. Improper learning,on the other hand, has been separately studied in statistical learning theory in the iid setting [11, 10]. In thisframework, which is also called Representation Independent learning, the learning algorithm is not restrictedto output a hypothesis from a given set of hypotheses.To our knowledge, [3] is the only existing work that attempts to frame and solve policy optimization over animproper class via boosting a given class of controllers. However, the paper is situated in the rather diﬀerentcontext of non-stochastic control and assumes perfect knowledge of (i) the memory-boundedness of the MDP,and (ii) the state noise vector in every round, which amounts to essentially knowing the MDP transitiondynamics. We work in the stochastic MDP setting and moreover assume no access to the MDP’s transitionkernel. Further, [3] also assume that all the atomic controllers available to them are stabilizing which, whenworking with an unknown MDP, is a very strong assumption to make. We make no such assumptions onour atomic controller class and, as we show in Sec. 2 and Sec. 6, our algorithms even begin with provablyunstable controllers and yet succeed in stabilizing the system.In summary, the problem that we address concerns ﬁnding the best among a given class of controllers. Noneof these need be optimal for the MDP at hand. Moreover, our PG algorithm could very well converge toan improper mixture of these controllers meaning that the output of our algorithms need not be any of theatomic controllers we are provided with. This setting, to the best of our knowledge has not been investigatedin the RL literature hitherto.

An ideal example that helps motivate the the need for improper learning, while simultaneously illustrating itscapabilities, is the scheduling problem in a constrained queueing network. Such systems are widely used tomodel communication networks in the literature [6].The system, shown in Fig. 1a, comprises two queues fed by independent, stochastic arrival processes A i ( t ) , i ∈ { , } , t ∈ N . The length of Queue i , measured at the beginning of time slot t, is denoted by Q i ( t ) ∈ Z + . A common server serves both queues and can drain at most one packet from the system in atime slot . The server, therefore, needs to decide which of the two queues it intends to serve in a given slot(we assume that once the server chooses to serve a packet, service succeeds with probability 1). The server’sdecision is denoted by the vector D ( t ) ∈ A := { [0 , , [1 , , [0 , } , where a “1” denotes service and a “0” Hence, a constrained queueing system. enotes lack thereof. A ( t ) A ( t ) Q ( t ) Q ( t ) D ( t ) (a) Q i ( t ) is the length of Queue i ( i ∈ { , } ) at the beginningof time slot t , A i ( t ) is its packet arrival process and D ( t ) ∈{ [0 , , [1 , , [0 , } . λ λ (0 , − ε ) (1 − ε, ,ε ) ( ε, A B C C C extra capacity achievedthrough improperlearning λ + λ = (b) K and K by themselves can only stabilize C ∪ C (grayrectangles). With improper learning, however, we enlarge theset of stabilizable arrival rates by the triangle ∆ ABC shown inpurple, above.Figure 1: Motivating example: Constrained queueing network with 2 queues. The capacity region of this network (see Fig. 1b) isgiven by Λ := (cid:8) λ ∈ R : λ + λ < (cid:9) . For simplicity, we assume that the processes ( A i ( t )) ∞ t =0 are both IID Bernoulli, with E A i ( t ) = λ i . Note thatthe arrival rate λ = [ λ , λ ] is unknown to the learner. Deﬁning ( x ) + := max { , x } , ∀ x ∈ R , queue lengthevolution is given by the equations Q i ( t + 1) = ( Q i ( t ) − D i ( t )) + + A i ( t + 1) , i ∈ { , } . (1)Let F t denote the state-action history until time t, and P ( A ) the space of all probability distributions on A . We aim to ﬁnd a policy π : F t → P ( A ) to minimize the discounted system backlog given by J π ( Q (0)) := E π Q (0) ∞ (cid:88) t =0 γ t ( Q ( t ) + Q ( t )) . (2)Any policy π with J π ( Q (0)) < ∞ , ∀ Q (0) ∈ Z is said to be stabilizing (or, equivalently, a stable policy). Itis well known that there exist stabilizing policies iﬀ λ + λ < stationary policy π µ ,µ deﬁned by π ε ,ε ( Q ) =  [1 , , w.p. µ , [0 , , w.p. µ , and[0 , , w.p. 1 − µ − µ , ∀ Q ∈ Z (3)can provably stabilize a system iﬀ µ i > λ i , ∀ i ∈ { , } . Now, assume our control set consists of two stationarypolicies K , K with K ≡ π ε, − ε , K ≡ π − ε,ε and suﬃciently small ε > . That is, we have M = 2 controllers K , K . Clearly, neither of these can, by itself, stabilize a network with λ = [0 . , . . However, an improper mixture of the two that selects K and K each with probability 1 / C ∪ C ∪ ∆ ABC, withoutprior knowledge of [ λ , λ ] . In other words, our algorithm enlarges the stability region by the triangle ∆

ABC, over and above C ∪ C . In Sec. 6, we expand upon this example and show among other things, (1) how ourimproper learner converges to the stabilizing mixture of the available policies and (2) if the optimal policy isamong the M available controllers, our algorithm ﬁnds and converges to it. Problem Statement and Notation

A (ﬁnite) Markov Decision Process ( S , A , P , r, ρ, γ ) is speciﬁed by a ﬁnite state space S , a ﬁnite action space A , a transition probability matrix P , where P (˜ s | s, a ) is the probability of transitioning into state ˜ s upontaking action a ∈ A in state s , a single stage reward function r : S × A → R , a starting state distribution ρ over S and a discount factor γ ∈ (0 , policy or controller π : S → P ( A ) speciﬁes a decision-making strategy in which the learnerchooses actions ( a t ) adaptively based on the current state ( s t ), i.e., a t ∼ π ( s t ). π and ρ , together with P , induce a probability measure P πρ on the space of all sample paths of the underlying Markov process and wedenote by E πρ the associated expectation operator. The value function of policy π (also called the value ofpolicy π ), denoted by V π is the total discounted reward obtained by following π , i.e., V π ( ρ ) := E πρ ∞ (cid:88) t =0 γ t r ( s t , a t ) (4) Improper Learning.

We assume that the learner is provided with a ﬁnite number of (stationary) controllers C := { K , · · · , K M } and, as described below, set up a parameterized improper policy class I soft ( C ) thatdepends on C . The aim therefore, is to identify the best policy for the given MDP within this class, i.e., π ∗ = argmax π ∈I soft ( C ) V π ( ρ ) . (5)We now describe the construction of the class I soft ( C ) . The Softmax Policy Class.

We assign weights θ m ∈ R , to each controller K m ∈ C and deﬁne θ :=[ θ , · · · , θ M ]. The improper class I soft is parameterized by θ as follows. In each round, the policy π θ ∈ I soft ( C )chooses a controller drawn from softmax ( θ ), i.e., the probability of choosing Controller K m is given by, π θ ( m ) := e θ m M (cid:80) m (cid:48) =1 e θ m (cid:48) . (6)Note, therefore, that in every round, our algorithm interacts with the MDP only through the controllersampled in that round (see Figure 2). In the rest of the paper, we will deal exclusively with a ﬁxed and given C and the resultant I soft . therefore, we overload the notation π θ t ( a | s ) for any a ∈ A and s ∈ S to denote theprobability with which the algorithm chooses action a in state s at time t . For ease of notation, whenever thecontext is clear, we will also drop the subscript θ i.e., π θ t ≡ π t . Hence, we have at any time t (cid:62) π θ t ( a | s ) = M (cid:88) m =1 π θ t ( m ) K m ( s, a ) . (7)Since we deal with gradient-based methods in the sequel, we deﬁne the value gradient of policy π θ ∈ I soft , by ∇ θ V π θ ≡ dV πθt dθ t . We say that V π θ is β -smooth if ∇ θ V π θ is β -Lipschitz [2]. Finally, let for any two integers a and b , I ab denote the indicator that a = b . Contrast with traditional the PG approach:

We emphasize that this problem is fundamentally diﬀerentfrom the traditional policy gradient approach where the parameterization completely deﬁnes the policy interms of the state-action mapping. One can use the methodology followed in [25], by assigning a parameter θ s,m for every s ∈ S , m ∈ [ M ]. With some calculation, it can be shown that this is equivalent to the tabular igure 2: A black-box view of our improper learning approach through softmax policy gradient. Algorithm 1

Softmax Policy Gradient (SoftMax PG)

Input: learning rate η > θ m = 1, for all m ∈ [ M ], s ∼ µ for t = 1 to T do Choose controller m t ∼ π t .Play action a t ∼ K m t ( s t , :).Observe s t +1 ∼ P ( . | s t , a t ).Update: θ t +1 = θ t + η. dV πθt ( µ ) dθ t . end for setting with S states and M actions, with the new ‘reward’ deﬁned by r ( s, m ) := (cid:80) a ∈A K m ( s, a ) r ( s, a ) where r ( s, a ) is the usual expected reward obtained at state s and playing action a ∈ A . By following the approachin [25] on this modiﬁed setting, it can be shown that the policy converges for each s ∈ S , π θ ( m ∗ ( s ) (cid:12)(cid:12) s ) → s ∈ S , which is the optimum policy.However, the problem that we address, is to select a single controller (from within I soft , the convex hull ofthe given M controllers) , which would guarantee maximum return if one plays that single mixture for alltime, from among the given set of controllers. Our policy gradient algorithm, SoftMax PG, is shown in Algorithm 1. The parameters θ ∈ R M which deﬁnethe policy are updated by following the gradient of the value function at the current policy parameters. Thepolicy π θ ( m ) is deﬁned as in (6). In the rest of this section we consider the instructive sub-case when S = 1,which is also called the Multiarmed Bandit. We provide regret bounds for two cases (1) when the valuegradient dV πθt ( µ ) dθ t (in the gradient update) is available in each round, and (2) when it needs to be estimated. .1 Bandit over Bandits Note once again that each controller in this case, is a probability distribution over the A arms of the bandit.We consider the scenario where the agent at each time t (cid:62)

1, has to choose a probability distribution K m t from a set of M probability distributions over actions A . She then plays an action a t ∼ K m t . This is diﬀerentfrom the standard MABs because the learner cannot choose the actions directly, instead chooses from a given set of controllers, to play actions. Note the V function has no argument as S = 1. Let µ ∈ [0 , A be themean vector of the arms A . The value function for any given mixture π ∈ P ([ M ]), V π := E (cid:34) ∞ (cid:88) t =0 γ t r t (cid:12)(cid:12) π (cid:35) = ∞ (cid:88) t =0 γ t E (cid:2) r t (cid:12)(cid:12) π (cid:3) = ∞ (cid:88) t =0 γ t (cid:88) a ∈A M (cid:88) m =1 π ( m ) K m ( a ) µ a . = 11 − γ M (cid:88) m =1 π m µ T K m = 11 − γ M (cid:88) m =1 π m r µm . (8)where the interpretation of r µm is that it is the mean reward one obtains if the controller m is chosen at anyround t . Since V π is linear in π , the maximum is attained at one of the base controllers π ∗ puts mass 1on m ∗ where m ∗ := argmax m ∈ [ M ] V K m , and V K m is the value obtained using K m for all time. In the sequel, weassume ∆ i := r µm ∗ − r µi > With access to the exact value gradient at each step, we have the following result.

Theorem 4.1.

With η = − γ )5 and with θ (1) m = 1 /M for all m ∈ [ M ] , with the availability for true gradient,we have ∀ t (cid:62) , V π ∗ − V π θt (cid:54) − γ M t . Also, deﬁning regret for a time horizon of T rounds as R ( T ) := T (cid:88) t =1 V π ∗ − V π θt , (9)we show as a corollary to Thm. 4.1 that, Corollary 4.1.1. R ( T ) (cid:54) min (cid:26) M − γ log T , (cid:114) − γ M √ T (cid:27) . Proof sketch of Theorem 4.1.

The proof ﬂows along similar lines as that of Theorem 3 in [25]. First weestablish that V π is − γ ) -smooth. Since smoothness alone does not guarantee convergence and convergencerates, we establish a lower bound on the norm of the gradient of the value function at every step t as below(these type of inequalities are called (cid:32)Lojaseiwicz inequalities [22]) lgorithm 2 Projection-free Policy Gradient (for MABs)

Input: learning rate η ∈ (0 , π ( m ) = M , for all m ∈ [ M ]. for t = 1 to T do m ∗ ( t ) ← argmax m ∈ [ M ] π t ( m )Choose controller m t ∼ π t .Play action a t ∼ K m t .Receive reward R m t by pulling arm a t .Update ∀ m ∈ [ M ] , m (cid:54) = m ∗ ( t ) : π t +1 ( m ) = π t ( m ) + η (cid:18) R m I m π t ( m ) − R m ∗ ( t ) I m ∗ ( t ) π t ( m ∗ ( t )) (cid:19) (10)Set π t +1 ( m ∗ ( t )) = 1 − (cid:80) m (cid:54) = m ∗ ( t ) π t +1 ( m ). end forLemma 4.2. [Lower bound on norm of gradient] (cid:13)(cid:13)(cid:13)(cid:13) ∂V π θ ∂θ (cid:13)(cid:13)(cid:13)(cid:13) (cid:62) π θ m ∗ (cid:16) V π ∗ − V π θ (cid:17) . Intuitively, this “gradient domination” helps the algorithm to not get stuck in a local maxima. We refer thereader to section C for all the proofs for this section.

When value gradients are unavailable, we follow a direct policy gradient algorithm instead of softmaxprojection) (see Algorithm 2). At each round t (cid:62)

1, the learning rate for η is chosen asynchronously for eachcontroller m , to be απ t ( m ) , to ensure that we remain inside the simplex, for some α ∈ (0 , π ∈P ([ M ]) M (cid:88) m =1 π ( m )( r µ ( m ∗ ) − r µ ( m )) . A direct gradient with respect to the parameters π ( m ) gives us a rule for the policy gradient algorithm. Theother changes in the update step (eq 10), stem from the fact that true means of the arms are unavailable andimportance sampling.We have the following result. Theorem 4.3.

With value of α chosen to be less than ∆ min r µm ∗ − ∆ min , ( π t ) is a Markov process, with π t ( m ∗ ) → as t → ∞ , a.s. Further the regret till any time T is bounded as R ( T ) (cid:54) − γ (cid:88) m (cid:54) = m ∗ ∆ m α ∆ min log T + C, where C := − γ (cid:80) t (cid:62) P (cid:8) π t ( m ∗ ( t )) (cid:54) (cid:9) < ∞ .he proof follows similar lines as in [13]. Again we refer C in the appendix for the proof of this theorem. Wemake couple of remarks. Note 1.

The “cost” of not knowing the true gradient seems to cause the dependence on ∆ min in the regret,as is not the case when true gradient is available (see Theorem 4.1 and Corollary 4.1.1). The dependence on ∆ min as is well known from the work of [19], is unavoidable. Note 2.

The dependence of α on ∆ min can be removed by a more sophisticated choice of learning rate, atthe cost of an extra log T dependence on regret [13]. In this section, we describe the convergence rate of the softmax policy gradient Algorithm 1for MDPs with S (cid:62) bestin-class policy at a rate O (cid:0) t (cid:1) . Furthermore, the theorem shows an explicit dependence on the number ofcontrollers M . Theorem 5.1 (Convergence of Policy Gradient) . With { θ t } t (cid:62) generated as in Algorithm 1 and using alearning rate η = (1 − γ ) γ +4 γ +5 , for all t (cid:62) , V ∗ ( ρ ) − V π θt ( ρ ) (cid:54) t M (cid:18) γ + 4 γ + 5 c (1 − γ ) (cid:19) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ µ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13)(cid:13) µ (cid:13)(cid:13)(cid:13)(cid:13) ∞ . The quantity c in the statement is the minimum probability that the policy gradient algorithm 1 puts on thecontrollers for which the best mixture π ∗ puts positive probability mass. Proof sketch of Theorem 5.1.

We highlight here the main steps of the proof. We begin by showing that inour case the gradient of V π θ ( µ ) can be simpliﬁed as follows. Lemma 5.2 (Gradient Simpliﬁcation) . The softmax policy gradient with respect to the parameter θ ∈ R M is ∂∂θ m V π θ ( µ ) = 11 − γ (cid:88) s ∈S d π θ µ ( s ) π θ ( m ) ˜ A ( s, m ) . where ˜ A ( s, m ) := ˜ Q ( s, m ) − V ( s ) and ˜ Q ( s, m ) := (cid:80) a ∈A K m ( s, a ) Q π θ ( s, a ) , and d π θ µ ( . ) is the discounted statevisitation measure starting with an initial distribution µ and following policy π θ . The interpretation here being ˜ A ( s, m ) is the advantage of following controller m at state s and then followingthe policy π θ for all time versus following π θ always. As mentioned in section 4.1, we proceed by provingsmoothness of V function over the space R M . Lemma 5.3. V π θ ( µ ) is γ +4 γ +52(1 − γ ) -smooth. Next we show a novel (cid:32)Lojaseiwicz type inequality which lower bounds the magnitude of the gradient of thevalue function. emma 5.4 (Non-uniform (cid:32)Lojaseiwicz inequality) . (cid:13)(cid:13)(cid:13)(cid:13) ∂∂θ V π θ ( µ ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:62) √ M (cid:32) min m : π ∗ θm > π θ m (cid:33) × (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ ρ d π θ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ∞ [ V ∗ ( ρ ) − V π θ ( ρ )] . The rest of the proof follows by combining Lemmas 5.2, 5.3 and 5.4 followed by an induction argument over t (cid:62) We now discuss the results of implementing our algorithms on the constrained queueing example described inSec. 2. Note that the underlying MDP in this case is evolves over Z , which is a countable set . But this doesnot pose a restriction to our algorithm since it only needs to deal with controllers already capable of handlingcountable state spaces.However, since neither value functions nor value gradients for this problem are available in closed-form, in thissection we modify SoftMax PG (Algorithm 1) to make it generally implementable using a combination of (1)rollouts to estimate the value function of the current (improper) policy and (2) a stochastic approximation-based approach to estimate its value gradient. The gradient estimation algorithm, GradEst , is shown inAlgorithm 4.Speciﬁcally, in order to estimate the value gradient, we use the approach in [15], noting that for a function V : R M → R , the gradient, ∇ V , ∇ V ( θ ) ≈ E [( V ( θ + α.u ) − V ( θ )) u ] . Mα . (11)where α ∈ (0 , u is chosen to be uniformly random on unit sphere, the second term is zero, ie., E [( V ( θ + αu ) − V ( θ )) u ] . Mα = E [( V ( θ + αu )) u ] . Mα .The expression above requires evaluation of the value function at the point ( θ + α.u ). Since the value functionmay not be explicitly computable, we employ rollouts, for its evaluation.We study two diﬀerent settings (1) where in the ﬁrst case the optimal policy is a strict improper combinationof the available controllers and (2) where it is at a corner point, i.e., one of the available controllers itself isoptimal. Our simulations show that in both the cases, PG converges to the correct controller distribution.We provide all details about hyperparameters in Sec. E in the Appendix.Recall the example that we discussed in Sec. 2. We consider the case with Bernoulli arrivals with rates λ = [ λ , λ ] and are given two base/atomic controllers { K , K } , where controller K i serves Queue i withprobability 1, i = 1 ,

2. As can be seen in Fig. 3a when λ = [0 . , .

49] (equal arrival rates), GradEst convergesto an improper mixture policy that serves each queue with probability [0 . , . a) Arrival rate:( λ , λ ) =(0 . , .

49) (b) Arrival rate:( λ , λ ) =(0 . , .

4) (c) (Estimated) Value func-tions for case with the twobase policies and LongestQueue First (“LQF”) (d) Case with 3 experts:Always Queue 1, AlwaysQueue 2 and LQF.Figure 3: Softmax policy gradient algorithm applies show convergence to the best mixture policy.

Algorithm 3

Softmax PG with Gradient Estimation (SPGE)

Input: learning rate η > θ m = 1, for all m ∈ [ M ], s ∼ µ for t = 1 to T do Choose controller m t ∼ π t .Play action a t ∼ K m t ( s t , :).Observe s t +1 ∼ P ( . | s t , a t ). (cid:94) ∇ θ t V π θt ( µ ) = GradientEst( θ t )Update: θ t +1 = θ t + η. (cid:94) ∇ θ t V π θt ( µ ). end for unserved queue would obviously increase without bound). Figure 3b, shows that with unequal arrival ratestoo, GradEst quickly converges to the best policy.Fig. 3c shows the evolution of the value function of GradEst (in blue) compared with those of the basecontrollers (red) and the Longest Queue First policy (LQF) which, as the name suggests, always serves thelongest queue in the system (black). LQF, like any policy that always serves a nonempty queue in the systemwhenever there is one , is known to be optimal in the sense of delay minimization for this system [26]. SeeSec. E in the Appendix for more details about this experiment.Finally, Fig. 3d shows the result of the second experimental setting with three base controllers, one of whichis delay optimal. The ﬁrst two are K , K as before and the third controller, K , is LQF. Notice that K , K are both queue length-agnostic, meaning they could attempt to serve empty queues as well. LQF, on theother hand, always and only serves nonempty queues. Hence, in this case the optimal policy is attained atone of the corner points, i.e., [0 , , Tie-breaking rule is irrelevant. lgorithm 4

GradEst

Input:

Policy parameters θ , parameter α > for i = 1 to runs do u i ∼ U nif ( S M − ) .θ α = θ + α.u i π α = softmax ( θ α ) for l = 1 to rollouts do Generate trajectory according to the policy π α : ( s , a , r , s , a , r , . . . , s lt , a lt , r lt ) reward l = lt (cid:80) j =0 γ j r j end for mr ( i ) = mean ( reward ) end for GradV alue = runs runs (cid:80) i =1 mr ( i ) .u i . Mα . Return:

GradV alue . In this paper, we considered the problem of choosing the best mixture of controllers for ReinforcementLearning and made the ﬁrst attempt at improper learning in the RL setting. One natural option in this caseis to run each controller separately for a long time and then choose the best one based on estimated returns.While quite plausible, this “explore-then-exploit” approach is likely to be severely suboptimal in terms ofrates. Moreover, it is not clear how to obtain the best mixture of base controllers as opposed to the best basecontroller. We recall the queuing example (Sec. 2) where the best mixture may be strictly superior each basecontroller.Further, our mixing of controllers was simple: we looked to compete with the best ﬁxed mixture. Onecan consider a richer class of mixtures. For example, an attention model can be used to choose whichcontroller to use, or other state-dependent models can be relevant. The learning architecture should notchange dramatically since we are using gradients for the selection process which currently is simple, butmay be replaced by a more complex architecture. Another example is to artiﬁcially force switching acrosscontrollers to occur less frequently than in every round. The can help create momentum and allow thecontrolled process to ’mix’ better, when using complex controllers.Finally, in the present setting, the base controllers are ﬁxed. It would be interesting to consider addingadaptive, or ’learning’ controllers as well as the ﬁxed ones. Including the base controllers can provide baselineperformance below which the performance of the learning controllers would not drop.

References [1] Yasin Abbasi-Yadkori and Csaba Szepesv´ari. Regret bounds for the adaptive control of linear quadraticsystems. In

Proceedings of the 24th Annual Conference on Learning Theory , volume 19 of

Proceedings ofMachine Learning Research , pages 1–26, Budapest, Hungary, 09–11 Jun 2011. JMLR Workshop andConference Proceedings.[2] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximationwith policy gradient methods in markov decision processes. In

Proceedings of Thirty Third Conferenceon Learning Theory , pages 64–66. PMLR, 2020.3] Naman Agarwal, Nataly Brukhim, Elad Hazan, and Zhou Lu. Boosting for control of dynamical systems.In Hal Daum´e III and Aarti Singh, editors,

Proceedings of the 37th International Conference on MachineLearning , volume 119 of

Proceedings of Machine Learning Research , pages 96–103. PMLR, 13–18 Jul2020.[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarialmulti-armed bandit problem. In

Proceedings of IEEE 36th Annual Foundations of Computer Science ,pages 322–331, 1995.[5] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning.In

Advances in Neural Information Processing Systems , volume 21, pages 89–96. Curran Associates, Inc.,2009.[6] Dimitri Bertsekas and Robert Gallager.

Data Networks (2nd Ed.) . Prentice-Hall, Inc., USA, 1992.[7] Jalaj Bhandari and D. Russo. Global optimality guarantees for policy gradient methods.

ArXiv ,abs/1906.01786, 2019.[8] Vivek S. Borkar.

Stochastic Approximation . Cambridge Books. Cambridge University Press, December2008.[9] Asaf Cassel, Alon Cohen, and Tomer Koren. Logarithmic regret for learning linear quadratic regulatorseﬃciently. In

Proceedings of the 37th International Conference on Machine Learning , volume 119 of

Proceedings of Machine Learning Research , pages 1328–1337. PMLR, 13–18 Jul 2020.[10] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. More data speeds up training time in learninghalfspaces over sparse vectors. In

Advances in Neural Information Processing Systems , volume 26, pages145–153. Curran Associates, Inc., 2013.[11] Amit Daniely, Nati Linial, and Shai Shalev-Shwartz. From average case complexity to improper learningcomplexity. In

Proceedings of the Forty-Sixth Annual ACM Symposium on Theory of Computing , STOC’14, page 441–448, New York, NY, USA, 2014. Association for Computing Machinery.[12] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the Sample Complexityof the Linear Quadratic Regulator. arXiv e-prints , October 2017.[13] D. Denisov and N. Walton. Regret analysis of a markov policy gradient algorithm for multi-arm bandits.

ArXiv , abs/2007.10229, 2020.[14] Rick Durrett. Probability: Theory and examples, 2011.[15] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization inthe bandit setting: Gradient descent without a gradient. SODA ’05, page 385–394, USA, 2005. Societyfor Industrial and Applied Mathematics.[16] Aditya Gopalan and Shie Mannor. Thompson Sampling for Learning Parameterized Markov DecisionProcesses. In

Proceedings of The 28th Conference on Learning Theory , volume 40 of

Proceedings ofMachine Learning Research , pages 861–898, Paris, France, 03–06 Jul 2015. PMLR.[17] Morteza Ibrahimi, Adel Javanmard, and Benjamin Roy. Eﬃcient reinforcement learning for highdimensional linear quadratic systems. In

Advances in Neural Information Processing Systems , volume 25,pages 2636–2644. Curran Associates, Inc., 2012.[18] Tom´aˇs Koc´ak, Gergely Neu, Michal Valko, and Remi Munos. Eﬃcient learning by implicit explorationin bandit problems with side observations. In

Advances in Neural Information Processing Systems ,volume 27, pages 613–621. Curran Associates, Inc., 2014.19] T.L Lai and Herbert Robbins. Asymptotically eﬃcient adaptive allocation rules.

Advances in AppliedMathematics , 6(1):4 – 22, 1985.[20] Tor Lattimore and Csaba Szepesv´ari.

Bandit Algorithms . Cambridge University Press, 2020.[21] N. Littlestone and M. K. Warmuth. The weighted majority algorithm.

Inform. Comput. , 108(2):212–261,1994.[22] S (cid:32)Lojasiewicz. Les ´equations aux d´eriv´ees partielles (paris, 1962), 1963.[23] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-criticfor mixed cooperative-competitive environments, 2020.[24] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is eﬃcient for linear quadraticcontrol. In

Advances in Neural Information Processing Systems , volume 32, pages 10154–10164. CurranAssociates, Inc., 2019.[25] Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence ratesof softmax policy gradient methods. In

Proceedings of the 37th International Conference on MachineLearning , pages 6820–6829. PMLR, 2020.[26] A. Mohan, A. Chattopadhyay, and A. Kumar. Hybrid mac protocols for low-delay scheduling. In , pages 47–55, LosAlamitos, CA, USA, oct 2016. IEEE Computer Society.[27] Gergely Neu. Explore no more: Improved high-probability regret bounds for non-stochastic bandits. In

Advances in Neural Information Processing Systems , volume 28, pages 3168–3176. Curran Associates,Inc., 2015.[28] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) eﬃcient reinforcement learning via posteriorsampling. In

Advances in Neural Information Processing Systems , volume 26, pages 3003–3011. CurranAssociates, Inc., 2013.[29] Y. Ouyang, Mukul Gagrani, A. Nayyar, and R. Jain. Learning unknown markov decision processes: Athompson sampling approach. In

NIPS , 2017.[30] Mircea-Bogdan Radac and Radu-Emil Precup. Data-driven model-free slip control of anti-lock brakingsystems using reinforcement q-learning.

Neurocomput. , 275(C):317–329, January 2018.[31] G. A. Rummery and M. Niranjan. On-line q-learning using connectionist systems. Technical report,1994.[32] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policyoptimization. In Francis Bach and David Blei, editors,

Proceedings of the 32nd International Conferenceon Machine Learning , volume 37 of

Proceedings of Machine Learning Research , pages 1889–1897, Lille,France, 07–09 Jul 2015. PMLR.[33] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms, 2017.[34] Lior Shani, Yonathan Efroni, and Shie Mannor. Adaptive trust region policy optimization: Globalconvergence and faster rates for regularized mdps.

ArXiv , abs/1909.02769, 2020.35] David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van denDriessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, SanderDieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, MadeleineLeach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deepneural networks and tree search.

Nature , 529:484–503, 2016.[36] Satinder Singh, Andy Okun, and Andrew Jackson. Artiﬁcial intelligence: Learning to play Go fromscratch. 550(7676):336–337, October 2017.[37] Richard S. Sutton and Andrew G. Barto.

Reinforcement Learning: An Introduction . The MIT Press,second edition, 2018.[38] Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methodsfor reinforcement learning with function approximation. In

Advances in Neural Information ProcessingSystems , volume 12, pages 1057–1063. MIT Press, 2000.[39] L. Tassiulas and A. Ephremides. Stability properties of constrained queueing systems and schedulingpolicies for maximum throughput in multihop radio networks.

IEEE Transactions on Automatic Control ,37(12):1936–1948, 1992.

Glossary of Symbols S : State space2. A : Action space3. S : Cardinality of S A : Cardinality of A M : Number of controllers6. K i Controller i , i = 1 , · · · , M . For ﬁnite SA space MDP, K i is a matrix of size S × A , where each rowis a probability distribution over the actions.7. C : Given collection of M controllers.8. I soft ( C ) : Improper policy class setup by the learner.9. θ ∈ R M : Parameter assigned to the controllers to controllers, representing weights, updated each roundby the learner.10. π ( . ) : Probability of choosing controllers11. π ( . (cid:12)(cid:12) s ) Probability of choosing action given state s . Note that in our setting, given π ( . ) overcontrollers (see previous item) and the set of controllers, π ( . (cid:12)(cid:12) s ) is completely deﬁned, i.e., π ( a (cid:12)(cid:12) s ) = M (cid:80) m =1 π ( m ) K m ( s, a ). Hence we use simply π to denote the policy followed, whenever the context is clear.12. r ( s, a ) : Immediate (one-step) reward obtained if action a is played in state s .13. P ( s (cid:48) (cid:12)(cid:12) s, a ) Probability of transitioning to state s (cid:48) from state s having taken action a .14. V π ( ρ ) := E s ∼ ρ [ V π ( s )] = E πρ (cid:80) ∞ t =0 γ t r ( s t , a t ) Value function starting with initial distribution ρ overstates, and following policy π .15. Q π ( s, a ) := E (cid:20) r ( s, a ) + γ (cid:80) s (cid:48) ∈S P ( s (cid:48) (cid:12)(cid:12) s, a ) V π ( s (cid:48) ) (cid:21) .16. ˜ Q π ( s, m ) := E (cid:20) (cid:80) a ∈A K m ( s, a ) r ( s, a ) + γ (cid:80) s (cid:48) ∈S P ( s (cid:48) (cid:12)(cid:12) s, a ) V π ( s (cid:48) ) (cid:21) .17. A π ( s, a ) := Q π ( s, a ) − V π ( s )18. ˜ A ( s, m ) := ˜ Q π ( s, m ) − V π ( s ).19. d πν := E s ∼ ν (cid:20) (1 − γ ) ∞ (cid:80) t =0 P (cid:2) s t = s (cid:12)(cid:12) s o , π, P (cid:3)(cid:21) . Denotes a distribution over the states, is called the“discounted state visitation measure” igure 4: An example of an MDP with controllers as deﬁned in (12) having a non-concave value function. The MDP has S = 5states and A = 2 actions. States s , s and s are terminal states. The only transition with nonzero reward is s → s . B Non-concavity of Value function

We begin by showing that the value function V π ( s ) is non-concave, hence standard convex optimizationtechniques for maximization may get stuck in local optima. We note once again that this is diﬀerent fromthe non-concavity of V π when the parameterization is over the entire state-action space, i.e., R S × A .We show here that for direct parameterization, the value function is non-concave where, by Direct Parameter-ization we mean that the controllers K m are parameterized by weights θ m ∈ R , where θ i (cid:62) , ∀ i ∈ [ M ] and M (cid:80) i =1 θ i = 1. A similar argument holds for softmax parameterization, which we outline in Note 3. Lemma B.1. (Non-concavity of Value function) There is an MDP and a set of controllers, for which themaximization problem of the value function (i.e. (5) ) is non-concave for direct parameterization, i.e., θ (cid:55)→ V π θ is non-concave.Proof. Consider the MDP shown in Figure 4 with 5 states, s , . . . , s . States s , s and s are terminalstates. In the ﬁgure we also show the allowed transitions and the rewards obtained by those transitions.Let the action set A consists of only three actions { a , a , a } ≡ { right, up, null } , where ’null’ is a dummyaction included to accommodate the three terminal states. Let us consider the case when M = 2. The twocontrollers K i ∈ R S × A , i = 1 , A ) are shown below. K =  / / / /  , K =  / / / /  . (12)Let θ (1) = (1 , T and θ (2) = (0 , T . Let us ﬁx the initial state to be s . Since a nonzero reward is only earnedduring a s → s transition, we note for any policy π : A → S that V π ( s ) = π ( a | s ) π ( a | s ) r . We alsohave, ( K + K ) /  / / / /  . We will show that V π θ (1) + V π θ (2) > V π ( θ (1)+ θ (2) ) / .e observe the following. V π θ (1) ( s ) = V K ( s ) = (1 / . (1 / .r = r/ .V π θ (2) ( s ) = V K ( s ) = (3 / . (3 / .r = 9 r/ . where V K ( s ) denotes the value obtained by starting from state s and following a controller matrix K for alltime.Also, on the other hand we have, V π ( θ (1)+ θ (2) ) / = V ( K + K ) / ( s ) = (1 / . (1 / .r = r/ . Hence we see that,12 V π θ (1) + 12 V π θ (2) = r/

32 + 9 r/

32 = 10 r/

32 = 1 . r/ > r/ V π ( θ (1)+ θ (2) ) / . This shows that θ (cid:55)→ V π θ is non-concave, which concludes the proof. Note 3.

For softmax parametrization, we ﬁx some ε ∈ (0 , and set θ (1) = (log(1 − ε ) , log ε ) T and θ (1) =(log ε, log(1 − ε )) T . A similar calculation using softmax projection shows that under θ (1) we follow matrix (1 − ε ) K + εK , which yields a Value of (1 / ε/ r . Under θ (2) we follow matrix εK + (1 − ε ) K , whichyields a Value of (3 / − ε/ r . On the other hand, ( θ (1) + θ (2) ) / amounts to playing the matrix ( K + K ) / ,yielding the a value of r/ , as above. One can verify easily that (1 / ε/ r + (3 / − ε/ r > .r/ . Thisshows the non-concavity of θ (cid:55)→ V π θ under softmax parameterization. C Proofs for MABs

C.1 Proofs for MABs with perfect gradient knowledge

Recall from eq (8), that the value function for any given policy π ∈ P ([ M ]), that is a distribution over thegiven M controllers (which are itself distributions over actions A ) can be simpliﬁed as: V π = 11 − γ M (cid:88) m =1 π m µ T K m = 11 − γ M (cid:88) m =1 π m r µm where µ here is the (unknown) vector of mean rewards of the arms A . Here, r µm := µ T K m , i = 1 , · · · , M ,represents the mean reward obtained by choosing to play controller K m , m ∈ M . For ease of notation, wewill drop the superscript µ in the proofs of this section. We ﬁrst show a simpliﬁcation of the gradient of thevalue function w.r.t. the parameter θ . Fix a m ∈ [ M ], ∂∂θ m (cid:48) V π θ = 11 − γ M (cid:88) m =1 ∂∂θ m π θ ( m ) r m = 11 − γ M (cid:88) m =1 π θ ( m (cid:48) ) { I mm (cid:48) − π θ ( m ) } r m . (13)Next we show that V π is β − smooth. A function f : R M → R is β − smooth, if ∀ θ (cid:48) , θ ∈ R M (cid:12)(cid:12)(cid:12)(cid:12) f ( θ (cid:48) ) − f ( θ ) − (cid:28) ddθ f ( θ ) , θ (cid:48) − θ (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) β (cid:107) θ (cid:48) − θ (cid:107) . et S := d dθ V π θ . This is a matrix of size M × M . Let 1 (cid:54) i, j (cid:54) M . S i,j = (cid:18) ddθ (cid:18) ddθ V π θ (cid:19)(cid:19) i,j (14)= 11 − γ d ( π θ ( i )( r ( i ) − π T θ r )) dθ j (15)= 11 − γ (cid:18) dπ θ ( i ) dθ j ( r ( i ) − π T θ r ) + π θ ( i ) d ( r ( i ) − π T θ r ) dθ j (cid:19) (16)= 11 − γ (cid:0) π θ ( j )( r ( i ) − π T θ r ) − π θ ( i ) π θ ( j )( r ( i ) − π T θ r ) − π θ ( i ) π θ ( j )( r ( j ) − π T θ r ) (cid:1) . (17)Next, let y ∈ R M , (cid:12)(cid:12) y T Sy (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 M (cid:88) j =1 S ij y ( i ) y ( j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 11 − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 M (cid:88) j =1 (cid:0) π θ ( j )( r ( i ) − π T θ r ) − π θ ( i ) π θ ( j )( r ( i ) − π T θ r ) − π θ ( i ) π θ ( j )( r ( j ) − π T θ r ) (cid:1) y ( i ) y ( j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 11 − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 π θ ( i )( r ( i ) − π T θ r ) y ( i ) − M (cid:88) i =1 M (cid:88) j =1 π θ ( i ) π θ ( j )( r ( i ) − π T θ r ) y ( i ) y ( j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 11 − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 π θ ( i )( r ( i ) − π T θ r ) y ( i ) − M (cid:88) i =1 π θ ( i )( r ( i ) − π T θ r ) y ( i ) M (cid:88) j =1 π θ ( j ) y ( j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 π θ ( i )( r ( i ) − π T θ r ) y ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + 21 − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 π θ ( i )( r ( i ) − π T θ r ) y ( i ) M (cid:88) j =1 π θ ( j ) y ( j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) − γ (cid:13)(cid:13) π θ (cid:12) ( r − π T θ r ) (cid:13)(cid:13) ∞ (cid:107) y (cid:12) y (cid:107) + 21 − γ (cid:13)(cid:13) π θ (cid:12) ( r − π T θ r ) (cid:13)(cid:13) . (cid:107) y (cid:107) ∞ . (cid:107) π θ (cid:107) (cid:107) y (cid:107) ∞ . The last equality is by the assumption that reward are bounded in [0,1]. We observe that, (cid:13)(cid:13) π θ (cid:12) ( r − π T θ r ) (cid:13)(cid:13) = M (cid:88) m =1 (cid:12)(cid:12) π θ ( i )( r ( i ) − π T θ r ) (cid:12)(cid:12) = M (cid:88) m =1 π θ ( i ) (cid:12)(cid:12) r ( i ) − π T θ r (cid:12)(cid:12) = max i =1 ,...,M (cid:12)(cid:12) r ( i ) − π T θ r (cid:12)(cid:12) (cid:54) . Next, for any i ∈ [ M ], (cid:12)(cid:12) π θ ( i )( r ( i ) − π T θ r ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) π θ ( i ) r ( i ) − π θ ( i ) r ( i ) − (cid:88) j (cid:54) = i π θ ( i ) π θ ( j ) r ( j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = π θ ( i )(1 − π θ ( i )) + π θ ( i )(1 − π θ ( i )) (cid:54) . / / . ombining the above two inequalities with the fact that (cid:107) π θ (cid:107) = 1 and (cid:107) y (cid:107) ∞ (cid:54) (cid:107) y (cid:107) , we get, (cid:12)(cid:12) y T Sy (cid:12)(cid:12) (cid:54) − γ (cid:13)(cid:13) π θ (cid:12) ( r − π T θ r ) (cid:13)(cid:13) ∞ (cid:107) y (cid:12) y (cid:107) + 21 − γ (cid:13)(cid:13) π θ (cid:12) ( r − π T θ r ) (cid:13)(cid:13) . (cid:107) y (cid:107) ∞ . (cid:107) π θ (cid:107) (cid:107) y (cid:107) ∞ (cid:54) − γ (1 / (cid:107) y (cid:107) . Hence V π θ is β − smooth with β = − γ ) .Next we prove Lemma 4.2.Proof of Lemma 4.2. Proof.

Recall from the simpliﬁcation of gradient of V π , i.e., eq (13): ∂∂θ m V π θ = 11 − γ M (cid:88) m (cid:48) =1 π θ ( m ) { I mm (cid:48) − π θ ( m (cid:48) ) } r (cid:48) m = 11 − γ π ( m ) (cid:0) r ( m ) − π T r (cid:1) . Taking norm both sides, (cid:13)(cid:13)(cid:13)(cid:13) ∂∂θ V π θ (cid:13)(cid:13)(cid:13)(cid:13) = 11 − γ (cid:118)(cid:117)(cid:117)(cid:116) M (cid:88) m =1 ( π ( m )) ( r ( m ) − π T r ) (cid:62) − γ (cid:113) ( π ( m ∗ )) ( r ( m ∗ ) − π T r ) = 11 − γ ( π ( m ∗ )) (cid:0) r ( m ∗ ) − π T r (cid:1) = 11 − γ ( π ( m ∗ )) ( π ∗ − π ) T r = ( π ( m ∗ )) (cid:104) V π ∗ − V π θ (cid:105) . where π ∗ = e m ∗ .We will now prove Theorem 4.1 and corollary 4.1.1. We restate the result here. Theorem 4.1.

With η = − γ )5 and with θ (1) m = 1 /M for all m ∈ [ M ] , with the availability for true gradient,we have ∀ t (cid:62) , V π ∗ − V π θt (cid:54) − γ M t . roof. First, note that since V π is smooth we have: V π θt − V π θt +1 (cid:54) − (cid:28) ddθ t V π θt , θ t +1 − θ t (cid:29) + 52(1 − γ ) (cid:107) θ t +1 − θ t (cid:107) = − η (cid:13)(cid:13)(cid:13)(cid:13) ddθ t V π θt (cid:13)(cid:13)(cid:13)(cid:13) + 54(1 − γ ) η (cid:13)(cid:13)(cid:13)(cid:13) ddθ t V π θt (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) ddθ t V π θt (cid:13)(cid:13)(cid:13)(cid:13) (cid:18) η − γ ) − η (cid:19) = − (cid:18) − γ (cid:19) (cid:13)(cid:13)(cid:13)(cid:13) ddθ t V π θt (cid:13)(cid:13)(cid:13)(cid:13) . (cid:54) − (cid:18) − γ (cid:19) ( π θ t ( m ∗ )) (cid:104) V π ∗ − V π θ (cid:105) Lemma 4.2 (cid:54) − (cid:18) − γ (cid:19) ( inf (cid:54) s (cid:54) t π θ t ( m ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) =: c t ) (cid:104) V π ∗ − V π θ (cid:105) . The ﬁrst equality is by smoothness, second inequality is by the update equation in algorithm 1.Next, let δ t := V π ∗ − V π θt . We have, δ t +1 − δ t (cid:54) − (1 − γ )5 c t δ t . (18) Claim: ∀ t (cid:62) , δ t (cid:54) c t (1 − γ ) 1 t . We prove the claim by using induction on t (cid:62) δ t (cid:54) − γ , the claim is true for all t (cid:54) ϕ t := c t (1 − γ ) . Fix a t (cid:62)

2, assume δ t (cid:54) ϕ t t .Let g : R → R be a function deﬁned as g ( x ) = x − ϕ t x . One can verify easily that g is monotonicallyincreasing in (cid:2) , ϕ t (cid:3) . Next with equation 19, we have δ t +1 (cid:54) δ t − ϕ t δ t = g ( δ t ) (cid:54) g ( ϕ t t ) (cid:54) ϕ t t − ϕ t t = ϕ t (cid:18) t − t (cid:19) (cid:54) ϕ t (cid:18) t + 1 (cid:19) . This completes the proof of the claim. We will show that c t (cid:62) /M in the next lemma. We ﬁrst complete theproof of the corollary assuming this.We ﬁx a T (cid:62)

1. Observe that, δ t (cid:54) − γ ) c t t (cid:54) − γ ) c T t . T (cid:88) t =1 V π ∗ − V π θt = 11 − γ T (cid:88) t =1 ( π ∗ − π θ t ) T r (cid:54) T (1 − γ ) c T + 1 . lso we have that, T (cid:88) t =1 V π ∗ − V π θt = T (cid:88) t =1 δ t (cid:54) √ T (cid:118)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 δ t (cid:54) √ T (cid:118)(cid:117)(cid:117)(cid:116) T (cid:88) t =1 − γ ) c T ( δ t − δ t +1 ) (cid:54) c T (cid:115) T (1 − γ ) . We next show that with θ (1) m = 1 /M, ∀ m , i.e., uniform initialization, inf t (cid:62) c t = 1 /M , which will then completethe proof of Theorem 4.1 and of corollary 4.1.1. Lemma C.1.

We have inf t (cid:62) π θ t ( m ∗ ) > . Furthermore, with uniform initialization of the parameters θ (1) m ,i.e., /M, ∀ m ∈ [ M ] , we have inf t (cid:62) π θ t ( m ∗ ) = M .Proof. We will show that there exists t such that inf t (cid:62) π θ t ( m ∗ ) = min (cid:54) t (cid:54) t π θ t ( m ∗ ), where t = min { t : π θ t ( m ∗ ) (cid:62) C } .We deﬁne the following sets. S = (cid:26) θ : dV π θ dθ m ∗ (cid:62) dV π θ dθ m , ∀ m (cid:54) = m ∗ (cid:27) S = { θ : π θ ( m ∗ ) (cid:62) π θ ( m ) , ∀ m (cid:54) = m ∗ }S = { θ : π θ ( m ∗ ) (cid:62) C } Note that S depends on the choice of C . Let C := M − ∆ M +∆ . We claim the following: Claim 2. ( i ) θ t ∈ S = ⇒ θ t +1 ∈ S and ( ii ) θ t ∈ S = ⇒ π θ t +1 ( m ∗ ) (cid:62) π θ t ( m ∗ ). Proof of Claim 2. ( i ) Fix a m (cid:54) = m ∗ . We will show that if dV πθ dθ t ( m ∗ ) (cid:62) dV πθ dθ t ( m ) , then dV πθ dθ t +1 ( m ∗ ) (cid:62) dV πθ dθ t +1 ( m ) . Thiswill prove the ﬁrst part.Case (a): π θ t ( m ∗ ) (cid:62) π θ t ( m ). This implies, by the softmax property, that θ t ( m ∗ ) (cid:62) θ t ( m ). After gradientascent update step we have: θ t +1 ( m ∗ ) = θ t ( m ∗ ) + η dV π θt dθ t ( m ∗ ) (cid:62) θ t ( m ) + η dV π θt dθ t ( m )= θ t +1 ( m ) . This again implies that θ t +1 ( m ∗ ) (cid:62) θ t +1 ( m ). By the deﬁnition of derivative of V π θ w.r.t θ t (see eq (13)), dV π θ dθ t +1 ( m ∗ ) = 11 − γ π θ t +1 ( m ∗ ) ( r ( m ∗ ) − π T θ t +1 r )= 11 − γ π θ t +1 ( m ) ( r ( m ) − π T θ t +1 r )= dV π θ dθ t +1 ( m ) . This implies θ t +1 ∈ S .Case (b): π θ t ( m ∗ ) < π θ t ( m ). We ﬁrst note the following equivalence: dV π θ dθ ( m ∗ ) (cid:62) dV π θ dθ ( m ) ←→ ( r ( m ∗ ) − r ( m )) (cid:18) − π θ ( m ∗ ) π θ ( m ∗ ) (cid:19) ( r ( m ∗ ) − π T θ r ) . hich can be simpliﬁed as:( r ( m ∗ ) − r ( m )) (cid:18) − π θ ( m ∗ ) π θ ( m ∗ ) (cid:19) ( r ( m ∗ ) − π T θ r ) = ( r ( m ∗ ) − r ( m )) (1 − exp ( θ t ( m ∗ ) − θ t ( m ))) ( r ( m ∗ ) − π T θ r ) . The above condition can be rearranged as: r ( m ∗ ) − r ( m ) (cid:62) (1 − exp ( θ t ( m ∗ ) − θ t ( m ))) (cid:0) r ( m ∗ ) − π T θ t r (cid:1) . By lemma D.7, we have that V π θt +1 (cid:62) V π θt = ⇒ π T θ t +1 r (cid:62) π T θ t r . Hence,0 < r ( m ∗ ) − π T θ t +1 r (cid:54) π T θ t r . Also, we note: θ t +1 ( m ∗ ) − θ t +1 ( m ) = θ t ( m ∗ ) + η dV π t dθ t ( m ∗ ) − θ t +1 ( m ) − η dV π t dθ t ( m ) (cid:62) θ t ( m ∗ ) − θ t ( m ) . This implies, 1 − exp ( θ t +1 ( m ∗ ) − θ t +1 ( m )) (cid:54) − exp ( θ t ( m ∗ ) − θ t ( m )).Next, we observe that by the assumption π t ( m ∗ ) < π t ( m ), we have1 − exp ( θ t ( m ∗ ) − θ t ( m )) = 1 − π t ( m ∗ ) π t ( m ) > . Hence we have,(1 − exp ( θ t +1 ( m ∗ ) − θ t +1 ( m ))) (cid:16) r ( m ∗ ) − π T θ t +1 r (cid:17) (cid:54) (1 − exp ( θ t ( m ∗ ) − θ t ( m ))) (cid:0) r ( m ∗ ) − π T θ t r (cid:1) (cid:54) r ( m ∗ ) − r ( m ) . Equivalently, (cid:18) − π t +1 ( m ∗ ) π t +1 ( m ) (cid:19) ( r ( m ∗ ) − π T t +1 r ) (cid:54) r ( m ∗ ) − r ( m ) . Finishing the proof of the claim 2(i).(ii) Let θ t ∈ S . We observe that: π t +1 ( m ∗ ) = exp( θ t +1 ( m ∗ )) M (cid:80) m =1 exp( θ t +1 ( m ))= exp( θ t ( m ∗ ) + η dV πt dθ t ( m ∗ ) ) M (cid:80) m =1 exp( θ t ( m ) + η dV πt dθ t ( m ) ) (cid:62) exp( θ t ( m ∗ ) + η dV πt dθ t ( m ∗ ) ) M (cid:80) m =1 exp( θ t ( m ) + η dV πt dθ t ( m ∗ ) )= exp( θ t ( m ∗ )) M (cid:80) m =1 exp( θ t ( m )) = π t ( m ∗ )This completes the proof of Claim 2(ii). laim 3. S ⊂ S and S ⊂ S . Proof.

To show that S ⊂ S , let θ ∈ cS . We have π θ ( m ∗ ) (cid:62) π θ ( m ) , ∀ m (cid:54) = m ∗ . dV π θ dθ ( m ∗ ) = 11 − γ π θ ( m ∗ )( r ( m ∗ ) − π T θ r ) > − γ π θ ( m )( r ( m ) − π T θ r )= dV π θ dθ ( m ) . This shows that θ ∈ S . For showing the second part of the claim, we assume θ ∈ S ∩ S c , because if θ ∈ S ,we are done. Let m (cid:54) = m ∗ . We have, dV π θ dθ ( m ∗ ) − dV π θ dθ ( m ) = 11 − γ (cid:0) π θ ( m ∗ )( r ( m ∗ ) − π T θ r ) − π θ ( m )( r ( m ) − π T θ r ) (cid:1) = 11 − γ  π θ ( m ∗ )( r ( m ∗ ) − π T θ r ) + M (cid:88) i (cid:54) = m ∗ ,m π θ ( i )( r ( i ) − π T θ r )  = 11 − γ  π θ ( m ∗ ) + M (cid:88) i (cid:54) = m ∗ ,m π θ ( i )  ( r ( m ∗ ) − π T θ r ) − M (cid:88) i (cid:54) = m ∗ ,m π θ ( i )( r ( m ∗ ) − r ( i ))  (cid:62) − γ  π θ ( m ∗ ) + M (cid:88) i (cid:54) = m ∗ ,m π θ ( i )  ( r ( m ∗ ) − π T θ r ) − M (cid:88) i (cid:54) = m ∗ ,m π θ ( i )  (cid:62) − γ  π θ ( m ∗ ) + M (cid:88) i (cid:54) = m ∗ ,m π θ ( i )  ∆ M − M (cid:88) i (cid:54) = m ∗ ,m π θ ( i )  . Observe that, M (cid:80) i (cid:54) = m ∗ ,m π θ ( i ) = 1 − π ( m ∗ ) − π ( m ). Using this and rearranging we get, dV π θ dθ ( m ∗ ) − dV π θ dθ ( m ) (cid:62) − γ (cid:18) π ( m ∗ ) (cid:18) M (cid:19) − (cid:18) − ∆ M (cid:19) + π ( m ) (cid:18) − ∆ M (cid:19)(cid:19) (cid:62) − γ π ( m ) (cid:18) − ∆ M (cid:19) (cid:62) . The last inequality follows because θ ∈ S and the choice of C . This completes the proof of Claim 3. Claim 4.

There exists a ﬁnite t , such that θ t ∈ S . Proof.

The proof of this claim relies on the asymptotic convergence result of [2]. We note that theirconvergence result hold for our choice of η = − γ )5 . As noted in [25], the choice of η is used to justify thegradient ascent lemma D.7. Hence we have π θ t → t → ∞ . Therefore, there exists a ﬁnite t such that π θ t ( m ∗ ) (cid:62) C and hence θ t ∈ S .This completes the proof that there exists a t such that inf t (cid:62) π θ t ( m ∗ ) = inf (cid:54) t (cid:54) t π θ t ( m ∗ ), since once the θ t ∈ S ,by Claim 3, θ t ∈ S . Further, by Claim 2, ∀ t (cid:62) t , θ t ∈ S and π θ t ( m ∗ ) is non-decreasing after t .ith uniform initialization θ ( m ∗ ) = M (cid:62) θ ( m ), for all m (cid:54) = m ∗ . Hence, π θ ( m ∗ ) (cid:62) π θ ( m ) for all m (cid:54) = m ∗ .This implies θ ∈ S , which implies θ ∈ S . As established in Claim 2, S remains invariant under gradientascent updates, implying t = 1. Hence we have that inf t (cid:62) π θ t ( m ∗ ) = π θ ( m ∗ ) = 1 /M , completing the proof ofTheorem 4.1 and corollary 4.1.1. C.2 Proofs for MABs with noisy gradients

We recall Theorem 4.3.

Theorem 4.3.

The proof is an extension of that of Theorem 1 of [13] for the setting that we have. The proof isdivided into three main parts. In the ﬁrst part we show that the recurrence time of the process { π t ( m ∗ ) } t (cid:62) is almost surely ﬁnite. Next we bound the expected value of the time taken by the process π t ( m ∗ ) to reach 1.Finally we show that almost surely, lim t →∞ π t ( m ∗ ) →

1, in other words the process { π t ( m ∗ ) } t (cid:62) is transient.We use all these facts to show a regret bound.Recall m ∗ ( t ) := argmax m ∈ [ M ] π t ( m ). We start by deﬁning the following quantity which will be useful for theanalysis of algorithm 2.Let τ := min (cid:8) t (cid:62) π t ( m ∗ ) > (cid:9) .Next, let S := (cid:8) π ∈ P ([ M ]) : − α (cid:54) π ( m ∗ ) < (cid:9) .In addition, we deﬁne for any a ∈ R , S a := (cid:8) π ∈ P ([ M ]) : − αa (cid:54) π ( m ∗ ) < x (cid:9) . Observe that if π ( m ∗ ) (cid:62) /a and π ( m ∗ ) < /a then π ∈ S a . This fact follows just by the update step of the algorithm 2, and choosing η = απ t ( m ) for every m (cid:54) = m ∗ . Lemma C.2.

For α > such that α < ∆ min r ( m ∗ ) − ∆ min , we have that sup π ∈S E (cid:2) τ (cid:12)(cid:12) π = π (cid:3) < ∞ . Proof.

The proof here is for completeness. We ﬁrst make note of the following useful result: For a sequenceof positive real numbers { a n } n (cid:62) such that the following condition is met: a ( n + 1) (cid:54) a ( n ) − b.a ( n ) , for some b >

0, the following is always true: a n (cid:54) a bt . This inequality follows by rearranging and observing the a n is a non-increasing sequence. A complete proofcan be found in eg. ([13], Appendix A.1). Returning to the proof of lemma, we proceed by showing thatthe sequence 1 /π t ( m ∗ ) − ct is a supermartingale for some c >

0. Let ∆ min := ∆ for ease of notation. Notehat if the condition on α holds then there exists an ε >

0, such that (1 + ε )(1 + α ) < r ∗ / ( r ∗ − ∆), where r ∗ := r ( m ∗ ). We choose c to be c := α. r ∗ α − α ( r ∗ − ∆)(1 + ε ) > . Next, let x to be greater than M and satisfying: xx − αM (cid:54) ε. Let ξ x := min { t (cid:62) π t ( m ∗ ) > /x } . Since for t = 1 , . . . , ξ x − m ∗ ( t ) (cid:54) = m ∗ , we have π t +1 ( m ∗ ) =(1 + α ) π t ( m ∗ ) w.p. π t ( m ∗ ) r ∗ and π t +1 ( m ∗ ) = π t ( m ∗ ) + απ t ( m ∗ ) /π t ( m ∗ ) w.p. π t ( m ∗ ) r ∗ ( t ), where r ∗ ( t ) := r ( m ∗ ( t )).Let y ( t ) := 1 /π t ( m ∗ ), then we observe by a short calculation that, y ( t + 1) =  y ( t ) − α α y ( t ) , w.p. r ∗ y ( t ) y ( t ) + α y ( t ) π t ( m ∗ ( t )) y ( t ) − α . w.p.π t ( m ∗ ) r ∗ ( t ) y ( t ) otherwise. We see that, E (cid:2) y ( t + 1) (cid:12)(cid:12) H ( t ) (cid:3) − y ( t )= r ∗ y ( t ) . ( y ( t ) − α α y ( t )) + π t ( m ∗ ) r ∗ ( t ) . ( y ( t ) + α y ( t ) π t ( m ∗ ( t )) y ( t ) − α ) − y ( t )( r ∗ y ( t ) + π t ( m ∗ ) r ∗ ( t )) (cid:54) α ( r ∗ − ∆)(1 + ε ) − α r ∗ α = − c. The inequality holds because r ∗ ( t ) (cid:54) r ∗ ∆ and that π t ( m ∗ ) > /M . By the Optional Stopping Theorem [14], − c E [ ξ x ∧ t ] (cid:62) E [ y ( ξ x ∧ t ) − E [ y (1)]] (cid:62) − x − α . The ﬁnal inequality holds because π ( m ∗ ) (cid:62) − αx .Next applying Monotone Convergence Theorem, gives theta E [ ξ x ] (cid:54) xc (1 − α ) . Finally to show the resultof lemma C.2, we refer the reader to (Appendix A.2, [13]), which follow from standard Markov chainarguments.Next we deﬁne an embedded Markov Chain { p ( s ) , s ∈ Z + } as follows. First let σ ( k ) := min (cid:8) t (cid:62) τ ( k ) : π t ( m ∗ ) < (cid:9) and τ ( k ) := min (cid:8) t (cid:62) σ ( k −

1) : π t ( m ∗ ) (cid:62) (cid:9) . Note that within the region [ τ ( k ) , σ ( k )), π t ( m ∗ ) (cid:62) / σ ( k ) , τ ( k + 1)), π t ( m ∗ ) < /

2. We next analyze the rate at which π t ( m ∗ ) approaches 1. Deﬁne p ( s ) := π t s ( m ∗ ) where t s = s + k (cid:88) i =0 ( τ ( i + 1) − σ ( i ))for s ∈ (cid:34) k (cid:88) i =0 ( σ ( i ) − τ ( i )) , k +1 (cid:88) i =0 ( σ ( i ) − τ ( i )) (cid:33) Also let, σ s := min { t > π t + t s ( m ∗ ) > / } and, τ s := min { t > σ s : π t + t s ( m ∗ ) (cid:54) / } emma C.3. The process { p ( s ) } s (cid:62) , is a submartingale. Further, p ( s ) → , as s → ∞ . Finally, E [ p ( s )] (cid:62) −

11 + α ∆ (cid:32) (cid:80) m (cid:48)(cid:54) = m ∗ ∆ m (cid:48) (cid:33) s . Proof.

We ﬁrst observe that, p ( s + 1) = (cid:40) π t s +1 ( m ∗ ) if π t s +1 ( m ∗ ) (cid:62) / π t s + τ + s ( m ∗ ) if π t s +1 ( m ∗ ) < / π t s + τ s ( m ∗ ) (cid:62) /

2, we have that, p ( s + 1) (cid:62) π t s +1 ( m ∗ ) and p ( s ) = π t s ( m ∗ ) . Since at times t s , π t s ( m ∗ ) > /

2, we know that m ∗ is the leading arm. Thus by the update step, for all m (cid:54) = m ∗ , π t s +1 ( m ) = π t s ( m ) + απ t s ( m ) (cid:20) I m R m ( t s ) π t s ( m ) − I m ∗ R m ∗ ( t s ) π t s ( m ∗ ) (cid:21) . Taking expectations both sides, E (cid:2) π t s +1 ( m ) (cid:12)(cid:12) H ( t s ) (cid:3) − π t s ( m ) = απ t s ( m ) ( r m − r m ∗ ) = − α ∆ m π t s ( m ) . Summing over all m (cid:54) = m ∗ : − E (cid:2) π t s +1 ( m ∗ ) (cid:12)(cid:12) H ( t s ) (cid:3) + π t s ( m ∗ ) = − α (cid:88) m (cid:54) = m ∗ ∆ m π t s ( m ) . By Jensen’s inequality, (cid:88) m (cid:54) = m ∗ ∆ m π t s ( m ) =  (cid:88) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48)  (cid:88) m (cid:54) = m ∗ ∆ m (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) π t s ( m ) (cid:62)  (cid:88) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48)   (cid:88) m (cid:54) = m ∗ ∆ m π t s ( m ) (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33)  (cid:62)  (cid:88) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48)  ∆ (cid:32) (cid:80) m (cid:54) = m ∗ π t s ( m ) (cid:33) (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) = ∆ (1 − π t s ( m ∗ )) (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) . ence we get, p ( s ) − E (cid:2) p ( s + 1) (cid:12)(cid:12) H ( t s ) (cid:3) (cid:54) − α ∆ (1 − p ( s )) (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) = ⇒ E (cid:2) p ( s + 1) (cid:12)(cid:12) H ( t s ) (cid:3) (cid:62) p ( s ) + α ∆ (1 − p ( s )) (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) . This implies immediately that { p ( s ) } s (cid:62) is a submartingale.Since, { p ( s ) } is non-negative and bounded by 1, by Martingale Convergence Theorem, lim s →∞ p ( s ) exists.We will now show that the limit is 1. Clearly, it is suﬃcient to show that lim sup s →∞ p ( s ) = 1. For a >

2, let ϕ a := min (cid:26) s (cid:62) p ( s ) (cid:62) a − a (cid:27) . As is shown in [13], it is suﬃcient to show ϕ a < ∞ , with probability 1, because then one can deﬁne a sequenceof stopping times for increasing a , each ﬁnite w.p. 1. which implies that p ( s ) →

1. By the previous display,we have E (cid:2) p ( s + 1) (cid:12)(cid:12) H ( t s ) (cid:3) − p ( s ) (cid:62) α ∆ (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) a as long as p ( s ) (cid:54) a − a . Hence by applying Optional Stopping Theorem and rearranging we get, E [ ϕ a ] (cid:54) lim s →∞ E [ ϕ a ∧ s ] (cid:54) (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) a α ∆ (1 − E [ p (1)]) < ∞ . Since ϕ a is a non-negative random variable with ﬁnite expectation, ϕ a < ∞ a.s. Let q ( s ) = 1 − p ( s ). We have: E [ q ( s + 1)] − E [ q ( s )] (cid:54) − α ∆ ( q ( s )) (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) . By the useful result C.2, we get, E [ q ( s )] (cid:54) E [ q (1)]1 + α ∆ E [ q (1)] (cid:32) (cid:80) m (cid:48)(cid:54) = m ∗ ∆ m (cid:48) (cid:33) s (cid:54)

11 + α ∆ (cid:32) (cid:80) m (cid:48)(cid:54) = m ∗ ∆ m (cid:48) (cid:33) s . This completes the proof of the lemma.Finally we provide a lemma to tie the results above. We refer (Appendix A.5 [13]) for the proof of this lemma.

Lemma C.4. (cid:88) t (cid:62) P [ π t ( m ∗ ) < / < ∞ . Also, with probability 1, π t ( m ∗ ) → , as t → ∞ . roof of regret bound: Since r ∗ − r ( m ) (cid:54)

1, we have by the deﬁnition of regret (see eq 9) R ( T ) = E (cid:34) − γ T (cid:88) t =1 (cid:32) M (cid:88) m =1 π ∗ ( m ) r m − π t ( m ) r m (cid:33)(cid:35) . Here we recall that π ∗ = e m ∗ , we have: R ( T ) = 11 − γ E (cid:34) T (cid:88) t =1 (cid:32) M (cid:88) m =1 ( π ∗ ( m ) r m − π t ( m ) r m ) (cid:33)(cid:35) = 11 − γ E (cid:34) M (cid:88) m =1 (cid:32) T (cid:88) t =1 ( π ∗ ( m ) r m − π t ( m ) r m ) (cid:33)(cid:35) = 11 − γ E (cid:34) T (cid:88) t =1 (cid:32) r ∗ − M (cid:88) m =1 π t ( m ) r m (cid:33)(cid:35) = 11 − γ E (cid:34)(cid:32) T (cid:88) t =1 r ∗ − T (cid:88) t =1 M (cid:88) m =1 π t ( m ) r m (cid:33)(cid:35) = 11 − γ E  T (cid:88) t =1 r ∗ (1 − π t ( m ∗ )) − T (cid:88) t =1 (cid:88) m (cid:54) = m ∗ π t ( m ) r m  = 11 − γ E  T (cid:88) t =1 (cid:88) m (cid:54) = m ∗ r ∗ π t ( m ) − T (cid:88) t =1 (cid:88) m (cid:54) = m ∗ π t ( m ) r m  = 11 − γ (cid:88) m (cid:54) = m ∗ ( r ∗ − r m ) E (cid:34) T (cid:88) t =1 π t ( m ) (cid:35) . Hence we have, R ( T ) = 11 − γ (cid:88) m (cid:54) = m ∗ ( r ∗ − r m ) E (cid:34) T (cid:88) t =1 π t ( m ) (cid:35) (cid:54) − γ (cid:88) m (cid:54) = m ∗ E (cid:34) T (cid:88) t =1 π t ( m ) (cid:35) = 11 − γ E (cid:34) T (cid:88) t =1 (1 − π t ( m ∗ )) (cid:35) We analyze the following term: E (cid:34) T (cid:88) t =1 (1 − π t ( m ∗ )) (cid:35) = E (cid:34) T (cid:88) t =1 (1 − π t ( m ∗ )) I { π t ( m ∗ ) (cid:62) / } (cid:35) + E (cid:34) T (cid:88) t =1 (1 − π t ( m ∗ )) I { π t ( m ∗ ) < / } (cid:35) = E (cid:34) T (cid:88) t =1 (1 − π t ( m ∗ )) I { π t ( m ∗ ) (cid:62) / } (cid:35) + C . here, C := ∞ (cid:80) t =1 P [ π t ( m ∗ ) < / < ∞ by Lemma C.4. Next we observe that, E (cid:34) T (cid:88) t =1 (1 − π t ( m ∗ )) I { π t ( m ∗ ) (cid:62) / } (cid:35) = E (cid:34) T (cid:88) s =1 q ( s ) I { π t ( m ∗ ) (cid:62) / } (cid:35) (cid:54) E (cid:34) T (cid:88) s =1 q ( s ) (cid:35) = T (cid:88) t =1

11 + α ∆ (cid:32) (cid:80) m (cid:48)(cid:54) = m ∗ ∆ m (cid:48) (cid:33) s (cid:54) T (cid:88) t =1 (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) α ∆ s (cid:54) (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) α ∆ log T. Putting things together, we get, R ( T ) (cid:54) − γ  (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) α ∆ log T + C  = 11 − γ  (cid:32) (cid:80) m (cid:48) (cid:54) = m ∗ ∆ m (cid:48) (cid:33) α ∆ log T  + C. This completes the proof of Theorem 4.3.

D Proofs for MDPs

First we recall the policy gradient theorem.

Theorem D.1 (Policy Gradient Theorem [38]) . ∂∂θ V π θ ( µ ) = 11 − γ (cid:88) s ∈S d π θ µ ( s ) (cid:88) a ∈A ∂π θ ( a | s ) ∂θ Q π θ ( s, a ) . Let s ∈ S and m ∈ [ m ]. Let ˜ Q π θ ( s, m ) := (cid:80) a ∈A K m ( s, a ) Q π θ ( s, a ). Also let ˜ A ( s, m ) := ˜ Q ( s, m ) − V ( s ). Lemma D.2. ∂∂θ m V π θ ( µ ) = 11 − γ (cid:88) s ∈S d π θ µ ( s ) π m (cid:48) ˜ A π θ ( s, m ) . roof. From the policy gradient theorem D.1, we have: ∂∂θ m (cid:48) V π θ ( µ ) = 11 − γ (cid:88) s ∈S d π θ µ ( s ) (cid:88) a ∈A ∂π θ m (cid:48) ( a | s ) ∂θ Q π θ ( s, a )= 11 − γ (cid:88) s ∈S d π θ µ ( s ) (cid:88) a ∈A ∂∂θ m (cid:48) (cid:32) M (cid:88) m =1 π θ ( m ) K m ( s, a ) (cid:33) Q π θ ( s, a )= 11 − γ (cid:88) s ∈S d π θ µ ( s ) M (cid:88) m =1 (cid:88) a ∈A (cid:18) ∂∂θ m (cid:48) π θ ( m ) (cid:19) K m ( s, a ) Q ( s, a )= 11 − γ (cid:88) s ∈S d π θ µ ( s ) (cid:88) a ∈A π m (cid:48) (cid:32) K m (cid:48) ( s, a ) − M (cid:88) m =1 π m K m ( s, a ) (cid:33) Q ( s, a )= 11 − γ (cid:88) s ∈S d π θ µ ( s ) π m (cid:48) (cid:88) a ∈A (cid:32) K m (cid:48) ( s, a ) − M (cid:88) m =1 π m K m ( s, a ) (cid:33) Q ( s, a )= 11 − γ (cid:88) s ∈S d π θ µ ( s ) π m (cid:48) (cid:34)(cid:88) a ∈A K m (cid:48) ( s, a ) Q ( s, a ) − (cid:88) a ∈A M (cid:88) m =1 π m K m ( s, a ) Q ( s, a ) (cid:35) = 11 − γ (cid:88) s ∈S d π θ µ ( s ) π m (cid:48) (cid:104) ˜ Q ( s, m (cid:48) ) − V ( s ) (cid:105) = 11 − γ (cid:88) s ∈S d π θ µ ( s ) π m (cid:48) ˜ A π θ ( s, m (cid:48) ) . Lemma D.3. V π θ ( µ ) is γ +4 γ +52(1 − γ ) -smooth.Proof. The proof uses ideas from [2] and [25]. Let θ α = θ + αu , where u ∈ R M , α ∈ R . For any s ∈ S , (cid:88) a (cid:12)(cid:12)(cid:12)(cid:12) ∂π θ α ( a | s ) ∂α (cid:12)(cid:12)(cid:12) α =0 (cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) ∂π θ α ( a | s ) ∂θ α (cid:12)(cid:12)(cid:12) α =0 , ∂θ α ∂α (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) ∂π θ α ( a | s ) ∂θ α (cid:12)(cid:12)(cid:12) α =0 , u (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) m (cid:48)(cid:48) =1 M (cid:88) m =1 π θ m (cid:48)(cid:48) ( I mm (cid:48)(cid:48) − π θ m ) K m ( s, a ) u ( m (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) m (cid:48)(cid:48) =1 π θ m (cid:48)(cid:48) (cid:32) K m (cid:48)(cid:48) ( s, a ) u ( m (cid:48)(cid:48) ) − M (cid:88) m =1 K m ( s, a ) u ( m (cid:48)(cid:48) ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (cid:88) a M (cid:88) m (cid:48)(cid:48) =1 π θ m (cid:48)(cid:48) K m (cid:48)(cid:48) ( s, a ) | u ( m (cid:48)(cid:48) ) | + (cid:88) a M (cid:88) m (cid:48)(cid:48) =1 M (cid:88) m =1 π θ m (cid:48)(cid:48) π θ m K m ( s, a ) | u ( m (cid:48)(cid:48) ) | = M (cid:88) m (cid:48)(cid:48) =1 π θ m (cid:48)(cid:48) | u ( m (cid:48)(cid:48) ) | (cid:88) a K m (cid:48)(cid:48) ( s, a ) (cid:124) (cid:123)(cid:122) (cid:125) =1 + M (cid:88) m (cid:48)(cid:48) =1 M (cid:88) m =1 π θ m (cid:48)(cid:48) π θ m | u ( m (cid:48)(cid:48) ) | (cid:88) a K m ( s, a ) (cid:124) (cid:123)(cid:122) (cid:125) =1 = M (cid:88) m (cid:48)(cid:48) =1 π θ m (cid:48)(cid:48) | u ( m (cid:48)(cid:48) ) | + M (cid:88) m (cid:48)(cid:48) =1 M (cid:88) m =1 π θ m (cid:48)(cid:48) π θ m | u ( m (cid:48)(cid:48) ) | M (cid:88) m (cid:48)(cid:48) =1 π θ m (cid:48)(cid:48) | u ( m (cid:48)(cid:48) ) | (cid:54) (cid:107) u (cid:107) . Next we bound the second derivative. (cid:88) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ π θ α ( a (cid:12)(cid:12) s ) ∂α (cid:12)(cid:12) α =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:42) ∂∂θ α ∂π θ α ( a (cid:12)(cid:12) s ) ∂α (cid:12)(cid:12) α =0 , u (cid:43)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:88) a (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:42) ∂ π θ α ( a (cid:12)(cid:12) s ) ∂α (cid:12)(cid:12) α =0 u, u (cid:43)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Let H a,θ := ∂ π θα ( a (cid:12)(cid:12) s ) ∂θ ∈ R M × M . We have, H a,θi,j = ∂∂θ j (cid:32) M (cid:88) m =1 π θ i ( I mi − π θ m ) K m ( s, a ) (cid:33) = ∂∂θ j (cid:32) π θ i K i ( s, a ) − M (cid:88) m =1 π θ i π θ m K m ( s, a ) (cid:33) = π θ j ( I ij − π θ i ) K i ( s, a ) − M (cid:88) m =1 K m ( s, a ) ∂π θ i π θ m ∂θ j = π j ( I ij − π i ) K i ( s, a ) − M (cid:88) m =1 K m ( s, a ) ( π j ( I ij − π i ) π m + π i π j ( I mj − π m ))= π j (cid:32) ( I ij − π i ) K i ( s, a ) − M (cid:88) m =1 π m ( I ij − π i ) K m ( s, a ) − M (cid:88) m =1 π i ( I mj − π m ) K m ( s, a ) (cid:33) . lugging this into the second derivative, we get, (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) ∂ ∂θ π θ ( a | s ) u, u (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) j =1 M (cid:88) i =1 H a,θi,j u i u j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) j =1 M (cid:88) i =1 π j (cid:32) ( I ij − π i ) K i ( s, a ) − M (cid:88) m =1 π m ( I ij − π i ) K m ( s, a ) − M (cid:88) m =1 π i ( I mj − π m ) K m ( s, a ) (cid:33) u i u j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 π i K i ( s, a ) u i − M (cid:88) i =1 M (cid:88) j =1 π i π j K i ( s, a ) u i u j − M (cid:88) i =1 M (cid:88) m =1 π i π m K m ( s, a ) u i + M (cid:88) i =1 M (cid:88) j =1 M (cid:88) m =1 π i π j π m K m ( s, a ) u i u j − M (cid:88) i =1 M (cid:88) j =1 π i π j K j ( s, a ) u i u j + M (cid:88) i =1 M (cid:88) j =1 M (cid:88) m =1 π i π j π m K m ( s, a ) u i u j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 π i K i ( s, a ) u i − M (cid:88) i =1 M (cid:88) j =1 π i π j K i ( s, a ) u i u j − M (cid:88) i =1 M (cid:88) m =1 π i π m K m ( s, a ) u i + 2 M (cid:88) i =1 M (cid:88) j =1 M (cid:88) m =1 π i π j π m K m ( s, a ) u i u j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) i =1 π i u i (cid:32) K i ( s, a ) − M (cid:88) m =1 π m K m ( s, a ) (cid:33) − M (cid:88) i =1 π i u i M (cid:88) j =1 π j u j (cid:32) K i ( s, a ) − M (cid:88) m =1 π m K m ( s, a ) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) M (cid:88) i =1 π i u i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K i ( s, a ) − M (cid:88) m =1 π m K m ( s, a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) (cid:54) +2 M (cid:88) i =1 π i | u i | M (cid:88) j =1 π j | u j | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K i ( s, a ) − M (cid:88) m =1 π m K m ( s, a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) (cid:54) (cid:54) (cid:107) u (cid:107) + 2 M (cid:88) i =1 π i | u i | M (cid:88) j =1 π j | u j | (cid:54) (cid:107) u (cid:107) . The rest of the proof is similar to [25] and we include this for completeness. Deﬁne P ( α ) ∈ R S × S , where ∀ ( s, s (cid:48) ) , [ P ( α )] ( s,s (cid:48) ) = (cid:88) a ∈A π θ α ( a (cid:12)(cid:12) s ) . P ( s (cid:48) | s, a ) . The derivative w.r.t. α is, (cid:20) ∂∂α P ( α ) (cid:12)(cid:12)(cid:12) α =0 (cid:21) ( s,s (cid:48) ) = (cid:88) a ∈A (cid:20) ∂∂α π θ α ( a (cid:12)(cid:12) s ) (cid:12)(cid:12)(cid:12) α =0 (cid:21) . P ( s (cid:48) | s, a ) . For any vector x ∈ R S , (cid:20) ∂∂α P ( α ) (cid:12)(cid:12)(cid:12) α =0 x (cid:21) ( s ) = (cid:88) s (cid:48) ∈S (cid:88) a ∈A (cid:20) ∂∂α π θ α ( a (cid:12)(cid:12) s ) (cid:12)(cid:12)(cid:12) α =0 (cid:21) . P ( s (cid:48) | s, a ) .x ( s (cid:48) ) . he l ∞ norm can be upper-bounded as, (cid:13)(cid:13)(cid:13)(cid:13) ∂∂α P ( α ) (cid:12)(cid:12)(cid:12) α =0 x (cid:13)(cid:13)(cid:13)(cid:13) ∞ = max s ∈S (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s (cid:48) ∈S (cid:88) a ∈A (cid:20) ∂∂α π θ α ( a (cid:12)(cid:12) s ) (cid:12)(cid:12)(cid:12) α =0 (cid:21) . P ( s (cid:48) | s, a ) .x ( s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) max s ∈S (cid:88) s (cid:48) ∈S (cid:88) a ∈A (cid:12)(cid:12)(cid:12)(cid:12) ∂∂α π θ α ( a (cid:12)(cid:12) s ) (cid:12)(cid:12)(cid:12) α =0 (cid:12)(cid:12)(cid:12)(cid:12) . P ( s (cid:48) | s, a ) . (cid:107) x (cid:107) ∞ (cid:54) (cid:107) u (cid:107) (cid:107) x (cid:107) ∞ . Now we ﬁnd the second derivative, (cid:20) ∂ P ( α ) ∂α (cid:12)(cid:12)(cid:12) α =0 (cid:21) ( s,s (cid:48) ) = (cid:88) a ∈A (cid:20) ∂ π θ α ( a | s ) ∂α (cid:12)(cid:12)(cid:12) α =0 (cid:21) P ( s (cid:48) | s, a )taking the l ∞ norm, (cid:13)(cid:13)(cid:13)(cid:13)(cid:20) ∂ P ( α ) ∂α (cid:12)(cid:12)(cid:12) α =0 (cid:21) x (cid:13)(cid:13)(cid:13)(cid:13) ∞ = max s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) s (cid:48) ∈S (cid:88) a ∈A (cid:20) ∂ π θ α ( a | s ) ∂α (cid:12)(cid:12)(cid:12) α =0 (cid:21) P ( s (cid:48) | s, a ) x ( s (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) max s (cid:88) s (cid:48) ∈S (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) ∂ π θ α ( a | s ) ∂α (cid:12)(cid:12)(cid:12) α =0 (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) P ( s (cid:48) | s, a ) (cid:107) x (cid:107) ∞ (cid:54) (cid:107) u (cid:107) (cid:107) x (cid:107) ∞ . Next we observe that the value function of π θ α : V π θα ( s ) = (cid:88) a ∈A π θ α ( a | s ) r ( s, a ) (cid:124) (cid:123)(cid:122) (cid:125) r θα + γ (cid:88) a ∈A π θ α ( a | s ) (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a ) V π θα ( s (cid:48) ) . In matrix form, V π θα = r θ α + γP ( α ) V π θα = ⇒ ( Id − γP ( α )) V π θα = r θ α V π θα = ( Id − γP ( α )) − r θ α . Let M ( α ) := ( Id − γP ( α )) − = ∞ (cid:80) t =0 γ t [ P ( α )] t . Also, observe that = 11 − γ ( Id − γP ( α )) = ⇒ M ( α ) = 11 − γ . = ⇒ ∀ i (cid:107) [ M ( α )] i, : (cid:107) = 11 − γ where [ M ( α )] i, : is the i th row of M ( α ). Hence for any vector x ∈ R S , (cid:107) M ( α ) x (cid:107) ∞ (cid:54) − γ (cid:107) x (cid:107) ∞ . By assumption 1, we have (cid:107) r θ α (cid:107) ∞ = max s | r θ α ( s ) | (cid:54)

1. Next we ﬁnd the derivative of r θ α w.r.t α . (cid:12)(cid:12)(cid:12)(cid:12) ∂r θ α ( s ) ∂α (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) ∂r θ α ( s ) ∂θ α (cid:19) T ∂θ α ∂α (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) m (cid:48)(cid:48) =1 M (cid:88) m =1 (cid:88) a ∈A π θ α ( m (cid:48)(cid:48) )( I mm (cid:48)(cid:48) − π θ α ( m )) K m ( s, a ) r ( s, a ) u ( m (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) m (cid:48)(cid:48) =1 (cid:88) a ∈A π θ α ( m (cid:48)(cid:48) ) K m (cid:48)(cid:48) ( s, a ) r ( s, a ) u ( m (cid:48)(cid:48) ) − M (cid:88) m (cid:48)(cid:48) =1 M (cid:88) m =1 (cid:88) a ∈A π θ α ( m (cid:48)(cid:48) ) π θ α ( m ) K m ( s, a ) r ( s, a ) u ( m (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) m (cid:48)(cid:48) =1 (cid:88) a ∈A π θ α ( m (cid:48)(cid:48) ) K m (cid:48)(cid:48) ( s, a ) r ( s, a ) − M (cid:88) m (cid:48)(cid:48) =1 M (cid:88) m =1 (cid:88) a ∈A π θ α ( m (cid:48)(cid:48) ) π θ α ( m ) K m ( s, a ) r ( s, a ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) u (cid:107) ∞ (cid:54) (cid:107) u (cid:107) . Similarly, we can calculate the upper-bound on second derivative, (cid:13)(cid:13)(cid:13)(cid:13) ∂r θ α ∂α (cid:13)(cid:13)(cid:13)(cid:13) ∞ = max s (cid:12)(cid:12)(cid:12)(cid:12) ∂r θ α ( s ) ∂α (cid:12)(cid:12)(cid:12)(cid:12) = max s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) ∂∂α (cid:26) ∂r θ α ( s ) ∂α (cid:27)(cid:19) T ∂θ α ∂α (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = max s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) ∂ r θ α ( s ) ∂α ∂θ α ∂α (cid:19) T ∂θ α ∂α (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) / (cid:107) u (cid:107) . Next, the derivative of the value function w.r.t α is given by, ∂V π θα ( s ) ∂α = γe T s M ( α ) ∂P ( α ) ∂α M ( α ) r θ α + e T s M ( α ) ∂r θ α ∂α . And the second derivative, ∂ V π θα ( s ) ∂α = 2 γ e T s M ( α ) ∂P ( α ) ∂α M ( α ) ∂P ( α ) ∂α M ( α ) r θ α (cid:124) (cid:123)(cid:122) (cid:125) T + γe T s M ( α ) ∂ P ( α ) ∂α M ( α ) r θ α (cid:124) (cid:123)(cid:122) (cid:125) T + 2 γe T s M ( α ) ∂P ( α ) ∂α M ( α ) ∂r θ α ∂α (cid:124) (cid:123)(cid:122) (cid:125) T + e T s M ( α ) ∂ r θ α ∂α (cid:124) (cid:123)(cid:122) (cid:125) T . We use the above derived bounds to bound each of the term in the above display. The calculations hereare same as shown for Lemma 7 in [25], except for the particular values of the bounds. Hence we directly,mention the ﬁnal bounds that we obtain and refer to [25] for the detailed but elementary calculations. | T | (cid:54) − γ ) (cid:107) u (cid:107) | T | (cid:54) − γ ) (cid:107) u (cid:107) | T | (cid:54) − γ ) (cid:107) u (cid:107) | T | (cid:54) / − γ ) (cid:107) u (cid:107) . Combining the above bounds we get, (cid:12)(cid:12)(cid:12)(cid:12) ∂ V π θα ( s ) ∂α (cid:12)(cid:12)(cid:12) α =0 (cid:12)(cid:12)(cid:12)(cid:12) (cid:54) (cid:18) γ (1 − γ ) + 3 γ (1 − γ ) + 4 γ (1 − γ ) + 5 / − γ ) (cid:19) (cid:107) u (cid:107) = 7 γ + 4 γ + 52(1 − γ ) (cid:107) u (cid:107) . inally, let y ∈ R M and ﬁx a θ ∈ R M : (cid:12)(cid:12)(cid:12)(cid:12) y T ∂ V π θ ( s ) ∂θ y (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) y (cid:107) y (cid:107) T ∂ V π θ ( s ) ∂θ y (cid:107) y (cid:107) (cid:12)(cid:12)(cid:12)(cid:12) . (cid:107) y (cid:107) (cid:54) max (cid:107) u (cid:107) =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) ∂ V π θ ( s ) ∂θ u, u (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) . (cid:107) y (cid:107) = max (cid:107) u (cid:107) =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:28) ∂ V π θα ( s ) ∂θ α (cid:12)(cid:12)(cid:12) α =0 ∂θ α ∂α , ∂θ α ∂α (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) . (cid:107) y (cid:107) = max (cid:107) u (cid:107) =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂ V π θα ( s ) ∂α (cid:12)(cid:12)(cid:12) α =0 (cid:12)(cid:12)(cid:12)(cid:12) . (cid:107) y (cid:107) (cid:54) γ + 4 γ + 52(1 − γ ) (cid:107) y (cid:107) . Let θ ξ := θ + ξ ( θ (cid:48) − θ ) where ξ ∈ [0 , ∀ s, θ, θ (cid:48) , (cid:12)(cid:12)(cid:12)(cid:12) V π θ (cid:48) ( s ) − V π θ ( s ) − (cid:28) ∂V π θ ( s ) ∂θ (cid:29)(cid:12)(cid:12)(cid:12)(cid:12) = 12 . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( θ (cid:48) − θ ) T ∂ V π θξ ( s ) ∂θ ξ ( θ (cid:48) − θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:54) γ + 4 γ + 54(1 − γ ) (cid:107) θ (cid:48) − θ (cid:107) . Since V π θ ( s ) is γ +4 γ +52(1 − γ ) smooth for every s , V π θ ( µ ) is also γ +4 γ +52(1 − γ ) − smooth. Lemma D.4 (Value Diﬀerence Lemma-1) . For any two policies π and π (cid:48) , and for any state s ∈ S , thefollowing is true. V π (cid:48) ( s ) − V π ( s ) = 11 − γ (cid:88) s (cid:48) ∈S d π (cid:48) s ( s (cid:48) ) M (cid:88) m =1 π (cid:48) m ˜ A ( s (cid:48) , m ) . Proof. V π (cid:48) ( s ) − V π ( s ) = M (cid:88) m =1 π (cid:48) m ˜ Q (cid:48) ( s, m ) − M (cid:88) m =1 π m ˜ Q ( s, m )= M (cid:88) m =1 π (cid:48) m (cid:16) ˜ Q (cid:48) ( s, m ) − ˜ Q ( s, m ) (cid:17) + M (cid:88) m =1 ( π (cid:48) m − π m ) ˜ Q ( s, m )= M (cid:88) m =1 ( π (cid:48) m − π m ) ˜ Q ( s, m ) + M (cid:88) m =1 π (cid:48) m (cid:88) a ∈A K m ( s, a ) (cid:124) (cid:123)(cid:122) (cid:125) = (cid:80) a ∈A π θ ( a | s ) (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a ) (cid:104) V π (cid:48) ( s (cid:48) ) − V π ( s (cid:48) ) (cid:105) = 11 − γ (cid:88) s (cid:48) ∈S d π (cid:48) s ( s (cid:48) ) M (cid:88) m (cid:48) =1 ( π (cid:48) m (cid:48) − π m (cid:48) ) ˜ Q ( s (cid:48) , m (cid:48) )= 11 − γ (cid:88) s (cid:48) ∈S d π (cid:48) s ( s (cid:48) ) M (cid:88) m (cid:48) =1 π (cid:48) m (cid:48) ( ˜ Qs (cid:48) , m (cid:48) − V ( s (cid:48) ))= 11 − γ (cid:88) s (cid:48) ∈S d π (cid:48) s ( s (cid:48) ) M (cid:88) m (cid:48) =1 π (cid:48) m (cid:48) ˜ A ( s (cid:48) , m (cid:48) ) . emma D.5. (Value Diﬀerence Lemma-2) For any two policies π and π (cid:48) and state s ∈ S , the following istrue. V π (cid:48) ( s ) − V π ( s ) = 11 − γ (cid:88) s (cid:48) ∈S d πs ( s (cid:48) ) M (cid:88) m =1 ( π (cid:48) m − π m ) ˜ Q π (cid:48) ( s (cid:48) , m ) . Proof.

We will use ˜ Q for ˜ Q π and ˜ Q (cid:48) for ˜ Q π (cid:48) as a shorthand. V π (cid:48) ( s ) − V π ( s )= M (cid:88) m =1 π (cid:48) m ˜ Q (cid:48) ( s, m ) − M (cid:88) m =1 π m ˜ Q ( s, m )= M (cid:88) m =1 ( π (cid:48) m − π m ) ˜ Q (cid:48) ( s, m ) + M (cid:88) m =1 π m ( ˜ Q (cid:48) ( s, m ) − ˜ Q ( s, m ))= M (cid:88) m =1 ( π (cid:48) m − π m ) ˜ Q (cid:48) ( s, m ) + γ M (cid:88) m =1 π m (cid:32)(cid:88) a ∈A K m ( s, a ) (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a ) V (cid:48) ( s (cid:48) ) − (cid:88) a ∈A K m ( s, a ) (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a ) V ( s (cid:48) ) (cid:33) = M (cid:88) m =1 ( π (cid:48) m − π m ) ˜ Q (cid:48) ( s, m ) + γ (cid:88) a ∈A π θ ( a | s ) (cid:88) s (cid:48) ∈S P ( s (cid:48) | s, a ) [ V (cid:48) ( s ) − V ( s (cid:48) )]= 11 − γ (cid:88) s (cid:48) ∈S d πs ( s (cid:48) ) M (cid:88) m =1 ( π (cid:48) m − π m ) ˜ Q (cid:48) ( s (cid:48) , m ) . Assumption 1.

The reward r ( s, a ) ∈ [0 , , for all pairs ( s, a ) ∈ S × A . Assumption 2.

Let π ∗ := argmax π ∈P M V π ( s ) . We make the following assumption. E m ∼ π ∗ [ Q π θ ( s, m )] − V π θ ( s ) (cid:62) , ∀ s ∈ S , ∀ π θ ∈ Π . Let the best controller be a point in the M − simplex , i.e., K ∗ := M (cid:80) m =1 π ∗ m K m . Lemma D.6 (Non-uniform (cid:32)Lojaseiwicz inequality) . (cid:13)(cid:13)(cid:13)(cid:13) ∂∂θ V π θ ( µ ) (cid:13)(cid:13)(cid:13)(cid:13) (cid:62) √ M (cid:32) min m : π ∗ θm > π θ m (cid:33) × (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ ρ d π θ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ∞ [ V ∗ ( ρ ) − V π θ ( ρ )] . Proof. (cid:13)(cid:13)(cid:13)(cid:13) ∂∂θ V π θ ( µ ) (cid:13)(cid:13)(cid:13)(cid:13) = (cid:32) M (cid:88) m =1 (cid:18) ∂V π θ ( µ ) ∂θ m (cid:19) (cid:33) / √ M M (cid:88) m =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂V π θ ( µ ) ∂θ m (cid:12)(cid:12)(cid:12)(cid:12) (Cauchy-Schwarz)= 1 √ M M (cid:88) m =1 − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) s ∈S d π θ µ ( s ) π m ˜ A ( s, m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Lemma 5.2 (cid:62) √ M M (cid:88) m =1 π ∗ m π m − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) s ∈S d π θ µ ( s ) ˜ A ( s, m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:62) (cid:32) min m : π ∗ θm > π θ m (cid:33) √ M M (cid:88) m =1 π ∗ m − γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) s ∈S d π θ µ ( s ) ˜ A ( s, m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:62) (cid:32) min m : π ∗ θm > π θ m (cid:33) √ M (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) m =1 π ∗ m − γ (cid:88) s ∈S d π θ µ ( s ) ˜ A ( s, m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:32) min m : π ∗ θm > π θ m (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ M (cid:88) s ∈S d π θ µ ( s ) M (cid:88) m =1 π ∗ m − γ ˜ A ( s, m ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:32) min m : π ∗ θm > π θ m (cid:33) √ M (cid:88) s ∈S d π θ µ ( s ) M (cid:88) m =1 π ∗ m − γ ˜ A ( s, m ) Assumption 2 (cid:62) √ M − γ (cid:32) min m : π ∗ θm > π θ m (cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ ρ d π θ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ∞ (cid:88) s ∈S d ∗ ρ ( s ) M (cid:88) m =1 π ∗ m ˜ A ( s, m )= 1 √ M (cid:32) min m : π ∗ θm > π θ m (cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ ρ d π θ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ∞ [ V ∗ ( ρ ) − V π θ ( ρ )] Lemma D.4 . D.1 Proof of the Theorem 5.1

Theorem 5.1 (Convergence of Policy Gradient) . With { θ t } t (cid:62) generated as in Algorithm 1 and using alearning rate η = (1 − γ ) γ +4 γ +5 , for all t (cid:62) , V ∗ ( ρ ) − V π θt ( ρ ) (cid:54) t M (cid:18) γ + 4 γ + 5 c (1 − γ ) (cid:19) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ µ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13)(cid:13) µ (cid:13)(cid:13)(cid:13)(cid:13) ∞ . Let β := γ +4 γ +5(1 − γ ) . We have that, V ∗ ( ρ ) − V π θ ( ρ ) = 11 − γ (cid:88) s ∈S d π θ ρ ( s ) M (cid:88) m =1 ( π ∗ m − π m ) ˜ Q π ∗ ( s, m ) (Lemma D.5)= 11 − γ (cid:88) s ∈S d π θ ρ ( s ) d π θ µ ( s ) d π θ µ ( s ) M (cid:88) m =1 ( π ∗ m − π m ) ˜ Q π ∗ ( s, m ) (cid:54) − γ (cid:13)(cid:13)(cid:13)(cid:13) d π θ µ (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:88) s ∈S M (cid:88) m =1 ( π ∗ m − π m ) ˜ Q π ∗ ( s, m ) − γ ) (cid:13)(cid:13)(cid:13)(cid:13) µ (cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:88) s ∈S M (cid:88) m =1 ( π ∗ m − π m ) ˜ Q π ∗ ( s, m )= 1(1 − γ ) (cid:13)(cid:13)(cid:13)(cid:13) µ (cid:13)(cid:13)(cid:13)(cid:13) ∞ [ V ∗ ( µ ) − V π θ ( µ )] (Lemma D.5) . Let δ t := V ∗ ( µ ) − V π θt ( µ ). δ t +1 − δ t = V π θt ( µ ) − V π θt +1 ( µ ) (Lemma 5.3) (cid:54) − β (cid:13)(cid:13)(cid:13)(cid:13) ∂∂θ V π θt ( µ ) (cid:13)(cid:13)(cid:13)(cid:13) (Lemma D.7 ) (cid:54) − β M (cid:32) min m : π ∗ θm > π θ m (cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ ρ d π θ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ∞ δ t (Lemma 5.4) (cid:54) − β (1 − γ ) M (cid:32) min m : π ∗ θm > π θ m (cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ ρ d π θ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ∞ δ t (cid:54) − β (1 − γ ) M (cid:32) inf t (cid:62) min m : π ∗ θm > π θ m (cid:33) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ ρ d π θ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ∞ δ t = − β M (1 − γ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ µ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ∞ c δ t , where c := inf t (cid:62) min m : π ∗ m > π θ t ( m ). Assumption 3.

We assume that the constant c > . Hence we have that, δ t +1 (cid:54) δ t − β (1 − γ ) M (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d π ∗ µ µ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − ∞ c δ t . (19)The rest of the proof follows from a induction argument over t (cid:62) δ t (cid:54) − γ , and c ∈ (0 , t (cid:54) βM (1 − γ ) (cid:13)(cid:13)(cid:13)(cid:13) d π ∗ µ µ (cid:13)(cid:13)(cid:13)(cid:13) ∞ . For ease of notation, let ϕ := βMc (1 − γ ) (cid:13)(cid:13)(cid:13)(cid:13) d π ∗ µ µ (cid:13)(cid:13)(cid:13)(cid:13) ∞ . We need to show that δ t (cid:54) ϕt , for all t (cid:62) t (cid:62)

2, assume δ t (cid:54) ϕt .Let g : R → R be a function deﬁned as g ( x ) = x − ϕ x . One can verify easily that g is monotonicallyincreasing in (cid:2) , ϕ (cid:3) . Next with equation 19, we have δ t +1 (cid:54) δ t − ϕ δ t = g ( δ t ) g ( ϕt ) (cid:54) ϕt − ϕt = ϕ (cid:18) t − t (cid:19) (cid:54) ϕ (cid:18) t + 1 (cid:19) . This completes the proof.

Lemma D.7.

Let f : R M → R be β − smooth. Then gradient ascent with learning rate β guarantees, for all x, x (cid:48) ∈ R M : f ( x ) − f ( x (cid:48) ) (cid:54) − β (cid:13)(cid:13)(cid:13)(cid:13) df ( x ) dx (cid:13)(cid:13)(cid:13)(cid:13) . Proof. f ( x ) − f ( x (cid:48) ) (cid:54) − (cid:28) ∂f ( x ) ∂x (cid:29) + β . (cid:107) x (cid:48) − x (cid:107) = 1 β (cid:13)(cid:13)(cid:13)(cid:13) df ( x ) dx (cid:13)(cid:13)(cid:13)(cid:13) + β β (cid:13)(cid:13)(cid:13)(cid:13) df ( x ) dx (cid:13)(cid:13)(cid:13)(cid:13) = − β (cid:13)(cid:13)(cid:13)(cid:13) df ( x ) dx (cid:13)(cid:13)(cid:13)(cid:13) . E Simulation Details

Choice of hyperparameters.

In the simulations, we set learning rate to be 10 − , runs = 10 , rollouts =10 , lt = 30, discount factor γ = 0 . α = 1 / √ runs . All the simulations have been run for 25 trials andthe results shown are averaged over them. We capped the queue sizes at B = 500.Here, we justify the value of the two policies which always follow one ﬁxed queue, that is plotted as straightline in Figure 3c. Let us ﬁnd the value of the policy which always serves queue 1. The calculation for theother expert (serving queue 2 only) is similar. Let q i ( t ) denote the length of queue i at time t . We note thatsince the expert (policy) always recommends to serve one of the queue, the expected cost suﬀered in anyround t is c t = q ( t ) + q ( t ) = 0 + t.λ . Let us start with empty queues at t = 0. V Expert ( ) = E (cid:34) T (cid:88) t =0 γ t c t (cid:12)(cid:12) Expert (cid:35) = T (cid:88) t =0 γ t .t.λ (cid:54) λ . γ (1 − γ ) . With the values, γ = 0 . λ = 0 .

49, we get V Expert ( ) (cid:54)(cid:54)