[PDF] Gambler's Ruin Bandit Problem

Abstract

In this paper, we propose a new multi-armed bandit problem called the Gambler's Ruin Bandit Problem (GRBP). In the GRBP, the learner proceeds in a sequence of rounds, where each round is a Markov Decision Process (MDP) with two actions (arms): a continuation action that moves the learner randomly over the state space around the current state; and a terminal action that moves the learner directly into one of the two terminal states (goal and dead-end state). The current round ends when a terminal state is reached, and the learner incurs a positive reward only when the goal state is reached. The objective of the learner is to maximize its long-term reward (expected number of times the goal state is reached), without having any prior knowledge on the state transition probabilities. We first prove a result on the form of the optimal policy for the GRBP. Then, we define the regret of the learner with respect to an omnipotent oracle, which acts optimally in each round, and prove that it increases logarithmically over rounds. We also identify a condition under which the learner's regret is bounded. A potential application of the GRBP is optimal medical treatment assignment, in which the continuation action corresponds to a conservative treatment and the terminal action corresponds to a risky treatment such as surgery.

Full PDF

GGambler’s Ruin Bandit Problem

Nima Akbarzadeh, Cem Tekin

Bilkent University, Electrical and Electronics Engineering Department, Ankara, Turkey

Abstract —In this paper, we propose a new multi-armedbandit problem called the

Gambler’s Ruin Bandit Problem (GRBP). In the GRBP, the learner proceeds in a sequence ofrounds, where each round is a Markov Decision Process (MDP)with two actions (arms): a continuation action that moves thelearner randomly over the state space around the current state;and a terminal action that moves the learner directly into one ofthe two terminal states (goal and dead-end state). The currentround ends when a terminal state is reached, and the learnerincurs a positive reward only when the goal state is reached.The objective of the learner is to maximize its long-term reward(expected number of times the goal state is reached), withouthaving any prior knowledge on the state transition probabilities.We ﬁrst prove a result on the form of the optimal policyfor the GRBP. Then, we deﬁne the regret of the learner withrespect to an omnipotent oracle , which acts optimally in eachround, and prove that it increases logarithmically over rounds.We also identify a condition under which the learner’s regretis bounded. A potential application of the GRBP is optimalmedical treatment assignment, in which the continuation actioncorresponds to a conservative treatment and the terminal actioncorresponds to a risky treatment such as surgery.

I. I

NTRODUCTION

Multi-armed bandits (MAB) are used to model a plethoraof applications that require sequential decision making underuncertainty ranging from clinical trials [1] to web advertising[2]. In the conventional MAB [3], [4] the learner chooses anaction from a ﬁnite set of actions at each round, and receivesa random reward. The goal of the learner is to maximizeits long-term expected reward by choosing actions that yieldhigh rewards. This is a non-trivial task, since the rewarddistributions are not known beforehand. Numerous order-optimal index-based learning rules have been developed forthe conventional MAB [4]–[6]. These rules act myopically bychoosing the action with the maximum index in each round.Situations that require multiple actions to be taken in eachround cannot be modeled using conventional MAB. As anexample, consider medical treatment administration. At thebeginning of each round a patient arrives to the intensivecare unit (ICU) with a random initial health state. The goalstate is deﬁned as discharge and dead-end state is deﬁned as death . Actions correspond to treatment options that movethe patient randomly over the state space. The objectiveis to maximize the expected number of patients that aredischarged by learning the optimal treatment policy usingthe observations gathered from the previous patients. In theexample given above, each round corresponds to a goal-oriented Markov Decision Process (MDP) with dead-ends

Cem Tekin is supported by TUBITAK 2232 Fellowship (116C043). [7]. The learner knows the state space, goal and dead-endstates, but does not know the state transition probabilitiesa priori. At each round, the learner chooses a sequence ofactions and only observes the state transitions that result fromthe chosen actions. In the literature, this kind of feedbackinformation is called bandit feedback [8].Motivated by the application described above, we proposea new MAB problem in which multiple arms are selectedin each round until a terminal state is reached. Due to itsresemblance to the

Gambler’s Ruin Problem [9]–[11], we callthis new MAB problem the

Gambler’s Ruin Bandit Problem (GRBP). In GRBP, the system proceeds in a sequence ofrounds ρ ∈ { , , . . . } . Each round is modeled as an MDP(as in Fig. 1 ) with unknown state transition probabilities andterminal (absorbing) states. The set of terminal states includesa goal state G and a dead-end state D , and the non-terminalstates are ordered between the goal and dead-end states.In each non-terminal state, there are two possible actions:a continuation action (action C ) that moves the learnerrandomly over the state space around the current state; anda terminal action (action F ) that moves the learner directlyinto a terminal state. Starting from a random, non-terminalinitial state, the learner chooses a sequence of actions andobserves the resulting state transitions until a terminal stateis reached. The learner incurs a unit reward if the goal stateis reached. Otherwise, it incurs no reward. The goal of thelearner is to maximize its cumulative expected reward overthe rounds.If the state transition probabilities were known beforehand,an omnipotent oracle with unlimited computational powercould calculate the optimal policy that maximizes the proba-bility of hitting the goal state from any initial state, and thenselect its actions according to the optimal policy. We deﬁnethe regret of the learner by round ρ as the difference in theexpected number of times the goal state is reached by theomnipotent oracle and the learner by round ρ .First, we show that the optimal policy for GRBP canbe computed in a straightforward manner: there exists athreshold state above which it is always optimal to take action C and on or below which it is always optimal to take action F . Then, we propose an online learning algorithm for thelearner, and bound its regret for two different regions thatthe actual state transition probabilities can lie in. The regretis bounded (ﬁnite) in one region, while it is logarithmic inthe number of rounds in the other region. These boundsare problem-speciﬁc, in the sense that they are functionsof the state transition probabilities. Finally, we illustrate the a r X i v : . [ c s . L G ] S e p D +1 s -‐1 s G -‐1 G … … s +1 p F p F p D p C Figure 1. State transition model of the GRBP. Only state transitions out ofstate s are shown. Dashed arrows correspond to possible state transitions bytaking action F , while solid arrows correspond to possible state transitionsby taking action C . Weights on the arrows correspond to state transitionprobabilities. The state transition probabilities for all other non-terminalstates are the same as state s . behavior of the regret as a function of the state transitionprobabilities through numerical experiments.The contributions of this paper can be summarized asfollows: • We deﬁne a new MAB problem, called GRBP, in whichthe learner takes a sequence of actions in each roundwith the objective of reaching to the goal state. • We show that using conventional MAB algorithms suchas UCB1 [4] in GRBP by enumerating all deterministicMarkov policies is very inefﬁcient and results in highregret. • We prove that the optimal policy for GRBP has athreshold form and the value of the threshold can becalculated in a computationally efﬁcient way. • We derive bounds on the regret of the learner withrespect to an omnipotent oracle that acts optimally.Unlike conventional MAB where the regret growth is atleast logarithmic in the number of rounds [3], in GRBPregret can be either logarithmic or bounded, based on thevalues of the state transition probabilities. We explicitlycharacterize the region of state transition probabilitiesin which the regret is bounded.Remainder of the paper is organized as follows. Relatedwork is given in Section II. GRBP is deﬁned in Section III.Form of the optimal policy for the GRBP is given in SectionIV. The learning algorithm for GRBP is given in Section Vtogether with its regret analysis. Numerical results are shownin Section VI. Conclusion is given in Section VII.II. R

ELATED W ORK

A. Gambler’s Ruin Problem

If action F is removed from the GRBP, it becomes theGambler’s Ruin Problem. In the model of Hunter et al. [10]of the Gambler’s Ruin Problem, in addition to the standardoutcome of moving one state to the left or right, two extraoutcomes are also considered. One outcome changes thestate immediately to G , while the other outcome changesthe state immediately to D . These outcomes are referred toas Windfall and

Catastrophe outcomes, respectively. The ruinand winning probabilities and the duration of the game arecalculated based on these additional outcomes. In anothermodel [11], modiﬁcations such as the chance of absorptionin states other than G and D and staying in the same state are considered. The ruin and winning probabilities are calculatedaccording to the proposed state transition model. UnlikeGRBP which is an MDP, the Gambler’s Ruin Problem is aMarkov chain. Moreover, the ruin and winning probabilitiesin the models above can be calculated exactly since thetransition probabilities are assumed to be known. B. MDPs

GRBP is closely related to goal oriented MDPs andstochastic shortest path problems [12]. For these problems,in each state (or time epoch), an action has to be taken withthe aim of reaching to the goal state ( G ) with minimumcost. For this task, the optimal policy have to be determinedbeforehand using the set of known transition probabilities.Recently, progress has been made in obtaining solutions forMDPs that have dead-end ( D ) states in addition to goal( G ) states [7], [13]. These solutions require value iterationand heuristic search methods to be performed using theknowledge of transition probabilities. To the best of ourknowledge, a reinforcement learning algorithm that workswithout knowing the transition probabilities a priori and thatachieves logarithmic regret bounds, has not been developedyet for these problems.Reinforcement learning in MDPs is considered by numer-ous researchers [14], [15]. In these works, it is assumedthat the underlying MDP is unknown but ergodic, i.e., itis possible to reach from any state to all other states witha positive probability under any policy. These works adoptthe principle of optimism under uncertainty to choose anaction that maximizes the expected reward among a set ofMDP models that are consistent with the estimated transitionprobabilities. Unlike these works, in GRBP (i) the MDP is notergodic, and (ii) the reward is obtained only in the terminalstate and not after each chosen action. C. Multi-armed Bandits

Over the last decade many variations of the MAB problemis studied and many different learning algorithms are pro-posed, including Gittins index [16], upper conﬁdence boundpolicies (UCB-1, UCB-2, Normalized UCB, KL-UCB) [4]–[6], greedy policies ( (cid:15) -greedy algorithm) [4] and Thompsonsampling [17] (see [8] for a comprehensive analysis of theMAB problem). The performance of a learning algorithmfor a MAB problem is computed using the notion of regret.For the stochastic MAB problem [3], the regret is deﬁnedas the difference between the total (expected) reward of thelearning algorithm and an oracle which acts optimally basedon complete knowledge of the problem parameters. It isshown that the regret grows logarithmically in the numberof rounds for this problem.GRBP can be viewed as a MAB problem in which eacharm corresponds to a policy. Since the set of possibledeterministic policies for the GRBP is exponential in thenumber of states, it is infeasible to use algorithms developedfor MAB problems to directly learn the optimal policy byexperimenting with different policies over different rounds.n addition, GRBP model does not ﬁt into the combinatorialmodels proposed in prior works [18]. Due to these differ-ences, existing MAB solutions cannot solve GRBP in anefﬁcient way. Therefore, a new learning methodology thatexploits the structure of the GRBP is needed.III. P

ROBLEM F ORMULATION

A. Deﬁnition of the GRBP

In the GRBP, the system is composed of a ﬁnite set ofstates S := { D, , . . . , G } , where integer D = 0 denotesthe dead-end state and G denotes the goal state. The set of initial (starting) states is denoted by ˜ S := { , . . . , G − } . Thesystem operates in rounds ( ρ = 1 , , . . . ). The initial state ofeach round is drawn from a probability distribution q ( s ) , s ∈ ˜ S over the set of initial states ˜ S , such that − q (1) > .The current round ends and the next round starts when thelearner hits state D or G . Because of this, D and G are called terminal states . All other states are called non-terminal states.Each round is divided into multiple time slots in which thelearner takes an action in each time slot from the action set A := { C, F } with the aim of reaching to state G . Here, C denotes the continuation action and F is the terminal action.According to Fig. 1, action C moves the learner one state tothe right or to the left of the current state. Action F movesthe learner directly to one of the terminal states. Possibleoutcomes of each action in a non-terminal state s is shownin Fig. 1. Let s ρt denote the state at the beginning of the t thtime slot of round ρ and a ρt denote the action taken at the t th time slot of round ρ . The state transition probabilities foraction C are given by Pr( s ρt +1 = s + 1 | s ρt = s, a ρt = C ) = p C , t ≥ , s ∈ ˜ S Pr( s ρt +1 = s − | s ρt = s, a ρt = C ) = p D , t ≥ , s ∈ ˜ S where p C + p D = 1 . The state transition probabilities foraction F are given by Pr( s ρt +1 = G | s ρt = s, a ρt = F ) = p F , t ≥ , s ∈ ˜ S Pr( s ρt +1 = D | s ρt = s, a ρt = F ) = 1 − p F , t ≥ , s ∈ ˜ S where < p F < . If the state transition probabilities areknown, each round can be modeled as a MDP and an optimalpolicy can be found by dynamic programming [12], [19]. B. Value Functions, Rewards and the Optimal Policy

Let π = ( π , π , . . . ) , where π t : ˜ S → A , t ≥ representa deterministic Markov policy. π is a stationary policy if π t = π t (cid:48) for all t and t (cid:48) . For this case we will simply use π :˜ S → A to denote a stationary deterministic Markov policy.Since the time horizon is inﬁnite within a round and thestate transition probabilities are time-invariant, it is sufﬁcientto search for the optimal policy within the set of stationarydeterministic Markov policies, which is denoted by Π . Let V π ( s ) denote the probability of reaching to G by using policy π given that the system is in state s . Let Q π ( s, a ) denote the probability of reaching to G by taking action a in state s ,and then continuing according to policy π . We have Q π ( s, C ) = p C V π ( s + 1) + p D V π ( s − ,Q π ( s, F ) = p F for s ∈ ˜ S . Hence, V π ( s ) , s ∈ ˜ S can be computed by solvingthe following set of equations: V π ( G ) = 1 , V π ( D ) = 0 , V π ( s ) = Q π ( s, π ( s )) , ∀ s ∈ ˜ S where π ( s ) denotes the action selected by π in state s . Thevalue of policy π is deﬁned as V π := (cid:88) s ∈ ˜ S q ( s ) V π ( s ) . The optimal policy is denoted by π ∗ := arg max π ∈ Π V π and the value of the optimal policy is denoted by V ∗ := max π ∈ Π V π . The optimal policy is characterized by Bellman optimalityequations for all s ∈ ˜ SV ∗ ( s ) = max { p F V ∗ ( G ) , p C V ∗ ( s + 1) + p D V ∗ ( s − } , = max { p F , p C V ∗ ( s + 1) + p D V ∗ ( s − } . (1)As it is sufﬁcient to search for the optimal policy withinstationary deterministic Markov policies and since there areonly two actions that can be taken in each s ∈ ˜ S , the numberof all such policies is G − . In Section IV, we will prove thatthe optimal policy for GRBP has a simple threshold form,which reduces the number of policies to learn from G − to . C. Online Learning in the GRBP

As we described in the previous subsection, when the statetransition probabilities are known, optimal solution and itsprobability of reaching to the goal can be found by solvingBellman optimality equations. When the learner does notknow p C and p F , the optimal policy cannot be computed apriori, and hence needs to be learned. We deﬁne the learningloss of the learner, who is not aware of the optimal policya priori, with respect to an oracle, who knows the optimalpolicy from the initial round, as the regret given byReg ( T ) := T V ∗ − T (cid:88) ρ =1 V ˆ π ρ where ˆ π ρ denotes the policy that is used by the learner inround ρ . Let N π ( T ) denote the number of times policy π is used by the learner by round T . For any policy π , let ∆ π := V ∗ − V π denote the suboptimality gap of that policy.The regret can be rewritten asReg ( T ) = (cid:88) π ∈ Π N π ( T )∆ π . (2)n this paper, we will design learning algorithms that min-imize the growth rate of the expected regret, i.e., E [ Reg ( T )] .A straightforward way to do this will be to employ UCB1algorithm [4] or its variants [6] by taking each policy asan arm. The result below state a logarithmic bound on theexpected regret when UCB1 is used. Theorem 1.

When UCB1 in [4] is used to select the policyto follow at the beginning of each round (with set of arms Π ), we have E [ Reg ( T )] = 8 (cid:88) π : V π

See [4].As shown in Theorem 1, the expected regret of UCB1depends linearly on the number of suboptimal policies. ForGRBP, the number of policies can be very large. For instance,we have G − different stationary deterministic Markovpolicies for the deﬁned problem. These imply that usingUCB1 to learn the optimal policy is highly inefﬁcient forthe GRBP. The learning algorithm we propose in Section Vexploits a result on the form of the optimal policy that willbe derived in Section IV to learn the optimal policy in afast manner. This learning algorithm calculates an estimatedoptimal policy using the estimated transition probabilities,and hence learns much faster than applying UCB1 naively.Moreover, it can even achieve bounded regret (instead oflogarithmic regret) under some special cases.IV. F ORM OF THE O PTIMAL P OLICY

In this section, we prove that the optimal policy for GRBPhas a threshold form. The value of the threshold depends onlyon the state transition probabilities and the number of states.First, we give the deﬁnition of a stationary threshold policy.

Deﬁnition 1. π is a stationary threshold policy if there exists τ ∈ { , , . . . , G − } such that π ( s ) = C for all s > τ and π ( s ) = F for all s ≤ τ . We use π trτ to denote the stationarythreshold policy with threshold τ . The set of stationarythreshold policies is given by Π tr := { π trτ } τ = { , ,...,G − } . The next lemma constrains the set of policies that theoptimal policy lies in.

Lemma 1.

In the GRBP it is always optimal to select action C at s ∈ ˜ S − { } .Proof: By (1), for s ∈ ˜ S − { } we have V ∗ ( s ) = max { p F , p C V ∗ ( s + 1) + p D V ∗ ( s − } . If V ∗ ( s ) = p F , this implies that p C V ∗ ( s + 1) + p D V ∗ ( s − ≤ p F ⇒ V ∗ ( s − ≤ p F − p C V ∗ ( s + 1) p D . (3)By deﬁnition, p F ≤ V ∗ ( s ) , ∀ s ∈ ˜ S . (4) Therefore, p F − p C V ∗ ( s + 1) p D ≤ p F − p C p F p D = p F which in combination with (3) implies that V ∗ ( s − ≤ p F .According to (4) we ﬁnd that V ∗ ( s −

1) = p F . Then, weconclude that V ∗ ( s ) = p F ⇒ V ∗ ( s −

1) = p F , ∀ s ∈ ˜ S − { } . This also implies that V ∗ ( s + 1) ≤ p F − p D V ∗ ( s − p C = p F . Consequently, if V ∗ ( s ) = p F for some s ∈ ˜ S − { } , then V ∗ ( s ) = p F , ∀ s ∈ ˜ S − { } . (5)By (5), if V ∗ ( s ) = p F for some s ∈ ˜ S − { } , then thisimplies that V ∗ ( G −

1) = p F . Since V ∗ ( G ) = 1 , we have V ∗ ( G −

1) = max { p F , p C + p D p F } = p F ⇒ p F ≥ p C + p D p F ⇒ p F (1 − p D ) ≥ p C ⇒ p F ≥ ⇒ p F = 1 . This shows that unless p F = 1 , it is suboptimal to selectaction F in states ˜ S − { } and since p F = 1 is a trivial case,we disregard that. Hence, it is always optimal to select action C at s ∈ ˜ S − { } .The result of Lemma 1 holds independently from the setof transition probabilities and the number of states. Lemma1 leaves out only two candidates for the optimal policy. Theﬁrst candidate is the policy which selects action C at anystate s ∈ ˜ S . The second candidate selects action C in allstates except state . Hence, the optimal policy is always inset { π tr , π tr } . This reduces the set of policies to considerfrom G − to . Let r := p D /p C denote the failure ratio ofaction C . The next lemma gives the value functions for π tr and π tr . Lemma 2.

In the GRBP we have(i) V π tr ( s ) =  p F + (1 − p F ) 1 − r s − − r G − , when r (cid:54) = 1 p F + (1 − p F ) s − G − , when r = 1 (ii) V π tr ( s ) =  − r s − r G , when r (cid:54) = 1 sG , when r = 1 for s ∈ ˜ S .Proof: (i) :For π tr we have:  V π tr ( G ) = 1 V π tr ( G −

1) = p C V π tr ( G ) + p D V π tr ( G − . . .V π tr (2) = p C V π tr (3) + p D V π tr (1) V π tr (1) = p F  ( p C + p D ) V π tr ( G −

1) = p C + p D V π tr ( G − . . . ( p C + p D ) V π tr (2) = p C V π tr (3) + p D p F ⇒  p C ( V π tr ( G − −

1) = p D ( V π tr ( G − − V π tr ( G − . . .p C ( V π tr ( s + 1) − V π tr ( s + 2)) = p D ( V π tr ( s ) − V π tr ( s + 1)) . . .p C ( V π tr (2) − V π tr (3)) = p D ( p F − V π tr (2)) ⇒  V π tr ( G − − r G − ( p F − V π tr (2)) . . .V π tr ( s ) − V π tr ( s + 1) = r s − ( p F − V π tr (2)) . . .V π tr (2) − V π tr (3) = r ( p F − V π tr (2)) ⇒ (6)Summation of all the terms results in − V π tr (2) = ( V π tr (2) − p F )( G − (cid:88) i =1 r i ) ⇒ (7) V π tr (2)( G − (cid:88) i =0 r i ) = 1 + p F ( G − (cid:88) i =1 r i ) ⇒ V π tr (2)( G − (cid:88) i =0 r i ) = 1 − p F + p F ( G − (cid:88) i =0 r i ) ⇒ V π tr (2) = p F + 1 − p F ( (cid:80) G − i =0 r i ) ⇒ V π tr (2) = p F + (1 − p F ) 1 − r − r G − . Then, for s th state, we have to sum up to ( s − th equationin (6): V π tr ( s ) − V π tr (2) = ( V π tr (2) − p F )( s − (cid:88) i =1 r i ) ⇒ V π tr ( s ) = p F + ( V π tr (2) − p F )( s − (cid:88) i =0 r i ) ⇒ (8) V π tr ( s ) = p F + (1 − p F ) 1 − r s − − r G − . (9)For the fair case, r has to be set to 1 in (7) and (8). Then, V π tr (2) = p F + (1 − p F ) 1 G − and V π tr ( s ) = p F + (1 − p F ) s − G − . Case (ii) :Since action F is never selected by π tr , for this case,standard analysis of the gambler’s ruin problem applies. Thus, the probability of hitting G from state s is (1 − r s ) / (1 − r G ) (10)for r (cid:54) = 1 and s/G for r = 1 [20].The form of the optimal policy is given in the followingtheorem. Theorem 2.

In the GRBP, the optimal policy is π trτ ∗ , where τ ∗ =  sign( p F − − r − r G ) , when r (cid:54) = 1sign( p F − G ) , when r = 1 where sign( x ) = 1 if x is nonnegative and otherwise.Proof: Since we have found in Lemma 1 that it is alwaysoptimal to select action C when the state is in { , . . . , G − } ,to ﬁnd the optimal policy, it is sufﬁcient to compare the valuefunctions of the two policies for s = 1 . When r (cid:54) = 1 , thisgives π ∗ = π tr if − r − r G < p F and π ∗ = π tr otherwise. Similarly, if r = 1 and /G < p F ,then π ∗ = π tr . Otherwise, π ∗ = π tr . Using these, the valueof the optimal threshold is given as τ ∗ =  sign( p F − − r − r G ) if r (cid:54) = 1sign( p F − G ) if r = 1 which completes the proof.When r (cid:54) = 1 , the term (1 − r ) / (1 − r G ) representsprobability of hitting G starting from state by alwaysselecting action C . This probability is equal to /G when r = 1 . Because of this, it is optimal to take the terminal actionin some cases for which p C > p F . Although the continuationaction can move the system state in the direction of the goalstate for some time, the long term chance of hitting the goalstate by taking the continuation action can be lower than thechance of hitting the goal state by immediately taking theterminal action at state .Equation of the boundary for which the optimal policychanges from π tr to π tr is p F = B ( r ) := (1 − r ) / (1 − r G ) (11)when r (cid:54) = 1 . This decision boundary is illustrated in Fig. 2for different values of G . We call the region of transitionprobabilities for which π tr is optimal as the exploration region, and the region for which π tr is optimal as the no-exploration region. In exploration region, the optimal policydoes not take action F in any round. Therefore, any learningalgorithm that needs to learn how well action F performs,needs to explore action F . As the value of G increases,area of the exploration region decreases due to the fact thatprobability of hitting the goal state by only taking action C decreases. When (1 − r ) / (1 − r G ) = p F both π tr and π tr are optimal. For thiscase, we favor π tr because it always ends the current round. C p F G = 5G = 50G = 100 No − exploration Region Exploration Region Figure 2. The boundary between explore and no-explore regions

V. A N O NLINE L EARNING A LGORITHM AND I TS R EGRET A NALYSIS

In this section, we propose a learning algorithm thatminimizes the regret when the state transition probabilitiesare unknown. The proposed algorithm forms estimates ofstate transition probabilities based on the history of statetransitions, and then, uses these estimates together with theform of the optimal policy obtained in Section IV to calculatean estimated optimal policy at each round.

A. Greedy Exploitation with Threshold Based Exploration

The learning algorithm for the GRBP is called

GreedyExploitation with Threshold Based Exploration (GETBE) andits pseudocode is given in Algorithm 1. Unlike conventionalMAB algorithms [3], [4], [6] which require all arms to besampled at least logarithmically many times, GETBE doesnot need to sample all policies (arms) logarithmically manytimes to ﬁnd the optimal policy with a sufﬁciently highprobability. GETBE achieves this by utilizing the form ofthe optimal policy derived in the previous section. AlthoughGETBE does not require all policies to be explored, itrequires exploration of action F when the estimated optimalpolicy never selects action F . This forced exploration is doneto guarantee that GETBE does not get stuck in the suboptimalpolicy.GETBE keeps counters N GF ( ρ ) , N F ( ρ ) , N uC ( ρ ) and N C ( ρ ) : (i) N GF ( ρ ) is the number of times action F is selectedand terminal state G is entered upon selection of action F bythe beginning of round ρ , (ii) N F ( ρ ) is the number of timesaction F is selected by the beginning of round ρ , (iii) N uC ( ρ ) is the number of times transition from some state s to s + 1 happened (i.e., the state moved up) after selecting action C by the beginning of round ρ , (iv) N C ( ρ ) is the number oftimes action C is selected by the beginning of round ρ . Let T F ( ρ ) and T C ( ρ ) represent the number of times action F andaction C is selected in round ρ , respectively. Since, action F is a terminal action, it can be selected at most once in eachround. However, action C can be selected multiple times inthe same round. Let T GF ( ρ ) and T uC ( ρ ) represent the numberof times state G is reached after the selection of action F and the number of times the state moved up after the selectionof action C in round ρ , respectively.At the beginning of round ρ , GETBE forms the transitionprobability estimates ˆ p Fρ := N GF ( ρ ) /N F ( ρ ) and ˆ p Cρ := N uC ( ρ ) /N C ( ρ ) that correspond to actions F and C , respec-tively. Then, it computes the estimated optimal policy ˆ π ρ by using the form of the optimal policy given in Theorem2 for the GRBP. If ˆ π ρ = π tr , then GETBE operates in greedy exploitation mode by acting according to π tr forthe entire round. Else if ˆ π ρ = π tr , then GETBE operatesin triggered exploration mode and selects action F in theﬁrst time slot of that round if N F ( ρ ) < D ( ρ ) , where D ( ρ ) is a non-decreasing control function that is an input ofGETBE. This control function helps GETBE to avoid gettingstuck in the suboptimal policy by forcing the selection ofaction F , although it is suboptimal according to ˆ π ρ . When N F ( ρ ) ≥ D ( ρ ) , GETBE employs ˆ π ρ for the entire round.At the end of round ρ the values of counters are updatedas follows: N F ( ρ + 1) = N F ( ρ ) + T F ( ρ ) N GF ( ρ + 1) = N GF ( ρ ) + T GF ( ρ ) N C ( ρ + 1) = N C ( ρ ) + T C ( ρ ) N uC ( ρ + 1) = N uC ( ρ ) + T uC ( ρ ) . (12)These values are used to estimate the transition probabilitiesthat will be used at the beginning of round ρ + 1 , for whichthe above procedure repeats. In the analysis of GETBE, wewill show that when N F ( ρ ) ≥ D ( ρ ) , the probability thatGETBE selects the suboptimal policy is very small, whichimplies that the regret incurred is very small. B. Regret Analysis

In this section, we bound the (expected) regret of GETBE.We show that GETBE achieves bounded regret when theunknown transition probabilities lie in no-exploration regionand logarithmic (in number of rounds) regret when theunknown transition probabilities lie in exploration region.Based on Theorem 2, GETBE only needs to learn the optimalpolicy from the set of policies { π tr , π tr } . Using this fact andtaking the expectation of (2), the expected regret of GETBEcan be written as E [ Reg ( T )] = (cid:88) π ∈{ π tr ,π tr } E [ N π ( T )]∆ π . (13)Let ∆( s ) := | V π tr ( s ) − V π tr ( s ) | , s ∈ ˜ S be the suboptimalitygap when the initial state is s . For any π ∈ { π tr , π tr } , wehave ∆ π ≤ ∆ max , where ∆ max := max s ∈ ˜ S ∆( s ) . The nextlemma gives closed-form expressions for ∆( s ) and ∆ max . Lemma 3.

We have ∆( s ) =  G − sG − | p F − G | if r = 1 r G − − r s − r G − − | p F − − r − r G | if r (cid:54) = 1 I ( · ) denotes the indicator function which is if the expression insideevaluates true and otherwise. lgorithm 1 GETBE Algorithm Input : G, D ( ρ ) Initialize : Take action C and then action F once to form initialestimates: N GF (1) , N F (1) = 1 , N uC (1) , N C (1) = 1 (Round(s)to form the initial estimates (at most 2 rounds) are ignored inthe regret analysis). ρ = 1 while ρ ≥ do

4: Get initial state s ρ ∈ ˜ S , t = 1 ˆ p Fρ = N GF ( ρ ) N F ( ρ ) , ˆ p Cρ = N uC ( ρ ) N C ( ρ ) , ˆ r ρ = 1 − ˆ p Cρ ˆ p Cρ if ˆ p uρ = 0 . then ˆ τ ρ = sign (ˆ p Fρ − /G ) else ˆ τ ρ = sign (ˆ p Fρ − − ˆ r ρ − (ˆ r ρ ) G ) end if

11: Set ˆ π ρ = π tr ˆ τ ρ while s ρt (cid:54) = G or D do if (ˆ π ρ = π tr && N F ( ρ ) < D ( ρ )) || ( s ρt ≤ ˆ τ ρ ) then

14: Select action F , observe state s ρt +1 T F ( ρ ) = T F ( ρ ) + 1 , T GF ( ρ ) = I ( s ρt +1 = G ) else

17: Select action C , observe state s ρt +1 T C ( ρ ) = T C ( ρ ) + 1 T uC ( ρ ) = T uC ( ρ ) + I ( s ρt +1 = s ρt + 1) t = t + 1 end if end while

23: Update the counters according to (12)24: ρ = ρ + 1 end while and ∆ max =  | p F − G | if r = 1 | p F − − r − r G | if r (cid:54) = 1 Proof:

According to Lemma 2 we have

Case (i) r = 1 : ∆( s ) = | V π tr ( s ) − V π tr ( s ) | = | p F + (1 − p F ) s − G − − sG | = | p F ( G − sG − s − G − − sG | = G − sG − | p F − G | . The above equation is maximized when s = 1 . Therefore,when r = 1 , ∆ max = max s ∈ ˜ S ∆( s ) = | p F − G | . Case (ii) r (cid:54) = 1 : ∆( s ) = | V π tr ( s ) − V π tr ( s ) | = | p F + (1 − p F ) r s − − r G − − − r s − r G − | = | p F ( r G − − r s − r G − − r s − − r G − − − r s − r G − | = | p F ( r G − − r s − r G − − r s − r s − + r G − − r G (1 − r G − )(1 − r G ) | = ( r G − − r s − r G − − | p F − − r − r G | . Again, the above equation is maximized when s = 1 .Therefore, when r (cid:54) = 1 , ∆ max = max s ∈ ˜ S ∆( s ) = | p F − − r − r G | . Next, we bound E [ N π ( T )] for the suboptimal policy in aseries of lemmas. From (11), it is clear that the boundary isa function of r . Let r = − xx . Then, the boundary becomesa function of x by which we have B ( x ) = (1 − − xx ) / (1 − ( 1 − xx ) G ) . Let δ be the minimum Euclidean distance of pair ( p C , p F )from the boundary ( x, B ( x ) ) given in Fig. 2. The value of δ speciﬁes the hardness of GRBP. When δ is small, it is harderto distinguish the optimal policy from the suboptimal policy.If the pair of estimated transition probabilities (ˆ p Cρ , ˆ p Fρ ) inround ρ lies within a ball around ( p C , p F ) with radiusless than δ , then GETBE will select the optimal policy inthat round. The probability that GETBE selects the optimalpolicy is lower bounded by the probability that the estimatedtransition probabilities lie in a ball centered at ( p C , p F ) withradius δ .The following lemma provides a lower bound on theexpected number of times each action is selected by GETBE.This result will be used when bounding the regret of GETBE. Lemma 4. (i)

Let p F, be the probability of taking action F in round ρ when ˆ π ρ = π tr and p C, be the probabilityof taking action C at least once in round ρ when ˆ π ρ = π tr .Then, p C, = 1 − q (1) p F, = (cid:40)(cid:80) G − s =1 G − sG − q ( s ) if r = 1 (cid:80) G − s =1 r s − − r G − − r G − q ( s ) if r (cid:54) = 1 . (ii) Let D ( ρ ) := γ log ρ where γ > /p F, , and f a ( ρ ) := (cid:40) . p C, ρ, for a = C . p F, γ − √ γ ) log ρ, for a = F Let ρ (cid:48) C be the ﬁrst round in which . p C, ρ − p C, (cid:100) D ( ρ ) (cid:101) − (cid:112) ρ − (cid:100) D ( ρ ) (cid:101) log ρ becomes positive and ρ (cid:48) F be the ﬁrstround in which both . p F, γ − √ γ ) log ρ − √ log ρ and ρ − − (cid:100) D ( ρ ) (cid:101) becomes positive. Then for a ∈ { F, C } wehave Pr ( N a ( ρ ) < f a ( ρ )) ≤ ρ , for ρ ≥ ρ (cid:48) a . Proof:

The following expressions will be used in theproof: • N ( ρ ) : Number of rounds by ρ for which ˆ π ρ = π tr . • N ( ρ ) : Number of rounds by ρ for which ˆ π ρ = π tr . • N F, ( ρ ) : Number of rounds by ρ for which action F is taken when ˆ π ρ = π tr . • N C, ( ρ ) : Number of rounds by ρ for which action C is taken when ˆ π ρ = π tr . n a ( ρ ) : Indicator function of the event that action a isselected for at least once in round ρ . (i) When ˆ π ρ = π tr , action C is not taken only if the initialstate is . Hence, p C, = 1 − Pr( s ρ = 1) = 1 − q (1) . Let H denote the event that state is reached before state G when ˆ π ρ = π tr . We have p F, = G − (cid:88) s =1 Pr( H | s ρ = s ) q ( s ) . When r = 1 , p F, is equivalent to the ruin probability(probability of hitting the terminal state ) of a fair gambler’sruin problem over G − states, where states 1 and G are theterminal states. For this problem, the probability of hitting G from state s is s − G − . Hence, probability of hitting state from state s is Pr( H | s ρ = s ) = 1 − s − G − G − sG − . When r (cid:54) = 1 , the problem is equivalent to an unfair gambler’sruin problem with G − states in which probability of hitting G from state s is − r s − − r G − . Then, the probability of hittingstate from state s becomes Pr( H | s ρ = s ) = 1 − − r s − − r G − = r s − − r G − − r G − . (ii) Since action C might be selected for more than once ina round, we have N a ( ρ ) ≥ n a (1) + n a (2) + · · · + n a ( ρ ) . Thisholds because in the initialization of GETBE, each actionis selected once. Basically, we derive the lower bounds for N a ( ρ + 1) , but these lower bounds also hold for N a ( ρ ) because of the way GETBE is initialized. For a set of rounds ρ ∈ T ⊂ { , . . . , T } , n a ( ρ ) s are in general not identicallydistributed. But if ˆ π ρ is same for all rounds ρ ∈ T , then n a ( ρ ) s are identically distributed.First, assume that N ( ρ ) = k , ≤ k ≤ ρ . Then, theprobability that action C is selected at least once in each ofthese k rounds is p C, . Let j i denote the index of the roundin which the estimated optimal policy is π tr for the i th time.The sequence of Bernoulli random variables n C ( j i ) , i =1 , . . . , k are independent and identically distributed. Hence,the Hoeffding bound given in Appendix A can be used toupper-bound the deviation probability of sum of these randomvariables from the expected sum. Since the estimated optimalpolicy will be π tr for the remaining ρ − k rounds, the numberof times action F is selected in all of these rounds will beat most (cid:100) D ( ρ ) (cid:101) . Therefore, the probability of taking action C is zero for at most (cid:100) D ( ρ ) (cid:101) rounds. Let ρ D := ρ − (cid:100) D ( ρ ) (cid:101) and N (cid:48) C ( ρ ) denote the sum of ρ random variables that aredrawn from an independent identically distributed Bernoullirandom process with parameter p C, . Then, N C ( ρ ) ≥ ρ − k − (cid:100) D ( ρ ) (cid:101) + k (cid:88) i =1 n C ( j i ) ≥ ρ −(cid:100) D ( ρ ) (cid:101) (cid:88) i =1 n C ( j i ) = N (cid:48) C ( ρ − (cid:100) D ( ρ ) (cid:101) )= N (cid:48) C ( ρ D ) . (14)According to the Hoeffding bound in Appendix A, we havefor z > N (cid:48) C ( ρ D ) − E ( N (cid:48) C ( ρ D )) ≤ − z ) ≤ e − z /ρ D . When z = √ ρ D log ρ the above bound becomes Pr( N (cid:48) C ( ρ D ) ≤ E ( N (cid:48) C ( ρ D )) − (cid:112) ρ D log ρ ) ≤ ρ ⇒ Pr( N (cid:48) C ( ρ D ) ≤ p C, ( ρ − (cid:100) D ( ρ ) (cid:101) ) − (cid:112) ( ρ − (cid:100) D ( ρ ) (cid:101) ) log ρ ) ≤ ρ . Then, by using (14) we obtain

Pr( N C ( ρ ) ≤ p C, ( ρ − (cid:100) D ( ρ ) (cid:101) ) − (cid:112) ( ρ − (cid:100) D ( ρ ) (cid:101) ) log ρ ) ≤ ρ . Since ρ (cid:48) C is the ﬁrst round in which . p C, ρ − p C, (cid:100) D ( ρ ) (cid:101)−√ ρ D log ρ becomes positive, on or after ρ (cid:48) C , we have p C, ρ D − √ ρ D log ρ > . p C, ρ . Therefore, we replace p C, ρ D − √ ρ D log ρ with . p C, ρ and then Pr( N C ( ρ ) ≤ . p C, ρ ) ≤ ρ , for ρ ≥ ρ (cid:48) C which is equivalent to Pr( N C ( ρ ) ≤ f C ( ρ )) ≤ ρ , for ρ ≥ ρ (cid:48) C . (15)Again, assume that N ( ρ ) = k . Then, the probability ofselecting action F is p F, in each of these k rounds. Let R denote the set of the remaining ρ − k rounds. For a round ρ r ∈R , action F is selected only if N F ( ρ r ) ≤ D ( ρ r ) . Amongthe rounds in R , the number of rounds in which action F isselected is bounded below by min { ρ − k, (cid:100) D ( ρ − k ) (cid:101)} . Then, n F ( j i ) , i = 1 , , . . . is a sequence of i.i.d. Bernoulli randomvariables with parameter p F, . From the argument above, weobtain N F ( ρ ) ≥ min { ρ − k, (cid:100) D ( ρ − k ) (cid:101)} + k (cid:88) i =1 n F ( j i ) ≥ k +min { ρ − k, (cid:100) D ( ρ − k ) (cid:101)} (cid:88) i =1 n F ( j i ) When min { ρ − k, (cid:100) D ( ρ − k ) (cid:101)} = ρ − k , we have N F ( ρ ) ≥ ρ (cid:88) i =1 n F ( j i ) ≥ (cid:100) D ( ρ ) (cid:101) (cid:88) i =1 n F ( j i ) , for ρ ≥ ρ (cid:48) F . When min { ρ − k, (cid:100) D ( ρ − k ) (cid:101)} = (cid:100) D ( ρ − k ) (cid:101) , we have N F ( ρ ) ≥ k + (cid:100) D ( ρ − k ) (cid:101) (cid:88) i =1 n F ( j i ) . ext, we will show that D ( ρ − k ) + k ≥ D ( ρ ) when ρ issufﬁciently large. First, min { ρ − k, (cid:100) D ( ρ − k ) (cid:101)} = (cid:100) D ( ρ − k ) (cid:101) implies that ρ ≥ (cid:100) D ( ρ − k ) (cid:101) + k ≥ D ( ρ − k ) + k. (16)Also, D ( ρ − k ) + k ≥ D ( ρ ) should imply that D ( ρ ) − D ( ρ − k ) ≤ k ⇒ γ log( ρ/ ( ρ − k )) ≤ k ⇒ ρ/ ( ρ − k ) ≤ e k/γ ⇒ ρ ≥ ke k/γ e k/γ − (17)Using the results in (16) and (17), we conclude that D ( ρ − k ) + k ≥ D ( ρ ) holds when D ( ρ − k ) + k ≥ ke k/γ e k/γ − . (18)By setting D ( ρ ) = γ log ρ and manipulating (18) we get γ log( ρ − k ) + k ≥ ke k/γ e k/γ − ⇒ γ log( ρ − k ) ≥ k ( e k/γ e k/γ − − ⇒ log( ρ − k ) ≥ k/γe k/γ − ⇒ ρ − k ≥ e k/γek/γ − ⇒ ρ ≥ k + e k/γek/γ − . (19)First, we evaluate the term h ( k ) := e k/γek/γ − . We will showthat h ( k ) ∈ [1 , e ] for all k ∈ R + . By applying L’Hopital’srule we get lim k → k/γe k/γ − k → /γ (1 /γ ) e k/γ = 1 and lim k →∞ k/γe k/γ − These two conditions and using the fact that exponentialfunction is continuous we conclude that lim k → e k/γek/γ − = e lim k → k/γek/γ − = e and lim k →∞ e k/γek/γ − = e lim k →∞ k/γek/γ − = 1 . Next, we will show that e k/γek/γ − is decreasing in k . Sincethis is a monotonically increasing function of k/γe k/γ − , it issufﬁcient to show that k/γe k/γ − is decreasing in k . We have ddk k/γe k/γ − γ ( e k/γ − − kγ ( e k/γ /γ )( e k/γ − The denominator is always positive for k > . Therefore, weonly consider the numerator and write it as γ ( e k/γ − − kγ ( e k/γ /γ ) = ( γ − k ) e k/γ − γγ . As the denominator is positive, we only need to show that ( γ − k ) e k/γ − γ is always negative. The derivative of the aboveexpression is − ( k/γ ) e k/γ , which is negative for k > . Wealso have ( γ − k ) e k/γ − γ = 0 at k = 0 . These two conditionsimply that ( γ − k ) e k/γ − γ is always negative for k > , bywhich we conclude that e k/γek/γ − is decreasing in k . Hence,we have ≤ e k/γek/γ − ≤ e This implies that k + e k/γek/γ − ≤ k + e . Hence (19) holdswhen ρ ≥ k + e . This implies that when k ≤ ρ − e , we have k + (cid:100) D ( ρ − k ) (cid:101) (cid:88) i =1 n F ( j i ) ≥ (cid:100) D ( ρ ) (cid:101) (cid:88) i =1 n F ( j i ) . The only cases that are left out are k = ρ , k = ρ − and k = ρ − . But we know from the deﬁnition of ρ (cid:48) F that for ρ ≥ ρ (cid:48) F , ρ − − (cid:100) D ( ρ ) (cid:101) is positive. Hence for these caseswe also have k + (cid:100) D ( ρ − k ) (cid:101) (cid:88) i =1 n F ( j i ) ≥ ρ − (cid:88) i =1 n F ( j i ) ≥ (cid:100) D ( ρ ) (cid:101) (cid:88) i =1 n F ( j i ) . Let N (cid:48) F ( ρ ) denote the sum over n F ( j i ) for ρ rounds. Fromall of the cases we derived above, we obtain N F ( ρ ) ≥ (cid:100) D ( ρ ) (cid:101) (cid:88) i =1 n F ( j i ) = N (cid:48) F ( (cid:100) D ( ρ ) (cid:101) ) for ρ ≥ ρ (cid:48) F (20)Now, by using Hoeffding bound we have Pr ( N (cid:48) F ( (cid:100) D ( ρ ) (cid:101) ) − E ( N (cid:48) F ( (cid:100) D ( ρ ) (cid:101) )) ≤ − z ) ≤ e − z / (cid:100) D ( ρ ) (cid:101) and if z = (cid:112) (cid:100) D ( ρ ) (cid:101) log ρ then, Pr( N (cid:48) F ( (cid:100) D ( ρ ) (cid:101) ) < E ( N (cid:48) F ( (cid:100) D ( ρ ) (cid:101) )) − (cid:112) (cid:100) D ( ρ ) (cid:101) log ρ ) ≤ ρ Pr( N (cid:48) F ( (cid:100) D ( ρ ) (cid:101) ) < p F, (cid:100) D ( ρ ) (cid:101) − (cid:112) (cid:100) D ( ρ ) (cid:101) log ρ ) ≤ ρ . By using (20), we get

Pr( N F ( ρ ) < p F, (cid:100) D ( ρ ) (cid:101) − (cid:112) (cid:100) D ( ρ ) (cid:101) log ρ ) ≤ ρ . (21)Then, by using D ( ρ ) = γ log ρ, γ > /p F, , we have p F, (cid:100) D ( ρ ) (cid:101) − (cid:112) (cid:100) D ( ρ ) (cid:101) log ρ = p F, (cid:100) γ log ρ (cid:101) − (cid:112) (cid:100) γ log ρ (cid:101) log ρ ≥ p F, γ log ρ − (cid:112) ( γ log ρ + 1) log ρ = p F, γ log ρ − (cid:113) γ log ρ + log ρ ≥ p F, γ log ρ − (cid:113) γ log ρ − (cid:112) log ρ (22) = ( p F, γ − √ γ ) log ρ − (cid:112) log ρ (23)where (22) occurs due to the subadditivity of the squareroot. Next, we will show that (23) becomes positive when ρ For a, b > we have √ a + √ b > √ a + b since ( √ a + √ b ) > a + b . s large enough. To do this, we ﬁrst show that the ﬁrst termin (23) is always positive. This is proven by observing that γ > /p F, ⇒ p F, √ γ − > ⇒ p F, γ − √ γ > . (24)Since log ρ increases at a higher rate than √ log ρ , it canbe shown that . p F, γ − √ γ ) log ρ − √ log ρ will al-ways increase after some round. Since lim ρ →∞ . p F, γ −√ γ ) log ρ − √ log ρ = ∞ , this term is expected to be positiveafter some round. From the statement of the lemma, itis known that ρ (cid:48) F is greater than or equal to this round.Therefore, for ρ ≥ ρ (cid:48) F , ( p F, γ − √ γ ) log ρ − √ log ρ > . p F, γ − √ γ ) log ρ . Using this and (23), we obtain p F, (cid:100) D ( ρ ) (cid:101) − (cid:112) (cid:100) D ( ρ ) (cid:101) log ρ ≥ ( p F, γ − √ γ ) log ρ − (cid:112) log ρ ≥ . p F, γ − √ γ ) log ρ. Then, we use this result and (21) to get

Pr( N F ( ρ ) ≤ . p F, γ − √ γ ) log ρ ) ≤ ρ , for ρ ≥ ρ (cid:48) F which is equivalent to Pr( N F ( ρ ) ≤ f F ( ρ )) ≤ ρ , for ρ ≥ ρ (cid:48) F . (25)The (expected) regret given in (13) can be decomposedinto two parts: (i) regret in rounds in which the suboptimalpolicy is selected, (ii) regret in rounds in which the optimalpolicy is selected and GETBE explores. Let IR ( T ) denote thenumber of rounds by round T in which the suboptimal policyis selected. The ﬁrst part of the regret is upper bounded by E ( IR ( T )) , since the reward in a round can be either or . Similarly, the second part of the regret is upper boundedby the number of explorations when the optimal policy is π tr . When the optimal policy is π tr , exploration will onlybe performed when the suboptimal policy is selected. Hence,there is no additional regret due to explorations, since all theregret is accounted for in the ﬁrst part of the regret.Let A ρ denote the event that the suboptimal policy isselected in round ρ . Let C ρ := {| p C − ˆ p Cρ |≥ δ/ √ } ∪ {| p F − ˆ p Fρ |≥ δ/ √ } . It can be shown that on event C cρ the Euclidian distancebetween ( p C , p F ) and (ˆ p Cρ , ˆ p Fρ ) is less than δ . This impliesthat on event C cρ , the optimal policy is selected. Therefore, C ρ contains the event that the optimal policy is not selected.Using the linearity of expectation and the union bound, weobtain E [ IR ( T )] = E [ T (cid:88) ρ =1 I ( A ρ )] ≤ T (cid:88) ρ =1 (cid:88) a ∈{ F,C } Pr (cid:16) | p a − ˆ p a ρ |≥ δ/ √ (cid:17) . (26) Let I exp ρ be the indicator function of the event that GETBEexplores. By the above discussion we have E [ Reg ( T ) | π ∗ = π tr ] ≤ E [ IR ( T )] (27) E [ Reg ( T ) | π ∗ = π tr ] ≤ ∆ max E [ IR ( T )]+ E [ T (cid:88) ρ =1 I exp ρ ] . (28)Next, we bound the expected regret of GETBE for theGRBP using (27) and (28). Theorem 3.

Let x := (cid:16) (cid:112) (24 p F, /δ ) + 1 (cid:17) / p F, .Assume that the control function is D ( ρ ) = γ log ρ where γ > max { ( x ) , p F, ) } . Let ρ (cid:48)(cid:48) be the ﬁrst round in which δ ≥ ρf a ( ρ ) for bothactions, ρ (cid:48) := max { ρ (cid:48) C , ρ (cid:48) F , ρ (cid:48)(cid:48) } , and w := 4 ρ (cid:48) + 2 π − e − δ ) Then, the regret of GETBE is bounded by E [ Reg ( T ) | π ∗ = π tr ] ≤ w and E [ Reg ( T ) | π ∗ = π tr ] ≤ (cid:100) D ( T ) (cid:101) + w ∆ max . Proof:

First, we bound E [ IR ( T )] . For this, we replacethe order of summations in (26) and we have E [ IR ( T )] ≤ (cid:88) a ∈{ F,C } T (cid:88) ρ =1 Pr (cid:16) | p a − ˆ p a ρ |≥ δ/ √ (cid:17) . (29)Let N ∗ F ( ρ ) := N GF ( ρ ) and N ∗ C ( ρ ) := N uC ( ρ ) . By using thelaw of total probability and Hoeffding inequality, we obtainfor a ∈ { F, C } T (cid:88) ρ =1 Pr (cid:18) | p a − ˆ p aρ |≥ δ √ (cid:19) = T (cid:88) ρ =1 ∞ (cid:88) n =1 Pr (cid:18) | p a − N ∗ a ( ρ ) N a ( ρ ) |≥ δ √ | N a ( ρ ) = n (cid:19) Pr ( N a ( ρ ) = n )= T (cid:88) ρ =1 ∞ (cid:88) n =1 Pr (cid:18) | np a − N ∗ a ( ρ ) |≥ n δ √ (cid:19) Pr ( N a ( ρ ) = n ) ≤ T (cid:88) ρ =1 ∞ (cid:88) n =1 e − ( nδ/ √ n Pr ( N a ( ρ ) = n )= T (cid:88) ρ =1 ∞ (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n ) . (30)For each action, we use the result of Lemma 4 and dividethe summation in (30) into two summations. Note that theounds on N a ( ρ ) given in Lemma 4 hold when ρ ≥ ρ (cid:48) ≥ max { ρ (cid:48) C , ρ (cid:48) F } . Therefore, we have T (cid:88) ρ =1 ∞ (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n ) = ρ (cid:48) − (cid:88) ρ =1 ∞ (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n )+ T (cid:88) ρ = ρ (cid:48) ∞ (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n )= K (cid:48) + T (cid:88) ρ = ρ (cid:48) ∞ (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n ) (31)where K (cid:48) = ρ (cid:48) − (cid:80) ρ =1 ∞ (cid:80) n =1 e − nδ Pr ( N a ( ρ ) = n ) which is ﬁnitesince ρ (cid:48) is ﬁnite. Since ∞ (cid:88) n =1 Pr ( N a ( ρ ) = n ) = 1 and as e − nδ ≤ , then we have ∞ (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n ) ≤ . Therefore, an upper bound on K (cid:48) can be given as K (cid:48) = ρ (cid:48) − (cid:88) ρ =1 ∞ (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n ) ≤ ρ (cid:48) − (cid:88) ρ =1 < ρ (cid:48) . (32)We have T (cid:88) ρ = ρ (cid:48) ∞ (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n ) = T (cid:88) ρ = ρ (cid:48) f a ( ρ ) (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n )+ T (cid:88) ρ = ρ (cid:48) ∞ (cid:88) n = f a ( ρ )+1 e − nδ Pr ( N a ( ρ ) = n ) . (33)For the ﬁrst summation in (33), we use (15) and (25) for eachaction as an upper bound since it is the case when n ≤ f a ( ρ ) .Therefore, T (cid:88) ρ = ρ (cid:48) f a ( ρ ) (cid:88) n =1 e − nδ Pr ( N a ( ρ ) = n ) ≤ T (cid:88) ρ =1 f a ( ρ ) (cid:88) n =1 e − nδ ρ ≤ ∞ (cid:88) ρ =1 ∞ (cid:88) n =1 e − nδ ρ = π − e − δ ) (34)For the second summation in (33), we ﬁrst show that δ ≥ ρf a ( ρ ) for each action when ρ ≥ ρ (cid:48) .For a = F , we have δ ≥ ρf F ( ρ ) = p F, γ −√ γ since γ ≥ ( x ) . The proof is as follows. Note that the term p F, γ −√ γ is positive because of (24). In order to have δ ≥ p F, γ −√ γ ,we must have p F, γ − √ γ − /δ ≥ . This can be re-writtenas a second order polynomial function, which is given by g ( x ) = ax + bx + c ≥ where a = p F, , b = − , c = − /δ and x = √ γ . Since γ is positive, we will ﬁnd positive values of x for which g ( x ) is non-negative. Also, g ( x ) is a convex function sinceits second derivative is a , which is positive. Hence, g ( x ) is non-negative for positive x ’s which are greater than thelargest root. The roots of g ( x ) are given as x = 1 + (cid:113) p F, δ p F, , x = 1 − (cid:113) p F, δ p F, . It is clear that only x is positive. Thus, g ( x ) is non-negativefor x = √ γ ≥ x . Therefore, γ has to be greater than ( x ) so that δ ≥ p F, γ −√ γ .For a = C we have ρf C ( ρ ) = ρ . p C, ρ . This quantitydecreases as ρ increases and converges to zero in the limit ρ goes to inﬁnity. Hence, this quantity becomes smaller than δ after some round. Hence, for ρ ≥ ρ (cid:48) , we have δ ≥ ρf a ( ρ ) for both actions. Thus, T (cid:88) ρ = ρ (cid:48) ∞ (cid:88) n = f a ( ρ )+1 e − nδ Pr ( N a ( ρ ) = n ) ≤ T (cid:88) ρ = ρ (cid:48) e − f a ( ρ ) δ ∞ (cid:88) n = f a ( ρ )+1 Pr ( N a ( ρ ) = n ) ≤ T (cid:88) ρ = ρ (cid:48) e − f a ( ρ ) δ ≤ T (cid:88) ρ = ρ (cid:48) e − ρ ≤ ∞ (cid:88) ρ =1 ρ ≤ π . (35)Finally, we combine the results of (35), (34) and (32)together with the result of (31) and sum the ﬁnal result overthe two actions to get a bound for the expression in (29).This results in E [ IR ( T )] ≤ ρ (cid:48) + π − e − δ ) + π ρ (cid:48) + 2 π − e − δ ) = w. (36)Assume the optimal policy is π tr . Then, the expectednumber of rounds in which the suboptimal policy is selectedis ﬁnite and bounded by w (independent of T ) in (36). Inthis case, the exploration is done only when the suboptimalpolicy is selected and there will be no extra regret term dueto exploration. Therefore, E [ Reg ( T ) | π ∗ = π tr ] ≤ ρ (cid:48) + 2 π − e − δ )= w. ssume the optimal policy is π tr . Similar to the previouscase, the expected number of rounds in which the suboptimalpolicy is selected is at most w . Since the suboptimal policyfor this case is π tr , it will always be played if it is selected(no exploration). Hence, the regret in these rounds is at most ∆ max . However, the learner will explore action F when theoptimal policy is selected. This results in additional regret.Since, the number of explorations of GETBE by round T is bounded by (cid:100) D ( T ) (cid:101) , the regret that will result fromexplorations is also bounded by (cid:100) D ( T ) (cid:101) . Therefore, E [ Reg ( T ) | π ∗ = π tr ] ≤ (cid:100) D ( T ) (cid:101) + w ∆ max . Theorem 3 bounds the expected regret of GETBE. When π ∗ = π tr , Reg ( T ) = O (1) since both actions will be selectedwith positive probability by the optimal policy at each round.When π ∗ = π tr , Reg ( T ) = O ( D ( T )) since GETBE forces toexplore action F logarithmically many times to avoid gettingstuck in the suboptimal policy.VI. N UMERICAL R ESULTS

We create a synthetic medical treatment selection problembased on [21]. Each state is assumed to be a stage of gastriccancer ( G = 4 , D = 0 ). The goal state is deﬁned as atleast three years of survival. Action C is assumed to bechemotherapy and action F is assumed to be surgery. Foraction C , p C is determined by using the average survivalrates for young and old groups at different stages of cancergiven in [21]. For each stage, the survival rate at three yearsis taken to be the probability of hitting G by taking action C continuously. With this information, we set p C = 0 . . Also,the ﬁve-year survival rate of surgery given in [22] ( ) isused to set p F = 0 . .The regrets shown in Fig. 3 and 4 correspond to differentvariants of GETBE, named as GETBE-SM, GETBE-PSand GETBE-UCB. Each variant updates the state transitionprobabilities in a different way. GETBE-SM uses the controlfunction together with sample mean estimates of the statetransition probabilities. Unlike GETBE-SM, GETBE-UCBand GETBE-PS do not use the control function. GETBE-PS uses posterior sampling from the Beta distribution [17]to sample and update p F and p C . GETBE-UCB adds an inﬂation term that is equal to (cid:113) N F ( ρ )+ N C ( ρ )) N a ( ρ ) to thesample mean estimates of the state transition probabilitiesthat correspond to action a . PS-PolSelection and UCB-PolSelection algorithms treat each policy as a super-arm, anduse PS and UCB methods to select the best policy among thetwo threshold policies. Instead of updating the state transitionprobabilities, they directly update the rewards of the policies.Initial state distribution is taken to be the uniformdistribution. Initial estimates of the transition probabili-ties are formed by setting N F (1) = 1 , N GF (1) ∼ Unif [0 , , N C (1) = 1 , N uC (1) ∼ Unif [0 , . The timehorizon is taken to be rounds, and the control functionis set to be D ( ρ ) = 15 log ρ . Reported results are averagedover iterations. Rounds R e g re t GETBE − SMGETBE − UCBGETBE − PSUCB − PolSelectionPS − PolSelection

Figure 3. Regrets of GETBE and the other algorithms as a function of thenumber of rounds, when the transition probabilities lie in the no-explorationregion.

In Fig. 3 the regrets of GETBE and other algorithms areshown for p F and p C values given above. For this case, thethe optimal policy is π tr and all variants of GETBE achieveﬁnite regret, as expected. However, the regrets of UCB-PolSelection and PS-PolSelection increase logarithmically,since they sample each policy logarithmically many times. Rounds R e g re t GETBE − SMGETBE − UCBGETBE − PSUCB − PolSelectionPS − PolSelection

Figure 4. Regrets of GETBE and the other algorithms as a function of thenumber of rounds, when the transition probabilities lies in the explorationregion.Figure 5. Regret of GETBE for different values of p C , p F . Black lineshows the boundary. Next, we set p C = 0 . and p F = 0 . , in order to showhow the algorithms perform when the optimal policy is π tr .he result for this case is given in Fig. 4. As expected, theregret grows logarithmically over the rounds for all variantsof GETBE, PS-PolSelection and UCB-PolSelection. GETBE-PS achieves the lowest regret for this case.Fig. 5 illustrates the regret of GETBE-SM as a function of p F and p C for T = 1000 . As the state transition probabilitiesshift from the no-exploration region to the exploration regionthe regret increases as expected.VII. C ONCLUSION

In this paper, we introduced the Gambler’s Ruin BanditProblem. We characterized the form of the optimal policyfor this problem, and then developed a learning algorithmcalled GETBE that operates on the GRBP to learn the optimalpolicy when the transition probabilities are unknown. Weproved that the regret of this algorithm is either bounded(ﬁnite) or logarithmic in the number of rounds based on theregion that the true transition probabilities lie in. In additionto the regret bounds, we illustrated the performance of ouralgorithm via numerical experiments.R

EFERENCES[1] S. S. .Villar, J. Bowden, and J. Wason, “Multi-armed bandit modelsfor the optimal design of clinical trials: Beneﬁts and challenges,”

Statistical Science , vol. 30, no. 2, pp. 199–215, 2015.[2] C. Tekin and M. van der Schaar, “RELEAF: An algorithm for learningand exploiting relevance,”

IEEE J. Sel. Topics Signal Process. , vol. 9,no. 4, pp. 716–727, 2015.[3] Lai, T. L., Robbins, and Herbert, “Asymptotically efﬁcient adaptiveallocation rules,”

Advances in Applied Mathematics , vol. 6, no. 1, pp.4–22, 1985.[4] P. Auer, Cesa-bianchi, N. o, and P. Fischer, “Finite-time analysis of themultiarmed bandit problem,”

Machine Learning , vol. 47, pp. 235–256,2002.[5] Garivier, Aurelien, Cappe, and Olivier, “The KL-UCB algorithm forbounded stochastic bandits and beyond,” in

COLT , 2011, pp. 359–376.[6] P. Auer and R. Ortner, “UCB revisited: Improved regret bounds forthe stochastic multi-armed bandit problem,”

Periodica MathematicaHungarica , vol. 61, no. 1-2, pp. 55–65, 2010.[7] A. Kolobov, Mausam, and D. Weld, “A theory of goal-oriented mdpswith dead ends,” in

UAI , 2012, pp. 438–447.[8] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic andnon-stochastic multi-armed bandit problems,”

Foundations and Trendsin Machine Learning , vol. 5, no. 1, pp. 1–122, 2012.[9] L. Takacs, “On the classical ruin problems,”

J. Amer. StatisisticalAssociation , vol. 64, pp. 889–906, 1969.[10] B. Hunter, A. C. Krinik, C. Nguyen, J. M. Switkes, and H. F. vonBremen, “Gambler’s ruin with catastrophe and windfalls,”

StatisticalTheory and Practice , vol. 2, no. 2, pp. 199–219, 2008.[11] T. van Uem, “Maximum and minimum of modiﬁed gamblers ruinproblem, arxiv:1301.2702,” 2013.[12] D. Bertsekas, “Dynamic programming and optimal control,”

AthenaScientiﬁc .[13] F. Teichteil-Konigsbuch, “Stochastic safest and shortest path prob-lems,” in

AAAI , 2012.[14] A. Tewari and P. Bartlett, “Optimistic linear programming gives loga-rithmic regret for irreducible MDPs,”

Advances in Neural InformationProcessing Systems , vol. 20, pp. 1505–1512, 2008.[15] P. Auer, T. Jaksch, and R. Ortner, “Near-optimal regret bounds forreinforcement learning,” in

Advances in Neural Information ProcessingSystems , 2009, pp. 89–96.[16] Gittins, J.C., Jones, and D.M., “A dynamic allocation index for thesequential design of experiments,”

Progress in Statistics Gani, J. (ed.) ,pp. 241–266, 1974.[17] S. Agrawal and N. Goyal, “Analysis of Thompson sampling forthe multi-armed bandit problem,”

The Journal of Machine LearningResearch , vol. 23, no. 39, pp. 285–294, 2012. [18] N. Cesa-Bianchi and G. Lugosi, “Combinatorial bandits,”

Journal ofComputer and System Sciences , vol. 78, no. 5, pp. 1404–1422, 2012.[19] R. Bellman and R. E. Kalaba,

Dynamic programming and moderncontrol theory . Citeseer, 1965, vol. 81.[20] M. A. El-Shehawey, “On the gambler’s ruin problem for a ﬁnitemarkov chain,”

Statistics and Probability Letters , vol. 79, pp. 1590–1595, 2009.[21] T. Isobe, K. Hashimoto, J. Kizaki, M. Miyagi, K. Aoyagi, K. Koufuji,and K. Shirouzu, “Characteristics and prognosis of gastric cancer inyoung patients,”

International Journal of Oncology A PPENDIX

Let X , X , . . . , X n be random variables in range of [0,1] and E [ X t | X , . . . , X t − ] = µ . Let S n = (cid:80) ni =1 X i . Thenfor any nonegative z , Pr ( S n ≤ E ( S n ) − z ) ≤ exp( − z n )Pr ( | S n − E ( S n ) |≥ z ) ≤ − z nn