[PDF] Differentiable Linear Bandit Algorithm

Abstract

Upper Confidence Bound (UCB) is arguably the most commonly used method for linear multi-arm bandit problems. While conceptually and computationally simple, this method highly relies on the confidence bounds, failing to strike the optimal exploration-exploitation if these bounds are not properly set. In the literature, confidence bounds are typically derived from concentration inequalities based on assumptions on the reward distribution, e.g., sub-Gaussianity. The validity of these assumptions however is unknown in practice. In this work, we aim at learning the confidence bound in a data-driven fashion, making it adaptive to the actual problem structure. Specifically, noting that existing UCB-typed algorithms are not differentiable with respect to confidence bound, we first propose a novel differentiable linear bandit algorithm. Then, we introduce a gradient estimator, which allows the confidence bound to be learned via gradient ascent. Theoretically, we show that the proposed algorithm achieves a \tilde{\mathcal{O}}(\hat{\beta}\sqrt{dT}) upper bound of T-round regret, where d is the dimension of arm features and \hat{\beta} is the learned size of confidence bound. Empirical results show that \hat{\beta} is significantly smaller than its theoretical upper bound and proposed algorithms outperforms baseline ones on both simulated and real-world datasets.

Full PDF

DDifferentiable Linear Bandit Algorithm

Kaige Yang

University College London [email protected]

Laura Toni

University College London [email protected]

Abstract

Upper Conﬁdence Bound (UCB) is among the most commonly used methods forlinear multi-arm bandit problems. While conceptually and computationally simple,this method highly relies on the conﬁdence bounds, failing to strike the optimalexploration-exploitation if these bounds are not properly set. In the literature,conﬁdence bounds are typically derived from concentration inequalities based onassumptions on the reward distribution, e.g. , sub-Gaussianity. The validity of theseassumptions however is unknown in practice. In this work, we aim at learning theconﬁdence bound in a data-driven fashion, making it adaptive to the actual problemstructure. Noting that existing UCB-typed algorithms are not differentiable withrespect to conﬁdence bound, we ﬁrst propose a novel differentiable linear banditalgorithm. Then, we introduce a gradient estimator, which allows permit to learnthe conﬁdence bound via iterative gradient ascent. Theoretically, we show that theproposed algorithm achieves a ˜ O ( ˆ β √ dT ) upper bound of T -round regret, where d is the dimension of arm features and ˆ β is the learned size of the conﬁdence bound.Empirical results show that ˆ β is signiﬁcantly smaller than its theoretical upperbound and proposed algorithms outperform baseline ones on both synthetic andreal-world datasets. Multi-Arm Bandit (MAB) [5] is an online decision making problem, in which an agent selects armssequentially and observes stochastic rewards as feedback. The goal of the agent is to maximize theexpected cumulative reward over a number of trials. The expected reward of each arm is unknown apriori and it is learned from experience by the agent. As a consequence, the agent needs to balance theselection of arms to improve its knowledge (exploration) and the selection of the highest rewardingarm given the knowledge acquired till thus far (exploitation). This is formalized as the so-calledexploration-exploitation trade-off. Bandit algorithms are designed to strike this trade-off. One classof MAB problems is the linear MAB [8], in which each arm is described by a feature vector andthe expected reward follows a linear model over its feature vector and an unknown parameter vector.Each arm’s feature vector is known a priori by the agent and it is considered as a hint on the armreward. The learning problem boils down to the agent inferring the unknown parameter vector, basedon the history (selected arms and received rewards) and selecting arms accordingly.One popular algorithm to solve linear MAB is the

Upper Conﬁdence Bound (UCB) [3] [8] [1]. Itspopularity is motivated by its conceptual simplicity and strong theoretical guarantees. UCB-typedalgorithms rely on the construction of an upper conﬁdence bound, which is the estimated rewardinﬂated based on the level of uncertainty of the estimate. At each decision opportunity, the agentselects the arm with the highest upper conﬁdence bound. This reﬂects the

Optimism in Face ofUncertainty principle. In such way, either the arm with high estimated reward (exploitation) orhigh uncertainty (exploration) is selected. However, to properly balance between exploration andexploitation, it is fundamental to establish a tight conﬁdence bound [16].

Preprint. Under review. a r X i v : . [ c s . L G ] J un n most existing works, conﬁdence bounds are derived from concentration inequalities [1] [4] [18]given a priori assumptions on the reward distribution (e.g., sub-Gaussinaity). These bounds achievestrong minimax theoretical guarantees, outperforming competitor algorithms such as LinTS [2].While these bounds are essential for a theoretical analysis, they do not necessarily translate intopractice. In fact, these constructed conﬁdence bounds are typically conservative in practice , as notedin [19] [14]. This is because concentration inequalities are usually built based on given rewarddistributions instead of the actual data (or problem structure). This results in non-adaptive andpotentially wide conﬁdence bounds which in turn lead to suboptimal performance in practice.Alternatively, in this work we aim to learn the conﬁdence bound in a data-driven fashion makingit adaptive to the actual problem structure. Inspired by [6], we aim at having a parametrized anddifferentiable cumulative reward function with respect to the conﬁdence bound, which can then beoptimized. The key challenge is that existing UCB-typed algorithms are non-differentiable withrespect to the conﬁdence bound, mainly due to maximization of the UCB index (i.e., due to thepresence of the arg max operator in the

OFUL [1],

LinUCB [8]). To address this, we propose a noveldifferentiable UCB-typed linear bandit algorithm and introduce a gradient estimator which enablesthe conﬁdence bound to be learned via gradient ascent.Our proposed algorithm contains two core components. First, we consider a more informative UCB-based index than the classical UCB index used in

OFUL [1],

LinUCB [8], which not only summarizesthe history of each arm but also differentiates arms to be suboptimal arms and non-suboptimal arms.Second, we consider a softmax function, which transforms each index into a probability distribution,where the probability for each suboptimal arm to be selected is arbitrary small. Conversely, theprobability for a non-suboptimal arm to be selected is greater for arms with larger index. The keyidea is that the exploration is conducted by selecting arms with large index more often than others.The exploitation is achieved by soft-eliminating suboptimal arms (arbitrary small probability to beselected). The softmax function ensures the differentiabilty of the reward function, paving the way tolearn conﬁdence bound via gradient ascent. Based on this, we provide two linear bandit algorithmsfor learning conﬁdence bound in both ofﬂine and online settings. Theoretically, we provide a regretupper bound for the ofﬂine learning setting.In summary, we make the following contributions: • We propose a novel UCB-typed linear bandit algorithm where the expected cumulativereward is a differentiable function of the conﬁdence bound. • We introduce a gradient estimator and show how the conﬁdence bound can be learned viagradient ascent both in ofﬂine/online settings. • Theoretically, we prove a ˜ O ( ˆ β √ dT ) upper bound of T -rounds regret where ˆ β is the learnedsize of the conﬁdence bound in the ofﬂine setting. • Empirically, we show ˆ β is signiﬁcantly smaller than its theoretical upper bound, leading tosubstantially lower cumulative regrets with respect to state-of-the-art baselines on syntheticand real-world datasets. Notation : [ K ] mean the set { , , ..., K } . Arm is indexed by i, j ∈ A . We use boldface lower letter,e.g., x , to denote vector and boldface upper letter. e.g., M , to denote matrix. For a positive deﬁnitematrix M ∈ R d × d and a vector x ∈ R d , we denote the weighted 2-norm by || x || M = √ x T Mx .Each arm k is represented by the feature vector x k ∈ R d . We denote by P and E the probabilitydistribution and the expectation operator, respectively. Our work is inspired by [6], which was the ﬁrst attempt in addressing policy-gradient optimizationof bandit policies via differentiable bandit algorithm. However, there are fundamental differencesbetween [6] and our work. First, authors proposed a differential bandit framework for BayesianMAB problem, which is not directly applicable to linear MAB problems. Conversely, we propose adifferentiable UCB-typed linear bandit algorithm. Second, the main goal of [6] is to learn the learningrate (coldness-parameter) of the softmax function, while our algorithm aims at learning the size ofthe conﬁdence bound. Third, we propose algorithms for both ofﬂine and online settings, while [6]covered the ofﬂine setting only. Moreover, in [6] a regret analysis was provided for MAB with two2rms. In contrast, we provide a regret analysis for linear MAB with arbitrary ﬁnite number of arms inofﬂine setting.Another work focused on data-dependent UCB is [12]. Authors proposed an algorithm called bootstrapedUCB . In [12], the stochastic reward is assumed to be sub-Weibull random variable.Multiplier bootstrap was employed to approximate the reward distribution. The boostrapped quantileacted as UCB to facilitie exploration. Their algorithm was deployed on both MAB and linear MABproblems, while regret analysis covered MAB only. Similar to this work, other bootstrap techniqueswere employed [7] [9] [22]. Although aiming to the same goal (data-dependent UCB), these worksare fundamentally differnt from our approach. Our algorithm is a differentiable bandit algorithmwhere we rely on gradient estimator to learn UCB. Their algorithm is non-differentiable, relying onthe boostrapped quantile of the assumed reward distribution to construct UCB.Bootstrap techniques were used also for Thompson Sampling exploration in [19], in which authorproposed the

BoostrapThompson algoritihm for MAB. Bootstrap techniques were used to sampleobservations from historical and pseudo observations to approximate the posterior distribution whichwas then used to encourage exploration. As an extension, [23] generalized this technique to Gaussianreward MAB, while [13] and [14] proposed an extension to contextual linear bandit, achievingthe same regret bound of

LinTS [2]. The problem they aimed to address was the computationalinfeasibility of inferring posterior distribution when reward follows nonlinear models. This departsfrom our goal, which is rather learning the conﬁdence bound from data.Our work can be viewed as a subtle combination of

EXP3 [5] and

Phased Elimination [17]. EXP3 was designed for MAB, where arms with higher empirical averaged reward are signed withlarger probability by softmax function. The coldness-parameter of softmax function is a tunablehype-parameter chosen by the user. In our work, we propose a novel scheme to set this parameterautomatically in a data-driven fashion. Moreover, although

Exp3 is a differentiable bandit algorithm,it is not an UCB-typed algorithm.

Phased Elimination eliminates suboptimal arms based on thesame index as ours and selects non-suboptimal arms uniformly (pure exploration). There are severalfundamental differences between this approach and our work: i) the conﬁdence bound in our workis learned from data and not from concentration inequalities – leading to a less conservative bound; ii) Phased Elimination is a non-differentiable algorithm; iii)

Phased Elimination achievesoptimality in a worst case scenario (minmax regret) while our algorithm get an empirical gain beingdata dependent.In summary, to the best of our knowledge, our work is the ﬁrst differentiable UCB-typed linearbandit algorithm which enables conﬁdence bound to be learned purely from data without relying onconcentration inequalities and assumptions on the form of reward distribution.

We consider the stochastic linear bandit with an arm set A and a time horizon of T -rounds. Thearm set contains K arms, i.e., |A| = K , where K could be large. Each arm i ∈ A is associatedwith a known feature vector x i ∈ R d . The expected reward of each arm µ i = x Ti θ follows a linearrelationship over x i and an unknown parameter vector θ . Similarly to other works in the banditliterature, we assume that arm feature and parameter vector are bounded || x || ≤ L and || θ || ≤ C ,where L > and C > . At the beginning of each decision opportunity t ∈ [ T ] , the learning agentselects one arm i ∈ A within the arm set A . Upon this selection, the agent observes the instantaneousreward y t ∈ [0 , , which is drawn independently from a distribution with unknown mean µ i = x Ti θ .The agent aims to maximize the expected cumulative reward over the time horizon T . Namely, Y T = T (cid:88) t =1 E [ y t ] (1)This is equivalent to minimize the expected cumulative regret which measures the difference betweenthe expected cumulative reward if the optimal arm were always selected and the agent’s expectedcumulative reward. Denoting by µ ∗ = max i ∈A x Ti θ the expected reward of the optimal arm, we get R T = T µ ∗ − T (cid:88) t =1 E [ y t ] . (2) Algorithm:

Phased elimination with G-optimal exploration page. 258 [17] lgorithm 1: SoftUCB

Input : β , A , K , T , α . Initialization : V = α I ∈ R d × d , b = ∈ R d , ˆ θ = ∈ R d , γ = 0 . for t ∈ [1 , T ] do

1. Find S i,t , ∀ i ∈ A via Eq. 7 with β .2. Find π t via Eq. 8 with γ t − .3. Select arm i t ∈ A randomly following π t and receive payoff y t .4. Update V t ← V t + x t x Tt , b t ← b t − + x t y t and ˆ θ t = V − t b t .5. Update γ t via Eq. 9 . endUpper Conﬁdence Bound (UCB) . The upper conﬁdence bound algorithm, e.g., OFUL [1], is designedbased on the

Optimism in Face of Uncertainty principle. The key aspect is to construct a conﬁdencebound of the estimated reward of each arm. Formally, at each round t , the conﬁdence bound is deﬁnedas | ˆ µ i,t − µ i | ≤ β || x i || V − t , ∀ i ∈ A (3)where ˆ µ i,t is the estimate of the reward of arm i at round t and V t = (cid:80) Tt =1 x t x Tt is the Gram matrixup to round t . Then, the agent selects the arm with the highest upper conﬁdence bound as follows i t = arg max i ∈A ˆ µ i,t + β || x i || V − t (4)It is well known that the tighter the bound in Eq. 3, the better the balance between explorationand exploitation [16]. Most existing conﬁdence bounds are established based on concentrationinequalities. e.g., Hoeffding inequality [4], self-normalized [1], Azuma Inequality [17], Bernsteininequality [18]. As a speciﬁc example, under the assumption of the stochastic reward to be a R -sub-Gaussian variable, one of the state-of-the-art high probability upper bound of β , derived based onproperties of self-normalized martingale, was given by [1]: β ≤ R (cid:115) (cid:18) δ (cid:19) + d log (cid:18) Td (cid:19) + √ αC (5)where α is a regularizer parameter of least-square estimator, − δ is the probability of which Eq. 3holds and || θ || ≤ C . The tightness of this (and other bounds) relies on the validity of assumptions onthe reward distribution, which is unfortunately unknown in practice. Alternatively, we aim at learningthe conﬁdence bound, i.e., β , in a data-driven fashion without any a priori assumption on the unknownreward distribution except the linearity function of the mean reward, i.e., is µ i = x Ti θ , ∀ i ∈ A . In this section, we ﬁrst present our proposed algorithm whose expected cumulative reward is adifferentiable function of the conﬁdence bound. Then, we provide a gradient estimator which enablesconﬁdence bound to be learned via gradient ascent. Next, we propose two algorithms to learn theconﬁdence bound in ofﬂine and online settings, respectively. Finally, we prove a regret upper boundfor ofﬂine learning setting.

Our proposed algorithm named

SoftUCB is shown in Algorithm 2.

SoftUCB contains two corecomponents: an UCB-based index S i,t and an arm selection policy π t . Formally, for i ∈ A , ˆ µ i,t = x Ti ˆ θ t where ˆ θ t = V − t (cid:80) ts =1 x s y s is the least-square estimator and V − t = (cid:80) ts =1 x s x Ts isthe Gram matrix up to round t . Let denote by i ∗ = arg max i ∈A ˆ µ i,t − β || x i || V − t the arm with thelargest lower conﬁdence bound at round t . Let us also deﬁne φ i,t = || x i || V − t + || x i ∗ || V − t and ˆ∆ i,t = ˆ µ i ∗ ,t − ˆ µ i,t (6)4here β is the conﬁdence bound deﬁned in Eq. 3 and ˆ∆ i,t is the estimated reward gap between i ∗ and i . Equipped with the above notations, we are now ready to introduce the UCB-based index S i,t deﬁned as S i,t = βφ i,t − ˆ∆ i,t . (7)It is worth noting that S i,t is more informative than classical UCB index provided Eq. 4, because ofthe following two key properties: i) , S i,t differentiates arms into suboptimal arms and non-suboptimalarms. Speciﬁcally, S i,t < identiﬁes arms which are suboptimal, ∆ i = µ ∗ − µ i > , and thereforecould be eliminated (i.e., not selected by the agent); ii) , S i,t ≥ S j,t ≥ implies that the upperconﬁdence bound ˆ µ i,t + β | x i || V − t ≥ ˆ µ j,t + β || x j || V − t and therefore arm i is more likely to beselected, in line with the Optimism in Face of Uncertainty principle. These two properties are statedformally in Lemma 1.

Lemma 1. If S i,t < , arm i is a suboptimal arm, i.e., µ ∗ − µ i > . If S i,t ≥ S j,t ≥ , then theupper conﬁdence bound ˆ µ i,t + β || x i || V − t ≥ ˆ µ j,t + β || x j || V − t . The proof is provided in Appendix A. We now describe the arm selection strategy. At each round t ∈ [ T ] , the probability for arm i to beselected is deﬁned as p i,t = exp( γ t S i,t ) (cid:80) Kj =1 exp( γ t S j,t ) (8)where γ t > is the coldness-parameter controlling the concentration of the distribution (policy) π t = [ p ,t , p ,t , ..., p K,t ] , and it is set as γ t = log (cid:16) δ |L t | − δ (cid:17) ˜ S max ,t (9)where at each round t , the arm set A is divided into two subsets U t and L t with U t ∪ L t = A and U t ∩ L t = ∅ . Namely, L t is the set of suboptimal arms (i.e., i ∈ L t if S i,t < ) and U t is the set ofnon-suboptimal arms (i.e., i ∈ U t if S i,t ≥ ). ˜ S max ,t = max i ∈U t S i,t , |L t | is the cardinality of L t and δ is a probability hyper-parameter explained in the following Lemma. Lemma 2.

At any round t ∈ [ T ] , for any δ ∈ (0 , , setting γ t ≥ log( δ |L t | − δ ) / ˜ S max ,t guarantees that p U t = (cid:80) i ∈U t p i,t ≥ δ and p L t = (cid:80) i ∈L t p i,t < − δ . The proof is provided in Appendix B. According to Lemma 2, Eq. 9 guarantees that suboptimal arms ( i ∈ L t ) are selected with an arbitrarysmall probability (i.e., p L t < − δ ≈ when δ ≈ ). This leads to a soft-elimination of suboptimalarms. Furthermore, a positive γ t guarantees p i,t ≥ p j,t if S i,t ≥ S j,t ≥ , ∀ i, j ∈ U t which obeysthe Optimism in Face of Uncertainty principle.Overall,

SoftUCB (soft-) eliminates suboptimal arms and selects non-suboptimal arms accordingto the index in Eq. 7 which favors the selection of arms with either high estimated reward or highuncertainty. β We now show that the expected cumulative reward of

SoftUCB is a differentiable function over β and introduce a gradient estimator. Formally, given the expected cumulative reward deﬁned in Eq. 1and SoftUCB described above, we have the optimization objective deﬁned as max β Y T = max β T (cid:88) t =1 E [ y t ] = max β T (cid:88) t =1 K (cid:88) i =1 p i,t µ i , s.t. | µ i − ˆ µ i,t | ≤ β || x i || V − t , ∀ i ∈ A , t ∈ [ T ] (10)The imposed constraint ensures that β || x i || V − t is indeed an actual upper conﬁdence bound (UCB) atany round t ∈ [ T ] for any arm i ∈ A . Applying the Lagrange multipliers gives the new objective: max β T (cid:88) t =1 K (cid:88) i =1 p i,t µ i − η ( | µ i − ˆ µ i,t | − β || x i || V − t ) , s.t. η > (11)5he gradient of β , denoted as g ( β ) , can be derived as (proof in Appendix C): g ( β ) = T (cid:88) t =1 K (cid:88) i =1 p i,t µ i (cid:32) γ t φ i,t − (cid:80) Kj =1 γ t φ j,t exp( γ t S j,t ) (cid:80) Kj =1 exp( γ t S j,t ) (cid:33) + η || x i || V − t (12)Note that µ i is unknown in practice and it is therefore replaced by its empirical estimate ˆ µ i,t , leadingto the following gradient estimator ˆ g ( β ) = T (cid:88) t =1 K (cid:88) i =1 p i,t ˆ µ i,t (cid:32) γ t φ i,t − (cid:80) Kj =1 γ t φ j,t exp( γ t S j,t ) (cid:80) Kj =1 exp( γ t S j,t ) (cid:33) + η || x i || V − t (13)The gradient estimator ˆ g ( β ) in Eq. 13 enables β to be learned via gradient ascent. As a stochasticgradient method, under standard condition of learning rate, e.g., RM [20], it is expected that ˆ β converges to local optimum. Equipped with the gradient estimator ˆ g ( β ) (Eq. 13), we now show how to learn β in ofﬂine andonline settings. The corresponding algorithms named SoftUCB offline and

SoftUCB online arepresented in Appendix E.

Ofﬂine setting . In this setting, multiple T -rounds trajectories of the bandit problem with the samearm set A are used to train β , which is reﬁned after each T -rounds trajectory. The key steps are toinitialize ˆ β and run SoftUCB on A for N training trajectories – each trajectory containing T -rounds.After each trajectory n ∈ [ N ] , update ˆ β n ← ˆ β n − + λ ˆ g ( β ) via Eq. 13 where λ is the learning step.At the end of the training, run SoftUCB on A with ˆ β = ˆ β N .As a result of the training, the value of ˆ β is optimized in such a way that it maximizes the expectedcumulative reward of arm set A . Empirically, the ˆ β to which the algorithm converges is substantialless than its theoretical upper bound Eq. 5. This translates into a signiﬁcant regret reduction. In thefollowing subsection we provide a theoretical regret upper bound of SoftUCB offline

While the above method is fully adaptive to the structure of A , it provides a burden on the compu-tational complexity. Speciﬁcally, the computational complexity SoftUCB offline is O ( N KT ) ,since we run SoftUCB N trajectories with K arms and T rounds in each trajectory. This is muchhigher than other linear algorithms such as LinUCB [1] and

LinTS [2]. To mitigate this issue, wepropose

SoftUCB online which learns β within one trajectory in an online fashion. Online setting . In this setting, ˆ β is updated online during one T -rounds trajectory. Speciﬁcally, ˆ β is initialized and SoftUCB on A is run for T rounds. At the end of each round t ∈ [ T ] , update ˆ β t ← ˆ β t − + λ ˆ g t ( β ) where λ is the learning step and ˆ g t ( β ) is the gradient estimator (Eq. 15deﬁned blow). This reduces the computationally complexity to O ( KT ) since it does not require the N -training trajectories, which is at the same level of OFUL [1],

LinUCB [8] and

LinTS [2].In this setting, Y T = (cid:80) Tt =1 E [ y t ] , the objective function we aim at maximizing, is not availablebefore the end of the trajectory. To obviate to this problem, similarly to policy gradient methods fornon-episodic reinforcement learning problems [21], we update ˆ β to maximizes the average rewardper round ˆ Y t . Formally, at each round t , ˆ Y t consists of two parts: the observed cumulative reward upto round t and bootstrapped future reward under the current policy π t = [ p ,t , p ,t , ..., p K,t ] . Thistranslates in the following problem formulation max β ˆ Y t = max β (cid:32) t (cid:88) s =1 K (cid:88) i =1 p i,s ˆ µ i,s + ( T − t ) K (cid:88) i =1 p i,t ˆ µ i,t (cid:33) /Ts.t. | ˆ µ i,t − µ i,t | ≤ β || x i || V − t , ∀ i ∈ A (14)The gradient estimator ˆ g t ( β ) at round t can be derived as ˆ g t ( β ) = 1 T (cid:18) t (cid:88) s =1 K (cid:88) i =1 ˆ µ i,s (cid:53) β p i,s + ( T − t ) K (cid:88) i =1 ˆ µ i,t (cid:53) β p i,t + η || x i || V − t (cid:19) (15)It is worth noting that, at the end of trajectory t = T , the ˆ Y t converges to. Y T in the ofﬂine setting.6able 1: The comparison between ˆ β (ofﬂine) and theoretical bound ˜ β . d = 5 , T = 2 d = 5 , T = 2 d = 5 , T = 2 d = 10 , T = 2 d = 15 , T = 2 ˆ β = ˆ β = ˆ β = ˆ β = ˆ β = ˜ β = 2 .

56 ˜ β = 2 .

66 ˜ β = 2 .

76 ˜ β = 3 .

25 ˜ β = 3 . (a) ˆ β (ofﬂine) (b) R T (ofﬂine) (c) ˆ β (online) (d) R T (online) Figure 1: Learning curves of

SoftUCB offline and

SoftUCB online

Deﬁne E [ r t ] = E [ µ ∗ − (cid:80) Ki =1 p i,t µ i ] be the expected regret at round t ∈ [ T ] . Let ˆ β = β N be the conﬁdence bound learned from the ofﬂine training setting after N T -rounds trajectories. Letassume that γ t follows Lemma 2 and δ ≈ . The cumulative regret of SoftUCB is bounded as R T = T (cid:88) t =1 E [ r t ] ≤ √ βδ (cid:115) T d log (cid:18) α + Td (cid:19) = ˜ O (cid:32) ˆ β (cid:115) dT log (cid:18) Td (cid:19)(cid:33) (16) where ˜ O ( · ) hides absolute constant. The proof is contained in Appendix D. Theorem 1 provides a regret upper bound of

SoftUCB in the ofﬂine setting. To compare the regretbound with that of other algorithms, we show ˆ β explicitly in the upper bound. Our regret boundscales with d and T as the regret bound O ( β √ dT ) of existing UCB-typed algorithms, e.g., OFUL [1],

LinUCB [8],

Giro [13]. Since we make no assumption on the reward distribution, we can not derivea theoretical upper bound on ˆ β . However, it is worth to noting that empirical results (in next section)show that ˆ β is signiﬁcantly smaller than its theoretical upper bound Eq. 5. The theoretical analysisfor the online setting is left for future works. Our experimental evaluation aims to answer the following questions: (1) Does the learning curve of ˆ β converge in ofﬂine and online settings? (2) Is ˆ β lower than its theoretical counterpart? (3) How doour proposed algorithms perform compare to baseline ones?In synthetic datasets, there are K = 50 arms with feature vector drawn uniformly from [ − , . Thedimension of arm feature is set as d = 10 , . Arm feature vectors are normalized to be unit vectors.The parameter vector θ is generated as a random unit vector. The noise level is set as . and theregularizer parameter is α = 1 . We use two real-world datasets: Jester [11] and

Movielens [15](see Appendix G for more details). We compare the proposed algorithms with baseline ones, namely

LinUCB [1],

LinTS [2] and (cid:15) -greedy [21]. The β in LinUCB is set as Eq. 5,

LinTS follows [2], and (cid:15) = 0 . in (cid:15) -greedy .Fig. 1 depicts the learning curves of ˆ β and the corresponding R T in both ofﬂine and online settingsfor the synthetic datasets. The feature dimension d = 10 . In both settings, ˆ β and R T achieve7 a) d = 10 (b) d = 20 (c) MovieLens (d)

Jester

Figure 2: Performance of algorithms on synthetic and real-world datasetsconvergence. Note that in ofﬂine setting, ˆ β is optimized to maxmize the expected cumulative rewardEq. 10, while in online setting, ˆ β is optimized to maximize the average reward per round Eq. 14.In Table 1, we compare ˆ β obtained from ofﬂine training and its theoretical suggested ˜ β given byEq. 5. Clearly, ˆ β is signiﬁcantly less than ˜ β consistently in all cases. This is because ˆ β is adaptive tothe structure of A , while ˜ β is derived based on worst-case (minimax analysis). Note that the value of ˆ β is highly data-dependent. The value we report here only valid for our experimental data. However,it is reasonable to expect ˆ β less that ˜ β in general. The corresponding learning curves are shown inAppendix F.In Fig. 2 , the proposed algorithms converge to lower cumulative regret comparing with baselines.There are two reasons: First, the conﬁdence bound ˆ β is optimized. Second, the proposed algorithmeliminates (softly) suboptimal arms which accelerates the rate of convergence. It is worth notingthat the regret of SoftUCB online is large at the initial phase. This is because at the beginning,when γ = 0 , |L t | = 0 , SoftUCB online selects arms uniformly which results in large regret. Later,when suboptimal arms are identiﬁed, |L t | > , γ t > according to Eq. 9. Suboptimal arms aresoft-eliminated and non-suboptimal arms are selected following index Eq. 7 which controls the regret.Finally, during our experiments, we noticed that the convergence of ˆ β in both ofﬂine and onlinesetting is sensitive to the Lagrange multiplier η . With large η , the gradient ascent algorithm failsin converging, this is because the gradient estimator Eq. 15 is dominated by η || x i || V − t . On theother hand, too small η does not ensure the key constraint | ˆ µ i,t − µ i | ≤ β || x i || V − t . This can leadto erroneously eliminating the optimal arm. Therefore, the hyper-parameter η needs to be tunedcarefully during experiments. We propose

SoftUCB , a novel UCB-typed linear bandit algorithm based on an adaptive conﬁdencebound, resulting in a less conservative algorithm respect to UCB-typed algorithms with constructed conﬁdence bounds. The key novelty is to propose an expected cumulative reward which is adifferentiable function of the conﬁdence bound, and derive a gradient estimator, which enablesconﬁdence bound to be learned via gradient ascent. The estimated conﬁdence bound ˆ β can be updatedunder ofﬂine/online training settings with the proposed SoftUCB offline and

SoftUCB online ,respectively. Theoretically, we provide a ˜ O ( ˆ β √ dT ) regret upper bound of SoftUCB in the ofﬂinesetting. Empirically, we show that ˆ β is signiﬁcantly less that its theoretical counterpart leading to areduction of the cumulative regret compared to state-of-the-art baselines.There are several directions for future work. First, our work can be combined with meta-learningalgorithms, e.g., MAML [10], to learn a conﬁdence bound which is adaptive to the common structureof a set of bandit tasks. Second, we believe our work can be generalized to reinforcement learning(RL) tasks where exploration and exploitation trade-off is a long standing challenge.8

Broader Impact Discussion

Our work is an algorithm for multi-arm bandit (MAB) problem. On the novelty side, our workautomates the exploration in bandit problems. Such algorithm could be used in recommendationsystem and clinic trials. On the positive side, our work could balance the exploration and exploitationtrade-off in a problem dependently way, which might improve the customer satisfaction or patient’shealth care. On the negative side, depending to the deployed application, the recommended contentsmight be unsuitable for some users. To mitigate this issue, domain knowledge might be requiredto ﬁlter the recommended contents before releasing to users. Regarding the health care application,expert’s supervision is essential to avoid any potential hazard.

References [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linearstochastic bandits. In

Advances in Neural Information Processing Systems , pages 2312–2320,2011.[2] Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linearpayoffs. In

International Conference on Machine Learning , pages 127–135, 2013.[3] Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-offs.

Journal of MachineLearning Research , 3(Nov):397–422, 2002.[4] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmedbandit problem.

Machine learning , 47(2-3):235–256, 2002.[5] Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. Gambling in a riggedcasino: The adversarial multi-armed bandit problem. In

Proceedings of IEEE 36th AnnualFoundations of Computer Science , pages 322–331. IEEE, 1995.[6] Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, andManzil Zaheer. Differentiable bandit exploration. arXiv preprint arXiv:2002.06772 , 2020.[7] Richard Y Chen, Szymon Sidor, Pieter Abbeel, and John Schulman. Ucb exploration viaq-ensembles. arXiv preprint arXiv:1706.01502 , 2017.[8] Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payofffunctions. In

Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligenceand Statistics , pages 208–214, 2011.[9] Adam N Elmachtoub, Ryan McNellis, Sechan Oh, and Marek Petrik. A practical method forsolving contextual bandit problems using decision trees. arXiv preprint arXiv:1706.04687 ,2017.[10] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adap-tation of deep networks. In

Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 1126–1135. JMLR. org, 2017.[11] Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant timecollaborative ﬁltering algorithm. information retrieval , 4(2):133–151, 2001.[12] Botao Hao, Yasin Abbasi Yadkori, Zheng Wen, and Guang Cheng. Bootstrapping upperconﬁdence bound. In

Advances in Neural Information Processing Systems , pages 12123–12133,2019.[13] Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, and Craig Boutilier. Perturbed-history exploration in stochastic multi-armed bandits. arXiv preprint arXiv:1902.10089 , 2019.[14] Branislav Kveton, Csaba Szepesvari, Zheng Wen, Mohammad Ghavamzadeh, and Tor Lattimore.Garbage in, reward out: Bootstrapping exploration in multi-armed bandits. arXiv preprintarXiv:1811.05154 , 2018.[15] Shyong Lam and Jon Herlocker. Movielens data sets.

Department of Computer Science andEngineering at the University of Minnesota , 2006.[16] Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of ﬁnite-armed linear bandits. arXiv preprint arXiv:1610.04491 , 2016.[17] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint , 2018.918] Volodymyr Mnih, Csaba Szepesvári, and Jean-Yves Audibert. Empirical bernstein stopping. In

Proceedings of the 25th international conference on Machine learning , pages 672–679, 2008.[19] Ian Osband and Benjamin Van Roy. Bootstrapped thompson sampling and deep exploration. arXiv preprint arXiv:1507.00300 , 2015.[20] Herbert Robbins and Sutton Monro. A stochastic approximation method.

The annals ofmathematical statistics , pages 400–407, 1951.[21] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction . MIT press,2018.[22] Liang Tang, Yexi Jiang, Lei Li, Chunqiu Zeng, and Tao Li. Personalized recommendationvia parameter-free contextual bandits. In

Proceedings of the 38th international ACM SIGIRconference on research and development in information retrieval , pages 323–332, 2015.[23] Sharan Vaswani, Branislav Kveton, Zheng Wen, Anup Rao, Mark Schmidt, and Yasin Abbasi-Yadkori. New insights into bootstrapping for bandits. arXiv preprint arXiv:1805.09793 , 2018.10 ppendidx A

This section contains the proof of Lemma 1.

Proof.

Suppose S i,t < , that is β ( || x i ∗ || V − t + || x i || V − t ) < ˆ µ i ∗ ,t − ˆ µ i,t (17)Rearrange terms gives ˆ µ i,t + β || x i || V − t ≤ ˆ µ i ∗ ,t − β || x i ∗ || V − t (18)Note that | µ i − ˆ µ i,t | ≤ β || x i || V − t , ∀ i ∈ A . Then, ˆ µ i ∗ ,t − β || x i ∗ || V − t ≤ µ i ∗ (19)and µ i ≤ ˆ µ i,t + β || x i || V − t (20)Combine together we have µ i ≤ µ i ∗ ≤ µ ∗ (21)Recall by deﬁnition i ∗ = arg max i ∈A ˆ µ i,t − β || x i || V − t is the arm with largest lower upper bound atround t . Therefore, ∆ i = µ ∗ − µ i > . In words, arm i is suboptimal.Suppose S j,t ≥ S i,t ≥ β ( || x j ∗ || V − t + || x j || V − t ) − (ˆ µ j ∗ ,t − ˆ µ j,t ) ≤ β ( || x i ∗ || V − t + || x i || V − t ) − (ˆ µ i ∗ ,t − ˆ µ i,t ) (22)Recall the deﬁnition of i ∗ , i ∗ = arg max j ∈ [ K ] ˆ µ j,t − β || x j || V − t (23)Thus, at each time t , i ∗ = j ∗ . Then, β || x j || V − t + ˆ µ j,t ≤ β || x i || V − t + ˆ µ i,t (24) Appendix B

This section contains the proof of Lemma 2.

Proof. p U t = (cid:80) i ∈U t exp( γ t S i,t ) (cid:80) i ∈U t exp( γ t S i,t ) + (cid:80) j ∈L t exp( γ t S j,t ) (25)By deﬁnition, S j,t < , ∀ j ∈ L . Thus, exp( γS j,t ) < , ∀ j ∈ L (26)Then, (cid:88) j ∈L t exp( γS j,t ) < |L t | (27)Therefore, p U t > (cid:80) i ∈U t exp( γ t S i,t ) (cid:80) i ∈U t exp( γ t S i,t ) + |L t | (28)For any probability δ ∈ (0 , , we can ﬁnd a γ t such that p U t ≥ δ , namely (cid:80) i ∈U t exp( γ t S i,t ) (cid:80) i ∈U t exp( γ t S i,t ) + |L t | ≥ δ (29)Rearrange terms gives (cid:88) i ∈U t exp( γ t S i,t ) ≥ δ |L t | − δ (30)11ake logarithm on both sides, log (cid:32) (cid:88) i ∈U t exp( γ t S i,t ) (cid:33) ≥ log (cid:18) δ |L t | − δ (cid:19) (31)The left side term is LogSumExp which can be approximated by log (cid:32) (cid:88) i ∈U t exp( γ t S i,t ) (cid:33) ≥ max i ∈U t γ t S i,t = γ t max i ∈U t S i,t (32)Denote ˜ S max ,t = max i ∈U t S i,t and let γ t ˜ S max ,t ≥ log( δ |L t | − δ ) (33)we have γ t ≥ log( δ L t − δ )˜ S max ,t (34)Therefore, if γ t satisﬁes Eq. 34, p U t ≥ δ (35)Clearly, p L t < − δ since p L (cid:116) + p U t = 1 . Appendix C

This section contains the derive of gradients.

Proof. max β Y ( T ) = max β T (cid:88) t =1 E [ y t ] = max β,γ T (cid:88) t =1 K (cid:88) i =1 p i,t µ i s.t. | µ i − ˆ µ i,t | − β || x i || V − t ≤ , ∀ i ∈ A , ∀ t ∈ [ T ] (36)Apply the Lagrange multipliers, the optimization objective is max β T (cid:88) t =1 K (cid:88) i =1 p i,t µ i − η ( | µ i − ˆ µ i,t | − β || x i || V − t ) s.t. η > (37)Apply the score function (cid:53) θ f ( θ ) = f ( θ ) (cid:53) θ log f ( θ ) to p i,t log p i,t = γS i,t − log K (cid:88) j =1 exp γS j,t (38) (cid:53) β log p i,t = γ t φ i,t − (cid:80) Kj =1 γ t φ j,t exp γ t S j,t (cid:80) Kj =1 exp γ t S j,t (39)Then, the gradient g ( β ) is g ( β ) = T (cid:88) t =1 K (cid:88) i =1 µ i p i,t (cid:32) γ t φ i,t − (cid:80) Kj =1 γ t φ j,t exp γ t S j,t (cid:80) Kj =1 exp γ t S j,t (cid:33) + η || x i || V − t (40)The gradient estimator ˆ g ( β ) is obtained by repalcing µ i with ˆ µ i,t = x Ti ˆ θ t where ˆ θ t = V − (cid:80) ts =1 x s y s is obtained via least-square estimator. ˆ g ( β ) = T (cid:88) t =1 K (cid:88) i =1 ˆ µ i,t p i,t (cid:32) γ t φ i,t − (cid:80) Kj =1 γ t φ j,t exp γ t S j,t (cid:80) Kj =1 exp γ t S j,t (cid:33) + η || x i || V − t (41)12 ppendix D This sextion contains the proof of Theorem 1.

Proof.

The probability of each arm is deﬁned as p i,t = exp( γ t S i,t ) (cid:80) Kj =1 exp( γ t S j,t ) (42) S i,t is deﬁned as S i,t = ˆ βφ i,t − ˆ∆ i,t = ˆ β ( || x i || V − t + || x i ∗ || V − t ) − (ˆ µ i ∗ ,t − ˆ µ i,t ) (43)The cumulative regret to be minimized is deﬁned as R T = T (cid:88) t =1 E [ r t ] = T (cid:88) t =1 µ ∗ − E [ y t ] = T (cid:88) t =1 ( µ ∗ − K (cid:88) i =1 p i,t µ i )= T (cid:88) t =1 K (cid:88) i =1 p i,t ( µ ∗ − µ i ) = T (cid:88) t =1 K (cid:88) i =1 p i,t ∆ i (44)where we use (cid:80) Ki =1 p i,t = 1 .At each time t , trm set A is divided into two subsets U t and L t with U t ∪ L t = A . Arm i ∈ U t if S i,t ≥ and arm i ∈ L t if S i,t < . E [ r t ] = K (cid:88) i =1 p i,t ∆ i = (cid:88) i ∈U t p i,t ∆ i + (cid:88) i ∈L t p i,t ∆ i (45)Suppose γ t follows Lemma 2, then (cid:80) i ∈L t p i,t < − δ . Assume ∆ i ≤ , ∀ i ∈ A . Then, E [ r t ] = (cid:88) i ∈U t p i,t ∆ i + (cid:88) i ∈L t p i,t ≤ (cid:88) i ∈U t p i,t ∆ i + (1 − δ ) (46)By setting δ ≈ , we have − δ ≈ . It means arms in L t are unlikely to be selected. So, the secondterm can be dropped. Therefore, E [ r t ] ≤ (cid:88) i ∈U t p i,t ∆ i (47)Thus, E [ r t ] ≤ (cid:88) i ∈U t p i,t ∆ i = (cid:88) i ∈U t p i,t ( µ ∗ − µ i ) (48)Note that at each time t , | ˆ µ i,t − µ i | ≤ ˆ β || x i || V − t , ∀ i ∈ [ K ] . Then µ ∗ ≤ ˆ µ ∗ ,t + ˆ β || x ∗ || V − t (49)and µ i ≥ ˆ µ i,t − ˆ β || x i || V − t (50)Thus, µ ∗ − µ i ≤ ˆ β ( || x ∗ || V − t + || x i || V − t ) + (ˆ µ ∗ ,t − ˆ µ i,t ) (51)Note that ˆ µ ∗ ,t − ˆ µ i,t ≤ ˆ µ i ∗ ,t − ˆ µ i,t where i ∗ = arg max j ∈ [ K ] ˆ µ j,t − ˆ µ i,t . Therefore, µ ∗ − µ i ≤ ˆ β ( || x ∗ || V − t + || x i || V − t ) + (ˆ µ i ∗ ,t − ˆ µ i,t ) (52)Since i ∈ U t , S i,t ≥ . That is ˆ µ i ∗ ,t − ˆ µ i,t ≤ β ( || x ∗ || V − t + || x i || V − t ) . Then, µ ∗ − µ i ≤ ˆ β ( || x ∗ || V − t + || x i || V − t ) + (ˆ µ i ∗ ,t − ˆ µ i,t ) ≤ β ( || x ∗ || V − t + || x i || V − t ) (53)13eﬁne ψ t = max i ∈ [ K ] || x i || V − t . We have µ ∗ − µ i ≤ βψ t (54)Plugging this into Eq. 48 gives E [ r t ] ≤ β (cid:88) i ∈U t p i,t ψ t (55)Since we assume γ t follows Lemma 2, we have p U t = (cid:80) i ∈U t p i,t = δ . Therefore, E [ r t ] ≤ β (cid:88) i ∈U t p i,t ψ t = 4 ˆ βφ t (cid:88) i ∈U t p i,t = 4 ˆ βψ t p U t ≤ βδψ t (56)Thus, the cumulative regret R T = T (cid:88) t =1 E [ r t ] ≤ (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 E [ r t ] ≤ βδ (cid:118)(cid:117)(cid:117)(cid:116) T T (cid:88) t =1 ψ t (57)From Lemma 3 (stated below), we have T (cid:88) t =1 ψ t ≤ d log( α + Td ) (58)Plugging in Eq. 57, R T ≤ βδ (cid:114) T d log( α + Td ) = ˜ O ( ˆ β (cid:114) T d log(1 + Td )) (59)where δ is the probability parameter chosen by user. Lemma 3. (Lemma 11 in [1]) T (cid:88) t =1 || x || V − t ≤ log det ( V t ) ≤ d log( α + Td ) (60)14 ppendix E This section contains the pseudo code of

SoftUCB , SoftUCB offline and

SoftUCB online . Algorithm 2:

SoftUCB

Input : β , A , K , T , α . Initialization : V = α I ∈ R d × d , b = ∈ R d , ˆ θ = ∈ R d , γ = 0 . for t ∈ [1 , T ] do

Input : A , K , T , λ , η Initialization : β = 0 , ˆ β = 0 . for n ∈ [1 , N ] do

1. Run

SoftUCB on A rounds with β = β n − .2. Update β n ← β n − + λ ˆ g ( β ) via Eq. 13 endOutput : ˆ β ← β N Run

SoftUCB on A with β = ˆ β . Algorithm 4:

SoftUCB online

Input : A , K , T , α , λ , η Initialization : β = 0 , V = αI ∈ R d × d , b = ∈ R d , ˆ θ = ∈ R d , γ = 0 . for t ∈ [1 , T ] do

1. Select arm i t ∈ [ K ] randomly following π t and receive payoff y t .2. Update V t ← V t + x t x Tt , b t ← b t − + x t y t and ˆ θ t = V − t b t .3. Update β t ← β t − + λ ˆ g t ( β ) via Eq. 15. end ppendix F This section contains the learning curves of

SoftUCB offline . (a) d = 5 , T = 2 (b) d = 5 , T = 2 (c) d = 5 , T = 2 Figure 3: Learning curves of

SoftUCB offline (a) d = 10 , T = 2 (b) d = 15 , T = 2 Figure 4: Learning curves of

SoftUCB offline

Appendix G

The dataset

Jester contains ratings of 40 jokes from 19891 users. We sample K = 50 users randomlyas arms. Their rating to top 39 jokes are used as feature vector. Then, to reduce the sparsity, we applyprinciple component analysis algorithm to reduce the dimension d = 10 . Their rating on the 40thjokes are used as rewards. At each round, the algorithm selects on user to recommend the joke andthe reward is the rating given by the user. MovieLens contains 6k users and their ratings on 40kmovies. Since not every user gives ratings on all movies, there are a large mount of missing ratings.We factorize the rating matrix to ﬁll the missing values. The rest works the same as in