[PDF] Correlated Bandits for Dynamic Pricing via the ARC algorithm

Abstract

The Asymptotic Randomised Control (ARC) algorithm provides a rigorous approximation to the optimal strategy for a wide class of Bayesian bandits, while retaining reasonable computational complexity. In particular, it allows a decision maker to observe signals in addition to their rewards, to incorporate correlations between the outcomes of different choices, and to have nontrivial dynamics for their estimates. The algorithm is guaranteed to asymptotically optimise the expected discounted payoff, with error depending on the initial uncertainty of the bandit. In this paper, we consider a batched bandit problem where observations arrive from a generalised linear model; we extend the ARC algorithm to this setting. We apply this to a classic dynamic pricing problem based on a Bayesian hierarchical model and demonstrate that the ARC algorithm outperforms alternative approaches.

Full PDF

CCorrelated Bandits for Dynamic Pricing via the ARC algorithm

Samuel N. Cohen ∗ Tanut Treetanthiploet † February 9, 2021

Abstract

The Asymptotic Randomised Control (ARC) algorithm provides a rigorous approximation to theoptimal strategy for a wide class of Bayesian bandits, while retaining reasonable computational com-plexity. In particular, it allows a decision maker to observe signals in addition to their rewards, toincorporate correlations between the outcomes of diﬀerent choices, and to have nontrivial dynamicsfor their estimates. The algorithm is guaranteed to asymptotically optimise the expected discountedpayoﬀ, with error depending on the initial uncertainty of the bandit. In this paper, we consider abatched bandit problem where observations arrive from a generalised linear model; we extend the ARCalgorithm to this setting. We apply this to a classic dynamic pricing problem based on a Bayesianhierarchical model and demonstrate that the ARC algorithm outperforms alternative approaches.Keywords: multi-armed bandit, parametric bandit, generalised linear model, dynamic pricingMSC2020: 62J12, 90B50, 91B38, 93C41

In the multi-armed bandit problem, a decision maker needs to sequentially decide between acting to revealdata about a system and acting to generate proﬁt. The central idea of the multi-armed bandit is that theagent has K ‘options’ or equivalently, a bandit with K arms, and must choose which arm to play at eachtime. Playing an arm results in a reward generated from a ﬁxed but unknown distribution which must beinferred ‘on-the-ﬂy’.In the classic multi-armed bandit problem, the reward of each arm is assumed to be independent of theothers (Gittins and Jones [9], Agrawal [1], Lattimore and Szepesv´ari [13]) and it is the only observationobtained by the decision maker at each step. In practice, we often observe signals in addition to the rewardsand there is often correlation between the distributions of outcomes for diﬀerent choices.For example, in a dynamic pricing problem (Dub´e and Misra [6], Misra et al. [14]), an agent wants toﬁx the price of a single product, from a ﬁnite set of prices { c , ..., c K } to maximise revenue. We know thatwhen the price is high, the demand p ( c k ) (which we interpret as the chance that each customer will buythe product) is low, but each sale yields a higher return. The agent’s reward on a given day is c k S k where S k | N ∼ B ( N, p ( c k )) is the number of customers who buy the product and N is the number of customerson a particular day. On each day, the agent wishes to choose c k to maximise c k E S k = c k p ( c k ) E ( N ).Unfortunately, the agent does not know the true demand p ( c k ), and needs to infer it over time.The situation above can be modeled as a multi-armed bandit problem, where each price correspondsto an arm of the bandit. However, this is not a classical bandit problem: First, we observe more than thereward, i.e. we observe both the number of customers entering the shop and the number of sales, while thereward only represents the number of sales. The second diﬀerence is more fundamental, in that we knowthat there is some correlation between the number of sales (and thus the rewards) of diﬀerent price choices.For example, when the price c k increases, we expect the demand p ( c k ) to decrease.In recent work, bandits with correlation have been considered (Filippi et al. [8] and Rusmevichientongand Tsitsiklis [15]). However, these approaches require the distribution of the reward to follow a Generalised ∗ Mathematical Institute, University of Oxford, Woodstock Rd, Oxford, OX2 6GG, UK, [email protected] † Mathematical Institute, University of Oxford, Woodstock Rd, Oxford, OX2 6GG, UK, [email protected] a r X i v : . [ m a t h . O C ] F e b inear Model (GLM), and the reward is the only observation. In the case of the dynamic pricing problemdescribed above, the numbers of customers on each day will vary and the reward is scaled with the price,making this an unnatural assumption.In our earlier work [5], we considered a wide class of Bayesian multi-armed bandits as a stochastic controlproblem (equivalently, a Markov decision problem), allowing a wide range of ﬂexibility in modelling. Byapplying an asymptotic expansion, together with a regularisation, this leads to the Asymptotic RandomisedControl (ARC) algorithm. They also showed that the ARC algorithm gives a near optimal value for thetotal discounted reward problem (2.1). However, their implementation is limited to the case where theestimation procedure admits a simple prior-posterior conjugate pair.In this paper, we consider a general class of multi-armed bandit problems where the observations ofeach arm arrive from a parametric exponential family in batches, with a shared parameter, following ageneralised linear model . We also generalise the implementation of the ARC algorithm to tackle caseswithout exact prior-posterior conjugacy, by applying the Kalman ﬁlter to the generalised linear model asin Fahrmeir [7].The paper proceeds as follows. In Section 2, we formulate dynamic online pricing as a generalised linearbandit model and give a description of how we can use large sample theory and Bayesian statistics topropagate our beliefs. We then give a brief description of bandit algorithms which may be considered ascandidates for the generalised linear bandit problem and outline the implementation of the ARC algorithmin Section 3. Finally in Section 4, we use experimental data from Dub´e and Misra [6] to illustrate theperformance of each bandit algorithm in observations from a generalised linear model. In September 2015 Dub´e and Misra [6] ran an experiment, in collaboration with the business-to-businesscompany ZipRecruiter.com, to choose an optimal price in an online sales problem. Their experiment ranin two stages, as an oﬄine learning problem: ﬁrst collecting data using randomly assigned prices, and thentesting their optimal price.

Price: c l o g i t o f A c q u i s i t i o n R a t e : l o g ( p / ( - p )) (a) Price: c E x p e c t e d R e w a r d : p c (b) Figure 1: (a): The relation between the price and the logit of Acquisition Rate. (b): The relation betweenthe price and the expected reward per customer.In contrast, Misra et al. [14] used the same data to illustrate dynamic online pricing as a classicalmulti-armed bandit problem and tackled this problem using a modiﬁcation of the classical UCB algorithm Filippi et al. [8] consider the same observation sequence but they assume that the observation is the reward. c k p ( c k )) generated by the best ﬁt line and the observed data is illustrated in Figure 1(b).Guided by to Figure 1, it is reasonable to consider a logistic model for the probability of subscription;the reward, however, does not ﬁt as naturally into a GLM framework. Given this model, the next questionis how to use this model to sequentially choose prices. This requires us to combine the multi-armed banditproblem with generalised linear regression. We now give a formal speciﬁcation of our multi-armed bandit problem: Suppose that at each time t , weneed to choose one of the K arms. After choosing the k th arm at time t , we observe a random variable Y ( k ) t whose distribution varies with k and collect a reward R ( k ) (cid:0) Y ( k ) t (cid:1) . The collection of previous observationsare used as historical data on which to base our next decision, and the problem repeats.To provide a mathematical framework, let (Ω , P , F ) be a probability space equipped with a randomvariable Θ, a family of random variables ( Y ( k ) t ) t ∈ N ,k =1 , ,..,K and a sequence ( ζ t ) t ∈ N with ζ t ∼ IID U [0 , Y ( k ) t ) , Θ). The random variable Y ( k ) t is interpreted as the observation when the k th arm ischosen at time t , the process ( ζ t ) is used to allow random decisions, and Θ is a hidden parameter specifyingthe distribution of all ( Y ( k ) t ) in a Bayesian manner. We assume (for simplicity) that ( Y (1) t , ..., Y ( K ) t ) t ∈ N areconditionally independent for each t given Θ.At time t , we need to choose an action A t taking values in { , .., K } . We can incorporate additionalrandomisation by instead choosing a sequence ( U t ) taking values in ∆ K := { u ∈ [0 , K : (cid:80) Ki =1 u i = 1 } where the k th component of U t represents the (conditional) probability that we choose the k th arm. Thecorresponding actions and ﬁltrations (representing historical observations) can be given by A t = A ( U t , ζ t )where A ( u, ζ ) := sup (cid:8) i : (cid:80) ik =1 u k ≥ ζ (cid:9) and F Ut := σ (cid:0) ζ , Y ( A )1 , ..., ζ t , Y ( A t ) t (cid:1) .The random variable U t +1 must be chosen based on information at time t , that is, U is ( F Ut ) t ≥ -predictable. We emphasise that, for any strategy ( U t ), the parameter Θ is not generally F Ut -measurable forany t ≥

0, that is, it is impossible to completely resolve our uncertainty over the distribution of outcomes.The objective of the multi-armed bandit is to choose a strategy U to optimise some objective. Thisobjective varies in the literature.One common objective is to minimise the cumulative (frequentist) regret (e.g. Agrawal [1], Auer et al.[2], Kaufmann et al. [10], Lai and Robbins [12]) R (cid:0) Θ , T, ( U t ) (cid:1) := T (cid:88) t =1 (cid:18) E (cid:104) R ( A ∗ ) (cid:0) Y ( A ∗ ) t (cid:1)(cid:12)(cid:12)(cid:12) Θ (cid:105) − E (cid:104) R ( A t ) (cid:0) Y ( A t ) t (cid:1)(cid:12)(cid:12)(cid:12) Θ (cid:105)(cid:19) . where A ∗ = arg max k E (cid:104) R ( k ) (cid:0) Y ( k ) t (cid:1)(cid:12)(cid:12)(cid:12) Θ (cid:105) .One may also minimise Bayesian regret (Russo and Roy [17, 18]); r ( π, T, ( U t )) := E π (cid:2) R (cid:0) Θ , T, ( U t ) (cid:1)(cid:3) , where π is a prior belief for the parameter Θ.An alternative objective is to maximise the total discounted reward (Gittins and Jones [9] Ryzhov etal. [20], Cohen and Treetanthiploet [5]); V (cid:0) π, β, ( U t ) (cid:1) := E π (cid:34) ∞ (cid:88) t =1 β t − R ( A t ) (cid:16) Y ( A t ) t (cid:17)(cid:35) (2.1)for a given discount rate β ∈ (0 , .2 Generalised Linear Bandit Model In this section, we will focus on our dynamic pricing problem, where observations are sampled from anexponential family whose parameter depends on our decision.At the beginning of each day, we need to choose a price from the set { c , ..., c K } . On day t , withchosen price c k , we observe N ( k ) t customers arriving at the store. In order to capture the relations betweendemands at diﬀerent prices, we suppose that the probability that each customer buys the product can bemodeled by a logistic model, i.e. the relation between the demand p ( c k ) and the price c k is given bylogit (cid:0) p ( c k ) (cid:1) = Γ + Γ c k = (Γ , Γ) (cid:62) (1 , c k ) =: Θ (cid:62) x ( k ) where logit( p ) = log (cid:0) p − p (cid:1) . The parameter Θ = (Γ , Γ) is unknown. At the end of day t , we observe Y ( k ) t := (cid:0) N ( k ) t , (cid:8) Q ( k ) i,t (cid:9) i =1 ,...,N ( k ) t (cid:1) , where Q ( k ) i,t takes values in { , } and indicates whether the product isbought by the i th customer, and collect a reward R ( k ) (cid:0) Y ( k ) t (cid:1) = c k (cid:80) N ( k ) t i =1 Q ( k ) i,t .More generally, we can ﬁt the dynamic pricing problem into the GLM framework. Let { x (1) , ..., x ( K ) } ⊆ R l be a collection of features to be chosen by the agent, each corresponding to a choice { , ..., K } , and letΘ be a random variable taking values in R l representing an unknown parameter. After choosing x ( k ) attime t , the agent observes N ( k ) t independent random variables (cid:0) Q ( k ) i,t (cid:1) N t i =1 where Q ( k ) i,t (cid:12)(cid:12)(cid:12) Θ ∼ IID h ( q ) exp (cid:16) Φ ( k ) q − G (cid:0) Φ ( k ) (cid:1)(cid:17) , (2.2)Φ ( k ) = φ (cid:0) Θ (cid:62) x ( k ) (cid:1) and Q ( k ) i,t is independent of N ( k ) t . Here, h , φ and G are known functions. For simplicity,we will assume that for each ﬁxed k , the processes (cid:0) N ( k ) t (cid:1) ∞ t =1 are IID with known distribution . Afterobserving Q ( k ) i,t and N ( k ) t , we obtain a reward R ( k ) (cid:0) N ( k ) t , Q ( k )1 ,t , ..., Q ( k ) N ( k ) t ,t (cid:1) . Remark . A straightforward extension is to allow the observation Q to be such that T ( Q ) belongs to anexponential family for some function T in an arbitrary dimension. This follows the same analysis and willallow us to use the same model to consider e.g. pricing for multiple products. In order to implement a multi-armed bandit algorithm, we need an eﬃcient way to update our estimate ofthe parameter Θ, together with its precision. As our observations are obtained in batches, we shall use alarge sample approximation to update via a normal-normal conjugate model.From (2.2), the mean and variance of Q given parameter Θ = θ are E θ (cid:0) Q ( k ) i,t (cid:1) = G (cid:48) (cid:0) φ ( k ) (cid:1) and Var θ (cid:0) Q ( k ) i,t (cid:1) = G (cid:48)(cid:48) (cid:0) φ ( k ) (cid:1) . where φ ( k ) = φ ( θ (cid:62) x ( k ) ).Suppose that the link function φ is invertible and diﬀerentiable. If our model is non-degenerate, G (cid:48) must also be invertible. We deﬁne ψ := ( G (cid:48) ◦ φ ) − . It then follows from the Central Limit Theorem andthe Delta method that √ n (cid:0) Ψ n − θ (cid:62) x ( k ) (cid:1) d −−→ N (cid:16) , / [ G (cid:48)(cid:48) (cid:0) φ ( k ) (cid:1) φ (cid:48) ( θ (cid:62) x ( k ) ) ] (cid:17) , (2.3)as n → ∞ , where Ψ n = ψ (cid:16) n (cid:80) ni =1 Q ( k ) i,t (cid:17) . Moreover, by Slutsky’s lemma, √ nV n (cid:0) Ψ n − θ (cid:62) x ( k ) (cid:1) d −−→ N (0 , V n := V (Ψ n ) := G (cid:48)(cid:48) ( φ (Ψ n )) (cid:0) φ (cid:48) (Ψ n ) (cid:1) .When n is not large, ψ ( n (cid:80) ni =1 Q ( k ) i,t ) may not be well-deﬁned for some values of Q . This is the case forthe logistic model when n (cid:80) ni =1 Q ( k ) i,t ∈ { , } . In order to avoid this degeneracy, if M t − is our running In fact, one may allow that the distribution of ( N ( k ) t ) ∞ t =1 is not known. In this case, all of our approximation argumentsfollow through with an extra posterior update step for the parameter of ( N ( k ) t ). We will leave this to the reader in order tosimplify our discussion. θ , we consider a linear expansion of ψ around ψ − ( M (cid:62) t − x ( k ) ), which approximates the expectedvalue of Q ( k ) i,t . This approach was used by Fahrmeir [7] to derive an extended Kalman ﬁlter with GLMobservations, as in (2.4).Suppose that the posterior of Θ at time t − |F Ut − ∼ N ( M t − , Σ t − ) . Then, after observing (cid:8) Q ( k ) i,t (cid:9) i =1 ,...,N ( k ) t , the posterior can be approximately updated by the Kalman ﬁlterequations M t = Σ t (cid:16) Σ − t − M t − + S ( k ) t Ψ ( k ) t x ( k ) (cid:17) , Σ t = (cid:16) Σ − t − + S ( k ) t (cid:0) x ( k ) (cid:1)(cid:0) x ( k ) (cid:1) (cid:62) (cid:17) − , Ψ ( k ) t = M (cid:62) t − x ( k ) + (cid:0) P ( k ) t − ˆ P ( k ) t − (cid:1) ψ (cid:48) (cid:0) ˆ P ( k ) t − (cid:1) ,  (2.4)where S ( k ) t := N ( k ) t V (Ψ ( k ) t ), ˆ P ( k ) t − := ψ − ( M (cid:62) t − x ( k ) ), and P ( k ) t := (cid:80) N ( k ) t i =1 Q ( k ) i,t /N ( k ) t .By using the Woodbury identity, (2.4) simpliﬁes to M t = M t − + R ( k ) t (cid:16) Ψ ( k ) t − M (cid:62) t − x ( k ) (cid:17) Σ t − (cid:0) x ( k ) (cid:1) , Σ t = Σ t − − R ( k ) t Σ t − (cid:0) x ( k ) (cid:1)(cid:0) x ( k ) (cid:1) (cid:62) Σ t − ,  (2.5)where R ( k ) t = S ( k ) t (cid:46)(cid:16) S ( k ) t ( x ( k ) ) (cid:62) Σ t − ( x ( k ) ) + 1 (cid:17) . There are a number of approaches commonly used to solve the multi-armed bandit problem. In thissection, we present a few approaches which can be adapted for application to the multi-armed banditwith generalised linear observation. We will implement these using the approximate posterior Θ |F Ut ∼ N ( M t , Σ t ), where ( M t , Σ t ) is given by (2.5). • (cid:15) -Greedy (Vermorel and Mohri [22]): With probability 1 − (cid:15) , we choose an arm based on its maximalexpected reward and with probability (cid:15) , we choose an arm uniformly at random. • Explore-then-commit (ETC) (Rusmevichientong and Tsitsiklis [15]): Let (cid:15) ∈ (0 , (cid:98) (cid:15)T (cid:99) trials and then choose an arm which maximises theestimated expected reward for each remaining trial. • Thompson Sampling (TS) (Thompson [21], Russo et al. [19]) : We select according to theposterior probability that each arm is ‘best’. In particular, we choose a randomised strategy U t suchthat U T St = P ( A T St = i |F Ut − ) = P ( A ∗ = i |F Ut − )where A ∗ := arg max i ∈A E (cid:0) R ( k ) ( Y ( k ) ) (cid:12)(cid:12) Θ (cid:1) . This algorithm can be implemented by sampling ˆΘ t at time t from its posterior distribution (i.e.conditional on F Ut − ). We can then choose A T St = arg max i ∈A E (cid:0) R ( k ) ( Y ( k ) ) (cid:12)(cid:12) Θ = ˆΘ t (cid:1) . One may also see our parameter update (2.5) as the Laplace approximation method for Thompson Sampling (see [19,Chapter 5]) Bayesian Upper Conﬁdence Bound (Bayes-UCB) (Kaufmann et al. [10], Filippi [8]) : Thisis an optimistic strategy to choose the arm based on an upper bound of the reward. A Bayes − UCBt = arg max i Q (cid:16) p, R ( k ) ( Y ( k ) t ) (cid:12)(cid:12)(cid:12) F Ut − (cid:17) , where Q ( p, X ) is the p -quantile of X . Remark . Kaufmann et al. [10] prove a theoretical guarantee of optimal order for the classical(independence) Bernoulli bandit when p = 1 − / (cid:0) t (log( T )) c (cid:1) and c ≥

5; their simulations suggestthat c = 0 performs well in practice. We will use these values to implement the Bayes-UCB in oursimulations. • Knowledge Gradient (KG) (Ryzhov et al. [20]): At each time, we pretend that we will onlyupdate our posterior once, immediately after the current trial. This simpliﬁes the objective function(2.1) and suggests the choice A KGt = arg max k E (cid:104) R ( k ) (cid:0) Y ( k ) t (cid:1) + (cid:16) β − β (cid:17) ˜ V ( k ) t +1 (cid:12)(cid:12)(cid:12) F Ut (cid:105) , where ˜ V ( k ) t +1 = E (cid:104) max j R ( j ) (cid:0) Y ( j ) t +1 (cid:1)(cid:12)(cid:12)(cid:12) F Ut , A t = k (cid:105) .The expectation can be computed using an appropriate quadrature or Monte Carlo approach if it isnot available explicitly. • Information-Directed Sampling (IDS) (Russo and Roy [18], Kirschner and Krause [11]): Thisalgorithm chooses based on the information ratio between the single-step regret and the informationgain.We deﬁne ∆ t ( u ) := (cid:88) k ∈A u k E (cid:16) R ( A ∗ ) ( Y ( A ∗ ) t ) − R ( k ) (cid:0) Y ( k ) t (cid:1)(cid:12)(cid:12)(cid:12) F t − (cid:17) where A ∗ := arg max k ∈A E (cid:0) R ( k ) ( Y ( k ) ) (cid:12)(cid:12) Θ (cid:1) . Let H be an ( F Ut )-adapted process taking values in R K + whose k th component represents the infor-mation gain when the k th arm is chosen. Deﬁne G t ( u ) := (cid:80) Kk =1 u k H ( k ) t .The IDS algorithm chooses an arm randomly according to the probabilities U IDSt = arg min u ∈ ∆ K (cid:16) δ t ( u ) g t ( u ) (cid:17) . In Russo and Roy [18], H ( k ) t is deﬁned to be the improvement in the Shannon entropy of A ∗ ’sposterior when the k th arm is chosen at time t . However, computing this H ( k ) t is expensive. Thus,in our simulation, we will consider H ( k ) t = E (cid:16) H (Θ t − ) − H (Θ t ) (cid:12)(cid:12)(cid:12) F t − , A t = k (cid:17) where H (Θ t ) is the Shannon entropy of the posterior distribution of Θ at time t . This informationgain is considered in Kirschner and Krause [11]. • Asymptotic Randomised Control (ARC) [5]: We choose an arm to maximise the asymptoticvalue of our decision. We will summarise the required calculation in Section 3.2In our simulation, we will also consider the classical UCB and the UCB-tuned algorithm (Auer et al. [2])as candidates, as considered in Misra et al. [14]. These algorithms are similar to the Bayes-UCB algorithmbut consider the problem from a frequentist perspective and ignore the correlation between outcomes. Inparticular, we do not use (2.5) to propagate our belief but we record the reward of each arm separately.The reader may refer to Misra et al. [14] or a survey by Burtini et al. [3]. Filippi [8] consider the UCB algorithm with correlation under frequentist perspective where an additional rescaling on thebandwidth appears due to the presence of the link function. Nonetheless, their formulation only restricts to the case wherethe reward is the only observation. .2 Implementation of the ARC algorithm The key idea of the ARC algorithm is to give an estimate of the optimal solution to (2.1) via a Markovdecision process with ( m, Σ) as an underlying state. A smooth approximation is obtained by introducinga preference for random decisions in the objective function (2.1), in particular adding a reward λ H ( A t )to R ( A t ) in (2.1), where H is a smooth entropy function (e.g. Shannon entropy, which we use here). Thescale of this preference is controlled through the parameter λ , which is determined dynamically in order tohave a negligible eﬀect when uncertainty is low. This approximation results in a semi-index strategy whichamounts to computing the solution a ∈ R K to the ﬁxed point equation : a = f + (cid:16) β − β (cid:17) L λ ( a )where the term f = (cid:2) E (cid:0) R ( k ) ( Y ( k ) ) (cid:12)(cid:12) F Ut (cid:1)(cid:3) k =1 ,...,K is the vector of expected rewards over one time step(quantifying the gain from immediate exploitation) and L λ ( a ) is an exploration term, given in (3.1) below.Intuitively, the term L λ ( a ) combines the derivatives of f , with respect to the parameter estimate and itsprecision, to give an approximation of the expected increase in future payoﬀs which would be generatedfrom the observations Y ( k ) t , separately from the immediate reward.With this interpretation of f and L λ , the components of a can be interpreted as measuring the immediatereward and the increase in total reward arising from each choice, taking into account the eﬀect of learningon future rewards. The entropy term results in the ARC algorithm applying a softmax function to a ,yielding conditional probabilities of choosing each arm, rather than a deterministic choice. We refer to [5,Section 6] for a more detailed overview of this algorithm.The implementation of the ARC algorithm requires the computation of the dynamics of the posteriorparameter and the derivative of the expected one-period cost with respect to the underlying state (i.e., theposterior parameters). In our model, we use (2.5) to simplify our posterior dynamic for the generalised linearbandit, allowing us to estimate the relevant terms in the ARC algorithm for this case. The implementationis given by the following steps: Step I: Estimate the dynamics of posterior parameter.

The ARC algorithm requires the com-putation of how we expect the parameter estimate and its precision to change. In the case of interest, wewill treat the posterior mean m and variance Σ of the parameter Θ as the ‘estimator’ and ‘precision’ in theARC algorithm.Let m and Σ be the posterior mean and variance conditional on F Ut and let M ( k ) and Σ ( k ) be theirupdate after we choose the k th arm. As discussed in Section 2.2.1, we can update M ( k ) and Σ ( k ) by (2.5).From (2.3), we estimate Ψ ( k ) | Θ ≈ N (cid:0) Θ (cid:62) x ( k ) , /S ( k ) (cid:1) . Assuming the posterior Θ ∼ N ( m, Σ) givesΘ (cid:62) x ( k ) ∼ N (cid:16) m (cid:62) x ( k ) , (cid:0) x ( k ) (cid:1) (cid:62) Σ (cid:0) x ( k ) (cid:1)(cid:17) . Hence, the conditional distribution of Ψ ( k ) | ( m, Σ) can be approx-imated by Ψ ( k ) | ( m, Σ) ≈ N (cid:16) m (cid:62) x ( k ) , (cid:16) S ( k ) ( x ( k ) ) (cid:62) Σ ( x ( k ) )+1 S ( k ) (cid:17)(cid:17) . Therefore, (2.5) yields an approximate innovation representation∆ M ( k ) ≈ (cid:32) S ( k ) S ( k ) (cid:0) x ( k ) (cid:1) (cid:62) Σ (cid:0) x ( k ) (cid:1) + 1 (cid:33) / Σ (cid:0) x ( k ) (cid:1) Z, where Z ∼ N (0 , ∼ N ( m, Σ), when Σ is small, we may estimate S ( k ) ≈ n k V (cid:0) m (cid:62) x ( k ) (cid:1) where n k = E (cid:0) N ( k ) (cid:1) .Hence, in the notation of [5], the dynamics of our state ( m, Σ) can be approximated by E m, Σ (cid:0) ∆ M ( k ) (cid:1) ≈ ˜ µ ( k ) ( m, Σ) := 0 , Var m, Σ (cid:0) ∆ M ( k ) (cid:1) ≈ (cid:0) ˜ σ ˜ σ (cid:62) (cid:1) ( k ) ( m, d ) := w ( m, Σ , k )Σ (cid:0) x ( k ) (cid:1) (cid:0) x ( k ) (cid:1) (cid:62) Σ , E m, Σ (cid:0) ∆Σ ( k ) (cid:1) ≈ ˜ b ( k ) ( m, d ) := − w ( m, Σ , k )Σ (cid:0) x ( k ) (cid:1) (cid:0) x ( k ) (cid:1) (cid:62) Σ , where w ( m, Σ , k ) := (cid:0) n k V ( m (cid:62) x ( k ) ) n k V ( m (cid:62) x ( k ) )( x ( k ) ) (cid:62) Σ ( x ( k ) )+1 (cid:1) . This ﬁxed-point calculation is the only computationally expensive part of the algorithm, and standard methods can beapplied. tep II: Compute the expected reward f and learning function L λ using the estimateddynamics. We next compute the expected reward given the (estimate) posterior parameter, that is f : R l × S l + → R K with components f k ( m, Σ) := E (cid:16) R ( k ) (cid:0) Y ( k ) (cid:1)(cid:12)(cid:12)(cid:12) Θ ∼ N ( m, Σ) (cid:17) , where S l + is the family of positive deﬁnite R l × l matrices, and the learning function L λ : R K × R l ×S l + → R K with components L λk ( a, m, Σ) := (cid:104)B λ ( a, m, Σ); E m, Σ (cid:0) ∆Σ ( k ) (cid:1) (cid:105) + (cid:104)M λ ( a, m, Σ); E m, Σ (cid:0) ∆ M ( k ) (cid:1) (cid:105) + 12 (cid:104) Ξ λ ( a, m, Σ); Var m, Σ (cid:0) ∆ M ( k ) (cid:1) (cid:105) , (3.1)where we deﬁne B λ := (cid:88) k ν λk ( a )( ∂ Σ f k ) , M λ := (cid:88) k ν λk ( a )( ∂ m f k ) , Ξ λ := (cid:88) k ν λk ( a )( ∂ m f k ) + 1 λ (cid:88) j,k η λjk ( a )( ∂ m f j )( ∂ m f k ) (cid:62) ,ν λk ( a ) := exp (cid:0) a k /λ (cid:1)(cid:14)(cid:88) j exp (cid:0) a j /λ (cid:1) , and η λjk ( a ) := ν λj ( a ) (cid:0) I ( j = k ) − ν λk ( a ) (cid:1) . The operator (cid:104)· ; ·(cid:105) is interpreted as an inner product of tensors, i.e. if u, v ∈ R l , then (cid:104) u ; v (cid:105) = u · v andif u, v ∈ R l × l , then (cid:104) u ; v (cid:105) = Tr( uv ).The ARC algorithm focuses on the regime where Σ is small. Hence, we estimate our expected rewardby ˜ f : R l → R K with components f k ( m, Σ) ≈ ˜ f k ( m ) := E (cid:18) R ( k ) (cid:16) N ( k ) , (cid:0) Q ( k ) i (cid:1) N ( k ) i =1 (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) Θ = m (cid:19) = h k (cid:0) m (cid:62) x ( k ) (cid:1) , (3.2)for some function h k . The last equality follows from the fact that { Q ( k ) i | Θ = m } i =1 ,..., ∞ ∼ IID g ( q ; m (cid:62) x ( k ) )and N ( k ) is independent of { Q ( k ) i } i =1 ,..., ∞ .Using the estimates ˜ µ ( k ) , (cid:0) ˜ σ ˜ σ (cid:62) (cid:1) ( k ) , ˜ b ( k ) , and ˜ f k , we can approximate L λ in the ARC algorithm by˜ L λk := 12 v ( m, Σ , k ) (cid:34) K (cid:88) j =1 ν λj ( a ) h (cid:48)(cid:48) j (cid:0) m (cid:62) x ( j ) (cid:1)(cid:0) r kj (Σ) (cid:1) + 1 λ (cid:32) K (cid:88) j =1 ν λj ( a ) (cid:16) h (cid:48) j (cid:0) m (cid:62) x ( j ) (cid:1) r kj (Σ) (cid:17) − (cid:18) K (cid:88) j =1 ν λj ( a ) h (cid:48) j (cid:0) m (cid:62) x ( j ) (cid:1) r kj (Σ) (cid:19) (cid:33)(cid:35) , (3.3)where r kj (Σ) := (cid:0) x ( j ) (cid:1) (cid:62) Σ (cid:0) x ( k ) (cid:1) . Step III: Use the ARC algorithm for the generalised linear bandit.

Finally, to apply the ARCalgorithm, one also needs to choose a rescaling function to encourage randomisation when estimates areuncertain. Here, we choose λ ρ ( m, Σ) := ρ (cid:107) Σ (cid:107) , using the Euclidean norm (cid:107) Σ (cid:107) = (cid:112) (cid:104) Σ; Σ (cid:105) . We refer to [5]for further discussion of the eﬀect of this choice and alternatives.The implementation of this algorithm is then given by the pseudocode of Algorithm 1. To run the ARCalgorithm, we need to choose the parameters ρ > β ∈ (0 , ρ determines the randomness ofour early choices, encouraging early exploration, while β is a discount factor, which is used to value thefuture reward. These parameters can be chosen by the user . It is suggested by Russo [16] for a related approach that β = 1 − /T , where T is the total number of trials, might be anappropriate choice. lgorithm 1 ARC Algorithm

Input: ρ, β, m, Σ , T Let ˜ f and ˜ L λ be function given in (3.2) and (3.3) for t = 1 , ..., T do Deﬁne λ = ρ (cid:107) Σ (cid:107) Solve a = ˜ f ( m ) + (cid:0) β − β (cid:1) ˜ L λ (cid:0) a, m, Σ (cid:1) for a Sample A t with P ( A t = k ) = ν λk ( a )Obtain observations ( N, Q ) and collect the rewardUpdate the posterior parameter ( m, Σ) as in (2.5) end for

It was shown in [5] that for a smooth function f and˜ V ( π, β, ( U t )) := E π (cid:34) ∞ (cid:88) t =1 β t − f ( A t , M t − , Σ t − ) (cid:35) , (3.4)we have (cid:12)(cid:12)(cid:12) ˜ V ( π, β, ( U ARCt )) − sup U ˜ V ( π, β, ( U t )) (cid:12)(cid:12)(cid:12) = O ( (cid:107) Σ (cid:107) ) . In addition, they also showed that (cid:107) Σ t (cid:107) → t → ∞ P π -a.s. (given that G is suﬃciently nice).For the multi-armed bandit problem, one may wish to consider the total discounted reward (2.1).However, when we do not have a prior-posterior conjugate pair, which is the case for this GLM framework,the equality between (2.1) and (3.4) cannot be guaranteed. Nonetheless, given that the batch size isrelatively large, the diﬀerence between (2.1) and (3.4) is small. As the strategy followed determines the observations available, we cannot directly use historical data to testthe algorithm. However, we can use the data to construct an environment in which to run tests. We takea Bayesian view and build a simple hierarchical model (with an improper uniform prior). Eﬀectively, thisassumes that our observations come from an exchangable copy of the world we would deploy our banditsin, with the same (unknown) realised value of Θ. Then we will use Laplace approximation, as in Russo etal. [19, Chapter 5] or Chapelle and Li [4], to obtain a posterior sample to simulate the markets.

Remark . It is worth emphasising that when implementing the algorithm in the simulation, we do notassume that our algorithms know the distribution that we use to simulate the parameter, Θ. Instead thesimulations are initialised with an almost uninformative prior.To construct a posterior for Θ given historical data, we assume that we have a collection of observations( q (1) i ) r i =1 , ( q (2) i ) r i =1 , ..., ( q ( K ) i ) r K i =1 from an exponential family modeled by (2.2). We denote the correspondinglog-likelihood function of Θ by (cid:96) (cid:0) θ ; (cid:0) q ( k ) i (cid:1)(cid:1) .Let ˆ θ be the maximum likelihood estimator. i.e. ˆ θ = arg max θ (cid:96) (cid:0) θ ; (cid:0) q ( k ) i (cid:1)(cid:1) . Then we may approximatethe log-likelihood function by (cid:96) (cid:16) θ ; (cid:0) q ( k ) i (cid:1)(cid:17) ≈

12 ( θ − ˆ θ ) (cid:62) ∂ θ (cid:96) (cid:0) ˆ θ ; (cid:0) q ( k ) i (cid:1)(cid:1) ( θ − ˆ θ ) + c, for c independent of θ . Therefore, starting from the uninformative (improper, uniform) prior, we canestimate the posterior of the parameter Θ byΘ ∼ N (cid:16) ˆ θ, (cid:16) − ∂ θ (cid:96) (cid:0) ˆ θ ; ( q ( k ) i ) (cid:1)(cid:17) − (cid:17) . (4.1)9n particular, if the parameter for the exponential family is given in its canonical form, i.e. when φ in (2.2)is the identity, then the observed Fisher information is − ∂ θ (cid:96) (cid:0) ˆ θ ; ( q ( k ) i ) (cid:1) = K (cid:88) k =1 r k G (cid:48)(cid:48) (cid:0) ˆ θ (cid:62) x ( k ) (cid:1)(cid:0) x ( k ) (cid:1)(cid:0) x ( k ) (cid:1) (cid:62) . We will use the simulated values of Θ as the (hidden) realisations with which to test our algorithms.

Finally, we can implement the multi-armed bandit algorithm to run a simulation for the dynamic onlinepricing problem.Recall that in this model, at each time t , we need to choose x ( k ) = (1 , c k ) ∈ R where c k is the priceof the product. We then observe the number of customers N ( k ) t and whether each customer buys theproduct or not in terms of a binary random variable (cid:0) Q ( k ) i,t (cid:1) N ( k ) t i =1 . The reward that we receive in eachstep is given by R ( k ) (cid:0) N ( k ) t , Q ( k )1 ,t , ..., Q ( k ) N t ,t (cid:1) := c k (cid:80) N ( k ) t i =1 Q ( k ) i,t . We will model (cid:0) Q ( k ) i,t (cid:1) by a logistic model, i.e. G ( z ) = log(1 + e z ) and φ ( z ) = z for the functions φ and G in (2.2).We can now write down the expected reward, E θ (cid:16) R ( k ) (cid:16) N ( k ) t , Q ( k )1 ,t , ..., Q ( k ) N ( k ) t ,t (cid:17)(cid:17) = n k c k G (cid:48) (cid:0) θ (cid:62) x ( k ) (cid:1) , where n k = E ( N ( k ) t ). In particular, the function h k in (3.2) and (3.3) is given by h k ( y ) = n k c k G (cid:48) (cid:0) y (cid:1) . In Dub´e and Misra [6], in stage one of their experiment, they randomly assigned one of ten experimentalpricing cells to 7,867 diﬀerent customers who reached Ziprecruiter’s paywall. The exact numbers of cus-tomers that were assigned to each price are not reported. Hence, we will assume that there are exactly 787customers for each price. We then use their reported subscription rate to estimate the exact numbers ofcustomers who decided to subscribe given each price.Using this data, we apply (4.1) to infer an approximate posterior distribution:Θ (cid:12)(cid:12)(cid:0) q ( k ) i (cid:1) ∼ N (cid:32) (cid:18) − . × − − . × − (cid:19) , (cid:18) . × − − . × − − . × − . × − (cid:19) (cid:33) . (4.2)In order to compare performance, we will consider each algorithm over a period of one year (365 days)and only allow the agent to change the price at the end of each day. We assume that a common pricemust be shown to all customers on each day. We will also assume that the chosen price does not aﬀect thenumber of customers reaching the paywall. i.e. we assume that N ( k ) t ≡ N t .We run 5 × independent simulations of the diﬀerent market situations, where for each simulationthe parameter Θ is sampled from (4.2). We also independently sample ( N t ) t =1 ∼ Poisson(270) to representthe number of visitors on each day . The simulated subscription probability and the simulated expectedrevenue per customer, for each price level, are illustrated in Figure 2. We apply each algorithm described in Section 3, with m = (0 ,

0) and Σ = I as an initial prior for Θ anduse (2.5) to propagate the prior.To assess the performance of each algorithm, one often compares the cumulative pseudo-regret of eachalgorithm given the true parameter Θ: R (cid:0) θ, T, ( A t ) (cid:1) := T (cid:88) t =1 (cid:16) max k h k ( θ ) − h A t ( θ ) (cid:17) , In [6], they observed that ZipRecruiter.com had roughly 8 ,

000 visitors per month. Hence, it is reasonable to assume thatthere are roughly 270 visitors per day.

100 200 300 400 500

Price S u b s c r i p t i o n R a t e (a) Price E x p e c t e d R e w a r d p e r c u s t o m e r (b) Figure 2: (a): The average subscription rate in our simulation, with two standard deviation bands. (b):The average reward per customer in our simulation, with two standard deviation bands.where h k ( θ ) := E (cid:2) R ( k ) (cid:0) Y ( k ) t (cid:1)(cid:12)(cid:12) Θ = θ (cid:3) = nc k G (cid:48) (cid:0) θ (cid:62) x ( k ) (cid:1) , G ( y ) = log(1 + e y ), ( A t ) is the sequence of actionsthat the algorithm chooses and n = E ( N ( k ) t ).Figure 3 shows the mean, median, 0.75 quantile and 0.90 quantile of cumulative pseudo-regret of eachalgorithm described in Section 3. We see that most algorithms outperform the classical UCB and theUCB-tuned used in Misra et al. [14], which is unsurprising as these approaches ignore correlation betweendemand at diﬀerent prices. We also see that the ARC algorithm outperforms all other algorithms both onaverage and in extreme cases (as shown by the quantile plots).In addition to the regret criteria, when displaying the price, the agent may not want to change the pricetoo frequently. This is why the Explore-Then-Commit (ETC) algorithm is often preferred when consideringa pricing problem. Figure 4 shows that the ARC algorithm and KG algorithm typically require a smallnumber of price changes but still achieve a reasonably low regret (as shown in Figure 3) whereas, Thompsonsampling and the UCB type algorithms require a larger number of price changes.11

50 100 150 200 250 300 350

Trials E x p e c t e d - E x p e c t e d R e g r e t (a) Trials(b)

Trials E x p e c t e d - E x p e c t e d R e g r e t (c) Trials(d)

ARC: =0.1, = 0.997ARC: =1, = 0.997ARC: =5, = 0.997ARC: =10, = 0.997 ETC: =0.005ETC: =0.01ETC: =0.05ETC: =0.1 Bayes-UCB: c=0Bayes-UCB: c=5Classic UCBUCB-tuned Greedy: =0Greedy: =0.05Greedy: =0.1 KG: = 0.997ThompsonIDS

Figure 3: (a): Mean, (b): Median, (c): 0.75 quantile and (d): 0.90 quantile of cumulative expected-expectedpseudo regret 12

50 100 150 200 250 300 350

Trials T h e nu m b e r o f p r i c e c h a n g e s (a) Trials(b)

Trials T h e nu m b e r o f p r i c e c h a n g e s (c) Trials(d)

Figure 4: (a): Mean, (b): Median, (c): 0.75 quantile and (d): 0.90 quantile of the number of price changes

Acknowledgements:

Samuel Cohen thanks the Oxford-Man Institute for research support and ac-knowledges the support of The Alan Turing Institute under the Engineering and Physical Sciences ResearchCouncil grant EP/N510129/1. Tanut Treetanthiploet acknowledges support of the Development and Pro-motion of Science and Technology Talents Project (DPST) of the Government of Thailand.

References [1] R. Agrawal. Sample mean based index policies by O (log N ) regret for the multi-armed bandit problem. Advances in Applied Probability , pages 1054–1078, 1995.[2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.

Machine Learning , pages 235–256, 2002. 133] G. Burtini, J. Loeppky, and R. Lawrence. A survey of online experiment design with the stochasticmulti-armed bandit. arXiv:1510.00757v4 , 2015.[4] O. Chapelle and L. Li. An empirical evaluation of thompson sampling.

Advances in Neural InformationProcessing Systems 24 , page 2249–2257, 2011.[5] S. N. Cohen and T. Treetanthiploet. Asymptotic Randomised Control with applications to bandits.arXiv:2010.07252, 2020.[6] J-P. Dub´e and S. Misra. Scalable price targeting. NBER working paper No. w23775, National Bureauof Economic’ Research, 2017.[7] L. Fahrmeir. Posterior mode estimation by extended kalman ﬁltering for multivariate dynamic gener-alized linear models.

Journal of the American Statistical Association , pages 501–509, 1992.[8] S. Filippi, O. Capp´e, A. Garivier, and C. Szepesv´ari. Parametric bandits: the Generalized Linearcase. In

NIPS’10: Proceedings of the 23rd International Conference on Neural Information ProcessingSystems , page 586–594, 2010.[9] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments.In J. Gani, editor,

Progress in Statistics , pages 241–266, Amsterdam: North Holland, 1974.[10] E. Kaufmann, O. Capp´e, and Aur ´ elien Garivier. On Bayesian upper conﬁdence bounds for banditproblems. In Artiﬁcial intelligence and statistics , pages 592–600, 2012.[11] J. Kirschner and A. Krause. Information Directed Sampling and bandits with heteroscedastic noise.

Proceedings of Machine Learning Research , pages 1–28, 2018.[12] T. L. Lai and H. Robbins. Asymptotically eﬃcient adaptive allocation rules.

Advances in AppliedMathematics , pages 4–22, 1985.[13] T. Lattimore and C. Szepesv´ari.

Bandit Algorithms . Cambridge University Press, 2019.[14] K. Misra, E. M. Schwartz, and J. Abernethy. Dynamic online pricing with incomplete informationusing multiarmed bandit experiments.

Marketing Science , pages 226–252, 2019.[15] P. Rusmevichientong and J. N. Tsitsiklis. Linearly parameterized bandits.

Mathematics of OperationsResearch , pages 395–411, 2009.[16] D. Russo. A note on the equivalence of upper conﬁdence bounds and Gittins indices for patient agents. arXiv:1904.04732 , 2019.[17] D. Russo and B. Van Roy. An information-theoretic analysis of Thompson Sampling.

Journal ofMachine Learning Research , pages 1–30, 2016.[18] D. Russo and B. Van Roy. Learning to optimize via Information-Directed Sampling.

OperationsResearch , pages 1–23, 2017.[19] D. J. Russo, B. V. Roy, A. Kazerouni, I. Osband, and Z. Wen. A tutorial on Thompson Sampling.

Foundations and Trends in Machine Learning , 11(1):1–96, 2018.[20] I. O. Ryzhov, W. B. Powell, and P. I. Frazier. The Knowledge Gradient algorithm for a general classof online learning problems.

Operations Research , pages 180–195, 2012.[21] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of theevidence of two samples.

Biometrika , pages 285–294, 1933.[22] J. Vermorel and M. Mohri. Multi-armed bandit algorithms and empirical evaluation. In