Expanding on Repeated Consumer Search Using Multi-Armed Bandits and Secretaries
aa r X i v : . [ ec on . T H ] D ec Expanding on Repeated Consumer Search Using Multi-ArmedBandits and Secretaries ∗ Chan Tung Yu Marco
University of Hong Kong
December 24, 2020
Abstract
We seek to take a different approach in deriving the optimal search policy for the repeatedconsumer search model found in Fishman & Rob (1995) with the main motivation of droppingthe assumption of prior knowledge of the price distribution F ( p ) in each period. We will dothis by incorporating the famous multi-armed bandit problem (MAB). We start by modifyingthe MAB framework to fit the setting of the repeated consumer search model and formulatethe objective as a dynamic optimization problem. Then, given any sequence of exploration weassign a value to each store in that sequence using Bellman equations. We then proceed to breakdown the problem into individual optimal stopping problems for each period which incidentallycoincides with the framework of the famous secretary problem where we proceed to derive theoptimal stopping policy. We will see that implementing the optimal stopping policy in eachperiod solves the original dynamic optimization by ‘forward induction’ reasoning. In the repeated consumer search model (Fishman & Rob, 1995) consumers are faced with a marketselling the same product but with price dispersion. Consumers then have the objective of maximiz-ing their own surplus by purchasing a unit of the product but at the lowest possible price, but arefaced with the additional challenge of not knowing the prices in each store. They can resolve thislack of information by searching (exploration) which incurs a search cost c . Hence the main essenceof the consumer’s problem is to decide whether or not exploration is worthwhile. The setting isalso a dynamic one, meaning that the consumer will have to repeat this purchasing process in everytime period until the terminal period. The dynamic setting is used to portray the potential forlong-term business relationships between firms and consumers.The model in Fishman & Rob (1995) proceeds its analysis with the assumption that consumersknow the price distribution F ( p ) beforehand in each period. We drop this assumption for the goal of ∗ HKU Fall Semester 2020; ECON 6077: Topics in Economics Research I - Industrial Organization, final project. By formulating the search model into the MABframework, we hope to be able to adopt and modify computational methods and results to provideus with further insight into a consumer’s optimal search rule under relaxed assumptions and under amore general setting. It should also be noted that this paper’s analysis is focused on the consumer,demand-side. Further analysis on the producer side will be required in order to gain any insight orimplications this general setting may have on the market equilibrium.
This section will give a brief overview and summary of relevant literature and existing models.
Here we will introduce our starting point, the repeated purchase search model for a single goodmore formally, which is a natural extension of the price dispersion model in Reinganum (1979). The model’s set-up starts with a large number n of identical consumers with individual demandfor the single good x = D ( p ). Each consumer will purchase a unit of the good every period t = 1 , , . . . Then a consumer’s surplus from purchasing at price ˆ p at time t is, S (ˆ p ) = Z ∞ ˆ p D ( p ) d p. Consumers are assumed to only know the distribution of prices charged by each firm in themarket but not the individual prices charged by each firm. Consumers are also allowed in eachperiod to sequentially search and sample prices from firms they have not bought from in the previousperiod at a cost c > We will give an overview of the MAB problem in later sections. Reinganum (1979) models the equilibrium of firms subject to productivity shocks (represented by marginal costbeing randomly distributed) and consumers with imperfect information. In contrast, a consumer can costlessly learn the current price of any seller she bought from in the last period.This can be interpreted as the consumer starting every new time period at the store they bought from in the lastperiod. The idea is that consumers have all information about sellers they previously bought from but retains noinformation about firms they merely visited, or bought from more than a period ago. n t stores and buys at price p t is givenby S ( p t ) − n t c , giving us the sum of discounted surplus, V ( t ) = ∞ X τ = t δ τ − t h S ( p τ ) − n τ c i (1)with discount factor δ ∈ (0 , ω can take two distinct values { ω L , ω H } where ω L ≤ ω H . We also assume that marginal cost of firms satisfy the Markov property,that is, β H = P( ω t = ω H | ω t − = ω H ) β L = P( ω t = ω L | ω t − = ω L ) . For simplification β = β L = β H is assumed, so β is the persistence probability of a firm tomaintain its current productivity level. That is, in every given period, a firm’s state of being ahigh or low cost is subject to change as dictated by β . This cost state is determined at the start ofevery period will remain unchanged throughout the entire period.Then assuming the system has a steady state, it can be shown that the number of low cost andhigh cost firms will be equal regardless of the value of β .Then given state of a firm, it can be shown that in equilibrium low cost firms will charge p L = p mL and high cost firms will charge p H = min { p r , p mH } , where p m is the profit maximizing(monopoly) price p r is the (unknown) reservation price of consumers. It should be noted thatthe reservation price p r is what dictates the consumers’ search rule i.e. if p j > p r then continuesearching, otherwise settle and purchase.In a steady state equilibrium we assume that consumers will engage in optimal sequential searchto maximizes their discounted net surplus (1). Then a steady state equilibrium will be characterizedby a (joint) price distribution F ( p , p , . . . , p M ) given M firms in the market that is unchangingover time and satisfies,1. Given F ( p ), the consumer’s search rule maximizes their discounted net surplus (1)2. Given F ( p ), ω j , number of customers last period and consumers’ search rules, each firmmaximizes their discounted expected profits.3. The firms’ profit maximization gives rise to the same F ( p ) given in 1. and 2.As in Reinganum (1979), it can be shown that p L < p r . That is, it is not worth for consumersto search until the lowest price p L is found if they have found a price sufficiently close to p L . Maximizing expected discounted profits Π t = E h P ∞ τ = t δ τ − t D ( p τ )( p τ − ω τ ) i Note that F here can be formulated as a binomial distribution, given that p j are all independent and can beformulated as Bernoulli random variables. p H = p r . Then writing out the expected surplus V L for a consumerreceiving low price p L at time t together with the fact that at steady state, there is an equal numberof high and low cost firms, we can formulate the system of equations V H = − c + V H + V L = S ( p r ) + δ (cid:2) βV H + (1 − β ) V L (cid:3) V L = S ( p L ) + δ (cid:2) βV L + (1 − β ) V H (cid:3) (2)where V H is the expected discounted surplus from search continuation .Solving the above system we will arrive at, S ( p r ) = S ( p L ) + (4 βδ − δ − c. (3)To understand this result, consider the two extreme cases, when β = 1 (or equivalently β = 0)and β = . Since all firms are ‘fixed’ (low price firms will indefinitely offer low prices), given that p r is the current price on hand, the lifetime benefit of finding a low price firm is ∞ X τ = t (cid:2) S ( p L ) − S ( p r ) (cid:3) = S ( p L ) − S ( p r )1 − δ . (4)Also notice that when we substitute β = 1 into (3) we get c = 12 · S ( p L ) − S ( p r ))1 − δ . (5)Recall that since there is an equal number of firms, the expected benefit of searching is multiplied by (4). Hence, the consumer will set reservation price p r so as to equate the expectedbenefit of searching with search cost c , given by (5). The intuition being that search is most valuablewhen β = 1. Since all firms are fixed, the consumer can immediately pin down their discountedlifetime benefit of all future expenditures when deciding between settling with their current priceand searching.On the other hand, when β = 0, there is no persistence of the firm’s state across periods. Theintuition here is since firms now have equal probabilities of changing state, together with the factthat the number of low and high cost firms are half-half at steady state, we can see that searchinghas zero value as it only incurs a search cost c . Then for a consumer who has p r on hand, if hewere to find a low cost firm his benefit of search would be (cid:2) S ( p L ) − S ( p r ) (cid:3) which would occur withprobability given equal number of high and low cost firms at steady state. Similarly, when wesubstitute β = 0 into (3) we get c = 12 (cid:2) S ( p L ) − S ( p r ) (cid:3) (6)where consumers will similarly choose p r to satisfy the indifference condition (6). p H = p r implies that consumers are indifferent between settling for p H and continuing their search.
4e can see that in the end, although cost distribution is invariant to persistence probability β ,the price distribution is. The multi-armed bandits (MAB) problem is a classic reinforcement learning problem introducedby Thompson (1933). The problem exemplifies the trade-off dilemma between exploration andexploitation (very much like our search theory models.) An armed-bandit is a nickname given tocasino slot machines where one can pull the arm to receive a random reward (essentially a lottery.)The most basic framework of the MAB problem are the ‘stochastic bandits’. The problemis formulated as such. There are K armed-bandits, each of which provides a random reward r j ∈ [0 ,
1] distributed by F j which are i.i.d. for all j = 1 , , . . . , K . Thus, the K armed-banditscan be summarized by the vector of mean rewards ( µ , . . . , µ K ) . Define A t as the set of actions at time t which describes the interaction with a single banditat time t . So, if action a t is taken at time t on bandit j , then the ‘action value’ of a t is the meanreward Q ( a t ) = µ j and her realized reward will be π ( a t ) = r j , where a t is the action taken and π is the reward function. The agent’s objective is to maximize his total reward over T rounds. Ofcourse, in this model the agent is assumed to not know the reward distributions F , F , . . . , F K .Thus the dilemma arises, where in every round the agent must choose between finding a differentarm to pull (exploration/search) and continuing to pull from the same bandit.The standard measure of performance of a strategy/algorithm in the MAB problem or equiva-lently, an alternate formulation of the objective in the MAB problem is to minimize what is called‘regret’. Regret is measured using the best-arm as a benchmark to compare by. Let µ ∗ = max j µ j ,that is µ ∗ is the highest mean reward (or the mean reward of the optimal arm).Then define regret R at time t as, R ( t ) = tµ ∗ − t X τ =1 Q ( a τ ) (7)where Q ( a ) = E (cid:2) π ( a ) (cid:3) is the action value of a .Generally, there are three ways the agent can go about forming his strategy.1. No exploration (trivial case)2. Random exploration3. Strategic exploration with preferences over uncertaintyThere are several famous algorithms ( ε -greedy algorithm, Upper Confidence Bounds (UCB)) theagent can choose to implement which fall into one of the above three categories. Also, the problem Weng (2018), Slivkins (2019). Independently, identically distributed Where in our case we have A = A = . . . = A T . a t can be interpreted as the arm chosen to be pulled at time t . which will be the main approach usedin this paper.We can already see the many parallels between the MAB problem and the consumer searchmodel, with the main difference being that the agent has no information regarding the rewarddistributions of each bandit. Another striking difference is that the repeated purchase model(Fishman & Rob, 1995) as formulated in the MAB framework can be seen as a more ’forgiving’version of the classical MAB, in that consumers can sample (at a search cost c ) whereas in the theMAB, consumer would choose a seller and immediately buy at the offered price. First, we will formulate the repeated search model (Fishman & Rob, 1995) using Bernoulli ban-dits. The consumers can be represented by a single agent, and the M firms by M Bernoullibandits (or M -armed Bernoulli bandits.)Now for the store chosen at time t , j ( t ) define the state of store j ( t ), x (cid:2) j ( t ) (cid:3) = t t. (8)Recall that the state x (cid:2) j ( t ) (cid:3) follows a (stationary) Markov process and we found that under asteady-state we have that the probability of finding a high/low cost store is .Then the reward function will be defined as π (cid:16) x (cid:2) j ( t ) (cid:3)(cid:17) = S ( p L ) if x (cid:2) j ( t ) (cid:3) = 1 S ( p H ) if x (cid:2) j ( t ) (cid:3) = 0 (9)That is, the reward function depends on the store chosen and its state at time t . In terms ofthe bandit framework, j ( t ) is the arm chosen at time t and the reward realized is r j = π (cid:16) x (cid:2) j ( t ) (cid:3)(cid:17) which we will sometimes shorten to π (cid:2) x ( t ) (cid:3) .One other important item to mention is that given the reward function formulation in (9), weare assuming that on the producer/firm side, they are behaving optimally and that prices reflecttheir productivity state as was in Fishman & Rob (1995), in that low cost firms offer p L and highcost firms offer p H with p L < p H .Then after incorporating search cost c the consumer’s objective is the dynamic optimizationproblem, max { n ( t ) ,j ( t ) } ∞ t =0 ∞ X t =0 δ t " π (cid:16) x (cid:2) j ( t ) (cid:3)(cid:17) − c · n ( t ) (10) See McCall & McCall (1987) and Gittins (1974). Bandits where the state space is { , } . δ ∈ (0 , j ( t ) given the optimal number of stores searched n ( t ) ineach period t .This formulation of the search problem as a MAB problem follows a very similar form to thatin McCall & McCall (1987) where they incorporate dynamic programming methods together withGittins’ index introduced in Gittins (1974), the decision rule being ‘always play the bandit withthe largest index’ in which we take a similar but different approach. One major difference between our context and other MAB formulations is that untouchedbandits do not ‘freeze’. That is, marginal cost of the store are still subject to change (via. Markovprocess) regardless of whether or not that store has been visited. Figure 1 depicts a visualizationof the problem, where each black circle represents store j ’s (unobserved) state at time t , an orangecircle represents a store who’s state is observed at t , P x,y is the transition probability and c is thesearch cost. The blue arrows represent the Markov transitions of each store going forward in timewhilst the red arrow indicates search across stores in each period. j Time t P x,y c Figure 1: State space diagram
To re-iterate the setup, there are M stores in the market, each of which are subject to changesin productivity via a Markov process represented by the transition matrix P given our two-statesetup, P = β L − β L − β H β H ! . (11) Although we will not be assigning each store with a Gittins’ index, we will be assigning values on each store asshown in the next section. The policy we derive is also an optimal stopping rule. A reminder that we are not considering a spatial model, so a consumer can travel to any single store startingfrom any store at the search cost c . F ( p ) in each time period, we will maintain the assumption that they are aware of the transitionprobabilities, i.e. the transition matrix P . This can be interpreted as consumers being aware ofexternal market factors influencing productivity shocks of stores.Similar to Gittins (1974) and McCall & McCall (1987), to formulate the problem as a dynamicprogram the consumer will assign a value to each store. To see how, consider the following illus-tration.First, at the initial period t = 0, let the consumer’s initial position be at one of the M stores;that is she gets to explore the first store free of charge. Then before the start of each period, theconsumer predetermines the order in which she explores any additional stores. Since the pricedistribution is unknown to her, the sequential order she chooses for exploration is inconsequential.Now denote the order of the stores in which she plans to explore as { , , . . . , M − } . Then startingat t = 0, we define the value placed on each store j = 0 , , , . . . , M − V h x (0) i = π h x (0) i + δ " X s =1 P x (0) ,s · V ( s ) V h x (0) , x (0) i = max ( π h x (0) i , π h x (0) i) − c + δ " X s =1 P x ∗ (0) ,s · V ( s ) ... V M − h x (0) , . . . , x M (0) i = max ( π h x (0) i , . . . , π h x M (0) i) − ( M − c + δ " X s =1 P x ∗ M (0) ,s · V ( s ) (12)where the j subscript on V j denotes the order in which that additional store was visited, i.e. the j th additional store visited in the order of exploration { , , . . . , M − } , x m (0) denotes the stateof store m and x ∗ m (0) = argmax k ∈{ , ,...,m } n π (cid:2) x k (0) (cid:3)o at t = 0. There are several important things to note regarding the formulation in (12). First is that thisformulation is valid for any period t . Next is that the initial dynamic optimization problem in (10)required choices on both number of stores to explore n ( t ) and the store to purchase from j ( t ) ineach period t . By assigning values to each store as done in (12) we have broken down the probleminto a sequence of individual optimal stopping problems.To see how, first notice that the formulation of the values in (12) takes in the stores observedthus far and provides the maximum expected value for the current period and the subsequent, Note that consumers’ knowledge of the transition probabilities may lead them to deriving the stationary dis-tribution of prices, if there is one to be found. This is however to do with an individual store and the probabilitydistribution of the state that particular store given a certain time t . This assumption smoothens the framework when formulating the Bellman equations. Any store explored after the initial store is considered ‘additional’. A short reminder that we assume the consumer is able to return to any store she has sampled at no cost but onlyfor that period. Then notice that at t = 0 (or any subsequent time period) theconsumer is first faced with the choice of whether to settle for value V = V (cid:2) x (0) (cid:3) or go ontothe first store in the predetermined order of search { , , . . . , M − } to realize value V . Then byconstruction, if she accepts the value V , there is no reverting back to the value V as the searchcost c is built into the value V . Similarly, she can choose whether to discard V to realize V and soon until V M − . We now start to see that this formulation is precisely that of the famous secretaryproblem (also known as the marriage problem) that was introduced in the early 1960’s by MartinGardner. We will first formulate the initial optimal stopping problem at t = 0 and derive its optimal policy. At the start of t = 0 the consumer is faced with the sequence of values { V , V , V , . . . , V M − } whereonly V is known initially. Had the consumer been clairvoyant, she would be able to rank the storesaccording to their values from greatest to least. Alas, her objective is to; given a predeterminedsequence of exploration { , , . . . , M − } ; find an optimal stopping policy that maximizes herreceived value. The optimal stopping policy will take the following form: ‘Pass through a certainnumber of stores, and after that pick the first store that yields the highest value so far.’Before we proceed with the derivation, we will introduce some notation make some simplifi-cations for convenience. For the purpose of deriving the optimal stopping policy, we focus ourattention only to the rank order of the stores’ values V j , not to the actual values themselves. So,the consumer maximizing her probability of finding the best store is equivalent to maximizing herexpected value received. Hence, for this section the term ‘value’ will also refer to the expectedprobability of finding the best store. • Let m be the number of stores visited/sampled. So the consumer starts with m = 1 which isthe store with value V . • Let n be the number of stores not yet visited, so M = m + n . • We say that a store is ‘ viable ’ if the store’s value is the highest seen so far. • Define Y m to be the value when m stores have been visited and V m has been discarded. • Define U m to be the value when m stores have been visited and the m th store is viable . Observe that the V ( s ) term inside the expectation portion depicts that the value is derived from zero additionalsearches as the consumer is assumed to appear in the store she purchased from last period. See Ferguson et al. (1989) for a history and review of the problem. We use the dynamic programming approach found in Beckmann (1990) to derive the optimal stopping policy. This can be applied to any arbitrary period t . Suppose after receiving a value V j the consumer is convinced that there is a high chance that V j is the highestvalue. So we can see that maximizing the probability of finding the highest value is indeed the same as maximizingher received value. If j is the index of search order, then m = j + 1. So after exploring M − th store, m = M . m th is not chosen, then the value for the consumer is, Y m = P n m + 1 th store is not viable o Y m +1 + P n m + 1 th store is viable o U m +1 = (cid:16) mm + 1 (cid:17) Y m +1 + (cid:16) m + 1 (cid:17) U m +1 . (13)Now consider the case that the m th store is viable. There is a decision either to choose ordiscard V m . Then if the m th store is chosen then the value is P (cid:8) m th is the best (cid:9) = m/M . Since Y m has already been established we have, U m = max n mM , Y m o . (14)Using backward induction, we initiate with the last store by setting m = M so we have Y M = 0by definition. Then we have U M = max { , } = 1 . Then by (13), Y M − = (cid:16) M − M (cid:17) Y M + (cid:16) M (cid:17) U M = 1 M (15)and, U M − = max n M − M , M o = M − M . (16)Continuing to m = M − Y M − = (cid:16) M − M − (cid:17) Y M − + (cid:16) M − (cid:17) U M − = 1 M (cid:16) M − M − (cid:17) = 1 M (cid:16) M − M − M − M − (cid:17) = M − M (cid:16) M − M − (cid:17) (17)and, U M − = max (cid:26) M − M , M − M (cid:16) M − M − (cid:17)(cid:27) . (18)Starting to see a pattern emerge, given m = M − n for n = 1 , , . . . , M − Y m = mM (cid:16) m + 1 m + 1 + . . . + 1 M − (cid:17) U m = mM max n , m + 1 m + 1 + . . . + 1 M − o . (19) In this case, m/M can be interpreted as the ‘search termination’ value. Discarding the final value offered by the last store results in 0 probability of finding the best value. .3 Deriving the Optimal Stopping Policy From (19) we can see that (a) P M − k = m k is decreasing in m , (b) P M − k = m k < U m > Y m ,that is; the value of the m th store being viable is greater than the value received by skipping the m th store.Let m = m ∗ be a critical cut-off point such that the optimal policy will choose the first viablestore m ≥ m ∗ . As such m = m ∗ will solve, M − X k = m ∗ k ≤ < M − X k = m ∗ − k . (20)That is, the m ∗ − th store is the last store such that, m ∗ − M M − X m ∗ − k > m ∗ − M . (21)The LHS of (21) is the value of rejecting the m ∗ − th store and the RHS is the probability that m ∗ − th is the best store. So, m = m ∗ will be the first store where the value of rejecting m = m ∗ is greater or equal to the probability that m = m ∗ is the best store. Then it follows that Y m ∗ − isthe value received by the optimal stopping policy. Y m ∗ − = m ∗ − M M − X k = m ∗ − k . (22)It can also be shown that U = Y = U = Y = . . . = U m ∗ − = Y m ∗ − by noting the fact that m ∗ − Y m ∗ − > m ∗ − M . We can approximate the solution m ∗ by using, Z Mm ∗ x d x ≈ M − X k = m ∗ k ≤ m ∗ ,ln (cid:16) Mm ∗ (cid:17) ≈ ⇒ m ∗ M ≈ e . (24)That is, the optimal stopping policy’s probability of finding the best store is 1 /e . Looking closer at (23), we would notice that actually, Z Mm ∗ x d x < M − X k = m ∗ k ≤ Beckmann (1990) See appendix A to the derivation of (22). The 1 /e solution is a well-known result which was demonstrated in Derman (1970) and others (see Ferguson et al.(1989) for details on the history. m ∗ = (cid:22) Me (cid:23) ⇒ Mm ∗ < e (26)where ⌊·⌋ is the floor function.Then from (23) we see that Y m ≈ mM ln (cid:0) Mm (cid:1) . To get a better understanding of the result (24)write Y m as a function of m with parameter M , Y ( m ; M ) = mM ln (cid:16) Mm (cid:17) . (27)Then notice that the optimal Y ( m ; M ) value is decreasing in m as mM increases (say via M decreasing.) Figure 2 shows a visualization of how the optimal m value changes as M increases.From figure 2 we can see that as mM increases, the optimal cutoff m ∗ decreases, hence the idea ofaccepting the first viable value from store m ∗ onward.Table 1 shows the optimal stopping policy’s m ∗ cut-off values and Y values for varying numberof stores M . We can see that the value Y tends to and fluctuates around 1 /e for a high enoughnumber of stores M .Finally we may conjecture that implementing this optimal stopping policy every period is indeedan optimal strategy. Looking back at the value formulation in (12), we can choose to apply it toany arbitrary time period t . By doing so, if the consumer were to find the optimal value in thatperiod denoted V ∗ ( t ), that would mean that the current surplus gained from the store plus theexpected value next period less the search cost is the highest amongst all stores at t . Using ‘forwardinduction’ reasoning we can see that by virtue of the dynamic program formulation, the desirabilityof an action in the present is influenced by what may happen in the future. As a result, the optimalpolicy that maximizes the probability of finding the store with the maximum value V ∗ ( t ) for everyperiod t is indeed optimal. mY ( m ; M ) Y ( m ; 5) Y ( m ; 2) e Figure 2: Visualization of Y ( m ; M ) across M values12able 1: Optimal Stopping Policy Values M M/e m ∗ m ∗ − Y .
368 0 0 12 0 .
736 0 0 0 . .
104 1 0 0 . .
472 1 0 0 . .
839 1 0 0 . .
207 2 1 0 . .
575 2 1 0 . .
943 2 1 0 . .
311 3 2 0 . .
679 3 2 0 . .
047 4 3 0 . .
415 4 3 0 . .
782 4 3 0 . .
150 5 4 0 . .
518 5 4 0 . .
886 5 4 0 . .
254 6 5 0 . .
622 6 5 0 . .
990 6 5 0 . .
358 7 6 0 . .
390 18 17 0 . .
790 36 35 0 . .
400 110 109 0 . In this paper, we solely focused on the consumers’ optimal policy without considerations on theeffect they may have on the firm (store)-side. It is reasonable to expect that consumers’ behaviorwill have an impact on the actions of firms, so the next question is; given that consumers use theoptimal policy we derived, will there be any changes in the firms’ price setting behavior, comparedto say the case in Fishman & Rob (1995)? In our analysis we maintained the assumption thatstores’ prices reflected their productivity state (high cost stores charge higher prices.) However, inReinganum (1979) and Fishman & Rob (1995) this was a result of firms behaving optimally giventhat they were able to deduce the consumers’ reservation price p r . Since our search policy does notuse a reservation price, there is additional work to be done to see how firms’ will behave under thesearch policy derived here. We have also taken quite a different approach to deriving the consumer’s optimal search rule. See appendix B for R code to obtain values. To this end, one may consider using the ‘adversarial bandit’ variant of the MAB introduced by Auer et al. (2002). p r wherethe rule is ‘If p j > p r , continue search, otherwise settle for p j .’ (28)compared to our optimal stopping rule for a given M number of stores, ‘Pass through m ∗ − stores, after which pick the first store that yields the highest value so far.’ (29)They are similar in that both policies are characterized by a cut-off; the reservation price p r in Fishman & Rob (1995) and m ∗ in ours. The basis of Fishman’s search rule is to induce anindifference condition with the reservation price p r given a persistence probability β and searchcost c . We have seen that changes in β indeed influences the consumer’s choice, where consumerschoose lower reservation prices the more persistent the stores’ states are. This is also reflected inthe policy we derived. The consumer picking a certain store for its value V j is also a result of thethe transition/persistence probabilities as the value takes the future expectation into account.To illustrate, suppose the reward the consumer receives π h x j ( t ) i is indeed high. If this store islikely to continue its current state into the future, this will be reflected in the value V as definedin (12). Conversely, if the store’s state is highly volatile, this will also be reflected in V , affectingits desirability to the consumer.In the end, we can see that although the policies were formulated very differently, they do sharesimilarities in what they do, in that they both try to make the best decision based on current andexpected future payoffs. 14 eferences Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002). The nonstochastic multiarmedbandit problem.
SIAM journal on computing , (1), 48–77.Beckmann, M. (1990). Dynamic programming and the secretary problem. Computers & Mathe-matics with Applications , (11), 25–28.Derman, C. (1970). Finite state markovian decision processes (Tech. Rep.).Ferguson, T. S., et al. (1989). Who solved the secretary problem?
Statistical science , (3),282–289.Fishman, A., & Rob, R. (1995). The durability of information, market efficiency and the size offirms. International Economic Review , 19–36.Gittins, J. (1974). A dynamic allocation index for the sequential design of experiments.
Progressin statistics , 241–266.McCall, B. P., & McCall, J. J. (1987). A sequential study of migration and job search.
Journal ofLabor Economics , (4, Part 1), 452–476.Reinganum, J. F. (1979). A simple model of equilibrium price dispersion. Journal of PoliticalEconomy , (4), 851–858.Slivkins, A. (2019). Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272 .Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in viewof the evidence of two samples. Biometrika , (3/4), 285–294.Weng, L. (2018). The multi-armed bandit problem andits solutions. lilianweng.github.io/lil-log . Retrieved from http://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html ppendices A Value of the Optimal Stopping Policy
We can write the optimal stopping policy’s value Y = Y m ∗ − in terms of the cutoff m ∗ , Y = (cid:16) m ∗ − m ∗ (cid:17) Y m ∗ + (cid:16) m ∗ (cid:17) U m ∗ = (cid:16) m ∗ − m ∗ (cid:17)(cid:16) m ∗ M (cid:17) M − X k = m ∗ k + 1 m ∗ (cid:16) m ∗ M (cid:17) = m ∗ − M (cid:16) m ∗ + 1 m ∗ + 1 + . . . + 1 M (cid:17) + 1 M = m ∗ − M (cid:16) m ∗ − m ∗ + . . . + 1 M − (cid:17) = m ∗ − M M − X k = m ∗ − k . (30)where the second line in (30) uses the fact mentioned above about m ∗ being the first store tosatisfy the condition m ∗ /M ≤ m ∗ /M · P M − k = m ∗ /k . B R Code for Deriving Policy Values
Visit the following url for the R code. https://github.com/ctymarco/MABandSecretaries/blob/b1345a7d2c051129d98bcf82c081be1434756bc1/values.Rhttps://github.com/ctymarco/MABandSecretaries/blob/b1345a7d2c051129d98bcf82c081be1434756bc1/values.R