[PDF] Expanding on Repeated Consumer Search Using Multi-Armed Bandits and Secretaries

Abstract

We seek to take a different approach in deriving the optimal search policy for the repeated consumer search model found in Fishman and Rob (1995) with the main motivation of dropping the assumption of prior knowledge of the price distribution F(p) in each period. We will do this by incorporating the famous multi-armed bandit problem (MAB). We start by modifying the MAB framework to fit the setting of the repeated consumer search model and formulate the objective as a dynamic optimization problem. Then, given any sequence of exploration, we assign a value to each store in that sequence using Bellman equations. We then proceed to break down the problem into individual optimal stopping problems for each period which incidentally coincides with the framework of the famous secretary problem where we proceed to derive the optimal stopping policy. We will see that implementing the optimal stopping policy in each period solves the original dynamic optimization by `forward induction' reasoning.

Full PDF

aa r X i v : . [ ec on . T H ] D ec Expanding on Repeated Consumer Search Using Multi-ArmedBandits and Secretaries ∗ Chan Tung Yu Marco

University of Hong Kong

December 24, 2020

Abstract

We seek to take a diﬀerent approach in deriving the optimal search policy for the repeatedconsumer search model found in Fishman & Rob (1995) with the main motivation of droppingthe assumption of prior knowledge of the price distribution F ( p ) in each period. We will dothis by incorporating the famous multi-armed bandit problem (MAB). We start by modifyingthe MAB framework to ﬁt the setting of the repeated consumer search model and formulatethe objective as a dynamic optimization problem. Then, given any sequence of exploration weassign a value to each store in that sequence using Bellman equations. We then proceed to breakdown the problem into individual optimal stopping problems for each period which incidentallycoincides with the framework of the famous secretary problem where we proceed to derive theoptimal stopping policy. We will see that implementing the optimal stopping policy in eachperiod solves the original dynamic optimization by ‘forward induction’ reasoning. In the repeated consumer search model (Fishman & Rob, 1995) consumers are faced with a marketselling the same product but with price dispersion. Consumers then have the objective of maximiz-ing their own surplus by purchasing a unit of the product but at the lowest possible price, but arefaced with the additional challenge of not knowing the prices in each store. They can resolve thislack of information by searching (exploration) which incurs a search cost c . Hence the main essenceof the consumer’s problem is to decide whether or not exploration is worthwhile. The setting isalso a dynamic one, meaning that the consumer will have to repeat this purchasing process in everytime period until the terminal period. The dynamic setting is used to portray the potential forlong-term business relationships between ﬁrms and consumers.The model in Fishman & Rob (1995) proceeds its analysis with the assumption that consumersknow the price distribution F ( p ) beforehand in each period. We drop this assumption for the goal of ∗ HKU Fall Semester 2020; ECON 6077: Topics in Economics Research I - Industrial Organization, ﬁnal project. By formulating the search model into the MABframework, we hope to be able to adopt and modify computational methods and results to provideus with further insight into a consumer’s optimal search rule under relaxed assumptions and under amore general setting. It should also be noted that this paper’s analysis is focused on the consumer,demand-side. Further analysis on the producer side will be required in order to gain any insight orimplications this general setting may have on the market equilibrium.

This section will give a brief overview and summary of relevant literature and existing models.

Here we will introduce our starting point, the repeated purchase search model for a single goodmore formally, which is a natural extension of the price dispersion model in Reinganum (1979). The model’s set-up starts with a large number n of identical consumers with individual demandfor the single good x = D ( p ). Each consumer will purchase a unit of the good every period t = 1 , , . . . Then a consumer’s surplus from purchasing at price ˆ p at time t is, S (ˆ p ) = Z ∞ ˆ p D ( p ) d p. Consumers are assumed to only know the distribution of prices charged by each ﬁrm in themarket but not the individual prices charged by each ﬁrm. Consumers are also allowed in eachperiod to sequentially search and sample prices from ﬁrms they have not bought from in the previousperiod at a cost c > We will give an overview of the MAB problem in later sections. Reinganum (1979) models the equilibrium of ﬁrms subject to productivity shocks (represented by marginal costbeing randomly distributed) and consumers with imperfect information. In contrast, a consumer can costlessly learn the current price of any seller she bought from in the last period.This can be interpreted as the consumer starting every new time period at the store they bought from in the lastperiod. The idea is that consumers have all information about sellers they previously bought from but retains noinformation about ﬁrms they merely visited, or bought from more than a period ago. n t stores and buys at price p t is givenby S ( p t ) − n t c , giving us the sum of discounted surplus, V ( t ) = ∞ X τ = t δ τ − t h S ( p τ ) − n τ c i (1)with discount factor δ ∈ (0 , ω can take two distinct values { ω L , ω H } where ω L ≤ ω H . We also assume that marginal cost of ﬁrms satisfy the Markov property,that is, β H = P( ω t = ω H | ω t − = ω H ) β L = P( ω t = ω L | ω t − = ω L ) . For simpliﬁcation β = β L = β H is assumed, so β is the persistence probability of a ﬁrm tomaintain its current productivity level. That is, in every given period, a ﬁrm’s state of being ahigh or low cost is subject to change as dictated by β . This cost state is determined at the start ofevery period will remain unchanged throughout the entire period.Then assuming the system has a steady state, it can be shown that the number of low cost andhigh cost ﬁrms will be equal regardless of the value of β .Then given state of a ﬁrm, it can be shown that in equilibrium low cost ﬁrms will charge p L = p mL and high cost ﬁrms will charge p H = min { p r , p mH } , where p m is the proﬁt maximizing(monopoly) price p r is the (unknown) reservation price of consumers. It should be noted thatthe reservation price p r is what dictates the consumers’ search rule i.e. if p j > p r then continuesearching, otherwise settle and purchase.In a steady state equilibrium we assume that consumers will engage in optimal sequential searchto maximizes their discounted net surplus (1). Then a steady state equilibrium will be characterizedby a (joint) price distribution F ( p , p , . . . , p M ) given M ﬁrms in the market that is unchangingover time and satisﬁes,1. Given F ( p ), the consumer’s search rule maximizes their discounted net surplus (1)2. Given F ( p ), ω j , number of customers last period and consumers’ search rules, each ﬁrmmaximizes their discounted expected proﬁts.3. The ﬁrms’ proﬁt maximization gives rise to the same F ( p ) given in 1. and 2.As in Reinganum (1979), it can be shown that p L < p r . That is, it is not worth for consumersto search until the lowest price p L is found if they have found a price suﬃciently close to p L . Maximizing expected discounted proﬁts Π t = E h P ∞ τ = t δ τ − t D ( p τ )( p τ − ω τ ) i Note that F here can be formulated as a binomial distribution, given that p j are all independent and can beformulated as Bernoulli random variables. p H = p r . Then writing out the expected surplus V L for a consumerreceiving low price p L at time t together with the fact that at steady state, there is an equal numberof high and low cost ﬁrms, we can formulate the system of equations  V H = − c + V H + V L = S ( p r ) + δ (cid:2) βV H + (1 − β ) V L (cid:3) V L = S ( p L ) + δ (cid:2) βV L + (1 − β ) V H (cid:3) (2)where V H is the expected discounted surplus from search continuation .Solving the above system we will arrive at, S ( p r ) = S ( p L ) + (4 βδ − δ − c. (3)To understand this result, consider the two extreme cases, when β = 1 (or equivalently β = 0)and β = . Since all ﬁrms are ‘ﬁxed’ (low price ﬁrms will indeﬁnitely oﬀer low prices), given that p r is the current price on hand, the lifetime beneﬁt of ﬁnding a low price ﬁrm is ∞ X τ = t (cid:2) S ( p L ) − S ( p r ) (cid:3) = S ( p L ) − S ( p r )1 − δ . (4)Also notice that when we substitute β = 1 into (3) we get c = 12 · S ( p L ) − S ( p r ))1 − δ . (5)Recall that since there is an equal number of ﬁrms, the expected beneﬁt of searching is multiplied by (4). Hence, the consumer will set reservation price p r so as to equate the expectedbeneﬁt of searching with search cost c , given by (5). The intuition being that search is most valuablewhen β = 1. Since all ﬁrms are ﬁxed, the consumer can immediately pin down their discountedlifetime beneﬁt of all future expenditures when deciding between settling with their current priceand searching.On the other hand, when β = 0, there is no persistence of the ﬁrm’s state across periods. Theintuition here is since ﬁrms now have equal probabilities of changing state, together with the factthat the number of low and high cost ﬁrms are half-half at steady state, we can see that searchinghas zero value as it only incurs a search cost c . Then for a consumer who has p r on hand, if hewere to ﬁnd a low cost ﬁrm his beneﬁt of search would be (cid:2) S ( p L ) − S ( p r ) (cid:3) which would occur withprobability given equal number of high and low cost ﬁrms at steady state. Similarly, when wesubstitute β = 0 into (3) we get c = 12 (cid:2) S ( p L ) − S ( p r ) (cid:3) (6)where consumers will similarly choose p r to satisfy the indiﬀerence condition (6). p H = p r implies that consumers are indiﬀerent between settling for p H and continuing their search.

4e can see that in the end, although cost distribution is invariant to persistence probability β ,the price distribution is. The multi-armed bandits (MAB) problem is a classic reinforcement learning problem introducedby Thompson (1933). The problem exempliﬁes the trade-oﬀ dilemma between exploration andexploitation (very much like our search theory models.) An armed-bandit is a nickname given tocasino slot machines where one can pull the arm to receive a random reward (essentially a lottery.)The most basic framework of the MAB problem are the ‘stochastic bandits’. The problemis formulated as such. There are K armed-bandits, each of which provides a random reward r j ∈ [0 ,

1] distributed by F j which are i.i.d. for all j = 1 , , . . . , K . Thus, the K armed-banditscan be summarized by the vector of mean rewards ( µ , . . . , µ K ) . Deﬁne A t as the set of actions at time t which describes the interaction with a single banditat time t . So, if action a t is taken at time t on bandit j , then the ‘action value’ of a t is the meanreward Q ( a t ) = µ j and her realized reward will be π ( a t ) = r j , where a t is the action taken and π is the reward function. The agent’s objective is to maximize his total reward over T rounds. Ofcourse, in this model the agent is assumed to not know the reward distributions F , F , . . . , F K .Thus the dilemma arises, where in every round the agent must choose between ﬁnding a diﬀerentarm to pull (exploration/search) and continuing to pull from the same bandit.The standard measure of performance of a strategy/algorithm in the MAB problem or equiva-lently, an alternate formulation of the objective in the MAB problem is to minimize what is called‘regret’. Regret is measured using the best-arm as a benchmark to compare by. Let µ ∗ = max j µ j ,that is µ ∗ is the highest mean reward (or the mean reward of the optimal arm).Then deﬁne regret R at time t as, R ( t ) = tµ ∗ − t X τ =1 Q ( a τ ) (7)where Q ( a ) = E (cid:2) π ( a ) (cid:3) is the action value of a .Generally, there are three ways the agent can go about forming his strategy.1. No exploration (trivial case)2. Random exploration3. Strategic exploration with preferences over uncertaintyThere are several famous algorithms ( ε -greedy algorithm, Upper Conﬁdence Bounds (UCB)) theagent can choose to implement which fall into one of the above three categories. Also, the problem Weng (2018), Slivkins (2019). Independently, identically distributed Where in our case we have A = A = . . . = A T . a t can be interpreted as the arm chosen to be pulled at time t . which will be the main approach usedin this paper.We can already see the many parallels between the MAB problem and the consumer searchmodel, with the main diﬀerence being that the agent has no information regarding the rewarddistributions of each bandit. Another striking diﬀerence is that the repeated purchase model(Fishman & Rob, 1995) as formulated in the MAB framework can be seen as a more ’forgiving’version of the classical MAB, in that consumers can sample (at a search cost c ) whereas in the theMAB, consumer would choose a seller and immediately buy at the oﬀered price. First, we will formulate the repeated search model (Fishman & Rob, 1995) using Bernoulli ban-dits. The consumers can be represented by a single agent, and the M ﬁrms by M Bernoullibandits (or M -armed Bernoulli bandits.)Now for the store chosen at time t , j ( t ) deﬁne the state of store j ( t ), x (cid:2) j ( t ) (cid:3) =  t t. (8)Recall that the state x (cid:2) j ( t ) (cid:3) follows a (stationary) Markov process and we found that under asteady-state we have that the probability of ﬁnding a high/low cost store is .Then the reward function will be deﬁned as π (cid:16) x (cid:2) j ( t ) (cid:3)(cid:17) =  S ( p L ) if x (cid:2) j ( t ) (cid:3) = 1 S ( p H ) if x (cid:2) j ( t ) (cid:3) = 0 (9)That is, the reward function depends on the store chosen and its state at time t . In terms ofthe bandit framework, j ( t ) is the arm chosen at time t and the reward realized is r j = π (cid:16) x (cid:2) j ( t ) (cid:3)(cid:17) which we will sometimes shorten to π (cid:2) x ( t ) (cid:3) .One other important item to mention is that given the reward function formulation in (9), weare assuming that on the producer/ﬁrm side, they are behaving optimally and that prices reﬂecttheir productivity state as was in Fishman & Rob (1995), in that low cost ﬁrms oﬀer p L and highcost ﬁrms oﬀer p H with p L < p H .Then after incorporating search cost c the consumer’s objective is the dynamic optimizationproblem, max { n ( t ) ,j ( t ) } ∞ t =0 ∞ X t =0 δ t " π (cid:16) x (cid:2) j ( t ) (cid:3)(cid:17) − c · n ( t ) (10) See McCall & McCall (1987) and Gittins (1974). Bandits where the state space is { , } . δ ∈ (0 , j ( t ) given the optimal number of stores searched n ( t ) ineach period t .This formulation of the search problem as a MAB problem follows a very similar form to thatin McCall & McCall (1987) where they incorporate dynamic programming methods together withGittins’ index introduced in Gittins (1974), the decision rule being ‘always play the bandit withthe largest index’ in which we take a similar but diﬀerent approach. One major diﬀerence between our context and other MAB formulations is that untouchedbandits do not ‘freeze’. That is, marginal cost of the store are still subject to change (via. Markovprocess) regardless of whether or not that store has been visited. Figure 1 depicts a visualizationof the problem, where each black circle represents store j ’s (unobserved) state at time t , an orangecircle represents a store who’s state is observed at t , P x,y is the transition probability and c is thesearch cost. The blue arrows represent the Markov transitions of each store going forward in timewhilst the red arrow indicates search across stores in each period. j Time t P x,y c Figure 1: State space diagram

To re-iterate the setup, there are M stores in the market, each of which are subject to changesin productivity via a Markov process represented by the transition matrix P given our two-statesetup, P = β L − β L − β H β H ! . (11) Although we will not be assigning each store with a Gittins’ index, we will be assigning values on each store asshown in the next section. The policy we derive is also an optimal stopping rule. A reminder that we are not considering a spatial model, so a consumer can travel to any single store startingfrom any store at the search cost c . F ( p ) in each time period, we will maintain the assumption that they are aware of the transitionprobabilities, i.e. the transition matrix P . This can be interpreted as consumers being aware ofexternal market factors inﬂuencing productivity shocks of stores.Similar to Gittins (1974) and McCall & McCall (1987), to formulate the problem as a dynamicprogram the consumer will assign a value to each store. To see how, consider the following illus-tration.First, at the initial period t = 0, let the consumer’s initial position be at one of the M stores;that is she gets to explore the ﬁrst store free of charge. Then before the start of each period, theconsumer predetermines the order in which she explores any additional stores. Since the pricedistribution is unknown to her, the sequential order she chooses for exploration is inconsequential.Now denote the order of the stores in which she plans to explore as { , , . . . , M − } . Then startingat t = 0, we deﬁne the value placed on each store j = 0 , , , . . . , M − V h x (0) i = π h x (0) i + δ " X s =1 P x (0) ,s · V ( s ) V h x (0) , x (0) i = max ( π h x (0) i , π h x (0) i) − c + δ " X s =1 P x ∗ (0) ,s · V ( s ) ... V M − h x (0) , . . . , x M (0) i = max ( π h x (0) i , . . . , π h x M (0) i) − ( M − c + δ " X s =1 P x ∗ M (0) ,s · V ( s ) (12)where the j subscript on V j denotes the order in which that additional store was visited, i.e. the j th additional store visited in the order of exploration { , , . . . , M − } , x m (0) denotes the stateof store m and x ∗ m (0) = argmax k ∈{ , ,...,m } n π (cid:2) x k (0) (cid:3)o at t = 0. There are several important things to note regarding the formulation in (12). First is that thisformulation is valid for any period t . Next is that the initial dynamic optimization problem in (10)required choices on both number of stores to explore n ( t ) and the store to purchase from j ( t ) ineach period t . By assigning values to each store as done in (12) we have broken down the probleminto a sequence of individual optimal stopping problems.To see how, ﬁrst notice that the formulation of the values in (12) takes in the stores observedthus far and provides the maximum expected value for the current period and the subsequent, Note that consumers’ knowledge of the transition probabilities may lead them to deriving the stationary dis-tribution of prices, if there is one to be found. This is however to do with an individual store and the probabilitydistribution of the state that particular store given a certain time t . This assumption smoothens the framework when formulating the Bellman equations. Any store explored after the initial store is considered ‘additional’. A short reminder that we assume the consumer is able to return to any store she has sampled at no cost but onlyfor that period. Then notice that at t = 0 (or any subsequent time period) theconsumer is ﬁrst faced with the choice of whether to settle for value V = V (cid:2) x (0) (cid:3) or go ontothe ﬁrst store in the predetermined order of search { , , . . . , M − } to realize value V . Then byconstruction, if she accepts the value V , there is no reverting back to the value V as the searchcost c is built into the value V . Similarly, she can choose whether to discard V to realize V and soon until V M − . We now start to see that this formulation is precisely that of the famous secretaryproblem (also known as the marriage problem) that was introduced in the early 1960’s by MartinGardner. We will ﬁrst formulate the initial optimal stopping problem at t = 0 and derive its optimal policy. At the start of t = 0 the consumer is faced with the sequence of values { V , V , V , . . . , V M − } whereonly V is known initially. Had the consumer been clairvoyant, she would be able to rank the storesaccording to their values from greatest to least. Alas, her objective is to; given a predeterminedsequence of exploration { , , . . . , M − } ; ﬁnd an optimal stopping policy that maximizes herreceived value. The optimal stopping policy will take the following form: ‘Pass through a certainnumber of stores, and after that pick the ﬁrst store that yields the highest value so far.’Before we proceed with the derivation, we will introduce some notation make some simpliﬁ-cations for convenience. For the purpose of deriving the optimal stopping policy, we focus ourattention only to the rank order of the stores’ values V j , not to the actual values themselves. So,the consumer maximizing her probability of ﬁnding the best store is equivalent to maximizing herexpected value received. Hence, for this section the term ‘value’ will also refer to the expectedprobability of ﬁnding the best store. • Let m be the number of stores visited/sampled. So the consumer starts with m = 1 which isthe store with value V . • Let n be the number of stores not yet visited, so M = m + n . • We say that a store is ‘ viable ’ if the store’s value is the highest seen so far. • Deﬁne Y m to be the value when m stores have been visited and V m has been discarded. • Deﬁne U m to be the value when m stores have been visited and the m th store is viable . Observe that the V ( s ) term inside the expectation portion depicts that the value is derived from zero additionalsearches as the consumer is assumed to appear in the store she purchased from last period. See Ferguson et al. (1989) for a history and review of the problem. We use the dynamic programming approach found in Beckmann (1990) to derive the optimal stopping policy. This can be applied to any arbitrary period t . Suppose after receiving a value V j the consumer is convinced that there is a high chance that V j is the highestvalue. So we can see that maximizing the probability of ﬁnding the highest value is indeed the same as maximizingher received value. If j is the index of search order, then m = j + 1. So after exploring M − th store, m = M . m th is not chosen, then the value for the consumer is, Y m = P n m + 1 th store is not viable o Y m +1 + P n m + 1 th store is viable o U m +1 = (cid:16) mm + 1 (cid:17) Y m +1 + (cid:16) m + 1 (cid:17) U m +1 . (13)Now consider the case that the m th store is viable. There is a decision either to choose ordiscard V m . Then if the m th store is chosen then the value is P (cid:8) m th is the best (cid:9) = m/M . Since Y m has already been established we have, U m = max n mM , Y m o . (14)Using backward induction, we initiate with the last store by setting m = M so we have Y M = 0by deﬁnition. Then we have U M = max { , } = 1 . Then by (13), Y M − = (cid:16) M − M (cid:17) Y M + (cid:16) M (cid:17) U M = 1 M (15)and, U M − = max n M − M , M o = M − M . (16)Continuing to m = M − Y M − = (cid:16) M − M − (cid:17) Y M − + (cid:16) M − (cid:17) U M − = 1 M (cid:16) M − M − (cid:17) = 1 M (cid:16) M − M − M − M − (cid:17) = M − M (cid:16) M − M − (cid:17) (17)and, U M − = max (cid:26) M − M , M − M (cid:16) M − M − (cid:17)(cid:27) . (18)Starting to see a pattern emerge, given m = M − n for n = 1 , , . . . , M − Y m = mM (cid:16) m + 1 m + 1 + . . . + 1 M − (cid:17) U m = mM max n , m + 1 m + 1 + . . . + 1 M − o . (19) In this case, m/M can be interpreted as the ‘search termination’ value. Discarding the ﬁnal value oﬀered by the last store results in 0 probability of ﬁnding the best value. .3 Deriving the Optimal Stopping Policy From (19) we can see that (a) P M − k = m k is decreasing in m , (b) P M − k = m k < U m > Y m ,that is; the value of the m th store being viable is greater than the value received by skipping the m th store.Let m = m ∗ be a critical cut-oﬀ point such that the optimal policy will choose the ﬁrst viablestore m ≥ m ∗ . As such m = m ∗ will solve, M − X k = m ∗ k ≤ < M − X k = m ∗ − k . (20)That is, the m ∗ − th store is the last store such that, m ∗ − M M − X m ∗ − k > m ∗ − M . (21)The LHS of (21) is the value of rejecting the m ∗ − th store and the RHS is the probability that m ∗ − th is the best store. So, m = m ∗ will be the ﬁrst store where the value of rejecting m = m ∗ is greater or equal to the probability that m = m ∗ is the best store. Then it follows that Y m ∗ − isthe value received by the optimal stopping policy. Y m ∗ − = m ∗ − M M − X k = m ∗ − k . (22)It can also be shown that U = Y = U = Y = . . . = U m ∗ − = Y m ∗ − by noting the fact that m ∗ − Y m ∗ − > m ∗ − M . We can approximate the solution m ∗ by using, Z Mm ∗ x d x ≈ M − X k = m ∗ k ≤ m ∗ ,ln (cid:16) Mm ∗ (cid:17) ≈ ⇒ m ∗ M ≈ e . (24)That is, the optimal stopping policy’s probability of ﬁnding the best store is 1 /e . Looking closer at (23), we would notice that actually, Z Mm ∗ x d x < M − X k = m ∗ k ≤ Beckmann (1990) See appendix A to the derivation of (22). The 1 /e solution is a well-known result which was demonstrated in Derman (1970) and others (see Ferguson et al.(1989) for details on the history. m ∗ = (cid:22) Me (cid:23) ⇒ Mm ∗ < e (26)where ⌊·⌋ is the ﬂoor function.Then from (23) we see that Y m ≈ mM ln (cid:0) Mm (cid:1) . To get a better understanding of the result (24)write Y m as a function of m with parameter M , Y ( m ; M ) = mM ln (cid:16) Mm (cid:17) . (27)Then notice that the optimal Y ( m ; M ) value is decreasing in m as mM increases (say via M decreasing.) Figure 2 shows a visualization of how the optimal m value changes as M increases.From ﬁgure 2 we can see that as mM increases, the optimal cutoﬀ m ∗ decreases, hence the idea ofaccepting the ﬁrst viable value from store m ∗ onward.Table 1 shows the optimal stopping policy’s m ∗ cut-oﬀ values and Y values for varying numberof stores M . We can see that the value Y tends to and ﬂuctuates around 1 /e for a high enoughnumber of stores M .Finally we may conjecture that implementing this optimal stopping policy every period is indeedan optimal strategy. Looking back at the value formulation in (12), we can choose to apply it toany arbitrary time period t . By doing so, if the consumer were to ﬁnd the optimal value in thatperiod denoted V ∗ ( t ), that would mean that the current surplus gained from the store plus theexpected value next period less the search cost is the highest amongst all stores at t . Using ‘forwardinduction’ reasoning we can see that by virtue of the dynamic program formulation, the desirabilityof an action in the present is inﬂuenced by what may happen in the future. As a result, the optimalpolicy that maximizes the probability of ﬁnding the store with the maximum value V ∗ ( t ) for everyperiod t is indeed optimal. mY ( m ; M ) Y ( m ; 5) Y ( m ; 2) e Figure 2: Visualization of Y ( m ; M ) across M values12able 1: Optimal Stopping Policy Values M M/e m ∗ m ∗ − Y .

368 0 0 12 0 .

736 0 0 0 . .

104 1 0 0 . .

472 1 0 0 . .

839 1 0 0 . .

207 2 1 0 . .

575 2 1 0 . .

943 2 1 0 . .

311 3 2 0 . .

679 3 2 0 . .

047 4 3 0 . .

415 4 3 0 . .

782 4 3 0 . .

150 5 4 0 . .

518 5 4 0 . .

886 5 4 0 . .

254 6 5 0 . .

622 6 5 0 . .

990 6 5 0 . .

358 7 6 0 . .

390 18 17 0 . .

790 36 35 0 . .

400 110 109 0 . In this paper, we solely focused on the consumers’ optimal policy without considerations on theeﬀect they may have on the ﬁrm (store)-side. It is reasonable to expect that consumers’ behaviorwill have an impact on the actions of ﬁrms, so the next question is; given that consumers use theoptimal policy we derived, will there be any changes in the ﬁrms’ price setting behavior, comparedto say the case in Fishman & Rob (1995)? In our analysis we maintained the assumption thatstores’ prices reﬂected their productivity state (high cost stores charge higher prices.) However, inReinganum (1979) and Fishman & Rob (1995) this was a result of ﬁrms behaving optimally giventhat they were able to deduce the consumers’ reservation price p r . Since our search policy does notuse a reservation price, there is additional work to be done to see how ﬁrms’ will behave under thesearch policy derived here. We have also taken quite a diﬀerent approach to deriving the consumer’s optimal search rule. See appendix B for R code to obtain values. To this end, one may consider using the ‘adversarial bandit’ variant of the MAB introduced by Auer et al. (2002). p r wherethe rule is ‘If p j > p r , continue search, otherwise settle for p j .’ (28)compared to our optimal stopping rule for a given M number of stores, ‘Pass through m ∗ − stores, after which pick the ﬁrst store that yields the highest value so far.’ (29)They are similar in that both policies are characterized by a cut-oﬀ; the reservation price p r in Fishman & Rob (1995) and m ∗ in ours. The basis of Fishman’s search rule is to induce anindiﬀerence condition with the reservation price p r given a persistence probability β and searchcost c . We have seen that changes in β indeed inﬂuences the consumer’s choice, where consumerschoose lower reservation prices the more persistent the stores’ states are. This is also reﬂected inthe policy we derived. The consumer picking a certain store for its value V j is also a result of thethe transition/persistence probabilities as the value takes the future expectation into account.To illustrate, suppose the reward the consumer receives π h x j ( t ) i is indeed high. If this store islikely to continue its current state into the future, this will be reﬂected in the value V as deﬁnedin (12). Conversely, if the store’s state is highly volatile, this will also be reﬂected in V , aﬀectingits desirability to the consumer.In the end, we can see that although the policies were formulated very diﬀerently, they do sharesimilarities in what they do, in that they both try to make the best decision based on current andexpected future payoﬀs. 14 eferences Auer, P., Cesa-Bianchi, N., Freund, Y., & Schapire, R. E. (2002). The nonstochastic multiarmedbandit problem.

SIAM journal on computing , (1), 48–77.Beckmann, M. (1990). Dynamic programming and the secretary problem. Computers & Mathe-matics with Applications , (11), 25–28.Derman, C. (1970). Finite state markovian decision processes (Tech. Rep.).Ferguson, T. S., et al. (1989). Who solved the secretary problem?

Statistical science , (3),282–289.Fishman, A., & Rob, R. (1995). The durability of information, market eﬃciency and the size ofﬁrms. International Economic Review , 19–36.Gittins, J. (1974). A dynamic allocation index for the sequential design of experiments.

Progressin statistics , 241–266.McCall, B. P., & McCall, J. J. (1987). A sequential study of migration and job search.

Journal ofLabor Economics , (4, Part 1), 452–476.Reinganum, J. F. (1979). A simple model of equilibrium price dispersion. Journal of PoliticalEconomy , (4), 851–858.Slivkins, A. (2019). Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272 .Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in viewof the evidence of two samples. Biometrika , (3/4), 285–294.Weng, L. (2018). The multi-armed bandit problem andits solutions. lilianweng.github.io/lil-log . Retrieved from http://lilianweng.github.io/lil-log/2018/01/23/the-multi-armed-bandit-problem-and-its-solutions.html ppendices A Value of the Optimal Stopping Policy

We can write the optimal stopping policy’s value Y = Y m ∗ − in terms of the cutoﬀ m ∗ , Y = (cid:16) m ∗ − m ∗ (cid:17) Y m ∗ + (cid:16) m ∗ (cid:17) U m ∗ = (cid:16) m ∗ − m ∗ (cid:17)(cid:16) m ∗ M (cid:17) M − X k = m ∗ k + 1 m ∗ (cid:16) m ∗ M (cid:17) = m ∗ − M (cid:16) m ∗ + 1 m ∗ + 1 + . . . + 1 M (cid:17) + 1 M = m ∗ − M (cid:16) m ∗ − m ∗ + . . . + 1 M − (cid:17) = m ∗ − M M − X k = m ∗ − k . (30)where the second line in (30) uses the fact mentioned above about m ∗ being the ﬁrst store tosatisfy the condition m ∗ /M ≤ m ∗ /M · P M − k = m ∗ /k . B R Code for Deriving Policy Values

Visit the following url for the R code. https://github.com/ctymarco/MABandSecretaries/blob/b1345a7d2c051129d98bcf82c081be1434756bc1/values.Rhttps://github.com/ctymarco/MABandSecretaries/blob/b1345a7d2c051129d98bcf82c081be1434756bc1/values.R