Constant Regret Re-solving Heuristics for Price-based Revenue Management
NNearly Bounded Regret of Re-solving Heuristics in Price-basedRevenue Management
Yining Wang Warrington College of Business, University of FloridaSeptember 8, 2020
Abstract
Price-based revenue management is a class of important questions in operations management.In its simplest form, a retailer sells a single product over T consecutive time periods and issubject to constraints on the initial inventory levels. While the optimal pricing policy over T periods could be obtained via dynamic programming, such an approach is sometimes undesirablebecause of its enormous computational costs. Approximately optimal policies, such as the re-solving heuristic, is often applied as a computationally tractable alternative. In this paper, weprove the following results:1. We prove that a popular and commonly used re-solving heuristic attains an O (ln ln T )regret compared to the value of the optimal DP pricing policy. This improves the O (ln T )regret upper bound established in the prior work of Jasin (2014).2. We prove that there is an Ω(ln T ) gap between the value of the optimal DP pricing policyand that of a static LP relaxation. This complements our upper bound results in showingthat the static LP relaxation is not an adequate information-relaxed benchmark whenanalyzing price-based revenue management algorithms. Keywords: re-solving, self-adjusting controls, price-based revenue management, dynamicpricing
We consider the simplest example of price-based revenue management, in which the retailer sellsa single product repeatedly over T consecutive time periods, subject to initial inventory levelconstraints. More specifically, let f : [0 , → [ d, d ] be a fixed demand rate function that is mono-tonically decreasing, and x T ∈ ( d, d ) be an inventory ratio parameter. The price-based revenuemanagement model consists of T consecutive selling periods, with an initial inventory level of y = x T T . At time t , the retailer sets a price p t ∈ [0 , d t , instantaneous1 a r X i v : . [ m a t h . O C ] S e p evenue r t , and remaining inventory level y t are governed by the following model: d t = f ( p t ) + ξ t , r t = p t min { d t , y t − } , y t = max { , y t − − d t } , (1)where ξ , · · · , ξ T i.i.d. ∼ Q are i.i.d. centered additive noise variables.The retailer’s objective is to design an admissible pricing policy π to maximize his/her expectedrevenue over T periods. A pricing policy π is admissible if the advertised price p t at time t isdecided based only on the inventory level at the beginning of the t th time period. Mathematically,an admissible policy π can be parameterized as π = ( π , · · · , π T ), where π t is a certain randomfunction that maps from y t − to p t ∈ [0 , π canthen be written as R π ( T, y ) := E π (cid:34) T (cid:88) t =1 r t (cid:12)(cid:12)(cid:12)(cid:12) p t ∼ π t ( y t − ) (cid:35) . (2) An optimal policy π ∗ maximizing R π ( T, x T ) defined in Eq. (2) can be in principle obtained viadynamic programming. Such an approach, however, is computationally very expensive becausethere are an infinite number of states (inventory levels) and actions (advertised prices). Althoughdiscretization is possible, it is not an exact solution and soon becomes intractable when the dis-cretization grid becomes too dense. Furthermore, with multiple products for sale (e.g., networkrevenue management) the number of states and prices grow exponentially and the approach istherefore intractable.The seminal work of Gallego and Van Ryzin (1994) proposed a useful and easy-to-compute bench-mark for understanding and developing approximately optimal dynamic pricing control protocols.Suppose the inverse function of f exists and let x ∗ = arg max x ∈ [ d,d ] r ( x ), where r ( x ) = xf − ( x ).The following results are established in (Gallego and Van Ryzin 1994): Theorem 1 (Gallego and Van Ryzin (1994)) . For any admissible policy π and y = x T T , R π ( T, y ) ≤ T r (min { x T , x ∗ } ) . Furthermore, for the static pricing policy π s : p t ≡ f − (min { x T , x ∗ } ) , R π s ( T, y ) ≥ T r (min { x T , x ∗ } ) − O ( √ T ) . It has been an interesting question to further improve the O ( √ T ) gap in Theorem 1 by consideringmore sophisticated yet still computationally efficient dynamic pricing strategies. In the work ofJasin (2014) the gap is reduced from O ( √ T ) to O (ln T ), as shown by the following result: Theorem 2 (Jasin (2014)) . Let π r = ( π r , · · · , π rT ) be a re-optimizing pricing strategy defined as p t = f − (max( d, min { y t − / ( T − t + 1) , x ∗ } )) . Then R π r ( T, y ) ≥ T r (min { x T , x ∗ } ) − O (ln T ) , where y = x T T . Although Theorems 1 and 2 are often studied in the context of network revenue management in2 ptimal DP Static LP relaxationRe-solvingStationary policy
𝑂(ln 𝑇) , Jasin’ , Oper. Res. 𝑂( 𝑇) , Gallego & van Ryzin ’94,
Manage. Sci.
𝑂 ln ln 𝑇 , this paper
Ω(ln 𝑇) , this paper
Figure 1: Graphical illustration of our results and comparison with existing results.which more than one products are present, for even the simplest single-product case it is still openwhether the O (ln T ) gap in Theorem 2 can be further reduced. This question is answered fromboth upper bound and lower bound sides in this paper, as we present in the next section. In this paper we establish two main results: nearly bounded regret for the re-solving heuristic, andlogarithmic regret lower bounds on the gap between the value of the optimal pricing policy and thestatic LP relaxation. Figure 1 summarizes results established in this paper (in red) and comparesthem with existing results in the prior literature (in blue).Our first main result, as stated in Theorem 3 later in this paper, asserts that for any fixed x T ∈ ( d, x ∗ ) the cumulative regret of the re-solving heuristic π r is upper bounded by an iter-ated logarithmic term O (ln ln T ), compared against the expected reward of the optimal dynamicpricing policy π ∗ . Apart from the obvious improvement from O (ln T ) to O (ln ln T ) in asymptoticregret upper bounds, our proof technique is different from existing works which compares the ex-pected reward of the re-solving policy to a certain information relaxed benchmark, such as thestatic LP solution or the hindsight optimum benchmark. In contrast, because most benchmarksin the price-based revenue management setting are likely to be loose, we compare the value of π r directly with the value of the optimal DP policy π ∗ by carefully analyzing the demand correctionstructures in π ∗ .Our second main result, as stated in Theorem 4 later in this paper, shows that there is an Ω(ln T ) lower bound on the gap between the expected revenue of the re-solving heuristics π r and thestatic LP relaxation benchmark T r (min { x ∗ , x T } ). Coupled with the O (ln ln T ) regret upper boundestablished in Theorem 3, this shows that there is an Ω(ln T ) lower bound on the gap betweenthe value of the optimal policy π ∗ and the static LP relaxation as well. This demonstrates thefundamental limitation of analysis conducted using the static LP relaxation or other similar criteriabecause these information relaxed benchmarks give the pricing policy too much information ahead3f time and are therefore too loose for price-based revenue management problems. The most relevant prior research to our paper is the work by Jasin (2014), who studied the net-work revenue management problem and showed that a re-optimization heuristic attains an O (ln T )asymptotic regret upper bound under mild conditions. Jasin (2014) also shows that infrequentre-solving has similar theoretical performance guarantees and is much more computationally effi-cient. In this paper, we improve the regret of frequent re-solving to O (ln ln T ), which is an iteratedlogarithmic term in time horizon T and is very close to bounded. Our analysis is different from theone in (Jasin 2014) in the sense that we directly compare the expected revenue of re-solving withthe value of the optimal DP policy, instead of a static LP relaxation. Additionally, we complementour results with an Ω(ln T ) lower bound between the expected revenue of the optimal DP policyand the static LP relaxation.The idea of using simple, easy-to-compute pricing policies to approximate the optimal dynamicpricing strategy originates from the works of Gallego and Van Ryzin (1994, 1997), who studiedstatic price policies. Maglaras and Meissner (2006) showed that frequent resolves does not diminishrevenue asymptotically. Chen and Farias (2013) studied a single-product pricing problem under aspecific class of demand models, and showed that re-optimization strictly improves the asymptoticperformance compared to static price strategies. Due to the modeling difference the results in(Chen and Farias 2013) are not directly comparable with our setting. More specifically, Chenand Farias (2013) studied a market-size stochastic process that models inter-temporal correlationsand non-stationarity in demand. As a result, the model in Chen and Farias (2013) is harder andtherefore weaker performance guarantees are derived. In the model of (Chen and Farias 2013)the competitive ratio of the static price strategy is O (1 / ln T ) while the competitive ratio of re-optimization is around 0.5 when properly tuned; in contrast in the model considered in our paper(independent and stationary demands) the static price strategy has a 1 − O (1 / √ T ) competitiveratio while re-optimization has a 1 − O (ln ln T /T ) competitive ratio.Re-solving has also been studied in several other settings such as quantity-based revenue manage-ment, for example in (Reiman and Wang 2008, Cooper 2002, Secomandi 2008, Jasin 2014, Jasinand Kumar 2013, Bumpensanti and Wang 2020, Wu et al. 2015). The quantity-based revenuemanagement model exhibits some quite different structures from the price-based model we study,such as the fact that re-optimization having the potential of lowering the expected revenue, andthe possibility of achieving bounded regret (i.e., O (1) regret) by using hindsight-optimum (HO)benchmarks. Vera et al. (2019) studied a price-based revenue management model with a finite setof candidate prices. 4 related yet significantly different problem is dynamic pricing with demand learning , in which theunderlying demand rate function is unknown and needs to be learnt in the pricing process. Somerepresentative recent works include (Besbes and Zeevi 2009, Keskin and Zeevi 2014, Wang et al.2014, Besbes and Zeevi 2015, Broder and Rusmevichientong 2012, Cheung et al. 2017, Lei et al.2014), and many more. In contrast, in this paper the retailer is assumed to have full informationabout the underlying demand distributions. Because of the retailer’s full information about thedemand function, lower bounds/negative results are proved using completely different techniquesfrom the lower bounds in (Broder and Rusmevichientong 2012, Wang et al. 2019), which rely onthe customers’ lack of knowledge about the underlying demand function. We make the following standard assumptions throughout this paper.1. The demand rate function f : [0 , → [ d, d ] with f (0) = d , f (1) = d > g = f − . Furthermore, there exists constant C < ∞ such that | f ( p ) − f ( p (cid:48) ) | ≤ C | p − p (cid:48) | for all p, p (cid:48) ∈ [0 , | g ( d ) − g ( d (cid:48) ) | ≤ C | d − d (cid:48) | for all d, d (cid:48) ∈ [ d, d ].2. The expected revenue r ( d ) = df − ( d ) as a function of the demand rate d is concave and threetimes continuously differentiable, with sup d | r (cid:48) ( d ) | + | r (cid:48)(cid:48) ( d ) | + | r (cid:48)(cid:48)(cid:48) ( d ) | < ∞ . Furthermore, thereexist constants 0 < m ≤ M < ∞ such that m ≤ − r (cid:48)(cid:48) ( d ) ≤ M for all d ∈ [ d, d ].3. The noise variables { ξ t } Tt =1 are i.i.d. sampled from an underlying distribution Q . Furthermore, E Q [ ξ t ] = 0, | ξ t | ≤ B ξ almost surely for some constant 0 < B ξ ≤ d , and E Q [ ξ t ] >
0. Note thisalso ensures that the realized demands are non-negative almost surely.
When the (normalized) initial inventory level x T exceeds the optimal demand rate x ∗ = arg max x ∈ [ d,d ] r ( x ),it is easy to verify that the stationary policy π s : p t ≡ f − ( x ∗ ) has constant regret for sufficientlylarge T in this case. Proposition 1.
Suppose x T ∈ ( x ∗ , d ) and let π s : p t ≡ f − ( x ∗ ) be the stationary pricing policydefined in Theorem 1. Then for sufficiently large T , R π s ( T, x T T ) ≥ T r ( x ∗ ) − O (1) ≥ R π ∗ ( T, x T T ) − O (1) .Proof of Proposition 1. Define F := {∀ t, (cid:80) tτ =1 ξ τ ≤ T ( x T − x ∗ ) } be the event that the initialinventory is not completely depleted throughout the T selling periods. By Hoeffding’s inequality, forevery t it holds that Pr[ | (cid:80) tτ =1 ξ τ ≤ T ( x T − x ∗ )] ≤ O (exp {− T ( x T − x ∗ ) t } ) ≤ O (exp {− T ( x T − x ∗ ) } ).5ith a union bound over all t = 1 , , · · · , T , we have Pr[ F ] ≥ − O ( T e − Ω( T ) ). Subsequently, withthe definition of ξ = T (cid:80) Tt =1 ξ t and E [ ξ ] = 0, | ξ | ≤ B ξ = O (1) a.s. we have R π s ( T, x T T ) ≥ E [ (cid:80) Tt =1 r t {F } ] ≥ T r ( x ∗ ) Pr[ F ] − T | E [ ξ {F } ] | = T r ( x ∗ ) Pr[ F ] − T | E [ ξ {F c } ] |≥ T r ( x ∗ )(1 − O ( T e − Ω( T ) )) − O ( T e − Ω( T ) ) = T r ( x ∗ ) − O (1) , which is to be proved.The case of insufficient inventory x T ∈ ( d, x ∗ ), on the other hand, is much more complicated. Thestationary policy π s ≡ f − ( x T ) typically suffers Ω( √ T ) regret. On the other hand, the work ofJasin (2014) established that the regret of π s when measured against T r ( x T ) is at most O (ln T ).Our next theorem improves the regret to the iterated logarithm by switching from the static LPbenchmark T r ( x T ) directly to the expected revenue of the optimal DP policy. Theorem 3.
Suppose x T ∈ ( d, x ∗ ) and let π r be the re-solving policy defined in Theorem 2. Letalso π ∗ be the optimal DP pricing policy. For sufficiently large T , it holds that R π r ( T, x T T ) ≥ R π ∗ ( T, x T T ) − O (ln ln T ) . Theorem 3 is the main result of this section and its proof is given in Sec. 4. Note that insteadof comparing with the static LP benchmark
T r ( x T ), Theorem 3 compares the value of π r directlywith the optimal DP pricing policy π ∗ , allowing for tighter regret bounds. On the other hand, the O (ln ln T ) regret gap does not hold when compared against the T r ( x T ) benchmark, as we shallestablish in the next section. In this section, we show that the regret of the re-solving policy π r measured against the static LPbenchmark T r ( x T ) (in the insufficient inventory case) is at least logarithmic Ω(ln T ). Theorem 4.
Suppose x T ∈ ( d, x ∗ ) and let π r be the re-solving policy defined in Theorem 2. Forsufficiently large T , it holds that R π r ( T, x T T ) ≤ T r ( x T ) − Ω(ln T ) . Theorem 4 is the main result of this section and is proved later in Sec. 4. Because R π ∗ ( T, x T T )is naturally an upper bound on R π r ( T, x T T ), Theorem 4 shows that there is an logarithmic lowerbound Ω(ln T ) between the value of the optimal DP pricing policy and the static LP relaxation.In the prior works of (Bumpensanti and Wang 2020, Vera et al. 2019) benchmarks weaker thanthe static LP relaxation are considered too, such as the “hindsight optimum” benchmark whichassumes the pricing policy has knowledge of the average realized demands in later time periods. Inthe appendix of this paper we show that a popular version of the hindsight optimum benchmarkhas O (1) regret when measured against the static LP benchmark T r ( x T ), and is therefore alsoΩ(ln T ) away from the value of the optimal DP policy.6able 1: Notations used in the proof. Notation Definition Meaning φ ∗ t ( x ) φ ∗ t ( x ) = R π ∗ ( t, xt ) reward of π ∗ with t periods and xt inventory φ rt ( x ) φ rt ( x ) = R π r ( t, xt ) reward of re-solving with t periods and xt inventory z τ z τ = f ( p τ ) the expected demand at time τξ τ ξ τ i.i.d. ∼ Q , | ξ τ | ≤ B ξ a.s., E [ ξ τ ] = 0 the stochastic demand noise at time τx ∗ τ , x rτ remaining inventory divide by τ normalized inventory levels under policy π ∗ and π r ∆ τ See Eq. (3) the optimal demand correction with τ periods remaining∆ → t ∆ T T − + ∆ T − T − + · · · + ∆ t +1 t harmonic series of demand corrections up to tξ → t ξ T T − + ξ T − T − + · · · + ξ t +1 t harmonic series of demand noises up to tT (cid:93) See Eq. (4) stopping time that { x ∗ τ , x rτ } τ ≤ T r are well-behaved To present our proof we first define some notations. Let φ ∗ t ( x ) = R π ∗ ( t, xt ) and φ rt ( x ) = R π r ( t, xt )be the expected cumulative revenue of the optimal DP pricing policy π ∗ and the re-solving policy π r , respectively. For τ ≤ t , let x ∗ τ and x rτ be the random variables of the normalized inventorylevels under policy π ∗ and π r when there are τ time periods remaining. Let also p τ , z τ , ξ τ be theprice, expected demand and stochastic demand noises at time τ . These notations are summarizedin Table 1, with some additional notations being defined later as the proof proceeds.The rest of this section of proof is organized as follows. In the first two sub-sections we establishsome properties of the optimal policy π ∗ and the re-solving policy π r . More specifically, we estab-lish upper and lower bounds of the expected rewards φ ∗ t ( · ) , φ ∗ t ( · ) using the key quantities of { ∆ → t } (harmonic series of optimal demand corrections), { ξ → t } (harmonic series of stochastic noise vari-ables) and T (cid:93) (a carefully defined stopping time to ensure that the process is well-behaved before T (cid:93) ). We then proceed with the proofs of Theorems 3, 4 by carefully analyzing the differences inthe expansions of φ ∗ t ( · ) , φ rt ( · ). π ∗ For any τ ≥ x ∗ τ ≥ d/t , the value of the optimal policy π ∗ is defined by the following valueiteration formula: φ ∗ τ ( x ∗ τ ) = max ∆ r ( x ∗ τ + ∆ τ ) + E (cid:20) φ ∗ τ − (cid:18) x ∗ τ − ∆ + ξ τ τ − (cid:19)(cid:21) = r ( x ∗ τ + ∆ τ ) + E [ φ ∗ τ − ( x ∗ τ − )] , (3)where x ∗ τ − = x ∗ τ − ∆ τ + ξ τ τ − , and the maximization of ∆ is subject to the constraint that x ∗ τ +∆ ∈ [ d, d ].The random variable ∆ τ is thus defined as the maximizing parameter of ∆ τ which in turn dependson the random variable of x ∗ τ .For any t < T let ∆ → t := ∆ T T − + ∆ T − T − + · · · + ∆ t +1 t and ξ → t := ξ T T − + ξ T − T − + · · · + ξ t +1 t . (For t = T → t = ξ → t = 0.) Define stopping time T (cid:93) as T (cid:93) := max (cid:2) {(cid:100) ln T (cid:101)} ∪ { t : max( | ∆ → t + ξ → t | , | ξ → t | ) > min( x T − d, x ∗ − x T ) / } (cid:1) } (cid:3) , (4)where x T ∈ ( d, x ∗ ) is the normalized initial inventory level (i.e., the initial inventory level is x T T )and x ∗ = arg max x ∈ [ d,d ] r ( x ) is the optimal price without inventory considerations. Intuitively, T (cid:93) is the first time mark at which the remaining inventory level is either too low or too high. It iseasy to verify that, as t = T, T − , · · · ,
1, the random variable T (cid:93) is a stopping time since it onlydepends on ∆ → t and ξ → t , both of which are available at time t . It then holds that x ∗ τ = x T − ∆ → τ − ξ → τ , ∀ τ ≥ T (cid:93) , (5)where x ∗ τ τ is the random variable of the total inventory level when there are τ time periods remainingunder policy π ∗ . Lemma 1.
Let x T ∈ (0 , x ∗ ) and T (cid:93) , { x ∗ τ , ∆ τ } τ ≤ T (cid:93) be defined in Eqs. (3,4,5). Then φ ∗ T ( x T ) ≤ E T (cid:88) τ = T (cid:93) +1 r ( x ∗ τ + ∆ τ ) + T (cid:93) r ( x ∗ T (cid:93) ) . Proof of Lemma 1.
The reward collected in time periods
T, T − , · · · , T (cid:93) + 1 are (cid:80) Tτ = T (cid:93) +1 [ r ( x ∗ τ +∆ τ )+ f − ( x ∗ τ +∆ τ ) ξ τ ]. When there are T (cid:93) periods remaining the random variable of total remaininginventory level is x ∗ T (cid:93) T (cid:93) . By definition of T (cid:93) and the fact that T (cid:93) ≥ (cid:100) ln T (cid:101) , it is clear that x ∗ T (cid:93) ∈ [ d, x ∗ ] for sufficiently large T . It is a well-known result that (see Theorem 1 in this paper, or(Gallego and Van Ryzin 1994)) φ ∗ T (cid:93) ( x ) ≤ T (cid:93) r ( x ) for all x ∈ [ d, x ∗ ]. Subsequently, φ ∗ T ( x T ) ≤ E T (cid:88) τ = T (cid:93) +1 [ r ( x ∗ τ + ∆ τ ) + f − ( x ∗ τ + ∆ τ ) ξ τ ] + T (cid:93) r ( x ∗ T (cid:93) ) = E T (cid:88) τ = T (cid:93) +1 r ( x ∗ τ + ∆ τ ) + T (cid:93) r ( x ∗ T (cid:93) ) , where the second equality holds because E [ (cid:80) Tτ = T (cid:93) +1 f − ( x ∗ τ + ∆ τ ) ξ τ )] = 0 thanks to the Doob’soptimal stopping theorem. Lemma 2.
For x T ∈ (0 , x ∗ ) , it holds that E [ T (cid:93) ] = O (ln T ) . Remark 1.
In the O ( · ) notation in Lemma 2 we omit constants depending on x T , x ∗ and r ( · ) .Proof of Lemma 2. Fix any t ≤ T , note that ξ → t is the sum of ( T − t ) centered independent randomvariables with variance E [ | ξ → t | ] = (cid:80) Tτ = t +1 O (1 / ( τ − ) = O (1 /t ). Note also that each | ξ τ | / ( τ − B ξ /t almost surely. By Bernstein’s inequality, with probability 1 − δ it holds that | ξ → t | ≤ O ( t − ln(1 /δ )) + O ( t − / (cid:112) ln(1 /δ )) . T = (cid:100) ln T (cid:101) . With the above inequality and the union bound, we have for sufficiently large T that Pr (cid:2) ∀ t ≥ T , | ξ → t | ≤ min( x T − d, x ∗ − x T ) / (cid:3) ≥ − O ( T − ) . (6)Let E be the event that ∀ t ≥ T = (cid:100) ln T (cid:101) , | ξ → t | ≤ min( x T − d, x ∗ − x T ) /
4. Eq. (6) shows thatPr[ E c ] = O ( T − ). Lemma 1 and the law of total expectation imply that φ ∗ T ( x T ) ≤ E (cid:104) ( (cid:80) Tτ = T (cid:93) +1 r ( x ∗ τ + ∆ τ ) + T (cid:93) r ( x ∗ T (cid:93) )) {E} (cid:105) + O (1) . (7)Next, consider arbitrary τ ≥ T (cid:93) + 1. Using the smoothness and concavity of r ( · ), it holds that r ( x ∗ τ + ∆ τ ) = r ( x T − ∆ → τ − ξ → τ + ∆ τ ) ≤ r ( x T ) + r (cid:48) ( x T )( − ∆ → τ − ξ → τ + ∆ τ ) . (8)Similarly, for x ∗ T (cid:93) = x T − ∆ → T (cid:93) − ξ → T (cid:93) it holds that r ( x ∗ T (cid:93) ) ≤ r ( x T ) + r (cid:48) ( x T )( − ∆ → T (cid:93) − ξ → T (cid:93) ) − m (cid:12)(cid:12) ∆ → T (cid:93) + ξ → T (cid:93) (cid:12)(cid:12) . (9)Combining Eqs. (8,9) we have that (cid:80) Tτ = T (cid:93) +1 r ( x ∗ τ + ∆ τ ) + T (cid:93) r ( x ∗ T (cid:93) ) ≤ T r ( x T ) + r (cid:48) ( x T ) (cid:2) (cid:80) Tτ = T (cid:93) +1 ( − ∆ → τ − ξ → τ + ∆ τ ) − T (cid:93) ∆ → T (cid:93) − T (cid:93) ξ → T (cid:93) (cid:3) − m T (cid:93) (cid:12)(cid:12) ∆ → T (cid:93) + ξ → T (cid:93) (cid:12)(cid:12) = T r ( x T ) − r (cid:48) ( x T ) (cid:2) (cid:80) Tτ = T (cid:93) +1 ξ τ (cid:3) − m T (cid:93) (cid:12)(cid:12) ∆ → T (cid:93) + ξ → T (cid:93) (cid:12)(cid:12) . (10)By the law of total expectation, for every τ it holds that | E [ ξ τ {E} ] | = | E [ ξ τ {E c } ] | = O ( T − ).Combining Eqs. (10) and (7), we have φ ∗ T ( x T ) ≤ T r ( x T ) + O (1) − m E (cid:2) T (cid:93) (cid:12)(cid:12) ∆ → T (cid:93) + ξ → T (cid:93) (cid:12)(cid:12) {E} (cid:3) ≤ T r ( x T ) + O (1) − m × min( x T − d, x ∗ − x T ) E [ T (cid:93) { ( T (cid:93) > (cid:100) ln T (cid:101) ) ∩ E} ] (11)= T r ( x T ) + O (1) − Ω(1) × E [ T (cid:93) { ( T (cid:93) > (cid:100) ln T (cid:101) ) ∩ E} ] . Here, Eq. (11) holds because T (cid:93) > (cid:100) ln T (cid:101) implies that | ξ → T (cid:93) + ∆ → T (cid:93) | ≥ min( x T − d, x ∗ − x T ) / | ∆ → T (cid:93) | ≥ min( x T − d, x ∗ − x T ) / | ξ → T (cid:93) | ≤ min( x T − d, x ∗ − x T ) / E . On the other hand, the results of (Jasin 2014) shows that φ ∗ T ( x T ) ≥ T r ( x T ) − O (ln T ). Subsequently, E [ T (cid:93) { ( T (cid:93) > (cid:100) ln T (cid:101) ) ∩ E} ] = O (ln T ) . E [ T (cid:93) ] = E [ T (cid:93) { ( T (cid:93) > (cid:100) ln T (cid:101) ) ∩ E} ] + E [ T (cid:93) { ( T (cid:93) = (cid:100) ln T (cid:101) ) ∩ E} ] + E [ T (cid:93) {E c } ] ≤ O (ln T ) + O (ln T ) + O (1) = O (ln T ) , which is to be demonstrated. π r For any τ ≥ x rτ ∈ [ d/T, x ∗ ], the value of the re-solving policy π r can be written as φ rt ( x rτ ) = r ( x rτ ) + E (cid:20) φ rt − (cid:18) x rτ − ξ τ τ − (cid:19)(cid:21) . (12)Note that Eq. (12) does not hold for x rτ > x ∗ , in which case the re-solving policy π r would committo z τ = x ∗ instead of z τ = x rτ . Comparing Eq. (12) with Eq. (3), we remark that the re-solvingheuristics π r is the special case of the dynamic programming with decision rule ∆ τ ≡ x rτ ≤ x ∗ .Recall the definition of the stopping time T (cid:93) in Eq. (4). Because of the upper bound | ξ → t | ≤ min { x T − d, x ∗ − x T } / t > T (cid:93) , we have that x rτ = x T − ξ → τ , ∀ τ ≥ T (cid:93) . (13) Lemma 3.
Let x T ∈ (0 , x ∗ ) and T (cid:93) , { x rτ } τ ≥ T (cid:93) be defined in Eqs. (4,13). Then E T (cid:88) τ = T (cid:93) +1 r ( x rτ ) + T (cid:93) r ( x rT (cid:93) ) − O (ln T (cid:93) ) ≤ φ rT ( x T ) ≤ E T (cid:88) τ = T (cid:93) +1 r ( x rτ ) + T (cid:93) r ( x rT (cid:93) ) . Proof of Lemma 3.
Suppose when there are T (cid:93) time periods left the remaining inventory level is x rT (cid:93) T (cid:93) for some x rT (cid:93) ∈ (0 , x ∗ ]. The static LP relaxation claims that φ rT (cid:93) ( x (cid:93)T (cid:93) ) ≤ T (cid:93) r ( x rT (cid:93) ). On theother hand, the results of Jasin (2014) asserts that the re-solving heuristics has logarithmic regretcompared against the static LP benchmark, or more specifically φ rT (cid:93) ( x (cid:93)T (cid:93) ) ≥ T (cid:93) r ( x rT (cid:93) ) − O (ln T (cid:93) ).The rest of the proof is identical to the proof of Lemma 1.10 .3 Proof of Theorem 3 In this section we prove Theorem 3. By Lemmas 1 and 1, for any fixed x T ∈ ( d, x ∗ ) and sufficientlylarge T it holds that φ ∗ T ( x T ) ≤ E [ (cid:80) Tτ = T (cid:93) +1 r ( x ∗ τ + ∆ τ ) + T (cid:93) r ( x ∗ T (cid:93) )]; (14) φ rT ( x T ) ≥ E (cid:104)(cid:80) Tτ = T (cid:93) +1 r ( x rτ ) + T (cid:93) r ( x rT (cid:93) ) − O (ln T (cid:93) ) (cid:105) , (15)where x ∗ τ = x T − ∆ → τ − ξ → τ , x rτ = x T − ξ → τ for all τ ≥ T (cid:93) , ∆ → τ = ∆ T T − + · · · + ∆ τ +1 τ , ξ → τ = ξ T T − + · · · + ξ τ +1 τ , and T (cid:93) is the stopping time defined in Eq. (4).For any τ ≥ T (cid:93) , by the smoothness and concavity of r ( · ), it holds that r ( x ∗ τ + ∆ τ ) − r ( x rτ ) ≤ r (cid:48) ( x T − ξ → τ )[∆ τ − ∆ → τ ] − m | ∆ τ − ∆ → τ | ≤ r (cid:48) ( x T )[∆ τ − ∆ → τ ] − r (cid:48)(cid:48) ( x T ) ξ → τ [∆ τ − ∆ → τ ] + O ( | ξ → τ | ) | ∆ τ − ∆ → τ | − m | ∆ τ − ∆ → τ | ≤ r (cid:48) ( x T )[∆ τ − ∆ → τ ] − r (cid:48)(cid:48) ( x T ) ξ → τ [∆ τ − ∆ → τ ] + 12 m O ( | ξ → τ | ) . (16)Similarly, for τ = T (cid:93) , we have r ( x ∗ T (cid:93) ) − r ( x rT (cid:93) ) ≤ − r (cid:48) ( x T )∆ → T (cid:93) + r (cid:48)(cid:48) ( x T ) ξ → T (cid:93) ∆ → T (cid:93) + 12 m O ( | ξ → T (cid:93) | ) . (17)Combining Eqs. (16,17) we obtain φ ∗ T ( x T ) − φ rT ( x T ) ≤ E (cid:2) r (cid:48) ( x T ) A − r (cid:48)(cid:48) ( x T ) B + O (1) × C − O (ln T (cid:93) ) (cid:3) , (18)where random variables A , B , C are defined as A = (cid:80) Tτ = T (cid:93) +1 [∆ τ − ∆ → τ ] − T (cid:93) ∆ → T (cid:93) , B = (cid:80) Tτ = T (cid:93) +1 ξ → τ [∆ τ − ∆ → τ ] − T (cid:93) ξ → T (cid:93) ∆ → T (cid:93) , C = (cid:80) Tτ = T (cid:93) +1 | ξ → τ | + T (cid:93) | ξ → T (cid:93) | . We next analyze the three terms A , B , C separately. Recall the definition that ∆ → τ = ∆ T T − + · · · + ∆ τ +1 τ . With elementary algebra it is easy to verify that A = (cid:80) Tτ = T (cid:93) +1 [∆ τ − ∆ → τ ] − T (cid:93) ∆ → T (cid:93) = 0 . (19)11or the B term, re-organizing all terms for each ξ t , t > T (cid:93) , we obtain B = T (cid:88) t = T (cid:93) +1 ξ t t − (cid:20) t − (cid:88) τ = T (cid:93) +1 (∆ τ − ∆ → τ ) − T (cid:93) ∆ → T (cid:93) (cid:21) = T (cid:88) t = T (cid:93) +1 ξ t t − (cid:20) − ( t − → t − (cid:21) = − T (cid:88) t = T (cid:93) +1 ξ t ∆ → t − . Note that, because ∆ → t − = ∆ T T − + · · · + ∆ t t − involve demand corrections ∆ T , ∆ T − , · · · , ∆ t whenthere are at least t periods remaining, it holds that E [ ξ t ∆ → t − ] = E [ ξ t | ∆ → t − ] = E Q [ ξ t ] = 0 sincethe DP policy must be non-anticipating. Therefore, by Doob’s optimal stopping theorem we have E [ B ] = E [ (cid:80) Tt = T (cid:93) +1 ξ t ] = 0 . (20)Finally we upper bound (the expectation) of term C . Recall the definition that ξ → t = ξ T T − + · · · + ξ t +1 t . Clearly, ξ → t is the sum of centered, independently distributed random variables with E [ ξ → t ] = 0 and E [ ξ → t ] = O (1 / ( T − + · · · + 1 /t ) = O (1 /t ). Note also that each | ξ τ / ( τ − | term is upper bounded by B ξ /t almost surely. By Bernstein’s inequality, with probability 1 − δ itholds that (cid:12)(cid:12) ξ → t (cid:12)(cid:12) ≤ O ( t − ln(1 /δ )) + O ( t − / (cid:112) ln(1 /δ )) . Setting δ = 1 /T and taking the union bound over all t ≥ T (cid:93) , it holds with probability 1 − O ( T − )that (cid:12)(cid:12) ξ → t (cid:12)(cid:12) ≤ O ( t − ln T + t − / √ ln T ) , ∀ t ≥ T (cid:93) . Consequently, with probability 1 − O ( T − ) we have C = T (cid:88) t = T (cid:93) +1 | ξ → t | + T (cid:93) | ξ → T (cid:93) | ≤ T (cid:88) t = T (cid:93) +1 O ( t − ln T + t − ln T ) + T (cid:93) × O ([ T (cid:93) ] − ln T + [ T (cid:93) ] − ln T ) ≤ O (cid:18) ln T [ T (cid:93) ] + ln TT (cid:93) (cid:19) ≤ O (1) , where the last inequality holds because T (cid:93) ≥ ln T almost surely. On the other hand, because | ξ → t | ≤ B ξ ln T almost surely, we have that C ≤ O ( T ln T ) almost surely. Therefore, E [ C ] ≤ O (1) + O ( T ln T ) × O ( T − ) = O (1) . (21)Combining Eqs. (19,20,21) with Eq. (18), we have φ ∗ T ( x T ) − φ rT ( x T ) ≤ O (1) + O ( E [ln T (cid:93) ]) ≤ O (ln( E [ T (cid:93) ])) ≤ O (ln ln T ) , where the last inequality holds by applying Lemma 2. This completes the proof of Theorem 3.12 .4 Proof of Theorem 4 Recall that x rτ = x T − ξ → τ for all τ ≥ T (cid:93) , where T (cid:93) is the stopping time defined in Eq. (4).Expanding the difference r ( x rτ ) − r ( x T ) at x T and using the smoothness and concavity of r ( · ), wehave r ( x rτ ) − r ( x T ) ≤ − r (cid:48) ( x T ) ξ → τ − m | ξ → τ | . Invoking Lemma 3, we have φ rT ( x T ) − T r ( x T ) ≤ E (cid:104)(cid:80) Tτ = T (cid:93) +1 ( r ( x rτ ) − r ( x T )) + T (cid:93) ( r ( x rτ ) − r ( x T )) (cid:105) ≤ − r (cid:48) ( x T ) E [ (cid:80) Tτ = T (cid:93) +1 ξ → τ + T (cid:93) ξ → T (cid:93) ] − m E [ (cid:80) Tτ = T (cid:93) +1 | ξ → τ | + T (cid:93) | ξ → T (cid:93) | ] . (22)For the first term in Eq. (22), we have E [ (cid:80) Tτ = T (cid:93) +1 ξ → τ + T (cid:93) ξ → T (cid:93) ] = E [ (cid:80) Tt = T (cid:93) +1 ξ t ] = 0 , (23)where the last equality holds thanks to the Doob’s optimal stopping theorem. For the second termin Eq. (22), we have E [ (cid:80) Tτ = T (cid:93) +1 | ξ → τ | + T (cid:93) | ξ → T (cid:93) | ] ≥ E [ (cid:80) Tt = T (cid:93) +1 | ξ → t | ] ≥ E [ (cid:80) Tt = T (cid:93) +1 Ω(1 /t )] (24) ≥ Ω(ln T − E [ln T (cid:93) ]) ≥ Ω(ln T − ln( E [ T (cid:93) ])) = Ω(ln T ) , (25)where Eq. (24) holds by the Doob’s optimal stopping theorem (since E [ | ξ → t | ] is a determin-istic quantity) and Eq. (25) holds by applying Lemma 2 and Jensen’s inequality. CombiningEqs. (22,23,25) we complete the proof of Theorem 4. We corroborate the theoretical findings in this paper with a simple numerical experiment. In thesimulation we adopt a Bernoulli demand model Pr[ d t = 1 | p t ] = α − βp t , Pr[ d t = 0 | p t ] = 1 − Pr[ d t =1 | p t ] with p ∈ [0 , α = 3 / β = 1 /
2. The (normalized) initial inventory level is x T = 5 / T time periods the initial inventory level is x T T = 5 T / x ∗ without inventory constraints is x ∗ = 3 / > x T ,and the static LP relaxation suggests a T r ( x T ) = (19 / T = . T expected revenue. We selectthe Bernoulli demand model because the states of inventory levels are discrete and therefore theoptimal dynamic programming pricing policy can be exactly obtained.In Table 2 we report the regret of the static LP relaxation, the optimal stationary policy π s : p t ≡ f − ( x T ) = 7 / π r . All regret is defined with respect to the value13able 2: Regret for the static LP relaxation, the optimal stationary policy π s and the re-solvingheuristics π r compared against the value of the optimal DP pricing policy.log T π s π r T-50510
Regret vs. log T Static LP relaxationStationary policyResolving policy
Figure 2: Plots of regret of the static LP relaxation, the optimal stationary policy π s and there-solving heuristics π r compared against the value of the optimal DP pricing policy.(expected reward) of the optimal DP pricing policy, and the regret for the static LP relaxationis negative since the static LP relaxation always upper bounds the value of any policy. Both thestationary policy π s and the re-solving heuristics π r are run for 5 × times for each value of T ranging from T = 2 = 64 to T = 2 = 32 ,
768 to obtain an accurate estimation of their expectedrewards. We also plot the regret in Figure 2 to make the regret growth of each policy more intuitive.As we can see from Table 2, the gap between the value of the optimal policy and the value of thestatic LP relaxation grows nearly linearly as the number of time periods T grows geometrically,which verifies the Ω(ln T ) growth rate established in Theorem 4. On the other hand, the growthof regret of the re-solving heuristics π r stagnated at T ≥ and is nearly the same for T rangingfrom 2 = 1024 to 2 = 32768. This shows the asymptotic growth of regret of π r is far slowerthan O (ln T ) and is compatible with the O (ln ln T ) regret upper bound we proved in Theorem 3.14 Conclusion
In this paper, we analyze the re-solving heuristics in single-product price-based revenue managementand establish two complementary theoretical results: that the re-solving heuristic attains O (ln ln T )regret compared against the value of the optimal dynamic programming pricing policy, and thereexists an Ω(ln T ) lower bound on the gap between the re-solving heuristics (as well as the expectedrevenue of the optimal policy) and the static LP relaxations.Going forward, one obvious question is whether it is possible to further sharpen the regret upperbound from the iterated logarithm O (ln ln T ) to bounded regret O (1), which is suggested to hold bythe numerical results presented in the previous section. Technically speaking, the O (ln ln T ) termin our analysis arises from the expectation of the stopping time T (cid:93) , which characterizes how well-behaved the normalized inventory levels are before T (cid:93) . To further reduce the impact of T (cid:93) one needsto carefully analyze the behavior of both the optimal pricing policy and the re-solving heuristics inthe cases when inventories run out too fast or not sufficiently fast so that the normalized inventorylevels near the end of the T selling periods fall outside of their typical ranges. Appendix: the Hindsight-Optimum (HO) benchmark
The HO benchmark was adopted in (Bumpensanti and Wang 2020) to develop constant-regret re-optimizing algorithms for item-based network revenue management. Since in item-based networkrevenue management the demand rates are not affected by the (adaptively chosen) prices, theformulation in (Bumpensanti and Wang 2020) is not directly applicable to our setting. Instead,we formulate an HO benchmark following the strategy in (Vera et al. 2019) which also consideredprice-based revenue management with a finite subset of pries.
Definition 1 (The HO benchmark) . For any p define random variable D T ( p ) := (cid:80) Tt =1 d t as thetotal realized demand with fixed price p t ≡ p . A policy π is HO-admissible if at time t , the pricedecision p t depends only on { p t (cid:48) , x t (cid:48) , d t (cid:48) } t (cid:48) 1] in hindsight. Clearly, such policies are morepowerful than an ordinary admissible policy which only knows the expected demand but not therealized demand for a specific price p .Our next proposition shows that the HO-benchmark R HO ( T, x ) has a constant gap comparedagainst the T p ∗ f ( p ∗ ) oracle. Hence, it also has an Ω(ln T ) gap from the re-solving heuristic andthe optimal DP solution. 15 roposition 2. For any x T ∈ ( d, x ∗ ) , it holds that R HO ( T, y ) ≥ T r ( x T ) − O (1) where y = x T T . The proof of Proposition 2 is straightforward and presented later. The conclusion in Theorem 4then holds with T r ( x T ) replaced by R HO ( T, x T T ). Proof of Proposition 2. It is clear that, in our setting of ξ , · · · , ξ T being i.i.d., knowing { D T ( p ) } p ∈ [0 , is equivalent to knowing ξ = T (cid:80) Tt =1 ξ t , since D T ( p ) = T ( f ( p ) + ξ ) for all p . Now consider thepolicy of fixed prices p t ≡ g ( x T + ξ ). The expected regret of such a policy can be bounded as T E ξ (cid:2) ( x T + ξ ) f − ( x T + ξ ) (cid:3) − T x T f − ( x T ) = T E ξ [ r ( x T + ξ ) − r ( x T )] ≥ T E ξ (cid:20) r (cid:48) ( x T ) ξ − M ξ (cid:21) = M T E [ ξ ] = M × O (1) = O (1) , which is to be demonstrated. References Besbes, Omar, Assaf Zeevi. 2009. Dynamic pricing without knowing the demand function: Risk bounds andnear-optimal algorithms. Operations Research (6) 1407–1420.Besbes, Omar, Assaf Zeevi. 2015. On the (surprising) sufficiency of linear models for dynamic pricing withdemand learning. Management Science (4) 723–739.Broder, Josef, Paat Rusmevichientong. 2012. Dynamic pricing under a general parametric choice model. Operations Research (4) 965–980.Bumpensanti, Pornpawee, He Wang. 2020. A re-solving heuristic with uniformly bounded loss for networkrevenue management. Management Science (forthcoming) .Chen, Yiwei, Vivek F Farias. 2013. Simple policies for dynamic pricing with imperfect forecasts. OperationsResearch (3) 612–624.Cheung, Wang Chi, David Simchi-Levi, He Wang. 2017. Dynamic pricing and demand learning with limitedprice experimentation. Operations Research (6) 1722–1731.Cooper, William L. 2002. Asymptotic behavior of an allocation policy for revenue management. OperationsResearch (4) 720–727.Gallego, Guillermo, Garrett Van Ryzin. 1994. Optimal dynamic pricing of inventories with stochastic demandover finite horizons. Management Science (8) 999–1020.Gallego, Guillermo, Garrett Van Ryzin. 1997. A multiproduct dynamic pricing problem and its applicationsto network yield management. Operations Research (1) 24–41.Jasin, Stefanus. 2014. Reoptimization and self-adjusting price control for network revenue management. Operations Research (5) 1168–1178.Jasin, Stefanus, Sunil Kumar. 2013. Analysis of deterministic lp-based booking limit and bid price controlsfor revenue management. Operations Research (6) 1312–1320. eskin, N Bora, Assaf Zeevi. 2014. Dynamic pricing with an unknown demand model: Asymptoticallyoptimal semi-myopic policies. Operations Research (5) 1142–1167.Lei, Yanzhe Murray, Stefanus Jasin, Amitabh Sinha. 2014. Near-optimal bisection search for nonparametricdynamic pricing with inventory constraint. Ross School of Business Paper (1252).Maglaras, Constantinos, Joern Meissner. 2006. Dynamic pricing strategies for multiproduct revenue man-agement problems. Manufacturing & Service Operations Management (2) 136–148.Reiman, Martin I, Qiong Wang. 2008. An asymptotically optimal policy for a quantity-based network revenuemanagement problem. Mathematics of Operations Research (2) 257–282.Secomandi, Nicola. 2008. An analysis of the control-algorithm re-solving issue in inventory and revenuemanagement. Manufacturing & Service Operations Management (3) 468–483.Vera, Alberto, Siddhartha Banerjee, Itai Gurvich. 2019. Online allocation and pricing: Constant regret viabellman inequalities. arXiv preprint arXiv:1906.06361 .Wang, Yining, Boxiao Chen, David Simchi-Levi. 2019. Multi-modal dynamic pricing. Available at SSRN3489355 .Wang, Zizhuo, Shiming Deng, Yinyu Ye. 2014. Close the gaps: A learning-while-doing algorithm for single-product revenue management problems. Operations Research (2) 318–331.Wu, Huasen, Rayadurgam Srikant, Xin Liu, Chong Jiang. 2015. Algorithms with logarithmic or sublinearregret for constrained contextual bandits. Advances in Neural Information Processing Systems . 433–441.. 433–441.