[PDF] Learning in structured MDPs with convex cost functions: Improved regret bounds for inventory management

Abstract

Full PDF

LLearning in structured MDPs with convex cost functions:Improved regret bounds for inventory management

SHIPRA AGRAWAL AND RANDY JIA

We consider a stochastic inventory control problem under censored demands, lost sales, and positive lead times.This is a fundamental problem in inventory management, with significant literature establishing near-optimalityof a simple class of policies called “base-stock policies” for the underlying Markov Decision Process (MDP), aswell as convexity of long run average-cost under those policies. We consider the relatively less studied problemof designing a learning algorithm for this problem when the underlying demand distribution is unknown. Thegoal is to bound regret of the algorithm when compared to the best base-stock policy. We utilize the convexityproperties and a newly derived bound on bias of base-stock policies to establish a connection to stochasticconvex bandit optimization.Our main contribution is a learning algorithm with a regret bound of ˜ O ( L √ T + D ) for the inventory controlproblem. Here L is the fixed and known lead time, and D is an unknown parameter of the demand distributiondescribed roughly as the number of time steps needed to generate enough demand for depleting one unit ofinventory. Notably, even though the state space of the underlying MDP is continuous and L -dimensional, ourregret bounds depend linearly on L . Our results significantly improve the previously best known regret bounds forthis problem where the dependence on L was exponential and many further assumptions on demand distributionwere required. The techniques presented here may be of independent interest for other settings that involve largestructured MDPs but with convex cost functions. Many operations management problems involve making decisions sequentially over time, wherethe outcome of a decision may depend on the current state of the system in addition to an uncertaindemand or customer arrival process. This includes several online decision making problems inrevenue and supply chain management. There, the sales revenue and supply costs incurred as aresult of pricing and ordering decisions may depend on the current level of inventory in stock, backorders, outstanding orders etc., in addition to the uncertain demand and/or supply for the products.Markov Decision Process (MDP) is a useful framework for modeling these sequential decisionmaking problems. In a typical formulation, the state of the MDP captures the current position ofinventory. The reward (observed sales) depends on the the current state of the inventory in additionto the demand. The stochastic state transition and reward generation models capture the uncertaintyin demand.A fundamental yet notoriously difficult problem in this area is the periodic inventory controlproblem under positive lead times and lost sales [20, 21]. In this problem, in each of the T sequentialdecision making periods, the decision maker takes into account the current on-hand inventory andthe pipeline of outstanding orders to decide the new order. There is a fixed delay (i.e., lead time)between placing an order and receiving it. A random demand is generated from a static distributionindependently in every period. However, the demand information is censored in the sense that thedecision maker observes only the sales, i.e., the minimum of demand and on-hand inventory. Anyunmet demand is lost, and incurs a penalty called lost sales penalty. Any leftover inventory at the endof a period incurs a holding cost. The aim is to minimize the aggregate long term inventory holdingcost and lost sales penalty. There is a significant existing research that develops a Markov model (orsemi-Markov model due to unobserved lost sales penalty) for this problem, and studies methods forcomputing optimal policies, assuming the demand distribution is either known or can be efficientlysimulated (e.g., see survey in [7]). In particular, a simple class of policies called base-stock policies Manuscript submitted for review to ACM Economics & Computation 2019 (EC ’19). a r X i v : . [ c s . L G ] M a y hipra Agrawal and Randy Jia optimal for this problem [7, 9]. Under a base-stock policy,the inventory position is always maintained at a target “base-stock level.” Notably, when using abase-stock policy, the infinite horizon average cost function for the inventory control MDP can beshown to be convex in the base-stock level [12]. Therefore, under known demand model, convexoptimization can be used to compute the optimal base-stock policy.In this paper, we considered the relatively less studied problem of periodic inventory controlwhen the decision maker does not know the demand distribution a priori . The goal is to designa learning algorithm that can use the observed outcomes of past decisions to implicitly learn theunknown underlying MDP model and adaptively improve the decision making strategy over time,aka a reinforcement learning algorithm. Following the near-optimality of base-stock policies, we aimto bound regret of the learning algorithm when compared to the best base-stock policy as benchmark.The two main challenges in designing an efficient learning algorithm for the inventory controlproblem described above are presented by the censored demand and the positive lead time . Thecensored demand assumption results in an exploration-exploitation tradeoff for the learning algorithm.Since the decision maker can only observe the sales, which is the minimum of demand and theon-hand inventory for a product, the quality of samples available for demand estimation of a productdepend crucially on the past ordering decisions. For example, suppose that due to past orderingpolicies, a certain product was maintained at a low inventory level for most of the past sales periods,then the higher quantiles of the demand distribution for that product would be unobserved. Therefore,in order to ensure accurate demand learning, large inventory states need to be sufficiently explored.However, this exploration needs to be balanced with the holding cost incurred for any leftoverinventory. There has been recent work on exploration-exploitation algorithms for regret minimizationin finite MDPs with regret bounds that depend linearly or sublinearly on the size of the state spaceand action space (e.g., [2, 3, 11]). However, the positive lead time in delivery of an order results in amuch enlarged state space (exponential in lead time) for the inventory control problem, since the stateneeds to track all the outstanding orders in the pipeline. There is a further issue of discretization, sincethe state space (inventory position) and action space (orders) is continuous. Discretizing over a gridwould give a further enlarged state space and action space. As a result, none of the above-mentionedreinforcement learning techniques can be applied directly to obtain useful regret bounds for theinventory control problem considered here.The main insight in this paper is that even though the state space is large, the convexity of theaverage cost function under the benchmark policies (here, base-stock policies) can be used to designan efficient learning algorithm for this MDP. We use the relation between bias and infinite horizonaverage cost of a policy given by Bellman equations, to provide a connection between stochasticconvex bandit optimization and the problem of learning and optimization in such MDPs. Specifically,we build upon the algorithm for stochastic convex optimization with bandit feedback from [1] toderive a simple algorithm that achieves an ˜ O ( L √ T + D ) regret bound for the inventory control problem.Here, L is the fixed and known lead time. And, D is a parameter of the demand distribution F , definedas the expected number of independent draws needed from distribution F for the sum to exceed .Importantly, although our regret bound depends on D , our algorithm does not need to know thisparameter.Our regret bound substantially improves the existing results for this problem by [8, 19], wherethe regret bounds grow exponentially with the lead time L (roughly as D L √ T ), and many furtherassumptions on demand distribution are required for the bounds to hold. A more detailed comparisonto related work is provided later in the text. More importantly, we believe that our algorithm designand analysis techniques can be applied in an almost blackbox manner for minimizing regret in other with increase in lost sales penaltyhipra Agrawal and Randy Jia Organization.

The rest of the paper is organized as follows. In the next two subsections, we providethe formal problem definition and describe our main results, along with a precise comparison ofour regret bounds to closely related work. In section §2, we use an MRP (Markov Reward Process)formulation to prove some key technical results, including convexity and bounded bias of base-stockpolicies. These insights form the basis of algorithm design and regret analysis in sections §3 and §4respectively. We conclude in section §5.

We consider a single product stochastic inventory control problem with lost sales and positive leadtimes. The problem setting considered here is similar to the setting considered in [9, 19]. In this, aninventory manager makes sequential decisions in discrete time steps t = , . . . , T . In the beginningof every time step t , the inventory manager observes the current inventory level inv t , and L previousunfulfilled orders in the pipeline, denoted as o t − L , o t − L + , . . . , o t − , for a single product. Here, L ≥ is the lead time defined as the delay (number of time steps) between placing an order and receiving it.Initially in step , there is no inventory (inv = ) and no unfulfilled orders. Based on this information,the manager decides the amount o t ∈ R of the product to order in the current time step.The next inventory position is then obtained through the following sequence of events. First, theorder o t − L that was made L time steps earlier, arrives, so that the on-hand inventory level becomes I t = inv t + o t − L . Then, an unobserved demand d t ≥ is generated from an unknown demanddistribution F , independent of the previous time steps. Sales is the minimum of the on-hand inventoryand demand, i.e., sales y t : = min { I t , d t } . The decision maker only observes the sales y t and notthe actual demand d t , the demand information is therefore censored . A holding cost of h ( I t − d t ) + is incurred on remaining inventory and a lost sales penalty of p ( d t − I t ) + is incurred on part of thedemand that could not be served due to insufficient on-hand inventory. That is, the cost incurred atend of step t is, ¯ C t = h ( I t − d t ) + + p ( d t − I t ) + . (1)Here, h and p are pre-specified constants denoting per unit holding cost and per unit lost sales penalty,respectively. Note that the lost sales and therefore the lost sales penalty is unobserved by the decisionmaker.Figure 1 illustrates the timing of arrival of orders and demand. The next step t + begins with theleftover inventory inv t + : = ( I t − d t ) + = ( inv t + o t − L − d t ) + (2)and the new pipeline of outstanding orders o t − L + , . . . , o t .An online learning algorithm for this problem would sequentially decide the orders o , . . . , o T ,under demand censoring, and without a priori knowing the demand distribution. The objective is tominimize the total expected cost E [ (cid:205) Tt = ¯ C t ]. Base-stock policies aka order up to policies form an important class of policies for the inventorycontrol problem. Under this policy, the inventory manger always orders a quantity that brings thetotal inventory position (i.e., sum of leftover inventory plus outstanding orders) to some fixed valueknown as the base-stock level, if possible. Specifically, let in the beginning of step t , the leftoverinventory be inv t and the outstanding orders be o t − L , . . . , o t − . Then, on using a base-stock policywith level x , the order o t in step t is given by hipra Agrawal and Randy Jia Fig. 1. Timing of arrival of orders and demand at time t . o t = ( x − inv t − (cid:205) Li = o t − i ) + .[9, 20] provide empirical results that show that base-stock policies work well in many applications,and furthermore, [9] show that as the ratio of lost sales to holding cost penalty increases to infinity,the ratio of the cost of the best base-stock policy to the optimal cost converges to . Since the ratioof lost sales to holding cost penalty is typically large in applications, best base-stock policy can beconsidered close to optimal. Regret against the best base-stock policy.

Considering the asymptotic optimality of base-stockpolicies, several past works consider a more tractable objective of minimizing regret of an onlinealgorithm compared to the best base-stock policy [8, 19].Let ¯ C xt , t = , . . . , denote the sequence of costs incurred on running the base-stock policy withlevel x . Define λ x as the expected infinite horizon average cost of the base-stock policy, when startingfrom no inventory or outstanding orders, i.e., λ x : = E (cid:34) lim T →∞ T T (cid:213) t = ¯ C xt | inv = (cid:35) (3)Following result from [12] shows that this long-run average cost is convex in x .L EMMA

DERIVED FROM T HEOREM OF [12]). Given a demand distribution F such that F ( ) > , i.e., there is non-zero probability of zero demand. Then, for any x ≥ , the expected infinitehorizon average cost (i.e., λ x ) is convex in x . R EMARK Theorem 12 of [12] actually proves convexity of average cost when starting froman inventory level inv = x . However, in definition of λ x , we assumed starting inventory inv = .On starting from no-inventory and outstanding orders, and using base-stock policy with level x , thesystem will reach the state with inventory x and no-outstanding orders in finite (exactly L ) steps.Therefore, λ x is same as the expected infinite horizon average cost on starting with inv = x . Regret of an algorithm is defined as difference in its total cost in time T compared to the asymptoticcost of the best base-stock policy. That is,Regret ( T ) : = E (cid:34) T (cid:213) t = ¯ C t (cid:35) − T (cid:18) min x ∈[ , U ] λ x (cid:19) (4) hipra Agrawal and Randy Jia ¯ C t , t = , . . . , is the sequence of costs incurred on running the algorithm starting fromno-inventory and no outstanding orders. [ , U ] is some pre-specified range of base-stock levels to beconsidered. Before we formally state our main theorem, we need to define D , a parameter of the demanddistribution F that appears in our regret bounds. However, it is important to note that our algorithmdoes not need to know the parameter D . Definition 1.2.

We define D as the expected number of independent samples needed from distribu-tion F for the sum of those samples to exceed . More precisely, let d , d , d , . . . , be a sequence ofindependent samples generated from the demand distribution F , and let τ be the minimum numbersuch that (cid:205) τi = d i ≥ . Then define D : = E [ τ ] . We also refer to D as the expected time to deplete oneunit of inventory. We will assume that the demand distribution F is such that D is finite.Our main result is stated as follows.T HEOREM

Given a demand distribution F such that F ( ) > and the expected time D todeplete one unit of inventory is finite. Then, for L ≥ , there exists an algorithm (Algorithm 1) for theinventory control problem with regret bounded as:Regret ( T ) ≤ ˜ O (cid:16) D max ( h , p ) U + ( L + ) max ( h , p ) U √ T (cid:17) , with probability at least − T . For T ≥ ( DU ) , this implies a regret bound ofRegret ( T ) ≤ ˜ O (cid:16) ( L + ) max ( h , p ) U √ T (cid:17) . Here ˜ O (·) hides logarithmic factors in h , p , U , L , T , and absolute constants. Here, constants max ( h , p ) and U define the scale of the problem. Note that the regret bound hasa very mild (additive) dependence on the parameter D of the demand distribution, defined as theexpected time to deplete one unit of inventory. We conjecture that such a dependence on D in theregret may be unavoidable; any time a learning algorithm reaches an inventory level higher thanthe optimal base-stock policy, it must necessarily wait time steps roughly proportional to D for theinventory to deplete, in order to play a better policy. Only an algorithm that never overshoots theoptimal inventory level may avoid incurring this waiting time. However, without a priori knowledgeof the optimal level, an exploration based learning algorithm is unlikely to avoid this completely.This dependence also reminds of the dependence on diameter of an MDP in regret bounds for RLalgorithms (e.g. see [2, 11, 17]).R EMARK The condition F ( ) > in the above theorem is required only for using the resulton convexity of infinite horizon average cost in Theorem 12 of [12] (see Lemma 1.1). The convexityresult can in fact also be shown to hold under some alternate conditions like finite support of demand,or under sufficient discretization of demand. R EMARK One may consider an alternative regret definition that compares difference in totalcost of the algorithm in time T to the total cost of the best base-stock policy. That is,Regret ′ ( T ) : = E (cid:34) T (cid:213) t = ( ¯ C t − ¯ C x ∗ t ) (cid:35) where x ∗ = arg min x ∈[ , U ] λ x . We show that our proof implies a bound similar to Theorem 1.3 for this alternative regret definition. hipra Agrawal and Randy Jia Some earlier works on exploration-exploitation algorithms for inventory control problem [6, 10] pro-vide ˜ O (√ T ) regret bounds, but under zero-lead time [10] and/or perishable inventory [6] assumptions.The inventory control problem considered here is exactly the same as that considered in recent workby [19], and earlier work by [8]. Therefore, we provide a precise comparison to the results obtainedin those works. Our result matches the O (√ T ) dependence on T in [19], improving on the O ( T ) dependence originally given in [8]. Further, it can be shown (see [19], Proposition 1) that for T > ,the expected regret for any learning algorithm in this setting is lower bounded by Ω (√ T ) , and thusour bound is optimal in T (within logarithmic factors).More importantly, our regret bound scales linearly in L as opposed to the exponential dependenceon L in [19]. Specifically, the regret bound achieved by [19] is of order ˜ O ( max ( h , p ) U (cid:0) c (cid:1) L √ T ) .Besides exponential dependence on L , the constant c here is given by product of some positiveprobabilities for demand to take values in certain ranges, which requires several further assumptionson distribution F (see Assumption 1 of [19]).Among other related work, the results in [4, 5, 14] imply ˜ O (√ T ) bounds for variations of inventorycontrol problems under adversarial demand , however, under significant simplifying assumptionssuch as all remaining inventory perishes at the end of time period and there is no lead time. Undersuch assumptions there is no state dependence across periods and the problem becomes closer toonline learning. Our algorithm design and analysis will utilize some key structural properties provided by base-stockpolicies. Specifically, we prove properties of a Markov Reward Process (MRP) obtained on runninga base-stock policy for the inventory control problem. An MRP extends a Markov chain by adding areward (or cost) to each state. In particular, the stochastic process obtained on fixing a policy in anMDP is an MRP.To define the MRP studied here, we observe that if we start with an on-hand inventory and apipeline of outstanding orders that sum to less than or equal to x , then using base-stock policy x willorder the amount o t to bring the sum to exactly x , i.e, o t = x − ( inv t + (cid:205) Li = o t − i ) = x − I t − (cid:205) L − i = o t − i .From here on, the base-stock policy will always order whatever is consumed due to demand, i.e., o t + = y t where y t = min { I t , d t } is the observed sales. And, the sum of inventory and outstandingorders will be maintained as x .Based on this observation, we define an MRP with state at time t as the tuple of available inventoryand outstanding orders, i.e., s t = ( I t , o t − L + , . . . , o t ) . The MRP starts from a state where all theentries in this tuple sum to x . The base-stock policy will maintain this feature with the new state attime t + being s t + = ( I t − y t + o t − L + , o t − L + , . . . , o t , o t + ) , where o t + = y t when using base-stockpolicy.We also define a cost C x ( s t ) associated with each state in this MRP. We define this as C x ( s t ) = E [ C xt | s t ] where C xt is a pseudo-cost defined as a modification of true cost ¯ C xt : C xt = ¯ C xt − pd t = h ( I t − y t ) − py t . (5)The advantage of using this pseudo-cost is that since I t and y t are observable, the pseudo-cost iscompletely observed. On the other hand, recall that the “lost sales" in the true cost are not observed.Further, since the modification pd t does not depend on the policy or algorithm being used, later (seeLemma 3.1) we will be able to show that the regret computed using this version of the cost is in factexactly the same as the regret Regret ( T ) in (4). hipra Agrawal and Randy Jia Definition 2.1 (Markov reward process M( x , s ) ). We define MRP M( x , s ) as the bipartitestochastic process {( s t , C x ( s t )) ; t = , , , . . . } . Here s t and C x ( s t ) denotes state and the reward (cost) at time t in this MRP.Let S x denote the set of ( L + ) -dimensional non-negative vectors whose components sum to x .The process starts in state s ∈ S x . Given state s t = ( s t ( ) , s t ( ) , . . . , s t ( L )) , new state at time t + isgiven by s t + : = ( s t ( ) − y t + s t ( ) , s t ( ) , . . . , s t ( L ) , y t ) (6)where y t = min { s t ( ) , d t } , d t ∼ F . Observe that if s ∈ S x , we have s t ∈ S x for all t by the abovetransition process. Cost function C x ( s t ) is defined as: C x ( s t ) = E [ C xt | s t ] where C xt : = h ( s t ( ) − y t ) − py t . (7)Two important quantities are the loss and bias of this MRP. Definition 2.2 (Loss and Bias).

For any s ∈ S x , loss д x ( s ) and bias v x ( s ) of MRP M( x , s ) aredefined as: д x ( s ) : = E (cid:2) lim T →∞ T (cid:205) Tt = C x ( s t )| s = s (cid:3) v x ( s ) : = E (cid:2) lim T →∞ (cid:205) Tt = C x ( s t ) − д x ( s t )| s = s (cid:3) R EMARK Technically, for the above limits to exist, and for some other known results on MRPsused later, we need finite state space and finite action space (see Chapter 8.2 in [15]). Since werestrict to orders within range [ , U ] , and all states s ∈ S x are vectors in [ , x ] L with x ∈ [ , U ] , wecan obtain finite state space and action space by discretizing demand and orders using a uniform gridwith spacing ϵ ∈ ( , ) . Discretizing this way will give us a state space and action space of size (cid:0) Uϵ (cid:1) L and Uϵ , respectively. In fact, we can use arbitrary small precision parameter ϵ , since our bounds willnot depend on the size of state space or action space. We therefore ignore this technicality in rest ofthe paper. The following lemma formally connects the loss of the MRP to the asymptotic average cost λ x ofa base-stock policy (refer to (3)) used in defining regret. This connection will allow us to use thepseudo-costs instead of unobserved true costs.L EMMA

Let s ′ ∈ S x be given by s ′ : = ( x , , . . . , ) . Then, λ x = д x ( s ′ ) + pµ with µ being the mean of the demand distribution F . P ROOF . On using base stock policy with level x starting in no inventory x and no outstandingorders, the first order will be x , which will arrive at time step L + . The orders and on-hand inventorywill be for the first L time steps I = I = . . . , I L = . All the sales is lost, and therefore, the truecost in each of these steps is pd t . In step L + , we will have an on-hand inventory I L + = x and nooutstanding orders. And, from here on, the system will follow a Markov reward process M( x , s ) with s = s ′ . Therefore, by relation (see (5)) between pseudo-cost C xt and true cost (lost sales penalty andholding cost) ¯ C xt , we have C xt = ¯ C xt − pd t , for t ≥ L + . Therefore, hipra Agrawal and Randy Jia д x ( s ′ ) = lim T →∞ E (cid:34) T L + T (cid:213) t = L + C x ( s t )| s L + = s ′ (cid:35) = lim T →∞ E (cid:34) T T + L (cid:213) t = ( ¯ C xt − pd t ) | inv = (cid:35) = λ x − pµ . □ Next, we prove some important properties of bias and loss of this MRP, namely,(1) that loss is independent of the starting state and convex in x (§2.1),(2) a bound on bias starting from any state (§2.2), and(3) a concentration lemma bounding the difference between the loss (i.e., expected infinite horizonaverage cost) and finite horizon average cost observed on running a base-stock policy (§2.3).These results are presented in the next three subsections and will be crucial in algorithm design andanalysis presented in the subsequent sections. To derive these properties, we first prove a bound onthe difference in aggregate cost (termed as “value") on starting from two different states. This result(proved in Lemma 2.5) forms a key technical result utilized in proving all the above properties. Definition 2.4 (Value).

For any s ∈ S x the value V xT ( s ) in time T of MRP M( x , s ) is defined as: V xT ( s ) : = E (cid:2)(cid:205) Tt = C x ( s t )| s = s (cid:3) . L EMMA

OUNDED DIFFERENCE IN VALUE ). For any x , T , and s , s ′ ∈ S x , V xT ( s ) − V xT ( s ′ ) ≤

36 max ( h , p ) Lx . P ROOF . For L = , s = s ′ = ( x ) and hence both sides are zero. Consider when L ≥ . One wayto bound the difference in the two values V xT ( s ) and V xT ( s ′ ) is to upper bound the expected numberof steps to reach a common state starting from s and s ′ . Once a common state is reached, from thatpoint onward, the two processes will have the same value. For example, if there is demand for L consecutive time steps, then both processes will reach state ( x , , . . . , ) . Therefore, the difference invalues can be upper bounded by a quantity proportional to inverse of the probability that demandis for L consecutive steps. Unfortunately, this probability is exponentially small in L . In fact, theexponential dependence of regret on previous works (e.g., [8, 19]) can be traced to using an argumentlike above somewhere in the analysis. Instead we achieve a bound with linear dependence in L byusing a more careful analysis of the costs incurred on starting from different states.For any s ∈ S x , we define m xT ( s ) : = (cid:205) Tt = I t to be the total on-hand inventory level, and n xT ( s ) : = (cid:205) Tt = y t to be the total sales in T time steps, on starting from state s . Then, V xT ( s ) : = E [ T (cid:213) i = C xt | s = s ] = E [ T (cid:213) i = hI t − ( h + p ) y t | s = s ] = E [ h ( m xT ( s )) − ( h + p )( n xT ( s ))] . Thus, the difference between values V xT ( s ) and V xT ( s ′ ) can be bounded by bounding difference intotal on-hand inventory | m xT ( s ) − m xT ( s ′ )| and total sales | n xT ( s ) − n xT ( s ′ )| . We bound this differenceby first comparing pairs of states s , s ′ that satisfy s ′ ⪰ s , with the relation ⪰ defined as the propertythat for some index k ≥ , the first k entries satisfy s ′ ( ) ≥ s ( ) , . . . , s ′ ( k ) ≥ s ( k ) , and the remaining L + − k entries satisfy s ′ ( k + ) ≤ s ( k + ) , . . . , s ′ ( L ) ≤ s ( L ) .For such pairs, we can bound the difference in total sales and total on-hand inventory as | n xT ( s ) − n xT ( s ′ )| ≤ x and | m xT ( s ) − m xT ( s ′ )| ≤ Lx . hipra Agrawal and Randy Jia s ′ vs. s . We show that initially more sales are observed on starting from s ′ , since s ′ ⪰ s . Over time, thesystem keeps alternating between states with s ′ t ⪰ s t and s ′ t ⪯ s t in cycles of length at most L . Theadditional sales in one cycle with s ′ t ⪰ s t compensates for the lower sales in the next cycle with s ′ t ⪯ s t , so that the total difference is bounded. The formal proofs for bounding difference in salesand on-hand inventory are provided in Lemma B.6 and Lemma B.7, respectively, in the appendix.Then we use the observation that ˆ s ⪰ s for all states s ∈ S x when ˆ s : = ( x , , , . . . , ) . Therefore,we can apply the above results to conclude | n xT ( s ) − n xT ( ˆ s )| ≤ x and | m xT ( s ) − m xT ( ˆ s )| ≤ Lx , implying | V xT ( s ) − V xT ( ˆ s )| = (cid:12)(cid:12) E [ h ( m xT ( s ) − m xT ( ˆ s )) − ( h + p )( n xT ( s ) − n xT ( ˆ s ))] (cid:12)(cid:12) ≤ ( h + p ) Lx . Since the above holds for any state s , we have that for two arbitrary starting states s , s ′ ∈ S x , | V xT ( s ) − V xT ( s ′ )| = | V xT ( s ) − V xT ( ˆ s ) + V xT ( ˆ s ) − V xT ( s ′ )| ≤ ( h + p ) Lx ≤

36 max ( h , p ) Lx . □ Next, we use the value difference lemma to show that the loss д x ( s ) is independent of the startingstate s ∈ S x in this MRP.L EMMA

NIFORM LOSS LEMMA ). For any x , s , s ′ ∈ S x , д x ( s ′ ) = д x ( s ) = : д x . P ROOF . Using definition of V xT ( s ) and д x ( s ) , д x ( s ) = lim T →∞ T V xT ( s ) so that using Lemma 2.5 | д x ( s ) − д x ( s ′ )| = | lim T →∞ T V xT ( s ) − lim T →∞ T V xT ( s ′ )| ≤ lim T →∞

36 max ( h , p ) LxT = , since both limits exist (see Remark 4). Hence for any s , s ′ ∈ S x , д x ( s ′ ) = д x ( s ) . □ Now the convexity of д x follows almost immediately from convexity of λ x and the relation givenin Lemma 2.3L EMMA

ONVEXITY LEMMA ). Assuming demand distribution F is such that there is aconstant probability of demand, i.e., F ( ) > . Then, for any base-stock level x , and s ∈ S x , д x ( s ) is convex in x . P ROOF . Let s ′ : = ( x , , . . . , ) and let µ be the mean of demand distribution F . By Lemma 2.3we have that д x ( s ′ ) = λ x − pµ . Therefore, under given assumption that demand distribution F has anon-zero probability of zero demand, we can use Lemma 1.1 to conclude that the first term is convexin x , which implies д x ( s ′ ) is convex. Now, by Lemma 2.6, for any state s ∈ S x , д x ( s ) = д x ( s ′ ) .Therefore, д x ( s ) is convex in x for all s ∈ S x . □ Now, the following can be obtained as a corollary of Lemma 2.5 and the definition of bias.L

EMMA

OUNDED BIAS LEMMA ). For any x and s , s ′ ∈ S x , v x ( s ) − v x ( s ′ ) ≤

36 max ( h , p ) Lx . hipra Agrawal and Randy Jia ROOF . From Lemma 2.6, д x ( s t ) = д x ( s ′ t ) = д x for all t . Now by definition of v x (·) , v x ( s ) = E (cid:2) lim T →∞ (cid:205) Tt = C x ( s t ) − д x | s = s (cid:3) = lim T →∞ V xT ( s ) − Tд x and v x ( s ′ ) = E (cid:2) lim T →∞ (cid:205) Tt = C x ( s t ) − д x | s = s ′ (cid:3) = lim T →∞ V xT ( s ′ ) − Tд x . We note that both of the above limits exists (see Remark 4), and hence by Lemma 2.5, v x ( s ) − v x ( s ′ ) = lim T →∞ ( V xT ( s ) − Tд x ) − lim T →∞ ( V xT ( s ′ ) − Tд x ) = lim T →∞ V xT ( s ) − V xT ( s ′ )≤

36 max ( h , p ) Lx . □ We use the following known relation between loss and bias which holds under finite state and actionspace.L

EMMA

HEOREM

For any s ∈ S x , the bias and loss satisfy the followingequation: д x ( s ) = C x ( s ) + E s ′ ∼ P x ( s ) [ v x ( s ′ )] − v x ( s ) . L EMMA

ONCENTRATION LEMMA ). Given a base-stock level x , let γ > and N = log ( T ) γ .Then, for any s ∈ S x , with probability − T , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:213) t = C x ( s t ) − д x ( s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤

108 max ( h , p ) Lxγ . P ROOF . By Theorem 2.9, the loss д x and bias v x satisfy: д x ( s ) = C x ( s ) + E s ′ ∼ P x ( s ) [ v x ( s ′ )] − v x ( s ) for all states s ∈ S x . Note that this equation continues to hold if any constant c is added to all v x ( s ) .Therefore, for the purpose of using this equation, without loss of generality, we can assume that min s ∈S x v x ( s ) = , and from Lemma 2.8 we have ≤ v x ( s ) ≤

36 max ( h , p ) Lx for all s ∈ S x .Also, note that if s ∈ S x then all subsequent states s t in MRP M( x , s ) are in S x . Therefore,from Lemma 2.6, д x ( s ) = д x ( s t ) for all t . We use these observations to derive the following. hipra Agrawal and Randy Jia (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:32) N N (cid:213) t = C x ( s t ) (cid:33) − д x ( s ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:213) t = ( C x ( s t ) − д x ( s t )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:213) t = ( C x ( s t ) − ( C x ( s t ) + E s ′ ∼ P x ( s t ) [ v x ( s ′ )] − v x ( s t )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:213) t = v x ( s t ) − E s ′ ∼ P x ( s t ) [ v x ( s ′ )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) N ( v x ( s ) − E s ′ ∼ P x ( s N ) [ v x ( s ′ )]) + N N − (cid:213) t = v x ( s t + ) − E s ′ ∼ P x ( s t ) [ v x ( s ′ )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤

36 max ( h , p ) LxN + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N − (cid:213) t = v x ( s t + ) − E s t + ∼ P x ( s t ) [ v x ( s t + )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . Now, let ∆ t + : = v x ( s t + ) − E s ′ ∼ P x ( s t ) [ v x ( s ′ )] . Note that E [ ∆ t + | s t ] = and hence ∆ t ’s form a martingale difference sequence with | ∆ t | ≤

36 max ( h , p ) Lx for all t (the distribution P x ( s t ) is supported only on states in S x ). Thus, we canapply Azuma-Hoeffding’s inequality (Theorem A.1 in the Appendix) to show that for any ϵ > , P (| N − (cid:213) t = ∆ t | ≥ ϵ ) ≤ exp (− ϵ ( N − )(

36 max ( h , p ) Lx ) ) . Therefore, by setting ϵ =

72 max ( h , p ) Lx (cid:112) N log ( T ) , we obtain that with probability at least − T , | N N (cid:213) t = C x ( s t ) − д x ( s )| ≤

36 max ( h , p ) LxN + N (

72 max ( h , p ) Lx (cid:112) N log ( T ))≤

108 max ( h , p ) Lx (cid:114) log ( T ) N The result follows by substituting N = log ( T ) γ . □ R EMARK Observe from the above lemmas that the bias, and the difference between expectedtotal cost and asymptotic cost, are when lead time is . We design a learning algorithm for the inventory control problem when the demand distribution F is a priori unknown. The algorithm seeks to minimize regret in the total expected cost comparedto the asymptotic cost of the best base-stock policies in a pre-specified range [ , U ] (refer to regretdefinition in §1.1). The algorithm receives as input, the range [ , U ] of base-stock levels to competewith, the fixed delay parameter L , and the time horizon T , but not the demand distribution F . Challenges and main ideas.

Our algorithm crucially utilizes the observations made in section §2regarding convexity of the average cost when a base-stock policy is used. Based on this observation, hipra Agrawal and Randy Jia -dimensional stochastic convex bandits.In the stochastic convex bandit problem, in every round the decision maker chooses a decision x t and observes a noisy realization of f ( x t ) , where f is some fixed but unknown convex function,and the noise is i.i.d. across rounds. The goal of an online algorithm is to use past observations tomake decisions x t , t = , . . . , T in order to minimize the regret against the best single decision, i.e.,minimize regret (cid:205) Tt = ( f ( x t ) − f ( x ∗ )) where x ∗ = arg min x ∈ X f ( x ) . Therefore, based on the definitionof regret in the inventory control problem, one may want to consider a mapping to the stochasticconvex bandit problem by setting f ( x ) as λ x , the average cost of the base-stock policy with level x .However, there are several challenges in achieving this mapping. Firstly, the instantaneous holdingcost and lost sales penalty depends on the current inventory state, and therefore is not a noisyrealization of f ( x ) = λ x (more precisely, the noise is not i.i.d. across rounds). Further, a part of theinstantaneous cost, that is, the lost sales penalty, is not even observed.We overcome these challenges using the construction of pseudo-cost and the concentration resultsderived in the previous section. In particular, in Lemma 2.3 we proved that expected infinite horizonaverage pseudo-cost д x ( s ) starting from a state s ∈ S differs from the average true cost λ x byamount pµ . Since this deviation of pµ is fixed and does not depend on the policy used , the followingequivalence between regret in pseudo-cost vs. regret in true costs follows almost immediately.L EMMA

Recall ¯ C t = h ( I t − d t ) + + p ( d t − I t ) + is the true cost in step t . Let C t = ¯ C t − pd t bethe observed cost (i.e., pseudo-cost) at time t . Then, regret under the true cost is equivalent to theregret under the observed cost, i.e.,Regret ( T ) : = E [ T (cid:213) t = ¯ C t ] − T (cid:18) min x ∈[ , U ] λ x (cid:19) = E [ T (cid:213) t = C t ] − T (cid:18) min x ∈[ , U ] д x (cid:19) . where д x is the loss of MRP M( x , s ) starting in any state s ∈ S x (refer to Definition 2.2 and Lemma2.6). P ROOF . Regret ( T ) = E [ (cid:205) Tt = ¯ C t ] − T (cid:0) min x ∈[ , U ] λ x (cid:1) (using Lemma 2.3 and 2.6) = E [ (cid:205) Tt = C t + pd t ] − T (cid:0) min x ∈[ , U ] д x + pµ (cid:1) (using independent demand assumption) = E [ (cid:205) Tt = C t ] − T (cid:0) min x ∈[ , U ] д x (cid:1) . □ Thus, we can focus on designing an algorithm for minimizing pseudo-costs. Further, the concen-tration results in Lemma 2.10, derived by bounding bias of base-stock policies, allow us to developconfidence intervals on estimates of cost functions in a manner similar to stochastic convex banditalgorithms.

Algorithm description.

Our algorithm is derived from the algorithm in [1] for -dimensionalstochastic convex bandits with convex function f ( x ) = д x . Following are the main components ofour algorithm. Working interval of base-stock level:

Our algorithm maintains a high probability confidenceinterval that contains an optimal base-stock level. Initially, this is set as [ , U ] , the pre-specifiedrange received as an input. As the algorithm progresses, the working interval is refined by discardingportions of this interval which have low probability of containing the optimal base-stock level. Epoch and round structure:

Our algorithm proceeds in epochs ( k = , , . . . ), a group of consecutivetime steps where the same working interval of base-stock level is maintained throughout an epoch hipra Agrawal and Randy Jia [ l k , r k ] . Epochs are also further split up into groups of consecutive time steps calledrounds. In round i of epoch k , the algorithm first plays the policy π , which is to order in everytime step, until the sum of total inventory and on hand orders falls below x l : = l k + r k − l k . Then,the algorithm plays policies π x l , π x c , π x r , denoting base stock policies corresponding to base-stocklevels x l : = l k + r k − l k , x c : = l k + r k − l k , x r : = l k + ( r k − l k ) , respectively, for N i time steps each.Note that on executing base-stock policies in the given order, the algorithm always starts executinga base-stock policy π x for x ∈ ( x l , x c , x r ) at a total inventory position below x . Therefore, it willimmediately (in one step) reach the desired inventory position x . Here N i = log ( T )/ γ i with γ i = − i .Therefore, the number of observations quadruples in each round. At the end of every round, theseobservations are used to update a confidence interval estimate for average cost. An epoch ends whenthe confidence intervals at the end of a round meet a certain condition, as defined next. Updating confidence intervals.

Given a vector C N = ( C , C , ..., C N ) , define LB ( C N ) : = N N (cid:213) i = C i − Hγ , and U B ( C N ) : = N N (cid:213) i = C i + Hγ , (8)where γ = (cid:113) log ( T ) N , H : =

576 max ( h , p )( L + ) U and L is the known lead time.Now, let C ℓ N , C cN , C rN denote the N = N i realizations of pseudo-costs ( C xt ) observed on runningbase-stock policy π x for each of the three levels x ∈ [ x l , x c , x r ] in round i . Then, at the end of round i , the algorithm computes three intervals: [ LB ( C aN ) , U B ( C aN )] for a ∈ { l , c , r } .Using the Lemma 2.10 proven in the previous section to bound the difference between expectedcost and expected asymptotic cost, and Lemma A.2 to bound the difference between empiricalcost and expected cost, we can show that (see Lemma A.3) the loss of each of these base-stockpolicies д x a ∈ [ LB ( C aN ) , U B ( C aN )] with probability − T . Therefore, each of these intervals is ahigh confidence intervals for the respective loss, with endpoints determined by N observed empiricalcosts.At the end of every round i of an epoch k , the algorithm uses the updated confidence intervalsto check if either the portion [ l k , x l ] or the portion [ x r , r k ] of the working interval [ l k , r k ] can beeliminated. Given the confidence intervals, the test used for this purpose is exactly the same as in [1],and uses convexity properties of function д x . If the test succeeds, at least / of the working intervalis eliminated and the epoch k ends.The algorithm is summarized as Algorithm 1. In this section, we prove the regret bound stated in Theorem 1.3 for Algorithm 1. Given the keytechnical results proven in section §2, the regret analysis follows steps similar to the regret analysisfor stochastic convex bandits in [1]. We use the notation f ( x ) = д x in this proof to connect theregret analysis here to the analysis for stochastic convex bandits with convex function f . Let x ∗ = min x ∈[ , U ] д x = min x ∈[ , U ] f ( x ) . And, let C t be the cost (i.e., pseudo-cost) observed at time t .Then, by Lemma 3.1, Regret ( T ) = E [ (cid:205) Tt = C t ] − (cid:205) Tt = f ( x ∗ ) . Also define an event E such that all confidence intervals [ LB ( C aN ) , U B ( C aN )] calculated in Algorithm1 satisfy: д x a ∈ [ LB ( C aN ) , U B ( C aN )] for every epoch k , round i and a ∈ { l , c , r } . The analysis in thissection will be condition on E , and the probability P (E) will be addressed at the end. hipra Agrawal and Randy Jia Algorithm 1

Learning algorithm for the inventory control problem

Inputs:

Base-stock range [ , U ] , lead time L , time horizon T . Initialize: l : = , r : = U . for epochs k = , , . . . , do Set w k : = r k − l k , the width of the working interval [ l k , r k ] .Set x l : = l k + w k / , x c : = l k + w k / , and x r : = l k + w k / . for round i = , , . . . , do Let γ i = − i and N = log ( T ) γ i .Play policy π until a time step t with inventory position ( inv t + o t − + · · · + o t − L ) ≤ x l .Play policy π x l , π x c , π x r , each for N time steps to observe N realizations of pseudo-costs( C t = ¯ C t − pd t ); store as vectors C lN , C cN , C rN respectively.If at any point during the above two steps, the total number of time steps reaches T , exit.For each a ∈ { l , c , r } , use C aN to calculate a confidence interval [ LB ( C aN ) , U B ( C aN )] of length Hγ i as given by (8), where H =

576 max ( h , p )( L + ) U . if max { LB ( C lN ) , LB ( C rN )} ≥ min { U B ( C lN ) , U B ( C cN ) , U B ( C rN )} + Hγ i thenif LB ( C lN ) ≥ LB ( C rN ) then l k + : = x l and r k + = r k . if LB ( C lN ) < LB ( C rN ) then l k + : = l k and r k + = x r Go to next epoch k + . else Go to next round i + . end ifend forend for We divide the regret in two parts: first we consider the regret over the set of times steps T i , k , at thebeginning of each epoch k and round i where policy π is played until the leftover inventory depletesto a level below x l . We denote the total contribution of regret from these steps (across all epochs androunds) as Regret ( T ) . Since the cost incurred at any time step is at most max ( h , p )( U ) , this part ofthe regret is bounded byRegret ( T ) ≤ max ( h , p )( U ) · E [ (cid:205) epoch k (cid:205) round i in epoch k | T k , i , |] .To bound expected number of steps in T k , i , , observe that ordering zero for L steps will result in atmost U inventory on hand and no orders in the pipeline. By definition of D , the expected number oftime steps to deplete U units of inventory is upper bounded by DU . Therefore, E [| T k , i , |] ≤ L + DU .Since any epoch has at most T time steps, and each successive round within an epoch has fourtimes the number of time steps as the previous, there are at most log ( T ) rounds per epoch. Also,in Lemma C.4 we show that, under E , the number of epochs is bounded by log / ( T ) . Intuitively,this holds because in every epoch we eliminate at least ( / ) th of the working interval. Using theseobservations, the regret from all the time steps where policy π was executed is bounded byRegret ( T ) ≤ log / ( T ) log ( T )( L + DU ) max ( h , p )( U ) . (9)Next, we consider the regret over all remaining time steps, denoted as Regret ( T ) . Algorithm 1plays the base-stock policies with level x l , x c , or x r in these steps, where these levels are updatedat the end of every epoch. Consider a round i in epoch k . Let T k , i , l , T k , i , c , T k , i , r , be the set of (atmost) N i = log ( T ) γ i consecutive times where policies π x l , π x c , π x r are played, respectively, in round i of epoch k . Here, γ i = − i , and recall that H =

576 max ( h , p )( L + ) U . Let x t denote the base-stock hipra Agrawal and Randy Jia t . By Lemma 2.10, for epoch k , round i and a ∈ { l , c , r } ,under E , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:213) t ∈ T k , i , a ( C t − f ( x t )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ N i Hγ i = H log ( T ) γ i . Substituting above, we can derive thatRegret = E  (cid:213) epoch k round i (cid:213) a ∈{ l , c , r } (cid:213) t ∈ T k , i , a ( C t − f ( x ∗ )  = E  (cid:213) epoch k round i (cid:213) a ∈{ l , c , r } (cid:213) t ∈ T k , i , a ( C t − f ( x t ) + f ( x t ) − f ( x ∗ ))  ≤ E  (cid:213) epoch k round i (cid:169)(cid:173)(cid:171) N i Hγ i + (cid:213) a ∈{ l , c , r } (cid:213) t ∈ T k , i , a ( f ( x t ) − f ( x ∗ )) (cid:170)(cid:174)(cid:172) . (10)Now observe that for any round i of epoch k in which the algorithm does not terminate, the totalnumber of time steps is bounded by T . So for a ∈ { l , c , r } , we have | T k , i , a | = log ( T ) γ i ≤ T , whichimplies γ i ≥ (cid:113) log ( T ) T . Since γ i + = γ i , let us define γ min : = (cid:113) log ( T ) T so that γ min ≤ γ j for anyround j . Recall that γ i = − i so we can bound the geometric series: (cid:213) k (cid:213) i N i Hγ i = (cid:213) k (cid:213) i (cid:18) H log ( T ) γ i (cid:19) ≤ log / ( T ) (cid:18) H log ( T ) γ min (cid:19) . (11)Substituting the value of γ min we get a bound of H log / ( T ) (cid:112) T log ( T ) on the first term in (10).Now, consider the second term in (10). We use the results in [1] regarding the convergence of theconvex optimization algorithm to bound the gap between f ( x t ) and f ( x ∗ ) . Intuitively, in every epochthe working interval shrinks by a constant factor, so that x t ∈ { x l , x c , x r } are closer and closer to theoptimal level x ∗ . Therefore, the gap | f ( x t ) − f ( x ∗ )| can be bounded using a Lipschitz property of f proven in Lemma C.1 that shows | f ( x t ) − f ( x ∗ )| ≤ max ( h , p )| x t − x ∗ | . Specifically, we adapt theproof from [1] to derive the following bound (details are in Lemma C.5 in appendix): (cid:213) k , i , a , t ∈ T k , i , a f ( x ( t )) − f ( x ∗ ) ≤ H log / ( T ) (cid:112) T log ( T ) . (12)Substituting, in (10), Regret ( T ) is bounded by:Regret ( T ) ≤ H log / ( T ) (cid:112) T log ( T ) + H log / ( T ) (cid:112) T log ( T ) = H log / ( T ) (cid:112) T log ( T ) . (13)And, combining with the bound on Regret ( T ) from (9), we get the following regret bound:Regret ( T ) ≤ log / ( T ) log ( T )( L + DU ) max ( h , p )( U ) + H log / ( T ) (cid:112) T log ( T ) = ˜ O (cid:16) D max ( h , p ) U + ( L + ) max ( h , p ) U √ T (cid:17) . We complete the proof of the theorem statement by noting all the analysis has been conditioned on E where д x a ∈ [ LB ( C aN ) , U B ( C aN )] for every epoch k , round i and a ∈ { l , c , r } . By Lemma 2.10, thecondition is satisfied with probability at least − T for each k , i , a . Since there are no more than T time steps and therefore at most T plays of any policy, by union bound P (E) ≥ − T , hipra Agrawal and Randy Jia − T .Finally, to see that a similar regret bound holds for the alternative regret definition in Remark 3,we compare the two regret definitions:Regret ′ ( T ) = Regret ( T ) + T λ x ∗ − E (cid:34) T (cid:213) t = ¯ C x ∗ t (cid:35) Now, use (5) and Lemma 2.3 to convert true costs to pseudo-costs; and then apply Lemma 2.10 toobtain

T λ x ∗ − E (cid:34) T (cid:213) t = ¯ C x ∗ t (cid:35) = Tд x ∗ − E (cid:34) T (cid:213) t = C x ∗ t (cid:35) ≤ O ( H (cid:112) T log ( T )) Hence the two regret bounds are of the same order.

We presented an algorithm to minimize regret in the periodic inventory control problem undercensored demand, lost sales, and positive lead time, when compared to the best base-stock policy. Byusing convexity properties of the long run average cost function and a newly proven bound on bias ofbase-stock policies, we extend a stochastic convex bandit algorithm to obtain a simple algorithm thatsubstantially improves upon the existing solutions for this problem. In particular, the regret bound forour algorithm maintains an optimal dependence on T , while also achieving a linear dependence onthe other problem parameters like lead time. The algorithm design and analysis techniques developedhere may be useful for obtaining efficient solutions for other classes of learning problems where theMDPs involved may be large, but the long-run average cost under benchmark policies is convex. REFERENCES [1] Alekh Agarwal, Dean P Foster, Daniel J Hsu, Sham M Kakade, and Alexander Rakhlin. Stochastic convex optimizationwith bandit feedback. In

Advances in Neural Information Processing Systems , pages 1035–1043, 2011.[2] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In

Advances in Neural Information Processing Systems , pages 1184–1194, 2017.[3] Peter L Bartlett and Ambuj Tewari. REGAL: A regularization based algorithm for reinforcement learning in weaklycommunicating MDPs. In

Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence , pages35–42. AUAI Press, 2009.[4] Gábor Bartók, Dean P Foster, Dávid Pál, Alexander Rakhlin, and Csaba Szepesvári. Partial monitoring - classification,regret bounds, and algorithms.

Mathematics of Operations Research , 39(4):967–997, 2014.[5] Omar Besbes, Yonatan Gur, and Assaf Zeevi. Non-stationary stochastic optimization.

Operations research , 63(5):1227–1244, 2015.[6] Omar Besbes and Alp Muharremoglu. On implications of demand censoring in the newsvendor problem.

ManagementScience , 59(6):1407–1424, 2013.[7] Marco Bijvank and Iris FA Vis. Lost-sales inventory theory: A review.

European Journal of Operational Research ,215(1):1–13, 2011.[8] Woonghee Tim Huh, Ganesh Janakiraman, John A Muckstadt, and Paat Rusmevichientong. An adaptive algorithm forfinding the optimal base-stock policy in lost sales inventory systems with censored demand.

Mathematics of OperationsResearch , 34(2):397–416, 2009.[9] Woonghee Tim Huh, Ganesh Janakiraman, John A Muckstadt, and Paat Rusmevichientong. Asymptotic optimality oforder-up-to policies in lost sales inventory systems.

Management Science , 55(3):404–420, 2009.[10] Woonghee Tim Huh and Paat Rusmevichientong. A nonparametric asymptotic analysis of inventory planning withcensored demand.

Mathematics of Operations Research , 34(1):103–123, 2009.[11] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning.

Journal ofMachine Learning Research , 11(Apr):1563–1600, 2010.[12] Ganesh Janakiraman and Robin O Roundy. Lost-sales problems with stochastic lead times: Convexity results forbase-stock policies.

Operations Research , 52(5):795–803, 2004.[13] Hau Leung Lee and Morris A Cohen. A note on the convexity of performance measures of m/m/c queueing systems.

Journal of Applied Probability , 20(4):920–923, 1983.hipra Agrawal and Randy Jia [14] Gábor Lugosi, Mihalis G Markakis, and Gergely Neu. On the hardness of inventory management with censored demanddata. arXiv preprint arXiv:1710.05739 , 2017.[15] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming . John Wiley & Sons, 2014.[16] J George Shanthikumar and David D Yao. Optimal server allocation in a system of multi-server stations.

ManagementScience , 33(9):1173–1180, 1987.[17] Ambuj Tewari and Peter L Bartlett. Optimistic linear programming gives logarithmic regret for irreducible MDPs. In

Advances in Neural Information Processing Systems , pages 1505–1512, 2008.[18] Richard R Weber. Note - on the marginal benefit of adding servers to g/gi/m queues.

Management Science , 26(9):946–951,1980.[19] Huanan Zhang, Xiuli Chao, and Cong Shi. Closing the gap: A learning algorithm for the lost-sales inventory systemwith lead times. 2017.[20] Paul Zipkin. Old and new methods for lost-sales inventory systems.

Operations Research , 56(5):1256–1263, 2008.[21] Paul Herbert Zipkin.

Foundations of inventory management . 2000.hipra Agrawal and Randy Jia A CONCENTRATION BOUNDS T HEOREM

A.1 (A

ZUMA -H OEFFDING INEQUALITY ). Let X , X , . . . be a martingale differencesequence with | X i | ≤ c for all i . Then for all ϵ > and n ∈ N , P (| n (cid:213) i = X i | ≥ ϵ ) ≤ exp (− ϵ nc ) . We note that instantaneous costs observed from running Algorithm 1 are different from the costsdefined in the cost function of M( x , s ) . The following lemma gives a concentration bound on the N step observed instantaneous costs and the expected N step cost. In conjunction with Lemma 2.10,these two results are used to give a high probability confidence interval containing the true loss atbase-stock level x , using only observed samples of the instantaneous cost.L EMMA

A.2 (C

ONCENTRATION OF N OBSERVED COST AND EXPECTED COST ). Given a base-stock level x ∈ [ , U ] , let H : =

576 max ( h , p )( L + ) U , γ > and N ≥ log ( T ) γ . Then for any s ∈ S x ,with probability − T , N | N (cid:213) t = C x ( s t ) − N (cid:213) t = C xt | ≤ Hγ . P ROOF . Let F n be the filtration with respect to the states s , s , . . . , s n . Define Y n − : = E [ N (cid:213) t = C xt |F n ] . Then the sequence Y , . . . , Y N forms a Doob martingale sequence with Y = (cid:205) Nt = C x ( s t ) and Y N = (cid:205) Nt = C xt . Furthermore for n ∈ { , , . . . , N } , | Y n − Y n − | = | C xn − + V xN − n + ( s n ) − V xN − n + ( s n − ))| = | C xn − + V xN − n + ( s n ) − ( V xN − n + ( s n − ) + E [ C xN | s n − ])|≤

36 max ( h , p ) Lx + ( h , p ) x ≤

36 max ( h , p )( L + ) x using Lemma 2.5 to bound the difference in values and the fact that the maximum cost in a singletime step with at most x units of inventory in the pipeline is bounded above by max ( h , p ) x to boundone time instance difference in observed and expected cost. Now, we can apply Azuma-Hoeffding’sinequality (Theorem A.1) to derive that for any ϵ > , P (| N (cid:213) i = Y i − Y i − | ≥ ϵ ) ≤ exp (− ϵ N (

36 max ( h , p ) Lx ) ) . Therefore, by setting ϵ =

72 max ( h , p ) Lx (cid:112) N log ( T ) , we obtain that with probability at least − T , N | Y N − Y | = N | N (cid:213) t = C x ( s t ) − N (cid:213) i = C xt | ≤ H (cid:114) log ( T ) N . The result follow by substituting N ≥ log ( T ) γ . □ hipra Agrawal and Randy Jia EMMA

A.3.

Let C N be the vector formed by sequence of observed (pseudo) costs on runningbase-stock policy with level x , with N ≥ log ( T ) γ . Following the definition of LB ( C N ) , U B ( C N ) givenby (8), with probability − T , д x ∈ [ LB ( C N ) , U B ( C N )] . P ROOF . By Lemmas 2.10 and A.2 (noting that C i = C xi ), with probability − T , | N N (cid:213) i = C x ( s i ) − д x | ≤

108 max ( h , p ) Lxγ ≤ Hγ and | N N (cid:213) i = C x ( s i ) − N N (cid:213) i = C i | ≤ Hγ . Hence | N N (cid:213) i = C i − д x | = | N N (cid:213) i = C i − N N (cid:213) i = C x ( s i ) + N N (cid:213) i = C x ( s i ) − д x | ≤ Hγ so N N (cid:213) i = C i − Hγ ≤ д x ≤ N N (cid:213) i = C i + Hγ . The result follows noting from (8) that LB ( C N ) = N N (cid:213) i = C i − Hγ and U B ( C N ) = N N (cid:213) i = C i + Hγ . □ B PROOF DETAILS FOR LEMMA 2.5

In this section we provide the proof details for results used in Lemma 2.5, when L ≥ . Recallthat MRP M( x , s ) is defined such that state s = ( s ( ) , s ( ) , . . . , s ( L )) with s ( ) being the on-handinventory after the current time step’s order arrival and new order, and s ( ) , . . . , s ( L ) are outstandingorders, with s ( L ) being the most recent order, scheduled to arrive L time steps from the current time.New orders are placed such that at every time step t = , , . . . , T we have (cid:205) Li = s t ( i ) = x , where s t isthe state at time t . We observe on-hand inventory level I t : = s t ( ) and sales given by y t : = min ( d t , I t ) .The sales y t also happens to be the order placed in the next time step (the first order y is a bitdifferent, it is such that state s has total inventory level x ). The new state at time t + is given by s t + = ( s t ( ) − y t + s t ( ) , s t ( ) , . . . , s t ( L ) , y t ) . Let n T ( s ) : = (cid:205) Tt = y t denote the sum of sales from time to T , and m T ( s ) : = (cid:205) Tt = I t the sum ofon-hand inventory levels. B.1 Bounding cumulative observed sales

We bound the difference between the total sales in time T starting from two different states s , s ′ whenthe states satisfy the following property given below. Definition B.1.

Define states s : = ( s ( ) , s ( ) , . . . , s ( L )) , s ′ : = ( s ′ ( ) , s ′ ( ) , . . . , s ′ ( L )) . We say that s ′ ⪰ s if s ′ = ( s ( ) + δ , s ( ) + δ , . . . , s ( L ) + δ L ) where δ + δ + . . . + δ L = and there exists some ≤ k ≤ L − such that δ i ≥ for all i ∈ { , , . . . , k } and δ i ≤ for all i ∈ { k + , k + , . . . , L } .We first provide a simple bound on n T ( s ′ ) − n T ( s ) when s ′ ⪰ s and T ≤ L + which will beuseful in our proof for larger T . hipra Agrawal and Randy Jia EMMA

B.2.

Assume s ′ ⪰ s and define Y t : = (cid:205) ti = y t , Y ′ t : = (cid:205) ti = y ′ t to be the total observedsales up to time t starting from state s , s ′ , respectively. Then for t = , , . . . , L + , we have that Y ′ t − Y t ≤ max ≤ k ≤ t − ( δ + . . . + δ k ) . P ROOF . We prove this statement by induction on t . For t = , y ′ − y = min ( s ( ) + δ , d ) − min ( s ( ) , d ) ≤ δ , since s ′ ⪰ s implies that s ( ) + δ ≥ s ( ) . Assume for any time up to t − the hypothesis holds.Then, consider time t and observe that: I ′ t = s ′ t ( ) = ( s ( ) + δ + s ( ) + δ + . . . + s ( t − ) + δ t − ) − ( y ′ + . . . + y ′ t − ) and I t = s t ( ) = ( s ( ) + s ( ) + . . . + s ( t − )) − ( y + . . . + y t − ) , so subtracting we get I ′ t − I t + Y ′ t − − Y t − = δ + δ + . . . + δ t − . (14)Now, we write Y ′ t − Y t = y ′ t − y t + Y ′ t − − Y t − = min ( I ′ t , d t ) − min ( I t , d t ) + Y ′ t − − Y t − . There are four cases to consider:(1) d t ≤ I ′ t , d t ≤ I t : In this case Y ′ t − Y t = d t − d t + Y ′ t − − Y t − = Y ′ t − − Y t − ≤ max ≤ k ≤ t − ( δ + . . . + δ k ) ≤ max ≤ k ≤ t − ( δ + . . . + δ k ) by the induction hypothesis.(2) d t ≥ I ′ t , d t ≥ I t : In this case Y ′ t − Y t = I ′ t − I t + Y ′ t − − Y t − = δ + . . . + δ t − ≤ max ≤ k ≤ t − ( δ + . . . + δ k ) by (14).(3) I t ≤ d t ≤ I ′ t : In this case Y ′ t − Y t = d t − I t + Y ′ t − − Y t − = d t − I ′ t + I ′ t − I t + Y ′ t − − Y t − = d t − I ′ t + δ + . . . + δ t − ≤ δ + . . . + δ t − ≤ max ≤ k ≤ t − ( δ + . . . + δ k ) by (14).(4) I ′ t ≤ d t ≤ I t : In this case Y ′ t − Y t = I ′ t − d t + Y ′ t − − Y t − ≤ Y ′ t − − Y t − ≤ max ≤ k ≤ t − ( δ + . . . + δ k ) ≤ max ≤ k ≤ t − ( δ + . . . + δ k ) by the induction hypothesis.Therefore, we have proven that under the induction hypothesis Y ′ t − Y t ≤ max ≤ k ≤ t − ( δ + . . . + δ k ) and the desired result for all t ∈ { , , . . . , L + } follows by induction. □ L EMMA

B.3.

Consider the MRPs on following base-stock policy with level x starting in states s , s ′ ∈ S x with s ′ ⪰ s . Let I t = s t ( ) , I ′ t = s ′ t ( ) be the on-hand inventory levels in the two processesat time t . Then, if I ′ t − I t ≥ for all t ∈ { , , . . . , L + } , then it holds that n T ( s ′ L + ) = n T ( s L + ) forany T . P ROOF . If we have I ′ t − I t ≥ then the respective sales at time t satisfy y ′ t ≥ y t as well.Therefore, each entry of state s ′ L + = ( I ′ L + , y ′ , y ′ , . . . , y ′ L ) is at least the respective entry of state s L + = ( I L + , y , y , . . . , y L ) . Since the total sum of the entries in each state is equal to x , we concludethat s ′ L + = s L + and hence n T ( s ′ L + ) = n T ( s L + ) for any T . □ Above lemma shows that if we ever observe a t with s ′ t ⪰ s t and the next L consecutive on-handinventory levels are at least as high in the process starting from state s ′ t compared to starting fromstate s t , then the two processes will reach an identical state at time t + L and hence all future observedsales will be the same. Utilizing this property, for states s ′ ⪰ s we can define the following sequenceof times: hipra Agrawal and Randy Jia Definition B.4.

Given starting states s ′ ⪰ s , define a sequence of times = σ < τ < σ < τ < σ < . . . ≤ Γ such that for i ≥ , τ i is the first time after t = σ i − at which I ′ τ i < I τ i , σ i is the first time after t = τ i at which I ′ σ i > I σ i , and Γ is the first time at which s ′ Γ = s Γ . By the previous lemma, τ i − σ i − ≤ L + and σ i − τ i ≤ L + (whenever τ i , σ i exist).L EMMA

B.5.

Given starting states s ′ ⪰ s and the sequence defined above, s ′ σ i ⪰ s σ i and s ′ τ i ⪯ s τ i for all i , where τ i , σ i ≤ Γ . P ROOF . We have s ′ σ ⪰ s σ is the starting state at time t = = σ . If time t = τ exists, then τ − σ ≤ L + by Lemma B.3. Furthermore, we can show that s ′ τ ⪯ s τ . To see this, s ′ τ = ( I ′ τ , s ′ σ ( τ − σ ) , s ′ σ ( τ + − σ ) , . . . , s ′ σ ( L ) , y ′ σ , . . . , y ′ τ − ) and s τ = ( I τ , s σ ( τ − σ ) , s σ ( τ + − σ ) , . . . , s σ ( L ) , y σ , . . . , y τ − ) . By definition of τ , for times t ∈ { σ , σ + , . . . , τ − } we have I ′ t ≥ I t and hence y ′ t ≥ y t . Wealso know that I ′ τ < I τ . It suffices to show that s ′ σ ( i ) ≤ s σ ( i ) for all i ∈ { τ − σ , τ + − σ , . . . , L } .Recall I ′ τ = I ′ τ − − y ′ τ − + s ′ σ ( τ − − σ ) and I τ = I τ − − y τ − + s σ ( τ − − σ ) so that: I ′ τ − − y ′ τ − + s ′ σ ( τ − − σ ) < I τ − − y τ − + s σ ( τ − − σ ) ≤ I ′ τ − − y ′ τ − + s σ ( τ − − σ ) where the last inequality is because I ′ τ − ≥ I τ − implies (for any demand d τ − ), I ′ τ − − y ′ τ − = I ′ τ − − min ( I ′ τ − , d τ − ) ≥ I τ − − min ( I τ − , d τ − ) = I τ − − y τ − . Hence, s ′ σ ( τ − − σ ) < s σ ( τ − − σ ) and because s ′ ⪰ s , s ′ σ ( i ) ≤ s σ ( i ) holds for all i ∈ { τ − σ , τ + − σ , . . . , L } . So we have shown that s ′ τ ⪯ s τ .We can inductively apply the above argument for each successive σ i , τ i , so that s ′ σ i ⪰ s σ i and s ′ τ i ⪯ s τ i for all i . □ Finally, we are ready to bound the difference in total observed sales in time T between two states s ′ ⪰ s under policy π x .L EMMA

B.6.

Let s ′ , s ∈ S x , and s ′ ⪰ s . Then, | n xT ( s ′ ) − n xT ( s )| ≤ x . P ROOF . Let sequence σ , τ , σ , . . . be as in Definition B.4. First we show that n xT ( s ′ ) − n xT ( s ) ≥ − x . Let us assume that in our sequence of times the last σ is σ M . Then note that n xT ( s ′ ) − n xT ( s ) = T (cid:213) t = y ′ t − T (cid:213) t = y t = M − (cid:213) i = (cid:32) σ i + − (cid:213) j = σ i ( y ′ j − y j ) (cid:33) + T (cid:213) t = σ M ( y ′ t − y t ) . We will show that (cid:205) σ i + − j = σ i ( y ′ j − y j ) ≥ for any i = , , . . . , M − . Consider the process startingfrom states s ′ σ i , s σ i , where s ′ σ i ⪰ s σ i by the previous lemma. hipra Agrawal and Randy Jia ( y ′ σ i + . . . + y ′ τ i + − ) − ( y σ i + . . . + y τ i + − ) = [( s ′ σ i ( ) − s σ i ( )) + . . . + ( s ′ σ i ( τ i + − − σ i ) − s σ i ( τ i + − − σ i ))] − ( I ′ τ i + − I τ i + ) . Now consider the process starting from states s ′ τ i + ⪯ s τ i + . Recall that s ′ τ i + = ( I ′ τ i + , s ′ σ i ( τ i + − σ i ) , s ′ σ i ( τ i + + − σ i ) , . . . , s ′ σ i ( L ) , y ′ σ i , . . . , y ′ τ i + − ) and s τ i + = ( I τ i + , s σ i ( τ i + − σ i ) , s σ i ( τ i + + − σ i ) , . . . , s σ i ( L ) , y σ i , . . . , y τ i + − ) , and as proved in the previous lemma I ′ τ i + < I τ i + , s σ ′ i ( i ) ≤ s σ i ( i ) for all i ∈ { τ i + − σ i , . . . , L } , and y ′ t ≥ y t for all t ∈ { σ i , . . . , τ i + − } . So we have by Lemma B.2 that ( y τ i + + . . . + y σ i + − ) − ( y ′ τ i + + . . . + y ′ σ i + − )≤ ( I τ i + − I ′ τ i + ) + [( s σ i ( τ i + − σ i ) − s ′ σ i ( τ i + − σ i )) + . . . + ( s σ i ( L ) − s ′ σ i ( L ))] = ( I τ i + − I ′ τ i + ) + [( s ′ σ i ( ) − s σ i ( )) + . . . + ( s ′ σ i ( τ i + − − σ i ) − s σ i ( τ i + − − σ i ))] where the last equality follows from the fact that the sum of the entries in the states is always thesame.Combining the two results, we have that for any i = , , . . . , M − , σ i + − (cid:213) j = σ i ( y ′ j − y j ) = ( y ′ σ i + . . . + y ′ σ i + − ) − ( y σ i + . . . + y σ i + − ) ≥ . Therefore, we can conclude that n xT ( s ′ ) − n xT ( s ) ≥ T (cid:213) t = σ M ( y ′ t − y t ) = ˆ Γ (cid:213) t = σ M ( y ′ t − y t ) , where ˆ Γ : = min ( Γ , T ) . By our construction of the σ , τ sequence, ˆ Γ − σ M + ≤ ( L + ) . Note thatover any L + consecutive time steps, the total observed sales difference in those L + times is atmost x for any two starting states. So n xT ( s ′ ) − n xT ( s ) ≥ (cid:205) ˆ Γ t = τ M ( y ′ t − y t ) ≥ − x .To complete the proof, we show in a similar way that n xT ( s ′ ) − n xT ( s ) ≤ x . Let us assume that inour sequence of times the last τ is τ N . Then note that n xT ( s ′ ) − n xT ( s ) = T (cid:213) t = y ′ t − T (cid:213) t = y t = τ − (cid:213) t = ( y ′ t − y t ) + N − (cid:213) i = (cid:32) τ i + − (cid:213) j = τ i ( y ′ j − y j ) (cid:33) + T (cid:213) t = τ N ( y ′ t − y t ) . For any i = , , . . . , N − , consider the process starting from states s ′ τ i , s τ i , where s ′ τ i ⪯ s τ i bythe previous lemma. By an identical argument as above, (cid:205) τ i + − j = τ i ( y ′ j − y j ) ≤ , and (cid:205) Tt = τ N ( y ′ t − y t ) = (cid:205) ˆ Γ t = τ N ( y ′ t − y t ) ≤ x . Noting that there are at most L + time steps in (cid:205) τ − t = ( y ′ t − y t ) , it is boundedby x , so we have shown that n xT ( s ′ ) − n xT ( s ) ≤ x and hence with the other result, | n xT ( s ′ ) − n xT ( s )| ≤ x . □ hipra Agrawal and Randy Jia B.2 Bounding cumulative on-hand inventory level L EMMA

B.7.

Let s ′ , s ∈ S x , and s ′ ⪰ s . Then, | m xT ( s ) − m xT ( s ′ )| ≤ Lx . P ROOF . Recall from before y t , I t is the observed sales and on-hand inventory level at the beginningof time t ≥ , respectively. Under base-stock level x policy, the order placed at time t is precisely y t (assume without loss of generality y = , that is, we start at a state with total inventory level x ).Also note that with starting state s = ( s ( ) , s ( ) , . . . , s ( L )) it makes sense to denote y = S ( L ) , y − = s ( L − ) , . . . , y − L = s ( ) . The on-hand inventory level transitions as follows: I t + = I t − y t + y t − L . Therefore, we can write that the inventory level at some time k ≥ : I k = I + k − (cid:213) j = ( y j − L − y j ) and hence the total sum of all inventory levels to time T is: T (cid:213) k = I k = T (cid:213) k = ( I + k − (cid:213) j = ( y j − L − y j )) = T I + T (cid:213) k = k − (cid:213) j = ( y j − L − y j ) = T I + T − (cid:213) i = ( T − i )( y i − L − y i ) = T I + T − (cid:213) i = ( T − i ) y i − L − T − (cid:213) i = ( T − i ) y i . Now, if we break up the summations on the right hand side and reindex, T − (cid:213) i = ( T − i ) y i − L = L (cid:213) i = ( T − i ) y i − L + T − (cid:213) i = L + ( T − i ) y i − L = L (cid:213) i = ( T − i ) y i − L + T − L − (cid:213) i = ( T − L − i ) y i and T − (cid:213) i = ( T − i ) y i = T − L − (cid:213) i = ( T − i ) y i + T − (cid:213) i = T − L ( T − i ) y i , so that: T (cid:213) k = I k = T I + T − (cid:213) i = ( T − i ) y i − L − T − (cid:213) i = ( T − i ) y i = T I + L (cid:213) i = ( T − i ) y i − L − (cid:32) T − L − (cid:213) i = Ly i + T − (cid:213) i = T − L ( T − i ) y i (cid:33) . Define ϵ : = T − (cid:213) i = Ly i − ( T − L − (cid:213) i = Ly i + T − (cid:213) i = T − L ( T − i ) y i ) ≥ hipra Agrawal and Randy Jia ϵ = L − (cid:213) i = iy T − L + i ≤ L L − (cid:213) i = y T − L + i ≤ Lx because in L − consecutive time steps, the total sales following π x cannot exceed x .We can write T (cid:213) k = I k = T I + L (cid:213) i = ( T − i ) y i − L − T − (cid:213) i = Ly i + ϵ = T I + L (cid:213) i = ( T − i ) y i − L − T − (cid:213) i = Ly i + ϵ = L (cid:213) i = ( T − i ) s ( i ) − L T − (cid:213) i = y i + ϵ from the known values of y , . . . , y − L .Now let I ′ t , y ′ t , s ′ ( i ) , ϵ ′ be the respective values if the starting state is s ′ instead of s , and that s ′ ⪰ s .By above, the difference | m Tx ( s ′ ) − m Tx ( s )| can be bounded as | m Tx ( s ′ ) − m Tx ( s )| = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T (cid:213) k = I ′ k − T (cid:213) k = I k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L (cid:213) i = ( T − i ) s ′ ( i ) − L (cid:213) i = ( T − i ) s ( i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) T − (cid:213) i = ( y ′ i − y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + | ϵ ′ − ϵ |≤ ( T x − ( T − L ) x ) + L ( x ) + ( Lx ) = Lx where we bound | (cid:205) Li = ( T − i ) s ′ ( i ) − (cid:205) Li = ( T − i ) s ( i )| by the largest possible value that occurs when s ′ = ( x , , . . . , ) and s = ( , . . . , , x ) , and use Lemma B.6 to bound (cid:205) T − i = ( y ′ i − y i ) . □ C PROOF DETAILS FOR THEOREM 1.3

Below we present additional results required for Theorem 1.3. Recall that f ( x ) : = д x is a convexfunction and our confidence intervals are defined as in (8). Also recall that E is the event when allconfidence intervals [ LB ( C aN ) , U B ( C aN )] calculated in Algorithm 1 satisfy: д x a ∈ [ LB ( C aN ) , U B ( C aN )] for every epoch k , round i and a ∈ { l , c , r } .L EMMA

C.1.

For f ( x ) : = д x and x ∈ [ , U ] , the Lipschitz factor of f ( x ) is max ( h , p ) . That is, for δ ≥ , | f ( x + δ ) − f ( x )| ≤ max ( h , p ) δ . P ROOF . Let us compare the loss д x + δ vs. д x on executing base-stock policy with level x + δ vs. x . Let us assume the starting state for the two MRPs are s = ( x + δ , , . . . , ) and s = ( x , , . . . , ) respectively (recall from Lemma 2.6 that loss is independent of the starting state). We compare thetwo losses by coupling the execution of the two MRPs. For every time t , let s t : = ( I t , o t − L + , . . . , o t ) be the state of the system following policy with level x + δ , and s t : = ( I t , o t − L + , . . . , o t ) be thestate of the system following policy with level x . Define s ≥ s if every entry in s is at least therespective entry in s .We will first show by induction that at each time step t , s t ≥ s t . In the first time step, we have s = ( x + δ , , . . . , ) ≥ ( x , , . . . , ) = s . From then on, the new order placed in time t + is the hipra Agrawal and Randy Jia t . Therefore if at time t we have that s t ≥ s t , then the ordersat time t + satisfy o t + = min ( d t , I t ) ≥ min ( d t , I t ) = o t + . Also, I t + = ( I t − min ( d t , I t )) + o t − L ≥( I t − min ( d t , I t )) + o t − L = I t + . Hence we have s t + ≥ s t + . By induction, we have that for every t ≥ , s t ≥ s t .We complete the proof by noting that additionally, at every time t , the total sum of the entries of s t is exactly δ greater than the sum of the entries of s t . Therefore, the difference ≤ I t − I t ≤ δ for every t , which implies the difference in sales ≤ y t − y t = min ( d t , I t ) − min ( d t , I t ) ≤ δ . Also ≤ ( I t − y t ) − ( I t − y t ) ≤ δ . Recall pseudo-cost C x + δt = ( I t − y t ) h − py t , and C xt = ( I t − y t ) h − py t ,therefore, we have for every t and every sequence of demand realizations, | C x + δt − C xt | ≤ max ( h , p ) δ . By definition of loss f ( x ) = д x as average of pseudo-costs (see Definition 2.2), we have | f ( x + δ ) − f ( x )| ≤ max ( h , p ) δ . □ The proofs for the remaining lemmas provided below are similar to the proofs of the correspondinglemmas in [1]. We include the proofs here for completeness.L

EMMA

C.2 (L

EMMA IN [1]). Recall [ l k , r k ] denotes the working interval in epoch k ofAlgorithm 1, with [ l , r ] : = [ , U ] . Then, under event E , for epoch k ending in round i , the workinginterval [ l k + , r k + ] for the next epoch k + contains every x ∈ [ l k , r k ] such that f ( x ) ≤ f ( x ∗ ) + Hγ i ,where H =

576 max ( h , p )( L + ) U . In particular, x ∗ ∈ [ l k , r k ] for all epochs k . P ROOF . Under Algorithm 1, the epoch k ends in round i because max { LB ( C lN ) , LB ( C rN )} ≥ min { U B ( C lN ) , U B ( C cN ) , U B ( C rN )} + Hγ i . Hence either:(1) LB ( C lN ) ≥ U B ( C rN ) + Hγ i ,(2) LB ( C rN ) ≥ U B ( C lN ) + Hγ i , or(3) max { LB ( C lN ) , LB ( C rN )} ≥ U B ( C cN ) + Hγ i .Consider the case (1) (case (2) is analogous). Then, f ( x l ) ≥ f ( x r ) + Hγ i . We need to show that every x ∈ [ l k , l k + ] has f ( x ) ≥ f ( x ∗ ) + Hγ i . Pick x ∈ [ l k , x l ] so that x l ∈ [ x , x r ] .Then x l = tx + ( − t ) x r for some ≤ t ≤ so by convexity f ( x l ) ≤ t f ( x ) + ( − t ) f ( x r ) . This implies that f ( x ) ≥ f ( x r ) + f ( x l ) − f ( x r ) t ≥ f ( x r ) + Hγ i t ≥ f ( x ∗ ) + Hγ i , where we used that t ≤ .Now consider the case (3). Assume without loss of generality that LB ( C lN ) ≥ LB ( C rN ) . Then wehave f ( x l ) ≥ f ( x c ) + Hγ i . We need to show that every x ∈ [ l k , l k + ] has f ( x ) ≥ f ( x ∗ ) + Hγ i . This follows from the sameargument as above with x r replaced by x c . The fact that x ∗ ∈ [ l k , r k ] for all epochs k follows byinduction. □ hipra Agrawal and Randy Jia EMMA

C.3 (L

EMMA IN [1]). Under E , if epoch k does not end in round i , then f ( x ) ≤ f ( x ∗ ) + Hγ i for each x ∈ { x r , x c , x l } . P ROOF . Under Algorithm 1, round i continues to round i + if max { LB ( C lN ) , LB ( C rN )} < min { U B ( C lN ) , U B ( C cN ) , U B ( C rN )} + Hγ i . We observe that since each confidence interval is of length Hγ i , this means that f ( x l ) , f ( x c ) , f ( x r ) are contained in an interval of length at most Hγ i . By Lemma C.2, x ∗ ∈ [ l k , r k ] . Without loss ofgenerality, assume x ∗ ≤ x c . Then there exists t ≥ such that x ∗ = x c + t ( x c − x r ) , so that x c = + t x ∗ + t + t x r . Note that t ≤ because | x c − l k | = w k and | x r − x c | = w k , so t = | x ∗ − x c || x r − x c | ≤ | l k − x c || x r − x c | = w k / w k / = . Since f is convex, f ( x c ) ≤ + t f ( x ∗ ) + t + t f ( x r ) and so f ( x ∗ ) ≥ ( + t ) (cid:16) f ( x c ) − t + t f ( x r ) (cid:17) = f ( x r ) + ( + t )( f ( x c ) − f ( x r ))≥ f ( x r ) − ( + t )| f ( x c ) − f ( x r )|≥ f ( x r ) − ( + t ) Hγ i ≥ f ( x r ) − Hγ i . Thus for each x ∈ { x l , x c , x r } , f ( x ) ≤ f ( x r ) + Hγ i ≤ f ( x ∗ ) + Hγ i . □ L EMMA

C.4 (L

EMMA IN [1]). Under E , the total number of epochs K is bounded by log / ( T ) . P ROOF . Observe that for any round i that does not terminate the algorithm, N i = log ( T ) γ i ≤ T (sincealgorithm terminates upon reaching T time steps), which implies γ i ≥ (cid:113) log ( T ) T . Since γ i + = γ i , letus define γ min : = (cid:113) log ( T ) T so that γ min ≤ γ i for any γ i . Define the interval I : = [ x ∗ − Hγ min max ( h , p ) , x ∗ + Hγ min max ( h , p ) ] , so that for any x ∈ I , f ( x ) − f ( x ∗ ) ≤ max ( h , p )| x − x ∗ | ≤ Hγ min by Lemma C.1. Now, for any epoch k ′ which ends in round i ′ , Hγ min ≤ Hγ i ′ and hence by LemmaC.2 we have I ⊆ { x ∈ [ , U ] : f ( x ) ≤ f ( x ∗ ) + Hγ i ′ } ⊆ [ l k ′ + , r k ′ + ] . So for any epoch k ′ , the length of interval I is no more than the length of interval [ l k ′ + , r k ′ + ] , and so Hγ min max ( h , p ) ≤ w k ′ + . hipra Agrawal and Randy Jia w k ′ + ≤ w k ′ for any k ′ = , , . . . , K − , we have that for k ′ = K − , Hγ min max ( h , p ) = H max ( h , p ) (cid:114) log ( T ) T ≤ w K ≤ ( ) K − w = ( )( ) K ( U ) . Rearranging the inequality we get that K ≤

12 log / ( ( h , p ) ( U ) T H log ( T ) ) ≤ log / ( T ) since H =

576 max ( h , p )( L + ) U . □ L EMMA

C.5 (L

EMMA IN [1]). Recall T k , i , a is the set of consecutive times where base stockpolicy level x a is played in round i of epoch k , for a ∈ { l , c , r } . Then under E , we can bound (overtime steps of all such k , i , a ) (cid:213) k , i , a , t ∈ T k , i , a f ( x ( t )) − f ( x ∗ ) ≤ H log / ( T ) (cid:112) T log ( T ) . P ROOF . Let us first fix an epoch k and assume it ends in round i ( k ) . If i ( k ) = , then by LemmaC.1, (cid:213) i , a , t ∈ T k , i , a ( f ( x t ) − f ( x ∗ )) ≤ N max ( h , p )| x t − x ∗ | ≤ (cid:18) ( T ) γ (cid:19) max ( h , p ) U . (15)Otherwise, if i ( k ) > , then (cid:213) i , a , t ∈ T k , i , a ( f ( x t ) − f ( x ∗ )) = i ( k )− (cid:213) i = (cid:213) a , t ∈ T k , i , a ( f ( x t ) − f ( x ∗ )) + (cid:213) a , t ∈ T k , i ( k ) , a ( f ( x t ) − f ( x ∗ )) . By Lemma C.3, for each x t ∈ { x r , x c , x l } , f ( x t ) − f ( x ∗ ) ≤ Hγ i for all i = , , . . . , i ( k ) − . Also, γ i ( k )− = γ i ( k ) , so when i ( k ) > , (cid:213) i , a , t ∈ T k , i , a ( f ( x t ) − f ( x ∗ )) ≤ i ( k )− (cid:213) i = (cid:213) a , t ∈ T k , i , a ( Hγ i ) + (cid:213) a , t ∈ T k , i ( k ) , a ( Hγ i ( k )− )≤ i ( k )− (cid:213) i = ( N i )( Hγ i ) + ( N i ( k ) )( Hγ i ( k ) )≤ i ( k ) (cid:213) i = N i Hγ i ≤ H log ( T ) γ min , hipra Agrawal and Randy Jia i ( k ) , (cid:213) i , a , t ∈ T k , i , a ( f ( x t ) − f ( x ∗ )) ≤ (cid:18) ( T ) γ (cid:19) max ( h , p ) U + (cid:213) i = N i Hγ i (since γ min ≤ γ = / ) ≤ (cid:18) ( T ) γ min (cid:19) max ( h , p ) U + H log ( T ) γ min (since H =

576 max ( h , p )( L + ) U ) ≤ H log ( T ) γ min . Therefore, over all epochs k , by Lemma C.4 (cid:213) k , i , a , t ∈ T k , i , a ( f ( x t ) − f ( x ∗ )) ≤ log / ( T ) (cid:18) H log ( T ) γ min (cid:19) , and the result follows from substituting γ min = (cid:113) log ( T ) T ..