[PDF] How Linear Reward Helps in Online Resource Allocation

Abstract

In this paper, we consider an online stochastic resource allocation problem which takes a linear program as its underlying form. We analyze an adaptive allocation algorithm and derives a constant regret bound that is not dependent on the number of time periods (number of decision variables) under the condition that the objective coefficient of the linear program is linear in the corresponding constraint coefficients. Furthermore, the constant regret bound does not assume the knowledge of underlying distribution.

Full PDF

aa r X i v : . [ m a t h . O C ] J a n How Linear Reward Helps in Online Resource Allocation

Guanting Chen † Xiaocheng Li ♦ Yinyu Ye ‡ † Institute for Computational and Mathematical Engineering, Stanford University ♦ Imperial College Business School ‡ Department of Management Science and Engineering, Stanford University { guanting, chengli1, yinyu-ye } @stanford.edu Abstract

In this paper, we consider an online stochastic resource allocation problem which takes a linearprogram as its underlying form. We analyze an adaptive allocation algorithm and derives a constantregret bound that is not dependent on the number of time periods (number of decision variables)under the condition that the objective coeﬃcient of the linear program is linear in the correspondingconstraint coeﬃcients. Furthermore, the constant regret bound does not assume the knowledge ofunderlying distribution.

In this paper, we present and analyze an algorithm that achieves bounded regret for a class of onlineallocation problems. Speciﬁcally, the allocation problem can be formulated as a linear program (LP): max n X j =1 r ⊤ j x j (1)s.t. n X j =1 A j x j ≤ b ⊤ x j ≤ , x j ≥ , j = 1 , ..., n where r j = ( r j , ..., r jl ) ⊤ ∈ R l , A j = ( a j , ..., a jl ) ∈ R m × l , and a js = ( a js , ..., a mjs ) ⊤ , for j = 1 , ..., n and s = 1 , ..., l. The decision variables are x = ( x , ..., x n ) where x j = ( x j , ..., x jl ) ⊤ for j = 1 , ..., n .In the online setting, the parameters of the optimization problem (1) are revealed in an online fashionand one needs to determine the value of decision variables sequentially. Speciﬁcally, at each time t, thecoeﬃcients ( r t , A t ) are revealed, and we need to decide the value of x t instantly. Diﬀerent from theoﬄine setting, at time t , we do not have the information of the subsequent coeﬃcients to be revealed,i.e., { ( r j , A j ) } nj = t +1 . The problem (1) in the online setting is often referred to as Online Linear Pro-gramming (Agrawal et al., 2014; Kesselheim et al., 2014; Li, 2020). However in this paper, due to thefact that the reward r t is linear with the resource A t , maximizing the reward is equivalent to maximiz-ing the eﬃciency of the resource allocation process. For such reason, we identify our formulation as anonline allocation problem. The allocation problem with diﬀerent speciﬁcation of the coeﬃcients encom-passes a wide range of classic and modern applications, including secretary problem (Ferguson et al.,1989), knapsack problem (Kellerer et al., 2003), resource allocation problem (Dantzig, 1965), networkrouting problem (Buchbinder and Naor, 2009), matching and ad-words problem (Mehta et al., 2005),combinatorial auction problem (Agrawal et al., 2009), etc.There have been a proliferate literature developing and analyzing algorithms for the online allocationproblem. One stream of literature investigates conditions for the existence of an online algorithm that1chieves bounded regret (not dependent on the number of decision variables or decision time periods).Speciﬁcally, a line of works (Jasin and Kumar, 2012; Wu et al., 2015; Bumpensanti and Wang, 2020a)study the canonical quantity-based network revenue management problem and design algorithms thatachieve bounded regret under the knowledge of the customer arrival distribution. A common algorithmicdesign for these works is the re-solving technique which computes a control policy by periodically solvinga linear program speciﬁed by the known arrival distribution. One challenge for this problem is thedegeneracy of the underlying LP and it has been overcome by Bumpensanti and Wang (2020a) via aninfrequent re-solving scheme which updates the control policy only at a few selected points. Anotherline of works (Vera et al., 2019; Vera and Banerjee, 2019; Banerjee and Freund, 2020) devise algorithmsthat achieve bounded regret for various problems including dynamic pricing, knapsack problem, and binpacking problem. The authors develop a novel and intuitive approach called “compensated coupling”to derive regret upper bound for an online algorithm. The idea is to bound the cumulative expectedloss induced by a diﬀerent decision (made by the online algorithm) against the oﬄine benchmark. Asin the aforementioned works on network revenue management, this line of works are also built upon theknowledge of the underlying distribution.Eﬀorts have also been made to relax the assumption on knowing the underlying distribution. Recently,Banerjee et al. (2020) consider one single historical trace of observations in substitute of the knowledge ofthe true distribution and derive an algorithm that achieves bounded regret for certain constrained onlineoptimization problem. In certain sense, the single historical trace can contribute to Ω( n ) observations(with n being the length of horizon). In a similar spirit, Shivaswamy and Joachims (2012) characterizethe number of historical observations needed to achieve a bounded regret for the stochastic multi-armedbandits problem. The historical observations required in these two works can be viewed as a warm startfor the online procedure, and the empirical distribution constructed from the historical observationsprovides a moderately good estimation for the true underlying distribution at the very beginning of theonline procedure. In this work, we pursue further along this line and identify conditions under whichbounded regret can be achieved without any historical observations.The closest work to our result is the paper (Asadpour et al., 2019) where the authors derive boundedregret for a resource allocation problem to study the eﬀectiveness of the long-chain design withoutknowing the true distribution. Technically, the formulation in (Asadpour et al., 2019) can be cast in(1) by imposing a binary structure on the constraint matrices A j ’s, along with certain other conditions.Meanwhile, Li and Ye (2019) study the online linear programming problem with a provable log( n ) regretbound, and the authors observe empirically that bounded regret can be achieved on some probleminstances through an action-history-dependent design. Our results here generalize the above ﬁndingsand aim to reveal the mathematical structure that guarantees a bounded regret.This paper is organized as follows: In Section 2 we introduce the model, algorithm and the mainresults. In section 3 and 4 we present the analysis for bounded regret under diﬀerent conditions. Forthe underlying LP, we categorize the resource into two parts: the binding resource and non-bindingresource, and we provide diﬀerent regret analysis for those two resources. As a result, in Section 3 weprovide the regret analysis for the special case that all resources are binding. In Section 4 we tackle themore general setup where there are both binding resources and non-binding resources. Extensions tomulti-dimensional problems are shown in Section 5.2 Main Results

For now, we focus on the one-dimensional case for notation simplicity where l = 1 in (1). The problemthen reduces to a one-dimensional online linear programming problem, max n X j =1 r j x j (2)s.t. n X j =1 a j x j ≤ b ≤ x j ≤ , j = 1 , ..., n where a j = ( a , ..., a m ) ⊤ ∈ R m and the decision variables are x = ( x , ..., x n ) ∈ R n . There are m constraints and n decision variables. Throughout the paper, we use i to index constraints and j or t toindex decision variables. Assumption 1 (Distribution) . We assume(a) The column-coeﬃcient pair ( r j , a j ) ’s are i.i.d. sampled from a distribution P . The distribution P takes a ﬁnite and known support { ( v k , u k ) } Kk =1 where v i ∈ R and u i ∈ R m . Speciﬁcally, P (( r j , a j ) =( v k , u k )) = p k for k = 1 , ..., K, and the parameters p = ( p , ..., p K ) ⊤ are unknown.(b) There exists a vector λ ≥ such that v k = u ⊤ k λ for k = 1 , ..., K. (c) The right-hand-side b = n d ≥ where d = ( d , ..., d m ) ⊤ . Assumption 1 (a) states that the support of the distribution is ﬁnite and known but the parametersare unknown. Assumption 1 (b) tells that the coeﬃcient in the objective coeﬃcient r j is linear in theconstraint coeﬃcient vector a j . The linear dependency holds in application contexts such as resourceallocation (Asadpour et al., 2019) and hospital scheduling (Conforti et al., 2014). It requires the receivedreturn in each time period is linearly dependent on the allocated resource. Moreover, the linear growthcondition b = n d is widely adopted in the context of online allocation problem (Asadpour et al., 2019;Li and Ye, 2019; Bumpensanti and Wang, 2020a). In our context, the condition is mild in that if b = o ( n ) we could always set a time horizon n ′ ≪ n such that b = n ′ d and the linear growth condition holds for n ′ . In the online allocation problem, at each time t , we decide the value of x t : when x t = 1 , we allocatethe resource a j to this request accordingly and when x t = 0 , we reject the request. Partial acceptanceof a request is allowed. As for the oﬄine problem, we need to conform to the constraint throughout theprocedure, i.e., no shorting of the resource is allowed. We consider regret as the performance measure,formally deﬁned as follows: Reg πn = E " R ∗ n − n X t =1 r t x t where the random variable R ∗ n represents the optimal objective value of the “oﬄine” problem (2) and x t ’srepresent the online solution. π denotes the online algorithm/policy used in making the online decisions.The expectation is taken with respect to ( r j , a j ) ’s.From Assumption we can derive an underlying linear programming problem comparable to the3oﬄine” LP (2). max K X k =1 v k p k q k s.t. K X k =1 u k p k q k ≤ d ≤ q k ≤ , j = 1 , ..., K (3)The intuition is that at each time step, the reward and resource are drawn from the distribution speciﬁedin Assumption (a), and the resource constraint becomes the averaged resource capacity d := b /n . Werefer to (3) as Deterministic Linear Program (DLP).To better characterize the online procedure, we introduce a few additional notations. Deﬁne b t asthe constraint level at the beginning of time t and d t = b t / ( n − t + 1) denote the average constraintcapacity for the remaining time periods. Speciﬁcally, we use b n +1 to denote the remaining constraint atthe end of horizon, and the initial d = ( d , , ..., d ,m ) ⊤ = b /n = b /n = d . Also, let n k ( t ) be the countsof observations of ( v k , u k ) up to time t for k = 1 , ..., K . Since no shorting is allowed, i.e., the remainingconstraint vector b n +1 must be element-wise non-negative.From Assumption 1 again, one can write the “oﬃline” LP (2) at time t in the form of the realizedsample of the DLP (3): max K X k =1 v k n k ( t ) t q k s.t. K X k =1 u k n k ( t ) t q k ≤ d t ≤ q k ≤ , k = 1 , ..., K (4)We call the above formulation Sample Linear Program (SLP). One observation is that when t = n , theoptimal solution q ∗ for (4) could be transferred to an optimal solution x ∗ for (2): For every integer t ∈ [1 , n ] , we can identify a mapping T : N → N such that ( r t , a t ) = ( v T ( t ) , u T ( t ) ) , and let x ∗ t = q ∗T t .Such idea is used in constructing Algorithm 1.Also, we remark that the linearity condition in Assumption 1 (b) provides an alternative representa-tion of the (online) objective value. Speciﬁcally, n X t =1 r t x t = λ ⊤ · n X t =1 a t x t = λ ⊤ b − λ ⊤ b n +1 where b n +1 denotes the remaining constraint at the end of horizon. The implication is that for this typeof allocation problem, the goal is to deplete the constraint/resource as much as possible. Consequently,the main challenge is to balance the consumption of diﬀerent constraints: The “right” amount of resourceallocation is given by (3) and we should try to avoid the case that some constraint is exhausted whileanother constraint still has much remaining compared to (3).The main algorithm for solving the online allocation problem is presented in Algorithm 1. At eachtime t , the algorithm solves a linear program (4) to compute the probability of acceptance for each typeof observations ( v k , u k ) . For the linear program (4), the left-hand-side of constraints and the objectivefunction are both based on the empirical counts from the past observations, and the right-hand-side of theconstraints are the normalized resource capacity. The algorithm follows the same spirit as the probabilisticallocation control algorithm with re-estimation and re-solving in (Jasin, 2015). In our algorithm, the re-solving frequency, i.e., the frequency of updating the acceptance probabilities, is set to be one. This is in4 lgorithm 1

Adaptive Allocation Algorithm Input: b , n, { ( v k , u k ) } Kk =1 Initialize b = b , d = b /n Set x = 1 for t = 2 , ..., n do Compute b t = b t − − a t − x t − Compute d t = b t / ( n − t ) Solve the linear program where the decision variables are ( q , ..., q K ) : max K X k =1 v k n t − ( k ) q k (5)s.t. K X k =1 u k n t − ( k ) q k ≤ ( t − d t ≤ q k ≤ , k = 1 , ..., K Denote the optimal solution as q ∗ t = ( q ∗ t , ..., q ∗ Kt ) Observe ( r t , a t ) and identify ( r t , a t ) = ( v k , u k ) for some k . If previously we do not have anyoccurrence of such k yet, solve (5) again with n t − ( k ) = 1 and get the solution q ∗ t = ( q ∗ t , ..., q ∗ Kt ) .Then set back n t − ( k ) = 0 . Set x t = ( , with probability q ∗ kt , with probability − q ∗ kt when the constraint permits; otherwise set x t = 0 . Update the counts n k ( t ) = ( n k ( t −

1) + 1 , if ( r t , a t ) = ( v k , u k ) n k ( t − , otherwise end for Output: x = ( x , ..., x n ) contrast with a recent work (Bumpensanti and Wang, 2020a) where an infrequent re-solving scheme isadvocated. We emphasize this is because the diﬀerence in the problem setting: we do not assume priorknowledge of the underlying distribution, so a frequent updating scheme for the distribution parameters(through the empirical counts) and the acceptance probability is necessary. Throughout the paper, we denote [ n ] the integers index ∩ ni =1 { i } for any n ∈ N . We denote ¯ A as thecomplement of an event A . The inequality for vectors means that the inequality holds element-wisely.The following list summarizes the notations used in this paper.• m : the number of constraints.• n : the number of decision variables in the oﬄine problem (2).• K : the number of types of diﬀerent orders (also the number of decision variables in DLP (3)).• p : the probability vector that characterize the distribution of K types of orders.• a i : the i -th column in the constraint matrix A in (2).• u k : the k -th column in the constraint matrix U in (3).5 r i : the i -th coeﬃcient in the objective function in (2).• v k : the k -th coeﬃcient in the objective function in (3).• λ : vector that describe the linear relation that v k = λ ⊤ u k .• b = [ b , · · · , b m ] ⊤ : initial constraint capacity.• d = [ d , · · · , d m ] ⊤ : initial average constraint capacity.• b t : capacity constraint at time t .• d t : average capacity constraint at time t .• ¯ a, ¯ d, d, ¯ r, p : ¯ a = max k ∈ [ K ] k u k k ∞ , ¯ d = || d || ∞ , d = min i ∈ [ m ] d i , ¯ r = || r || ∞ , and p = min k ∈ [ K ] p k .• R ∗ n : R ∗ n is the optimal objective value of (2).• ¯ R ∗ n : ¯ R ∗ n is the optimal objective value of (3) multiplied by n .• ¯ U , ¯ v , ¯ d : the matrix, reward and constraint in the standard form of the DLP (22).• ¯ q ∗ : ¯ q ∗ = [ q ∗ , s ∗ , z ∗ ] ⊤ is the optimal solution of DLP in standard form (22), where q ∗ is the optimalsolution for (3) and the rest are slack variables.• ¯ B ∗ , ¯ N ∗ : ¯ B ∗ is the index set for optimal basic variables for (22), and ¯ N ∗ is the index set for optimalnon-basic variables.• T ∗ , N ∗ : T ∗ is the index set for binding resource for solution of (3), and L ∗ is the index set fornon-binding resource.• χ, Ψ , ψ, σ : χ = min { ¯ q ∗ i | ¯ q ∗ i = 0 } , Ψ is the vector of reduced cost, ψ = min {| Ψ i || ¯ q ∗ i = 0 } , and σ = σ min ( ¯ U ¯ B ∗ ) .• l ∗ : the optimal dual solution for (26).• N k ( t ) : the number of times the algorithm accept order type k at time t .• n k ( t ) : the number of times the order type k appears at time t . In this section we summarize the main results for the paper. Notice that the behavior of the online LPproblem (2) rely heavily on the underlying DLP (3), and we discuss the regret of (2) under two diﬀerentcondition on the underlying DLP (3): the ﬁrst being that all resources in (3) are binding, and the secondbeing that there are at least one non-binding resources. For simplicity, we refer to those condition as“binding” case and “general” case.To prove bounded regret, we require the following assumption for the binding and general case belowin an informal way.

Assumption for Binding Case (Informal) : There exists a constant ∆ > such that for all d ′ ∈ D = N mi =1 [ d i − ∆ , d i + ∆] , the solution for (3) exists and all resource is binding. Assumption for General Case (Informal) : We require the solution for the (3) in standard formis unique and non-degenerate.Despite the assumptions that we require for the binding case and general case look diﬀerent, theyshare the same spirit that they both require a “stability” condition such that the structure on binding6nd non-binding stays the same. In fact, the unique and non-degenerate assumption would imply therobustness on small perturbations for d ′ ∈ D . Therefore, for the binding case we can have a weakerassumption such that the underlying DLP can have some non-degeneracy.Lastly, we state the informal results for the regret for the binding case and general case. Theorem for Binding Case (Informal) : Under Assumption 1 and the assumption for bindingcase, Algorithm 1 have the regret in the formReg n ≤ O ( m ) + O ( n exp( − n )) . Theorem for General Case (Informal) : Under assumption 1 and the Assumption for generalcase, Algorithm 1 have the regret in the formReg n ≤ O ( m ) + O ( mn exp( − n )) + O ( n / exp( − n / )) . Although the binding case and general case have diﬀerent assumptions, the technique in the bindingcase is used extensively in the analysis for the general case. As a result, for simplicity, we ﬁrst introducethe analysis for the binding case in Section 3, and use the results again for Section 4 in a more complexsetting. One will ﬁnd that the analysis in Section 3 is not only useful in proving the bounded regret forthe binding resources in Section 4, but also essential in the analysis for the non-binding resources.

In this section, we analyze the regret of Algorithm 1 for the case that all the resources are binding, wherethe LP problem becomes a set-partition problem. Besides Assumption 1, we assume the following

Assumption 2.

There exists a constant ∆ > such that one can ﬁnd a vector q = ( q , ...., q K ) ⊤ ∈ R K that satisﬁes K X k =1 u k p k q k = d ′ for all d ′ ∈ D = N mi =1 [ d i − ∆ , d i + ∆] . Assumption 2 requires all the constraints are binding for the DLP (3). In other words, there exists away to allocate the resource such that the capacity of all constraints can be exhausted (on expectation)which maximizes the return. It also requires that the constraint bindingness still holds with respect tosmall perturbation on d . We ﬁrst present the idea and then elaborate on the details in the following subsections. And we willsee that the analysis centers around the process d t . For starter, recall that D = N mi =1 [ d i − ∆ , d i + ∆] inAssumption 2 imposes a stability condition for the right-hand-side parameter d = b n . Throughout theanalysis, we will reserve the notation d = ( d , ..., d m ) ⊤ to denote the (initial) average resource capacityand use d ′ to refer to an arbitrary value in R m + . Let D ′ := m O i =1 [ d i − ∆2 , d i + ∆2 ] be a strict subset of D . Ideally, the process d t should stay within D ′ (or D ) throughout the horizon, asthis would imply that the resource is exhausted only at the very end of the horizon. Deﬁne x t ( d ′ ) be theonline solution of Algorithm 1 at the t -th time period if the input d t of the LP (5) is equal to d ′ ∈ R m + .7onsider the following event deﬁned in the space of the history observations up to time t − , E t := (cid:26) H t − (cid:12)(cid:12)(cid:12) sup d ′ ∈D ′ k E [ a t x t ( d ′ ) |H t − ] − d ′ k ∞ ≤ ǫ t (cid:27) where the history H t − = ( r , a , ..., r t − , a t − ) . Speciﬁcally, we choose ǫ t :=  ¯ a t ≤ κn, t β t > κn, (6)where ¯ a is deﬁned in Assumption 1 as an upper bound for a t ’s inﬁnity norm and the parameter β cantake any value in (0 , ) . The constant κ ∈ (0 , will be speciﬁed later, and without loss of generality, weassume κn takes an integer value.Algorithm 1 re-solves the LP (5) at every time step and the aim is to ensure that the expected resourceconsumption at each time t stays close d t . In this way, it prevents the resource from an early depletionor a left-over at the end of the horizon. The event E t aims to characterize the algorithm behavior in thisaspect, and a “good” event is deﬁned as E [ a t x t ( d ′ ) |H t − ] , the expected resource consumption at time t ,stays close to d t with a range of ǫ t . In the ﬁrst κn time periods, since we still have a lot of resources, wecan tolerate a relatively large diﬀerence by choosing ǫ t = ¯ a ; consequently, E t will happen almost surelyfor the ﬁrst κn time periods. For t > κn , we deﬁne ǫ t = t β with β ∈ (0 , ) . Roughly speaking, at time t ,the estimation error is on the order of √ t . Here we choose a much wider interval in deﬁning ǫ t for tworeasons: (i) it guarantees that the event E t will happen with high probability; (ii) as long as the diﬀerenceis no greater than t β , the adaptive (re-solving) mechanism will ensure the stability of the process d t . Forcompleteness, we deﬁne E as the whole space Ω . We also emphasize that the event states a uniformityover possible values of d t by the inner supremum over all d ′ ∈ D ′ . By the deﬁnition of the online procedure, the dynamics of d t is characterized by d t +1 = b t +1 n − t = b t − a t x t n − t = d t − n − t ( a t x t − d t ) for t = 1 , ..., n − . In this light, the event E t characterizes that the expectation of the second term in thelast part has an absolute value no greater than ǫ t if d t ∈ D ′ . Next, we deﬁne a stopping time to capture the “bad” event that is either d t / ∈ D ′ or the complementof the event E t . Mathematically, ˜ τ := min { t ≤ n : d t / ∈ D ′ or H t − / ∈ E t } ∪ { n + 1 } . With ˜ τ , we deﬁne an auxiliary process ˜ d t as follows ˜ d t =  d t , t < ˜ τ , d ˜ τ , t ≥ ˜ τ . By its deﬁnition, the process ˜ d t freezes its value once d t exits the region D ′ or the bad event happens.The motivation for deﬁning the process ˜ d t can be seen from the following inequality, P ( d s / ∈ D ′ for some s ≤ t ) = P (cid:0) d s / ∈ D ′ for some s ≤ t, ∩ ts =1 E s (cid:1) + P (cid:0) d s / ∈ D ′ for some s ≤ t, ∪ ts =1 ¯ E s (cid:1) ≤ P (cid:16) ˜ d s / ∈ D ′ for some s ≤ t (cid:17) + t X s =1 P (( r , a ..., r s − , a s − ) / ∈ E s ) . (7)8here ¯ E denotes the complement of an event E . For the ﬁrst part of the second line, it is because given ∩ ts =1 E s , the event that d s / ∈ D ′ for some s ≤ t is equivalent to the event τ ≤ t and thus it entails ˜ d s / ∈ D ′ .For the second part, it is obtained by simply ignoring the condition on d s and then taking a union boundwith respect to s = 1 , ..., t .Now we discuss the motivation for deﬁning ˜ d t and the signiﬁcance of the inequality (7). To derivea regret upper bound for the algorithm, we strive to bound the probability of d s / ∈ D ′ . The inequality(7) separates the probability into two components. The ﬁrst component involves the process ˜ d t whichis a very “regular” process in that if t < ˜ τ , the process ˜ d t ’s ﬂuctuation is subject to the event E t andif t ≥ ˜ τ , its value freezes. Furthermore, note that the event E t restricts the magnitude of ﬂuctuationfor the process d ′ t (and d t ), the process d ′ t behaves roughly close to a martingale. It thus creates muchconvenience in analyzing the ﬁrst component. Next, for the second component of the inequality (7), it isabout the probability of ¯ E t ’s, which can be analyzed individually for each t . Technically, the inequality(7) disentangles the process stability from the estimation error. Its ﬁrst component concerns the processstability given a good estimate, while its second component concerns the probability of a good estimate.Piecing the two components together, we obtain a bound on the probability that d t exists the region D ′ . In the following two subsections, we will analyze the two components separately and then combine theresults to derive the regret bound.

The following theorem states a concentration result for a martingale diﬀerence sequence X t ’s. Theapproach how we analyze the ﬁrst component can be viewed as a two-step procedure: we ﬁrst constructa martingale based on the process ˜ d t . Speciﬁcally, the constructed martingale and ˜ d t share the sameinitial value d , and the diﬀerence between the martingale and ˜ d t is controllably small. Second, we applyTheorem 1 for the constructed martingale and argue that both the martingale and ˜ d t will stay within D ′ (nearby the initial value d ) with high probability. Theorem 1. (Hoeﬀding’s inequality for dependent data (van de Geer, 2002)) Consider a sequence ofrandom variables { X t } nt =1 adapted to the ﬁltration H t ’s and E [ X t |H t − ] = 0 for t = 1 , ..., n where H = ∅ . Suppose L t , U t are H t − -measurable random variables such that L t ≤ X t ≤ U t almostsurely for t = 1 , ..., n . Let S t = P ts =1 X t and V t = P ts =1 ( U s − L s ) . Then, the following inequality holdsfor all b > , c > and n ∈ N + , P ( | S t | ≥ b, V t ≤ c for some t ∈ { , ..., n } ) ≤ e − b c . The following lemma implements the above idea. It utilizes the result in Theorem 1 and provides anupper bound on the ﬁrst component in (7).

Lemma 1.

The following inequality holds for all n ≥ n and t ≤ n − , P (cid:16) ˜ d s / ∈ D ′ for some s ≤ t (cid:17) ≤ me − ∆2( n − t )32¯ a (8) where ¯ a is deﬁned in Assumption 1, and ∆ (hence D ′ ) can be chosen arbitrarily. The constant n isdeﬁned as the minimal integer such that n ≥ ( ∆16¯ a ) − + 2 and log n n β ≤ κ β ∆8 . Here κ = 1 − exp( − ∆16¯ a ) and β is a ﬁxed number in (0 , ) as in deﬁning the event E t ’s. roof. We ﬁrst analyze the process ˜ d i,t , the i -th component of ˜ d , and then take union bound with respectto i . Deﬁne H t = { ( r s , a s ) } ts =1 for t = 1 , ..., n. Let Y t := ˜ d i,t +1 − ˜ d i,t for t ≥ and X t := Y t − E [ Y t |H t − ] . In this way, to analyze the process ˜ d i,t , we only need to analyze the summation P t − s =1 Y t equivalently.From the deﬁnition of the process ˜ d t , we know that when t ≥ ˜ τ − , ˜ d i,t +1 = ˜ d i,t , and when ≤ t < ˜ τ ˜ d i,t +1 = ˜ d i,t − n − t ( a i,t x t − ˜ d i,t ) . From the fact that ˜ d i,t is H t − -measurable, we have | X t | = (cid:12)(cid:12)(cid:12)(cid:12) n − t ( a i,t x t − ˜ d i,t ) − E (cid:20) n − t ( a i,t x t − ˜ d i,t ) (cid:12)(cid:12)(cid:12)(cid:12) H t − (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = 1 n − t | a i,t x t − E [ a i,t x t |H t − ] | ≤ ¯ an − t (9)for each t ≤ n − . So we can deﬁne L t and U t as L t := − ¯ an − t ,U t := ¯ an − t , and the conditions of Theorem 1 are met for the process X t , L t and U t . Then as in Theorem 1, V t = t X s =1 ( U t − L t ) = t X s =1 a ( n − s ) ≤ a n − t − for t = 1 , ..., n − . From Theorem 1, we know that P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 X j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ∆ ′ for some s ≤ t  ≤ e − ∆ ′ n − t − a (10)holds for all ∆ ′ > and t ≤ n − . With this bound on the summation of X t , we return to analyze the summation of Y t by boundingthe diﬀerence between these two sequences. By the deﬁnition, we have | X t − Y t | = | E [ Y t |H t − ] | = | E [ ˜ d i,t +1 − ˜ d i,t |H t − ] | = (cid:12)(cid:12)(cid:12)(cid:12) n − t E [( a i,t x t − ˜ d i,t ) I ( t < τ ) |H t − ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ ǫ t n − t = ¯ an − t I ( t ≤ κn ) + 1( n − t ) t β I ( t > κn ) (11)for ≤ t ≤ n − . The second line comes from the deﬁnition of Y t . The third line comes from splitting10he second line with two indicators I ( t < τ ) and I ( t ≥ τ ) . The process d t freezes and ˜ d i,t +1 = ˜ d i,t for t ≥ τ . The second last line comes from the deﬁnition of E t and the deﬁnition of τ , and the last linecomes from the deﬁnition of ǫ t .By taking summation of (11), we have for s ≤ n − , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 X j − s X j =1 Y j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κn X j =1 ¯ an − j + s X j = κn +1 n − j ) j β . The next is to ﬁnd a proper value of κ such that the equation above is bounded by ∆4 . For the ﬁrst part,we have κn X j =1 ¯ an − j ≤ ¯ a Z n − n − κn − x dx = ¯ a log (cid:18) n − n − κn − (cid:19) ≤ ¯ a log (cid:18) n − n − κn − − κ (cid:19) = ¯ a (cid:18) log (cid:18) n − n − (cid:19) − log (1 − κ ) (cid:19) . For the second part, n − X j = κn +1 n − j ) j β ≤ κn ) β n − X j = κn +1 n − j ≤ log n ( κn ) β . Henceforth, if we set κ = 1 − exp( − ∆16¯ a ) and deﬁne n as the minimal integer such that n ≥ ( ∆16¯ a ) − +2 and log n n β ≤ κ β ∆8 , then the following inequality holds for n ≥ n κn X j =1 ¯ an − j + n − X j = κn +1 n − j ) j β ≤ ∆8 + ∆8 = ∆4 . With the choice of κ and n ≥ n , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 X j − s X j =1 Y j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∆4 holds almost surely. Consequently, (cid:26) | ˜ d i,s − d i | > ∆2 for some s ≤ t (cid:27) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s − X j =1 Y j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ∆ / for some s ≤ t  = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 Y j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ∆ / for some s ≤ t −  ⊆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 X j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ∆ / for some s ≤ t −  . Therefore, we set ∆ ′ = ∆4 in (10) and apply union bound with respect to constraint index i = 1 , ..., m , P (cid:16) ˜ d s / ∈ D ′ for some s ≤ t (cid:17) ≤ me − ∆2( n − t )32¯ a for t ≤ n − and n ≥ n . 11 .2 Analysis of the second component in (7) To analyze the second component, we ﬁrst establish suﬃcient conditions such that all the constraints arebinding for the sampled LP (5). In other words, if we replace the true probability p k with an accurateenough estimation, the linear program (5) also satisﬁed Assumption 2. In certain sense, this can beviewed as a sensitivity result on LP (5): when the input parameters to the LP are perturbed by a smallamount, the bindingness structure of the LP are still maintained.Denote p := min ≤ k ≤ K p k to be the smallest probability and d := min ≤ i ≤ m d i to be the smallestvalue of the initial average constraint level d . Note that both p and d are positive. Lemma 2.

For t = 1 , ..., n , if d t = d ′ = ( d ′ , ..., d ′ m ) ⊤ ∈ D ′ and (cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) ≤ p ∆2( d + ∆) for each k = 1 , ..., K , the optimal solution q ∗ = ( q ∗ , ..., q ∗ K ) ⊤ of the scaled LP (5) (at time t + 1 ) satisﬁes K X k =1 u k n k ( t ) q ∗ k = t d ′ for each d ′ ∈ D ′ . Proof.

We provide a constructive proof based on the condition in Assumption 2. For each d ′ ∈ D ′ ,consider the problem max y =( y ,...,y K ) ⊤ K X k =1 v k p k y k s.t. K X k =1 u k p k y k ≤ d ′ · d + ∆ d + ∆ / ≤ y k ≤ . (12)The deﬁnition of D ′ entails k d − d ′ k ∞ ≤ ∆2 . Thus we have d ′ i · d + ∆ d + ∆ / ≤ ( d i + ∆ / · d + ∆ d + ∆ / ≤ d i + ∆ for each i = 1 , ..., m . Recall that d = ( d , ..., d m ) ⊤ represents the initial average constraint capacity.Therefore, we know d ′ · d +∆ d +∆ / ∈ D . From Assumption 1, we know that there exists y ∗ = ( y ∗ , ..., y ∗ K ) ⊤ such that K X k =1 u k p k y ∗ k = d ′ d + ∆ d + ∆ / . Scaling both sides of the equality, K X k =1 u k n k ( t ) t (cid:18) p k n k ( t ) /t d + ∆ / d + ∆ y ∗ k (cid:19) = d ′ . (13)From the condition | n k ( t ) /t − p k | ≤ p ∆ / d +∆ we know p k − n k ( t ) t ≤ p k ∆2( d + ∆) k = 1 , ..., K , and it thus implies p k n k ( t ) /t d m + ∆ / d m + ∆ ≤ . So, if we let q ∗ k = p k n k ( t ) /t d + ∆ / d + ∆ y ∗ k for k = 1 , ..., K, then q ∗ will be an optimal solution to the scaled LP (5) with d t = d ′ .According to Lemma 2, we deﬁne the event A ( k ) t := (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) ≤ p ∆2( d + ∆) (cid:27) . Now we are in a good shape to analyze event E t ’s and bound the second term in (7). The idea is toapply Lemma 2 and employ the following argument to bound the probability E t with the intersection asequence of concentration events. Let q ∗ ( d ′ ) = ( q ∗ ( d ′ ) , ..., q ∗ K ( d ′ )) ⊤ ∈ R K be the optimal solution of (5)with d t = d ′ for all d ′ ∈ D ′ . From the algorithm, we know E [ a t +1 x t +1 ( d ′ ) |H t ] = K X k =1 u k p k q ∗ k ( d ′ ) . This is because for each k , the probability of seeing ( v k , u k ) in the coming arrival is p k and the acceptanceof this arrival is q ∗ k ( d ′ ) . From Lemma 2, we know conditional on the event ∩ Kk =1 A ( k ) t , d ′ = K X k =1 u k n k ( t ) t q ∗ k ( d ′ ) . Then, taking the diﬀerence, E [ a t +1 x t +1 ( d ′ ) |H t ] − d ′ = K X k =1 u k (cid:18) p k − n k ( t ) t (cid:19) q ∗ k ( d ′ ) . (14)This identity (14) enables us to bound the left-hand-side with a concentration argument. Speciﬁcally,deﬁne the event B ( k ) t := (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) ≤ aKt β (cid:27) . Then we know that conditional on the event ∩ Kk =1 ( A ( k ) t ∩ B ( k ) t ) , k E [ a t +1 x t +1 ( d ′ ) |H t ] − d ′ k ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K X k =1 u k (cid:18) p k − n k ( t ) t (cid:19) q ∗ k ( d ′ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ K X k =1 (cid:12)(cid:12)(cid:12)(cid:12) p k − n k ( t ) t (cid:12)(cid:12)(cid:12)(cid:12) q ∗ k ( d ′ )¯ a ≤ min (cid:26) t β , ¯ a (cid:27) = ǫ t where ¯ a is deﬁned in Assumption 1. The ﬁrst line is justiﬁed by the preceding argument and it relies onevent A ( k ) t ’s, while the second line relies on the event B ( k ) t ’s. The inequality is useful in that it upperbounds our target (the left hand side, bias in resource consumption) by the estimation errors of p k ’s.13onsequently, ∩ Kk =1 (cid:16) A ( k ) t ∩ B ( k ) t (cid:17) ⊂ E t for each t = 1 , ..., n. Thus the following lemma provides an lower bound for the probability P ( E t ) andconcludes our analysis on the second component in (7). Lemma 3.

The following inequality holds for each t = 1 , ..., n, P (cid:16) ∩ Kk =1 ( A ( k ) t ∩ B ( k ) t ) (cid:17) ≥ − K exp − p ∆ t d + ∆) ! − K exp (cid:18) − t − β ¯ a K (cid:19) . Proof.

We have P (cid:16) ∩ Kk =1 ( A ( k ) t ∩ B ( k ) t ) (cid:17) ≥ − K X k =1 P (cid:16) ¯ A ( k ) t (cid:17) − K X k =1 P (cid:16) ¯ B ( k ) t (cid:17) where ¯ E denotes the complement of event E . Next we analyze each component in the summation withHoeﬀding’s inequality. Speciﬁcally, n k ( t ) = t X s =1 I (( r s , a s ) = ( v k , u k )) where I ( · ) denotes the indicator function and I (( r s , a s ) = ( v k , u k )) ’s are i.i.d. random variables. Inaddition, E [ I (( r s , a s ) = ( v k , u k )] = p k for each k = 1 , ..., K . Therefore, we have P (cid:16) ¯ A ( k ) t (cid:17) = P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) > p ∆2( d + ∆) (cid:19) ≤ − p ∆ t d + ∆) ! and P (cid:16) ¯ B ( k ) t (cid:17) = P (cid:18)(cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) > aKt β (cid:19) ≤ (cid:18) − t − β ¯ a K (cid:19) Combining these two inequalities, we have P (cid:16) ∩ Kk =1 ( A ( k ) t ∩ B ( k ) t ) (cid:17) ≥ − K exp − p ∆ t d + ∆) ! − K exp (cid:18) − t − β ¯ a K (cid:19) . Now we complete the regret analysis by combining the result in the two preceding subsections. First weprovide a more careful analysis for (7). If we deﬁne the stopping time τ := min { t ≤ n : d t / ∈ D ′ } ∪ { n + 1 } , t ≤ n P ( τ ≤ t ) = P ( d s / ∈ D ′ for some s ≤ t ) ≤ P (cid:16) ˜ d s / ∈ D ′ for some s ≤ t (cid:17) + t X s =1 P (( r , a ..., r s − , a s − ) / ∈ E s )= P (cid:16) ˜ d s / ∈ D ′ for some s ≤ t (cid:17) + t X s = κn +1 P (( r , a ..., r s − , a s − ) / ∈ E s ) . where the second line comes from (7) and the third line comes from the deﬁnition of E s . By a “morecareful” analysis, it means that the second component takes into account the deﬁnition of ǫ t and thusremoves the ﬁrst κn summands.With n and κ deﬁned in Lemma 1, if n ≥ n , we have E [ τ ] = n X t =1 (1 − P ( τ ≤ t )) ≥ n − − n − X t =1 P (cid:16) ∃ s ≤ t s.t. ˜ d s / ∈ D ′ (cid:17) − n X t =1 t X s = κn +1 P ( ¯ E s )= n − − m − e − ( n − / (32¯ a ) − e − ∆ / (32¯ a ) − n X t =1 t X s = κn +1 P ( ¯ E s ) (15)where the last line applies Lemma 1 for the ﬁrst summation. The following result applies Lemma 3 forthe second summation. Lemma 4.

The following inequality holds for n ∈ N + and β ∈ (0 , ) , n X t =1 t X s = κn +1 P ( ¯ E s ) ≤ K ( d + ∆) np ∆ exp − p ∆ κ d + ∆) · n ! + K ¯ a n β β exp (cid:18) − κ − β K ¯ a · n − β (cid:19) = O (cid:0) n β exp (cid:0) − n − β (cid:1)(cid:1) (16) Proof. t X s = κn +1 P ( ¯ E s ) ≤ n X s = κn +1 K exp − p ∆ s d + ∆) ! + 2 K exp (cid:18) − s − β ¯ a K (cid:19) ≤ Z nκn K exp − p ∆ s d + ∆) ! + 2 K exp (cid:18) − s − β ¯ a K (cid:19)! ds. (17)For the second term observe that for a > , β, κ ∈ (0 , we can have the following bound Z nκn e − ax β dx ≤ − n − β aβ Z nκn − aβx − β e − ax β dx ≤ n − β aβ e − a ( κn ) β . (18)Combining (17) and (18) yields n X t =1 t X s = κn +1 P ( ¯ E s ) ≤ K ( d + ∆) np ∆ exp − p ∆ d + ∆) κn ! + K ¯ a n β β exp (cid:18) − K ¯ a ( κn ) − β (cid:19) . m − e − ( n − / (32¯ a ) − e − ∆ / (32¯ a ) ≤ m ¯ a ∆ , we have E [ τ ] ≥ n − − m ¯ a ∆ − K ( d + ∆) np ∆ exp − p ∆ κ d + ∆) · n ! − K ¯ a n β β exp (cid:18) − κ − β K ¯ a · n − β (cid:19) . (19)Lastly, we present the bound on the regret in the following theorem which tells that the regret isuniformly bounded in respect with n . Theorem 2.

Under Assumption 1 and 2, the regret of Algorithm 1 satisﬁesReg πn ≤ O ( m ) + O ( n exp( − n )) for all m, n ∈ N + . π denotes the policy speciﬁed by Algorithm 1.Proof. Recall that the regret Reg πn = E " R ∗ n − n X t =1 r t x t , where R ∗ n represents the optimal objective value of the “oﬄine” problem (2) and n X t =1 r t x t = λ ⊤ · n X t =1 a t x t = λ ⊤ b − λ ⊤ b n +1 . Under Assumption 1, we have R ∗ n ≤ λ ⊤ b , and by combining the results above we haveReg πn = E " R ∗ n − n X t =1 r t x t ≤ E (cid:2) λ ⊤ b − λ ⊤ b + λ ⊤ b n +1 (cid:3) ≤ λ ⊤ E [ b n +1 ] (20)where λ is deﬁned as in Assumption 1. Since τ ≤ n + 1 and d τ − ∈ D ′ , we have λ ⊤ E [ b n +1 ] ≤ λ ⊤ E [ b τ − ] = λ ⊤ E [( n − τ + 2) d τ − ] ≤ λ ⊤ (cid:18) d + ∆2 (cid:19) E [( n − τ + 2)] . (21)Finally, from (20), (21), and (19) we haveReg πn ≤ λ ⊤ (cid:18) d + ∆2 (cid:19) (cid:0) O ( m ) + O (cid:0) n β exp (cid:0) − n − β (cid:1)(cid:1)(cid:1) , ≤ ¯ r (cid:0) O ( m ) + O (cid:0) n β exp (cid:0) − n − β (cid:1)(cid:1)(cid:1) when n ≥ n . As deﬁned in Lemma 1, n is the minimal integer such that n ≥ ( ∆16¯ a ) − + 2 and log n n β ≤ κ β ∆8 , κ = 1 − exp( − ∆16¯ a ) , and β is a ﬁxed number in (0 , ) . Here the ﬁrst line comes from omittingthe constants in (19) and the second line comes from the fact that λ ⊤ ( d + ∆2 ) ≤ ¯ r from Assumption 2.By choosing β = , we complete the proof. 16 Regret Analysis for General Case

In this section we discuss a more general case, where in the corresponding DLP (3) there are bothbinding and non-binding resources. The roadmap for proving the bounded regret for the general casecan be summarized as follows: Lemma 5 characterize the regret as bounding three components: ﬁrst,the binding resource must have bounded regrets; second, for a speciﬁc type of orders, we should acceptalmost all of them; and third, for another type of orders, we should not accept almost any of them. Forbounding the binding resources, we rely heavily on the results provided in Section 3. For the last twocomponents, Lemma 6 enable us to determine whether to accept or reject a type of order by verifyingif the corresponding primal solution is basic or non-basic. From Assumption 3, the non-degeneracy anduniqueness condition implies a stability condition, such that as n goes up the oﬄine LP will stay closeenough to the DLP so that it will eventually stay within the range where the solution to the standardizedoﬄine LP (32) share the same index for the basic and non-basic variable as the standardized DLP (22). To facilitate the analysis, we introduce the standard form of the DLP (3). We denote q the vector [ q , · · · , q K ] ⊤ , v p the vector [ v p , · · · , v K p K ] ⊤ , s the vector [ s , · · · , s m ] ⊤ , z the vector [ z , · · · , z K ] ⊤ , U p the matrix [ u p , · · · u K p K ] , and the standard form could be written as follows: max v ⊤ p q s.t. U p q + s = dq + z = q , s , z ≥ . (22)Notice that if we denote I m and I K the corresponding identity matrix of dimension m and K , anddeﬁne ¯ U ∈ R ( m + K ) × ( m +2 K ) , ¯ v , ¯ q ∈ R ( m +2 K ) , ¯ d ∈ R ( m + K ) in the following way ¯ U := U p I m I K I K ! , ¯ q =  qsz  , ¯ v = v p ! , ¯ d = d ! , then (22) could be written as max ¯ v ⊤ ¯ q s.t. ¯ U ¯ q = ¯ d ¯ q ≥ (23)Denote [ n ] := { , · · · , n } , ¯ B ∗ ⊆ [2 K + m ] the set of optimal basic variables for the LP (22), and ¯ N ∗ ⊆ [2 K + m ] the set of optimal non-basic variables. Next, we state the assumptions needed for thegeneral case: Assumption 3.

The optimal solution to (22) is unique and non-degenerate, i.e. { i : q ∗ i = 0 , i = 1 , · · · , K } + { i : s ∗ i = 0 , i = 1 , · · · , m } + { i : z ∗ i = 0 , i = 1 , · · · , K } = m + K. Assumption implies that the matrix ¯ U ¯ B ∗ is non-singular. Moreover, because of the non-degeneracyassumption we know that ¯ q ∗ i > for all i ∈ ¯ B ∗ . Hence we deﬁne χ := min { ¯ q ∗ i = 0 } , σ := σ min ( ¯ U ¯ B ∗ ) , (24)17nd from Assumption we know that χ, σ > .Next, we introduce another tool for the analysis of the regret. Consider the dual problem of theoriginal DLP min m X i =1 l i d i + K X k =1 y i s.t. u ⊤ k l + y k ≥ v k , k = 1 , ..., Kl k ≥ , k = 1 , · · · , Ky i ≥ , i = 1 , · · · , m (25)which could be simpliﬁed to min m X i =1 l i d i + K X k =1 ( v k − u ⊤ k l ) + s.t. l k ≥ , k = 1 , · · · , K (26)From Assumption we know there exist an unique optimal dual solution l ∗ for the dual of DLP.Then, within horizon n , if we denote N k ( n ) the number of times the algorithm accept order type k , and ¯ R ∗ n as the optimal value for the DLP multiplied by n , we can bound the regret with the following lemma: Lemma 5.

Reg πn ≤ E " l ∗⊤ b − K X k =1 u k N k ( n ) ! + E " K X k =1 ( np k − N k ( n ))( v k − u ⊤ k l ∗ ) + + E " K X k =1 N k ( n )( u ⊤ k l ∗ − v k ) + . (27)WLOG assume that we have { i : s ∗ i = 0 , i = 1 , · · · , m } > and { i : s ∗ i = 0 , i = 1 , · · · , m } > ,which means we have both binding and non-binding resources. It is obvious that when all resources arenon-binding, the optimal dual price l ∗ will be hence the ﬁrst term will be , and the general result willstill hold.For the ﬁrst term, since in the original LP there are binding and non-binding resources, from com-plementary slackness we know that the optimal dual price l ∗ for the binding resource is positive, whilefor the non-binding resource the dual price is . Henceforth, if we denote T ∗ ⊆ [ m ] the index of bindingresources, and L ∗ ⊆ [ m ] the index of non-binding resources we can have E " l ∗⊤ b − K X k =1 u k N k ( n ) ! = E " l ∗⊤T ∗ b T ∗ − K X k =1 u T ∗ ,k N k ( n ) ! . (28)Hence we can see that if the remaining binding resource b T ∗ − P Kk =1 u T ∗ ,k N k ( n ) could be bounded, thenthe ﬁrst term will be bounded.The second term suggests that for the order type v k − u ⊤ k l ∗ > , in order to get a bounded regret,we need to guarantee the number that we are rejecting this type of order is bounded in expectation.Similarly, the third term suggests that for order type v k − u ⊤ k l ∗ < , we should accept only bounded (inexpectation) number of them in order to achieve bounded regret.Since the “right” decision choice we should made on the non-binding type of orders involves v k − u ⊤ k l ∗ ,a natural idea is to look at the reduced cost. For the standard LP form (22), for every i ∈ [2 K + m ] , thereduced cost Ψ i is deﬁned by Ψ i = ¯ v i − ¯ v ⊤ ¯ B ∗ ¯ U ¯ B ∗ ¯ U i (29)18otice that the dual LP (25) is also the dual of the standard form (22). We denote ¯ l ∗ = [ l ∗ , y ∗ ] ⊤ , andfrom Assumption we know that the reduced cost can be written as the dual form Ψ = ¯ v − ¯ U ⊤ ¯ l ∗ . (30)From optimal condition of Linear Programming we also know that ¯ B ∗ = { i : Ψ i = 0 } , ¯ N ∗ = { i : Ψ i < } . (31)Next, we deﬁne ψ := min {| Ψ i || Ψ i = 0 } , and by looking at the reduced cost we can have the followingobservation. Lemma 6.

For i ∈ [ K ] , the suﬃcient and necessary condition for v i − u ⊤ i l ∗ < is q ∗ i = 0 , v i − u ⊤ i l ∗ > is the suﬃcient condition for q ∗ i = 1 , and q ∗ i = 1 is the suﬃcient condition for v i − u ⊤ i l ∗ ≥ . Lemma 6 gives us the relationship between the regret, optimal basis, and reduced cost. For i ∈ [ K ] , q ∗ i = 0 (which is equivalent to Ψ i < ) means not accepting the order type i . Therefore, as long as theoptimal basis is not changing, we will always be rejecting the order such that v i − u ⊤ i l ∗ < .For order type v i − u ⊤ i l ∗ > , q ∗ i = 1 (which implies z ∗ i = 0 , hence Ψ i + K + m < ) means alwaysaccepting the order, hence we know that we will always accept the i such that v i − u ⊤ i l ∗ > . However,Lemma 6 also suggests that we might be always accepting order i such that v i − u ⊤ i l ∗ = 0 , but from thesecond and last term in Lemma 5 we know that it will not aﬀect the regret.If we can provide a solution of the oﬄine LP (32) such that the optimal basis is the same with theoptimal basis of (22), then from (31) and the analysis above, we know that we are accepting and rejectingthe right type of orders. More speciﬁcally, the optimal basis not changing implies for i that we mustreject ( v i − u ⊤ i l ∗ < ) or accept ( v i − u ⊤ i l ∗ > ), we always have the right decision ( q ∗ i = 0 or z ∗ i = 0 ).The rest of this section describes how we can ensure the oﬄine LP having the “right” optimal basis overtime. Intuitively, as t increases, our oﬄine LP (32) will converge to the DLP (22). It turns out that the stabilitycondition not only help us to identify the optimal basis for the oﬄine LP, but also play an essential rolein prescribing the behavior of the average constraint process d t that is both binding and non-binding:in order to apply the same tool as in Section 3 for the binding resources, we need some property toensure the feasibility condition in Lemma 2. It seems that the Assumption 3 is enough to provide similarconditions that is robust to perturbation, hence ensure a similar feasibility condition.WLOG, we assume the average resource constraint is ordered in the way that d = [ d T ∗ , d L ∗ ] ⊤ , where d T ∗ stands for the binding resources and d L ∗ stands for non-binding resources. As our Algorithm 1proceed to time t + 1 , recall that one would be solving the oﬄine LP (4). Here we need some extranotations to make in the standard LP form. We denote v t the vector h v n t (1) t , · · · , v K n k ( t ) t i ⊤ , U t thematrix h u n t (1) t , · · · u K n k ( t ) t i , and deﬁne ¯ U t ∈ R ( m + K ) × ( m +2 K ) , ¯ v t ∈ R ( m +2 K ) , ¯ d t ∈ R ( m + K ) in thefollowing way ¯ U t := U t I m I K I K ! , ¯ v t = v t ! , ¯ d t =  d T ∗ ,t d L ∗ ,t  , max ¯ v ⊤ t ¯ q s.t. ¯ U t ¯ q = ¯ d t ¯ q ≥ . (32)We remark that (32) can be viewed as the original standard form DLP (22) with some noise on thereward vector ¯ v , matrix ¯ U , and demand process ¯ d . The next lemma states the condition that ensurethe optimal basis of the solution for (32) and (22) is the same. Before we state the lemma, we deﬁne theapproximation error for i ∈ [ K ] as U ǫt := ¯ U t − ¯ U , v ǫt := ¯ v t − ¯ v , d ǫt := ¯ d t − ¯ d , u ǫt,i = u i (cid:18) n t ( i ) t − p i (cid:19) , v ǫt,i = v i (cid:18) n t ( i ) t − p i (cid:19) . (33)Next, we state the stability lemma below. Lemma 7.

Under Assumption 3 and the condition (cid:13)(cid:13) u ǫt,i (cid:13)(cid:13) ∞ ≤  min { ,σ,σ }· min { χ,ψ }

12 ¯ d ¯ r ¯ aK √ K + m , for i ∈ B ∗ , σψ r √ K ( K + m ) , for i ∈ N ∗ , | v ǫt,i | ≤  σψ a √ K ( K + m ) , for i ∈ B ∗ , ψ , for i ∈ N ∗ , | d ǫt,i | ≤ σχ √ K + m , for i ∈ T ∗ d ǫt,i ≥ − σχ √ K + m , for i ∈ L ∗ (34) the optimal solution for (32) is unique and have the same optimal basis as the DLP in standard form. Now we have all the tools for bounding the ﬁrst term in Lemma 5, and the technique would be the sameas in Section 3 except for some minor modiﬁcations because of the introduction of non-binding resources.We also slightly overload the notation by deﬁning ∆ := σχ √ K + m , D := O i ∈T ∗ [ d i − ∆ , d i + ∆] ! O O i ∈L ∗ (cid:20) d i − ∆2 , + ∞ (cid:19)! . (35)Notice that in D the region for the binding resource and non-binding resource is not the same, and thereasons are twofolds: ﬁrst, the average constraint for non-binding resource will go up in expectation;second, for the non-binding resource we need to ensure some room for the concentration analysis later.Similar as before, we know the algorithm will terminate after the stopping time τ that τ := min { t ≤ n : d t / ∈ D} ∪ { n + 1 } . (36)Then, by deﬁning ǫ T ∗ ,t :=  ¯ a t ≤ κn, t β t > κn, ǫ L ∗ ,t :=  ¯ a + ∆ − d t ≤ κn, − ∆4 t > κn,

20e can use the same two terms analysis by deﬁning E t := (cid:26) H t − (cid:12)(cid:12)(cid:12) sup d ′ ∈D k E [ a T ∗ ,t x t ( d ′ ) |H t − ] − d ′T ∗ k ∞ ≤ ǫ T ∗ ,t and sup d ′ ∈D E [ a L ∗ ,t x t ( d ′ ) |H t − ] − d ′L ∗ ≤ ǫ L ∗ ,t (cid:27) , ˜ τ := min { t ≤ n : d t / ∈ D or H t − / ∈ E t } ∪ { n + 1 } . The intuition for E t is that we don’t want d L ∗ to outside of ∆ such that the structure of binding andnon-binding could change (remember that from stability Lemma 7, we know the binding and nonbindingstructure won’t change when d t stays in a certain area that contains D ). With ˜ τ , we deﬁne an auxiliaryprocess ˜ d t as follows ˜ d t =  d t , t < ˜ τ , d ˜ τ , t ≥ ˜ τ , which will lead to the same decomposition P ( d s / ∈ D for some s ≤ t ) ≤ P (cid:16) ˜ d s / ∈ D ′ for some s ≤ t (cid:17) + t X s =1 P (( r , a ..., r s − , a s − ) / ∈ E s ) . (37)We present the following results below: Lemma 8.

The following inequality holds for all n ≥ n and t ≤ n − , P (cid:16) ˜ d s / ∈ D for some s ≤ t (cid:17) ≤ me − ∆2( n − t )32¯ a (38) where the constant n is deﬁned as the minimal integer such that n ≥ (cid:0) ∆8¯ a (cid:1) − ! ∨  (cid:16) ∆8(¯ a +∆ − d ) (cid:17) −  and log n n β ≤ κ β ∆4 , β is a ﬁxed number in (0 , ) as in deﬁning the event E t ’s, and κ = (cid:0) − exp( − ∆8¯ a ) (cid:1) ∧ (cid:16) − exp( − ∆8(¯ a +∆ − d ) ) (cid:17) . Lemma 8 provide the bound for the ﬁrst term in (37), and the technique is very similar to Lemma 1.Next, for bounding the second term in (37), we deﬁne γ := min ( min { , σ, σ } · min { χ, ψ }

12 ¯ d ¯ r ¯ a K √ K + m , σψ r ¯ a p K ( K + m ) , ψ r , ∆4¯ aK ) , which is a constant ensuring that the stability condition implied by Lemma 7. We overload the deﬁnitionof events A ( k ) t and B ( k ) t by deﬁning A ( k ) t := (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) ≤ γ (cid:27) , B ( k ) t := (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) ≤ aKt β (cid:27) . Similar to Lemma 2, within region D and under event ∩ Kk =1 A ( k ) t , from Lemma 7 we know the optimalbasis won’t change, which means the slackness condition is not changing, and it further implies that21inding resource is still binding. Therefore, we have the feasibility condition that d ′T ∗ = K X k =1 u T ∗ ,k n k ( t ) t q ∗ k ( d ′ ) , which essential in our proof. For B ( k ) t , n(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12) ≤ aKt β o ensures the condition in E t for bindingindex T ∗ , and again n(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12) ≤ ∆4¯ aK o (from A ( k ) t and deﬁnition of γ ) ensures the condition in E t for the non-binding index L ∗ . We then state the results of the bound for the second term in (37). Lemma 9.

For t > κn , ∩ Kk =1 (cid:16) A ( k ) t ∩ B ( k ) t (cid:17) ⊂ E t . Lemma 10.

Under Assumption 1 and 3, we have for t ≤ n − , P ( E t ) ≥  for t ≤ κn − K exp (cid:0) − γ t (cid:1) − K exp (cid:16) − t − β ¯ a K (cid:17) for t > κnP ( τ < t ) ≤ m exp (cid:18) − ∆ ( n − t )32¯ a (cid:19) + Kγ exp (cid:0) − γ κn (cid:1) + K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19) , (39) and E " l ∗⊤ b − K X k =1 u k N k ( n ) ! ≤ O ( m ) + O ( mn exp( − n )) + O ( n / exp( − n / )) . (40)Lemma 10 is analogous to (19), from which we can directly generate the regret bound for bindingresource. Then we are in good shape to derive the ﬁnal results. Recall from Lemma 5 that we needs to bound E hP Kk =1 ( np k − N k ( n ))( v k − u ⊤ k l ∗ ) + i , which is to acceptthe right order, and E hP Kk =1 N k ( n )( u ⊤ k l ∗ − v k ) + i , which is to reject the wrong order. From Lemma 6and its following analysis we know that it suﬃces to bound the probability that the optimal basis for(32) is not the same as (22) at each round. We state the results below. Lemma 11.

Under Assumption 1 and 3, by implementing Algorithm 1 we have E " K X k =1 ( np k − N k ( n ))( v k − u ⊤ k l ∗ ) + + E " K X k =1 N k ( n )( u ⊤ k l ∗ − v k ) + ≤ η (cid:18) m ¯ a ∆ + Knγ exp (cid:0) − γ n (cid:1) + K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19) + Kγ (cid:19) (41) where η = max k ∈ [ K ] (cid:12)(cid:12) v k − u ⊤ k l ∗ (cid:12)(cid:12) . Finally, we present the ﬁnal results.

Theorem 3.

Under Assumption 1 and 3, Algorithm 1 give a regret in the order ofReg πn ≤ O ( m ) + O ( mn exp( − n )) + O ( n / exp( − n / )) . (42)22 Multi-dimensional Case

In this section, we discuss how the results above could also extend to multi-dimensional case presentedat the beginning of the paper. When l > , recall from (1) that the underlying LP is, max n X j =1 r ⊤ j x j s.t. n X j =1 A j x j ≤ b ⊤ x j ≤ , x j ≥ , j = 1 , ..., n, Applications:

This multi-dimensional formulation with linear reward covers the following problemsas special case: the resource allocation problem (Asadpour et al., 2019), the operations room reserva-tion/scheduling problem (Conforti et al., 2014; Stein et al., 2020), and the AdWords problem (Mehta,2013). For all these three examples, there is a linear structure between the reward r j and the resourceconsumption A j , and thus can be viewed as special case of our formulation.For the multidimensional case, the distributional assumption becomes Assumption 4.

We assume(a) The column-coeﬃcient pair ( r j , A j ) ’s are i.i.d. sampled from a distribution P . The distribution P takes a ﬁnite and known support { ( v k , u k ) } Kk =1 where v k ∈ R l and u k ∈ R m × l . Speciﬁcally, P (( r j , A j ) = ( v k , u k )) = p k for k = 1 , ..., K, and the parameters p = ( p , ..., p K ) ⊤ are unknown.(b) There exists a vector λ ∈ R m and λ ≥ such that v k = u ⊤ k λ for k = 1 , ..., K. (c) The right-hand-side b = n d ≥ where d = ( d , ..., d m ) ⊤ . The multidimensional DLP will be in the form max K X k =1 v ⊤ k q k p k s.t. K X k =1 u k q k p k ≤ d ⊤ q k ≤ , q k ≥ , k = 1 , ..., K, (43)and the multidimensional SLP at time t will be in the form max K X k =1 v ⊤ k q k n k ( t ) t s.t. K X k =1 u k q k n k ( t ) t ≤ d ⊤ q k ≤ , q k ≥ . k = 1 , ..., K, (44)We can think the problem in the form that at each round t , we are present with an order bundle oftype k that is drawn from an i.i.d distribution. Inside the order bundle type k there are l diﬀerent orders,and we have to choose one type j out of the l diﬀerent orders. We ﬁnd that the theoretical framework forthe binding case and general case in Section 3 and 4 still works for the proof of multidimensional case.Therefore, in the rest of the section we state the procedure on how to get similar results. For succinctness,23e just state the outline and focus on the parts that has diﬀerence with the previous arguments, andthe rest should be the same. The assumption for multidimensional bindingness becomes

Assumption 5.

There exists a constant ∆ > such that one can ﬁnd a matrix q = ( q , ...., q K ) ∈ R l × K that satisﬁes K X k =1 u k q k p k = d ′ for all d ′ ∈ D = N mi =1 [ d i − ∆ , d i + ∆] . Similar to Section 3, we can carry out the same analysis as before. First, by taking the followingquantities as deﬁned in a same manner as the one-dimensional case D ′ := m O i =1 [ d i − ∆2 , d i + ∆2 ] , E t := (cid:26) H t − (cid:12)(cid:12)(cid:12) sup d ′ ∈D ′ k E [ A t x t ( d ′ ) |H t − ] − d ′ k ∞ ≤ ǫ t (cid:27) ,ǫ t :=  ¯ a t ≤ κn, t β t > κn, which will lead to the same decomposition P ( d s / ∈ D ′ for some s ≤ t ) ≤ P (cid:16) ˜ d s / ∈ D ′ for some s ≤ t (cid:17) + t X s =1 P (( r , A ..., r s − , A s − ) / ∈ E s ) . Lemma 1 follows trivially since the analysis does not involve l . For the second term, Lemma 2 also holdsas follows. Lemma 12.

For t = 1 , ..., n , if d t = d ′ = ( d ′ , ..., d ′ m ) ⊤ ∈ D ′ and (cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) ≤ p ∆2( d + ∆) for each k = 1 , ..., K , the optimal solution q ∗ = ( q ∗ , ..., q ∗ K ) of the scaled LP at time t + 1 satisﬁes K X k =1 u k q ∗ k n k ( t ) = t d ′ for each d ′ ∈ D ′ . Proof.

The proof follow essentially the same argument, based on the observation that for y k ∈ R m suchthat y k ≥ and q ⊤ y k ≤ , the inequality will also hold for γ y k if γ ∈ [0 , .Lastly, from the deﬁnition of the events A ( k ) t := (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) ≤ p ∆2( d + ∆) (cid:27) and B ( k ) t := (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12)(cid:12) ≤ aKt β (cid:27) ,

24e know the derivation of ∩ Kk =1 (cid:16) A ( k ) t ∩ B ( k ) t (cid:17) ⊂ E t is independent of l because only the maximum norms on p , A and d are required. Therefore, the resultin Theorem 2 also holds for the multi-dimensional case under Assumption 4 and 5. In the general case, with little abuse of notation, we denote q ∈ R l × K the matrix [ q , · · · , q k ] , v p ∈ R l × K the matrix [ p v , · · · , p K v K ] , U p ∈ R m × Kl the matrix [ u , · · · , u K ] , s the vector [ s , . . . , s m ] ⊤ , and z thevector [ z , . . . , z Kl ] ⊤ . Next, for the notations in the standard form, we deﬁne ¯ U ∈ R ( m + K ) × ( m + K + Kl ) , ¯ v ∈ R ( m + K + Kl ) , and ¯ d ∈ R ( m + K ) in the following way ¯ U := U p I m I K I K ! , ¯ q =  qsz  , ¯ v = v p ! , ¯ d = d ! , where ¯ I K ∈ R K × Kl is the matrix such that for the k th row, the l ( k −

1) + 1 th to the lk th entry is , and otherwise. Then the standard form of DLP (43) could be written as max ¯ v ⊤ ¯ q s.t. ¯ U ¯ q = ¯ d ¯ q ≥ . (45)The assumption then becomes the following. Assumption 6.

The optimal solution to (45) is unique and non-degenerate, i.e. { i : q ∗ i = 0 , i = 1 , · · · , Kl } + { i : s ∗ i = 0 , i = 1 , · · · , m } + { i : z ∗ i = 0 , i = 1 , · · · , K } = m + K. Then, we denote u k = [ u k, · · · , u k,l ] , v k = [ v k, · · · , u k,l ] , and N kj ( n ) as the number of times thatwe accept order type j inside the order bundle type k (which corresponds to the j th column in u k ). Forconvenience, from now on denote type kj as the j th order in order bundle type k . Similar to Lemma 5,one could have the general regret form Lemma 13.

Reg πn ≤ E  l ∗⊤  b − K X k =1 l X j =1 u kj N kj ( n )  + E  K X k =1 l X j =1 N kj ( n ) (cid:0) u ⊤ kj l ∗ − v kj (cid:1) +  + E  K X k =1  np k max j ∈ [ l ] (cid:0) v kj − u ⊤ kj l ∗ (cid:1) + − l X j =1 N kj ( n ) (cid:0) v kj − u ⊤ kj l ∗ (cid:1) +  . (46)Lemma 13 tells us that, besides making sure the binding source have bounded regret, we should avoidtwo cases. First, the second term in (46) tell us that for any order type j from any order bundle type k , as long as u ⊤ kj l ∗ − v kj > , we should not accept it, which is consistent with Section 4. The secondcase is a little diﬀerent from the one-dimensional case: from the third term we know that, for a speciﬁcorder bundle k , if there is any order j such that v kj − u ⊤ kj l ∗ > , we should always accept the order type25 such that j = arg max j ∈ [ l ] { v kj − u ⊤ kj l ∗ } . (47)If there are more than one maximizers, and we denote the set of maximizers in (47) to be M k ⊆ [ l ] , thenwe have to make sure that P j ∈M k q ∗ kj = 1 . The intuition is that only if P j ∈M k q ∗ kj = 1 can we makesure that np k − P lj =1 N kj ( n ) is bounded in expectation. The condition that P j ∈M k q ∗ kj = 1 is equivalentto the condition of slack variable z ∗ k = 0 , hence we can use the relationship between the optimal basisand reduced cost again. Lemma 14.

For type kj , v kj − u ⊤ kj l ∗ < is the suﬃcient condition for q ∗ kj = 0 , and for the order bundle k , max j ∈ [ l ] (cid:16) v kj − u ⊤ kj l ∗ (cid:17) + > is the suﬃcient condition for P j ∈M k q ∗ kj = 1 . Lemma 14 tells us again that that if the optimal basis is not changing, we will always do the “right”condition, and the bounded regret result followed exactly the same way in Section 4. Therefore, theresult in Theorem 3 also holds for the multi-dimensional case under Assumption 4 and 6.

References

Agrawal, Shipra, Erick Delage, Mark Peters, Zizhuo Wang, Yinyu Ye. 2009. A uniﬁed framework fordynamic pari-mutuel information market design.

Proceedings of the 10th ACM conference on Electroniccommerce . 255–264.Agrawal, Shipra, Zizhuo Wang, Yinyu Ye. 2014. A dynamic near-optimal algorithm for online linearprogramming.

Operations Research (4) 876–890.Appa, Gautam. 2002. On the uniqueness of solutions to linear programs. Journal of the OperationalResearch Society (10) 1127–1132.Asadpour, Arash, Xuan Wang, Jiawei Zhang. 2019. Online resource allocation with limited ﬂexibility. Management Science .Banerjee, Siddhartha, Daniel Freund. 2020. Uniform loss algorithms for online stochastic decision-making with applications to bin packing.

Abstracts of the 2020 SIGMETRICS/Performance JointInternational Conference on Measurement and Modeling of Computer Systems . 1–2.Banerjee, Siddhartha, Itai Gurvich, Alberto Vera. 2020. Constant regret in online allocation: On thesuﬃciency of a single historical trace .Buchbinder, Niv, Joseph Naor. 2009. Online primal-dual algorithms for covering and packing.

Mathe-matics of Operations Research (2) 270–286.Bumpensanti, Pornpawee, He Wang. 2020a. A re-solving heuristic with uniformly bounded loss fornetwork revenue management. Management Science .Bumpensanti, Pornpawee, He Wang. 2020b. A re-solving heuristic with uniformly bounded loss fornetwork revenue management.

Management Science .Conforti, Michele, Gérard Cornuéjols, Giacomo Zambelli, et al. 2014.

Integer programming , vol. 271.Springer.Dantzig, George Bernard. 1965.

Linear Programming and Extensions , vol. 48. Princeton UniversityPress. 26erguson, Thomas S, et al. 1989. Who solved the secretary problem?

Statistical science (3) 282–289.Jasin, Stefanus. 2015. Performance of an lp-based control for revenue management with unknown demandparameters. Operations Research (4) 909–915.Jasin, Stefanus, Sunil Kumar. 2012. A re-solving heuristic with bounded revenue loss for network revenuemanagement with customer choice. Mathematics of Operations Research (2) 313–345.Kellerer, Hans, Ulrich Pferschy, David Pisinger. 2003. Knapsack problems. 2004.Kesselheim, Thomas, Andreas Tönnis, Klaus Radke, Berthold Vöcking. 2014. Primal beats dual ononline packing lps in the random-order model. Proceedings of the forty-sixth annual ACM symposiumon Theory of computing . ACM, 303–312.Li, Xiaocheng. 2020. Online linear programming: Dual convergence and new algorithms.

Ph.D. Thesis,Stanford University .Li, Xiaocheng, Yinyu Ye. 2019. Online linear programming: Dual convergence, new algorithms, andregret bounds. arXiv preprint arXiv:1909.05499 .Mehta, Aranyak. 2013. Online matching and ad allocation .Mehta, Aranyak, Amin Saberi, Umesh Vazirani, Vijay Vazirani. 2005. Adwords and generalized on-linematching. . IEEE,264–273.Shivaswamy, Pannagadatta, Thorsten Joachims. 2012. Multi-armed bandit problems with history.

Arti-ﬁcial Intelligence and Statistics . 1046–1054.Stein, Cliﬀord, Van-Anh Truong, Xinshang Wang. 2020. Advance service reservations with heterogeneouscustomers.

Management Science (7) 2929–2950.van de Geer, Sara A. 2002. On hoeﬀding’s inequality for dependent random variables. Empirical processtechniques for dependent data . Springer, 161–169.Vera, Alberto, Siddhartha Banerjee. 2019. The bayesian prophet: A low-regret framework for onlinedecision making.

ACM SIGMETRICS Performance Evaluation Review (1) 81–82.Vera, Alberto, Siddhartha Banerjee, Itai Gurvich. 2019. Online allocation and pricing: Constant regretvia bellman inequalities. arXiv preprint arXiv:1906.06361 .Wu, Huasen, Rayadurgam Srikant, Xin Liu, Chong Jiang. 2015. Algorithms with logarithmic or sublinearregret for constrained contextual bandits. Advances in Neural Information Processing Systems . 433–441. 27 ppendix

Proof of Lemma 5.

We haveReg πn = E " R ∗ n − n X t =1 r t x t ≤ E " ¯ R ∗ n − n X t =1 r t x t = E " l ∗⊤ b + n X t =1 ( r t − a ⊤ t l ∗ ) + − n X t =1 ( r t − a ⊤ t l ∗ ) x t − n X t =1 a ⊤ t l ∗ x t = E " l ∗⊤ b + K X k =1 np k ( v k − u ⊤ k l ∗ ) + − K X k =1 ( v k − u ⊤ k l ∗ ) N k ( n ) − K X k =1 u ⊤ k l ∗ N k ( n ) = E " l ∗⊤ b − K X k =1 u ⊤ k l ∗ N k ( n ) + E " K X k =1 np k ( v k − u ⊤ k l ∗ ) + − K X k =1 ( v k − u ⊤ k l ∗ ) N k ( n ) = E " l ∗⊤ b − K X k =1 u k N k ( n ) ! + E " K X k =1 ( np k − N k ( n ))( v k − u ⊤ k l ∗ ) + + E " K X k =1 N k ( n )( u ⊤ k l ∗ − v k ) + , (48)where in the second line ¯ R ∗ n is the optimal objective value of (3) multiplied by n . The inequality holdsbecause the DLP is a relaxation of the original oﬄine Linear Program (Bumpensanti and Wang, 2020b).The second equality holds because of duality (26), and the third equality holds because we are calculatingcorresponding categories for the incoming order, and the rest follows from algebra. Proof of Lemma 6.

Firstly from (30), we know that if i ∈ [ K ] , we have Ψ i = v i − u ⊤ i l ∗ − y ∗ i . Then, noticethat for i ∈ [ K ] , v i − u ⊤ i l ∗ − y ∗ i < ⇔ v i − u ⊤ i l ∗ < , this is because y ∗ i > would strictly increase thedual objective hence make a contradiction. Hence, by noticing Ψ i < is equivalent to q ∗ i = 0 we knowthat we are done for the ﬁrst statement.Next, notice that for i ∈ [ K ] , we have v i − u ⊤ i l ∗ > ⇔ y ∗ i > . This is because the constraint in thedual requires that u ⊤ i l + y i ≥ v i . Since v i − u ⊤ i l ∗ > , the optimal solution for y ∗ i is y ∗ i = v i − u ⊤ i l ∗ > 0.Conversely, if y ∗ i > , it is impossible to have v i − u ⊤ i l ∗ ≤ , since otherwise letting y ∗ i = 0 will contradictthe optimality. Then, since from complementary slackness we know that y ∗ i > implies q ∗ i = 1 , we have v i − u ⊤ i l ∗ > ⇒ q ∗ i = 1 . Also, it is worth noticing that since q ∗ i = 1 implies z ∗ i = 0 , from (31) we know Ψ i + K + m < .Lastly, to show that q ∗ i = 1 ⇒ v i − u ⊤ i l ∗ ≥ , we know that q ∗ i = 1 implies v i − u ⊤ i l ∗ − y ∗ i = 0 , hence v i − u ⊤ i l ∗ ≥ . Proof of Lemma 7.

Recall that ¯ B ∗ and ¯ N ∗ denote the optimal and non-optimal basis for the standard-form LP (22), respectively. The idea here is to show that the perturbed LP (32) shares the same optimaland non-optimal basis under condition (34). Denote the solution of the perturbed LP (32) as ( ˆ q , ˆ s , ˆ z ) .Speciﬁcally, assuming ¯ U t, ¯ B ∗ being invertible, ( ˆ q , ˆ s , ˆ z ) is deﬁned by ( ˆ q , ˆ s , ˆ z ) ¯ B ∗ = ( ¯ U t, ¯ B ∗ ) − ¯ d t , ( ˆ q , ˆ s , ˆ z ) ¯ N ∗ = Next, we prove the following four results:(a) The matrix ¯ U t, ¯ B ∗ is non-singular and thus ( ˆ q , ˆ s , ˆ z ) is well-deﬁned.(b) ( ˆ q , ˆ s , ˆ z ) ≥ and thus ( ˆ q , ˆ s , ˆ z ) is a feasible solution to (32).28c) The reduced costs for non-basic variables in ¯ N ∗ are all negative and therefore ( ˆ q , ˆ s , ˆ z ) is the uniqueoptimal solution to the perturbed LP (32).(d) The non-binding resource d L ∗ ,t doesn’t need an upper bound as a restrictionTo show part (a), we prove that the smallest singular value of the matrix is positive. We use σ min ( M ) and σ max ( M ) to denote the smallest and the largest singular value of a matrix M . Then we have σ min (cid:0) ¯ U t, ¯ B ∗ (cid:1) ≥ σ min (cid:0) ¯ U ¯ B ∗ (cid:1) − σ max (cid:0) − U ǫ ¯ B ∗ (cid:1) = σ − σ max (cid:0) − U ǫ ¯ B ∗ (cid:1) ≥ σ − √ m + K k U ǫ ¯ B ∗ k ∞ , where the ﬁrst line comes from Weyl’s inequality on matrix eigenvalues/singular values, the second linecomes from the deﬁnition of σ , and the third line is obtained from the relation between the spectralnorm and the inﬁnity norm of a matrix. From condition (34), we have σ min (cid:0) ¯ U t, ¯ B ∗ (cid:1) ≥ σ , and consequently, σ max (cid:16)(cid:0) ¯ U t, ¯ B ∗ (cid:1) − (cid:17) ≤ σ . (49)For part (b), we show the non-negativeness of the solution ( ˆ q , ˆ s , ˆ z ) . Also notice that in order to makethe notation less cumbersome, we abuse the notation a bit by omitting subscript t for the terms relatedto perturbation ( U ǫt , v ǫt and so on) in the following proof. From Assumption 3 we know that ( q ∗ , s ∗ , z ∗ ) ¯ B ∗ ≥ χ > , where the inequality holds element-wise. To ensure that (cid:0) x δ , s δ (cid:1) ¯ B ∗ is strict positive, it suﬃces to show k ( ˆ q , ˆ s , ˆ z ) ¯ B ∗ − ( q ∗ , s ∗ , z ∗ ) ¯ B ∗ k ∞ = (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ ( ¯ d + d ǫ ) − ( ¯ U ¯ B ∗ ) − ¯ d (cid:13)(cid:13) ∞ ≤ χ . From condition (34), if max i ∈ ¯ B ∗ ∩ [ m ] k u ǫi k ∞ ≤ σ χ dK √ K + m and k d ǫ k ∞ ≤ σχ √ K + m , then (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ ( ¯ d + d ǫ ) − ( ¯ U ¯ B ∗ ) − ¯ d (cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ ¯ d − ( ¯ U ¯ B ∗ ) − ¯ d (cid:13)(cid:13) ∞ + (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ d ǫ (cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ − ( ¯ U ¯ B ∗ ) − (cid:13)(cid:13) ∞ (cid:13)(cid:13) ¯ d (cid:13)(cid:13) ∞ + (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ d ǫ (cid:13)(cid:13) ∞ ≤ ¯ d (cid:13)(cid:13) ( ¯ U ¯ B ∗ ) − (cid:0) ¯ U ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ − I (cid:1)(cid:13)(cid:13) ∞ + (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ d ǫ (cid:13)(cid:13) ∞ = ¯ d (cid:13)(cid:13) ( ¯ U ¯ B ∗ ) − (cid:0) ( ¯ U + U ǫ − U ǫ ) ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ − I (cid:1)(cid:13)(cid:13) ∞ + (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ d ǫ (cid:13)(cid:13) ∞ = ¯ d (cid:13)(cid:13) ( ¯ U ¯ B ∗ ) − U ǫ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ (cid:13)(cid:13) ∞ + (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ d ǫ (cid:13)(cid:13) ∞ ≤ ¯ d (cid:13)(cid:13) ( ¯ U ¯ B ∗ ) − ( ¯ U + U ǫ ) − B ∗ (cid:13)(cid:13) ∞ k U ǫ ¯ B ∗ k ∞ + (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ d ǫ (cid:13)(cid:13) ∞ ≤ ¯ d √ K + mσ max (cid:0) ( ¯ U ¯ B ∗ ) − ( ¯ U + U ǫ ) − B ∗ (cid:1) k U ǫ ¯ B ∗ k ∞ + √ K + mσ max ( ¯ U + U ǫ ) − B ∗ k d ǫ k ∞ ≤ d √ K + mσ k U ǫ ¯ B ∗ k ∞ + 2 √ K + mσ k d ǫ k ∞ ≤ dK √ K + mσ max i ∈ ¯ B ∗ ∩ [ K ] k u ǫi k ∞ + 2 √ K + mσ k d ǫ k ∞ ≤ χ . ∞ norm. The eighth line comefrom the deﬁnition of σ following Assumption 3 and the relation between the spectral norm σ max andL ∞ norm. The last line is from the inequality that || U ǫ ¯ B ∗ || ∞ ≤ K max i ∈ ¯ B ∗ ∩ [ K ] k u ǫi k ∞ (notice that here theconstant K comes from the fact that U ǫ ¯ B ∗ has at most K non-zero columns) and applies condition (34).Thus we ﬁnish the part on the feasibility.For part (c), we prove that reduced costs of non-basic variables in ¯ N ∗ are all strictly negative.We treat the decision variables corresponding to the main variable q and the slack variables s and z separately.For a non-basic variable q i that corresponds to the i -th order, the reduced cost of the perturbed LP(32) (denoted by Ψ ti ) can be expressed as follows, Ψ ti := v i + v ǫi − (¯ v + v ǫ ) ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ ( u i + u ǫi )= v i − ¯ v ⊤ ¯ B ∗ ( ¯ U ¯ B ∗ ) − u i + ¯ v ⊤ ¯ B ∗ ( ¯ U ¯ B ∗ ) − u i + v ǫi − (¯ v + v ǫ ) ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ ( u i + u ǫi )= Ψ i + v ǫi + ¯ v ⊤ ¯ B ∗ (cid:0) ( ¯ U ¯ B ∗ ) − − ( ¯ U + U ǫ ) − B ∗ (cid:1) u i − v ǫ ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ u i − ¯ v ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ u ǫi − v ǫ ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ u ǫi . (50)Since Ψ i < , a suﬃcient condition for Ψ ti < is the absolution values of all the rest ﬁve terms are nogreater than | Ψ i | . Next, we are going to bound those ﬁve terms.For v ǫi , the inequality | v ǫi | ≤ | Ψ i | is directly implied from the condition (34). For the next term, wehave (cid:12)(cid:12) ¯ v ⊤ ¯ B ∗ (cid:0) ( ¯ U ¯ B ∗ ) − − ( ¯ U + U ǫ ) − B ∗ (cid:1) u i (cid:12)(cid:12) ≤ k ¯ v ¯ B ∗ k · (cid:13)(cid:13)(cid:0) ( ¯ U ¯ B ∗ ) − − ( ¯ U + U ǫ ) − B ∗ (cid:1) u i (cid:13)(cid:13) ∞ ≤ K ¯ r (cid:13)(cid:13) ( ¯ U ¯ B ∗ ) − (cid:0) I − ¯ U ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ (cid:1) u i (cid:13)(cid:13) ∞ = K ¯ r (cid:13)(cid:13) ( ¯ U ¯ B ∗ ) − U ǫ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ u i (cid:13)(cid:13) ∞ ≤ K ¯ r (cid:13)(cid:13) ( ¯ U ¯ B ∗ ) − ( ¯ U + U ǫ ) − B ∗ (cid:13)(cid:13) ∞ (cid:13)(cid:13) U ǫ ¯ B ∗ (cid:13)(cid:13) ∞ k u i k ∞ ≤ K ¯ r ¯ a (cid:13)(cid:13) ( ¯ U ¯ B ∗ ) − ( ¯ U + U ǫ ) − B ∗ (cid:13)(cid:13) ∞ (cid:13)(cid:13) U ǫ ¯ B ∗ (cid:13)(cid:13) ∞ ≤ ¯ r ¯ aK √ K + mσ max (cid:0) ( ¯ U ¯ B ∗ ) − ( ¯ U + U ǫ ) − B ∗ (cid:1) (cid:13)(cid:13) U ǫ ¯ B ∗ (cid:13)(cid:13) ∞ ≤ r ¯ aK √ K + mσ (cid:13)(cid:13) U ǫ ¯ B ∗ (cid:13)(cid:13) ∞ ≤ r ¯ aK √ K + mσ max i ∈ ¯ B ∗ ∩ [ K ] k u ǫi k ∞ , (51)where the ﬁrst line is obtained by Holder’s inequality, the fourth line is obtained by the sub-multiplicativityand (49), and the sixth line comes from the relation between the spectral norm and L ∞ norm, and thelast line is again from || U ǫ ¯ B ∗ || ∞ ≤ K max i ∈ ¯ B ∗ ∩ [ K ] k u ǫi k ∞ . Thus, from (34) we have (cid:12)(cid:12) ¯ v ⊤ ¯ B ∗ (cid:0) ( ¯ U ¯ B ∗ ) − − ( ¯ U + U ǫ ) − B ∗ (cid:1) u i (cid:12)(cid:12) ≤ ψ . For the next term, we have (cid:12)(cid:12) v ǫ ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ u i (cid:12)(cid:12) ≤ (cid:13)(cid:13) v ǫ ¯ B ∗ (cid:13)(cid:13) (cid:13)(cid:13) ( ¯ U + U ǫ ) − B ∗ u i (cid:13)(cid:13) ≤ √ K (cid:13)(cid:13) v ǫ ¯ B ∗ (cid:13)(cid:13) ∞ σ max (cid:0) ( ¯ U + U ǫ ) − B ∗ (cid:1) k u i k ≤ p K ( K + m )¯ a (cid:13)(cid:13) v ǫ ¯ B ∗ (cid:13)(cid:13) ∞ σ max (cid:0) ( ¯ U + U ǫ ) − B ∗ (cid:1) ≤ p K ( K + m )¯ aσ (cid:13)(cid:13) v ǫ ¯ B ∗ (cid:13)(cid:13) ∞ . (52)30hus, from (34) again we have (cid:12)(cid:12) v ǫ ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ u i (cid:12)(cid:12) ≤ ψ . Similar to the previous method, for the last two terms we have (cid:12)(cid:12) ¯ v ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ u ǫi (cid:12)(cid:12) ≤ p K ( K + m )¯ rσ max i ∈ ¯ N ∗ ∩ [ K ] k u ǫi k ∞ , (cid:12)(cid:12) v ǫ ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ u ǫi (cid:12)(cid:12) ≤ p K ( K + m ) σ max i ∈ ¯ N ∗ ∩ [ K ] k u ǫi k ∞ (cid:13)(cid:13) v ǫ ¯ B ∗ (cid:13)(cid:13) ∞ . (53)Both of them are no larger than ψ condition because of (34).Therefore, for any non-basic and non-slack variable q i , we conclude that its reduced cost in theperturbed LP Ψ ti ≤ Ψ i < . Next, for the non-basic slack variable s j or z j , similarly as the proof for q i , the reduced cost Ψ tj inthe perturbed LP (32) satisﬁes Ψ tj = − (¯ v + v ǫ ) ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ e j = Ψ j + ¯ v ⊤ ¯ B ∗ ¯ U − B ∗ e j − (¯ v + v ǫ ) ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ e j = Ψ j + ¯ v ⊤ ¯ B ∗ (cid:0) ¯ U − B ∗ − ( ¯ U + U ǫ ) − B ∗ (cid:1) e j − v ǫ ⊤ ¯ B ∗ ( ¯ U + U ǫ ) − B ∗ e j ≥ Ψ j − K / p ( K + m )¯ rσ max i ∈ ¯ B ∗ ∩ [ m ] k u ǫi k ∞ − √ Kσ k v ǫ ¯ B ∗ k ∞ where e j ∈ R m is a vector with value 1 for the j -th entry and 0 otherwise. From the condition (34), Ψ tj ≤ Ψ j + ψ < . Thus part (c) is done.For part (d), we already know that under condition (cid:13)(cid:13) u ǫt,i (cid:13)(cid:13) ∞ ≤  min { ,σ,σ }· min { χ,ψ }

12 ¯ d ¯ r ¯ aK √ K + m , for i ∈ B ∗ , σψ r √ K ( K + m ) , for i ∈ N ∗ , | v ǫt,i | ≤  σψ a √ K ( K + m ) , for i ∈ B ∗ , ψ , for i ∈ N ∗ , || d ǫt || ∞ ≤ σχ √ K + m , (54)The optimal solution for (32) is unique and share the optimal basis for solution of (22). Since the optimalbasis is the same, from the previous proof we know that the reduce cost for the non-basic variable isstrictly negative, therefore, for the case d ǫ L ∗ ,t ≥ σχ √ K + m , if we can ﬁnd an optimal solution on the optimal31asis, the solution will be optimal and unique (Appa, 2002). Consider the oﬄine LP (32) under condition (cid:13)(cid:13) u ǫt,i (cid:13)(cid:13) ∞ ≤  min { ,σ,σ }· min { χ,ψ }

12 ¯ d ¯ r ¯ aK √ K + m , for i ∈ B ∗ , σψ r √ K ( K + m ) , for i ∈ N ∗ , | v ǫt,i | ≤  σψ a √ K ( K + m ) , for i ∈ B ∗ , ψ , for i ∈ N ∗ , | d ǫt,i | ≤ σχ √ K + m , for i ∈ T ∗ ,d ǫt,i = − σχ √ K + m , for i ∈ L ∗ , (55)we know that since the optimal basis is not changing, in the optimal solution for this LP, the slackvariable ˆ s i for K + i ∈ ¯ B ∗ ∩ [ K + 1 , K + m ] , is positive, and from this we know that i ∈ L ∗ . This LP hasbasic optimal solution ( ˆ q , ˆ s , ˆ z ) such that ( ˆ q , ˆ s , ˆ z ) ¯ B ∗ = ( ¯ U + U ǫ ) − B ∗ ¯ d t , and next we are going to deriveoptimal solution for another LP problem, where the constraint for the unbinding resource being larger.More speciﬁcally, for any ¯ d (1) t = ¯ d t + ¯ d δ , where ¯ d δi = ( δ i > , for i ∈ L ∗ ,δ i = 0 , for i / ∈ L ∗ , (56)the basic optimal solution for the oﬄine LP (32) under condition (55) (except that the resource constraintis now ¯ d (1) t ), is given by ( ˆ q , ˆ s , ˆ z ) + ¯ q δ , where ¯ q δi = ( ¯ d δi , for i ∈ [ K + 1 , K + m ] , , otherwise . (57)Since we are only increasing constraint that have slackness, we know that ( ˆ q , ˆ s , ˆ z ) + ¯ q δ is the optimalsolution, and by construction ( ˆ q , ˆ s , ˆ z ) + ¯ q δ has the basic variable set still being ¯ B ∗ . Moreover, becausethe reduced cost for the non-basic variable is still strictly negative, we know that the solution is unique.Finally, this implies that the optimal solution is still unique and share the same optimal basis as (22). Proof of Lemma 8.

We treat the case for binding index and non-binding index separately. For bindingindex i ∈ T ∗ and t = 1 , ..., n. , let Y t := ˜ d i,t +1 − ˜ d i,t , X t := Y t − E [ Y t |H t − ] . For binding case the setup is completely the same as the proof in Lemma 1. Therefore from Theorem 1,we know that P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 X j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ ∆2 for some s ≤ t  ≤ e − ∆2( n − t − a (58)holds for t ≤ n − . The diﬀerence in the proof of Lemma 8 is that we have to make sure the new choiceof κ satisﬁes the bound that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 X j − s X j =1 Y j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ κn X j =1 ¯ an − j + s X j = κn +1 n − j ) j β ≤ ∆2 . κn X j =1 ¯ an − j ≤ ¯ a (cid:18) log (cid:18) n − n − (cid:19) − log (1 − κ ) (cid:19) . For the second part, n − X j = κn +1 n − j ) j β ≤ κn ) β n − X j = κn +1 n − j ≤ log n ( κn ) β . From our new deﬁnition that κ ≤ − exp( − ∆8¯ a ) , and that n is the minimal integer such that n ≥ ( ∆8¯ a ) − + 2 and log n n β ≤ κ β ∆4 , we know that the ﬁrst and second part will still be bounded by ∆ / .Then the following inequality holds for n ≥ n κn X j =1 ¯ an − j + n − X j = κn +1 n − j ) j β ≤ ∆4 + ∆4 = ∆2 . With the choice of κ and n ≥ n , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 X j − s X j =1 Y j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ ∆2 holds almost surely. Consequently, n | ˜ d i,s − d i | > ∆ for some s ≤ t o = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s − X j =1 Y j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ∆ for some s ≤ t  = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 Y j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ∆ for some s ≤ t −  ⊆ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) s X j =1 X j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ∆ / for some s ≤ t −  . Therefore, for i ∈ T ∗ we know that P (cid:16) ˜ d i,s / ∈ D ′ for some s ≤ t (cid:17) ≤ e − ∆2( n − t )8¯ a for t ≤ n − and n ≥ n .Next, for non-binding index i ∈ L ∗ , we use the same deﬁnition on Y t , X t such that Y t := ˜ d i,t +1 − ˜ d i,t , X t := Y t − E [ Y t |H t − ] . From the fact that Y t = 1 n − t ( d t − a t x t ( d t )) { ˜ τ>t } , { ˜ τ > t } being H t − -measurable, { ˜ τ > t } implying that d t ∈ D and H t − ∈ E t , we have d i,t − a i,t x t ( d t ) ≥ inf d ′ ∈D d ′L ∗ − E [ a L ∗ ,t x t ( d ′ ) |H t − ] ≥  d − ∆ − ¯ a t ≤ κn, ∆4 t > κn, E [ Y t |H t − ] = 1 n − t E (cid:2) d i,t − a i,t x ( d i,t ) { ˜ τ>t } (cid:12)(cid:12) H t − (cid:3) ≥  d − ∆ − ¯ an − t t ≤ κn, t > κn, (59)almost surely. Next, similar from Lemma 1 we have that P  s X j =1 X j ≤ − ∆4 for some s ≤ t  ≤ e − ∆2( n − t − a (60)holds for t ≤ n − . Then we need to show that s X j =1 X j − s X j =1 Y j = − s X j =1 E [ Y j |H j − ] ≤ κn X j =1 ¯ a + ∆ − dn − j ≤ ∆4 . Again notice that we have κn X j =1 ¯ a + ∆ − dn − j ≤ (¯ a + ∆ − d ) (cid:18) log (cid:18) n − n − (cid:19) − log (1 − κ ) (cid:19) . From our new deﬁnition that κ ≤ − exp( − ∆8(¯ a +∆ − d ) ) , and that n is the minimal integer such that n ≥ ( ∆8(¯ a +∆ − d ) ) − + 2 , we know that the above will be bounded by ∆ / . Therefore, with the choiceof κ and n ≥ n , we have s X j =1 X j − s X j =1 Y j ≤ ∆4 holds almost surely. Next, we have for i ∈ L ∗ (cid:26) ˜ d i,s − d i ≤ − ∆2 for some s ≤ t (cid:27) =  s − X j =1 Y j ≤ − ∆2 for some s ≤ t  =  s X j =1 Y j ≤ − ∆2 for some s ≤ t −  ⊆  s X j =1 X j ≤ − ∆4 for some s ≤ t −  . Therefore, for i ∈ L ∗ we have that P (cid:16) ˜ d i,s / ∈ D for some s ≤ t (cid:17) ≤ e − ∆2( n − t )32¯ a . To sum up, by taking the union bound, we know that for all n ≥ n and t ≤ n − , we have P (cid:16) ˜ d s / ∈ D for some s ≤ t (cid:17) ≤ me − ∆2( n − t )32¯ a . Proof of Lemma 9.

Under ∩ Kk =1 A ( k ) t , for any d ′ ∈ D , Lemma 7 tell us that there exists ˆ d , which isdependent on d ′ such that ˆ d = K X k =1 u k n k ( t ) t q ∗ k ( d ′ ) . ˆ d T ∗ = d ′T ∗ = K X k =1 u T ∗ ,k n k ( t ) t q ∗ k ( d ′ ) , ˆ d L ∗ = K X k =1 u L ∗ ,k n k ( t ) t q ∗ k ( d ′ ) ≤ d ′L ∗ , (61)Therefore, from equation (61) and Algorithm 1 know that d ′L ∗ − E [ a L ∗ ,t x t ( d ′ ) |H t − ] = d ′L ∗ − K X k =1 u k p k q ∗ k ( d ′ )= d ′L ∗ − ˆ d L ∗ + K X k =1 u k (cid:18) n k ( t ) t − p k (cid:19) q ∗ k ( d ′ ) ≥ (cid:18) d L ∗ − ∆2 (cid:19) + (∆ − d L ∗ ) − ∆4 = ∆4 , (62)where in the last line, d ′L ∗ ≥ d L ∗ − ∆2 is from the fact that d ′ ∈ D , ˆ d L ∗ ≤ d L ∗ − ∆ is from Lemma 7(more speciﬁcally, the positivity of slack variable and the stability of the optimal basic index), and thelast one is from the inequality ensured by A ( k ) t that n(cid:12)(cid:12)(cid:12) n k ( t ) t − p k (cid:12)(cid:12)(cid:12) ≤ ∆4¯ aK o for any k . Therefore, under ∩ Kk =1 A ( k ) t we ﬁnish proving the property sup d ′ ∈D E [ a L ∗ ,t x t ( d ′ ) |H t − ] − d ′L ∗ ≤ − ∆4 . Then, in order to show ∩ Kk =1 (cid:16) A ( k ) t ∩ B ( k ) t (cid:17) ⊂ E t , it suﬃces to show that sup d ′ ∈D k E [ a T ∗ ,t x t ( d ′ ) |H t − ] − d ′T ∗ k ∞ ≤ t β . Under event ∩ Kk =1 (cid:16) A ( k ) t ∩ B ( k ) t (cid:17) , from equation (61) and Algorithm 1 we know that k E [ a T ∗ ,t x t ( d ′ ) |H t − ] − d ′T ∗ k ∞ = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) K X k =1 u T ∗ ,k (cid:18) p k − n k ( t ) t (cid:19) q ∗ k ( d ′ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ t β , (63)where the last equation comes from ∩ Kk =1 B ( k ) t . Hence we are done for the proof. Proof of Lemma 10.

When t ≤ κn , from the design of ǫ T ∗ ,t and ǫ L ∗ ,t we know that P ( E t ) = 1 . When t > κn , the inequality follows directly from Lemma 9 and Hoeﬀding inequality. For t ≤ n − P ( τ < t ) = P ( d s / ∈ D for some s ≤ t ) ≤ P (cid:16) ˜ d s / ∈ D for some s ≤ t (cid:17) + t X s =1 P (( r , a ..., r s − , a s − ) / ∈ E s ) ≤ P (cid:16) ˜ d s / ∈ D for some s ≤ t (cid:17) + t X s = κn +1 K exp (cid:0) − γ t (cid:1) + t X s = κn +1 K exp (cid:18) − t − β ¯ a K (cid:19) ≤ m exp (cid:18) − ∆ ( n − t )32¯ a (cid:19) + Kγ exp (cid:0) − γ κn (cid:1) + K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19) . (64)where the last line comes from (18) and the analysis above. Next, for the same way in the proof of35emma 4 we have E [ τ ] = n X t =1 (1 − P ( τ ≤ t )) ≥ n − − n − X t =1 (cid:18) m exp (cid:18) − ∆ ( n − t )32¯ a (cid:19) + Kγ exp (cid:0) − γ κn (cid:1) + K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19)(cid:19) = n − − m − e − ( n − / (32¯ a ) − e − ∆ / (32¯ a ) − Knγ exp (cid:0) − γ κn (cid:1) − K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19) ≥ n − − m ¯ a ∆ − Knγ exp (cid:0) − γ κn (cid:1) − K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19) . (65)Lastly, similar to the proof of Theorem 2, we have E " l ∗⊤ b − K X k =1 u k N k ( n ) ! = E " l ∗⊤T ∗ b T ∗ − K X k =1 u T ∗ ,k N k ( n ) ! ≤ l ∗⊤T ∗ E [ b T ∗ ,n +1 ] ≤ l ∗⊤T ∗ E [ b T ∗ ,τ − ] = l ∗⊤T ∗ E [( n − τ + 2) d T ∗ ,τ − ] ≤ l ∗⊤T ∗ (cid:18) d + ∆2 (cid:19) E [ n − τ + 2] ≤ O ( m ) + O ( mn exp( − n )) + O ( n / exp( − n / )) , (66)where the last line come from the fact that γ is in the order of m . Proof of Lemma 11.

Suppose that at time t , the arriving order type cost resource u k , i.e. a t = u k , andit corresponds to the type of order that we should reject, i.e. v k − u ⊤ k l ∗ < . We know that if we arein the event { d t ∈ D} ∩ n ∩ Kk =1 A ( k ) t o , then optimal basis won’t change. From Lemma 6, for those ordertype k such that v k − u ⊤ k l ∗ < , if we denote q ( t ) the optimal solution at t , we will have q ( t ) k = 0 , whichmeans we would never mistakenly accept it. Likewise, { d t ∈ D} ∩ n ∩ Kk =1 A ( k ) t o also prevent us frommistakenly rejecting order type k if v k − u ⊤ k l ∗ > . Henceforth, { making a mistake at round t } ⊆ { d t / ∈ D} ∪ n ∪ Kk =1 {A ( k ) ,ct } o . Therefore, if we denote ˆ N k ( n ) as the number of mistakes we have made for order type k when t = n , weknow that ˆ N k ( n ) =  N k ( n ) for u ⊤ k l ∗ − v k > P ns =1 { a s = u k } − N k ( n ) for v k − u ⊤ k l ∗ > .For simplicity in notation, denote η = max k ∈ [ K ] (cid:12)(cid:12) v k − u ⊤ k l ∗ (cid:12)(cid:12) , E " K X k =1 ( np k − N k ( n ))( v k − u ⊤ k l ∗ ) + + E " K X k =1 N k ( n )( u ⊤ k l ∗ − v k ) + ≤ η E " K X k =1 ( np k − N k ( n )) { v k − u ⊤ k l ∗ > } + E " K X k =1 N k ( n ) { u ⊤ k l ∗ − v k > } ≤ η K X k =1 E h ˆ N k ( n ) i! = η n X t =1 E " K X k =1 (cid:16) ˆ N k ( t ) − ˆ N k ( t − (cid:17) ≤ η n X t =1 (cid:16) P ( d t / ∈ D ) + P (cid:16) ∪ Kk =1 {A ( k ) ,ct } (cid:17)(cid:17) . (67)Hence we know that E " K X k =1 ( np k − N k ( n ))( v k − u ⊤ k l ∗ ) + + E " K X k =1 N k ( n )( u ⊤ k l ∗ − v k ) + ≤ η + η n − X t =1 (cid:18) m exp (cid:18) − ∆ ( n − t )32¯ a (cid:19) + Kγ exp (cid:0) − γ n (cid:1) + K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19) +2 K exp (cid:0) − γ t (cid:1)(cid:1) ≤ η m − e − ( n − / (32¯ a ) − e − ∆ / (32¯ a ) + Knγ exp (cid:0) − γ n (cid:1) + K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19) +2 K − e − γ ( n +1) − e − γ ! ≤ η m − e − ( n − / (32¯ a ) − e − ∆ / (32¯ a ) + Knγ exp (cid:0) − γ n (cid:1) + K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19) +2 K − e − γ ( n +1) − e − γ ! ≤ η (cid:18) m ¯ a ∆ + Knγ exp (cid:0) − γ n (cid:1) + K ¯ a n β β exp (cid:18) − κ − β K ¯ a n − β (cid:19) + Kγ (cid:19) ≤ O ( m ) + O ( mn exp( − n )) + O ( n / exp( − n / )) , (68)where the last line comes from the fact that /γ is of the order m .37 roof of Lemma 13. Reg πn ≤ l ∗⊤ b + K X k =1 np k max j ∈ [ l ] (cid:0) v kj − u ⊤ kj l ∗ (cid:1) + − n X t =1 r t x t = l ∗⊤ b + K X k =1 np k max j ∈ [ l ] (cid:0) v kj − u ⊤ kj l ∗ (cid:1) + − K X k =1 l X j =1 v kj N kj ( n )= l ∗⊤ b + K X k =1 np k max j ∈ [ l ] (cid:0) v kj − u ⊤ kj l ∗ (cid:1) + − K X k =1 l X j =1 (cid:0) v kj − u ⊤ kj l ∗ (cid:1) N kj ( n ) − K X k =1 l X j =1 u ⊤ kj l ∗ N kj ( n )= l ∗⊤  b − K X k =1 l X j =1 u kj N kj ( n )  + K X k =1  np k max j ∈ [ l ] (cid:0) v kj − u ⊤ kj l ∗ (cid:1) + − l X j =1 N kj ( n ) (cid:0) v kj − u ⊤ kj l ∗ (cid:1) ≤ l ∗⊤  b − K X k =1 l X j =1 u kj N kj ( n )  + K X k =1  np k max j ∈ [ l ] (cid:0) v kj − u ⊤ kj l ∗ (cid:1) + − l X j =1 N kj ( n ) (cid:0) v kj − u ⊤ kj l ∗ (cid:1) +  + K X k =1 l X j =1 N kj ( n ) (cid:0) u ⊤ kj l ∗ − v kj (cid:1) + . (69) Proof of Lemma 14. v kj − u ⊤ kj l ∗ < implies that v kj − u ⊤ kj l ∗ − y ∗ k < , hence from complementaryslackness we know that q ∗ kj = 0 .Next, when max j ∈ [ l ] (cid:16) v kj − u ⊤ kj l ∗ (cid:17) + > , we know that in this case y ∗ k = max j ∈ [ l ] (cid:16) v kj − u ⊤ kj l ∗ (cid:17) + .From complementary slackness, we know that y ∗ k > implies P lj =1 q ∗ kj = 1 . Then to show P lj ∈M k q ∗ kj = 1 it suﬃces to show that for j / ∈ M k , we have q ∗ kj = 0 . When j / ∈ M k . we know that v kj − u ⊤ kj l ∗ − y ∗ k < ,and from complementary slackness again, we know q ∗ kj = 0= 0