Dual Mirror Descent for Online Allocation Problems
DDual Mirror Descent for Online Allocation Problems
Haihao Lu ∗ Santiago Balseiro † Vahab Mirrokni ‡ February 2020
Abstract
We consider online allocation problems with concave revenue functions and resource con-straints, which are central problems in revenue management and online advertising. In thesesettings, requests arrive sequentially during a finite horizon and, for each request, a decisionmaker needs to choose an action that consumes a certain amount of resources and generatesrevenue. The revenue function and resource consumption of each request are drawn indepen-dently and at random from a probability distribution that is unknown to the decision maker. Theobjective is to maximize cumulative revenues subject to a constraint on the total consumptionof resources.We design a general class of algorithms that achieve sub-linear expected regret compared tothe hindsight optimal allocation. Our algorithms operate in the Lagrangian dual space: theymaintain a dual multiplier for each resource that is updated using online mirror descent. Bychoosing the reference function accordingly, we recover dual sub-gradient descent and dual ex-ponential weights algorithm. The resulting algorithms are simple, efficient, and shown to attainthe optimal order of regret when the length of the horizon and the initial number of resourcesare scaled proportionally. We discuss applications to online bidding in repeated auctions withbudget constraints and online proportional matching with high entropy.
A central problem in revenue management and online advertising is the online allocation of requestssubject to resource constraints. In revenue management, for example, firms such as hotels andairlines need to decide, when a request for a room or a flight arrives, whether to accept or declinethe request (Talluri and van Ryzin, 2004). In search advertising, each time a user makes a search, thesearch engine has an opportunity to show an advertisement next to the organic search results (Mehtaet al., 2007b). For each arriving user, the website collects bids from various advertisers who areinterested in showing an ad and then needs to decide, in real time, which ad to show to the user.Such decisions are not made in isolation because resources are limited: hotels have limited numberof rooms, planes have limited number of seats, and advertisers have limited budgets.In this paper, we study allocation problems with concave revenue functions and resource constraints.Requests arrive sequentially during a finite horizon and, for each request, the decision maker needs to ∗ University of Chicago - Booth School of Business, and Google Research. † Columbia Business School - Decision Risk and Operations, and Google Research. ‡ Google Research. a r X i v : . [ m a t h . O C ] J un hoose an action that consumes certain amount of resources and generates revenue. The objectiveof the decision maker is to maximize cumulative revenues subject to a constraint on the totalconsumption of resources. The revenue function and resource consumption of each request is learntby the decision maker before making a decision. For example, airlines know the fare requested by theconsumer before deciding whether to sell the ticket and search engines know advertisers’ bids beforedeciding which ad to show. We assume that the revenue function and resource consumption of eachrequest are drawn independently and at random from a fixed probability distribution. In practice,decision makers rarely know the probability distribution of requests in advance. Thus motivated,we consider a data-driven setting in which the underlying probability distribution is unknown tothe decision maker. Performance of an online algorithm is measured using regret, which is given bythe difference between the revenue attained by the optimal allocation with the benefit of hindsight(also referred as the offline optimum) and the cumulative revenues collected by the algorithm. We design a general class of algorithms that operate in the Lagrangian dual space. If the optimaldual variables were known in advance, the decision maker could, in principle, use these dual variablesto price resources and decompose the problem across time periods. In practice, however, the optimaldual variables depend on the entire sequence of requests and are not known to the decision maker inadvance. Our algorithms circumvent this issue by maintaining a dual multiplier for each resource,which is updated after each request using online mirror descent. Actions are then taken using theestimated dual variables as a proxy for the opportunity cost of consuming resources. By choosing thereference function accordingly, we recover dual sub-gradient descent and dual exponential weightsalgorithm.From the computational perspective, our algorithms are efficient; in many cases the dual variablescan be updated after each request in linear time. This is in sharp contrast to most existing algo-rithms, which require periodically solving large convex optimizations problems (see Section 1.2).In many applications, such as online advertising, a massive number of decisions need to made inmilliseconds and solving large optimizations problems is not operationally feasible.We show that our algorithms attain regret of order O ( √ T ) when the length of the horizon T andinitial number of resources are scaled proportionally (Theorem 1). Because no algorithm can attainregret lower than Ω( √ T ) under our minimal assumption (Lemma 1), these two results imply thatour algorithms attain the optimal order of regret. To establish our regret bounds, we need toovercome two challenges: lower bounding the cumulative performance of our algorithm relative tothe benchmark and showing that resources are not depleted too early in the horizon. We nextdescribe these two challenges.Recall that even though our algorithms operate in the dual space, performance is ultimately mea-sured in the primal space. Therefore, standard results from the online mirror descent literature donot directly apply to our setting as these provide upper bounds on dual performance while the anal-ysis requires lower bounds on primal performance. We overcome this first challenge by providing anovel analysis of dual online mirror descent that yields suitable lower bounds on primal performance(Proposition 3).A requisite for obtaining good primal performance is not depleting resources too early; otherwise, the2ecision maker could miss good future opportunities. Our algorithms have a natural self-correctingfeature that prevents them from depleting resources too early. By design, they target to consume aconstant number of resources per period so as to deplete resources exactly at the end of the horizon.When a request consumes more (less) resources than the target, the corresponding dual variable isincreased (decreased). Because resources are then priced higher (lower), future actions are chosento consume resources more conservatively (aggressively). As a result, using the update rule of thedual variables, we can show that our algorithms never deplete resources too early (Proposition 2).Our main result follows from combining these results together.We then discuss applications to online bidding in repeated auctions with budget constraints and toonline matching with high entropy (Section 5). As of 2019, around 85% of all display advertisementsare bought programmatically – using automated algorithms (eMarketer, 2019). A common mech-anism used by advertisers to buy ad slots is real-time auctions: each time a user visits a website,an auction is run to determine the ad to be shown in the user’s browser. Because there is a largenumber of these advertising opportunities in a given day, advertisers set budgets to control theircumulative expenditure. There is thus a need to develop data-driven algorithm to optimize adver-tisers’ bids in repeated auctions with budgets. This problem has been studied recently in Balseiroand Gur (2017, 2019), where they provide a dual sub-gradient descent algorithm that yields O ( √ T ) regret. Our algorithms attain similar regret bounds with considerably weaker restrictions on theinputs. In particular, they assume that values and competing bids are independent, and that thedual objective is thrice differentiable and strongly convex. We require no such assumptions.Online matching is another central problem in computer science, with applications in online ad-vertisement allocation, job/server allocation in cloud computing, product recommendation underresource constraints, etc. It has been shown that a high-entropy proportional matching can leadto additional desirable properties, such as fairness and diversity (Lan et al., 2010; Venkatasubra-manian, 2010; Qin and Zhu, 2013; Ahmed et al., 2017). We here study the online advertisementallocation problem, where at each time period, the decision maker matches an incoming impressionwith one advertiser (who may have a capacity constraint), aiming to maximize the total revenueover all incoming impressions while keeping a high entropy of such matchings. Recently Agrawalet al. (2018) studied a multi-round offline proportional matching algorithm for this problem set-ting. Our algorithm leads to a simple online counterpart to Agrawal et al. (2018) that yieldssimilar regret/complexity bounds. Dughmi et al. (2017) introduced a dual-based online algorithmfor proportional matching with an exponential based update. Their algorithm, however, requires anestimate of the value of the benchmark. When the value of the benchmark is not known, an estimatecan be obtained by solving a convex optimization problem. Our algorithm, in comparison, does notrequire knowing the value of the benchmark nor solving convex optimization problems.We conclude the paper by presenting numerical experiments on online proportional matching, whichvalidate our results. Online allocation problems have a rich history in computer science and operations research.Online allocation problems with linear revenue functions have been studied extensively in the so-called random permutation model . In the random permutation model, an adversary first selects a3equence of requests which are then presented to the decision maker in random order. This modelis more general than our setting in which requests are drawn independently and at random froman unknown distribution. Devanur and Hayes (2009) study online allocation problems in whichrevenues are proportional to amount of resources consumed (this is referred to as the Ad Wordsproblem) and present a dual training algorithm with two phases: a training phase in which datais used to estimate the dual variables by solving a linear program and an exploitation phase inwhich actions are taken using the estimated dual variables. Their algorithm can be shown to obtainregret of order O ( T / ) . Feldman et al. (2010) present similar training-based algorithms for moregeneral linear online allocation problems with similar regret guarantees. Pushing these ideas one stepfurther, Agrawal et al. (2014) consider an algorithm that dynamically updates the dual variablesby periodically solving a linear program using all data collected so far. This more sophisticatedalgorithm improves upon previous work by obtaining regret of order O ( T / ) . Compared to thesepapers, our algorithms work for general concave revenue functions, and for the linear case we obtainsimilar or better regret guarantees with simpler update rules that do not require solving large linearprograms.Devanur et al. (2019) study linear, online allocation problems when requests are drawn indepen-dently and at random from an unknown distribution, and provide algorithms that achieve O ( T / ) regret. A key feature of their algorithms is that they require knowledge or estimates of the value ofthe benchmark (which in their case is the optimal allocation under the expected instance). Whenthe value of the benchmark is known, they provide a simple algorithm that, similarly to ours, doesnot require solving a linear program in each stage. Their algorithm also maintains dual variablesfor the resource constraints, which are updated using an exponential update. When the value ofthe benchmark is unknown, they provide an algorithm that estimates the value of the benchmarkby working in phases of geometrically increasing length. This algorithm, however, requires solvinga linear program in each phase to estimate the value of the benchmark.Closest to ours is Agrawal and Devanur (2015), which studies general online allocation problemsthat allow for concave objectives and convex feasibility constraints. They present a general class ofalgorithms that maintain dual variables for the constraints, which are updated using any black-boxonline convex optimization algorithm. When the objective is non-linear, they present fast algorithmsthat do not require solving a convex program when an estimate of the value of the benchmark isknown. When an estimate is not available, their algorithm requires periodically solving a convexoptimization program to estimate the value of the benchmark. Additionally, they allow resourceconstraints to be violated; they show that constraints are violated by at most O ( √ T ) . Becausewe require constraints to be satisfied for every realization, their algorithms are not feasible in oursetting. When the objective is linear, they present an algorithm that updates dual variables usingmultiplicative weight updates and satisfies resource constraints for every realization. This algorithm,however, requires an estimate of the value of the benchmark that is obtained from solving a linearprogram once at the beginning of the horizon (Agrawal, 2019). Our paper extends their work bydeveloping simple algorithms for concave objectives that do not require estimates of the value ofthe benchmark and satisfy constraints for every realization under a large class of update rules. Asa matter of fact, a key contribution of our work is showing that under a large class of referencefunctions, dual mirror descent does not deplete resources too early (Proposition 2).Our algorithm is an online dual mirror descent algorithm. It has been known in the optimizationliterature that mirror descent naturally minimizes a primal-dual gap in both deterministic and4tochastic setting (Bach, 2015; Lu and Freund, 2018). However, the results therein do not applydirectly to our setting because (i) as we will show later, our goal is to maximize the revenue inthe online setting (i.e., (1)) rather than to maximize the natural primal objective (i.e., (19) in theappendix); (ii) we do not allow violations of the budget constraints in our online setting, while inthe offline setting satisfying these constraints is easy since we can always shrink variables after thefact.Our algorithms attain regret of order O ( √ T ) , which is tight under our minimal assumptions on theinput (Lemma 1). Jasin (2015) studies linear allocation problems and shows that it is possible toattain O (log T ) regret when the expected instance is non-degenerate. His algorithm periodically re-estimates the distribution of requests and computes a primal control by periodically solving a linearprogram with the re-estimated parameters. Li and Ye (2019) study linear allocation problems underthe assumption that the distribution of requests is absolutely continuous with uniformly boundeddensities. They present a dual algorithm that attains O (log T ) regret. Their algorithm updatesdual variables by solving a dual, linear program in each stage using all data collected so far. Theassumptions of these two papers are essentially imposing that the dual objective is strongly convexat the optimal dual variables. In comparison, under our weaker assumptions, the dual objectivecannot be guaranteed to be strongly convex, which leads to a Ω( √ T ) lower bound on regret. Similardistinctions arise in online convex optimization where convexity vs. strong convexity of the primalobjective functions determine whether Θ( √ T ) vs. Θ(log T ) regret is attainable (see, e.g., Hazanet al. 2016).There is also a stream of literature that studies online allocation problems with linear utility func-tions when the input is adversarial (Mehta et al., 2007a; Feldman et al., 2009). In this case, it isgenerally impossible to attain sublinear regret and, instead, the focus is on designing algorithmsthat obtain constant factor approximations to the offline optimum solution. We define R n + := { x ∈ R n | x ≥ } and R n ++ := { x ∈ R n | x > } . We use [ m ] as the shorthand of { , . . . , m } . denotes the all-one vector, and e j is the j -th standard unit vector. We consider the following generic online convex problem with resource constraints: ( O ) : max x : x t ∈ X T (cid:88) t =1 f t ( x t ) s.t. T (cid:88) t =1 b t x t ≤ T ρ , (1)where x t ∈ X ⊆ R d is the decision variable at time t , f t ∈ R d → R is the concave revenue functionreceived at time t , b t ∈ R m × d + is the entry-wise non-negative cost matrix received at time t , ρ ∈ R m ++
5s the positive resource constraint vector. In the online setting, at each time period ≤ t ≤ T , wereceive a request ( f t , b t ) , and we use an algorithm A to make a real-time decision x t based on thecurrent request ( f t , b t ) and the previous history H t − := { f s , b s , x s } t − s =1 , i.e., x t = A ( f t , b t | H t − ) . (2)Moreover, the constraint: t (cid:88) s =1 b s x s ≤ ρTx t ∈ X (3)must be satisfied for every t ≤ T . The above process generates total revenue (cid:80) Tt =1 f t ( x t ) at theend of the T time periods, and our goal is to a design algorithm A to maximize such revenue whilesatisfying constraint (3).We assume the request ( f t , b t ) is generated i.i.d. from an unknown distribution P ∈ I , i.e., ( f t , b t ) ∈ { ( f , b ) , . . . , ( f n , b n ) } with probability P (( f t , b t ) = ( f i , b i )) = p i , where I denotes a familyof distributions satisfying some regularity conditions (to be further discussed in Assumption 2). Inparticular, we define the expected revenue of an algorithm A over distribution P as R ( A | P ) = E P (cid:34) T (cid:88) t =1 f t ( x t ) (cid:35) , where x t is computed by (2). The baseline we compare with is the expected revenue of the optimalsolution in hindsight, which is also referred as the offline problem in the computer science literature.This amounts to solving for the optimal allocation under full information of all requests and thentaking expectations over all possible realizations:OPT ( P ) = E P (cid:34) max x t ∈ X (cid:80) Tt =1 f t ( x t ) s.t. (cid:80) Tt =1 b t x t ≤ T ρ (cid:35) . (4)We further define the regret of algorithm A as: Regret( A | P ) := OPT ( P ) − R ( A | P ) , and the worst-case regret of algorithm A over a family of distributions I as: Regret( A | I ) := sup P ∈ I { OPT ( P ) − R ( A | P ) } . Since the probability distribution P is unknown to the decision maker, our goal is to design analgorithm A that works well for any distribution P ∈ I , namely, it has low worst-case regret Regret( A | I ) . 6 .1 The Dual Problem to (1) In this section, we provide an upper bound of OPT ( P ) , which we call the offline dual problem to(1), and moreover, this dual problem inspires us to develop our main algorithm (Algorithm 1 inSection 2.2) for solving (1). Such upper bound in the linear case has been considered extensively inthe literature (see, e.g., Talluri and van Ryzin 1998 for an example).Define f ∗ i ( c ) := max x ∈ X { f i ( x ) − c (cid:62) x } (5)as the conjugate function of f i ( x ) (restricted in X ) . And define D ( µ ) : R m → R as D ( µ ) := n (cid:88) i =1 p i f ∗ i ( b (cid:62) i µ ) + µ (cid:62) ρ , then D ( µ ) provides a valid upper bound to OPT ( P ) : Proposition 1.
It holds for any µ ≥ thatOPT ( P ) ≤ T D ( µ ) . (6)Furthermore, we call ( D ) : min µ ≥ D ( µ ) = n (cid:88) i =1 p i f ∗ i ( b (cid:62) i µ ) + µ (cid:62) ρ . (7)the offline dual problem to (1). Online mirror descent algorithm is a standard algorithm in online convex optimization Hazan et al.(2016). In this section, we present the online mirror descent algorithm on the dual problem (7),while our goal is to obtain a good solution to the original primal problem (1).To discuss the mirror descent algorithm, first recall that the Bregman divergence with respect to agiven convex reference function h ( · ) is defined as V h ( x, y ) := h ( x ) − h ( y ) − (cid:104)∇ h ( y ) , x − y (cid:105) . Algorithm1 presents the main algorithm we study in this paper. At time t , we receive a request ( f t , b t ) , andwe compute the optimal response ˜ x t that maximizes an opportunity cost-adjusted revenue of thisrequest based on the current dual solution µ t . We then take this action (i.e., x t = ˜ x t ) if that doesnot exceed the resource constraint, otherwise we take a void action (i.e., x t = 0 ). Notice that itfollows from the definition of conjugate function (5) that − b t ˜ x t ∈ ∂f ∗ t ( b (cid:62) t µ t ) . Thus ˜ g t := − b t ˜ x t + ρ is an unbiased stochastic estimator of the gradient of the dual problem D ( µ ) at µ t : E P [˜ g t ] = E P [ − b t ˜ x t + ρ ] ∈ n (cid:88) i =1 p i ∂f ∗ i ( b (cid:62) i µ t ) + ρ ∈ ∂D ( µ t ) . lgorithm 1: Dual Mirror Descent Algorithm for (1)
Input:
Initial dual solution µ , total time period T , remaining resources B = T ρ , referencefunction h ( · ) : R m → R , and step-size η . for t = 0 , . . . , T − do Receive ( f t , b t ) ∼ P , i.e., P (( f t , b t ) = ( f i , b i )) = p i .Make the primal decision and update the remaining resources: ˜ x t = arg max x ∈ X { f t ( x ) − µ (cid:62) t b t x } ,x t = (cid:26) ˜ x t if b t ˜ x t ≤ B t otherwise ,B t +1 = B t − b t x t . Obtain a stochastic sub-gradient of D ( µ t ) : ˜ g t = − b t ˜ x t + ρ . Update the dual variable by mirror descent: µ t +1 = arg min µ ≥ (cid:104) ˜ g t , µ (cid:105) + 1 η V h ( µ, µ t ) . (8) end We then utilize ˜ g t to update the dual variable by performing an online mirror descent step (8) withstep-size η .Algorithm 1 only takes an initial dual variable and a step size as inputs and is simple to implement.In most cases, the mirror descent step can be computed in linear time as (8) admits a closed-formsolution. For example, if the reference function is h ( µ ) = − (cid:80) i µ i log( µ i ) , the dual update (8)becomes µ t +1 = µ t ∗ exp( − η ˜ g t ) , which recovers the online exponential weights algorithm for solving (7); if the reference function is h ( µ ) = (cid:107) µ (cid:107) , the dual update (8) becomes µ t +1 = Proj µ ≥ { µ t − η ˜ g t } , which recovers the online sub-gradient descent method for solving (7). More precisely, f ∗ i ( c ) is the conjugate function of f i ( x ) + { x ∈ X } under the standard definition of conjugatefunction, where { x ∈ X } is the indicator function of the constraint. Regret Bound
In this section, we present the worst-case regret bound of Algorithm 1 for solving (1). First we statethe assumptions required in our analysis.
Assumption 1. (Assumptions on constraint set X ). We assume that: (i) X is a convex and boundedset in R d + , and (ii) ∈ X . The above assumption implies that we can only take non-negative actions. Moreover, we canalways take the void action by choosing x t = 0 in order to make sure we do not exceed the resourceconstraints. This guarantees the existence of a feasible solution. Assumption 2. (Assumptions on distribution family I ). For any P ∈ I , it holds that1. P has finite support: S ( P ) := { ( f , b ) , . . . , ( f n , b n ) } .2. For any ( f i , b i ) ∈ S ( P ) , it holds that (i) f i ( x ) ≥ , ∀ x ∈ X ; (ii) f i (0) = 0 ; (iii) b i ≥ ; and f i ( x ) is a concave function in X .3. There exists ¯ f ∈ R ++ such that f i ( x ) ≤ ¯ f for any x ∈ X and ( f i , b i ) ∈ S ( P ) .4. There exists ¯ b ∈ R ++ such that (cid:107) b i x (cid:107) ∞ ≤ ¯ b for any x ∈ X and ( f i , b i ) ∈ S ( P ) . We herein assume P has finite support for simplicity of the argument. We do not see any particularreasons that prevents our results from holding in the infinite support case. The upper bound ¯ f and ¯ b impose regularity on the probability class I , and they will appear in the regret bound.The assumption b i ≥ implies that we cannot replenish resources once they are consumed. Theassumption f i (0) = 0 is without loss of generality since we can always subtract a constant from thefunction f i ( x ) . Assumption 3. (Assumptions on resource parameter ρ ). We assume there exist ¯ ρ, ρ ∈ R ++ suchthat for any j ∈ [ m ] , ρ ≤ ρ j ≤ ¯ ρ . Remark 1.
Without loss of generality, we can assume ρ j = 1 for any j ∈ [ m ] by rescaling the j -throw in b i . This may lead to slightly favorable regret bound, but we herein choose to keep ρ for itsgenerality. Definition 1.
We define µ max ∈ R m such that µ max j := ¯ fρ j + 1 . As we will show later in Proposition 2, as long as ≤ µ ≤ µ max , the dual variable obtained byAlgorithm 1 satisfies µ t ≤ µ max at any time t . In other words, µ t attained by Algorithm 1 alwaysstays in domain D := { µ ∈ R m | ≤ µ ≤ µ max } . Assumption 4. (Assumptions on reference function h ( · ) ). We assume1. h ( µ ) is coordinate-wisely separable, i.e., h ( µ ) = (cid:80) mj =1 h j ( µ j ) where h j ( · ) is a convex univariatefunction.2. h ( µ ) is σ -strongly convex in (cid:96) -norm in D , i.e., h ( µ ) ≥ h ( µ ) + (cid:104)∇ h ( µ ) , µ − µ (cid:105) + σ (cid:107) µ − µ (cid:107) for any µ , µ ∈ D . . h ( µ ) is σ -strongly convex in (cid:96) -norm in D , i.e., h ( µ ) ≥ h ( µ ) + (cid:104)∇ h ( µ ) , µ − µ (cid:105) + σ (cid:107) µ − µ (cid:107) for any µ , µ ∈ D . Strong convexity of the reference function is a standard assumption for the analysis of mirror descentalgorithms (Bubeck, 2015). Indeed, the strong convexity in (cid:96) -norm and (cid:96) -norm are equivalent (upto a dimension-dependent constant). We here assume strong convexity in both norms in order toobtain a tighter regret bound.If h ( · ) is not a coordinate-wise separable function, the sub-problem (8) can be hard to solve. Fur-thermore, most examples in the mirror descent literature utilize coordinate-wise separable referencefunctions (Nemirovsky and Yudin, 1983; Beck and Teboulle, 2003; Bubeck, 2015; Lu et al., 2018;Lu, 2017). The next theorem presents the worst-case regret bound of Algorithm 1.
Theorem 1.
Consider Algorithm 1 with step-size η ≤ σ ¯ b and initial solution µ ≤ µ max . SupposeAssumption 1-4 are satisfied. Then it holds for any T ≥ that Regret( A | I ) ≤ (cid:0) ¯ b + ¯ ρ (cid:1) σ ηT + V h (0 , µ ) η + ¯ fρη (cid:107)∇ h ( µ max ) − ∇ h ( µ ) (cid:107) ∞ + ¯ f ¯ bρ . (9)When choosing η = O (1 / √ T ) , we obtain that Regret( A | I ) ≤ O ( √ T ) when T is sufficiently large,and, therefore, our algorithm yields sublinear regret. Remark 2.
In this remark, we assume ¯ ρ = ρ = 1 (this is without loss of generality as mentionedin Remark 1). Here we consider two special cases of Theorem 1 when T is sufficiently large:1. Suppose h ( µ ) = (cid:107) µ (cid:107) and µ = 0 , then Algorithm 1 recovers dual online sub-gradient descent,and with proper step-size η we can obtain Regret( A | I ) ≤ (cid:113) m ¯ f (¯ b + 1) √ T + ¯ f ¯ b .
2. Suppose h ( µ ) = − (cid:80) mj =1 µ j log µ j and µ = e − , then Algorithm 1 recovers the dual multi-plicative update algorithm, and with proper step-size η we can obtain Regret( A | I ) ≤ ¯ f ¯ b + 2 (cid:113) m (¯ b + 1) (cid:0) ¯ f + 1 (cid:1) (cid:0) ¯ f (log (cid:0) ¯ f + 1 (cid:1) + 1) + me − (cid:1) √ T .
We next discuss the tightness of our regret bound. The following result, which we reproducewithout proof, shows that one cannot hope to attain regret lower than Ω( √ T ) under our modelingassumptions. Lemma 1 (Lemma 1 from Arlotto and Gurvich 2019) . For every T ≥ , there exists a probabilitydistribution P such that inf A Regret( A | P ) ≥ C √ T , where C is a constant independent of T . T , there exists a probability distribution under which allalgorithms –even those that known the probability distribution– incur Ω( √ T ) regret. The worst-casedistribution used in the proof of the result assigns mass to three points with one point having massof order / √ T . Because the regret bound of Algorithm 1 provided in Theorem 1 does not depend onthe probability mass function of the distribution P , it readily follows that our algorithm also attains O ( √ T ) in such worst-case instance. This implies that our algorithm attains the optimal order ofregret when the length of the horizon and initial number of resources are scaled proportionally. There are two major steps in the proof of Theorem 1. We need to show that: (i) Algorithm 1 doesnot deplete resources too early (Proposition 2); and (ii) before running out of the resources, theaverage cumulative revenue is close to a dual objective value (Proposition 3), which provides anupper bound of OPT ( P ) . Here we present these two major steps. Step 1 (Lower bound on the stopping time):
At first, we define the stopping time of Algorithm1:
Definition 2.
We define the stopping time τ A of Algorithm 1 as the first time less than T that thereexists resource j such that τ A (cid:88) t =1 ( b t ) (cid:62) j x t + ¯ b ≥ ρ j T .
Notice that τ A is a random variable, and moreover, we will not violate the resource constraintsbefore the stopping time τ A . The next proposition says the stopping time τ A is close to the end ofthe horizon T . Proposition 2.
Consider Algorithm 1 with step-size η ≤ σ ¯ b . Then it holds that µ t ≤ µ max for any t ≤ T . Furthermore, it holds with probability that T − τ A ≤ ηρ (cid:107)∇ h ( µ max ) − ∇ h ( µ ) (cid:107) ∞ + ¯ bρ . (10) Step 2 (Primal-dual bound on the cumulative revenue):
We here study the primal-dualgap until the stopping-time τ A . Notice that before the stopping time τ A , Algorithm 1 performs thestandard mirror descent steps on the dual function.Let us denote the random variable γ t to be the type of request in time period t , i.e., γ t is the randomvariable that determines the (stochastic) sample i in the t -th iteration of Algorithm 1. Then µ t +1 is a random variable which depends on all previous values γ , . . . , γ t and we denote this string ofrandom variables ξ t = { γ , . . . , γ t } .The next Proposition presents a primal-dual bound on the cumulative revenue of Algorithm 1 beforethe stopping time τ A . Proposition 3.
Consider the Algorithm 1 with given step-size η under Assumptions . Let τ A be the stopping time defined in Definition 2. Denote ¯ µ τ A = (cid:80) τAt =1 µ t τ A . Then the following inequality olds: E P (cid:34) τ A D (¯ µ τ A ) − τ A (cid:88) t =1 f t ( x t ) (cid:35) ≤ (cid:0) ¯ b + ¯ ρ (cid:1) σ η E P [ τ A ] + V h (0 , µ ) η . Together with the above two steps, we can show that the cumulative revenue till the stopping timeis not far away from the optimal revenue to the offline problem (1) by using Proposition 1. Wepresent the proof of Proposition 2, Proposition 3 and Theorem 1 in Appendix A.2, Appendix A.3and Appendix A.4, respectively.
In this section, we discuss applications of Algorithm 1 to online matching with high entropy andbidding in repeated auctions with budgets.
Most online advertisements are sold using auctions in which advertisers bid based on viewer-specificinformation. Typically advertisers participate in a large number of auctions on a given day, andthey set budgets to control their cumulative expenditure throughout the day. We discuss how toapply our methods to the problem of bidding in repeated auctions with budgets.We consider an advertiser with a budget ρT that limits the cumulative expenditure over T auctions.Each request corresponds to an auction in which an impression becomes available for sale. Whenthe t -th impression arrives, the advertiser first learns a value v t for winning the impression basedviewer-specific information and then determines a bid w t to submit to the auction. We assumethat impressions are sold using a second-price auction. Denoting by b t the highest bid submittedby competitors, the advertiser wins whenever his bid is the highest (i.e., w t ≥ b t ) and pays thesecond-highest bid in case of winning (i.e., b t { w t ≥ b t } ) . To simplify the exposition, we assumethat ties are broken in favor of the advertiser. At the point of bidding, the advertiser does not knowthe highest competing bid. Consistent with practice, we assume that the advertiser only observeshis payment in case of winning. Values and competing bids are drawn i.i.d. from an unknown,discrete distribution.The problem of bidding in repeated auctions with budgets has been studied recently in Balseiro andGur (2017, 2019). In their paper, they present an adaptive pacing strategy that attempts to learn anoptimal Lagrange multiplier using sub-gradient descent. Their adaptive pacing strategy is shown toattain O ( √ T ) regret under restrictive assumptions on the distribution of inputs. Specifically, theyassume that values and competing bids are independent, and that D ( µ ) is thrice differentiable andstrongly convex. In practice, however, values and competing bids are positively correlated. Ouralgorithms attain similar regret bounds without such restrictive assumptions on the inputs.With the benefit of hindsight, a decision maker can win an auction by bidding an amount equalto the highest competing bid (i.e., w t = b t ). Therefore, the optimal solution in hindsight reduces12 lgorithm 2: Online Dual Mirror Descent Algorithm for Bidding in Repeated Auctions
Input:
Initial dual solution µ , reference function h ( · ) : R → R , step-size η . for t = 0 , . . . , T − do Receive an impression with value v t .Bid w t = min { ˜ w t , B t } where ˜ w t = v t / (1 + µ t ) and B t is the remaining budget.Observe the payment q t = b t { w t ≥ b t } .Obtain a stochastic dual sub-gradient ˜ g t = − q t + ρ . Update the dual variable using mirror descent: µ t +1 = arg min µ ≥ (cid:104) ˜ g t , µ (cid:105) + 1 η V h ( µ, µ t ) . end to solving a knapsack problem in which the impressions to be won are chosen to maximize the netutility subject to the budget constraint. The problem is given by: max x t ∈{ , } T (cid:88) t =1 ( v t − b t ) x t s.t. T (cid:88) t =1 b t x t ≤ T ρ, where x t ∈ { , } is a decision variable indicating whether the advertiser wins the t -th impres-sion.Note that the informational assumptions are different from the ones of our baseline model becausethe competing bid b t is not assumed to be known at the point of bidding. Interestingly, becauseads are sold using an ex-post incentive compatible auction, such information is not necessary forour algorithm: the algorithm only needs to know the payment incurred. As a matter of fact, ouranalysis applies to any other ex-post incentive compatible auction.This problem can be mapped to our framework by setting f t ( x ) = ( v t − b t ) x . Denoting by µ t ≥ the dual multiplier of the budget constraint, the primal decision in Algorithm 1 is given by ˜ x t = arg max x ∈{ , } { f t ( x ) − µ t b t x } = { v t ≥ (1 + µ t ) b t } This decision can be implemented by bidding w t = v t / (1 + µ t ) without knowing the maximumcompeting bid. We present the formal algorithm in Algorithm 2. Theorem 1 readily implies thatchoosing η ∼ / √ T yields a regret of O ( √ T ) . 13 .2 Proportional Matching with High Entropy We consider an online matching problem using the terminology from online advertising. Supposethere are n different impressions and m advertisers. At time period t , an impression with revenuevector r t ∈ R m arrives, i.e., if we allocate it to advertiser j ∈ [ m ] , then it generates revenue ( r t ) j .In the online setting, the impressions arrive sequentially. For each time period t , we decide an as-signment probability variable x t ∈ X := { x ∈ R m + | (cid:80) mi =1 x i ≤ } , and assign the arriving impressionto advertiser j with probability ( x t ) j . Notice that with probability − (cid:80) mj =1 ( x t ) j the impressionis not assigned to any advertiser, and in practice, such impressions will go to other traffic. Sup-pose there are in total T time periods, and we assume the capacity of the j -th advertiser is ρ j T .Define H ( x ) := − m (cid:88) j =1 x j log( x j ) − − m (cid:88) j =1 x j log − m (cid:88) j =1 x j to be the entropy function of assignment probability x .We herein study the high entropy fractional matching, where the goal is to find a fractional matching { x t } t to maximize the revenue with an entropy regularizer. The hindsight problem is: max x t ∈ X T (cid:88) t =1 r (cid:62) t x t + λH ( x t ) s.t. T (cid:88) t =1 x t ≤ T ρ , (11)where λ is the parameter of the entropy regularizer and v t is a random variable defined by (12).A matching with high entropy has been shown to possess many additional desirable properties,for example, higher fairness and higher diversity (Lan et al., 2010; Venkatasubramanian, 2010; Qinand Zhu, 2013; Ahmed et al., 2017). Recently Agrawal et al. (2018) designed a multi-round offlineproportional allocation algorithm for solving (11). Our algorithm, in contrast, is a simpler onlinealgorithm and does not need to be run multi-rounds. A major difference is that their capacityconstraints can be violated during the runs because they allow rescaling the variables at the endof the run to satisfy the capacity constraints, while we do not allow such violation in our onlinesetting. Dughmi et al. (2017) introduced a dual-based online algorithm for proportional matchingwith an exponential based update. Their algorithm, however, requires an estimate of the value ofthe benchmark. When the value of the benchmark is not known, an estimate can be obtained bysolving a convex optimization problem. Our algorithm, in comparison, does not require knowing thevalue of the benchmark nor solving convex optimization problems. Refer to Agrawal et al. (2018)for a more detailed literature review on the background of this problem.Algorithm 3 is a variant of Algorithm 1 for the above proportional matching problem (11), with f t ( x ) = r (cid:62) t x + λH ( x ) and b t = I . The only difference is that in the constraints we need to take intoaccount the actual realization of the probabilistic matching. Define the random variable v t = (cid:26) e j w.p. x j w.p. − (cid:80) mj =1 x j , (12)14 lgorithm 3: Online Dual Mirror Descent Algorithm for Proportional Matching Problemswith High Entropy
Input:
Initial dual solution µ , and step-size η . for t = 0 , . . . , T − do Receive an impression with revenue vector r t , and regularized revenue function f t ( x ) = r (cid:62) t x + λH ( x ) .Decide the assignment probability and update the remaining capacity: x t = arg max x ∈ X { f t ( x ) − µ Tt x } or equivalently ( x t ) j = exp(( r t ( j ) − µ t ( j )) /λ ) (cid:80) ml =1 exp(( r t ( l ) − µ t ( l )) /λ ) + 1 . Make the allocation decision: v t is set base on x t by (12) if it does not excess thecapacity constraint, otherwise set v t = 0 .Obtain a stochastic dual sub-gradient: ˜ g t = − x t + ρ . Update the dual variable using mirror descent: µ t +1 = arg min µ ≥ (cid:104) ˜ g t , µ (cid:105) + 1 η V h ( µ, µ t ) . end igure 1: Regret versus horizon T for the numerical experiments on proportional matching high entropy. where e j ∈ R m is the j -th standard unit vector in R m . Then v t characterizes the realized assignmentof the impression at time t . In the online problem, the constraint of (11) is stated in terms of v t , i.e.,the random realization of the decision variable x t . Let ζ denote the random variable determines therealization in the above process, then the results in Theorem 1 still holds after taking the expectationover ζ on the left-hand of (9): Proposition 4.
Consider Algorithm 3 with step-size η ∼ O ( √ T ) and initial solution µ = 0 forsolving (11) . Then it holds that Regret( A | I ) := sup P ∈ I { OPT ( P ) − E ζ R ( A | P ) } ≤ O (cid:16) √ T (cid:17) . Here, we present a numerical experiment on proportional matching with high entropy (Section 5.2)to verify our theoretical results.
Data generation:
We use the dataset introduced by Balseiro et al. (2014). They consider theproblem faced by a publisher who has to deliver impressions to advertisers so as to maximize click-through rates. (They consider the secondary objective of maximizing revenue from a spot market,which we do not take into account for this experiments). We incorporate the entropy regularizer H ( x ) to the objective with parameter λ = 0 . . In each problem instance there are m advertisers;advertiser j can be assigned at most ρ j T impressions. The revenue vector r t gives the expectedclick-through rate of assigning the impression to each advertiser. In their paper, they parametricallyestimate click-through rates using mixtures of log-normal distributions. Because they do not reportthe actual data used to estimate their model, we instead take their estimate model as a generativemodel and sample impressions from the distributions provided in their paper. We generated , samples for each publisher, and we present results for publisher 2 from their dataset, which has 12advertisers. Random trials:
There are two layers of randomness in Algorithm 3: randomness coming from16he data (i.e., P ), and randomness coming from the proportional matching (i.e., ζ ). In the numer-ical experiments, we first obtain random datasets with size T uniformly randomly chosen from , samples (for the first layer of randomness), and for each dataset, we run Algorithm 3 times (for the second layer of randomness). In total, we run random trials, and report theaverage regret with . confidence interval in Figure 1. Regret computation:
For each random trial with given round T , we compute the cumulativerevenue till the stopping time τ A of the algorithm, i.e., the first time that the remaining capacityof one advertiser is strictly less than . We then compute the average cumulative revenue over the trials as our expected revenue of Algorithm 3, i.e., R ( A | P ) . We compute OPT by solvingthe offline problem (1) with , samples. We report the following regret: Regret( A | P ) = T / × OPT − R ( A | P ) and its . confidence interval any different value of T . Implementation Details
For each value of T , we plot the average regret and its . confidenceinterval over random trials. For all experiments, we start from µ = 0 , utilize h ( x ) = (cid:107) x (cid:107) as our reference function (thus the algorithm is dual sub-gradient descent), and choose η = √ T asthe step-size. Observation
From Figure 1, we can clearly see that the regret grows at the rate of √ T , whichverifies the results in Theorem 1. In this paper, we present a class of simple and efficient algorithms for online allocation problemswith concave revenue functions. We show that our algorithms attain O ( √ T ) regret, which matchesthe lower bound. Numerical experiments validate our results. Interesting future research directionsare to explore whether better regret bounds can be obtained under more restrictive assumptions onthe inputs and to study the performance of online dual mirror descent on more general inputs (e.g.,non-stationary or adversarial inputs). References
Shipra Agrawal. Private communications. 2019.Shipra Agrawal and Nikhil R. Devanur. Fast algorithms for online stochastic convex programming. In
Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms , SODA ’15, page1405–1424, USA, 2015. Society for Industrial and Applied Mathematics.Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. A dynamic near-optimal algorithm for online linear program-ming.
Operations Research , 62(4):876–890, 2014.Shipra Agrawal, Morteza Zadimoghaddam, and Vahab Mirrokni. Proportional allocation: Simple, dis-tributed, and diverse matching with high entropy. In
International Conference on Machine Learning ,pages 99–108, 2018.Faez Ahmed, John P Dickerson, and Mark Fuge. Diverse weighted bipartite b-matching. arXiv preprintarXiv:1702.07134 , 2017. lessandro Arlotto and Itai Gurvich. Uniformly bounded regret in the multisecretary problem. StochasticSystems , 9(3):231–260, 2019.Francis Bach. Duality between subgradient and conditional gradient methods.
SIAM Journal on Optimiza-tion , 25(1):115–129, 2015.Santiago R. Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimizationand equilibrium. In
Proceedings of the 2017 ACM Conference on Economics and Computation , EC ’17,page 609, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450345279.Santiago R. Balseiro and Yonatan Gur. Learning in repeated auctions with budgets: Regret minimizationand equilibrium.
Management Science , 65(9):3952–3968, 2019.Santiago R Balseiro, Jon Feldman, Vahab Mirrokni, and Shan Muthukrishnan. Yield optimization of displayadvertising with ad exchange.
Management Science , 60(12):2886–2907, 2014.A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimiza-tion.
Operations Research Letters , 31(3):167–175, 2003.S. Bubeck. Convex optimization: Algorithms and complexity.
Foundations and Trends R (cid:13) in Machine Learn-ing , 8(3-4):231–357, 2015.Gong Chen and Marc Teboulle. Convergence analysis of a proximal-like minimization algorithm usingbregman functions. SIAM Journal on Optimization , 3(3):538–543, 1993.Nikhil R. Devanur and Thomas P. Hayes. The adwords problem: online keyword matching with budgetedbidders under random permutations. In
Proceedings of the 10th ACM conference on Electronic commerce ,EC ’09, pages 71–78. ACM, 2009.Nikhil R. Devanur, Kamal Jain, Balasubramanian Sivan, and Christopher A. Wilkens. Near optimal onlinealgorithms and fast approximation algorithms for resource allocation problems.
J. ACM , 66(1), jan 2019.Shaddin Dughmi, Jason D. Hartline, Robert Kleinberg, and Rad Niazadeh. Bernoulli factories and black-boxreductions in mechanism design. In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theoryof Computing
Proceedings of the 5th International Workshop on Internet and Network Economics ,WINE ’09, pages 374–385. Springer-Verlag, 2009.Jon Feldman, Monika Henzinger, Nitish Korula, Vahab S. Mirrokni, and Cliff Stein. Online stochastic packingapplied to display ad allocation. In
Proceedings of the 18th annual European conference on Algorithms:Part I , ESA’10, pages 182–194. Springer-Verlag, 2010.Elad Hazan et al. Introduction to online convex optimization.
Foundations and Trends R (cid:13) in Optimization ,2(3-4):157–325, 2016.Stefanus Jasin. Performance of an lp-based control for revenue management with unknown demand param-eters. Operations Research , 63(4):909–915, 2015.Tian Lan, David Kao, Mung Chiang, and Ashutosh Sabharwal.
An axiomatic theory of fairness in networkresource allocation . IEEE, 2010. iaocheng Li and Yinyu Ye. Online linear programming: Dual convergence, new algorithms, and regretbounds. 2019.Haihao Lu. “Relative-continuity” for non-lipschitz non-smooth convex optimization using stochastic (ordeterministic) mirror descent. arXiv preprint arXiv:1710.04718 , 2017.Haihao Lu and Robert M Freund. Generalized stochastic frank-wolfe algorithm with stochastic" substitute"gradient for structured convex optimization. arXiv preprint arXiv:1807.07680 , 2018.Haihao Lu, Robert Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order meth-ods, and applications. SIAM Journal on Optimization , 28(1):333–354, 2018.Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords and generalized online match-ing.
J. ACM , 54:22:1–22:19, October 2007a.Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords and generalized online match-ing.
J. ACM , 54(5):22–es, October 2007b. ISSN 0004-5411.A. S. Nemirovsky and D. B. Yudin.
Problem Complexity and Method Efficiency in Optimization . Wiley,New York, 1983.Lijing Qin and Xiaoyan Zhu. Promoting diversity in recommendation by entropy regularizer. In
Twenty-ThirdInternational Joint Conference on Artificial Intelligence , 2013.Kalyan Talluri and Garrett van Ryzin. An analysis of bid-price controls for network revenue management.
Management Science , 44(11):1577–1593, 1998.Kalyan T. Talluri and Garrett J. van Ryzin.
The Theory and Practice of Revenue Management . InternationalSeries in Operations Research & Management Science, Vol. 68. Springer, 2004.Venkat Venkatasubramanian. Fairness is an emergent self-organized property of the free market for labor.
Entropy , 12(6):1514–1531, 2010. Proofs in the Paper
A.1 Proofs of Proposition 1
Notice that for any µ ≥ , it holds thatOPT ( P )= E P (cid:34) max x t ∈ X (cid:80) Tt =1 f t ( x t ) s.t. (cid:80) Tt =1 b t x t ≤ T ρ (cid:35) ≤ E P (cid:34) max x t ∈ X T (cid:88) t =1 f t ( x t ) + T µ (cid:62) ρ − µ (cid:62) T (cid:88) t =1 b t x t (cid:35) = T E P (cid:20) max x ∈ X f ( x ) − µ (cid:62) bx + µ (cid:62) ρ (cid:21) = T (cid:32) n (cid:88) i =1 p i max x ∈ X { f i ( x ) − µ (cid:62) b i x } + µ (cid:62) ρ (cid:33) = T (cid:32) n (cid:88) i =1 p i f ∗ i ( b (cid:62) i µ ) + µ (cid:62) ρ (cid:33) , (13)where the first inequality is because of the feasibility of x and µ ≥ and the last equality is due tothe definition of f ∗ i . This finishes the proof. A.2 Proofs of Proposition 2
The key step in the proof of Proposition 2 is the following lemma, which shows that the dual update(8) never exceeds the upper bound µ max when the step-size η is small enough. Lemma 2.
Let ˜ g = b ∇ f ∗ ( b (cid:62) µ )+ ρ with ( b, f ) ∈ { ( b , f ) , . . . , ( b n , f n ) } , and µ + = arg min ˆ µ ≥ (cid:104) ˜ g, ˆ µ (cid:105) + η V h (ˆ µ, µ ) . Suppose µ ≤ µ max and η ≤ σ ¯ b , then it holds that µ + ≤ µ max . Proof.
Denote J := { j | µ + j > } , then we just need to show µ + j ≤ µ max j for any j ∈ J . Followingthe update rule (8), it holds for any j ∈ J that ˙ h j ( µ + j ) = ˙ h j ( µ j ) − η ˜ g j = ˙ h j ( µ j ) − η ( b ) (cid:62) j ∇ f ∗ ( b (cid:62) µ ) − ηρ j . (14)Define h ∗ j ( c ) = max µ j { cµ j − h j ( µ j ) } as the conjugate function of h j ( µ j ) , then by the property ofconjugate function it holds that h ∗ j ( · ) is a σ -smooth univariate convex function. Furthermore, ˙ h ∗ j ( · ) is increasing, and ˙ h ∗ j ( ˙ h j ( µ j )) = µ j .Now define ˜ x := arg max x ∈ X { f ( x ) − µ (cid:62) bx } = −∇ f ∗ ( b (cid:62) µ ) . Then it holds that f (0) ≤ f (˜ x ) − µ (cid:62) b ˜ x ≤ ¯ f − µ (cid:62) b ˜ x , whereby µ (cid:62) b ˜ x ≤ ¯ f . Since µ ≥ , b ≥ , ˜ x ∈ X ⊆ R d + , it holds for any j ∈ J that ( b ) (cid:62) j ˜ x ≤ ¯ fµ j . Meanwhile, it follows by the definition of ¯ b that ( b ) (cid:62) j ˜ x ≤ ¯ b . Together with (14), itholds that ˙ h j ( µ + j ) ≤ ˙ h j ( µ j ) + η min (cid:18) ¯ fµ j , ¯ b (cid:19) − ηρ j . (15)20f ¯ fρ j ≤ µ j ≤ µ max j , we have min (cid:16) ¯ fµ j , ¯ b (cid:17) − ρ j ≤ , thus it holds that µ + j ≤ µ j ≤ µ max j by utilizing(15) and convexity of ˙ h j . Otherwise, µ j ≤ ¯ fρ j , and furthermore, µ + j = ˙ h ∗ j ( ˙ h j ( µ + j )) ≤ ˙ h ∗ j ( ˙ h j ( µ j ) + η ¯ b ) ≤ ˙ h ∗ j ( ˙ h j ( µ j )) + η ¯ bσ ≤ ¯ fρ j + 1 = µ max j , where the first inequality is from (15) and the monotonicity of ˙ h ∗ j ( · ) , the second inequality is from ˙ h ∗ j ( ˙ h j ( µ j )) = µ j and the σ -smoothness of h ∗ j ( · ) , the last inequality utilizes η ≤ σ ¯ b , and the lastequality follows from Definition 1. This finishes the proof of Lemma 2. Proof of Proposition 2:
First, a direct application of Lemma 8 shows that for any t , µ t ≤ µ max .Next, it follows by the definition of τ A (Definition 2) that there exist j such that (cid:80) τ A t =1 ( b t ) (cid:62) j x t + ¯ b ≥ ρ j T . By the definition of ˜ g t , we have τ A (cid:88) t =1 (˜ g t ) j = ρ j τ A − τ A (cid:88) t =1 ( b t ) (cid:62) j x t ≤ ρ j τ A − ρ j T + ¯ b , thus T − τ A ≤ ¯ b − (cid:80) τ A t =1 (˜ g t ) j ρ j . (16)On the other hand, it follows the update rule (8) that for any t ≤ τ A , ˙ h j (( µ t +1 ) j ) ≥ ˙ h j (( µ t ) j ) − η (˜ g t ) j . Thus, τ A (cid:88) t =1 − (˜ g t ) j ≤ η (cid:16) ˙ h j (( µ τ A +1 ) j ) − ˙ h j (( µ ) j ) (cid:17) ≤ η (cid:16) ˙ h j ( µ max j ) − ˙ h j (( µ ) j ) (cid:17) , (17)where the last inequality is due to the monotocity of ˙ h j ( · ) . Combining (16) and (17), we reach T − τ A ≤ max j (cid:40) ˙ h j ( µ max j ) − ˙ h j (( µ ) j ) ηρ j + ¯ bρ j (cid:41) . This finishes the proof by noticing that ρ j ≥ ρ and ˙ h j ( µ max j ) − ˙ h j (( µ ) j ) ≤ (cid:107)∇ h ( µ max ) − ∇ h ( µ ) (cid:107) ∞ . A.3 Proof of Proposition 3
Before proving Proposition 3, we first introduce some new notations which are used in the proof.By the definition of conjugate function, we can rewrite the dual problem (7) as the following saddle-point problem: ( S ) : min ≤ µ max y ∈ p X L ( y, µ ) := n (cid:88) i =1 p i f i ( y i /p i ) − µ (cid:62) By + µ (cid:62) ρ , (18) y := [ y , . . . , y n ] ∈ R nd , B := [ b ; . . . ; b n ] ∈ R m × nd and p X := { y | y i ∈ p i X } ⊆ R nd + . Byminimizing over µ in (18), we obtain the following primal problem: ( P ) : max y P ( y ) := n (cid:88) i =1 p i f i ( y i /p i ) (19)s.t. By ≤ ρ (20) y ∈ p X . (21)The decision variable y i /p i ∈ X can be interpreted as the expected action to be taken when arequest of type i arrives. Therefore, ( P ) can be interpreted as a deterministic optimization problemin which resource constraints can be satisfied in expectation. In the linear case, this problemis sometimes referred as the deterministic linear program (Talluri and van Ryzin, 1998) or theexpected instance (Devanur et al., 2019). Moreover, we define an auxiliary primal variable sequence { z t } t =1 ,...,T : z t = arg max z ∈ p X L ( z, µ t ) . (22)As a direct consequence of (18) and (22), we obtain: g t := ∇ D ( µ t ) = ∇ µ L ( z t , µ t ) . (23) Proof of Proposition 3.
It follows by the definition of ˜ g t , ¯ b and ¯ ρ that E γ t (cid:107) ˜ g t (cid:107) ∞ ≤ (cid:0) E γ t (cid:107) b t x t (cid:107) ∞ + (cid:107) ρ (cid:107) ∞ (cid:1) ≤ (cid:0) ¯ b + ¯ ρ (cid:1) . (24)Note that µ t ∈ σ ( ξ t − ) , g t ∈ σ ( ξ t − ) , and ˜ g t ∈ σ ( ξ t ) , where σ ( X ) denotes the sigma algebragenerated by a stochastic process X . Notice E γ t ˜ g t = g t , thus it holds for any µ ∈ D that (cid:104) g t , µ t − µ (cid:105) = (cid:104) E γ t [˜ g t | µ t ] , µ t − µ (cid:105)≤ E γ t (cid:20) (cid:104) ˜ g t , µ t − µ t +1 (cid:105) + 1 η V h ( µ, µ t ) − η V h ( µ, µ t +1 ) − η V h ( µ t +1 , µ t ) | µ t (cid:21) ≤ E γ t (cid:20) (cid:104) ˜ g t , µ t − µ t +1 (cid:105) + 1 η V h ( µ, µ t ) − η V h ( µ, µ t +1 ) − σ η (cid:107) µ t +1 − µ t (cid:107) | µ t (cid:21) ≤ E γ t (cid:20) ησ (cid:107) ˜ g t (cid:107) ∞ + 1 η V h ( µ, µ t ) − η V h ( µ, µ t +1 ) | µ t (cid:21) ≤ ησ (cid:0) ¯ b + ¯ ρ (cid:1) + 1 η V h ( µ, µ t ) − E γ t (cid:20) η V h ( µ, µ t +1 ) | µ t (cid:21) , (25)22here the first inequality follows from Three-Point Property stated in Lemma 3.2 of Chen andTeboulle (1993), the second inequality is by strongly convexity of h , the third inequality uses that a + b ≥ ab for a, b ∈ R and Cauchy-Schwarz to obtain σ η (cid:107) µ t +1 − µ t (cid:107) + ησ (cid:107) ˜ g t (cid:107) ∞ ≥ (cid:107) µ t +1 − µ t (cid:107) (cid:107) ˜ g t (cid:107) ∞ ≥ |(cid:104) ˜ g t , µ t − µ t +1 (cid:105)| , and the last inequality follows from (24). Taking expectation with respect to ξ t − and multiplyingby η on both sides of (25) yields: E ξ t − [ η (cid:104) g t , µ t − µ (cid:105) ] (26) ≤ (cid:0) ¯ b + ¯ ρ (cid:1) σ η + E ξ t − [ V h ( µ, µ t )] − E ξ t [ V h ( µ, µ t +1 )] . Consider the process M t = (cid:80) ts =1 η (cid:104) g s , µ s − µ (cid:105)− E ξ s − [ η (cid:104) g s , µ s − µ (cid:105) ] , which is martingale with respectto ξ t (i.e., M t ∈ σ ( ξ t ) and E [ M t +1 | ξ t ] = M t ) with increments bounded by | M t − M t − | ≤ η ( (cid:107) g t (cid:107) ∞ + E ξ t − (cid:107) g t (cid:107) ∞ ) (cid:107) µ t − µ (cid:107) ≤ b + ¯ ρ ) m (cid:107) µ t − µ (cid:107) ∞ ≤ m (¯ b + ¯ ρ ) (cid:107) µ max (cid:107) ∞ = 4 m (¯ b + ¯ ρ ) (cid:18) ¯ fρ + 1 (cid:19) < ∞ , where the first inequality is Cauchy-Schwarz, the second inequality is from (cid:107) g t (cid:107) ∞ ≤ ¯ b + ¯ ρ almostsurely, and the last inequality uses µ t ∈ D by Lemma 2. Since τ A is a stopping time with respect to ξ t and τ A is bounded, the Optional Stopping Theorem implies that E [ M τ A ] = 0 . Therefore, E (cid:34) τ A (cid:88) t =1 η (cid:104) g t , µ t − µ (cid:105) (cid:35) = E (cid:34) τ A (cid:88) t =1 E ξ t − [ η (cid:104) g t , µ t − µ (cid:105) ] (cid:35) ≤ (cid:0) ¯ b + ¯ ρ (cid:1) σ η E [ τ A ] + V h ( µ, µ ) . (27)where the inequality follows from summing up (26) from t = 1 to t = τ A , telescoping, and usingthat the Bregman divergence is non-negative. 23n the other hand, it holds that by choosing µ = 0 τ A (cid:88) t =1 η (cid:104) g t , µ t − µ (cid:105) = τ A (cid:88) t =1 η (cid:104)∇ µ L ( z t , µ t ) , µ t − µ (cid:105) = τ A (cid:88) t =1 η ( L ( z t , µ t ) − L ( z t , µ ))= τ A (cid:88) t =1 η ( L ( z t , µ t ) − P ( z t ) − µ ( ρ − Bz t ))= τ A (cid:88) t =1 η ( D ( µ t ) − P ( z t ) − µ ( ρ − Bz t )) ≥ τ A η (cid:18) D (¯ µ τ A ) − (cid:80) τ A t =1 P ( z t ) τ A (cid:19) − τ A (cid:88) t =1 µ ( ρ − Bz t )= τ A η (cid:18) D (¯ µ τ A ) − (cid:80) τ A t =1 P ( z t ) τ A (cid:19) , (28)where the first equality uses (23), the second equality is because L ( z, µ ) is linear in µ , the thirdequality is from z t = arg min z L ( z, µ t ) , the first inequality uses convexity of D ( · ) over µ , and thelast equality is because µ = 0 . Combining (27) and (28) and choosing µ = 0 , we obtain: E (cid:34) τ A D (¯ µ τ A ) − τ A (cid:88) t =1 P ( z t ) (cid:35) ≤ (cid:0) ¯ b + ¯ ρ (cid:1) σ η E [ τ A ] + V h ( µ, µ ) η . (29)Notice that µ t and z t are measurable given the sigma algebra σ ( ξ t − ) . From the update of x t and z t , we know that if a request of type i -th is realized in the t -th iteration, then x t = ( z t ) i /p i . Thusit holds for any t ≤ τ A that E γ t [ f t ( x t ) | ξ t − ] = n (cid:88) i =1 p i f i (( z t ) i /p i ) = P ( z t ) . Therefore, another martingale argument yields that E (cid:34) τ A (cid:88) t =1 f t ( x t ) (cid:35) = E (cid:34) τ A (cid:88) t =1 P ( z t ) (cid:35) . (30)Combining (29) and (30) finishes the proof. 24 .4 Proof of Theorem 1 Proof of Theorem 1.
For any P ∈ I , we have for any τ A thatOPT ( P ) = τ A T OPT ( P ) + T − τ A T OPT ( P ) ≤ τ A D (¯ µ τ A ) + ( T − τ A ) ¯ f , where the inequality uses (6) and the fact that OPT ( P ) ≤ ¯ f . Therefore, Regret( A | P )= OPT ( P ) − R ( A | P ) ≤ E P (cid:34) τ A D (¯ µ τ A ) + ( T − τ A ) ¯ f − T (cid:88) t =1 f t ( x t ) (cid:35) ≤ E P (cid:34)(cid:32) τ A D (¯ µ τ A ) − τ A (cid:88) t =1 f t ( x t ) (cid:33)(cid:35) + E P (cid:2) ( T − τ A ) ¯ f (cid:3) ≤ (cid:0) ¯ b + ¯ ρ (cid:1) σ η E P [ τ A ] + V h (0 , µ ) η + ¯ fρη (cid:107)∇ h ( µ max ) − ∇ h ( µ ) (cid:107) ∞ + ¯ f ¯ bρ ≤ (cid:0) ¯ b + ¯ ρ (cid:1) σ ηT + V h (0 , µ ) η + ¯ fρη (cid:107)∇ h ( µ max ) − ∇ h ( µ ) (cid:107) ∞ + ¯ f ¯ bρ , (31)where the second inequality is because τ A ≤ T and f t ( x t ) ≥ , the third inequality uses Proposition2 and Proposition 3, and the last inequality is from τ A ≤ T almost surely. Moreover, (31) holds forany P ∈ I , which finishes the proof of Theorem 1. A.5 Proof of Proposition 4
Proof of Proposition 4.
The proof essentially follows exactly from the proof of Theorem 1after taking the expectation on ζ . Notice that ˜ g t does not depend on the realization of ζ , thusProposition 3 still holds. The only part requiring major modification in the proof is the bound onstopping time (i.e., Proposition 2). Actually (10) no longer holds almost surely, but it still holds inexpectation, which is enough to show Proposition 4. The difficulty in establishing this result is thatthe stopping time is now defined in term of the realized allocation v t instead of the action x t . Forthe consistency of the proof, we still use the notations for the general problem (1) herein. Our goalis now to show that E ζ [ T − τ A ] ≤ max j (cid:40) ˙ h j ( µ max j ) − ˙ h j (( µ ) j ) ηρ j + ¯ bρ j (cid:41) . (32)25onsider the sigma algebra F t = σ ( H t , f t +1 , b t +1 ) where H t = { f s , b s , v s } ts =1 is the previous history.Note that x t +1 ∈ F t but v t +1 ∈ F t +1 . At first, it holds that M t := (cid:80) ts =1 b s ( v s − x s ) is a martingalewith respect to F t , because M t ∈ F t , E ζ ( (cid:107) M t (cid:107) ) ≤ bt is bounded for any t , and E ζ [ M t +1 − M t | F t ] = b t +1 E [ v t +1 − x t +1 | x t +1 ] = 0 . Recall that τ A is the first time that τ A (cid:88) t =0 ( b t ) (cid:62) j v t + ¯ b ≥ ρ j T . (33)Notice the ( b t ) (cid:62) j v t in the left-hand-side of (33) is measurable with respect to F t , thus τ A is astopping time with respect to F t . It then follows by Martingale Optional Stopping Theorem that E ζ [ M τ A ] = E ζ [ M ] = 0 . Therefore, because ˜ g t = − x t + ρ we obtain E ζ (cid:34) τ A (cid:88) t =1 (˜ g t ) j (cid:35) = E ζ (cid:34) ρ j τ A − τ A (cid:88) t =1 ( b t ) (cid:62) j x t (cid:35) = E ζ (cid:34) ρ j τ A − τ A (cid:88) t =1 ( b t ) (cid:62) j v t (cid:35) ≤ E ζ (cid:2) ρ j τ A − ρ j T + ¯ b (cid:3) , where the second equality is from E ζ [ M τ A ] = 0 and the inequality from (33). Thus E ζ [ T − τ A ] ≤ E ζ (cid:20) ¯ b − (cid:80) τ A t =1 (˜ g t ) j ρ j (cid:21) . Notice that ˜ g t does not depend on the realized allocation ξ t , thus Lemma 2 and (17) still holds,which finishes the proof of (32). Proposition 4 can be then proved by following the exact steps inthe proof of Theorem 1 after taking an additional expectation over ζζ