[PDF] The Primal-Dual method for Learning Augmented Algorithms

Abstract

The extension of classical online algorithms when provided with predictions is a new and active research area. In this paper, we extend the primal-dual method for online algorithms in order to incorporate predictions that advise the online algorithm about the next action to take. We use this framework to obtain novel algorithms for a variety of online covering problems. We compare our algorithms to the cost of the true and predicted offline optimal solutions and show that these algorithms outperform any online algorithm when the prediction is accurate while maintaining good guarantees when the prediction is misleading.

Full PDF

TThe Primal-Dual method for Learning AugmentedAlgorithms

Etienne Bamas ∗ EPFL, Lausanne, Switzerland [email protected]

Andreas Maggiori ∗ EPFL, Lausanne, Switzerland [email protected]

Ola Svensson ∗ EPFL, Lausanne, Switzerland [email protected]

Abstract

The extension of classical online algorithms when provided with predictions isa new and active research area. In this paper, we extend the primal-dual methodfor online algorithms in order to incorporate predictions that advise the onlinealgorithm about the next action to take. We use this framework to obtain novelalgorithms for a variety of online covering problems. We compare our algorithmsto the cost of the true and predicted ofﬂine optimal solutions and show that thesealgorithms outperform any online algorithm when the prediction is accurate whilemaintaining good guarantees when the prediction is misleading.

In the classical ﬁeld of online algorithms the input is presented in an online fashion and thealgorithm is required to make irrevocable decisions without knowing the future. The performance isoften measured in terms of worst-case guarantees with respect to an optimal ofﬂine solution. In thispaper, we will consider minimization problems and formally, we will say that an online algorithm

ALG is c - competitive if on any input I , the cost c ALG ( I ) of the solution output by algorithm ALG on input I satisﬁes c ALG ( I ) (cid:54) c · OPT( I ) , where OPT( I ) denotes the cost of the ofﬂineoptimum. Due to the uncertainty about the future, online algorithms tend to be overly cautiouswhich sometimes causes their performance in real-world situations to be far from what a machinelearning (ML) algorithm would have achieved. Indeed in many practical applications future eventsfollow patterns which are easily predictable using ML methods. In [19] Lykouris and Vassilvitskiiformalized a general framework for incorporating (ML) predictions into online algorithms anddesigned an extension of the marking algorithm to solve the online caching problem when providedwith predictions. This work was quickly followed by many other papers studying different learningaugmented online problems such as scheduling ([17]), caching ([2, 26]), ski rental ([10, 16, 25, 28]),clustering ([7]) and other problems ([12, 23]). The main challenge is to incorporate the predictionwithout knowing how the prediction was computed and in particular without making any assumptionon the quality of the prediction. This setting is natural as in real-world situations, predictions areprovided by ML algorithms that rarely come with worst-case guarantees on their accuracy. Thus, thedifﬁculty in designing a learning augmented algorithm is to ﬁnd a good balance: on the one hand,following blindly the prediction might lead to a very bad solution if the prediction is misleading. Onthe other hand if the algorithm does not trust the prediction at all, it will simply never beneﬁt froman excellent prediction. The aforementioned results solve this issue by designing smart algorithmswhich exploit the problem structure to achieve a good trade-off between these two cases. In this ∗ Equal Contribution. a r X i v : . [ c s . L G ] O c t aper we take a different perspective. Instead of focusing on a speciﬁc problem trying to integratepredictions, we show how to extend a very powerful algorithmic method, the Primal-Dual method,into the design of online learning augmented algorithms. We underline that despite the generalityof our extension technique, it produces online learning augmented algorithms in a fairly simple andstraightforward manner. The Primal-Dual method.

The Primal-Dual (PD) method is a very powerful algorithmic techniqueto design online algorithms. It was ﬁrst introduced by Alon et al. [1] to design an online algorithm forthe classical online set cover problem and later extended to many other problems such as weightedcaching ([3]), revenue maximization in ad-auctions, TCP acknowledgement and ski rental [5]. Wemention the survey of Buchbinder and Naor [4] for more references about this technique. In a fewwords, the technique consists in formulating the online problem as a linear program P complementedby its dual D . Subsequently, the algorithm builds online a feasible fractional solution to both theprimal P and dual D . Every time an update of the primal and dual variables is made, the cost of theprimal increases by some amount ∆ P while the cost of the dual increases by some amount ∆ D . Thecompetitive ratio of the fractional solution is then obtained by upper bounding the ratio ∆ P ∆ D and usingweak duality. The integral solution is then obtained by an online rounding scheme of the fractionalsolution. Preliminary notions for Learning Augmented (LA) algorithms.

LA algorithms receive as input aprediction A , an instance I which is revealed online, a robustness parameter λ , and output a solutionof cost c ALG ( A , I , λ ) . Intuitively, λ indicates our conﬁdence in the prediction with smaller valuesreﬂecting high conﬁdence. We denote by S ( A , I ) the cost of the output solution on input I if thealgorithm follows blindly the prediction A . We avoid deﬁning explicitly prediction A to easily ﬁtdifferent prediction cases. For instance if the prediction A is a predicted solution (without necessarilyrevealing the predicted instance) then following blindly the solution would simply mean to output thepredicted solution A . For each result presented in this paper, it will be clear what is the prediction A and the cost S ( A , I ) . Given this, we restate some useful deﬁnitions introduced in [19, 25] in ourcontext. For any < λ (cid:54) , we will say that an LA algorithm is C ( λ ) -consistent and R ( λ ) -robust ifthe cost of the output solution satisﬁes: c ALG ( A , I , λ ) (cid:54) min { C ( λ ) · S ( A , I ) , R ( λ ) · OPT( I ) } (1)If A is accurate ( S ( A , I ) ≈ OPT( I ) ) and at the same time we trust the prediction, we would like ourperformance to be close to the optimal ofﬂine. Thus, ideally C ( λ ) should approach as λ approaches . On the same spirit, a value of λ close to denotes no trust to the prediction, and in that case, ouralgorithm should not be much worse than the best pure online algorithm. Therefore, R (1) should beclose to the competitive ratio of the best pure online algorithm. We also mention that in some otherpapers such as [25], a smoothness criterion on the consistency bound is required. In these papersthe setting is slightly different. Indeed, prediction A is a predicted instance I pred and the error isdeﬁned to describe how far I pred is from the real instance I . With this in mind, an algorithm is saidto be smooth if the performance degrades smoothly as the error increases. We emphasize that, in theapplications considered in this paper, this smoothness property is implicitly included in the value of S ( A , I ) which degrades smoothly with the quality of the prediction. Our contributions.

We show how to extend the Primal-Dual method (when predictions are pro-vided) for solving problems that can be formulated as covering problems. The algorithms designedusing this technique receive as input a robustness parameter λ and incorporate a prediction. If theprediction is accurate our algorithms can be arbitrarily close to the optimal ofﬂine (beating knownlower bounds of the classical online algorithms) while being robust to failures of the predictor. Weﬁrst apply our Primal-Dual Learning Augmented (PDLA) technique to the online version of theweighted set cover problem, which constitutes the most canonical example of a covering LinearProgram (LP). For that problem we show how we can easily modify the Primal-Dual algorithmto incorporate predictions. Even though in this case, prediction may not seem very natural, thisresult reveals that we can use PDLA to design learning augmented algorithms for the large classof problems that can be formulated as a covering LP. We then continue by addressing problems inwhich the prediction model is much more natural. Using the PDLA technique, we ﬁrst design analgorithm which recovers the results of Purohit et al. [25] for the ski rental problem, and we alsoprove that the consistency-robustness trade-off of that algorithm is optimal. We additionally design alearning augmented algorithm for a generalization of the ski rental, namely the Bahncard problem.2inally, we turn our attention to a problem which arises in network congestion control, the TCPacknowledgement problem. We design an LA algorithm for that problem and conduct experimentswhich conﬁrm our claims. We note that the analysis of the algorithms designed using PDLA is(arguably) simple and boils down to (1) proving robustness with (essentially) the same proof as inthe original Primal-Dual technique and (2) proving consistency using a simple charging argumentthat, without making use of the dual, relates the cost incurred by our algorithms to the prediction. Inaddition to that, using PDLA, the design of online LA algorithms is almost automatic. We emphasizethat the preexisting online rounding schemes to obtain an integral solution from a fractional solutionstill apply to our learning augmented algorithms. Hence in all the paper we focus only on building afractional solution and provide appropriate references for the rounding scheme. In this section we apply PDLA to solve the online weighted set cover problem when providedwith predictions. Set cover is arguably the most canonical example of a covering problem andthe framework that we develop readily applies to other covering problems. In particular, weuse the framework to give tight or nearly-tight LA algorithms for ski rental, Bahncard, and dy-namic TCP acknowledgement, which are all problems that can be formulated as covering LPs.

Primal minimize P S ∈F w S x S subject to: P S ∈F ( e ) x S ≥ ∀ e ∈ U x S ≥ ∀ S ∈ F Dual maximize P e ∈U y e subject to: P e ∈ S y e ≤ w S ∀ S ∈ F y e ≥ ∀ e ∈ U Figure 1: Primal Dual formulation ofweighted set cover

The weighted set cover problem.

In this problem, weare given a universe U = { e , e , . . . , e n } of n elementsand a family F of m sets over this universe, each set S ∈ F has a weight w S and each element e is cov-ered by any set in F ( e ) = { S ∈ F | e ∈ S } . Let d = max e ∈U |F ( e ) | denote the maximum number of setsthat cover one element. Our goal is to select sets so as tocover all elements while minimizing the total weight. Inits online version, elements are given one by one and it isunknown to the algorithm which elements will arrive andin which order. When a new element arrives, it is requiredto cover it by adding a new set if necessary. Removinga set from the current solution to decrease its cost is notallowed. Alon et al. in [1] ﬁrst studied the online version designing an almost optimal O (log n log d ) -competitive algorithm. We note that the O (log n ) factor comes from the integrality gap of the linearprogram formulation of the problem (Figure 1) while the O (log d ) is due to the online nature ofthe problem. Since Alon et al. [1] designed an online rounding scheme at a multiplicative cost of O (log n ) , we will focus on building an increasing fractional solution to the set cover problem (i.e. x S can only increase over time for all S ). PDLA for weighted set cover.

Algorithm 2 takes as input a predicted covering

A ⊂ F and arobustness parameter λ ∈ [0 , . While an instance I is revealed in an online fashion, an increasingfractional solution { x S } S ∈F ∈ [0 , F is built. Note that F ( e ) ∩ A are the sets which cover e inthe prediction. To simplify the description, we assume that | F ( e ) ∩ A | (cid:62) , ∀ e , i.e. the predictionforms a feasible solution. The algorithm without this assumption can be found in appendix A. Algorithm Intuition.

We ﬁrst turn our attention to the original online algorithm of Alon et. al.[1] described in Algorithm 1. To get an intuition assume that w S = 1 , ∀ S and consider thevery ﬁrst arrival of an element e . After the ﬁrst execution of the while loop, e is covered and x S = | F ( e ) | , ∀ S ∈ F ( e ) . In other words, the online algorithm creates a uniform distribution over thesets in F ( e ) , reﬂecting in such a way his unawareness about the future. On the contrary Algorithm2 uses the prediction to adjust the increase rate of primal variables, augmenting more aggressivelyprimal variables of sets which are predicted to be in the optimal ofﬂine solution. Indeed, after theﬁrst execution of the while loop, sets which belong to A get a value of λ | F ( e ) | + − λ |F ( e ) ∩A | while setswhich are not chosen by the prediction get λ | F ( e ) | .We continue by exposing our main conceptual contribution. To that end let S ( A , I ) denote the costof the covering solution described by prediction A on instance I .3 lgorithm 1 P RIMAL D UAL M ETHOD FOR O NLINE W EIGHTED S ET C OVER [1].

Initialize: x S ← , y e ← ∀ S, e for all element e that just arrived dowhile (cid:80) S ∈F ( e ) x S < do /* Primal Update for all S ∈ F ( e ) do x S ← x S (cid:16) w S (cid:17) + w S | F ( e ) | end for /* Dual Update y e ← y e + 1 end whileend for ⇒⇒ Algorithm 2

PDLA

FOR O NLINE W EIGHTED S ET C OVER . Input: λ , A Initialize: x S ← , y e ← ∀ S, e for all element e that just arrived dowhile (cid:80) S ∈F ( e ) x S < do /* Primal Update for all S ∈ F ( e ) and S ∈ A do x S ← x S (cid:16) w S (cid:17) + λw S | F ( e ) | + − λw S | F ( e ) ∩A | end forfor all S ∈ F ( e ) and S (cid:54)∈ A do x S ← x S (cid:16) w S (cid:17) + λw S | F ( e ) | end for /* Dual Update y e ← y e + 1 end whileend for Theorem 1.

Assuming A is a feasible solution, the cost of the fractional solution output by Algorithm2 satisﬁes c PDLA ( A , I , λ ) (cid:54) min (cid:26) O (cid:18) − λ (cid:19) · S ( A , I ) , O (cid:18) log (cid:18) dλ (cid:19)(cid:19) · OPT( I ) (cid:27) Proof sketch.

The proof is split in two parts. The ﬁrst part is to bound the cost of the algorithm bythe term O (cid:16) − λ (cid:17) · S ( A , I ) . As mentioned in the introduction we use a charging argument to do so.After each execution of the while loop we can decompose the primal increase into two parts. ∆ P c which denotes the increase due to sets in F ( e ) ∩ A and ∆ P u which denotes the increase due to setsin F ( e ) \ A , thus for the overall primal increase ∆ P we have ∆ P = ∆ P c + ∆ P u . We continue byupper bounding ∆ P u as a function of λ and ∆ P c , that is ∆ P u (cid:54) O ( λ − λ )∆ P c , and deducing that ∆ P (cid:54) O ( − λ )∆ P c . Now the consistency proof terminates by noting that since ∆ P c is generatedby sets in the prediction, we can charge this increase to S ( A , I ) . The robustness bound, which isindependent of the prediction, is retrieved by mimicking the proof of the original online algorithm ofAlon et al. [1]. See appendix A for more details. Primal minimize B · x + P j ∈ [ N ] f j subject to: x + f j ≥ ∀ j ∈ [ N ] x, f j ≥ ∀ j ∈ [ N ] Dual maximize P j ∈ [ N ] y j subject to: P j ∈ [ N ] y j ≤ B ≥ y j ≥ ∀ j ∈ [ N ] Figure 2: Primal dual formulation of theski rental problem.As another application of PDLA we design a learningaugmented algorithm for one of the simplest and wellstudied online problems, the ski rental problem. In thisproblem, every new day, one has to decide whether to rentskis for this day, which costs dollar or to buy skis forthe rest of the vacation at a cost of B dollars. In its ofﬂineversion the total number of vacation days, N , is known inadvance and the problem becomes trivial. From the primal-dual formulation of the problem (Figure 2) it is clear thatif B < N , the optimal strategy is to buy the skis at dayone while if B (cid:62) N the optimal strategy is to always rent.In the online setting the difﬁculty relies in the fact that wedo not know N in advance. A deterministic -competitiveonline algorithm has been known for a long time [13] anda randomized ee − ≈ . -competitive algorithm was alsodesigned later [14]. Both competitive ratios are known tobe optimal for deterministic and randomized algorithms respectively. This problem was alreadystudied in various learning augmented settings [10, 16, 25, 28]. Our approach recovers, using the4rimal-dual method, the results of [25]. As in [25] our prediction A will be the total number ofvacation days N pred . PDLA for ski rental.

To simplify the description, we denote an instance of the problem as I = ( N, B ) and deﬁne the function e ( z ) = (1 + 1 /B ) z · B . Note that if B → ∞ , then e ( z ) ap-proaches e z hence the choice of notation. In an integral solution, the variable x is to indicate thatthe skis are bought and otherwise. In the same spirit f j indicates whether we rent on day j or not.Buchbinder et al. [5] showed how to easily turn a fractional monotone solution (i.e. it is not permittedto decrease a variable) to an online randomized algorithm of expected cost equal to the cost of thefractional solution. Hence we focus only on building online a fractional solution. Algorithm 3 isdue to [5] and uses the Primal-Dual method to solve the problem. Each new day j a new constraint x + f j (cid:62) is revealed. To satisfy this constraint, the algorithm updates the primal and dual variableswhile trying to maintain (1) the ratio ∆ P/ ∆ D as small as possible and (2) the primal and dualsolutions feasible. As in the online weighted set cover problem, the key idea for extending Algorithm3 to the learning augmented Algorithm 4 is to use the prediction N pred in order to adjust the rateat which each variable is increased. Thus, when N pred > B we increase the buying variable moreaggressively than the pure online algorithm. Here, the cost of following blindly the prediction N pred is S ( N pred , I ) = B · { N pred > B } + N · { N pred (cid:54) B } . Algorithm 3 P RIMAL D UAL FOR S KI -R ENTAL [5].

Initialize: x ← , f j ← , ∀ jc ← e (1) , c (cid:48) ← for each new day j s.t. x + f j < do /* Primal Update f j ← − xx ← (1 + B ) x + c − · B /* Dual Update y j ← c (cid:48) end for ⇒⇒ Algorithm 4

PDLA

FOR S KI -R ENTAL . Input: λ , N pred Initialize: x ← , f j ← , ∀ j if N pred (cid:62) B then /* Prediction suggests buying c ← e ( λ ) , c (cid:48) ← else /* Prediction suggests renting c ← e (1 /λ ) , c (cid:48) ← λ end iffor each new day j s.t. x + f j < do /* Primal Update f j ← − xx ← (1 + B ) x + c − · B /* Dual Update y j ← c (cid:48) end for In the following we assume that either λB or B/λ is an integer (depending on whether c equals e ( λ ) or e (1 /λ ) respectively in Algorithm 4). Our results do not change qualitatively by rounding up to theclosest integer. See appendix B for details. Theorem 2 (PDLA for ski rental) . For any λ ∈ (0 , , the cost of PDLA for ski rental is bounded asfollows c PDLA ( N pred , I , λ ) (cid:54) min (cid:26) λ − e ( − λ ) · S ( N pred , I ) , − e ( − λ ) · OPT( I ) (cid:27) Proof sketch.

The robustness bound is proved essentially using the same proof as for the originalanalysis of Algorithm 3 in [5]. For the consistency bound we ﬁrst note that after an update the primalincrease is c − , now depending on the value of c we distinguish between two cases. If N pred (cid:62) B then Algorithm 4 is always aggressive in buying. In this case it is easy to show that at most λB updates are made before we get x (cid:62) . Once x (cid:62) , no more updates are needed. Since eachaggressive update costs at most e ( λ ) − = e ( λ ) e ( λ ) − = − e ( − λ ) we get that the total cost paid byAlgorithm 4 is at most λB − e ( − λ ) = S ( N pred , I ) · λ − e ( − λ ) . Similarly, in the second case N pred < B and the algorithm increases the buying variable less aggressively. In this case each update costs atmost e (1 /λ ) − = − e ( − /λ ) and at most N of these updates are made therefore Algorithm 45ays at most N − e ( − /λ ) = S ( N pred , I ) · − e ( − /λ ) . To conclude the consistency proof, note that − e ( − /λ ) (cid:54) λ − e ( − λ ) (see Lemma 19 inequality (2)).In addition to recovering the positive results of [25], we additionally show in appendix D that thisconsistency-robustness trade-off is optimal. Lemma 3.

Any λ − e − λ -consistent learning augmented algorithm for ski rental has robustness R ( λ ) (cid:62) − e − λ To emphasize how PDLA permits us to tackle more general problems, we apply the same ideas to ageneralization of the ski-rental problem, namely, the Bahncard problem [9]. This problem models asituation where a tourist travels every day multiple trips. Before any new trip, the tourist has twochoices, either to buy a ticket for that particular trip at a cost of or buy a discount card, at a cost of B , that allows to buy tickets at a cheaper price of β < . The discount card remains valid during T days. Note that ski-rental is modeled by taking β = 0 and T −→ ∞ . In the learning augmentedversion of the problem we are given a prediction A which consists in a collection of times where weare advised to acquire the discount card. We state the main result on this problem and defer the proofto Appendix B. Theorem 4 (PDLA for the Bahncard problem) . For any λ ∈ (0 , , any β ∈ [0 , and B − β −→ ∞ ,we have the following guarantees on any instance I and prediction A cost PDLA ( A , I , λ ) (cid:54) min (cid:26) λ − β + λβ · e λ − βe λ − · S ( A , I ) , e λ − βe λ − · OPT( I ) (cid:27) Primal minimize P t ∈ T x t + P j ∈ M P t | t ≥ t ( j ) 1 d f jt subject to: f jt + P tk = t ( j ) x k ≥ ∀ j, t ≥ t ( j ) f jt ≥ ∀ j, t ≥ t ( j ) x t ≥ ∀ t ∈ T Dual maximize P j ∈ M P t | t ≥ t ( j ) y jt subject to: P j | t ≥ t ( j ) P t ≥ t y jt ≤ ∀ t ∈ T ≤ y jt ≤ d ∀ j, t ≥ t ( j ) Figure 3: Primal Dual formulation of the TCPacknowledgement problemIn this section, we continue by applying PDLAto a classic network congestion problem of theTransmission Control Protocol (TCP). Duringa TCP interaction, a server receives a streamof packets and replies back to the sender ac-knowledging that each packet arrived correctly.Instead of sending an acknowledgement for eachpacket separately, the server can choose to delayits response and acknowledge multiple packetssimultaneously via a single TCP response. Ofcourse, in this scenario there is an additional costincurred due to the delayed packets, which is thetotal latency incurred by those packets. Thus, onone hand sending too many acknowledgments(acks) overloads the network, on the other handsending one ack for all the packets slows downthe TCP interaction. Hence a good trade-off hasto be achieved and the objective function which we aim to minimize will be the sum of the totalnumber of acknowledgements plus the total latency. The problem was ﬁrst modeled by Dooly et al.[8], where they showed how to solve the ofﬂine problem optimally in quadratic time along with adeterministic -competitive online algorithm. Karlin et al. [15] provided the ﬁrst ee − -competitiverandomized algorithm which was later shown to be optimal by Seiden in [27]. The problem was latersolved using the primal-dual method by Buchbinder et al. [5] who also obtained an ee − -competitivealgorithm. Figure 3 presents the primal-dual formulation of the problem. In this formulation eachpacket j arrives at time t ( j ) and is acknowledged by the ﬁrst ack sent after t ( j ) . Here, variable x t corresponds to sending an ack at time t and f jt is set to one (in the integral solution) if packet j wasnot acknowledged by time t . The time granularity is controlled by the parameter d and each additionaltime unit of latency comes at a cost of /d . As in the ski rental problem, there is no integrality gapand a fractional monotone solution can be converted to a randomized algorithm in a lossless manner(see [5] for more details). 6 .1 The PDLA algorithm and its theoretical analysis Our prediction consists in a collection of times A in which the prediction suggests sending an ack.Let α ( t ) be the next time t (cid:48) (cid:62) t when prediction sends an ack. With this deﬁnition each packet j , if the prediction is followed blindly, is acknowledged at time α ( t ( j )) incurring a latency costof ( α ( t ( j )) − t ( j )) · d . In the same spirit as for the ski rental problem we adapt the pure onlineAlgorithm 5 into the learning augmented Algorithm 6. Algorithm 6 adjusts the rate at which weincrease the primal and dual variables according to the prediction A . Thus if a packet j at time t is"uncovered" ( (cid:80) tk = t ( j ) x k + f jt < ) by our fractional solution and "covered" by A ( α ( t ( j )) (cid:54) t ) weincrease x t at a faster rate. To simplify the description of Algorithm 6 we deﬁne e ( z ) = (1 + d ) z · d .To get to the continuous time case, we will take the limit d → ∞ so the reader should think intuitivelyas e ( z ) ≈ e z . Algorithm 5 P RIMAL D UAL METHOD FOR

TCP

ACKNOWLEDGEMENT [5].

Initialize: x ← , y ← for all times t dofor all packages j such that (cid:80) tk = t ( j ) x k < do c ←− e (1) , c (cid:48) ←− /d /* Primal Update f jt ← − (cid:80) tk = t ( j ) x k x t ← x t + d · (cid:16)(cid:80) tk = t ( j ) x k + c − (cid:17) /* Dual Update y jt ← c (cid:48) end forend for ⇒⇒ Algorithm 6

PDLA

FOR

TCP

ACKNOWLEDGE - MENT

Input: λ , A Initialize: x ← , y ← for all times t dofor all packages j such that (cid:80) tk = t ( j ) x k < doif t (cid:62) α ( t ( j )) then /* Prediction already acknowledged packet j c ←− e ( λ ) , c (cid:48) ←− /d else /* Prediction did not acknowledge packet j yet c ←− e (1 /λ ) , c (cid:48) ←− λ/d end if /* Primal Update f jt ← − (cid:80) tk = t ( j ) x k x t ← x t + d · (cid:16)(cid:80) tk = t ( j ) x k + c − (cid:17) /* Dual Update y jt ← c (cid:48) end forend for We continue by presenting Algorithm’s 6 guarantees together with a proof sketch. As before I denotes the TCP ack problem instance which is revealed in an online fashion. The full proof isdeferred to appendix C. Theorem 5 (PDLA for TCP-ack) . For any prediction A , any instance I of the TCP ack problem,any parameter λ ∈ (0 , , and d → ∞ : Algorithm 6 outputs a fractional solution of cost at most c PDLA ( A , I , λ ) (cid:54) min (cid:110) λ − e − λ · S ( A , I ) , − e − λ · OPT( I ) (cid:111) Proof sketch.

The two bounds are proven separately. For the robustness bound, while our analysisis slightly more technical, we use the same idea as the original analysis in [5]. That is, upperbounding the ratio ∆ P/ ∆ D in every iteration and using weak duality. The consistency proof usesa simple charging scheme that can be seen as a generalization of our consistency proof for the skirental problem. We essentially have two cases, big ( c = e ( λ ) ) and small ( c = e (1 /λ ) ) updates.In the case of a small update, a simple calculation reveals that the increase in cost of the solutionis at most ∆ P = d (cid:16) − (cid:80) tk = t ( j ) x k (cid:17) + d (cid:16)(cid:80) tk = t ( j ) x k + e (1 /λ ) − (cid:17) = d (cid:16) e (1 /λ ) − (cid:17) = d · (cid:16) − e ( − /λ ) (cid:17) . Notice then whenever Algorithm 6 does a small update at time t due to request j ,prediction A pays a latency cost of /d since it has not yet acknowledged request j . Hence the primalincrease of cost which is at most d · − e ( − /λ ) can be charged to the latency cost /d paid by A witha multiplicative factor − e ( − /λ ) (cid:54) λ − e ( − λ ) (see Lemma 19, inequality (3)). The case of big updatesis slightly different. Consider a time t at which A sends an acknowledgement and consider the bigupdates performed by Algorithm 6 for packets j arrived before that time ( t ( j ) (cid:54) t ). We claim that atmost (cid:100) λd (cid:101) such big updates can be made. Indeed, big updates are more aggressive (i.e. x t increases7aster), and a “covering” due to (cid:80) tk = t x k (cid:62) is reached after only (cid:100) λd (cid:101) updates (after this point,the packets arrived before time t will never force Algorithm 6 to make an update). Thus Algorithm’s6 cost due to these big updates is at most (cid:100) λd (cid:101) · ( cost of a big update ) = (cid:100) λd (cid:101) · ( d · λ − e ( − λ ) ) whichcan be charged to the cost of 1 incurred by A for sending an ack at time t . We present experimental results that conﬁrm the theoretical analysis of Algorithm 6 for the TCPacknowledgement problem. The code is publicly available at https://github.com/etienne4/PDLA . We experiment on various types of distribution for packet arrival inputs. Historically, thedistribution of TCP packets was often assumed to follow some Poisson distribution ([20, 30]).However, it was later shown than this assumption was not always representative of the reality. Inparticular real-world distributions often exhibit a heavy tail (i.e. there is still a signiﬁcant probabilityof seeing a huge amount of packets arriving at some time). To better integrate this in models, heavytailed distributions such as the Pareto distribution are often suggested (see for instance [11, 22]). Thismotivates our choice of distributions for random packet arrival instances. We will experiment onPoisson distribution, Pareto distribution and a custom distribution that we introduce and seems togenerate the most challenging instances for our algorithms.

Input distributions.

In all our instances, we set the subdivision parameter d to which meansthat every second is split into time units. Then we deﬁne an array of length where the i -th entry deﬁnes how many requests arrive at the i -th time step. Each entry in the array is drawnindependently from the others from a distribution D . In the case of a Poisson distribution, we set D = P (1) (the Poisson distribution of mean ). For the Pareto distribution, we choose D to be theLomax distribution (which is a special case of Pareto distribution) with shape parameter set to ([29]). Finally, we deﬁne the iterated Poisson distribution as follows. Fix an integer n > and µ > .Draw X ∼ P ( µ ) . Then for i from to n draw X i ∼ P ( X i − ) . The ﬁnal value returned is X n .This distribution, while still having an expectation of µ , appears to generate more spikes than theclassical Poisson distribution. The interest of this distribution in our case is that it generates morechallenging instances than the other two (i.e. the competitive ratios of online algorithms are closer tothe worst-case bounds). In our experiments, we choose µ = 1 and n = 10 . Plots of typical instancesunder these laws can be seen in appendix C. Note that for all these distributions, the expected valuefor each entry is . Noisy prediction.

The prediction A is produced as follows. We perturb the real instances withnoise, then compute an optimal solution on this perturbed instance and use this as a prediction. Moreprecisely, we introduce a replacement rate p ∈ [0 , . Then we go through the instance generatedaccording to some distribution D and for each each entry at index (cid:54) i (cid:54) , with probability p we set this entry to (i.e. we delete this entry) and with probability p we add to this entry a randomvariable Y ∼ D . Both operations, adding and deleting, are performed independently of each other.We then test our algorithm with different values of robustness parameter λ ∈ { , . , . , . } . Results.

The plots in Figure 4 present the average competitive ratios of Algorithm 6 over experiments for each distribution and each value of λ . As expected, with a perfect prediction, settinga lower λ will yield a much better solution while setting λ = 1 simply means that we run the pureonline algorithm of Buchbinder et al. [5] (that achieves the best possible competitive ratio for the pureonline problem). On the most challenging instances generated by the iterated Poisson distribution(Figure 4c), even with a replacement rate of where the prediction is simply an instance totallyuncorrelated to the real instance, our algorithm maintains good guarantees for small values of λ . Wenote that in all the experiments the competitive ratios achieved by Algorithm 6 are better than therobustness guarantees of Theorem 5, which are { . , . , . , . } for λ ∈ { , . , . , . } respectively. In addition to that, all the competitive ratios degrade smoothly as the error increaseswhich conﬁrms our earlier discussion about smoothness.8 .0 0.2 0.4 0.6 0.8 1.0Replacement rate1.11.21.3 C o m p e t i t i v e r a t i o Lambda = 1Lambda = 0.8Lambda = 0.6Lambda = 0.4 (a) Poisson distribution C o m p e t i t i v e r a t i o Lambda = 1Lambda = 0.8Lambda = 0.6Lambda = 0.4 (b) Pareto distribution C o m p e t i t i v e r a t i o Lambda = 1Lambda = 0.8Lambda = 0.6Lambda = 0.4 (c) Iterated Poisson distribution

Figure 4: Competitive ratios under various distributions and replacement rates from to In this paper we present the PDLA technique, a learning augmented version of the classic Primal-Dualtechnique, and apply it to design algorithms for some classic online problems when a prediction isprovided. Since the Primal-Dual technique is used to solve many more covering problems, like forinstance weighted caching or load balancing [4], an interesting research direction would be to applyPDLA to tackle those problems and (hopefully) get tight consistency-robustness trade-off guarantees(as the one achieved by Algorithm 4 and proved in Lemma 3). In addition to that, we suspect thatthis work might provide insights not only for covering but also for some packing problems which aresolved using the Primal-Dual technique in the classic online model (e.g. revenue maximization inad-auctions [5]). Finally, another interesting direction would be to incorporate predictions into thePrimal-Dual technique when used to solve covering problems where the objective function is nonlinear (e.g. convex).

Broader Impact

The ﬁeld of learning augmented algorithms lies in the intersection of machine learning and onlinealgorithms, trying to combine the best of the two worlds. Learning augmented algorithms areparticularly suited for critical applications where maintaining worst-case guarantees is mandatory butat the same time predictions about the future are possible. Thus, our work represents a stepping stonetowards (easily) integrating ML predictions in such applications, increasing this way the possiblebeneﬁts of ML to society. PDLA offers a recipe on how to incorporate predictions to tackle classicalcovering online problems, that is to ﬁrst solve the online problem using the Primal-Dual techniqueand then use the prediction to change the rate at which primal and dual variables increase or decrease.We believe that since the idea behind this technique is simple and does not require too much domain-speciﬁc knowledge, it might be applicable to different problems and can also be implemented inpractice.

Acknowledgments and Disclosure of Funding

This research is supported by the Swiss National Science Foundation project 200021-184656 “Ran-domness in Problem Instances and Randomized Algorithms”. Andreas Maggiori was supported bythe Swiss National Science Fund (SNSF) grant n o _ / “Spatial Coupling of GraphicalModels in Communications, Signal Processing, Computer Science and Statistical Physics”. References [1] Noga Alon, Baruch Awerbuch, and Yossi Azar. The online set cover problem. In

Proceedingsof the Thirty-Fifth Annual ACM Symposium on Theory of Computing , STOC ’03, page 100–105,New York, NY, USA, 2003. Association for Computing Machinery. ISBN 1581136749. doi:10.1145/780542.780558. URL https://doi.org/10.1145/780542.780558 .[2] Antonios Antoniadis, Christian Coester, Marek Elias, Adam Polak, and Bertrand Simon. Onlinemetric algorithms with untrusted predictions, 2020.93] N. Bansal, N. Buchbinder, and J. Naor. A primal-dual randomized algorithm for weightedpaging. In ,pages 507–517, 2007.[4] Niv Buchbinder and Joseph (Sefﬁ) Naor. The design of competitive online algorithms via a pri-mal: Dual approach.

Found. Trends Theor. Comput. Sci. , 3(2–3):93–263, February 2009. ISSN1551-305X. doi: 10.1561/0400000024. URL https://doi.org/10.1561/0400000024 .[5] Niv Buchbinder, Kamal Jain, and Joseph (Sefﬁ) Naor. Online primal-dual algorithms formaximizing ad-auctions revenue. In Lars Arge, Michael Hoffmann, and Emo Welzl, editors,

Algorithms – ESA 2007 , pages 253–264, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg.[6] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movementin location-based social networks. In

Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , KDD ’11, page 1082–1090, NewYork, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450308137. doi:10.1145/2020408.2020579. URL https://doi.org/10.1145/2020408.2020579 .[7] Yihe Dong, Piotr Indyk, Ilya Razenshteyn, and Tal Wagner. Learning space partitionsfor nearest neighbor search. In

Eighth International Conference on Learning Repre-sentations (ICLR) , April 2020. URL .[8] Daniel R Dooly, Sally A Goldman, and Stephen D Scott. Tcp dynamic acknowledgment delay(extended abstract) theory and practice. In

Proceedings of the thirtieth annual ACM symposiumon Theory of computing , pages 389–398, 1998.[9] Rudolf Fleischer. On the bahncard problem.

Theoretical Computer Science , 268(1):161 – 174,2001. ISSN 0304-3975. doi: https://doi.org/10.1016/S0304-3975(00)00266-8. URL . On-line Al-gorithms ’98.[10] Sreenivas Gollapudi and Debmalya Panigrahi. Online algorithms for rent-or-buy with expertadvice. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,

Proceedings of the 36thInternational Conference on Machine Learning , volume 97 of

Proceedings of Machine LearningResearch , pages 2319–2327, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/gollapudi19a.html .[11] Weibo Gong, Yong Liu, Vishal Misra, and Don Towsley. On the tails of web ﬁle size distributions.In in: Proceedings of 39th Allerton Conference on Communication, Control, and Computing ,2001.[12] Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimationalgorithms. In , 2019. URL https://openreview.net/forum?id=r1lohoCqY7 .[13] A. R. Karlin, M. S. Manasse, L. Rudolph, and D. D. Sleator. Competitive snoopy caching.In , pages 244–254,1986.[14] Anna R. Karlin, Mark S. Manasse, Lyle A. McGeoch, and Susan Owicki. Competitive ran-domized algorithms for non-uniform problems. In

Proceedings of the First Annual ACM-SIAMSymposium on Discrete Algorithms , SODA ’90, page 301–309, USA, 1990. Society for Indus-trial and Applied Mathematics. ISBN 0898712513.[15] Anna R. Karlin, Claire Kenyon, and Dana Randall. Dynamic tcp acknowledgement andother stories about e/(e-1). In

Proceedings of the Thirty-Third Annual ACM Symposium onTheory of Computing , STOC ’01, page 502–509, New York, NY, USA, 2001. Associationfor Computing Machinery. ISBN 1581133499. doi: 10.1145/380752.380845. URL https://doi.org/10.1145/380752.380845 . 1016] Rohan Kodialam. Optimal algorithms for ski rental with soft machine-learned predictions.

CoRR , abs/1903.00092, 2019. URL http://arxiv.org/abs/1903.00092 .[17] Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online schedul-ing via learned weights. In

Proceedings of the 2020 ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020 , pages 1859–1877, 2020.doi: 10.1137/1.9781611975994.114. URL https://doi.org/10.1137/1.9781611975994.114 .[18] Russell Lee, Mohammad H. Hajiesmaili, and Jian Li. Learning-assisted competitive algorithmsfor peak-aware energy scheduling.

CoRR , abs/1911.07972, 2019. URL http://arxiv.org/abs/1911.07972 .[19] Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned ad-vice. In

Proceedings of the 35th International Conference on Machine Learning, ICML 2018,Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 , pages 3302–3311, 2018. URL http://proceedings.mlr.press/v80/lykouris18a.html .[20] M. Marathe and W. Hawe. Predicted capacity of ethernet in a university environment. In

Proceedings of Southcon 1982 , pages 1–10, 1982.[21] Andres Muñoz Medina and Sergei Vassilvitskii. Revenue optimization with approxi-mate bid predictions. In

Advances in Neural Information Processing Systems 30: An-nual Conference on Neural Information Processing Systems 2017, 4-9 December 2017,Long Beach, CA, USA , pages 1858–1866, 2017. URL http://papers.nips.cc/paper/6782-revenue-optimization-with-approximate-bid-predictions .[22] Michael Mitzenmacher. Dynamic models for ﬁle sizes and double pareto distributions.

In-ternet Math. , 1(3):305–333, 2003. URL https://projecteuclid.org:443/euclid.im/1109190964 .[23] Michael Mitzenmacher. A model for learned bloom ﬁlters and optimizing by sand-wiching. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,and R. Garnett, editors,

Advances in Neural Information Processing Systems 31 , pages464–473. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7328-a-model-for-learned-bloom-filters-and-optimizing-by-sandwiching.pdf .[24] Michael Mitzenmacher. Scheduling with Predictions and the Price of Misprediction. arXive-prints , art. arXiv:1902.00732, February 2019.[25] Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via MLpredictions. In

Advances in Neural Information Processing Systems 31: Annual Con-ference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December2018, Montréal, Canada , pages 9684–9693, 2018. URL http://papers.nips.cc/paper/8174-improving-online-algorithms-via-ml-predictions .[26] Dhruv Rohatgi. Near-optimal bounds for online caching with machine learned advice. In

Proceedings of the Thirty-First Annual ACM-SIAM Symposium on Discrete Algorithms , SODA’20, page 1834–1845, USA, 2020. Society for Industrial and Applied Mathematics.[27] Steven S. Seiden. A guessing game and randomized online algorithms. In

Proceedings of theThirty-Second Annual ACM Symposium on Theory of Computing , STOC ’00, page 592–601,New York, NY, USA, 2000. Association for Computing Machinery. ISBN 1581131844. doi:10.1145/335305.335385. URL https://doi.org/10.1145/335305.335385 .[28] Shufan Wang, Jian Li, and Shiqiang Wang. Online Algorithms for Multi-shop Ski Rental withMachine Learned Predictions. arXiv e-prints , art. arXiv:2002.05808, February 2020.[29] Wikipedia contributors. Lomax distribution, 2004. URL https://en.wikipedia.org/wiki/Lomax_distribution . [Online; accessed 18-May-2020].[30] Michael Wilson. A historical view of network trafﬁc models. , 2006.1131] Yinfeng Xu and Weijun Xu. Competitive algorithms for online leasing problem in probabilis-tic environments. In

Advances in Neural Networks - ISNN 2004, International Symposiumon Neural Networks, Dalian, China, August 19-21, 2004, Proceedings, Part II , pages 725–730, 2004. doi: 10.1007/978-3-540-28648-6\_116. URL https://doi.org/10.1007/978-3-540-28648-6_116 . 12

Missing proofs for Set Cover

We ﬁrst present the slightly modiﬁed algorithm where we do not need the prediction to form a feasiblesolution. As mentioned in the main paper, when an element e is uncovered by the prediction, i.e. | F ( e ) ∩ A | = 0 , we just run the purely online algorithm ( λ = 1 ). Algorithm 7

PDLA

FOR O NLINE W EIGHTED S ET C OVER . Input: λ , A Initialize: x S ← , y e ← ∀ S, e for all element e that just arrived dowhile (cid:80) S ∈F ( e ) x S < dofor all S ∈ F ( e ) doif | F ( e ) ∩ A | (cid:62) then /* Primal Update (more aggressive if { S ∈ A} = 1 ) x S ← x S (cid:16) w S (cid:17) + λw S ·| F ( e ) | + (1 − λ ) · { S ∈A} w S ·| F ( e ) ∩A | else /* e is not covered by the prediction x S ← x S · (cid:16) w S (cid:17) + w S ·| F ( e ) | end ifend for /* Dual Update y e ← y e + 1 end whileend for We start by proving that the dual constraints are only violated by a multiplicative factor of O (cid:0) log (cid:0) dλ (cid:1)(cid:1) .Thus, scaling down the dual solution of Algorithm 7 by O (cid:0) log (cid:0) dλ (cid:1)(cid:1) creates a feasible dual solutionwhich will permit us to use weak duality. Lemma 6.

Let y be the dual solution built by Algorithm 7. Then y Θ(log( d/λ )) is a feasible solution tothe dual problem.Proof. The proof essentially follows the same path as in [4]. The only constraints that can be violatedare of the form (cid:80) e ∈ S y e (cid:54) w S for some S ∈ F . Consider one such constraint. At every update ofthe primal variable x S the sum (cid:80) e ∈ S y e increases by 1, since the dual variable corresponding to thenewly arrived element increases by . We prove by induction on the number of such updates that atany point in time x S (cid:62) λd (cid:18)(cid:16) w S (cid:17) (cid:80) e ∈ S y e − (cid:19) . Indeed, when no update concerning S is donewe have that x S = 0 and (cid:80) e ∈ S y e = 0 . Suppose this is true after k updates of the variable x S , i.e. (cid:80) e ∈ S y e = k . Now, assume that a newly arrived element e ∗ ∈ S provokes a primal update from x old S to x new S and increases its dual value by one, i.e. y new e ∗ = y old e ∗ + 1 . Then we always have: x new S (cid:62) x old S · (cid:18) w S (cid:19) + min (cid:26) | F ( e ) | · w S , λ | F ( e ) | · w S + (1 − λ ) · { S ∈ A}| F ( e ) ∩ A | · w S (cid:27) (cid:62)(cid:62) x old S · (cid:18) w S (cid:19) + λd · w S Thus, by the induction hypothesis x new S (cid:62) λd (cid:32)(cid:18) w S (cid:19) (cid:80) e ∈ S \{ e ∗} y e + y old e ∗ − (cid:33) · (cid:18) w S (cid:19) + λd · w S = λd (cid:32)(cid:18) w S (cid:19) (cid:80) e ∈ S \{ e ∗} y e + y new e ∗ − (cid:33) = λd (cid:32)(cid:18) w S (cid:19) (cid:80) e ∈ S y e − (cid:33) w S (cid:62) , we have that (1 + 1 /w s ) w s (cid:62) , thus: x S (cid:62) λd (cid:18) w S (cid:19) w S · (cid:80) e ∈ S yewS −  (cid:62) λd (cid:18) (cid:80) e ∈ S yewS − (cid:19) We continue by upper bounding the value of x S . Note that once x S (cid:62) , no more primal updates canhappen, therefore whenever an update is made we have x S < just before the update. Thus: x new S (cid:54) x old S · (cid:18) w S (cid:19) + max (cid:26) λw S · | F ( e ) | + (1 − λ ) · { S ∈ A} w S · | F ( e ) ∩ A | , w S · | F ( e ) | (cid:27) (cid:54) x old S · (cid:54) Combining the lower and upper bound on x S we get that: (cid:88) e ∈ S y e (cid:54) log (cid:18) dλ + 1 (cid:19) · w S = O (log ( d/λ )) · w S which concludes the proof. Lemma 7 (Robustness) . The competitive ratio is always bounded by O (cid:0) log (cid:0) dλ (cid:1)(cid:1) Proof.

We denote as before by x old S and x new S the primal variables before and after the update respec-tively. Each time the while loop is executed we have that (cid:80) S ∈F ( e ) x old S < and the increase in thedual is ∆ D = 1 . Denote by δx S = x new S − x old S the increase of a variable for a speciﬁc set S . If anelement is covered by the prediction then it holds that: ∆ P = (cid:88) S ∈F ( e ) w S · δx S = (cid:88) S ∈F ( e ) ∩A w S · δx S + (cid:88) S ∈F ( e ) \F ( e ) ∩A w S · δx S == (cid:88) S ∈F ( e ) (cid:18) x old S + λ | F ( e ) | (cid:19) + (cid:88) S ∈F ( e ) ∩A (1 − λ ) | F ( e ) ∩ A| = (cid:88) S ∈F ( e ) x old S + λ + 1 − λ (cid:54) By repeating the same calculation we get that if an element is uncovered by the prediction then: ∆ P = (cid:88) S ∈F ( e ) w S · δx S = (cid:88) S ∈F ( e ) (cid:18) x old S + 1 | F ( e ) | (cid:19) = (cid:88) S ∈F ( e ) x old S + 1 (cid:54) Overall we have that:1. At any iteration ∆ P ∆ D (cid:54) .2. The ﬁnal primal solution is feasible.3. By Lemma 6, denoting y the ﬁnal dual solution, y Θ(log( d/λ ) is feasible.Thus, by weak duality we get that the competitive ratio of Algorithm 7 is upper bounded by · O (log ( d/λ )) = O (log ( d/λ ))) .In the following we do not assume that our prediction A forms a feasible solution. Therefore we willdenote by1. S ( A , I ) the cost of the (possibly partial) covering if prediction A is followed blindly.2. C nc the cost of optimally covering elements which are not covered by the prediction.3. c PDLA ( A , I , λ ) the cost of the covering solution calculated by Algorithm 7. Lemma 8 (Consistency) . c PDLA ( A , I , λ ) (cid:54) O (cid:16) − λ (cid:17) · S ( A , I ) + O (log ( d )) · C nc roof. We split the analysis in two parts. First, we look at the case when an element which isuncovered by the prediction arrives. In this case Algorithm 7 emulates the pure online algorithm( λ = 1 ). More precisely, by the same calculations as before, we can show that y nc the solution ofthe dual problem restricted to the uncovered elements satisfy the property that y nc O (log d ) is feasible.Therefore for those elements by Lemma 7 the cost of Algorithm 7 is upper bounded by O (log d ) · C nc .We turn our attention to the more interesting case where the prediction covers an element. In this case,after the execution of the while loop we decompose the primal increase into two parts. ∆ P c whichdenotes the increase due to sets S chosen by A ( { S ∈ A} = 1 ) and ∆ P u which denotes the increasedue to sets S not chosen by the prediction ( { S ∈ A} = 0 ), thus we have ∆ P = ∆ P c + ∆ P u . Let c = { S ∈ F ( e ) : { S ∈ A} = 1 } and u = { S ∈ F ( e ) : { S ∈ A} = 0 } . We then have: ∆ P c = (cid:88) S ∈ c x S + λ · | c || c | + | u | + 1 − λ (cid:62) λd + 1 − λ ∆ P u = (cid:88) S ∈ u x S + λ · | u || c | + | u | (cid:54) λ since , | c || c | + | u | (cid:62) d and | u || c | + | u | (cid:54) . Combining the two bounds we get that ∆ P u (cid:54) λ λd +1 − λ · ∆ P c and consequently: ∆ P (cid:54) (cid:32) λ λd + 1 − λ (cid:33) ∆ P c = O (cid:18) − λ (cid:19) ∆ P c Since the cost increase ∆ P c is caused by sets which are selected by the prediction, we can chargethis cost to the corresponding increase of S ( A , I ) loosing only a multiplicative O (1) factor. Bycombining the two cases we conclude the proof. B Missing proofs for ski rental and the Bahncard problem

We detail here the missing proofs from section 3. We ﬁrst prove our results regarding the ski rentalproblem and then focus on the Bahncard problem.

B.1 The ski rental problem

We provide here a full proof of Theorem 2. In our setting, the prediction A is the predicted numberof skiing days N pred and S ( A , I ) = S ( N pred , I ) = B · { N pred > B } + N · { N pred (cid:54) B } is thecost of following blindly the prediction. We ﬁrst prove an easy lemma about the feasibility of thedual solution. Lemma 9.

Let y be the dual solution built by Algorithm 4. Then y is a feasible solution (assuming Bλ is integral if the prediction suggests to rent).Proof. To see this, note that the only constraint that might by violated is the constraint (cid:80) j ∈ [ N ] y j (cid:54) B . Denote by S the value of the sum (cid:80) j ∈ [ N ] y j . Note that once x (cid:62) , the value of S will neverchange anymore. The value of S increases by for every big update and by λ for every small update.In the case N pred > B , the algorithm always does big updates (the prediction suggest to buy). Weclaim that at most (cid:100) λB (cid:101) big updates can be made before x (cid:62) . We denote x ( k ) the value of x after k updates. We then prove by induction that x ( k ) (cid:62) e ( k/B ) − e ( λ ) − (recall that e ( z ) = (1 + 1 /B ) z · B ≈ e z ).15learly, if k = 0 , we have x (0) (cid:62) . Now assume this is the case for k updates we then have x ( k + 1) = (cid:18) B (cid:19) · x ( k ) + 1( e ( λ ) − · B (cid:62) (cid:18) B (cid:19) · e ( k/B ) − e ( λ ) − e ( λ ) − · B = (1 + 1 /B ) · ( e ( k/B ) −

1) + 1 /Be ( λ ) − e (( k + 1) /B ) − e ( λ ) − which ends the induction. Hence at most (cid:100) λB (cid:101) (cid:54) B big updates can be made before x (cid:62) . Thisimplies that S (cid:54) B at the end of the algorithm. In the case where N pred (cid:54) B , we prove in exactly thesame way that at most (cid:6) Bλ (cid:7) updates are performed before x (cid:62) . Hence we have that S (cid:54) λ · (cid:6) Bλ (cid:7) .By assumption, we have that B/λ is an integer hence S (cid:54) and y is again feasible.We can ﬁnish the main proof. Proof of Theorem 2.

We prove ﬁrst the robustness bound. By the Lemma 9, we know that the dualsolution is feasible. Hence what remains to prove is to upper bound the ratio ∆ P ∆ D and use weakduality. In the case of a big update we have ∆ P ∆ D = ∆ P = 1 + 1 e ( λ ) − − e ( − λ ) In the case of a small update we have ∆ P ∆ D = ∆ Pλ = 1 λ · − e ( − /λ ) (cid:54) − e ( − λ ) where the last inequality comes from Lemma 19 inequality (2). By weak duality, we have therobustness bound.To prove consistency, we have two cases. If N pred (cid:54) B , then Algorithm 4 does at most N updates,each of cost at most − e ( − /λ ) while the prediction A pays a cost of N . Noting again that, by Lemma19, − e ( − /λ ) (cid:54) λ − e ( − λ ) ends the proof of consistency in this case. The other case is different. Asin the proof of Lemma 9, we still have that x ( k ) (cid:62) e ( k/B ) − e ( λ ) − hence at most (cid:100) λB (cid:101) (cid:54) B updates aredone by Algorithm, each of cost at most − e ( − λ ) hence a total cost of at most (cid:100) λB (cid:101) − e ( − λ ) Since we assume in this case that λB is integral and that the prediction A pays a cost of B , thecompetitive ratio is indeed λ − e ( − λ ) B.2 The Bahncard problemHistory of the problem.

The Banhcard problem, which was initially introduced in [9], models asituation where a tourist travels every day multiple trips. Before any new trip, the tourist has twochoices, either to buy a ticket for that particular trip at a cost of or buy a discount card, at a costof B , and use this discount card to get a ticket for a price of β < . The discount card is then validfor rest of that day and for the next T − days. This generalizes the ski rental problem in severalways, ﬁrst the discount expires after a ﬁxed amount of time, second buying only offers a discount andnot a free trip. Note that if β = 0 and T −→ ∞ we recover the ski-rental problem. Karlin et al. [15]designed an optimal randomized online algorithm of competitive ratio ee − β when B −→ ∞ .16 DLA for the Bahncard problem.

We design, using PDLA, a learning augmented algorithm forthe Bahncard problem. The ﬁnal goal is to prove Theorem 4. An interesting feature of our algorithmis that, as for the TCP ack problem, it does not need to be given the full prediction in advance.If Bahncards are bought by the prediction A at a set of times { t , t , . . . , t k } , the algorithm doesnot need to know before time t i that the Bahncard i is bought. For instance we could think of theprediction of an employee of the station giving short-term advice to a traveller every time he showsup at the station.We now give the primal dual formulation of the Bahncard problem along with its correspondinglearning augmented algorithm. We mention that, to the best of our knowledge, no online algorithmusing the primal-dual method was designed before. Hence the primal-dual formulation (Figure 5) ofthe problem is new. In an integral solution, we would have x t = 1 if the solution buys a Bahncardat time t and x t = 0 otherwise. Then f j represents the fractional amount of trip j done at time t ( j ) that is bought at full price and d j the amount of the trip bought at discounted price. The ﬁrst naturalconstraint is the one that says that each trip should be paid entirely either in discounted or full price,i.e. d j + f j (cid:62) . We then have the constraint (cid:80) t ( j ) t = t ( j ) − T x t (cid:62) d j that says that to be able to buy aticket at discounted price, at least one Bahncard must have been bought in the last T time steps.Figure 5: Primal Dual formulation of the Bahncard problem. Primal Dual minimize B · (cid:80) t ∈ T x t + (cid:80) j ∈ M βd j + f j maximize (cid:80) j ∈ M c j subject to: d j + f j (cid:62) ∀ j subject to: c j (cid:54) ∀ j (cid:80) t ( j ) t = t ( j ) − T x t (cid:62) d j ∀ j c j − b j (cid:54) β ∀ jx t (cid:62) ∀ t ∈ T (cid:80) j : t ( j ) − T (cid:54) t (cid:54) t ( j ) b j (cid:54) B ∀ t ∈ T d j , f j (cid:62) ∀ j c j , b j (cid:62) ∀ j Following the same idea as for the ski rental problem, we will guide the updates in the primal-dualalgorithm with the advice provided. We deﬁne a function e ( z ) = (cid:16) − βB (cid:17) z · ( B/ (1 − β )) . Again for B − β → ∞ , the reader should think intuitively of e ( z ) as e z . The parameter z will then take valueseither λ or /λ depending on if we want to do a big or small update in the primal. As for ski rental,when we do a small update, we will need to scale down the dual update by a factor of λ to maintainfeasibility of the dual solution.The rule to decide if an update should be big or small is the following: if the prediction A boughta Bahncard less than T time steps in the past (i.e. if the predicted solution has currently a validBahncard) the update should be big. Otherwise the update should be cautious. In algorithm 8, wedenote by l A ( t ) the latest time before time t at which the prediction A bought a Bahncard. We usethe convention that l A ( t ) = −∞ if no Bahncard was bought before time t . Of course in this problemit is possible that trips show up while the fractional solution already has a full Bahncard available (i.e. (cid:80) t ( j ) t = t ( j ) − T x t (cid:62) ). In this case there is no point in buying more fractions of a Bahncard and thealgorithm will do what we call a minimal update.17 lgorithm 8 LA O

NLINE P RIMAL -D UAL FOR THE B AHNCARD PROBLEM

Input: λ , A Initialize: x, d, f ← , c, b ← for all trip j doif (cid:80) t ( j ) t = t ( j ) − T x t (cid:62) then d j ← c j ← β end ifif (cid:80) t ( j ) t = t ( j ) − T x t < thenif t ( j ) (cid:54) l A ( t ( j )) + T then d j ← (cid:80) t ( j ) t = t ( j ) − T x t f j ← − d j x t ( j ) ← x t ( j ) + − βB · (cid:16)(cid:80) t ( j ) t = t ( j ) − T x t + e ( λ ) − (cid:17) b j ← − βc j ← b j + β end ifif t ( j ) > l A ( t ( j )) + T then d j ← (cid:80) t ( j ) t = t ( j ) − T x t f j ← − d j x t ( j ) ← x t ( j ) + − βB · (cid:16)(cid:80) t ( j ) t = t ( j ) − T x t + e (1 /λ ) − (cid:17) b j ← λ (1 − β ) c j ← b j + β end ifend ifend for We ﬁrst prove that the dual built by the algorithm is almost feasible.

Lemma 10.

Let ( c, b ) be the dual solution built by Algorithm 8, then ( c,b )1+(1 − β ) /B is feasible.Proof. Note that the constraints c j (cid:54) and c j − b j (cid:54) β are clearly maintained by the algorithm.And scaling down both c and b by some factor bigger than will not alter their feasibility. Hencewe focus only on the constraints of the form (cid:80) j : t ( j ) − T (cid:54) t (cid:54) t ( j ) b j (cid:54) B for a ﬁxed time t . Note thatduring a minimal update, the value of b j is not changed hence only small or big updates can alter thevalue of the sum (cid:80) j : t ( j ) − T (cid:54) t (cid:54) t ( j ) b j . Similarly as for proofs in ski rental, denote by b the number ofbig updates that are counted in this sum and by s the number of small updates in this sum.We ﬁrst notice that once we have that (cid:80) t + Tt (cid:48) = t x t (cid:48) (cid:62) , no updates that alter the constraint (cid:80) j : t ( j ) − T (cid:54) t (cid:54) t ( j ) b j can happen. To see this, note that upon arrival of a trip j between time t and t + T , we have (cid:80) t ( j ) t = t ( j ) − T x t (cid:62) (cid:80) t ( j ) t (cid:48) = t x t (cid:48) = (cid:80) t + Tt (cid:48) = t x t (cid:48) .Denote by S the value of the sum (cid:80) t + Tt (cid:48) = t x t (cid:48) . Note that for a big update, we have that the value of thesum S is increased to at least S · (cid:16) − βB (cid:17) + − βB · e ( λ ) − . Similarly for a small update the newvalue of the sum is at least S · (cid:16) − βB (cid:17) + − βB · e (1 /λ ) − . Hence we can apply directly Lemma20 with d = B − β to conclude that once b + λs (cid:62) B − β , we have that S (cid:62) .Since for a big update, the sum (cid:80) j : t ( j ) − T (cid:54) t (cid:54) t ( j ) b j increases by − β and by λ (1 − β ) for a smallupdate we can see that the ﬁrst time the constraint (cid:80) j : t ( j ) − T (cid:54) t (cid:54) t ( j ) b j (cid:54) B is violated, we have S (cid:62) . Now since each update in the sum (cid:80) j : t ( j ) − T (cid:54) t (cid:54) t ( j ) b j is of value at most − β we canconclude that at the end of the algorithm, we have (cid:80) j : t ( j ) − T (cid:54) t (cid:54) t ( j ) b j (cid:54) B + 1 − β hence theconclusion. 18e then prove robustness of Algorithm 8 by the following lemma. Lemma 11 (Robustness) . For any λ ∈ (0 , and any β ∈ [0 , , PDLA for the Bahncard problem is ( e ( λ ) − β ) · (1+(1 − β ) /B ) e ( λ ) − -robust.Proof. Algorithm 8 makes 3 possible types of updates. For a minimal update, we have ∆ P = ∆ D = β . For a small update we have ∆ P = (1 − β ) ·  t ( j ) (cid:88) t = t ( j ) − T x t + 1 e (1 /λ ) −  + β · t ( j ) (cid:88) t = t ( j ) − T x t + 1 − t ( j ) (cid:88) t = t ( j ) − T x t = 1 + 1 − βe (1 /λ ) − e (1 /λ ) − βe (1 /λ ) − and ∆ D = λ (1 − β ) + β = β (1 − λ ) + λ hence the ratio is ∆ P ∆ D = 1 β (1 − λ ) + λ · e (1 /λ ) − βe (1 /λ ) − Similarly in the case of a big update we have ∆ P = e ( λ ) − βe ( λ ) − and ∆ D = 1 which gives a ratio of ∆ P ∆ D = e ( λ ) − βe ( λ ) − We can conclude by Lemma 19 (inequality (7)) that the ratio of primal cost increase vs dual costincrease is always bounded by ∆ P ∆ D (cid:54) e ( λ ) − βe ( λ ) − Using Lemma 10 along with weak duality is enough to conclude that the cost of the fractional solutionbuilt by the algorithm is bounded as follows cost

PDLA ( A , I , λ ) (cid:54) ( e ( λ ) − β ) · (1 + (1 − β ) /B ) e ( λ ) − · OPT which ends the proof.For consistency, we analyze the algorithm’s cost in two parts. When the heuristic algorithm A buysits i th Bahncard at some time t i , deﬁne the interval I i = [ t i , t i + T ] which represents the set of timesduring which this speciﬁc Bahncard is valid. This creates a family of intervals I , . . . I k if A buys k Bahncards. Note that we can assume that all these intervals are disjoint since if the prediction A suggests to buy a new Bahncard before the previous one expires, it is always better to postpone thisbuy to the end of the validity of the current Bahncard. Lemma 12.

Denote by (∆ P ) I i the increase in the primal cost of Algorithm 8 during interval I i andby cost ( A ) I i what prediction A pays during this same interval I i (including the buy of the Bahncardat the beginning of the interval I i ). Then, for all i we have (∆ P ) I i cost ( A ) I i (cid:54) (cid:108) λ · B − β (cid:109) B + β · (cid:108) λ · B − β (cid:109) · e ( λ ) − βe ( λ ) − Proof.

Assume that m trips are requested during this interval I i . Then we ﬁrst have that cost ( A ) I i = B + βm ( A buys a Bahncard then pays a discounted price for every trip in the interval I i ).19s for Algorithm 8, for each trip j , we are possibly in the ﬁrst two cases: either (cid:80) t ( j ) t = t ( j ) − T x t (cid:62) inwhich case the increase in the primal is ∆ P = β or in the second case in which case the increase inthe primal is ∆ P = (1 − β ) ·  t ( j ) (cid:88) t = t ( j ) − T x t + 1 e ( λ ) −  + β · t ( j ) (cid:88) t = t ( j ) − T x t + 1 − t ( j ) (cid:88) t = t ( j ) − T x t = 1 + 1 − βe ( λ ) − We claim that the updates of the second case can happen at most (cid:108) λ · B − β (cid:109) times during interval I i . To see this, denote by S ( l ) the value of (cid:80) t (cid:48) (cid:62) t i x t (cid:48) after l big updates in interval I i . Note thatonce (cid:80) t (cid:48) (cid:62) t i x t (cid:48) (cid:62) , big updates cannot happen anymore. Hence all we need to prove is that S (cid:16)(cid:108) λ · B − β (cid:109)(cid:17) (cid:62) .We prove by induction that S ( k ) (cid:62) e ( k · (1 − β ) /B ) − e ( λ ) − This is indeed true for k = 0 as S (0) is the value of (cid:80) t (cid:48) (cid:62) t i x t (cid:48) before any big update was made in I i hence S (0) (cid:62) . Now assume this is the case for some k and compute S ( k + 1) (cid:62) (cid:18) − βB (cid:19) · S ( k ) + 1 − βB · e ( λ ) − (cid:62) (cid:16) − βB (cid:17) · ( e ( k · (1 − β ) /B ) −

1) + − βB e ( λ ) − (cid:62) e (( k + 1) · (1 − β ) /B ) − e ( λ ) − which concludes the induction.Hence on interval I i , the total increase in the cost of the solution can be bounded as follows (∆ P ) I i (cid:54) min (cid:26)(cid:24) λ · B − β (cid:25) , m (cid:27) · (cid:18) − βe ( λ ) − (cid:19) + max (cid:26) , (cid:18) m − (cid:24) λ · B − β (cid:25)(cid:19)(cid:27) · β One can see that the worst case possible for the ratio (∆ P ) Ii cost ( A ) Ii is obtained for m = (cid:108) λ · B − β (cid:109) and isbounded by (∆ P ) I i cost ( A ) I i (cid:54) (cid:108) λ · B − β (cid:109) · (cid:16) − βe ( λ ) − (cid:17) B + β · (cid:108) λ · B − β (cid:109) = (cid:108) λ · B − β (cid:109) B + β · (cid:108) λ · B − β (cid:109) · e ( λ ) − βe ( λ ) − We then consider times t that do not belong to any interval I i . More precisely, we upper bound thevalue (∆ P ) j that is the increase in cost of the primal solution due to trip j such that t ( j ) does notbelong to any interval I i . Note that in this case the prediction always pays a cost of . Lemma 13.

For any trip j such that t ( j ) / ∈ (cid:83) i I i , we have that (∆ P ) j (cid:54) e (1 /λ ) − βe (1 /λ ) − Proof.

Note that Algorithm 8 pays either the cost of a small update which is e (1 /λ ) − βe (1 /λ ) − or the cost ofa minimal update which is β . 20or simplicity and better readability, we will formulate the ﬁnal theorem of this section only for B − β → ∞ . Theorem (Theorem 4 restated) . For any λ ∈ (0 , , any β ∈ [0 , and B − β −→ ∞ , we have thefollowing guarantees on any instance I cost PDLA ( A , I , λ ) (cid:54) min (cid:26) λ − β + λβ · e λ − βe λ − · S ( A , I ) , e λ − βe λ − · OPT (cid:27)

Proof.

By taking the limit in Lemma 11, we see that the cost of the solution output by Algorithm 8 isat most e λ − βe λ − · OPT which proves the second bound in the theorem.For the ﬁrst bound, note that we can write the ﬁnal cost of the solution as cost

PDLA ( A , I , λ ) = ∆ P = (cid:88) i (∆ P ) I i + (cid:88) j : t ( j ) / ∈ (cid:83) i I i (∆ P ) j By taking the limit in Lemma 12 we get that (cid:88) i (∆ P ) I i (cid:54) λ − β + βλ · e λ − βe λ − · (cid:88) i cost ( A ) I i and by taking the limit in Lemma 13, we get that (cid:88) j : t ( j ) / ∈ (cid:83) i I i (∆ P ) j (cid:54) e /λ − βe /λ − · (cid:88) j : t ( j ) / ∈ (cid:83) i I i cost ( A ) j By using Lemma 19 (inequality (6)), we see that max (cid:26) λ − β + βλ · e λ − βe λ − , e /λ − βe /λ − (cid:27) = λ − β + βλ · e λ − βe λ − which ends the proof.We ﬁnish this section by proving that a fractional solution can be rounded online into a randomizedintegral solution. The expected cost of the rounded instance will be equal to the cost of the fractionalsolution. If the rounding is very similar to the existing rounding of Buchbinder et al. [5] for ski rentalor TCP acknowledgement, we still include it here for completeness as the Bahncard problem wasnever solved in a primal-dual way. The argument is summarized in the following lemma. Lemma 14.

Given a fractional solution ( x, d, f ) to the Bahncard problem, it can be rounded onlineinto an integral solution of expected cost equal to the fractional cost of ( x, d, f ) .Proof. Choose some real number p uniformly at random in the interval [0 , . Then arrange thevariables x t on the real line (i.e. iteratively as follows, each time t takes an interval I t of length x t right after the interval taken by x t − ). Then buy a Bahncard at every time t such that the intervalcorresponding to time t contains the real number p + k for some integer k . We check ﬁrst that theexpected buying cost is B · (cid:88) t E ( p + k ∈ I t ) = B · (cid:88) t x t Next, to compute the total expected price of the tickets, notice that if a ticket was bought in theprevious T time steps, we can pay a discounted price, otherwise we need to pay the full price of . For a trip j , the probability that a ticket was bought in the previous T time steps is at least (cid:80) t ( j ) t = t ( j ) − T x t . Hence with probability at least (cid:80) t ( j ) t = t ( j ) − T x t (cid:62) d j we pay a price of β and withprobability − d j (cid:54) f j we pay a price of which ends the proof.21 Missing proofs for TCP

C.1 Plots of instances

We brieﬂy show in Figures 6, 7, and 8 how typical instances under various distributions look like. nu m b e r o f r e q u e s t s Figure 6: Typical instance under Poisson distribution nu m b e r o f r e q u e s t s Figure 7: Typical instance under Pareto distribution nu m b e r o f r e q u e s t s Figure 8: Typical instance under iterated Poisson distribution22 .2 Theoretical analysis

Recall that we deﬁne in this section e ( z ) = (1 + 1 /d ) z · d which will be roughly equal to e z for big d .The big updates are then the updates where z is set to λ and during a small update, z is set to /λ .We ﬁrst analyze the consistency of this algorithm. To this end denote by n A the number of acknowl-edgements sent by A and by latency ( A ) the latency paid by the prediction A . Lemma 15.

For any λ ∈ (0 , , d > ,c PDLA ( A , I , λ ) (cid:54) n A · d · (cid:100) λd (cid:101) − e ( − λ ) + latency ( A ) · − e ( − /λ ) Proof.

We will use a charging argument to analyze the performance of Algorithm 6. Note that for asmall update, the increase in cost of the fractional solution is ∆ P = 1 d  − t (cid:88) k = t ( j ) x k  + 1 d ·  t (cid:88) k = t ( j ) x k + 1 e (1 /λ ) −  = 1 d · − e ( − /λ ) However, for every small update that is made, it must be that A pays a latency of at least d . Hencethe total cost of small updates made by Algorithm 6 is at most latency ( A ) · − e ( − /λ ) .Secondly we bound the total cost of big updates of our algorithm. Let t be a time at which A sendsan acknowledgment. Let Y be the set of big updates made because of jobs j that are acknowledgedat time t by A (these big updates are hence made at some time t (cid:62) t ). We claim that | Y | (cid:54) (cid:100) λd (cid:101) .To prove this denote by S ( l ) the value of (cid:80) + ∞ k = t x k after l such big updates (there might be smallupdates inﬂuencing this value but only to make it bigger). Notice that once (cid:80) + ∞ k = t x k (cid:62) there is noremaining update in Y . We prove by induction that S ( l ) (cid:62) (1 + 1 /d ) l − /d ) λd − This is clear for l = 0 as S (0) (cid:62) . Now assume this is the case for some value l and apply a bigupdate at time t for job j to get S ( l + 1) = S ( l ) + 1 d ·  t (cid:88) k = t ( j ) x k + 1 e ( λ ) −  (cid:62) S ( l ) · (1 + 1 /d ) + 1 d ( e ( λ ) − /d ) l +1 − − /d (1 + 1 /d ) λd − /d ( e ( λ ) − /d ) l +1 − − /d (1 + 1 /d ) λd − /d (1 + 1 /d ) λd −

1= (1 + 1 /d ) l +1 − /d ) λd − Where the second inequality comes from noting that since we are considering an update due toa request j acknowledged at time t by the predicted solution, it must be that t ( j ) (cid:54) t and (cid:80) tk = t ( j ) x k (cid:62) (cid:80) tk = t x k . Hence we get that S ( (cid:100) λd (cid:101) ) (cid:62) which implies that | Y | (cid:54) (cid:100) λd (cid:101) .By a similar calculation as for the small update case, we have that the cost of a big update is ∆ P = 1 d · − e ( − λ ) Y is charged to the acknowledgement that A pays at time t to ﬁnish the proof.Taking the limit d → + ∞ we get the following corollary: Corollary 16.

For any λ ∈ (0 , and taking d → + ∞ , we have thatc PDLA ( A , I , λ ) (cid:54) n A · λ − e − λ + latency ( A ) · − e − /λ We then prove robustness of the algorithm with the following lemmas.

Lemma 17.

Let y be the dual solution produced by Algorithm 6. Then y /d is feasible.Proof. Notice that the constraints of the second type (i.e. (cid:54) y jt (cid:54) /d ) are always satisﬁed since < λ (cid:54) . We now check that the second constraints are almost satisﬁed (within some factor (1 + 1 /d ) ). Fix a time t ∈ T and consider the corresponding constraint: (cid:88) j | t (cid:62) t ( j ) (cid:88) t (cid:48) (cid:62) t y jt (cid:54) Note that for a small update for some job j such that t ( j ) (cid:54) t the sum above increases by λ/d whileit increases by /d for a big update. Notice that once we have that (cid:80) t (cid:48) (cid:62) t x t (cid:48) (cid:62) , no more suchupdate will be performed. Denote by S the value of this sum.Notice that for a big update, the sum S becomes (cid:0) d (cid:1) · S + d ((1+1 /d ) λd − . Similarly, for a smallupdates it becomes (cid:0) d (cid:1) · S + d ((1+1 /d ) d/λ − .Hence, if we denote by s the number of small updates in this sum and by b the number of big updates,by Lemma 20 we have that if λs + b (cid:62) d then (cid:80) t (cid:48) (cid:62) t x t (cid:48) (cid:62) . This directly implies that the valueof (cid:80) j | t (cid:62) t ( j ) (cid:80) t (cid:48) (cid:62) t y jt is at most /d at the end of the algorithm (each update in the dual is ofvalue at most /d ).Therefore scaling down all y jt by a multiplicative factor of /d yields a feasible solution to thedual. Lemma 18.

For d → + ∞ , Algorithm 6 outputs a solution of cost at most − e − λ · OPT

Proof.

We ﬁrst compare the increase ∆ P in the primal value to the increase ∆ D in the dual value atevery update. We claim that for every update we have ∆ P ∆ D (cid:54) − e ( − λ ) In the case of a big update we directly have ∆ P = d (cid:16) e ( λ ) − (cid:17) = d · − e ( − λ ) and ∆ D = d .In the case of a small update we have ∆ D = λd and ∆ P = d (cid:16) e (1 /λ ) − (cid:17) = d · − e ( − /λ ) andwe conclude applying Lemma 19 (inequality (3)) that we always have ∆ P ∆ D (cid:54) − e ( − λ ) By lemma 17, y /d is a feasible solution. Hence taking d → + ∞ together with the previous remarkand weak duality we get the result.Combining Lemma 16 and Lemma 18 yields Theorem 5.24 Optimality bound

Lemma 3.

Any λ − e − λ -consistent learning augmented algorithm for ski rental has robustness R ( λ ) (cid:62) − e − λ Proof.

For simplicity, we will consider the ski-rental problem in the continuous case which corre-sponds to the behaviour of the discrete version when B −→ ∞ . In this problem, the cost of buying is and a randomized algorithm has to deﬁne a (buying) probability distribution { p t } t (cid:62) . Moreover,consider the case where the true number of vacation days t end ∈ [0 , ∪ (2 , ∞ ) . In such a case we canassume w.l.o.g. that p t = 0 , ∀ t > . Indeed moving buying probability mass from any p t , t > to p does not increase the cost of the randomized algorithm. Assume now that the prediction suggestsus that the end of vacations is at ˆ t end > , thus the optimal ofﬂine solution, if the prediction iscorrect, is to buy the skis in the beginning for a total cost of . Since the algorithm has to deﬁne aprobability distribution in [0 , , { p t } needs to satisfy the equality constraint (cid:82) p t dt = 1 . Moreover,note that when the prediction is correct, i.e. t end > , the LA algorithm suffers an expected cost of (cid:82) ( t + 1) p t dt while the optimum ofﬂine has a cost of . Thus the consistency requirement forcesthe distribution to satisfy the inequality (cid:82) ( t + 1) p t dt (cid:54) λ − e − λ . Now assume that the best possibleLA algorithm is c -robust. If t end (cid:54) then the LA algorithm’s cost is (cid:82) t end ( t + 1) p t dt + t end (cid:82) t end p t dt while the optimum ofﬂine cost is t end . Thus, due to c -robustness we have that for every t (cid:48) ∈ [0 , , (cid:82) t (cid:48) ( t + 1) p t dt + t (cid:48) (cid:82) t (cid:48) p t dt (cid:54) ct (cid:48) . We calculate the best possible robustness c with the following LP:Figure 9: Primal Robustness for ski-rental problem. Primal minimize c subject to: (cid:82) p t dt = 1 (cid:82) ( t + 1) p t dt (cid:54) λ − e − λ (cid:82) t (cid:48) ( t + 1) p t dt + t (cid:48) (cid:82) t (cid:48) p t dt (cid:54) ct (cid:48) ∀ t (cid:48) ∈ [0 , p t (cid:62) ∀ t (cid:48) ∈ [0 , To lower bound the best possible robustness c we will present a feasible solution to the dual of 9. Thedual variables λ d and λ c correspond respectively to the ﬁrst and second primal constraints in Figure9. The dual variables λ t , ∀ t ∈ [0 , correspond to the robustness constraints described in the thirdline of the primal.The corresponding dual is:Figure 10: Dual Robustness for ski-rental problem. Dual maximize λ d − λ c · λ − e − λ subject to: (cid:82) tλ t dt (cid:54) λ d − ( t (cid:48) + 1) λ c (cid:54) (cid:82) t (cid:48) tλ t dt + ( t (cid:48) + 1) (cid:82) t (cid:48) λ t dt ∀ t (cid:48) ∈ [0 , λ c , λ t (cid:62) ∀ t ∈ [0 , Let K = − λe − λ − e − λ . Then, λ t = K · e − t · { t (cid:54) λ } , λ d = K and λ c = K · e − λ .We ﬁrst prove that this dual solution is feasible. For the ﬁrst constraint notice that (cid:90) tλ t dt = K · (cid:90) λ te − t dt = K · (cid:0) − ( λ + 1) e − λ (cid:1) = 1 For the second type of constraint ﬁrst in the case t (cid:48) > λ we get (cid:90) t (cid:48) tλ t dt + ( t (cid:48) + 1) (cid:90) t (cid:48) λ t dt = (cid:90) λ tλ t dt = 1 λ d − ( t (cid:48) + 1) λ c (cid:54) λ d − ( λ + 1) λ c = K · (cid:0) − ( λ + 1) e − λ (cid:1) = 1 hence these constraints are satisﬁed.In the second case t (cid:48) (cid:54) λ , we have that (cid:90) t (cid:48) tλ t dt + ( t (cid:48) + 1) (cid:90) t (cid:48) λ t dt = K · (cid:32)(cid:90) t (cid:48) te − t dt + ( t (cid:48) + 1) (cid:90) λt (cid:48) e − t dt (cid:33) = K · (cid:16) − ( t (cid:48) + 1) e − t (cid:48) + ( t (cid:48) + 1)( e − t (cid:48) − e − λ ) (cid:17) = K · (cid:0) − ( t (cid:48) + 1) e − λ (cid:1) = λ d − ( t (cid:48) + 1) λ c which proves that these constraints are also satisﬁed. Hence this dual solution is feasible. Finallynote that the cost of this dual solution is λ d − λ c · λ − e − λ = K · (cid:18) − λ − e − λ · e − λ (cid:19) = K · − e − λ − λe − λ − e − λ = 11 − e − λ By weak duality, we conclude that the best robustness cannot be better than − e − λ E Technical lemmas

A few inequalities that will be useful:

Lemma 19.

For any d > , any < λ (cid:54) , and any β ∈ [0 , , we have: λ − e − λ (cid:62) − e − /λ (2) λ − (1 + 1 /d ) − λd (cid:62) − (1 + 1 /d ) − d/λ (3) e λ − (cid:62) − λλ · e /λ + 1 e /λ − (4) /d ) λd − (cid:62) − λλ · (1 + 1 /d ) d/λ + 1(1 + 1 /d ) d/λ − (5) λ − β + βλ · e λ − βe λ − (cid:62) e /λ − βe /λ − (6) ( λ + β − βλ ) · e λ − βe λ − (cid:62) e /λ − βe /λ − (7) Proof.

Since the formal proof of (2) and (4) seems to require heavy calculations and that they areeasy to check on computer we will only give a proof by a plot (see Figures 11a and 11b). For 11b,note that (4) ⇐⇒ e λ − − − λλ − λ · e − /λ − e − /λ (cid:62) .26 .0 0.2 0.4 0.6 0.8 1.0lambda0.000.020.040.060.080.100.12 (a) Plot of λ − e − λ − − e − /λ (b) Plot of e λ − − − λλ − λ · e − /λ − e − /λ Figure 11: Plots for (2) and (4)We now prove that inequality (2) implies inequality (3). For this end notice that we can write (1 + 1 /d ) d = e x for some x ∈ (0 , since (1 + 1 /d ) d ∈ (1 , e ) for all d > . We prove that for any x ∈ (0 , λ (cid:0) − e − x/λ (cid:1) − e − xλ (cid:62) λ (cid:0) − e − /λ (cid:1) − e − λ which will imply our claim since by inequality (2) the right hand side is bigger than . First note thisis equivalent to prove that g λ ( x ) = (1 − e − λ ) · (1 − e − x/λ ) − (1 − e − /λ ) · (1 − e − xλ ) (cid:62) Taking the derivative of g λ ( x ) we obtain g (cid:48) λ ( x ) = 1 − e − λ λ · e − x/λ − λ (1 − e − /λ ) · e − xλ hence we can write g (cid:48) λ ( x ) (cid:62) ⇐⇒ e x ( λ − /λ ) (cid:62) λ · − e − /λ − e − λ Notice that the left hand side in this inequality is decreasing because λ ∈ (0 , . Also notice that g λ (0) = g λ (1) = 0 . These two facts together imply that g λ is ﬁrst increasing for x ∈ (0 , c ] thendecreasing for x ∈ ( c, for some unknown c . In particular, we indeed have that g λ ( x ) (cid:62) whichends the proof of inequality (3).Similarly, we prove that inequality (4) implies inequality (5). Again we write (1 + 1 /d ) d = e x forsome x ∈ (0 , . We ﬁrst rewrite inequality (5).(5) ⇐⇒ e λx − (cid:62) − λλ · e x/λ + 1 e x/λ − ⇐⇒ e λx − (cid:62) − λλ · ( e x/λ −

1) + λ e x/λ − ⇐⇒ λ ( e x/λ − (cid:62) (1 − λ )( e x/λ − e λx −

1) + ( e λx − ⇐⇒ λ ( e x/λ − − (1 − λ )( e x/λ − e λx − − ( e λx − (cid:62) Deﬁne the following function h λ ( x ) = λ ( e x/λ − − (1 − λ )( e x/λ − e λx − − ( e λx − . Onecan ﬁrst compute: 27 (cid:48) λ ( x ) = e x/λ − (1 − λ ) · (cid:18) λe λx ( e x/λ −

1) + 1 λ e x/λ ( e λx − (cid:19) − λe λx = e x/λ − λe λx − (1 − λ ) · (cid:18) ( λ + 1 /λ ) e x ( λ +1 /λ ) − λe λx − λ e x/λ (cid:19) = e x/λ · (cid:18) − λλ (cid:19) + e λx · ( − λ + λ (1 − λ )) − e x ( λ +1 /λ ) · (1 − λ ) · (cid:18) λ + 1 λ (cid:19) = e x/λ λ − λ e λx − e x ( λ +1 /λ ) λ · (1 − λ ) · ( λ + 1) Hence we can rewrite h (cid:48) λ ( x ) (cid:62) ⇐⇒ e x/λ λ − λ e λx − e x ( λ +1 /λ ) λ · (1 − λ ) · ( λ + 1) (cid:62) ⇐⇒ e x/λ − λ e λx − e x ( λ +1 /λ ) · (1 − λ ) · ( λ + 1) (cid:62) ⇐⇒ − λ e x ( λ − /λ ) − e xλ · (1 − λ ) · ( λ + 1) (cid:62) Let us deﬁne i λ ( x ) = 1 − λ e x ( λ − /λ ) − e xλ · (1 − λ ) · ( λ + 1) and we derive i (cid:48) λ ( x ) = − λ · ( λ − /λ ) · e x ( λ − /λ ) − λe λx · (1 − λ ) · ( λ + 1) We can now notice that i (cid:48) λ ( x ) (cid:62) ⇐⇒ − λ · ( λ − /λ ) · e x ( λ − /λ ) − λe λx · (1 − λ ) · ( λ + 1) (cid:62) ⇐⇒ − λ · ( λ − /λ ) · e − x/λ − λ (1 − λ ) · ( λ + 1) (cid:62) ⇐⇒ λ · (1 /λ − λ ) · e − x/λ − λ (1 − λ ) · ( λ + 1) (cid:62) Since the left hand side is decreasing as x increases we only need to check one extreme value whichis i (cid:48) λ (0) . We write i (cid:48) λ (0) (cid:54) ⇐⇒ λ · (1 /λ − λ ) − λ · (1 − λ ) · ( λ + 1) (cid:54) ⇐⇒ λ − λ − ( λ + λ − λ − λ ) (cid:54) ⇐⇒ − λ + 2 λ − λ (cid:54) ⇐⇒ − λ · ( λ − (cid:54) hence we always have i (cid:48) λ (0) (cid:54) .Therefore we get that i (cid:48) λ ( x ) (cid:54) for all x and λ . Note that i λ (0) = 1 − λ − (1 − λ )( λ + 1) =1 − λ − λ − λ + λ = λ − λ (cid:62) . Therefore we get that h λ is ﬁrst positive on some interval [0 , c ] and then negative for x ∈ [ c, ∞ ) . Therefore h λ is ﬁrst increasing then decreasing. Notice that h λ (0) = 0 and h λ (1) (cid:62) by inequality (4). Hence inequality (5) is true for all x ∈ [0 , whichconcludes the proof.Finally, the proof of (6) and (7) are quicker and similar. Note that(6) ⇐⇒ λ · e λ − βe λ − (cid:62) (1 − β + βλ ) · e /λ − βe /λ − which is equivalent to a polynomial (in β ) of degree being positive. The leading coefﬁcient of thispolynomial P is negative and we notice that P (1) = 0 and that P (0) (cid:62) by (2). All these factstogether imply that P ( β ) (cid:62) for all β ∈ [0 , . The proof of (7) is similar. Lemma 20.

Let < λ (cid:54) , d > and deﬁne the following functions ( x ∈ R ): f ( x ) = (cid:18) d (cid:19) · x + 1 d ((1 + 1 /d ) λd − ( x ) = (cid:18) d (cid:19) · x + 1 d (cid:0) (1 + 1 /d ) d/λ − (cid:1) Given S (cid:62) and a word w ∈ { a, b } ∗ we deﬁne a sequence S w recursively as follows: S w.y = (cid:40) S if w.y = εf ( S w ) if y = ag ( S w ) if y = b Then for any w ∈ { a, b } ∗ such that | w | a + λ | w | b (cid:62) d we have that S w (cid:62) .Proof. Let w (cid:48) = b . . . ba . . . a = b | w | b a | w | a be the word made of | w | b consecutive b s followed by | w | a consecutive a s. Then we claim that S w (cid:48) (cid:54) S w . This directly follows from the fact that for anyreal number x , f ( g ( x )) (cid:54) g ( f ( x )) . Noticing this, we can swap positions between an a followed by a b and reducing the ﬁnal value. We keep doing this until all the b s in w end up in front position.With standard computations one can check that S b | w | b = S · (1 + 1 /d ) | w | b + (1 + 1 /d ) | w | b − /d ) d/λ − For ease of notation deﬁne S (cid:48) = S b | w | b . Using the assumption that | w | a + λ | w | b (cid:62) d and that S (cid:62) we get that S (cid:48) (cid:62) (1 + 1 /d ) ( d −| w | a ) /λ − /d ) d/λ − Again using standard calculations we get that S w (cid:48) (cid:62) S (cid:48) · (1 + 1 /d ) | w | a + (1 + 1 /d ) | w | a − /d ) λd − which implies S w (cid:48) (cid:62) (1 + 1 /d ) ( d −| w | a ) /λ − /d ) d/λ − · (1 + 1 /d ) | w | a + (1 + 1 /d ) | w | a − /d ) λd − Deﬁne h ( x ) = (1+1 /d ) ( d − x ) /λ − /d ) d/λ − · (1 + 1 /d ) x + (1+1 /d ) x − /d ) λd − . We ﬁnish the proof by proving thatfor any < λ (cid:54) , any d > and any x (cid:62) , we have that h ( x ) (cid:62) .Note that h (0) = 1 and that h (cid:48) ( x ) = ln(1 + 1 /d ) · (cid:18) (1 + 1 /d ) x (1 + 1 /d ) λd − − − λλ · (1 + 1 /d ) ( d − (1 − λ ) x ) /λ (1 + 1 /d ) d/λ − − (1 + 1 /d ) x (1 + 1 /d ) d/λ − (cid:19) To study the sign of h (cid:48) ( x ) we can drop the ln(1 + 1 /d ) and write h (cid:48) ( x ) (cid:62) ⇐⇒ (1 + 1 /d ) x (1 + 1 /d ) λd − − − λλ · (1 + 1 /d ) ( d − (1 − λ ) x ) /λ (1 + 1 /d ) d/λ − − (1 + 1 /d ) x (1 + 1 /d ) d/λ − (cid:62) ⇐⇒ /d ) λd − − − λλ · (1 + 1 /d ) ( d − x ) /λ (1 + 1 /d ) d/λ − − /d ) d/λ − (cid:62) Clearly the last term is increasing as x increases hence we can limit ourselves to prove that h (cid:48) (0) (cid:62) which we can rewrite 29 (cid:48) (0) (cid:62) ⇐⇒ /d ) λd − − − λλ · (1 + 1 /d ) d/λ (1 + 1 /d ) d/λ − − /d ) d/λ − (cid:62) ⇐⇒ /d ) λd − (cid:62) − λλ · (1 + 1 /d ) d/λ + 1(1 + 1 /d ) d/λ −1