Learning Augmented Energy Minimization via Speed Scaling
Étienne Bamas, Andreas Maggiori, Lars Rohwedder, Ola Svensson
LLearning Augmented Energy Minimization via SpeedScaling
Etienne Bamas ∗ EPFLSwitzerland [email protected]
Andreas Maggiori ∗ EPFLSwitzerland [email protected]
Lars Rohwedder ∗ EPFLSwitzerland [email protected]
Ola Svensson ∗ EPFLSwitzerland [email protected]
Abstract
As power management has become a primary concern in modern data centers,computing resources are being scaled dynamically to minimize energy consumption.We initiate the study of a variant of the classic online speed scaling problem, inwhich machine learning predictions about the future can be integrated naturally.Inspired by recent work on learning-augmented online algorithms, we propose analgorithm which incorporates predictions in a black-box manner and outperformsany online algorithm if the accuracy is high, yet maintains provable guarantees ifthe prediction is very inaccurate. We provide both theoretical and experimentalevidence to support our claims.
Online problems can be informally defined as problems where we are required to make irrevocabledecisions without knowing the future. The classical way of dealing with such problems is to designalgorithms which provide provable bounds on the ratio between the value of the algorithm’s solutionand the optimal (offline) solution (the competitive ratio). Here, no assumption about the future ismade. Unfortunately, this no-assumption regime comes at a high cost: Because the algorithm has tobe overly prudent and prepare for all possible future events, the guarantees are often poor. Due tothe success story of machine learning (ML), a recent line of work, first proposed by Lykouris andVassilvitskii [13] and Medina and Vassilvitskii [14], suggests incorporating the predictions providedby ML algorithms in the design of online algorithms. While some related approaches were consideredbefore (see e.g. Xu and Xu [16]), the attention in this subject has increased substantially in the recentyears [7, 8, 10, 11, 12, 13, 14, 15]. An obvious caveat is that ML predictors often come with noworst-case guarantees and so we would like our algorithm to be robust to misleading predictions. Wefollow the terminology introduced by Purohit et al. [15], where consistency is the performance of analgorithm when the predictor is perfectly accurate, while robustness is a worst case guarantee thatdoes not depend on the quality of the prediction. The goal of the works above is to design algorithmswhich provably beat the classical online algorithms in the consistency case, while being robust whenthe predictor fails.
Problem.
The problem we are considering is motivated by the following scenario. Consider aserver that receives requests in an online fashion. For each request some computational work has to ∗ Equal Contribution a r X i v : . [ c s . L G ] O c t e done and, as a measure of Quality-of-Service, we require that each request is answered withinsome fixed time. In order to satisfy all the requests in time the server can dynamically change itsprocessor speed at any time. However, the power consumption can be a super-linear function of theprocessing speed (more precisely, we model the power consumption as s α where s is the processingspeed and α > ). Therefore, the problem of minimizing energy becomes non-trivial. This problemcan be considered in the online model where the server has no information about the future tasks atall. However, this assumption seems unnecessarily restrictive as these requests tend to follow somepatterns that can be predicted. For this reason a good algorithm should be able to incorporate somegiven predictions about the future. Similar scenarios appear in real-world systems as, for instance,in dynamic frequency scaling of CPUs or in autoscaling of cloud applications [4, 9]. In the case ofautoscaling, ML advice is already being incorporated into online algorithms in practice [4]. However,on the theory side, while the above speed scaling problem was introduced by Yao et al. [17] in aseminal paper who studied it both in the online and offline settings (see also [2, 3]), it has not beenconsidered in the learning augmented setting. Contributions.
We formalize an intuitive and well-founded prediction model for the classic speedscaling problem. We show that our problem is non-trivial by providing an unconditional lower boundthat demonstrates: An algorithm cannot be optimal, if the prediction is correct, and at the same timeretain robustness. We then focus on our main contribution which is the design and analysis of a simpleand efficient algorithm which incorporates any ML predictor as a black-box without making anyfurther assumption. We achieve this in a modular way: First, we show that there is a consistent (butnot robust) online algorithm. Then we develop a technique to make any online algorithm (which mayuse the prediction) robust at a small cost. Moreover, we design general methods to allow algorithmsto cope with small perturbations in the prediction. In addition to the theoretical analysis, we alsoprovide an experimental analysis that supports our claims on both synthetic and real datasets. Formost of the paper we focus on a restricted case of the speed scaling problem by Yao et al. [17], wherepredictions can be integrated naturally. However, we show that with more sophisticated algorithmsour techniques extend well to the general case.
Related work.
On the one hand, the field of learning augmented algorithms is relatively new, witha lot of recent exciting results (see for example Gollapudi and Panigrahi [7], Hsu et al. [8], Kodialam[10], Lattanzi et al. [11], Lee et al. [12], Lykouris and Vassilvitskii [13], Medina and Vassilvitskii[14], Purohit et al. [15], Xu and Xu [16]). On the other hand, the speed scaling problem proposedby Yao et al. in [17] is well understood in both the offline and online setting. In its full generality,a set of tasks each with different arrival times, deadlines, and workloads needs to be completed intime while the speed is scaled in order to minimize energy. In the offline setting Yao et al. provedthat the problem can be solved in polynomial time by a greedy algorithm. In the online setting, inwhich the jobs are revealed only at their release time, Yao et al. designed two different algorithms: (1)the A
VERAGE R ATE heuristic (AVR), for which they proved a bound of α − α α on the competitiveratio. This analysis was later proved to be asymptotically tight by Bansal et al. [3]. (2) The O PTIMAL A VAILABLE heuristic (OA), which was shown to be α α -competitive in [2]. In the same paper, Bansalet al. proposed a third online algorithm named BKP for which they proved a competitive ratioasymptotically equivalent to e α . While these competitive ratios exponential in α might not seemsatisfying, Bansal et al. also proved that the exponential dependency cannot be better than e α . Anumber of variants of the problem have also been considered in the offline setting (no preemptionallowed, precedence constraints, nested jobs and more listed in a recent survey by Gerards et al. [6])and under a stochastic optimization point of view (see for instance [1]). It is important to note that,while in theory the problem is interesting in the general case i.e. when α is an input parameter, inpractice we usually focus on small values of α such as 2 or 3 since they model certain physical laws(see e.g. Bansal et al. [2]). Although the BKP algorithm provides the best asymptotic guarantee, OAor AVR often lead to better solutions for small α and therefore remain relevant. We define the Uniform Speed Scaling problem, a natural restricted version of the speed scalingproblem [17], where predictions can be integrated naturally. While the restricted version is our mainfocus as it allows for cleaner exposition and prediction models, we also show that our techniques2an be adapted to more complex algorithms yielding similar results for the general problem (seeSection 3.4 for further extensions).
Problem definition.
An instance of the problem can be formally described as a triple ( w, D, T ) where [0 , T ] is a finite time horizon, each time i ∈ { , . . . , T − D } jobs with a total workload w i ∈ Z (cid:62) arrive, which have to be completed by time i + D . To do so, we can adjust the speed s i ( t ) at which each workload w i is processed for t ∈ [ i, i + D ] . Jobs may be processed in parallel.The overall speed of our processing unit at time t is the sum s ( t ) = (cid:80) i s i ( t ) , which yields a powerconsumption of s ( t ) α , where α > is a problem specific constant. Since we want to finish each jobon time, we require that the amount of work dedicated to job i in the interval [ i, i + D ] should be w i .In other words, (cid:82) i + Di s i ( t ) dt = w i . In the offline setting, the whole instance is known in advance,i.e., the vector of workloads w is entirely accessible. In the online problem, at time i , the algorithm isonly aware of all workloads w j with j (cid:54) i , i.e., the jobs that were released before time i . As notedby Bansal et al. [2], in the offline setting the problem can be formulated concisely as the followingmathematical program: Definition 1 (Uniform Speed Scaling problem) . On input ( w, D, T ) compute the optimal solutionfor min (cid:90) T s ( t ) α dt s.t. ∀ i (cid:90) i + Di s i ( t ) dt = w i , ∀ t (cid:88) i s i ( t ) = s ( t ) , ∀ i ∀ t s i ( t ) (cid:62) . In contrast, we refer to the problem of Yao et al. [17] as the
General Speed Scaling problem. Thedifference is that there the time that the processor is given to complete each job is not necessarilyequal across jobs. More precisely, there we replace w and D by a set of jobs J j = ( r j , d j , w j ) , where r j is the time the job becomes available, d j is the deadline by which it must be completed, and w j isthe work to be completed. As a shorthand, we sometimes refer to these two problems as the uniformdeadlines case and the general deadlines case. As mentioned before, Yao et al. [17] provide a simpleoptimal greedy algorithm that runs in polynomial time. As for the online setting, we emphasize thatboth the general and the uniform speed scaling problem are non-trivial. More specifically, we provethat no online algorithm can have a competitive ratio better than Ω((6 / α ) even in the uniformcase (see Theorem 9 in Appendix B). We provide a few additional insights on the performance ofonline algorithms for the uniform deadline case. Although the AVR algorithm was proved to be α − · α α -competitive by Yao et al. [17] with a quite technical proof ; we show, with a simple proof,that AVR is in fact α -competitive in the uniform deadlines case and we provide an almost matchinglower bound on the competitive ratio (see Theorem 10 and Theorem 11 in appendix).Note that in both problems the processor is allowed to run multiple jobs in parallel. However, weunderline that restricting the problem to the case where the processor is only allowed to run at mostone job at any given point in time is equivalent. Indeed, given a feasible solution s ( t ) = (cid:80) i s i ( t ) in the parallel setting, rescheduling jobs sequentially according to the earliest deadline first (EDF)policy creates a feasible solution of the same (energy) cost where at each point in time only one job isprocessed. Prediction model and error measure.
In the following, we present the model of prediction we areconsidering. Recall an instance of the problem is defined as a time horizon [0 , T ] , a duration D , anda vector of workloads w i , i = 1 , . . . , T − D . A natural prediction is simply to give the algorithm apredicted instance ( w pred , T, D ) at time t = 0 . From now on, we will refer to the ground truth workvector as w real and to the predicted instance as w pred .We define the error err of the prediction as err( w real , w pred ) = || w real − w pred || αα = (cid:88) i | w real i − w pred i | α . We simply write err , when w real and w pred are clear from the context. The motivation for using α inthe definition of err and not some other constant p comes from strong impossibility results. Clearly,guarantees for higher values p are weaker than for lower p . Therefore, we would like to set p as lowas possible. However, we show that p needs to be at least α in order to make a sensible use of aprediction (see Theorem 13 in the supplementary material). We further note that it may seem naturalto consider a predictor that is able to renew its prediction over time, e.g., by providing our algorithm anew prediction at every integral time i . To this end, in Appendix D, we show how to naturally extend3ll our results from the single prediction to the evolving prediction model. Finally we restate somedesirable properties previously defined in [13, 15] that a learning augmented algorithm should have.Recall that the prediction is a source of unreliable information on the remaining instance and thatthe algorithm is oblivious to the quality of this prediction. In the following we denote by OPT theenergy cost of the optimal offline schedule and by ε > a robustness parameter of the algorithm, thesmaller ε is the more we trust the prediction.If the prediction is perfectly accurate, i.e., the entire instance can be derived from the prediction, thenthe provable guarantees should be better than what a pure online algorithm can achieve. Ideally, thealgorithm produces an offline optimal solution or comes close to it. By close to optimal, we mean thatthe cost of the algorithm (when the prediction is perfectly accurate) should be at most c ( α, ε ) · OPT ,where c ( α, ε ) tends to as ε approaches . This characteristic will be called consistency .The competitive ratio of the algorithm should always be bounded even for arbitrarily bad (adversarial)predictions. Ideally, the competitive ratio is somewhat comparable to the competitive ratio ofalgorithms from literature for the pure online case. Formally, the cost of the algorithm should alwaysbe bounded by r ( α, ε ) · OPT for some function r ( α, ε ) . This characteristic will be called robustness .A perfect prediction is a strong requirement. The consistency property should transition smoothly forall ranges of errors, that is, the algorithm’s guarantees deteriorate smoothly as the prediction errorincreases. Formally, the cost of the algorithm should always be at most c ( α, ε ) · OPT + f ( α, ε, err) forsome function f such that f ( α, ε,
0) = 0 for any α, ε . This last property will be called smoothness .Note that our definitions of consistency and robustness depend on the problem specific constant α which is unavoidable (see Theorem 9 in the appendix). The dependence on the robustness parameter ε is justified, because no algorithm can be perfectly consistent and robust at the same time (seeTheorem 12 in the appendix), hence a trade-off is necessary. In this section we develop two modular building blocks to obtain a consistent, smooth, and robustalgorithm. The first block is an algorithm which computes a schedule online taking into account theprediction for the future. This algorithm is consistent and smooth, but not robust. Then we describea generic method how to robustify an arbitrary online algorithm at a small cost. Finally, we give asummary of the theoretical qualities for the full algorithm and a full description in pseudo-code. Wenote that in Appendix H and Appendix F we present additional building blocks (see Section 3.4 foran overview).
In the following we describe a learning-augmented online algorithm, which we call LAS-T
RUST . Preparation.
We compute an optimal schedule s pred for the predicted jobs. An optimal schedulecan always be normalized such that each workload w pred i is completely scheduled in an interval [ a i , b i ] at a uniform speed c i , that is, s pred i ( t ) = (cid:26) c i if t ∈ [ a i , b i ] , otherwise.Furthermore, the intervals [ a i , b i ] are non-overlapping. For details we refer the reader to the optimaloffline algorithm by Yao et al. [17], which always creates such a schedule. The online algorithm.
At time i we first schedule w real i at uniform speed in [ a i , b i ] , but we cap thespeed at c i . If this does not complete the job, that is, w real i > c i ( b i − a i ) = w pred i , we uniformlyschedule the remaining work in the interval [ i, i + D ] More formally,we define s i ( t ) = s (cid:48) i ( t ) + s (cid:48)(cid:48) i ( t ) where s (cid:48) i ( t ) = (cid:40) min (cid:110) w real i b i − a i , c i (cid:111) if t ∈ [ a i , b i ] , otherwise.4nd s (cid:48)(cid:48) i ( t ) = (cid:40) D max { , w real i − w pred i } if t ∈ [ i, i + D ] , otherwise. Analysis.
It is easy to see that the algorithm is consistent: If the prediction of w real i is perfect( w pred i = w real i ), the job will be scheduled at speed c i in the interval [ a i , b i ] . If all predictions areperfect, this is exactly the optimal schedule. Theorem 2.
For every < δ (cid:54) , the cost of the schedule produced by the algorithm LAS-T
RUST is bounded by (1 + δ ) α OPT +(12 /δ ) α · err . Proof.
Define w + i = max { , w real i − w pred i } as the additional work at time i as compared to thepredicted work. Likewise, define w − i = max { , w pred i − w real i } . We use OPT( w + ) and OPT( w − ) to denote the cost of optimal schedules of these workloads w + and w − , respectively. We willfirst relate the energy of the schedule s ( t ) to the optimal energy for the predicted instance, i.e., OPT( w pred ) . Then we will relate OPT( w pred ) to OPT( w real ) .For the former let s (cid:48) i and s (cid:48)(cid:48) i be defined as in the algorithm. Observe that s (cid:48) i ( t ) (cid:54) s pred i ( t ) for all i and t . Hence, the energy for the partial schedule s (cid:48) (by itself) is at most OPT( w pred ) . Furthermore, bydefinition we have that s (cid:48)(cid:48) i ( t ) = w + i /D . In other words, s (cid:48)(cid:48) i is exactly the AVR schedule on instance w + . By analysis of AVR, we know that the total energy of s (cid:48)(cid:48) i is at most α OPT( w + ) . Since theenergy function is non-linear, we cannot simply add the energy of both speeds. Instead, we usethe following inequality: For all x, y (cid:62) and < γ (cid:54) , it holds that ( x + y ) α (cid:54) (1 + γ ) α x α + (cid:16) γ (cid:17) α y α . This follows from a simple case distinction whether y (cid:54) γx . Thus, (substituting γ for δ/ )the energy of the schedule s is bounded by (cid:90) ( s (cid:48) ( t ) + s (cid:48)(cid:48) ( t )) α dt (cid:54) (1 + δ/ α (cid:90) s (cid:48) i ( t ) α dt + (6 /δ ) α (cid:90) s (cid:48)(cid:48) i ( t ) α dt (cid:54) (1 + δ/ α OPT( w pred ) + (12 /δ ) α OPT( w + ) . (1)For the last inequality we used that the competitive ratio of AVR is α .In order to relate OPT( w pred ) and OPT( w real ) , we argue similarly. Notice that scheduling w real optimally (by itself) and then scheduling w − using AVR forms a valid solution for w pred . Hence, OPT( w pred ) (cid:54) (1 + δ/ α OPT( w real ) + (12 /δ ) α OPT( w − ) . Inserting this inequality into (1) we conclude that the energy of the schedule s is at most (1 + δ/ α OPT( w real ) + (12 /δ ) α (OPT( w + ) + OPT( w − )) (cid:54) (1 + δ ) α OPT( w real ) + (12 /δ ) α · err . This inequality follows from the fact that the error function (cid:107)·(cid:107) αα is always an upper bound on theenergy of the optimal schedule (by scheduling every job within the next time unit). In this section, we describe a method R
OBUSTIFY that takes any online algorithm which guaranteesto complete each job in (1 − δ ) D time, that is, with some slack to its deadline, and turns it into arobust algorithm without increasing the energy of the schedule produced. Here δ > can be chosenat will, but it impacts the robustness guarantee. We remark that the slack constraint is easy to achieve:In Appendix E we prove that decreasing D to (1 − δ ) D increases the energy of the optimum scheduleonly very mildly. Specifically, if we let OPT( w real , (1 − δ ) D, T ) and OPT( w real , D, T ) denote thecosts of optimal schedules of workload w real with durations (1 − δ ) D and D , respectively, then: Claim 3.
For any instance ( w real , D, T ) we have that OPT( w real , (1 − δ ) D, T ) (cid:54) − δ ) α − OPT( w real , D, T ) . Hence, running a consistent algorithm with (1 − δ ) D will not increase the cost significantly. Alterna-tively, we can run the online algorithm with D , but increase the generated speed function by / (1 − δ ) D timespeed Figure 1: A schedule and its convolution.and reschedule all jobs using EDF. This also results in a schedule where all jobs are completed in (1 − δ ) D time.For a schedule s of ( w real , (1 − δ ) D, T ) we define the δ -convolution operator which returns theschedule s ( δ ) of the original instance ( w real , D, T ) by s ( δ ) i ( t ) = 1 δD (cid:90) tt − δD s i ( r ) dr for each i ∈ T (letting s i ( r ) = 0 if r < ). See Figure 1 for an illustration. The name comes fromthe fact that this operator is the convolution of s i ( t ) with the function f ( t ) that takes value / ( δD ) if (cid:54) t (cid:54) δD and value otherwise.Next we state three key properties of the convolution operator, all of which follow from easyobservations or standard arguments that are deferred to Appendix G. Claim 4. If s is a feasible schedule for ( w real , (1 − δ ) D, T ) then s ( δ ) is a feasible schedule for ( w real , D, T ) . Claim 5.
The cost of schedule s ( δ ) is not higher than that of s , that is, (cid:90) T ( s ( δ ) ( t )) α dt (cid:54) (cid:90) T ( s ( t )) α dt. Let s AVR i ( t ) denote the speed of workload w real i of the A VERAGE R ATE heuristic, that is, s AVR i ( t ) = w real i /D if i (cid:54) t (cid:54) i + D and s AVR i ( t ) = 0 otherwise. We relate s ( δ ) i ( t ) to s AVR i ( t ) . Claim 6.
Let s be a feasible schedule for ( w real , (1 − δ ) D, T ) . Then s ( δ ) i ( t ) (cid:54) δ s AVR i ( t ) . By using that the competitive ratio of A
VERAGE R ATE is at most α (see Appendix B), we get (cid:90) T ( s ( δ ) ( t )) α dt (cid:54) (cid:18) δ (cid:19) α (cid:90) T ( s AVR ( t )) α dt (cid:54) (cid:18) δ (cid:19) α OPT . We conclude with the following theorem, which follows immediately from the previous claims.
Theorem 7.
Given an online algorithm that produces a schedule s for ( w real , (1 − δ ) D, T ) , we cancompute online a schedule s ( δ ) with (cid:90) T ( s ( δ ) ( t )) α dt (cid:54) min (cid:40)(cid:90) T ( s ( t )) α dt, (cid:18) δ (cid:19) α OPT (cid:41) . By combining LAS-T
RUST and R
OBUSTIFY , we obtain an algorithm LAS (see Algorithm 1) whichhas the following properties. See Appendix A for a formal argument.
Theorem 8.
For any given ε > , algorithm LAS constructs a schedule of cost at most min (cid:8) (1 + ε ) OPT + O (cid:0) αε (cid:1) α err , O (cid:0) αε (cid:1) α OPT (cid:9) . lgorithm 1 L EARNING A UGMENTED S CHEDULING (LAS)
Input: T , D , and w pred initially and w real in an online fashion Output:
A feasible schedule ( s i ) T − Di =0 Let δ > with (cid:0) δ − δ (cid:1) α = 1 + ε .Compute optimal offline schedule for ( w pred , T, (1 − δ ) D ) where the jobs w pred i are run at uniformspeeds c i an disjoint intervals [ a i , b i ] using [17]. on arrival of w real i do Let s (cid:48) i ( t ) = (cid:40) min (cid:110) w real i b i − a i , c i (cid:111) if t ∈ [ a i , b i ] , otherwise.Let s (cid:48)(cid:48) i ( t ) = (cid:40) D max { , w real i − w pred i } if t ∈ [ i, i + D ] , otherwise.Let s i ( t ) = δD (cid:82) tt − δD s (cid:48) i ( r ) + s (cid:48)(cid:48) i ( r ) dr end on3.4 Other Extensions In Appendix H we also consider General Speed Scheduling (the problem with general deadlines) andshow that a more sophisticated method allows us to robustify any algorithm even in this more generalsetting. Hence, for this case we can also obtain an algorithm that is almost optimal in the consistencycase and always robust.The careful reader may have noted that one can craft instances so that the used error function err is very sensitive to small shifts in the prediction. An illustrative example is as follows. Considera predicted workload w pred defined by w pred i = 1 for those time steps i that are divisible by alarge constant, say , and let w pred i = 0 for all other time steps. If the real instance w real is a small shift of w pred say w real i +1 = w pred i then the prediction error err( w real , w pred ) is largealthough w pred intuitively forms a good prediction of w real . To overcome this sensitivity, we firstgeneralize the definition of err to err η which is tolerant to small shifts in the workload. In particular, err η ( w real , w pred ) = 0 for the example given above. We then give a generic method for transformingan algorithm so as to obtain guarantees with respect to err η instead of err at a small loss. Details canbe found in Appendix F. In this section, we will test the LAS algorithm on both synthetic and real datasets. We will calculatethe competitive ratios with respect to the offline optimum. We fix α = 3 in all our experiments asthis value models the power consumption of modern processors (see Bansal et al. [2]). For eachexperiment, we compare our LAS algorithm to the three main online algorithms that exist for thisproblem which are AVR and OA by Yao et al. [17] and BKP by Bansal et al. [2]. We note that thecode is publicly available at https://github.com/andreasr27/LAS . Artificial datasets.
In the synthetic data case, we will mimic the request pattern of a typical datacenter application by simulating a bounded random walk. In the following we write Z ∼ U{ m, M } when sampling an integer uniformly at random in the range [ m, M ] . Subsequently, we fix threeintegers s, m, M where [ m, M ] define the range in which the walk should stay. For each integraltime i we sample X i ∼ U{− s, s } . Then we set w ∼ U{ m, M } and w i +1 to be the median valueof the list { m, w i + X i , M } , that is, if the value w i + X i remains in the predefined range we donot change it, otherwise we round it to the closest point in the range. For this type of ground truthinstance we test our algorithm coupled with three different predictors. The accurate predictor forwhich we set ˜ w i ∼ w i + U{− s, s } , the random predictor where we set ˜ w i ∼ U{ m, M } and the misleading predictor for which ˜ w i = ( M − w i ) + m . In each case we perform 20 experiment runs.The results are summarized in Table 1. In the first two cases (accurate and random predictors) wepresent the average competitive ratios of every algorithm over all runs. In contrast, for the last column7igure 2: From top to bottom: The first two graphs show the performance of LAS when ε = 0 . and ε = 0 . with respect to the online algorithms AVR and OA. The bottom graph presents theprediction error. The timeline was discretized in chunks of ten minutes and D was set to 20.Table 1: Artificial dataset results Algorithm Accurate Random MisleadingAVR 1.268 1.268 1.383BKP 7.880 7.880 10.380OA 1.199 1.199 1.361LAS, ε = 0 . ε = 0 . ε = 0 . ε = 0 . ε = 0 . We used m = 20 , M = 80 , s = 5 , T = 220 and D = 20 .(misleading predictor) we present the maximum competitive ratio of each algorithm taken over the 20runs to highlight the worst case robustness of LAS. We note that in the first case, where the predictoris relatively accurate but still noisy, LAS is consistently better than any online algorithm achieving acompetitive ratio close to for small values of ε . In the second case, the predictor does not give ususeful information about the future since it is completely uncorrelated with the ground truth instance.In such a case, LAS experiences a similar performance to the best online algorithms. In the third case,the predictor tries to mislead our algorithm by creating a prediction which constitutes a symmetric(around ( m + M ) / ) random walk with respect to the true instance. When coupled with such apredictor, as expected, LAS performs worse than the best online algorithm, but it still maintainsan acceptable competitive ratio. Furthermore, augmenting the robustness parameter ε , and therebytrusting less the predictor, improves the competitive ratio in this case. Real dataset.
We provide additional evidence that the LAS algorithm outperforms purely onlinealgorithms by conducting experiments on the login requests to
BrightKite [5], a no longer functioningsocial network. We note that this dataset was previously used in the context of learning augmentedalgorithms by Lykouris and Vassilvitskii [13]. In order to emphasize the fact that even a verysimple predictor can improve the scheduling performance drastically, we will use the arguably mostsimple predictor possible. We use the access patterns of the previous day as a prediction for thecurrent day. In Figure 2 we compare the performance of the LAS algorithm for different valuesof the robustness parameter ε with respect to AVR and OA. We did not include BKP, since itsperformance is substantially worse than all other algorithms. Note that our algorithm shows asubstantial improvement with respect to both AVR and OA, while maintaining a low competitive8atio even when the prediction error is high (for instance in the last days). The first days, wherethe prediction error is low, by setting ε = 0 . (and trusting more the prediction) we obtain an averagecompetitive ratio of . , while with ε = 0 . the average competitive ratio slightly deteriorates to . . However, when the prediction error is high, setting ε = 0 . is better. On average from the firstto the last day of the timeline, the competitive ratio of AVR and OA is . and . respectively,while LAS obtains an average competitive ratio of . when ε = 0 . and . when ε = 0 . ,thus beating the online algorithms in both cases.More experiments regarding the influence of the α parameter in the performance of LAS algorithmcan be found in Appendix I. Broader impact
As climate change is a severe issue, trying to minimize the environmental impact of modern computersystems has become a priority. High energy consumption and the CO emissions related to it aresome of the main factors increasing the environmental impact of computer systems. While our workconsiders a specific problem related to scheduling, we would like to emphasize that a considerablepercentage of real-world systems already have the ability to dynamically scale their computingresources to minimize their energy consumption. Thus, studying models (like the one presented inthis paper) with the latter capability is a line of work with huge potential societal impact. In additionto that, although the analysis of the guarantees provided by our algorithm is not straightforward,the algorithm itself is relatively simple. The latter fact makes us optimistic that insights from thiswork can be used in practice contributing to minimizing the environmental impact of computerinfrastructures. Acknowledgments and Disclosure of Funding
This research is supported by the Swiss National Science Foundation project 200021-184656 “Ran-domness in Problem Instances and Randomized Algorithms”. Andreas Maggiori was supported bythe Swiss National Science Fund (SNSF) grant n o _ / “Spatial Coupling of GraphicalModels in Communications, Signal Processing, Computer Science and Statistical Physics”. References [1] Lachlan LH Andrew, Minghong Lin, and Adam Wierman. Optimality, fairness, and robustnessin speed scaling designs. In
Proceedings of the ACM SIGMETRICS international conference onMeasurement and modeling of computer systems , pages 37–48, 2010.[2] Nikhil Bansal, Tracy Kimbrel, and Kirk Pruhs. Speed scaling to manage energy and temperature.
J. ACM , 54(1):3:1–3:39, 2007. doi: 10.1145/1206035.1206038. URL https://doi.org/10.1145/1206035.1206038 .[3] Nikhil Bansal, David P. Bunde, Ho-Leung Chan, and Kirk Pruhs. Average rate speed scaling.In
LATIN 2008: Theoretical Informatics, 8th Latin American Symposium, Búzios, Brazil, April7-11, 2008, Proceedings , pages 240–251, 2008. doi: 10.1007/978-3-540-78773-0\_21. URL https://doi.org/10.1007/978-3-540-78773-0_21 .[4] Jeff Barr. New – predictive scaling for ec2, powered by machine learning.
AWS NewsBlog , November 2018. URL https://aws.amazon.com/blogs/aws/new-predictive-scaling-for-ec2-powered-by-machine-learning/ .[5] Eunjoon Cho, Seth A. Myers, and Jure Leskovec. Friendship and mobility: User movementin location-based social networks. In
Proceedings of the 17th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining , KDD ’11, page 1082–1090, NewYork, NY, USA, 2011. Association for Computing Machinery. ISBN 9781450308137. doi:10.1145/2020408.2020579. URL https://doi.org/10.1145/2020408.2020579 .[6] Marco E. T. Gerards, Johann L. Hurink, and Philip K. F. Hölzenspies. A survey of offlinealgorithms for energy minimization under deadline constraints.
J. Scheduling , 19(1):3–19, 2016. CPU Dynamic Voltage and Frequency Scaling (DVFS) in modern processors and autoscaling of cloudapplications https://doi.org/10.1007/s10951-015-0463-8 .[7] Sreenivas Gollapudi and Debmalya Panigrahi. Online algorithms for rent-or-buy with expertadvice. In
Proceedings of the 36th International Conference on Machine Learning, ICML2019, 9-15 June 2019, Long Beach, California, USA , pages 2319–2327, 2019. URL http://proceedings.mlr.press/v97/gollapudi19a.html .[8] Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-based frequency estimationalgorithms. In , 2019. URL https://openreview.net/forum?id=r1lohoCqY7 .[9] Craig Kitterman. Autoscaling windows azure applications.
Microsoft Azure Blog ,June 2013. URL https://azure.microsoft.com/de-de/blog/autoscaling-windows-azure-applications/ .[10] Rohan Kodialam. Optimal algorithms for ski rental with soft machine-learned predictions.
CoRR , abs/1903.00092, 2019. URL http://arxiv.org/abs/1903.00092 .[11] Silvio Lattanzi, Thomas Lavastida, Benjamin Moseley, and Sergei Vassilvitskii. Online schedul-ing via learned weights. In
Proceedings of the 2020 ACM-SIAM Symposium on DiscreteAlgorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020 , pages 1859–1877, 2020.doi: 10.1137/1.9781611975994.114. URL https://doi.org/10.1137/1.9781611975994.114 .[12] Russell Lee, Mohammad H. Hajiesmaili, and Jian Li. Learning-assisted competitive algorithmsfor peak-aware energy scheduling.
CoRR , abs/1911.07972, 2019. URL http://arxiv.org/abs/1911.07972 .[13] Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned ad-vice. In
Proceedings of the 35th International Conference on Machine Learning, ICML 2018,Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 , pages 3302–3311, 2018. URL http://proceedings.mlr.press/v80/lykouris18a.html .[14] Andres Muñoz Medina and Sergei Vassilvitskii. Revenue optimization with approximatebid predictions. In
Advances in Neural Information Processing Systems 30: Annual Con-ference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach,CA, USA , pages 1858–1866, 2017. URL http://papers.nips.cc/paper/6782-revenue-optimization-with-approximate-bid-predictions .[15] Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via MLpredictions. In
Advances in Neural Information Processing Systems 31: Annual Confer-ence on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018,Montréal, Canada , pages 9684–9693, 2018. URL http://papers.nips.cc/paper/8174-improving-online-algorithms-via-ml-predictions .[16] Yinfeng Xu and Weijun Xu. Competitive algorithms for online leasing problem in probabilisticenvironments. In
Advances in Neural Networks - ISNN 2004, International Symposium onNeural Networks, Dalian, China, August 19-21, 2004, Proceedings, Part II , pages 725–730,2004. doi: 10.1007/978-3-540-28648-6\_116. URL https://doi.org/10.1007/978-3-540-28648-6_116 .[17] F. Frances Yao, Alan J. Demers, and Scott Shenker. A scheduling model for reduced CPUenergy. In , pages 374–382, 1995. doi: 10.1109/SFCS.1995.492493. URL https://doi.org/10.1109/SFCS.1995.492493 .10
Omitted Proofs from Section 3
Theorem 8.
For any given ε > , algorithm LAS constructs a schedule of cost at most min (cid:8) (1 + ε ) OPT + O (cid:0) αε (cid:1) α err , O (cid:0) αε (cid:1) α OPT (cid:9) . Proof.
We choose δ such that ( δ − δ ) α = 1 + ε . Note that δ (cid:54) ε/ (6 α ) . By Claim 3 we know that OPT( w real , (1 − δ ) D, T ) (cid:54) (cid:18) − δ (cid:19) α OPT . Hence, by Theorem 2 algorithm LAS-T
RUST constructs a schedule with cost at most (cid:18) δ − δ (cid:19) α OPT + O (cid:18) δ (cid:19) α err Finally, we apply R
OBUSTIFY and with Theorem 7 obtain a bound of min (cid:26)(cid:18) δ − δ (cid:19) α OPT + O (cid:18) δ (cid:19) α err , O (cid:18) δ (cid:19) α OPT (cid:27) (cid:54) min (cid:110) (1 + ε ) OPT + O (cid:16) αε (cid:17) α err , O (cid:16) αε (cid:17) α OPT (cid:111) . B Pure online algorithms for uniform deadlines
Since most related results concern the general speed scaling problem, we give some insights aboutthe uniform speed scaling problem in the online setting without predictions. We first give a lowerbound on the competitive ratio for any online algorithm for the simplest case where D = 2 and thenprovide an almost tight analysis of the competitive ratio of AVR. Theorem 9.
There is no (randomized) online algorithm with an (expected) competitive ratio betterthan
Ω ((6 / α ) .Proof. Consider D = 2 and two instances J and J . Instance J consists of only one job that isreleased at time with workload and J consists of the same first job with a second job which startsat time with workload .In both instances, the optimal schedule runs with uniform speed at all time. In the first instance, itruns the single job for units of time at speed / . The energy-cost is therefore / α − . In thesecond instance, it first runs the first job at speed for one unit of time and then the second job atspeed for units of time. Hence, it has an energy-cost of .Now consider an online algorithm. Before time both instances are identical and the algorithmtherefore behaves the same. In particular, it has to decide how much work of job to process betweentime and . Let us fix some γ (cid:62) as a threshold for the amount of work dedicated to job by thealgorithm before time . We have the following two cases depending on the instance.1. If the algorithm processes more that γ units of work on job before time then for instance J the energy cost is at least γ α . Hence the competitive ratio is at least γ α · α − .2. On the contrary, if the algorithm works less than γ units of work before the release of thesecond job then in instance J the algorithm has to complete at least − γ units of workbetween time and . Hence, its competitive ratio is at least / · ((3 − γ ) / α .Choosing γ such that these two competitive ratios are equal gives γ = /α − /α +1 and yields alower bound on the competitive ratio of at least: α − (cid:18) /α − /α + 1 (cid:19) α . This term asymptotically approaches / · (6 / α and this already proves the theorem for deterministicalgorithms. More precisely, it proves that any deterministic algorithm has a competitive ratio of11t least Ω ((6 / α ) on at least one of the two instances J or J . Hence, by defining a probabilitydistribution over inputs such that p ( J ) = p ( J ) = and applying Yao’s minimax principle we getthat the expected competitive ratio of any randomized online algorithm is at least (1 / · α − (cid:18) /α − /α + 1 (cid:19) α . which again gives Ω ((6 / α ) as lower bound, this time against randomized algorithms.We now turn ourselves to the more specific case of the AVR algorithm with the following two results.We recall that the AVR algorithm was shown to be α − · α α -competitive by Yao et al. [17] in thegeneral deadlines case. In the case of uniform deadlines, the competitive ratio of AVR is actuallymuch better and proofs are much less technical than the original analysis of Yao et al. Recall that foreach job i with workload w i , release r i , and deadline d i ; AVR defines a speed s i ( t ) = w i / ( d i − r i ) if t ∈ [ r i , d i ] and otherwise. Theorem 10.
AVR is α -competitive for the uniform speed scaling problem.Proof. Let ( w, D, T ) be a job instance and s OPT be the speed function of the optimal schedule forthis instance. Let s AVR be the speed function produced by the A
VERAGE R ATE heuristic on the sameinstance. It suffices to show that for any time t we have s AVR ( t ) (cid:54) · s OPT ( t ) . Fix some t . We assume w.l.o.g. that the optimal schedule runs each job j isolated for a total time of p ∗ j . By optimality of the schedule, the speed during this time is uniform, i.e., exactly w j /p ∗ j . Denoteby j t the job that is processed in the optimal schedule at time t .Let j be some job with r j (cid:54) t (cid:54) r j + D . It must be that w j p ∗ j (cid:54) w j t p ∗ j t = s OPT ( t ) . (2)Note that all jobs j with r j (cid:54) t (cid:54) r j + D are processed completely between t − D and t + D .Therefore, (cid:88) J j : r j (cid:54) t (cid:54) r j + D p ∗ j (cid:54) D. With (2) it follows that (cid:88) J j : r j (cid:54) t (cid:54) r j + D w j (cid:54) s OPT ( t ) (cid:88) J j : r j (cid:54) t (cid:54) r j + D p ∗ j (cid:54) D · s OPT ( t ) . We conclude that s AVR ( t ) = (cid:88) J j : r j (cid:54) t (cid:54) r j + D w j D (cid:54) · s OPT ( t ) . Next, we show that our upper bound on the exponential dependency in α of the competitive ratio forAVR (in Theorem 10) is tight for the uniform deadlines case. Theorem 11.
Asymptotically ( α approaches ∞ ), the competitive ratio of the AVR algorithm for theuniform deadlines case is at least α eα Proof.
Assume α > and consider a two-job instance with one job arriving at time of workload and one job arriving at time (1 − /α ) D with workload . One can check that the optimal scheduleruns at constant speed throughout the whole instance for a total energy of (cid:18) − /α ) D (cid:19) α · (2 − /α ) D.
12n the other hand, on interval [(1 − /α ) D, D ] , AVR runs at speed /D . This implies the followinglower bound on the competitive ratio: (2 /D ) α · (2 /α ) D (cid:16) − /α ) D (cid:17) α · (2 − /α ) D = 2 α α (cid:18) − α (cid:19) α − which approaches to α / ( eα ) as α tends to infinity. C Impossibility results for learning augmented speed scaling
This section is devoted to prove some impossibility results about learning augmented algorithms inthe context of speed scaling. We first prove that our trade-offs between consistency and robustnessare essentially optimal. Again, we describe an instance as a triple ( w, D, T ) . Theorem 12.
Assume a deterministic learning enhanced algorithm is (1 + ε/ α − -consistent forany α (cid:62) and any small enough constant ε > (independently of D ). Then the worst casecompetitive ratio of this algorithm cannot be better than Ω (cid:0) ε (cid:1) α − .Proof. Fix D big enough so that (cid:100) εD (cid:101) (cid:54) · ( εD ) . Consider two different job instances J and J : J contains only one job of workload released at time and J contains an additional job ofworkload /ε released at time (cid:100) εD (cid:101) . On the first instance, the optimal cost is /D α − while theoptimum energy cost for J is (1 / (cid:100) εD (cid:101) ) α − + D/ ( εD ) α (cid:54) (1 /ε ) α · ((1 + ε ) /D α − ) .Assume the algorithm is given the job of workload released at time and additionally the predictionconsists of one job of workload /ε released at time (cid:100) εD (cid:101) . Note that until time (cid:100) εD (cid:101) the algorithmcannot tell the difference between instances J and J .Depending on how much the algorithm works before time (cid:100) εD (cid:101) , we distinguish the following cases.1. If the algorithm works more that / then the energy spent by the algorithm until time (cid:100) εD (cid:101) is at least (1 / α / ( (cid:100) εD (cid:101) ) α − = Ω (cid:18) εD (cid:19) α − .
2. However, if it works less than / then on instance J , a total work of at least (1 /ε + 1 − /
2) = (1 / /ε ) remains to be done in D time units. Hence the energy consumption oninstance J is at least (1 / /ε ) α D α − . If the algorithm is (1 + ε/ α − -consistent, then it must be that the algorithm works more that / before time (cid:100) εD (cid:101) otherwise, by the second case of the analysis, the competitive ratio is at least (1 / /ε ) α (1 /ε ) α (1 + ε ) = (1 + ε/ α ε > (1 + ε/ α − , where the last inequality holds for α > and ε small enough.However it means that if the algorithm was running on instance J (i.e. the prediction is incorrect)then by the first case the approximation ratio is at least Ω (cid:0) ε (cid:1) α − .We then argue that one cannot hope to rely on some l p norm for p < α to measure error. Theorem 13.
Fix some α and D and let p such that p < α . Suppose there is an algorithm which onsome prediction w pred computes a solution of value at most C · OPT + C (cid:48) · (cid:107) w − w pred (cid:107) pp . Here C and C (cid:48) are constants that can be chosen as an arbitrary function of α and D .Then it also exists an algorithm for the online problem (without predictions) which is ( C + ε ) -competitive for every ε > .
13n other words, predictions do not help, if we choose p < α . Proof.
In the online algorithm we use the prediction-based algorithm A P as a black box. We setthe prediction ˜ w to all . We forward each job to A P , but scale its work by a large factor M . It isobvious that by scaling the optimum of the instance increases exactly by a factor M α . The error inthe prediction, however, increases less: (cid:107) M · w − M · w pred (cid:107) pp = M p · (cid:107) w − w pred (cid:107) pp . We run the jobs as A P does, but scale them down by M again. Thus, we get a schedule of value M − α ( M α · C · OPT + M p · C (cid:48) · (cid:107) w − w pred (cid:107) pp ) = C · OPT + M p − α · C (cid:48) · (cid:107) w − w pred (cid:107) pp . (3)Now if we choose M large enough, the second term in (3) becomes insignificant. First, we relate theprediction error to the optimum. First note that OPT (cid:62) (1 /D α ) · || w || αα since the optimum solution cannot be less expensive than running all jobs i disjointly at speed w i /D for time D . Second note that (cid:107) w (cid:107) pp (cid:54) || w || αα since | x | p (cid:54) | x | α for any x (cid:62) (recall that we assumedour workloads to be integral). Hence we get that, (cid:107) w − w pred (cid:107) pp = (cid:107) w (cid:107) pp (cid:54) D α · OPT . Choosing M sufficiently large gives M p − α C (cid:48) D α < ε , which implies that (3) is at most ( C + ε ) OPT . D Extension to evolving predictors
In this section, we extend the result of Section 3 to the case where the algorithm is provided severalpredictions over time. In particular, we assume that the algorithm is provided a new prediction at eachintegral time t . The setting is natural as for a very long timeline, it is intuitive that the predictor mightrenew its prediction over time. Since making a mistake in the prediction of a very far future seemsalso less hurtful than making a mistake in predicting an immediate future, we define a generalizederror metric incorporating this idea.Let < λ < be a parameter that describes how fast the confidence in a prediction deteriorateswith the time until the expected arrival of the predicted job. Define the prediction received at time t as a workload vector w pred ( t ) . Recall we are still considering the uniform deadlines case hence aninstance is defined as a triplet ( w, D, T ) .We then define the total error of a series of predictions as err ( λ ) = (cid:88) t ∞ (cid:88) i = t +1 | w real i − w pred i ( t ) | α · λ i − t . In the following we reduce the evolving predictions model to the single prediction one.We would like to prove similar results as in the single prediction setting with respect to err ( λ ) . In orderto do so, we will split the instance into parts of bounded time horizon, solve each one independentlywith a single prediction, and show that this also gives a guarantee based on err ( λ ) . In particular, wewill use the algorithm for the single prediction model as a black box.The basic idea is as follows. If no job were to arrive for a duration of D , then the instance before thisinterval and afterwards can be solved independently. This is because any job in the earlier instancemust finish before any job in the later instance can start. Hence, they cannot interfere. At randompoints, we ignore all jobs for a duration of D , thereby split the instance. The ignored jobs will bescheduled sub-optimally using AVR. If we only do this occasionally, i.e., after every intervals oflength (cid:29) D , the error we introduce is negligible.We proceed by defining the splitting procedure formally. Consider the timeline as infinite in bothdirections. To split the instance, we define some interval length kD , where k ∈ N will be specifiedlater. We split the infinite timeline into contiguous intervals of length kD . Moreover, we choose14n offset x ∈ { , · · · , k − } uniformly at random. Using these values, we define intervals I i =[2(( i − k − x ) D, ik − x ) D ) . We will denote by t i = (2( i − k − x ) D the start time of theinterval I i . Consequently, the end of I i is t i +1 .In each interval I i , we solve the instance given by the jobs entirely contained in this interval usingour algorithm with the most recent prediction as of time t i , i.e., w pred ( t i ) , and schedule the jobsaccordingly. We write s ALG( i ) for this schedule. For the jobs that are overlapping with two contiguousintervals we schedule them independently using the A VERAGE R ATE heuristic. The schedule for thejobs overlapping with intervals I i and I i +1 will be referred to as s AVR( i ) .It is easy to see that this algorithm is robust: The energy of the produced schedule is (cid:90) (cid:32)(cid:88) i (cid:104) s ALG( i ) ( t ) + s AVR( i ) ( t ) (cid:105)(cid:33) α dt (cid:54) α (cid:90) (cid:32)(cid:88) i s ALG( i ) ( t ) (cid:33) α dt + 2 α (cid:90) (cid:32)(cid:88) i s AVR( i ) ( t ) (cid:33) α dt. Moreover, the first term can be bounded by α · O ( α/ε ) α OPT using Theorem 8 and the second termcan be bounded by α · α OPT because of Theorem 10. This gives an overall bound of O ( α/ε ) α onthe competitive ratio.In the rest of the section we focus on the consistency/smoothness guarantee. We first bound the costsof s ALG( i ) and s AVR( i ) isolated (ignoring potential interferences). Using these bounds, we derive anoverall guarantee for the algorithm’s cost. Lemma 14. E (cid:32)(cid:88) i (cid:90) s AVR( i ) ( t ) α (cid:33) (cid:54) α k OPT
Proof.
Fix some i > and let us call O i the job instance consisting of jobs overlapping with bothintervals I i and I i +1 . By Theorem 10 the energy used by AVR is at most a α -factor from theoptimum schedule. Hence, (cid:90) s AVR( i ) ( t ) α dt (cid:54) α OPT( O i ) . Now denote by s OPT the speed function of the optimum schedule over the whole instance. Then
OPT( O i ) (cid:54) (cid:90) t i + Dt i − D s OPT ( t ) α dt. This holds because s OPT processes some work during [ t i − D, t i + D ] which has to include all of O i . Hence, we have that E (cid:32)(cid:88) i OPT( O i ) (cid:33) (cid:54) k k − (cid:88) x =0 (cid:88) i (cid:90) ik − x ) D + D ik − x ) D − D s OPT ( t ) α dt (cid:54) k (cid:90) s OPT ( t ) α dt = 1 k OPT
The second inequality holds, because the integrals are over disjoint ranges. Together, with the boundon s AVR( i ) we get the claimed inequality. Lemma 15. (cid:88) i (cid:90) s ALG( i ) ( t ) α dt (cid:54) (1 + ε ) OPT + O (cid:16) αε (cid:17) α · λ − kD · err ( λ ) . roof. Note that for any i t ( i +1) (cid:88) t =( t i )+1 | w real t − w pred t ( t i ) | α (cid:54) λ − kD t ( i +1) (cid:88) t =( t i )+1 | w real t − w pred t ( t i ) | α λ t − t i . Hence, (cid:88) i t i +1 (cid:88) t = t i | w real t − w pred t ( t i ) | α (cid:54) λ − kD err ( λ ) . Using Theorem 8 for each (cid:82) s ( i )ALG ( t ) α dt , we get a bound depending on (cid:80) t i +1 t = t i | w real t − w pred t ( t i ) | α .Summing over i and using the inequality above finishes the proof of the lemma.We are ready to state the consistency/smoothness guarantee of the splitting algorithm. Theorem 16.
With robustness parameter O ( ε/α ) the splitting algorithm produces in expectation aschedule of cost at most (1 + ε ) OPT + O (cid:16) αε (cid:17) α · λ − D/ε · O ( α/ε ) α · err ( λ ) . In other words, we get the same guarantee as in the single prediction case, except that the dependencyon the error is larger by a factor of λ − D/ε · O ( α/ε ) α . The exponential dependency on D may seemunsatisfying, but (1) it cannot be avoided (see Theorem 17) and (2) for moderate values of λ , e.g. λ = 1 − /D , this exponential dependency vanishes. Proof.
We will make use of the following inequality: For all a, b (cid:62) and < δ (cid:54) , it holds that ( a + b ) α (cid:54) (1 + δ ) a α + (cid:18) αδ (cid:19) α b α . This follows from a simple case distinction whether b (cid:54) a · δ/ (2 α ) . In expectation, the cost of thealgorithm is bounded by E (cid:20) (cid:90) (cid:32)(cid:88) i [ s ALG( i ) ( t ) + s AVR( i ) ( t )] (cid:33) α dt (cid:21) (cid:54) (1 + ε ) E (cid:20) (cid:90) (cid:88) i ( s ALG( i ) ( t )) α dt (cid:21) + (cid:18) αε (cid:19) α E (cid:20) (cid:90) (cid:88) i ( s AVR( i ) ( t )) α dt (cid:21) (cid:54) (1 + ε ) (cid:90) (cid:88) i s ALG( i ) ( t )) α dt + 1 k (cid:18) αε (cid:19) α OPT . By choosing k = 1 /ε (6 α/ε ) α the latter term becomes ε OPT . With Lemma 15 we can bound theterm above by (1 + ε ) OPT + O (cid:16) αε (cid:17) α · λ − D/ε · O ( α/ε ) α · err ( λ ) . Scaling ε by a constant yields the claimed guarantee.We complement the result of this section with an impossibility result. We allow the parameter λ inthe definition of err ( λ ) to be a function of D and we write λ ( D ) . Theorem 17.
Let err ( λ ) the error in the evolving prediction model be defined with some <λ ( D ) < that can depend on D . Suppose there is an algorithm which computes a solution of valueat most C · OPT + C (cid:48) ( D ) · err ( λ ) , where C is independent of D and C (cid:48) ( D ) = o (cid:16) − λ ( D ) D λ ( D ) D · D α (cid:17) . Then there also exists an algorithmfor the online problem (without predictions) which is ( C + ε ) -competitive for every ε > .
16n particular, note that for λ independent of D , it shows that an exponential dependency in D isneeded in C (cid:48) ( D ) as we get in Theorem 16. Proof.
The structure of the proof is similar to that of Theorem 13. We pass an instance to the assumedalgorithm, but set the prediction to all . Unlike the previous proof, we keep the same workloadswhen passing the jobs, but subdivide D in to D · k time steps where k will be specified later. Thiswill decrease the cost of every solution by k α .Take an instance with interval length D . Like in the proof of Theorem 13 we have that (cid:107) w real (cid:107) αα (cid:54) D α · OPT . Consider the error parameter err ( λ ) (cid:48) for the instance with D (cid:48) = D · k . We observe that err ( λ ) (cid:48) = (cid:88) t ∞ (cid:88) i = t +1 | w real k · i | α · λ ( D (cid:48) ) k ( i − t ) (cid:54) || w real || αα · ∞ (cid:88) i =1 λ ( D (cid:48) ) k · i (cid:54) || w real || αα λ ( D (cid:48) ) k − λ ( D (cid:48) ) k (cid:54) D α λ ( D (cid:48) ) k − λ ( D (cid:48) ) k · OPT
Hence, by definition the algorithm produces a solution of cost C · OPT /k α + C (cid:48) ( D (cid:48) ) err ( λ ) (cid:48) (cid:54) ( C/k α + D α λ ( D (cid:48) ) k − λ ( D (cid:48) ) k C (cid:48) ( D (cid:48) )) · OPT for the subdivided instance. Transferring it to the original instance, we get a cost of ( C + k α D α λ ( D (cid:48) ) k − λ ( D (cid:48) ) k C (cid:48) ( D (cid:48) )) · OPT
Therefore, if k α λ ( D · k ) k − λ ( D · k ) k C (cid:48) ( D · k ) tends to as k grows, for any ε > , we can fix k big enough sothat the cost of the algorithm is at most ( C + ε ) OPT . E A shrinking lemma
Recall that by applying the earliest-deadline-first policy, we can normalize every schedule to runat most one job at each time. We say, it is run isolated . Moreover, if a job is run isolated, it isalways better to run it at a uniform speed (by convexity of x (cid:55)→ x α on x (cid:62) ). Hence, an optimalschedule can be characterized solely by the total time p j each job is run. Given such p j we will givea necessary and sufficient condition of when a schedule that runs each job isolated for p j time exists.Note that we assume we are in the general deadline case, each job j comes with a release r j anddeadline d j and the EDF policy might cause some jobs to be preempted. Lemma 18.
Let there be a set of n jobs with release times r j and deadlines d j for each job j . Let p j denote the total duration that j should be processed. Scheduling the jobs isolated earliest-deadline-first, with the constraint to never run a job before its release time, will complete every job j beforetime d j if and only if for every interval [ t, t (cid:48) ] it holds that (cid:88) j : t (cid:54) r j ,d j (cid:54) t (cid:48) p j (cid:54) t (cid:48) − t (4) Proof.
For the one direction, let t, t (cid:48) such that (4) is not fulfilled. Since the jobs with t (cid:54) r j cannotbe processed before t , the last such job j (cid:48) to be completed must finish after t + (cid:88) j : t (cid:54) r j ,d j (cid:54) t (cid:48) p j > t + t (cid:48) − t = t (cid:48) (cid:62) d j (cid:48) [ t, t (cid:48) ] .To this end, let j (cid:48) be the first job that finishes strictly after d j (cid:48) and consider the interval I = [ r j (cid:48) , d j (cid:48) ] .We now define the following operator that transforms our interval I into an interval I . Consider t inf to be the smallest release time among all jobs that are processed in interval I and define I = [ t inf , d j (cid:48) ] . We apply iteratively this operation to obtain interval I k +1 from interval I k . We claimthe following properties that we prove by induction.1. For any k (cid:62) , the machine is never idle in interval I k .2. For any k (cid:62) , all jobs that are processed in I k have a deadline (cid:54) d j (cid:48) .For I = [ r j (cid:48) , d j (cid:48) ] , since job j (cid:48) is not finished by time d j (cid:48) it must be that the machine is never idle inthat interval. Additionally, if a job is processed in this interval, it must be that its deadline is earlierthat d j (cid:48) since we process in EDF order. Assume both items hold for I k and then consider I k +1 thatwe denote by [ a k +1 , d j (cid:48) ] . By construction, there is a job denoted j k +1 released at time a k +1 that isnot finished by time a k . Therefore the machine cannot be idle at any time in [ a k +1 , a k ] hence at anytime in I k +1 by the induction hypothesis. Furthermore, consider a job processed in I k +1 \ I k . It mustbe that its deadline is earlier that the deadline of job j k +1 . But job j k +1 is processed in interval I k which implies that its deadline is earlier than d j (cid:48) and ends the induction.Denote by k (cid:48) the first index such that I k (cid:48) = I k (cid:48) +1 . We define I ∞ = I k (cid:48) . By construction, it must bethat all jobs processed in I ∞ have release time in I ∞ and by induction the machine is never idle inthis interval and all jobs processed in I ∞ have deadline in I ∞ .Since job j (cid:48) is not finished by time d j (cid:48) and by the previous remarks we have that (cid:88) j : r j ,d j ∈ I ∞ p j > | I ∞ | which yields a counter example to (4).We can now prove two shrinking lemmas that are needed in the procedure R OBUSTIFY and itsgeneralization to general deadlines.
Lemma 19.
Let (cid:54) µ < . For any instance I consider the instance I (cid:48) where the deadline of job j is set to d (cid:48) j = r j + (1 − µ )( d j − r j ) (i.e. we shrink each job by a (1 − µ ) factor). Then OPT( I (cid:48) ) (cid:54) OPT( I )(1 − µ ) α − Additionally, assuming (cid:54) µ < / , consider the instance I (cid:48)(cid:48) where the deadline of job j is set to d (cid:48)(cid:48) j = r j + (1 − µ )( d j − r j ) and the release time is set to r (cid:48)(cid:48) j = r j + µ ( d j − r j ) . Then OPT( I (cid:48)(cid:48) ) (cid:54) OPT( I )(1 − µ ) α − Proof.
W.l.o.g. we can assume that the optimal schedule s for I runs each job isolated and at auniform speed. By optimality of the schedule and convexity, each job j must be run at a constantspeed s j for a total duration of p j . Consider the first case and define a speed s (cid:48) j = s j − µ for all j (hence the total processing time becomes p (cid:48) j = (1 − µ ) · p j ).Assume now in the new instance I (cid:48) we run jobs earliest-deadline-first with the constraint that no jobis run before its release time (with the processing times p (cid:48) j ). We will prove using Lemma 18 that alldeadlines are satisfied. Consider now an interval [ t, t (cid:48) ] we then have that (cid:88) j : t (cid:54) r j ,d (cid:48) j (cid:54) t (cid:48) p (cid:48) j = (1 − µ ) · (cid:88) j : t (cid:54) r j ,d (cid:48) j (cid:54) t (cid:48) p j (cid:54) (1 − µ ) · (cid:88) j : t (cid:54) r j ,d j (cid:54) t (cid:48)− µt − µ p j where the last inequality comes from the fact that t (cid:48) (cid:62) d (cid:48) j = d j − µ ( d j − r j ) which implies that d j (cid:54) t (cid:48) − µr j − µ (cid:54) t (cid:48) − µt − µ by using r j (cid:62) t . By Lemma 18 and the fact that s is a feasible schedule for I
18e have that (cid:88) j : t (cid:54) r j ,d (cid:48) j (cid:54) t (cid:48) p (cid:48) j (cid:54) (1 − µ ) · (cid:18) t (cid:48) − µt − µ − t (cid:19) = (1 − µ ) · t (cid:48) − t − µ = t (cid:48) − t which implies by Lemma 18 that running all jobs EDF with processing time p (cid:48) j satisfies all deadlines d (cid:48) j . Now notice the cost of this schedule is at most − µ ) α − times the original schedule s which endsthe proof (each job is ran − µ times faster but for a time (1 − µ ) times shorter).The proof of the second case is similar. Note that for any [ t, t (cid:48) ] , if d (cid:48)(cid:48) j = r j + (1 − µ )( d j − r j ) = (1 − µ ) d j + µr j (cid:54) t (cid:48) r (cid:48)(cid:48) j = r j + µ ( d j − r j ) = (1 − µ ) r j + µd j (cid:62) t then we have (1 − µ ) d j (cid:54) t (cid:48) − µr j (cid:54) t (cid:48) − µ − µ ( t − µd j ) ⇐⇒ (1 − µ ) d j − µ − µ d j (cid:54) t (cid:48) − µ − µ · t ⇐⇒ d j ((1 − µ ) − µ ) (cid:54) (1 − µ ) t (cid:48) − µt ⇐⇒ d j (cid:54) (1 − µ ) t (cid:48) − µt − µ Similarly, we have (1 − µ ) r j (cid:62) t − µd j (cid:62) t − µ − µ ( t (cid:48) − µr j ) ⇐⇒ (1 − µ ) r j − µ − µ r j (cid:62) t − µ − µ · t (cid:48) ⇐⇒ r j (cid:62) (1 − µ ) t − µt (cid:48) − µ Notice that (1 − µ ) t (cid:48) − µt − µ − (1 − µ ) t − µt (cid:48) − µ = t (cid:48) − t − µ Therefore, if we set the speed that each job s (cid:48)(cid:48) j is processed to s (cid:48)(cid:48) j = s j − µ then we have a processingtime p (cid:48)(cid:48) j = (1 − µ ) · p j and we can write (cid:88) j : t (cid:54) r (cid:48)(cid:48) j ,d (cid:48)(cid:48) j (cid:54) t (cid:48) p (cid:48)(cid:48) j = (1 − µ ) · (cid:88) j : t (cid:54) r (cid:48)(cid:48) j ,d (cid:48)(cid:48) j (cid:54) t (cid:48) p j (cid:54) (1 − µ ) · (cid:88) j : (1 − µ ) t − µt (cid:48) − µ (cid:54) r j ,d j (cid:54) (1 − µ ) t (cid:48)− µt − µ p j (cid:54) (1 − µ ) · t (cid:48) − t − µ = t (cid:48) − t by Lemma 18. Hence we can conclude similarly as in the previous case. F Making an algorithm noise tolerant
The idea for achieving noise tolerance is that by Lemma 19 we know that if we delay each job’sarrival slightly (e.g., by ηD ) we can still obtain a near optimal solution. This gives us time to reassignarriving jobs within a small interval in order to make the input more similar to the prediction. We first,in Section F.1, generalize the error function err to a more noise tolerant error function err η . We then,in Section F.2, give a general procedure for making an algorithm noise tolerant (see Theorem 20).19 .1 Noise tolerant measure of error For motivation, recall the example given in the main body. Specifically, consider a predicted workload w pred defined by w pred i = 1 for those time steps i that are divisible by a large constant, say ,and let w pred i = 0 for all other time steps. If the real instance w real is a small shift of w pred say w real i +1 = w pred i then the prediction error err( w real , w pred ) is large although w pred intuitively forms agood prediction of w real . To overcome this sensitivity to noise, we generalize the definition of err .For two workload vectors w, w (cid:48) , and a parameter η (cid:62) , we say that w is in the η -neighborhoodof w (cid:48) , denoted by w ∈ N η ( w (cid:48) ) , if w can be obtained from w (cid:48) by moving the workload at most ηD time steps forward or backward in time. Formally w ∈ N ( w (cid:48) ) if there exists a solution { x ij } to thefollowing system of linear equations : w i = i + ηD (cid:88) j = i − ηD x ij ∀ iw (cid:48) j = j + ηD (cid:88) i = j − ηD x ij ∀ j The concept of η -neighborhood is inspired by the notion of earth mover’s distance but is adapted toour setting. Intuitively, the variable x ij denotes how much of the load w i has been moved to timeunit j in order to obtain w (cid:48) . Also note that it is a symmetric and reflexive relation, i.e., if w ∈ N η ( w (cid:48) ) then w (cid:48) ∈ N η ( w ) and w ∈ N η ( w ) .We now generalize the measure of prediction error as follows. For a parameter η (cid:62) , an instance w real , and a prediction w pred , we define the η -prediction error, denoted by err η , as err η ( w real , w pred ) = min w ∈ N η ( w pred ) err( w real , w ) . Note that by symmetry we have that err η ( w real , w pred ) = err η ( w pred , w real ) . Furthermore, we havethat err η = err if η = 0 but it may be much smaller for η > . To see this, consider the vectors w pred and w real i = w pred i +1 given in the motivational example above. While err( w pred , w real ) is large,we have err η ( w pred , w real ) = 0 for any η with ηD (cid:62) . Indeed the definition of err η is exactly so asto allow for a certain amount of noise (calibrated by the parameter η ) in the prediction. F.2 Noise tolerant procedure
We give a general procedure for making an algorithm A noise tolerant under the mild condition that A is monotone: we say that an algorithm is monotone if given a predictor w pred and duration D , thecost of scheduling a workload w is at least as large as that of scheduling a workload w (cid:48) if w (cid:62) w (cid:48) (coordinate-wise). That increasing the workload should only increase the cost of a schedule is anatural condition that in particular all our algorithms satisfy. Theorem 20.
Suppose there is a monotone learning-augmented online algorithm A for the uniformspeed scaling problem, that given prediction w pred , computes a schedule of an instance w real ofvalue at most min { C · OPT + C (cid:48) err( w real , w pred ) , C (cid:48)(cid:48) OPT } . Then, for every η (cid:62) , ζ > there is a learning-augmented online algorithm N OISE -R OBUST ( A ) ,that given prediction w pred , computes a schedule of w real of value at most ((1 + η )(1 + ζ )) O ( α ) times min { C · OPT +(1 /ζ ) O ( α ) ( C + C (cid:48) ) err η ( w real , w pred ) , C (cid:48)(cid:48) OPT } . The pseudo-code of the online algorithm N
OISE -R OBUST ( A ) , obtained from A , is given in Algo-rithm 2. To simplify notation, we assume that ηD evaluates to an integer and we have extended the vectors w and w (cid:48) to take value outside the range [0 , T − D ] . lgorithm 2 N OISE -R OBUST ( A ) Input:
Algorithm A , prediction w pred , and η (cid:62) , ζ > Initialize A with prediction w pred i = (1 + ζ ) w pred i − ηD and duration (1 − η ) D Let w online and w real be workload vectors, initialized to on time step i do W ← w real i for j ∈ { i − ηD, . . . , i + ηD } do if w online j + W (cid:54) (1 + ζ ) w pred j then x ij ← W W ← w online j ← w online j + W else if w online j < (1 + ζ ) w pred j then x ij ← (1 + ζ ) w pred j − w online j W ← W − x ij w online j ← (1 + ζ ) w pred j end if end for // Distribute remaining workload W evenly for j ∈ { i − ηD, . . . , i + ηD } do x ij ← x ij + W/ (2 ηD + 1) w online j ← w online j + W/ (2 ηD + 1) end for w real i ← w online i − ηD Feed the job with workload w real i to A end on iw pred i iw real i iw online i w online from w real and w pred .The algorithm constructs a vector w online ∈ N η ( w real ) while trying to minimize err ( w online , w pred ) .Each component w online i will be finalized at time i + ηD . Hence, we forward the jobs to A with adelay of ηD .The vector is constructed as follows. Suppose a job w real i arrives. The algorithm first (see Steps - )greedily assigns the workload to the time steps j = i − ηD, i − ηD + 1 , . . . , i + ηD from left-to-rightsubject to the constraint that no time step receives a workload higher than (1 + ζ ) w pred j . If not allworkload of w real i was assigned in this way, then the overflow is assigned uniformly to the time stepsfrom i − ηD to i + ηD (Steps - ). Since each w online j can only receive workloads during timesteps j − ηD, . . . , j + ηD , it will be finalized at time j + ηD . Thus, at time i we can safely forward w online i − ηD to the algorithm A . Hence, we set the workload of the algorithm’s instance to w real i = w online i − ηD (Steps - ). This shift together with the fact that a job w real i may be assigned to w online i + ηD , i.e., ηD time steps forward in time, is the reason why we run each job with an interval of length (1 − η ) D .Shrinking the interval of each job allows to make this shift and reassignment while still guaranteeingthat each job is finished by its original deadline. 21or an example, consider Figure 3. Here we assume that ηD = 1 and for illustrative purposes that ζ = 0 . At time , a workload w real0 = 1 is released. The algorithm N OISE -R OBUST ( A ) then greedilyconstructs w online by filling the available slots in w pred − , w pred0 , and w pred1 . Since w pred0 = 3 , it fits allof the workload of w real0 at time . Similarly the workloads w real2 and w real3 both fit under the capacitygiven by w pred . Now consider the workload w real4 = 2 released at time . At this point, the availableworkload at time is fully occupied and one there is one unit of workload left at time . Hence,N OISE -R OBUST ( A ) will first assign the one unit of w real4 to the third time slot and then split theremaining unit of workload unit uniformly across the time steps , , . The obtained vector w online isdepicted on the right of Figure 3. The workload w online is then fed online to the algorithm A (giving aschedule of w online and thus of w real ) so that at time i , A receives the job w real i = w online i + ηD = w online i +1 with a deadline of i + (1 − η ) D = i + D − . This deadline is chosen so as to guarantee that a jobis finished by A within its original deadline. Indeed, by this selection, the last part of the job w real4 that was assigned to w online5 is guaranteed to finish by time D − D which is its originaldeadline.Having described the algorithm, we proceed to analyze its guarantees which will prove Theorem 20. Analysis.
We start by analyzing the noise tolerance of N
OISE -R OBUST ( A ) . Lemma 21.
The schedule computed by N OISE -R OBUST ( A ) has cost at most (1 + O ( η )) α C (cid:48)(cid:48) OPT .Proof.
Let
OPT and
OPT (cid:48) denote the cost of an optimum schedule of the original instance w real with duration D and the instance w real with duration (1 − η ) D fed to A , respectively. The lemmathen follows by showing that OPT (cid:48) (cid:54) (1 + O ( η )) α OPT . To show this inequality, consider an optimal schedule s of w real subject to the constraint that everyjob w real i is scheduled within the time interval [ i + 2 ηD, i + (1 − η ) D ] . By Lemma 19, we havethat the cost of this schedule is at most (1 + O ( η )) α OPT . The statement therefore follows byarguing that s also gives a feasible schedule of w real with duration (1 − η ) D . To see this note thatN OISE -R OBUST ( A ) moves the workload w real i to a subset of w real i , w real i +1 , . . . , w real i +2 ηD . All of thesejobs are allowed to be processed during [ i + 2 ηD, i + (1 − η ) D ] . It follows that the part of these jobsthat corresponds to w real i can be processed in the computed schedule s (whenever it processes w real i )since s process that job in the time interval [ i + 2 ηD, i + (1 − η ) D ] . By doing this “reverse-mapping”for every job, we can thus use s as a schedule for the instance w real with duration (1 − η ) D .We now proceed to analyze the consistency and smoothness. The following lemma is the maintechnical part of the analysis. We use the common notation ( a ) + for max { a, } . Lemma 22.
The workload vector w online produced by N OISE -R OBUST ( A ) satisfies (cid:88) i (cid:20)(cid:16) w online i − (1 + ζ ) w pred i (cid:17) + (cid:21) α (cid:54) O (1 /ζ ) α · min w ∈ N η ( w real ) (cid:88) i (cid:20)(cid:16) w i − w pred i (cid:17) + (cid:21) α . The more technical proof of this lemma is given in Section F.2.1. Here, we explain how it implies theconsistency and smoothness bounds of Theorem 20. For a workload vector w , we use the notation OPT( w ) and OPT (cid:48) ( w ) to denote the cost of an optimal schedule of workload w with duration D and (1 − η ) D , respectively. Now let ˆ w online be the workload vector defined by ˆ w online i = max { w online i , (1 + ζ ) w pred i } . We analyze the cost of the schedule produced by A for ˆ w online (shifted by ηD ). This also boundsthe cost of running A with w real : Since A is monotone, the cost of the schedule computed for theworkload ˆ w online (shifted by ηD ) can only be greater than that computed for w real which equals w online (shifted by ηD ). Furthermore, we have by Lemma 22 that err ( ˆ w online , (1 + ζ ) w pred ) = (cid:88) i (cid:20)(cid:16) w online i − (1 + ζ ) w pred i (cid:17) + (cid:21) α (5) (cid:54) O (1 /ζ ) α err η ( w real , w pred ) .
22t follows by the assumptions on A that the schedule computed by N OISE -R OBUST ( A ) has cost atmost C · OPT (cid:48) ( ˆ w online ) + C (cid:48) · err ( ˆ w online , (1 + ζ ) w pred ) (cid:54) C · OPT (cid:48) ( ˆ w online ) + O (1 /ζ ) α · C (cid:48) · err η ( w real , w pred ) . The following lemma implies the consistency and smoothness, as stated in Theorem 20, by relating
OPT (cid:48) ( ˆ w online ) with the cost OPT = OPT( w real ) . Lemma 23.
We have
OPT (cid:48) ( ˆ w online ) (cid:54) ((1 + η )(1 + ζ )) O ( α ) (cid:0) OPT( w real ) + O (1 /ζ ) α err η ( w real , w pred ) (cid:1) . Proof.
By the exact same arguments as in the proof of Theorem 2, we have that for any η (cid:48) > (cid:48) ( ˆ w online ) (cid:54) (1 + η (cid:48) ) α OPT (cid:48) ((1 + ζ ) w pred ) + O (1 /η (cid:48) ) α err( ˆ w online , (1 + ζ ) w pred ) (cid:54) (1 + η (cid:48) ) α OPT (cid:48) ((1 + ζ ) w pred ) + O (1 /η (cid:48) ) α O (1 /ζ ) α err η ( w real , w pred ) , where we used (5) for the second inequality.By Lemma 19, we have that decreasing the duration by a factor (1 − η ) only increases the cost byfactor (1 + O ( η )) α and so OPT (cid:48) ((1 + ζ ) w pred ) (cid:54) (1 + O ( η )) α OPT((1 + ζ ) w pred ) . Furthermore,as a schedule for a workload w pred gives a schedule for (1 + ζ ) w pred by increasing the speed by afactor (1 + ζ ) , we get OPT (cid:48) ((1 + ζ ) w pred ) (cid:54) (1 + O ( η )) α (1 + ζ ) α OPT( w pred ) . Hence, by choosing η (cid:48) = ζ , OPT (cid:48) ( ˆ w online ) (cid:54) (1 + O ( η )) α (1 + ζ ) α OPT( w pred ) + O (1 /ζ ) α err η ( w real , w pred ) . It remains to upper bound
OPT( w pred ) by OPT( w real ) . Let w = argmin w ∈ N η ( w pred ) err( w, w real ) and so err η ( w real , w pred ) = err( w real , w ) . By again applying the arguments of Theorem 2, we havefor any η (cid:48) > w ) (cid:54) (1 + η (cid:48) ) α OPT( w real ) + O (1 /η (cid:48) ) α err( w real , w ) . Now consider an optimal schedule of w subject to that for every time t the job w t is scheduled withinthe interval [ t + ηD, t + (1 − η ) D ] . By Lemma 19, we have that this schedule has cost at most (1 + O ( η )) α OPT( w ) . Observe that this schedule for w also defines a feasible schedule for w pred since the time of any job is shifted by at most ηD in w . Hence, by again selecting η (cid:48) = ζ , OPT( w pred ) (cid:54) (1 + O ( η )) α OPT( w ) (cid:54) (1 + O ( η )) α (cid:0) (1 + ζ ) α OPT( w real ) + O (1 /ζ ) α err η ( w real , w pred ) (cid:1) Finally, by combining all inequalities, we get
OPT (cid:48) ( ˆ w online ) (cid:54) (1 + O ( η )) α (cid:0) (1 + ζ ) α OPT( w real ) + O (1 /ζ ) α err η ( w real , w pred ) (cid:1) F.2.1 Proof of Lemma 22
The lemma is trivially true if there were no jobs that had remaining workloads to be assigneduniformly, i.e., if we always have W = 0 at Step of N OISE -R OBUST ( A ) . So suppose that therewas at least one such job and consider the directed bipartite graph G with bipartitions A and B defined as follows:• A contains a vertex for each component of w real and B contains one for each component of w online . In other words, A and B contain one vertex for each time unit.• There is an arc from i ∈ A to j ∈ B if | i − j | (cid:54) ηD , that is, if w real i could potentially beassigned to w online j . 23 There is an arc from j ∈ B to i ∈ A if part of the workload of w real i was assigned to w online j by N OISE -R OBUST ( A ) , i.e., if x ij > .Now let t be the last time step such that the online algorithm had to assign the remaining workload of w real t uniformly. So, by selection, t + ηD is the last time step so that w online t + ηD > (1 + ζ ) w pred t + ηD . For k (cid:62) , define the sets A k = { i ∈ A : the shortest path from t to i has length k in G } ,B k = { j ∈ B : the shortest path from t to j has length k + 1 in G } . Here t stands for the corresponding vertex in A . The set A k consists of those time steps, for whichthe corresponding jobs in w real have been moved in w online to the time slots in B k − but not to anytime slot in B k − , B k − , . . . , B ; and B k are all the time slots where the jobs corresponding to A k could have been assigned (but no job in A k − , A k − , . . . , A could have been assigned). By theselection of t , and the construction of w online , these sets satisfy the following two properties: Claim 24.
The sets ( A k , B k ) k (cid:62) satisfy • For any time step j ∈ (cid:83) k B k we have w online j (cid:62) (1 + ζ ) w pred j . • For any two time steps i k ∈ A k and i (cid:96) ∈ A (cid:96) with k > (cid:96) , we have i k − i (cid:96) (cid:54) ηD ( k − (cid:96) + 2) .Proof of claim. In the proof of the claim we use the notation (cid:96) ( A k ) and (cid:96) ( B k ) to denote the left-most(earliest) time step in A k and B k , respectively. The proof is by induction on k (cid:62) with the followinginduction hypothesis (IH):1. For any time step j ∈ B k we have w online j (cid:62) (1 + ζ ) w pred j .2. B = { t − ηD, . . . , t + ηD } and for any (non-empty) B k with k > we have B k = { (cid:96) ( B k ) , . . . , (cid:96) ( B k − ) − } and (cid:96) ( B k ) − (cid:96) ( B k − ) (cid:54) ηD .The first part of IH immediately implies the first part of the claim. The second part implies the secondpart of the claim as follows: Any time step in A (cid:96) has a time step in B (cid:96) that differs by at most ηD .Similarly, for any time step in A k there is a time step in B k − at distance at most ηD . Now by thesecond part of the induction hypothesis, the distance between these time steps in B k − and B (cid:96) is atmost ( k − (cid:96) + 1)2 ηD .We complete the proof by verifying the inductive hypothesis. For the base case when k = 0 , wehave B = { t − ηD, . . . , t + ηD } by definition since A = { t } . We also have that the first part ofIH holds by the definition of N OISE -R OBUST ( A ) and the fact that the overflow of job w real ( t ) wasuniformly assigned to these time steps.For the inductive step, consider a time step i ∈ A k . By definition w real i was assigned to a time stepin B k − but to no time step in B k − ∪ . . . ∪ B . Now suppose toward contradiction that there is atime step j ∈ A k − such that j < i . But then by the greedy strategy of N OISE -R OBUST ( A ) (jobsare assigned left-to-right), we reach the contradiction that w real i must have been assigned to a timestep in B k − ∪ . . . ∪ B if k (cid:62) since then w real j is assigned to a time step in B k − . For k = 1 , wehave j = t and so all time steps in B were full (with respect to capacity (1 + ζ ) w pred ) after t wasprocessed. Hence, in this case, w real i could only be assigned to a time step in B if it it had overflowthat was uniformly assigned by N OISE -R OBUST ( A ) , which contradicts the selection of t .We thus have that each time step in A k is smaller than the earliest time step in A k − . It follows that B k = { (cid:96) ( B k ) , . . . , (cid:96) ( B k − ) − } where (cid:96) ( B k ) = (cid:96) ( A k ) − ηD . The bound (cid:96) ( B k ) − (cid:96) ( B k − ) (cid:54) ηD then follows since, by definition, { (cid:96) ( A k ) − ηD, . . . , (cid:96) ( A k ) + ηD } must intersect B k − . Thiscompletes the inductive step for the second part of IH. For the first part, note that the job w real (cid:96) ( A k ) wasalso assigned to B k − by N OISE -R OBUST ( A ) . By the greedy left-to-right strategy, this only happensif the capacity of all time steps B k is saturated. 24ow let p be the smallest index such that w real ( A p +1 ) + w real ( A p +2 ) (cid:54) ζ (cid:48) (cid:80) pi =0 w real ( A i ) wherewe select ζ (cid:48) = ζ/ . We have p +1 (cid:88) i =0 w real ( A i ) (cid:62) p (cid:88) i =0 w online ( B i ) (cid:62) (1 + ζ ) p (cid:88) i =0 w pred ( B i ) (6)where the first inequality holds by the definition of the sets and the second is by the first part of theabove claim. In addition, by the selection of p , p (cid:88) i =0 w real ( A i ) (cid:62) (1 − ζ (cid:48) ) p +2 (cid:88) i =0 w real ( A i ) . (7)Now let q = max { p − / ( ζ (cid:48) ) , } . We claim the following inequality p (cid:88) i = q w real ( A i ) (cid:62) (1 − ζ (cid:48) ) p (cid:88) i =0 w real ( A i ) . (8)The inequality is trivially true if q = 0 . Otherwise, we have by the selection of p , p (cid:88) i = q w real ( A i ) = (1 − ζ (cid:48) ) p (cid:88) i = q w real ( A i ) + ζ (cid:48) p (cid:88) i = q w real ( A i ) (cid:62) (1 − ζ (cid:48) ) p (cid:88) i = q w real ( A i ) + ( p − q )2 ( ζ (cid:48) ) q − (cid:88) i =0 w real ( A i ) (cid:62) (1 − ζ (cid:48) ) p (cid:88) i = q w real ( A i ) + 2 q − (cid:88) i =0 w real ( A i ) and so (8) holds.We are now ready to complete the proof of the lemma. Let w ∗ be a minimizer of the right-hand-side,i.e., w ∗ = argmin w ∈ N η ( w real ) (cid:88) i (cid:20)(cid:16) w i − w pred i (cid:17) + (cid:21) α Divide the time steps of the instance into T , B p +1 , T and T where T contains all time steps earlierthan (cid:96) ( B p +1 ) , T contains the time steps in ∪ pi =0 B i , and T contains the remaining time steps, i.e.,those after t + ηD . By the selection of t , we have w online i (cid:54) (1 + ζ ) w pred i for all i ∈ T . We thushave that (cid:80) i (cid:20)(cid:16) w online i − (1 + ζ ) w pred i (cid:17) + (cid:21) α equals (cid:88) i ∈ T (cid:20)(cid:16) w online i − (1 + ζ ) w pred i (cid:17) + (cid:21) α + (cid:88) i ∈ B p +1 ∪ T (cid:20)(cid:16) w online i − (1 + ζ ) w pred i (cid:17) + (cid:21) α . We start by analyzing the second sum. The only jobs in w real that contribute to the workloadof w online at the time steps in B p +1 ∪ T are by definition those corresponding to time steps in A ∪ . . . ∪ A p +2 . In the worst case, we have that w pred is during these time steps and that the jobsin w real are uniformly assigned to the same ηD + 1 time steps. This gives us the upper bound: (cid:88) i ∈ B p +1 ∪ T (cid:20)(cid:16) w online i − (1 + ζ ) w pred i (cid:17) + (cid:21) α (cid:54) (cid:32) (cid:80) p +2 i =0 w real ( A i )2 ηD + 1 (cid:33) α · (2 ηD + 1) (cid:54) (1 + ζ (cid:48) ) α (cid:18) (cid:80) pi =0 w real ( A i )2 ηD (cid:19) α ηD . At the same time, combining (6) (7), and (8) give us p (cid:88) i = q w real ( A i ) (cid:62) (1 − ζ (cid:48) ) (1 + ζ ) p (cid:88) i =0 w pred ( B i ) (cid:62) (1 + ζ/ p (cid:88) i =0 w pred ( B i ) .
25y definition, the jobs in w real corresponding to time steps ∪ pk = q A k can only be assigned to w online during time steps T = ∪ pk =0 B k . Therefore, as the difference between the largest time and smallesttime in ∪ pk = q A k is at most ηD ( p − q + 2) (second statement of the above claim) and thus theworkload of those time steps can be assigned to at most ηD ( p − q + 4) time steps, we have (cid:88) i ∈ T (cid:20)(cid:16) w ∗ i − w pred i (cid:17) + (cid:21) α (cid:62) (cid:32) (cid:80) pi = q w real ( A i ) − (cid:80) pi =0 w pred ( B i )( p − q + 4) · ηD (cid:33) α · ( p − q + 4) · ηD (cid:62) (cid:0) c · ζ (cid:1) α (cid:18) (cid:80) pi =0 w real ( A i )2 ηD (cid:19) α · ηD for an absolute constant c . It follows that (cid:88) i ∈ B p +1 ∪ T (cid:20)(cid:16) w online i − (1 + ζ ) w pred i (cid:17) + (cid:21) α (cid:54) (cid:18) ζ (cid:48) cζ (cid:19) α (cid:88) i ∈ T (cid:20)(cid:16) w ∗ i − w pred i (cid:17) + (cid:21) α . We have thus upper bounded the sum on the left over time steps in B p +1 ∪ T by the sum on the rightover only time steps in T . Since N OISE -R OBUST ( A ) does not assign the workload w real i for i ∈ T to w online on any of the time steps in T , we can repeatedly apply the arguments on the time steps in T to show (cid:88) i ∈ T (cid:20)(cid:16) w online i − (1 + ζ ) w pred i (cid:17) + (cid:21) α (cid:54) (cid:18) ζ (cid:48) cζ (cid:19) α (cid:88) i ∈ T ∪ B p +1 (cid:20)(cid:16) w ∗ i − w pred i (cid:17) + (cid:21) α , yielding the statement of the lemma. G R
OBUSTIFY for uniform deadlines
Here we provide the proofs of Claim 4, Claim 5, Claim 6.
Claim 4. If s is a feasible schedule for ( w real , (1 − δ ) D, T ) then s ( δ ) is a feasible schedule for ( w real , D, T ) .Proof. Since s is a feasible schedule for ( w, (1 − δD ) , T ) , we have that (cid:90) r i + Dr i s ( δ ) i ( t ) dt = (cid:90) r i + Dr i δD (cid:18)(cid:90) tt − δD s i ( t (cid:48) ) dt (cid:48) (cid:19) dt = (cid:90) r i +(1 − δ ) Dr i s i ( t (cid:48) ) (cid:32)(cid:90) t (cid:48) + δDt (cid:48) δD dt (cid:33) dt (cid:48) = w i . Claim 5.
The cost of schedule s ( δ ) is not higher than that of s , that is, (cid:90) T ( s ( δ ) ( t )) α dt (cid:54) (cid:90) T ( s ( t )) α dt. Proof.
The proof only uses Jensen’s inequality in the second line and the statement can be calculatedas follows. (cid:90) T (cid:16) s ( δ ) ( t ) (cid:17) α dt = (cid:90) T (cid:18) δD (cid:90) tt − δD s ( t (cid:48) ) dt (cid:48) (cid:19) α dt (cid:54) (cid:90) T δD (cid:18)(cid:90) tt − δD ( s ( t (cid:48) )) α dt (cid:48) (cid:19) dt = (cid:90) T ( s ( t (cid:48) )) α (cid:32)(cid:90) t (cid:48) + δDt (cid:48) δD dt (cid:33) dt (cid:48) = (cid:90) T ( s ( t )) α dt laim 6. Let s be a feasible schedule for ( w real , (1 − δ ) D, T ) . Then s ( δ ) i ( t ) (cid:54) δ s AVR i ( t ) . Proof.
We have that s ( δ ) i ( t ) = 1 δD (cid:90) tt − δD s i ( t (cid:48) ) dt (cid:48) (cid:54) δD (cid:90) r i + Dr i s i ( t (cid:48) ) dt (cid:48) = w i δD = s AVR i ( t ) δ . H R
OBUSTIFY for general deadlines
In this section, we discuss generalizations of our techniques to general deadlines. Recall that aninstance with general deadlines is defined by a set J of jobs J j = ( r j , d j , w j ) , where r j is the timethe job becomes available, d j is the deadline by which it must be completed, and w j is the work to becompleted. For δ > , we use the notation J δ to denote the instance obtained from J by shrinkingthe duration of each job by a factor (1 − δ ) . That is, for each job ( r j , d j , w j ) ∈ J , J δ contains thejob ( r j , r j + (1 − δ )( d j − r j ) , w j ) .Our main result in this section generalizes R OBUSTIFY to general deadlines.
Theorem 25.
For any δ > , given an online algorithm for general deadlines that produces aschedule for J δ of cost C , we can compute online a schedule for J of cost at most min (cid:40)(cid:18) − δ (cid:19) α − C, (2 α/δ ) α / · OPT (cid:41) , where OPT denotes the cost of an optimal schedule of J . Since it is easy to design a consistent algorithm by just blindly following the prediction, we have thefollowing corollary.
Corollary 26.
There exists a learning augmented online algorithm for the General Speed Scalingproblem, parameterized by ε > , with the following guarantees: • Consistency: If the prediction is accurate, then the cost of the returned schedule is at most (1 + ε ) OPT . • Robustness: Irrespective of the prediction, the cost of the returned schedule is at most O ( α /ε ) α · OPT .Proof of Corollary.
Consider the algorithm that blindly follows the prediction to do an optimalschedule of J δ when in the consistent case. That is, given the prediction of J , it schedules all jobsthat agrees with the prediction according to the optimal schedule of the predicted J δ ; the workloadof the remaining jobs j that were wrongly predicted is scheduled uniformly during their durationfrom release time r j to deadline d j . In the consistent case, when the prediction is accurate, the costof the computed schedule equals thus the cost OPT( J δ ) of an optimal schedule of J δ . Furthermore,we have by Lemma 19 OPT( J δ ) (cid:54) (cid:18) − δ (cid:19) α − OPT , where OPT denotes the cost of an optimal schedule to J . Applying Theorem 25 on this algorithm wethus obtain an algorithm that is also robust. Specifically, we obtain an algorithm with the followingguarantees:• If prediction is accurate, then the computed schedule has cost at most (cid:16) − δ (cid:17) α − · OPT .• The cost of the computed schedule is always at most (2 α/δ ) α / · OPT .The corollary thus follows by selecting δ = Θ( ε/α ) so that / (1 − δ ) α − = 1 + ε .27e remark that one can also define “smooth” algorithms for general deadlines as we did in theuniform case. However, the prediction model and the measure of error quickly get complex andnotation heavy. Indeed, our main motivation for studying the Uniform Speed Scaling problem is thatit is a clean but still relevant version that allows for a natural prediction model.We proceed by proving the main theorem of this section, Theorem 25. The procedure G
ENERAL -R OBUSTIFY . We describe the procedure G
ENERAL -R OBUSTIFY thatgeneralizes R
OBUSTIFY to general deadlines. Its analysis then implies Theorem 25. Let A denote theonline algorithm of Theorem 25 that produces a schedule of J δ of cost C . To simplify the descriptionof G ENERAL -R OBUSTIFY , we fix ∆ > and assume that the schedule s output by A only changesat times that are multiples of ∆ . This is without loss of generality as we can let ∆ tend to . Tosimplify our calculations, we further assume that δ ( d j − r j ) / ∆ evaluates to an integer for all jobs ( r j , d j , w j ) ∈ J .The time line is thus partitioned into time intervals of length ∆ so that in each time interval eitherno job is processed by s or exactly one job is processed at constant speed by s . We denote by s ( t ) the speed at which s processes the job j ( t ) during the t :th time interval, where we let s ( t ) = 0 and j ( t ) = ⊥ if no job was processed by s (during this time interval).To describe the schedule computed by G ENERAL -R OBUSTIFY , we further divide each time intervalinto a base part of length (1 − δ )∆ and an auxiliary part of length δ ∆ . In the t :th time interval,G ENERAL -R OBUSTIFY schedules job j ( t ) at a certain speed s base ( t ) during the base part, and asubset J ( t ) ⊆ J of the jobs is scheduled during the auxiliary part, each i ∈ J ( t ) at a speed s aux i ( t ) .These quantities are computed by G ENERAL -R OBUSTIFY online at the start of the t :th time intervalas follows:• Let s aux ( t ) = (cid:80) i ∈J ( t ) s aux i ( t ) be the current speed of the auxiliary part and let D j ( t ) = d j ( t ) − r j ( t ) be the duration of job j ( t ) .• If s ( t ) / (1 − δ ) (cid:54) s aux ( t ) , then set s base ( t ) = s ( t ) / (1 − δ ) .• Otherwise, set s base ( t ) so that (1 − δ )∆ s base ( t ) + (cid:0) s base ( t ) − s aux ( t ) (cid:1) δ D j ( t ) = s ( t )∆ (9)and add j ( t ) to J ( t ) , J ( t + 1) , . . . , J ( t + δD j ( t ) / ∆ − with all auxiliary speeds s aux j ( t ) ( t ) , s aux j ( t ) ( t + 1) , . . . , s aux j ( t ) ( t + δD j ( t ) / ∆ − set to s base ( t ) − s aux ( t ) .This completes the formal description of G ENERAL -R OBUSTIFY . Before proceeding to its analysis,which implies Theorem 25, we explain the example depicted in Figure 4. Schedule s , illustratedon the left, schedules a blue, red, and green job during the first, second, and third time interval,respectively. We have that δ/ ∆ times the duration of the blue job and the red job are and ,respectively. G ENERAL -R OBUSTIFY now produces the schedule on the right where the auxiliaryparts are indicated by the horizontal stripes. When the the blue job is scheduled it is partitionedamong the base part of the first interval and evenly among the auxiliary parts of the first, second andthird intervals so that the speed at the first interval is the same in the base part and auxiliary part.Similarly, when the red job is scheduled, G
ENERAL -R OBUSTIFY splits it among the base part of thesecond interval and evenly among the auxiliary part of the second, third, fourth and fifth intervals sothat the speed during the base part equals the speed at the auxiliary part during the second interval.Finally, the green job is processed at a small speed and is thus only scheduled in the base part of thethird interval (with a speed increased by a factor / (1 − δ ) ). Analysis.
We show that G
ENERAL -R OBUSTIFY satisfies the guarantees stipulated by Theorem 25.We first argue that G
ENERAL -R OBUSTIFY produces a feasible schedule to J . During the t :thinterval, the schedule s computed by A processes ∆ · s ( t ) work of job j ( t ) . We argue that G ENERAL -R OBUSTIFY processes the same amount of work from this time interval. At the time when thisinterval is considered by G
ENERAL -R OBUSTIFY , there are two cases:• If s ( t ) / (1 − δ ) (cid:54) s aux ( t ) then s base ( t ) = s ( t ) / (1 − δ ) so G ENERAL -R OBUSTIFY processes (1 − δ )∆ s ( t ) / (1 − δ ) = s ( t )∆ work of j ( t ) during the base part of the t :th time interval.28chedule by A ∆ 2∆ 3∆ 4∆ 5∆ timespeed Schedule by G ENERAL -R OBUSTIFY ∆ 2∆ 3∆ 4∆ 5∆ timespeedFigure 4: Given the schedule on the left, G
ENERAL -R OBUSTIFY produces the schedule on the right.• Otherwise, we have that G
ENERAL -R OBUSTIFY processes (1 − δ )∆ s base ( t ) of j ( t ) duringthe base part of the t :th time interval and δ ∆ (cid:0) s base ( t ) − s aux ( t ) (cid:1) during the auxiliary part ofeach of the δD j ( t ) / ∆ time intervals t, t + 1 , . . . , t + δD j ( t ) / ∆ − . By the selection (9), itthus follows that G ENERAL -R OBUSTIFY processes all work s ( t )∆ from this time interval.in this case as well.The schedule of G ENERAL -R OBUSTIFY thus completely processes every job. Furthermore, sinceeach job is delayed at most δD j ( t ) time steps we have that it is a feasible schedule to J since westarted with a schedule for J δ , which completes each job j by time r j + (1 − δ ) D j . It remains toprove the robustness and soundness guarantees of Theorem 25 Lemma 27 (Robustness) . G ENERAL -R OBUSTIFY computes a schedule of cost at most (2 α/δ ) α / · OPT .Proof.
By the definition of the algorithm we have, for each time interval, that the speed of the basepart is at most the speed of the auxiliary part. Letting s base ( t ) and s aux ( t ) denote the speed of the baseand auxiliary part of the t :th time interval, we thus have (cid:88) t (cid:0) (1 − δ ) s base ( t ) α + δs aux ( t ) α (cid:1) (cid:54) (cid:88) t s aux ( t ) α . Now we have that the part of a job j that is processed during the auxiliary part of a time interval hasbeen uniformly assigned to at least δ D j time steps. It follows that the speed at any auxiliary timeinterval is at most /δ times the speed at that time of the A VERAGE R ATE heuristic (AVR). Thelemma now follows since that heuristic is known [17] to have competitive ratio at most (2 α ) α / . Lemma 28 (Consistency) . G ENERAL -R OBUSTIFY computes a schedule of cost at most (cid:16) − δ (cid:17) α − · C where C denotes the cost of the schedule s computed by A .Proof. For t (cid:62) , let h ( t ) be the schedule that processes the workload during the first t time intervalsas in the schedule computed by G ENERAL -R OBUSTIFY , and the workload of the remaining timeintervals is processed during the base part of that time interval by increasing the speed by a factor / (1 − δ ) . Hence, h (0) is the schedule that processes the workload of all time intervals during the basepart at a speed up of / (1 − δ ) , and h ( ∞ ) equals the schedule produced by G ENERAL -R OBUSTIFY .By definition, the cost of h (0) equals (cid:16) − δ (cid:17) α (1 − δ ) · C and so the lemma follows by observingthat for every t (cid:62) the cost of h ( t ) is at most the cost of h ( t − . To see this consider the two cases ofG ENERAL -R OBUSTIFY when considering the t :th time interval:• If s ( t ) / (1 − δ ) (cid:54) s aux ( t ) then G ENERAL -R OBUSTIFY processes all the workload during thebase part at a speed of s base ( t ) = s ( t ) / (1 − δ ) . Hence, in this case, the schedules h ( t ) and h ( t − processes the workload of the t :th time interval identically and so they have equalcosts. 29 Otherwise, G ENERAL -R OBUSTIFY partitions the workload of the t :th time interval amongthe base part of the t :th interval and δD j ( t ) / ∆ many auxiliary parts so that the speed at eachof these parts is strictly less than s ( t ) / (1 − δ ) . Hence, since h ( t ) processes the workload ofthe t :th time interval at a lower speed than h ( t − we have that its cost is strictly lower if α > (and the cost is equal if α = 1 ). I Additional Experiments