[PDF] A New Approach to Capacity Scaling Augmented With Unreliable Machine Learning Predictions

Abstract

Modern data centers suffer from immense power consumption. The erratic behavior of internet traffic forces data centers to maintain excess capacity in the form of idle servers in case the workload suddenly increases. As an idle server still consumes a significant fraction of the peak energy, data center operators have heavily invested in capacity scaling solutions. In simple terms, these aim to deactivate servers if the demand is low and to activate them again when the workload increases. To do so, an algorithm needs to strike a delicate balance between power consumption, flow-time, and switching costs. Over the last decade, the research community has developed competitive online algorithms with worst-case guarantees. In the presence of historic data patterns, prescription from Machine Learning (ML) predictions typically outperform such competitive algorithms. This, however, comes at the cost of sacrificing the robustness of performance, since unpredictable surges in the workload are not uncommon. The current work builds on the emerging paradigm of augmenting unreliable ML predictions with online algorithms to develop novel robust algorithms that enjoy the benefits of both worlds. We analyze a continuous-time model for capacity scaling, where the goal is to minimize the weighted sum of flow-time, switching cost, and power consumption in an online fashion. We propose a novel algorithm, called Adaptive Balanced Capacity Scaling (ABCS), that has access to black-box ML predictions, but is completely oblivious to the accuracy of these predictions. In particular, if the predictions turn out to be accurate in hindsight, we prove that ABCS is (1+\varepsilon)-competitive. Moreover, even when the predictions are inaccurate, ABCS guarantees a bounded competitive ratio. The performance of the ABCS algorithm on a real-world dataset positively support the theoretical results.

Full PDF

AA New Approach to Capacity Scaling Augmented WithUnreliable Machine Learning Predictions

Daan Rutten * and Debankur MukherjeeGeorgia Institute of TechnologyJanuary 29, 2021 Abstract

Modern data centers suffer from immense power consumption. The erratic behavior ofinternet trafﬁc forces data centers to maintain excess capacity in the form of idle servers incase the workload suddenly increases. As an idle server still consumes a signiﬁcant fractionof the peak energy, major data center operators like Amazon AWS have heavily invested incapacity scaling solutions. In simple terms, capacity scaling solutions aim to deactivate serversif the demand is low and to activate them again when the workload increases. To do so, analgorithm needs to strike a delicate balance between the power consumption, ﬂow-time, andthe switching cost. Over the last decade, the research community has developed sophisticatedcompetitive online algorithms with worst-case guarantees. In the presence of historic datapatterns, prescription from Machine Learning (ML) predictions typically outperform suchcompetitive algorithms. This, however, comes at the cost of sacriﬁcing the robustness ofperformance, since unpredictable surges in the workload are not uncommon. The currentwork builds on the emerging paradigm of augmenting unreliable ML predictions with onlinealgorithms to develop novel robust algorithms that enjoy the beneﬁts of both worlds.In this paper, we analyze a continuous-time model for capacity scaling, where the goal isto minimize the weighted sum of ﬂow-time, switching cost, and power consumption in anonline fashion. The model generalizes much of the earlier related approaches. We propose alow-complexity algorithm, called the Adaptive Balanced Capacity Scaling (ABCS) algorithm,that has access to black-box ML predictions. Although ABCS is completely oblivious to theaccuracy of these predictions, its performance does depend on the error of the predictions.In particular, if the predictions turn out to be accurate in hindsight, we prove that ABCS is ( + ε ) -competitive. Moreover, even when the predictions are inaccurate, ABCS guarantees auniformly bounded competitive ratio. Finally, we investigate the performance of the ABCSalgorithm on a real-world dataset and carry out extensive numerical experiments, which pos-itively support the theoretical results. Keywords — energy efﬁciency, online algorithms, competitive analysis, speed scaling, com-petitive ratio * Email: [email protected] a r X i v : . [ c s . D S ] J a n Introduction

Modern data centers suffer from immense power consumption, which amounts to a massive eco-nomic and environmental impact. In 2014, data centers alone contributed to about 1.8% of thetotal U.S. electricity consumption [52] and this is projected to reach 7% in 2030 [46]. Consequently,data center providers are constantly striving to optimize their servers for energy efﬁciency, push-ing the hardware’s efﬁciency to nearly its limit. At this point, algorithmic improvements appearto be critical in order to achieve substantial further gain [52]. A common practice for data cen-ters has been to reserve signiﬁcant excess service capacity in the form of idle servers [53], eventhough a typical idle server still consumes about 44% of its peak power consumption [52]. Therecommendation from the U.S. Department of Energy [52], industry [17, 25, 47], and the academicresearch community [1, 21, 34] is, therefore, to implement dynamic capacity scaling functionalitybased on the demand. If the demand is low, the service capacity should be scaled down bydeactivating servers, while at peak times, it should be scaled up by increasing the number ofactive servers. Instead of physically turning servers on or off, such dynamic scaling functionalityis often implemented by carefully allocating a fraction of servers to other, lower priority servicesand quickly bringing them back at times of high demand; see [13, 51, 54] for a more detailed ac-count. This maximizes the utilization of the system and hence minimizes the power consumption.The call for algorithmic solutions to capacity scaling has inspired a vibrant line of researchover the last decade [1, 5, 8, 19, 21, 27, 34, 35, 44, 45]. The problem ﬁts into the frameworkof online algorithms, where the goal is to design algorithms that dynamically scale the currentservice capacity, based on the past and current system information. Here, the performance ofan algorithm is captured in terms of the

Competitive Ratio (CR), which is deﬁned as the worstpossible ratio between the cost incurred by the online algorithm and that by the ofﬂine optimumalgorithm. Note that the online algorithm has information only about the past and the present,while the ofﬂine optimum has accurate information about all future input variables such asthe task-arrival process in the context of the current article. The key advantage of such strongperformance guarantees lies in its robustness, that is, the algorithm safeguards against the worst-case scenario.However, any of today’s modern large-scale systems has access to massive historical data,which, combined with standard Machine Learning (ML) algorithms, can reveal deﬁnitive pat-terns. In these cases, simply following the recommendations obtained from the ML predictionstypically outperforms any competitive algorithm. Netﬂix is an example of a company imple-menting capacity scaling in practice. Instead of relying on competitive online algorithms, Netﬂixhas implemented ML algorithms in their Scryer system [47]. They noted that their demand usu-ally follows regular patterns, allowing them to accurately predict the demand during a day basedon data from previous weeks. Most of the time, the performance of the machine learning algo-rithm is therefore excellent. However, besides empirical veriﬁcation, the performance of such MLpredictions is not guaranteed. In fact, repeated observations show that unexpected surges in theworkload are not at all uncommon [11, 32, 47], which cause a signiﬁcant adverse impact on thesystem performance.The contrasting approaches between academia and industry reveal a gap between what weare able to prove and what is desirable in practice. While online algorithms do not require anyinformation about future arrivals, in practice, these predictions are usually available. At the2ame time, an algorithm should not blindly trust the predictions because occasionally the accu-racy of the predictions can be signiﬁcantly poor. The current work aims to bridge this gap byincorporating ML predictions directly into the competitive analysis framework. In particular, wepropose a novel low-complexity algorithm for capacity scaling, the Adaptive Balanced CapacityScaling (ABCS) algorithm, which has access to a black-box predictor, lending predictions aboutfuture arrivals. Critically, not only is ABCS completely unaware of the prediction’s accuracy,we also restrain from making any statistical assumptions on the accuracy. Hence, this excludesany attempt to learn the prediction’s accuracy since accurate past predictions do not necessarilywarrant the quality of future predictions. The main challenge therefore is to design near-optimalalgorithms which intelligently accept and reject the recommendations given by the ML predictor withoutknowing or learning their accuracy . Note, however, that the performance of the ABCS algorithmdoes depend on the (unknown) error of the prediction, and it ensures, among others, two mostdesirable properties: (i) consistency , i.e., if the predictions turn out to be accurate in hindsight,then ABCS automatically replicates almost the optimal solution, and (ii) competitiveness , i.e., ifthe predictions are inaccurate in hindsight, then the performance of ABCS is at most an absoluteconstant factor times the minimum cost. The formal deﬁnitions of consistency and competitive-ness are given in Section 3. It is worth emphasizing that this work is not concerned with how theML predictions are obtained and uses them as a black box.

We will use a canonical continuous-time dynamical system model that is used to analyze algo-rithms for energy efﬁciency; see for example [1, 8, 34, 35, 37] for variations. Consider a systemwith a large number of homogeneous servers. Each server is in either of two states: active or inactive . Let m ( t ) denote the number of active servers at time t . Workload arrives into the systemin continuous time and gets processed at instantaneous rate m ( t ) . The system has a buffer ofinﬁnite capacity, where the unprocessed workload can wait until it is executed. We will assumethat there is an unknown and arbitrary arrival rate function λ ( t ) that represents the arrival pro-cess; see Section 2 for further details. We do not impose any restrictions on λ ( · ) . To contrast thiswith the often-studied case when the workload arrival is stochastic, λ ( · ) can be thought of as anindividual sample path of the corresponding stochastic arrival process. At any time, the systemmay decide to increase or decrease m ( t ) in an online fashion. However, it pays a switching costeach time a server is activated. This represents the cost of terminating the lower priority servicerunning at the inactive server and related migration costs [34, 35, 37, 51]. The goal of the systemis to minimize the weighted sum of the ﬂow-time, the switching cost, and the power consump-tion [37]. The ﬂow-time is deﬁned as the total time tasks spend in the system and is a measureof the response time [1, 8]. We will analyze the performance of an algorithm by its competitiveratio, the worst-case ratio between the cost of the online algorithm and the minimum ofﬂinecost, over all possible arrival rate functions λ ( · ) . We further assume that the algorithm receivespredictions about future workload through an ML oracle [36]. More precisely, at time t = λ ( · ) . The algorithm may use these predictionsto increase or decrease the number of servers accordingly. For instance, if the oracle predictsthat the demand in the next hour will increase, then the algorithm might proactively increasethe number of servers. However, as mentioned before, it is crucial that the algorithm completelyoblivious to the accuracy of these predictions. We measure the accuracy of the predictions interms of the mean absolute error (MAE) between the predicted arrival rate function ˜ λ and theactual rate function λ (see Deﬁnition 3.1). Our contributions in the current paper are twofold:3

1) Purely online algorithm with worst-case guarantees.

First, we propose a novel purely on-line algorithm for capacity scaling, called Balanced Capacity Scaling (BCS). This purely onlinescenario, or the scenario of traditional competitive analysis, is equivalent to having predictionswith inﬁnite error. There are several fundamental works that have considered the purely onlinescenario for capacity scaling [19, 34, 35, 44, 45]. We extend the state of the art in this area byanalyzing a more general model in continuous time and where unprocessed workload is allowedto wait. In fact, we show that a class of popular algorithms that were previously proposed arenot constant competitive in the more general case (see Proposition 4.13). We show that the BCSalgorithm is 5-competitive in the general case (Corollary 4.2) and is 2-competitive when waitingis not allowed and workload must be processed immediately upon arrival (Theorem 4.3). BCSis easy to implement and is memoryless, i.e., it only depends on the current state of the systemand not on the past. Also, we prove a lower bound result that any deterministic online algorithmmust have a competitive ratio of at least 2.549 (Proposition 4.4), which implies that the problemis strictly harder than the classical ski-rental problem, a benchmark for online algorithms. (2) Augmenting unreliable ML predictions.

When ML predictions are available, we ﬁrst proposean adaptive algorithm, called Adapt to the Prediction (AP), which ensures consistency. That is,we prove (Theorem 4.6) that the competitive ratio of the AP algorithm is at most 1 + Θ ( η ) , where η is a suitable measure of the prediction’s accuracy and is a function of the MAE between thepredicted arrival rate function ˜ λ and the actual rate function λ (Deﬁnition 3.2). The AP algorithmdoes not follow the predictions blindly. Rather, it dynamically scales the number of servers inan online fashion as the past predictions turn out to be inaccurate. Although the performance ofthe AP algorithm is near-optimal as η → η , it is not constantcompetitive if predictions are completely inaccurate ( η = ∞ ). Thus, it does not provide anyworst-case guarantees. This is a feat shared by many recent adaptive algorithms in the literature(see Remark 4.7).Finally, we combine the ideas behind the BCS and the AP algorithms, to propose an algorithmthat is both competitive and consistent. This brings us to the main contribution of the paper. Wepropose the Adaptive Balanced Capacity Scaling (ABCS) algorithm, which uses the structureof BCS and utilizes AP as a subroutine. ABCS has a hyperparameter r ≥

1, which can beﬁxed at any value before implementing the algorithm. It represents our conﬁdence on the MLpredictions. If we choose r =

1, then the algorithm works as a purely online one and disregardsall predictions. In this case, ABCS is 5-competitive, as before. However, for any ﬁxed r >

1, weprove (Corollary 4.9) that the competitive ratio of ABCS is at mostmin (cid:110) ( + Θ ( η )) (cid:16) + Θ ( r − ) (cid:17) , Θ ( r ) (cid:111) , (1.1)where η is the prediction’s accuracy as before. There are a number of consequences of the aboveresult. We start by emphasizing that although the competitive ratio is a function of the error η ,the algorithm is completely oblivious to it. Now, the higher we ﬁx the value of r to be, the closercompetitive ratio of ABCS gets to 1 if the predictions turn out to be accurate in hindsight. In thiscase, if the predictions are completely inaccurate ( η = ∞ ), its competitive ratio is at most Θ ( r ) ,a constant that depends only on r and not on η . ABCS is therefore robust against unpredictablesurges in workload, while providing near-optimal performance if the predictions are accurate.Another interesting thing to note is that for r >

1, the competitive ratio in (1.1) is the mini-mum of two terms: the ﬁrst term, which we call the

Optimistic Competitive Ratio (OCR), is smallerwhen the prediction is accurate and the second term, which we call the

Pessimistic CompetitiveRatio (PCR), is smaller when the prediction is inaccurate. From the algorithm designer’s perspec-4ive, there is a clear trade-off between OCR and PCR, which is conveniently controlled by theconﬁdence hyperparameter r . It is important to note that ABCS provides performance guaran-tees for any ﬁxed r ≥ the choice of r reﬂects the risk that the system designer is willing to take in the pessimisticcase against the gain in the optimistic case. See Remark 4.10 for further discussion. This trade-off,however, is not speciﬁc to our algorithm. In fact, we prove a negative result in Proposition 4.12that any algorithm which is ( + δ ) -competitive in the optimistic case, has a competitive ratio ofat least 1/ ( δ ) in the pessimistic case.To test the performance of our algorithms in practice, we implemented them on both a real-world dataset of DNS requests observed at a campus network [39] and a set of artiﬁcial datasets,and the performance turns out to be excellent. See Section 5 for details. Over the past two decades, the rapid growth of data centers and its immense power consump-tion have inspired a vibrant line of research in optimizing the energy efﬁciency of such sys-tems [10, 15, 50, 55]. Below, we provide an overview of a few inﬂuential works relevant to thecurrent paper.The capacity scaling problem was introduced in a seminal paper by Lin et al. [34], who analyzea discrete time model of a data center. At each time step t , the cost of operating m ( t ) servers isdetermined by the switching cost and an arbitrary convex function g t ( m ( t )) , which, for example,speciﬁes the cost of increased power consumption versus response time. At time step t , thesystem reveals the function g t and accurate functions g t + , g t + , . . . , g t + w in a prediction windowof w future time steps. Lin et al. [34] propose an algorithm, called the LCP algorithm, and provethat it is 3-competitive. Surprisingly, the performance of the LCP algorithm does not improveif w >

0, i.e., if predictions are available. When g t has a ﬁxed form, our results generalize thiswork to continuous time, and to predictions that are not necessarily accurate. Moreover, theperformance of our algorithm increases provably in the presence of predictions.Lu et al. [35] consider a scenario where tasks cannot wait in queue and must be served im-mediately upon arrival. They discover that in this case, the capacity scaling problem reduces tosolving a number of independent ski-rental problems. The authors then propose an algorithmand prove that it is 2-competitive. Our model, in addition, includes the response time, whichdirectly generalizes the framework of [35]. This ﬂexibility introduces a whole new dimension inthe space of possible decisions. For example, since the results of Lu et al. [35] lack any form ofdelay, tasks are processed at the same time by any algorithm. Our model allows an algorithm-dependent delay of serving tasks, which desynchronizes the time at which tasks are processed ata server across different algorithms and hence, signiﬁcantly complicates the analysis. Mazzuccoand Dyachuk [40] analyze a related problem in which the number of servers is periodically up-dated and a task is lost if a server is not immediately available to serve it. The goal of theiralgorithm is to balance the power consumption and the cost of losing tasks. Galloway et al. [18]and later Gandhi et al. [20, 22] perform an empirical study of data centers. Their results showthat signiﬁcant power savings are possible, while maintaining much of the latency of the network.A well-studied problem that is somewhat related to our setup is speed scaling . Here, the goalis to optimize the processing speed of a single server and to minimize the weighted sum of theﬂow-time and power consumption, while the switching cost is zero. In contrast to our model,5he scheduling of jobs also play a crucial role here. The power consumption is typically cubicin the processing speed. A seminal paper in this area is by Bansal et al. [8], which proposesan algorithm that schedules the task with the shortest remaining processing time (SRPT) ﬁrstand processes it at a speed such that the power consumption is equal to the number of waitingtasks plus one. The authors prove that this algorithm is ( + ε ) -competitive. Later papers haveextended the case of the single server to processor sharing systems [56] and parallel processorswith deadline constraints [2]. The problem of speed scaling has also been analyzed in the casethe inter-arrival times and required processing times are exponentially distributed [4].Any algorithm for the capacity scaling problem consists of two components: ﬁrst, to activateservers and second, to deactivate servers. For a single server, a natural abstraction of the latterproblem is the famous ski-rental problem, as ﬁrst introduced by Karlin et al. [29]. The ski-rentalproblem has been applied to cases of capital investment [6, 14], TCP acknowledgement [28] andcache coherence [3]. Irani et al. [27] analyze the ski-rental problem when multiple power-downstates are available, such as active, sleeping, hibernating, and inactive. The power consumptionin each state is different and moving between the states incurs a switching cost. Augustine etal. [5] generalize these results when the transition costs between the different states are not ad-ditive. Although the current work focuses on only two states, i.e., active and inactive, we expectthat the algorithm and proofs are general enough to accommodate multiple power-down states,which we leave as interesting future work. Khanafer et al. [30] analyze the ski-rental problem ina stochastic context.Lykouris and Vassilvtiskii [36] initiated the study of online algorithms augmented by MLpredictions. They show how to adapt the marker algorithm for the caching problem to obtain acompetitive ratio of 2 if the predictions are perfectly accurate, and a bounded competitive ratioif the predictions are inaccurate. The idea has since been applied to bipartite matching [31], ski-rental and scheduling on a single machine [48], bloom-ﬁlters [41], and frequency estimation [26].Lee et al. [33] propose an algorithm which operates onsite generators to reduce the peak energyusage of data centers. Although related to the current work, their algorithm works independentto the capacity scaling happening inside the data center. Bamas et al. [7] discuss an algorithmaugmented by predictions for the related problem of speed scaling discussed above, in the caseof parallel processors with deadline constraints. Similar to our results, Bamas et al. [7] identifya trade-off between what we call an Optimistic Competitive Ratio and a Pessimistic CompetitiveRatio. We prove, for the capacity scaling problem considered in the current work, that any algorithm must exhibit such a trade-off (see Proposition 4.12 for details). Mahdian et al. [38]propose an algorithm which naively switches between an optimistic and a pessimistic schedulingalgorithm to minimize the makespan when routing tasks to multiple machines.Recently, the notion of a predictor has also emerged in stochastic scheduling. Mitzenmacher[42] introduces the predictor as a probability density function g ( x , y ) for a task with actual ser-vice time x and predicted service time y . Here the author analyzes the shortest predicted jobﬁrst (SPJF) and shortest predicted remaining processing time (SPRPT) queueing disciplines for asingle queue and determines the price of misprediction, i.e. the ratio of the cost if perfect infor-mation of the service time distribution is known and the cost if only predictions are available. Formultiple queues, Mitzenmacher [43] has simulated the supermarket model or the ‘power-of- d ’model, to show empirically that the availability of predictions greatly improves performance.A different line of work called online algorithms with advice questions how many bits of perfect future information are necessary to reproduce the optimal ofﬂine algorithm (see [12] for asurvey). The difference with the current work is that we do not assume that the predictions areperfect but instead have arbitrary accuracy. 6hen the arrival process and service times are stochastic, there are several major worksthat consider energy efﬁciency of the system. Gandhi et al. [19] provide an exact analysis ofthe M/M/k/setup system. The system is similar to the M/M/k class of Markov chains, i.e.,tasks arrive according to a Poisson process and require an exponentially distributed processingtime. To process the tasks, the system has access to a maximum of k servers. According to thealgorithm in [19], if a task arrives and there are no available servers, the system moves one serverto its setup state, where it remains for an exponentially random time before the server becomesactive. The authors provide a sophisticated method to analyze the system exactly. Maccio andDown [37] analyze a similar system for a broader class of cost functions. When each server has adedicated separate queue, Mukherjee et al. [44] and Mukherjee and Stolyar [45] analyze the casewhere the setup times and standby times (the time a server remains idle before it is deactivated)are independent exponentially distributed. In this case they propose an algorithm that achievesasymptotic optimality for both the response time and power consumption in the large-systemlimit. Earlier research has also modeled the response time as a constraint rather than charging acost for the response-time [24]. Here, each task is presented with a deadline and the task shouldbe served before this deadline or it is irrevocably lost. The earliest deadline ﬁrst (EDF) queueingdiscipline has been proven to be effective in this case [16]. The remainder of the paper is organized as follows. Section 2 describes the model. Section 3introduces some preliminary concepts and deﬁnitions related to the ML predictions, such asthe error. Section 4 introduces our algorithms and the main results, of which the high-levelproof ideas are provided in Section 6. Most of the technical proofs are given in the appendix.Section 5 presents extensive numerical experiments, including the performance of our algorithmson a real-world dataset. Finally, Section 7 concludes our work and presents directions for futureresearch.

We now introduce a general model for capacity scaling. Let ω , β and θ be ﬁxed non-negativeparameters of the model. We will assume that the tasks waiting in the buffer accumulate awaiting cost at rate ω >

0, the cost of activating a server is β >

0, and each active serveraccumulates a power consumption cost at rate θ ≥ T >

0, a given initial number of servers m ( ) ≥

0, and an unknown and arbitrary function λ : [ T ] → R + representing the arrivalprocess. The model is:minimize m : ( T ] → R + ω · (cid:90) T q ( s ) d s + β · lim sup δ ↓ (cid:98) T / δ (cid:99) ∑ i = [ m ( i δ + δ ) − m ( i δ )] + + θ · (cid:90) T m ( s ) d s subject to q ( t ) = (cid:90) t ( λ ( s ) − m ( s )) { q ( s ) > λ ( s ) ≥ m ( s ) } d s for all t ∈ [ T ] m ( t ) ≥ t ∈ ( T ] (2.1)where [ x ] + = max ( x , 0 ) . To solve the optimization problem above, an algorithm needs to deter-mine the function m ( · ) , given the parameters ω , β , θ . Note that our goal is to investigate online λ ( · ) is revealed to the algorithm in an online fashion. In other words,at time t , the algorithm must determine m ( t ) depending only on λ ( s ) for s ∈ [ t ] . We willprecisely state these assumptions later. For an algorithm that runs m ( t ) servers at time t , the costaccumulated until time t is deﬁned asC ost λ ( m , t ) : = ω · (cid:90) t q ( s ) d s + β · lim sup δ ↓ (cid:98) t / δ (cid:99) ∑ i = [ m ( i δ + δ ) − m ( i δ )] + + θ · (cid:90) t m ( s ) d s (2.2)We will compare the total cost C ost λ ( m , T ) for an online algorithm to that of the ofﬂine minimumdeﬁned as O pt : = inf m : ( T ] → R + C ost λ ( m , T ) , (2.3)and without loss of generality, we will assume O pt < ∞ throughout the paper. Remark . The argument that minimizes (2.3) exists, as stated by the next proposition. The proofof Proposition 2.2 is given in Appendix A.1. The difﬁculty in the proof lies in dealing with thesecond term in (2.2), which makes the function C ost λ ( m , T ) discontinuous in m ( · ) w.r.t. the L norm. Proposition 2.2.

There exists m ∗ : [ T ] → R + such that C ost λ ( m ∗ , T ) = O pt . The model in (2.1) actually combines some well-studied state-of-the-art models [1, 8, 34, 35,37]. To see how it relates to the problem of capacity scaling, note that the objective functionin (2.1) is a weighted sum of three metrics. Below, we clarify each of them. These three metricsare common performance measures of the system, such as the response time or the power con-sumption. The parameters ω , β and θ represent the weights assigned to each of these metrics.The three metrics are as follows:(i) The ﬂow-time.

The ﬂow-time is deﬁned as the total time a task spends in the system andcaptures the response time of the system. Note that the average response time per unit ofworkload is (cid:82) T q ( s ) d s (cid:82) T λ ( s ) d s ; see also [1, 8]. The weight ω is the cost attributed to the response time(e.g., in dollars per second). The weight ω could, for example, be determined based on lossof revenue or user dissatisfaction as a result of increased response time.(ii) The switching cost.

As in [34, 35, 51], the parameter β can be viewed as the cost to increasethe number of active servers (e.g., in dollars per server). This may include for example,the cost to terminate a lower priority service and related migration costs. In practice, thesecosts are usually equivalent to the cost of running the server for multiple hours [34]. Thetotal switching cost is β times the number of times a server is made active.(iii) The power consumption.

The power consumption is proportional to the total time serversare in the active state [35]. The weight θ represents the cost of power (e.g., in dollars perserver per second).Also, the constraints in (2.1) model the dynamics of capacity scaling and q ( · ) can be viewed asthe queue length process or the remaining workload process. Note that (2.1) does not require q ( t ) or m ( t ) to be integer-valued. This is a fairly standard relaxation, since a service may typicallyrequest a fraction of the server’s capacity [51, 54] and a single task is tiny; see for example,[34, 44]. The system in (2.1) can also be interpreted as a ﬂuid counterpart of a discrete system.Figure 1 depicts the model schematically. 8igure 1: The system receives task at rate λ ( t ) and operates m ( t ) servers. The workload is q ( t ) . Remark . The model in (2.1) assumes that the service capacity can be increased nearly in-stantaneously. Hence, it does not include the so-called setup time. Besides being a commonassumption in competitive analysis (see for example [34, 35]), this is also not completely un-reasonable in practice. This is mainly because servers are not usually physically turned off inreality. Instead, when a server becomes “inactive”, the server’s capacity will be used by otherlow-priority services. Then, activating a server means quickly terminating such low-priority ser-vices, see [51, 54] for a more detailed account. From a theoretical standpoint, the assumption ofa zero setup time is also necessary to get a uniformly bounded competitive ratio, as stated in thenext lemma. For the sake of Lemma 2.4, let us assume that in the capacity scaling problem in(2.1) there is an additional setup time τ > t , then the number ofservers is increased at time t + τ . The proof of Lemma 2.4 is provided in Appendix A.2. Lemma 2.4.

Let A be any deterministic algorithm for the capacity scaling problem in (2.1) and assumethat there is an additional setup time τ > before the number of servers can be increased. Also, let CR denote the competitive ratio of A . Then there exists θ such that CR ≥ ωτ β , (2.4) In other words, there does not exist any deterministic algorithm with uniformly bounded competitive ratio.

Formulation (2.1) is fairly easy to solve as an ofﬂine optimization problem (in Section 5.3 wepresent a linear program to solve the ofﬂine problem). However, as mentioned earlier, we areinterested in an online algorithm. Speciﬁcally, we distinguish two scenarios:1.

Purely online scenario.

The system reveals m ( ) , ω , β , θ and, at time t , also λ ( s ) for s ∈ [ t ] to the online algorithm, but not λ ( s ) for any s > t . The purely online scenario correspondsto the setting where predictions may not be available and provides a natural starting pointof our investigation. We discuss a competitive algorithm for the purely online scenario inSection 4.1. Additionally, in this purely online scenario, our algorithm does not require thesystem to reveal the ﬁnite time horizon T upfront.2. Machine learning scenario.

In addition to the assumptions in the purely online scenario above,at time t =

0, an ML predictor predicts the arrival rate function of the entire interval, thatis, it predicts the arrival rate function to be ˜ λ : [ T ] → R + . The ML predictor may,for example, be trained on the past observed workload on a day. For the purpose of thecurrent work, we treat the predictor as a black box. We discuss a consistent algorithm for9he machine learning scenario in Section 4.2.1, for which the competitive ratio degradesgracefully with the prediction’s accuracy. However, the algorithm is not competitive inthe worst-case. Finally, in Section 4.2.2 we discuss an algorithm for the machine learningscenario, which is simultaneously competitive and consistent, by combining the algorithmsfrom Sections 4.1 and 4.2.1.The idea of using online algorithms with unreliable machine-learned advice was ﬁrst introducedin [36] in the context of competitive caching. The next section provides the necessary details ofthe framework of [36]. This section brieﬂy presents the competitive analysis framework for algorithms that have accessto ML predictions. The setup was ﬁrst introduced in [36] and we adapt it here for the currentscenario. We measure the errors in predictions by the mean absolute error (MAE) betweenthe true and the predicted label, which is commonly used in state-of-the-art machine learningalgorithms [23, 49].

Deﬁnition 3.1.

The error in the prediction ˜ λ ( · ) with respect to the actual arrival rate λ ( · ) is (cid:13)(cid:13) ˜ λ − λ (cid:13)(cid:13) MAE = T (cid:90) T (cid:12)(cid:12) ˜ λ ( t ) − λ ( t ) (cid:12)(cid:12) d t . (3.1)To measure the performance of an algorithm augmented by an ML predictor, we will deﬁnethe competitive ratio as a function of the prediction’s accuracy. However, before stating thedeﬁnition of competitive ratio, we introduce the level of accuracy of a prediction. Deﬁnition 3.2.

Fix a ﬁnite time horizon T, arrival rate function λ ( · ) , and initial number of serversm ( ) . Let O pt be as deﬁned in (2.3) . For η > , we say that a prediction ˜ λ is η -accurate for the instance ( T , λ , m ( )) if (cid:13)(cid:13) ˜ λ − λ (cid:13)(cid:13) MAE ≤ η · O pt T . (3.2)The deﬁnition of the prediction’s accuracy is intimately tied to the cost of the optimal solution.As already argued in [36], since O pt is a linear functional of λ , normalizing the error by the costof the optimal solution is necessary. This is because the deﬁnition should be invariant to scalingand padding arguments. For example, if we double both λ ( · ) and ˜ λ ( · ) , then the prediction’saccuracy should still be the same.Let A be any algorithm for (2.1). The performance of A is measured by the competitive ratioCR ( η ) , which itself is a function of the accuracy η . The following deﬁnition is an adaptation of[36, Deﬁnition 3] for the current setup. Deﬁnition 3.3.

Fix a ﬁnite time horizon T, arrival rate function λ ( · ) , and an initial number of serversm ( ) . Let A be any algorithm for (2.1) and m ( t ) denote its number of servers when it has access to aprediction ˜ λ , and O pt be as deﬁned in (2.3) . We say that A has a competitive ratio at most CR for theinstance ( T , λ , m ( )) and prediction ˜ λ if C ost λ ( m , T ) ≤ CR · O pt + Φ ( ) , (3.3) where Φ ( ) is a constant depending only on m ( ) . We say that the competitive ratio of A is at most CR ( η ) ,if the competitive ratio is at most CR ( η ) for all instances ( T , λ , m ( )) and any η -accurate prediction ˜ λ . Deﬁnition 3.4.

Let A be any algorithm for (2.1) , and CR ( η ) denote its competitive ratio when it hasaccess to an η -accurate prediction. Then, we say that algorithm A is (i) ρ -consistent if CR ( ) = ρ ; (ii) α -robust if CR ( η ) = O ( α ( η )) ; (iii) γ -competitive if CR ( η ) ≤ γ for all η ∈ [ ∞ ] . We ﬁrst discuss a competitive algorithm in the purely online scenario. Recall that in this case thesystem reveals m ( ) , ω , β , θ and, at time t , also λ ( s ) for s ∈ [ t ] to the algorithm, but not λ ( s ) for any s > t . Moreover, as mentioned earlier, the results in this section also hold when the ﬁnitetime horizon T is not revealed upfront. The Balanced Capacity Scaling (BCS) algorithm that wepropose is parameterized by two non-negative numbers r and r . Algorithm 1 below describesBCS for any ﬁxed choices of r and r . Algorithm 1:

BCS ( r , r )Choose m ( · ) such that at each time t ≥ m ( t ) d t = r ω · q ( t ) − r θ · m ( t ) β . (4.1)We start by brieﬂy discussing the intuition behind BCS. At each time t ≥

0, BCS computesthe derivative of the number of servers, i.e., how fast the system should increase or decreasethe service capacity. Note that if we solve equation (4.1), then we obtain the number of servers m ( t ) , which is differentiable for all t ≥

0. The two parameters r and r control how fast thealgorithm reacts, by increasing or decreasing the number of servers respectively. If the workload q ( t ) is non-zero, then the ﬁrst term in the right-hand side of equation (4.1) increases the numberof servers at rate r . The second term is an “inertia term” which decreases the number of serversat rate r . Note that if we integrate equation (4.1), we obtain (cid:90) t r ω · q ( s ) d s = (cid:90) t β · d m ( s ) d s d s + (cid:90) t r θ · m ( s ) d s . (4.2)This means that BCS aims to carefully balance the ﬂow-time with the switching cost plus the powerconsumption . The BCS algorithm is memoryless and computationally cheap. The derivative ofthe number of servers depends only on the current workload and number of servers, without re-quiring knowledge about the past workload, number of servers or arrival rate. BCS can therefore11e implemented without any memory requirements.We are able to characterize the performance of BCS analytically, for any ﬁxed choices of r and r . Theorem 4.1 below characterizes the competitive ratio of the BCS algorithm. The proofof Theorem 4.1 is provided in Section 6.1. Theorem 4.1.

Let CR denote the competitive ratio of BCS (Algorithm 1). Then, CR ≤ (cid:18) + r + r (cid:19) max ( r , 2 r ) . (4.3)The optimal choice of the parameters is r = r =

1. Corollary 4.2 states that BCS is5-competitive in this case.

Corollary 4.2.

Let CR denote the competitive ratio of BCS (Algorithm 1). If r = and r = , then CR ≤ . Moreover, in the special case when tasks are not allowed to wait and must be served imme-diately upon arrival ( ω = ∞ ), BCS turns out to be 2-competitive, as stated in Theorem 4.3. Theproof of Theorem 4.3 is given in Appendix A.3. Theorem 4.3.

Let CR denote the competitive ratio of BCS (Algorithm 1). If r = , r = and ω = ∞ ,then CR ≤ . Note that the capacity scaling problem has previously been related to the classical ski-rentalproblem [5, 27, 35] which is 2-competitive. As it turns out, when tasks are allowed to wait, theformulation in (2.1) of the capacity scaling problem is strictly harder than the ski-rental problem,as Proposition 4.4 states below. Proposition 4.4 is proved in Appendix A.4.

Proposition 4.4.

Let A be any deterministic algorithm for the capacity scaling problem in (2.1) in thepurely online scenario, and CR denote its competitive ratio. There exist choices for ω , β , and θ such that CR ≥ . In other words, any deterministic algorithm is at least -competitive.Remark . We should note that the proof of Proposition 4.4 assumes that the ﬁnite time horizon T is not revealed upfront. We leave it to future work to identify a (possibly weaker) lower boundif T is known to the algorithm. To augment the BCS algorithm with machine learning predictions, we proceed in two steps.First, in section 4.2.1, we introduce the Adapt to the Prediction (AP) algorithm. We prove thatthe competitive ratio of AP degrades gracefully with the prediction’s accuracy, although AP isnot competitive. Second, in section 4.2.2, we discuss how to combine the BCS and AP algorithmsto obtain the ABCS algorithm, which follows the predictions but is robust against inaccuratepredictions and therefore competitive.

We will now turn our attention to the machine learning scenario. Recall that in this case, at time t =

0, the algorithm receives a predicted arrival rate function ˜ λ : [ T ] → R + . Note that a trivialway to implement the predictions is to blindly trust the predictions, i.e., to let m ∈ arg min m : ( T ] → R + C ost ˜ λ ( m , T ) . (4.4)12he above minimum exists (see Remark 2.1). However, in this case, the performance decaysdrastically even for relatively small prediction errors. Indeed, if the actual arrival rate λ ( · ) ishigher than the predicted arrival rate ˜ λ ( · ) at the start, then the associated workload could stay inthe queue until the end of the time horizon [ T ] and incur a signiﬁcant waiting cost. We insteadpropose the Adapt to the Prediction (AP) algorithm, which consists of an ofﬂine and an onlinecomponent. The ofﬂine component computes an estimate for the number of servers upfrontbased on ˜ λ ( · ) . The online component follows the ofﬂine estimate, but dynamically adapts thenumber of servers based on discrepancies between the predicted and actual arrival rates. Let usdeﬁne ∆ λ ( t ) : = (cid:40)(cid:0) λ ( t ) − ˜ λ ( t ) (cid:1) + for t ≥

00 for t < Algorithm 2:

APChoose m ( · ) such that at each time t ≥ m ( t ) = m ( t ) + m ( t ) , (4.5)where m ∈ arg min m : ( T ] → R + C ost ˜ λ ( m , T ) , (4.6)d m ( t ) d t = (cid:114) ω β · (cid:16) ∆ λ ( t ) − ∆ λ (cid:16) t − (cid:112) β / ω (cid:17)(cid:17) . (4.7)The number of servers under the AP algorithm consists of two components, an ofﬂine com-ponent m and an online component m . The ofﬂine component m is determined upfront bythe optimal number of servers to handle the predicted arrival rate ˜ λ . The online component m is determined in an online manner and it reacts if the actual arrival rate turns out to be higherthan the predicted arrival rate. Note that if we solve equation (4.7), then we obtain the numberof servers m ( t ) , which is differentiable for all t ≥

0. The online component works as follows.If ∆ λ ( t ) >

0, then the online component increases the service capacity at rate (cid:112) ω / ( β ) . Inother words, for each additional unit of workload received, the number of servers is increasedby (cid:112) ω / ( β ) . After a ﬁxed time of (cid:112) β / ω the number of servers is decreased again. Intuitively,if ω (cid:29) β , then the online component turns on many servers for a short period of time, whereasif β (cid:29) ω , then the online component turns on a few servers for a longer period of time. Theconstants are chosen to carefully balance the ﬂow-time with the switching cost.Although the optimization in the ofﬂine component might be expensive, it has to be per-formed only once at the start. Moreover, if the predictions are based on historical data, theofﬂine component m might even be precomputed and retrieved from memory at the start. Theonline component in contrast is computationally cheap.The competitive ratio of AP, of course, depends on the accuracy of the predictions. Theo-rem 4.6 characterizes the performance of AP. Recall the deﬁnition of the competitive ratio CR ( η ) from Section 3. 13 heorem 4.6. Let CR ( η ) denote the competitive ratio of AP (Algorithm 2) when it has access to an η -accurate prediction. Then, CR ( η ) ≤ + (cid:16)(cid:112) ωβ + θ (cid:17) η . (4.8)The proof of Theorem 4.6 is provided in Appendix A.5. If η is small, then the competitiveratio is close to one. In fact, the AP algorithm replicates the optimal solution exactly (in L sense)if the predictions turn out to be accurate and is 1-consistent. Moreover, the competitive ratio alsodegrades gracefully in the prediction’s accuracy, which, as discussed earlier, is not achieved bythe ofﬂine component m alone. Remark . Although AP does not follow the predictions blindly, the AP algorithm is not compet-itive, since it is not hard to verify that CR ( η ) → ∞ as η → ∞ (e.g. let ˜ λ ( t ) → ∞ for all t ∈ [ T ] ).Note that most algorithms proposed in the literature, such as the RHC and LCP algorithms from[34], follow the predictions blindly and therefore are not competitive. Hence, the goal in thenext subsection is to combine the above approaches of BCS and AP to obtain an algorithm whichfollows the predictions most of the time, but ignores the predictions when appropriate. We now answer the question of how to follow the predictions most of the time without trustingthem blindly. The Adaptive Balanced Capacity Scaling (ABCS) algorithm we propose, strategi-cally combines the BCS and AP algorithms introduced earlier. Let ˜ m ( · ) be the number of serversas turned on by AP (Algorithm 2). Let ˜ q ( · ) be the queue length process under the AP algorithm,that is, ˜ q ( t ) = (cid:90) t ( λ ( s ) − ˜ m ( s )) { ˜ q ( s ) > λ ( s ) ≥ ˜ m ( s ) } d s for all t ≥

0. (4.9)The ABCS algorithm is parameterized by four non-negative numbers R ≥ r ≥ R ≥ r ≥

0. Algorithm 3 below describes ABCS for any ﬁxed choices of R , r , R , r . Algorithm 3:

ABCS ( r , r , R , R )Choose m ( · ) such that at each time t ≥ m ( t ) d t = ˆ r ( t ) ω · q ( t ) − ˆ r ( t ) θ · m ( t ) β , (4.10)where ˆ r ( t ) =  r if m ( t ) − ˜ m ( t ) > [ q ( t ) − ˜ q ( t )] + · (cid:113) ω β , R if m ( t ) − ˜ m ( t ) ≤ [ q ( t ) − ˜ q ( t )] + · (cid:113) ω β ,ˆ r ( t ) = (cid:40) R if m ( t ) > ˜ m ( t ) and q ( t ) ≤ ˜ q ( t ) , r if m ( t ) ≤ ˜ m ( t ) or q ( t ) > ˜ q ( t ) . (4.11)In spirit, the ABCS algorithm works similarly to the BCS algorithm. In fact, if R = r and R = r , then the ABCS algorithm is equivalent to the BCS algorithm and disregards predictionsaltogether. However, in contrast to the constant rates r and r of BCS, the rates at which ABCSreacts, is captured by the state-dependent rate functions ˆ r ( t ) and ˆ r ( t ) . The reason behind the14recise choices of ˆ r ( t ) and ˆ r ( t ) will be clear later from the performance of the algorithm. Froma high-level perspective, these are chosen to approach the behavior of the advised number ofservers ˜ m ( t ) of AP. Indeed, if ABCS has less than the advised number of servers ˜ m ( t ) , then itincreases m ( t ) at the higher rate R and decreases it at the lower rate r . Similarly, if ABCS has“sufﬁciently more” servers than the advised number ˜ m ( t ) , then it increases m ( t ) at the lower rate r and decreases it at the higher rate R . The number of servers of ABCS therefore judiciouslyapproaches the number of advised servers. However, it does not blindly follow ˜ m ( t ) to protectagainst inaccurate predictions. For example, if the workload q ( t ) is signiﬁcantly higher thanthe current number of servers m ( t ) , then ABCS will always increase the number of servers at anon-zero rate.Our main result characterizes the performance of ABCS analytically, which is presented inTheorem 4.8 below. The proof of Theorem 4.8 is provided in Section 6.2. Recall the deﬁnition ofthe competitive ratio CR ( η ) from Section 3. Theorem 4.8.

Let CR ( η ) denote the competitive ratio of ABCS (Algorithm 3) when it has access to an η -accurate prediction. Then, CR ( η ) ≤ min (cid:16) ( + ( (cid:112) ωβ + θ ) η ) · OCR, PCR (cid:17) , (4.12) where

OCR = max (cid:18) c r , c R √ + R , c + c , c (cid:19) , PCR = max (cid:18) c R , 2 c , 2 c R + − R r (cid:19) , c = + r + R , c = c √ + r − c + c √ + R , c = + R + R , c = + r + r R + c r , c = + r + r , c = c (cid:115) R r . (4.13)Theorem 4.8 characterizes the competitive ratio of ABCS explicitly for any choices of theparameters. Note that for any value of η , the competitive ratio is at most PCR. Moreover, if η issmall, then the competitive ratio is close to OCR. It is straightforward to check from Theorem 4.8that ABCS satisﬁes the three desiderata of Deﬁnition 3.4. In particular, ABCS is OCR-consistentand PCR-competitive. The constants OCR and PCR, of course, depend on the parameters R , r , R and r . Corollary 4.9 provides guidance on how to choose these parameters. For ease ofunderstanding, we have stated Corollary 4.9 using asymptotic comparison symbols. Corollary 4.9.

Let r ≥ be a hyperparameter, representing the conﬁdence in the predictions. Let CR ( η ) denote the competitive ratio of ABCS (Algorithm 3) when it has access to an η -accurate prediction. IfR = Θ ( r ) , r = Θ ( r − ) , R = Θ ( r ) and r = Θ ( r − ) , then CR ( η ) ≤ min (cid:16) ( + ( (cid:112) ωβ + θ ) η ) · ( + Θ ( r − )) , Θ ( r ) (cid:17) . (4.14)Corollary 4.9 characterizes the trade-off between the OCR and the PCR. If the conﬁdence inthe predictions, r , is set at a high value, then OCR tends to 1. However, the value of PCR tendsto become large in this case, even though, importantly, it remains uniformly bounded as η → ∞ .Figure 2a plots the competitive ratio as a function of η and the conﬁdence hyperparameter r . Forﬁxed r , the competitive ratio increases linearly in η . If η is large enough, then the competitiveratio becomes constant in η at a value of PCR. On the other hand, for ﬁxed η , the competitiveratio is always 5 in the case of zero conﬁdence ( R = r and R = r ). Then, as the conﬁdenceincreases, the competitive ratio decreases slowly if η is small enough and increases rapidly if η islarge. 15 a) The competitive ratio as a function of the normal-ized accuracy of the predictions ( (cid:112) ωβ + θ ) η andthe conﬁdence hyperparameter r . The competitiveratio decreases as predictions are more accurate, butremains bounded as the accuracy becomes worse. (b) The Pessimistic Competitive Ratio (PCR) as afunction of the Optimistic Competitive Ratio (OCR).The ﬁgure interpolates between the purely onlinescenario (OCR = PCR =

5) and the machine learningscenario (OCR = = ∞ ). Figure 2: The analytical performance of ABCS (Algorithm 3).

Remark . Figure 2b plots the value of PCR on the y-axis, for a ﬁxed OCR on the x-axis.Figure 2b depicts the interpolation between the purely online scenario (OCR = PCR =

5) andthe machine learning scenario (OCR = = ∞ ). The current work generalizes these twoextremes to any scenario in between. As mentioned in the introduction, we provide performanceguarantees for ABCS for any value of the conﬁdence hyperparameter r ≥

1. However, the speciﬁcchoice of r would depend on where the system designer wants to place the system on the bluecurve in Figure 2b; view it as a risk-vs-gain curve. The ﬁgure shows that if one chooses a valueof r , so that if the predictions turn out to be accurate, ABCS would be 2-competitive, then thatwould put the system at the risk of being up to about 90-competitive, if the predictions turn outto be completely wrong. Later, in Proposition 4.12 we show that the growth rate of the OCR-vs-PCR curve that we obtain for ABCS is nearly optimal in the sense that any algorithm which is ( + δ ) -competitive in the optimistic case must be at least 1/ ( δ ) -competitive in the pessimisticcase. Remark . Recently, there has been some interest in understanding the performance of algo-rithms when an estimate of the prediction’s accuracy η is available in terms of some probabilitydistribution [42, 43]. In such cases, Theorem 4.8 allows one to calculate the optimal choice ofconﬁdence hyperparameter r that minimizes the expected competitive ratio. Assume that theprediction’s accuracy η follows some distribution µ ( · ) . The distribution µ ( · ) might, for example,be estimated based on historically observed data. For a ﬁxed r , note that OCR and PCR arefunctions of r . Denote ζ ( r ) : = PCR − OCR2OCR .16he expected value of the random competitive ratio of ABCS is then E η ∼ µ [ CR ( η )] = (cid:90) ∞ min (( + η ) · OCR, PCR ) d µ ( η )= · (cid:90) ζ ( r ) η d µ ( η ) + OCR · (cid:90) ζ ( r ) d µ ( η ) + PCR · (cid:90) ∞ ζ ( r ) d µ ( η )= · E [ η { η ≤ ζ ( r ) } ] + OCR · P ( η ≤ ζ ( r )) + PCR · P ( η > ζ ( r )) . (4.15)Therefore, if either the distribution or an estimate thereof is known, then the parameters of ABCScan be chosen to minimize the expected competitive ratio.Theorem 4.8 and Corollary 4.9 demonstrate that there is a trade-off between the OCR and thePCR. The following proposition shows that this trade-off is in fact, inherent to the problem andis not an artifact of the algorithm. Proposition 4.12.

Let A be any deterministic algorithm for the capacity scaling problem in (2.1) , and CR ( η ) denote its competitive ratio when it is has access to an η -accurate prediction. There exist choices of ω , β , and θ such that, for any δ > , if CR ( ) ≤ + δ , then CR (cid:18) δ (cid:19) ≥ δ . (4.16) In other words, any deterministic algorithm that is ( + δ ) -consistent must be Ω ( δ ) -competitive. Proposition 4.12 is proved in Appendix A.9. As mentioned earlier, an algorithm for capacityscaling must consist of two components: one component is to decide when to activate a serverand the other component is to device when to deactivate a server. For the latter problem, apopular state-of-the-art approach is to implement a power-down timer [5, 19, 27, 29, 35, 44]. Thepower-down timer works as follows: each time a server becomes idle, the system starts a timercorresponding to that server. If the server remains idle after the timer expires, then the server isdeactivated. Algorithm 4 shows the Timer algorithm.

Algorithm 4:

The Timer AlgorithmAt each time t ≥ β / θ time.We end this section by pointing out that although the Timer algorithm above has proven tobe successful under speciﬁc (especially stochastic) scenarios, the worst-case performance of thealgorithm in the current context is poor as the following proposition shows. More speciﬁcally,Proposition 4.13 shows that the Timer algorithm does not have uniformly bounded competitiveratio. To the best of our knowledge, there does not exist any uniformly competitive algorithmfor the capacity scaling problem where ω is ﬁnite, until in the current work. Proposition 4.13 isproved in Appendix A.10. Proposition 4.13.

Fix any rule for activating servers and any ω > . There exist choices of β and θ suchthat the competitive ratio of the Timer algorithm (Algorithm 4) is at least ( ω ) − .Remark . The constant β / θ is the default choice if ω = ∞ and is used in popular state-of-theart approaches [5, 27, 29, 34]. The result of Proposition 4.13 extends readily to any constant mul-tiple of β / θ . However, for other choices, the competitive ratio depends on the rule for activatingservers. 17 a) The real-world arrival pattern ( λ ) across four con-secutive days and the number of servers of Opt, AP,BCS and the Timer algorithm. (b) The competitive ratio (CR) of AP, ABCS and theTimer algorithm as a function of the conﬁdence hy-perparameter ( r ). The CR of AP and ABCS furtherdepends on the type of predictions provided (type 1-4). The CR of AP for type 1 and 2 is at least 20 andhas not been presented here. The CR of ABCS fortype 1 corresponds with the CR of ABCS for type 3. Figure 3: The performance of our algorithms evaluated on a real-world dataset.

We ﬁrst verify the performance of the proposed algorithms on a real-world dataset of internettrafﬁc. The dataset consists of DNS requests observed at a campus network across four consec-utive days in April 2016 [39]. We extracted the number of requests at each second to obtain thearrival rate function λ ( t ) . Furthermore, we assume that each server consumes 850 W at a priceof 0.15 cents per kWh, β is equal to the power-cost of running a server for four hours and ω = ω , β and θ and the conﬁdence hyperparameter r on the performance. Moreover,we will test the inﬂuence of the predictions on the performance of ABCS and AP. The resultsin this section will illustrate the performance of these algorithms in practice as opposed to theworst-case results presented before.The real-world arrival pattern ﬂuctuates between 0-20 requests per second during night timeand up to 1500 requests per second at peak hours. Figure 4 presents the competitive ratios of AP(Algorithm 2), ABCS (Algorithm 3), and the Timer algorithm (Algorithm 4) as a function of theweights ω , β and θ . AP and ABCS further depend on the predictions provided. Let λ ( · ) be thearrival rate. The type of predictions used are as follows:• Type 1.

The system does not reveal any predictions, i.e. ˜ λ ( t ) = t ∈ [ T ] .• Type 2.

The system reports the moving average (MA) across three hours, i.e. ˜ λ ( t ) =( min ( t , 1.5 ) + min ( T − t , 1.5 )) − (cid:82) min ( t + T ) max ( t − ) λ ( t ) d t for all t ∈ [ T ] .• Type 3.

The system provides perfect predictions, i.e. ˜ λ ( t ) = λ ( t ) for all t ∈ [ T ] .18 a) CR as a function of ω (b) CR as a function of θ (c) CR as a function of β Figure 4: The competitive ratio (CR) of AP, ABCS, and the Timer algorithm as a function of theweights ω , β and θ for a real-world arrival pattern. The conﬁdence hyperparameter of ABCS is r =

8. The CR of AP and ABCS further depends on the type of predictions provided (type 1-3).The CR of AP for type 1 and 2 is at least 20 and has not been presented here.The AP algorithm performed poorly for type 1 and type 2 and its competitive ratio has notbeen presented in Figure 4. The poor performance was already expected from Theorem 4.6because the prediction’s accuracy is low. ABCS has a signiﬁcantly lower competitive ratio thanAP for inaccurate predictions (type 1 and type 2), proving that ABCS is robust against inaccuratepredictions. Moreover, even if only the moving average (MA) is available, the performance ofABCS improves in most cases. The competitive ratio of ABCS if accurate predictions are available(type 3) is close to 1 for any choice of weights. Also, note that ABCS outperforms the Timeralgorithm for almost any type of prediction and choice of weights.Figure 3b presents the competitive ratio of the AP (Algorithm 2), ABCS (Algorithm 3) and theTimer algorithm (Algorithm 4) as a function of the conﬁdence hyperparameter ( r ) as introducedin Corollary 4.9. The competitive ratio only slightly depends on the conﬁdence hyperparameter.The competitive ratio of ABCS for type 1 matches the competitive ratio for type 3, which issurprising and generally not true in light of Figure 4. The competitive ratio of ABCS increases asthe conﬁdence increases if inaccurate predictions are available (type 2) and decreases if accuratepredictions are available (type 3). To empirically verify the performance against extreme cases, we tested the proposed algorithmson three artiﬁcially generated datasets. The arrival rate functions from these datasets are notintended to model the real-world internet trafﬁc. Rather, the arrival rate functions are highlystylized versions of speciﬁc patterns occurring in real-world trafﬁc. The three patterns consideredare a sinusoidal, a sharp peak, and a step-function as seen in Figure 5. Figure 5 also presents anexample run of the AP (Algorithm 2), BCS (Algorithm 1), and the Timer algorithm (Algorithm 4).The results will provide insights into how our algorithm respond to features of the demandprocess. For example, the sinusoidal represents an average of the arrival rate during a day, whilethe sharp peak may represent an unpredictable surge in workload. For each pattern, we willinvestigate the inﬂuence of the weights ω , β and θ and the conﬁdence hyperparameter r onthe performance. Moreover, for each pattern, we have further identiﬁed three to four predictedpatterns to investigate the inﬂuence of the predictions on the performance.19 a) Sinusoidal (b) Sharp peak (c) Step-function Figure 5: The three artiﬁcial arrival patterns ( λ ) considered and the number of servers of Opt,AP, BCS and the Timer algorithm. Sinusoidal.

The sinusoidal pattern represents a stylized periodic arrival rate function. The si-nusoidal pattern has a period of 24 hours, varying between λ ( t ) = λ ( t ) = ω , β and θ . AP and ABCS further depend onthe predictions provided. Let λ ( · ) be the arrival rate. The type of predictions are as follows:• Type 1.

The system does not reveal any predictions, i.e. ˜ λ ( t ) = t ∈ [ T ] .• Type 2.

The system predicts only the average of the arrival rate, i.e. ˜ λ ( t ) =

500 for all t ∈ [ T ] .• Type 3.

The system predicts the opposite of the arrival rate, i.e. ˜ λ ( t ) = − λ ( t ) for all t ∈ [ T ] .• Type 4.

The system provides perfect predictions, i.e. ˜ λ ( t ) = λ ( t ) for all t ∈ [ T ] .The competitive ratio of AP decreases as more accurate predictions are available. Generally,the availability of perfect predictions (type 4) guarantees a competitive ratio of 1, while thecompetitive ratio of opposite predictions (type 3) is higher than not revealing any predictionsat all (type 1). Whether the competitive ratio increases or decreases if the average of the arrivalrate (type 2) is available depends on the weights. The ABCS algorithm is more robust againstinaccurate predictions; the competitive ratio never exceeds 1.45 for any type of prediction orchoice of weights. For this pattern, ABCS is able to closely replicate the optimal solution, evenwithout the availability of predictions. Also, ABCS performs better than the Timer algorithm foralmost any choice of weights. Sharp peak.

The sharp peak pattern represents a (potentially unpredictable) surge in workloadthat lasts for a short time. The shark peak lasts ﬁve minutes at λ ( t ) = ω , β and θ . The type of predictions are as follows:• Type 1.

The system does not reveal any predictions, i.e. ˜ λ ( t ) = t ∈ [ T ] .• Type 2.

The system predicts only 50% of the peak height, i.e. ˜ λ ( t ) = · λ ( t ) for all t ∈ [ T ] .• Type 3.

The system provides perfect predictions, i.e. ˜ λ ( t ) = λ ( t ) for all t ∈ [ T ] .20 a) CR as a function of ω (b) CR as a function of θ (c) CR as a function of β Figure 6: The competitive ratio (CR) of AP, ABCS and the Timer algorithm as a function of theweights ω , β and θ for a sinusoidal arrival pattern. The conﬁdence hyperparameter of ABCS is r =

8. The CR of AP and ABCS further depends on the type of predictions provided (type 1-4). (a) CR as a function of ω (b) CR as a function of β (c) CR as a function of θ Figure 7: The competitive ratio (CR) of AP, ABCS and the Timer algorithm as a function of theweights ω , β and θ for a sharp peak arrival pattern. The conﬁdence hyperparameter of ABCS is r =

8. The CR of AP and ABCS further depends on the type of predictions provided (type 1-3).The competitive ratio of both AP and ABCS decreases as more accurate predictions are available.The competitive ratio is highest if the system does not reveal any predictions (type 1), followed byif the system predicts only 50% of the peak height (type 2) and if predictions are perfect (type 3).Moreover, if the predictions are inaccurate (type 1 and type 2), then ABCS signiﬁcantly outper-forms AP and hence is robust against unpredictable surges in workload. Still, if the predictionsare accurate (type 3), then ABCS performs close to the optimal solution for almost any choice ofweights.

Step-function.

The step-function pattern is another periodic arrival rate function with suddenzero-one steps. The step-function pattern has a period of 24 hours, alternating between λ ( t ) = λ ( t ) = ω , β and θ . The type of predictions are similar to the sinusoidal pattern as follows:• Type 1.

The system does not reveal any predictions, i.e. ˜ λ ( t ) = t ∈ [ T ] .• Type 2.

The system predicts only the average of the arrival rate, i.e. ˜ λ ( t ) =

500 for all t ∈ [ T ] . 21 a) CR as a function of ω (b) CR as a function of β (c) CR as a function of θ Figure 8: The competitive ratio (CR) of AP, ABCS and the Timer algorithm as a function of theweights ω , β and θ for a step-function arrival pattern. The conﬁdence hyperparameter of ABCSis r =

8. The CR of AP and ABCS further depends on the type of predictions provided (type 1-4). (a) Sinusoidal (b) Sharp peak (c) Step-function

Figure 9: The competitive ratio (CR) of AP, ABCS and the Timer algorithm as a function ofthe conﬁdence hyperparameter ( r ). The CR of AP and ABCS further depends on the type ofpredictions provided (type 1-4).• Type 3.

The system predicts the opposite of the arrival rate, i.e. ˜ λ ( t ) = − λ ( t ) for all t ∈ [ T ] .• Type 4.

The system provides perfect predictions, i.e. ˜ λ ( t ) = λ ( t ) for all t ∈ [ T ] .The competitive ratio of both AP and ABCS decreases as more accurate predictions are available.Also, whether the system provides opposite predictions or does not provide any predictionsdoes not matter for the competitive ratio. The competitive ratio decreases if the system predictsthe average of the arrival rate. The performance of ABCS seems to increase as β decreases and θ increases, which is similar to but more pronounced as compared to the sinusoidal arrival pattern. Conﬁdence.

Finally, we discuss the dependence of the performance on the conﬁdence hyper-parameter ( r ) as introduced in Corollary 4.9. Figure 9 presents the competitive ratio as a functionof the conﬁdence hyperparameter. Surprisingly, the competitive ratio decreases as the conﬁdenceincreases even if the predictions are inaccurate. We attribute this to the fact that, for these arrivalpatterns, it is beneﬁcial to increase the number of servers quickly in response to an increase inworkload. If the conﬁdence hyperparameter is higher and as a result R is higher, then ABCS isable to increase the capacity quickly. 22 .3 The ofﬂine optimum The optimal ofﬂine solution to (2.1) can be computed by evaluating a linear program. Here, forthe sake of numerical complexity, we assume that the arrival rate function is constant in each δ -interval for arbitrary δ > δ >

0, suchthat λ ( i δ + s ) = λ ( i δ ) for all s ∈ [ δ ) and i =

0, 1, . . . , (cid:98) T / δ (cid:99) . (5.1)The assumption is reasonable in practice since the smallest time unit in the datasets presentedhere is one second and we could take δ = T is divisible by δ and let n = T / δ , q = m = m ( ) . The linear program isminimize m , d ∈ R n and q ∈ R n + ωδ · n ∑ i = q i + q i + + β · n ∑ i = d i + θδ · n ∑ i = m i subject to q i + ≥ q i + (cid:90) i δ ( i − ) δ λ ( t ) d t − δ m i for all i =

1, . . . , nd i ≥ m i − m i − for all i =

1, . . . , nq i + , d i , m i ≥ i =

1, . . . , n (5.2)The linear program returns an approximation of the optimal ofﬂine solution of (2.1) as demon-strated by the next lemma. Lemma 5.1 is proved in Appendix A.6. Lemma 5.1.

Fix a ﬁnite time horizon T, arrival rate function λ ( · ) , and the initial number of serversm ( ) . Let O pt be as deﬁned in (2.3) and let m ( · ) be an optimal solution of the linear program (5.2) . Then, C ost λ ( m , T ) ≤ (cid:18) + ωδ θ (cid:19) (cid:18)(cid:18) + ωδ β (cid:19) O pt + ωδ m ( ) (cid:19) . (5.3) We will provide a high-level overview of the proof of Theorem 4.1 and refer to the appendix forthe details. Recall that the BCS algorithm is a special case of the ABCS algorithm (let R = r , R = r ). To prove Theorem 4.1, we will, in fact, establish a more general result in Proposition6.1 below, where the rates r and r may vary as rate functions over time. Proposition 6.1 statesthat the competitive ratio of ABCS never exceeds PCR irrespective of the magnitude of the errorin prediction. Theorem 4.1 thus follows immediately by letting R = r and R = r . Proposition 6.1.

Fix a ﬁnite time horizon T, arrival rate function λ ( · ) and initial number of serversm ( ) . Let O pt be as deﬁned in (2.3) and m ( t ) be the number of servers of ABCS (Algorithm 3) when ithas access to a prediction ˜ λ . Then C ost λ ( m , T ) ≤ PCR · O pt + β · m ( ) r , (6.1) for all instances ( T , λ , m ( )) and predictions ˜ λ , where PCR is as deﬁned in (4.13) . The proof of Proposition 6.1 is based on a potential function argument and is provided inAppendix A.7. We end this section by giving a proof-sketch of Proposition 6.1.23 roof sketch of Proposition 6.1.

Let m ( · ) be the number of servers of ABCS and m ∗ ( · ) be a differ-entiable optimal solution to the ofﬂine optimization problem (2.1). Appendix A.7 shows how toextend this to arbitrary non-differentiable solutions. Let Φ ( · ) be a non-negative potential functionsuch that d Φ ( t ) d t + ∂ C ost λ ( m , t ) ∂ t ≤ PCR · ∂ C ost λ ( m ∗ , t ) ∂ t , (6.2)assuming, for now, that such a Φ ( · ) exists. We integrate equation (6.2) from time t = t = T to obtainC ost λ ( m , T ) ≤ PCR · C ost λ ( m ∗ , T ) + Φ ( ) − Φ ( T ) ≤ PCR · C ost λ ( m ∗ , T ) + Φ ( ) , (6.3)where the last step follows because Φ ( T ) is non-negative. The proof of Proposition 6.1 is thereforecompleted if we manage to ﬁnd a potential function Φ ( t ) satisfying equation (6.2) and Φ ( ) = β · m ( ) r . Deﬁne the potential function Φ ( t ) such that Φ ( t ) = (cid:40) c β · ( d R ( t ) − m ( t ) + m ∗ ( t )) , if m ( t ) > m ∗ ( t ) c β · ( d r ( t ) − m ( t ) + m ∗ ( t )) if m ( t ) ≤ m ∗ ( t )+ β · m ( t ) r + c R θ · [ q ( t ) − q ∗ ( t )] + , (6.4)where c and c are as deﬁned in equation (4.13) and d r ( t ) = (cid:115) r ω · ([ q ( t ) − q ∗ ( t )] + ) β + ( m ( t ) − m ∗ ( t )) . (6.5)Note that Φ ( t ) is non-negative and Φ ( ) = β · m ( ) r . It remains to show that this choice of Φ ( t ) satisﬁes equation (6.2), which profoundly relies on d r ( t ) . The full argument involves a casedistinction and is provided in Appendix A.7. For this proof sketch, let us only consider the casethat m ( t ) > m ∗ ( t ) , q ( t ) > q ∗ ( t ) , d m d t ≥ d m ∗ d t ≥

0. Recall that, by the deﬁnition of ABCS,d q ( t ) d t = λ ( t ) − m ( t ) , d m ( t ) d t = ˆ r ( t ) ω · q ( t ) − ˆ r ( t ) θ · m ( t ) β ≤ R ω · q ( t ) β . (6.6)The derivative of d R ( t ) is at most β · d d R d t ≤ d R ( t ) − ·  R ω · ( q ( t ) − q ∗ ( t ))( λ ( t ) − m ( t ))+ R ω · ( q ∗ ( t ) − q ( t ))( λ ( t ) − m ∗ ( t ))+ R ω · q ( t ) · ( m ( t ) − m ∗ ( t ))+ β · d m ∗ d t · ( m ∗ ( t ) − m ( t ))  = d R ( t ) − · ( m ( t ) − m ∗ ( t )) (cid:18) R ω · q ∗ ( t ) − β · d m ∗ d t (cid:19) ≤ R ω · q ∗ ( t ) . (6.7)Crucially, the derivative does not contain any terms involving m ( t ) or q ( t ) , but only q ∗ ( t ) , whichis easy to bound by the cost of the optimal solution. Thus, the derivative of Φ ( t ) can be upperbounded as follows:d Φ ( t ) d t ≤ c R ω · q ∗ ( t ) − (cid:18) + r (cid:19) β · d m d t + c β · d m ∗ d t + (cid:18) + R r (cid:19) θ · ( m ∗ ( t ) − m ( t )) , (6.8)24here some the constants c and c have been expanded according to their deﬁnitions in (4.13).Observe that the derivative of the potential function Φ ( t ) exactly cancels the cost of ABCS, since ω · q ( t ) ≤ β r · d m ( t ) d t + R θ m ( t ) r . We therefore obtaind Φ ( t ) d t + ∂ C ost λ ( m , t ) ∂ t ≤ c R ω · q ∗ ( t ) + c β · d m ∗ d t + (cid:18) + R r (cid:19) θ · m ∗ ( t ) ≤ max (cid:18) c R , c , 1 + R r (cid:19) · ∂ C ost λ ( m ∗ , t ) ∂ t . (6.9)Note that removing the term d r ( t ) from Φ ( t ) would yield a similar form as equation (6.8). How-ever, the resulting potential function is not non-negative, hence the need for the term d r ( t ) . Theorem 4.8 states that the competitive ratio is at most the minimum of the Optimistic Compet-itive Ratio (OCR) and the Pessimistic Competitive Ratio (PCR). Proposition 6.1 in the previoussection showed that the competitive ratio of ABCS is at most PCR. To prove the bound on thecompetitive ratio by OCR, we will relate the performance of ABCS to the cost achieved by thesubroutine AP. Proposition 6.2 below states that the cost of ABCS differs by at most a factor ofOCR from the cost of AP.

Proposition 6.2.

Fix a ﬁnite time horizon T, arrival rate function λ ( · ) and initial number of serversm ( ) . Let O pt be as deﬁned in (2.3) and m ( t ) be the number of servers of BCS (Algorithm 1) when it hasaccess to a prediction ˜ λ . Then C ost λ ( m , T ) ≤ OCR · C ost λ ( ˜ m , T ) + β · m ( ) R , (6.10) for all instances ( T , λ , m ( )) and predictions ˜ λ , where ˜ m ( t ) represents the advised number of servers ofAP and OCR is as deﬁned in (4.13) . The proof of Proposition 6.2 is based on a potential function argument and is provided inAppendix A.8. The proof follows a similar structure as the proof of Proposition 6.1. We nowhave all the ingredients to prove Theorem 4.8.

Proof of Theorem 4.8.

The proof follows almost immediately from Proposition 6.1 and 6.2. Notethat CR ( η ) ≤ PCR because C ost λ ( m , T ) ≤ PCR · O pt + β · m ( ) r , (6.11)by Proposition 6.1. Next, CR ( η ) ≤ ( + ( (cid:112) ωβ + θ ) η ) · OCR becauseC ost λ ( m , T ) ≤ OCR · C ost λ ( ˜ m , T ) + β · m ( ) R ≤ ( + ( (cid:112) ωβ + θ ) η ) · OCR · O pt + β · m ( ) R , (6.12)by Proposition 6.2 and Theorem 4.6. 25 emark . Note that Proposition 6.2 is oblivious to the choice of AP as the source of the advisednumber of servers ˜ m . Therefore, if there exists an algorithm similar to AP but with a better errordependence, then it is straightforward to extend ABCS to use this algorithm as the source forthe advised number of servers instead. As the proof of Theorem 4.8 shows, the improved errordependence carries over immediately. In this paper, we explored how ML predictions can be used to improve the performance ofcapacity scaling solutions without sacriﬁcing robustness. We extend the state of the art in ca-pacity scaling by analyzing a more general model in continuous time where tasks are allowedto wait, in which case popular earlier proposed algorithms are not constant competitive. TheBalanced Capacity Scaling (BCS) algorithm we proposed is 5-competitive in the general case. Wealso introduced the Adapt to the Prediction (AP) algorithm which is 1-competitive if the MLpredictions are accurate. Finally, we proposed the Adaptive Balanced Capacity Scaling (ABCS)algorithm, which combines the ideas behind BCS and AP. We proved that, in the presence ofaccurate ML predictions, ABCS is ( + ε ) -competitive and its performance degrades gracefullyin the prediction’s accuracy. Moreover, the competitive ratio of ABCS is at most Θ (cid:0) ε − (cid:1) whenML predictions are inaccurate. Although the competitive ratio of ABCS depends on the accu-racy, the algorithm is oblivious to it. In the context of data centers, since real-world internettrafﬁc is erratic, any implemented capacity scaling solution must be robust against sudden un-predictable surges in workload. Our results yield signiﬁcant reductions in power consumptionwhile maintaining robustness against these sudden spikes. The theoretical results are supportedby extensive numerical experiments on a real-world dataset. Finally, in an ongoing work, weare exploring how the conﬁdence hyperparameter of ABCS can be learned over time if there arestatistical guarantees on the prediction’s accuracy. DM thanks Ravi Kumar (Google) for inspiring discussions that started this project. We alsogratefully acknowledge the ﬁnancial support for this project from the ARC-TRIAD fellowship.

References [1] Albers, S. and Fujiwara, H. (2007). Energy-efﬁcient algorithms for ﬂow time minimization.

ACM Transactions on Algorithms (TALG) , 3(4):49–es.[2] Albers, S., M ¨uller, F., and Schmelzer, S. (2014). Speed scaling on parallel processors.

Algorith-mica , 68(2):404–425.[3] Anderson, C. and Karlin, A. R. (1996). Two adaptive hybrid cache coherency protocols. In

Proceedings. Second International Symposium on High-Performance Computer Architecture , pages303–313. IEEE.[4] Ata, B. and Shneorson, S. (2006). Dynamic control of an M/M/1 service system with ad-justable arrival and service rates.

Management Science , 52(11):1778–1791.265] Augustine, J., Irani, S., and Swamy, C. (2004). Optimal power-down strategies. In , pages 530–539. IEEE.[6] Azar, Y., Bartal, Y., Feuerstein, E., Fiat, A., Leonardi, S., and Ros´en, A. (1999). On capitalinvestment.

Algorithmica , 25(1):22–36.[7] Bamas, E., Maggiori, A., Rohwedder, L., and Svensson, O. (2020). Learning augmented energyminimization via speed scaling. arXiv preprint arXiv:2010.11629 .[8] Bansal, N., Chan, H.-L., and Pruhs, K. (2009). Speed scaling with an arbitrary power function.In

Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms , pages 693–701. SIAM.[9] Barbu, V. and Precupanu, T. (2012).

Convexity and optimization in Banach spaces . SpringerScience & Business Media.[10] Barroso, L. A. and H ¨olzle, U. (2007). The case for energy-proportional computing.

Computer ,40(12):33–37.[11] Bodik, P. (2010).

Automating datacenter operations using machine learning . PhD thesis, UCBerkeley.[12] Boyar, J., Favrholdt, L. M., Kudahl, C., Larsen, K. S., and Mikkelsen, J. W. (2017). Onlinealgorithms with advice: A survey.

ACM Computing Surveys (CSUR) , 50(2):1–34.[13] Cortez, E., Bonde, A., Muzio, A., Russinovich, M., Fontoura, M., and Bianchini, R. (2017). Re-source central: Understanding and predicting workloads for improved resource managementin large cloud platforms. In

Proceedings of the 26th Symposium on Operating Systems Principles ,pages 153–167.[14] Damaschke, P. (2003). Nearly optimal strategies for special cases of on-line capital invest-ment.

Theoretical Computer Science , 302(1-3):35–44.[15] Dayarathna, M., Wen, Y., and Fan, R. (2015). Data center energy consumption modeling: Asurvey.

IEEE Communications Surveys & Tutorials , 18(1):732–794.[16] Doytchinov, B., Lehoczky, J., and Shreve, S. (2001). Real-time queues in heavy trafﬁc withearliest-deadline-ﬁrst queue discipline.

Annals of Applied Probability , pages 332–378.[17] Facebook (2014). Making Facebook’s software infrastructure more energy efﬁcient withAutoscale. https://engineering.fb.com/production-engineering/making-facebook-s-software-infrastructure-more-energy-efficient-with-autoscale/ .[18] Galloway, J., Smith, K., and Carver, J. (2012). An empirical study of power aware loadbalancing in local cloud architectures. In , pages 232–236. IEEE.[19] Gandhi, A., Doroudi, S., Harchol-Balter, M., and Scheller-Wolf, A. (2013). Exact analysis ofthe M/M/k/setup class of Markov chains via recursive renewal reward. In

Proceedings of theACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems ,pages 153–166. 2720] Gandhi, A., Dube, P., Karve, A., Kochut, A., and Zhang, L. (2014). Adaptive, model-drivenautoscaling for cloud applications. In , pages 57–64.[21] Gandhi, A., Gupta, V., Harchol-Balter, M., and Kozuch, M. A. (2010). Optimality anal-ysis of energy-performance trade-off for server farm management.

Performance Evaluation ,67(11):1155–1171.[22] Gandhi, A., Harchol-Balter, M., Raghunathan, R., and Kozuch, M. A. (2012). Autoscale: Dy-namic, robust capacity management for multi-tier data centers.

ACM Transactions on ComputerSystems (TOCS) , 30(4):1–26.[23] Gao, J. (2014). Machine learning applications for data center optimization.[24] Goldman, S. A., Parwatikar, J., and Suri, S. (2000). Online scheduling with hard deadlines.

Journal of Algorithms , 34(2):370–389.[25] Google (2016). Data centers get ﬁt on efﬁciency. https://blog.google/outreach-initiatives/environment/data-centers-get-fit-on-efficiency/ .[26] Hsu, C.-Y., Indyk, P., Katabi, D., and Vakilian, A. (2019). Learning-based frequency estima-tion algorithms. In

International Conference on Learning Representations .[27] Irani, S., Shukla, S., and Gupta, R. (2002). Competitive analysis of dynamic power man-agement strategies for systems with multiple power saving states. In

Proceedings 2002 Design,Automation and Test in Europe Conference and Exhibition , pages 117–123. IEEE.[28] Karlin, A. R., Kenyon, C., and Randall, D. (2003). Dynamic TCP acknowledgment and otherstories about e/(e- 1).

Algorithmica , 36:209–224.[29] Karlin, A. R., Manasse, M. S., Rudolph, L., and Sleator, D. D. (1988). Competitive snoopycaching.

Algorithmica , 3(1-4):79–119.[30] Khanafer, A., Kodialam, M., and Puttaswamy, K. P. (2013). The constrained ski-rental prob-lem and its application to online cloud cost optimization. In ,pages 1492–1500. IEEE.[31] Kumar, R., Purohit, M., Schild, A., Svitkina, Z., and Vee, E. (2018). Semi-online bipartitematching. arXiv preprint arXiv:1812.00134 .[32] Lassettre, E., Coleman, D. W., Diao, Y., Froehlich, S., Hellerstein, J. L., Hsiung, L., Mum-mert, T., Raghavachari, M., Parker, G., Russell, L., et al. (2003). Dynamic surge protection:An approach to handling unexpected workload surges with resource actions that have leadtimes. In

International Workshop on Distributed Systems: Operations and Management , pages 82–92. Springer.[33] Lee, R., Hajiesmaili, M. H., and Li, J. (2019). Learning-assisted competitive algorithms forpeak-aware energy scheduling. arXiv preprint arXiv:1911.07972 .[34] Lin, M., Wierman, A., Andrew, L. L., and Thereska, E. (2012). Dynamic right-sizing forpower-proportional data centers.

IEEE/ACM Transactions on Networking , 21(5):1378–1391.2835] Lu, T., Chen, M., and Andrew, L. L. (2012). Simple and effective dynamic provisioningfor power-proportional data centers.

IEEE Transactions on Parallel and Distributed Systems ,24(6):1161–1171.[36] Lykouris, T. and Vassilvtiskii, S. (2018). Competitive caching with machine learned advice.In

International Conference on Machine Learning , pages 3296–3305.[37] Maccio, V. J. and Down, D. G. (2015). On optimal policies for energy-aware servers.

Perfor-mance Evaluation , 90:36–52.[38] Mahdian, M., Nazerzadeh, H., and Saberi, A. (2012). Online optimization with uncertaininformation.

ACM Transactions on Algorithms (TALG) , 8(1):1–29.[39] Manmeet, S., Maninder, S., and Sanmeet, K. (2019). TI-2016 DNS dataset.

IEEE Dataport .[40] Mazzucco, M. and Dyachuk, D. (2012). Optimizing cloud providers revenues via energyefﬁcient server allocation.

Sustainable Computing: Informatics and Systems , 2(1):1–12.[41] Mitzenmacher, M. (2018). A model for learned bloom ﬁlters and optimizing by sandwiching.In

Advances in Neural Information Processing Systems , pages 464–473.[42] Mitzenmacher, M. (2019a). Scheduling with predictions and the price of misprediction. arXivpreprint arXiv:1902.00732 .[43] Mitzenmacher, M. (2019b). The supermarket model with known and predicted service times. arXiv preprint arXiv:1905.12155 .[44] Mukherjee, D., Dhara, S., Borst, S. C., and van Leeuwaarden, J. S. H. (2017). Optimal serviceelasticity in large-scale distributed systems.

Proceedings of the ACM on Measurement and Analysisof Computing Systems , 1(1):1–28.[45] Mukherjee, D. and Stolyar, A. (2019). Join idle queue with service elasticity: Large-scaleasymptotics of a nonmonotone system.

Stochastic Systems , 9(4):338–358.[46] Nature (2018). How to stop data centres from gobbling up the world’s electricity. .[47] Netﬂix (2013). Scryer: Netﬂix’s Predictive Auto Scaling Engine. https://netflixtechblog.com/scryer-netflixs-predictive-auto-scaling-engine-a3f8fc922270 .[48] Purohit, M., Svitkina, Z., and Kumar, R. (2018). Improving online algorithms via ml predic-tions. In

Advances in Neural Information Processing Systems , pages 9661–9670.[49] Qi, J., Du, J., Siniscalchi, S. M., Ma, X., and Lee, C.-H. (2020). On mean absolute error fordeep neural network based vector-to-vector regression.

IEEE Signal Processing Letters .[50] Rong, H., Zhang, H., Xiao, S., Li, C., and Hu, C. (2016). Optimizing energy consumption fordata centers.

Renewable and Sustainable Energy Reviews , 58:674–691.[51] Rzadca, K., Findeisen, P., Swiderski, J., Zych, P., Broniek, P., Kusmierek, J., Nowak, P.,Strack, B., Witusowski, P., Hand, S., et al. (2020). Autopilot: workload autoscaling at google.In

Proceedings of the Fifteenth European Conference on Computer Systems , pages 1–16.2952] Shehabi, A., Smith, S., Sartor, D., Brown, R., Herrlin, M., Koomey, J., Masanet, E., Horner, N.,Azevedo, I., and Lintner, W. (2016). United States Data Center Energy Usage Report. Technicalreport, Lawrence Berkeley National Lab.[53] Sverdlik, Y. (2020). How Zoom, Netﬂix, and Dropbox are Staying Online During the Pan-demic. .[54] Tirmazi, M., Barker, A., Deng, N., Haque, M. E., Qin, Z. G., Hand, S., Harchol-Balter, M., andWilkes, J. (2020). Borg: the next generation. In

Proceedings of the Fifteenth European Conferenceon Computer Systems , pages 1–14.[55] Urgaonkar, B., Shenoy, P., Chandra, A., and Goyal, P. (2005). Dynamic provisioning of multi-tier internet applications. In

Second International Conference on Autonomic Computing (ICAC’05) ,pages 217–228. IEEE.[56] Wierman, A., Andrew, L. L., and Tang, A. (2009). Power-aware speed scaling in processorsharing systems. In

IEEE INFOCOM 2009 , pages 2007–2015. IEEE.

A Proofs

This section provides the proofs which have been omitted from the main text.

A.1 Proof of Proposition 2.2

Fix a ﬁnite time horizon T , arrival rate function λ ( · ) and initial number of servers m ( ) . Let V + ( f ) = lim sup δ ↓ (cid:98) T / δ (cid:99) ∑ i = [ f ( i δ + δ ) − f ( i δ )] + (A.1)for a function f : [ T ] → R . The deﬁnition of V + ( f ) is closely related to the notion of boundedvariation. The bounded variation of a function f : [ T ] → R is deﬁned as V ( f ) = sup (cid:40) n ∑ i = | f ( z i ) − f ( z i − ) | such that { z i } ni = is a partition of [ T ] (cid:41) . (A.2)Let m n be a sequence of functions such that C ost λ ( m n , T ) → O pt as n → ∞ . There exists N ∈ N such that C ost λ ( m n , T ) ≤ · O pt for all n ≥ N . As a result, V + ( m n ) is uniformly bounded for n ≥ N . Note that, without increasing the cost, we can set m n ( T ) =

0. The bounded variation and V + ( m n ) are then related as V ( m n ) = V + ( m n ) + m ( ) . (A.3)The rest of the proof depends on the following compactness theorem [9]. Theorem A.1. (Helly’s selection theorem) Let f n : [ T ] → R be a sequence of functions and supposethat the next two conditions hold:(i) The sequence f n has uniformly bounded variation, i.e., sup n ∈ N V ( f n ) < ∞ , DM: Can you cite precise theorem number? ii) The sequence f n is uniformly bounded at a point, i.e., there exists t ∈ [ T ] such that { f n ( t ) } ∞ n = isa bounded set.Then, there exists a subsequence f n k of f n and a function f : [ T ] → R such that(i) f n k converges to f pointwise as k → ∞ ,(ii) f n k converges to f in L as k → ∞ ,(iii) V ( f ) ≤ lim inf k → ∞ V ( f n k ) . Recall the inﬁnite sequence m N , m N + , . . . introduced above. Condition (i) in Theorem A.1holds because V ( m n ) = V + ( m n ) + m ( ) ≤ pt β + m ( ) , (A.4)for all n ≥ N . Moreover, condition (ii) in Theorem A.1 holds for t =

0, since m n ( ) = m ( ) forall n ∈ N . Hence, there exists a subsequence m n k of m n and a function m ∗ : [ T ] → R such that m n k → m ∗ pointwise and in the L norm, as k → ∞ , and V + ( m ∗ ) = ( V ( m ∗ ) − m ( )) /2 ≤ lim inf k → ∞ ( V ( m n k ) − m ( )) /2 = lim inf k → ∞ V + ( m n k ) . (A.5)Therefore, since the ﬂow-time and the power cost are continuous in m with respect to the L norm,C ost λ ( m ∗ , T ) = ω · (cid:90) T q ∗ ( t ) d t + β · V + ( m ∗ ) + θ · (cid:90) T m ∗ ( t ) d t ≤ ω · lim k → ∞ (cid:90) T q n k ( t ) d t + β · lim inf k → ∞ V + ( m n k ) + θ · lim k → ∞ (cid:90) T m n k ( t ) d t ≤ lim k → ∞ C ost λ ( m n k , T ) = O pt , (A.6)which completes the proof of the proposition. A.2 Proof of Lemma 2.4

Fix any algorithm A and parameters ω and β . We will construct an instance for which C ost λ ( m , T ) ≥ ωτ β · O pt + Φ ( ) . Throughout the example we will assume that Φ ( ) is zero without loss of gen-erality.Let λ ( t ) = ρδ τ ( t ) , where δ τ ( t ) is the Dirac delta function at τ , i.e., δ τ ( τ ) = ∞ , δ τ ( t ) = t (cid:54) = τ and (cid:82) ∞ δ τ ( s ) d s =

1. The value of ρ will be speciﬁed later. Let m ( ) =

0, the ﬁnitetime horizon T = τ and θ =

0. Let m ( t ) be the number of servers of A for the instance. Wedistinguish two cases depending on sup t ∈ [ τ ,2 τ ) m ( t ) .1. First, consider the case when sup t ∈ [ τ ,2 τ ) m ( t ) < τ − . Fix ρ =

2. One possible alternativesolution of (2.1) is m ∗ ( t ) = t ∈ [ τ , 2 τ ] . This solution does not incur any waiting cost.The value of the optimal solution is therefore at most O pt ≤ β . The cost of A is at leastC ost λ ( m , T ) > ω · (cid:82) ττ (cid:0) ρ − ( s − τ ) τ − (cid:1) d s ≥ ωτ ≥ ωτ β · O pt .2. Next, consider the case when sup t ∈ [ τ ,2 τ ) m ( t ) ≥ τ − . Fix ρ =

0. The optimal solution of(2.1) is m ∗ ( t ) = t ∈ [ T ] . The value of the optimal solution is O pt =

0. The cost of A is at least C ost λ ( m , T ) ≥ β · τ − ≥ ωτ β · O pt .This completes the proof of the Lemma. 31 .3 Proof of Theorem 4.3 Fix a ﬁnite time horizon T , arrival rate function λ ( · ) and initial number of servers m ( ) . Let m ∗ ( t ) be a solution of the ofﬂine optimization problem (2.1) and m ( t ) be the number of serversof BCS (Algorithm 1). We will prove thatC ost ( m , T ) ≤ · C ost ( m ∗ , T ) + β · m ( ) , (A.7)where we have omitted λ from the notation C ost λ ( m , T ) . Overview of the proof.

Let t ≤ t ≤ . . . be a partitioning of the interval [ T ] such that (i) m ( t ) is monotone in [ t k , t k + ] and (ii) either m ( t ) > m ∗ ( t ) or m ( t ) ≤ m ∗ ( t ) in [ t k , t k + ] for all k ∈ N .The goal of the proof will be to ﬁnd a non-negative potential function Φ ( t ) such that Φ ( t k + ) − Φ ( t k ) + C ost ( m , t k + ) − C ost ( m , t k ) ≤ · ( C ost ( m ∗ , t k + ) − C ost ( m ∗ , t k )) , (A.8)for all k ∈ N . We sum equation (A.8) over k ∈ N to obtainC ost ( m , T ) ≤ · C ost ( m ∗ , T ) + Φ ( ) − Φ ( T ) ≤ · C ost ( m ∗ , T ) + Φ ( ) , (A.9)where the last step follows because Φ ( T ) is non-negative. The proof of Theorem 4.3 is thereforecompleted if we manage to ﬁnd a non-negative potential function Φ ( t ) satisfying equation (A.8)and Φ ( ) = β · m ( ) . Choice of Φ ( t ) . Deﬁne the potential function Φ ( t ) such that Φ ( t ) = (cid:40) β · m ( t ) , if m ( t ) > m ∗ ( t ) β · m ∗ ( t ) − β · m ( t ) if m ( t ) ≤ m ∗ ( t ) (A.10)Note that Φ ( t ) is non-negative and Φ ( ) = β · m ( ) . Veriﬁcation of (A.8) . We continue by verifying equation (A.8). Fix k ∈ N . We distinguish twocases, depending on whether m ( t ) is decreasing or non-decreasing in [ t k , t k + ] .(i) Assume that m ( s ) is decreasing for s ∈ [ t k , t k + ] . Recall that, by deﬁnition,d m ( t ) d t = − θ · m ( t ) β , (A.11)for t ∈ [ t k , t k + ] and therefore m ( t k + s ) = m ( t k ) · exp (cid:18) − θ · s β (cid:19) , (A.12)for s ∈ [ t k + − t k ] and hence θ · (cid:90) t k + t k m ( s ) d s = β · m ( t k ) (cid:18) − exp (cid:18) − θ · ( t k + − t k ) β (cid:19)(cid:19) = β · ( m ( t k ) − m ( t k + )) . (A.13)32e further distinguish two cases depending on whether m ( t ) > m ∗ ( t ) or m ( t ) ≤ m ∗ ( t ) in [ t k , t k + ] . First, consider the case that m ( s ) > m ∗ ( s ) for s ∈ [ t k , t k + ] . Then, Φ ( t k + ) − Φ ( t k ) + C ost ( m , t k + ) − C ost ( m , t k )= β · ( m ( t k + ) − m ( t k )) + β · ( m ( t k ) − m ( t k + ))= ≤ ( C ost ( m ∗ , t k + ) − C ost ( m ∗ , t k )) . (A.14)Next, consider the case that that m ( s ) ≤ m ∗ ( s ) for s ∈ [ t k , t k + ] . Then, Φ ( t k + ) − Φ ( t k ) + C ost ( m , t k + ) − C ost ( m , t k )= β · ( m ∗ ( t k + ) − m ∗ ( t k )) − β · ( m ( t k + ) − m ( t k )) + β · ( m ( t k ) − m ( t k + ))= β · ( m ∗ ( t k + ) − m ∗ ( t k )) + θ · (cid:90) t k + t k m ( s ) d s ≤ β · ( m ∗ ( t k + ) − m ∗ ( t k )) + θ · (cid:90) t k + t k m ∗ ( s ) d s ≤ ( C ost ( m ∗ , t k + ) − C ost ( m ∗ , t k )) . (A.15)(ii) Assume that m ( s ) is non-decreasing for s ∈ [ t k , t k + ] . Note that, since tasks are not allowedto wait, m ∗ ( t ) ≥ λ ( t ) for all t ∈ [ T ] . Recall that, by deﬁnition, if the arrival rate λ ( t ) ishigher than the number of servers m ( t ) then BCS increases the number of servers to matchthe arrival rate. Therefore, m ( s ) = λ ( s ) for s ∈ [ t k , t k + ] because m ( t ) is non-decreasing in [ t k , t k + ] . Hence, m ∗ ( s ) ≥ m ( s ) = λ ( s ) for s ∈ [ t k , t k + ] and Φ ( t k + ) − Φ ( t k ) + C ost ( m , t k + ) − C ost ( m , t k )= β · ( m ∗ ( t k + ) − m ∗ ( t k )) − β · ( m ( t k + ) − m ( t k ))+ β · ( m ( t k + ) − m ( t k )) + θ · (cid:90) t k + t k m ( s ) d s ≤ β · ( m ∗ ( t k + ) − m ∗ ( t k )) + θ · (cid:90) t k + t k m ∗ ( s ) d s ≤ ( C ost ( m ∗ , t k + ) − C ost ( m ∗ , t k )) . (A.16) A.4 Proof of Proposition 4.4

Fix any algorithm A , and let m ( t ) denote its number of servers. We will construct an instance ( T , λ , m ( )) for which C ost λ ( m , T ) ≥ · O pt + Φ ( ) . Throughout the example, we willassume that Φ ( ) is zero, without loss of generality.Let λ ( t ) = t ∈ [ T ] and m ( ) =

0. The time horizon T will be speciﬁed later. Fix β = ω = θ =

0. Let the prediction ˜ λ ( t ) = t and as a result the advised numberof servers ˜ m ( t ) = t . Let m ( t ) be the number of servers of A for the instance. Deﬁne τ : = inf { t | m ( t ) > t } or τ = ∞ , if the inﬁmum does not exist. We distinguish two casesdepending on the value of τ .1. First, consider the case when τ ≤ T = τ . The optimal solution to (2.1) is m ∗ ( t ) = t ∈ [ T ] . The value of the optimal solution is purely due to ﬂow-time and isequal to O pt = τ /2.At time t = τ , algorithm A has at least m ( τ ) > τ servers. The ﬂow-time is atleast (cid:82) τ q ( t ) d t ≥ (cid:82) τ (cid:82) t − s d s d t ≥ τ /2 − τ /12, because m ( t ) ≤ t for33 ∈ [ τ ) . The cost of A is therefore at least C ost λ ( m , T ) ≥ τ + τ /2 − τ /12 ≥ · τ /2 = · O pt , where the second inequality follows because τ ≤ τ > T =

3. The optimal solution to (2.1) is m ∗ ( t ) = t ∈ [ T ] . The value of the optimal solution is purely due to switching costand is equal to O pt = t = A is at least q ( ) ≥ (cid:82) − t d t = t = m ( t ) = + q ( ) / √ ≥ t ∈ ( T ] . The ﬂow-time is therefore at least (cid:82) T q ( t ) d t ≥ (cid:82) (cid:82) s − s d s d t + q ( ) / √ ≥ m ( t ) ≤ t for t ∈ [ τ ) . The cost of A is thereforeat least C ost λ ( m , T ) ≥ ≥ · O pt .Hence, the statement follows. A.5 Proof of Theorem 4.6

Proof.

Fix a ﬁnite time horizon T , arrival rate function λ ( · ) and initial number of servers m ( ) .Let ˜ λ ( · ) be the predicted arrival rate. The idea to the proof is to separate the cost into the cost ofthe ofﬂine and the online component. More speciﬁcally, we claim thatC ost λ ( m , T ) ≤ C ost ˜ λ ( m , T ) + (cid:16)(cid:112) ωβ + θ (cid:17) T · (cid:107) ∆ λ (cid:107) MAE . (A.17)To see why, note that q ( t ) = (cid:90) t ( λ ( s ) − m ( s )) { q ( s ) > λ ( s ) ≥ m ( s ) } d s ≤ (cid:90) t ( ˜ λ ( s ) − m ( s )) { q ( s ) > λ ( s ) ≥ m ( s ) } d s + (cid:90) t ( ∆ λ ( s ) − m ( s )) { q ( s ) > λ ( s ) ≥ m ( s ) } d s ≤ (cid:90) t ( ˜ λ ( s ) − m ( s )) { q ( s ) > λ ( s ) ≥ m ( s ) } d s + (cid:90) t ( ∆ λ ( s ) − m ( s )) { q ( s ) > λ ( s ) ≥ m ( s ) } d s = q ( t ) + q ( t ) . (A.18)Therefore, the ﬂow-time of the algorithm is at most the sum of the ﬂow-time of the ofﬂinecomponent on ˜ λ and the online component on ∆ λ . Similarly, since the switching cost and thepower cost are linear in the number of servers m ( · ) , the cost of the algorithm is at mostC ost λ ( m , T ) ≤ C ost ˜ λ ( m , T ) + C ost ∆ λ ( m , T ) . (A.19)We will further bound the cost of the online component m . Let [ t , t + δ ) ⊆ [ T ] be an arbitrarytime interval for δ > ∆ q ( t ) = (cid:82) t + δ t ∆ λ ( s ) d s . We will bound the cost due tothe ∆ q ( t ) workload received in this time interval. The number of servers m ( t ) increases by (cid:112) ω / ( β ) · ∆ q ( t ) in the interval. Moreover, after a time of (cid:112) β / ω , the number of servers m ( t ) decreases again by (cid:112) ω / ( β ) · ∆ q ( t ) . Throughout (cid:2) t , t + (cid:112) β / ω (cid:1) , the queue length due to this34raction of the workload decreases linearly as q ( t + s ) = ∆ q ( t ) − (cid:112) ω / ( β ) · ∆ q ( t ) · s until theworkload is completely handled. The cost due to waiting is therefore ω · (cid:90) (cid:113) βω (cid:18) ∆ q ( t ) − (cid:114) ω β · ∆ q ( t ) · s (cid:19) d s = ω · (cid:114) β ω · ∆ q ( t ) . (A.20)Note that, since δ can be chosen arbitrarily small, the waiting cost in the interval [ t , t + δ ) isnegligible. The switching cost is β · (cid:112) ω / ( β ) · ∆ q ( t ) and the power cost is θ · (cid:112) β / ω · (cid:112) ω / ( β ) · ∆ q ( t ) = θ · ∆ q ( t ) . The cost of the online component is thereforeC ost ∆ λ ( m , T ) ≤ lim δ ↓ (cid:98) T / δ (cid:99) ∑ i = (cid:16)(cid:112) ωβ + θ (cid:17) · ∆ q ( i δ ) = (cid:16)(cid:112) ωβ + θ (cid:17) T · (cid:107) ∆ λ (cid:107) MAE , (A.21)which proves equation (A.17) by combining (A.19) and (A.21). Similarly, let ∆ λ ∗ ( t ) = ˜ λ ( t ) − λ ( t ) .Then, by interchanging the actual arrival rate λ and the predicted arrival rate ˜ λ in equation (A.17),we ﬁnd that C ost ˜ λ ( m ∗ , T ) ≤ C ost λ ( m ∗ , T ) + C ost ∆ λ ∗ ( m ∗ , T ) , (A.22)where m ∗ ( t ) = m ∗ ( t ) + m ∗ ( t ) and m ∗ ∈ arg min m : ( T ] → R + C ost λ ( m , T ) , (A.23)d m ∗ ( t ) d t = (cid:114) ω β · (cid:16) ∆ λ ∗ ( t ) − ∆ λ ∗ (cid:16) t − (cid:112) β / ω (cid:17)(cid:17) . (A.24)Finally, we combine (A.17) and (A.22) to ﬁnd thatC ost λ ( m , T ) ≤ C ost ˜ λ ( m , T ) + (cid:16)(cid:112) ωβ + θ (cid:17) T · (cid:107) ∆ λ (cid:107) MAE ≤ C ost ˜ λ ( m ∗ , T ) + (cid:16)(cid:112) ωβ + θ (cid:17) T · (cid:107) ∆ λ (cid:107) MAE ≤ C ost λ ( m ∗ , T ) + (cid:16)(cid:112) ωβ + θ (cid:17) T · (cid:107) ˜ λ − λ (cid:107) MAE , (A.25)where the ﬁrst inequality follows by (A.17), the second inequality follows because m achievesthe minimum cost on ˜ λ and the third inequality follows by (A.22). This completes the proofbecause m ∗ is the optimal ofﬂine solution on λ . A.6 Proof of Lemma 5.1

Fix a ﬁnite time horizon T , arrival rate function λ ( · ) and initial number of servers m ( ) . Assumethat T is divisible by δ and let n = T / δ . Let C : = { f : [ T ] → R + | f ( i δ + s ) = f ( i δ ) for all s ∈ [ δ ) and i =

0, 1, . . . , n − } (A.26)be the subspace of the space of functions which are constant in each δ -interval. Recall thatby assumption, λ ∈ C . We note that each f ∈ C is equivalently represented by a vector f =( f ( ) , f ( δ ) , . . . , f ( n − )) ∈ R n and vice versa. We will therefore interchangeably use vectornotation to denote an element from C . Claim A.2.

We claim that inf m ∈C C ost λ ( m , T ) ≤ (cid:18) + ωδ β (cid:19) inf m : [ T ] → R + C ost λ ( m , T ) + ωδ · m ( ) roof. Let m ∗ : [ T ] → R + be arbitrary and let m i = δ (cid:82) i δ ( i − ) δ m ∗ ( t ) d t . We will prove thatC ost λ ( m , T ) ≤ (cid:18) + ωδ β (cid:19) C ost λ ( m ∗ , T ) + ωδ · m ( ) m is at most the switching cost of m ∗ and the power cost of m is equal to thepower cost of m ∗ . We will therefore focus on the ﬂow-time cost. The queue length of m ∗ at theendpoints of each δ -interval is at least the queue length of m as follows, q ∗ ( i δ ) = q ∗ (( i − ) δ ) + (cid:90) i δ ( i − ) δ ( λ i − m ∗ ( s )) { q ∗ ( s ) > λ i ≥ m ∗ ( s ) } d s ≥ (cid:20) q ∗ (( i − ) δ ) + (cid:90) i δ ( i − ) δ ( λ i − m ∗ ( s )) d s (cid:21) + ≥ [ q (( i − ) δ ) + δλ i − δ m i ] + = q ( i δ ) , (A.29)where the inequality q ∗ (( i − ) δ ) ≥ q (( i − ) δ ) follows by induction on i . Deﬁne ∆ i = sup t ∈ [( i − ) δ , i δ ] m ∗ ( t ) − inf t ∈ [( i − ) δ , i δ ] m ∗ ( t ) , (A.30)and observe that n ∑ i = ∆ i ≤ m ( ) + ε ↓ (cid:98) T / ε (cid:99) ∑ i = [ m ∗ ( i ε + ε ) − m ∗ ( i ε )] + ≤ m ( ) + ost λ ( m ∗ , T ) β . (A.31)Then, the ﬂow-time cost of m ∗ in each δ -interval is at least, (cid:90) i δ ( i − ) δ q ∗ ( t ) d t = (cid:90) i δ ( i − ) δ (cid:20) q ∗ (( i − ) δ ) + (cid:90) t ( i − ) δ ( λ i − m ( s )) { q ∗ ( s ) > λ i ≥ m ( s ) } d s (cid:21) d t ≥ (cid:90) i δ ( i − ) δ (cid:20) q ∗ (( i − ) δ ) + (cid:90) t ( i − ) δ ( λ i − m ( s )) d s (cid:21) + d t ≥ (cid:90) i δ ( i − ) δ (cid:20) q (( i − ) δ ) + (cid:90) t ( i − ) δ ( λ i − m i − ∆ i ) d s (cid:21) + d t = (cid:90) i δ ( i − ) δ [ q ( t ) − ( t − ( i − ) δ ) ∆ i ] + d t ≥ (cid:90) i δ ( i − ) δ q ( t ) d t − δ ∆ i ω · (cid:90) T q ( t ) d t − ω · (cid:90) T q ∗ ( t ) d t ≤ ωδ · n ∑ i = ∆ i ≤ ωδ · m ( ) + ωδ · C ost λ ( m ∗ , T ) β , (A.33)where the second inequality follows by (A.31). This completes the proof of the claim.Let O bj λ ( m , T ) denote the value of the objective in (5.2) for m ∈ C . Claim A.3.

We claim that O bj λ ( m , T ) = C ost λ ( m , T ) + ω · n ∑ i = (cid:18) δ · q i − q i ( m i − λ i ) (cid:19) { q i > and q i + = }≤ (cid:18) + ωδ θ (cid:19) C ost λ ( m , T ) , (A.34) for any m ∈ C . roof. Let m ∈ C be arbitrary. Note that, since m ∈ C , the switching cost is ∑ ni = [ m i − m i − ] + andthe power cost is ∑ ni = δ m i , which matches the terms in O bj λ ( m , T ) . We will therefore focus onthe ﬂow-time cost. Denote q = ( q ( ) , q ( δ ) , . . . , q ( n )) ∈ R n + . The ﬂow-time is equal to (cid:90) i δ ( i − ) δ q ( t ) d t = (cid:90) δ [ q i + ( λ i − m i ) t ] + d t = δ · q i + q i + { q i = q i + > } + q i ( m i − λ i ) { q i > q i + = } , (A.35)because q ( · ) increases or decreases linearly. This completes the equality in the claim. To see whythe inequality holds, note that n ∑ i = (cid:18) δ · q i − q i ( m i − λ i ) (cid:19) { q i > q i + = } ≤ δ · n ∑ i = q i { q i > q i + = }≤ δ · n ∑ i = δ m i , ≤ δ · C ost λ ( m , T ) θ , (A.36)where the second inequality follows because δ ( m i − λ i ) ≥ q i . This completes the proof of theclaim.We now ﬁnish the proof of Lemma 5.1. Let m ∈ C be an optimal solution to (5.2). Moreover,deﬁne m ∗ = arg min m ∈C C ost λ ( m , T ) . (A.37)Then, the cost of m is at most,C ost λ ( m , T ) ≤ O bj λ ( m , T ) ≤ O bj λ ( m ∗ , T ) ≤ (cid:18) + ωδ θ (cid:19) C ost λ ( m ∗ , T ) ≤ (cid:18) + ωδ θ (cid:19) (cid:18)(cid:18) + ωδ β (cid:19) O pt + ωδ m ( ) (cid:19) , (A.38)which completes the proof of the lemma. A.7 Proof of Proposition 6.1

Fix a ﬁnite time horizon T , arrival rate function λ ( · ) and initial number of servers m ( ) . Let m ∗ ( t ) be a solution of the ofﬂine optimization problem (2.1) and q ∗ ( t ) the corresponding workload.Assume that the solution achieves a ﬁnite cost. If there does not exist a solution which achievesﬁnite cost, then Proposition 6.1 follows immediately. Without loss of generality, assume that m ∗ ( t ) is differentiable. To see why this is possible, assume that m ∗ ( t ) is not differentiable. Deﬁnethe interpolation m ∗ δ ( t ) of m ∗ ( t ) such that m ∗ δ ( t ) = (cid:90) t + δ t m ∗ ( s ) δ d s , (A.39)which is differentiable for all δ >

0. Also, note that (cid:90) t t m ∗ δ ( t ) d t = (cid:90) t t (cid:90) t + δ t m ∗ ( s ) δ d s d t → (cid:90) t t m ∗ ( t ) d t as δ →

0, (A.40)37or any 0 ≤ t ≤ t ≤ ∞ . The cost of m ∗ ( t ) and m ∗ δ ( t ) therefore coincide, asymptotically as δ →

0. As a result, each function m ∗ ( · ) can be written as the limit of a sequence of differentiablefunctions m ∗ δ ( · ) and we therefore assume that m ∗ ( t ) is differentiable without loss of generality. Overview of the proof.

Let m ( t ) be the number of servers of ABCS (Algorithm 3) and q ( t ) be the corresponding workload. The goal of the proof will be to ﬁnd a non-negative potentialfunction Φ ( t ) such that d Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ PCR · ∂ C ost ( m ∗ , t ) ∂ t , (A.41)where we have omitted λ from the notation C ost λ ( m , t ) . Note that C ost ( m , t ) and C ost ( m ∗ , t ) are differentiable because m ( t ) and m ∗ ( t ) are differentiable. We integrate equation (A.41) fromtime t = t = T to obtainC ost ( m , T ) ≤ PCR · C ost ( m ∗ , T ) + Φ ( ) − Φ ( T ) ≤ PCR · C ost ( m ∗ , T ) + Φ ( ) , (A.42)where the last step follows because Φ ( T ) is non-negative. The proof of Proposition 6.1 is thereforecompleted if we manage to ﬁnd a differentiable potential function Φ ( t ) satisfying equation (A.41)and Φ ( ) = β · m ( ) r . Choice of Φ ( t ) . Deﬁne the potential function Φ ( t ) such that Φ ( t ) = (cid:40) c β · ( d R ( t ) − m ( t ) + m ∗ ( t )) , if m ( t ) > m ∗ ( t ) c β · ( d r ( t ) − m ( t ) + m ∗ ( t )) if m ( t ) ≤ m ∗ ( t )+ β · m ( t ) r + c R θ · [ q ( t ) − q ∗ ( t )] + , (A.43)where d r ( t ) = (cid:115) r ω · ([ q ( t ) − q ∗ ( t )] + ) β + ( m ( t ) − m ∗ ( t )) . (A.44)Note that Φ ( t ) is non-negative and Φ ( ) = β · m ( ) r . The sophisticated reader might remark thatthere are points in the domain for which Φ ( t ) is not differentiable. As there can only be countablymany of these points, these points do not inﬂuence the integral of equation (A.41) and we simplyignore these points in the analysis. Veriﬁcation of (A.41) . We continue by verifying equation (A.41). We distinguish four cases,depending on whether q ( t ) > q ∗ ( t ) or q ( t ) ≤ q ∗ ( t ) and m ( t ) > m ∗ ( t ) or m ( t ) ≤ m ∗ ( t ) .(i) Assume that q ( t ) > q ∗ ( t ) and m ( t ) > m ∗ ( t ) . Recall that, by deﬁnition,d q d t = λ ( t ) − m ( t ) , d m d t = ˆ r ( t ) ω · q ( t ) − ˆ r ( t ) θ · m ( t ) β ≤ R ω · q ( t ) β . (A.45)38he derivative of d R ( t ) is therefore at most β · d d R ( t ) d t ≤ d R ( t ) − ·  R ω · ( q ( t ) − q ∗ ( t ))( λ ( t ) − m ( t ))+ R ω · ( q ∗ ( t ) − q ( t ))( λ ( t ) − m ∗ ( t ))+ R ω · q ( t ) · ( m ( t ) − m ∗ ( t ))+ β · d m ∗ d t · ( m ∗ ( t ) − m ( t ))  = d R ( t ) − · ( m ( t ) − m ∗ ( t )) (cid:18) R ω · q ∗ ( t ) − β · d m ∗ d t (cid:19) ≤ R ω · q ∗ ( t ) + β · (cid:20) − d m ∗ d t (cid:21) + . (A.46)The derivative of the potential function Φ ( t ) is thend Φ ( t ) d t ≤ c R ω · q ∗ ( t ) + c β · (cid:32)(cid:20) − d m ∗ d t (cid:21) + + d m ∗ d t (cid:33) − (cid:18) c − r (cid:19) β · d m d t + c R θ · ( λ ( t ) − m ( t ) − λ ( t ) + m ∗ ( t )) ≤ c R ω · q ∗ ( t ) + c β · (cid:20) d m ∗ d t (cid:21) + − (cid:18) + r (cid:19) β · d m d t + (cid:18) + R + R r (cid:19) θ · ( m ∗ ( t ) − m ( t )) . (A.47)The derivative of the cumulative cost C ost ( m , t ) is ∂ C ost ( m , t ) ∂ t = ω · q ( t ) + β · (cid:20) d m d t (cid:21) + + θ · m ( t ) ≤ β r · d m d t + β · (cid:20) d m d t (cid:21) + + (cid:18) + R r (cid:19) θ · m ( t ) . (A.48)We sum equation (A.47) and (A.48) and cancel terms to obtaind Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ c R ω · q ∗ ( t ) + c β · (cid:20) d m ∗ d t (cid:21) + + (cid:18) + R + R r (cid:19) θ · m ∗ ( t ) ≤ PCR · ∂ C ost ( m ∗ , t ) ∂ t . (A.49)Note that if d m d t ≥

0, then the sum follows immediately. If d m d t <

0, we apply the bound − β · d m d t ≤ R θ · m ( t ) − r ω · q ( t ) ≤ R θ · m ( t ) . (A.50)(ii) Assume that q ( t ) ≤ q ∗ ( t ) and m ( t ) > m ∗ ( t ) . The potential function Φ ( t ) simpliﬁes to Φ ( t ) = β · m ( t ) r . (A.51)The derivative of the potential function Φ ( t ) is thend Φ ( t ) d t = β r · d m d t ≤ R ω r · q ∗ ( t ) − θ · m ( t ) . (A.52)39he derivative of the cumulative cost C ost ( m , t ) is ∂ C ost ( m , t ) ∂ t = ω · q ( t ) + β · (cid:20) d m d t (cid:21) + + θ · m ( t ) ≤ ω · q ( t ) + β · [ R ω · q ( t ) − r θ · m ( t )] + + θ · m ( t ) ≤ ( + R ) ω · q ∗ ( t ) + θ · m ( t ) (A.53)We sum equation (A.52) and (A.53) and cancel terms to obtaind Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ (cid:18) + R + R r (cid:19) ω · q ∗ ( t ) ≤ PCR · ∂ C ost ( m ∗ , t ) ∂ t . (A.54)(iii) Assume that q ( t ) > q ∗ ( t ) and m ( t ) ≤ m ∗ ( t ) . Recall that, by deﬁnition,d q d t = λ ( t ) − m ( t ) , d m d t = ˆ r ( t ) ω · q ( t ) − ˆ r ( t ) θ · m ( t ) β ≥ r ω · q ( t ) − R θ · m ( t ) β . (A.55)The derivative of d r ( t ) is therefore at most β · d d r ( t ) d t ≤ d r ( t ) − ·  r ω · ( q ( t ) − q ∗ ( t ))( λ ( t ) − m ( t ))+ r ω · ( q ∗ ( t ) − q ( t ))( λ ( t ) − m ∗ ( t ))+( r ω · q ( t ) − R θ · m ( t ))( m ( t ) − m ∗ ( t ))+ β · d m ∗ d t · ( m ∗ ( t ) − m ( t ))  ≤ d r ( t ) − · ( m ∗ ( t ) − m ( t )) (cid:18) R θ · m ( t ) + β · d m ∗ d t (cid:19) ≤ R θ · m ( t ) + β · (cid:20) d m ∗ d t (cid:21) + . (A.56)The derivative of the potential function Φ ( t ) is thend Φ ( t ) d t ≤ c R θ · m ( t ) + c β · (cid:32)(cid:20) d m ∗ d t (cid:21) + + d m ∗ d t (cid:33) − (cid:18) c − r (cid:19) β · d m d t + c R θ · ( λ ( t ) − m ( t ) − λ ( t ) + m ∗ ( t )) ≤ c R θ · m ∗ ( t ) + c β · (cid:20) d m ∗ d t (cid:21) + − (cid:18) c − r (cid:19) β · d m d t . (A.57)Similar to before, the derivative of the cumulative cost C ost ( m , t ) is ∂ C ost ( m , t ) ∂ t ≤ β r · d m d t + β · (cid:20) d m d t (cid:21) + + (cid:18) + R r (cid:19) θ · m ( t ) . (A.58)We sum equation (A.57) and (A.58) and cancel terms to obtaind Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ c β · (cid:20) d m ∗ d t (cid:21) + + (cid:18) c R + − R r (cid:19) θ · m ∗ ( t ) ≤ PCR · ∂ C ost ( m ∗ , t ) ∂ t . (A.59)40iv) Assume that q ( t ) ≤ q ∗ ( t ) and m ( t ) ≤ m ∗ ( t ) . The potential function Φ ( t ) simpliﬁes to Φ ( t ) = c β · ( m ∗ ( t ) − m ( t )) + β · m ( t ) r . (A.60)The derivative of the potential function Φ ( t ) is thend Φ ( t ) d t = c β · d m ∗ d t − (cid:18) c − r (cid:19) β · d m d t . (A.61)Similar to before, the derivative of the cumulative cost C ost ( m , t ) is ∂ C ost ( m , t ) ∂ t ≤ β r · d m d t + β · (cid:20) d m d t (cid:21) + + (cid:18) + R r (cid:19) θ · m ( t ) . (A.62)We sum equation (A.61) and (A.62) and cancel terms to obtaind Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ c β · (cid:20) d m ∗ d t (cid:21) + + (cid:18) c R + − R r (cid:19) θ · m ∗ ( t ) ≤ PCR · ∂ C ost ( m ∗ , t ) ∂ t . (A.63) A.8 Proof of Proposition 6.2

Fix a ﬁnite time horizon T , arrival rate function λ ( · ) and initial number of servers m ( ) . Let ˜ m ( · ) be the number of advised servers of AP and ˜ q ( · ) be the corresponding workload. Assume thatthe advised number of servers achieves ﬁnite cost. If the advised number of servers does notachieve ﬁnite cost, then Proposition 6.2 follows immediately. Without loss of generality, similarto the proof of Proposition 6.1, assume that ˜ m ( t ) is differentiable. Overview of the proof.

As argued before (see the proof of Proposition 6.1), the proof of Propo-sition 6.2 requires us to ﬁnd a non-negative potential function Φ ( t ) such thatd Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ PCR · ∂ C ost ( ˜ m , t ) ∂ t , (A.64)where we have omitted λ from the notation C ost λ ( m , t ) . Choice of Φ ( t ) . Deﬁne the potential function Φ ( t ) such that Φ ( t ) = (cid:40) c β · ( d r ( t ) − m ( t ) + ˜ m ( t )) if ˆ r ( t ) = r , c β · d R ( t ) − c β · ( m ( t ) − ˜ m ( t )) if ˆ r ( t ) = R , + β · m ( t ) R + c θ · [ q ( t ) − ˜ q ( t )] + , (A.65)where d r ( t ) = (cid:115) r ω · ([ q ( t ) − ˜ q ( t )] + ) β + ( m ( t ) − ˜ m ( t )) . (A.66)41ote that Φ ( ) = β · m ( ) R . If ˆ r ( t ) = r or m ( t ) ≤ ˜ m ( t ) then Φ ( t ) is trivially non-negative. Assumethat ˆ r ( t ) = R and m ( t ) > ˜ m ( t ) , then Φ ( t ) ≥ c β · d R ( t ) − c β · ( m ( t ) − ˜ m ( t )) ≥ (cid:16) c (cid:112) + R − c (cid:17) β · ( m ( t ) − ˜ m ( t )) ≥

0, (A.67)and hence Φ ( t ) is non-negative. The sophisticated reader might remark that there are points inthe domain for which Φ ( t ) is not differentiable. As there can only be countably many of thesepoints, these points do not inﬂuence the integral of equation (A.64) and we simply ignore thesepoints in the analysis. Veriﬁcation of (A.64) . We continue by verifying equation (A.64). We distinguish eight cases,depending on whether q ( t ) > ˜ q ( t ) or q ( t ) ≤ ˜ q ( t ) , m ( t ) > ˜ m ( t ) or m ( t ) ≤ ˜ m ( t ) and ˆ r ( t ) = r orˆ r ( t ) = R .(i.a) Assume that q ( t ) > ˜ q ( t ) , m ( t ) > ˜ m ( t ) and ˆ r ( t ) = r . Note that ˆ r ( t ) = r because q ( t ) > ˜ q ( t ) . Recall that, by deﬁnition,d q ( t ) d t = λ ( t ) − m ( t ) , d m ( t ) d t = r ω · q ( t ) − r θ · m ( t ) β . (A.68)The derivative of d r ( t ) is therefore at most β · d d r ( t ) d t ≤ d r ( t ) − ·  r ω · ( q ( t ) − ˜ q ( t ))( λ ( t ) − m ( t ))+ r ω · ( ˜ q ( t ) − q ( t ))( λ ( t ) − ˜ m ( t ))+( r ω · q ( t ) − r θ · m ( t ))( m ( t ) − ˜ m ( t ))+ β · d ˜ m d t · ( ˜ m ( t ) − m ( t ))  = d r ( t ) − · ( m ( t ) − ˜ m ( t )) (cid:18) r ω · ˜ q ( t ) − r θ · m ( t ) − β · d ˜ m d t (cid:19) ≤ r ω · ˜ q ( t ) − r θ √ + r · m ( t ) + β · (cid:20) − d ˜ m d t (cid:21) + − β √ + r · (cid:20) d ˜ m d t (cid:21) + . (A.69)The derivative of the potential function Φ ( t ) is thend Φ ( t ) d t ≤ c r ω · ˜ q ( t ) − c r θ √ + r · m ( t )+ c β · (cid:20) − d ˜ m d t (cid:21) + − c β √ + r · (cid:20) d ˜ m d t (cid:21) + + c β · d ˜ m d t − (cid:18) c − R (cid:19) β · d m d t + c θ · ( λ ( t ) − m ( t ) − λ ( t ) + ˜ m ( t )) ≤ c r ω · ˜ q ( t ) − r θ r · m ( t ) + (cid:18) + R (cid:19) β · (cid:20) d ˜ m d t (cid:21) + − (cid:18) + r (cid:19) β · d m d t + c θ · ( ˜ m ( t ) − m ( t )) . (A.70)The derivative of the cumulative cost C ost ( m , t ) is ∂ C ost ( m , t ) ∂ t = ω · q ( t ) + β · (cid:20) d m d t (cid:21) + + θ · m ( t )= β r · d m d t + β · (cid:20) d m d t (cid:21) + + (cid:18) + r r (cid:19) θ · m ( t ) . (A.71)42e sum equation (A.70) and (A.71) and cancel terms to obtaind Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ c r ω · ˜ q ( t ) + (cid:18) + R (cid:19) β · (cid:20) d ˜ m d t (cid:21) + + c θ · ˜ m ( t ) ≤ OCR · ∂ C ost ( ˜ m , t ) ∂ t . (A.72)Note that if d m ( t ) d t ≥

0, then the sum follows immediately. If d m ( t ) d t <

0, we apply the bound − β · d m ( t ) d t = r θ · m ( t ) − r ω · q ( t ) ≤ r θ · m ( t ) . (A.73)(i.b) Assume that q ( t ) > ˜ q ( t ) , m ( t ) > ˜ m ( t ) and ˆ r ( t ) = R . Note that ˆ r ( t ) = r because q ( t ) > ˜ q ( t ) . The derivative of d R ( t ) is therefore at most β · d d R ( t ) d t ≤ d R ( t ) − ·  R ω · ( q ( t ) − ˜ q ( t ))( λ ( t ) − m ( t ))+ R ω · ( ˜ q ( t ) − q ( t ))( λ ( t ) − ˜ m ( t ))+ R ω · q ( t ) · ( m ( t ) − ˜ m ( t ))+ β · d ˜ m d t · ( ˜ m ( t ) − m ( t ))  = d R ( t ) − · ( m ( t ) − ˜ m ( t )) (cid:18) R ω · ˜ q ( t ) − β · d ˜ m d t (cid:19) ≤ R ω √ + R · ˜ q ( t ) + β √ + R · (cid:20) − d ˜ m d t (cid:21) + . (A.74)The derivative of the potential function Φ ( t ) is thend Φ ( t ) d t ≤ c R ω √ + R · ˜ q ( t ) + c β √ + R · (cid:20) − d ˜ m d t (cid:21) + + c β · d ˜ m d t − (cid:18) c − R (cid:19) d m d t + c θ · ( λ ( t ) − m ( t ) − λ ( t ) + ˜ m ( t )) ≤ c R ω √ + R · ˜ q ( t ) + c β · (cid:20) d ˜ m d t (cid:21) + − (cid:18) + R (cid:19) β · d m d t + c θ · ( ˜ m ( t ) − m ( t )) . (A.75)The derivative of the cumulative cost C ost ( m , t ) is ∂ C ost ( m , t ) ∂ t = β R · d m d t + β · (cid:20) d m d t (cid:21) + + (cid:18) + r R (cid:19) θ · m ( t ) . (A.76)We sum equation (A.75) and (A.76) and cancel terms to obtaind Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ c R √ + R ω · ˜ q ( t ) + c β · (cid:20) d ˜ m d t (cid:21) + + c θ · ˜ m ( t ) ≤ OCR · ∂ C ost ( ˜ m , t ) ∂ t . (A.77)43ii.a) Assume that q ( t ) ≤ ˜ q ( t ) , m ( t ) > ˜ m ( t ) and ˆ r ( t ) = r . Note that ˆ r ( t ) = R because m ( t ) > ˜ m ( t ) and q ( t ) ≤ ˜ q ( t ) . The potential function Φ ( t ) simpliﬁes to Φ ( t ) = β · m ( t ) R (A.78)The derivative of the potential function Φ ( t ) is thend Φ ( t ) d t = β R · d m d t = r ω R · q ( t ) − θ · m ( t ) . (A.79)The derivative of the cumulative cost C ost ( m , t ) is ∂ C ost ( m , t ) ∂ t = ω · q ( t ) + β · (cid:20) d m d t (cid:21) + θ · m ( t )= ω · q ( t ) + β · (cid:20) r ω · q ( t ) β − R θ · m ( t ) β (cid:21) + + θ · m ( t ) ≤ ( + r ) ω · q ( t ) + θ · m ( t ) . (A.80)We sum equation (A.79) and (A.80) and cancel terms to obtaind Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ (cid:18) + r + r R (cid:19) ω · ˜ q ( t ) ≤ OCR · ∂ C ost ( ˜ m , t ) ∂ t . (A.81)(ii.b) Assume that q ( t ) ≤ ˜ q ( t ) , m ( t ) > ˜ m ( t ) and ˆ r ( t ) = R . However, ˆ r ( t ) = R implies that m ( t ) − ˜ m ( t ) ≤ [ q ( t ) − ˜ q ( t )] + · (cid:113) ω β = q ( t ) > ˜ q ( t ) , m ( t ) ≤ ˜ m ( t ) and ˆ r ( t ) = r . However, ˆ r ( t ) = r implies that m ( t ) − ˜ m ( t ) > [ q ( t ) − ˜ q ( t )] · (cid:113) ω β ≥ q ( t ) > ˜ q ( t ) , m ( t ) ≤ ˜ m ( t ) and ˆ r ( t ) = R . Note that ˆ r ( t ) = r because m ( t ) ≤ ˜ m ( t ) . The derivative of d R ( t ) is therefore at most β · d d R ( t ) d t ≤ d R ( t ) − ·  R ω · ( q ( t ) − ˜ q ( t ))( λ ( t ) − m ( t ))+ R ω · ( ˜ q ( t ) − q ( t ))( λ ( t ) − ˜ m ( t ))+( R ω · q ( t ) − r θ · m ( t ))( m ( t ) − ˜ m ( t ))+ β · d ˜ m d t · ( ˜ m ( t ) − m ( t ))  ≤ d R ( t ) − · ( ˜ m ( t ) − m ( t )) (cid:18) r θ · m ( t ) + β · d ˜ m d t (cid:19) ≤ r θ · m ( t ) + β · (cid:20) d ˜ m d t (cid:21) + . (A.82)44he derivative of the potential function Φ ( t ) is thend Φ ( t ) d t ≤ c r θ · m ( t ) + c β · (cid:20) d ˜ m d t (cid:21) + + c β · d ˜ m d t − (cid:18) c − R (cid:19) β · d m d t + c θ · ( λ ( t ) − m ( t ) − λ ( t ) + ˜ m ( t )) ≤ c r θ · m ( t ) + ( c + c ) β · (cid:20) d ˜ m d t (cid:21) + − (cid:18) + R (cid:19) β · d m d t + c θ · ( ˜ m ( t ) − m ( t )) (A.83)The derivative of the cumulative cost C ost ( m , t ) is ∂ C ost ( m , t ) ∂ t = β R · d m d t + β · (cid:20) d m d t (cid:21) + + (cid:18) + r R (cid:19) θ · m ( t ) . (A.84)We sum equation (A.83) and (A.84) and cancel terms to obtaind Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ ( c + c ) β · (cid:20) d ˜ m d t (cid:21) + + c θ · ˜ m ( t ) ≤ OCR · ∂ C ost ( ˜ m , t ) ∂ t . (A.85)(iv.a) Assume that q ( t ) ≤ ˜ q ( t ) , m ( t ) ≤ ˜ m ( t ) and ˆ r ( t ) = r . However, ˆ r ( t ) = r implies that m ( t ) − ˜ m ( t ) > [ q ( t ) − ˜ q ( t )] + · (cid:113) ω β = q ( t ) ≤ ˜ q ( t ) , m ( t ) ≤ ˜ m ( t ) and ˆ r ( t ) = R . Note that ˆ r ( t ) = r because m ( t ) ≤ ˜ m ( t ) . The potential function Φ ( t ) simpliﬁes to Φ ( t ) = ( c + c ) β · ( ˜ m ( t ) − m ( t )) + β · m ( t ) R (A.86)The derivative of the potential function Φ ( t ) is thend Φ ( t ) d t = ( c + c ) β · d ˜ m d t − (cid:18) c + + R (cid:19) β · d m d t . (A.87)The derivative of the cumulative cost C ost ( m , t ) is ∂ C ost ( m , t ) ∂ t = β R · d m d t + β · (cid:20) d m d t (cid:21) + + (cid:18) + r R (cid:19) θ · m ( t ) . (A.88)We sum equation (A.87) and (A.88) and cancel terms to obtaind Φ ( t ) d t + ∂ C ost ( m , t ) ∂ t ≤ ( c + c ) β · (cid:20) d ˜ m d t (cid:21) + + c θ · ˜ m ( t ) ≤ OCR · ∂ C ost ( ˜ m , t ) ∂ t . (A.89)45 .9 Proof of Proposition 4.12 Fix any algorithm A , and let CR ( η ) denote its competitive ratio when it has access to an η -accurate prediction. Fix δ > ( ) ≤ + δ . We will construct an instance forwhich C ost λ ( m , T ) ≥ O pt / ( δ ) + Φ ( ) . Throughout the example, we will assume that Φ ( ) iszero without loss of generality.Let T = + √ δ , m ( ) = λ ( t ) = t ∈ [ √ δ ) . The value of λ ( t ) for t ∈ [ √ δ , T ] will be speciﬁed later. Fix β = ω = θ =

0. Let the prediction ˜ λ ( t ) = t ∈ [ T ] and let m ( t ) be the number of servers of A for the instance. We distinguish two cases depending on thevalue of m ( √ δ ) .1. First, consider the case when m ( √ δ ) <

3. Fix λ ( t ) = t ∈ [ √ δ , T ] . The optimalsolution of (2.1) is m ∗ ( t ) = t ∈ [ T ] . The value of the optimal solution is purely dueto switching cost and is equal to O pt =

2. As ˜ λ ( t ) = λ ( t ) , the prediction ˜ λ is perfect. Hence,by the assumption that A is ( + δ ) -consistent, we must have C ost λ ( m , T ) ≤ ( + δ ) O pt forthis instance. We will verify this.At time t = √ δ , the queue length of A is at least q ( √ δ ) ≥ (cid:82) √ δ λ ( t ) − m ( t ) d t > √ δ .The optimal solution starting from time t = √ δ is m ( t ) = + q ( √ δ ) / √ > + δ for t ∈ ( √ δ , T ] . As a result, the ﬂow-time is at least (cid:82) T q ( t ) d t ≥ q ( √ δ ) / √ > δ . The cost of A is therefore at least C ost λ ( m , T ) > + δ ≥ ( + δ ) O pt . This is a contradiction with ourassumption and the next case must occur.2. Next, consider the case when m ( √ δ ) ≥

3. Fix λ ( t ) = t ∈ [ √ δ , T ] . The optimalsolution of (2.1) is m ∗ ( t ) = t ∈ [ T ] if δ is sufﬁciently small. The queue length satisﬁes q ∗ ( t ) = t for t ∈ [ √ δ ) and q ∗ ( t ) = √ δ − ( t − √ δ ) for t ∈ [ √ δ , T ] . The value of theoptimal solution is purely due to ﬂow-time and is equal to O pt = δ + δ = δ .The cost of A is at least C ost λ ( m , T ) ≥ ≥ O pt / ( δ ) due to switching cost. Moreover,the mean absolute error (MAE) is T · (cid:13)(cid:13) ˜ λ − λ (cid:13)(cid:13) MAE = (cid:82) T (cid:12)(cid:12) ˜ λ ( t ) − λ ( t ) (cid:12)(cid:12) d t = δ -accurate.Hence, equation (4.16) follows. A.10 Proof of Proposition 4.13

Let m ( t ) denote the number of servers of the Timer algorithm. We will construct an instance ( T , λ , m ( )) for which C ost λ ( m , T ) ≥ O pt / ( ω ) + Φ ( ) for any Φ ( ) .Let T = n ω − , m ( ) = ω − and λ ( t ) = ∑ n − i = ω − · δ i ω − ( t ) , where δ t ( t ) is the Dirac delta at t , i.e. δ t ( t ) = ∞ , δ t ( t ) = t (cid:54) = t and (cid:82) ∞ δ t ( s ) d s =

1. Fix β = ω − and θ =

1. Let m ( t ) be the number of servers of the Timer algorithm for the instance. Since the time betweensubsequent arrivals is ω − = β / θ , the Timer Algorithm does not turn off any servers. Hence, thenumber of servers is m ( t ) = ω − for t ∈ [ T ] . The cost is thereforeC ost λ ( m , T ) = θ T · ω − = n ω − . (A.90)Let m ∗ ( t ) = t ∈ [ T ] . The corresponding queue length satisﬁes q ∗ ( i ω − + t ) = ω − − t for t ∈ [ ω − ) and i =

0, . . . , n −

1. The cost of the optimal solution is at most the cost of this46lgorithm, i.e.O pt ≤ ω · (cid:90) T q ∗ ( t ) d t + θ · (cid:90) T m ∗ ( t ) d t = n ω · (cid:90) ω − ( ω − − t ) d t + θ · (cid:90) T t = n ω · ω − + θ T = n ω − n ≥ ω · Φ ( ) and thereforeC ost λ ( m , T ) = n ω − ≥ n ω − + Φ ( ) ≥ ω · O pt + Φ ( ))