[PDF] Queues with Small Advice

Abstract

Motivated by recent work on scheduling with predicted job sizes, we consider the performance of scheduling algorithms with minimal advice, namely a single bit. Besides demonstrating the power of very limited advice, such schemes are quite natural. In the prediction setting, one bit of advice can be used to model a simple prediction as to whether a job is "large" or "small"; that is, whether a job is above or below a given threshold. Further, one-bit advice schemes can correspond to mechanisms that tell whether to put a job at the front or the back for the queue, a limitation which may be useful in many implementation settings. Finally, queues with a single bit of advice have a simple enough state that they can be analyzed in the limiting mean-field analysis framework for the power of two choices. Our work follows in the path of recent work by showing that even small amounts of even possibly inaccurate information can greatly improve scheduling performance.

Full PDF

QQueues with Small Advice

Michael Mitzenmacher ∗ June 30, 2020

Abstract

Motivated by recent work on scheduling with predicted job sizes, we consider the performance ofscheduling algorithms with minimal advice, namely a single bit. Besides demonstrating the power ofvery limited advice, such schemes are quite natural. In the prediction setting, one bit of advice can beused to model a simple prediction as to whether a job is “large” or “small”; that is, whether a job isabove or below a given threshold. Further, one-bit advice schemes can correspond to mechanisms thattell whether to put a job at the front or the back for the queue, a limitation which may be useful in manyimplementation settings. Finally, queues with a single bit of advice have a simple enough state thatthey can be analyzed in the limiting mean-ﬁeld analysis framework for the power of two choices. Ourwork follows in the path of recent work by showing that even small amounts of even possibly inaccurateinformation can greatly improve scheduling performance.

In queueing settings where the required service time for a job is known, strategies that take advantage ofthat information, such as Shortest Job First (SJF) or Shortest Remaining Processing Time (SRPT) can yieldsigniﬁcant performance improvements over blind strategies such as First In First Out (FIFO). However,exact knowledge of the service times is a great deal to ask for in practice. Here we consider the settingwhere one is given much more limited information. Speciﬁcally, we consider the case where, for each job,a queue gets only one bit of information, or advice, regarding the job size.While a one bit limitation may seem unusual, there are both theoretical and practical motivations forsuch a study. Online algorithms with small amounts of optimal advice has been a subject of study in thetheoretical literature (see, e.g., the survey [3]); such work highlights the potential for additional informationto improve performance. Considering the case of just one bit of information is an interesting limiting case.Further, one-bit advice can naturally correspond to informing whether a job should be placed at the frontor the back of the queue; for some queue implementations, such as in hardware or other highly constrainedsettings, one may desire this simplicity over more complicated data structures for managing job placementin the queue.However, as a more concrete practical motivation, recently researchers have studied queues with pre-dicted service times, rather than exact service times, where such predictions might naturally be provided bya machine learning algorithm [4, 13, 14, 15, 17, 19]. Indeed, the queueing setting is one natural example ofan expanding line of work where predictions can be used to improve algorithms, particularly in scheduling ∗ School of Engineering and Applied Sciences, Harvard University. Supported in part by NSF grants CCF-1563710 and CCF-1535795. This is an arxiv draft, to be submitted and subject to changes. a r X i v : . [ c s . PF ] J un e.g., [8, 9, 11, 15, 17]). Our setting here of one-bit predictions can model a natural setting where the pre-diction corresponds to whether a job’s service time is believed to be above or below a ﬁxed threshold. Suchpredictions may be simpler to implement or more accurate than schemes that attempt to provide a predictionof the exact service time.For single-queue settings, our work uses standard queueing theoretic analysis techniques. Here wegenerally follow the (folklore) approach of using Kleinrock’s Conservation Law to derive formulae for theconditional waiting time of a job according to its service time; this approach dates back to at least the workof O’Donovan [16], from whose framework and notation we borrow. The derivations can also be readilyobtained using the analysis of priority systems, following the framework presented in for example [6]. Thegoal here is not to suggest new methods of analysis, but instead: • show how the problem of scheduling with limited predicted information can naturally be analyzed; • demonstrate how even limited advice and predictions can provide large performance gains; and • show some interesting derivations for the special case we refer to as exponential predictions.We also examine one-bit predictions schemes with large numbers of queues using the power of twochoices. Here each arrival chooses the better of two randomly selected queues (or more generally from d randomly selected queues) from a large system of (homogenous) queues. This study shares many of thesame motivations as for single queues; moreover, it may offer a ﬁrst step to some open questions in the area,such as analyzing the power of two choices when using Shortest Remaining Processing Time or relatedschemes (see e.g. [14]).Finally, more generally, we believe this work also highlights some aspects of using machine learningpredictions that may provide guidance for the design of machine learning prediction settings. For example,we see that some predictions may be much more important than others; in queueing settings, it seemsgenerally much more important to identify long jobs correctly than short jobs, as long jobs will block manyother jobs from service. We consider M/G/1 queueing systems, with arrival rate λ and where the processing times are independentlysampled according to the cumulative distribution F ( x ) with corresponding density f ( x ) . We follow someof the notation from [16]. We assume the expected service time has been scaled so the mean service time is1 (that is, E [ F ] = 1 ). Note E [ F ] is the second moment for the service time. We further let V = λ E [ F ]2 be the expected remaining service time of the job being served at the time of a random arrival. We also let ρ ( t ) = λ (cid:90) t xf ( x ) dx be the rate at which load is added to the queue from jobs with service time at most t , and correspondingly ρ ( t ) = λ (cid:90) t ∞ xf ( x ) dx = λ. .2 The Conservation Law As described in [16], Kleinrock’s Conservation Law says that for a queue with Poisson arrivals satisfyingbasic assumptions (such as the queue is busy whenever there are jobs in the system), the expected load L onthe system at a random time point (e.g., in the stationary distribution), satisﬁes L (1 − ρ ) = V, where again V is the expected load due to the job in service and ρ is the total rate at which load is added to thesystem. The law allows simple derivations of conditional expected waiting times, by looking at appropriatesubsystems of jobs. We consider the case of an advisor that provides a single bit of advice per job. Speciﬁcally we consider thestrategy where the advice bit is 0 if the job’s service time is less than some threshold T , and 1 otherwise.The job is placed at the front of the queue if the advice bit is 0, and at the back of the queue otherwise. Weconsider preemptive and non-preemptive queues, where in the preemptive case a job placed at the front willpreempt the job currently receiving service. We later generalize the one bit of advice to prediction-basedsystems, where the prediction is whether the service time for the job is larger or smaller than the threshold. We ﬁrst consider jobs arriving jobs with service time at most T . Here we do not require the conservationrule; such a job is placed at the front of the queue, although it has to wait for the job, if any, in service tocomplete. Further, any additional jobs of service time at most T that arrive before this job starts serviceis placed ahead of the arriving job being considered. We denote the expected waiting time for a job withservice time t , by which we mean the time spent by an incoming job in the stationary distribution waitingbefore starting to obtain service, by W ( t ) . We denote the expected sojourn time, by which we mean theentire time spent by an incoming job in the system, by S ( t ) .The expected time an arriving job has to wait for an existing job being processed is V . It follows fromstandard busy period analysis that additional incoming jobs increase the expected waiting time by a factorof ρ ( T ) , and so W ( t ) = V − ρ ( T ) . The expected sojourn time for such jobs is thus S ( t ) = V − ρ ( T ) + t. For jobs with service time t larger than T , we consider the subsystem of all jobs, and use the notation W ( t ) and S ( t ) for the corresponding quantities. In this setting we have L = V − ρ from the conservation law. For any job with service time larger than T , any new job with service time atmost T that arrives will be placed ahead of of this job until it is served. Hence W ( t ) = L − ρ ( T )= V (1 − ρ )(1 − ρ ( T )) , nd S ( t ) = V (1 − ρ )(1 − ρ ( T )) + t. For a given service distribution F and threshold T , we have the total expected waiting time W in thesystem is W = V F ( T )1 − ρ ( T ) + V (1 − F ( T ))(1 − ρ )(1 − ρ ( T ))= V (1 − F ( T ) ρ )(1 − ρ )(1 − ρ ( T )) . The expected sojourn time S satisﬁes S = W + 1 . Minimizing W (or S ) can be accomplished numeri-cally.As an example we discuss through this work, for exponentially distributed service times, V = λ , F ( T ) = 1 − e − T , ρ = λ , and ρ ( T ) = ( λ )(1 − ( T + 1) e − T ) . We ﬁnd the expected sojourn time forthis case, which we refer to as S e,n , is then S e,n = λ (1 − λ + λe − T )(1 − λ )(1 − ( λ (1 − ( T + 1) e − T ))) + 1= 1 − λ + λ ( T + 1) e − T − λ T e − T (1 − λ )(1 − ( λ (1 − ( T + 1) e − T ))) + 1 . Taking the derivative, we ﬁnd the optimal T value occurs when λ − e − T T − , or equivalently we seek T that satisﬁes λ = T − e − T + T − , In particular, as λ goes to , the optimal T increases to inﬁnity, and as λ goes to 0, the optimal T goes to 1.It is perhaps worth noting that a threshold T of 4 corresponds to a λ larger than . ; that is, in this case, wedo not see very large thresholds even under high load.As another example, we consider service distributions following the Weibull distribution with cumula-tive distribution F ( x ) = 1 − e −√ x . The Weibull distribution is heavy-tailed; while the average servicetime of this distribution remains 1, the second moment is 6, so there are many more very long jobs as com-pared to the exponential distribution. Weibull distributions are commonly used for queueing simulations, asheavy-tailed service time distributions are more realistic for many scenarios.For this Weibull distribution, V = 3 λ and ρ ( T ) = λ (1 − e −√ T ( T + √ T + 1)) are computed easily.The expected sojourn time in this case, which we denote by S w,n , is then given by S w,n = 3 λ (1 − λ + λe −√ T )(1 − λ )(1 − ( λ (1 − e −√ T ( T + √ T + 1)))) + 1 . .5 The preemptive system We ﬁrst consider jobs arriving jobs with service time at most T . Again, here we do not require the conser-vation rule; such a job is placed at the front of the queue, and any additional job of service time at most T that arrive before this job starts service is placed ahead of the arriving job being considered. We use W ( t ) and S ( t ) as before.Clearly W ( t ) = 0 . However, we also consider the effect of preemptions. Since any job of size at most T will preempt the job, the expected sojourn time is S ( t ) = t − ρ ( T ) . For jobs with service times larger than T , we consider the subsystem of all jobs, and again use thenotation W ( t ) and S ( t ) for the corresponding quantities.In this setting we have L = V − ρ from the conservation law. While waiting any job of service time at most T is placed ahead of any job ofsize greater than T , so again W ( t ) = L − ρ ( T )= V (1 − ρ )(1 − ρ ( T )) . Because of the preemption, the expected time from the start of service until ﬁnishing service increases to t/ (1 − ρ ( T )) , and so S ( t ) = V (1 − ρ )(1 − ρ ( T )) + t − ρ ( T ) . In this case the total expected waiting time is W = V (1 − F ( T ))(1 − ρ )(1 − ρ ( T )) , and the total expected sojourn time is S = V (1 − F ( T )) + 1 − ρ (1 − ρ )(1 − ρ ( T )) . For exponentially distributed service times, we ﬁnd the expected sojourn time in this case, which werefer to as S e,p is then S e,p = 1 − λ + λe − T (1 − λ )(1 − ( λ (1 − ( T + 1) e − T ))) . One can readily that S e,p < S e,n for any value of T ; indeed, λS e,p = S e,n − , o S e,n − S e,p = 1 − (1 − λ ) S e,p > . Also, the optimal value of T is again given by λ − e − T T − . For the Weibull distribution, we have the corresponding expression S w,p = 1 − λ + 3 λe −√ T (1 − λ )(1 − ( λ (1 − e −√ T ( T + √ T + 1)))) . We consider a simple model where the probability of a misprediction for a given item depends only on itsservice time, independent of other jobs and other considerations. Speciﬁcally, we suppose we have a desiredthreshold T , and our prediction is simply our best guess as to whether a job’s service time is larger or lessthan T . We deﬁne g T ( x ) be the probability that a job of size x is predicted to be less than T . While onecan imagine more complex prediction models, this model is quite natural, and is useful for examining thepotential power of predictions. To deal with the predictions, we now let Q ( T ) = (cid:90) ∞ f ( x ) g T ( x ) dx, and ρ (cid:48) ( T ) = λ (cid:90) ∞ xf ( x ) g T ( x ) dx. Here ρ (cid:48) ( T ) can be interpreted as the rate load arrives to the system from jobs with predicted service time atmost T , and similarly Q ( T ) is the fraction of jobbs predicted to have service time at most T .We ﬁrst consider arriving jobs with predicted service time at most T and actual service time t . Followingthe same reasoning we have previously used, the waiting time W (cid:48) t for such jobs is given by W (cid:48) ( t ) = V − ρ (cid:48) ( T ) . For jobs with predicted service time greater than T , W (cid:48) ( t ) = V (1 − ρ )(1 − ρ (cid:48) ( T )) . The total expected waiting time per job is therfore given by W (cid:48) = V (cid:82) ∞ f ( x ) g T ( x ) dx − ρ (cid:48) ( T ) + V (cid:82) ∞ f ( x )(1 − g T ( x )) dx (1 − ρ )(1 − ρ (cid:48) ( T ))= V (1 − ρ (cid:82) ∞ f ( x ) g T ( x ) dx )(1 − ρ )(1 − ρ (cid:48) ( T ))= V (1 − ρQ ( T ))(1 − ρ )(1 − ρ (cid:48) ( T )) . n particular, we see that the only changes from the setting without the prediction is that in the − F ( T ) ρ term in the numerator, the F ( T ) has been replaced by the more complex integral expression Q ( T ) , andsimilarly the − ρ ( T ) term in the denominator has become − ρ (cid:48) ( T ) .A model suggested in [13, 14] considers the setting where a prediction for a job with service time z isitself exponentially distributed with mean z ; we refer to this as the exponential prediction model. Whilenot necessarily realistic, this model often allows for mathematical derivations, and provides a useful startingpoint for considering the effects of predictions. With this model, g T ( x ) = 1 − e − ( T/x ) , and hence Q ( T ) = (cid:90) ∞ f ( x ) g T ( x ) dx = 1 − (cid:90) ∞ e − x − ( T/x ) dx = 1 − √ T K (2 √ T ) , where K is a modiﬁed Bessel function of the second kind. Also ρ (cid:48) ( T ) = λ (cid:90) ∞ ( xe − x − xe − x − ( T/x ) ) dx = λ (1 − T K (2 √ T )) , where K is a modiﬁed Bessel function of the second kind (with a different parameter).The expected sojourn time for this case, which we refer to as S e ∗ ,n , is then S e ∗ ,n = λ (1 − λ (1 − √ T K (2 √ T )))(1 − λ )(1 − λ (1 − T K (2 √ T ))) + 1 . There does not appear to be a simple form for the derivative of this expression that allows us to write asimple form for the optimal value fo T , although it can be found numerically.We can perform similar calculations for our Weibull distribution. Here we have Q ( T ) = 1 − (cid:90) ∞ sqrt x e −√ x − ( T/x ) dx = 1 − (cid:114) T π G , , (cid:18) T | − , , (cid:19) , where here G is the Meijer G -function.Similarly ρ (cid:48) ( T ) = λ (cid:90) ∞ (cid:114) x (cid:16) e −√ x − e −√ x − ( T/x ) (cid:17) dx = λ (cid:32) − (cid:114) T π G , , (cid:18) T | − , , (cid:19)(cid:33) , where again G is the Meijer G -function.The expected sojourn time for this case, which we refer to as S w ∗ ,n , is then S w ∗ ,n = 3 λ (cid:18) − λ (cid:18) − (cid:113) T π G , , (cid:0) T | − , , (cid:1)(cid:19)(cid:19) (1 − λ ) (cid:18) − λ (cid:18) − (cid:113) T π G , , (cid:0) T | − , , (cid:1)(cid:19)(cid:19) + 1 . .2 The preemptive system For the preemptive system, again let Q ( T ) = (cid:90) ∞ f ( x ) g T ( x ) dx, and ρ (cid:48) ( T ) = λ (cid:90) ∞ xf ( x ) g T ( x ) dx. We ﬁrst consider jobs arriving jobs with predicted service time at most T and actual service time t Such jobswill have no waiting time, but their expected sojourn time is S ( t ) = t − ρ (cid:48) ( T ) . For jobs with predicted service time greater than or equal to T , following the same reasoning as in thecase without predictions, we have W ( t ) = V (1 − ρ )(1 − ρ (cid:48) ( T )) , and S ( t ) = V (1 − ρ )(1 − ρ (cid:48) ( T )) + t − ρ (cid:48) ( T ) . We therefore ﬁnd the total expected waiting time is W = V (1 − Q ( T ))(1 − ρ )(1 − ρ (cid:48) ( T )) , and the total expected sojourn time is S = V (1 − Q ( T ))(1 − ρ )(1 − ρ (cid:48) ( T )) + 11 − ρ (cid:48) ( T )= V (1 − Q ( T )) + 1 − ρ (1 − ρ )(1 − ρ (cid:48) ( T )) . For the exponential prediction model, the expected sojourn time, which we refer to as S e ∗ ,n , is then S e ∗ ,p = λ √ T K (2 √ T ) + 1 − λ (1 − λ )(1 − λ (1 − T K (2 √ T ))) . We again here have the relation λS ∗ e,p = S ∗ e,n − , showing that preemption is always helpful in this setting.For the Weibull model, S w ∗ ,p = 1 − λ + 3 λ (cid:113) T π G , , (cid:0) T | − , , (cid:1) (1 − λ ) (cid:18) − λ (cid:18) − (cid:113) T π G , , (cid:0) T | − , , (cid:1)(cid:19)(cid:19) . One-Bit Advice with Multiple Queues

In this section, we consider one-bit threshold schemes for setting with multiple queues. In particular, weconsider the “power of d choices” (also known as the “balanced allocations”) setting, where we considerthe number of queues growing to inﬁnity, and each arrival chooses the best queue from a small constant-sized subset of randomly selected queues [2, 12, 18]. An advantage here of queues based on one bit ofadvice is their state can easily represented, allowing the type of mean-ﬁeld analysis that is typical for suchsystems. We note that analysis of the power of d choices with for example exact job sizes using queueingschemes such as Shortest Remaining Processing Time remains an intriguing open question (see e.g. [14] formore discussion), although simply using FIFO queues and choosing the least loaded queue from a constantnumber of choices has been analyzed [7].As our purpose here is primarily to demonstrate how schemes utilizing one bit can be analyzed in thisframework, we choose a relatively simple example, based on the anecdotal 80-20 rule, that 20% of the jobscause 80% of the work. We assume that there are two types of jobs: long jobs have exponentially distributedservice times with mean µ , and short jobs have exponentially distributed service times with mean µ < µ .Long jobs arrive with rate λ n and short jobs arrive with rate λ n , where n is the number of queues in thesystem. While this model is general, it can encompass the 80-20 rule, where long jobs are less frequent butrequire much more work; for example, if λ µ = 4 λ µ , then long jobs are relatively rare but contribute80% of the work.In the prediction setting, we assume long jobs are misclassiﬁed as short jobs independently with prob-ability q , and short jobs are misclassiﬁed as long jobs independently with probability q . (We may viewthe case without predictions, where the one bit of advice is accurate, as corresponding to q = q = 0 ,with the resulting equations.) A more useful interpretation for our analysis is that a job that is classiﬁedas a long job, or labeled long, is actually long with probability p L = λ (1 − q ) / ( λ (1 − q ) + λ q ) ,and similarly a job that is classiﬁed as a short job, or labeled short, is actually short with probability p S = λ (1 − q ) / ( λ (1 − q )+ λ q ) . Similarly, the arrival rate for jobs that labeled long is λ L = λ (1 − q )+ λ q and the arrival rate for jobs that are labeled short is λ S = λ (1 − q ) + λ q .For serving jobs, we give labeled short jobs priority, and serve them in FIFO fashion; similarly, labeledlongs jobs are served using FIFO. We suggest a simple, convenient method for choosing queues, althoughmany variations are possible and can be studied similarly. We choose the “best” of d queues chosen inde-pendently and uniformly at random for a constant d , where we determine the best as follows. First, an emptyqueue has highest priority; an empty queue will always be selected if it is one of the d chosen. Otherwise,we ignore the label and the time already spent being served of the job being served. Jobs that are predictedto be short shall choose the queue with the fewest queued labeled short jobs, breaking ties in favor of thequeue with the fewest labeled long jobs (and then randomly if two queues match), and similarly jobs that arepredicted to be long shall choose the queue with the smallest number of queued labeled long jobs, breakingties in favor of the queue with fewer short jobs (and then randomly if two queues match). Again, one couldimagine more complex policies based on minimizing the expected time until service; such policies can bestudied using the same framework.We derive equations describing the system state in the mean ﬁeld limit, where the number of queues goesto inﬁnity. (This approach can be formalized using the theory of density dependent jump Markov chains,following the work of Kurtz; see [5, 10, 20] for examples.) The state of a single queue can be represented bya triple ( s, (cid:96), c ) , where s is the number of jobs that are labeled short, (cid:96) is the number of jobs that are labeledlong, and c is 1 if the current running job is long and 2 if it is short. The state (0 , , is used for an emptyqueue; this is the only state where c (cid:54) = 1 , . Let x ( s,(cid:96),c ) ( t ) represent the fraction of queues in state ( s, (cid:96), c ) at time t ; we drop the t where the meaning is clear. We use ˆ x ( s, (cid:96), c ) to refer to the equilibrium values forthese quantities.Note that our setting allows a relatively simple analysis by having service times be exponentially dis-ributed and predictions depend only on the type of the job, instead of its running time. Because of this, tokeep the state of a queue it sufﬁces to keep the type of the running job, as this gives the distribution of theremaining time it is in service. This approach can be extended to more general service times and predictions;see e.g. [1, 7], for example, for the appropriate framework. At a high level, for such generalizations, thestate of the queue must track how long the current running job has been in the system; the distributionsfunction for remaining time in service, which is derived by taking the proper weighted average over types,then determines whether the service will complete over the next time interval dt .To write the equations describing the limiting behavior of these systems, we use some additional nota-tion. Let z (2 ,s,(cid:96) ) be the fraction of queues with lower priority over a queue with s queued labeled short jobsand (cid:96) queue labeled large jobs when a job labeled short arrives (in terms of being chosen by our algorithm),and similarly deﬁne z (1 ,s,(cid:96) ) for when a job labeled long arrives. Here again we drop the implicit dependenceon t . These z values can be readily computed by dynamic programming or even brute force given the x ( s,(cid:96),c ) .For c = 1 , and c (cid:48) = 1 , , let w ( c (cid:48) ,s,(cid:96),c ) = (cid:16)(cid:0) z ( c (cid:48) ,s,(cid:96) ) + x ( s,(cid:96), + x ( s,(cid:96), (cid:1) d − (cid:0) z ( c (cid:48) ,s,(cid:96) ) (cid:1) d (cid:17) x ( s,(cid:96),c ) x ( s,(cid:96), + x ( s,(cid:96), . Then w ( c (cid:48) ,s,(cid:96),c ) gives the probability that an incoming job labeled c (cid:48) chooses a queue in state ( s, (cid:96), c ) . Forthe empty queue, we have the special case w ( c (cid:48) , , , = 1 − (cid:0) − x (0 , , (cid:1) d , and is it convenient to let w ( c (cid:48) ,s,(cid:96),c ) = 0 if s < or (cid:96) < .The limiting mean ﬁeld equations when s > are then dx ( s,(cid:96), dt = λ S w (2 ,s − ,(cid:96), + λ L w (1 ,s,(cid:96) − , + µ x ( s +1 ,(cid:96), (1 − p S ) + µ x ( s +1 ,(cid:96), (1 − p S ) − (cid:0) µ x ( s,(cid:96), + λ S w (2 ,s,(cid:96), + λ L w (1 ,s,(cid:96), (cid:1) , and dx ( s,(cid:96), dt = λ S w (2 ,s − ,(cid:96), + λ L w (1 ,s,(cid:96) − , + µ x ( s +1 ,(cid:96), p S + µ x ( s +1 ,(cid:96), p S − (cid:0) µ x ( s,(cid:96), + λ S w (2 ,s,(cid:96), + λ L w (1 ,s,(cid:96), (cid:1) . The cases where s = 0 and (cid:96) > are given by dx (0 ,(cid:96), dt = λ L w (1 , ,(cid:96) − , + µ x (1 ,(cid:96), (1 − p S ) + µ x (1 ,(cid:96), (1 − p S ) + µ x (0 ,(cid:96) +1 , p L + µ x (0 ,(cid:96) +1 , p L − (cid:0) µ x (0 ,(cid:96), + + λ S w (2 , ,(cid:96), + λ L w (1 , ,(cid:96), (cid:1) , and dx (0 ,(cid:96), dt = λ L w (1 , ,l − , + µ x (1 ,(cid:96), p S + µ x (1 ,(cid:96), p S + µ x (0 ,(cid:96) +1 , (1 − p L ) + µ x (0 ,(cid:96) +1 , (1 − p L ) − (cid:0) µ x (0 ,(cid:96), + + λ S w (2 , ,(cid:96), + λ L w (1 , ,(cid:96), (cid:1) , And ﬁnally, for queues without any waiting jobs, we have dx (0 , , dt = λ L p L w (1 , , , + λ S (1 − p S ) w (1 , , , + µ x (0 , , p L + µ x (0 , , p L + µ x (1 , , (1 − p S ) + µ x (1 , , (1 − p S ) − (cid:0) µ x , , + λ S w (2 , , , + λ L w (1 , , , (cid:1) ; x (0 , , dt = λ S p S w (2 , , , + λ L (1 − p L ) w (2 , , , + µ x (0 , , (1 − p L ) + µ x (0 , , (1 − p L ) + µ x (1 , , p S + µ x (1 , , p S − (cid:0) µ x , , + λ S w (2 , , , + λ L w (1 , , , (cid:1) ; dx (0 , , dt = µ x (0 , , + µ x (0 , , − (cid:0) λ S w (2 , , , + λ L w (1 , , , (cid:1) . In section 5 below, we compare the results from calculating the differential equations results numericallywith simulations.

In the simulations for single queues, each data point is obtained by simulating initially empty queues over1000000 units of time, and taking the average response time for all jobs that terminate after time 100000. Wethen take the average of over 100 simulations. Waiting for the ﬁrst 10% allows the system to approach thestationary distribution, and we run for sufﬁcient time that recording only completed jobs has small inﬂuence.Before presenting results, we emphasize that we have checked the experimental results for single queuesagainst the equations we have derived in sections 2 and 3. They match very closely; in general terms,nearly all the averaged simulation results presented are within 1% of the values derived from the equations.(Individual simulation runs can vary more signiﬁcantly; the maximum and minimum times over our trialsvary by 5-10% for exponentially distributed service times, and by 10-20% for Weibull service times.) Assuch, we do not present further results comparing equations to simulations here. Our ﬁrst simulations are for exponentially distributed service times. While we have done more sim-ulations at various arrival rates, we present results for λ = 0 . , . , and . . This focuses on the moreinteresting case of reasonably high arrival rates, while keeping the values within a reasonable range forpresentation. As a baseline, when the arrival rate is λ , the expected time a job spends in such a system inequilibrium with a FIFO queue is / (1 − λ ) .We ﬁrst show the results of the experiments, comparing the results with and without preemption, andwith and without prediction. Figure 1 shows the results as the threshold varies given correct one-bit advice,and Figure 2 shows the results under the exponential prediction model. The ﬁgures are at the same scaleso results can be compared. We see here that preemption, as expected, provides some gains, and the costfor using prediction is not too large. In particular, one bit of even sometimes incorrect advice substantiallyreduces the average time in system over simple FIFO queueing. The results show that in this setting choosinga threshold near the optimal rather than the optimal does not substantially affect the results.We also do experiments for the Weibull distribution with cumulative distribution − e −√ x . As abaseline, when the arrival rate is λ , the expected time a job spends in such a system in equilibrium with aFIFO queue under this Weibull distribution is (1 + λ ) / (1 − λ ) .Similar to before, Figure 3 shows the results as the threshold varies given correct one-bit advice, andFigure 4 shows the results under the exponential prediction model. Note that the range of thresholds is muchlarger. One would expect larger thresholds would be optimal for a heavy-tailed distribution, as the downside While arguably we could have simply trusted the equations, we ﬁnd verifying results via simulation worthwhile. A v e r a g e t i m e i n s y s t e m T : Threshold λ = 0.80, PE λ = 0.80, Non-PE λ = 0.90, PE λ = 0.90, Non-PE λ = 0.95, PE λ = 0.95, Non-PE Figure 1: Performance under threshold schemeswith exact information, exponentially distributedservice times. A v e r a g e t i m e i n s y s t e m T : Threshold λ = 0.80, PE λ = 0.80, Non-PE λ = 0.90, PE λ = 0.90, Non-PE λ = 0.95, PE λ = 0.95, Non-PE Figure 2: Performance under threshold schemeswith predicted information, exponentially dis-tributed service times, exponential predictions. A v e r a g e t i m e i n s y s t e m T : Threshold λ = 0.80, PE λ = 0.80, Non-PE λ = 0.90, PE λ = 0.90, Non-PE λ = 0.95, PE λ = 0.95, Non-PE Figure 3: Performance under threshold schemeswith exact information, Weibull distributed ser-vice times. A v e r a g e t i m e i n s y s t e m T : Threshold λ = 0.80, PE λ = 0.80, Non-PE λ = 0.90, PE λ = 0.90, Non-PE λ = 0.95, PE λ = 0.95, Non-PE Figure 4: Performance under threshold schemeswith predicted information, Weibull distributedservice times, exponential predictions.IFO THRESHOLD THRESHOLD SRPT PREDICTION PREDICTION SPRPT λ NO PREEMPT PREEMPT NO PREEMPT PREEMPT0.50 2.000 1.783 1.564 1.425 1.850 1.698 1.6590.60 2.500 2.089 1.814 1.604 2.209 2.013 1.9400.70 3.333 2.542 2.203 1.875 2.761 2.517 2.3690.80 5.000 3.329 2.910 2.355 3.757 3.451 3.1430.90 10.00 5.278 4.755 3.552 6.366 5.960 5.0970.95 20.00 8.535 7.914 5.532 10.848 10.372 8.4240.98 50.00 16.495 15.735 10.436 22.418 21.909 16.696Table 1: Results for exponentially distributed service times. Prediction results are using exponential predic-tions. FIFO THRESHOLD THRESHOLD SRPT PREDICTION PREDICTION SPRPTNO PREEMPT PREEMPT NO PREEMPT PREEMPT0.50 4.000 3.012 1.608 1.411 3.155 1.736 1.9400.60 5.500 3.676 1.867 1.574 3.918 2.062 2.2800.70 8.000 4.565 2.258 1.813 4.983 2.568 2.7500.80 13.00 5.955 2.951 2.217 6.721 3.481 3.5190.90 29.00 8.940 4.649 3.154 10.630 5.790 5.2240.95 58.00 13.223 7.448 4.517 16.546 9.846 7.7880.98 148.0 22.451 15.194 7.666 29.346 20.918 13.404Table 2: Results for Weibull distributed service times. Prediction results are using exponential predictions.of having a very large job at the front of the queue is more substantial. Further, in this setting, preemp-tion offers more substantial gains, and clearly pushes the optimal threshold to larger values, as preemptionsigniﬁcantly reduces the impact of a long job holding the queue. While the cost of using predicted adviceover optimal advice is larger, the potential (and actual) gains from using predicted advice are even moresubstantial.To shed additional light, we compare one-bit advice to no advice, in which case the queue uses FIFO, andfull knowledge of the processing times, in which case the queue uses Shortest Remaining Processing Time(SRPT). In the case of predictions, we consider the exponential prediction model, comparing our schemesagainst FIFO and Shortest Predicted Remaining Processing Time [13], which uses SRPT scheduling on thepredicted times. For FIFO results we simply use the standard formula for expected time in the system; wecould similarly use formulae for the other results, but present results from simulations. In particular, for ourone-bit schemes, we choose the best threshold from simulation result, and allow preemptions. As we can seein Tables 1 and 2, one-bit schemes greatly improve on FIFO across the board, with a greater improvementfor Weibull distributions, as one would expect. Indeed, one-bit threshold schemes achieve a large fractionof the beneﬁt that would arise from full knowledge of processing times, and one-bit predictions achieve alarge fraction of the beneﬁt that would arise from more detailed predictions. We also ﬁnd that preemption ishelpful, and moreso for the heavier tailed distributions, as one would expect. Perhaps most important, usingpredictions provides large gains, nearly as good as with exact information, showing the large potential foreven simple predictions to provide large value in scheduling.imulations Diff. Eqns.1 Choice 24.208 − − −

SRPT 2.366 − − −

Shorter Queue, FIFO 4.967 − − −

Pred 0.0, 0.0 3.394 3.392Pred 0.1, 0.1 3.690 3.688Pred 0.2, 0.2 4.010 4.007Pred 0.3, 0.3 4.353 4.347Pred 0.4, 0.4 4.717 4.711Pred 0.5, 0.5 5.105 5.098Pred 0.2, 0.4 4.280 4.276Pred 0.4, 0.2 4.402 4.395Pred 0.11, 0.61 4.617 4.611Table 3: Results for queueing systems with 1000 queues, 2 choices and baseline comparisons.

We present results here for an example using the power of two choices, to demonstrate the results fromdifferential equations match simulations, and to show the effectiveness of working with predictions. Inour example we follow the 80-20 rule; we choose parameters λ = 0 . , µ = 3 . , λ = 0 . , and µ = 0 . . The overall load on the system is therefore . . For simulations, each data point is obtainedby simulating systems of 1000 (initially empty) queues over 100000 units of time, and taking the averageresponse time for all jobs that terminate after time 10000. We then take the average of over 100 simulations.For the differential equations, we simply used Euler’s method over times steps of − over time . (Thisprovides an accurate calculation for the “ﬁxed point”, or stationary distribution corresponding to the solutionof these equations.) All experiments use two choices. We provide simulation results for randomly choosinga single queue and using FIFO, choosing a queue based on the least loaded and using SRPT within thequeue, and choosing the shorter of two queues and FIFO processing for comparison. We provide simulationresults for various predictions, where the two values after “Pred” in the table are q (misprediction for longjobs) and q (short jobs), respectively.The main takeaways from Table 3, beyond the fact that the differential equations are quite accurate (lessthan 0.2% difference in these examples), are that one-bit predictions can provide beneﬁts over the alreadyexcellent performance of choosing the shorter queue; even when all predictions are only 60% accurate, weseem some gain in performance. Considering q = 0 . and q = 0 . along with q = 0 . and q = 0 . shows that predictions for long jobs are more important, even though there are much fewer long jobs. Theeffect of giving a long job higher priority, where it can block short jobs, has a more prominent effect thanmisclassifying a short job. This demonstrates that the goal of a machine learning algorithm in this settingshould not be simply to maximize the number of correct predictions; a machine learning algorithm can dobetter by predicting the long jobs well. (See [14] for a similar discussion.)As an extreme example of this, choosing q = 0 . and q = 0 . leads to a total error rate of 51% overall jobs, as short jobs have a much higher arrival rate than long jobs. But even though more than half the jobstypes are predicted incorrectly, because long jobs are predicted correctly most of the time, such predictionsstill perform notably better than not using predictions and just choosing the shorter queue. Conclusions

We have looked at the setting of queueing systems with one bit of advice, where a primary motivation is thepotential for machine learning algorithms to provide simple but useful predictions to improve scheduling.In the case of single queues, we see that a natural probabilistic model for predictions leads to relativelystraightforward equations that can be used to determine where one would ideally choose a threshold toseparate long and short jobs. For large-scale queueing systems, where the power of two choices can be used,we have shown that one-bit prediction can allow for ﬂuid limit analysis. We view this as a potential stepforward for the interesting open problem of determining the behavior of systems using the power of twochoices with scheduling via shortest remaining processing time or other scheduling schemes based on theservice time.We believe there remain several interesting directions to explore in this space. The use of predictionsin more complex settings, such as call centers, may provide signiﬁcant value. A challenging underlyingquestions, when “jobs” may correspond to people, is how to deﬁne appropriate notions of fairness, so tht jobsthat are mispredicted by a machine learning algorithm do not suffer overly from the algorithm’s behavior.

References [1] Reza Aghajani, Xingjie Li, and Kavita Ramanan. Mean-ﬁeld dynamics of load-balancing networkswith general service distributions. arXiv preprint arXiv:1512.05056 , 2015.[2] Yossi Azar, Andrei Z. Broder, Anna R. Karlin, and Eli Upfal. Balanced allocations.

SIAM J. Comput. ,29(1):180–200, 1999.[3] Joan Boyar, Lene M Favrholdt, Christian Kudahl, Kim S Larsen, and Jesper W Mikkelsen. Onlinealgorithms with advice: a survey.

ACM SIGACT News , 47(3):93–129, 2016.[4] Matteo Dell’Amico, Damiano Carra, and Pietro Michiardi. PSBS: Practical size-based scheduling.

IEEE Transactions on Computers , 65(7):2199-2212, 2015.[5] S. N. Ethier and T. G. Kurtz.

Markov Processes: Characterization and Convergence . John Wiley andSons, 1986.[6] Mor Harchol-Balter.

Performance modeling and design of computer systems: queueing theory inaction . Cambridge University Press, 2013.[7] Tim Hellemans and Benny Van Houdt. On the power-of- d -choices with least loaded server selection. POMACS , 2(2):27:1–27:22, 2018.[8] Chen-Yu Hsu, Piotr Indyk, Dina Katabi, and Ali Vakilian. Learning-Based Frequency EstimationAlgorithms. In , 2019.[9] Tim Kraska, Alex Beutel, Ed H Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned indexstructures. In

Proceedings of the 2018 International Conference on Management of Data , pages 489–504, 2018.[10] T. G. Kurtz. Solutions of Ordinary Differential Equations as Limits of Pure Jump Markov Processes.

Journal of Applied Probability , Vol. 7, 1970, pp. 49-58.[11] Thodoris Lykouris and Sergei Vassilvitskii. Competitive caching with machine learned advice. In

Proceedings of the 35th International Conference on Machine Learning , pp. 3302–3311, 2018.12] Michael Mitzenmacher. The power of two choices in randomized load balancing.

IEEE Trans. ParallelDistrib. Syst. , 12(10):1094–1104, 2001.[13] Michael Mitzenmacher. Scheduling with predictions and the price of misprediction. 11th Innovationsin Theoretical Computer Science Conference, ITCS 2020, 14:1-14:18, 2020.[14] Michael Mitzenmacher. The Supermarket Model with Known and Predicted Service Times. arXivpreprint arXiv:1905.12155 , 2019.[15] Michael Mitzenmacher and Sergei Vassilvitskii. Algorithms with Predictions. In

Beyond the Worst-Case Analysis of Algorithms , edited by Tim Roughgarden, Cambridge University Press, 2020.[16] T.M. O’Donovan. Direct solutions of M/G/1 priority queueing models.

Revue franaise d’automatique,informatique, recherche oprationnelle , 10.V1, 107-111, 1976.[17] Manish Purohit, Zoya Svitkina, and Ravi Kumar. Improving online algorithms via ML predictions. In

Advances in Neural Information Processing Systems , pages 9684–9693, 2018.[18] Nikita Dmitrievna Vvedenskaya, Roland L’vovich Dobrushin, and Fridrikh Izrailevich Karpelevich.Queueing system with selection of the shortest of two queues: An asymptotic approach.

ProblemyPeredachi Informatsii , 32(1):20–34, 1996.[19] Adam Wierman and Misja Nuyens. Scheduling despite inexact job-size information.

PerformanceEvaluation Review , 36(1):25-36, 2008.[20] N.C. Wormald. Differential equations for random processes and random graphs.