Efficiency-Loss of Greedy Schedules in Non-Preemptive Processing of Jobs with Decaying Value
aa r X i v : . [ m a t h . O C ] J u l Efficiency-Loss of Greedy Schedules in Non-PreemptiveProcessing of Jobs with Decaying Value
Carri W. Chan Nicholas BambosElectrical Engineering Department andElectrical Engineering Department Management Science & Engineering DepartmentStanford University Stanford UniversityStanford, CA 94305 Stanford, CA [email protected] [email protected] fficiency-Loss of Greedy Schedules in Non-PreemptiveProcessing of Jobs with Decaying Value
Abstract
We consider the problem of dynamically scheduling J jobs on N processors for non-preemptive execution where the value of each job (or the reward garnered upon completion) decays over time. Alljobs are initially available in a buffer and the distribution of their service times are known. When aprocessor becomes available, one must determine which free job to schedule so as to maximize the totalexpected reward accrued for the completion of all jobs. Such problems arise in diverse application areas,e.g. scheduling of patients for medical procedures, supply chains of perishable goods, packet schedulingfor delay-sensitive communication network traffic, etc. Computation of optimal schedules is generallyintractable, while online low-complexity schedules are often essential in practice.It is shown that the simple greedy/myopic schedule provably achieves performance within a factor E [max j σ j ]min j E [ σ j ] from optimal. This bound can be improved to a factor of when the service times areidentically distributed. Various aspects of the greedy schedule are examined and it is demonstrated toperform quite close to optimal in some practical situations despite the fact that it ignores reward-decaydeeper in time. Consider a queueing/scheduling system (as in Fig. 1), where a finite number J jobs wait in a buffer, each tobe processed by one of N servers/processors. Time is slotted. The service/processing requirement, σ j , ofeach job j is random and its distribution, f j ( σ j ) , is known. All processors operate at service rate ; hence,the service time for each job is invariant to the processor which it assigned. Service is non-preemptive (jobservice cannot be interrupted mid-processing to be resumed later or discontinued). The completion of job j in time slot t garners a reward w j ( t ) ≥ , which decays with time (i.e. w j ( t ) is non-increasing in t ). Thegoal is to schedule the jobs on the processors so as to maximize the aggregate reward accrued when all jobscomplete execution.As will become clear below, a key complicating factor is that the job service is non-preemptive , inducinga ‘combinatorial twist’ on the problem. Under preemptive processing, the latter would wash away and theproblem would become much simpler. Another complicating factor is the fact that the rewards/values w j ( t ) decay over time in a general way; special cases might be significantly easier to handle (though still not1 .. PSfrag replacements jobs = { , , . . . , j, . . . , J − , J } σ j σ j σ j σ j Figure 1: System Diagram: J jobs wait to be processed on one of n machines. The processing time of eachjob is independent of the processor and other jobs.necessarily easy). A third complicating factor is the general distributions of the stochastic job processingtimes σ j (even though these are independent across different jobs); for special distributions the problemcan become significantly simpler (and the results tighter). We aim to address the problem in the mostgeneral setting arising in a variety of applications (see below), which may actually require online (real-time) schedule implementation. In that case, since the complexity of computing the optimal job scheduleis prohibitive, one seeks simple and practical schedules (implementable online), which have performancewithin provable bounds from optimal. In this paper, we focus on a greedy/myopic schedule defined belowand study its efficiency. We discuss these factors below in conjunction with prior work and a variety ofapplications. There are diverse applications where job completion rewards decay over time. For example, such is thecase with patient scheduling in health-care systems. Delays in treatment often lead to deterioration ofpatient health (see, for instance, [1]) which may result in reduction of the eventual treatment impact; thisis obviously the case with various medical procedures, operations, etc. Indeed, a number of studies havedemonstrated that delayed treatment results in increased patient mortality [2–6]. Moreover, in a related study[7], over 60% of physicians reported dissatisfaction with delays in viewing test results, which subsequentlyled to delays in treatment. It is likely that increased mortality is primarily induced via deterioration ofpatient health condition and resulting reduction of benefit from eventual treatment. This is how the effect of2reatment delay is modeled in this paper.On the other hand, in information technology, reward decay occurs in various situations–for example, inmultimedia packet scheduling for transmission over wireless links. Each packet corresponds to a job whichis completed once the packet is successfully received at the receiver; until then, it is repeatedly transmitted(non-preemptive processing). Transmission time until successful reception is random, due both to randompacket sizes and randomly varying wireless channel quality. In the simplest case, video packets have asingle deadline and reward is only received if the packet is received prior to its deadline expiration. Inmore advanced schemes, multiple deadlines are considered (decreasing, piecewise-constant reward decayfunction), reflecting coding interdependencies across packets. Indeed, even if a packet misses its initialdeadline, it could improve the quality of the received and reconstructed video because other packets whichdepend on it may still be able to meet their deadline [8].As with multimedia packet scheduling above and similar situations of task scheduling in parallel com-puting systems, we can consider jobs that contain interdependencies within our model. The completion of asingle job j garners reward r j . However, other jobs may rely on that one too, either because they cannot be-gin processing until that is completed (due to data-passing, precedence constraints, etc.) or their processingaccuracy/quality depends on output from that job (e.g. decoding dependencies). Therefore, the ‘effective’reward generated is actually w j ( t ) = r j − f ( t ) , where the increasing function f ( t ) reflects the detrimentaleffect that completing job j after delay t has on other jobs depending on it. In fact, our formulation allowsfor the case where even r j is a decaying function in time.A third application area where job completion rewards may decay over time is in the case of perishableitems, like food, medicine, etc. For example, the quality of food items (milk, eggs, etc.) decays with time.The scheduling problem is when to release these items for sale given varying transportation times (fromstorage to shelf) and the decaying reward R ( t ) . It is also possible to have a cost s for each time slot the itemremains in storage so that the effective reward of an item once it is released for sale is C ( t ) = R ( t ) − st . When rewards do not decay over time but stay constant, job scheduling problems may be cast in the frame-work of ‘multiarmed bandit’ problems [9, 10]. Furthermore, optimal policies for certain ‘well-behaved’decaying reward functions (such as linear and exponential) have been developed (see [9, 10] and relatedworks). Unfortunately, under general decaying rewards, solving for the optimal schedule becomes very3ifficult.There has been related work on delay-sensitive scheduling in networking. In the case of broadcastscheduling in computer networks, jobs correspond to requests for pages (files). Due to the broadcast natureof a wireless channel, multiple requests can be satisfied with the transmission of a single page. In [11], agreedy algorithm is shown to be a -approximation for throughput maximization of broadcast scheduling inthe case of equal sized pages . In a similar scenario, an online preemptive algorithm is shown to be Ω( √ n ) competitive where n is the number of pages that can be requested [12]. Our work differs from this priorwork in that we allow for 1) arbitrary decaying rewards, rather restricting to step functions when the deadlineexpires, 2) jobs are non-preemptive and have varied lengths (and all jobs are available at time ).A substantial body of work has focused on scheduling for perishable products (see [13] for a review).The focus is on finding an optimal ordering policy given the lifetime and demand of the perishable items.In [14], the authors study how to maximize utility garnered by delivering perishable goods, such as ready-mixed concrete, and minimize costs subject to stochasticity in transportation times. The authors formulatea mathematical program to solve the problem and propose heuristic algorithms for use in practice. Interest-ingly, the perishable items in this case have a fixed lifetime, after which they are rendered useless (deadline).Our formulation here allows for general decay.In [15], the authors look at how to schedule an M/M/1 queue where rewards decay exponentially depen-dent on each job’s sojourn time due to the ‘impatient’ nature of the users. A greedy policy is shown to be op-timal in the case of identical decay rates of these impatient users. Our scheduling problem is closely relatedto a number of instances of the Multiarmed Bandit Problem. When rewards exhibit ‘well-behaved’ decay,(identical rates, constant rates, etc.) it is possible to find optimal, or near-optimal algorithms [9, 10, 16, 17].This is not always the case for arbitrary decay.In a problem similar to the one we study in this paper, a greedy algorithm is shown to be a -approximationwhen job completions generate rewards according to general decaying reward functions [18]. The main dis-tinction between this work and ours is that the previous work allows for job preemption while we considerthe case that once a job is scheduled it occupies the machine until it completes. This constraint adds an extralayer of complexity.Indeed, non-preemption makes the scheduling problem we study substantially more difficult. Non-preemptive interval scheduling is studied in [12, 19] among others. Jobs can either be scheduled duringtheir specified interval or rejected. The end of the interval corresponds to the deadline of the corresponding4ob. If ∆ is the ratio of the large job size to the smallest job size, then an online algorithm cannot be betterthan O (log ∆) . Our work differs from this prior work because we consider arbitrary decay of rewards andassume all jobs are available at time . The decaying reward functions make this a more general and difficultscheduling problem. However, our result also relies on ∆ , the ratio between largest and smallest jobs.Still, there are instances where optimal schedules can be found for arbitrary decaying rewards. In aparallel scenario to ours, jobs can be scheduled, non-preemptively, multiple times. For this problem, thereward function for completing a particular job decays with the number of times that job has been completed.In this case, a greedy policy is optimal for arbitrary decaying rewards [10]. This problem is parallel to oursin that it allows for arbitrary decaying rewards. However, the decay does not depend on the completion timeof the job, but rather on the number of times that job has been completed. In our case, each job is onlyprocessed a single time.Relating back to our scenario where the rewards decay with time, it is again the case that for ‘well-behaved’ decaying functions (linear and exponential), policies based on an index rule are optimal [9, 10].The policy we propose in this paper is also an index rule. In fact, the proposed policy is very closely relatedto the ‘c- µ ’-type scheduling rules (see, for instance [10, 20]) where the objective is to minimize cost (ratherthan maximize rewards) when costs are linearly or concavely increasing. One of the main distinctionsbetween our work and this is that we consider multiple servers. Unfortunately, the optimality of the ‘c- µ ’rule does not extend to this case. Furthermore, linear/concave decaying rewards are just single instancesof our more general formulation of decaying rewards. It is also important to recognize that many of theresults of this prior work are in heavy-traffic regimes where a lot of the fine-grained optimization requiredin non-heavy-traffic is washed out. In this paper, we study the efficacy of a greedy scheduling algorithm for non-preemptive jobs whose re-wards decay arbitrarily with time. There are a number of applications which exhibit such behavior such aspatient scheduling in hospitals, packet scheduling in multimedia communication systems, and supply chainmanagement for perishable goods. It is shown that finding an optimal scheduling policy for such systems isNP-hard. As such, finding simple heuristics is highly desirable. We show that a greedy algorithm is guar-anteed to be within a factor of ∆ + 2 of optimal where ∆ is the ratio of the largest job completion time tothe smallest. This bound is improved in some special cases. Via numerical studies, we see that, in practice,5he greedy policy is likely to perform much closer to optimal which suggests it is a reasonable heuristic forpractical deployment. To the best of our knowledge this is the first look at non-preemptive scheduling ofjobs with arbitrary decaying rewards.The rest of the paper is structured as follows. In Section 2 we formally introduce the scheduling modelwe will study. In Section 3 we propose and study the performance of a greedy scheduling policy. The mainresult, which is a bound on the loss of efficiency due to greedy scheduling, is given in Section 3.2. In Section4, we examine some special cases where this bound can be improved. In Section 5, we do some performanceevaluation of the greedy policy via a simulation study. Finally, we conclude in Section 6. Consider a set of J jobs, indexed by j ∈ J = { , , . . . , J } , and N processors/servers, indexed by n ∈N = { , , . . . , N } . Each job j ∈ J has a random processing requirement σ j and can be processed by anyprocessor n ∈ N . All processors have service rate 1 and each one can process a single job at a time. Serviceis non-preemptive in the sense that once a processor starts executing a job it cannot stop until completion.Time is slotted and indexed by t ∈ { , , , ... } . We denote the distribution of the service times by f j ( σ j ) . Assumption 1
The random job processing times σ j , j ∈ J are 1) statistically independent with P ( σ j < ∞ ) = 1 and 2) their distributions, f j ( σ j ) , do not depend on time. However, the jobs processing times are not necessarily identically distributed.Let b j ( t ) be the residual service time of job j at time t . Initially, b j (0) = σ j , for each j ∈ J . Thebacklog state of the system at time t is the vector b ( t ) = ( b ( t ) , b ( t ) , ..., b j ( t ) , ..., b J ( t )) . (1)It evolves from initial state b (0) = ( σ , σ , ..., σ j , ..., σ J ) to final state ~σ ( T ) = (0 , , ..., , ..., by assign-ing processors to process the jobs non-preemptively , until all jobs have finished execution at some (random)time T . Note that for each job j ∈ J , b j ( t ) = σ j implies that j has not started processing by t (has not beenscheduled before t ), while b j ( t ) = 0 implies that the job finished execution before (or at) time t . Indeed, if6ob j starts execution at time slot t j and finishes at the beginning of time slot T j then σ j = T j − t j and b j ( t ) = σ j , t < t j σ j − [ t − t j ] , t = t j , t j + 1 , ..., T j − , t ≥ T j (2)As discussed later, the start times t j are chosen by the scheduling policy, while the end times T j are thendetermined by the fact that scheduling is non-preemptive so that T j = t j + σ j .The job service times are random and their true values are not observable ex ante or known a priori; theycan only be seen ex post, after a job has completed processing. However, the values x j ( t ) tracking whichjobs are completed at each time tx j ( t ) = , if job j has not completed processing at time slot t , if job j has completed by time slot t (3)are directly observable for each job j ∈ J . We work below with the observable ‘backlog state’ x ( t ) = ( x ( t ) , x ( t ) , ..., x j ( t ) , ..., x J ( t )) (4)in { , } J which tracks which jobs are completed and which are still waiting to complete processing at time t . To fully specify the state of a job, we define y j ( t ) as the time slot t ′ < t in which job j begins processing.Specifically, y j ( t ) = t ′ , if job j began procesing in time slot t ′ < t ∅ , if job j has not begun processing prior to time slot t (necessarily, x j ( t ) = 1 ) (5)Hence, any job with y j ( t ) = ∅ (where ∅ is some null symbol) has not yet begun processing and is free tobe scheduled. If x j ( t ) = 1 , then job j has not completed and it is still being processed due to the non-preemptive nature of the service discipline. Once a job is scheduled in time slot t j , then y j ( t ) = t j for all7 > t j . The service state is then, y ( t ) = ( y ( t ) , y ( t ) , ..., y j ( t ) , ..., y J ( t )) (6)in {{ , , ..., t − } ∪ ∅} J and tracks when (and if) each job began processing. In time slot t , one cancalculate the distribution for the remaining service time b j ( t ) given the distribution of σ j based on when (if)the job has started processing and whether it has completed. Only the distribution of b j ( t ) is known as thejob service time is only observable once the job completes processing. Therefore, x j and y j can be jointlyleveraged to compute the distribution of the residual service time of job j .We next define the state z n ( t ) of processor n ∈ N which tracks which job it is assigned to process intime slot t . Specifically, z n ( t ) = j, if processor n is still executing job j ∈ J at the beginning of time slot t , if processor n is free at the beginning of time slot t , hence, available for allocation (7)and the processor state is z ( t ) = ( z ( t ) , z ( t ) , ..., z n ( t ) , ..., z N ( t )) (8)in { , , ..., j, ...J } N and tracks the free vs. allocated processors at the beginning of time slot t .At the beginning of each time slot t , each job j with y j ( t ) = ∅ (not yet started) can be scheduled on(matched with) a processor n with z n ( t ) = 0 (free) to start execution. The observable state of the system atthe beginning of time slot t is s t = ( x ( t ) , y ( t ) , z ( t )) (9)Recall that from x ( t ) and y ( t ) we can determine the distribution of the remaining service time b ( t ) . Sothe global state (9) yields the distribution for the remaining backlog and also tracks the processor state. Thestate space S is the set of all states the system may attain throughout its evolution. We denote by x ( s ) theprojection of the state onto the x -coordinate. We similarly apply notation for y ( s ) and z ( s ) .Given the free jobs and processors at state s t , we denote by A ( s t ) the set of job-processor matchings(schedules) that can be selected, i.e. they are feasible, at the beginning of time slot t . These matchings arein addition to those already in place for jobs which are in mid-processing due to the non-preemptive natureof execution. Note that at each time t , for any feasible job-processor matching A ∈ A ( s t ) we have that8 j, n ) ∈ A implies x j ( t ) = 1 , y j ( t ) = ∅ and z n ( t ) = 0 , meaning processor n is free and job j has notstarted processing. Also, only one free job can be matched to each free processor and vice-versa (hence, ( j, n ) , ( k, m ) ∈ A with ( j, n ) = ( k, m ) implies j = k and n = m ). Despite the fact that A ( s t ) is clearly afunction of s t , we may occasionally suppress s t for notational simplicity.The completion of job j by the end of time slot t garners non-negative reward w j ( t ) . We assume thereward decays over time, as follows. Assumption 2
For each job j ∈ J , the reward function w j ( t ) ≥ decays over time; that is, it is non-increasing in t (may be piece-wise constant). This immediately accounts for raw deadlines by setting w j ( t ) = { t ≤ d j } when d j is the deadline of job j .Recall that if job j is scheduled on processor n at the beginning of time slot t , it will finish by thebeginning of time slot t + σ j . Therefore, the reward ‘locked’ at the beginning of time slot t , given that ajob-processor match A ∈ A ( s t ) is chosen to be used in this slot, is simply R t ( s, A ) = X ( j,n ) ∈ A w j ( t + σ j ) (10)It is desirable to design a control (scheduling, matching) policy choosing at each t a job-processormatching in A ( s t ) to maximize the total expected reward accrued until all jobs have been executed. Sinceat time t the realization of σ j is unknown for each job j that has not completed by t , any control policy is a-priori unaware of the exact reward accrued from a particular action at t . Only the statistics of this reward areknown. Specifically, let π be a scheduling policy which chooses a job-processor matching π t ( s t ) ∈ A ( s t ) at t , and let Π be the set of all such policies. Define the expected total reward-to-go under a policy π startingat state s ∈ S at time slot t , as J πt ( s ) = E " T X t ′ = t R t ′ ( s t ′ , π t ′ ( s t ′ )) | s t = s (11)where T is the (random) time where all jobs have completed execution. T may depend on the policy π used. Note that if we wanted to consider a finite, deterministic horizon ˜ T , we could appropriately generate aschedule based on the modified, truncated reward functions, ˜ w j ( t ) , such that for all t ≤ ˜ T , ˜ w j ( t ) = w j ( t ) ,9therwise ˜ w j ( t ) = 0 . The expectation is taken over the random service times σ j of the jobs. We let J ∗ t ( s ) = max π ∈ Π J πt ( s ) (12)denote the expected total reward-to-go under the optimal policy, π ∗ = argmax π ∈ Π J πt ( s ) .The optimal reward-to-go function (or value function) J ∗ and the optimal scheduling policy π ∗ can inprinciple be computed via dynamic programming. Once all jobs have been completed, x = and no morereward can be earned. Therefore, J ∗ t ( s ) = 0 for all s = ( x , y , z ) such that x = .Given the current state s t = s and the matching A between free jobs and processors enabled at thebeginning of time slot t , the system will transition to state s t +1 = s ′ at the beginning of time slot t + 1 withprobabilities P A ( s t +1 = s ′ | s t = s ) . For example, if the service times σ j are geometrically distributed withprobabilities p j correspondingly, and the system is in state s t = s = ( x , y , z ) and matching A ∈ A ( s t ) ischosen, then the system transitions to state s t = s ′ = ( x ′ , y ′ , z ′ ) with the following probabilities: P A ( x ′ j = 0 | s ) = , if x ( s ) j = 0 ; p j , if y j ( s ) < t or ( j, n ) ∈ A for some n ; , otherwise. P A ( x ′ j = 1 | s ) = , if x ( s ) j = 1 and ( j, n ) A for all n ; − p j , if y j ( s ) < t or ( j, n ) ∈ A for some n ; , otherwise. P A ( y ′ j = t | s ) = , if ( j, n ) ∈ A for some n ; , otherwise. P A ( y ′ j = y j | s ) = , ( j, n ) A for all n ; , otherwise. P A ( z ′ n = j | s ) = − p j , if z ( s ) n = j or ( j, n ) ∈ A for some n ; , otherwise. P A ( z ′ n = 0 | s ) = p j , if z ( s ) n = j or ( j, n ) ∈ A for some n ; , if z ( s ) n = 0 and ( j, n ) A for all j ; , otherwise. (13)10e can now recursively obtain J ∗ using the Bellman recursion J ∗ t ( s ) = max A ∈A n E (cid:2) X ( j,n ) ∈ A w j ( t + σ j ) + X s ′ ∈S P A [ s t +1 = s ′ | s t = s ] J ∗ t +1 ( s ′ ) o = max A ∈A n E [ R t ( s, A )] + J ∗ t +1 ( ˜ S ( s, A )) (cid:3)o (14)where ˜ S ( s, A ) is the random next state encountered given that we start in state s and action A is taken.The solution can be found using the value iteration method. Proposition 1
There exists an optimal control solution to (14) which is obtainable via value iteration.
Proof:
Once the queue is emptied, Bellman’s recursion terminates. When x = , there are no more jobs leftto be processed. No action can generate any reward and the optimal policy will never leave this state once itreaches it. There exists a policy which will complete all jobs and cause the Bellman’s recursion to terminatein finite time. (i.e. we process all jobs on a single server, n , in random order. Because P ( σ j < ∞ ) = 1 , alljobs will be completed in finite time.) This guarantees the existence of a stationary optimal policy which isobtainable via value iteration [21]. (cid:4) Of course, this approach is computationally intractable: the state space (the set of all ( x , y , z ) ) is expo-nentially large. As such, this makes such problems pragmatically difficult. We now show that a special case of the non-preemptive scheduling problem is NP-hard. Consider a deter-ministic version of the problem where the completion time of job j is σ j with probability . Let w j ( t ) = v j for t ≤ K and w j ( t ) = 0 otherwise. We can think of v j as the value of job j and K as the shared deadlineamongst all jobs. This version of the non-preemptive scheduling problem with decaying rewards can bereduced to the 0/1 Multiple-Knapsack Problem which is known to be NP-complete. Theorem 1
The non-preemptive scheduling problem with decaying rewards is NP-hard.
Proof:
In the case of the 0/1 Multiple Knapsack Problem, there are J objects of size σ j and value v j to beplaced in N knapsacks of capacity K . Reward is only accrued if the entire object is placed in a knapsack–fractional objects are not possible. The optimal packing of objects is equal to the optimal scheduling policy11or the non-preemptive scheduling with decaying rewards problem. This reduction takes constant time. Thiscompletes the proof. (cid:4) In light of Theorem 1, finding an optimal policy for the scheduling problem at hand is computationallyintractable. Therefore, it is highly desirable to find simple, but effective heuristics for practical deployment.In this section, we examine one such policy.A natural heuristic policy one may consider for the stochastic depletion problem is given by the greedypolicy which in state s with F = P n { z ( s ) n =0 } free processors chooses the F available jobs with maximumexpected utility rate earned over the following time-step, E [ w j ( t ( s )+ σ j )] E [ σ j ] . That is π gt ( s ) = argmax A ∈A X ( j,n ) ∈ A E [ w j ( t ( s ) + σ j )] E [ σ j ] (15)Such a policy is adaptive but ignores the evolution of the reward functions, w j ( t ) , and its impact on rewardsaccrued in future states. We denote by J gt ( s ) the reward garnered by the greedy policy starting in state s . We start with an instructive example which demonstrates the nature (and degree) of sub-optimality of thegreedy policy.
Example 1 (Greedy Sub-Optimality) Consider the case with jobs and machine. Time is initialized to so that t = 0 , J = 2 and N = 1 . Assume that each job is waiting to begin processing so that x = x = 1 and y = y = ∅ . The service times are Geometric and the expected service times for job and are M and , respectively, i.e. p = M and p = 1 . The reward functions are: For j = 1 : w j ( t ) = M , t = 10 , t > j = 2 : w j ( t ) = (cid:26) ǫ, ∀ t for ǫ > . Hence, the completion of job generates rewards of M if it is completed in the first time slot; therwise, no revenue is received. On the other hand, job results generates reward of ǫ , regardless ofwhich time slot it is completed in. Therefore, the reward rates are: E [ w ( t + σ )] E [ σ ] = , t = 10 , t > E [ w ( t + σ )] E [ σ ] = (cid:26) ǫ, ∀ t In time slot t = 0 , the greedy policy schedules job because its reward rate ( ǫ ) is great than that of job ( ). Job completes processing in one time slot and generates reward of w (1) = 1 + ǫ . At t = 1 , onlyjob remains to be processed. However, the service time for job is at least one time slot, so when job is completed at t = 1 + σ > , reward is generated. Hence the greedy policy generates a total expectedreward of ǫ .On the other hand, the optimal policy realizes the reward of job is degrading and schedules it first.With probability /M , job will complete by time slot t = 1 and generate reward M . However, withprobability − /M it will take more than one time slot and generate no reward since w ( t ) = 0 for t > . Upon the completion of job , job is scheduled and it completes processing in time slot. Since w ( t ) = 1 + ǫ for all t , this results in additional reward of ǫ . Hence, the total expected reward generatedby the optimal policy is M + 1 + ǫ . Comparing the performance of the optimal and greedy policies gives J ∗ ( s ) /J g ( s ) = ( M + 1 + ǫ ) / (1 + ǫ ) . Letting ǫ → , it is easy to see that the greedy policy results in an M + 1 approximation, where M = E [max j σ j ]min j E [ σ j ] = ∆ . This suggests that the approximation of the greedy policy is dependent on the relationshipbetween job service times. The following subsection specifies this relation. (2 + ∆) -Approximation In this section we will show that the greedy heuristic is within a factor of of optimal, where ∆ = E [ σ max ]min j E [ σ j ] (16)Before we can prove this result, we need to first show a few properties of the system and the optimal valuefunction, J ∗ t . 13e begin with a monotonicity property based on the number of jobs remaining to be processed. In-tuitively, if one were given an additional set of jobs to process, the reward that can be garnered by thecompletion of the original set of jobs in conjunction with the additional jobs will be more than if those extrajobs were not available. Consider two states: s and s ′ which are nearly identical except state s has more jobsto process than state s ′ . In other words, all jobs that have been completed in the s -system have also beencompleted in the s ′ -system. Similarly, any job that has started processing in the s -system has also startedprocessing in the s ′ -system at the exact same time on the same machine. Any additional jobs in state s arejobs that have not started processing, but have already been completed in state s ′ . That is, the additionaljobs are only available for processing in the s -system. Then the reward-to-go generated starting in state s islarger than that starting in state s ′ . The following lemma formalizes this intuition. Lemma 1 (Monotonicity in Jobs) Consider states s and s ′ such that state s has more jobs than state s ′ andany job that has started in state s ′ started processing in the exact same time slot in state s so that for eachjob j : x ( s ) j ≥ x j ( s ′ ) and y ( s ) j = y ( s ′ ) j , if x ( s ) j = x ( s ′ ) j ; ∅ , if x ( s ) j > x ( s ′ ) j .Also, in both states, each processor n is either not busy or busy processing the same job: z ( s ) n = z ( s ′ ) n .For all states s and s ′ which satisfy these conditions, the following holds: J ∗ t ( s ) ≥ J ∗ t ( s ′ ) . Proof:
Consider a coupling of the systems starting at s and s ′ such that they see the same realizations ofservice times σ j (and residual service times for jobs that have already started processing). This is possiblefor all jobs j ∈ J s ′ = { j ∈ J | x ( s ′ ) j = 1 } ⊆ J s = { j ∈ J | x ( s ) j = 1 } because they have the samedistributions. J s and J s ′ denote the jobs to be completed under the systems starting in states s and s ′ ,respectively.Let π ∗ ( s ′ ) denote the optimal scheduling policy starting from state s ′ . Consider a policy ˜ π that starts instate s and mimics π ∗ ( s ′ ) until all jobs j ∈ J s ′ are completed and completes the rest of the jobs j ∈ J s \ J s ′ in sequential order. That is, under ˜ π the scheduler initially pretends that jobs j ∈ J s \ J s ′ do not exist and14ses the optimal policy under this assumption; once these jobs are completed, it processes the remainingjobs in an arbitrary order. Said another way, the ˜ π policy blocks processing of the additional jobs in state s ( j ∈ J s \ J s ′ ) and optimally processes the remaining jobs. Once these jobs ( j ∈ J s ′ ) are completed,the ˜ π policy ‘unlocks’ the remaining, additional jobs and processes them in an arbitrary manner. Fig. 2demonstrates the relationship between π ∗ ( s ′ ) and ˜ π for a single server over a particular sample path forservice times.PSfrag replacements s ′ -system s -system σ σ σ σ σ σ σ σ σ σ tt T s ′ T s ′ T s Figure 2: Monotonicity in Jobs: A single server scenario. The s -system is given additional jobs, j = 5 and j = 6 . The s ′ -system uses policy π ∗ ( s ′ ) to optimally process all jobs j = 1 , , , . The s -system usespolicy ˜ π which mimics π ∗ ( s ′ ) until all jobs j = 1 , , , are completed at time T s ′ and then processes theremaining additional jobs.Let T j be the completion time of job j for the s -system when using policy ˜ π . Similarly, let T ∗ j be thecompletion time of job j for the s ′ -system under the optimal policy, π ∗ ( s ′ ) . By our coupling, for all j ∈ J s ′ , T j = T ∗ j , i.e. the completion time of job j is identical under the s -system which uses policy ˜ π and under the s ′ -system which uses policy π ∗ ( s ′ ) . (Notice in Fig. 2, jobs , , , complete at the same time in the s ′ and s -systems). We use the notation J ∗ t ( s | σ ) as the optimal reward-to-go given the filtration of the job servicetimes, i.e. given a sample path of realizations of the σ j . We employ similar notation for J ˜ πt . We have, J ∗ t ( s | σ ) ≥ J ˜ πt ( s | σ )= X j ∈J w j ( T j )= X j ∈J s ′ w j ( T j ) + X j ∈J s \J s ′ w j ( T j )= J ∗ t ( s ′ | σ ) + X j ∈J s \J s ′ w j ( T j ) ≥ J ∗ t ( s ′ | σ ) The first inequality comes from the optimality of J ∗ t ( · ) . The first equality comes from the definition of the15eward function, T j , and ˜ π policy. The third equality comes from the coupling of the two systems so that T j = T ∗ j for all j ∈ J s ′ . The last inequality comes from non-negative property of the rewards in Assumption2. Taking expectations over σ j yields the desired result. (cid:4) Next, we consider a property of the optimal policy. In every time slot, there will be a set (possibly empty)of free machines ( z n = 0 ). In each time slot, the optimal policy will assign a job to all free machines,assuming there are enough available jobs. That is, while there are still jobs waiting to be processed, nomachine will idle under the optimal policy. Lemma 2 (Non-idling) Suppose in state s , there are F = |{ n ∈ N | z ( s ) n = 0 }| free machines, and thenumber of jobs remaining to be processed is K = |{ j ∈ J | x ( s ) j = 1 , y ( s ) j = ∅}| . Then, under the optimalpolicy π ∗ ( s ) , the number of job-processor pairs executed in the next time slot will be: | A | = min { K, F } . i.e. the optimal policy is non-idling. Proof:
The proof is by contradiction. What needs to be shown is that nothing can be gained by idling( | A | < min { K, F } ). Suppose that under the optimal policy, a processor remains free (idles), even thoughthere is an available job to work on. Consider another policy ˜ π which is identical to the π ∗ policy except itbegins processing all jobs on the idling machine one time slot earlier. Due to assumption 2, by processingthe jobs earlier, this will result in an increase in reward. This contradicts the optimality of the idling policy;hence, no optimal policy will idle. (cid:4) Now consider two systems which are identical, except one machine is tied up longer in the secondsystem. The following lemma says that the maximum amount of additional revenue accrued by the firstsystem for being able to start processing earlier is given by the reward rate of the greedy job; that is, the jobof maximum reward rate amongst those in processing or waiting.
Lemma 3 (Greedy Revenue) Consider a state s t = s in time slot t and let g denote the index of a greedyjob, i.e. g = argmax j ∈J s E [ w j ( t + σ j )] E [ σ j ] for all jobs which are mid-processing or have not started ( J s = { k ∈J | x k ( s ) = 1 } ).Denote by s g and s i two states which are related to state s in the following manner. The two statesare identical to state s , except on free machine n g ( z ( s ) n g = 0 ). In state s g , machine n g is occupied by a eplica of job g meaning it has the same service time as job g , however, its completion does not generate anyrewards nor does it effect the completion of the original job g . Similarly, in state s i , machine n g is occupiedby a replica of job i . Said in notation: x ( s i ) j = x ( s g ) j = x ( s ) j and y ( s i ) j = y ( s g ) j = y ( s ) j for all j , z ( s i ) n = z ( s g ) n = z ( s ) n for all n = n g , and z ( s g ) n g = g while z ( s i ) n g = i for some arbitrary job index i and machine n g . Then, E [ J ∗ t ( s i )] ≤ (cid:16) − E [ σ i ] E [ σ g ] + E [max j ∈J s σ j ] E [ σ g ] (cid:17) E [ w g ( t + σ g )] + E [ J ∗ t ( s g )] Proof:
We begin by coupling the systems such that they see the same realizations for service times. Notethat the replicated jobs which currently occupy machine n g need not have the same service time of theiroriginal jobs, i or g –despite having the same distribution.Consider a policy ˜ π for the s g -system which mimics the π ∗ ( s i ) policy. While processor n g is occupiedby replica job g , which blocks processing of other jobs, the s g -system will simulate the service time of jobson processor n g . There are two possible cases, σ i ≥ σ g and σ i < σ g . Case 1, σ i ≥ σ g : the ˜ π policy idles on machine n g until t + σ i (time which machine n g is free in the s i -system). At this point, the s g -system is ‘synced’ with the s i -system and it proceeds with executing theoptimal policy for the s i system, π ∗ ( s i ) . See Fig. 3 for a single processor example of such a scenario. IDLE
PSfrag replacements s i -system s g -system σ i σ g σ g σ g σ σ σ σ σ σ σ σ tt t + σ g t + σ i t + σ i T s i T s i Figure 3: Case 1, σ i ≥ σ g . The optimal policy is used for the s i -system, which processes jobs in order , , , g . The s g -system uses policy ˜ π which mimics π ∗ ( s i ) . Because job i completes after job g , the ˜ π policy idles. Note that job g is processed twice in the s g system because the first job is just a replica. Job i isonly processed once in the s i system because even though i is a replica, the original had already completedprocessing.If T ∗ j ( s i ) is the completion time of job j in the s i -system under optimal policy π ∗ ( s i ) ,and T j is thecompletion time of job j in the s g -system under the ˜ π policy, then T j = T ∗ j ( s i ) . Employing similarnotation as before, we consider the reward-to-go on a single realized sample path of service times,17iven by σ and the event σ i ≥ σ g : J ∗ t ( s i | σ, σ i ≥ σ g ) = X j ∈J s w j ( T ∗ j ( s i )) (17) = X j ∈J s w j ( T j )= J ˜ πt ( s g | σ, σ i ≥ σ g ) ≤ J ∗ t ( s g | σ, σ i ≥ σ g ) ≤ J ∗ t ( s g | σ, σ i ≥ σ g ) + E [ w g ( t + σ g ) | σ, σ i < σ g ] E [ σ g | σ, σ i < σ g ] E [ σ g − σ i + σ max | σ, σ i < σ g ] Case 2, σ i < σ g : In this case, ˜ π cannot exactly mimic π ∗ ( s i ) policy because machine n g will continue tobe busy after i completes in the s i -system. The ˜ π policy will simulate the processing of jobs on n g ,while the machine is still busy. Let J sim denote the set of jobs whose processing is simulated. Despitethe fact that these simulated jobs will not actually be completed, the ˜ π policy assumes they are. The ˜ π policy continues to follow the π ∗ ( s i ) policy until all jobs are ‘completed’ in the sense that they areactually completed or their completion was simulated because processor n g was busy under the s g -system when it was free under the s i -system. The ˜ π policy then finishes processing the simulated jobs( j ∈ J sim ) in an arbitrary manner so that they are actually completed. That is, the actual completionof the simulated jobs is transferred to after the rest of the jobs have completed processing. Fig. 4shows an example sample path of this scenario.If T ∗ j ( s i ) is the completion time of job j in the s i -system under optimal policy, π ∗ ( s i ) , and T j is thecompletion time of job j in the s g -system under the ˜ π policy, then T j = T ∗ j ( s i ) for all j
6∈ J sim .Then (again employing the notation given the filtration of σ j and the case σ i < σ g ): J ∗ t ( s i | σ, σ i < σ g ) = X j ∈J s w j ( T ∗ j ( s i ))= X j ∈J sim w j ( T ∗ j ( s i )) + X j sim w j ( T ∗ j ( s i )) ≤ X j ∈J sim w j ( T ∗ j ( s i )) + X j sim w j ( T ∗ j ( s i )) + X j ∈J sim w j ( T ′ j )= X j ∈J sim w j ( T ∗ j ( s i )) + J ˜ πt ( s g | σ, σ i < σ g ) ≤ X j ∈J sim w j ( T ∗ j ( s i )) + J ∗ t ( s g | σ, σ i < σ g ) (18)18 DLE
PSfrag replacements s i -system s g -system σ i σ g σ σ σ σ σ σ σ σ σ σ tt t + σ g t + σ i t + σ i ττ simulated jobs T s i T s i T s g Figure 4: Case 2, σ i < σ g . The optimal policy is used for the s i -system, which processes jobs in order , , . The s g -system uses policy ˜ π which mimics π ∗ ( s i ) . Because job i completes before job g , the ˜ π policy is blocked until t + σ g . At time t + σ i , the ˜ π policy simulates the processing of jobs and onmachine n g . The machine will idle once replica job g completes and before job finishes its simulatedprocessing. At time τ , the ˜ π policy is able to follow the π ∗ ( s i ) policy. Then the simulated jobs and are completed in an arbitrary order after the π ∗ ( s i ) policy completes at time T s i . Note that job g and i areprocessed once in each system because the original jobs have already completed processing (the replicas areprocessed by time t ).The first inequality comes from the non-negativity of rewards. The third equality comes from ourcoupling and the definition of the ˜ π policy. The last inequality comes from the optimality of J ∗ t .Taking expectations over the σ j , or equivalently the T ∗ j ( s i ) , and using a little algebra for (18): J ∗ t ( s i | σ i < σ g ) ≤ X j ∈J sim E [ w j ( T ∗ j ( s i )) | σ i < σ g ] + J ∗ t ( s g | σ i < σ g ) (19) ≤ X j ∈J sim E [ σ j | σ i < σ g ] E [ w j ( t + σ j ) | σ i < σ g ] E [ σ j | σ i < σ g ] + J ∗ t ( s g | σ i < σ g ) ≤ max k E [ w k ( t + σ k ) | σ i < σ g ] E [ σ k | σ i < σ g ] X j ∈J sim E [ σ j | σ i < σ g ] + J ∗ t ( s g | σ i < σ g ) ≤ E [ w g ( t + σ g ) | σ i < σ g ] E [ σ g | σ i < σ g ] E [ σ g + σ l − σ i | σ i < σ g ] + J ∗ t ( s g | σ i < σ g ) ≤ E [ w g ( t + σ g ) | σ i < σ g ] E [ σ g | σ i < σ g ] E [ σ g − σ i + σ max | σ i < σ g ] + J ∗ t ( s g | σ i < σ g ) The second inequality comes from the fact that for all j , T ∗ j ( s i ) ≥ t + σ j since the earliest time a jobcan begin processing is t and all w j ( t ) are non-increasing in t (Assumption 2). The forth inequalitycomes from the definition of job g . Now, consider the total service time of simulated jobs. Simulatedjobs begin at t + σ i and finish at τ > t + σ g . In particular, there exists some l such that the first timemachine n g is free under policy ˜ π is τ < t + σ g + σ l , i.e. l is the last simulated job (job in Fig. 4).19ence the total service time of simulated jobs is bounded above by ( t + σ g + σ l ) − ( t + σ i ) . Thisyields inequality four.Combining (17) and (19), and taking expectations over σ i ≥ σ g and σ i < σ g yields: J ∗ t ( s i ) ≤ E [ w g ( t + σ g )] E [ σ g ] (cid:16) E [ σ g ] − E [ σ i ] + E [ σ max ] (cid:17) + J ∗ t ( s g )= (cid:16) − E [ σ i ] E [ σ g ] + E [ σ max ] E [ σ i ] (cid:17) E [ w g ( t + σ g )] + J ∗ t ( s g ) which concludes the proof. (cid:4) Suppose we were able to process a job without using a machine. The total reward gained by the use ofthis ‘virtual machine’ is greater than the reward gained without the use of it. Define S ′ : S × J → S as theoperation/function which reduces state s to state s ′ i = S ′ ( s, i ) by removing job i which has not yet begunprocessing in state s . That is, starting in state s , select a job i that has not been completed. Complete job i and generate its associated reward without tying up a processor. Said in notation, ∀ n : z ( s ′ i ) n = z ( s ) n ; and ∀ j = i : x ( s ′ i ) j = x ( s ) j and y ( s ′ i ) j = y ( s ) j , but x ( s ′ i ) i = ( x ( s ) i − + and y ( s ′ i ) i = ∅ . Lemma 4 (Virtual Machine Rewards) For all states s and any job i , let state S ′ ( s, i ) denote the resultingstate if job i were processed without occupying a processor. Also, reward w i is generated upon completion.Then: J ∗ t ( s ) ≤ E [ w i ( t + σ i )] + J ∗ t ( S ′ ( s, i )) , Proof:
Consider a coupling of the systems starting in state s and s ′ i = S ′ ( s, i ) such that they see the samerealizations of the service times for all jobs. Let π ∗ ( s ) denote the optimal scheduling policy starting fromstate s .In the s ′ i -system, we call job i a ‘fictitious’ job. It is fictitious because it does not actually exist (ithas already completed) under the s ′ i -system. Consider a policy ˜ π which assumes that job i is a ‘real’(available/not processed) job and executes the optimal policy under this assumption, i.e. it at time slot t , it assumes it is in state s (rather than s ′ i ) and executes the optimal policy π ∗ t ( s ) . When ˜ π schedules job i ,there is no job to actually process, so the processor will idle while it simulates the processing time for job i which is identically distributed to σ i under the s -system. See Fig. 5 for a single machine example of the ˜ π and π ∗ ( s ) policies given a sample path for service time realizations.20 DLE
PSfrag replacements s -systemvirtual machine s ′ i -system σ σ σ σ σ σ σ σ σ σ tt T s ′ T s ′ Figure 5: Virtual machine: A single server scenario. Under the s ′ i -system, job is processed on a virtualmachine. The s -system uses policy π ( s ′ ) to optimally process all jobs j = 1 , , , . The s ′ i -system usespolicy ˜ π which mimics π ∗ ( s ) . Because job j = 2 has already been processed on the ‘virtual machine’, the ˜ π policy idles.Let T j be the completion time of job j under the ˜ π policy. Note that T i is the completion time of thefictitious job, i . Let t i denote the random time which job i begins ‘processing’ under this policy. Under ourcoupling, T j is precisely t j plus the processing time of job j under π ∗ ( s ) for the s -system. Hence, J ∗ t ( s | σ j ) = X j w j ( T j )= X j = i w j ( T j ) + w i ( T i )= J ˜ πt ( s ′ i | σ j ) + w i ( t i + σ i ) ≤ J ∗ t ( s ′ i | σ j ) + w i ( t + σ i ) The inequality results from the non-increasing property of the reward functions in Assumption 2 and fromthe optimality of J ∗ t ( · ) . Taking expectations over σ j yields the desired result. (cid:4) We are now in position to prove the main result of this paper. Let ∆ = E [max j σ j ]min j E [ σ j ] as in (16). Theorem 2
For all states s ∈ S , the following performance guarantee for the greedy policy holds: J ∗ t ( s ) ≤ (2 + ∆) J gt ( s ) . Proof:
The proof proceeds by induction on the number of jobs remaining to be processed, P j ∈J { y j = ∅} .The claim is trivially true if there is only one job remaining to be processed–the greedy and optimal policieswill coincide. Now consider a state s such that P j { y ( s ) j = ∅} = K , and assume that the claim is true for all21tates s ′ with K > P j { y ( s ′ ) j = ∅} .Now if π ∗ t ( s ) = π gt ( s ) the then the next state encountered and rewards generated in both systems areidentically distributed so that the induction hypothesis immediately yields the result for state s .Consider the case where π ∗ t ( s ) = π gt ( s ) . Denote by J ∗ and J g the set of jobs processed by the optimaland greedy policies in state s . Note that these sets depend on the current time slot t and the state s ; however,we suppress them for notational compactness. Recall that, by Lemma 2, |J ∗ | = |J g | . Let A ∗ and A g denotethe optimal and greedy scheduling policy, respectively, given state s in time slot t .Taking definitions from before, we define ˜ S ( s, A ) as the random next state encountered given that westart in state s and action A is taken. Also, define S ′ ( s, i ) as state s with the completion of job i , i.e. job i iscompleted ( x i = 0 ) without using a processor.Define the operator ˆ S : S × A → S which transforms state s by tying up machines with replicatedthe jobs defined by A . That is, ˆ s = ˆ S ( s, A ) is the state where jobs begin processing on the machinesgiven by A , but no reward is generated for their completion and they remain to be processed at a later time(reward is generate upon this second completion). This second completion may occur prior or followingthe completion of the replicated job. A defines which jobs are replicated and which machine they areprocessed on, and hence occupy–replicated jobs do not generated any reward. Put another way, ˆ s is a newstate where machines are occupied for an amount of time defined by the service times of jobs in A . Said innotation, x ( ˆ S ( s, A )) j = x ( s ) j and y ( ˆ S ( s, A )) j = y ( s ) j for all j , while z ( ˆ S ( s, A )) n = j if ( j, n ) ∈ A and z ( ˆ S ( s, A )) n = y ( s ) n otherwise.We have: J ∗ t ( s ) = X j ∈J ∗ E [ w j ( t + σ j )] + E [ J ∗ t ( ˜ S ( s, A ∗ ))] ≤ X ( i,g ) ∈ ( J ∗ , J g ) E [ σ i ] E [ σ g ] E [ w g ( t + σ g )] + E [ J ∗ t ( ˜ S ( s, A ∗ ))] ≤ X ( i,g ) ∈ ( J ∗ , J g ) E [ σ i ] E [ σ g ] E [ w g ( t + σ g )] + + E [ J ∗ t ( ˆ S ( s, A ∗ ))] (20)The first inequality comes from the definition of the greedy policy; the reward rate for greedy jobs is higherthan for the optimal jobs. The second inequality comes from Lemma 1 by putting back the jobs in A ∗ . Thatis the machines are occupied by replicas of jobs defined in A ∗ , but the original jobs are placed back to be22ompleted at a later date. These additional jobs generate more reward as shown in Lemma 1.Continuing (20), we switch A ∗ with A g . That is, instead of tying up the machines with replicas of theoptimal jobs, they are typed up with replicas of the greedy jobs. Because |J ∗ | = |J g | and the processingtimes on each machines are identical, we can consider each machine individually and use Lemma 3 so that, X ( i,g ) ∈ ( J ∗ , J g ) E [ σ i ] E [ σ g ] E [ w g ( t + σ g )] + E [ J ∗ t ( ˆ S ( s, A ∗ ))] ≤ X ( i,g ) ∈ ( J ∗ , J g ) E [ σ i ] E [ σ g ] E [ w g ( t + σ g )] + E [ J ∗ t ( ˆ S ( s, A g ))] X ( i,g ) ∈ ( J ∗ , J g ) E [ w g ( t + σ g )] (cid:16) − E [ σ i ] E [ σ g ] + E [ σ max ] E [ σ g ] (cid:17) += X g ∈J g E [ w g ( t + σ g )] (cid:16) E [ σ max ] E [ σ g ] (cid:17) + E [ J ∗ t ( ˆ S ( s, A g ))] (21)Continuing (21), we now complete the greedy jobs without occupying any machines: X j ∈J g E [ w g ( t + σ g )] (cid:16) E [ σ max ] E [ σ g ] (cid:17) + E [ J ∗ t ( ˆ S ( s, A g ))] ≤ X g ∈J g E [ w g ( t + σ g )] (cid:16) E [ σ max ] E [ σ g ] (cid:17) + E [ J ∗ t ( ˜ S ( s, A g ))] ≤ X g ∈J g (cid:16) E [ σ max ]min k E [ σ k ] (cid:17) E [ w g ( t + σ g )] + (cid:16) E [ σ max ]min k E [ σ k ] (cid:17) E [ J gt ( ˜ S ( s, A g ))]= (cid:16) (cid:17) J gt ( s ) The first inequality comes from use of ‘virtual machines’ for the greedy jobs under Lemma 4. The secondinequality comes from the induction hypothesis. This concludes the proof. (cid:4)
As shown in [9, 10] the greedy policy is optimal, for linear or exponential decaying reward functions. Undera few other special cases, the bound in Theorem 2 can be improved.23 .1 Identical Processing Times
Suppose that all job service times are independent and identically distributed, i.e. in the case of Geometricservice times, p j = p for all j . In general, there is no closed form equation for E [ σ max ] ; however, in thiscase, the bound can be improved to a factor of . To do this, Lemma 3 needs to be modified. Lemma 5 (Greedy Revenue, I.I.D processing times) Consider a state s t = s in time slot t and let g denotethe index of a greedy job, i.e. g = argmax j ∈J s E [ w j ( t ( s )+ σ j )] E [ σ j ] for all jobs which are mid-processing or havenot started ( J s = { k ∈ J | x k ( s ) = 1 } ). Denote by s g and s i two states which are related to state s asfollows: x ( s i ) j = x ( s g ) j = x ( s ) j and y ( s i ) j = y ( s g ) j = y ( s ) j for all j , z ( s i ) n = z ( s g ) n = z ( s ) n for all n = n g , and z ( s g ) n g = g while z ( s i ) n g = i for some arbitrary job index i and machine n g . That is in state s i , machine n g is occupied by a replica of job g ; and in state s i , machine n g is occupied by a replica of job i . Then, E [ J ∗ t ( s i )] = E [ J ∗ t ( s g )] Proof:
Couple the systems such that they see the same realizations for service times of job i and job g which are currently occupying machine n g . This coupling is possible since the jobs are i.i.d. Therefore,under this coupling there is no difference between state s i and s g since these ‘jobs’ are only occupying themachine but are not generating any rewards. Hence, E [ J ∗ t ( s i )] = E [ J ∗ t ( s g )] . (cid:4) Now we are able to prove an improved bound on the performance of the greedy policy.
Theorem 3
Let the service time for job j be distributed according to density function f j ( σ ) . If all jobservice times are independent and identically distributed according to f ( σ ) , i.e. f j ( σ ) = f ( σ ) ∀ j , then forall states s ∈ S , the greedy policy is guaranteed to be within a factor of of optimal: J ∗ t ( s ) ≤ J gt ( s ) . Proof:
Under this scenario, Lemma 3 can be replaced by Lemma 5 in the proof of Theorem 2. Hence, E [ J ∗ t ( ˆ S ( s, A ∗ ))] = E [ J ∗ t ( ˆ S ( s, A g ))] since the distribution of completion times is identical, the amount oftime a processor is busy is independent of which job it is processing. Instead of replicating the entire proofhere, we examine how (20), (21), and (22) change. 24he only difference for (20) is that E [ σ j ] = E [ σ i ] for i, j which allows for a slight simplification. J ∗ t ( s ) = X j ∈J ∗ E [ w j ( t + σ j )] + E [ J ∗ t ( ˜ S ( s, A ∗ ))] ≤ X ( i,g ) ∈ ( J ∗ , J g ) E [ σ i ] E [ σ g ] E [ w g ( t + σ g )] + E [ J ∗ t ( ˜ S ( s, A ∗ ))] ≤ X g ∈J g E [ w g ( t + σ g )] + + E [ J ∗ t ( ˆ S ( s, A ∗ ))] (22)Now, with improvement to Lemma 3 in Lemma 5, (21) is reduced significantly X g ∈J g E [ w g ( t + σ g )] + E [ J ∗ t ( ˆ S ( s, A ∗ ))] = X g ∈J g E [ w g ( t + σ g )] + E [ J ∗ t ( ˆ S ( s, A g ))] (23)Finally, utilizing Lemma 4 and completing/generating rewards for the greedy jobs gives: X g ∈J g E [ w g ( t + σ g )] + E [ J ∗ t ( ˆ S ( s, A g ))] ≤ X g ∈J g E [ w g ( t + σ g )] + E [ J ∗ t ( ˜ S ( s, A g ))] (24) ≤ X g ∈J g E [ w g ( t + σ g )] + 2 E [ J ∗ g ( ˜ S ( s, A g ))]= 2 J gt ( s ) (cid:4) In the case of i.i.d. service times, the greedy policy corresponds to scheduling the job with the highestexpected rewards over their identical completion times. While this seems to be an intuitive policy, thefollowing example shows what can go wrong.
Example 2
Consider the case with jobs and machine ( J = 2 and N = 1 ). We begin at t = 0 . Assumethat neither job has begun processing so that x = x = 1 and y = y = ∅ . The service times for job and are both deterministic and equal to . The reward functions are: For j = 1 : w j ( t ) = − ǫ, t = 10 , t > j = 2 : w j ( t ) = (cid:26) , ∀ t or ǫ > . So that the completion of job only results in revenue if it is completed in the first time slot,but job results in the same revenue, regardless of which time slot it is completed in. Therefore, the rewardrates are: E [ w ( t + σ )] E [ σ ] = − ǫ, t = 10 , t > E [ w ( t + σ )] E [ σ ] = (cid:26) , ∀ t Clearly, the greedy policy is to schedule job and then job since the reward rate for job is greaterthan that for job ( ǫ > ). However, when job completes at t = 2 , it generates no reward since w (2) = 0 . This results in reward . On the other hand, the optimal policy realizes the reward of job isdegrading and schedules it first and schedules job second. This results in reward − ǫ . We thus see that J ∗ t ( s ) = (2 − ǫ ) J gt ( s ) here. In light of the example just shown, the bound in Theorem 3 is tight . We have proven a worse case bound for arbitrary decaying rewards. If the time-scale of decay is very longcompared to the time-scale of job completion times, then the rewards would be nearly constant during theprocessing time of a job. In particular, as the decay goes to zero over the time-scale of job completion times,the performance of the greedy heuristic approaches the performance of the optimal policy.We will now formally define the time-scale of decay. Consider a difference equation specification forthe time-scale of decay. Let δ = max t,k,m E [ w k ( t ) − w k ( t + σ m )] ≥ . We will show that as δ → , J gt ( s ) → J ∗ t ( s ) . To do this, we must start with a few preliminary results.The first is, as δ → , rewards become invariant to the completion time. Rewards are generated uponthe completion of each job. However, as δ → , the rewards generated at the completion time of a job isnearly the rewards that would have been generated had the job had processing time. Lemma 6 (Time-Invariant Rewards) For any jobs i, j and time slot t , as the time-scale of decay, δ , ap-proaches , the reward generated for completing job j is invariant to shifts in time by the service time of job , σ i . In particular, lim δ → E [ w j ( t + σ i )] = w j ( t ) Proof:
For any job indices i, j and time slot t : (cid:12)(cid:12) E [ w j ( t + σ i )] − w j ( t ) (cid:12)(cid:12) ≤ max τ,k,m (cid:12)(cid:12) E [ w k ( τ + σ m )] − w k ( τ ) (cid:12)(cid:12) = δ (25)which implies that (cid:12)(cid:12) E [ w j ( t + σ i )] − w j ( t ) (cid:12)(cid:12) → as δ → . (cid:4) Because rewards are nearly constant over the time-scale of job completions, starting a job σ j timeslots later does not significantly reduce the aggregate reward accrued. The following lemma is similarto Lemma 3 for slowly decaying reward functions. Define ˆ S ( s, A ) as in Section 3.2, so that ˆ S ( s, A ) isthe state where jobs are processed on the machines given by A , but they are not removed and no reward isgenerated for this initial processing. These replica jobs occupy the machines, making them unable to processother jobs in the meantime. However, they do not generate reward. In notation, x ( ˆ S ( s, A )) j = x ( s ) j and y ( ˆ S ( s, A )) j = y ( s ) j for all j , while z ( ˆ S ( s, A )) n = j for all ( j, n ) ∈ A and z ( ˆ S ( s, A )) n = z ( s ) n otherwise. Lemma 7 (Delayed Machine) Let ˆ s = ˆ S ( s, A ) denote the resulting state if machines in A are occupied, butall the jobs have the same (un)processed state as in state s . Then, starting in any state s and given action A , the difference in optimal reward-to-go generated in states s and ˆ s goes to as the time-scale of decay, δ ,goes to , i.e. (cid:12)(cid:12) J ∗ t ( s ) − E [ J ∗ t ( ˆ S ( s, A )] (cid:12)(cid:12) → as δ → Proof:
To begin, note that J ∗ t ( s ) ≥ E [ J ∗ t ( ˆ S ( s, A ))] . To see this, we couple the job completion times. Let ˜ π denote a policy starting from state s , but mimicking the optimal policy starting from state ˆ S ( s, A ) = ˆ s , π ∗ (ˆ s ) . Therefore, under the ˜ π policy, machine n k will idle for σ k time slots before proceeding if ( k, n k ) ∈ A .The s -system simply delays processing any new jobs until the replica jobs in the ˆ s -system are completed.In this case, the completion time for jobs will be identical under the ˜ π and π ∗ (ˆ s ) policies. Hence J ∗ t (ˆ s ) = J ˜ πt ( s ) ≤ J ∗ t ( s ) , by the optimality of J ∗ t ( s ) .Now to show the convergence result, couple the job completion times under the s and ˆ s -systems. Let σ ∗ k = max ( k,n ) ∈ A σ k be the maximal service time for jobs in A . Consider a policy ˜ π for the ˆ s -system which27dles for σ ∗ k time slots and begins processing new jobs at time t ′ = t + σ ∗ k , but assumes that t ′ = t . Therefore, ˜ π coincides precisely with π ∗ ( s ) shifted in time by σ ∗ k . In other words, the ˜ π policy waits until t ′ at whichpoint all replica jobs are completed and then begins processing new jobs as if no time has passed and t ′ = t .For the s -system, let T ∗ j be the completion time of job j under the optimal policy π ∗ ( s ) . Then T ˜ πj = T ∗ j + σ ∗ k is the completion time for job j under the ˜ π policy. Now, given some ǫ > (cid:12)(cid:12) J ∗ t ( s ) − E [ J ∗ t (ˆ s )] (cid:12)(cid:12) ≤ (cid:12)(cid:12) J ∗ t ( s ) − E [ J ˆ πt (ˆ s )] (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E (cid:2) X j w j ( T ∗ j ) (cid:3) − E (cid:2) X j w j ( T ˜ πj ) (cid:3)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) X j E (cid:2) w j ( T ∗ j ) − w j ( T ∗ j + σ ∗ k ) (cid:3)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) X j E h w j ( T ∗ j ) − E (cid:2) w j ( T ∗ j + σ ∗ k ) (cid:3)i(cid:12)(cid:12)(cid:12) ≤ J δ < ǫ (26)The inequality comes from Lemma 6 and because δ → there exists δ < ǫ/J . (cid:4) Now, we are in position to prove that the performance of the greedy policy approaches the performanceof the optimal policy when the decay of rewards is slow compared to the job completion time.
Theorem 4 (Slowly Decaying Rewards) For any state s ∈ S , as the time-scale of decay goes to , i.e δ → ,the performance of the greedy policy approaches that of the optimal policy. J gt ( s ) → J ∗ t ( s ) . Proof:
The proof is by induction on the number of jobs remaining to begin processing. Clearly, whenonly one job remains the greedy and optimal policies coincide. Now we assume it is true for K − jobsremaining and show that it is true for K jobs.Denote by J ∗ and J g the set of jobs processed by the optimal and greedy policies in state s . Recall that,by Lemma 2, |J ∗ | = |J g | . Let A ∗ and A g denote the optimal and greedy scheduling policy, respectively.As before, we define ˜ S ( s, A ) which is the next state given we start in state s and take action A and ˆ S ( s, A ) which is the state with machines in A occupied by replica jobs which generate reward.Suppose we are given ǫ > . Define δ ǫ, such that for all δ < δ ǫ, , (cid:12)(cid:12) J ∗ t ( s | σ j ) − J ∗ t ( ˆ S ( s, A ) | σ j ) (cid:12)(cid:12) < ǫ/ ;this is possible due to Lemma 7. Define δ ǫ, such that for all δ < δ ǫ, , (cid:12)(cid:12) J ∗ t ( s ′ ) − J gt ( s ′ ) (cid:12)(cid:12) < ǫ/ for any s ′ K − jobs remaining; this is possible due to our inductive hypothesis. Then let δ ǫ = min { δ ǫ, , δ ǫ, } .For any δ < δ ǫ : J ∗ t ( s ) ≤ E [ J ∗ t ( ˆ S ( s, A g ))] + ǫ/ ≤ X j ∈J g E [ w j ( t + σ j )] + E [ J ∗ t ( ˜ S ( s, A g ))] + ǫ/ ≤ X j ∈J g E [ w j ( t + σ j )] + E [ J gt ( ˜ S ( s, A g ))] + ǫ = J gt ( s ) + ǫ The first inequality is due to Lemma 7, for state s and action given by A g . The second inequality is byLemma 4 for removing the greedy jobs. The third inequality is by the inductive hypothesis.By the optimality of J ∗ t , (cid:12)(cid:12) J ∗ t ( s ) − J gt ( s ) (cid:12)(cid:12) = J ∗ t ( s ) − J gt ( s ) . So for δ < δ ǫ , (cid:12)(cid:12) J ∗ t ( s ) − J gt ( s ) (cid:12)(cid:12) < ǫ , whichproves our claim. (cid:4) This result is intuitive because as the time-scale of decay becomes negligible to the time-scale of jobcompletion times, rewards can be viewed as essentially constant. As such, it does not matter which orderjobs are completed, since all will be completed. Hence, any policy, and certainly the greedy policy, is nearlyoptimal. However, the convergence rate to optimality will vary across policies.
In the previous sections, we have shown performance guarantees for a greedy policy when scheduling jobswith decaying rewards. In light of Example 1 and Theorem 2, the loss in performance due to use of thegreedy policy can be at least ∆ + 1 but can do no worse that ∆ + 2 . In this section, we show that, in practice,the greedy performance is likely to be much better.In order to enable computation of an optimal policy we assume that the number of jobs is finite andsmall ( - ). Even with a finite number of jobs, |S| grows exponentially fast in several problem parameterswhich forces us to limit the size of the problem instances we consider. In particular, we consider problemswith a single machine, M = 1 , and jobs with finite deadlines less than . That is no reward is accruedafter t = 100 . We assume job completion times are Geometric with p j evenly distributed between p min and p max = . . Since there is no closed form distribution for σ max , see Appendix A for how to find29 T0c a) Step d) Parabolic b) Linear e) 2−Step c) Exponentialt w ( t ) ke −ct k(t−T) kT −c Figure 6: Different types of decaying functions.an upper-bound to E [ σ max ] and, subsequently, an upper-bound to ∆ . We consider a number of decayingreward functions depicted in Fig. 6. The constants defining each reward function are drawn uniformly; allexperimental results are averaged over different realizations of these constants, with experimentsfor each such set.In Table 1, we summarize the performance of the greedy policy for the reward functions shown in Fig.6. In this case, p min = . and p max = . ; min j E [ σ j ] = . = 1 . and E [max j σ j ] < . , therefore, < . . We can see that while the optimal policy achieves larger reward than the greedy policy,the gains are within a factor of . –much better than the guarantee provided by Theorem 2. Because wehave finite deadlines for each job, there exists some T max such that for all j , w j ( t ) = 0 for all t > T max .Therefore, the reward function with exponential decay is slightly modified from the standard notion ofexponential decay where w j ( t ) → , but w j ( t ) > for any t < ∞ . Hence, the greedy policy is not optimalfor this exponential decay with finite deadline.It is interesting to note that the performance of the greedy policy seems to degrade as the number of jobsincreases. We examine this more closely in Fig. 7 under a step function reward function where rewards areconstant until a fixed deadline as in Fig. 6a. Clearly, the greedy and optimal policies coincide when there isonly one job. As the number of jobs increases, the performance of the greedy policy degrades; however, theloss in performance is much less than the bound of < . guarantees. is a worse-case bound30 ∗ t /J gt Type J = 2 J = 5 J = 8 Step 1.0065 1.0931 1.1287Linear 1.0133 1.0576 1.1289Exponential 1.0609 1.0433 1.0590Parabolic 1.0265 1.0382 1.06672-step 1.0218 1.1007 1.1520Table 1: Performance of Greedy policy versus Optimal Policy for different types of decaying reward func-tions for J jobs. PSfrag replacements J J ∗ t J gt Figure 7: Performance loss ( J ∗ t J gt ) as the number of jobs ( J ) increases.and while there are degenerate cases whose performance approaches this bound; it seems that in practice,the performance of the greedy policy is likely to be much better.From Theorem 2, the performance of the greedy policy is dependent on ∆ , the ratio between the largestand smallest expected service times. In our previous experiments, we have seen that J ∗ t J gt ≪ . Wenow examine if the performance of the greedy policy will vary significantly as we change ∆ . We fix p max = . and vary p min ∈ [ . , . ; this varies the upper-bound of ∆ (as calculated in Appendix A), ∆ UB ∈ [5 . , . . In Fig. 8, we see how the performance of the greedy policy ( J ∗ t J gt ) varies with ∆ . Asexpected, as ∆ increases, so does the loss in performance. However, it is interesting to note that ∆ must bevery large before the degradation in performance is significant. In fact, for a large range of ∆ UB ∈ [1 , , J ∗ t J gt is nearly constant and the greedy policy performs within of optimal. Even when ∆ = 260 , J ∗ t J gt < . .31
50 100 150 200 250 30011.051.11.151.21.251.3
PSfrag replacements ∆ UB J ∗ t J gt Figure 8: Performance loss ( J ∗ t J gt ) as ∆ UB increases.A loss of is much better than the theory guarantees.Depending on the system parameters, ∆ can be arbitrarily large which would lead to arbitrarily largedegradation in performance of the greedy policy. While we have seen via Example 1 that the performanceof the greedy policy can be highly dependent on ∆ , we suspect this to be a degenerate example. We expectthat in practice, the performance of the greedy policy to be closer to performance of the optimal policy. In this paper, we have studied online stochastic non-preemptive scheduling of jobs with decaying rewards.Arbitrary decaying reward functions enables this model to capture various distastes for delay which are moregeneral than the standard exponential or linear decay as well as fixed (random or deterministic) deadlines.Using stochastic Dynamic Programming techniques, we are able to show that a greedy heuristic is guaran-teed to be within a factor of ∆ + 2 of optimal where ∆ = E [max j σ j ]min j E [ σ j ] is the ratio of largest to shortest servicetimes. While there exist degenerate scenarios where the performance loss of the proposed policy is near thisworse-case bound, we expect that the performance loss to be much smaller for many practical scenarios ofinterest.This is a first look at non-preemptive scheduling with arbitrary decaying rewards. Some questions thatremain are how to account for job arrivals and processor dependent service times. When there are job ar-32ivals, due to the non-preemptive service discipline, it may be optimal for a machine to idle in order to allowthe machine to be free upon arrival of the new job. However, doing so requires some estimate or knowledgeof future jobs arrivals, which may not be available. Also with processor dependent service times, optimalpolicies may also call for idling. Consider a scenario where one machine is much faster than the rest. Thenan optimal policy may process all jobs on this fast machine, causing the other machines to idle. Allowing foridling policies significantly complicates the optimization problem at hand. One option is to only considernon-idling policies and maximize reward over this class of policies. It can be shown via a highly degenerateexample that requiring non-idling service disciplines can significantly degrade performance. However, formany scenarios this constraint is very natural. For instance, in service applications, such as health-care facil-ities, making customers (patients) wait when there are available servers (doctors) is unlikely to be tolerated.These are just some extensions to this general model we have analyzed. In this paper, we have consideredthe performance of an online scheduling algorithm for jobs with arbitrary decaying rewards. We have showna worse-case performance bound for this policy compared to the optimal off-line algorithm. While there aresome rare instances when the loss in performance of the proposed greedy policy is significant, in practice,we expect the performance loss to be small. This, along with the simplicity of this algorithm, makes it highlydesirable for real world implementation. References [1] P. McQuillan, S. Pilkington, A. Allan, B. Taylor, A. Short, G. Morgan, M. Nielsen, D. Barrett, andG. Smith, “Confidential inquiry into quality of care before admission to intensive care,”
British MedicalJournal , vol. 316, pp. 1853–1858, 1998.[2] P. S. Chan, H. M. Krumholz, G. Nichol, B. K. Nallamothu, and the American Heart AssociationNational Registry of Cardiopulmonary Resuscitation Investigators, “Delayed time to defibrillation afterin-hospital cardiac arrest,”
The New England Journal of Medicine , vol. 358, pp. 9–17, 2008.[3] G. D. Luca, H. Suryapranata, J. P. Ottervanger, and E. M. Antman, “Time delay to treatment andmortality in primary angioplasty for acute myocardial infarction: every minute of delay counts,”
Cir-culation , vol. 109, p. 12231225, 2004. 334] M. D. Buist, G. E. Moore, S. A. Bernard, B. P. Waxman, J. N. Anderson, and T. V. Nguyen, “Effects ofa medical emergency team on reduction of incidence of and mortality from unexpected cardiac arrestsin hospital: preliminary study,”
British Medical Journal , vol. 324, p. 7334, 2002.[5] R. Bellomo, D. Goldsmith, S. Uchino, J. B. G. K. Hart, H. Opdam, W. Silvester, L. Doolan, andG. Gutteridge, “A prospective before-and-after trial of a medical emergency team,”
Medical Journal ofAustralia , vol. 179, pp. 283–287, 2003.[6] P. J. Sharek, L. Parast, K. Leong, J. Coombs, K. E. J. Sullivan, L. R. Frankel, and S. J. Roth, “Effect of arapid response team on hospital-wide mortality and code rates outside the ICU in a childrens hospital,”
The Journal of the American Medical Association , vol. 298, pp. 2267–2274, 2007.[7] E. G. Poon, T. K. Gandhi, T. D. Sequist, H. J. Murff, A. S. Karson, and D. W. Bates, “‘I wish I hadseen this test result earlier!’: Dissatisfaction with test result management systems in primary care,”
Archines of Internal Medicine , vol. 164, pp. 2223–2228, 2004.[8] S. Wee, W. Tan, J. Apostolopoulos, and M. Etoh, “Optimized video streaming for networks withvarying delay,” in
Proc.
IEEE
ICME , 2002, pp. 1673–1677.[9] J. C. Gittins,
Multi-armed Bandit Allocation Indices . John Wiley & Sons Ltd., 1989.[10] J. Walrand,
An Introduction to Queuing Networks . Prentice-Hall, Inc, 1988.[11] J.-H. Kim and K.-Y. Chwa, “Scheduling broadcasts with deadlines,”
Theoretical Computer Science ,vol. 325, pp. 479–488, 2004.[12] R. J. Lipton and A. Tomkins, “Online interval scheduling,” in
Proc. SODA , 1994.[13] S. Nahmias, “Perishable inventory theory: A review,”
Operations Research , vol. 30, pp. 680–708,1982.[14] D. Naso, M. Surico, and M. Turchiano, “Reactive scheduling of a distributed network for the supply ofperishable products,” IEEE
Transactions on Automation Science and Engineering , vol. 4, pp. 407–423,2007.[15] A. C. Dalal and S. Jordan, “Optimal scheduling in a queue with differentiated impatient users,”
Per-formance Evaluation , vol. 59, pp. 73–84, 2005.3416] M. Pinedo,
Scheduling: Theory, Algorithms, and Systems , 2nd ed. Prentice-Hall, Inc, 2002.[17] P. Schuurman and G. J. Woeginger, “Polynomial time approximation algorithms for machine schedul-ing: Ten open problems,”
Journal of Scheduling , vol. 2, pp. 203–213, 1999.[18] C. W. Chan and V. F. Farias, “Stochastic depletion problems: Effective myopic policies for a class ofdynamic optimization problems,”
Mathematics of Operations Research , vol. 34, no. 2, pp. 333–350,May 2009.[19] S. A. Goldman, J. Parwatikar, and S. Suri, “On-line scheduling with hard deadlines,”
Journal of Algo-rithms , vol. 34, pp. 370–389, 2000.[20] J. van Mieghem, “Dynamic scheduling with convex delay costs: The generalized c | µ rule,” The Annalsof Applied Probability , vol. 5, pp. 809–833, 1995.[21] D. Bertsekas,
Dynamic Programming and Optimal Control , 2nd ed. Athena Scientific, 2000, vol. 1 & A Bound on σ max Suppose the service time of job j is Geometrically distributed with probability p j . Furthermore, p j isuniformly distributed between [ p min , p max ] .Using the fact that σ j is Geometrically distributed, i.e. P ( σ j ≤ x ) = 1 − (1 − p j ) x gives: P ( σ max > x ) = 1 − J Y j =1 P ( σ j ≤ x )= 1 − J Y j =1 (cid:0) − (1 − p j ) x (cid:1) ≤ − (cid:0) − (1 − p min ) x (cid:1) J (27)Finding the expectation of σ max gives: E [ σ max ] = ∞ X x =0 P ( σ max > x ) ≤ ∞ X x =0 h − (cid:0) − (1 − p min ) x (cid:1) J i (28)35e can numerically solve (28) to get an upper-bound on E [ σ max ] and hence, an upper-bound on ∆ . Inparticular: ∆ ≤ ∆ UB = p max ∞ X x =0 h − (cid:0) − (1 − p min ) x (cid:1) J ii