Reinforcement Learning-based Admission Control in Delay-sensitive Service Systems
RReinforcement Learning-based Admission Controlin Delay-sensitive Service Systems
Majid Raeis, Ali Tizghadam and Alberto Leon-Garcia
Department of Electrical and Computer EngineeringUniversity of Toronto , Toronto, CanadaEmails: [email protected], [email protected] and [email protected]
Abstract —Ensuring quality of service (QoS) guarantees inservice systems is a challenging task, particularly when thesystem is composed of more fine-grained services, such as servicefunction chains. An important QoS metric in service systems isthe end-to-end delay, which becomes even more important indelay-sensitive applications, where the jobs must be completedwithin a time deadline. Admission control is one way of providingend-to-end delay guarantee, where the controller accepts a jobonly if it has a high probability of meeting the deadline. Inthis paper, we propose a reinforcement learning-based admissioncontroller that guarantees a probabilistic upper-bound on theend-to-end delay of the service system, while minimizes theprobability of unnecessary rejections. Our controller only usesthe queue length information of the network and requires noknowledge about the network topology or system parameters.Since long-term performance metrics are of great importancein service systems, we take an average-reward reinforcementlearning approach, which is well suited to infinite horizonproblems. Our evaluations verify that the proposed RL-basedadmission controller is capable of providing probabilistic boundson the end-to-end delay of the network, without using systemmodel information.
Index Terms —Admission control, queueing networks, rein-forcement learning, delay-sensitive applications. I. INTRODUCTION
Providing quality of service (QoS) guarantees in complexservice systems is often a challenging task. Service functionchaining (SFC) is one such example, in which the end-to-end service is provided through a sequence of servicefunctions (SFs) such as firewalls, load balancers and deeppacket inspectors. One important QoS metric in service chainsis the end-to-end delay, which is particularly important indelay-sensitive applications in which a job must be completedwithin some specific deadline. Admission control (AC) is oneway of providing QoS, where the controller offers end-to-enddelay guarantees by rejecting those that are likely to fail thedelay requirement. This may also result in a higher throughput,since there will be more room for the future arrivals.An important challenge in designing controllers for servicenetworks is the lack of knowledge about the dynamics of thesystem, particularly when the system becomes more complex.This is a reason why classic network control algorithms oftenfall short on practicality. Reinforcement learning is a naturalcandidate that can deal with this issue. In the RL framework,the agent (controller) interacts with the environment (networksystem) and optimizes its policy without knowledge about thedynamics or topology of the system. In this paper we focus on RL-based control mechanisms for providing QoS in servicesystems, without limiting ourselves to a particular application.One of the earliest works on the use of reinforcementlearning for call admission control is [1], which aims tomaximize the earned revenue while providing QoS guaranteesto the users. An RL-based call admission controller for cellularnetworks has been proposed in [2], which improves the qualityof service and reduces call-blocking probabilities of hand-offcalls. Service function chaining is another example of servicenetworks, in which QoS can be of great importance. Theauthors in [3] proposed a DQN (Deep Q-Learning) basedQoS/QoE aware service function chaining in NFV-enabled5G networks. This work considered QoS metrics such asdelay, throughput, bandwidth, etc. The authors of [4], [5]proposed an automatic service and admission controller, calledAutoSAC, with an application in network function chaining.AutoSAC provides a service controller for automatic VNFscaling, and an admission controller that guarantees that theaccepted jobs meet their end-to-end deadlines. The proposedadmission controller uses the worst-case expected delays tomake its decision, which involve loose estimates of the delayand therefore jeopardise the throughput. In [6], the authorsproposed a deep reinforcement learning approach to handlecomplex and dynamic SFC embedding scenarios in IoT,where the average SFC processing delay has been used asthe primary embedding objective. Another body of literaturestudies the problem of QoS measurement and control in thecontext of general queueing systems. The end-to-end delaydistribution of the tandem and acyclic queueing networkshave been studied in [7], [8], where mixture density networks(MDNs) [9] have been used to learn the distributions. Thedistributions are used for providing probabilistic bounds onthe end-to-end delay of the network. The authors in [10]have used a general queueing model to learn a networkcontrol policy. More specifically, a model-based reinforcementlearning approach has been used to find the optimal policy thatminimizes the average job delay in queueing networks. Thereader can refer to [11]–[13] for some related works in otherservice system contexts.In this paper, we use queueing theory as a general frame-work for modeling service systems, and therefore our resultsare applicable to a wide range of applications. In contrastto the prior works, we adopt an average-reward reinforcementlearning approach, which is well-suited for non-episodic tasks a r X i v : . [ c s . PF ] A ug here long-term performance metrics are of great impor-tance [9], [14]. Our main contributions can be summarizedas follows • We propose an RL-based admission controller that pro-vides a probabilistic upper-bound on the end-to-end delayof the system, while most of the existing work focus onthe average delay, which is less informative. • Our controller is able to minimize the probability ofunnecessary rejections. In order to accomplish this, weuse a simulated environment in parallel with the originalenvironment to determine whether a rejection decisionwas a good decision, or a wrong choice and that the jobcould have been accepted. • In contrast to the classical queueing methods, our con-troller only observes the queue length information anddoes not require any information about the networktopology, service or inter-arrival time distributions.The paper is organized as follows. In Section II, we describethe queueing system model and formulate the admissioncontrol problem as an optimization problem. We briefly reviewsome basic concepts in reinforcement learning, especially theaverage-reward setting, in Section III. In Section IV, weformulate the problem as an average-reward reinforcementlearning and discuss the implementation challenges that needto be addressed. The evaluation of the proposed RL-basedadmission controller is presented in the context of servicefunction chaining in Section V. Finally, Section VI presentsconclusions.II. S
YSTEM M ODEL AND P ROBLEM S ETTING
We consider multi-server queueing systems, with FirstCome First Serve (FCFS) service discipline, as the buildingblocks of the service networks that we study. Furthermore, westudy tandem queues and simple acyclic queueing networksas shown in Fig.1. In a tandem topology, a customer must gothrough all the stages to receive the end-to-end service, whilein an acyclic topology, the customers randomly go through oneof the branches with the specified probabilities in Fig. 1b. Wedo not assume specific distributions for the service times orthe inter-arrival times and therefore, these processes can havearbitrary stationary distributions.Now, consider a network consisting of N queueing systems,where system n , ≤ n ≤ N , is a multi-server queueingsystem with c n homogeneous servers having service rates µ n .Let q n denote the queue length of the n th queueing systemupon arrival of a job at the entrance of the network ( st queue).The end-to-end delay of a new arrival is represented by d .Moreover, we consider an admission controller at the entranceof the network, which decides whether to accept or reject anincoming job based on the queue length information of all theconstituent systems, i.e., ( q , q , · · · , q N ) . The policy of theadmission controller is represented by π , and the acceptanceand rejection actions are denoted by A and R , respectively.As mentioned earlier, the reason for considering the ad-mission controller is to provide some sort of QoS to the cus-tomers. More specifically, our goal in designing the admission … … ! " " " … … $ % ! % … … $ ! … … ) * … ) , … ) + … … AC (a) ) * … ) + … ) , … ) - … ) +'* … … ( " ( "& ( "%'" AC (b)Fig. 1. Network topologies (a) Tandem queue (b) Acyclic queue. controller is to guarantee a probabilistic upper-bound on theend-to-end delay of the accepted jobs, i.e., P ( d > d ub | A ) ≤ (cid:15) ub , where d ub denotes the upper-bound and (cid:15) ub is theviolation probability. Many different policies may result in thesame probabilistic upper-bound, and we are interested in thepolicy that results in the minimum probability of unnecessaryrejections. Therefore, we can express the admission controldesign as an optimization problem: max π − p aπ ( R ) P ( d < d ub | R ) s.t. P ( d < d ub | A ) ≥ − (cid:15) ub , (1)where p aπ ( a ) , a ∈ { A, R } , is the steady state probability ofchoosing action a under policy π , and the objective is tominimize the probability of unnecessary rejection of an arrivalthat could have met the deadline, i.e., P (cid:0) R ∩ ( d < d ub ) (cid:1) = p aπ ( R ) P ( d < d ub | R ) .III. B ACKGROUND ON R EINFORCEMENT L EARNING
In this section, we briefly review the basic concepts ofthe reinforcement learning and discuss the average-rewardsetting. The basic elements of a reinforcement learning prob-lem are the agent and the environment , which have iterativeinteractions with each other. The environment is modeled bya Markov decision process (MDP), which is specified by < S , A , P , R > , with state space S , action space A , statetransition probability matrix P and reward function R . At eachtime step t , the agent observes a state s t ∈ S , takes action a t ∈ A , transits to state s t +1 ∈ S and receives a reward of R ( s t , a t , s t +1 ) . The agent’s actions are defined by its policy π , where π ( a | s ) is the probability of taking action a in state s .The formal definition of an agent’s goal varies based onthe problem setting. In episodic tasks in which there is anotion of terminal state, the goal of the agent is to maximizethe long-term expected return , where the return is definedas G t = (cid:80) Tk = t +1 R k . In continuing settings, in which theinteraction between the agent and the environment continuesforever, we cannot use the same return function as before,since T = ∞ and the reward can easily become unbounded.ne way of handling this issue is by using discounting . Inthis approach, the agent tries to maximize the expected dis-counted return, which is defined as G t = (cid:80) Tk = t +1 γ k − t − R k , ≤ γ < . In the discounted setting, the agent cares moreabout the immediate rewards in the near future rather thanthe delayed rewards in the far future. However, in manyapplications such as computer networks, we are interestedin the average performance of the system in the long runand we care as much about the future as we do about thepresent. There is a third setting for formulating the goal inRL problems, which is called the average reward setting [9].In this setting, the goal of the agent is to maximize the averagereward per time step, which under ergodicity assumption canbe obtained as [9] r ( π ) = (cid:88) s p sπ ( s ) (cid:88) a π ( a | s ) (cid:88) s (cid:48) ,r P ( s (cid:48) , r | s, a ) r, (2)where p sπ ( s ) is the steady state distribution of being in state s in a given time step following policy π . The averagereward setting is a good candidate for formulating the goal incontinuing environments, where we care about the future asmuch as the present. The differential return is defined as G t = (cid:80) ∞ k = t +1 ( R k − r ( π )) . Similar to the notion of action-valuefunction in the discounted setting, we can define differentialaction-value function (Q-function) as Q π ( s, a ) = E π [ G t | s t = s, a t = a ] , which denotes the expected return starting fromstate s , taking action a , and following policy π . Definingthe optimal Q-function as Q ∗ ( s, a ) = max π Q π ( s, a ) , we canobtain the optimal policy as π ∗ ( a | s ) = (cid:26) if a = arg max a (cid:48) Q ∗ ( s, a (cid:48) ) , otherwise . (3)The optimal Q-function must satisfy the Bellman optimalityequation for the average reward setting as follows Q ∗ ( s, a ) = E π (cid:20) r − max π r ( π ) + max a (cid:48) Q ∗ ( s (cid:48) , a (cid:48) ) (cid:12)(cid:12)(cid:12) s, a (cid:21) . (4)Similar to the discounted setting, the Bellman optimalityequation is not usually directly used to obtain the optimalQ-function. Instead, we use an iterative method to solve theBellman equation, which is similar to the well-known Q-learning algorithm and is called R-learning [15]. Based onthis method, the Q-function is updated at each time step as Q ( s t , a t ) ← Q ( s t , a t )+ α (cid:2) r t +1 − ¯ r +max a Q ( s t +1 , a ) − Q ( s t , a t ) (cid:3) , (5)where α represents the learning rate and ¯ r is an approximationof r ( π ) . At each time step that the behaviour policy actsgreedily, i.e., a t = arg max a Q ( s t , a ) , ¯ r is updated as ¯ r ← ¯ r + β (cid:2) r t +1 − ¯ r + max a Q ( s t +1 , a ) − Q ( s t , a t ) (cid:3) , (6)where β is the step-size parameter. IV. A DMISSION C ONTROL AS A R EINFORCEMENT L EARNING P ROBLEM
A. Problem Formulation
In this section, we formulate the admission control task as areinforcement learning problem in the average-reward setting.Our environment is a tandem (acyclic) queueing network andthe agent is the admission controller at the entrance of thenetwork. The goal is to design an admission controller thatminimizes the average number of unnecessary rejections pertime step, while guaranteeing a probabilistic upper-bound onthe end-to-end delay of the network. In order to achieve thisgoal, our controller can interact with the environment uponarrival of each job. Therefore, each time step is the intervalbetween two consecutive job arrivals.Now, let us define the components of our reinforcementlearning problem as follows: • State:
The vector of queue lengths of all the con-stituent queueing systems upon a job arrival, i.e., s =( q , q , · · · , q N ) , where q n represents the queue lengthof the n th queueing system. • Action:
The possible actions are whether to accept ( a = A ) or reject ( a = R ) a job upon its arrival. Here weconsider deterministic policies and therefore, action willbe a deterministic function of the state, i.e. π ( a | s ) =0 or , a ∈ { A, R } . Therefore, we use a ( s ) = i, i ∈{ A, R } to show the taken action at state s . • Reward:
Designing the reward function is the most chal-lenging part of the problem. The reward function must bedefined such that maximizing the average reward resultsin the desired goal formulated in (1). In order to simplifythe discussion, we make the following assumption: weassume that we know the end-to-end delay of the networkat any time t , i.e., the end-to-end delay that a potentialarrival at time t would experience if it is accepted bythe admission controller. This is clearly an unrealisticassumption, since some of the accepted jobs might notbe departed by the next time step and some might beeven rejected and therefore never go through the network.However, this assumption will help us in designing thereward function. We will discuss how the reward functioncan be calculated under realistic assumptions later in thissection.Now, consider the following reward function: r n = r A if a = A and d n < d ub ,r A if a = A and d n > d ub ,r R if a = R and d n < d ub ,r R if a = R and d n > d ub , (7)where r n denotes the immediate reward obtained in timestep n , and d n represents the end-to-end delay of the n tharrival. The average reward per time step for policy π cane written as r ( π ) = (cid:88) s p sπ ( s ) E π [ r | s ]= (cid:88) s s.t. a ( s )= A p sπ ( s ) E π [ r | s ] + (cid:88) s s.t. a ( s )= R p sπ ( s ) E π [ r | s ]= (cid:88) s s.t. a ( s )= A p sπ ( s ) (cid:2) r A P ( d < d ub | s ) + r A P ( d > d ub | s ) (cid:3) + (cid:88) s s.t. a ( s )= R p sπ ( s ) (cid:2) r R P ( d < d ub | s ) + r R P ( d > d ub | s ) (cid:3) , (8)where P ( d < d ub | s, π ) is shown by P ( d < d ub | s ) for simplicity. Furthermore, since for a given action a ∈ { A, R } , we have (cid:88) s s.t. a ( s )= a p sπ ( s ) P ( d < d ub | s ) = P (( d < d ub ) ∩ a )= p aπ ( a ) P ( d < d ub | a ) , where p aπ ( a ) = (cid:80) s p sπ ( s ) π ( a | s ) , we can write Eq. (8) as r ( π ) = p aπ ( A ) (cid:2) ( r A − r A ) P ( d < d ub | A ) + r A (cid:3) + p aπ ( R ) (cid:2) ( r R − r R ) P ( d < d ub | R ) + r R (cid:3) . (9)Now, we should choose the parameters of the rewardfunction such that the formulated goal in (1) is achieved.Let us first define the Lagrangian function associated withproblem (1) as L ( π, λ ) = − p aπ ( R ) P ( d < d ub | R )+ λ ( P ( d < d ub | A ) − (1 − (cid:15) )) , (10)where λ is the Lagrange multiplier associated withthe QoS constraint in our optimization problem. TheLagrangian dual function is defined as g ( λ ) =max π L ( π, λ ) and therefore the dual problem is min λ g ( λ ) , s.t. λ ≥ . (11)Now, using Eq. (9) and choosing r A = (cid:15)λ , r A = − (1 − (cid:15) ) λ , r R = − and r R = 0 , we have r λ ( π ) = − p aπ ( R ) P ( d < d ub | R )+ λp aπ ( A ) ( P ( d < d ub | A ) − (1 − (cid:15) )) , (12)where we use r λ ( π ) instead of r ( π ) to emphasize onits dependence on λ . As can be seen from Eqs. (10) and(12), r λ ( π ) corresponds to the Lagrangian function of thesame problem as (1), where both sides of the inequalityconstraint are multiplied by p aπ ( A ) , i.e., max π − p aπ ( R ) P ( d < d ub | R ) s.t. p aπ ( A ) P ( d < d ub | A ) ≥ p aπ ( A )(1 − (cid:15) ) . (13)Since we are interested in scenarios in which a policywith p aπ ( A ) > is feasible, problems (1) and (13) be-come similar and maximizing the average reward r λ ( π ) with respect to π will be the same as computing theLagrangian dual function associated with problem (13),i.e., ˜ g ( λ ) = max π ˜ L ( π, λ ) = max π r λ ( π ) , (14)where ˜ L ( π, λ ) = r λ ( π ) is the Lagrangian functionassociated with problem (13). Therefore, λ can be seen asa hyper-parameter for our RL problem, where choosingthe proper λ can result in achieving the goal formulatedin (1). It should be noted that based on the KKT(KarushKuhnTucker) conditions, the optimal point λ ∗ must satisfy λ ∗ ( P ( d < d ub | A ) − (1 − (cid:15) ub )) = 0 . B. Implementation Challenges
As mentioned earlier in this section, the immediate rewardfor a given time step depends on the end-to-end delay of thearriving job at the beginning of the same time step. Thereare two practical issues regarding this design of the rewardfunction that must be addressed. First, there is no guaranteethat the accepted job will finish its end-to-end service by thenext time step, and therefore the immediate reward cannotbe calculated for the corresponding action taken in that timestep until the job has departed. The second practical issue isthat the rejected jobs will never go through the network andtherefore, the end-to-end delay will not be defined for thosejobs. We discuss how these issues can be addressed in thissubsection.Let us first define D t as the set of departed jobs in timestep t . Therefore, in time step t , the end-to-end delay and theimmediate reward can be calculated for the jobs in D t . Hence,instead of updating the Q-function at time step t only basedon the experience gained in the current time step, we usethe experiences of the departed jobs in time slot t , i.e., jobsbelonging to D t . As shown in Fig. 2, at any time step t , westore the current state s t , the taken action a t and the next stateof the environment s (cid:48) t as an incomplete experience tuple, i.e., ( s t , a t , s (cid:48) t , r t =?) , in a buffer. Once the job is departed, theenvironment returns the corresponding reward r t and then wecan use the complete experience tuple ( s t , a t , s (cid:48) t , r t ) to updatethe Q-function. Since at time step t , we have |D t | departuresand therefore |D t | new complete experience tuples, it is eitherpossible that the Q-function does not get updated in this timestep ( |D t | = 0 ), or get updated multiple times ( |D t | > ). Itshould be noted that the main purpose of using the buffer isto store the incomplete experiences until their correspondingreward becomes available. However, this buffer can also beused as a replay buffer in deep Q-learning settings, where amini-batch method is used to update the weights by samplingexperiences uniformly from the replay buffer.Now, let us address the second practical issue regarding therejected jobs. Since the rejected jobs do not go through thenetwork, the real environment cannot be used to produce therewards as defined in Eq. (7). Instead, we can use a simulatedmodel to generate hypothetical experiences and use them totrain the controller. As shown in Fig.2, whenever the agentrejects a job, we simulate a parallel environment, shown in ? , @ ? . ? = A () ? , . ? , ) ?:; , - ? =? )@ ? = - ? } Buffer
Update . ? = R {( ) , . , ) , - )| 5 ∈ 7 * }- ? Fig. 2. Illustration of our reinforcement learning problem. D t denotes theset of rewards that correspond to the departed jobs or possibly the rejectedjob in time step t . grey, with the same state as the original environment uponjob arrival, and send the job into the simulated network tomeasure its end-to-end delay. If the end-to-end delay is smallerthan d ub , the simulated environment returns a reward of r R ,otherwise it returns r R . It should be mentioned that only theimmediate reward is generated by the simulated environment,while the next state comes from the real environment. Further-more, since the simulated environment can be fast-forwarded,the end-to-end delay and therefore the corresponding rewardcan be calculated in the same time step. As a result, we canuse the simulated experience to update the policy in the sametime step. Algorithm 1 shows our R-learning-based algorithmfor training the controller. Algorithm 1:
Admission control using R-learningInitialize ¯ r and Q ( s, a ) arbitrarily for all s ∈ S , a ∈ A Initialize s for t = 1 to max step doif rand(.) < (cid:15) then Choose action a t randomly f lag t =FALSE else a t = arg max a Q ( s t , a ) f lag t =TRUE end Take action a t and observe s t +1 and r t = { r i | i ∈ D t } if r t / ∈ D t then Store incomplete experience ( s t , a t , s t +1 , − ) inthe bufferStore f lag t endfor i in D t do Restore ( s i , a i , s i +1 , − ) from the buffer andappend r i Restore f lag i δ i ← r i − ¯ r + max a Q ( s i +1 , a ) − Q ( s i , a i ) if f lag i = TRUE then ¯ r ← ¯ r + βδ i endendend g () * (a) P ( d > d u b | A ) * QoS violation prob. (b)Fig. 3. Performance of the admission controller for different values of λ , a)Maximized average reward as a function of λ (Eq. (14)) b) QoS violationprobability TABLE IS IMULATION P ARAMETERS
Topology Num. of servers [ c , c , · · · , c N ] Service rates [ µ , µ , · · · , µ N ] Tandem [3 , ,
2] [0 . , . , . Acyclic [5 , , ,
2] [0 . , . , . , . Distribution Parameters
Gamma (Arrival) λ = 0 . , SCV = . Gamma (Service time) SCV = . V. E
VALUATION AND R ESULTS
As discussed in Section I, service function chaining is oneof the examples of the service networks that can benefit fromour RL-based admission controller. In this section, we evaluatethe performance of our admission controller in two differentscenarios. In the first scenario, we consider a service chain asin Fig. 5a, where the goal is to provide a probabilistic upper-bound on the end-to-end delay of the accepted jobs. In thesecond scenario, we consider a service chain with a topologyas in Fig. 5b, where the application requires that the jobs meetan end-to-end deadline, otherwise they are considered useless.Let us start with the first experiment. We model the servicefunction chain with a tandem queueing system as in Fig. 1a,parameters of which are summarized in Table I. We assumethat the job inter-arrival times and the service times haveGamma distribution (Table I). In both experiments, time isnormalized by the mean service time of the ingress queue,i.e., / ( c µ ) . Our goal is to design an admission controllerthat provides an upper-bound of d ub = 15 on the end-to-end delay of the jobs, with violation probability (cid:15) ub = 0 . ,and minimum unnecessary job rejections. In order to achievethis goal, we use our proposed RL-based admission controller.As discussed in section IV, the reward function parameters inEq. (7) are set to r A = (cid:15) ub λ , r A = − (1 − (cid:15) ub ) λ , r R = − and r R = 0 , where λ = λ ∗ is the optimal solution to Eq. (14).As mentioned earlier, the optimal value of hyper-parameter λ , i.e., λ ∗ , must satisfy λ ∗ ( P ( d < d ub | A ) − (1 − (cid:15) ub )) = 0 ,where λ ∗ ≥ . Therefore, in the optimal point either λ ∗ = 0 or P ( d > d ub | A ) = (cid:15) ub . Fig. 3a shows the optimized averagereward, i.e. ˜ g ( λ ) = max π r λ ( π ) , as a function of λ . Basedon Eq. (14), λ ∗ can be obtained by finding the λ for whichthe optimized average reward function, i.e. the Lagrangiandual function ˜ g ( λ ) , is minimized. As shown in Fig. 3a, theminimum is achieved for λ ∗ = 8 . This can also be verifiedusing Fig. 3b, in which λ ∗ satisfies the QoS constraint. TimeSteps P ( d > d _ u b | A ) d ub (a) TimeSteps O b j e c t i v e (b) TimeSteps P ( a = A ) (c) TimeSteps P ( d > d _ u b | A ) d ub (d) TimeSteps O b j e c t i v e (e) TimeSteps T h r o u g h p u t w ACw/o AC (f)Fig. 4. Performance of the RL-based admission controller in: scenario I: a) QoS constraint b) objective function c) acceptance rate; scenario II: a) QoSconstraint b) objective function c) throughput SF1 SF2SF2 SF3SF2SF1 SF3 (a)
SF1 SF2SF2 SF3SF2SF1 SF3 (b)Fig. 5. Service function chain topologies.
Now, using λ ∗ , the admission controller, i.e., the agent, hasbeen trained with 4 different initial seeds using Algorithm 1.Fig.4a-c show the performance of our trained controller inscenario I. The dark blue curves and the pale blue regionsshow the average and the standard error bands, respectively.Fig. 4a shows that the QoS constraint is satisfied, i.e., P ( d >d ub | A ) converges to (cid:15) up = 0 . as we train the agent. Thisshould be expected in the optimal point, since λ ∗ > . On theother hand, Fig. 4b shows the maximization of the objectivefunction, which is equal to minimizing the average numberof unnecessary rejections per time step. As can be seen inFig. 4c, the acceptance rate of the trained admission controller,which is equal to the throughput of the system in this scenario,converges to around .In the second experiment, we consider a service functionchain as in Fig. 5b, where service function SF2 has twoinstances with separate physical resources. Therefore, wecan model the system by an acyclic network as in Fig. 1b,with two parallel branches. The parameters of the modelare summarized in Table I. We assume that the applicationimposes a deadline of d ub = 20 on the end-to-end delay of thejobs. Using our proposed method, we can train an admissioncontroller that guarantees that the accepted jobs meet thedeadline with probability − (cid:15) ub = 0 . . We can use a similarapproach as in the previous experiment to find λ ∗ , which willbe equal to λ ∗ = 5 in this setting. As shown in Fig. 4d,the probability that an accepted job fails the end-to-end delay deadline converges to (cid:15) ub = 0 . . Moreover, Fig. 4f showsthat the admission controller has tremendously improved thethroughput of the service function chain, compared to the casewith no admission controller.VI. C ONCLUSIONS