[PDF] Stable Online Computation Offloading via Lyapunov-guided Deep Reinforcement Learning

Abstract

In this paper, we consider a multi-user mobile-edge computing (MEC) network with time-varying wireless channels and stochastic user task data arrivals in sequential time frames. In particular, we aim to design an online computation offloading algorithm to maximize the network data processing capability subject to the long-term data queue stability and average power constraints. The online algorithm is practical in the sense that the decisions for each time frame are made without the assumption of knowing future channel conditions and data arrivals. We formulate the problem as a multi-stage stochastic mixed integer non-linear programming (MINLP) problem that jointly determines the binary offloading (each user computes the task either locally or at the edge server) and system resource allocation decisions in sequential time frames. To address the coupling in the decisions of different time frames, we propose a novel framework, named LyDROO, that combines the advantages of Lyapunov optimization and deep reinforcement learning (DRL). Specifically, LyDROO first applies Lyapunov optimization to decouple the multi-stage stochastic MINLP into deterministic per-frame MINLP subproblems of much smaller size. Then, it integrates model-based optimization and model-free DRL to solve the per-frame MINLP problems with very low computational complexity. Simulation results show that the proposed LyDROO achieves optimal computation performance while satisfying all the long-term constraints. Besides, it induces very low execution latency that is particularly suitable for real-time implementation in fast fading environments.

Full PDF

11 Stable Online Computation Ofﬂoading viaLyapunov-guided Deep Reinforcement Learning

Suzhi Bi ∗ , Liang Huang † , Hui Wang ∗ , and Ying-Jun Angela Zhang ‡∗ College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China † College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China ‡ Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong SARE-mail: { bsz,wanghsz } @szu.edu.cn, [email protected], [email protected] Abstract —In this paper, we consider a multi-user mobile-edgecomputing (MEC) network with time-varying wireless channelsand stochastic user task data arrivals in sequential time frames.In particular, we aim to design an online computation ofﬂoadingalgorithm to maximize the network data processing capabilitysubject to the long-term data queue stability and average powerconstraints. The online algorithm is practical in the sense that thedecisions for each time frame are made without the assumptionof knowing future channel conditions and data arrivals. Weformulate the problem as a multi-stage stochastic mixed integernon-linear programming (MINLP) problem that jointly deter-mines the binary ofﬂoading (each user computes the task eitherlocally or at the edge server) and system resource allocationdecisions in sequential time frames. To address the couplingin the decisions of different time frames, we propose a novelframework, named LyDROO, that combines the advantages ofLyapunov optimization and deep reinforcement learning (DRL).Speciﬁcally, LyDROO ﬁrst applies Lyapunov optimization todecouple the multi-stage stochastic MINLP into deterministicper-frame MINLP subproblems of much smaller size. Then, itintegrates model-based optimization and model-free DRL to solvethe per-frame MINLP problems with very low computationalcomplexity. Simulation results show that the proposed LyDROOachieves optimal computation performance while satisfying allthe long-term constraints. Besides, it induces very low executionlatency that is particularly suitable for real-time implementationin fast fading environments.

I. I

NTRODUCTION

The emerging mobile-edge computing (MEC) technologyallows WDs to ofﬂoad intensive computation tasks to the edgeserver (ES) in the vicinity to reduce the computation energyand time cost [1]. Compared to the naive scheme that ofﬂoadsall the tasks for edge execution, opportunistic computationofﬂoading , which dynamically assigns tasks to be computedeither locally or at the ES, has shown signiﬁcant performanceimprovement under time-varying network conditions, suchas wireless channel gains [2], harvested energy level [3],service availability [4], and task input-output dependency [5].In general, it requires solving a mixed integer non-linearprogramming (MINLP) that jointly determines the binaryofﬂoading (i.e., either ofﬂoading the computation or not) andthe communication/computation resource allocation (e.g., taskofﬂoading time and local/edge CPU frequencies) decisions [5]–[7]. To tackle the prohibitively high computational complexityof solving the MINLP problems, many works have proposedreduced-complexity sub-optimal algorithms, such as local-search based heuristics [5], [6], decomposition-oriented search [6], and convex relaxations of the binary variables [7], [14], etc.However, aside from performance losses, the above algorithmsstill require a large number of numerical iterations to producea satisfying solution. It is therefore too costly to implementthe conventional optimization algorithms in a highly dynamicMEC environment, where the MINLP needs to be frequentlyre-solved once the system parameters, such as wireless linkquality, vary.The recent development of data-driven deep reinforcementlearning (DRL) provides a promising alternative to tacklethe online computation ofﬂoading problem. In a nutshell, theDRL framework takes a model-free approach that uses deepneural networks (DNNs) to directly learn the optimal mappingfrom the “state” (e.g., time-varying system parameters) to the“action” (e.g., ofﬂoading decisions and resource allocation) tomaximize the “reward” (e.g., data processing rate). Exampleimplementations include using deep Q-learning network (DQN)[8], double DQN [9], actor-critic DRL [10], [11]. In particular,our previous work [11] proposes a hybrid framework, namedDROO (Deep Reinforcement learning-based Online Ofﬂoad-ing), to combine the advantages of conventional model-basedoptimization and model-free DRL methods. The integratedlearning and optimization approach leads to more robust andfaster convergence of the online training process, thanks to theaccurate estimation of reward values corresponding to eachsampled action.Apart from optimizing the computation performance, it isequally important to guarantee stable system operation, such asdata queue stability and average power consumption. However,most of the existing DRL-based methods do not impose long-term performance constraints (e.g., [8]–[11]). Instead, theyresort to heuristic approaches that discourage unfavorableactions in each time frame by introducing penalty termsrelated to, for example, packet drop events [9] and energyconsumption [8]. A well-known framework for online jointutility maximization and stability control is

Lyapunov optimiza-tion [12]. It decouples a multi-stage stochastic optimization tosequential per-stage deterministic subproblems, while providingtheoretical guarantee to long-term system stability. Somerecent works have applied Lyapunov optimization to designcomputation ofﬂoading strategy in MEC networks (e.g., [13]–[15]). However, it still needs to solve a hard MINLP in eachper-stage subproblem to obtain the joint binary ofﬂoading andresource allocation decisions. To tackle the intractability, some a r X i v : . [ c s . N I] F e b works have designed reduced-complexity heuristics, such ascontinuous relaxation in [14] and decoupling heuristic in [15].This, however, suffers from the similar performance-complexitytradeoff dilemma as in [5]–[7].In this paper, we consider a multi-user MEC networkwith an ES assisting the computation of N WDs, wherethe computation task data arrive at the WDs’ data queuesstochastically in sequential time frames. We aim to designan online computation ofﬂoading algorithm, in the sensethat the decisions for each time frame are made withoutknowing the future channel conditions and task arrivals. Theobjective is to maximize the network data processing capabilityunder the long-term data queue stability and average powerconstraints. To tackle the problem, we propose a Lyapunov-guided Deep Reinforcement learning (DRL)-based OnlineOfﬂoading (LyDROO) framework that combines the advantagesof Lyapunov optimization and DRL. In particular, we ﬁrst applyLyapunov optimization to decouple the multi-stage stochasticMINLP into per-frame deterministic MINLP problems. Thenin each frame, we integrate model-based optimization andmodel-free DRL to solve the per-frame MINLP problems withvery low computational complexity. Simulation results showthat the proposed LyDROO algorithm converges very fastto the optimal computation rate while meeting all the long-term stability constraints. Compared to a myopic benchmarkalgorithm that greedily maximizes the computation rate in eachtime frame, the proposed LyDROO achieves a much larger stable capacity region that can stabilize the data queues undermuch heavier task data arrivals.II. S

YSTEM M ODEL AND P ROBLEM F ORMULATION

We consider an MEC network with an ES assisting thecomputation of N WDs in sequential time frames of equalduration T . Within the t th time frame, we denote A ti (in bits)as the raw task data arrival at the data queue of the i th WD.We assume that A ti follows an i.i.d. exponential distributionwith mean λ i , for i = 1 , · · · , N . We denote the channel gainbetween the i th WD and the ES as h ti . Under the block fadingassumption, h ti remains constant within a time frame but variesindependently across different frames.In the t th time frame, suppose that a tagged WD i processes D ti bits data and produces a computation output at the endof the time frame. Within each time frame, we assume thatthe WDs adopt a binary computation ofﬂoading rule [1] thatprocess the raw data either locally at the WD or remotely atthe ES. The ofﬂoading WDs share a common bandwidth W for transmitting the task data to the ES in a TDMA manner.We use a binary variable x ti to denote the ofﬂoading decision.When the i th WD processes the data locally ( x ti = 0 ), wedenote the local CPU frequency as f ti , which is upper boundedby f maxi . The raw data (in bits) processed locally and theconsumed energy within the time frame are [1] D ti,L = f ti T /φ, E ti,L = κ (cid:0) f ti (cid:1) T, ∀ x ti = 0 , (1) respectively. Here, parameter φ > denotes the number ofcomputation cycles needed to process one bit of raw data and κ > denotes the computing energy efﬁciency parameter. Otherwise, when the data is ofﬂoaded for edge execution( x ti = 1 ), we denote P ti as the transmit power constrainedby the maximum power P ti ≤ P maxi and τ ti T as the amountof time allocated to the i th WD for computation ofﬂoading.Here, τ ti ∈ [0 , and (cid:80) Ni =1 τ ti ≤ . The energy consumed ondata ofﬂoading is E ti,O = P ti τ ti T , such that P ti = E ti,O /τ ti T .Similar to [3] and [6], we neglect the delay on edge computingand result downloading such that the amount of data processedat the edge within the time frame is D ti,O = W τ ti Tv u log (cid:18) E ti,O h ti τ ti T N (cid:19) , ∀ x ti = 1 , (2) where v u ≥ represents the rate loss due to communicationoverhead and N denotes the noise power.Let D ti (cid:44) (1 − x ti ) D ti,L + x ti D ti,O and E ti (cid:44) (1 − x ti ) E ti,L + x ti E ti,O denote the bits computed and energy consumed in timeframe t . We deﬁne computation rate r ti and power consumption e ti in the t th time frame as r ti = D ti T = (1 − x ti ) f ti φ + x ti W τ ti v u log (cid:18) e ti,O h ti τ ti N (cid:19) ,e ti = E ti T = (1 − x ti ) κ (cid:0) f ti (cid:1) + x ti e ti,O , (3) where e ti,O (cid:44) E ti,O /T . For simplicity of exposition, we assume T = 1 without loss of generality in the following derivations.Let Q i ( t ) denote the queue length of the i th WD at thebeginning of the t th time frame such that Q i ( t + 1) = max (cid:110) Q i ( t ) − ˜ D ti + A ti , (cid:111) , i = 1 , , · · · , (4)where ˜ D ti = min ( Q i ( t ) , D ti ) and Q i (1) = 0 . In the followingderivation, we enforce the data causality constraint D ti ≤ Q i ( t ) ,implying that Q i ( t ) ≥ holds for any t . Thus, the queuedynamics is simpliﬁed as Q i ( t + 1) = Q i ( t ) − D ti + A ti , i = 1 , , · · · . (5)If lim K →∞ K (cid:80) Kt =1 E [ Q i ( t )] < ∞ holds for Q i ( t ) , we referto the queue as strongly stable , where the expectation is takenwith respect to the random channel fading and task data arrivals[12]. By the Little’s law, a strongly stable data queue translatesto a ﬁnite processing delay of each task data bit.In this paper, we aim to design an online algorithm tomaximize the long-term average weighted sum computation rateof all the WDs under the data queue stability and average powerconstraints. We denote x t = [ x t , · · · , x tN ] , τ t = [ τ t , · · · , τ tN ] , f t = [ f t , · · · , f tN ] and e tO = (cid:2) e t ,O , · · · , e tN,O (cid:3) , and let x = { x t } Kt =1 , τ = { τ t } Kt =1 , f = { f t } Kt =1 and e O = { e tO } Kt =1 . Weformulate the problem as a multi-stage stochastic MINLP maximize x , τ , f , e O lim K →∞ /K · (cid:80) Kt =1 (cid:80) Ni =1 c i r ti (6a)subject to (6b) (cid:80) Ni =1 τ ti ≤ , ∀ t, x ti ∈ { , } , ∀ i, t, (6c) (1 − x ti ) f ti φ + x ti W τ ti v u log (cid:18) e ti,O h ti τ ti N (cid:19) ≤ Q i ( t ) , ∀ t, i, (6d) lim K →∞ /K · (cid:80) Kt =1 E (cid:104) (1 − x ti ) κ (cid:0) f ti (cid:1) + x ti e ti,O (cid:105) ≤ γ i , ∀ i, (6e) lim K →∞ /K · (cid:80) Kt =1 E [ Q i ( t )] < ∞ , ∀ i, (6f) τ ti , f ti , e ti,O ≥ , f ti ≤ f maxi , e ti,O ≤ P maxi τ ti , ∀ i, t. (6g) Here, c i denotes the ﬁxed weight of the i th WD. (6c) denotes the ofﬂoading time constraint. Notice that τ ti = e ti,O = 0 musthold at the optimum if x ti = 0 . Similarly, f ti = 0 must hold if x ti = 1 . (6d) corresponds to the data causality constraint. (6e)corresponds to the average power constraint and γ i > is thepower threshold. (6f) are the data queue stability constraints.Under the stochastic channels and data arrivals, it is hard tosatisfy the long-term constraints when the decisions are madein each time frame without future knowledge. Besides, the fast-varying channel condition requires real-time decision-makingin each short time frame, e.g., within the channel coherencetime. In the following, we propose a novel LyDROO frameworkthat solves (6) with both high robustness and efﬁciency.III. L YAPUNOV - BASED M ULTI -S TAGE D ECOUPLING

In this section, we apply the Lyapunov optimization todecouple (6) into per-frame deterministic problems. To copewith the average power constraints (6e), we introduce N virtualenergy queues { Y i ( t ) } Ni =1 , one for each WD. Speciﬁcally, weset Y i (1) = 0 and update the queue as Y i ( t + 1) = max (cid:0) Y i ( t ) + νe ti − νγ i , (cid:1) , (7)for i = 1 , · · · , N, t = 1 , , · · · , where e ti in (3) is the energyconsumption at the t th time frame and ν is a positive scalingfactor. Intuitively, when the virtual energy queues are stable,the average power consumption e ti does not exceed γ i , andthus the constraints in (6e) are satisﬁed.We deﬁne Z ( t ) = { Q ( t ) , Y ( t ) } as the total queue backlog,where Q ( t ) = { Q i ( t ) } Ni =1 and Y ( t ) = { Y i ( t ) } Ni =1 . Then, weintroduce the Lyapunov function L ( Z ( t )) and Lyapunov drift ∆ L ( Z ( t )) as [12] L ( Z ( t )) = 0 . (cid:16)(cid:80) Ni =1 Q i ( t ) + (cid:80) Ni =1 Y i ( t ) (cid:17) , ∆ L ( Z ( t )) = E { L ( Z ( t + 1)) − L ( Z ( t )) | Z ( t ) } . (8) To maximize the time average computation rate while stabiliz-ing Z ( t ) , we use the drift-plus-penalty minimization approach[12]. Speciﬁcally, we seek to minimize an upper bound on thefollowing drift-plus-penalty expression at every time frame t : Λ ( Z ( t )) (cid:44) ∆ L ( Z ( t )) − V · (cid:80) Ni =1 E (cid:8) c i r ti | Z ( t ) (cid:9) , (9) where V > is an “importance” weight to scale the penalty.The following Theorem derives an upper bound of Λ ( Z ( t )) . Theorem 1:

Given any queue backlog Z ( t ) , the drift-plus-penalty in (9) is upper bounded as Λ ( Z ( t )) ≤ ˆ B + (cid:80) Ni =1 Q i ( t ) E (cid:2)(cid:0) A ti − D ti (cid:1) | Z ( t ) (cid:3) + (cid:80) Ni =1 (cid:8) Y i ( t ) E (cid:2) e ti − γ i | Z ( t ) (cid:3) − V E (cid:2) c i r ti | Z ( t ) (cid:3)(cid:9) , (10) where ˆ B is a ﬁnite constant. Proof : Due to the page limit, please refer the detailed proofin the online technical report [16]. (cid:4)

In the t th time frame, we apply the technique of opportunisticexpectation minimization [12]. That is, we observe the queuebacklogs Z ( t ) and decide the joint ofﬂoading and resourceallocation control action accordingly to minimize the upperbound in (10). By removing the constant terms from theobservation at the beginning of the t th time frame, the algorithmdecides the actions by maximizing the following: (cid:80) Ni =1 ( Q i ( t ) + V c i ) r ti − (cid:80) Ni =1 Y i ( t ) e ti , (11)where r ti and e ti are in (3). Intuitively, it tends to increase thecomputation rates of WDs that have a long data queue backlog or a large weight, while penalizing those that have exceeded theaverage power threshold. We introduce an auxiliary variable r ti,O for each WD i and denote r tO = (cid:8) r ti,O (cid:9) Ni =1 . Takinginto account the per-frame constraints, we solve the followingdeterministic per-frame subproblem in the t th time frame maximize x t , τ t , f t , e tO , r tO (cid:80) Ni =1 ( Q i ( t ) + V c i ) r ti − (cid:80) Ni =1 Y i ( t ) e ti (12a)subject to (12b) (cid:80) Ni =1 τ ti ≤ , (12c) f ti /φ ≤ Q i ( t ) , r ti,O ≤ Q i ( t ) , ∀ i, (12d) r ti,O ≤ W τ ti v u log (cid:18) e ti,O h ti τ ti N (cid:19) , ∀ i, (12e) f ti ≤ f maxi , e ti,O ≤ P maxi τ ti , ∀ i, (12f) x ti ∈ { , } , τ ti , f ti , e ti,O ≥ , ∀ i. (12g) Notice that the above constraints (12d) and (12e) are equivalentto (6d), because there is exactly one non-zero term in the left-hand side of (6d) at the optimum. It can be shown that, for afeasible (6) under the data arrival rates and power constraints,we can satisfy all long-term constraints in (6) by solving theper-frame subproblems in an online fashion, where the detailedproof is omitted due to the page limit. Then, the remainingdifﬁculty lies in solving the MINLP (12) in each time frame.IV. L

YAPUNOV - GUIDED

DRL

FOR O NLINE O FFLOADING

Recall that to solve (12) in the t th time frame, we observe ξ t (cid:44) { h ti , Q i ( t ) , Y i ( t ) } Ni =1 , consisting of the channel gains { h ti } Ni =1 and the system queue states { Q i ( t ) , Y i ( t ) } Ni =1 , andaccordingly decide the control action { x t , y t } , including thebinary ofﬂoading decision x t and the continuous resourceallocation y t (cid:44) (cid:8) τ ti , f ti , e ti,O , r ti,O (cid:9) Ni =1 .Suppose that the value of x t in (12) are given, we denote theindex set of users with x ti = 1 as M t , and the complementaryuser set with x ti = 0 as M t . Then, the remaining optimalresource allocation problem to optimize y t is as following maximize τ t , f t , e tO , r tO (cid:80) j ∈M t (cid:110) a tj f tj /φ − Y j ( t ) κ (cid:0) f tj (cid:1) (cid:111) (13a) + (cid:80) i ∈M t (cid:8) a ti r ti,O − Y i ( t ) e ti,O (cid:9) (13b)subject to (13c) (cid:80) i ∈M τ ti ≤ , e ti,O ≤ P maxi τ ti , ∀ i ∈ M t , (13d) f tj /φ ≤ Q j ( t ) , f tj ≤ f maxj , ∀ j ∈ M t , (13e) r ti,O ≤ Q i ( t ) , ∀ i ∈ M t (13f) r ti,O ≤ W τ ti v u log (cid:18) e ti,O h ti τ ti N (cid:19) , ∀ i ∈ M t , (13g) where a ti (cid:44) Q i ( t ) + V c i is a parameter. A close observationshows that although (12) is a non-convex optimization problem,the resource allocation problem (13) is in fact a convex problemif x t is ﬁxed. Here, we denote G (cid:0) x t , ξ t (cid:1) as the optimal valueof (12) by optimizing y t given the ofﬂoading decision x t andparameter ξ t , i.e., the optimal value of (13), which can beefﬁciently obtained from off-the-shelf or customized convexalgorithms. In this sense, solving (12) is equivalent to ﬁndingthe optimal ofﬂoading decision ( x t ) ∗ , where (cid:0) x t (cid:1) ∗ = arg maximize x t ∈{ , } N G (cid:0) x t , ξ t (cid:1) . (14)In general, obtaining ( x t ) ∗ requires enumerating N ofﬂoadingdecisions, which leads to signiﬁcantly high computational ........................ ︙ ︙ arg Max ︙ Training samples Sample random batch Compute optimal resource allocation problem (13)NOP Action Quantization Action selectionReplay MemoryLabeled sampleTrain the DNN

1) Actor module 2) Critic module3) Policy update module { , } t t ξ x { , } t t   ξ x { , } t t   ξ x { , } j j ξ x   , t t x y   , ( ), ( ) Nt ti i i i h Q t Y t   ξ   t t   ξ ˆ t x t x t tM x   , t t tM G x ξ   , t t G x ξ t x

4) Queueing module { (t), (t)} Ni i i

Q Y  { } t Ni i h  {A } t Ni i  { , } t ti i Ni D e  Input of DNN Data processed and energy consumptionChannel gainsData arrivals Execute the joint actionJoint actionUpdate queues

Fig. 1: The schematics of the proposed LyDROO algorithm. complexity even when N is moderate (e.g., N = 10 ). Othersearch based methods, such as branch-and-bound and blockcoordinate descent, are also time-consuming when N is large.In practice, neither method is applicable to online decision-making under fast-varying channel condition. Leveraging theDRL technique, we propose a LyDROO algorithm to constructa policy π that maps from the input ξ t to the optimal action ( x t ) ∗ , i.e., π : ξ t (cid:55)→ ( x t ) ∗ , with very low complexity, e.g., tensof milliseconds execution delay for N = 10 . As illustrated inFig. 1, LyDROO consists of four main modules that operatein a sequential and iterative manner as detailed below.

1) Actor Module:

The actor module consists of a DNN andan action quantizer. Here, we apply a fully-connected multi-layer perceptron of two hidden layers with and hiddenneurons, respectively. Besides, we use a sigmoid activationfunction at the output layer. At the beginning of the t th timeframe, we denote the parameter of the DNN as θ t , which israndomly initialized following the standard normal distributionwhen t = 1 . With the input ξ t , the DNN outputs a relaxedofﬂoading decision ˆx t ∈ [0 , N , where the input-out relationis expressed as Π θ t : ξ t (cid:55)→ ˆx t = (cid:8) ˆ x ti ∈ [0 , , i = 1 , · · · , N (cid:9) . (15)We then quantize the continuous ˆx t into M t feasible candidatebinary ofﬂoading actions, denoted as Υ M t : ˆx t (cid:55)→ Ω t = (cid:8) x tj | x tj ∈ { , } N , j = 1 , · · · , M t (cid:9) , (16)where Ω t denotes the set of candidate ofﬂoading actions inthe t th time frames. Notice that the number of binary actions M t = | Ω t | is a time-dependent design parameter.A good quantization function should balance the exploration-exploitation tradeoff in generating the ofﬂoading action toensure good training convergence. Intuitively, (cid:8) x tj (cid:9) ’s should beclose to ˆx t (measured by Euclidean distance) to make effectiveuse of the DNN’s output and meanwhile sufﬁciently separateto avoid premature convergence to sub-optimal solution inthe training process. Here, we apply a noisy order-preserving(NOP) quantization method, which can generate any M t ≤ N candidate actions. The NOP method generates the ﬁrst M t / actions ( M t is assumed an even number) by applying theorder-preserving quantizer (OPQ) in [11] to ˆ x t . To obtain Please refer to [16] for detailed quantization procedures. the remaining M t / actions, we ﬁrst generate a noisy versionof ˆ x t denoted as ˜ x t = Sigmoid (ˆ x t + n ) , where the randomGaussian noise n ∼ N ( , I N ) with I N being an identitymatrix, and Sigmoid ( · ) is the element-wise Sigmoid functionthat bounds each entry of ˜ x t within (0 , . Then, we producethe remaining M t / actions x tm , for m = M t / , · · · , M t ,by applying the OPQ to ˜ x t .

2) Critic Module:

Followed by the actor module, the criticmodule evaluates { x ti } and selects the best ofﬂoading action x t . LyDROO leverages the model information to evaluate thebinary ofﬂoading action by analytically solving the optimalresource allocation problem, which leads to more robust andfaster convergence of the DRL training process. Speciﬁcally,LyDROO selects the best action x t as x t = arg max x tj ∈ Ω t G (cid:0) x tj , ξ t (cid:1) , (17)where G (cid:0) x tj , ξ t (cid:1) is obtained by optimizing the resourceallocation given x tj in (13). Intuitively, a larger M t = | Ω t | results in better solution performance, but a higher executiondelay. To balance the performance-complexity tradeoff, wepropose here an adaptive procedure to set a time-varying M t .Denote m t ∈ [0 , M t − as the index of the best action x t ∈ Ω t . We deﬁne m ∗ t = mod( m t , M t / , which representsthe order of x t among either the M t / noise-free or the noise-added candidate actions. The key idea is that m ∗ t graduallydecreases as the actor DNN gradually approaches the optimalpolicy over time. In practice, we set a maximum M = 2 N initially and update M t every δ M ≥ time frames as M t = 2 · min (cid:0) max (cid:0) m ∗ t − , · · · , m ∗ t − δ M (cid:1) + 1 , N (cid:1) . (18)The additional in the ﬁrst term within the min operator allows M t to increase over time.

3) Policy Update Module:

LyDROO uses (cid:0) ξ t , x t (cid:1) as alabeled input-output sample for updating the policy of the DNN.In particular, we maintain a replay memory that only storesthe most recent q data samples. In practice, with an initiallyempty memory, we start training the DNN after collecting morethan q/ data samples. Then, the DNN is trained periodicallyonce every δ T time slots to avoid model over-ﬁtting. When mod ( t, δ T ) = 0 , we randomly select a batch of data samples { ( ξ τ , x τ ) , τ ∈ S t } , where S t denotes the time indices of the selected samples. We then update the parameter of the DNNby minimizing its average cross-entropy loss function overthe data samples. When the training completes, we update theparameter of the actor module in the next time frame to θ t +1 .

4) Queueing Module:

As a by-product of the critic module,we obtain the optimal resource allocation y t associated with x t .Accordingly, the system executes the joint computation ofﬂoad-ing and resource allocation action { x t , y t } , which processesdata { D ti } Ni =1 and consumes energy { e ti } Ni =1 as given in (3).Based on { D ti , e ti } Ni =1 and the data arrivals { A ti } Ni =1 observedin the t th time frame, the queueing module then updates thedata and energy queues { Q i ( t + 1) , Y i ( t + 1) } Ni =1 using (5)and (7) at the beginning of the ( t + 1) th time frame. With thewireless channel gains observation { h t +1 i } Ni =1 , the system feedsthe input parameter ξ t +1 = (cid:8) h t +1 i , Q i ( t + 1) , Y i ( t + 1) (cid:9) Ni =1 to the DNN and starts a new iteration from Step 1).With the above actor-critic-update loop, the DNN consis-tently learns from the best and most recent state-action pairs,leading to a better policy π θ t that gradually approximates theoptimal mapping to solve (14). We summarize the pseudo-codeof LyDROO in Algorithm . We can efﬁciently obtain the bestofﬂoading solution x t in line by solving convex resourceallocation problem in (13). Besides, the training process of theDNN in line - is performed in parallel with the executionof task ofﬂoading and computation, and thus does not incuradditional training delay overhead. Algorithm 1:

LyDROO algorithm for solving (6). input :

Parameters V , { γ i , w i } Ni =1 , K , training interval δ T , M t updateinterval δ M ; output : Control actions (cid:8) x t , y t (cid:9) Kt =1 ; Initialize the DNN with random parameters θ and empty replay memory, M ← N ; Initial data and energy queue Q i (1) = Y i (1) = 0 , for i = 1 , · · · , N ; for t = 1 , , . . . , K do Observe the input ξ t = (cid:8) h t , Q i ( t ) , Y i ( t ) (cid:9) Ni =1 and update M t using(18) if mod ( t, δ M ) = 0 ; Generate a relaxed ofﬂoading action ˆ x t = Π θ t (cid:0) ξ t (cid:1) with the DNN; Quantize ˆ x t into binary actions (cid:8) x ti | i = 1 , · · · , M t (cid:9) using the NOPmethod; Compute G (cid:0) x ti , ξ t (cid:1) by optimizing y ti in (13) for each x ti ; Select the best solution x t with (17) and execute the joint action (cid:0) x t , y t (cid:1) ; Update the replay memory by adding ( ξ t , x t ) ; if mod ( t, δ T ) = 0 then Uniformly sample a data set { ( ξ τ , x τ ) | τ ∈ S t } from thememory; Train the DNN with { ( ξ τ , x τ ) | τ ∈ S t } and update θ t ; end t ← t + 1 ; Update { Q i ( t ) , Y i ( t ) } Ni =1 based on (cid:0) x t − , y t − (cid:1) and data arrivalobservation (cid:110) A t − i (cid:111) Ni =1 using (5) and (7). end V. S

IMULATIONS

In this section, we use simulations to evaluate the perfor-mance of the proposed LyDROO algorithm. All the compu-tations are evaluated on a TensorFlow 2.0 platform with anIntel Core i5-4570 3.2GHz CPU and 12 GB of memory. Weassume that the average channel gain ¯ h i follows a path-lossmodel ¯ h i = A d (cid:16) × πf c d i (cid:17) d e , i = 1 , · · · , N , where A d = 3 denotes the antenna gain, f c = 915 MHz denotes the carrierfrequency, d e = 3 denotes the path loss exponent, and d i in meters denotes the distance between the i th WD and theES. h i follows an i.i.d. Rician distribution with line-of-sightlink gain equal to . h i . The noise power N = W υ withnoise power spectral density υ = − dBm/Hz. Unlessotherwise stated, we consider N = 10 WDs equally spacedwith d i = 120 + 15( i − , for i = 1 , · · · , N . The weight w i = 1 . if i is an odd number and w i = 1 otherwise.Other parameter values are set as: T = 1 second, W = 2 MHz, f maxi = 0 . GHz, P maxi = 0 . watt, γ i = 0 . watt, V = 20 , φ = 100 , ν = 1000 , q = 1024 , δ T = 10 , δ M = 32 , |S t | = 32 . For performance comparison, we consider twobenchmark methods: 1) Lyapunov-guided Coordinate Decent(LyCD): The only difference with LyDROO is that LyCDapplies the coordinate decent (CD) method [6] to solve (12),which iteratively applies one-dimensional search to updatethe binary ofﬂoading decision vector x t . We have veriﬁedthrough extensive simulations that the CD method achievesclose-to-optimal performance of solving (12). Therefore, weconsider LyCD as a target benchmark of LyDROO. 2) Myopicoptimization [11]: The Myopic method neglects the data queuebacklogs and maximizes the weighted sum computation ratein each time frame t by solving maximize x t , τ t , f t , e tO , r tO (cid:80) Ni =1 c i r ti subject to (12 c ) − (12 g ) , e ti ≤ tγ i − (cid:80) t − l =1 e li , ∀ i, where e ti ≤ tγ i − (cid:80) t − l =1 e li ensures that the average powerconstraint in (6e) is satisﬁed up to the t th time frame and (cid:8) e li | l < t (cid:9) is the known past energy consumptions.In Fig. 2(a)-(c), we evaluate the convergence of proposedLyDROO and the two benchmark methods. We consider twodata arrival rates with λ i = 2 . and Mbps for all i , and plotthe average data queue length, and average power consumption,and weighted sum computation rate over time. We consideri.i.d. random realizations in , time frames, where eachpoint in the ﬁgure is a moving-window average of timeframes. In Fig. 2(a), we observe that for a low data arrival rate λ i = 2 . , all the schemes maintain the data queues stable andachieve similar computation rate performance. Besides, theyall satisfy the average power constraint . W in Fig. 2(b).Meanwhile, they achieve the identical rate performance inFig. 2(c). When we increase λ i to , the average data queuelength of the Myopic method increases almost linearly withtime, indicating that the data arrival rate has surpassed thecomputation capacity (i.e., achievable sum computation rate).On the other hand, both the LyCD and LyDROO methodscan stabilize the data queues. In between, the LyCD methodmaintains lower data queue length over all time frames. Thedata queue length of LyDROO experiences quick increasewhen t ≤ , . However, as the embedded DNN graduallyapproaches the optimal policy, it quickly drops and eventuallyconverges to the similar queue length and rate performance asthe LyCD method after around t = 7 , , indicating its fastconvergence even under highly dynamic queueing systems.In Fig. 2(d), we vary the data arrival rate λ i from . to . Mbps, and plot the performance after convergence (pointswith inﬁnite data queue length are not plotted). We omit theresults for λ i > . because we observe that none of the three Average Data Queue Length (Mb) (cid:1) (cid:1) (cid:2) (cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:7)(cid:3)(cid:4)(cid:8)(cid:5)(cid:9)(cid:4)(cid:10)(cid:11)(cid:12)(cid:13) (a) Time Frames (cid:1) (cid:1) (cid:14) (cid:1) (cid:15)

Average Power Consumption (watt) (cid:1) (cid:1) (cid:2) (cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:7)(cid:3)(cid:4)(cid:8)(cid:5)(cid:9)(cid:4)(cid:10)(cid:11)(cid:12)(cid:13) (b) Time Frames (cid:1) (cid:1) (cid:14) (cid:1) (cid:15)

Weighted Sum Computation Rate (Mbps) (cid:1) (cid:1) (cid:2) (cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:7)(cid:3)(cid:4)(cid:8)(cid:5)(cid:9)(cid:4)(cid:10)(cid:11)(cid:12)(cid:13) (c) Time Frames (cid:1) (cid:1) (cid:14) (cid:1) (cid:15)

Average Data Queue Length (Mb) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:5)(cid:1)(cid:2)(cid:6)(cid:3)(cid:7)(cid:2)(cid:8)(cid:9)(cid:10)(cid:11)

Average Energy Consumption (watt) (cid:12) d (cid:14) (cid:15)(cid:16)(cid:16)(cid:10)(cid:17)(cid:13)(cid:18) (cid:4)(cid:13)(cid:19)(cid:20) (cid:1) (cid:1) Average Weighted Sum Computation Rate (Mbps)

Fig. 2: Convergence performance (a)-(c) and the impact of data arrival rate λ i (d) of different schemes. schemes can maintain queue stability, i.e., arrival rate surpassesthe achievable sum computation rate. All the three schemessatisfy the average power constraints under different λ i . Thedata queues are stable with LyCD and LyDROO under all theconsidered λ i , while the queue lengths of the Myopic schemebecome inﬁnite when λ i > . . The results show that bothLyDROO and LyCD achieve much larger stable capacity regionthan the Myopic method, and thus are more robust under heavyworkload. We also observe that LyCD and LyDROO achieveidentical computation rate performance in all the consideredcases. This is because when the data queues are long-termstable, the average computation rate of the i th WD (departuresrate of the data queue) equals the data arrival rate λ i , andthus the achievable average weighted sum computation rate is (cid:80) Ni =1 c i λ i for both schemes. In fact, this also indicates thatboth LyDROO and LyCD achieve the optimal computationrate performance in all the considered setups.We further examine the computational complexity. Here,we consider a ﬁxed total network workload Mbps andequally allocate λ i = 30 /N to each WD for N ∈ { , , } .The locations of the N WDs are evenly spaced within [120 , meters distance to the ES. The two LyDROO andLyCD method achieve similar computation rate performancefor all N and all the long-term constraints are satisﬁed.In terms of execution delay, LyDROO and LyCD take onaverage { . , . , . } and { . , . , . } secondsto generate an ofﬂoading decision for N ∈ { , , } ,respectively. LyCD consumes acceptable latency when N = 10 ,but signiﬁcantly long latency when N = 30 , e.g., around times longer than that of LyDROO. Because the channelcoherence time of a common indoor IoT system is no largerthan several seconds, the long execution latency makes LyCDcostly even infeasible in a practical MEC system with onlineofﬂoading decision. The proposed LyDROO algorithm, incontract, incurs very short latency overhead, e.g., around overhead when the time frame is seconds for N = 30 .Therefore, the LyDROO algorithm can be efﬁciently appliedin an MEC system under fast channel variation.VI. C ONCLUSIONS

In this paper, we have studied an online stable computationofﬂoading problem in a multi-user MEC network under stochas-tic wireless channel and task data arrivals. We formulated amulti-stage stochastic MINLP problem that maximizes theaverage weighted sum computation rate of all the WDs under long-term queue stability and average power constraints. Totackle the problem, we proposed a LyDROO framework thatcombines the advantages of Lyapunov optimization and DRL.We show that the proposed approach can achieve optimalcomputation rate performance meanwhile satisfying all thelong-term constraints. Besides, its incurs very low executiondelay in generating an online action, and converges withinrelatively small number of iterations.R

EFERENCES[1] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey onmobile edge computing: the communication perspective,”

IEEE Commun.Surveys Tuts. , vol. 19, no. 4, pp. 2322-2358, Aug. 2017.[2] W. Zhang, Y. Wen, K. Guan, D. Kilper, H. Luo, and D. O. Wu, “Energy-optimal mobile cloud computing under stochastic wireless channel,”

IEEETrans. Wireless Commun. , vol. 12, no. 9, pp. 4569-4581, Sep. 2013.[3] C. You, K. Huang, and H. Chae, “Energy efﬁcient mobile cloudcomputing powered by wireless energy transfer,”

IEEE J. Sel. AreasCommun. , vol. 34, no. 5, pp. 1757-1771, May 2016.[4] S. Bi, L. Huang, and Y. J. Zhang, “Joint optimization of servicecaching placement and computation ofﬂoading in mobile edge computingsystems,”

IEEE Trans. Wireless Commun. , vol. 19, no. 7, pp. 4947-4963,Jul. 2020.[5] J. Yan, S. Bi, Y. J. Zhang, and M. Tao, “Optimal task ofﬂoadingand resource allocation in mobile-edge computing with inter-user taskdependency,”

IEEE Trans. Wireless Commun. , vol. 19, no. 1, pp. 235-250,Jan. 2020.[6] S. Bi and Y. J. Zhang, “Computation rate maximization for wirelesspowered mobile-edge computing with binary computation ofﬂoading,”

IEEE Trans. Wireless Commun. , vol. 17, no. 6, pp. 4177-4190, Jun. 2018.[7] T. Q. Dinh, J. Tang, Q. D. La, and T. Q. Quek, “Ofﬂoading in mobileedge computing: task allocation and computational frequency scaling,”

IEEE Trans. Commun. , vol. 65, no. 8, pp. 3571-3584, Aug. 2017.[8] M. Min, L. Xiao, Y. Chen, P. Cheng, D. Wu, and W. Zhuang, “Learning-based computation ofﬂoading for IoT devices with energy harvesting,”

IEEE Trans. Veh. Technol. , vol. 68, no. 2, pp. 1930-1941, Feb. 2019.[9] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Optimizedcomputation ofﬂoading performance in virtual edge computing systemsvia deep reinforcement learning”,

IEEE Internet Things J. , vol. 6, no. 3,pp. 4005-4018, Jun. 2019.[10] Y. Wei, F. R. Yu, M. Song, and Z. Han, “Joint optimization of caching,computing, and radio resources for fog-enabled IoT using natural actor-critic deep reinforcement learning,”

IEEE Internet Things J. , vol. 6, no. 2,pp. 2061-2073, Apr. 2019.[11] L. Huang, S. Bi, and Y. J. Zhang, “Deep reinforcement learning for onlinecomputation ofﬂoading in wireless powered mobile-edge computingnetworks,”

IEEE Trans. Mobile Compt. , vol. 19, no. 11, pp. 2581-2593,Nov. 2020.[12] M. J. Neely, “Stochastic network optimization with application to com-munication and queueing systems,”

Synthesis Lectures on CommunicationNetworks , vol. 3, no. 1, pp. 1-211, 2010.[13] Y. Mao, J. Zhang, and K. B. Letaief, “Dynamic computation ofﬂoadingfor mobile-edge computing with energy harvesting devices,”

IEEE J. Sel.Areas in Commun. , vol. 34, no. 12, pp. 3590-3605, Dec. 2016. [14] J. Du, F. R. Yu, X. Chu, J. Feng, and G. Lu, “Computation ofﬂoadingand resource allocation in vehicular networks based on dual-side costminimization,”

IEEE Trans. Veh. Technol. , vol. 68, no. 2, pp. 1079-1092,Feb. 2019.[15] C. Liu, M. Bennis, M. Debbah, and H. V. Poor, “Dynamic task ofﬂoadingand resource allocation for ultra-reliable low-latency edge computing,”