[PDF] Adaptive Processor Frequency Adjustment for Mobile Edge Computing with Intermittent Energy Supply

Abstract

With astonishing speed, bandwidth, and scale, Mobile Edge Computing (MEC) has played an increasingly important role in the next generation of connectivity and service delivery. Yet, along with the massive deployment of MEC servers, the ensuing energy issue is now on an increasingly urgent agenda. In the current context, the large scale deployment of renewable-energy-supplied MEC servers is perhaps the most promising solution for the incoming energy issue. Nonetheless, as a result of the intermittent nature of their power sources, these special design MEC server must be more cautious about their energy usage, in a bid to maintain their service sustainability as well as service standard. Targeting optimization on a single-server MEC scenario, we in this paper propose NAFA, an adaptive processor frequency adjustment solution, to enable an effective plan of the server's energy usage. By learning from the historical data revealing request arrival and energy harvest pattern, the deep reinforcement learning-based solution is capable of making intelligent schedules on the server's processor frequency, so as to strike a good balance between service sustainability and service quality. The superior performance of NAFA is substantiated by real-data-based experiments, wherein NAFA demonstrates up to 20% increase in average request acceptance ratio and up to 50% reduction in average request processing time.

Full PDF

11 Adaptive Processor Frequency Adjustment forMobile Edge Computing with Intermittent EnergySupply

Tiansheng Huang ID , Weiwei Lin ID , Ying Li, Xiumin Wang, Qingbo Wu, Rui Li, Ching-Hsien Hsu ID ,and Albert Y. Zomaya ID , Fellow, IEEE

Abstract —With astonishing speed, bandwidth, and scale, Mobile Edge Computing (MEC) has played an increasingly important role inthe next generation of connectivity and service delivery. Yet, along with the massive deployment of MEC servers, the ensuing energyissue is now on an increasingly urgent agenda. In the current context, the large scale deployment of renewable-energy-supplied MECservers is perhaps the most promising solution for the incoming energy issue. Nonetheless, as a result of the intermittent nature of theirpower sources, these special design MEC server must be more cautious about their energy usage, in a bid to maintain their servicesustainability as well as service standard. Targeting optimization on a single-server MEC scenario, we in this paper propose NAFA, anadaptive processor frequency adjustment solution, to enable an effective plan of the server’s energy usage. By learning from thehistorical data revealing request arrival and energy harvest pattern, the deep reinforcement learning-based solution is capable ofmaking intelligent schedules on the server’s processor frequency, so as to strike a good balance between service sustainability andservice quality. The superior performance of NAFA is substantiated by real-data-based experiments, wherein NAFA demonstrates up to20% increase in average request acceptance ratio and up to 50% reduction in average request processing time.

Index Terms —Deep Reinforcement Learning, Event-driven Scheduling, Mobile edge computing, Online Learning, Semi-MarkovDecision Process. (cid:70)

NTRODUCTION L ATELY , Mobile Edge Computing (MEC) has emerged asa powerful computing paradigm for the future Internetof Things (IoTs) scenarios. MEC servers are mostly deployedin proximity to the users, with the merits of seamless cover-age and extremely low communication latency to the users.Also, the MEC servers feature the light-weight deployment, • The authors would like to thank the three anonymous reviewers fortheir constructive comments. Special thanks are due to Dr. MinxianXu of Shenzhen Institutes of Advanced Technology, Chinese Academyof Sciences, for his constructive advice on experiment setup. This work issupported by National Natural Science Foundation of China (62072187,61872084), Guangdong Major Project of Basic and Applied Basic Re-search(2019B030302002), Guangzhou Science and Technology Programkey projects (202007040002, 201907010001), and Fundamental ResearchFunds for the Central Universities, SCUT (2019ZD26). • T. Huang, W. Lin, Y. Li, and X. Wang are with the School of Com-puter Science and Engineering, South China University of Technol-ogy, China. Email: [email protected], [email protected],[email protected], [email protected]. • CH. Hsu is with the Department of Computer Science and InformationEngineering, Asia University, Taichung, Taiwan and with the Departmentof Computer Science and Information Engineering, Asia University,Taichung, Taiwan. Email: [email protected]. • Q. Wu is with the College of Computer, National University of DefenseTechnology, Changsha 410073, China. Email: [email protected]. • R. Li is with Peng Cheng Laboratory, Shenzhen 518000, China. Email:[email protected]. • AY. Zomaya is with the School of Computer Science, The University ofSydney, Sydney, Australia. Email: [email protected]. enabling their potential large-scale application in variousscenarios.Despite an attractive prospect, two crucial issues mightbe encountered by the real application:1) The large-scale deployment of grid-power MECservers has almost exhausted the existing energyresource and resulted in an enormous carbon foot-print. This in essence goes against the green com-puting initiative.2) With millions or billions of small servers deployedamid every corner of the city, some of the locationscould be quite unaccommodating for the construc-tion of grid power facility, and even for those ina good condition, the construction and operationoverhead of the power facility alone should not betaken lightly.With these two challenges encountered during the large-scale application of MEC, we initiate an alternative usageof intermittent energy supply . These supplies could be solarpower , wind power or wireless power , etc. Amid these alter-natives, renewable energy, such as solar power and windpower, could elegantly address both the two concerns. Thewireless power still breaches the green computing initiative,but can at least save the construction and operation cost ofa complete grid power system for all the computing units.However, all of these intermittent energy supplies reveala nature of unreliability: power has to be stored in a batterywith limited capacity for future use and this limited energyis clearly incapable of handling all the workloads when a r X i v : . [ ee ss . S Y ] F e b mass requests are submitted. With this concern, servershave two potential options to accommodate the increasingworkloads:1) They directly reject some of the requests (perhapsthose require greater computation), in a bid to saveenergy for the subsequent requests.2) They lower the processing frequency of their coresto save energy and accommodate the increasingworkloads. Nevertheless, this way does not comewithout a cost: each request might suffer a pro-longed processing time.These heuristic ideas of energy conservation elicit theproblem we are discussing in this paper. We are particularlyinterested in the potential request treatment (i.e., should wereject an incoming request, and if not, how shall we schedulethe processing frequency for it) and their consequent effectson the overall system performance.The problem could become even more sophisticated ifregarding the unknown and non-stationary request arrivaland energy harvest pattern. Traditional rule-based methodsclearly are incompetent in this volatile context: as a resultof their inﬂexibility, even though they might work in aparticular setting (a speciﬁc arrival pattern, for example),they may not work equally well if in a completely differentenvironment.Being motivated, we shall design a both effective andadaptive solution that is able to cope with the uncertaintybrought by the energy supply and request pattern. More-over, the solution should be highly programmable, allowingcustom design based on the operators’ expectations towardsdifferent performance metrics. The major contributions of our work are presented in thefollowing:1) We have analyzed the real working procedure ofan intermittent-energy-driven MEC system, basedon which, we propose an event-driven schedulescheme, which is deemed much matching with theworking pattern of this system.2) Moreover, we propose an energy reservation mecha-nism to accommodate the event-driven feature. Thismechanism is novel and has not been available inother sources, to our best knowledge.3) Based on the proposed scheduling mechanism, wehave formulated an optimization problem, whichbasically covers a few necessary system constraintsand two main objectives: cumulative processingtime and acceptance ratio.4) We propose a deep reinforcement learning-basedsolution (NAFA) to optimize the joint request treat-ment and frequency adjustment action.5) We do the experiments based on real solar data.By our experimental results, we substantiate theeffectiveness and adaptiveness of our proposed so-lutions. In addition to the general results, we alsodevelop a profound analysis of the working patternof different solutions. To the best knowledge of the authors, our main focus onevent-driven scheduling for an intermittent energy-suppliedsystem has not been presented in other sources. Also, ourproposed solution happens to be the ﬁrst trackable deep re-inforcement learning solution to an SMDP model. It has thepotential to be applied to other network systems that followa similar event-driven working pattern. In this regard, weconsider our novel work as a major contribution to the ﬁeld.

ELATED W ORK

AI-driven algorithms have been broadly studied thanksto their attractive efﬁciency and ﬂexibility. In many ﬁelds,such as service orchestration [1], vehicular or industrial IoTnetworks (see [2], [3]), crowdsensing (see [4], [5]), etc, linesof research on AI application have been conducted. The pro-posed AI-driven algorithms all achieved signiﬁcant perfor-mance enhancement, comparing with traditional method-ology. Among these emerging topics, AI-driven MEC canbe regarded as one of the hottest agenda. To illustrate, in[6], Chen et al. combined the technique of Deep Q Network(DQN) with a multi-object computation ofﬂoading scenario.By maintaining a neural network inside its memory, the MUis enabled to intelligently select an ofﬂoading object amongthe accessible base stations. Wei et al. in [7] introduced areinforcement learning algorithm to address the ofﬂoadingproblem in the IoT scenario, based on which, they furtherproposed a value function approximation method in anattempt to accelerate the learning speed. In [8], Min et al. further considered the privacy factors in healthcare IoTofﬂoading scenarios and proposed a reinforcement learning-based scheme for a privacy-ensured data ofﬂoading.Refs. [6]–[9] mentioned above are all based on MarkovDecision Process (MDP), in which the action schedulingis performed periodically based on a ﬁxed time interval.However, in the real scenario, the workﬂow of an agent(server or MU) is usually event-driven. To be speciﬁc, theagent is supposed to promptly make the scheduling andperform actions once an event has occurred (e.g. a requestarrives) but not wait until a ﬁxed periodicity is met.

Motivated by the gap between current research and theagent’s real working pattern, we consider the Semi-MarkovDecision Process (SMDP) as our basic model instead. Unlikethe conventional MDP model, for an SMDP model, the timeinterval between two sequential actions does not necessarilyneed to be the same. As such, it is super matching foran event-triggered schedule, like the one we need for therequests scheduling.SMDP is a powerful modeling tool that has been ap-plied to many ﬁelds, such as wireless networks, operationmanagement, etc. In [10], Zheng et al. ﬁrst applied SMDPin the scenario of vehicular cloud computing systems. Adiscounted reward is adopted in their system model, basedon which, the authors further proposed a value iterationmethod to derive the optimal policy. In [11], SMDP isﬁrst applied to energy harvesting wireless networks. Forproblem-solving, the authors adopted a model-based policy

UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 3

MEC Server 𝑓 𝑓 𝑓 free SchedulerEnergy Harvest

Battery  State Report  Request Arrival  Action  Run

Cores

Fig. 1: System architecture for an intermittent-energy-supplied MEC system.iteration method. However, the model-based solution pro-posed in the above work can not address the problem whenthe state transition probability is unknown, and in addition,it cannot deal with the well-known curse-of-dimensionalityissue.To ﬁx the gap, based on an SMDP model that is exclu-sively designed for an NB-IOT Edge Computing System, Lei et al. in [12] further proposed a reinforcement learning-basedalgorithm. However, the proposed algorithm is still toorestricted and cannot be applied in our current studied prob-lems, since several assumptions (e.g., exponential sojourntime between events) must be made in advance. Normally,these assumptions are inevitable for the derivation of state-value or policy-value estimation in an SMDP model, butunfortunately, could be the main culprit leading to greatdivergence between theory and reality.As a result, we in this paper will jettison all the as-sumptions typically presented in an SMDP formulation.Alternatively, we adopt a Double Deep Q Network (DDQN)(see [13]) to predict the state-action value function and toderive a near-optimal policy based on which. Our method isa more general solution for an SMDP model and is designedmostly based on a practical and usable standpoint.

ROBLEM F ORMULATION

In this paper, we target the optimization problem in amulti-users single-server MEC system that is driven byintermittent energy supply (e.g., renewable energy, wirelesscharging). As depicted in Fig. 1, multiple central processingunits (CPUs) and a limited-capacity battery, together withcommunication and energy harvesting modules, physicallyconstitute an MEC server. Our proposed MEC system makes scheduling on requests following an event-driven workﬂow,i.e., our request scheduling process is evoked immediatelyonce a request is collected. This event-driven process can bespeciﬁed by the following steps:1) Collect the request and check the current systemstatus, e.g., battery and energy reservation statusand CPU core status.2) Do schedule to the incoming request based on itscharacteristic and current system status. Explicitly,the scheduler is supposed to make the followingdecisions:a) Decide whether to accept the request or not.b) If accepted, decide the processing frequency ofthe request. We refer to processing frequency as the frequency that a CPU core might turnto while processing this particular request.The scheduler should not choose the fre-quency that might potentially over-reserve thecurrently available energy or over-load theavailable cores.3) Do energy reservation for the request based on thedecided frequency. A CPU core cannot use morethan the reserved energy to process this request.4) The request is scheduled to the corresponding CPUcore and start processing.Given that the system we study is not powered byreliable energy supply (e.g., coal power), a careful plan ofthe available energy is supposed to be made in order topromote the system performance.

In this subsection, we shall formally introduce the optimiza-tion problem by rigorous mathematics formulation (key no-

TABLE 1: Key notations for problem formulationNotations Meanings a i action scheduled for the i -th request f n n -th processing frequency option (GHz) d i data size of the i -th request (bits) ν computation complexity per bit κ effective switched capacitance m number of CPU cores τ a i processing time of the i -th request e a i energy consumption of the i -th request B i battery status when i -th request arrives B max maximum battery capacity λ i,i +1 / ρ i,i +1 captured/consumed energy betweenarrivals of the i -th and ( i + 1) -th request S i reserved energy when i -th request arrives Ψ i number of working CPU coreswhen the i -th request arrives η tradeoff parametertations and their meanings are given in Table 1). We considerthe optimization problem for an MEC server with m CPUcores, all of whose frequency can be adaptively adjusted viaDynamic Voltage and Frequency Scaling (DVFS) technique.Then we ﬁrst specify the ”action” for the optimizationproblem.

The decision (or action) in this system speciﬁes the treat-ment of the incoming request. Explicitly, it speciﬁes 1)whether to accept the request or not, 2) the processingfrequency of the request. Formally, let i index an incomingrequest by its arrival order. The action for the i -th request isdenoted by a i . Explicitly, we note that:1) a i ∈ { , , . . . , n } denotes the action index of theCPU frequency at which the request is scheduled,where n represents the maximum index (or totalpotential options) for frequency adjustment. Foroption n , f n GHz frequency will be scheduled tothe request.2) Specially, when a i = 0 , the request will be rejectedimmediately. The action we take might have a direct inﬂuence on therequest processing time , which can be speciﬁed by: τ a i = (cid:40) ν · d i f ai a i ∈ { , . . . , n } a i = 0 (1)where we denote d i as the processing data size of an ofﬂoad-ing request, ν as the required CPU cycles for computingone bit of the ofﬂoading data. Without loss of generality, weassume d i , i.e., processing data size, as a stochastic variablewhile regarding ν , i.e., required CPU cycles per bit, ﬁxed forall requests.Also, by specifying the processing frequency, we canderive the energy consumption for processing the request, which can be given as: e a i = (cid:40) κf a i · ν · d i a i ∈ { , . . . , n } a i = 0 (2)where we denote κ as the effective switched capacitance ofthe CPUs. Recall that the MEC server is driven by intermittent energysupply, which indicates that the server has to store thecaptured energy in its battery for future use. Concretely, weshall introduce a virtual queue, which serves as a measure-ment of the server’s current battery status. Formally, we leta virtual queue, whose backlog is denoted by B i , to capturethe energy status when the i -th request arrives. B i evolvesfollowing this rule: B i +1 = min { B i + λ i,i +1 − ρ i,i +1 , B max } (3)Here,1) λ i,i +1 represents the amount of energy that wascaptured by the server between the arrivals of the i -th and ( i + 1) -th request.2) ρ i,i +1 is the consumed energy during the sameinterval. It is notable that the consumed energyi.e., ρ i,i +1 , is highly related to the frequency of thecurrent running cores, due to which, is also relevantto the past actions , regarding the fact that it is thepast actions that determine the frequency of thecurrent running cores.3) B max is the maximum capacity of the battery. weuse this value to cap the battery status since themaximum energy status could not exceed the fullcapacity of the server’s battery. Upon receiving each of the coming requests, we shall re-serve the corresponding amount of energy for which so thatthe server could have enough energy to ﬁnish this request.This reservation mechanism is essential in maintaining thestability of our proposed system.To be speciﬁc, we ﬁrst have to construct a virtual queueto record the energy reservation status . The energy reservationqueue, with backlog S i , evolves between request arrivalsfollowing this rule: S i +1 = max { S i + e a i − ρ i,i +1 , } (4)where ρ i,i +1 is the consumed energy (same deﬁnition in Eq.(3)) and e a i is the energy consumption for the i -th request(same deﬁnition in Eq. (2)). By this means, S i +1 exactlymeasures how much energy has been reserved by the -thto the i -th request .

1. Some may wonder the rationale behind the subtraction of ρ i,i +1 .Considering the case that in a speciﬁc timestamp (e.g., the timestampthat the i -th request arrives), the processing of a past request (e.g., the ( i − -th request) could be unﬁnished, but have already consumedsome of the reservation energy. Then, the reservation energy for itshould subtract the already consumed part. UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 5

By maintaining the energy reservation status , we can specify energy constraint to control the action when the energy hasalready been full-reserved. The constraint can be given asfollows: S i + e a i ≤ B i (5)By this constraint, we ensure that the request has to berejected if not enough energy (that has not been reserved)is available. Recall that we assume a total number of m CPU coresare available for request processing. When the computationresources have already been full-loaded, the system has nooption but to reject the arrived requests. So, we introducethe following constraint: Ψ i + I { a i (cid:54) = 0 } ≤ m (6)where Ψ i denotes the number of currently working CPUcores. By this constraint, we ensure that the server could notbe overloaded by the accepted request. Now we shall formally introduce the problem that we aimto optimize, which is given as follows: (P1)

Obj1: min { a i } ∞ (cid:88) i =1 τ a i Obj2: max { a i } ∞ (cid:88) i =1 I { a i (cid:54) = 0 } C1: S i + e a i ≤ B i C2: Ψ i + I { a i (cid:54) = 0 } ≤ m (7)There are two objectives that we need to consider inthis problem: 1) the ﬁrst objective is the cumulative process-ing time of the system, 2) and the second objective is thecumulative acceptance . Besides, two constraints, i.e., EnergyConstraint and

Resource Constraint , have been covered in oursystem constraints.Obviously, P1 is unsolvable due to the following facts:1) Existence of stochastic variables.

Stochastic vari-ables, e.g., d i (data size of requests), λ i,i +1 (energysupplied) and request arrival rate, persist in P1 .2) Multi-objective quantiﬁcation.

The two objectivesconsidered in our problem are mutually exclusiveand there is not an explicit quantiﬁcation betweenthem.To bridge the gap, we 1) set up a tradeoff parameter tobalance the two objectives and transform the problem intoa single objective optimization problem, 2) transform theobjective function into an expected form. This leads to ournewly formulated P2 : (P2) max { a i } ∞ (cid:88) i =1 E [ I { a i (cid:54) = 0 } − ητ a i ] C1: S i + e a i ≤ B i C2: Ψ i + I { a i (cid:54) = 0 } ≤ m (8) where η serves as the tradeoff parameter to balance cumu-lative acceptance and cumulative processing time . P2 is moreconcrete after the transformation, but still, we encounter thefollowing challenges when solving P2 :1) Exact constraints.

Both C1 and C2 are exact con-straints that should be strictly restricted for eachrequest. This completely precludes the possibility ofapplying an ofﬂine deterministic policy to solve theproblem, noticing that B i , S i in C1 and Φ i in C2 areall stochastic for each request.2) Unknown and non-stationary supplied pattern ofenergy.

The supplied pattern of intermittent energyis highly discrepant, varying from place to place andhour to hour . As such, the stochastic process λ i,i +1 in Eq. (3) may not be stationary, i.e., λ i,i +1 samplesfrom an ever-changing stochastic distribution thatis relevant with request order i , and moreover, it isunknown to the scheduler.3) Unknown and non-stationary arrival pattern.

Thearrival pattern of the request’s data size is unknownto the system, and also, could be highly discrepantat temporal scales.4)

Unknown stochastic distribution of requests’ datasize.

The processing data size could vary betweenrequests and is typically unknown to the scheduler.5)

Coupling effects of actions between requests.

Thepast actions towards an earlier request might have adirect effect on the later scheduling process (see theevolvement of S i and B i )Regarding the above challenges, we have to resort to an on-line optimization solution for problem-solving. The solutionis expected to learn the stochastic pattern from the historicalknowledge, and meanwhile, can strictly comply with thesystem constraint. EEP R EINFORCEMENT L EARNING -B ASED S O - LUTION

To develop our reinforcement learning solution, we shallﬁrst transform the problem into an

Semi-Markov DecisionProcess (SMDP) formulation. In our SMDP formulation, weassume the system state as the system running status whena request has come, or, when the scheduler is supposed totake action . After an action being taken by the scheduler, a reward would be achieved, and the state (or system status)correspondingly transfers. It follows an event-driven process,that is, the scheduler decides the action once a request hascome but does nothing while awaiting. And as a result, thetime interval between two sequential actions may not be thesame. This is the core characteristic for an SMDP model andis the major difference from a normal MDP model.Now we shall specify the three principal elements (i.e.,states, action, and rewards) in our SMDP formulation insequence, and please note that most of the notations weused henceforth are consistent with those in Section 3.

2. Considering the energy harvest pattern of solar power. We have asigniﬁcant difference in harvest magnitude between day and night.

A system state is given as a tuple: s i (cid:44) { T i , B i , S i , ψ i, . . . , ψ i,n , d i } (9)Explicitly,1) T i is the when the i − threquest arrives in the MEC server. We incorporatethis element in our state in order to accommodatethe temporal factor that persists in the energy andrequest pattern.2) B i is the battery status , same as we specify in Eq.(3).3) S i is the energy reservation status , same as wespecify in Eq. (4).4) ψ i,n denotes the running CPU cores in the fre-quency of f n GHz . Meanwhile, it is intuitive to seethat Ψ i = (cid:80) nn (cid:48) =1 ψ i,n where Ψ i is the total runningcores we specify in Eq. (6).5) d i is the data size of the i -th request.Information captured by a state could be comprehendedas the current system status that might support the actionscheduling of the incoming i -th request. More explicitly,we argue that the formulated states should at least coverthe system status that helps construct a possible action set ,but could cover more types of informative knowledge tosupport the decision (e.g., T i in our current formulation,which will be formally analyzed later). The concept of possible action set would be given in our explanation of systemactions . Once a request arrives, given the current state (i.e., theobserved system status), the scheduler is supposed to takeaction, deciding the treatment of the request. Indicated bythe system constraints C1 and C2 in P2 , we are supposed tomake a restriction on the to-be taken action. By specifyingthe states (or observing the system status), it is not difﬁcultto ﬁnd that we can indeed get a closed-form possible actionset, as follows: a i ∈ A s i (cid:44) { a | Ψ i + I { a (cid:54) = 0 } ≤ m, S i + e a ≤ B i } (10)where Ψ i , B i and S i are all covered in our state formulation.By taken action from the deﬁned possible action set, weaddress challenge 1) exact constraints , that we specify in P2 . Given state and action, a system reward will be incurred, inthe following form: r ( s i , a i ) = I { a i (cid:54) = 0 } − ητ a i (11)The reward of different treatments of a request is consistentwith our objective formulation in P2 (see Eq. (8)). Thegoal of our MDP formulation is to maximize the expectedachieved rewards, which means that we aim to maximize (cid:80) ∞ i =1 E [ I { a i (cid:54) = 0 } − ητ a i ] , the same form with the objectivein P2 . Besides, here we can informally regard I { a i (cid:54) = 0 } as a”real reward” of accepting a request and τ a i as the ”penalty”of processing a request. After an action being taken, the state will be transferred toanother one when the next arrival of the request occurs. Thestate transferred probability to a speciﬁc state is assumed tobe stationary given the current state and the current action,i.e., we assume that: p ( s i +1 | s i , a i ) = constant (12)This assumption is the core part in our MDP formu-lation. And this is our main motivation to cover T i and ψ i, . . . , ψ i,n (rather than Ψ i ) in our state formulation. Afterall, we hope that the the state transferred probability. i.e., p ( s i +1 | s i , a i ) is indeed a constant.Explicitly, by given a i , T i , ψ i, . . . , ψ i,n and d i , we ex-pect the probability that s i transfers to s i +1 are a ﬁxedconstant. To reach this goal, given s i , we need to makesure that the corresponding elements (e.g., B i +1 , S i +1 and ψ i +1 , . . . , ψ i +1 ,n ) in state s i +1 follows the a same joint stationary distribution . A stationary distribution means it is nolonger relevent with request order i if given s i . To show thisdesirable property, we ﬁrst need to show that: Observation 1 (Stationary Energy Arrival

Given State ) . Byspecifying T i , the captured energy between two sequential requests(i.e., λ i,i +1 ) can be roughly regarded as samples from a stationoarydistribution, i.e., value of r.v. λ i,i +1 is not relevant with i (orderof request) given T i .Remark. This assumption is derived from our experience:many intermittent energy supply sources, such as solarpower, wind power, have a clear diurnal pattern, but if weﬁx the time to a speciﬁc timestamp in a day, the energyharvest can roughly be regarded as samples from a ﬁxeddistribution. Informally, we assume a ﬁxed energy harvestrate for a ﬁxed timestamp, e.g., 1000 Joules/s harvest ratein 3:00 pm. And this distribution is no longer relevantwith i . In this way, we respond to the non-stationary issuein challenge 2), unknown and non-stationary suppliedpattern of energy , that we previously proposed below P2.Similarly, one can easily extend the state formulation bycovering more system status (e.g., the location of the MECserver, if we aim to train a general model for multiple MECservers scenario) to yield a more accurate estimation. Observation 2 (Stationary Request Arrival

Given State ) . Byspecifying T i , the time interval between request arrival should alsobe considered as samples from a stationary distribution (thoughstill unknown). This means that the time interval between twoarrivals (denoted by t i,i +1 ) is no longer relevant with i if given T i .Remark. By this state formulation, we address the non-stationary issue in challenge 3), unknown and non-stationary arrival pattern , that we previously proposedbelow P2.

Observation 3 (Stationary Energy Consumption

GivenState ) . By specifying a i , ψ i, . . . , ψ i,n and a stationary timeinterval between two arrivals, the consumed energy between tworequests (i.e., ρ i,i +1 ) could roughly be regarded as samples from astationary distribution.Remark. Our explanation for observation 3 is that the con-

UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 7 sumed energy ρ i,i +1 is actually determined by current CPUfrequency as well as the time interval between requests (i.e., t i,i +1 ). Explicitly, we can roughly estimate that ρ i,i +1 = (cid:80) nn (cid:48) =1 ( ψ i,n (cid:48) + I { a i = n (cid:48) } ) t i,i +1 . But ths calculation is not100% accurate since the CPU could have sleeped beforearrival of next request, as a reuslt of the process ﬁnishingof a request. As a reﬁnement, we further assume ρ i,i +1 = (cid:80) nn (cid:48) =1 ( ψ i,n (cid:48) + I { a i = n (cid:48) } ) · t i,i +1 − y ( a i , ψ i, , . . . , ψ i,n , t i,i +1 ) where y ( a i , ψ i, , . . . , ψ i,n , t i,i +1 ) is a stochastic noise led bythe halfway sleeping of CPU cores. Without loss of general-ity, y ( a i , ψ i, , . . . , ψ i,n , t i,i +1 ) is assumed to be samples froma stationary distribution given a i , ψ i, , . . . , ψ i,n and t i,i +1 .By this assumption, we state that ρ i,i +1 is stationary if given a i , ψ i, , . . . , ψ i,n and a stationary t i,i +1 . Observation 4 (Stationary Battery and Reserved status

Given State ) . By specifying s i and a i , B i +1 and S i +1 canroughly be regarded as samples from two stationary distributions.Remark. See Eqs. (3) and (4), we ﬁnd that B i +1 and S i +1 arerelevant with λ i,i +1 , ρ i,i +1 , e a i and s i . As per Observation1 and 3, λ i,i +1 and ρ i,i +1 are all stationary given s i and a i . In addition, we note that e a i is stationary given thesame condition since d i is assumed to be samples froma stationary distribution and there is not other stochasticfactor in Eq. (2). Observation 5 (Stationary Core Status

Given State ) . Byspecifying s i and a i , ψ i +1 , , . . . , ψ i +1 ,n can roughly be regardedas samples from a stationary distribution.Remark. Here we simply regard that ψ i +1 ,n = ψ i,n + I { a i = n (cid:48) } − z ( ψ i,n , t i,i +1 ) where z ( ψ i,n , t i,i +1 ) denotesthe reduction on number of active core in f n GHz during t i,i +1 interval. Instinctively, we feel that the reduction, i.e., z ( ψ i,n , t i,i +1 ) should at least have some sort of bearingwith the running core status, i.e., ψ i,n and the time intervalbetween requests i.e., t i,i +1 . Informally, we might simplyimagine that for each time unit, the amount of α n · ψ i,n expected reduction would be possibly incurred, where α n should be only relevant to the evaluated core frequency (i.e., f n ). Then, following this strand of reasoning, z ( ψ i,n , t i,i +1 ) should be a stochastic variable and it can be roughly re-garded as samples from a stationary distribution if given ψ i,n and t i,i +1 . Observation 6 (Stationary Time and Data Size

Given State ) . By specifying s i and t i,i +1 , the timestamp in which the nextrequest comes, i.e., T i +1 = T i + t i,i +1 is naturally stationary.Besides, the data size of a request, i.e., d i +1 is naturally assumedto be samples from a stationary distribution (i.e., not relevant withthe arrival order i ). Combining Observation 4, 5 and 6, we roughly infer thatthere should be a unique constant specifying the transferredprobability from a state to another. However, we note thatthis conclusion is not formal in a very rigorous sense (and itdoes not necessarily to be!), given the complexity that existsin the analysis as well as the not necessarily 100% correct assumptions we make in the observation. Our real intentionof discussing the stationary property is actually to renderthe readers an instruction about state formulation, and showthat how to extend the state if in a different application case.

In this section, we shall show the readers a simpliﬁed work-ing procedure of our proposed SMDP model. As shownin Fig. 2, we consider three possible frequency adjustmentoptions for each incoming request, i.e., a i ∈ { , , , } where action a i = 0 means rejection and actions a i = 1 , , respectively correspond to a frequency setting of f , f , f .Our example basically illustrates the following workﬂows:1) Before the 1-st request’s arrival, a speciﬁc amountof energy (i.e., acquired energy , represented by thegreen bar) has been collected by the energy harvestmodules. Then, at the instance of the 1-st request’sarrival, the scheduler decides to take action a = 1 based on the current state . Once the action is beingtaken, a) a sleeping core would waken and startedto run in the frequency of f , b) a speciﬁc amountof energy would be reserved for processing of thisrequest, c) and the acquired energy would be ofﬁ-cially deemed as stored energy . Correspondingly,the state instantly transfers to a virtual post-actionstate , i.e., a) the ﬁrst bit in core status (i.e. runningcore in the frequency of f ) would be correspond-ingly ﬂipped to 1, b) the reserved energy would beupdated, c) and the newly acquired energy wouldbe absorbed into the stored energy.2) Then, the state (i.e., the system status) continues toevolve over time: the running cores continuouslyconsume the reserved energy and the stored energy(same amount would be consumed for reservedenergy and stored energy). The evolvement of statepauses when the second request arrives. After ob-serving the current system status (i.e., current state),the scheduler makes an action a = 2 at this timeand the state similarly transfers to the post-actionstate.3) Again, after the action being taken, the state evolvesover time, and within this time interval, a core(in frequency f ) has ﬁnished its task. Predictably,when the 3-rd request arrives, a state with corestatus (0 , , would be observed, and since theavailable energy (sum of acquired energy and storedenergy) is not sufﬁcient for processing this request,the scheduler has no choice but to reject the request,which makes a = 0 . This time, the post-action statedoes not experience substantial change comparedwith the observed state (except that acquired energyhas been absorbed).4) The same evolvement continues and the same actionand state transformation process would be repeatedfor the later requests.Please note that in our former formulation, we do notinvolve the consideration of post-action state, since we onlycare about the state transformation between two formalstates s i and s i +1 . We present the concept of post-action

3. Note that in our formal state formulation (see Eq. (9)), we donot distinguish between acquired energy and stored energy , but onlyinterest in the battery status , which is the sum of these two terms. Wemake this distinction in this example mainly to support our illustrationof the key idea. (𝜓 𝑖,1 , 𝜓 𝑖,2 , 𝜓 𝑖,3 ) Current running CPU cores with frequency 𝑓 , 𝑓 , 𝑓 𝑑 = 5 MB 𝑇 = 12: 00 Acquired energy Stored energy Reserved energy State 𝑑 = 18 MB 𝑇 = 12: 30 (0,0,0) (0,0,1)(0,1,0) (0,1,0) (1,0,0) (1,1,0)(0,0,0) (1,0,0) Post-action transfer 𝑎 = 1 Action transfer 3-rd Request : 𝑑 = 10 MB 𝑇 = 13: 15 𝑑 = 15 MB 𝑇 = 14: 20 𝑎 = 2 𝑎 = 0 𝑎 = 3 𝑎 𝑖 Post-action state

Fig. 2: Illustration of the event-driven SMDP modelstate here mainly in a bid to render the readers a wholepicture about how the system states might evolve over time.

Recall that our ultimate goal is to maximize the expectedcumulative reward, which means that we need to ﬁnd adeterministic optimal policy ˜ π ∗ such that: ˜ π ∗ = arg max a ∈A s ˜ Q ˜ π ∗ ( s, a ) (13)where ˜ Q ˜ π ( s, a )= E s ,s ,... (cid:34) r ( s , a ) + ∞ (cid:88) i =2 r ( s i , ˜ π ( s i )) | s = s, a = a (cid:35) (14)represents the expected cumulative rewards, starting froman initial state s and an initial action a . ˜ π ∗ is a deterministicpolicy that promises us the best action in cumulating ex-pected rewards. However, it is intuitive to ﬁnd that ˜ Q ˜ π ( s, a ) is not a convergence value no matter how the policy π is deﬁned (see the summation function), which makes itmeaningless to derive π ∗ in this form.To address this issue, we alternatively deﬁne a discountedexpected cumulative rewards , in the following form: Q π ( s, a )= E s ,s ,... (cid:34) r ( s , a ) + ∞ (cid:88) i =2 β i − r ( s i , π ( s i )) | s = s, a = a (cid:35) (15)where < β < is the discount factor and s is an initialstate. And we instead need to ﬁnd a near-optimal policy π ∗ suchthat: π ∗ ( s ) = arg max a ∈A s Q π ∗ ( s, a ) (16)where s is an initial state. Q π ( s, a ) is usually referred to as state-action value function and each Q π ( s, a ) regarding differ-ent policy π has a ﬁnite convergence value, which make itconcrete to ﬁnd Q π ∗ ( s, a ) . And as long as we know about Q π ∗ ( s, a ) , we are allowed to derive the near-optimal policy π ∗ by Eq. (16). Moreover, Q π ( s, a ) still conserves certaininformation, even after our discount of future reward: wediscount much to the reward that might be obtained inthe distant future but not that much to the near one, so itactually partially reveals the exact value of a station-actionpair (i.e., the cumulative rewards that might gain in thefuture). Actually, we can state that π ∗ ≈ ˜ π ∗ . Later in ouranalysis, we would alternatively search for π ∗ as our target.To derive Q π ∗ ( s, a ) , we need to notice that: Q π ∗ ( s, π ∗ ( s )) = max a ∈A s Q π ∗ ( s, a ) (17)The above result follows our deﬁnition of π ∗ in Eq. (16).Besides, as per Eq. (15), Q π ( s, a ) can indeed rewrite tothe following form: Q π ( s, a )= E s [ r ( s , a ) + βQ π ( s , π ( s )) | s = s, a = a ] (18)Plugging π ∗ into Eq. (18), it yields: Q π ∗ ( s, a )= E s [ r ( s , a ) + βQ π ∗ ( s , π ∗ ( s )) | s = s, a = a ] (19) UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 9

And plugging Eq. (17) into Eq. (19) , we have: Q π ∗ ( s, a )= E s (cid:20) r ( s , a ) + β max a ∈A s Q π ∗ ( s , a ) | s = s, a = a (cid:21) (20)As per the Markov and temporally homogeneous propertyof an SMDP, we have: Q π ∗ ( s, a ) = E s (cid:48) (cid:20) r ( s, a ) + β max a (cid:48) ∈A s (cid:48) Q π ∗ ( s (cid:48) , a (cid:48) ) (cid:21) (21)where s (cid:48) is a random variable (r.v.), which represents thenext state that current state s will transfer to. This equationis called Bellman optimality and it indeed speciﬁes the valueof taking a speciﬁc action, which is basically composed oftwo parts:1) The ﬁrst part is the expected reward that is im-mediately obtained after actions have been taken,embodied by E [ r ( s, a )] .2) The second part is the discounted future expectedrewards after the action has been taken, embod-ied by E [ β max a (cid:48) ∈A s (cid:48) Q π ∗ ( s (cid:48) , a (cid:48) )] . This literally ad-dresses challenge 5) Coupling effects of actionsbetween requests we proposed below P2 . The cou-pling effects are in fact embodied by these dis-counted expected rewards: by taking a differentaction, the state might experience a different transferprobability and thereby making the future rewardsbeing affected. By this MDP formulation, we areenabled to have a concrete model of this couplingeffect to the future.Besides, speciﬁed by the proposed challenges below P2 ,we know that both the energy and request arrival pattern,as well as the data size distribution, are all unknown tothe scheduler, which means that it is hopeless to derivethe closed-form Q π ∗ ( s, a ) . More explicitly, since the statetransferred probability p ( s (cid:48) | s, a ) (or p ( s i +1 | s i , a i ) in Eq. (12))is an unknown constant, we cannot expand the expectationin Eq. (21), making Q π ∗ ( s, a ) unachievable in an analyticalway. Without knowledge of Q π ∗ ( s, a ) , we are unable tofulﬁll our ultimate goal, i.e., to derive π ∗ .Fortunately, there still exists an alternative path to derive Q π ∗ ( s, a ) . (Henceforth, we use Q ( s, a ) to denote Q π ∗ ( s, a ) for sake of brevity). If we have sufﬁcient amount of dataover a speciﬁc state and action (captured by a set X s,a ),each piece of which shapes like this 4-element tuple: x (cid:44) ( s, a, r ( s, a ) , s (cid:48) ) , then we can estimate Q ( s, a ) by: Q ( s, a ) ≈ | X s,a | (cid:88) x ∈ X s,a (cid:20) r ( s, a ) + β max a (cid:48) ∈A s (cid:48) Q ( s (cid:48) , a (cid:48) ) (cid:21) (22)where | · | returns the cardinality of a set. Or we aim tominimize the estimation error , i.e., to minimize the followingterm: | X | (cid:88) x ∈ X (cid:20) Q ( s, a ) − (cid:18) r ( s, a ) + β max a (cid:48) ∈A s (cid:48) Q ( s (cid:48) , a (cid:48) ) (cid:19)(cid:21) (23)where X = X s ,a ∪ · · · ∪ X s ,a n ∪ . . . captures all theavailable data. In this way, we do not need prior knowledgeabout the transferred probability i.e., p ( s (cid:48) | s, a ) , but we learn it from the real state transferred data. This learning-basedmethod completely resolves the unknown distribution is-sues we raised below P2 .Unfortunately, this initial idea is still unworkable. Themajor impediment is that there are inﬁnite amount of states in our formulated problem! We simply cannot iterativelyachieve Q ( s, a ) for every possible state-action pair, but atleast, this initial idea points out a concrete direction thatelicits our later double deep Q network solution. In the previous section, we provide an initial idea about howto derive Q ( s, a ) in a learning way. But we simply cannotrecord Q ( s, a ) for each state due to the unbounded statespace. This hidden issue motivates us to use a neural net-work to predict the optimal state-action value (abbreviatedas Q value henceforth). Explicitly, we input a speciﬁc state tothe neural network and output the Q value for action to n . By this means, we do not need to record the Q value, butit is estimated based on the output after going through theneural network.Formally, the estimated Q value can be denoted by Q ( s, a ; θ i ) where θ i denotes the parameters of the neuralnetwork (termed Q network henceforth) after i steps oftraining. And following the same idea we proposed before,we expect to minimize the estimation error (or loss hence-forth), in this form: L i ( θ i )= 1 | ˜ X i | (cid:88) x ∈ ˜ X i (cid:20) Q ( s, a ; θ i ) − (cid:18) r ( s, a ) + β max a (cid:48) ∈A s (cid:48) Q ( s (cid:48) , a (cid:48) ; θ i ) (cid:19)(cid:21) (24)where ˜ X i is a sub-set of total available data when doingtraining for the i -th step, which is often referred to as mini-batch . We introduce such a concept here since the dataacquiring process is an online process. In other words, wedo not have all the training data naturally, but we obtainit through continuous interaction (i.e., acting action) andcontinuous policy update. As a result of this continuousupdate of data, it is simply not ”cost-effective” to involveall the historical data we currently have in every step oftraining. As a reﬁnement, only a subset of the data, i.e., ˜ X i is involved in the loss back-propagating for each step oftraining.However, the above-deﬁned loss, though is quite intu-itive, would possibly lead to training instability of the neuralnetwork, since it is updated too radically (see [14]). Asper [13], a double network solution would ensure a morestable performance. In this solution, we introduce anothernetwork, known as target network , whose parameters aredenoted by θ − i . The target network has exactly the samenetwork architecture as the Q network and its parameterswould be overridden periodically by the Q network. Infor-mally, it serves as a ”mirror” of the past Q network, whichsigniﬁcantly reduces the training variation of the currentQ network. The re-written loss function after adopting thedouble network architecture can be viewed in Eq. (25)(located at the top of the next page).Now we shall formally introduce our proposed so-lution termed N eural network-based A daptive F requency L i ( θ i ) = 1 | ˜ X i | (cid:88) x ∈ ˜ X i (cid:34) Q ( s, a ; θ i ) − (cid:32) r ( s, a ) + βQ (cid:32) s (cid:48) , arg max a (cid:48) ∈A s (cid:48) Q ( s (cid:48) , a (cid:48) ; θ i ); θ − i (cid:33)(cid:33)(cid:35) (25)where θ − i denotes the parameters of target nework. Algorithm 1

Training Stage of NAFA

Input:

Initial/minimum exploration factor, (cid:15) / (cid:15) min ;Discount factor for rewards/exploration factor, β / ξ ;Tradeoff Parameter, η ; Learning rate γ ;Update periodicity of target nework, ζ ;Steps per episode, N max ; Training episodes, ep max ;Batch size, | ˜ X i | ; Memory size, X max ; Output:

After-trained network parameters; θ final Initialize θ i and θ − i with arbitrary values Initialize i = 1 , (cid:15) = (cid:15) Initialize empty replay memory X Initialize enviroment (or system status) for ep ∈ { , , . . . , ep max } do Wait until the ﬁrst request comes repeat Observe current system status s if random() < (cid:15) then Randomly select action a from A s else a = argmax a ∈A s Q ( s, a ; θ i ) end if Perform action a and realize reward r ( s, a ) Wait until the next request comes

Observe current system status s (cid:48) Store ( s, a, r ( s, a ) , s (cid:48) ) into replay memory X Sample minibatch ˜ X i ∼ X Update parameter θ i +1 following θ i +1 = θ i − γ ∇ θ i L i ( θ i ) (26) (cid:15) i = (cid:15) min + ( (cid:15) − (cid:15) min ) · exp( − i/ξ ) if i mod ζ = 0 then θ − i +1 = θ i +1 else θ − i +1 = θ − i end if i = i + 1 until i > ep · N max Reset the environment (or system status) end for θ final = θ i A djustment (NAFA), whose running procedure on trainingand application stage are respectively shown in Algorithm 1and Algorithm 2. Overall, the running procedure of NAFAcan be summarized as follows:1) Initialization:

NAFA ﬁrst initializes the Q networkwith arbitrary values and set the exploration factorto a pre-set value.2)

Iterated Training:

We divide the training into sev- eral episodes of training. The environment (i.e., thesystem status) will be reset to the initial stage eachtime an episode ends. In each episode, after theﬁrst request comes, the following sub-proceduresperform in sequence:a)

Schedule Action:

NAFA observes the cur-rent system status s (or state in our MDPformulation). Targeting the to-be scheduledrequest (i.e., the i -th request), NAFA adopts a (cid:15) -greedy strategy. Explicitly, with probability (cid:15) , NAFA randomly explores the action spaceand randomly selects an action. With prob-ability − (cid:15) , NAFA greedily selects actionbased on the current system status s and thecurrent Q network’s output.b) Interaction:

NAFA performs the decided ac-tion a within the environment . A corre-sponding reward would be realized as perEq. (11). Note that the environment herecould be real operation enviroment of aMEC server, or could be a simulation envi-ronment that is set up for a training purpose.c) Data Integration:

After the interaction,NAFA sleeps until a new request has arrived.Once evoked, NAFA checks the current sta-tus s (cid:48) (regarded as next state in the currentiteration) and stores it into replay memory collectively with current state s , action a , andrealized reward r ( s, a ) . The replay memoryhas a maximum size X max . If the replaymemory goes full, the oldest data will bereplaced by the new one.d) Update Q Network:

A ﬁxed batch size ofdata is sampled from replay memory to theminibatch ˜ X i . Then, NAFA performs back-propagation with learning rate γ on the Qnetwork using samples within minibatch ˜ X i and the loss function deﬁned in Eq. (25).e) Update Exploration Factor and Target QNetwork:

NAFA discounts (cid:15) using a dis-count factor ξ and update the target networkat a periodicity of ζ steps. After that, NAFAstarts a new training iteration.3) Application:

After training has ﬁnished (i.e., it hasgone through ep max · N max steps of training), thealgorithm uses the obtained Q network (whose pa-rameters denoted by θ final ) to schedule requests inthe real operation enviroment , i.e., based on currentstate s i , action a i = argmax a ∈A si Q ( s i , a ; θ final ) will be taken for each request. UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 11

Algorithm 2

Application Stage of NAFA

Input:

After-trained network parameters; θ final Output:

Action sequence; a , a , . . . i = 1 for each request comes do Observe system status s i a i = argmax a ∈A s Q ( s i , a ; θ final ) Perform action a i i = i + 1 end for XPERIMENT • Programming and running environment:

We haveimplemented NAFA and the simulation environmenton the basis of PyTorch and gym (a site package thatis commonly used for environment construction inRL). Besides, all the computation in our simulation isrun by a high-performance workstation (Dell Pow-erEdge T630 with 2xGTX 1080Ti). • Simulation of energy arrival:

In our simulation,NAFA is deployed on an MEC server which is drivenby solar power (a typical example of intermittentenergy supply). To simulate the energy arrival pat-tern, we use the data from HelioClim-3 , a satellite-derived solar radiation database. Explicitly, we de-rive the Global Horizontal Irradiance (GHI) data inKampala, Uganda (latitude 0.329, longitude 32.499)from 2005-01-01 to 2005-12-31 (for train dataset) and2006-01-01 to 2006-12-31 (for test dataset). Note thatthe GHI data in some days are missing in the dataset,so we use 300 out of 365 days of intact data re-spectively from the train and test dataset duringour experiment. By the GHI data, we calculate theamount of arrival energy during a given time interval [ t , t ] , as the following form: λ t ,t = (cid:90) t t panel size · GHI ( t ) dt (27)where panel size is the solar panel size of an MECserver. By λ t ,t , we can derive the energy arrivalbetween two request arrivals, i.e., λ i,i +1 . • Simulation of request arrival and data size:

We usea Poisson process (with arrival rate λ r ) to generatethe request arrival events (similar simulation settingavailable in [15], [16]). The data size of the requestfollows a uniform distribution (similar setting avail-able in [17], [18]) in the scale of 10MB to 30MB, i.e., d i ∼ Uniform (10 , MB. • Action space:

We consider 4 possible actions foreach request (i.e., three correspond to different lev-els of frequency and one is rejection action). For-mally, for the i -th request, a i ∈ { , , , } , where TABLE 2: Simulation parametersSymbols Meanings Values ν computation complexity 2e4 κ effective switched capacitance 1e-28 m number of CPU cores 12 B max battery capacity 1e6 (Joules)panel size solar panel size 0.5 ( m )TABLE 3: Training hyper-parameters for NAFASymbols Meanings Values (cid:15) initial exploration factor 0.5 (cid:15) min minimum exploration factor 0.01 ξ discount factor for exploration 3e4 γ learning rate 5e-4 β discount factor for rewards 0.995 ζ target nework update periodicity 5000 | ˜ X i | batch size 80 X max size of replay buffer 1e6actions { , , } correspond to processing frequency { , , } GHz and action induces request rejection. • Network Structure:

The target network and Q net-work in NAFA have exactly the same structure: aDeep Neural Network (DNN) model with two hid-den layers, each with 200 and 100 neurons, activatedby ReLu, and an output layer, which outputs theestimated Q value for all 4 actions. • Simulation parameters:

All of the simulation param-eters have been speciﬁed in Table 2. • Hyper-parameters and training details of NAFA:

All hyper-parameters are available in Table 3. Inour simulation, we use ep max = 30 × episodesto train the model. The simulation time for eachepisode of training is 10 straight days. We reset thesimulation environment based on different GHI dataonce the last request within these 10 days has beenscheduled. Besides, it is important to note that, sincewe only have × days of GHI data as our trainingdataset, we re-use the same × days of GHI datafor training episodes between × to × . • Uniformization of states:

In our implementation ofNAFA, we have uniformed all the elements’ valuesin a state to the scale of [0 , before its input tothe neural network. We are prone to believe thatsuch a uniformization might potentially improve thetraining performance. For an evaluation purpose, we implement three baselinemethods, speciﬁed as follows:1)

Best Fit (BF):

Best Fit is a rule-based scheduling al-gorithm derived from [19]. In our problem, BF tendsto reserve energy for future use by scheduling the

5. So, in our real implementation, training steps N max is not neces-sarily identical for all the training episodes. minimized processing frequency for the incomingrequest. Explicitly, it selects an action: a i = (cid:40) argmin a ∈A si f a |A s i | > otherwise (28)2) Worst Fit (WF):

Worst Fit is another rule-basedscheduling algorithm derived from [20]. In ourproblem, WF is desperate to reduce the process-ing time of each request. It achieves this goal viascheduling the maximized processing frequency forthe incoming request as far as it is possible. Explic-itly, it selects an action: a i = (cid:40) argmax a ∈A si f a |A s i | > otherwise (29)3) linUCB: linUCB is an online learning solution andit is typically applied in a linear contextual banditmodel (see [21]). In our setting, linUCB learns theimmediate rewards achieved by different contexts(or states in our formulation) and it greedily selectsthe action that maximizes its estimated immediaterewards. However, it disregards the state transferprobability, or in other words, it ignores the rewardsthat might be obtained in the future. Besides, sameas NAFA, in our implementation of linUCB, weperform the same uniformization process for a statebefore its training. We do this in a bid to ensure afair comparison.In our experiment, we train linUCB and NAFA for the sameamount of episodes. After training, we use 300 straight daysof simulation based on the same test dataset to validate theperformance of different scheduling strategies. With the request arrival rate ﬁxing to λ r = 30 and differentsettings of tradeoff parameters η , we shall show how differ-ent scheduling strategies work. The results can be viewed inthe box plot, shown in Fig. 3. From this plot, we can derivethe following observations:1) In all the settings of η , NAFA outperforms all theother baselines in terms of average returned re-wards. This result strongly shows the superiority ofour proposed solution: not only is NAFA capable ofadaptively adjusting its action policy based on theoperator’s preference (embodied by tradeoff η ), butit also has a stronger learning performance, compar-ing with another learning algorithm, i.e., linUCB.2) With η becoming bigger, NAFA and linUCB’s aver-age rewards approach to 0 while other rule-basedalgorithms (i.e., WF and BF) reduce to a negativenumber. Our explanation for this phenomenon isthat when η is set to a sufﬁciently high value, thepenalty brought by processing time has exceededthe real rewards brought by accepting a request(see our deﬁnition of rewards in Eq.(11)). As such,accepting a request which takes too much time toprocess, is no longer a beneﬁcial action for this extreme tradeoff setting, so both NAFA and linUCBlearn to only accept a small portion of requests(perhaps those with smaller data size) and onlyacquire a slightly positive reward. Recall that the reward is exactly composed of two parts: realreward of accepting a request, and a penalty of processingtime. To draw a clearer picture of these two parts, we showin Fig. 4 how different algorithms (and in a different settingof η ) perform in terms of processing time and rejection ratiowhen ﬁxing λ r = 30 . Intuitively, we derive the followingobservations:1) Comparing BF with WF, BF leads to a higheraverage processing time, but meanwhile, a loweraverage rejection ratio is also observed. This phe-nomenon is wholly comprehensible if consideringthe working pattern of BF and WF. BF tends toschedule the incoming request to a lower processingfrequency in an attempt to conserve energy forfuture use. This conserved action might lead to ahigher average processing time. But at the sametime, it is supposed to have a lower rejection ratiowhen the power supply is limited (e.g. at nights).By contrast, WF might experience more rejectiondue to power shortage at this time, as a result ofits prodigal manner.2) With a larger tradeoff η , NAFA and linUCB expe-rience a drop in terms of processing latency, but arise in the rejection ratio is also observed. It againcorroborates that the tradeoff parameter η deﬁnedin rewards is functioning well, meeting the originaldesign purpose. In this way, operators should beable to adaptively adjust the algorithm’s perfor-mance based on their own appetites towards thetwo objectives.3) NAFA-0, NAFA-1, and NAFA-2 signiﬁcantly out-perform BF in terms of both the two objectives, i.e.,smaller processing time and lower rejection ratio.Besides, NAFA-2 and NAFA-3 also outperform WF.Of the same tradeoff parameter, linUCB is outper-formed by NAFA in terms of both the two objectivesin most of the experiment groups. Finally, not analgorithm outperforms NAFA in both the objectives.These observations further justiﬁed the superiorityof NAFA. To have a closer inspection on the algorithms’ performanceand to explore the hidden motivation that leads to rejection,by Fig. 5, we show the composition of requests treatmentwhile again ﬁxing η = 30 . Based on the ﬁgure, the followingobservations follow:1) Most of the rejections of WF is caused by energyfull-reserved while most of the rejections of BF is re-sulting from resource full-loaded. This phenomenonis in accordance with their respective behavioralpattern. BF, which is thrifty on energy usage, mightexperience too-long processing time, which in turn UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 13 %) :) 1$)$ OLQ8&% 5 H Z D U G V 3 H U ' D \ %) :) 1$)$ OLQ8&% %) :) 1$)$ OLQ8&% %) :) 1$)$ OLQ8&% %) :) 1$)$ OLQ8&% 5 H Z D U G V 3 H U ' D \ %) :) 1$)$ OLQ8&% %) :) 1$)$ OLQ8&% %) :) 1$)$ OLQ8&% Fig. 3: Box plot demonstrating rewards per day vs. tradeoff η ﬁxing λ r = 30 . Each of the sub-ﬁgures demonstrates rewarddata of 300 straight simulation days, using the same testing dataset. %):)1$)$1$)$1$)$1$)$1$)$1$)$1$)$1$)$1$)$OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&%3URFHVVLQJ7LPH 5HMHFWLRQ5DWLR Fig. 4: Processing time vs. rejection ratio when ﬁxing λ r =30 . NAFA- number and linUCB- number are used to denotedifferent values of η setting during the training of NAFAand linUCB, respectively.leads to unnecessary rejection when all the CPUcores are full (i.e., resource full-loaded). By contrast,a prodigal usage of energy might bring about en-ergy full-reserved, as WF is experiencing.2) linUCB and NAFA are able to balance the trade-off between resource full-loaded and energy full- %):)1$)$1$)$1$)$1$)$1$)$1$)$1$)$1$)$1$)$OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&%OLQ8&% accept full-reserved full-loaded conservation Fig. 5: Composition of request treatment and rejection mo-tivations for different algorithms when ﬁxing λ r = 30 . Weassume the rejections is led by full-reserved when energy inthe battery has mostly be reserved, so there is no action canbe taken. Analogously, the requests are rejected by the full-loaded when all the free CPU cores are exhausted (in a rarecase that the server is simultaneously full-loaded and full-reserved, we count it as full-reserved). Request rejected dueto neither of the reasons will be counted as conservation purpose. TABLE 4: Experimental data under the setting of different tradeoff parameters η and a ﬁxed request arrival rate λ r = 30 .The data of the highest acceptance, the lowest average processing time, and the highest average rewards among the samegroup of experiment have been highlighted.Tradeoff Methods Average Request Treatment and Rejection Motivations Average AverageAccept Full-reserved Full-loaded Conservation Processing time Rewards η = 0 BF 540.96 0.40 179.71 0.00 0.333 540.96WF 526.90 183.12 11.04 0.00 linUCB 640.52 14.00 66.55 0.00 0.277 640.52 η = 2 BF 540.96 0.40 179.71 0.00 0.333 60.72WF 526.90 183.12 11.04 0.00 linUCB 546.96 147.86 16.82 9.42 0.183 283.61 η = 4 BF linUCB 382.85 0.40 0.11 337.70 η = 6 BF linUCB 187.80 0.40 0.00 532.87 0.037 29.59 η = 8 BF linUCB 60.98 0.40 0.00 659.68 0.010 2.09loaded, and thereby, resulting in an increase in theoverall request acceptance ratio.3) With η becoming larger, both linUCB and NAFAdevelop a conservation action pattern: some re-quests are deliberately rejected when no resourcefull-loaded or energy full-loaded is experiencing.There are two explanations for this pattern: the ﬁrstis that a) penalty brought by processing latencyexceeds the rewards brought by acceptance, even ifthe action with the highest frequency is taken. Thenthere is simply no beneﬁt to accept such a request(typically is a request with a large data size). Thesecond motivation is b) the algorithm learns to rejectsome speciﬁc requests (perhaps those larger ones) ifthe system is nearly full-loaded and full-reserved.Apparently, the conservation pattern that linUCBdevelops is mostly based on the ﬁrst motivation, asit has no regard on the state transfer (or informally,the future), while the SMDP-based NAFA should beable to learn both of the two motivations. In addi-tion, corroborated by Fig. 3, the ”prophetic” NAFAcould indeed gain us more rewards and thereforesubstantiate the necessity of full consideration ofboth the two motivations.To gain us a more accurate observation, we demonstrate theaverage request treatment data in Table 4. In this experiment, we ﬁx the tradeoff parameter to η = 0 and η = 3 , and see how the returned rewards evolve withthe change of request arrival rate λ r . By Fig. 6, we ﬁnd that:1) Fixing tradeoff η = 0 , all the algorithms experiencea rise in rewards with the growth of request arrival

10 20 30 40 50 60 5HTXHVW$UULYDO5DWH $ Y H U DJH 5 H Z D U G V 3 H U ' D \ BFWFNAFAlinUCB

Fig. 6: Average Rewards vs. Request Arrival Rate whenﬁxing tradeoff parameter η = 0 rate. This phenomenon is quite comprehensiblesince every single request brings positive rewards (aconstant 1 when η = 0 ) if being proper scheduled.2) However, the growth of WF and BF stagnates asthe arrival rate becomes really big. By contrast,NAFA seems to continuously increase in the ac-quired rewards. By this observation, we see thatthe traditional rule-based algorithms apparently areincompetent of meeting the scheduling requirementwhen the system is in high-loaded status.3) linUCB experiences a drastic jitter under differentrequest arrival rates. We speculate that this jitter is UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 15

10 20 30 40 50 60 5HTXHVW$UULYDO5DWH $ Y H U DJH 5 H Z D U G V 3 H U ' D \ BFWFNAFAlinUCB

Fig. 7: Average Rewards vs. Request Arrival Rate whenﬁxing tradeoff parameter η = 3 resulting from a constant reward achieved for everystate (or context). When η = 0 , the algorithm simplyreduces to a random selection since each action hasexactly the same expected rewards for almost all thestates (except those when the system is full-loadedor full-reserved).By Fig. 7, we also ﬁnd that, for η = 3 :1) BF gained a negative average reward. This is re-sulting from the excessively high processing timepenalty if only employing the least processor fre-quency.2) NAFA again acquirs the highest average reward inall groups of experiments.

10 20 30 40 50 60 5HTXHVW$UULYDO5DWH $ Y H U DJH $ FF HS W DQ F H 5 D W L R BFWFNAFA-0NAFA-3linUCB-0linUCB-3

Fig. 8: Average acceptance ratio vs. request arrival rate.NAFA-

Number and linUCB-

Number represent the algo-rithms trained in the setting of η = Number .

10 20 30 40 50 60 5HTXHVW$UULYDO5DWH $ Y H U DJH 3 U R F H VV L QJ 7 L P H BFWFNAFA-0NAFA-3linUCB-0linUCB-3

Fig. 9: Average processing time vs. request arrival rate.NAFA-

Number and linUCB-

Number represent the algo-rithms trained in the setting of η = Number .To show the whole picture, we now demonstrate in Fig.8 and Fig. 9 how the two objectives, i.e., acceptance ratio andprocessing time evolve with different request arrival rates.Based on the result, we give the following observations:1) Processing time of BF and WF shows little varia-tion with request arrival rate. This phenomenon isintuitive since these rule-based methods are proneto schedule the same frequency to the incomingrequest (the largest one for WF and the smallest onefor BF).2) Compared to WF, BF acquires a higher acceptanceratio when the request rate is relatively low, but WFsurpasses BF when the request rate becomes greater.Obviously, this phenomenon seems to be telling usthat when the request rate is sufﬁciently high, thesystem is more vulnerable to resource full-loadedthan to energy full-reserved.3) NAFA-0 yields the largest acceptance ratio whileNAFA-3 yields the smallest processing time, com-paring to other baselines. This corroborates the ef-fectiveness and high adaptiveness of NAFA.4) linUCB-0 acquires the second-largest acceptance ra-tio and linUCB-3 acquires the second-smallest pro-cessing time. This substantiates some sort of adap-tiveness of linUCB, However, its performance couldnot match up with NAFA due to its ”near-sighted”action pattern.To enable an accurate observation, we demonstrate in Table5 our experimental data given different request arrival rateand a ﬁxed tradeoff parameter η = 3 . ONCLUSION AND F UTURE P ROSPECT

In this paper, we have studied an adaptive frequency ad-justment problem in the scenario of intermittent energysupplied MEC. Concerning multiple tradeoffs persisted inthe formulated problem, we propose a deep reinforcementlearning-based solution termed NAFA for problem-solving. TABLE 5: Experimental data under the setting of different request arrival rates λ r , and a ﬁxed tradeoff parameter η = 3 .The data of the highest acceptance, the lowest average processing time and the highest average rewards among the samegroup of experiment have been highlighted.Request Methods Average Request Treatment and Rejection Motivations Average AverageArrival Rate Accept Full-reserved Full-loaded Conservation Porcessing time Rewards λ r = 10 BF 239.57 0.12 0.24 0.00 0.444 -79.90WF

NAFA 236.74 0.12 0.00 3.07 0.218 79.91linUCB 206.95 0.12 0.00 32.86 λ r = 20 BF linUCB 365.21 0.26 0.13 114.84 λ r = 30 BF linUCB 475.40 161.54 5.87 78.26 λ r = 40 BF 581.07 0.51 379.01 0.00 0.269 -193.47WF 603.99 303.22 53.38 0.00 0.143 192.47NAFA linUCB 556.83 280.75 35.30 87.72 λ r = 50 BF 600.43 0.67 598.87 0.00 0.222 -199.66WF 654.79 415.44 129.74 0.00 0.124 207.77NAFA linUCB 609.75 390.90 96.49 102.82 λ r = 60 BF 611.44 0.79 826.66 0.00 0.189 -203.62WF 685.65 522.45 230.79 0.00 0.109 216.56NAFA linUCB 625.43 506.13 186.60 120.73 0.103 223.75By our real solar data-based experiments, we substantiatethe adaptiveness and superiority of NAFA, which dras-tically outperforms other baselines in terms of acquiredreward under different tradeoff settings.For a potential extension of our proposed solution, weare particularly interested in a more advanced technique,known as Federated Learning (FL, [22]). FL allows multipleservers to collectively train a Deep Q network, while do notnecessarily need to expose their private data to a centralizedentity. This novel training paradigm is pretty advantageousover our extensions from the single-server to the multi-servers scenario: we might gain access to request and energyarrival patterns from different servers (perhaps in differentvenues and owned by different operators) without knowingits training data. With more diversiﬁed data being availablefor training, a more generalized neural network modelmight be obtained and could be put into real serving overmillions (or billions) of self-powered micro MEC servers. R EFERENCES [1] T. Huang, W. Lin, C. Xiong, R. Pan, and J. Huang, “An antcolony optimization-based multiobjective service replicas place-ment strategy for fog computing,”

IEEE Transactions on Cybernetics ,pp. 1–14, 2020.[2] Z. Ning, P. Dong, X. Wang, M. S. Obaidat, X. Hu, L. Guo, Y. Guo,J. Huang, B. Hu, and Y. Li, “When deep reinforcement learningmeets 5g vehicular networks: A distributed ofﬂoading frameworkfor trafﬁc big data,”

IEEE Transactions on Industrial Informatics ,2019. [3] H. Yang, A. Alphones, W.-D. Zhong, C. Chen, and X. Xie,“Learning-based energy-efﬁcient resource management by het-erogeneous rf/vlc for ultra-reliable low-latency industrial iot net-works,”

IEEE Transactions on Industrial Informatics , 2019.[4] X. Li and X. Zhang, “Multi-task allocation under time constraintsin mobile crowdsensing,”

IEEE Transactions on Mobile Computing ,2019.[5] C. H. Liu, Z. Dai, Y. Zhao, J. Crowcroft, D. O. Wu, and K. Le-ung, “Distributed and energy-efﬁcient mobile crowdsensing withcharging stations by deep reinforcement learning,”

IEEE Transac-tions on Mobile Computing , 2019.[6] X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis,“Performance Optimization in Mobile-Edge Computing via DeepReinforcement Learning,” arXiv:1804.00514 [cs] , Mar. 2018, arXiv:1804.00514. [Online]. Available: http://arxiv.org/abs/1804.00514[7] Z. Wei, B. Zhao, J. Su, and X. Lu, “Dynamic Edge ComputationOfﬂoading for Internet of Things with Energy Harvesting: ALearning Method,”

IEEE Internet of Things Journal , pp. 1–1, 2019.[8] M. Min, X. Wan, L. Xiao, Y. Chen, M. Xia, D. Wu, and H. Dai,“Learning-Based Privacy-Aware Ofﬂoading for Healthcare IoTWith Energy Harvesting,”

IEEE Internet of Things Journal ,vol. 6, no. 3, pp. 4307–4316, Jun. 2019. [Online]. Available:https://ieeexplore.ieee.org/document/8491311/[9] M. Xu, A. N. Toosi, B. Bahrani, R. Razzaghi, and M. Singh,“Optimized renewable energy use in green cloud data centers,” in

International Conference on Service-Oriented Computing . Springer,2019, pp. 314–330.[10] K. Zheng, H. Meng, P. Chatzimisios, L. Lei, and X. Shen, “AnSMDP-Based Resource Allocation in Vehicular Cloud ComputingSystems,”

IEEE Transactions on Industrial Electronics , vol. 62, no. 12,pp. 7920–7928, Dec. 2015.[11] M. Baljon, M. Li, H. Liang, and L. Zhao, “SMDP-Based ResourceAllocation for Wireless Networks with Energy Harvesting Con-straints,” in , Sep. 2017, pp. 1–6.[12] L. Lei, H. Xu, X. Xiong, K. Zheng, and W. Xiang, “Joint Compu-tation Ofﬂoading and Multi-User Scheduling using Approximate

UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 17

Dynamic Programming in NB-IoT Edge Computing System,”

IEEE Internet of Things Journal , pp. 1–1, 2019.[13] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learn-ing with double q-learning,” in

Proceedings of the AAAI conferenceon artiﬁcial intelligence , vol. 30, no. 1, 2016.[14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis,“Human-level control through deep reinforcement learning,”

Nature

IEEETransactions on Cognitive Communications and Networking , vol. 3,no. 3, pp. 361–373, 2017.[16] Y. Mao, J. Zhang, and K. B. Letaief, “Dynamic computationofﬂoading for mobile-edge computing with energy harvestingdevices,”

IEEE Journal on Selected Areas in Communications , vol. 34,no. 12, pp. 3590–3605, 2016.[17] X. Lyu, W. Ni, H. Tian, R. P. Liu, X. Wang, G. B. Giannakis,and A. Paulraj, “Optimal schedule of mobile edge computingfor internet of things using partial information,”

IEEE Journal onSelected Areas in Communications , vol. 35, no. 11, pp. 2606–2615,2017.[18] Y. Chen, Y. Zhang, Y. Wu, L. Qi, X. Chen, and X. Shen, “Joint taskscheduling and energy management for heterogeneous mobileedge computing with hybrid energy supply,”

IEEE Internet ofThings Journal , 2020.[19] F. Farahnakian, T. Pahikkala, P. Liljeberg, J. Plosila, N. T. Hieu,and H. Tenhunen, “Energy-aware vm consolidation in cloud datacenters using utilization prediction model,”

IEEE Transactions onCloud Computing , 2016.[20] C. Xian, Y.-H. Lu, and Z. Li, “Energy-aware scheduling for real-time multiprocessor systems with uncertain task execution time,”in . IEEE, 2007,pp. 664–669.[21] W. Chu, L. Li, L. Reyzin, and R. Schapire, “Contextual banditswith linear payoff functions,” in

Proceedings of the Fourteenth Inter-national Conference on Artiﬁcial Intelligence and Statistics , 2011, pp.208–214.[22] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al. ,“Communication-efﬁcient learning of deep networks from decen-tralized data,” arXiv preprint arXiv:1602.05629arXiv preprint arXiv:1602.05629