Adaptive Processor Frequency Adjustment for Mobile Edge Computing with Intermittent Energy Supply
Tiansheng Huang, Weiwei Lin, Ying Li, Xiumin Wang, Qingbo Wu, Rui Li, Ching-Hsien Hsu, Albert Y. Zomaya
11 Adaptive Processor Frequency Adjustment forMobile Edge Computing with Intermittent EnergySupply
Tiansheng Huang ID , Weiwei Lin ID , Ying Li, Xiumin Wang, Qingbo Wu, Rui Li, Ching-Hsien Hsu ID ,and Albert Y. Zomaya ID , Fellow, IEEE
Abstract —With astonishing speed, bandwidth, and scale, Mobile Edge Computing (MEC) has played an increasingly important role inthe next generation of connectivity and service delivery. Yet, along with the massive deployment of MEC servers, the ensuing energyissue is now on an increasingly urgent agenda. In the current context, the large scale deployment of renewable-energy-supplied MECservers is perhaps the most promising solution for the incoming energy issue. Nonetheless, as a result of the intermittent nature of theirpower sources, these special design MEC server must be more cautious about their energy usage, in a bid to maintain their servicesustainability as well as service standard. Targeting optimization on a single-server MEC scenario, we in this paper propose NAFA, anadaptive processor frequency adjustment solution, to enable an effective plan of the server’s energy usage. By learning from thehistorical data revealing request arrival and energy harvest pattern, the deep reinforcement learning-based solution is capable ofmaking intelligent schedules on the server’s processor frequency, so as to strike a good balance between service sustainability andservice quality. The superior performance of NAFA is substantiated by real-data-based experiments, wherein NAFA demonstrates up to20% increase in average request acceptance ratio and up to 50% reduction in average request processing time.
Index Terms —Deep Reinforcement Learning, Event-driven Scheduling, Mobile edge computing, Online Learning, Semi-MarkovDecision Process. (cid:70)
NTRODUCTION L ATELY , Mobile Edge Computing (MEC) has emerged asa powerful computing paradigm for the future Internetof Things (IoTs) scenarios. MEC servers are mostly deployedin proximity to the users, with the merits of seamless cover-age and extremely low communication latency to the users.Also, the MEC servers feature the light-weight deployment, • The authors would like to thank the three anonymous reviewers fortheir constructive comments. Special thanks are due to Dr. MinxianXu of Shenzhen Institutes of Advanced Technology, Chinese Academyof Sciences, for his constructive advice on experiment setup. This work issupported by National Natural Science Foundation of China (62072187,61872084), Guangdong Major Project of Basic and Applied Basic Re-search(2019B030302002), Guangzhou Science and Technology Programkey projects (202007040002, 201907010001), and Fundamental ResearchFunds for the Central Universities, SCUT (2019ZD26). • T. Huang, W. Lin, Y. Li, and X. Wang are with the School of Com-puter Science and Engineering, South China University of Technol-ogy, China. Email: [email protected], [email protected],[email protected], [email protected]. • CH. Hsu is with the Department of Computer Science and InformationEngineering, Asia University, Taichung, Taiwan and with the Departmentof Computer Science and Information Engineering, Asia University,Taichung, Taiwan. Email: [email protected]. • Q. Wu is with the College of Computer, National University of DefenseTechnology, Changsha 410073, China. Email: [email protected]. • R. Li is with Peng Cheng Laboratory, Shenzhen 518000, China. Email:[email protected]. • AY. Zomaya is with the School of Computer Science, The University ofSydney, Sydney, Australia. Email: [email protected]. enabling their potential large-scale application in variousscenarios.Despite an attractive prospect, two crucial issues mightbe encountered by the real application:1) The large-scale deployment of grid-power MECservers has almost exhausted the existing energyresource and resulted in an enormous carbon foot-print. This in essence goes against the green com-puting initiative.2) With millions or billions of small servers deployedamid every corner of the city, some of the locationscould be quite unaccommodating for the construc-tion of grid power facility, and even for those ina good condition, the construction and operationoverhead of the power facility alone should not betaken lightly.With these two challenges encountered during the large-scale application of MEC, we initiate an alternative usageof intermittent energy supply . These supplies could be solarpower , wind power or wireless power , etc. Amid these alter-natives, renewable energy, such as solar power and windpower, could elegantly address both the two concerns. Thewireless power still breaches the green computing initiative,but can at least save the construction and operation cost ofa complete grid power system for all the computing units.However, all of these intermittent energy supplies reveala nature of unreliability: power has to be stored in a batterywith limited capacity for future use and this limited energyis clearly incapable of handling all the workloads when a r X i v : . [ ee ss . S Y ] F e b mass requests are submitted. With this concern, servershave two potential options to accommodate the increasingworkloads:1) They directly reject some of the requests (perhapsthose require greater computation), in a bid to saveenergy for the subsequent requests.2) They lower the processing frequency of their coresto save energy and accommodate the increasingworkloads. Nevertheless, this way does not comewithout a cost: each request might suffer a pro-longed processing time.These heuristic ideas of energy conservation elicit theproblem we are discussing in this paper. We are particularlyinterested in the potential request treatment (i.e., should wereject an incoming request, and if not, how shall we schedulethe processing frequency for it) and their consequent effectson the overall system performance.The problem could become even more sophisticated ifregarding the unknown and non-stationary request arrivaland energy harvest pattern. Traditional rule-based methodsclearly are incompetent in this volatile context: as a resultof their inflexibility, even though they might work in aparticular setting (a specific arrival pattern, for example),they may not work equally well if in a completely differentenvironment.Being motivated, we shall design a both effective andadaptive solution that is able to cope with the uncertaintybrought by the energy supply and request pattern. More-over, the solution should be highly programmable, allowingcustom design based on the operators’ expectations towardsdifferent performance metrics. The major contributions of our work are presented in thefollowing:1) We have analyzed the real working procedure ofan intermittent-energy-driven MEC system, basedon which, we propose an event-driven schedulescheme, which is deemed much matching with theworking pattern of this system.2) Moreover, we propose an energy reservation mecha-nism to accommodate the event-driven feature. Thismechanism is novel and has not been available inother sources, to our best knowledge.3) Based on the proposed scheduling mechanism, wehave formulated an optimization problem, whichbasically covers a few necessary system constraintsand two main objectives: cumulative processingtime and acceptance ratio.4) We propose a deep reinforcement learning-basedsolution (NAFA) to optimize the joint request treat-ment and frequency adjustment action.5) We do the experiments based on real solar data.By our experimental results, we substantiate theeffectiveness and adaptiveness of our proposed so-lutions. In addition to the general results, we alsodevelop a profound analysis of the working patternof different solutions. To the best knowledge of the authors, our main focus onevent-driven scheduling for an intermittent energy-suppliedsystem has not been presented in other sources. Also, ourproposed solution happens to be the first trackable deep re-inforcement learning solution to an SMDP model. It has thepotential to be applied to other network systems that followa similar event-driven working pattern. In this regard, weconsider our novel work as a major contribution to the field.
ELATED W ORK
AI-driven algorithms have been broadly studied thanksto their attractive efficiency and flexibility. In many fields,such as service orchestration [1], vehicular or industrial IoTnetworks (see [2], [3]), crowdsensing (see [4], [5]), etc, linesof research on AI application have been conducted. The pro-posed AI-driven algorithms all achieved significant perfor-mance enhancement, comparing with traditional method-ology. Among these emerging topics, AI-driven MEC canbe regarded as one of the hottest agenda. To illustrate, in[6], Chen et al. combined the technique of Deep Q Network(DQN) with a multi-object computation offloading scenario.By maintaining a neural network inside its memory, the MUis enabled to intelligently select an offloading object amongthe accessible base stations. Wei et al. in [7] introduced areinforcement learning algorithm to address the offloadingproblem in the IoT scenario, based on which, they furtherproposed a value function approximation method in anattempt to accelerate the learning speed. In [8], Min et al. further considered the privacy factors in healthcare IoToffloading scenarios and proposed a reinforcement learning-based scheme for a privacy-ensured data offloading.Refs. [6]–[9] mentioned above are all based on MarkovDecision Process (MDP), in which the action schedulingis performed periodically based on a fixed time interval.However, in the real scenario, the workflow of an agent(server or MU) is usually event-driven. To be specific, theagent is supposed to promptly make the scheduling andperform actions once an event has occurred (e.g. a requestarrives) but not wait until a fixed periodicity is met.
Motivated by the gap between current research and theagent’s real working pattern, we consider the Semi-MarkovDecision Process (SMDP) as our basic model instead. Unlikethe conventional MDP model, for an SMDP model, the timeinterval between two sequential actions does not necessarilyneed to be the same. As such, it is super matching foran event-triggered schedule, like the one we need for therequests scheduling.SMDP is a powerful modeling tool that has been ap-plied to many fields, such as wireless networks, operationmanagement, etc. In [10], Zheng et al. first applied SMDPin the scenario of vehicular cloud computing systems. Adiscounted reward is adopted in their system model, basedon which, the authors further proposed a value iterationmethod to derive the optimal policy. In [11], SMDP isfirst applied to energy harvesting wireless networks. Forproblem-solving, the authors adopted a model-based policy
UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 3
MEC Server 𝑓 𝑓 𝑓 free SchedulerEnergy Harvest
Battery State Report Request Arrival Action Run
Cores
Fig. 1: System architecture for an intermittent-energy-supplied MEC system.iteration method. However, the model-based solution pro-posed in the above work can not address the problem whenthe state transition probability is unknown, and in addition,it cannot deal with the well-known curse-of-dimensionalityissue.To fix the gap, based on an SMDP model that is exclu-sively designed for an NB-IOT Edge Computing System, Lei et al. in [12] further proposed a reinforcement learning-basedalgorithm. However, the proposed algorithm is still toorestricted and cannot be applied in our current studied prob-lems, since several assumptions (e.g., exponential sojourntime between events) must be made in advance. Normally,these assumptions are inevitable for the derivation of state-value or policy-value estimation in an SMDP model, butunfortunately, could be the main culprit leading to greatdivergence between theory and reality.As a result, we in this paper will jettison all the as-sumptions typically presented in an SMDP formulation.Alternatively, we adopt a Double Deep Q Network (DDQN)(see [13]) to predict the state-action value function and toderive a near-optimal policy based on which. Our method isa more general solution for an SMDP model and is designedmostly based on a practical and usable standpoint.
ROBLEM F ORMULATION
In this paper, we target the optimization problem in amulti-users single-server MEC system that is driven byintermittent energy supply (e.g., renewable energy, wirelesscharging). As depicted in Fig. 1, multiple central processingunits (CPUs) and a limited-capacity battery, together withcommunication and energy harvesting modules, physicallyconstitute an MEC server. Our proposed MEC system makes scheduling on requests following an event-driven workflow,i.e., our request scheduling process is evoked immediatelyonce a request is collected. This event-driven process can bespecified by the following steps:1) Collect the request and check the current systemstatus, e.g., battery and energy reservation statusand CPU core status.2) Do schedule to the incoming request based on itscharacteristic and current system status. Explicitly,the scheduler is supposed to make the followingdecisions:a) Decide whether to accept the request or not.b) If accepted, decide the processing frequency ofthe request. We refer to processing frequency as the frequency that a CPU core might turnto while processing this particular request.The scheduler should not choose the fre-quency that might potentially over-reserve thecurrently available energy or over-load theavailable cores.3) Do energy reservation for the request based on thedecided frequency. A CPU core cannot use morethan the reserved energy to process this request.4) The request is scheduled to the corresponding CPUcore and start processing.Given that the system we study is not powered byreliable energy supply (e.g., coal power), a careful plan ofthe available energy is supposed to be made in order topromote the system performance.
In this subsection, we shall formally introduce the optimiza-tion problem by rigorous mathematics formulation (key no-
TABLE 1: Key notations for problem formulationNotations Meanings a i action scheduled for the i -th request f n n -th processing frequency option (GHz) d i data size of the i -th request (bits) ν computation complexity per bit κ effective switched capacitance m number of CPU cores τ a i processing time of the i -th request e a i energy consumption of the i -th request B i battery status when i -th request arrives B max maximum battery capacity λ i,i +1 / ρ i,i +1 captured/consumed energy betweenarrivals of the i -th and ( i + 1) -th request S i reserved energy when i -th request arrives Ψ i number of working CPU coreswhen the i -th request arrives η tradeoff parametertations and their meanings are given in Table 1). We considerthe optimization problem for an MEC server with m CPUcores, all of whose frequency can be adaptively adjusted viaDynamic Voltage and Frequency Scaling (DVFS) technique.Then we first specify the ”action” for the optimizationproblem.
The decision (or action) in this system specifies the treat-ment of the incoming request. Explicitly, it specifies 1)whether to accept the request or not, 2) the processingfrequency of the request. Formally, let i index an incomingrequest by its arrival order. The action for the i -th request isdenoted by a i . Explicitly, we note that:1) a i ∈ { , , . . . , n } denotes the action index of theCPU frequency at which the request is scheduled,where n represents the maximum index (or totalpotential options) for frequency adjustment. Foroption n , f n GHz frequency will be scheduled tothe request.2) Specially, when a i = 0 , the request will be rejectedimmediately. The action we take might have a direct influence on therequest processing time , which can be specified by: τ a i = (cid:40) ν · d i f ai a i ∈ { , . . . , n } a i = 0 (1)where we denote d i as the processing data size of an offload-ing request, ν as the required CPU cycles for computingone bit of the offloading data. Without loss of generality, weassume d i , i.e., processing data size, as a stochastic variablewhile regarding ν , i.e., required CPU cycles per bit, fixed forall requests.Also, by specifying the processing frequency, we canderive the energy consumption for processing the request, which can be given as: e a i = (cid:40) κf a i · ν · d i a i ∈ { , . . . , n } a i = 0 (2)where we denote κ as the effective switched capacitance ofthe CPUs. Recall that the MEC server is driven by intermittent energysupply, which indicates that the server has to store thecaptured energy in its battery for future use. Concretely, weshall introduce a virtual queue, which serves as a measure-ment of the server’s current battery status. Formally, we leta virtual queue, whose backlog is denoted by B i , to capturethe energy status when the i -th request arrives. B i evolvesfollowing this rule: B i +1 = min { B i + λ i,i +1 − ρ i,i +1 , B max } (3)Here,1) λ i,i +1 represents the amount of energy that wascaptured by the server between the arrivals of the i -th and ( i + 1) -th request.2) ρ i,i +1 is the consumed energy during the sameinterval. It is notable that the consumed energyi.e., ρ i,i +1 , is highly related to the frequency of thecurrent running cores, due to which, is also relevantto the past actions , regarding the fact that it is thepast actions that determine the frequency of thecurrent running cores.3) B max is the maximum capacity of the battery. weuse this value to cap the battery status since themaximum energy status could not exceed the fullcapacity of the server’s battery. Upon receiving each of the coming requests, we shall re-serve the corresponding amount of energy for which so thatthe server could have enough energy to finish this request.This reservation mechanism is essential in maintaining thestability of our proposed system.To be specific, we first have to construct a virtual queueto record the energy reservation status . The energy reservationqueue, with backlog S i , evolves between request arrivalsfollowing this rule: S i +1 = max { S i + e a i − ρ i,i +1 , } (4)where ρ i,i +1 is the consumed energy (same definition in Eq.(3)) and e a i is the energy consumption for the i -th request(same definition in Eq. (2)). By this means, S i +1 exactlymeasures how much energy has been reserved by the -thto the i -th request .
1. Some may wonder the rationale behind the subtraction of ρ i,i +1 .Considering the case that in a specific timestamp (e.g., the timestampthat the i -th request arrives), the processing of a past request (e.g., the ( i − -th request) could be unfinished, but have already consumedsome of the reservation energy. Then, the reservation energy for itshould subtract the already consumed part. UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 5
By maintaining the energy reservation status , we can specify energy constraint to control the action when the energy hasalready been full-reserved. The constraint can be given asfollows: S i + e a i ≤ B i (5)By this constraint, we ensure that the request has to berejected if not enough energy (that has not been reserved)is available. Recall that we assume a total number of m CPU coresare available for request processing. When the computationresources have already been full-loaded, the system has nooption but to reject the arrived requests. So, we introducethe following constraint: Ψ i + I { a i (cid:54) = 0 } ≤ m (6)where Ψ i denotes the number of currently working CPUcores. By this constraint, we ensure that the server could notbe overloaded by the accepted request. Now we shall formally introduce the problem that we aimto optimize, which is given as follows: (P1)
Obj1: min { a i } ∞ (cid:88) i =1 τ a i Obj2: max { a i } ∞ (cid:88) i =1 I { a i (cid:54) = 0 } C1: S i + e a i ≤ B i C2: Ψ i + I { a i (cid:54) = 0 } ≤ m (7)There are two objectives that we need to consider inthis problem: 1) the first objective is the cumulative process-ing time of the system, 2) and the second objective is thecumulative acceptance . Besides, two constraints, i.e., EnergyConstraint and
Resource Constraint , have been covered in oursystem constraints.Obviously, P1 is unsolvable due to the following facts:1) Existence of stochastic variables.
Stochastic vari-ables, e.g., d i (data size of requests), λ i,i +1 (energysupplied) and request arrival rate, persist in P1 .2) Multi-objective quantification.
The two objectivesconsidered in our problem are mutually exclusiveand there is not an explicit quantification betweenthem.To bridge the gap, we 1) set up a tradeoff parameter tobalance the two objectives and transform the problem intoa single objective optimization problem, 2) transform theobjective function into an expected form. This leads to ournewly formulated P2 : (P2) max { a i } ∞ (cid:88) i =1 E [ I { a i (cid:54) = 0 } − ητ a i ] C1: S i + e a i ≤ B i C2: Ψ i + I { a i (cid:54) = 0 } ≤ m (8) where η serves as the tradeoff parameter to balance cumu-lative acceptance and cumulative processing time . P2 is moreconcrete after the transformation, but still, we encounter thefollowing challenges when solving P2 :1) Exact constraints.
Both C1 and C2 are exact con-straints that should be strictly restricted for eachrequest. This completely precludes the possibility ofapplying an offline deterministic policy to solve theproblem, noticing that B i , S i in C1 and Φ i in C2 areall stochastic for each request.2) Unknown and non-stationary supplied pattern ofenergy.
The supplied pattern of intermittent energyis highly discrepant, varying from place to place andhour to hour . As such, the stochastic process λ i,i +1 in Eq. (3) may not be stationary, i.e., λ i,i +1 samplesfrom an ever-changing stochastic distribution thatis relevant with request order i , and moreover, it isunknown to the scheduler.3) Unknown and non-stationary arrival pattern.
Thearrival pattern of the request’s data size is unknownto the system, and also, could be highly discrepantat temporal scales.4)
Unknown stochastic distribution of requests’ datasize.
The processing data size could vary betweenrequests and is typically unknown to the scheduler.5)
Coupling effects of actions between requests.
Thepast actions towards an earlier request might have adirect effect on the later scheduling process (see theevolvement of S i and B i )Regarding the above challenges, we have to resort to an on-line optimization solution for problem-solving. The solutionis expected to learn the stochastic pattern from the historicalknowledge, and meanwhile, can strictly comply with thesystem constraint. EEP R EINFORCEMENT L EARNING -B ASED S O - LUTION
To develop our reinforcement learning solution, we shallfirst transform the problem into an
Semi-Markov DecisionProcess (SMDP) formulation. In our SMDP formulation, weassume the system state as the system running status whena request has come, or, when the scheduler is supposed totake action . After an action being taken by the scheduler, a reward would be achieved, and the state (or system status)correspondingly transfers. It follows an event-driven process,that is, the scheduler decides the action once a request hascome but does nothing while awaiting. And as a result, thetime interval between two sequential actions may not be thesame. This is the core characteristic for an SMDP model andis the major difference from a normal MDP model.Now we shall specify the three principal elements (i.e.,states, action, and rewards) in our SMDP formulation insequence, and please note that most of the notations weused henceforth are consistent with those in Section 3.
2. Considering the energy harvest pattern of solar power. We have asignificant difference in harvest magnitude between day and night.
A system state is given as a tuple: s i (cid:44) { T i , B i , S i , ψ i, . . . , ψ i,n , d i } (9)Explicitly,1) T i is the when the i − threquest arrives in the MEC server. We incorporatethis element in our state in order to accommodatethe temporal factor that persists in the energy andrequest pattern.2) B i is the battery status , same as we specify in Eq.(3).3) S i is the energy reservation status , same as wespecify in Eq. (4).4) ψ i,n denotes the running CPU cores in the fre-quency of f n GHz . Meanwhile, it is intuitive to seethat Ψ i = (cid:80) nn (cid:48) =1 ψ i,n where Ψ i is the total runningcores we specify in Eq. (6).5) d i is the data size of the i -th request.Information captured by a state could be comprehendedas the current system status that might support the actionscheduling of the incoming i -th request. More explicitly,we argue that the formulated states should at least coverthe system status that helps construct a possible action set ,but could cover more types of informative knowledge tosupport the decision (e.g., T i in our current formulation,which will be formally analyzed later). The concept of possible action set would be given in our explanation of systemactions . Once a request arrives, given the current state (i.e., theobserved system status), the scheduler is supposed to takeaction, deciding the treatment of the request. Indicated bythe system constraints C1 and C2 in P2 , we are supposed tomake a restriction on the to-be taken action. By specifyingthe states (or observing the system status), it is not difficultto find that we can indeed get a closed-form possible actionset, as follows: a i ∈ A s i (cid:44) { a | Ψ i + I { a (cid:54) = 0 } ≤ m, S i + e a ≤ B i } (10)where Ψ i , B i and S i are all covered in our state formulation.By taken action from the defined possible action set, weaddress challenge 1) exact constraints , that we specify in P2 . Given state and action, a system reward will be incurred, inthe following form: r ( s i , a i ) = I { a i (cid:54) = 0 } − ητ a i (11)The reward of different treatments of a request is consistentwith our objective formulation in P2 (see Eq. (8)). Thegoal of our MDP formulation is to maximize the expectedachieved rewards, which means that we aim to maximize (cid:80) ∞ i =1 E [ I { a i (cid:54) = 0 } − ητ a i ] , the same form with the objectivein P2 . Besides, here we can informally regard I { a i (cid:54) = 0 } as a”real reward” of accepting a request and τ a i as the ”penalty”of processing a request. After an action being taken, the state will be transferred toanother one when the next arrival of the request occurs. Thestate transferred probability to a specific state is assumed tobe stationary given the current state and the current action,i.e., we assume that: p ( s i +1 | s i , a i ) = constant (12)This assumption is the core part in our MDP formu-lation. And this is our main motivation to cover T i and ψ i, . . . , ψ i,n (rather than Ψ i ) in our state formulation. Afterall, we hope that the the state transferred probability. i.e., p ( s i +1 | s i , a i ) is indeed a constant.Explicitly, by given a i , T i , ψ i, . . . , ψ i,n and d i , we ex-pect the probability that s i transfers to s i +1 are a fixedconstant. To reach this goal, given s i , we need to makesure that the corresponding elements (e.g., B i +1 , S i +1 and ψ i +1 , . . . , ψ i +1 ,n ) in state s i +1 follows the a same joint stationary distribution . A stationary distribution means it is nolonger relevent with request order i if given s i . To show thisdesirable property, we first need to show that: Observation 1 (Stationary Energy Arrival
Given State ) . Byspecifying T i , the captured energy between two sequential requests(i.e., λ i,i +1 ) can be roughly regarded as samples from a stationoarydistribution, i.e., value of r.v. λ i,i +1 is not relevant with i (orderof request) given T i .Remark. This assumption is derived from our experience:many intermittent energy supply sources, such as solarpower, wind power, have a clear diurnal pattern, but if wefix the time to a specific timestamp in a day, the energyharvest can roughly be regarded as samples from a fixeddistribution. Informally, we assume a fixed energy harvestrate for a fixed timestamp, e.g., 1000 Joules/s harvest ratein 3:00 pm. And this distribution is no longer relevantwith i . In this way, we respond to the non-stationary issuein challenge 2), unknown and non-stationary suppliedpattern of energy , that we previously proposed below P2.Similarly, one can easily extend the state formulation bycovering more system status (e.g., the location of the MECserver, if we aim to train a general model for multiple MECservers scenario) to yield a more accurate estimation. Observation 2 (Stationary Request Arrival
Given State ) . Byspecifying T i , the time interval between request arrival should alsobe considered as samples from a stationary distribution (thoughstill unknown). This means that the time interval between twoarrivals (denoted by t i,i +1 ) is no longer relevant with i if given T i .Remark. By this state formulation, we address the non-stationary issue in challenge 3), unknown and non-stationary arrival pattern , that we previously proposedbelow P2.
Observation 3 (Stationary Energy Consumption
GivenState ) . By specifying a i , ψ i, . . . , ψ i,n and a stationary timeinterval between two arrivals, the consumed energy between tworequests (i.e., ρ i,i +1 ) could roughly be regarded as samples from astationary distribution.Remark. Our explanation for observation 3 is that the con-
UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 7 sumed energy ρ i,i +1 is actually determined by current CPUfrequency as well as the time interval between requests (i.e., t i,i +1 ). Explicitly, we can roughly estimate that ρ i,i +1 = (cid:80) nn (cid:48) =1 ( ψ i,n (cid:48) + I { a i = n (cid:48) } ) t i,i +1 . But ths calculation is not100% accurate since the CPU could have sleeped beforearrival of next request, as a reuslt of the process finishingof a request. As a refinement, we further assume ρ i,i +1 = (cid:80) nn (cid:48) =1 ( ψ i,n (cid:48) + I { a i = n (cid:48) } ) · t i,i +1 − y ( a i , ψ i, , . . . , ψ i,n , t i,i +1 ) where y ( a i , ψ i, , . . . , ψ i,n , t i,i +1 ) is a stochastic noise led bythe halfway sleeping of CPU cores. Without loss of general-ity, y ( a i , ψ i, , . . . , ψ i,n , t i,i +1 ) is assumed to be samples froma stationary distribution given a i , ψ i, , . . . , ψ i,n and t i,i +1 .By this assumption, we state that ρ i,i +1 is stationary if given a i , ψ i, , . . . , ψ i,n and a stationary t i,i +1 . Observation 4 (Stationary Battery and Reserved status
Given State ) . By specifying s i and a i , B i +1 and S i +1 canroughly be regarded as samples from two stationary distributions.Remark. See Eqs. (3) and (4), we find that B i +1 and S i +1 arerelevant with λ i,i +1 , ρ i,i +1 , e a i and s i . As per Observation1 and 3, λ i,i +1 and ρ i,i +1 are all stationary given s i and a i . In addition, we note that e a i is stationary given thesame condition since d i is assumed to be samples froma stationary distribution and there is not other stochasticfactor in Eq. (2). Observation 5 (Stationary Core Status
Given State ) . Byspecifying s i and a i , ψ i +1 , , . . . , ψ i +1 ,n can roughly be regardedas samples from a stationary distribution.Remark. Here we simply regard that ψ i +1 ,n = ψ i,n + I { a i = n (cid:48) } − z ( ψ i,n , t i,i +1 ) where z ( ψ i,n , t i,i +1 ) denotesthe reduction on number of active core in f n GHz during t i,i +1 interval. Instinctively, we feel that the reduction, i.e., z ( ψ i,n , t i,i +1 ) should at least have some sort of bearingwith the running core status, i.e., ψ i,n and the time intervalbetween requests i.e., t i,i +1 . Informally, we might simplyimagine that for each time unit, the amount of α n · ψ i,n expected reduction would be possibly incurred, where α n should be only relevant to the evaluated core frequency (i.e., f n ). Then, following this strand of reasoning, z ( ψ i,n , t i,i +1 ) should be a stochastic variable and it can be roughly re-garded as samples from a stationary distribution if given ψ i,n and t i,i +1 . Observation 6 (Stationary Time and Data Size
Given State ) . By specifying s i and t i,i +1 , the timestamp in which the nextrequest comes, i.e., T i +1 = T i + t i,i +1 is naturally stationary.Besides, the data size of a request, i.e., d i +1 is naturally assumedto be samples from a stationary distribution (i.e., not relevant withthe arrival order i ). Combining Observation 4, 5 and 6, we roughly infer thatthere should be a unique constant specifying the transferredprobability from a state to another. However, we note thatthis conclusion is not formal in a very rigorous sense (and itdoes not necessarily to be!), given the complexity that existsin the analysis as well as the not necessarily 100% correct assumptions we make in the observation. Our real intentionof discussing the stationary property is actually to renderthe readers an instruction about state formulation, and showthat how to extend the state if in a different application case.
In this section, we shall show the readers a simplified work-ing procedure of our proposed SMDP model. As shownin Fig. 2, we consider three possible frequency adjustmentoptions for each incoming request, i.e., a i ∈ { , , , } where action a i = 0 means rejection and actions a i = 1 , , respectively correspond to a frequency setting of f , f , f .Our example basically illustrates the following workflows:1) Before the 1-st request’s arrival, a specific amountof energy (i.e., acquired energy , represented by thegreen bar) has been collected by the energy harvestmodules. Then, at the instance of the 1-st request’sarrival, the scheduler decides to take action a = 1 based on the current state . Once the action is beingtaken, a) a sleeping core would waken and startedto run in the frequency of f , b) a specific amountof energy would be reserved for processing of thisrequest, c) and the acquired energy would be offi-cially deemed as stored energy . Correspondingly,the state instantly transfers to a virtual post-actionstate , i.e., a) the first bit in core status (i.e. runningcore in the frequency of f ) would be correspond-ingly flipped to 1, b) the reserved energy would beupdated, c) and the newly acquired energy wouldbe absorbed into the stored energy.2) Then, the state (i.e., the system status) continues toevolve over time: the running cores continuouslyconsume the reserved energy and the stored energy(same amount would be consumed for reservedenergy and stored energy). The evolvement of statepauses when the second request arrives. After ob-serving the current system status (i.e., current state),the scheduler makes an action a = 2 at this timeand the state similarly transfers to the post-actionstate.3) Again, after the action being taken, the state evolvesover time, and within this time interval, a core(in frequency f ) has finished its task. Predictably,when the 3-rd request arrives, a state with corestatus (0 , , would be observed, and since theavailable energy (sum of acquired energy and storedenergy) is not sufficient for processing this request,the scheduler has no choice but to reject the request,which makes a = 0 . This time, the post-action statedoes not experience substantial change comparedwith the observed state (except that acquired energyhas been absorbed).4) The same evolvement continues and the same actionand state transformation process would be repeatedfor the later requests.Please note that in our former formulation, we do notinvolve the consideration of post-action state, since we onlycare about the state transformation between two formalstates s i and s i +1 . We present the concept of post-action
3. Note that in our formal state formulation (see Eq. (9)), we donot distinguish between acquired energy and stored energy , but onlyinterest in the battery status , which is the sum of these two terms. Wemake this distinction in this example mainly to support our illustrationof the key idea. (𝜓 𝑖,1 , 𝜓 𝑖,2 , 𝜓 𝑖,3 ) Current running CPU cores with frequency 𝑓 , 𝑓 , 𝑓 𝑑 = 5 MB 𝑇 = 12: 00 Acquired energy Stored energy Reserved energy State 𝑑 = 18 MB 𝑇 = 12: 30 (0,0,0) (0,0,1)(0,1,0) (0,1,0) (1,0,0) (1,1,0)(0,0,0) (1,0,0) Post-action transfer 𝑎 = 1 Action transfer 3-rd Request : 𝑑 = 10 MB 𝑇 = 13: 15 𝑑 = 15 MB 𝑇 = 14: 20 𝑎 = 2 𝑎 = 0 𝑎 = 3 𝑎 𝑖 Post-action state
Fig. 2: Illustration of the event-driven SMDP modelstate here mainly in a bid to render the readers a wholepicture about how the system states might evolve over time.
Recall that our ultimate goal is to maximize the expectedcumulative reward, which means that we need to find adeterministic optimal policy ˜ π ∗ such that: ˜ π ∗ = arg max a ∈A s ˜ Q ˜ π ∗ ( s, a ) (13)where ˜ Q ˜ π ( s, a )= E s ,s ,... (cid:34) r ( s , a ) + ∞ (cid:88) i =2 r ( s i , ˜ π ( s i )) | s = s, a = a (cid:35) (14)represents the expected cumulative rewards, starting froman initial state s and an initial action a . ˜ π ∗ is a deterministicpolicy that promises us the best action in cumulating ex-pected rewards. However, it is intuitive to find that ˜ Q ˜ π ( s, a ) is not a convergence value no matter how the policy π is defined (see the summation function), which makes itmeaningless to derive π ∗ in this form.To address this issue, we alternatively define a discountedexpected cumulative rewards , in the following form: Q π ( s, a )= E s ,s ,... (cid:34) r ( s , a ) + ∞ (cid:88) i =2 β i − r ( s i , π ( s i )) | s = s, a = a (cid:35) (15)where < β < is the discount factor and s is an initialstate. And we instead need to find a near-optimal policy π ∗ suchthat: π ∗ ( s ) = arg max a ∈A s Q π ∗ ( s, a ) (16)where s is an initial state. Q π ( s, a ) is usually referred to as state-action value function and each Q π ( s, a ) regarding differ-ent policy π has a finite convergence value, which make itconcrete to find Q π ∗ ( s, a ) . And as long as we know about Q π ∗ ( s, a ) , we are allowed to derive the near-optimal policy π ∗ by Eq. (16). Moreover, Q π ( s, a ) still conserves certaininformation, even after our discount of future reward: wediscount much to the reward that might be obtained inthe distant future but not that much to the near one, so itactually partially reveals the exact value of a station-actionpair (i.e., the cumulative rewards that might gain in thefuture). Actually, we can state that π ∗ ≈ ˜ π ∗ . Later in ouranalysis, we would alternatively search for π ∗ as our target.To derive Q π ∗ ( s, a ) , we need to notice that: Q π ∗ ( s, π ∗ ( s )) = max a ∈A s Q π ∗ ( s, a ) (17)The above result follows our definition of π ∗ in Eq. (16).Besides, as per Eq. (15), Q π ( s, a ) can indeed rewrite tothe following form: Q π ( s, a )= E s [ r ( s , a ) + βQ π ( s , π ( s )) | s = s, a = a ] (18)Plugging π ∗ into Eq. (18), it yields: Q π ∗ ( s, a )= E s [ r ( s , a ) + βQ π ∗ ( s , π ∗ ( s )) | s = s, a = a ] (19) UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 9
And plugging Eq. (17) into Eq. (19) , we have: Q π ∗ ( s, a )= E s (cid:20) r ( s , a ) + β max a ∈A s Q π ∗ ( s , a ) | s = s, a = a (cid:21) (20)As per the Markov and temporally homogeneous propertyof an SMDP, we have: Q π ∗ ( s, a ) = E s (cid:48) (cid:20) r ( s, a ) + β max a (cid:48) ∈A s (cid:48) Q π ∗ ( s (cid:48) , a (cid:48) ) (cid:21) (21)where s (cid:48) is a random variable (r.v.), which represents thenext state that current state s will transfer to. This equationis called Bellman optimality and it indeed specifies the valueof taking a specific action, which is basically composed oftwo parts:1) The first part is the expected reward that is im-mediately obtained after actions have been taken,embodied by E [ r ( s, a )] .2) The second part is the discounted future expectedrewards after the action has been taken, embod-ied by E [ β max a (cid:48) ∈A s (cid:48) Q π ∗ ( s (cid:48) , a (cid:48) )] . This literally ad-dresses challenge 5) Coupling effects of actionsbetween requests we proposed below P2 . The cou-pling effects are in fact embodied by these dis-counted expected rewards: by taking a differentaction, the state might experience a different transferprobability and thereby making the future rewardsbeing affected. By this MDP formulation, we areenabled to have a concrete model of this couplingeffect to the future.Besides, specified by the proposed challenges below P2 ,we know that both the energy and request arrival pattern,as well as the data size distribution, are all unknown tothe scheduler, which means that it is hopeless to derivethe closed-form Q π ∗ ( s, a ) . More explicitly, since the statetransferred probability p ( s (cid:48) | s, a ) (or p ( s i +1 | s i , a i ) in Eq. (12))is an unknown constant, we cannot expand the expectationin Eq. (21), making Q π ∗ ( s, a ) unachievable in an analyticalway. Without knowledge of Q π ∗ ( s, a ) , we are unable tofulfill our ultimate goal, i.e., to derive π ∗ .Fortunately, there still exists an alternative path to derive Q π ∗ ( s, a ) . (Henceforth, we use Q ( s, a ) to denote Q π ∗ ( s, a ) for sake of brevity). If we have sufficient amount of dataover a specific state and action (captured by a set X s,a ),each piece of which shapes like this 4-element tuple: x (cid:44) ( s, a, r ( s, a ) , s (cid:48) ) , then we can estimate Q ( s, a ) by: Q ( s, a ) ≈ | X s,a | (cid:88) x ∈ X s,a (cid:20) r ( s, a ) + β max a (cid:48) ∈A s (cid:48) Q ( s (cid:48) , a (cid:48) ) (cid:21) (22)where | · | returns the cardinality of a set. Or we aim tominimize the estimation error , i.e., to minimize the followingterm: | X | (cid:88) x ∈ X (cid:20) Q ( s, a ) − (cid:18) r ( s, a ) + β max a (cid:48) ∈A s (cid:48) Q ( s (cid:48) , a (cid:48) ) (cid:19)(cid:21) (23)where X = X s ,a ∪ · · · ∪ X s ,a n ∪ . . . captures all theavailable data. In this way, we do not need prior knowledgeabout the transferred probability i.e., p ( s (cid:48) | s, a ) , but we learn it from the real state transferred data. This learning-basedmethod completely resolves the unknown distribution is-sues we raised below P2 .Unfortunately, this initial idea is still unworkable. Themajor impediment is that there are infinite amount of states in our formulated problem! We simply cannot iterativelyachieve Q ( s, a ) for every possible state-action pair, but atleast, this initial idea points out a concrete direction thatelicits our later double deep Q network solution. In the previous section, we provide an initial idea about howto derive Q ( s, a ) in a learning way. But we simply cannotrecord Q ( s, a ) for each state due to the unbounded statespace. This hidden issue motivates us to use a neural net-work to predict the optimal state-action value (abbreviatedas Q value henceforth). Explicitly, we input a specific state tothe neural network and output the Q value for action to n . By this means, we do not need to record the Q value, butit is estimated based on the output after going through theneural network.Formally, the estimated Q value can be denoted by Q ( s, a ; θ i ) where θ i denotes the parameters of the neuralnetwork (termed Q network henceforth) after i steps oftraining. And following the same idea we proposed before,we expect to minimize the estimation error (or loss hence-forth), in this form: L i ( θ i )= 1 | ˜ X i | (cid:88) x ∈ ˜ X i (cid:20) Q ( s, a ; θ i ) − (cid:18) r ( s, a ) + β max a (cid:48) ∈A s (cid:48) Q ( s (cid:48) , a (cid:48) ; θ i ) (cid:19)(cid:21) (24)where ˜ X i is a sub-set of total available data when doingtraining for the i -th step, which is often referred to as mini-batch . We introduce such a concept here since the dataacquiring process is an online process. In other words, wedo not have all the training data naturally, but we obtainit through continuous interaction (i.e., acting action) andcontinuous policy update. As a result of this continuousupdate of data, it is simply not ”cost-effective” to involveall the historical data we currently have in every step oftraining. As a refinement, only a subset of the data, i.e., ˜ X i is involved in the loss back-propagating for each step oftraining.However, the above-defined loss, though is quite intu-itive, would possibly lead to training instability of the neuralnetwork, since it is updated too radically (see [14]). Asper [13], a double network solution would ensure a morestable performance. In this solution, we introduce anothernetwork, known as target network , whose parameters aredenoted by θ − i . The target network has exactly the samenetwork architecture as the Q network and its parameterswould be overridden periodically by the Q network. Infor-mally, it serves as a ”mirror” of the past Q network, whichsignificantly reduces the training variation of the currentQ network. The re-written loss function after adopting thedouble network architecture can be viewed in Eq. (25)(located at the top of the next page).Now we shall formally introduce our proposed so-lution termed N eural network-based A daptive F requency L i ( θ i ) = 1 | ˜ X i | (cid:88) x ∈ ˜ X i (cid:34) Q ( s, a ; θ i ) − (cid:32) r ( s, a ) + βQ (cid:32) s (cid:48) , arg max a (cid:48) ∈A s (cid:48) Q ( s (cid:48) , a (cid:48) ; θ i ); θ − i (cid:33)(cid:33)(cid:35) (25)where θ − i denotes the parameters of target nework. Algorithm 1
Training Stage of NAFA
Input:
Initial/minimum exploration factor, (cid:15) / (cid:15) min ;Discount factor for rewards/exploration factor, β / ξ ;Tradeoff Parameter, η ; Learning rate γ ;Update periodicity of target nework, ζ ;Steps per episode, N max ; Training episodes, ep max ;Batch size, | ˜ X i | ; Memory size, X max ; Output:
After-trained network parameters; θ final Initialize θ i and θ − i with arbitrary values Initialize i = 1 , (cid:15) = (cid:15) Initialize empty replay memory X Initialize enviroment (or system status) for ep ∈ { , , . . . , ep max } do Wait until the first request comes repeat Observe current system status s if random() < (cid:15) then Randomly select action a from A s else a = argmax a ∈A s Q ( s, a ; θ i ) end if Perform action a and realize reward r ( s, a ) Wait until the next request comes
Observe current system status s (cid:48) Store ( s, a, r ( s, a ) , s (cid:48) ) into replay memory X Sample minibatch ˜ X i ∼ X Update parameter θ i +1 following θ i +1 = θ i − γ ∇ θ i L i ( θ i ) (26) (cid:15) i = (cid:15) min + ( (cid:15) − (cid:15) min ) · exp( − i/ξ ) if i mod ζ = 0 then θ − i +1 = θ i +1 else θ − i +1 = θ − i end if i = i + 1 until i > ep · N max Reset the environment (or system status) end for θ final = θ i A djustment (NAFA), whose running procedure on trainingand application stage are respectively shown in Algorithm 1and Algorithm 2. Overall, the running procedure of NAFAcan be summarized as follows:1) Initialization:
NAFA first initializes the Q networkwith arbitrary values and set the exploration factorto a pre-set value.2)
Iterated Training:
We divide the training into sev- eral episodes of training. The environment (i.e., thesystem status) will be reset to the initial stage eachtime an episode ends. In each episode, after thefirst request comes, the following sub-proceduresperform in sequence:a)
Schedule Action:
NAFA observes the cur-rent system status s (or state in our MDPformulation). Targeting the to-be scheduledrequest (i.e., the i -th request), NAFA adopts a (cid:15) -greedy strategy. Explicitly, with probability (cid:15) , NAFA randomly explores the action spaceand randomly selects an action. With prob-ability − (cid:15) , NAFA greedily selects actionbased on the current system status s and thecurrent Q network’s output.b) Interaction:
NAFA performs the decided ac-tion a within the environment . A corre-sponding reward would be realized as perEq. (11). Note that the environment herecould be real operation enviroment of aMEC server, or could be a simulation envi-ronment that is set up for a training purpose.c) Data Integration:
After the interaction,NAFA sleeps until a new request has arrived.Once evoked, NAFA checks the current sta-tus s (cid:48) (regarded as next state in the currentiteration) and stores it into replay memory collectively with current state s , action a , andrealized reward r ( s, a ) . The replay memoryhas a maximum size X max . If the replaymemory goes full, the oldest data will bereplaced by the new one.d) Update Q Network:
A fixed batch size ofdata is sampled from replay memory to theminibatch ˜ X i . Then, NAFA performs back-propagation with learning rate γ on the Qnetwork using samples within minibatch ˜ X i and the loss function defined in Eq. (25).e) Update Exploration Factor and Target QNetwork:
NAFA discounts (cid:15) using a dis-count factor ξ and update the target networkat a periodicity of ζ steps. After that, NAFAstarts a new training iteration.3) Application:
After training has finished (i.e., it hasgone through ep max · N max steps of training), thealgorithm uses the obtained Q network (whose pa-rameters denoted by θ final ) to schedule requests inthe real operation enviroment , i.e., based on currentstate s i , action a i = argmax a ∈A si Q ( s i , a ; θ final ) will be taken for each request. UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 11
Algorithm 2
Application Stage of NAFA
Input:
After-trained network parameters; θ final Output:
Action sequence; a , a , . . . i = 1 for each request comes do Observe system status s i a i = argmax a ∈A s Q ( s i , a ; θ final ) Perform action a i i = i + 1 end for XPERIMENT • Programming and running environment:
We haveimplemented NAFA and the simulation environmenton the basis of PyTorch and gym (a site package thatis commonly used for environment construction inRL). Besides, all the computation in our simulation isrun by a high-performance workstation (Dell Pow-erEdge T630 with 2xGTX 1080Ti). • Simulation of energy arrival:
In our simulation,NAFA is deployed on an MEC server which is drivenby solar power (a typical example of intermittentenergy supply). To simulate the energy arrival pat-tern, we use the data from HelioClim-3 , a satellite-derived solar radiation database. Explicitly, we de-rive the Global Horizontal Irradiance (GHI) data inKampala, Uganda (latitude 0.329, longitude 32.499)from 2005-01-01 to 2005-12-31 (for train dataset) and2006-01-01 to 2006-12-31 (for test dataset). Note thatthe GHI data in some days are missing in the dataset,so we use 300 out of 365 days of intact data re-spectively from the train and test dataset duringour experiment. By the GHI data, we calculate theamount of arrival energy during a given time interval [ t , t ] , as the following form: λ t ,t = (cid:90) t t panel size · GHI ( t ) dt (27)where panel size is the solar panel size of an MECserver. By λ t ,t , we can derive the energy arrivalbetween two request arrivals, i.e., λ i,i +1 . • Simulation of request arrival and data size:
We usea Poisson process (with arrival rate λ r ) to generatethe request arrival events (similar simulation settingavailable in [15], [16]). The data size of the requestfollows a uniform distribution (similar setting avail-able in [17], [18]) in the scale of 10MB to 30MB, i.e., d i ∼ Uniform (10 , MB. • Action space:
We consider 4 possible actions foreach request (i.e., three correspond to different lev-els of frequency and one is rejection action). For-mally, for the i -th request, a i ∈ { , , , } , where TABLE 2: Simulation parametersSymbols Meanings Values ν computation complexity 2e4 κ effective switched capacitance 1e-28 m number of CPU cores 12 B max battery capacity 1e6 (Joules)panel size solar panel size 0.5 ( m )TABLE 3: Training hyper-parameters for NAFASymbols Meanings Values (cid:15) initial exploration factor 0.5 (cid:15) min minimum exploration factor 0.01 ξ discount factor for exploration 3e4 γ learning rate 5e-4 β discount factor for rewards 0.995 ζ target nework update periodicity 5000 | ˜ X i | batch size 80 X max size of replay buffer 1e6actions { , , } correspond to processing frequency { , , } GHz and action induces request rejection. • Network Structure:
The target network and Q net-work in NAFA have exactly the same structure: aDeep Neural Network (DNN) model with two hid-den layers, each with 200 and 100 neurons, activatedby ReLu, and an output layer, which outputs theestimated Q value for all 4 actions. • Simulation parameters:
All of the simulation param-eters have been specified in Table 2. • Hyper-parameters and training details of NAFA:
All hyper-parameters are available in Table 3. Inour simulation, we use ep max = 30 × episodesto train the model. The simulation time for eachepisode of training is 10 straight days. We reset thesimulation environment based on different GHI dataonce the last request within these 10 days has beenscheduled. Besides, it is important to note that, sincewe only have × days of GHI data as our trainingdataset, we re-use the same × days of GHI datafor training episodes between × to × . • Uniformization of states:
In our implementation ofNAFA, we have uniformed all the elements’ valuesin a state to the scale of [0 , before its input tothe neural network. We are prone to believe thatsuch a uniformization might potentially improve thetraining performance. For an evaluation purpose, we implement three baselinemethods, specified as follows:1)
Best Fit (BF):
Best Fit is a rule-based scheduling al-gorithm derived from [19]. In our problem, BF tendsto reserve energy for future use by scheduling the
5. So, in our real implementation, training steps N max is not neces-sarily identical for all the training episodes. minimized processing frequency for the incomingrequest. Explicitly, it selects an action: a i = (cid:40) argmin a ∈A si f a |A s i | > otherwise (28)2) Worst Fit (WF):
Worst Fit is another rule-basedscheduling algorithm derived from [20]. In ourproblem, WF is desperate to reduce the process-ing time of each request. It achieves this goal viascheduling the maximized processing frequency forthe incoming request as far as it is possible. Explic-itly, it selects an action: a i = (cid:40) argmax a ∈A si f a |A s i | > otherwise (29)3) linUCB: linUCB is an online learning solution andit is typically applied in a linear contextual banditmodel (see [21]). In our setting, linUCB learns theimmediate rewards achieved by different contexts(or states in our formulation) and it greedily selectsthe action that maximizes its estimated immediaterewards. However, it disregards the state transferprobability, or in other words, it ignores the rewardsthat might be obtained in the future. Besides, sameas NAFA, in our implementation of linUCB, weperform the same uniformization process for a statebefore its training. We do this in a bid to ensure afair comparison.In our experiment, we train linUCB and NAFA for the sameamount of episodes. After training, we use 300 straight daysof simulation based on the same test dataset to validate theperformance of different scheduling strategies. With the request arrival rate fixing to λ r = 30 and differentsettings of tradeoff parameters η , we shall show how differ-ent scheduling strategies work. The results can be viewed inthe box plot, shown in Fig. 3. From this plot, we can derivethe following observations:1) In all the settings of η , NAFA outperforms all theother baselines in terms of average returned re-wards. This result strongly shows the superiority ofour proposed solution: not only is NAFA capable ofadaptively adjusting its action policy based on theoperator’s preference (embodied by tradeoff η ), butit also has a stronger learning performance, compar-ing with another learning algorithm, i.e., linUCB.2) With η becoming bigger, NAFA and linUCB’s aver-age rewards approach to 0 while other rule-basedalgorithms (i.e., WF and BF) reduce to a negativenumber. Our explanation for this phenomenon isthat when η is set to a sufficiently high value, thepenalty brought by processing time has exceededthe real rewards brought by accepting a request(see our definition of rewards in Eq.(11)). As such,accepting a request which takes too much time toprocess, is no longer a beneficial action for this extreme tradeoff setting, so both NAFA and linUCBlearn to only accept a small portion of requests(perhaps those with smaller data size) and onlyacquire a slightly positive reward. Recall that the reward is exactly composed of two parts: realreward of accepting a request, and a penalty of processingtime. To draw a clearer picture of these two parts, we showin Fig. 4 how different algorithms (and in a different settingof η ) perform in terms of processing time and rejection ratiowhen fixing λ r = 30 . Intuitively, we derive the followingobservations:1) Comparing BF with WF, BF leads to a higheraverage processing time, but meanwhile, a loweraverage rejection ratio is also observed. This phe-nomenon is wholly comprehensible if consideringthe working pattern of BF and WF. BF tends toschedule the incoming request to a lower processingfrequency in an attempt to conserve energy forfuture use. This conserved action might lead to ahigher average processing time. But at the sametime, it is supposed to have a lower rejection ratiowhen the power supply is limited (e.g. at nights).By contrast, WF might experience more rejectiondue to power shortage at this time, as a result ofits prodigal manner.2) With a larger tradeoff η , NAFA and linUCB expe-rience a drop in terms of processing latency, but arise in the rejection ratio is also observed. It againcorroborates that the tradeoff parameter η definedin rewards is functioning well, meeting the originaldesign purpose. In this way, operators should beable to adaptively adjust the algorithm’s perfor-mance based on their own appetites towards thetwo objectives.3) NAFA-0, NAFA-1, and NAFA-2 significantly out-perform BF in terms of both the two objectives, i.e.,smaller processing time and lower rejection ratio.Besides, NAFA-2 and NAFA-3 also outperform WF.Of the same tradeoff parameter, linUCB is outper-formed by NAFA in terms of both the two objectivesin most of the experiment groups. Finally, not analgorithm outperforms NAFA in both the objectives.These observations further justified the superiorityof NAFA. To have a closer inspection on the algorithms’ performanceand to explore the hidden motivation that leads to rejection,by Fig. 5, we show the composition of requests treatmentwhile again fixing η = 30 . Based on the figure, the followingobservations follow:1) Most of the rejections of WF is caused by energyfull-reserved while most of the rejections of BF is re-sulting from resource full-loaded. This phenomenonis in accordance with their respective behavioralpattern. BF, which is thrifty on energy usage, mightexperience too-long processing time, which in turn UANG et al. : ADAPTIVE PROCESSOR FREQUENCY ADJUSTMENT FOR MOBILE EDGE COMPUTING WITH INTERMITTENT ENERGY SUPPLY 13 % ) : ) 1 $ ) $ O L Q 8 &