Deep Reinforcement Learning-based Task Offloading in Satellite-Terrestrial Edge Computing Networks
Dali Zhu, Haitao Liu, Ting Li, Jiyan Sun, Jie Liang, Hangsheng Zhang, Liru Geng, Yinlong Liu
DDeep Reinforcement Learning-based TaskOffloading in Satellite-Terrestrial Edge ComputingNetworks
Dali Zhu , Haitao Liu , Ting Li , Jiyan Sun , Jie Liang , Hangsheng Zhang , Liru Geng and Yinlong Liu Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China { zhudali, liuhaitao, liting0715, sunjiyan, liangjie, zhanghangsheng, gengliru, liuyinlong } @iie.ac.cn * The corresponding author
Abstract —In remote regions (e.g., mountain and desert), cel-lular networks are usually sparsely deployed or unavailable.With the appearance of new applications (e.g., industrial automa-tion and environment monitoring) in remote regions, resource-constrained terminals become unable to meet the latency re-quirements. Meanwhile, offloading tasks to urban terrestrialcloud (TC) via satellite link will lead to high delay. To tackleabove issues, Satellite Edge Computing architecture is proposed,i.e., users can offload computing tasks to visible satellites forexecuting. However, existing works are usually limited to offloadtasks in pure satellite networks, and make offloading decisionsbased on the predefined models of users. Besides, the runtimeconsumption of existing algorithms is rather high.In this paper, we study the task offloading problem in satellite-terrestrial edge computing networks, where tasks can be executedby satellite or urban TC. The proposed Deep Reinforcementlearning-based Task Offloading (DRTO) algorithm can acceleratelearning process by adjusting the number of candidate locations.In addition, offloading location and bandwidth allocation onlydepend on the current channel states. Simulation results showthat DRTO achieves near-optimal offloading cost performancewith much less runtime consumption, which is more suitable forsatellite-terrestrial network with fast fading channel.
Index Terms —Satellite-terrestrial networks, Edge computing,Deep reinforcement learning, Computation offloading, Mixed-integer programming
I. I
NTRODUCTION
With the emergence of 5G technology and the expansion ofhuman activities, new applications such as industrial automa-tion [1] and real-time environmental monitoring [2] [3] appearin remote regions. However, due to expensive construction andmaintenance costs, cellular base stations are usually sparselydeployed or unavailable in remote regions [4]. When resource-constrained terminals cannot meet the latency requirementsof these new applications, computing tasks are offloaded tourban terrestrial cloud (TC) [5] for executing via satellites [6][7]. However, the long propagation distance between remoteterminals and urban TC will lead to high latency, which cannotmeet the requirements of some delay-sensitive applications.Thanks to the emergence of low-earth-orbit (LEO) satellites,the propagation delay is significantly reduced. Furthermore, researchers proposed satellite edge computing (SatEC) archi-tecture [8] [9] [10] by referring to mobile edge computing(MEC) [11]. Remote terminals can directly offload computingtasks to nearby visible satellites for executing, which furtherreduces the offloading delay.Recently, there are several efforts focusing on task offload-ing in SatEC networks. Zhang et al. [12] proposed a satellite-aerial integrated computing architecture, where ground/aerialusers offload tasks to high-altitude platforms or LEO satel-lites. Considering the intermittent communication caused bysatellite orbiting, Wang et al. [13] proposed a IoT-to-Satelliteoffloading method based on game theory. However, theydo not consider the cooperation between SatEC server andurban terrestrial data centers. Actually, due to the limitedcomputing capacity and energy reservation of satellite, whena large number of tasks are simultaneously offloaded, SatECservers need to cooperate with urban TC to provide satisfyingcomputing service. As shown in Fig. 1, in a satellite-terrestrialintegrated network, the LEO access satellite can choose tolocally execute the offloaded tasks, or transparently forwardthem to its connected urban TC.Furthermore, although some existing works focus on of-floading in satellite-terrestrial integrated network, they requiresome predefined models. For examples, the flight trajectoriesof aerial users are required in [12], and the flight trajectory ofunmanned aerial vehicle is required in [14], which are usuallydifficult to obtain in practice. Instead, we propose to make of-floading decisions only based on current channel states, whichis more convenient to obtain. In addition, to optimize the delayand energy consumption in SatEC network, researchers usuallyformulate the offloading decision and bandwidth allocationproblem as a mixed-integer programming (MIP) problem[15] [16]. The 3D hypergraph matching [12], game-theoreticapproach [13] and a multiple-satellite offloading method [17]have been proposed to solve the hard MIP problem. However,both of them require considerable number of iterations toreach a satisfying optimum. Hence, they are not suitable formaking real-time offloading decisions, especially under thefast fading channels [18] caused by high speed movement of a r X i v : . [ c s . N I] F e b EO satellites [19].
ST 1 ST 2 Urban Terrestrial CloudLEO Access Satellite with SatEC Server
ST 3
ST N
Remote RegionTask Task
Fig. 1. Task Offloading in Satellite-Terrestrial Edge Computing Networks
In this paper, we consider a satellite-terrestrial edge comput-ing network and model the offloading cost as weighted sum oflatency and energy consumption. To minimize the offloadingcost, the offloading location decision and bandwidth allocationis formulated as a MIP problem. Then, we propose a low-complexity Deep Reinforcement learning-based Task Offload-ing (DRTO) algorithm to solve it. Specifically, the deep neuralnetwork (DNN) [20] only takes the current channel statesas inputs, and outputs a relaxed offloading location, whichis then quantized into a set of candidate binary offloadinglocations. Given a candidate location, a bandwidth allocationconvex problem is solved by CVXPY [21] tool. The maincontributions of this paper are summarized as follows: • Satellite-terrestrial cooperative offloading.
We considera satellite-terrestrial cooperative edge computing archi-tecture, where the tasks can be executed by either SatECserver or urban TC. The offloading location decision andbandwidth allocation is formulated as a MIP problem. • Model-free learning.
The proposed DRTO algorithmmakes offloading decision only based on the current chan-nel states. Meanwhile, DRTO can improve its offloadingpolicy by learning from the real-time trend of channelstates, which adapts to the high dynamics of satellite-terrestrial networks. • Low time complexity.
Compared with traditional opti-mization methods, DRTO completely removes the need ofsolving hard MIP problem. Furthermore, we dynamicallyadjust the size of action space to speed up the learningprocess. Simulation results show that the runtime con-sumption of DRTO is significantly decreased, while theoffloading cost performance is not compromised.The rest of this paper is organized as follows: We describethe system model and formulates the offloading cost minimiza-tion problem in Section II. The details of DRTO algorithm isintroduced in Section III. In Section IV, simulation results arepresented. Finally, the paper is concluded in Section V.II. S
YSTEM M ODEL AND P ROBLEM F ORMULATION
As shown in Fig. 1, LEO satellites fly above the surface ofearth at high speed, and connect the remote STs to groundstation. TC is directly connected to ground station via an optical fiber and its transmission delay can be ignored. We as-sume that the access satellite is always available, and consider N STs denoted by N = { , , ..., N } and a TC within thecoverage of the same access satellite. For simplicity, we denotethe wireless signal traveling from ST to its access satellite asthe 1st-hop, and the 2nd-hop from access satellite to TC. Weassume the access satellite can measure channel states beforedeciding the offloading locations and allocating the bandwidth.The notations used throughout the paper are list in Table I. TABLE IN
OTATIONS U SED IN THIS P APER
Notation Description x n Offloading location of n -th ST α n B Bandwidth allocated for n -th ST α N + n B Bandwidth allocated for forwarding the task of n -th ST B Total bandwidth of access satellite p n Transmission power of n -th ST p SAT
Transmission power of access satellite h n Channel gain between n -th ST and its access satellite h TC Channel gain between access satellite and TC N Noise power at the receiver L Size of task k Computational intensity f CPU frequency of SatEC server f CPU frequency of TC p c Computing Power Consumption of SatEC server λ Latency-Energy Weight Parameter
A. Offloading Location
For the task offloaded by n -th ST, its access satellitecan choose to locally process or transparently forward to itsconnected TC. We denote the offloading location of n -th STas x n , where x n = 1 and x n = 0 respectively denotes SatECserver and TC. B. Offloading Cost
The quality of service (QoS) mainly depends on user-perceived latency and energy consumption. Moreover, con-sidering the precious energy reservation of satellites, we alsoinclude the energy consumption of satellites into cost. Thedetailed definitions of offloading cost for different locationsare given as follows:
1) Offloaded to SatEC server:
When tasks are offloadedto SatEC server, the cost mainly consists of STs’ transmis-sion cost and SatEC server’s computing cost. We denote α n as the proportion of bandwidth allocated for n -th ST,then the n -th ST’s 1-st hop transmission rate is given by C ,n = α n B log (1 + p n h n /N ) , where B denotes the totalbandwidth of access satellite, p n denotes the transmissionpower of n -th ST, h n denotes the channel gain between n -th ST and its access satellite, and N denotes the noise powerat the receiver.Based on the 1st-hop transmission rate C ,n , the transmis-sion latency is given by T ,n = L/C ,n , where L denotes thetask size (in bits). Then, the energy consumed by n -th ST fortransmission is given by E ,n = p n T ,n .e simply ignore the queuing delay. The computing latencyat SatEC server is given by T c ,n = kL/f , where k denotes thecomputational intensity (in cycles/bit) of task, and f denotesthe CPU frequency (in cycles/s) of SatEC server. The energyconsumed by SatEC server for computing is given by E c ,n = p c T c ,n , where p c denotes the computing power consumption(in Watt) of SatEC server.Therefore, the total latency that n -th ST perceived andenergy consumed for n -th ST are respectively given by T SATn = T ,n + T c ,n and E SATn = E ,n + E c ,n .
2) Offloaded to TC:
When tasks are offloaded to TC,apart from the transmission cost of STs, the forwarding costof access satellite and computing cost of TC should beincluded. We denote α N + n as the proportion of bandwidthallocated for forwarding the task of n -th ST, then the 2nd-hop transmission rate for n -th ST is given by C ,n = α N + n B log (1 + p SAT h T C /N ) , where p SAT denotes thetransmission power of access satellite, h T C denotes the chan-nel gain between access satellite and TC. Therefore, theforwarding latency and energy consumption for n -th ST arerespectively given by T ,n = L/C ,n and E ,n = p SAT T ,n .The computing latency at TC is given by T c ,n = kL/f ,where f denotes the CPU frequency (in cycles/s) of TC.Thanks to the continuous electrical power supply for TC,we simply ignore the computing energy consumption of TC.Therefore, the total latency that n -th ST perceived and energyconsumed for n -th ST are respectively given by T T Cn = T ,n + T ,n + T c ,n and E T Cn = E ,n + E ,n . C. Problem Formulation
As mentioned above, the offloading cost is mainly composedof latency and energy consumption, which depends on offload-ing locations, current channel states and bandwidth allocation.Therefore, the offloading cost minimization problem P isformulated as following: P : min x , α F ( x , α ) = N (cid:88) n =1 x n (cid:2) λT SATn + (1 − λ ) E SATn (cid:3) + (1 − x n ) (cid:2) λT T Cn + (1 − λ ) E T Cn (cid:3) (1a) s.t. x n ∈ { , } , ∀ n ∈ N (1b) ≤ N (cid:88) n =1 α n ≤ (1c) α n ≥ , ∀ n ∈ { , ..., N } (1d)where λ denotes the weight parameter for balancing thelatency and energy consumption.It can be seen that problem P is a mixed-integer program-ming problem, in which the − integer variable x and thecontinuous variable α are mutually coupled. This problemis commonly reformulated by specific relaxation approachand then solved by powerful convex optimization techniques.However, these methods perform considerable iterations, andthe original problem cannot be solved within channel coher-ent time, especially when many STs simultaneously offload tasks. To tackle this dilemma, we are motivated to propose aeffective low-complexity Deep Reinforcement learning-basedTask Offloading algorithm to obtain the near-optimal solution.Specifically, we adopt a DNN to map the current channelstates to offloading locations, and improve the DNN viareinforcement learning.III. DRTO: D EEP R EINFORCEMENT L EARNING FOR T ASK O FFLOADING
To minimize the offloading cost, we design an offloadingalgorithm π : h −→ x ∗ that quickly selects the optimaloffloading location x ∗ = [ x ∗ , x ∗ , ..., x ∗ N ] only based on thecurrent channel state h = [ h , h , ..., h N , h T C ] . 𝒉 Channel Gain DNN
Solve the bandwidth allocation problem 𝒙 ∗ 𝜶 ∗ Input Output
Channel 𝒉 Location 𝒙 ∗ … Sample random batch for training
Memory
Channel 𝒉 Location 𝒙 ∗ Channel 𝒉 Location 𝒙 ∗ Channel 𝒉 Location 𝒙 ∗ Order-preserving quantization 𝒙 ෝ𝒙 𝒙 𝒙 𝐾 … Fig. 2. The diagram of DRTO
The diagram of DRTO is shown in Fig. 2. First, the DNNtakes the current channel gain h as inputs, and generates arelaxed offloading location ˆ x . Then, we quantize the relaxedlocation ˆ x into K candidate binary offloading locations,namely x , x , ..., x K . The optimal location x ∗ is obtained bysolving a series of bandwidth allocation convex problems. Sub-sequently, the newly obtained channel state-offloading locationpair ( h , x ∗ ) is added into replay memory. A random batchwill be sampled from memory to improve the DNN every δ time frames. To further reduce the runtime consumption,we dynamically adjust K to speed up the learning process.In the following subsections, the details of above stages aredescribed. The pseudocode of DRTO algorithm is summarizedin Algorithm 1. A. Generate the Offloading Location
As shown in the upper part of Fig. 2, in each timeframe, the fully connected DNN takes the current channelgain h as inputs, and generates a relaxed offloading location ˆ x = [ˆ x , ˆ x , ..., ˆ x N ] (each entry is relaxed into [0 , interval).Then, the relaxed location ˆ x is quantized into K binarylocations. Given a candidate location x k , DRTO solves abandwidth allocation convex problem, and obtains the offload-ing cost. Subsequently, the optimal offloading location x ∗ isselected according to the minimal offloading cost.Although the mapping from channel state to offloadinglocation is unknown and complex, thanks to the universalapproximation theorem [22], we adopt a fully connected DNNto approximate this mapping. The DNN is characterized bythe weights that connect the hidden neurons, and composed lgorithm 1 The DRTO Algorithm Input:
Current channel gain h . Output:
Optimal Offloading location x ∗ and correspond-ing bandwidth allocation α ∗ . for t = 1 , , ..., T do DNN generates a relaxed offloading location ˆ x . Quantize ˆ x into K t candidate binary offloading loca-tions x k , k ∈ , , ..., K t . for k = 1 , , ..., K t do Given binary offloading location x k , the bandwidthallocation α x k and offloading cost F ( x k , α x k ) areobtained by solving P (cid:48) . end for Obtain the optimal offloading location with respect to x ∗ = arg min x k ,k ∈ , ,...,K F ( x k , α x k ) . Add newly obtained channel state-offloading locationpair ( h t , x ∗ ) into replay memory. if t mod δ == 0 then Sample a random batch from memory for trainingDNN. end if if t mod ∆ == 0 then Adjust K t using (6) . end if end for of four layers, namely input layer, two hidden layers and ouputlayer. Here, we respectively use ReLU and sigmoid activationfunction in the hidden layers and output layer, thus each entryof the output relaxed offloading location satisfies ˆ x n ∈ (0 , .Then, the ˆ x is quantized into K candidate binary offloadinglocations, where K ∈ [1 , N ] . Intuitively, a larger K createshigher diversity in the candidate offloading location set, thusincreasing the chance of finding the global optimal offloadinglocation, but resulting in higher computational complexity.We adopt an order-preserving quantization method proposedin [23] for the trade-off of performance and complexity. Inorder-preserving quantization, the K is relatively small, butthe diversity of candidate offloading locations is guaranteed.Its main idea is preserving the order when quantization, i.e.,for each quantized location x k = [ x k, , x k, , ..., x k,N ] , x k,n ≤ x k,m should be held if ˆ x n ≤ ˆ x m for all n, m ∈ { , , ..., N } .Specifically, a series of K quantized locations { x k } aregenerated as following:1) Each entry of the 1st binary offloading location x isgiven by x ,n = (cid:40) x n > . , x n ≤ . . n = 1 , , ..., N (2)2) As for the remaining K − offloading locations, we firstsort each entry of ˆ x according to their distance to 0.5, i.e., | ˆ x (1) − . | ≤ | ˆ x (2) − . | ≤ ... ≤ | ˆ x ( N ) − . | , where ˆ x ( n ) denotes the sorted n -th entry. Hence, each entry of the k -th offloading location x k , k = 2 , , ..., K is given by x k,n = x n > ˆ x ( k − , x n = ˆ x ( k − and ˆ x ( k − ≤ . , x n = ˆ x ( k − and ˆ x ( k − > . , x n < ˆ x ( k − . n = 1 , , ..., N (3)Here we obtain K candidate offloading locations, given acandidate offloading location x k , the original offloading costminimization problem P is transformed into a convex problemon α P (cid:48) : min α F ( x k , α ) (4a) s.t. ≤ N (cid:88) n =1 α n ≤ (4b)which can be solved by convex optimization tool like CVXPY[21]. Then we obtain the optimal bandwidth allocation α ∗ x k and minimum offloading cost F (cid:0) x k , α ∗ x k (cid:1) with the givencandidate offloading location x k . By repeatedly solving theproblem P (cid:48) for each candidate offloading location, the bestoffloading location is selected by x ∗ = arg min { x k } ,k =1 , ,...K F (cid:0) x k , α ∗ x k (cid:1) (5)along with its corresponding optimal bandwidth allocation α ∗ . B. Update the Offloading Policy
Due to the rapid changes of satellite-terrestrial channelstates, in order to reduce the offloading cost, the offloadingpolicy should be updated in time. Different from traditionaldeep learning, the training samples of DRTO are composed ofthe latest channel state h and offloading location x ∗ . Since thecurrent offloading location is generated according to the policyin the last time frame, the training samples in adjacent timeframes are strongly correlated. If the latest samples are usedto train the DNN immediately, the network will be updatedin an inefficient way, and the offloading policy even may notconverge. Thanks to the experience replay mechanism [24]proposed by Google DeepMind, the newly obtained state-location pair ( h , x ∗ ) is added to the replay memory, andreplaces the oldest one if the memory is full. Subsequently,a random batch are sampled from the memory to improvethe DNN. The cross-entropy loss is reduced by utilizing theAdam optimizer [25]. Such iterations repeat and the policy ofthe DNN is gradually improved.By utilizing the experience replay mechanism, we constructa dynamic training dataset for DNN. Thanks to the randomsampling, the convergence is fastened because the correlationbetween training samples is reduced. Since the memory spaceis finite, the DNN is updated only according to the recentexperience, and the offloading policy π is always adapted tothe recent channel changes. . Dynamically Adjust K For each candidate offloading location, a bandwidth allo-cation convex problem is solved. Intuitively, a larger K canlead to a better temporary offloading decision and a betterlong-term offloading policy. However, to select the optimaloffloading location x ∗ in each time frame, repeatedly solvingbandwidth allocation problem ( P (cid:48) ) K times leads to highcomputational complexity. Therefore, there exists a trade-offbetween performance and complexity according to the settingof K . I n d e x o f O p t i m a l O ff l o a d i n g L o c a t i o n Fig. 3. The index of optimal offloading location with K = N = 5 With a fixed K = N , we plot the index of optimaloffloading location in each time frame. As shown in Fig.3, at the very beginning of the learning process, the indexof the optimal offloading location is relatively large. As theoffloading policy improves, we observe that most of theoptimal offloading location are the first location generatedby above order-preserving quantization method. This indicatesthat a large value of K is computationally inefficient andunnecessary. In other words, most of the quantized offloadinglocation in each time frame are redundant. Therefore, tospeed up the algorithm, we can gradually adjust K , and theperformance will not be compromised.We denote K t as the number of quantized offloading loca-tions at time frame t . Inspired by [23], we initially set K = N . For every ∆ time frames, K t will be adjusted once. In anadjustment time frame, to increase the diversity of candidateoffloading locations, K t is tuned to max (cid:0) k ∗ t − , ..., k ∗ t − ∆ (cid:1) + 1 .Therefore, K t is given by K t = N t = 1 , min (cid:0) max (cid:0) k ∗ t − , ..., k ∗ t − ∆ (cid:1) + 1 , N (cid:1) t mod ∆ = 0 ,K t − otherwise . (6)IV. S IMULATION R ESULTS
In this section, the performance of the proposed DRTOalgorithm is evaluated via simulations. The average channel gain h n or h T C follows the free space path loss model h = A d (cid:18) c πf c d (cid:19) d e . (7)The first and second hidden layer of DNN have and hidden neurons, respectively. The initial parameters ofthe DNN follow a normal distribution with zero-mean. TheDRTO algorithm is implemented in Python with TensorFlow2.0. We respectively evaluate the performance of convergence,offloading cost and runtime. Other default parameters are listedin Tab. II. TABLE IIS
IMULATION P ARAMETERS S ETUP
Parameters Value
Transmission power of ST p n and satellite p SAT (W) 1, 3Antenna gain A d and path loss exponent d e f c (GHz) 30Total bandwidth B (MHz) 800Receiver noise power N (W) − Task size L (MB) 100Computational intensity k (cycles/bit) 10Computing Power Consumption of SatEC server p c (W) 0.5CPU frequency of SatEC server f and TC f (GHz) 0.4, 3Latency-energy weight parameter λ δ A. The Performance of Convergence
The DRTO algorithm is evaluated over time frames.In Fig. 4, we plot the training loss of the DNN, whichgradually decreases and stabilizes at around . , whosefluctuation is mainly due to the random sampling of trainingdata. Fig. 4. The traning loss of DRTO
In Fig. 5, we plot the normalized offloading cost, which isdefined as ˆ F ( x ∗ , α ∗ ) = F ( x ∗ , α ∗ )min x (cid:48) ∈{ , } N F ( x (cid:48) , α x (cid:48) ) (8)here the numerator denotes the optimal offloading cost byDRTO algorithm, and the denominator denotes the optimaloffloading cost by greedily enumerating all the N offloadinglocations. We set the update interval ∆ = 64 . As we can see,within the first time frames, the normalized offloadingcost significantly fluctuates, indicating that the offloadingpolicy has not fully converged. Finally, most of the normalizedoffloading cost are converged to , only few frames slightlyfluctuates above due to the rapid channel fading wheninter-satellite handover occurs. In spite of this fluctuation, theDRTO algorithm can still achieve near-optimal offloading costperformance. N o r m a li z e d O ff l o a d i n g C o s t Fig. 5. Normalized offloading cost with ∆ = 64
B. The Performance of Offloading Cost
Regarding to the offloading cost performance, we compareour DRTO algorithm with other five representative benchmarksto demonstrate its superiority: • Distributed Deep Learning-based Offloading (DDLO)[26]. Multiple DNNs take the duplicated channel gainas input, then each DNN generates a candidate offloadinglocation. Then, the optimal offloading location is selectedwith respect to the minimum offloading cost. In the com-parison with DRTO, we assume that DDLO is composedof N DNNs. • Coordinate Descent (CD) [27]. The CD algorithm is a tra-ditional numerical optimization method, which iterativelyswaps the offloading location of each STs that leads tothe largest offloading cost decrement. The iteration stopswhen the offloading cost cannot be further decreased byswapping the offloading location. • Enumeration. We enumerate all N offloading locationcombinations and greedily select the best one. • Pure TC Computing. The LEO access satellite forwardsall the tasks to TC for executing. • Pure SatEC Computing. The LEO access satellite locallyexecute all the tasks.We consider N = 5 STs attaching to the same accesssatellite. In Fig. 6, we compare the performance of average
DRTO DDLO CD Enume. Pure TC Pure Sat.
Different Algorithms A v e r age O ff l oad i ng C o s t pe r T i m e F r a m e Fig. 6. Average Offloading Cost by Different Algorithms ( N = 5 ) offloading cost per time frame achieved by different offloadingalgorithms. As we can see, DRTO achieves similar perfor-mance as the greedy enumeration method, which verifies theoptimality of DRTO. Since the optimal offloading locationcombination is unique, any other random combinations willlead to higher offloading cost. In addition, we see that DRTOachieves lower offloading cost with about 17.5% and 23.6%reduction compared to pure TC Computing and pure SatECComputing methods, which indicates the necessity of coop-eration between SatEC servers and TCs to provide satisfyingcomputing service. C. The Performance of Runtime Consumption
Finally, we evaluate the runtime performance of DRTO.Since Pure TC Computing and Pure SatEC Computing arestatic, we compare DRTO with other three dynamic bench-marks. Specifically, we respectively record the total runtimeconsumption of different algorithms running on timeframes, and compute the average runtime per time frame. Theruntime comparison is shown in Fig. 7.Although four dynamic algorithms achieve similiar of-floading cost performance (in Fig. 6), DRTO consumes thelowest runtime with about 42.6%, 87.3% and 96.6% reductioncomparing to DDLO, CD and Enumeration when N = 7 . Inaddition, the runtime consumption of DRTO or DDLO doesnot explode when network scale increases. This is because thatDNN can accurately fits the complex mapping from channelstates to offloading location, compared with traditional CD orEnumeration methods, the action space of DRTO or DDLOis significantly reduced, resulting in much less iterations.In the comparison with DDLO, at the very beginning oflearning process, the action space of DRTO is the same asthat of DDLO. With the improvement of offloading policy, thenumber of quantized candidate offloading locations in DRTOis dynamically adjusted, thus the action space of DRTO isfurther reduced.Actually, the channel coherent time is extremely short dueto the high speed movement of satellites. DRTO can quicklyenerates offloading location and bandwidth allocation with-out compromising offloading cost performance, which betteradapts to the fast channel fading in satellite-terrestrial edgecomputing networks. Number of STs A v e r age E x e c u t i on La t en cy pe r T i m e F r a m e ( s ) DRTODDLOCDEnumeration
Fig. 7. Average Execution Latency by Different Algorithms
V. C
ONCLUSION
In this paper, we investigate the joint offloading location de-cision and bandwidth allocation problem in satellite-terrestrialedge computing networks, and propose DRTO algorithm tominimize the offloading cost based on current observed chan-nel states. DRTO improves its offloading policy by learningfrom the past offloading experiences via reinforcement learn-ing. To achieve faster convergence, we preserve order whengenerating candidate offloading locations and dynamicallyadjust K during learning process. Simulation results showthat our DRTO algorithm achieves near-optimal offloading costperformance as existing algorithms, but significantly reducesruntime consumption, making real-time offloading optimiza-tion truly viable under fast fading channel in satellite-terrestrialedge computing networks.A CKNOWLEDGEMENT
This work was supported by the National Key Researchand Development Program of China (No. 2017YFB0801900),Priority Research Program of Chinese Academy of Sciences(No. XDC02011000), Chinese National Key Laboratory ofScience and Technology on Information System Security (No.6142111190303), and Confidential Research Program (No.BMKY2018B17). R
EFERENCES[1] L. F. Abanto-Leon and G. H. A. Sim, “Fairness-aware hybrid precodingfor mmwave noma unicast/multicast transmissions in industrial iot,” in
IEEE ICC , 2020.[3] Y. Wan, K. Xu, G. Xue, and F. Wang, “Iotargos: A multi-layer securitymonitoring system for internet-of-things in smart homes,” in
IEEEINFOCOM , 2020. [2] Z. Liu, X. Liu, and K. Li, “Deeper exercise monitoring for smart gymusing fused rfid and cv data,” in
IEEE INFOCOM , 2020.[4] Y. Jia, J. Zhang, P. Wang, L. Liu, X. Zhang, and W. Wang, “Collab-orative transmission in hybrid satellite-terrestrial networks: Design andimplementation,” in
IEEE WCNC , 2020.[5] G. Guan, B. Li, Y. Gao, Y. Zhang, J. Bu, and W. Dong, “Tinylink 2.0:Integrating device, cloud, and client development for iot applications,”in
ACM MobiCom , 2020.[6] T. Lv, W. Liu, H. Huang, and X. Jia, “Optimal data downloadingby using inter-satellite offloading in leo satellite networks,” in
IEEEGLOBECOM , 2016.[7] M. Zhang and W. Zhou, “Energy-efficient collaborative data download-ing by using inter-satellite offloading,” in
IEEE GLOBECOM , 2019.[8] R. Xie, Q. Tang, Q. Wang, X. Liu, F. R. Yu, and T. Huang, “Satellite-terrestrial integrated edge computing networks: Architecture, challenges,and open issues,”
IEEE Network , 2020.[9] L. Yan, S. Cao, Y. Gong, H. Han, J. Wei, Y. Zhao, and S. Yang,“Satec: A 5g satellite edge computing framework based on microservicearchitecture,”
Sensors , 2019.[10] Y. Wang, J. Yang, X. Guo, and Z. Qu, “Satellite edge computing for theinternet of things in aerospace,”
Sensors , 2019.[11] P. Mach and Z. Becvar, “Mobile edge computing: A survey on architec-ture and computation offloading,”
IEEE Communications Surveys andTutorials , 2017.[12] L. Zhang, H. Zhang, C. Guo, H. Xu, L. Song, and Z. Han, “Satellite-aerial integrated computing in disasters: User association and offloadingdecision,” in
IEEE ICC , 2020.[13] Y. Wang, J. Yang, X. Guo, and Z. Qu, “A game-theoretic approach tocomputation offloading in satellite edge computing,”
IEEE Access , 2020.[14] C. Zhou, W. Wu, H. He, P. Yang, F. Lyu, N. Cheng, and X. Shen, “Delay-aware iot task scheduling in space-air-ground integrated network,” in
IEEE GLOBECOM , 2019.[15] J. Kim, T. Kim, M. Hashemi, C. G. Brinton, and D. J. Love, “Jointoptimization of signal design and resource allocation in wireless d2dedge computing,” in
IEEE INFOCOM , 2020.[16] S. Huang, G. Li, E. Ben-Awuah, B. O. Afum, and N. Hu, “A stochasticmixed integer programming framework for underground mining pro-duction scheduling optimization considering grade uncertainty,”
IEEEAccess , 2020.[17] J. Gao, L. Zhao, and X. Shen, “Service offloading in terrestrial-satellitesystems: User preference and network utility,” in
IEEE GLOBECOM ,2019.[18] P. Ramirez-Espinosa and F. J. Lopez-Martinez, “On the utility of theinverse gamma distribution in modeling composite fading channels,” in
IEEE GLOBECOM , 2019.[19] H. Maattanen, B. Hofstrom, S. Euler, J. Sedin, X. Lin, O. Liberg,G. Masini, and M. Israelsson, “5g nr communication over geo or leosatellite systems: 3gpp ran higher layer standardization aspects,” in
IEEEGLOBECOM , 2019.[20] H.-J. Jeong, H.-J. Lee, C. H. Shin, and S.-M. Moon, “Ionn: Incrementaloffloading of neural network computations from mobile devices to edgeservers,” in
ACM Symposium on Cloud Computing
Machine learning: an algorithmic perspective . CRC press,2015.[23] L. Huang, S. Bi, and Y. J. A. Zhang, “Deep reinforcement learningfor online computation offloading in wireless powered mobile-edgecomputing networks,”
IEEE TMC , 2020.[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv, 2013.[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv, 2014.[26] L. Huang, X. Feng, A. Feng, Y. Huang, and L. P. Qian, “Distributeddeep learning-based offloading for mobile edge computing networks,”
Mobile Networks and Applications , 2018.[27] S. Bi and Y. J. Zhang, “Computation rate maximization for wirelesspowered mobile-edge computing with binary computation offloading,”