[PDF] Deep Reinforcement Learning-based Task Offloading in Satellite-Terrestrial Edge Computing Networks

Abstract

In remote regions (e.g., mountain and desert), cellular networks are usually sparsely deployed or unavailable. With the appearance of new applications (e.g., industrial automation and environment monitoring) in remote regions, resource-constrained terminals become unable to meet the latency requirements. Meanwhile, offloading tasks to urban terrestrial cloud (TC) via satellite link will lead to high delay. To tackle above issues, Satellite Edge Computing architecture is proposed, i.e., users can offload computing tasks to visible satellites for executing. However, existing works are usually limited to offload tasks in pure satellite networks, and make offloading decisions based on the predefined models of users. Besides, the runtime consumption of existing algorithms is rather high. In this paper, we study the task offloading problem in satellite-terrestrial edge computing networks, where tasks can be executed by satellite or urban TC. The proposed Deep Reinforcement learning-based Task Offloading (DRTO) algorithm can accelerate learning process by adjusting the number of candidate locations. In addition, offloading location and bandwidth allocation only depend on the current channel states. Simulation results show that DRTO achieves near-optimal offloading cost performance with much less runtime consumption, which is more suitable for satellite-terrestrial network with fast fading channel.

Full PDF

DDeep Reinforcement Learning-based TaskOfﬂoading in Satellite-Terrestrial Edge ComputingNetworks

Dali Zhu , Haitao Liu , Ting Li , Jiyan Sun , Jie Liang , Hangsheng Zhang , Liru Geng and Yinlong Liu Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China { zhudali, liuhaitao, liting0715, sunjiyan, liangjie, zhanghangsheng, gengliru, liuyinlong } @iie.ac.cn * The corresponding author

Abstract —In remote regions (e.g., mountain and desert), cel-lular networks are usually sparsely deployed or unavailable.With the appearance of new applications (e.g., industrial automa-tion and environment monitoring) in remote regions, resource-constrained terminals become unable to meet the latency re-quirements. Meanwhile, ofﬂoading tasks to urban terrestrialcloud (TC) via satellite link will lead to high delay. To tackleabove issues, Satellite Edge Computing architecture is proposed,i.e., users can ofﬂoad computing tasks to visible satellites forexecuting. However, existing works are usually limited to ofﬂoadtasks in pure satellite networks, and make ofﬂoading decisionsbased on the predeﬁned models of users. Besides, the runtimeconsumption of existing algorithms is rather high.In this paper, we study the task ofﬂoading problem in satellite-terrestrial edge computing networks, where tasks can be executedby satellite or urban TC. The proposed Deep Reinforcementlearning-based Task Ofﬂoading (DRTO) algorithm can acceleratelearning process by adjusting the number of candidate locations.In addition, ofﬂoading location and bandwidth allocation onlydepend on the current channel states. Simulation results showthat DRTO achieves near-optimal ofﬂoading cost performancewith much less runtime consumption, which is more suitable forsatellite-terrestrial network with fast fading channel.

Index Terms —Satellite-terrestrial networks, Edge computing,Deep reinforcement learning, Computation ofﬂoading, Mixed-integer programming

I. I

NTRODUCTION

With the emergence of 5G technology and the expansion ofhuman activities, new applications such as industrial automa-tion [1] and real-time environmental monitoring [2] [3] appearin remote regions. However, due to expensive construction andmaintenance costs, cellular base stations are usually sparselydeployed or unavailable in remote regions [4]. When resource-constrained terminals cannot meet the latency requirementsof these new applications, computing tasks are ofﬂoaded tourban terrestrial cloud (TC) [5] for executing via satellites [6][7]. However, the long propagation distance between remoteterminals and urban TC will lead to high latency, which cannotmeet the requirements of some delay-sensitive applications.Thanks to the emergence of low-earth-orbit (LEO) satellites,the propagation delay is signiﬁcantly reduced. Furthermore, researchers proposed satellite edge computing (SatEC) archi-tecture [8] [9] [10] by referring to mobile edge computing(MEC) [11]. Remote terminals can directly ofﬂoad computingtasks to nearby visible satellites for executing, which furtherreduces the ofﬂoading delay.Recently, there are several efforts focusing on task ofﬂoad-ing in SatEC networks. Zhang et al. [12] proposed a satellite-aerial integrated computing architecture, where ground/aerialusers ofﬂoad tasks to high-altitude platforms or LEO satel-lites. Considering the intermittent communication caused bysatellite orbiting, Wang et al. [13] proposed a IoT-to-Satelliteofﬂoading method based on game theory. However, theydo not consider the cooperation between SatEC server andurban terrestrial data centers. Actually, due to the limitedcomputing capacity and energy reservation of satellite, whena large number of tasks are simultaneously ofﬂoaded, SatECservers need to cooperate with urban TC to provide satisfyingcomputing service. As shown in Fig. 1, in a satellite-terrestrialintegrated network, the LEO access satellite can choose tolocally execute the ofﬂoaded tasks, or transparently forwardthem to its connected urban TC.Furthermore, although some existing works focus on of-ﬂoading in satellite-terrestrial integrated network, they requiresome predeﬁned models. For examples, the ﬂight trajectoriesof aerial users are required in [12], and the ﬂight trajectory ofunmanned aerial vehicle is required in [14], which are usuallydifﬁcult to obtain in practice. Instead, we propose to make of-ﬂoading decisions only based on current channel states, whichis more convenient to obtain. In addition, to optimize the delayand energy consumption in SatEC network, researchers usuallyformulate the ofﬂoading decision and bandwidth allocationproblem as a mixed-integer programming (MIP) problem[15] [16]. The 3D hypergraph matching [12], game-theoreticapproach [13] and a multiple-satellite ofﬂoading method [17]have been proposed to solve the hard MIP problem. However,both of them require considerable number of iterations toreach a satisfying optimum. Hence, they are not suitable formaking real-time ofﬂoading decisions, especially under thefast fading channels [18] caused by high speed movement of a r X i v : . [ c s . N I] F e b EO satellites [19].

ST 1 ST 2 Urban Terrestrial CloudLEO Access Satellite with SatEC Server

ST 3

ST N

Remote RegionTask Task

Fig. 1. Task Ofﬂoading in Satellite-Terrestrial Edge Computing Networks

In this paper, we consider a satellite-terrestrial edge comput-ing network and model the ofﬂoading cost as weighted sum oflatency and energy consumption. To minimize the ofﬂoadingcost, the ofﬂoading location decision and bandwidth allocationis formulated as a MIP problem. Then, we propose a low-complexity Deep Reinforcement learning-based Task Ofﬂoad-ing (DRTO) algorithm to solve it. Speciﬁcally, the deep neuralnetwork (DNN) [20] only takes the current channel statesas inputs, and outputs a relaxed ofﬂoading location, whichis then quantized into a set of candidate binary ofﬂoadinglocations. Given a candidate location, a bandwidth allocationconvex problem is solved by CVXPY [21] tool. The maincontributions of this paper are summarized as follows: • Satellite-terrestrial cooperative ofﬂoading.

We considera satellite-terrestrial cooperative edge computing archi-tecture, where the tasks can be executed by either SatECserver or urban TC. The ofﬂoading location decision andbandwidth allocation is formulated as a MIP problem. • Model-free learning.

The proposed DRTO algorithmmakes ofﬂoading decision only based on the current chan-nel states. Meanwhile, DRTO can improve its ofﬂoadingpolicy by learning from the real-time trend of channelstates, which adapts to the high dynamics of satellite-terrestrial networks. • Low time complexity.

Compared with traditional opti-mization methods, DRTO completely removes the need ofsolving hard MIP problem. Furthermore, we dynamicallyadjust the size of action space to speed up the learningprocess. Simulation results show that the runtime con-sumption of DRTO is signiﬁcantly decreased, while theofﬂoading cost performance is not compromised.The rest of this paper is organized as follows: We describethe system model and formulates the ofﬂoading cost minimiza-tion problem in Section II. The details of DRTO algorithm isintroduced in Section III. In Section IV, simulation results arepresented. Finally, the paper is concluded in Section V.II. S

YSTEM M ODEL AND P ROBLEM F ORMULATION

As shown in Fig. 1, LEO satellites ﬂy above the surface ofearth at high speed, and connect the remote STs to groundstation. TC is directly connected to ground station via an optical ﬁber and its transmission delay can be ignored. We as-sume that the access satellite is always available, and consider N STs denoted by N = { , , ..., N } and a TC within thecoverage of the same access satellite. For simplicity, we denotethe wireless signal traveling from ST to its access satellite asthe 1st-hop, and the 2nd-hop from access satellite to TC. Weassume the access satellite can measure channel states beforedeciding the ofﬂoading locations and allocating the bandwidth.The notations used throughout the paper are list in Table I. TABLE IN

OTATIONS U SED IN THIS P APER

Notation Description x n Ofﬂoading location of n -th ST α n B Bandwidth allocated for n -th ST α N + n B Bandwidth allocated for forwarding the task of n -th ST B Total bandwidth of access satellite p n Transmission power of n -th ST p SAT

Transmission power of access satellite h n Channel gain between n -th ST and its access satellite h TC Channel gain between access satellite and TC N Noise power at the receiver L Size of task k Computational intensity f CPU frequency of SatEC server f CPU frequency of TC p c Computing Power Consumption of SatEC server λ Latency-Energy Weight Parameter

A. Ofﬂoading Location

For the task ofﬂoaded by n -th ST, its access satellitecan choose to locally process or transparently forward to itsconnected TC. We denote the ofﬂoading location of n -th STas x n , where x n = 1 and x n = 0 respectively denotes SatECserver and TC. B. Ofﬂoading Cost

The quality of service (QoS) mainly depends on user-perceived latency and energy consumption. Moreover, con-sidering the precious energy reservation of satellites, we alsoinclude the energy consumption of satellites into cost. Thedetailed deﬁnitions of ofﬂoading cost for different locationsare given as follows:

1) Ofﬂoaded to SatEC server:

When tasks are ofﬂoadedto SatEC server, the cost mainly consists of STs’ transmis-sion cost and SatEC server’s computing cost. We denote α n as the proportion of bandwidth allocated for n -th ST,then the n -th ST’s 1-st hop transmission rate is given by C ,n = α n B log (1 + p n h n /N ) , where B denotes the totalbandwidth of access satellite, p n denotes the transmissionpower of n -th ST, h n denotes the channel gain between n -th ST and its access satellite, and N denotes the noise powerat the receiver.Based on the 1st-hop transmission rate C ,n , the transmis-sion latency is given by T ,n = L/C ,n , where L denotes thetask size (in bits). Then, the energy consumed by n -th ST fortransmission is given by E ,n = p n T ,n .e simply ignore the queuing delay. The computing latencyat SatEC server is given by T c ,n = kL/f , where k denotes thecomputational intensity (in cycles/bit) of task, and f denotesthe CPU frequency (in cycles/s) of SatEC server. The energyconsumed by SatEC server for computing is given by E c ,n = p c T c ,n , where p c denotes the computing power consumption(in Watt) of SatEC server.Therefore, the total latency that n -th ST perceived andenergy consumed for n -th ST are respectively given by T SATn = T ,n + T c ,n and E SATn = E ,n + E c ,n .

2) Ofﬂoaded to TC:

When tasks are ofﬂoaded to TC,apart from the transmission cost of STs, the forwarding costof access satellite and computing cost of TC should beincluded. We denote α N + n as the proportion of bandwidthallocated for forwarding the task of n -th ST, then the 2nd-hop transmission rate for n -th ST is given by C ,n = α N + n B log (1 + p SAT h T C /N ) , where p SAT denotes thetransmission power of access satellite, h T C denotes the chan-nel gain between access satellite and TC. Therefore, theforwarding latency and energy consumption for n -th ST arerespectively given by T ,n = L/C ,n and E ,n = p SAT T ,n .The computing latency at TC is given by T c ,n = kL/f ,where f denotes the CPU frequency (in cycles/s) of TC.Thanks to the continuous electrical power supply for TC,we simply ignore the computing energy consumption of TC.Therefore, the total latency that n -th ST perceived and energyconsumed for n -th ST are respectively given by T T Cn = T ,n + T ,n + T c ,n and E T Cn = E ,n + E ,n . C. Problem Formulation

As mentioned above, the ofﬂoading cost is mainly composedof latency and energy consumption, which depends on ofﬂoad-ing locations, current channel states and bandwidth allocation.Therefore, the ofﬂoading cost minimization problem P isformulated as following: P : min x , α F ( x , α ) = N (cid:88) n =1 x n (cid:2) λT SATn + (1 − λ ) E SATn (cid:3) + (1 − x n ) (cid:2) λT T Cn + (1 − λ ) E T Cn (cid:3) (1a) s.t. x n ∈ { , } , ∀ n ∈ N (1b) ≤ N (cid:88) n =1 α n ≤ (1c) α n ≥ , ∀ n ∈ { , ..., N } (1d)where λ denotes the weight parameter for balancing thelatency and energy consumption.It can be seen that problem P is a mixed-integer program-ming problem, in which the − integer variable x and thecontinuous variable α are mutually coupled. This problemis commonly reformulated by speciﬁc relaxation approachand then solved by powerful convex optimization techniques.However, these methods perform considerable iterations, andthe original problem cannot be solved within channel coher-ent time, especially when many STs simultaneously ofﬂoad tasks. To tackle this dilemma, we are motivated to propose aeffective low-complexity Deep Reinforcement learning-basedTask Ofﬂoading algorithm to obtain the near-optimal solution.Speciﬁcally, we adopt a DNN to map the current channelstates to ofﬂoading locations, and improve the DNN viareinforcement learning.III. DRTO: D EEP R EINFORCEMENT L EARNING FOR T ASK O FFLOADING

To minimize the ofﬂoading cost, we design an ofﬂoadingalgorithm π : h −→ x ∗ that quickly selects the optimalofﬂoading location x ∗ = [ x ∗ , x ∗ , ..., x ∗ N ] only based on thecurrent channel state h = [ h , h , ..., h N , h T C ] . 𝒉 Channel Gain DNN

Solve the bandwidth allocation problem 𝒙 ∗ 𝜶 ∗ Input Output

Channel 𝒉 Location 𝒙 ∗ … Sample random batch for training

Memory

Channel 𝒉 Location 𝒙 ∗ Channel 𝒉 Location 𝒙 ∗ Channel 𝒉 Location 𝒙 ∗ Order-preserving quantization 𝒙 ෝ𝒙 𝒙 𝒙 𝐾 … Fig. 2. The diagram of DRTO

The diagram of DRTO is shown in Fig. 2. First, the DNNtakes the current channel gain h as inputs, and generates arelaxed ofﬂoading location ˆ x . Then, we quantize the relaxedlocation ˆ x into K candidate binary ofﬂoading locations,namely x , x , ..., x K . The optimal location x ∗ is obtained bysolving a series of bandwidth allocation convex problems. Sub-sequently, the newly obtained channel state-ofﬂoading locationpair ( h , x ∗ ) is added into replay memory. A random batchwill be sampled from memory to improve the DNN every δ time frames. To further reduce the runtime consumption,we dynamically adjust K to speed up the learning process.In the following subsections, the details of above stages aredescribed. The pseudocode of DRTO algorithm is summarizedin Algorithm 1. A. Generate the Ofﬂoading Location

As shown in the upper part of Fig. 2, in each timeframe, the fully connected DNN takes the current channelgain h as inputs, and generates a relaxed ofﬂoading location ˆ x = [ˆ x , ˆ x , ..., ˆ x N ] (each entry is relaxed into [0 , interval).Then, the relaxed location ˆ x is quantized into K binarylocations. Given a candidate location x k , DRTO solves abandwidth allocation convex problem, and obtains the ofﬂoad-ing cost. Subsequently, the optimal ofﬂoading location x ∗ isselected according to the minimal ofﬂoading cost.Although the mapping from channel state to ofﬂoadinglocation is unknown and complex, thanks to the universalapproximation theorem [22], we adopt a fully connected DNNto approximate this mapping. The DNN is characterized bythe weights that connect the hidden neurons, and composed lgorithm 1 The DRTO Algorithm Input:

Current channel gain h . Output:

Optimal Ofﬂoading location x ∗ and correspond-ing bandwidth allocation α ∗ . for t = 1 , , ..., T do DNN generates a relaxed ofﬂoading location ˆ x . Quantize ˆ x into K t candidate binary ofﬂoading loca-tions x k , k ∈ , , ..., K t . for k = 1 , , ..., K t do Given binary ofﬂoading location x k , the bandwidthallocation α x k and ofﬂoading cost F ( x k , α x k ) areobtained by solving P (cid:48) . end for Obtain the optimal ofﬂoading location with respect to x ∗ = arg min x k ,k ∈ , ,...,K F ( x k , α x k ) . Add newly obtained channel state-ofﬂoading locationpair ( h t , x ∗ ) into replay memory. if t mod δ == 0 then Sample a random batch from memory for trainingDNN. end if if t mod ∆ == 0 then Adjust K t using (6) . end if end for of four layers, namely input layer, two hidden layers and ouputlayer. Here, we respectively use ReLU and sigmoid activationfunction in the hidden layers and output layer, thus each entryof the output relaxed ofﬂoading location satisﬁes ˆ x n ∈ (0 , .Then, the ˆ x is quantized into K candidate binary ofﬂoadinglocations, where K ∈ [1 , N ] . Intuitively, a larger K createshigher diversity in the candidate ofﬂoading location set, thusincreasing the chance of ﬁnding the global optimal ofﬂoadinglocation, but resulting in higher computational complexity.We adopt an order-preserving quantization method proposedin [23] for the trade-off of performance and complexity. Inorder-preserving quantization, the K is relatively small, butthe diversity of candidate ofﬂoading locations is guaranteed.Its main idea is preserving the order when quantization, i.e.,for each quantized location x k = [ x k, , x k, , ..., x k,N ] , x k,n ≤ x k,m should be held if ˆ x n ≤ ˆ x m for all n, m ∈ { , , ..., N } .Speciﬁcally, a series of K quantized locations { x k } aregenerated as following:1) Each entry of the 1st binary ofﬂoading location x isgiven by x ,n = (cid:40) x n > . , x n ≤ . . n = 1 , , ..., N (2)2) As for the remaining K − ofﬂoading locations, we ﬁrstsort each entry of ˆ x according to their distance to 0.5, i.e., | ˆ x (1) − . | ≤ | ˆ x (2) − . | ≤ ... ≤ | ˆ x ( N ) − . | , where ˆ x ( n ) denotes the sorted n -th entry. Hence, each entry of the k -th ofﬂoading location x k , k = 2 , , ..., K is given by x k,n =  x n > ˆ x ( k − , x n = ˆ x ( k − and ˆ x ( k − ≤ . , x n = ˆ x ( k − and ˆ x ( k − > . , x n < ˆ x ( k − . n = 1 , , ..., N (3)Here we obtain K candidate ofﬂoading locations, given acandidate ofﬂoading location x k , the original ofﬂoading costminimization problem P is transformed into a convex problemon α P (cid:48) : min α F ( x k , α ) (4a) s.t. ≤ N (cid:88) n =1 α n ≤ (4b)which can be solved by convex optimization tool like CVXPY[21]. Then we obtain the optimal bandwidth allocation α ∗ x k and minimum ofﬂoading cost F (cid:0) x k , α ∗ x k (cid:1) with the givencandidate ofﬂoading location x k . By repeatedly solving theproblem P (cid:48) for each candidate ofﬂoading location, the bestofﬂoading location is selected by x ∗ = arg min { x k } ,k =1 , ,...K F (cid:0) x k , α ∗ x k (cid:1) (5)along with its corresponding optimal bandwidth allocation α ∗ . B. Update the Ofﬂoading Policy

Due to the rapid changes of satellite-terrestrial channelstates, in order to reduce the ofﬂoading cost, the ofﬂoadingpolicy should be updated in time. Different from traditionaldeep learning, the training samples of DRTO are composed ofthe latest channel state h and ofﬂoading location x ∗ . Since thecurrent ofﬂoading location is generated according to the policyin the last time frame, the training samples in adjacent timeframes are strongly correlated. If the latest samples are usedto train the DNN immediately, the network will be updatedin an inefﬁcient way, and the ofﬂoading policy even may notconverge. Thanks to the experience replay mechanism [24]proposed by Google DeepMind, the newly obtained state-location pair ( h , x ∗ ) is added to the replay memory, andreplaces the oldest one if the memory is full. Subsequently,a random batch are sampled from the memory to improvethe DNN. The cross-entropy loss is reduced by utilizing theAdam optimizer [25]. Such iterations repeat and the policy ofthe DNN is gradually improved.By utilizing the experience replay mechanism, we constructa dynamic training dataset for DNN. Thanks to the randomsampling, the convergence is fastened because the correlationbetween training samples is reduced. Since the memory spaceis ﬁnite, the DNN is updated only according to the recentexperience, and the ofﬂoading policy π is always adapted tothe recent channel changes. . Dynamically Adjust K For each candidate ofﬂoading location, a bandwidth allo-cation convex problem is solved. Intuitively, a larger K canlead to a better temporary ofﬂoading decision and a betterlong-term ofﬂoading policy. However, to select the optimalofﬂoading location x ∗ in each time frame, repeatedly solvingbandwidth allocation problem ( P (cid:48) ) K times leads to highcomputational complexity. Therefore, there exists a trade-offbetween performance and complexity according to the settingof K . I n d e x o f O p t i m a l O ff l o a d i n g L o c a t i o n Fig. 3. The index of optimal ofﬂoading location with K = N = 5 With a ﬁxed K = N , we plot the index of optimalofﬂoading location in each time frame. As shown in Fig.3, at the very beginning of the learning process, the indexof the optimal ofﬂoading location is relatively large. As theofﬂoading policy improves, we observe that most of theoptimal ofﬂoading location are the ﬁrst location generatedby above order-preserving quantization method. This indicatesthat a large value of K is computationally inefﬁcient andunnecessary. In other words, most of the quantized ofﬂoadinglocation in each time frame are redundant. Therefore, tospeed up the algorithm, we can gradually adjust K , and theperformance will not be compromised.We denote K t as the number of quantized ofﬂoading loca-tions at time frame t . Inspired by [23], we initially set K = N . For every ∆ time frames, K t will be adjusted once. In anadjustment time frame, to increase the diversity of candidateofﬂoading locations, K t is tuned to max (cid:0) k ∗ t − , ..., k ∗ t − ∆ (cid:1) + 1 .Therefore, K t is given by K t =  N t = 1 , min (cid:0) max (cid:0) k ∗ t − , ..., k ∗ t − ∆ (cid:1) + 1 , N (cid:1) t mod ∆ = 0 ,K t − otherwise . (6)IV. S IMULATION R ESULTS

In this section, the performance of the proposed DRTOalgorithm is evaluated via simulations. The average channel gain h n or h T C follows the free space path loss model h = A d (cid:18) c πf c d (cid:19) d e . (7)The ﬁrst and second hidden layer of DNN have and hidden neurons, respectively. The initial parameters ofthe DNN follow a normal distribution with zero-mean. TheDRTO algorithm is implemented in Python with TensorFlow2.0. We respectively evaluate the performance of convergence,ofﬂoading cost and runtime. Other default parameters are listedin Tab. II. TABLE IIS

IMULATION P ARAMETERS S ETUP

Parameters Value

Transmission power of ST p n and satellite p SAT (W) 1, 3Antenna gain A d and path loss exponent d e f c (GHz) 30Total bandwidth B (MHz) 800Receiver noise power N (W) − Task size L (MB) 100Computational intensity k (cycles/bit) 10Computing Power Consumption of SatEC server p c (W) 0.5CPU frequency of SatEC server f and TC f (GHz) 0.4, 3Latency-energy weight parameter λ δ A. The Performance of Convergence

The DRTO algorithm is evaluated over time frames.In Fig. 4, we plot the training loss of the DNN, whichgradually decreases and stabilizes at around . , whoseﬂuctuation is mainly due to the random sampling of trainingdata. Fig. 4. The traning loss of DRTO

In Fig. 5, we plot the normalized ofﬂoading cost, which isdeﬁned as ˆ F ( x ∗ , α ∗ ) = F ( x ∗ , α ∗ )min x (cid:48) ∈{ , } N F ( x (cid:48) , α x (cid:48) ) (8)here the numerator denotes the optimal ofﬂoading cost byDRTO algorithm, and the denominator denotes the optimalofﬂoading cost by greedily enumerating all the N ofﬂoadinglocations. We set the update interval ∆ = 64 . As we can see,within the ﬁrst time frames, the normalized ofﬂoadingcost signiﬁcantly ﬂuctuates, indicating that the ofﬂoadingpolicy has not fully converged. Finally, most of the normalizedofﬂoading cost are converged to , only few frames slightlyﬂuctuates above due to the rapid channel fading wheninter-satellite handover occurs. In spite of this ﬂuctuation, theDRTO algorithm can still achieve near-optimal ofﬂoading costperformance. N o r m a li z e d O ff l o a d i n g C o s t Fig. 5. Normalized ofﬂoading cost with ∆ = 64

B. The Performance of Ofﬂoading Cost

Regarding to the ofﬂoading cost performance, we compareour DRTO algorithm with other ﬁve representative benchmarksto demonstrate its superiority: • Distributed Deep Learning-based Ofﬂoading (DDLO)[26]. Multiple DNNs take the duplicated channel gainas input, then each DNN generates a candidate ofﬂoadinglocation. Then, the optimal ofﬂoading location is selectedwith respect to the minimum ofﬂoading cost. In the com-parison with DRTO, we assume that DDLO is composedof N DNNs. • Coordinate Descent (CD) [27]. The CD algorithm is a tra-ditional numerical optimization method, which iterativelyswaps the ofﬂoading location of each STs that leads tothe largest ofﬂoading cost decrement. The iteration stopswhen the ofﬂoading cost cannot be further decreased byswapping the ofﬂoading location. • Enumeration. We enumerate all N ofﬂoading locationcombinations and greedily select the best one. • Pure TC Computing. The LEO access satellite forwardsall the tasks to TC for executing. • Pure SatEC Computing. The LEO access satellite locallyexecute all the tasks.We consider N = 5 STs attaching to the same accesssatellite. In Fig. 6, we compare the performance of average

DRTO DDLO CD Enume. Pure TC Pure Sat.

Different Algorithms A v e r age O ff l oad i ng C o s t pe r T i m e F r a m e Fig. 6. Average Ofﬂoading Cost by Different Algorithms ( N = 5 ) ofﬂoading cost per time frame achieved by different ofﬂoadingalgorithms. As we can see, DRTO achieves similar perfor-mance as the greedy enumeration method, which veriﬁes theoptimality of DRTO. Since the optimal ofﬂoading locationcombination is unique, any other random combinations willlead to higher ofﬂoading cost. In addition, we see that DRTOachieves lower ofﬂoading cost with about 17.5% and 23.6%reduction compared to pure TC Computing and pure SatECComputing methods, which indicates the necessity of coop-eration between SatEC servers and TCs to provide satisfyingcomputing service. C. The Performance of Runtime Consumption

Finally, we evaluate the runtime performance of DRTO.Since Pure TC Computing and Pure SatEC Computing arestatic, we compare DRTO with other three dynamic bench-marks. Speciﬁcally, we respectively record the total runtimeconsumption of different algorithms running on timeframes, and compute the average runtime per time frame. Theruntime comparison is shown in Fig. 7.Although four dynamic algorithms achieve similiar of-ﬂoading cost performance (in Fig. 6), DRTO consumes thelowest runtime with about 42.6%, 87.3% and 96.6% reductioncomparing to DDLO, CD and Enumeration when N = 7 . Inaddition, the runtime consumption of DRTO or DDLO doesnot explode when network scale increases. This is because thatDNN can accurately ﬁts the complex mapping from channelstates to ofﬂoading location, compared with traditional CD orEnumeration methods, the action space of DRTO or DDLOis signiﬁcantly reduced, resulting in much less iterations.In the comparison with DDLO, at the very beginning oflearning process, the action space of DRTO is the same asthat of DDLO. With the improvement of ofﬂoading policy, thenumber of quantized candidate ofﬂoading locations in DRTOis dynamically adjusted, thus the action space of DRTO isfurther reduced.Actually, the channel coherent time is extremely short dueto the high speed movement of satellites. DRTO can quicklyenerates ofﬂoading location and bandwidth allocation with-out compromising ofﬂoading cost performance, which betteradapts to the fast channel fading in satellite-terrestrial edgecomputing networks. Number of STs A v e r age E x e c u t i on La t en cy pe r T i m e F r a m e ( s ) DRTODDLOCDEnumeration

Fig. 7. Average Execution Latency by Different Algorithms

V. C

ONCLUSION

In this paper, we investigate the joint ofﬂoading location de-cision and bandwidth allocation problem in satellite-terrestrialedge computing networks, and propose DRTO algorithm tominimize the ofﬂoading cost based on current observed chan-nel states. DRTO improves its ofﬂoading policy by learningfrom the past ofﬂoading experiences via reinforcement learn-ing. To achieve faster convergence, we preserve order whengenerating candidate ofﬂoading locations and dynamicallyadjust K during learning process. Simulation results showthat our DRTO algorithm achieves near-optimal ofﬂoading costperformance as existing algorithms, but signiﬁcantly reducesruntime consumption, making real-time ofﬂoading optimiza-tion truly viable under fast fading channel in satellite-terrestrialedge computing networks.A CKNOWLEDGEMENT

This work was supported by the National Key Researchand Development Program of China (No. 2017YFB0801900),Priority Research Program of Chinese Academy of Sciences(No. XDC02011000), Chinese National Key Laboratory ofScience and Technology on Information System Security (No.6142111190303), and Conﬁdential Research Program (No.BMKY2018B17). R

EFERENCES[1] L. F. Abanto-Leon and G. H. A. Sim, “Fairness-aware hybrid precodingfor mmwave noma unicast/multicast transmissions in industrial iot,” in

IEEE ICC , 2020.[3] Y. Wan, K. Xu, G. Xue, and F. Wang, “Iotargos: A multi-layer securitymonitoring system for internet-of-things in smart homes,” in

IEEEINFOCOM , 2020. [2] Z. Liu, X. Liu, and K. Li, “Deeper exercise monitoring for smart gymusing fused rﬁd and cv data,” in

IEEE INFOCOM , 2020.[4] Y. Jia, J. Zhang, P. Wang, L. Liu, X. Zhang, and W. Wang, “Collab-orative transmission in hybrid satellite-terrestrial networks: Design andimplementation,” in

IEEE WCNC , 2020.[5] G. Guan, B. Li, Y. Gao, Y. Zhang, J. Bu, and W. Dong, “Tinylink 2.0:Integrating device, cloud, and client development for iot applications,”in

ACM MobiCom , 2020.[6] T. Lv, W. Liu, H. Huang, and X. Jia, “Optimal data downloadingby using inter-satellite ofﬂoading in leo satellite networks,” in

IEEEGLOBECOM , 2016.[7] M. Zhang and W. Zhou, “Energy-efﬁcient collaborative data download-ing by using inter-satellite ofﬂoading,” in

IEEE GLOBECOM , 2019.[8] R. Xie, Q. Tang, Q. Wang, X. Liu, F. R. Yu, and T. Huang, “Satellite-terrestrial integrated edge computing networks: Architecture, challenges,and open issues,”

IEEE Network , 2020.[9] L. Yan, S. Cao, Y. Gong, H. Han, J. Wei, Y. Zhao, and S. Yang,“Satec: A 5g satellite edge computing framework based on microservicearchitecture,”

Sensors , 2019.[10] Y. Wang, J. Yang, X. Guo, and Z. Qu, “Satellite edge computing for theinternet of things in aerospace,”

Sensors , 2019.[11] P. Mach and Z. Becvar, “Mobile edge computing: A survey on architec-ture and computation ofﬂoading,”

IEEE Communications Surveys andTutorials , 2017.[12] L. Zhang, H. Zhang, C. Guo, H. Xu, L. Song, and Z. Han, “Satellite-aerial integrated computing in disasters: User association and ofﬂoadingdecision,” in

IEEE ICC , 2020.[13] Y. Wang, J. Yang, X. Guo, and Z. Qu, “A game-theoretic approach tocomputation ofﬂoading in satellite edge computing,”

IEEE Access , 2020.[14] C. Zhou, W. Wu, H. He, P. Yang, F. Lyu, N. Cheng, and X. Shen, “Delay-aware iot task scheduling in space-air-ground integrated network,” in

IEEE GLOBECOM , 2019.[15] J. Kim, T. Kim, M. Hashemi, C. G. Brinton, and D. J. Love, “Jointoptimization of signal design and resource allocation in wireless d2dedge computing,” in

IEEE INFOCOM , 2020.[16] S. Huang, G. Li, E. Ben-Awuah, B. O. Afum, and N. Hu, “A stochasticmixed integer programming framework for underground mining pro-duction scheduling optimization considering grade uncertainty,”

IEEEAccess , 2020.[17] J. Gao, L. Zhao, and X. Shen, “Service ofﬂoading in terrestrial-satellitesystems: User preference and network utility,” in

IEEE GLOBECOM ,2019.[18] P. Ramirez-Espinosa and F. J. Lopez-Martinez, “On the utility of theinverse gamma distribution in modeling composite fading channels,” in

IEEE GLOBECOM , 2019.[19] H. Maattanen, B. Hofstrom, S. Euler, J. Sedin, X. Lin, O. Liberg,G. Masini, and M. Israelsson, “5g nr communication over geo or leosatellite systems: 3gpp ran higher layer standardization aspects,” in

IEEEGLOBECOM , 2019.[20] H.-J. Jeong, H.-J. Lee, C. H. Shin, and S.-M. Moon, “Ionn: Incrementalofﬂoading of neural network computations from mobile devices to edgeservers,” in

ACM Symposium on Cloud Computing

Machine learning: an algorithmic perspective . CRC press,2015.[23] L. Huang, S. Bi, and Y. J. A. Zhang, “Deep reinforcement learningfor online computation ofﬂoading in wireless powered mobile-edgecomputing networks,”

IEEE TMC , 2020.[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv, 2013.[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv, 2014.[26] L. Huang, X. Feng, A. Feng, Y. Huang, and L. P. Qian, “Distributeddeep learning-based ofﬂoading for mobile edge computing networks,”

Mobile Networks and Applications , 2018.[27] S. Bi and Y. J. Zhang, “Computation rate maximization for wirelesspowered mobile-edge computing with binary computation ofﬂoading,”