[PDF] A General Framework for Charger Scheduling Optimization Problems

Abstract

This paper presents a general framework to tackle a diverse range of NP-hard charger scheduling problems, optimizing the trajectory of mobile chargers to prolong the life of Wireless Rechargeable Sensor Network (WRSN), a system consisting of sensors with rechargeable batteries and mobile chargers. Existing solutions to charger scheduling problems require problem-specific design and a trade-off between the solution quality and computing time. Instead, we observe that instances of the same type of charger scheduling problem are solved repeatedly with similar combinatorial structure but different data. We consider searching an optimal charger scheduling as a trial and error process, and the objective function of a charging optimization problem as reward, a scalar feedback signal for each search. We propose a deep reinforcement learning-based charger scheduling optimization framework. The biggest advantage of the framework is that a diverse range of domain-specific charger scheduling strategy can be learned automatically from previous experiences. A framework also simplifies the complexity of algorithm design for individual charger scheduling optimization problem. We pick three representative charger scheduling optimization problems, design algorithms based on the proposed deep reinforcement learning framework, implement them, and compare them with existing ones. Extensive simulation results show that our algorithms based on the proposed framework outperform all existing ones.

Full PDF

aa r X i v : . [ c s . N I] S e p A G

ENER AL F R AMEWORK FOR C HAR GER S C HEDULING O PTIMIZATION P ROB LEMS

A P

REPRINT

Xual Li

Center for Advanced Computer StudiesSchool of Computing and InformaticsUniversity of Louisiana at LafayetteLafayette, LA 70503 [email protected]

Miao Jin

Center for Advanced Computer StudiesSchool of Computing and InformaticsUniversity of Louisiana at LafayetteLafayette, LA 70503 [email protected]

January 15, 2020 A BSTRACT

This paper presents a general framework to tackle a diverse range of NP-hard charger schedulingproblems, optimizing the trajectory of mobile chargers to prolong the life of Wireless RechargeableSensor Network (WRSN), a system consisting of sensors with rechargeable batteries and mobilechargers. Existing solutions to charger scheduling problems require problem-speciﬁc design and atrade-off between the solution quality and computing time. Instead, we observe that instances of thesame type of charger scheduling problem are solved repeatedly with similar combinatorial structurebut different data. We consider searching an optimal charger scheduling as a trial and error process,and the objective function of a charging optimization problem as reward, a scalar feedback signalfor each search. We propose a deep reinforcement learning-based charger scheduling optimizationframework. The biggest advantage of the framework is that a diverse range of domain-speciﬁccharger scheduling strategy can be learned automatically from previous experiences. A frameworkalso simpliﬁes the complexity of algorithm design for individual charger scheduling optimizationproblem. We pick three representative charger scheduling optimization problems, design algorithmsbased on the proposed deep reinforcement learning framework, implement them, and compare themwith existing ones. Extensive simulation results show that our algorithms based on the proposedframework outperform all existing ones.

Keywords

Wireless rechargeable sensor networks · Mobile charger scheduling · Deep reinforcement learning

Wireless Rechargeable Sensor Network (WRSN), consisting of a group of sensors with rechargeable batteries and oneor multiple mobile chargers, has become a promising solution to the energy limitation problem in Wireless SensorNetworks (WSNs). However, inefﬁcient path planning of chargers may result in not just the waste of energy but alsothe death of nodes and failure of network tasks, considering a mobile charger needs to travel close to a sensor node tocharge. Therefore, charger scheduling optimization has become a popular research topic that focuses on optimizingthe trajectory of a mobile charger to prolong the life of a WRSN system.Charger scheduling optimization problems can be roughly classiﬁed into two groups. One provides a budget includingthe total charging time or energy spent on both charging and traveling. A charger seeks a path to maximize some ob-jective function including the total energy charged on nodes or number of nodes charged within the budget Liang et al.[2017], Chen et al. [2016]. The other requires a number of sensor nodes to be charged. A charger seeks a path to min-imize some objective function including the total energy spent on the road, or the travel distance, or the total chargingtime Shi et al. [2011], Xie et al. [2012], He et al. [2014], Lu et al. [2015], Li and Jin [2019]. rXiv

Template

A P

REPRINT

Charger scheduling optimization problems are in general NP-hard. They are tackled by exact, approximation, andheuristic algorithms. Speciﬁcally, an exact algorithm with provable optimality guarantee uses enumeration strategybut fails when the size of dataset increases. An approximation algorithm with provable solution quality is in generaldesirable. However, the computational complexity of an approximation algorithm may also go sky-high when the sizeof dataset or requirement of the solution quality increase. A heuristic algorithm is another alternative, fast to computealthough lack of optimality or quality guarantee. Overall, existing solutions require problem-speciﬁc design and atrade-off between the solution quality requirement and computing time.It is obvious that existing algorithm design on charger scheduling optimization problems fails to exploit the commoncharacteristic of these problems - instances of the same type of problem are solved repeatedly with similar combina-torial structure but different data Bello et al. [2016], Khalil et al. [2017]. Instead, we consider searching an optimalcharger scheduling as a trial and error process, and the objective function of a charging optimization problem as re-ward, a scalar feedback signal for each search. Speciﬁcally, we model charger scheduling optimization problemswith a weighted graph and consider the objective function as a cumulative reward of charging sensor nodes along apath in one cycle. We then build a deep reinforcement learning - a fundamental approach combined with deep neuralnetwork to optimize strategy to maximize the cumulative reward - based charger scheduling optimization framework.The biggest advantage of the framework is that a diverse range of domain-speciﬁc charger scheduling strategy can belearned automatically from previous experiences, i.e., different graphs with various sizes. A framework also simpliﬁesthe complexity of algorithm design for individual charger scheduling optimization problem.We introduce the framework with four steps:

Graph construction:

We use a weighted graph to model charger scheduling optimization problems.

Graph representation:

We apply the structure2vec technique Dai et al. [2016], Khalil et al. [2017] to represent theweighted graph.

Deep reinforcement learning based framework:

The framework includes all the key components of a typicalreinforcement learning algorithm and functions speciﬁcally designed for charger scheduling optimization problems.

Deep-Q-Network algorithm:

A Deep-Q-Network (DQN) is applied to learn the policy, i.e. an optimal chargerscheduling strategy.The rest of this paper is organized as follows: Section 2 gives a brief review of charger scheduling optimizationproblems and introduction of the basic concepts of deep reinforcement learning. Sections 3, 4, and 5 introduce threerepresentative charger scheduling optimization problems, respectively, where the one in 5 has not been studied yet.Section 6 provides the detail of proposed deep reinforcement learning-based charger scheduling optimization frame-work and shows in steps the algorithms designed to tackle the three charging problems based on the proposed frame-work. Section 7 presents the simulation results and performance comparison of the algorithms designed based on theproposed framework and existing ones. Section 8 concludes the paper.

Charger scheduling optimization has been a popular research topic that focuses on optimizing the trajectory of a mobilecharger to prolong the life of a WRSN system. Charger scheduling optimization problems can be roughly classiﬁedinto two groups.The ﬁrst group is that a charger with given budget (e.g., the total charging time, or the total energy spent on chargingand traveling) seeks a path to achieve a maximum objective function (e.g., the total energy charged to nodes, or thetotal number of nodes being charged) Liang et al. [2017], Chen et al. [2016]. In Liang et al. [2017], a charger withenergy bound seeks a path to maximize the charging rewards. The authors consider two scenarios - sensors can becharged to full capacity at one time or a certain energy level by several times - and provide a -approximation algorithm.In Chen et al. [2016], a charger seeks a charging path to maximize the number of mobile sensor nodes charged withina given charging time. The authors discretize the moving trajectory of each sensor node by the time and provide aquasi-polynomial time approximation algorithm.The second group is that a charger is required to charge a given set of sensor nodes and seeks a path to achievea minimum objective function (e.g., the total energy spent on the road, the total travel distance, or the total chargingtime) Shi et al. [2011], Xie et al. [2012], He et al. [2014], Lu et al. [2015], Li and Jin [2019], Ma et al. [2018], Xu et al.[2020]. A solution may not exist for problems of this category. In Shi et al. [2011], the authors consider a scenariowhere a mobile charger periodically travels inside a network to charge each sensor node. To maximize the ratio of the2 rXiv Template

A P

REPRINT charger’s vacation time over the cycle time, the authors prove that its optimal traveling path in each renewable cycleis the shortest Hamiltonian cycle and propose a heuristic solution. In Xie et al. [2012], the authors apply multi-nodewireless energy transfer technology and extend the study in Shi et al. [2011] to a scenario where multiple nodes canbe charged at the same time. In He et al. [2014], the authors introduce a tree-based charging schedule to minimizethe travel distance of a charger in robotic sensor networks. To guarantee the charging schedule depletion free for anyrobot, they provide theoretical guidance on the setting of remaining energy threshold at which robots request energyreplenishment. In Lu et al. [2015], the authors consider minimizing both the travel distance of a charger and thecharging delay of sensor nodes as a set of nested Traveling Salesman Problems. In Ma et al. [2018], a mobile chargercharges multiple sensors simultaneously to minimize the travel distance. In Xu et al. [2020], the authors consider amulti-node charging setting where multiple neighboring sensors can be charged simultaneously by a single charger.They provide an approximation algorithm to minimize the longest delay of sensors waiting for charge.We pick three representative problems that cover the two major groups and design issues in WRSN in Sections 3, 4,and 5, respectively. We use them as concrete examples to illustrate how our deep reinforcement learning-based frame-work can be applied and provide solutions with superior performance in the following sections.

Given a goal-directed agent in an uncertain environment, the agent interacts with the environment through observations,actions, and feedback (rewards) on actions Sutton et al. [1998]. At each time step t , the agent observes current state s t , and chooses action a . Then the state of the environment transits to next one s t +1 and the agent receives a reward r t . T ( s t +1 | s t , a ) is a transition probability function indicating the probability that the environment will transfer tostate s t +1 if the agent take action a at state s t . The agent learns through previous experiences to select an action andmake the expected cumulative discounted rewards E [ P inf t =0 γr t ] in the future maximized, where γ is the discount ratebetween and .The agent takes action A = { a , ..., a n } at state S = { s , ..., s m } based on a policy denoted by π ( S, A ) . A policy canbe either deterministic or stochastic. If the action space is discrete and the policy is deterministic, we choose value-based reinforcement learning, e.g. Q-learning. If the action space is continuous and the policy is stochastic, we thenchoose policy-based reinforcement learning, e.g. policy-gradient. Considering the action space of charger schedulingoptimization problems is discrete, i.e., a charger decides which exact node to charge next, we choose the value-basedreinforcement learning technique, Q-learning, to build the framework.Q-learning is a model-free reinforcement learning that does not require an agent with full knowledge of the wholeenvironment. The agent simply maintains a Q-table, which stores Q-value where Q stands for quality/reward of eachstate-action pair. The agent will select the action that maximizes Q in the current state. However, it is infeasibleto learn all the state-action pairs for most practical problems. Therefore, function approximation technique Hornik[1991] is commonly used. For Q-learning, a function approximator Q ( S, A ; Θ) is parameterized by Θ with size muchsmaller than the combination of all possible state-action pairs. Deep Q-Network (DQN) Riedmiller [2005] appliesdeep neural networks as function approximators, combined with different techniques including the experience replaymethod Mnih et al. [2013]. Considering the exponentially increased size of station-action pairs in charger schedulingoptimization problems, we choose DQN to build the framework for charger scheduling optimization problems inWRSNs. A battery-powered mobile sensor usually takes a pre-designed mobility pattern to do the jobs such as scientiﬁc explo-ration. The mobile network charging path optimization problem studied in Chen et al. [2016] seeks an optimal pathfor a mobile charger to charge mobile sensor nodes as many as possible within a ﬁxed time horizon.

The network model in Chen et al. [2016] assumes a set of battery-powered mobile sensor nodes that need to berecharged periodically. Denote V = { v i | ≤ i ≤ n } as the set of nodes deployed over a planar Field of Interest(FoI) and p i ( t )(1 ≤ i ≤ n ) as the position of v i at time t . A sensor node v i is equipped with a rechargeable batterywith capacity B . The trajectory of a mobile sensor node can be a curve of any form without imposing any constrainton its mobility pattern, but its trajectory (i.e., the moving position p i ( t ) for node v i at any t ) is known to the charger.A mobile sensor node remains stationary when being charged.3 rXiv Template

A P

REPRINT

Assume the moving speed of a mobile charger is faster than the upper-bound of the moving speed of a mobile sensor.A mobile charger travels from a starting point, charges mobile sensor nodes one by one to a required battery leveldenoted as α before returning to the end point. The mobile charger needs to maximize the charged sensor nodeswithin a maximum charging timespan denoted as C . The total charging time along the charging path P is denoted as Γ( P ) , including the traveling time of the charger, the charging time of the sensor nodes, and the total sojourn timewhen the charger stays at a position without charging any mobile sensor. The total charging time Γ( P ) needs to beless than the maximum charging timespan C .The mobile network charging path optimization problem is deﬁned as the following. Deﬁnition 1.

Given a required battery level α and a maximum charging timespan C , the mobile network chargingpath optimization problem is to schedule an optimal path: P ∗ = arg max Γ( P ) ≤ C | Λ( P ) | (1)where Λ( P ) is the set of nodes charged on the path P . The mobile network charging path optimization problem is APX-hard.

Theorem 1.

The mobile network charging path optimization problem is APX-hard, or NP-hard.Proof.

The proof can be found in the reference Chen et al. [2016].

A quasi-polynomial time algorithm that achieves a poly-logarithmic approximation is introduced in Chen et al. [2016]. The trajectory of each mobile sensor node is discretized by a time step ∆ t to construct a directed acyclic graph.Vertices of the graph represent the discretized points along the trajectories of mobile sensor nodes, the starting andending positions of the charger. The approximation algorithm recursively decomposes the problem of searching anoptimal charging path into sub-problems of searching sub-paths.The computational complexity of the algorithm is O ( n d min( n d , C ) log C ) L , where n d is the size of nodes afterdiscretization, C is the maximum charging timespan, and L is the recursion level of the algorithm. Let n represent theoriginal size of mobile sensor nodes, then n d = n C ∆ t . For a network with large size n and a long charging timespan C , it is obvious that the approximation algorithm has to sacriﬁce the solution quality by either increasing the time stepsize ∆ t or decreasing the recursion level L . Given a set of energy-critical sensors and a mobile charger with a ﬁxed energy capacity, the fully charging rewardmaximization problem studied in Liang et al. [2017] seeks an optimal tour for the mobile charger to maximize the sumof charging energy rewards.

The network model in Liang et al. [2017] assumes a set of stationary sensor nodes, V = { v i | ≤ i ≤ n } , deployedover a planar FoI with locations, P = { p i | ≤ i ≤ n } . Each sensor node v i is equipped with a rechargeable batterywith a capacity of B . A mobile charger with an energy capacity IE sends its departure time denoted as t from the service station to eachsensor node. When receiving the message, node v i estimates its residual energy at t denoted as B i ( t ) . If B i ( t ) isless than an energy threshold α , i.e., B i ( t ) /B ≤ α , v i calculates a positive integer number π i (1 ≤ π i ≤ n ) thatmodels the gain/reward if being charged. The reward is proportional to the amount of energy required to charge v i toits full capacity. Node v i then sends a charging request ( id, p i , π i , B i ( t )) , including node v i ID, position p i , chargingreward π i , and residual energy B i ( t ) at t , to the charger.4 rXiv Template

A P

REPRINT

The mobile charger seeks a charging path P that maximizes the total reward of nodes charged. Note that the totalamount of energy consumed in charging sensor nodes and traveling along the P should be no larger than the energycapacity IE of the mobile charger.The fully charging reward maximization problem is deﬁned as the following. Deﬁnition 2 (Fully charging reward maximization problem) . The fully charging reward maximization problem is toschedule a charging path P max X v i ∈ P π i (2a) s . t . X v i ∈ P ( B − B i ( t )) + Ω( P ) ∗ ξ ≤ IE (2b)where Ω( P ) is the length of path P , and ξ is the amount of energy consumption per traveling unit of the charger. The fully charging reward maximization problem is NP-hard.

Theorem 2.

The fully charging reward maximization problem is NP-hard.Proof.

The proof is contained in the reference Liang et al. [2017]. A -approximation algorithm is introduced in Liang et al. [2017]. The basic idea of the algorithm is to reduce thefully charging reward maximization problem to the orienteering problem with an approximation algorithm introducedin Bansal et al. [2004]. The orienteering problem is to ﬁnd a path P in a graph G from a starting vertex to an endingone such that the total reward collected from vertices along P , P v i ∈ P π i is maximized, and the length of P is nolonger than the length of a maximum path Ω( P max ) .The time complexity of the -approximation algorithm is T ori (3 | V ( G ) | , | V ( G ) | + | E ( G ) | )) , where T ori ( | V ( G ) | , | E ( G ) | ) is the time complexity of the approximation algorithm in Bansal et al. [2004]. One basic requirement of WSN deployment is full area coverage of a ﬁeld of interest (FoI). Multiple coverage, whereeach point of the FoI is covered by at least k different sensors with k > ( k -coverage), is often applied to increase thesensing accuracy of data fusion and enhance the fault tolerance in case of node failures Yang et al. [2006], Simon et al.[2007], Liggins et al. [2008], Z. Zhou and Gupta [2009], Bai et al. [2011], Li et al. [2015]. A common practice toachieve k -coverage is a high density of sensor nodes randomly distributed over the monitored FoI Kumar et al. [2004],Hefeeda and Bagheri [2007].Optimal k-coverage charging problem studies the mobile charger scheduling scenario in which the k -coverage abilityof a network system needs to be maintained. A node sends a charging request with its position information and acharging deadline estimated based on its current residual energy and battery consumption rate. A mobile charger seeksa path to charge sensor nodes before their charging deadlines under the constraint of maintaining the k -coverage abilityof the monitored are. At the same time, the charger tries to maximize the energy usage efﬁciency, i.e., minimizing theenergy consumption on traveling per tour.The Optimal k-coverage charging problem has not been studied yet. We describe the network model, formulatethe problem, analyze its hardness, and propose a dynamic programming-based algorithm in detail in the followingsections. We consider a set of stationary sensor nodes, V = { v i | ≤ i ≤ n } , deployed over a planar FoI denote as A withlocations, P = { p i | ≤ i ≤ n } . For each sensor node v i , we assume a disk sensing model with sensing range r . If5 rXiv Template

A P

REPRINT the Euclidean distance between a point q ∈ A and node position p i is within distance r , i.e., || p i − q || L ≤ r , then thepoint q is covered by sensor node v i , and we use v i ( p ) = 1 to represent it, as shown in equation (3): v i ( q ) = (cid:26) , || p i − q || L ≤ r , otherwise. (3) Deﬁnition 3 (Full Coverage) . If for any point q ∈ A , there exists at least one sensor node covering it, i.e., P ni =1 v i ( q ) ≥ , then area A is full covered. Deﬁnition 4 ( K -Coverage) . If for any point q ∈ A , there exist at least k ≥ sensor nodes covering it, i.e., P ni =1 v i ( q ) ≥ k , then area A is k -covered. It is obvious that full coverage is a special case of k -coverage when k = 1 . A sensor node v i is equipped with a rechargeable battery with capacity B . B i ( t ) denotes the residual energy of sensornode v i at time t . A mobile charger sends its departure time denoted as t from service station to each sensor node.When receiving the message, node v i estimates its residual energy B i ( t ) at t . If it is less than an energy threshold α , i.e., B i ( t ) /B ≤ α , sensor node v i sends a charging request ( id, p i , D i ) including its ID, position p i , and chargingdeadline D i to the charger. A charging deadline, i.e., an energy exhausted time, is estimated based on the residualenergy B i ( t ) at t and an average battery consumption rate denoted as β i . Speciﬁcally, D i = B i ( t ) /β i . Note thatnodes may have different energy consumption rates.A mobile charger with an average moving speed s is responsible for selecting and charging a number of sensor nodesbefore their charging deadlines to maintain k -coverage of area A , and it also seeks a path with a minimum energyconsumption on traveling. We assume that the time spent on charging path is less then the operation time of sensors,so a sensor node only needs to be charged once in each tour. Unless under an extremely dense sensor deployment, weconsider that a mobile charger charges sensor nodes one by one because the energy efﬁciency reduces dramaticallywith distance; the energy efﬁciency drops to 45 percent when the charging distance is m ( . ft) Kurs [2007]. Wedenote r c the energy transfer rate of a mobile charger.The charging time is deﬁned as the following: Deﬁnition 5 (Charging Time) . Denote P a charging path and t P ( v i ) the charging time along P at node v i . If P goesfrom nodes v i to v j , the charging time begins at node v j is t P ( v j ) =  t P ( v i ) + B − B i ( t P ( v i )) r c + d ij s , if t P ( v i ) + B − B i ( t P ( v i )) r c + d ij s ≤ D j inf , otherwise , (4) where d ij is the Euclidean distance between nodes v i and v j and s is the average moving speed of a mobile charger.The residual energy B i ( t ) is estimated as B i ( t ) = B i ( t ) − β i ∗ ( t − t ) . The optimal k -coverage charging problem is then be formulated as the following. Deﬁnition 6 (Optimal k -coverage charging problem) . Given a set of sensor nodes V = { v i | ≤ i ≤ n } , randomlydeployed over a planar region A with locations P = { p i | ≤ i ≤ n } such that every point of A has been at least k covered initially, the optimal k -coverage charging problem is to schedule a charging path P min | P | (5a) s . t . n X i =1 v i ( q ) ≥ k, ∀ q ∈ A. (5b) t P ( v i ) ≤ D i , ∀ v i ∈ P. (5c)Note that a charger does not need to respond all the nodes sending requests.6 rXiv Template

A P

REPRINT

The optimal k -coverage charging problem is NP-hard. Theorem 3.

The optimal k -coverage charging problem is NP-hard.Proof. To prove the NP-hardness of optimal k -coverage charging problem, we prove that an NP-hard one, travelingsalesman problem with deadline that seeks a minimum traveling cost to visit all the nodes before their deadlines, canbe reduced to the trivial case of the optimal k -coverage charging problem in polynomial time.We consider a trivial case of the optimal k -coverage charging problem: we require k = 1 and assume that the initialdeployment of sensor nodes has no coverage redundancy. A mobile charger needs to charge all the sensor nodessending requests before their deadlines to maintain a full coverage. It is straightforward to see that a solution oftraveling salesman problem with deadline is also a solution of the trivial case of the optimal k -coverage chargingproblem and vice versa. Since even ﬁnding a feasible path for traveling salesman problem with deadline is NP-complete Savelsbergh [1985], the optimal k -coverage charging problem is then NP-hard. Considering that the sensing region of a sensor node v i is centered at p i with radius r i , disk-shape sensing regions ofa network divide a planar FoI A into a set of subregions, marked as A = { a i | ≤ i ≤ m } . Then P ni =1 v i ( a i ) is thenumber of sensors with a i within their sensing ranges. In the deﬁnition of optimal k -coverage charging problem, weassume P ni =1 v i ( a i ) ≥ k with the initial deployment of a network. Denote r ( a i ) the number of sensor nodes sendingcharging requests with sensing regions including subregion a i . Three cases exist for subregion a i : Case I: P ni =1 v i ( a i ) − r ( a i ) ≥ k : all the requests are not essential. Case II: P ni =1 v i ( a i ) = k and P ni =1 v i ( a i ) − r ( a i ) < k : a mobile charger needs to charge all the sensor nodessending requests with their sensing regions containing subregion a i before their charging deadlines. Case III: P ni =1 v i ( a i ) > k and P ni =1 v i ( a i ) − r ( a i ) < k : a mobile charger needs to charge at least k − P ni =1 v i ( a i )+ r ( a i ) sensor nodes sending requests with their sensing regions containing subregion a i before their charging deadlines.A table denoted as T with size m , is constructed to store the minimum number of sensors to charge for each a i .Speciﬁcally, T [ i ] = k − P ni =1 v i ( a i ) + r ( a i ) . If the value is negative, we simply set T [ i ] to zero.For a sensor node v i sending request, we divide its time window [ t , t + D i ] into a set of time units { t ki | ≤ k ≤ D i } ,where t i = t and t D i i = D i . We represent node v i with a set of discretized nodes { v i ( t ki ) | ≤ k ≤ D i } , where v i ( t ki ) represents node v i at time t ki . We construct a directed graph denoted as G with vertices and edges deﬁned as follows. Vertices.

The vertex set V ( G ) includes the discretized sensor nodes sending charging requests, i.e., { v i ( t ki ) | ≤ k ≤ D i } . Edges.

There exists a directed edge −−−−−−−−→ v i ( t ki ) v j ( t k ′ j ) from v i ( t ki ) to v j ( t k ′ j ) in the edge set E ( G ) if and only if t ki + B − B i r c + d ij s > t k ′ − j ,t ki + B − B i r c + d ij s ≤ t k ′ j , where k ′ > .A directed edge ensures that a mobile charger arrives at sensor node v j before its charging deadline with the chargingtime being the arrival time of the charger. Theorem 4. G is a directed acyclic graph (DAG). rXiv Template

A P

REPRINT

Proof.

Suppose there exists a cycle in G . Assume vertices v i ( t ki ) and v j ( t k ′ j ) are on the cycle. Along the directed pathfrom v i ( t ki ) to v j ( t k ′ j ) , it is obvious that t ki < t k ′ j . However, along the directed path from v j ( t k ′ j ) to v i ( t ki ) , we have t k ′ j < t ki . Contradiction, so G is a directed acyclic graph. Deﬁnition 7 (Clique) . A set of nodes { v i ( t ki ) | ≤ k ≤ D i } in G is deﬁned as a clique if they correspond to the samenode v i at different time units. Deﬁnition 8 (Feasible Path) . A path P in G is a feasible one if it passes no more than one vertex of a clique. At thesame time, charging along P satisﬁes the k -coverage requirement of the given network. We design a dynamic programming-based algorithm to ﬁnd an optimal charging path for the k -coverage chargingproblem.To make sure that the computed charging path passes no more than one discretized vertex of a sensor node, we applythe color coding technique introduced in Alon et al. [1995] to assign each vertex a color. Speciﬁcally, we generate acoloring function c v : V → { , ..., n } that assigns each sensor node a unique node color. Each sensor node then passesthe node color to its discretized ones. A path in G is said to be colorful if each vertex on it is colored by a distinct nodecolor. It is obvious that a colorful path in G passes no more than one discretized vertex of a sensor node.To take into the consideration of traveling distance from service station to individual sensor node, we add an extravertex denoted as v and connect it with directed edges to vertices in G , i.e., { v i ( t ki ) | k = 0 } . The length of edge −−−−−→ v v i ( t i ) is the Euclidean distance between the service station and sensor node v i . The table T constructed in Sec. 5.4is stored at v .We ﬁrst topologically sort the new graph, i.e., G + v . Then we start from v to ﬁnd colorful paths by traversingfrom left to right in linearized order. Speciﬁcally, v checks neighbors connected with outgoing edges and sends table T to those contributing to the decrease of at least one table entry. Once a vertex v i ( t i ) receives T , v i ( t i ) checksthe subregions within its sensing range and updates the corresponding entries of T . v i ( t i ) also generates a color set C = { c ( v i ( t i )) } and stores with T , which indicates a colorful path of length | C | .Similarly, suppose the algorithm has traversed to vertex v i ( t ki ) , we check each color set C stored at v i ( t ki ) and itsoutgoing edge v j ( t k ′ j ) . If c ( v j ( t k ′ j )) C and charging v j ( t k ′ j ) helps decrease at least one entry of T associated with C , we add the color set C = { C + c ( v j ( t k ′ j )) } along with the updated T to the collection of v j ( t k ′ j ) .After the update of the last vertex in linearized order, we check the stored T s in each node and identify those with allzero entries. A color set associated with a T with all zero entries represents a colorful path that is a feasible solutionof the k -coverage problem.A path can be easily recovered from a color set. The basic idea is to start from vertex v i ( t ki ) with a color set C .We check the stored color sets of vertices connected to v i ( t ki ) with incoming edges. Assume we identify a neighbornode v j ( t k ′ j ) storing a color set C − c ( v i ( t ki )) , then we continue to trace back the path from v j ( t k ′ j ) with a color set C − c ( v i ( t ki )) . When we trace back to v , we have recovered the whole charging path. Among all feasible chargingpaths, the one with a minimal traveling distance is the optimal one. Lemma 1.

The algorithm returns an optimal solution of the k -coverage charging problem, i.e., a feasible path maxi-mizing the energy usage efﬁciency, if it exists.Proof. We ﬁrst show that the algorithm returns a feasible path. A path returned by the algorithm is a colorful onethat guarantees the path passes a sensor node no more than once. In the meantime, charging time at each sensor nodealong the path is before its deadline, otherwise a directed edge along the path won’t exist. Array T with all zero entriesmakes sure that the k -coverage is maintained.When the algorithm has traversed to the i th node in linearized order, each colorful path passing through the node hasbeen stored in the node.Note that the computational complexity of the dynamic programming algorithm can increase exponentially in theworst case because the stored color sets at a vertex can increase exponentially to the size of sensor nodes n.8 rXiv Template

A P

REPRINT

We show the framework by four steps and use the three representative charger scheduling optimization problemsdiscussed in Secs. 3, 4, and 5 as concrete examples to illustrate the algorithm design based on the framework.

Given a set of sensor nodes represented by V = { v i | ≤ i ≤ n } deployed over a planar region with the initial positionsdenoted by P = { p i | ≤ i ≤ n } , the sensing range of a sensor node v i is r and the sensing model is a disk. Eachnode v i is equipped with a rechargeable battery of capacity B . When the residual energy of v i represented as B i ( t ) attime t is below a speciﬁc threshold, the sensor node v i will send a charging request to a mobile charger. The chargercollects all charging requests before leaving from base station denoted by v at t . The average speed of the charger is s .We use weighted graph G ( V, E, ω ) to model the charger scheduling optimization problems. Speciﬁcally, Vertices.

A vertex v i ∈ V ( G ) represents a sensor node sending charging request. Edges.

An edge e ( v i , v j ) ∈ E ( G ) indicates a possible charging path of a charger from sensor node v i to v j withoutthe violation of any constraints. Edge Weight.

The weight ω ( v i , v j ) assigned to edge e ( v i , v j ) represents the Euclidean distance of nodes v i and v j .More constraints are added to the graph model for each charger scheduling optimization problem. Speciﬁcally, • Mobile Network Charging Path Optimization Problem:

The graph is a fully connected undirected onewith an edge connecting every pair of sensor nodes. The edge weight changes dynamically because thesensor nodes are mobile ones. With the assumption that the exact location of each mobile node is known to acharger at any time t , we discretize the time and update the graph in each time step. • Fully Charging Reward Maximization Problem:

The graph is a fully connected undirected one with anedge connecting every pair of sensor nodes. Each vertex v i is additionally associated with a positive integerwith range [1 , n ] to model the gain of charging v i by a mobile charger. A sensor with less residual energy isassigned a larger prize as it needs to be charged more urgently. • Optimal k-coverage Charging Problem:Vertices.

The vertex set V ( G ) includes all the sensor nodes, i.e., { v i | ≤ i ≤ n } and a start vertex denotedas v . Each vertex v i has a deadline D i . The location of v is the service station and the deadline of v is inf.Note that the deadline of sensor nodes without sending any request is set as . Edges.

For any sensor nodes v i and v j in V ( G ) , d ij is the euclidean distance between v i and v j . There existsan edge −−→ v i v j in V ( G ) if and only if the inequality below holds B − B i ( t ) r c + d ij s ≤ D j (6)where B i ( t ) denotes as the residual energy of sensor node i at charger departure time t , r c is the energytransfer rate of the mobile charger and s is its average moving speed. If a charging path P goes from node v i to node v j , the charging time beginning at node v j is given by Def. 5. Deﬁnition 9 (Feasible Path) . A path P in G is a feasible one if it starts from and ends at v , does not traverserepeated vertex, and charges vertices before their charging deadlines. At the same time, charging along P satisﬁes the k -coverage requirement of the given network. As we have mentioned in Sec. 2.2, deep Q-learning applies parametrized function Q ( S, A ; Θ) to approximate a large-sized state-action value function Q ( S, A ) that includes the combination of all pairs of states and actions, where Q ( S, A ; Θ) is parameterized by Θ with size much smaller than Q ( S, A ) . It is expected that the state-action valuefunction Q ( S, A ; Θ) has well summarized the state S of the current problem solving, i.e., incorporating current partialsolution including the selected vertices and their order into the graph model constructed in Sec. 6.1. The state-actionvalue function Q ( S, A ; Θ) should also have an estimation of the reward value when a new vertex is picked consideredas taking an action A in current state S .However, it is challenging to accurately describe both S and A on a graph. They may depend on the global and localstatistics of the current graph. There exist different graph representation techniques. We apply structure2vec Dai et al.9 rXiv Template

A P

REPRINT [2016], Khalil et al. [2017], a deep learning-based graph embedding technique, to compute a p-dimensional nodeembedding for each vertex and a feature vector with the same dimension for the graph. The state-action value function Q ( S, A ; Θ) can then be deﬁned by the computed feature vector and node embedding parameterized by Θ . Parameters Θ will be learned later using deep reinforcement learning algorithm. States, actions, transition, rewards, and policy are key components of any typical reinforcement learning algorithm.We add objective function, insertion function, and stop function into the deep reinforcement learning based chargingframework. Before we discuss these key components of the framework, we will brieﬂy explain some terms borrowedfrom reinforcement learning. An episode in the framework refers to a complete sequence of sensor nodes chargedunder constraints or requirements. A step, denoted by t , refers to charging a single sensor node in an episode. Objective function:

An objective function, denoted by f , reﬂects the quality of a partial solution to a chargingproblem. The deﬁnition of f varies from one charging problem to another, e.t., from maximizing the number ofcharging nodes to minimizing the traveling distance. To keep consistent, we turn f to negative when the problemrequires minimizing some value. Insertion function:

An insertion function, denoted by g , is designed to insert vertex v to the best position in a partialsolution s t +1 , e.g., s t +1 := g ( s t , v ) . Stop function:

A stop function checks the problem constraints and determines when to stop the current episode.

States:

A state s t is an ordered list of visited vertices at time step t , represented as a p -dimensional vector using thestructure2vec Dai et al. [2016], Khalil et al. [2017] graph representation technique. It is a partial solution s t ⊆ V ( G ) .The initial state is denoted by s = ( v ) . A stop function will determine the termination state s end . Actions:

Actions include all the candidate vertices at state s t . Such candidacy depends on the deﬁnition of theindividual charging problem. We will discuss it later. Similar to a state, an action is represented as a p -dimension nodeembedding using the structure2vec Dai et al. [2016], Khalil et al. [2017] graph embedding technique. Transition:

Suppose v is a candidate vertex. After taking the action v at time step t , state s t is transitioning to s t +1 := g ( s t , v ) where v is inserted to the best position by an insertion function g . Rewards:

A reward function, denoted by r ( s t , v ) , reﬂects the change of the objective function f after taking action v at state s t and transitioning to state s t +1 : r ( s t , v ) = f ( s t +1 ) − f ( s t ) . (7)It is obvious that the cumulative reward of an episode equals the objective function of the termination state, i.e. P ni =1 r ( s i , v i ) = f ( s end ) assuming the episode takes n steps. Policy:

We choose an epsilon greedy policy. The policy will choose an action achieving the current highest reward.However, with a small probability, it will instead randomly select the next action to prevent the stuck of a local optimalsolution. Speciﬁcally, at time step t , an action v t is selected by v t = (cid:26) arg max v ∈ ¯ s t Q ( s t , v ; Θ) , with probability − ǫ select a random vertex v t ∈ ¯ s t , otherwise (8)We will illustrate how to tailor the model for each of the selected charger scheduling optimization problems. The objective function f counts the number of selected vertices, i.e., charged mobile nodes. Insertion function:

The Insertion function g simply inserts the selected vertex v at time step t to the end of state s t . Stop function:

The stop function checks s t +1 to make sure that the total charging time is no larger than a givenmaximum timespan of the charger. Otherwise, the selected vertex v will be removed. Actions:

Actions at time step t include all non-selected vertices in current state s t . Rewards:

The reward function r ( s t , v ) is deﬁned as r ( s t , v ) = 1 (9)because inserting one vertex to a partial solution contributes the objective function f by one count.10 rXiv Template

A P

REPRINT

The objective function f counts the total prizes collected from selected vertices, i.e., mobilenodes charged. The prize at each vertex is proportional to the energy charged to its full capacity. Insertion function:

The Insertion function g ﬁnds a position to insert v at s t that minimizes the energy of chargerspent on road, which is deﬁned as: g ( s t , v ) = arg min i { E i + E it } (10)where E i + E it is the energy of charger spent on path when charging vertex v as the i -th one among the selectedvertices. Stop function:

The stop function checks s t +1 to make sure that the total amount of energy consumed on sensorcharging and the traveling is no greater than the energy capacity of a mobile charger . Otherwise, the selected vertex v will be removed. Actions:

Actions at time step t include all non-selected vertices in current state s t . Rewards:

The reward function r ( s t , v ) is deﬁned as r ( s t , v ) = E v (11)where E v is the energy charged to get the full capacity of v . The objective function f counts the total traveling distance of a charger. To maximize f , weturn f to negative. Insertion function:

The Insertion function g ﬁnds a position to insert v at s t that maximizes the reward and ensureseach node charged before its deadline. Stop function: : The stop function checks whether the current charging path guarantees the k -coverage requirementof the area. If the requirement is satisﬁed or R(S,v) is -inf, the algorithm terminates. States:

A state S is a partial solution S ⊆ V ( G ) , an ordered list of visited vertices. The ﬁrst vertex in S is v . Actions:

Let ¯ S contain vertices not in S and has at least one edge from vertices in S . An action is a vertex v from ¯ S returning the maximum reward. After taking the action v , the partial solution S is updated as S ′ := ( S, v ) , where v = arg max v ∈ ¯ S Q ( S, v ) (12) ( S, v ) denotes appending v to the best position after v in S that introduces the least traveling distance and maintainsall vertices in the new list a valid charging time. Rewards:

The reward function R ( S, v ) is deﬁned as the change of the traveling distance when taking the action v and transitioning from the state S to a new one S ′ . Assume v i , v j are two adjacent vertex in S , v is the ﬁrst vertex inthe S , and v t is the last vertex in the S .The reward function R ( S, v k ) is deﬁned as follows: R ( S, v k ) = (cid:26) − min( d ik + d kj − d ij , d tk + d k − d t ) , t S ′ ( v ) = inf − inf , otherwise (13)where d ij is the euclidean distance between nodes v i and v j , and t S ′ ( v ) is the updated charging time of node in pathformed by S ′ after inserting the v k . We adopt the deep-Q-network algorithm introduced in Khalil et al. [2017] to learn the parameters Θ of the state-actionvalue function Q ( S, v ; Θ) . The adopted algorithm updates the parameters Θ every n steps instead of every singlestep, to have a more accurate estimation of the objective function. Particularly, we apply experience replay methodin Mnih et al. [2013] to update Θ . Experience learned in each step is stored in a dataset. When we update Θ , a samplebatch with size is randomly selected from the dataset. Experience replay method helps break data correlation andavoid oscillations with the parameters. 11 rXiv Template

A P

REPRINT

The input of the network is the p -dimension vector featured by graph embedding network and output is the optimalsolution for the current state. The advantage of DQN is that it can handle delayed rewards, which represent the wayto optimize the objective function. In each step of the algorithm, the graph embedding method will be used to updatethe current partial solution and the new p -dimension vector which contains the newest information will be used for thenext step.Algorithm 1 summaries the major steps of the algorithm. Algorithm 1

DQN Algorithm Initialize replay memory H to capacity C for each episode do Initialize state s = ( v ) for step t = 1 to n do Select v t by (8) Add v t to partial solution by insertion function: s t +1 := g ( s t , v ) Calculate reward r ( s t , v ) Store tuple ( s t , v, r ( s t , v ) , s t +1 ) to H Sample random batch ( s l , v l , r ( s l , v l ) , s l +1 ) from H Update the network parameter Θ by squared loss function ( y l − Q ( s l , v l ; Θ)) , where y l =  r ( s l , v ) + γ max v ′ Q ( s l +1 , v ′ ; Θ) if s l +1 non-terminal and v ′ ∈ ¯ s l r ( s l , v ) otherwise (14) if s t +1 satisfy the stop function then Break end if end for end for

We implement and compare the performance of DQN algorithms designed based on the proposed deep reinforcementlearning-based charger scheduling optimization framework with existing ones on the selected charger scheduling opti-mization problems.

We compare the DQN algorithm designed based on the proposed framework with the quasi-polynomial time approx-imation algorithm (APP) introduced in Chen et al. [2016] and two other heuristic algorithms, including a greedy oneand a random one.However, the computing time of the APP algorithm in Chen et al. [2016] is very sensitive to the network size n , timestep size ∆ t , and charging timespan C and can increase dramatically to days with large n , small ∆ t , or long C .To control the running time, we choose networks with moderate sizes and a small timespan of charging C = 30 minutes. We compare the computing time and the number of sensors charged of these algorithms and study the impactof varying network size and charger capacity on the performances in Secs. 7.1.2 and 7.1.3, respectively. Note that wechoose a low recursion level L = 3 for the APP algorithm in Chen et al. [2016] to control the running time. We deploy sensors by uniform random distribution with the size n varying from to in an Euclidean square withsize [100 , m . A charger starts from the center of monitored area and ends at the upper right corner. The speedof charger is m/s. We use the random way point model to simulate the trace of a mobile sensor node with its averagespeed randomly chosen from [0 , m/s. 12 rXiv Template

A P

REPRINT

10 15 20 25 30

Network size n N u m b e r o f c h a r g e d s e n s o r n o d e s DQNAPPGreedyRandom 100 150 200 250 300 ∆t(s) N u m b e r o f c h a r g e d s e n s o r n o d e s DQNAPPGreedyRandom (a) Impact of network size (b) Impact of time step sizeFigure 1: (a) The number of mobile sensors charged with the maximum timespan of charger C = 30 minutes and ∆ t = 300 s under a varying network size of n . (b) The number of mobile sensors charged with the maximum timespanof charger C = 30 minutes and n = 10 under a varying time step size.The battery capacity B of each sensor is . KJ Chen et al. [2016]. The initial battery level of node i is randomlychosen from [0 , B ] . g i [(1 − ǫ ) α ] is the time to charge node i to (1 − ǫ ) α . The required charging level α = 90% , ǫ = 0 . . The average energy transfer rate is W. Figure 1(a) and Table 1 compare the number of mobile sensors charged within the maximum timespan and the cor-responding computing time of each algorithm, respectively, with ∆ t = 300 s and a varying network size of n . Withan increased network size, more mobile sensors can be charged, but the DQN algorithm consistently charges a signiﬁ-cantly higher number of mobile sensors than any other algorithm. At the same time, the computing time of the DQNalgorithm remains stable with an increase of n . By contrast, the computing time of the APP algorithm in Chen et al.[2016] increases dramatically. Overall, the DQN algorithm outperforms all others. The time step size ∆ t has a big impact on the performance of the APP algorithm in Chen et al. [2016] as shown inFigure 1(b) and Table 1. The number of charged mobile sensors of the APP algorithm in Chen et al. [2016] increaseswith a decreased time step size ∆ t but at the cost of sky-high computing time. On the contrary, ∆ t has no effecton the DQN algorithm because of the way to construct a graph as introduced in Sec.6.1. The DQN algorithm againoutperforms all other algorithms. Paper Liang et al. [2017] provides a -approximation algorithm to solve the fully charging reward maximization prob-lem by reducing the original problem to a classical orienteering one Bansal et al. [2004].We compare the DQN algorithm designed based on the proposed framework with the approximation algorithm (APP)in Liang et al. [2017], the minimum spanning tree (MST), the Capacitated Minimum Spanning Tree (CMST), andthe greedy one. We compare the computing time and the total energy spent on charging sensors and study the im-pact of varying network size and charger capacity on the performances of these algorithms in Secs. 7.2.2 and 7.2.3,respectively. The simulation area is an Euclidean square [1000 , m . We randomly deploy sensors with size n ranging from to . A mobile charger with energy capacity IE = 300 KJ, travels with an average speed m/s from the centerof monitored area and back to it with an average J/m spent on traveling Liang et al. [2017]. The battery capacity B of each sensor is . KJ Liang et al. [2017]. The initial battery level of node v i is randomly chosen from (0 , B ] . Asensor node sends a charging request when its energy is below of its capacity B .13 rXiv Template

A P

REPRINT

Table 1: Computing time under different network size n and time stepsize ∆ t Computing time in Fig. 1(a) Computing time in Fig. 1(b)Alg. n Comp. Time ∆ t Comp. Time(s) (s) (s)DQN Figure 2(a) and Table 2 compare the energy spent on sensor charging and the corresponding computing time of eachalgorithm, respectively, with the network size n increased from to . It is clear that the DQN algorithm spendsmuch more energy in charging sensors than any other algorithm. Considering a charger with a ﬁxed energy capacityand a group of sensors scattered in a ﬁeld, the energy spent on charging sensors decreases with an increased net-work size for all algorithms. However, with the increased network size, the computing time of the APP algorithmin Liang et al. [2017] increases dramatically while the DQN algorithm remains stable. Overall, the DQN algorithmoutperforms all other algorithms. Figure 2(b) and Table 2 compare the energy spent on sensor charging and the corresponding computing time of eachalgorithm, respectively, with a varying charger capacity. With the charger capacity IE increased from KJ to

KJ,the energy spent on charging sensor nodes increases for all algorithms too. The DQN algorithm consistently chargesmuch more energy on sensors than any other algorithms. At the same time, the computing time of the DQN algorithmkeeps stable while the APP algorithm increases dramatically. Overall, the DQN algorithm still outperforms all otheralgorithms.

We evaluate the DQN algorithm designed based on the proposed framework and compare with the dynamic pro-gramming based algorithm and three other heuristic algorithms including Ant Colony System (ACS) based algorithm,Random algorithm, and Greedy algorithm, as explained brieﬂy in Sec. 7.3.2.We compare the computing time to ﬁnd a feasible charging path and the energy of a charger spent on traveling of thesealgorithms. Considering network settings may affect the performance, we study the impact of varying network size n ,coverage requirement k , and remaining energy threshold α in Secs. 7.3.3, 7.3.4, and 7.3.5, respectively. Speciﬁcally,we assume a monitored area is at least k -coverage initially. Sensor node v i estimates the charging deadline D i basedon its residual energy B i ( t ) and the experimental energy consumption rate in Zhu et al. [2009]. We set up an Euclidean square [500 , m as a simulation area and randomly deploy sensor nodes with size rangingfrom to in the square such that the area is at least k -coverage initially where k varies from to . The sensingrange r is m. The base and service station of charger are co-located in the center of the square. A charger with a14 rXiv Template

A P

REPRINT

50 100 150 200

Network size n E n e r g y o n s e n s o r c h a r g i n g ( J ) DQNAPPMSTCMSTGreedy

200 250 300 350

Charger capacity IE (kJ) E n e r g y o n s e n s o r c h a r g i n g ( J ) DQNAPPMSTCMSTGreedy (a) Impact of network size (b) Impact of charger capacityFigure 2: (a) Energy spent on sensor charging for a mobile charger with a total energy capacity IE = 300 KJ undera varying network size of n . (b) Energy spent on sensor charging for a mobile charger with a varying energy capacity IE and a ﬁxed network size n = 120 .Table 2: Computing time under different network size n and charger capacity IE Computing time in Fig. 2(a) Computing time in Fig. 2(b)Alg. n Comp. Time IE Comp. Time(s) (kJ) (s)DQN rXiv Template

A P

REPRINT starting point from the service station has an average traveling speed m/s and consumes energy J/m

Liang et al.[2017]. The battery capacity B of each sensor is . KJ Liang et al. [2017]. The remaining energy threshold α vary from . to . . A sensor sends a charging request before the leaving of the charger from the service station.The sensor will include in the request the estimated energy exhausted time based on its current residual energy andenergy consumption rate. To simulate such request, we consider the residual battery of a sensor is a uniform randomvariable B i between (0 . , . KJ Shi et al. [2011]. The energy exhausted time D i = B i /β i . We choose the energyconsumption rate β i from the historical record of real sensors in Zhu et al. [2009] where the rate is varying accordingto the remaining energy and arrival charging time to the sensor. The energy transfer rate r c is W Chen et al. [2016].The discredited time step-size is s. We implement three heuristic algorithms for comparison: Ant Colony System (ACS) based algorithm, Random algo-rithm, and Greedy algorithm.ACS algorithm solves the traveling salesmen problem with an approach similar to the foraging behavior of realants Colorni et al. [1991], Dorigo and Gambardella [1997], Gutjahr [2000]. Ants seek path from their nest to foodsources and leave a chemical substance called pheromone along the paths they traverse. Later ants sense the pheromoneleft by earlier ones and tend to follow a trail with a stronger pheromone. Over a period of time, the shorter paths be-tween the nest and food sources are likely to be traveled more often than the longer ones. Therefore, shorter pathsaccumulate more pheromone, reinforcing these paths.Similarly, ACS algorithm places agents at some vertices of a graph. Each agent performs a series of random movesfrom current vertex to a neighboring one based on the transition probability of the connecting edge. After an agent hasﬁnished its tour, the length of tour is calculated and the local pheromone amounts of edges along the tour are updatedbased on the quality of the tour. After all agents have ﬁnished their tours, the shortest one is chosen and the globalpheromone amounts of edges along the tour are updated. The procedure continues until certain criteria are satisﬁed.When applying ACS algorithm to solve the the traveling salesmen problem with deadline, two local heuristic functionsare introduced in Cheng and Mao [2007] to exclude paths that violate the deadline constraints.We modify ACS algorithm introduced in Cheng and Mao [2007] for the optimal k -coverage charging problem. Agentsstart from and end at v . Denote τ ij ( t ) the amount of global pheromone deposited on edge −−→ v i v j and ∆ τ ij ( t ) theincreased amount at the t th iteration. ∆ τ ij ( t ) is deﬁned as ∆ τ ij ( t ) = (cid:26) L ∗ if −−→ v i v j ∈ P ∗ otherwise (15)where L ∗ is the traveling distance of the shortest feasible tour P ∗ at the t th iteration. Global pheromone τ ij ( t ) isupdated according to the following equation: τ ij ( t ) = (1 − θ ) τ ij ( t −

1) + θ ∆ τ ij ( t ) , (16)where θ is the global pheromone decay parameter. The local pheromone is updated in a similar way, where θ isreplaced by a local pheromone decay parameter and ∆ τ ij ( t ) is set as the initial pheromone value.We also modify the stop criteria of one agent such that the traveling path satisﬁes the requirement of k -coverage, orthe traveling time of current path is inf, or the agent is stuck at a vertex based on the transition rule.Random and Greedy algorithms work much more straightforward. The Random algorithm randomly chooses a nextnode not in the same clique of the existing path and with an outgoing edge from current one to charge. The Greedyalgorithm always chooses the nearest node not in the same clique of the existing path with an outgoing edge fromcurrent one. Random and Greedy algorithms terminate when they either ﬁnd a feasible path or are locally stuck. Notethat for ACS and Random algorithms, we always run multiple times and choose the best solution. We set the coverage requirement k = 3 and the remaining energy threshold α = 0 . for a sensor node to senda charging request. Table 3 compares the performances including the computation time to ﬁnd a feasible chargingpath and the energy spent on traveling when the network size n varies from to . The energy spent on travelingdecreases with an increased n since there are more redundant sensors to maintain the k -coverage requirement. Again,the DQN algorithm outperforms all other competing algorithms.16 rXiv Template

A P

REPRINT

Table 3: Performance comparison under different sizes of sensor network n k = 3 , α = 0 . Algorithm n Computation Feasible TravelingTime Path Energy(s) Found (kJ)Dynamic 48 156700 Yes 771DQN 83 Yes 771ACS 92 Yes 951Random 0.0004 Yes 2085Greedy 0.0004 Yes 888Dynamic 64 455 Yes 702DQN 20 Yes 702ACS 19 Yes 702Random 0.0003 No –Greedy 0.0003 Yes 846Dynamic 72 362 Yes 567DQN 32 Yes 567ACS 57 Yes 567Random 0.0003 Yes 1941Greedy 0.0003 Yes 810Dynamic 80 268 Yes 345DQN 20 Yes 345ACS 18 Yes 345Random 0.0003 No –Greedy 0.0003 Yes 375

We set the number of sensor nodes n = 64 and the remaining energy threshold α = 0 . for a sensor node to senda charging request. Table 4 compares the performances including the computation time to ﬁnd a feasible chargingpath and the energy spent on traveling when the coverage requirement k varies from to . The traveling energyincreases with the increased k because more sensor nodes need to be charged to satisfy the coverage requirement. Thedynamic programming algorithm with an exponentially increased computing time can only ﬁnd a feasible chargingpath when k is small. By contrast, the computing time of the DQN algorithm including its training time grows slowlywith the increase of k . The ACS algorithm performs better than the random and greedy algorithms, but overall, theperformance of the DQN algorithm signiﬁcantly outperforms all other competing algorithms. Tables 5 and 6 give the performances of different algorithms with the remaining energy threshold α varying from . to . under two network settings: n = 32 and k = 2 , and n = 48 and k = 3 , respectively. With the increased remainingenergy threshold, the traveling energy increases in both network settings. The Random and Greedy algorithms fail todetect a feasible path in many cases. The dynamic programming algorithm runs out of the memory when α is large.Both the ACS and DQN algorithms ﬁnd feasible paths for all cases. However, the performance of the ACS algorithmdecreases with the increase of α . The DQN algorithm consistently and signiﬁcantly outperforms all other comparisonalgorithms including botht the traveling energy and computing time. We introduce a deep reinforcement learning-based charger scheduling optimization framework. The biggest advantageof the framework is that a diverse range of domain-speciﬁc charger scheduling strategy can be learned automaticallyfrom previous experiences, i.e., different graphs with various sizes. A framework also simpliﬁes the complexity ofalgorithm design for individual charger scheduling optimization problem. We compare the performance of algorithmsdesigned based on the proposed framework with existing ones on a set of representative charger scheduling optimiza-17 rXiv

Template

A P

REPRINT

Table 4: Performance comparison under different coverage requirement k n = 64 , α = 0 . Algorithm k Computation Feasible TravelingTime Path Energy(s) Found (kJ)Dynamic 2 0.102 Yes 249DQN 16 Yes 249ACS 5 Yes 249Random 0.0006 Yes 249Greedy 0.0007 Yes 288Dynamic 3 455 Yes 702DQN 20 Yes 702ACS 19 Yes 702Random 0.0003 No –Greedy 0.0003 Yes 846Dynamic 4 – – –DQN 71 Yes 1089ACS 73 Yes 1188Random 0.0006 No –Greedy 0.0005 Yes 1254

Table 5: Performance comparison under different remaining energy threshold α when k = 2 , n = 32 k = 2 , n = 32 Algorithm α Computation Feasible TravelingTime Path Energy(s) Found (kJ)Dynamic 0.2 0.0012 Yes 405DQN 13 Yes 405ACS 3 Yes 405Random 0.0002 No –Greedy 0.0003 Yes 420Dynamic 0.4 9760 Yes 696DQN 26 Yes 696ACS 38 Yes 855Random 0.0003 No –Greedy 0.0003 No 891Dynamic 0.6 135742 Yes 1071DQN 116 Yes 1071ACS 232 Yes 2289Random 0.0006 No –Greedy 0.0004 No –Dynamic 0.8 – – –DQN 136 Yes 1080ACS 400 Yes 2544Random 0.002 No –Greedy 0.001 No – rXiv Template

A P

REPRINT

Table 6: Performance comparison under different remaining energy threshold α when k = 3 , n = 48 k = 3 , n = 48 Algorithm α Computation Feasible TravelingTime Path Energy(s) Found (kJ)Dynamic 0.2 0.02 Yes 348DQN 10 Yes 348ACS 1 Yes 348Random 0.0003 No –Greedy 0.0002 Yes 348Dynamic 0.4 145602 Yes 750DQN 81 Yes 750ACS 90 Yes 930Random 0.0004 Yes 2070Greedy 0.0004 Yes 870Dynamic 0.6 – – –DQN 138 Yes 1020ACS 466 Yes 1521Random 0.001 No –Greedy 0.001 No –Dynamic 0.8 – – –DQN 481 Yes 1272ACS 4200 Yes 4788Random 0.01 No –Greedy 0.02 No – tion problems. Extensive simulation and comparison results show that algorithms based on the proposed frameworkoutperform all existing ones and achieve close to optimal results.

References

Weifa Liang, Zichuan Xu, Wenzheng Xu, Jiugen Shi, Guoqiang Mao, and Sajal K Das. Approximation algorithms forcharging reward maximization in rechargeable sensor networks via a mobile charger.

IEEE/ACM Transactions onNetworking , 25(5):3161–3174, 2017.Lin Chen, Shan Lin, and Hua Huang. Charge me if you can: Charging path optimization and scheduling in mobilenetworks. In

Proceedings of the 17th ACM International Symposium on Mobile Ad Hoc Networking and Computing ,pages 101–110. ACM, 2016.Yi Shi, Liguang Xie, Y Thomas Hou, and Hanif D Sherali. On renewable sensor networks with wireless energytransfer. In

INFOCOM, 2011 Proceedings IEEE , pages 1350–1358. IEEE, 2011.Liguang Xie, Yi Shi, Y Thomas Hou, Wenjing Lou, Hanif D Sherali, and Scott F Midkiff. On renewable sensornetworks with wireless energy transfer: The multi-node case. In

Sensor, mesh and ad hoc communications andnetworks (SECON), 2012 9th annual IEEE communications society conference on , pages 10–18. IEEE, 2012.Liang He, Peng Cheng, Yu Gu, Jianping Pan, Ting Zhu, and Cong Liu. Mobile-to-mobile energy replenishment inmission-critical robotic sensor networks. In

INFOCOM, 2014 Proceedings IEEE , pages 1195–1203. IEEE, 2014.Xiao Lu, Ping Wang, Dusit Niyato, Dong In Kim, and Zhu Han. Wireless charging technologies: Fundamentals,standards, and network applications.

IEEE Communications Surveys & Tutorials , 18(2):1413–1452, 2015.Xuan Li and Miao Jin. Optimal k-coverage charging problem. arXiv preprint,arXiv:1901.09129 , 2019.Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimizationwith reinforcement learning. arXiv preprint arXiv:1611.09940 , 2016.Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. Learning combinatorial optimization algorithmsover graphs. In

Advances in Neural Information Processing Systems , pages 6351–6361, 2017.19 rXiv

Template

A P

REPRINT

Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured data. In

International conference on machine learning , pages 2702–2711, 2016.Yu Ma, Weifa Liang, and Wenzheng Xu. Charging utility maximization in wireless rechargeable sensor networks bycharging multiple sensors simultaneously.

IEEE/ACM Transactions on Networking , 26(4):1591–1604, 2018.Wenzheng Xu, Weifa Liang, Xiaohua Jia, Haibin Kan, Yinlong Xu, and Xinming Zhang. Minimizing the maximumcharging delay of multiple mobile chargers under the multi-node energy charging scheme.

IEEE Transactions onMobile Computing , 2020.Richard S Sutton, Andrew G Barto, et al.

Reinforcement learning: An introduction . MIT press, 1998.Kurt Hornik. Approximation capabilities of multilayer feedforward networks.

Neural networks , 4(2):251–257, 1991.Martin Riedmiller. Neural ﬁtted q iteration–ﬁrst experiences with a data efﬁcient neural reinforcement learning method.In

European Conference on Machine Learning , pages 317–328. Springer, 2005.Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and MartinRiedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 , 2013.Nikhil Bansal, Avrim Blum, Shuchi Chawla, and Adam Meyerson. Approximation algorithms for deadline-tsp andvehicle routing with time-windows. In

Proceedings of the thirty-sixth annual ACM symposium on Theory of com-puting , pages 166–174. ACM, 2004.Shuhui Yang, Fei Dai, Mihaela Cardei, Jie Wu, and Floyd Patterson. On connected multiple point coverage in wirelesssensor networks.

Journal of Wireless Information Networks , 2006, 2006.Gyula Simon, Miklós Molnár, László Gönczy, and Bernard Cousin. Dependable k-coverage algorithms for sensornetworks. In

Proc. of IMTC , 2007.Martin Liggins, David Hall, and James Llinas.

Handbook of Multisensor Data Fusion: Theory and Practice, SecondEdition . CRC Press, 2008.S.R. Das Z. Zhou and H. Gupta. Variable radii connected sensor cover in sensor networks.

ACM Trans. Senor Networks ,5(1):8:1–8:36, 2009.X. Bai, Z. Yun, D. Xuan, B. Chen, and W. Zhao. Optimal multiple-coverage of sensor networks. In

INFOCOM, 2011Proceedings IEEE , page 2498–2506. IEEE, 2011.Feng Li, Jun Luo, Wenping Wang, and Ying He. Autonomous deployment for load balancing k -surface coverage insensor networks. IEEE Transactions on Wireless Communications , 14(1):279–293, 2015.Santosh Kumar, Ten H. Lai, and József Balogh. On k-coverage in a mostly sleeping sensor network. In

Proceedingsof the 10th Annual International Conference on Mobile Computing and Networking , MobiCom ’04, pages 144–158,2004.M. Hefeeda and M. Bagheri. Randomized k-coverage algorithms for dense sensor networks. In

INFOCOM, 2007Proceedings IEEE , pages 2376–2380, 2007.Andre Kurs.

Power transfer through strongly coupled resonances . PhD thesis, Massachusetts Institute of Technology,2007.Martin WP Savelsbergh. Local search in routing problems with time windows.

Annals of Operations research , 4(1):285–305, 1985.Noga Alon, Raphael Yuster, and Uri Zwick. Color-coding.

Journal of the ACM (JACM) , 42(4):844–856, 1995.Ting Zhu, Ziguo Zhong, Yu Gu, Tian He, and Zhi-Li Zhang. Leakage-aware energy synchronization for wireless sensornetworks. In

Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services , pages319–332, 2009.A. Colorni, M. Dorigo, and V. Maniezzo. Distributed optimization by ant colonies. In

Proceedings of ECAL91 -European Conference on Artiﬁcial Life , page 134–142, 1991.Marco Dorigo and Luca Maria Gambardella. Ant colony system: a cooperative learning approach to the travelingsalesman problem.

IEEE Transactions on evolutionary computation , 1(1):53–66, 1997.Walter J. Gutjahr. A graph-based ant system and its convergence.

Future Gener. Comput. Syst. , 16(9):873–888, 2000.Chi-Bin Cheng and Chun-Pin Mao. A modiﬁed ant colony system for solving the travelling salesman problem withtime windows.