[PDF] AoI Minimization in Status Update Control with Energy Harvesting Sensors

Abstract

Information freshness is crucial for time-critical IoT applications, e.g., monitoring and control systems. We consider an IoT status update system with multiple users, multiple energy harvesting sensors, and a wireless edge node. The users receive time-sensitive information about physical quantities, each measured by a sensor. Users send requests to the edge node where a cache contains the most recently received measurements from each sensor. To serve a request, the edge node either commands the sensor to send a status update or retrieves the aged measurement from the cache. We aim at finding the best actions of the edge node to minimize the age of information of the served measurements. We model this problem as a Markov decision process and develop reinforcement learning (RL) algorithms: model-based value iteration and model-free Q-learning methods. We also propose a Q-learning method for the realistic case where the edge node is informed about the sensors' battery levels only via the status updates. The case under transmission limitations is also addressed. Furthermore, properties of an optimal policy are analytically characterized. Simulation results show that an optimal policy is a threshold-based policy and that the proposed RL methods significantly reduce the average cost compared to several baselines.

Full PDF

11 AoI Minimization in Status Update Controlwith Energy Harvesting Sensors

Mohammad Hatami ∗ , Markus Leinonen ∗ , and Marian Codreanu † Abstract

Information freshness is crucial for time-critical IoT applications, e.g., environment monitoring andcontrol systems. We consider an IoT-based status update system with multiple users, multiple energyharvesting sensors, and a wireless edge node. The users are interested in time-sensitive informationabout physical quantities, each measured by a sensor. Users send requests to the edge node where acache contains the most recently received measurements from each sensor. To serve a request, the edgenode either commands the sensor to send a status update or retrieves the aged measurement from thecache. We aim at nding the best action of the edge node to minimize the age of information of the servedmeasurements. We model this problem as a Markov decision process and develop reinforcement learning(RL) algorithms: a model-based value iteration method and a model-free Q-learning method. We alsopropose a Q-learning method for the realistic case where the edge node is informed about the sensorsbattery levels only via the status updates. Furthermore, properties of an optimal policy are analyticallycharacterized. Simulation results show that an optimal policy is a threshold-based policy and that theproposed RL methods signicantly reduce the average cost as compared to several baseline methods.

Index terms –

Internet of Things (IoT), age of information (AoI), energy harvesting, reinforcementlearning (RL), value iteration, dynamic programming, Q-learning.

I. I

NTRODUCTION

Internet of Things (IoT) is an emerging technology to connect different devices and appli-cations with minimal human intervention. IoT enables the users to effectively interact with thephysical surrounding environment and empower context-aware applications like smart cities [1]. ∗ Centre for Wireless Communications – Radio Technologies, University of Oulu, Finland. e-mail: mohammad.hatami@oulu.ﬁ,markus.leinonen@oulu.ﬁ. † Department of Science and Technology, Link¨oping University, Sweden. e-mail: [email protected]. a r X i v : . [ ee ss . SP ] S e p A typical IoT network consists of multiple wireless sensors which measure physical phenomenaand communicate the obtained measurements to a destination for further processing. Two inherentfeatures of such networks are: 1) stringent energy limitations of battery-powered sensors which,however, may be counteracted by harvesting energy from environmental sources such as sun,heat, and RF ambient [2]–[4], and 2) transient nature of data, i.e., the sensors’ measurementsbecome outdated after a while. This calls for the design of IoT sensing techniques where thesensors sample and send a minimal number of measurements to conserve the energy whileproviding the end users highly fresh data, as required by time-sensitive applications.The freshness of information can be quantiﬁed by the recently emerged metric, the ageof information (AoI) [5]–[9]. Formally, AoI is deﬁned as the time elapsed since the latestsuccessfully received status update packet at the destination was generated at a source node.The works that address AoI in energy harvesting IoT networks and cache updating systems canbe divided into two main classes: 1) the works that focus on analyzing the AoI in a speciﬁcscenario under their proposed status update control/scheduling policies [10]–[14], and 2) theworks that focus on ﬁnding an optimal control/scheduling policy for a speciﬁc system model. Forthe latter class, there are two main approaches. The ﬁrst approach involves ﬁnding an optimalpolicy by applying different tools from optimization theory [15]–[20]. Such approaches needexact information about the models and statistics of the environment, e.g., the energy harvestingprobabilities of sensors. The second category includes designs relying on dynamic programmingand learning methods [21]–[26]. In this paper, we focus on this category and ﬁnd an optimalcontrol policy that minimizes the AoI about the sensors’ measurements received by the users inan energy harvesting IoT network.A particular interest has arisen in designing AoI-aware IoT networks [10], [11]. In [10], athreshold-based age-dependent random access algorithm has been proposed for massive IoTnetworks, in which an IoT device transmits a status update when its age is greater than apredeﬁned threshold. In [11], the authors presented a stochastic geometry analysis for the averageAoI metric for a cellular-based IoT network wherein the IoT devices can communicate in adevice-to-device fashion and also send status updates to the base stations.AoI has also been investigated in cache updating systems [15]–[17]. In [15], the authorsintroduced a popularity-weighted AoI metric for updating dynamic content in a local cache,where the content is subjected to version updates. The authors in [16] considered a system consisting of a library of time-varying ﬁles, a server that at all times observes the current versionof all ﬁles, and a cache that stores the current versions of all ﬁles but afterwards has to updatethese ﬁles from the server. The aim of this work was to design an optimization based updatepolicy that minimizes the average AoI of all ﬁles with respect to a given popularity distribution.The authors in [17] considered a cache updating system with a source, a single cache and auser, and found an analytical expression for the average freshness of the ﬁles at the user undertheir proposed threshold policy.The works [12]–[14] focused on analyzing the AoI in energy harvesting IoT networks. Theauthors in [12] considered a known energy harvesting model and proposed a threshold adaptationalgorithm to maximize the hit rate in an IoT sensing network. In [13], the authors analyzed theaverage AoI in a cache enabled status updating system with an energy harvesting sensor thatmonitors a random process. In [14], the author derived a closed-form expression for the averageAoI in a wireless powered sensor network.Age-optimal policies for status update packet transmissions in energy harvesting networkshave been derived in [18]–[20] by using different methods from optimization theory. In [18],age-optimal transmission policies for energy harvesting two-hop networks have been investigated.In [19], the authors explored the beneﬁts of erasure status feedback for online timely updatingfor an energy harvesting sensor with unit-sized battery. In [20], the authors derived an optimalupdate policy for an energy harvesting source that sends status updates to a network interfacequeue for delivery to a monitoring system.Several works have tackled a problem of designing an AoI-optimal status update system byusing dynamic programming and learning based methods [21]–[26]. In this line of works, theauthors modeled the problem as a Markov decision process (MDP), and found an optimal policyusing model-based reinforcement learning (RL) methods based on dynamic programming, e.g.,value iteration algorithm, and/or model-free RL methods, e.g., Q-learning. A comprehensivesurvey of RL based methods for autonomous IoT networks was presented in [27]. The authorsin [21] used deep RL to solve a cache replacement problem with a limited cache size and transientdata in an IoT network. In [22], the authors studied average AoI minimization in cognitive radioenergy harvesting communications. In [23], deep RL was used to minimize AoI in a real-timemulti-node monitoring system, in which the sensors are powered through wireless energy transferby the destination. In [24], a real-time IoT monitoring system, in which the IoT devices sample a physical random process and send status updates to a destination, has been considered. Theauthors derived optimal sampling and updating policies that enable the IoT devices to minimizethe average AoI at the destination. In [25], the authors studied the problem of an optimal devicescheduling and status update sampling policy that minimizes the average AoI for a real-timeIoT monitoring system with nonuniform sizes of status update packets under noisy channels.Minimizing AoI in a wireless ad hoc network via deep RL has been investigated in [26].We consider an IoT-based status update system consisting of multiple users, multiple energyharvesting IoT sensors, and a wireless edge node. The users are interested in time-sensitiveinformation about physical quantities, each of which is measured by a sensor. The users sendtheir requests to the edge node, which acts as a gateway between the users and the sensors. Theedge node has a cache storage which stores the most recently received measurements of eachphysical quantity. To serve a user’s request, the edge node can either command the correspondingsensor to sample and send a fresh measurement in the form of status update packet, or use theavailable aged data in the cache. The former leads to serving a user with fresh measurement, yetat the cost of increased energy consumption at the sensor. The latter prevents the activation ofthe sensors for every request so that the sensors can utilize the sleep mode to save a considerableamount of energy [12], but the data forwarded to the users becomes stale. This results in aninherent trade-off between the AoI about the physical quantities at the users and the energyconsumption of the sensors.The main objective of this paper is to ﬁnd the best action of the edge node at each timeslot, which is called an optimal policy, to minimize a cost function that penalizes informationstaleness of the data served to the users; herein, the information staleness/freshness is quantiﬁedby the AoI. We model the problem as an MDP and propose three RL based algorithms toobtain an optimal policy. Namely, we ﬁrst derive the state transition probabilities of the MDPand devise a model-based value iteration algorithm relying on dynamic programming. Then,we develop a model-free Q-learning algorithm which does not require the knowledge of thestate transition probabilities. Furthermore, as a practical consideration, we propose a Q-learningmethod for a realistic scenario where the edge node is informed about the sensors’ battery levelsonly via the status update packets. Consequently, the edge node does not know the exact batterylevel of each sensor at each time slot, but only the battery level from each sensor’s last update.Moreover, structural properties of an optimal policy are analytically characterized. Simulation results show that the proposed RL algorithms – including the Q-learning method with partialbattery knowledge – signiﬁcantly reduce the average cost compared to several baseline methods.

A. Contributions

To summarize, the main contributions of our paper are as follows: • We consider an IoT based status update system with multiple users, multiple energy har-vesting IoT sensors, and an edge node under probabilistic models for the energy harvestingand wireless communications from sensors to the edge node. • We formulate a problem of ﬁnding an optimal policy to serve the users’ requests so as tominimize the AoI about the physical processes at the users under energy limitations at thesensors and unreliable reception of the status updates. • We model the considered problem as an MDP and provide necessary deﬁnitions for thesearch and evaluation of an optimal policy via learning. • We derive the state transition probabilities of the MDP and propose a model-based valueiteration algorithm to ﬁnd an optimal policy. • We propose a model-free Q-learning method to search for an optimal policy, which doesnot require the knowledge of the state transition probabilities. • As a practical consideration, we propose a Q-learning method for the realistic scenario wherethe edge node is informed about the sensors’ battery levels only via the status updates. • We derive structural properties of an optimal policy analytically and show that an optimalpolicy has a threshold-based structure with respect to the AoI in a speciﬁc scenario. • Extensive numerical experiments are conducted to show that an optimal policy is a threshold-based policy and that the proposed RL algorithms signiﬁcantly reduce the average cost ascompared to several baseline policies. • The proposed Q-learning algorithm relying on the inexact battery knowledge is demonstratedto be a viable solution in practice.The most related works to this paper are [13], [18], [19], [21], [24]–[26], with the followingdifferences to our work. The work [13] is different in that it did not aim to ﬁnd an optimalpolicy but rather analyzed the average AoI in a cache-enabled status updating system with anenergy harvesting sensor. While we use learning based methods, the works [18], [19] are basedon different methods from optimization theory. Different from these line of works, we also propose a model-free RL approach, i.e., Q-learning, in which prior knowledge about statistics ofthe environment, e.g., the energy harvesting probability and transmit success probability of thelink between the sensors and the edge node, are not needed. The works [21], [24]–[26] did notconsider energy limitations at the source nodes, whereas we consider energy harvesting sourcenodes – sensors – in which the sensors rely only on the energy harvested from the environment.Preliminary results of this paper appear in [28].

B. Organization

The paper is organized as follows. Section II presents the system model and problem deﬁnition.A Markov decision process and deﬁnition of optimal policies are presented in Section III. Ourproposed three RL-based status update control algorithms are developed in Section IV. Structuralproperties of an optimal policy are analytically characterized in Section V. Simulation resultsare presented in Section VI. Concluding remarks are drawn in Section VII.II. S

YSTEM M ODEL AND P ROBLEM F ORMULATION

A. Network Model

We consider an IoT sensing network consisting of multiple users ( data consumers ), a wirelessedge node, and a set K = { , . . . , K } of K energy harvesting sensors ( data producers ), asdepicted in Fig. 1. Users are interested in time-sensitive information about physical quantities(e.g., temperature or humidity) which are independently measured by the K sensors; formally,sensor k ∈ K measures a physical quantity f k . We assume that there is no direct link betweenthe users and the sensors, and the edge node acts as a gateway between them. Thus, the users’requests for the values of f k , k ∈ K , are served (only) via the edge node.The system operates in a slotted time fashion, i.e., time is divided into slots labeled withdiscrete indices t ∈ N . At the beginning of slot t , users request for the values of physicalquantities f k from the edge node. Formally, let r k ( t ) ∈ { , } , t = 1 , , . . . , denote the randomprocess of requesting the value of f k at the beginning of slot t ; r k ( t ) = 1 if the value of f k isrequested and r k ( t ) = 0 otherwise. Note that at each time slot, there can be multiple requestsarriving at the edge node.The edge node is equipped with a cache storage that stores the most recently receivedmeasurement of each physical quantity f k . Upon receiving a request for the value of f k at slot Edge node Edge node

Energy harvesting IoT sensor with energy storage Cache storage

Users (data consumers) Users (data consumers)Ambient RFThermal, solar, wind

Energy harvesting sources

Fig. 1: An IoT sensing network consisting of multiple users ( data consumers ), one wireless edge node (i.e., thegateway), and a set of K energy harvesting wireless IoT sensors ( data producers ). The procedure of serving arequest by using fresh data is shown by green lines, and the blue lines show the procedure of serving a request byusing the previous measurements already existing in the cache. t (i.e., r k ( t ) = 1 ), the edge node can either command sensor k to perform a new measurementand send a status update or use the previous measurement from the local cache, to serve therequest. Let a k ( t ) ∈ { , } denote the command action of the edge node at slot t ; a k ( t ) = 1 ifthe edge node commands sensor k to send a status update and a k ( t ) = 0 otherwise.We assume that all the requests that arrive at the beginning of slot t are handled during thesame slot t . This assumption is invoked by the following considerations. First, we assume thatthe edge node communicates the values to the users in an instantaneous and error-free fashion.Second, we assume that at each slot t , the edge node can command multiple sensors to sendtheir values for f k during the same slot t and that these command actions a k ( t ) , k ∈ K , areindependent across k . This models the case when the sensors have independent communicationchannels to the edge node. At this stage, we note that while the communications between theedge node and the users are error-free, the transmissions from the sensors to the edge node areprone to errors; this channel model is detailed in Section II-C. B. Energy Harvesting Sensors

We assume that the sensors rely on the energy harvested from the environment. Sensor k stores the harvested energy into a battery of ﬁnite size B k (units of energy). Formally, let b k ( t ) denote the battery level of sensor k at the beginning of slot t . Thus, b k ( t ) ∈ { , . . . , B k } . In general, a status update packet contains the measured value of a monitored process and a time stamp representing thetime when the sample was generated.

We consider a common assumption (see e.g., [18], [29]–[32]) that transmitting a status updatefrom each sensor to the edge node consumes one unit of energy. Once sensor k is commandedby the edge node to send a status update (i.e., a k ( t ) = 1 ), sensor k sends a status update if ithas at least one unit of energy in its battery (i.e., b k ( t ) ≥ ). Let random variable d k ( t ) ∈ { , } denote the action of sensor k at slot t ; d k ( t ) = 1 if sensor k sends a status update to the edgenode and d k ( t ) = 0 otherwise. Accordingly, the relation between the action of sensor k (i.e., d k ( t ) ) and the command action of the edge node (i.e., a k ( t ) ) can be expressed as d k ( t ) = a k ( t ) { b k ( t ) ≥ } , (1)where { . } is the indicator function. Note that quantity d k ( t ) in (1) characterizes also the energyconsumption of sensor k at slot t .We model the energy arrivals at the sensors as independent Bernoulli processes with intensities λ k , k ∈ K . Let e k ( t ) ∈ { , } , t = 1 , , . . . , denote the energy arrival process of sensor k .Thus, the probability that sensor k harvests one unit of energy during one time slot is λ k , i.e.,Pr { e k ( t ) = 1 } = λ k , k ∈ K , t = 1 , , . . . .Finally, using the deﬁned quantities b k ( t ) , d k ( t ) , and e k ( t ) , the evolution of the battery levelof sensor k is expressed as b k ( t + 1) = min { b k ( t ) + e k ( t ) − d k ( t ) , B k } . (2) C. Communication Between the Edge Node and the Sensors

We consider an error-free binary/single-bit command link from the edge node to each sensor[19], [33], and an error-prone wireless communication link from each sensor to the edge node, asillustrated in Fig. 2. If a sensor sends a status update packet to the edge node, the transmissionthrough the wireless link can be either successful or failed . Let h k ( t ) = 1 denote the eventthat a status update from sensor k has been successfully received by the edge node at slot t .Otherwise, h k ( t ) = 0 which accounts for both the cases that either 1) sensor k sends a statusupdate but the transmission is failed, or 2) the sensor does not send a status update at all. Let ξ k be the conditional probability that given that sensor k transmits a status update, it is successfullyreceived by the edge node, i.e., Pr { h k ( t ) = 1 | d k ( t ) = 1 } = ξ k , k ∈ K , t = 1 , , . . . . Thus, ξ k represents the transmit success probability of the link from sensor k to the edge node. Physical process

IoT sensorEnergy harvesting Command

Random Process

Edge nodeWireless channel

Fig. 2: The link between each sensor and the edge node consists of an error-free binary command link from theedge node to each sensor and an error-prone wireless communication link from each sensor to the edge node.

D. Age of InformationAge of information (AoI) is a destination-centric metric that quantiﬁes the freshness of infor-mation of a remotely observed random process [5]–[7]. Formally, let ∆ k ( t ) be the AoI about thephysical quantity f k at the edge node at the beginning of slot t , i.e., the number of time slotselapsed since the generation of the most recently received status update packet from sensor k .Let u k ( t ) denote the most recent time slot in which the edge node received a status update packetfrom sensor k , i.e., u k ( t ) = max { t (cid:48) | t (cid:48) < t, h k ( t (cid:48) ) = 1 } ; thus, the AoI about f k can be writtenas the random process ∆ k ( t ) = t − u k ( t ) . We make a common assumption (see e.g., [22]–[25])that ∆ k ( t ) is upper-bounded by a ﬁnite value ∆ k, max , i.e., ∆ k ( t ) ∈ { , , . . . , ∆ k, max } . This isreasonable, because after ∆ k ( t ) reaches a high value ∆ k, max , the available measurement aboutphysical process f k becomes excessively stale/expired, so further counting would be irrelevant.At each time slot, the AoI either drops to one if the edge node receives a status update fromthe corresponding sensor, or increases by one otherwise. Accordingly, the evolution of ∆ k ( t ) can be written as ∆ k ( t + 1) =  , if h k ( t ) = 1 , min { ∆ k ( t ) + 1 , ∆ k, max } , if h k ( t ) = 0 , (3)which can be expressed compactly as ∆ k ( t + 1) = min (cid:110)(cid:0) − h k ( t ) (cid:1) ∆ k ( t ) + 1 , ∆ k, max (cid:111) . E. Cost Function and Problem Formulation

We consider a cost function that penalizes the information staleness of the requested mea-surements received by the users. We deﬁne the per-sensor immediate cost at slot t as c k ( t ) = r k ( t ) β k ∆ k ( t + 1) , (4) where β k ≥ is a pre-deﬁned weight parameter accounting for the importance of the freshnessof physical quantity f k , and ∆ k ( t + 1) is the AoI deﬁned in (3). Note that when the value of f k is not requested at slot t , i.e., r k ( t ) = 0 , the immediate cost becomes c k ( t ) = 0 , as desired.Moreover, since the requests for the value of physical quantities come at the beginning of slot t and the edge node sends values to the users at the end of the same slot, ∆ k ( t + 1) is the effectiveAoI about f k seen by the users.The objective of our work is as follows. We aim to ﬁnd the best action of the edge nodeat each time slot, i.e., a k ( t ) , t = 1 , , . . . , k ∈ K , called an optimal policy , that minimizes thelong-term average cost, deﬁned as ¯ C = lim T →∞ T (cid:80) Tt =1 (cid:80) Kk =1 c k ( t ) . (5)In order to shed light on the search for such an optimal policy, we next present several pointsregarding the problem structure. First, recall from Section II-A that in order to serve the requestsfor the value of f k at slot t (i.e., r k ( t ) = 1 ), the edge node can either command sensor k tosend a status update, i.e., a k ( t ) = 1 , or use the available data in the cache, i.e., a k ( t ) = 0 . Theformer action (i.e., a k ( t ) = 1 ), depending on the battery of sensor k and the situation of thecommunication link between sensor k and the edge, may lead to having a fresh measurement(i.e., the AoI drops to one ∆ k ( t + 1) = 1 , minimizing the immediate cost c k ( t ) in (4)), yet atthe cost of consuming one unit of energy from the battery of sensor k . On the other hand, thelatter action (i.e., a k ( t ) = 0 ) provides energy saving at the cost of serving the requests by staledata. This introduces an inherent trade-off between (myopically) minimizing the immediate costor saving energy for the possible future requests to minimize the cost in a long run.Second, it is easy to verify that if there are no requests for the value of f k at slot t (i.e., r k ( t ) = 0 ), the optimal action a k ( t ) that minimizes the long-term average cost (5) is a k ( t ) = 0 .In this case, the immediate cost (4) becomes zero (i.e., c k ( t ) = 0 ), and furthermore, the commandaction a k ( t ) = 0 implies d k ( t ) = 0 as per (1), leading to energy saving for sensor k . Therefore,the search for an optimal policy boils down to ﬁnding the optimal actions a k ( t ) for the caseswith r k ( t ) = 1 . Remark 1.

As described in Section II-A, the command action of the edge node for a given sensordoes not affect the decisions for the others, i.e., the actions a k ( t ) at any slot t are independentacross sensors k ∈ K . Thus, the problem of ﬁnding the optimal actions a k ( t ) , k ∈ K , that minimize (5) is separable across sensors k ∈ K .Based on Remark 1, we express the cost in (5) equivalently as ¯ C = (cid:80) Kk =1 ¯ C k , (6)where ¯ C k is the long-term average cost associated with sensor k , i.e., the per-sensor long-termaverage cost, deﬁned as ¯ C k = lim T →∞ T (cid:80) Tt =1 c k ( t ) , k = 1 , . . . , K. (7)Thus, minimizing the system-wise cost in (5) reduces to minimizing the K per-sensor long-termaverage costs in (7). This will be a key factor in developing our reinforcement learning (RL)algorithms in Section IV. Prior to this, in Section III, we model the considered problem as aMarkov decision process (MDP) and give deﬁnitions of optimal policies, which are needed inour algorithm development.III. M ARKOV D ECISION P ROCESS AND O PTIMAL P OLICIES

As discussed in Section II-E, the problem of ﬁnding an optimal policy that minimizes thelong-term cost in (5) is separable across the sensors. Thus, we present the derivation of suchan optimal policy for a particular sensor k but, clearly, the derivations are valid for any sensor k ∈ K ; the edge node runs in parallel one policy for each sensor in the network. First, we modelthe problem as an MDP. Then, we give a formal deﬁnition of an optimal policy, followed byintroducing the key quantities needed to evaluate and search for such an optimal policy. Allthese serve as preliminaries for the development of our RL-based algorithms in Section IV. A. MDP Modeling

The MDP model associated with sensor k is deﬁned by the tuple (cid:8) S k , A k , P k (cid:0) s k ( t + 1) (cid:12)(cid:12) s k ( t ) , a k ( t ) (cid:1) , c k (cid:0) s k ( t ) , a k ( t ) (cid:1) , γ (cid:9) , where • S k is the state set. Let s k ( t ) ∈ S k denote the state at slot t , which is deﬁned as s k ( t ) = { b k ( t ) , ∆ k ( t ) } , where 1) b k ( t ) is the battery level of sensor k given by (2), i.e., b k ( t ) ∈ { , , . . . , B k } , and 2) ∆ k ( t ) is the AoI about the physical quantity f k in the localcache, i.e., ∆ k ( t ) ∈ { , , . . . , ∆ k, max } . • A k = { , } is the action set. The action selected by the edge node at slot t is denoted by a k ( t ) ∈ A k (see Section II-A). • P k (cid:0) s k ( t + 1) (cid:12)(cid:12) s k ( t ) , a k ( t ) (cid:1) is the state transition probability that maps a state-action pair atslot t onto a distribution of states at slot t + 1 . • c k ( s k ( t ) , a k ( t )) is the immediate cost function, i.e., the cost of taking action a k ( t ) in state s k ( t ) , which is also denoted simply by c k ( t ) , and is calculated using (4). • γ ∈ [0 , is a discount factor used to weight the immediate cost relative to the future costs. B. Optimal Policy

In an MDP environment, the immediate and long-term costs that the agent – the edge nodein our model – expects to receive depends on what actions the edge node takes at each timeslot, which are selected based on a policy . Intuitively, a policy π k deﬁnes the edge node’s actionselection in any given state. Generally, policies can be stochastic or deterministic [34, Sect. 1.3].A stochastic policy π k = π k ( a | s ) : S k × A k → [0 , is deﬁned as a mapping from state s ∈ S k to a probability of choosing each possible action a ∈ A k . A deterministic policy is a specialcase of the stochastic policy where in each state s ∈ S k , π k ( a | s ) = 1 for some a ∈ A k .The discounted long-term accumulated cost is deﬁned as C k ( t ) = (cid:80) ∞ τ =0 γ τ c k ( t + τ ) , (8)where c k ( · ) is the immediate cost calculated using (4). Our goal is to ﬁnd an optimal policy π ∗ k that minimizes the expected long-term cost in (8), deﬁned as π ∗ k = arg min π k E π k [ C k ( t ) | π k ] , (9)where E π k [ · ] denotes the expected value of C k ( t ) given that the edge node follows policy π k .Herein, we use the same notation π k for both stochastic and deterministic policies.Having deﬁned an optimal policy, we now present essential deﬁnitions as a means to search for such an optimal policy. These serve as a basis for our algorithms developed in Section IV. C. State-Value and Action-Value Functions

In order to evaluate policies and search for an optimal policy π ∗ k , we deﬁne the state-value and action-value functions. The state-value function speciﬁes how beneﬁcial it is for the edgenode to be in a particular state under a policy π k . Formally, the state-value function of state s ∈ S k under a policy π k , denoted by v π k ( s ) , is the expected long-term cost when starting instate s and following the policy π k thereafter, and it can be written as v π k ( s ) . = E π k [ C k ( t ) | s k ( t ) = s ] , ∀ s ∈ S k . (10) The action-value function speciﬁes how beneﬁcial it is for the edge node to perform a particularaction in a state under a policy π k . Formally, the action-value function, denoted by q π k ( s, a ) , isthe expected long-term cost for taking an action a ∈ A k in state s ∈ S k and thereafter followingthe policy π k , and it can be written as q π k ( s, a ) . = E π k [ C k ( t ) | s k ( t ) = s, a k ( t ) = a ] , ∀ s ∈ S k , a ∈ A k . (11)Value functions deﬁne a partial ordering over policies. More precisely, a policy π k is denedto be better than or equal to a policy π (cid:48) k (i.e., π k ≥ π (cid:48) k ) if and only if v π k ( s ) ≤ v π (cid:48) k ( s ) for all s ∈ S k [34, Sect. 3.6]. Therefore, an optimal policy π ∗ k , which is better than or equal to all otherpolicies, minimizes the state-value function for all states. Although there may be more thanone optimal policy, they all achieve the same state-value function, called the optimal state-value function, denoted by v ∗ k ( s ) , and it is expressed as v ∗ k ( s ) . = min π k v π k ( s ) , ∀ s ∈ S k . (12)Note that optimal policies also share the same action-value function, called the optimal action-value function. More precisely, the optimal action-value function for state s ∈ S k and action a ∈ A k , denoted by q ∗ k ( s, a ) , and it is deﬁned as q ∗ k ( s, a ) . = min π k q π k ( s, a ) , ∀ s ∈ S k , a ∈ A k . (13)The optimal action-value function q ∗ k ( s, a ) represents the minimum expected long-term cost thatthe edge node is going to get if it is in state s , takes action a , and follows an optimal policy π ∗ k from there onwards. Accordingly, an optimal deterministic policy π ∗ k can be obtained bychoosing the action a that minimizes q ∗ k ( s, a ) in each state s , which can expressed as π ∗ k ( a | s ) =  , if a = arg min a ∈A k q ∗ k ( s, a )0 , otherwise , ∀ s ∈ S k . (14)According to (14), the knowledge of the optimal action-value function q ∗ k ( s, a ) sufﬁces toﬁnd an optimal policy π ∗ k . Also, an optimal policy π ∗ k can be found via the optimal state-valuefunction v ∗ k ( s ) , provided that the state transition probabilities are known. In this case, we ﬁrstﬁnd optimal action-value function q ∗ k ( s, a ) , given that v ∗ k ( s ) is available for all the states, andthen ﬁnd an optimal policy using (14). More precisely, under an optimal policy π ∗ k , for any state s ∈ S k and its possible successor states s (cid:48) ∈ S k , the relationship between the optimal state-value and action-value functions can be derived as q ∗ k ( s, a ) = E π ∗ k [ (cid:80) ∞ τ =0 γ τ c k ( t + τ ) | s k ( t ) = s, a k ( t ) = a ]= E π ∗ k (cid:2) c k ( t ) + γC k ( t + 1) | s k ( t ) = s, a k ( t ) = a (cid:3) = E π ∗ k (cid:2) c k ( t ) + γv ∗ k ( s k ( t + 1)) | s k ( t ) = s, a k ( t ) = a (cid:3) = (cid:80) s (cid:48) ∈S k P k (cid:0) s (cid:48) | s, a (cid:1)(cid:2) c k ( s, a ) + γv ∗ k ( s (cid:48) ) (cid:3) , ∀ s ∈ S k , ∀ a ∈ A k . (15)In summary, one can ﬁnd an optimal policy if either 1) the optimal action-value function q ∗ k ( s, a ) is available, or 2) the optimal state-value function v ∗ k ( s ) and state transition probabilities P k (cid:0) s (cid:48) | s, a (cid:1) are available. We next discuss how to ﬁnd v ∗ k ( s ) and q ∗ k ( s, a ) .A fundamental property of the optimal state-value and action-value functions is that theysatisfy particular recursive relationships, called Bellman optimality equations , which can be usedto ﬁnd the optimal state-value and action-value functions [34, Sect. 3.5]. Formally, under anoptimal policy π ∗ k , the recursive relationship between the optimal state-value function of state s , v ∗ k ( s ) , and the optimal state-value function of its possible successor state s (cid:48) , v ∗ k ( s (cid:48) ) , is given by v ∗ k ( s ) = min a ∈A k q ∗ k ( s, a ) = min a ∈A k (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s, a ) [ c k ( s, a ) + γv ∗ k ( s (cid:48) )] , ∀ s ∈ S k . (16)The recursive equation in (16) is called the Bellman optimality equation for optimal state-valuefunction v ∗ k ( s ) . It expresses the fact that the value of a state under an optimal policy must equalthe expected long-term cost for the best action for that state.Assuming the availability of the state transition probabilities P k ( s (cid:48) | s, a ) , the Bellman optimal-ity equation in (16) can be used to estimate the optimal state-value function recursively; this isthe basis for our proposed value iteration algorithm developed in Section IV-A. Similar to (16),the Bellman optimality equation for the optimal action-value function q ∗ k ( s, a ) is expressed as q ∗ k ( s, a ) = (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s, a ) [ c k ( s, a ) + γ min a (cid:48) ∈A k q ∗ k ( s (cid:48) , a (cid:48) )] , ∀ s ∈ S k , a ∈ A k . (17)The Bellman optimality equation in (17) is the basis for our proposed Q-learning algorithmsdevised in Section IV-B and Section IV-C.IV. R EINFORCEMENT L EARNING B ASED S TATUS U PDATE C ONTROL A LGORITHMS

In this section, we develop three RL-based status update control algorithms for the consideredIoT network. The algorithms fall into two main categories: model-free RL and model-based RL.For the MDP model described in Section III-A, we ﬁrst develop a model-based value iterationalgorithm relying on dynamic programming in Section IV-A, and then in Section IV-B, we propose a model-free Q-learning algorithm. As a practical consideration in Section IV-C, weredeﬁne the presented state deﬁnition of the MDP and propose a Q-learning method for thescenario where the edge node is informed of the sensors’ battery levels only via the statusupdate packets. As a key advantage, the proposed algorithms are simple with low complexity ofimplementation, which is an important point in practice. A. Value Iteration AlgorithmValue Iteration is a model-based RL method that ﬁnds the optimal state-value function v ∗ k ( s ) ,and consequently, an optimal policy π ∗ k by turning the Bellman optimality equation (16) into aniterative update procedure [34, Section 4.4].

1) Derivation of the State Transition Probabilities:

In order to apply (16), the value iterationrequires the knowledge of the state transition probabilities of the MDP (see Section III-A).These are derived in the following. In the considered system model, for a given action a k ( t ) , thestate transition probabilities are functions of both energy harvesting rate λ k and transmit successprobability ξ k , which were deﬁned in Section II-B and II-C, respectively. The probability oftransition from state s k ( t ) to state s k ( t + 1) under action a k ( t ) is given by P k (cid:0) s k ( t + 1) (cid:12)(cid:12) s k ( t ) = { b k ( t ) < B k , ∆ k ( t ) } , a k ( t ) = 0 (cid:1) =  λ k , s k ( t + 1) = (cid:40) b k ( t + 1) = b k ( t ) + 1 , ∆ k ( t + 1) = min { ∆ k ( t ) + 1 , ∆ k, max } (cid:41) ;1 − λ k , s k ( t + 1) = (cid:40) b k ( t + 1) = b k ( t ) , ∆ k ( t + 1) = min { ∆ k ( t ) + 1 , ∆ k, max } (cid:41) ;0 , otherwise. (18a) P k (cid:0) s k ( t + 1) (cid:12)(cid:12) s k ( t ) = { b k ( t ) = B k , ∆ k ( t ) } , a k ( t ) = 0 (cid:1) =  , s k ( t + 1) = (cid:40) b k ( t + 1) = B k , ∆ k ( t + 1) = min { ∆ k ( t ) + 1 , ∆ k, max } (cid:41) ;0 , otherwise. (18b) P k (cid:0) s k ( t + 1) (cid:12)(cid:12) s k ( t ) = { b k ( t ) = 0 , ∆ k ( t ) } , a k ( t ) = 1 (cid:1) =  λ k , s k ( t + 1) = (cid:40) b k ( t + 1) = 1 , ∆ k ( t + 1) = min { ∆ k ( t ) + 1 , ∆ k, max } (cid:41) ;1 − λ k , s k ( t + 1) = (cid:40) b k ( t + 1) = 0 , ∆ k ( t + 1) = min { ∆ k ( t ) + 1 , ∆ k, max } (cid:41) ;0 , otherwise. (18c) P k (cid:0) s k ( t + 1) (cid:12)(cid:12) s k ( t ) = { b k ( t ) > , ∆ k ( t ) } , a k ( t ) = 1) =  λ k ξ k , s k ( t + 1) = (cid:40) b k ( t + 1) = b k ( t ) , ∆ k ( t + 1) = 1 (cid:41) ; λ k (1 − ξ k ) , s k ( t + 1) = (cid:40) b k ( t + 1) = b k ( t ) , ∆ k ( t + 1) = min { ∆ k ( t ) + 1 , ∆ k, max } (cid:41) ;(1 − λ k ) ξ k , s k ( t + 1) = (cid:40) b k ( t + 1) = b k ( t ) − , ∆ k ( t + 1) = 1 (cid:41) ;(1 − λ k )(1 − ξ k ) , s k ( t + 1) = (cid:40) b k ( t + 1) = b k ( t ) − k ( t + 1) = min { ∆ k ( t ) + 1 , ∆ k, max } (cid:41) ;0 otherwise. (18d)In brief, the ﬁrst three expressions (18a)–(18c) correspond to cases where sensor k does notsend a status update, whereas in (18d) sensor k sends a status update. These cases are detailedin the following.The ﬁrst case (18a) corresponds to the situation in which the edge node does not commandsensor k (i.e., a k ( t ) = 0 ), and thus, the sensor does not send a status update. The second caseis similar in that a k ( t ) = 0 , but differently from (18a), the battery of sensor k is full and thus,there is no room left for possible harvested energy units. In the third case (18c), sensor k iscommanded to send a status update, but since its battery is empty (i.e., b k ( t ) = 0 ), no updatetakes place. Since there is no update in all three cases (18a)–(18c), the AoI about the physicalquantity f k in the local cache increases by one. Moreover, in cases (18a) and (18c), a possibleharvested energy unit increases the battery state of sensor k by one. The fourth case (18d) standsfor the case in which the edge node commands sensor k to send a status update and sensor k has at least one unit of energy in its battery. In this case, sensor k sends the status update,consuming one unit of energy. Here, four possible events can occur, depending on the successof the transmission attempt and the energy arrivals. Namely, the transmitted status update isprone to a transmission failure, reaching the edge node with probability ξ k . Also, sensor k hasa chance to harvest one unit of energy which occurs with probability λ k .

2) Algorithm Summary:

Having deﬁned the state transition probabilities above, we nowemploy the Bellman optimality equation (16) and set up an iterative update procedure, thevalue iteration algorithm, to ﬁnd an optimal policy π ∗ k . The proposed value iteration algorithmis presented in Algorithm 1. Next, we detail the algorithm steps.The algorithm consists of four main stages: 1) start with an arbitrary initial approximationfor the optimal state-value function, e.g., v ∗ k ( s ) = 0 , ∀ s ∈ S k , 2) in each iteration, update the Algorithm 1

Value iteration algorithm for estimating the optimal state-value function Initialize v ∗ k ( s ) = 0 , k ∈ K , ∀ s ∈ S k , and determine a small threshold θ > . for k = 1 , . . . , K do repeat { Update v ∗ k ( s ) } δ = 0 { For stopping criterion } for s ∈ S k do ν = v ∗ k ( s ) v ∗ k ( s ) = min a ∈A k (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s, a ) [ c k ( s, a ) + γv ∗ k ( s (cid:48) )] δ = max { δ, | ν − v ∗ k ( s ) |} { Maximum deviation between the iterations } end for until δ < θ end for for k = 1 , . . . , K do for s ∈ S k do Output a deterministic policy π ∗ k ( a | s ) such that π ∗ k ( a | s ) = (cid:26) , if a = arg min a ∈A k (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s, a ) [ c k ( s, a ) + γv ∗ k ( s (cid:48) )]0 , otherwise end for end for estimated value for v ∗ k ( s ) , ∀ s ∈ S k , 3) stop when the maximum difference in v ∗ k ( s ) betweentwo consecutive iterations is below a pre-deﬁned threshold θ , and 4) determine an optimaldeterministic policy π ∗ k ( a | s ) by using (15) and (14).In the value iteration algorithm, it is assumed that the state transition probabilities are knownin advance. According to (18), in order to calculate the state transition probabilities P k ( s (cid:48) | s, a ) ,the probabilistic model of the environment, i.e., the energy harvesting probability λ k and thetransmit success probability ξ k are assumed to be known, which are not always available inpractice. For the case in which the state transition probabilities are unknown , we use a model-free RL algorithm to ﬁnd an optimal policy. This is carried out in the next subsections. B. Q-learning Algorithm

Q-learning is an online model-free RL algorithm that estimates/learns the optimal action-valuefunctions by experience and ﬁnds an optimal policy iteratively. The main difference to the valueiteration algorithm in Section IV-A is that Q-learning does not require the knowledge of thestate transition probabilities P k ( s (cid:48) | s, a ) .In the Q-learning method, the estimated action-value function for sensor k , denoted as Q k ( s, a ) , s ∈ S k , a ∈ A k , directly approximates the optimal action-value function q ∗ k ( s, a ) in (13) [34, Sect. 6.5]. The convergence Q k → q ∗ k requires that all state-action pairs continue to be updated.To satisfy this condition, a typical approach is to use the ”exploration-exploitation” techniquein the action selection. The (cid:15) -greedy algorithm is one such method that trade-offs explorationand exploitation [34, Sect. 6.5]. Intuitively, exploration is ﬁnding more information about theenvironment, while exploitation is exploiting known information to minimize the long-term cost.Our proposed Q-learning algorithm is presented in Algorithm 2. To allow exploration-exploitation, the edge node takes either a random or greedy action at slot t ; the probabilityof taking a random action is denoted by (cid:15) ( t ) , and thus, the probability of exploiting the greedyaction a k ( t ) = arg min a ∈A k Q k ( s k ( t ) , a ) is − (cid:15) ( t ) . Generally, during initial iterations, it is betterto set (cid:15) ( t ) high in order to learn the underlying dynamics, i.e., to allow more exploration. Onthe other hand, in stationary settings and once enough observations are made, small values of (cid:15) ( t ) become preferable to increase tendency to exploitation.As it is shown on line 23 in Algorithm 2, at each slot/iteration, the value for the Q-functionof the current state is updated based on the action taken and the resulting next state, where α ( t ) represents the learning rate at slot t . C. Q-Learning Algorithm with Partial Battery Knowledge

In Section III-A, we modeled the state of the MDP as s k ( t ) = { b k ( t ) , ∆ k ( t ) } . Consequently,both the proposed value iteration algorithm in Section IV-A and the Q-learning algorithm inSection IV-B rely on the assumption that the edge node knows the exact battery levels of thesensors at each time slot. This requires extra coordination between the edge node and the sensorsat each time slot, which may not always be feasible. In this section, we consider a realisticenvironment where the edge node is informed about the battery levels of the sensors only viathe status update packets . Consequently, the edge node has only partial knowledge about thebattery levels at each time slot.Since we consider a case where the edge node is informed about the battery levels of thesensors only via the status update packets, we need to modify the state deﬁnition of the MDPaccordingly. A status update packet generated at the beginning of slot t consists of the valueof physical quantity f k , the battery level of sensor k (i.e., b k ( t ) ), and the timestamp t whenthe sample was generated. Let ˜ b k ( t ) denote the knowledge about the battery level of sensor k at the edge node at time slot t . Formally, ˜ b k ( t ) = b k ( u k ( t )) , where u k ( t ) represents the most Algorithm 2

Online status update control algorithm via Q-learning Initialize Q k ( s, a ) = 0 , ∀ s ∈ S k , a ∈ A k , k ∈ K for each slot t = 1 , , , . . . do for k = 1 , . . . , K do if r k ( t ) = 0 then a k ( t ) = 0 else a k ( t ) is chosen according to the following probability a k ( t ) = (cid:26) arg min a ∈A k Q ( s k ( t ) , a ) , w.p. − (cid:15) ( t ) a random action a ∈ A k , w.p. (cid:15) ( t ) if a k ( t ) = 1 then Command sensor k to send a status update packet if b k ( t ) > then d k ( t ) = 1 else d k ( t ) = 0 end if else d k ( t ) = 0 end if end if Update AoI according to (3) and calculate c k ( t ) end for Wait for the next requests and compute s k ( t + 1) , ∀ k ∈ K for k = 1 , . . . , K do Update the Q-table Q k ( s k ( t ) , a k ( t )) ← (1 − α ( t )) Q k ( s k ( t ) , a k ( t ))+ α ( t ) (cid:0) c k ( t )+ γ min a ∈A k Q k ( s k ( t +1) , a ) (cid:1) end for end for recent time slot in which the edge node received a status update packet from sensor k , i.e., u k ( t ) = max { t (cid:48) | t (cid:48) < t, h k ( t (cid:48) ) = 1 } (see Section II-D). In other words, at time slot t , ˜ b k ( t ) describes what the battery level of sensor k was at the beginning of the most recent time slot atwhich the edge node received a status update from sensor k . To conclude, the edge node does notknow the exact battery level of the sensors at each time slot, but it only has the partial/outdated knowledge based on each sensors last update.Based on the discussions above, we modify the state deﬁnition of the MDP deﬁned in SectionIII-A as s k ( t ) = { ˜ b k ( t ) , ∆ k ( t ) } . Thus, as compared to the setting with exact battery knowledge,the state contains ˜ b k ( t ) instead of b k ( t ) . However, with this state deﬁnition, it is impossible tocalculate the state transition probabilities and use the value iteration algorithm. In particular, the underlying decision process is non-Markovian (i.e., not an MDP), caused by the uncertainty thatexists in the wireless channel . For better clariﬁcation, consider state s k ( t ) = { ˜ b k ( t ) , ∆ k ( t ) } andaction a k ( t ) = 0 ; the next state is s k ( t + 1) = (cid:8) ˜ b k ( t ) , min { ∆ k ( t ) + 1 , ∆ k, max } (cid:9) with probabilityone. However, given s k ( t ) and a k ( t ) = 1 , it is impossible to calculate the state transitionprobabilities without knowing the actions taken by the edge node during the last ∆ k ( t ) − slots,i.e., a k ( t − ∆ k ( t )) , . . . , a k ( t − . This is because the energy consumed by the sensor is unknownduring these ∆ k ( t ) − slots (in which, by deﬁnition, no update has been received); at each suchslot, three indistinguishable cases might have happened: 1) the edge node commanded the sensor,but the transmission was failed, or 2) the edge node commanded the sensor and it could not senda status update because its battery was empty, or 3) the edge node did not command the sensor.While the ﬁrst case consumes one unit of energy from the battery of the sensor, the second andthird cases do not. This means that in order to model the underlying decision process as an MDPand be able to calculate the state transition probabilities, the exact actions taken by the edgenode during the last ∆ k ( t ) − slots must be included in the state deﬁnition. More precisely, atslot t , the state would be deﬁned as s k ( t ) = (cid:8) ˜ b k ( t ) , ∆ k ( t ) , a k ( t − ∆ k ( t )) , . . . , a k ( t − (cid:9) . This,however, makes the state space grow exponentially in terms of ∆ k ( t ) .Despite the aforementioned non-Markovity property of the decision process, we apply theQ-learning presented in Algorithm 2 for the partial battery knowledge case with state deﬁnition s k ( t ) = { ˜ b k ( t ) , ∆ k ( t ) } . Recall that the Q-learning algorithm does not need any prior knowledgeabout the state transition probabilities. We will assess the performance of this Q-learning methodvia simulations in Section VI, and show that it achieves considerable performance gains comparedto several baseline methods.V. S TRUCTURAL P ROPERTIES OF AN O PTIMAL P OLICY

In this section, we analyze the properties of an optimal policy deﬁned in (9). We ﬁrst prove thatthe optimal state-value function has monotonic properties. Then, we exploit this monotonicityto prove that an optimal policy has a threshold-based structure with respect to the AoI for thecase where the link from sensor k to the edge node is perfect, i.e., ξ k = 1 . Threshold-basedstructures are also numerically illustrated in Section VI-B. Note that for the perfect channel case ξ k = 1 , we have an MDP, and we can compute the state transition probabilities. Next, we present two propositions, which are used to prove properties of an optimal policyexpressed in Theorem 1.

Proposition 1.

The optimal state-value function v ∗ k ( s ) is (i) non-decreasing with respect to theAoI, and (ii) non-increasing with respect to the battery level. The proof is presented in Appendix A.

Proposition 2.

For the case where the link from sensor k to the edge node is perfect, i.e., ξ k = 1 , the difference between the optimal action-value functions for different actions, denotedby δq ∗ k ( s ) = q ∗ k ( s, − q ∗ k ( s, , is non-increasing with respect to the AoI. The proof is presented in Appendix B.

Theorem 1.

For the case where the link from sensor k to the edge node is perfect, i.e., ξ k = 1 ,an optimal policy has a threshold-based structure with respect to the AoI.Proof. Proving that an optimal policy has a threshold-based structure with respect to the AoI isequivalent to showing that if the optimal action in state s = { b, ∆ } is a ∗ k ( s ) = 1 , then for allthe states s = { b, ∆ } , in which ∆ ≥ ∆ , the optimal action is a ∗ k ( s ) = 1 as well. According toProposition 2, q ∗ k ( s, − q ∗ k ( s, ≤ q ∗ k ( s, − q ∗ k ( s, . The optimal action in state s is a ∗ k ( s ) = 1 ,thus q ∗ k ( s, − q ∗ k ( s, ≤ . Accordingly, q ∗ k ( s, − q ∗ k ( s, ≤ , which shows that the optimalaction for state s is a ∗ k ( s ) = 1 . VI. S IMULATION R ESULTS

In this section, simulation results are presented to demonstrate the performance of the proposedvalue iteration algorithm summarized in Algorithm 1 and the proposed Q-learning algorithms –Q-learning with exact and partial battery knowledge – obtained by Algorithm 2.

A. Simulation Setup

The simulation scenario consists of K = 3 energy harvesting sensors, i.e., K = { , , } . Eachsensor k ∈ K has a battery of ﬁnite capacity B k = 15 units of energy. At each time slot, theprobability that the value of f k is requested (i.e., r k ( t ) = 1 ) is denoted by p k , i.e., Pr { r k ( t ) =1 } = p k . We set p k = 0 . , k ∈ K . The weight parameters in (4) are set as β k = 1 , ∀ k ∈ K .For the value iteration method summarized in Algorithm 1, we set the threshold parameter as θ = 0 . and the discount factor as γ = 0 . . For the Q-learning method summarized in Algorithm 2, we set (cid:15) ( t ) = 0 .

02 + 0 . e − (cid:15) d t with decay parameter (cid:15) d = 10 − . The learning rateis set to α ( t ) = 0 . during the ﬁrst /(cid:15) d = 10 iterations and after that α ( t ) = 0 . . B. Structure of Optimal Deterministic Policy

We analyze the structural properties of an optimal deterministic policy obtained by the valueiteration algorithm for a particular sensor, i.e., sensor 1, and investigate the effect of the energyharvesting probability λ and transmit success probability ξ .Fig. 3 illustrates the structure of the obtained optimal deterministic policy for different valuesof the energy harvesting probability λ with the transmit success probability ξ = 0 . . Eachpoint represents a potential state of the system as a pair of values of the battery level and AoI, ( b, ∆) . In particular, a red circle indicates that the optimal action in a given state is that the edgenode does not command the sensor (i.e., a = 0 ), and a blue square indicates that the optimalaction is that the edge node commands the sensor to send a status update (i.e., a = 1 ). The setof blue points is referred to as the command region hereinafter.From Fig. 3(a)–(d), we observe that the optimal deterministic policy has a threshold-based structure with respect to the battery level and the AoI, which can be expressed as follows:1) If the optimal action in state s = { b, ∆ } is a = 1 , then for all the states s (cid:48) = { b (cid:48) , ∆ } , inwhich b (cid:48) ≥ b , the optimal action is a = 1 as well.2) If the optimal action in state s = { b, ∆ } is a = 1 , then for all the states s (cid:48) = { b, ∆ (cid:48) } , inwhich ∆ (cid:48) ≥ ∆ , the optimal action is a = 1 as well. To exemplify this threshold-based structure in Fig. 3(a), consider point (5 , . Since the optimalaction at the point (5 , is a = 1 , we observe that the optimal action at all the points (5 , ∆) where ∆ ≥ , and all the points ( b, where b ≥ , is also a = 1 .By comparing Figs. 3(a)–(d) with each other, we observe that the command region (i.e., theset of blue square points) enlarges by increasing the energy harvesting probability λ . This is dueto the fact that since the sensor harvests energy more often, the edge node commands the sensorto send fresh measurements more often. Note that Fig. 3(d) is associated with an extreme casein which the edge node always harvests energy at each time slot; in this case, there is alwaysat least one unit of energy available in the battery of the sensor, and thus, for all the states with b ≥ , the optimal action is a = 1 . In Section V, we analytically proved this statement for the special case ξ k = 1 . In this section, we numerically show thatan optimal policy has a threshold-based structure with respect to the AoI for all the values of ξ k as well. Battery level A o I (a) λ = 0 . Battery level A o I (b) λ = 0 . Battery level A o I (c) λ = 0 . Battery level A o I (d) λ = 1 . Fig. 3: Structure of an optimal deterministic policy π ∗ obtained by the value iteration algorithm for each state s = { b, ∆ } with the transmit success probability ξ = 0 . for different values of the energy harvesting probability λ . Red circle: no command a = 0 ; blue square: command a = 1 . Fig. 4 illustrates the threshold-based structure of the obtained optimal deterministic policyfor different values of the transmit success probability ξ with the energy harvesting probability λ = 0 . . Figs. 4(a)–(d) illustrate that the command region expands by increasing the transmitsuccess probability ξ . This is due to the fact that by increasing ξ , the communication linkfrom the sensor to the edge node becomes more reliable, and thus, the edge node commandsthe sensor more often as it has more conﬁdence about receiving the transmitted status updatepacket. Fig. 4(a) depicts an extreme case with ξ = 0 , in which the link from the sensor to theedge node is always in the failed state and the edge node never receives any commanded statusupdate; to conserve the sensor’s battery, the optimal action is clearly a = 0 . Battery level A o I (a) ξ = 0 . Battery level A o I (b) ξ = 0 . Battery level A o I (c) ξ = 0 . Battery level A o I (d) ξ = 1 . Fig. 4: Structure of an optimal deterministic policy π ∗ obtained by the value iteration algorithm for each state s = { b, ∆ } with the energy harvesting probability λ = 0 . for different values of the transmit success probability ξ . Red circle: no command a = 0 ; blue square: command a = 1 . C. Performance and Learning Behaviour of the Proposed Algorithms

We investigate the performance and learning behaviour of the proposed Q-learning algorithmswith exact and partial battery knowledge. To this end, we analyze the performance of the proposedalgorithms in terms of the long-term average costs deﬁned in (5) and (7). As a remark, the valueiteration algorithm serves as a lower bound to the proposed Q-learning algorithms since it knowsthe exact statistical model of the environment, and consequently, the state transition probabilitiesof the underlying MDP. Similarly, the Q-learning method with the exact battery knowledge(referred to as

Q-learning-exact hereinafter) is a lower bound to the Q-learning algorithm havingonly the partial battery knowledge (referred to as

Q-learning-partial hereinafter).For comparison, we consider two baseline policies: greedy and random policy. In the greedypolicy, whenever the value of physical quantity f k is requested (i.e., r k ( t ) = 1 ), the edge node Time index A v e r age c o s t f o r s en s o r Value iterationQ-learning (exact battery)Q-learning (partial knowledge)GreedyRandom (a) Average cost for sensor 1, ¯ C , with λ = 0 . and ξ = 0 . Time index A v e r age c o s t f o r s en s o r Value iterationQ-learning (exact battery)Q-learning (partial knowledge)GreedyRandom (b) Average cost for sensor 2, ¯ C , with λ = 0 . and ξ = 0 . Time index A v e r age c o s t f o r s en s o r Value iterationQ-learning (exact battery)Q-learning (partial knowledge)GreedyRandom (c) Average cost for sensor 3, ¯ C , with λ = 0 . and ξ = 0 . Time index A v e r age c o s t f o r a ll t he s en s o r s Value iterationQ-learning (exact battery)Q-learning (partial knowledge)GreedyRandom (d) Average cost for all the sensors ¯ C Fig. 5: Learning behaviour of the proposed value iteration algorithm and Q-learning algorithms in comparison tobaseline policies. commands sensor k to send a status update (i.e., a k ( t ) = 1 ), regardless of the battery stage andAoI; sensor k sends a status update if the battery is non-empty, i.e., b k ( t ) ≥ . In the randompolicy, whenever the value of physical quantity f k is requested (i.e., r k ( t ) = 1 ), the edge nodeselects a random action a k ( t ) ∈ { , } according to the discrete uniform distribution.Fig. 5 depicts the performance of each algorithm for the energy harvesting probabilities λ =0 . , λ = 0 . , and λ = 0 . , and the transmit success probabilities ξ k = 0 . , ∀ k ∈ K .Figs. 5(a)–(c) are associated with the per-sensor long-term average cost ( ¯ C k ) for sensor 1, 2,and 3, respectively. Fig. 5(d) illustrates the long-term average cost over all the sensors ( ¯ C ).As it is shown in Fig. 5(d), Q-learning-exact performs close to the value iteration algorithmand the proposed RL algorithms outperform the baseline methods in terms of the long-term average cost. The Q-learning-exact, and also the value iteration algorithm, reduces the averagecost approximately by a factor of 2 compared to the greedy algorithm. Furthermore, the averagecost decreases roughly

30 % for the Q-learning-partial compared to the greedy algorithm.Interestingly, the gap between Q-learning-partial and Q-learning-exact is small, when theenergy harvesting probability is high enough. As it is shown in Figs. 5(a)–(c), the largest gapoccurs for the sensor with the lowest energy harvesting probability, i.e., sensor 1; on the contrary,the smallest gap is obtained for sensor 3 having the highest energy harvesting probability. This isdue to the fact that when the energy becomes scarce, the edge node receives status updates morerarely; consequently, the information about the battery levels at the edge node becomes moreoutdated, i.e., more uncertain, inhibiting the capability of Q-learning-partial to take near-optimalactions as taken by Q-learning-exact. Overall, Fig. 5 demonstrates that the proposed algorithmfor a realistic scenario has high performance even if the edge node performs actions based onthe outdated battery information.In Fig. 5(a), the greedy policy performs as poorly as the random policy, because the energyharvesting probability is low, and thus, it is highly sub-optimal to command the sensor at allstates. As it can be seen in Figs. 5(a)–(c), the lowest long-term average cost is associated with thesensor that has the highest energy harvesting probability, i.e., sensor 3. This is because sensor 3harvests energy more often, and thus, it can send status updates more frequently upon receivinga command from the edge node. Recall that the command region enlarges by increasing theenergy harvesting probability, i.e., the edge node commands the sensor more frequently.By comparing Figs. 5(a)–(c) with each other, we observe that by increasing the energyharvesting probability λ k the long-term average cost for the value iteration algorithm, and alsofor the Q-learning, moves toward the the long-term average cost for the greedy policy. This isbecause by increasing the energy harvesting probability, the command region enlarges, and thus,an optimal policy tends to the greedy policy.VII. C ONCLUSIONS

We investigated a status update control problem in an IoT sensing network consisting ofmultiple users, multiple energy harvesting sensors, and a wireless edge node. We modeled theproblem as an MDP and proposed two reinforcement learning (RL) based algorithms: a model-based value iteration method relying on dynamic programming, and a model-free Q-learningmethod. Furthermore, we developed a Q-learning method for the realistic case in which the edge node does not know the exact battery levels. The proposed Q-learning schemes do notneed any information about the energy harvesting model. Simulation results showed that anoptimal policy has a threshold-based structure, and the proposed RL algorithms signiﬁcantlyreduce the long-term average cost compared to several baseline methods.A PPENDIX

A. Proof of Proposition 1Proof.

As discussed in Section IV-A, the optimal state-value function v ∗ k ( s ) can be computediteratively by the value iteration algorithm. In the value iteration algorithm, the optimal state-value function of state s at iteration n = 1 , , . . . , denoted by v ∗ k ( s ) ( n ) , is updated as (see (16)) v ∗ k ( s ) ( n ) = min a ∈A k (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s, a ) (cid:2) c k ( s, a ) + γv ∗ k ( s (cid:48) ) ( n − (cid:3) = min a ∈A k q ∗ k ( s, a ) ( n − , ∀ s ∈ S k . (19)Thus, an optimal policy at n th iteration is given by π ∗ k ( a | s ) ( n ) =  , if a = arg min a ∈A k q ∗ k ( s, a ) ( n ) , otherwise , ∀ s ∈ S k . (20)Accordingly, an optimal action in state s at n th iteration, denoted by a ∗ k ( s ) ( n ) , is expressed as a ∗ k ( s ) ( n ) = arg min a ∈A k q ∗ k ( s, a ) ( n ) . (21)For any arbitrary initialization v ∗ k ( s ) (0) , the sequence { v ∗ k ( s ) ( n ) } can be shown to converge to theoptimal state-value function v ∗ k ( s ) [34, Sect. 4.4]. This fact can be expressed as lim n →∞ v ∗ k ( s ) ( n ) = v ∗ k ( s ) . (22)(i) In order to prove that v ∗ k ( s ) is non-decreasing with respect to the AoI, let us deﬁne twostates s = { b, ∆ } and s = { b, ∆ } , where ∆ ≥ ∆ . We show that v ∗ k ( s ) ≥ v ∗ k ( s ) . According to(22), it sufﬁces to prove that v ∗ k ( s ) ( n ) ≥ v ∗ k ( s ) ( n ) , ∀ n . We prove this by mathematical induction.The initial values can be chosen arbitrarily, e.g., v ∗ k ( s ) (0) = 0 and v ∗ k ( s ) (0) = 0 , thus, the relation v ∗ k ( s ) ( n ) ≥ v ∗ k ( s ) ( n ) holds for n = 0 . Assume that v ∗ k ( s ) ( n ) ≥ v ∗ k ( s ) ( n ) for some n . We need toprove that v ∗ k ( s ) ( n +1) ≥ v ∗ k ( s ) ( n +1) as well. From (19) and (21), we have v ∗ k ( s ) ( n +1) − v ∗ k ( s ) ( n +1) = min a ∈A k q ∗ k ( s, a ) ( n ) − min a ∈A k q ∗ k ( s, a ) ( n ) = q ∗ k (cid:0) s, a ∗ k ( s ) ( n ) (cid:1) ( n ) − q ∗ k (cid:0) s, a ∗ k ( s ) ( n ) (cid:1) ( n )( a ) ≤ q ∗ k (cid:0) s, a ∗ k ( s ) ( n ) (cid:1) ( n ) − q ∗ k (cid:0) s, a ∗ k ( s ) ( n ) (cid:1) ( n ) , (23) where (a) follows from the fact that taking action a ∗ k ( s ) ( n ) in state s is not necessarily optimal.We show that q ∗ k (cid:0) s, a ∗ k ( s ) ( n ) (cid:1) ( n ) − q ∗ k (cid:0) s, a ∗ k ( s ) ( n ) (cid:1) ( n ) ≤ for all possible actions a ∗ k ( s ) ( n ) ∈ { , } .We present the proof for the case corresponding to (18d) where b ≥ and a ∗ k ( s ) ( n ) = 1 ; for theother three cases corresponding to (18a)–(18c), the proof follows similarly. We have q ∗ k ( s, ( n ) − q ∗ k ( s, ( n ) = (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s, (cid:2) c k ( s,

1) + γv ∗ k ( s (cid:48) ) ( n ) (cid:3) − (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s, (cid:2) c k ( s,

1) + γv ∗ k ( s (cid:48) ) ( n ) (cid:3) ( a ) = λ k ξ k (cid:0) γv ∗ k ( b, ( n ) (cid:1) + (1 − λ k ) ξ k (cid:0) γv ∗ k ( b − , ( n ) (cid:1) + λ k (1 − ξ k ) (cid:0) min { ∆ + 1 , ∆ k, max } + γv ∗ k ( b, min { ∆ + 1 , ∆ k, max } ) ( n ) (cid:1) +(1 − λ k )(1 − ξ k ) (cid:0) min { ∆ + 1 , ∆ k, max } + γv ∗ k ( b − , min { ∆ + 1 , ∆ k, max } ) ( n ) (cid:1) − λ k ξ k (cid:0) γv ∗ k ( b, ( n ) (cid:1) − (1 − λ k ) ξ k (cid:0) γv ∗ k ( b − , ( n ) (cid:1) − λ k (1 − ξ k ) (cid:0) min { ∆ + 1 , ∆ k, max } + γv ∗ k ( b, min { ∆ + 1 , ∆ k, max } ) ( n ) (cid:1) − (1 − λ k )(1 − ξ k ) (cid:0) min { ∆ + 1 , ∆ k, max } + γv ∗ k ( b − , min { ∆ + 1 , ∆ k, max } ) ( n ) (cid:1) = (1 − ξ k ) (cid:0) min { ∆ + 1 , ∆ k, max } − min { ∆ + 1 , ∆ k, max } (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ( b ) ≤ + γλ k (1 − ξ k ) (cid:0) v ∗ k ( b, min { ∆ + 1 , ∆ k, max } ) ( n ) − v ∗ k ( b, min { ∆ + 1 , ∆ k, max } ) ( n ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ( c ) ≤ + γ (1 − λ k )(1 − ξ k ) (cid:0) v ∗ k ( b − , min { ∆ + 1 , ∆ k, max } ) ( n ) − v ∗ k ( b − , min { ∆ + 1 , ∆ k, max } ) ( n ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ( d ) ≤ ≤ , where in step (a) we use the result of (18a), step (b) follows from the assumption ∆ ≤ ∆ , andsteps (c) and (d) follow from the induction assumption.(ii) In order to prove that v ∗ k ( s ) is non-increasing with respect to the battery level, we deﬁnetwo states s = { b, ∆ } and s = { b, ∆ } , where b ≥ b . By using induction and following thesimilar steps as we have done in (i), one can easily show that v ∗ k ( s ) ≥ v ∗ k ( s ) . B. Proof of Proposition 2Proof.

We deﬁne two states s = { b, ∆ } and s = { b, ∆ } , where ∆ ≥ ∆ . We show that δq ∗ k ( s ) ≥ δq ∗ k ( s ) , which can be rewritten as q ∗ k ( s, − q ∗ k ( s, − q ∗ k ( s,

0) + q ∗ k ( s, ≥ . Wepresent the proof for the case where ≤ b < B k ; for the other two cases, i.e., b = 0 and b = B k , the proof follows similarly. We have q ∗ k ( s, − q ∗ k ( s, − q ∗ k ( s,

0) + q ∗ k ( s, (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s,

1) [ c k ( s,

1) + γv ∗ k ( s (cid:48) )] − (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s,

1) [ c k ( s,

1) + γv ∗ k ( s (cid:48) )] − (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s,

0) [ c k ( s,

0) + γv ∗ k ( s (cid:48) )] + (cid:80) s (cid:48) ∈S k P k ( s (cid:48) | s,

0) [ c k ( s,

0) + γv ∗ k ( s (cid:48) )]= λ k (cid:0) γv ∗ k ( b, (cid:1) + (1 − λ k ) (cid:0) γv ∗ k ( b − , (cid:1) − λ k (cid:0) γv ∗ k ( b, (cid:1) − (1 − λ k ) (cid:0) γv ∗ k ( b − , (cid:1) − λ k (cid:0) min { ∆ + 1 , ∆ k, max } + γv ∗ k ( b + 1 , min { ∆ + 1 , ∆ k, max } ) (cid:1) − (1 − λ k ) (cid:0) min { ∆ + 1 , ∆ k, max } + γv ∗ k ( b, min { ∆ + 1 , ∆ k, max } ) (cid:1) + λ k (cid:0) min { ∆ + 1 , ∆ k, max } + γv ∗ k ( b + 1 , min { ∆ + 1 , ∆ k, max } ) (cid:1) +(1 − λ k ) (cid:0) min { ∆ + 1 , ∆ k, max } + γv ∗ k ( b, min { ∆ + 1 , ∆ k, max } ) (cid:1) = (cid:0) min { ∆ + 1 , ∆ k, max } − min { ∆ + 1 , ∆ k, max } (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ( a ) ≥ + γλ k (cid:0) v ∗ k ( b + 1 , min { ∆ + 1 , ∆ k, max } ) − v ∗ k ( b + 1 , min { ∆ + 1 , ∆ k, max } (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ( b ) ≥ + γ (1 − λ k ) (cid:0) v ∗ k ( b, min { ∆ + 1 , ∆ k, max } ) − v ∗ k ( b, min { ∆ + 1 , ∆ k, max } ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) ( c ) ≥ ≥ , where step (a) follows from the assumption ∆ ≤ ∆ , and steps (b) and (c) follow fromProposition 1. R EFERENCES [1] L. D. Xu, W. He, and S. Li, “Internet of things in industries: A survey,”

IEEE Trans. Ind. Informat. , vol. 10, no. 4, pp.2233–2243, Nov. 2014.[2] S. Sudevalayam and P. Kulkarni, “Energy harvesting sensor nodes: Survey and implications,”

IEEE Commun. SurveysTuts. , vol. 13, no. 3, pp. 443–461, Jul. 2011.[3] S. Kim, R. Vyas, J. Bito, K. Niotaki, A. Collado, A. Georgiadis, and M. M. Tentzeris, “Ambient RF energy-harvestingtechnologies for self-sustainable standalone wireless sensor platforms,”

Proc. IEEE , vol. 102, no. 11, pp. 1649–1666, Nov.2014.[4] M. Pi˜nuela, P. D. Mitcheson, and S. Lucyszyn, “Ambient RF energy harvesting in urban and semi-urban environments,”

IEEE Trans. Microw. Theory Techn. , vol. 61, no. 7, pp. 2715–2726, May 2013.[5] S. Kaul, R. Yates, and M. Gruteser, “Real-time status: How often should one update?” in

Proc. IEEE Int. Conf. onComputer. Commun. (INFOCOM) , Orlando, FL, USA, Mar. 25–30, 2012, pp. 2731–2735.[6] R. D. Yates and S. K. Kaul, “The age of information: Real-time status updating by multiple sources,”

IEEE Trans. Inf.Theory , vol. 65, no. 3, pp. 1807–1827, Mar. 2019.[7] M. Costa, M. Codreanu, and A. Ephremides, “On the age of information in status update systems with packet management,”

IEEE Trans. Inf. Theory , vol. 62, no. 4, pp. 1897–1910, Apr. 2016.[8] Y. Sun, I. Kadota, R. Talak, E. Modiano, and R. Srikant, “Age of information: A new metric for information freshness,”

Synthesis Lectures on Communication Networks , vol. 12, no. 2, pp. 1–224, 2019. [9] A. Kosta, N. Pappas, and V. Angelakis, “Age of information: A new concept, metric, and tool,” Foundations and Trendsin Netw. , vol. 12, no. 3, pp. 162–259, 2017.[10] H. Chen, Y. Gu, and S. C. Liew, “Age-of-information dependent random access for massive IoT networks,” in

Proc. IEEEINFOCOM Workshop , Toronto, Canada, Jul. 6–9, 2020, pp. 177–182.[11] P. D. Mankar, Z. Chen, M. A. Abd-Elmagid, N. Pappas, and H. S. Dhillon, “Throughput and age of information in acellular-based IoT network,” 2020, [Online]. Available: https://arxiv.org/abs/2005.09547.[12] D. Niyato, D. I. Kim, P. Wang, and L. Song, “A novel caching mechanism for internet of things (IoT) sensing servicewith energy harvesting,” in

Proc. IEEE Int. Conf. Commun. , Kuala Lumpur, Malaysia, May 22-27 2016, pp. 1–6.[13] N. Pappas, Z. Chen, and M. Hatami, “Average AoI of cached status updates for a process monitored by an energy harvestingsensor,” in

Proc. Conf. Inform. Sciences Syst. (CISS) , Princeton, NJ, USA, Mar. 18–20, 2020, pp. 1–5.[14] I. Krikidis, “Average age of information in wireless powered sensor networks,”

IEEE Wireless Commun. Lett. , vol. 8, no. 2,pp. 628–631, Jan. 2019.[15] R. D. Yates, P. Ciblat, A. Yener, and M. Wigger, “Age-optimal constrained cache updating,” in

Proc. IEEE Int. Symp.Inform. Theory , Aachen, Germany, Jun. 25–30, 2017, pp. 141–145.[16] H. Tang, P. Ciblat, J. Wang, M. Wigger, and R. Yates, “Age of information aware cache updating with ﬁle- and age-dependent update durations,” in

Proc. Int. Symp. Model. and Opt. Mobile, Ad Hoc, Wireless Netw. , Volos, Greece, Jun.15–19, 2020, pp. 1–6.[17] M. Bastopcu and S. Ulukus, “Information freshness in cache updating systems,” 2020, [Online]. Available:https://arxiv.org/abs/2004.09475.[18] A. Arafa and S. Ulukus, “Timely updates in energy harvesting two-hop networks: Ofﬂine and online policies,”

IEEE Trans.Wireless Commun. , vol. 18, no. 8, pp. 4017–4030, Jun. 2019.[19] A. Arafa, J. Yang, S. Ulukus, and H. V. Poor, “Using erasure feedback for online timely updating with an energy harvestingsensor,” in

Proc. IEEE Int. Symp. Inform. Theory , Orlando, FL, USA, Jul. 7–12, 2019, pp. 607–611.[20] R. D. Yates, “Lazy is timely: Status updates by an energy harvesting source,” in

Proc. IEEE Int. Symp. Inform. Theory ,Orlando, FL, USA, Jun. 14–19, 2015, pp. 3008–3012.[21] H. Zhu, Y. Cao, X. Wei, W. Wang, T. Jiang, and S. Jin, “Caching transient data for internet of things: A deep reinforcementlearning approach,”

IEEE Internet Things J. , vol. 6, no. 2, pp. 2074–2083, Apr. 2019.[22] S. Leng and A. Yener, “Age of information minimization for an energy harvesting cognitive radio,”

IEEE Trans. on Cogn.Commun. Netw. , vol. 5, no. 2, pp. 427–439, May 2019.[23] M. A. Abd-Elmagid, H. S. Dhillon, and N. Pappas, “A reinforcement learning framework for optimizing age of informationin RF-powered communication systems,”

IEEE Trans. Commun. , vol. 68, no. 8, pp. 4747–4760, Aug. 2020.[24] B. Zhou and W. Saad, “Joint status sampling and updating for minimizing age of information in the internet of things,”

IEEE Trans. Commun. , vol. 67, no. 11, pp. 7468–7482, Nov. 2019.[25] ——, “Minimum age of information in the internet of things with non-uniform status packet sizes,”

IEEE Trans. WirelessCommun. , vol. 19, no. 3, pp. 1933–1947, Dec. 2020.[26] S. Leng and A. Yener, “Age of information minimization for wireless ad hoc networks: A deep reinforcement learningapproach,” in

Proc. IEEE Global Telecommun. Conf. , Waikoloa, HI, USA, Dec. 9–13, 2019, pp. 1–6.[27] L. Lei, Y. Tan, K. Zheng, S. Liu, K. Zhang, and X. Shen, “Deep reinforcement learning for autonomous internet of things:Model, applications and challenges,”

IEEE Commun. Surveys Tuts. , pp. 1–1, Apr. 2020.[28] M. Hatami, M. Jahandideh, M. Leinonen, and M. Codreanu, “Age-aware status update control for energy harvesting IoTsensors via reinforcement learning,” in

Proc. IEEE Int. Symp. Pers., Indoor, Mobile Radio Commun. , London, UK, Aug.31–Sep. 3 2020, available at: https://arxiv.org/pdf/2004.12684.pdf.[29] B. T. Bacinoglu, E. T. Ceran, and E. Uysal-Biyikoglu, “Age of information under energy replenishment constraints,” in

Proc. Inform. Theory and Appl. Workshop , San Diego, CA, USA, Feb. 1–6 2015, pp. 25–31.[30] X. Wu, J. Yang, and J. Wu, “Optimal status update for age of information minimization with an energy harvesting source,”

IEEE Trans. Green Commun. Netw. , vol. 2, no. 1, pp. 193–204, Mar. 2018. [31] A. Arafa, J. Yang, S. Ulukus, and H. V. Poor, “Age-minimal transmission for energy harvesting sensors with ﬁnite batteries:Online policies,” IEEE Trans. Inf. Theory , vol. 66, no. 1, pp. 534–556, Jan. 2020.[32] N. Michelusi, K. Stamatiou, and M. Zorzi, “Transmission policies for energy harvesting sensors with time-correlated energysupply,”

IEEE Trans. Commun. , vol. 61, no. 7, pp. 2988–3001, Jul. 2013.[33] E. T. Ceran, D. G¨und¨uz, and A. Gy¨orgy, “Reinforcement learning to minimize age of information with an energy harvestingsensor with HARQ and sensing cost,” in

Proc. IEEE INFOCOM Workshop , Paris, France, Apr. 29–May 2 2019, pp. 656–661.[34] R. S. Sutton and A. G. Barto,