[PDF] Distributed Power Control for Delay Optimization in Energy Harvesting Cooperative Relay Networks

Abstract

We consider cooperative communications with energy harvesting (EH) relays, and develop a distributed power control mechanism for the relaying terminals. Unlike prior art which mainly deal with single-relay systems with saturated traffic flow, we address the case of bursty data arrival at the source cooperatively forwarded by multiple half-duplex EH relays. We aim at optimizing the long-run average delay of the source packets under the energy neutrality constraint on power consumption of each relay. While EH relay systems have been predominantly optimized using either offline or online methodologies, we take on a more realistic learning-theoretic approach. Hence, our scheme can be deployed for real-time operation without assuming acausal information on channel realizations, data/energy arrivals as required by offline optimization, nor does it rely on precise statistics of the system processes as is the case with online optimization. We formulate the problem as a partially observable identical payoff stochastic game (PO-IPSG) with factored controllers, in which the power control policy of each relay is adaptive to its local source-to-relay/relay-to-destination channel states, its local energy state as well as to the source buffer state information. We derive a multi-agent reinforcement learning algorithm which is convergent to a locally optimal solution of the formulated PO-IPSG. The proposed algorithm operates without explicit message exchange between the relays, while inducing only little source-relay signaling overhead. By simulation, we contrast the delay performance of the proposed method against existing heuristics for throughput maximization. It is shown that compared with these heuristics, the systematic approach adopted in this paper has a smaller sub-optimality gap once evaluated against a centralized optimal policy armed with perfect statistics.

Full PDF

IEEE Transactions on Vehicular Technology 1  Abstract — We consider cooperative communications with energy harvesting (EH) relays, and develop a distributed power control mechanism for the relaying terminals. Unlike prior art which mainly deal with single-relay systems with saturated traffic flow, we address the case of bursty data arrival at the source cooperatively forwarded by multiple half-duplex EH relays. We aim at optimizing the long-run average delay of the source packets under the energy neutrality constraint on power consumption of each relay. While EH relay systems have been predominantly optimized using either offline or online methodologies, we take on a more realistic learning-theoretic approach. Hence, our scheme can be deployed for real-time operation without assuming acausal information on channel realizations, data/energy arrivals as required by offline optimization, nor does it rely on precise statistics of the system processes as is the case with online optimization. We formulate the problem as a partially observable identical payoff stochastic game (PO-IPSG) with factored controllers, in which the power control policy of each relay is adaptive to its channel and energy states as well as to the state of the source buffer. We equip each relay with a reinforcement learning procedure, and prove that the parallel execution of this procedure is convergent to (at least) a locally optimal solution of the formulated PO-IPSG. The proposed algorithm operates without explicit message exchange between the relays, while inducing only little source-relay signaling overhead. By simulation, we contrast the delay performance of the proposed method against existing heuristics for throughput maximization. It is shown that compared with these heuristics, the systematic approach adopted in this paper has a smaller sub-optimality gap once evaluated against a centralized optimal policy armed with perfect statistics.

Index Terms — bursty traffic, cooperative relaying, energy harvesting, power control, reinforcement learning, stochastic game, wireless communication.

Copyright (c) 2016 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. V. Hakami is with the Department of Computer Engineering, Iran University of Science and Technology (IUST), Tehran 16846-13114, Iran (email: [email protected]). M. Dehghan is with the Department of Computer Engineering and Information Technology, Amirkabir University of Technology (AUT), Tehran 15916-34311, Iran (e-mail: [email protected]). I. I NTRODUCTION

OOPERATIVE relaying is a promising paradigm which results in broader coverage and in combating the wireless channel impairments. Relay-assisted transmission mitigates the need to use a high power at the transmitter, leading to prolonged battery life and lower level of interference [1]. Relays in wireless networks can be classified as decode-and-forward (DaF) relays, which decode and possibly re-encode the information before forwarding it, and amplify-and-forward (AaF) relays, which forward an amplified version of the signal without hard decoding. AaF relays compared with other types which require signal detection, are less complicated, have lower implementation cost, and are thus utilizable widely [4]. While cooperative relaying results in higher network capacity, in forwarding to the destination a representation of the signal it has received from the source, a relay consumes its own energy. Since replacing batteries for such devices is either impracticable or costly in several scenarios, recent advances in energy harvesting devices [5] have paved the way for self-sustainable relays [6] that power themselves from theoretically unlimited energy sources that are present in their surrounding environment (e.g., in the form of solar, vibration, thermoelectricity, etc.). However, the harvested energy rates are typically quite low with sporadic arrivals in random limited amounts, and it is thus desirable to accumulate the harvested energy by storing it in a buffer such as a rechargeable battery for subsequent usage. In practice, the energy buffer is restricted in size, and thus EH relays may face power outage whenever the energy consumption rate is higher than the harvesting rate. Hence, there is a need for novel power-use policies which exploit available information on the energy, channel and data arrival processes to efficiently utilize the harvested power for meeting application-specific demands. A. Literature Review

Exploiting both energy harvesting and cooperative communications has received a considerable interest recently [7-20]. The use of EH relays in cooperative communication was first introduced in [8], where a comprehensive performance analysis was conducted for relay selection and transmission power setting in an AaF network in terms of symbol error probability by using a probabilistic energy model. However, the results in [8] are mostly of analytical

Distributed Power Control for Delay Optimization in Energy Harvesting Cooperative Relay Networks

V. Hakami and M. Dehghan,

Member, IEEE C IEEE Transactions on Vehicular Technology 2 interest rather than proposing a practical optimization scheme. More recently, several studies have come up with transmission control strategies (e.g., power allocation, relay selection, etc.) to optimize different network utility functions in EH relay systems [7,9,10,11,13,14,15,17,18,19,20,35]. These schemes can be categorized based on two main distinguishing features:  Optimization method (offline/online/learning-theoretic):

In offline optimization, it is assumed that all the future realizations of data/energy arrivals as well as the channel variations are known acausally before the system starts. In general, offline optimization problems are modeled as a mathematical program and the solution obtained can be considered as an upper bound on the performance of the actually stochastic system. In contrast, online optimization is much more realistic in the sense that only statistical knowledge but causal information on the realizations of the system states is assumed. A systematic way to approach online optimization is to formulate the problem as a stochastic dynamic program (DP) [21], and optimize the expected value of the long-run system performance. Nonetheless, in many practical scenarios either the characteristics of the channel variations and energy/data arrival processes change over time, or it is not possible to have reliable statistical information about these processes before node deployments. For example, in a sensor field with solar EH nodes distributed over a forest, each node’s solar EH profile will depend on its location, and is subject to change based on the time of the day or the day of the week. To adapt the transmission scheme in real time, one should resort to learning-theoretic schemes as they are capable of converging to optimal transmission policy over time in the absence of prior knowledge on the statistics of the processes governing the communication system.  Traffic type assumption (saturated/bursty):

Under saturated traffic assumption, there are infinite data backlogs at the source, and the optimization objective is to improve the physical layer performance (e.g., throughput, outage probability or symbol error rate), by only accounting for channel and energy state processes. When traffic is bursty, however, there is a need for a buffer where packets can be queued. The "emptying" rate of the buffer becomes then the "service" rate. A physical-layer model that only captures the variation of the channel and energy completely disregards this issue, and it can result in arbitrary long average waiting time of the packets at the source buffer. When the end-to-end delay is of interest, we need to track the source queue size that develop under bursty traffic generation, and the allocation of power at relays should control the service rate to achieve delay optimization at the source data link layer. The majority of the studies on EH relay systems lie within the offline optimization framework, and assume non-bursty source traffic type [7,9,10,13,14,15,18,19,20]. In [10], the problem of optimal power control for throughput maximization in an SRD network (one source-destination pair and one relay) is formulated as a nonlinear program in an offline setting. Both source and relay are harvesting entities, and the relay operates in half-duplex mode using AaF protocol. A similar setup is considered in [7], but for the case that both source and relay nodes have their own data to transmit to the destination, and the optimization objective is to maximize the total throughput. Also, in [9], the transmit power is jointly optimized with relay selection to handle the case of multiple relays. In [13], source and relay power allocation is optimized for an SRD system with a full-duplex relay using DaF protocol. Half-duplex DaF relaying is considered in [14], where it is assumed that only the source node can harvest energy. The case where both source and relay are EH nodes is handled in [15,18], while [20] considers two parallel EH relays (the so called diamond relay channel [22]). It is also worth noting that technically, the multi-relay case can be deemed equivalent to the OFDM relay with individual power constraint in each subcarrier. Accordingly, the studies in [38] and [39] have proposed optimization schemes for data and energy cooperation in relay-enhanced OFDM systems. Some studies [9,10,19] propose online throughput maximization for the case of saturated source traffic. In [19], for instance, a stochastic DP formulation is given for optimal online power allocation in the case of DaF relaying. In [10], the online power allocation problem is formulated as a Markov decision process (MDP) [23] and a computationally simple scheme is provided for the special case where power control at the nodes is limited to on-off switching. Again, within the context of saturated source traffic type, there has also been a recent study which utilizes a solar-data-driven stochastic energy harvesting model in an MDP-based design, and obtains the optimal DaF relay power control policy to minimize the long-term average symbol error rate [35]. Under a bursty on-off Markovian traffic assumption, the study in [11] addresses online relay scheduling for EH wireless sensor networks. The problem is formulated as a partially observable MDP (POMDP) [24] in which the source node has to choose between direct or cooperative transmission modes depending on its own available energy, the states of its energy harvesting and event generation processes, and using only partial knowledge of the relay’s state. Finally, in [17], a multi-source, single relay cooperative network is considered where the traffic at the source nodes is assumed to be bursty and the forwarding protocol used by the relay is DaF. The transmit power of all nodes is assumed to be contributed by both the conventional AC utility power and the renewable energy. A distributed learning algorithm is proposed to minimize the sum of the average delay of the data flows by dynamic power, rate and link selection control. B. Motivation, Contributions and Outline

Most prior art in optimizing the performance of EH relay systems belong to the realm of offline optimization, and primarily deal with the didactic single relay scenario [7,10,11,13,14,15,18,19]. Also, the existing online schemes require explicit knowledge of the statistics of the system processes [9,10,11,19] and do not address the case of bursty traffic in general where the optimization of the queueing delay is necessary. Unlike [17], in this paper, we consider an EH cooperative relay system consisting of multiple AaF relays

IEEE Transactions on Vehicular Technology 3 which are powered solely by an energy harvesting storage with limited capacity. The source node, on the other hand, has a continuous power supply and maintains a data buffer for the bursty traffic flow towards the destination. We aim at proposing a learning-theoretic scheme to control the relays’ power consumption for optimizing the long-run average delay experienced by the source packets. Ideally, the learning mechanism should be able to dynamically control the transmit power at the relays in adaptation to the source buffer state information (SBSI) as well as the global channel state information (CSI) and energy state information (ESI) of the relays. This calls for a principled design based on a centralized stochastic DP formulation. However, such scheme is already doomed by the curse of dimensionality due to the huge space of global CSI, global ESI, as well as the exponential growth of the number of joint action combinations with the number of relays involved. Moreover to gain access to the global state of the system, a centralized controller would induce heavy signaling overhead. Hence, it is way more practical to empower the relays with decentralized autonomy to make their own decisions based on immediate local feedbacks and partial observability of the system state (i.e., local CSI (LCSI) and local ESI (LESI)). These decisions are not trivial since each relay faces the uncertainty of the system state (channel, buffer, energy) and of the other relays’ actions and observations. To tackle these complications, we come up with a decentralized low overhead solution by making the following contributions:  We rigorously formulate the delay-optimal multi-relay power control problem as a partially observable identical payoff stochastic game (PO-IPSG) [25] that considers the abovementioned properties of the EH relay system. PO-IPSG is a stochastic process that is collectively controlled by a group of independent agents who lack a central view of the global system state. Nevertheless, these agents have a shared objective; i.e., they are all interested in optimizing the utility of the team as a whole. The process is decentralized because none of the agents can control the whole process, and neither of the agents has a full view of the global state. This readily corresponds to our setting in that we also assume all relays in the network collectively aim at minimizing the average number of packets waiting in the source buffer. Also, by making each relay’s power control policy adaptive to a partial view of the system consisting of SBSI, its LCSI, and LESI, the formulated PO-IPSG can systematically trade off long-term energy-efficiency and delay performance.  Given our PO-IPSG formulation, we propose a distributed learning-theoretic power control (DLTPC) algorithm that can be used by the relays to learn their power control play strategies in the absence of statistical knowledge regarding the dynamics of channel, traffic, and energy processes. We construct DLTPC by building on and extending the classical results for gradient-based optimization of MDPs [27,28] and PO-IPSGs [25]. We show that our algorithm harmonizes the relays’ policies so that their collective behavior is provably convergent to (at least) a locally optimal solution of PO-IPSG. As it turns out, DLTPC is a particularly lightweight algorithm, and its updates on the control policy induce only little source-relay signaling overhead with no explicit message exchange between the relays.  By simulation, we show the sub-optimality gap between DLTPC and an MDP-based optimal policy that is armed with perfect statistics. It is evidenced that DLTPC has a smaller performance margin with the centralized controller compared to existing suboptimal throughput-maximizers for EH AaF multi-relay systems (e.g., [9]). The rest of the paper is organized as follows: In Section II, we present the system model along with the general characteristics of the channel, traffic, and energy harvesting processes we assume in this paper. In Section III, we give our PO-IPSG-based formulation of the multi-relay delay optimization problem. In Section IV, the DLTPC algorithm is proposed for convergence to a locally optimal solution of the formulated PO-IPSG. Section V is dedicated to the comparative evaluation of the DLTPC algorithm. The paper ends with a concluding epilogue. II. S YSTEM M ODEL

In this section, we describe the two-hop relay communication system, as well as the channel, traffic, and energy harvesting models. As a notational convention, the time index appears as a subscript, while a relay’s index is always a superscript. Bold symbols are used for non-scalars (i.e., vectors or sets) at the social level, collecting quantities across all relays. A symbol associated with an individual relay (be it a scalar, a vector, or a set) is never in bold.

Fig. 1. A two-hop energy-harvesting cooperative relaying network. A. Energy-Harvesting Relay Communication System

The system under consideration is a two-hop relay network with one source node 𝑠 , 𝐾 energy-harvesting relay terminals (each denoted by 𝑅 𝑘 , 𝑘 ∈ 𝒦 ≜ {1, … , 𝐾} ) and one destination node 𝑑 , as illustrated in Fig. 1. It is assumed that the source node’s signal cannot reach the destination directly due to its limited transmission radius, and instead relies on the relays’ assistance to transmit to d . We assume that all relays operate in half-duplex mode. A two-phase AaF protocol is used for 𝑠 -to- 𝑑 packet delivery; more specifically, each time slot 𝑛 is split into two sub-slots, each with duration 𝜏/2 . In the first sub-slot, the source broadcasts its own data with full transmission power 𝑎 𝑠 to relay nodes. In the second sub-slot, according to the power control policy (defined in Section III.A and calculated by Algorithm 1), each relay decides whether to IEEE Transactions on Vehicular Technology 4 remain silent or to amplify the signal it has received from the source and forwards it to 𝑑 . It is further assumed that the second hop transmissions by the relays are over orthogonal channels (e.g., using frequency division multiple access). B. Channel and Physical Layer Model

We consider a frequency non-selective block fading model, where 𝑐 𝑠,𝑘 ∈ 𝒞 𝑠,𝑘 denotes the channel fading gain from node 𝑠 to relay 𝑅 𝑘 . We use 𝒞 𝑠,𝑘 to refer to the local source-to-relay channel state information (LSR-CSI) space; similarly, 𝑐 𝑘,𝑑 ∈𝒞 𝑘,𝑑 is used to denote the channel gain on the 𝑅 𝑘 - 𝑑 link, and 𝒞 𝑘,𝑑 represents the local relay-to-destination CSI (LRD-CSI) space. We define the local CSI (LCSI) space for the 𝑘 -th relay as 𝒞 𝑘 = 𝒞 𝑠,𝑘 × 𝒞 𝑘,𝑑 , where 𝑐 𝑛𝑘 = ⟨𝑐 𝑛𝑠,𝑘 , 𝑐 𝑛𝑘,𝑑 ⟩ ∈ 𝒞 𝑘 is referred to as relay 𝑅 𝑘 ’s LCSI at the 𝑛 -th time slot. Also, we use 𝓒 =× 𝑘=1𝐾 𝒞 𝑘 to denote the space of the global CSI, collecting the channel gains across all the relays 𝑅 𝑘 , 𝑘 ∈ 𝒦 . Assumption 1.

The global CSI 𝒄 𝑛 = ⟨𝑐 𝑛𝑘 ⟩ 𝑘∈𝒦 ∈ 𝓒 is quasi-static in each time slot. Furthermore, the process {𝒄 𝑛 } 𝑛∈ℕ is i.i.d. between slots with distribution ℙ{𝒄} . It is assumed that

ℙ{𝒄} is unknown and that each relay 𝑅 𝑘 is only aware of its local CSI 𝑐 𝑛𝑘 at time 𝑛 , which can be estimated using channel reciprocity, assuming a time-division duplexing (TDD) system.  Let 𝑥 represent the broadcast information symbol with unit energy from node 𝑠 . The signal received by 𝑅 𝑘 is given by: 𝑦 𝑛𝑠,𝑘 = √𝑎 𝑠 𝑐 𝑛𝑠,𝑘 𝑥 + 𝜂, where 𝜂 is the additive white Gaussian noise (AWGN). Without loss of generality, we assume that the noise power is the same over all links, denoted by 𝜎 . In phase 2, relay 𝑅 𝑘 amplifies 𝑦 𝑛𝑠,𝑘 , and forwards it to node 𝑑 with the chosen power 𝑎 𝑛𝑘 ∈ 𝒜 𝑘 . The received signal 𝑦 𝑛𝑘,𝑑 at 𝑑 is as follows: 𝑦 𝑛𝑘,𝑑 = √𝑎 𝑛𝑘 𝑐 𝑛𝑘,𝑑 𝑥 𝑛𝑘,𝑑 + 𝜂, where, 𝑥 𝑘,𝑑 is the signal sent from 𝑅 𝑘 to 𝑑 , normalized to have unit energy; i.e., 𝑥 𝑛𝑘,𝑑 = 𝑦 𝑛𝑠,𝑘 |𝑦 𝑛𝑠,𝑘 | . Given the power profile 𝒂 𝑛 = 〈𝑎 𝑛𝑘 〉 𝑘∈𝒦 , the end-to-end AaF cooperative service rate is as [34]: 𝑟 𝑛𝑠,𝒦,𝑑 = 𝛾 𝐿 𝑊 log (1 + ∑ Γ 𝑛𝑠,𝒦,𝑑𝑘∈𝒦 Υ ), (1) where, 𝑊 is the bandwidth for transmission, 𝛾 𝐿 denotes a bandwidth factor which is set to 1 for energy-constrained settings, Υ is a constant denoting the capacity gap, and: Γ 𝑛𝑠,𝒦,𝑑 = 𝑎 𝑛𝑘 𝑎 𝑠 𝑐 𝑛𝑠,𝑘 𝑐 𝑛𝑘,𝑑 𝜎 (𝑎 𝑠 𝑐 𝑛𝑠,𝑘 + 𝑎 𝑛𝑘 𝑐 𝑛𝑘,𝑑 + 𝜎 ), (2) is the relayed signal-to-noise ratio (SNR) for source node 𝑠 , which is helped by relay node 𝑅 𝑘 . C. Traffic Model and Source Buffer Dynamics

We assume there is one buffer at the source for the storage of packets. Let 𝑙 be the size of each packet and 𝐴 𝑛 be the random new packet arrival at the n -th slot. Assumption 2.

The arrival process {𝐴 𝑛 } 𝑛∈ℕ is i.i.d. with distribution ℙ{𝐴} and mean 𝜆 = 𝔼[𝐴] . Also, packet arrivals occur at the end of each time slot. It is further assumed that the specific form of

ℙ{𝐴} is unknown a priori.  We use 𝑏 𝑛 ∈ ℬ to denote the source buffer state information (SBSI), which is the number of packets in the source buffer at the beginning of the 𝑛 -th time slot. 𝑁 𝐵 denotes the maximum buffer size. When the buffer is full ( 𝑏 𝑛 = 𝑁 𝐵 ), new arrivals will be dropped. Finally, the buffer dynamics follow Lindley’s equation (3): 𝑏 𝑛+1 = min ((𝑏 𝑛 − 𝜏𝑟 𝑛𝑠,𝒦,𝑑

2𝑙 ) + + 𝐴 𝑛 , 𝑁 𝐵 ), (3) where (. ) + stands for max(. ,0) . D. Energy Harvesting and Relay Energy Storage Dynamics

The energy harvesting process at each relay is modeled as a packet arrival process (e.g., see [37]) such that each energy packet is an integer multiple of a fundamental energy unit (EU). The relay 𝑅 𝑘 is capable of harvesting a random number 𝐻 𝑛𝑘 of energy packets from the environment at each time slot. The relay stores its harvested energy in its battery or a super-capacitor [26] with a finite capacity denoted by 𝑁 𝐸𝑘 (energy packets), and all the energy harvested when the battery is full is lost. Also, the leakage within the battery or super-capacitor and the inefficiency in storing harvested energy are assumed to be negligible. Let 𝑒 𝑛𝑘 ∈ ℰ 𝑘 be the amount of renewable energy in relay 𝑅 𝑘 ’s energy storage at the beginning of the 𝑛 -th time slot. We refer to 𝑒 𝑛𝑘 as local energy state information (LESI). Also, we use 𝓔 =× 𝑘=1𝐾 ℰ 𝑘 to denote the space of the global ESI, collecting all possible LESI combinations across all the relays. Similarly, 𝒆 𝑛 = ⟨𝑒 𝑛𝑘 ⟩ 𝑘∈𝒦 ∈ 𝓔 is referred to as the system’s global ESI at the 𝑛 -th time slot. Assumption 3.

The arrival process {𝐻 𝑛𝑘 } 𝑛∈ℕ , ∀𝑘 ∈ 𝒦 is i.i.d. with respect to 𝑛 , and has distribution ℙ{𝐻 𝑘 } and mean 𝜇 𝑘 = 𝔼[𝐻 𝑘 ] . We assume that the new energy arrivals are observed after the control actions are performed at each slot. It is assumed that ℙ{𝐻 𝑘 } and 𝔼[𝐻 𝑘 ] are unknown and each relay 𝑅 𝑘 is only aware of its LESI 𝑒 𝑛𝑘 at each time slot.  Let 𝑎 𝑛𝑘 denote the chosen power level by relay 𝑅 𝑘 at time 𝑛 . The LESI dynamics for each relay 𝑅 𝑘 is as follows: 𝑒 𝑛+1𝑘 = min (𝑒 𝑛𝑘 − 𝑎 𝑛𝑘 𝜏2 + 𝐻 𝑛𝑘 , 𝑁 𝐸𝑘 ). (4) where 𝑎 𝑛𝑘 must satisfy the following energy availability constraint: 𝑎 𝑛𝑘 𝜏2 ≤ 𝑒 𝑛𝑘 , ∀𝑘 ∈ 𝒦. (5) Finally, it is implicitly assumed that 𝑎 𝑛𝑘 = 0 means that relay 𝑅 𝑘 remains inactive in time 𝑛 . III. P ROBLEM F ORMULATION

In this section, we formulate a decentralized power control policy for the relays to cooperatively optimize the average delay incurred by the source packets. In our system model, the dynamics of the source buffer depends, in part, on the packet arrival intensity 𝜆 , but it also depends on the cooperative IEEE Transactions on Vehicular Technology 5 service rate 𝑟 𝑠,𝒦,𝑑 it receives from the relays, which is affected by their channel states as well as their energy harvesting profile. Accordingly, we define the power control policy at each relay to be adaptive to SBSI, as well as its LCSI and LESI. In particular, adaptation to LCSI is needed to opportunistically exploit the channel dynamics and gain more value for the power invested. SBSI-adaptability is needed to make the policy delay-aware under the conditions of unsaturated traffic and finite-length buffer at the source. Finally, given that the relays rely on energy harvesting for their operation, their control policies are subject to instantaneous energy availability constraints. An LESI-adaptive policy avoids inadvertent consumption of the harvested energy, and increases the odds that on urgent occasions a larger number of relays are available for rendering their service (i.e., higher diversity order), and they have more feasible power options at their disposal. Our formulation is founded on the assumption that the relays would be working towards a common goal, i.e., the optimization of the incurred delay by the source packets. Altogether, our setup comes down to the coupled interaction of a number of agents with identical interest in a Markovian environment based on partial knowledge of the system state information and without explicit awareness of the action choices of the other agents. A systematic way to formulate this problem is to cast the system as a partially observable identical payoff stochastic game (PO-IPSG) [25]. We denote the PO-IPSG as a quintuple 𝒢 = ⟨𝒦, 𝓢, 𝓐, Τ, 𝑟⟩ . 𝓢 = ℬ × 𝓒 ×𝓔 is the global system state space, where each 𝒔 𝑛 ∈ 𝓢 denotes the global system state at the 𝑛 -th time slot, i.e., 𝒔 𝑛 =⟨𝑏 𝑛 ,𝒄 𝑛 , 𝒆 𝑛 ⟩ consists of the SBSI, global CSI, and global ESI; likewise, we use 𝒮 𝑘 = ℬ × 𝒞 𝑘 × ℰ 𝑘 to represent the space of partially observed system states from the viewpoint of relay 𝑅 𝑘 , 𝑘 ∈ 𝒦 . Similarly, 𝑠 𝑛𝑘 = ⟨𝑏 𝑛 ,𝑐 𝑛𝑘 , 𝑒 𝑛𝑘 ⟩ denotes the 𝑘 -th relay’s observed state at the 𝑛 -th time slot. 𝓐(𝒆) =× 𝑘=1𝐾 𝒜 𝑘 (𝑒 𝑘 ), ∀𝒆 ∈ 𝓔 is the battery state-dependent joint action space, i.e., different combinations of feasible power levels which can be chosen by the relays (see (5)). The mapping Τ: 𝓢 × 𝓐 × 𝓢 → [0,1] denotes the global state transition probabilities, and is discussed in more detail in III.B. Finally, 𝑟: 𝓢 × 𝓐 × 𝓢 → ℝ is the instantaneous reward function which is defined to be identical across all relays. More specifically, we define 𝑟 as a function of the number of vacant places in the source buffer; i.e., 𝑟(𝒔 𝑛 , 𝒂 𝑛 , 𝒔 𝑛+1 ) = 𝜈(𝑁 𝐵 − 𝑏 𝑛+1 ), (6) where 𝜈 is a positive constant. The dynamics of the game 𝒢 proceeds as follows: at each time slot 𝑛 , each relay 𝑅 𝑘 observes its local state 𝑠 𝑛𝑘 and selects an action 𝑎 𝑛𝑘 according to its power control policy 𝑢 𝑘 (to be specified in III.A). A composite action profile 𝒂 𝑛 = 〈𝑎 𝑛𝑘 〉 𝑘∈𝒦 from the joint action space 𝓐 is executed, the system probabilistically transitions to the next state 𝒔 𝑛+1 according to the law T(𝒔 𝑛+1 |𝒔 𝑛 , 𝒂 𝑛 ) , and all relays receive the identical reward 𝑟(𝒔 𝑛 , 𝒂 𝑛 , 𝒔 𝑛+1 ) . The system-wide objective is to maximize the value of the game , i.e., the long-run average of the received rewards. A. Factored Control Policy

We assume that the system is controlled by stationary policies. The stationarity of a policy implies that it depends on the history of the game only through the current state. Moreover, we parameterize the policy space by a set of continuous parameters

𝚯 ∈ ℝ 𝒟 of some dimension 𝒟 . In particular, as we are interested in decentralized optimization with partial state observability by the relays, we restrict ourselves to the space of factored joint controllers 𝓤 𝚯 , where each 𝓾 𝚯 ∈ 𝓤 𝚯 is a probabilistic mapping of the form 𝓾 𝚯 : 𝓢 ×𝓐 → [0,1] and it holds that 𝓾 𝚯 = ∏ 𝑢 𝜃 𝑘 𝐾𝑘=1 . Basically, 𝚯 is defined to be the concatenation of individual relay policy parameters, i.e., 𝚯 = ⟨𝜃 ,… , 𝜃 𝐾 ⟩ , and 𝑢 𝜃 𝑘 : 𝒮 𝑘 × 𝒜 𝑘 → [0,1] is relay 𝑅 𝑘 ’s individual power control policy. 𝜃 𝑘 is taken to be a 𝒟 𝑘 ≜ |𝒮 𝑘 × 𝒜 𝑘 | -dimensional vector of the form 𝜃 𝑘 =〈𝜃 𝑠,𝑎𝑘 〉 𝑠∈𝒮 𝑘 ,𝑎∈𝒜 𝑘 ; i.e., the joint policy space is of dimension 𝒟 = ∑ 𝒟 𝑘𝐾𝑘=1 . Remark 1 : The factorization of action choice allows for parallel computation of the control policy by the relays as stated in Theorem 2 (Section IV). It also helps overcome the curse of dimensionality associated with the huge size of the joint state-action space

𝓢 × 𝓐 ; however, as argued in [25], a side-effect is that only a subset of policies from the full space of joint policies (corresponding to e.g., a central non-factored controller) can be represented. Hence, we can at best yield the best set of policies from within the restricted space 𝓤 𝚯 .  A common way to express parametric policies in the literature (e.g., see [27]) is to assume a Gibbs-like distribution for the shape of 𝑢 𝜃 𝑘 (. ) ; more precisely, the probability of choosing power level 𝑎 ∈ 𝒜 𝑘 (𝑒) by relay 𝑅 𝑘 in state 𝑠 =〈𝑏, 𝑐, 𝑒〉 ∈ 𝒮 𝑘 is expressed as follows: 𝑢 𝜃 𝑘 (𝑎|𝑠) = exp⁡(𝜃 𝑠,𝑎 )∑ exp⁡(𝜃 𝑠,𝑎́ ) 𝑎́∈𝒜 𝑘 (𝑒) , (7) Note that the denominator in (7) is ensured to be non-zero by always having 𝑎 = 0 as the feasible choice. B. State Transition Laws

Assume a joint parametric control policy 𝓾 𝚯 ∈ 𝓤 𝚯 is given. The probabilistic dynamics of the system state can be characterized in terms of 𝓾 𝚯 and the mapping Τ , which denotes the controlled transition probabilities; more specifically, we have: ℙ{𝒔 𝑛+1 |𝒔 𝑛 , 𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 )}= T(𝒔 𝑛+1 |𝒔 𝑛 , 𝒂 𝑛 )𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 ), (8) where (recalling Assumption 1 on i.i.d. channels), we have: T(𝒔 𝑛+1 |𝒔 𝑛 , 𝒂 𝑛 )= ℙ{𝒄 𝑛+1 }. T(𝑏 𝑛+1 |𝒔 𝑛 , 𝒂 𝑛 )T(𝒆 𝑛+1 |𝒆 𝑛 , 𝒂 𝑛 ), (9) and the source buffer state transition is as follows: ⁡⁡T(𝑏 𝑛+1 |𝒔 𝑛 , 𝒂 𝑛 ) = { ℙ {𝐴 𝑛 = 𝑏 𝑛+1 − (𝑏 𝑛 − 𝜏𝑟 𝑛𝑠,𝒦,𝑑

2𝑙 ) + } , 𝑏 𝑛+1 < 𝑁 𝐵 ∑ ℙ{𝐴 𝑛 = 𝐴} ∞𝐴=𝑁 𝐵 −(𝑏 𝑛 −𝜏𝑟 𝑛𝑠,𝒦,𝑑

2𝑙 ) + , ⁡⁡⁡⁡⁡𝑏 𝑛+1 = 𝑁 𝐵 (10) For the probabilistic transition of the global ESI, we have: IEEE Transactions on Vehicular Technology 6

T(𝒆 𝑛+1 |𝒆 𝑛 , 𝒂 𝑛 ) = ⁡ ∏ T 𝑘 (𝑒 𝑛+1𝑘 |𝑒 𝑛𝑘 , 𝑎 𝑛𝑘 ) 𝒦𝑘=1 , where, T 𝑘 (𝑒 𝑛+1𝑘 |𝑒 𝑛𝑘 , 𝑎 𝑛𝑘 )= { ℙ {𝐸 𝑛𝑘 = 𝑒 𝑘𝑛+1 − (𝑒 𝑛𝑘 − 𝜏𝑎 𝑛𝑘 𝑛+1𝑘 < 𝑁 𝐸𝑘 ∑ ℙ{𝐸 𝑛𝑘 = 𝐸} ∞𝐸=𝑁 𝐸𝑘 −(𝑒 𝑛𝑘 −𝜏𝑎 𝑛𝑘 , 𝑒 𝑛+1𝑘 = 𝑁 𝐸𝑘 (11) C. System-Wide Objective

As is common in infinite-horizon stochastic DP problems [21], we may seek policies that choose actions to optimize either the expected total discounted reward or the expected average-reward per step criterion. In this work, we opt for the time-averaged metric due to the following reasons:  The average reward criterion puts more emphasis on the long-run performance of the system and does not discount its future behavior; without prior knowledge, each byte of a file or voice packet is of equal significance and it is hardly justified to discount later packets as inherently less important.  Moreover, even if a formulation based on discounted-reward maximization is employed to trade off the delay experienced by recent and later packets, the discount factor needs to be chosen heuristically, which affects the performance of the derived power control policy.  Finally, we set the goal in PO-IPSG 𝒢 to be the maximization of the long-run average number of empty slots in the source buffer. As we clarify in the sequel (see Remark 3), this time-averaged metric in our problem is naturally related to the mean waiting time in the source buffer, and correlates well with an objective judgment of the system performance. Now that we have stated our rationale for choosing a time-averaged criterion, in Remark 2, we impose a mild assumption on the set of admissible policies in order to ensure that the time-average criterion is well-defined: Remark 2:

Similar to other literature in MDP [12][28], we restrict our consideration to unichain policies in this paper. The stationary policy 𝓾 𝚯 is said to be unichain if the controlled Markov chain {𝒔 𝑛 } 𝑛∈ℕ under 𝓾 𝚯 is ergodic [33]. In this case, {𝒔 𝑛 } 𝑛∈ℕ has a unique steady state probability distribution 𝝅 , where for all 𝒔 ∈ 𝓢 , 𝜋(𝒔) = lim 𝑛→∞ ℙ(𝒔 𝑛 = 𝒔) [28]. Now, we may define the optimization objective as (12): max 𝚯 ℛ̅(𝓾 𝚯 ) ≜ lim 𝑁→∞

1𝑁 ∑ 𝔼 𝓾 𝚯 {𝑟 𝑛 } 𝑁−1𝑛=0 = 𝔼 𝝅 {𝜈(𝑁 𝐵 − 𝑏)}. (12) where the 𝔼 𝝅 denotes expectation w.r.t. the underlying probability 𝝅 .  Remark 3 : We have from the extended Little’s law (c.f., Lemma 1, [30]) that the long-run average delay

𝒟̅(𝓾 𝚯 ) of the source packets under the (unichain) policy 𝓾 𝚯 verifies the following inequality: 𝒟̅(𝓾 𝚯 ) ≤ lim 𝑁→∞

1𝑁 ∑ 𝔼 𝓾 𝚯 {𝑏 𝑛 }(1 − ℙ 𝑑𝑟𝑜𝑝 )𝜆 𝑁−1𝑛=0 ⁡, where 𝔼 𝓾 𝚯 is the expectation under stationary policy 𝓾 𝚯 and ℙ 𝑑𝑟𝑜𝑝 is the packet drop rate due to source buffer overflow. Here, we argue that since in practice, we target reasonable (e.g., 0.1%) drop rates, it holds that ℙ 𝑑𝑟𝑜𝑝 ≪ 1 , and therefore the following is a good approximation for the average delay: 𝒟̅(𝓾 𝚯 ) ≈ lim 𝑁→∞

1𝑁 ∑ 𝔼 𝓾 𝚯 {𝑏 𝑛 }𝜆 𝑁−1𝑛=0 . Furthermore, this approximation is asymptotically tight as the data buffer size increases. Therefore, for sufficiently large buffer size and low load regime, maximizing

ℛ̅(𝓾 𝚯 ) is a valid alternative to minimizing the average delay.  Definition 1 ( Local Optimal of PO-IPSG 𝒢 ). A profile of power control policies 𝓾 𝚯 ∗ = 〈𝑢 𝜃 , … , 𝑢 𝜃 𝐾∗ 〉 ∈ 𝓤 𝚯 is the local optimal of the game 𝒢 if it satisfies the following condition: ∇ 𝚯 ℛ̅(𝓾 𝚯 ∗ ) = 𝟎⃗⃗ .  Theorem 1 . The gradient in Definition 1 can be computed as (13) : ∇ 𝚯 ℛ̅(𝓾 𝚯 )= lim 𝑁→∞

1𝑁 ∑ ∇ 𝚯 ℙ{𝒔 𝑛+1 |𝒔 𝑛 , 𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 )}ℙ{𝒔 𝑛+1 |𝒔 𝑛 , 𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 )} 𝑄(𝒔 𝑛 , 𝒂 𝑛 ) 𝑁−1𝑛=0 , (13) where the function 𝑄(. , . ) is the so-called differential reward function defined as follows:

𝑄(𝒙, 𝒚)= lim

𝑁→∞ 𝔼 𝓾 𝚯 {∑ (𝑟 𝑛 − ℛ̅(𝓾 𝚯 )) 𝑁−1𝑛=0 |𝒔 = 𝒙, 𝒂 = 𝒚}. (14) Proof . The proof follows immediately from the derivation in [28, Section 3.2].  Note that (13) can be written in a more convenient form by realizing that: ∇ 𝚯 ℙ{𝒔 𝑛+1 |𝒔 𝑛 , 𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 )}ℙ{𝒔 𝑛+1 |𝒔 𝑛 , 𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 )}= ∇ 𝚯 ln[ℙ{𝒔 𝑛+1 |𝒔 𝑛 , 𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 )}]= ∇ 𝚯 ln[𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 )]. (15) It is worth noting that a function such as ∇ 𝚯 ln[𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 )] , which is the gradient of a log-likelihood, is also known as a score function in classical statistics [31]. Finally, ∇ 𝚯 ℛ̅(𝓾 𝚯 )= lim 𝑁→∞

1𝑁 ∑ ∇ 𝚯 ln[𝓾 𝚯 (𝒂 𝑛 |𝒔 𝑛 )] 𝑄(𝒔 𝑛 , 𝒂 𝑛 ) 𝑁−1𝑛=0 , (16) In what follows, we present a distributed learning-theoretic procedure to steer the relays’ behavior towards a delay-optimal power control policy 𝓾 𝚯 ∗ in the sense of Definition 1. IV. A M ULTI -A GENT R EINFORCEMENT L EARNING S OLUTION

In our PO-IPSG formulation, it is desired that the relays make coordinated decisions despite their independence of one another and despite their lack of omniscience (i.e., each single

IEEE Transactions on Vehicular Technology 7 relay is unaware of the other relays’ local states, and the policies they are pursuing). In order to harmonize the relays’ behavior, in this section, we present a distributed learning-theoretic power control (DLTPC) algorithm to be executed in parallel by each relay involved. In fully observable IPSGs, value function-based learning methods (e.g., [32]) have been proposed for discounted reward problems, which are convergent to the optimal Nash equilibrium. As for our PO-IPSG problem, however, we resort to policy search methods which have been shown to be a reasonable alternative to value-based methods for partially observable environments [36]. In particular, we follow the lead of Peshkin et al. in [25], which introduce a general method for using gradient ascent in multi-agent policy spaces to guarantee convergence to local optima (i.e., gradient zero operating points) of the game. Through a sketchy analysis, it has been shown in [25] that: when the search space is restricted to factored social policies 𝓤 𝚯 , joint gradient ascent performed by a central controller (with access to observation histories of the whole system) is equivalent to parallel gradient ascent performed by individual agents (with access only to their own partial view of the system history). Key to the argument in [25] is to show that: I) The parallel algorithm samples gradients ∇ 𝚯 ℛ̅ from the correct distribution, and II) The update increments used in gradient ascent are the same in the parallel algorithm as in the joint one. Moreover, to satisfy these two conditions, an underlying requirement is that the agents perform synchronized updates on the estimates of their own components of the global gradient vector. Although the study in [25] is conducted in the context of discounted reward PO-IPSGs, but as we show in this paper, their line of argument can be extended to average-reward settings as well. However, the discussion in [25] is more of an outline lacking most details on the machinery of gradient estimation. We thus turn to standard techniques for estimation of the gradient of the average-reward in MDP literature [27][28]. These algorithms typically exploit the regenerative structure of the system’ underlying Markov process to obtain unbiased gradient estimates based on the observations made in between regeneration times (i.e., between visits to a certain recurrent state). Applied to our PO-IPSG formulation, corresponding to every global regenerative cycle, we may define a local cycle for each relay during which it collects local observations to form an estimate of its own component of the global gradient vector. We show that at the expense of a very low signaling overhead, it can be arranged for the relays to agree on the termination of global regenerative cycles, thus satisfying the underlying requirement of synchronized updates in [25]. We then rigorously apply the line of argument in [25] to show that conditions I and II will be satisfied by our derivation (see Theorem 2 in Section IV). Based on this result, in Section IV.B, we discuss the update rules to be executed iteratively by each relay, and present DLTPC’s pseudo code. A. Decentralized Computation of the Performance Gradient

Assume that the relay communication system is controlled via some factored joint parametric control policy 𝓾 𝚯 ∈ 𝓤 𝚯 (c.f., Section III.A). The global system history is realized as an infinite-length trajectory of the form: 𝒉 ∞ = [𝒔 , 𝒂 , 𝑟 , 𝒔 , … , 𝒔 𝑛−1 , 𝒂 𝑛−1 , 𝑟 𝑛−1 , 𝒔 𝑛 , … ]∈ 𝓗 ∞ ≜ (𝓢 × 𝓐 × ℝ) ∞ . Now, fix some 𝑒 ∗ ∈ ℰ 𝑘 , ∀𝑘 and let 𝒆 ∗ ∈ 𝓔 be the global ESI where 𝑒 𝑛𝑘 = 𝑒 ∗ , ∀𝑘 ; likewise, fix some 𝑏 ∗ ∈ ℬ . Finally, let 𝓢 ∗ ≜ {⟨𝑏 ∗ , 𝒄, 𝒆 ∗ ⟩, ∀𝒄 ∈ 𝓒} . With {𝒔 𝑛 } 𝑛∈ℕ being ergodic, elements of 𝓢 ∗ recur infinitely often within any realization of the global system history. Let 𝑡 𝑚 be the time of the 𝑚 -th visit to 𝓢 ∗ . We refer to the following portion of history: 𝒉 𝑚∗ = [𝒔 𝑡 𝑚 , 𝒂 𝑡 𝑚 , 𝑟 𝑡 𝑚 , 𝒔 𝑡 𝑚 +1 , … , 𝒔 𝑡 𝑚+1 −1 , 𝒂 𝑡 𝑚+1 −1 , 𝑟 𝑡 𝑚+1 −1 , 𝒔 𝑡 𝑚+1 ] as the 𝑚 -th global renewal cycle ( 𝑚 ≥ 1 ). Under Assumption 1 for CSI and by regenerative property (e.g., see [29]), these pieces of system trajectory are i.i.d. We denote by ℓ(𝒉 𝑚∗ ) the length of 𝒉 𝑚∗ that is equal to ∆𝑡 𝑚 = 𝑡 𝑚+1 − 𝑡 𝑚 . It is also convenient to introduce local versions of a renewal cycle observed through the prism of each relay 𝑅 𝑘 . In fact, corresponding to the 𝑚 -th global renewal cycle 𝒉 𝑚∗ , the relay 𝑅 𝑘 ’s local renewal cycle is realized as follows: ℎ 𝑚∗,𝑘 = [𝑠 𝑡 𝑚 𝑘 , 𝑎 𝑡 𝑚 𝑘 , 𝑟 𝑡 𝑚 , 𝑠 𝑡 𝑚 +1𝑘 , … , 𝑠 𝑡 𝑚+1 −1𝑘 , 𝑎 𝑡 𝑚+1 −1𝑘 , 𝑟 𝑡 𝑚+1 −1 , 𝑠 𝑡 𝑚+1 𝑘 ], where, by definition of 𝑡 𝑚 , it holds that for all 𝑘 ∈ 𝒦 : 𝑠 𝑡 𝑚 𝑘 , 𝑠 𝑡 𝑚+1 𝑘 ∈ 𝒮 𝑘∗ ≜ {⟨𝑏 ∗ , 𝑐 𝑘 ,𝑒 ∗ ⟩, ∀𝑐 𝑘 ∈ 𝒞 𝑘 } ; i.e., ℎ 𝑚∗,𝑘 is of the same length as 𝒉 𝑚∗ . Now, more generally, define 𝓗 ∗ to be the space of all global renewal cycles; accordingly, ℋ ∗,𝑘 is used to refer to the space of all local renewal cycles for relay 𝑅 𝑘 . For 𝒉 ∗ ∈ 𝓗 ∗ , it holds that: ℙ(𝒉 ∗ |𝚯) = ∏ 𝛵(𝒔 [𝑛+1,𝒉 ∗ ] |𝒔 [𝑛,𝒉 ∗ ] , 𝒂 [𝑛,𝒉 ∗ ] )𝓾 𝚯 (𝒂 [𝑛,𝒉 ∗ ] |𝒔 [𝑛,𝒉 ∗ ] ). ℓ(𝒉 ∗ )−1𝑛=0 (17) where the notation 𝑥 [𝑛,𝒉 ∗ ] is used to refer to the component of 𝑥 realized at time ∗ ) within 𝒉 ∗ . Now, by renewal-reward theorem (e.g., see [29]), the performance gradient ∇ 𝚯 ℛ̅(𝓾 𝚯 ) defined in (16) can be calculated as follows: ∇ 𝚯 ℛ̅(𝓾 𝚯 ) = 𝔼 𝓾 𝚯 {∑ ∇ 𝚯 ln[𝓾 𝚯 (𝒂 [𝑛,𝒉 ∗ ] |𝒔 [𝑛,𝒉 ∗ ] )]𝑄(𝒔 [𝑛,𝒉 ∗ ] , 𝒂 [𝑛,𝒉 ∗ ] ) ℓ(𝒉 ∗ )−1𝑛=0 }𝔼 𝓾 𝚯 {ℓ(𝒉 ∗ )} , (18) i.e., the expected total quantity earned during one cycle, normalized by the expected cycle duration. Similarly, the differential reward for ∗ ) can be written as (19): 𝑄(𝒙, 𝒚) = 𝔼 𝓾 𝚯 {∑ (𝑟 [𝑗,𝒉 ∗ ] − ℛ̅(𝓾 𝚯 )) ℓ(𝒉 ∗ )−1𝑗=𝑛 |𝒔 [𝑛,𝒉 ∗ ] = 𝒙, 𝒂 [𝑛,𝒉 ∗ ] = 𝒚}. (19) Replacing 𝑄 with its estimate 𝑄̂(𝒔 [𝑛,𝒉 ∗ ] , 𝒂 [𝑛,𝒉 ∗ ] ) ≜ IEEE Transactions on Vehicular Technology 8 ∑ (𝑟 [𝑗,𝒉 ∗ ] − ℛ̅(𝓾 𝚯 )) ℓ(𝒉 ∗ )−1𝑗=𝑛 in (18), we have: ∇ 𝚯ℛ̅ ⃗⃗⃗⃗⃗ ≜ 𝔼 𝓾 𝚯 [ℓ(𝒉 ∗ )]∇ 𝚯 ℛ̅(𝓾 𝚯 ) = ∑ ℙ(𝒉 ∗ |𝚯) 𝒉 ∗ ∈𝓗 ∗ × { ∑ ∇ 𝚯 ln[𝓾 𝚯 (𝒂 [𝑛,𝒉 ∗ ] |𝒔 [𝑛,𝒉 ∗ ] )]𝑄̂(𝒔 [𝑛,𝒉 ∗ ] , 𝒂 [𝑛,𝒉 ∗ ] ) ℓ(𝒉 ∗ )−1𝑛=0 }, (20) where given that 𝔼 𝓾 𝚯 [ℓ(𝒉 ∗ )] is a positive number, 𝔼 𝓾 𝚯 [ℓ(𝒉 ∗ )]∇ 𝚯 ℛ̅(𝓾 𝚯 ) can be viewed as the expected gradient direction, and the zeroes of ∇ 𝚯ℛ̅ ⃗⃗⃗⃗⃗ are the same as those of ∇ 𝚯 ℛ̅(𝓾 𝚯 ) . Theorem 2 in the sequel establishes that the calculation of the direction of the performance gradient ∇ 𝚯ℛ̅ ⃗⃗⃗⃗⃗ can be done in a decentralized manner across the relays; i.e., each relay can independently calculate its individual gradient direction ∇ 𝜃 𝑘 ℛ̅ ⃗⃗⃗⃗⃗⃗ based on local information contained within its local renewal cycles ℎ ∗,𝑘 ∈ ℋ ∗,𝑘 , and yet the ensemble of individual gradient directions recover the whole vector ∇ 𝚯ℛ̅ ⃗⃗⃗⃗⃗ . Theorem 2 . Assume 𝓾 𝚯 ∈ 𝓤 𝚯 . The gradient direction ∇ 𝚯ℛ̅ ⃗⃗⃗⃗⃗ can be expressed as the vector: ∇ 𝚯ℛ̅ ⃗⃗⃗⃗⃗ = ⟨∇ 𝜃 ℛ̅ ⃗⃗⃗⃗⃗⃗ , … , ∇ 𝜃 𝐾 ℛ̅ ⃗⃗⃗⃗⃗⃗⃗ ⟩, in which each component ∇ 𝜃 𝑘 ℛ̅ ⃗⃗⃗⃗⃗⃗ , 𝑘 ∈ 𝒦 is calculated as: ∇ 𝜃 𝑘 ℛ̅ ⃗⃗⃗⃗⃗⃗ ≜ 𝔼 𝓾 𝚯 [ℓ(𝒉 ∗ )]∇ 𝜃 𝑘 ℛ̅(𝓾 𝚯 ) = ∑ ℙ(ℎ ∗,𝑘 |𝚯) ℎ ∗,𝑘 ∈ℋ ∗,𝑘 × { ∑ ∇ 𝜃 𝑘 ln [𝑢 𝜃 𝑘 (𝑎 [𝑛,ℎ ∗,𝑘 ] |𝑠 [𝑛,ℎ ∗,𝑘 ] )] 𝑄̂ (𝑠 [𝑛,ℎ ∗,𝑘 ] , 𝑎 [𝑛,ℎ ∗,𝑘 ] ) ℓ(ℎ ∗,𝑘 )−1𝑛=0 }, (21) and, 𝑄̂ (𝑠 [𝑛,ℎ ∗,𝑘 ] , 𝑎 [𝑛,ℎ ∗,𝑘 ] ) ≜ ∑ (𝑟 [𝑗,ℎ ∗,𝑘 ] − ℛ̅(𝓾 𝚯 )) ℓ(ℎ ∗,𝑘 )−1𝑗=𝑛 . (22) Proof . Please see Appendix A.  In essence, Theorem 2 states that: If at each renewal cycle, all relays 𝑅 𝑘 , 𝑘 ∈ 𝒦 update their policy parameters 𝜃 𝑘 along the gradient direction sampled from their distribution ℙ(ℎ ∗,𝑘 |𝚯) in parallel, the parameter vector 𝚯 gets updated along the gradient direction sampled from ⁡ℙ(𝒉 ∗ = 〈ℎ ∗,1 , . . , ℎ ∗,𝐾 〉|𝚯) ; i.e., the distributed algorithm is sampling from the correct distribution. Also, due to factorization, the update increments ∇ 𝜃 𝑘 ℛ̅ ⃗⃗⃗⃗⃗⃗ to be used in relay 𝑅 𝑘 ’s gradient ascent are independent of the parameters in other relays’ policies. Hence, the policy learning and control can be distributed among relays without requiring that they be informed of each others’ states and choices of actions. B. Distributed Learning-Theoretic Power Control (DLTPC)

In this section, we present DLTPC (Algorithm 1), our distributed learning-theoretic power control scheme, which can lead the relays’ collective behavior to a locally optimal delay performance. DLTPC relies on sample estimates of the performance gradient obtained during the actual system run-time to perform gradient-ascent in policy space. Hence, our algorithm does not need the explicit knowledge of the CSI, SBSI, and ESI statistics, and is an instance of model-free learning. This is as opposed to doing exact gradient-ascent, which requires the explicit knowledge of the transition laws T to analytically compute the gradient direction. In DLTPC, each relay updates its policy parameter 𝜃 𝑚𝑘 at the end of each renewal cycle, i.e., between visits to 𝓢 ∗ (see (27) in Algorithm 1). To understand (27), note that according to (21) and (22), we can use: 𝐹 𝑚𝑘 ≜ ∑ 𝜕 ln [𝑢 𝜃 𝑘 (𝑎 𝑛𝑘 |𝑠 𝑛𝑘 )]𝜕𝜃 𝑘 | 𝜃 𝑘 =𝜃 𝑚𝑘 ∑ (𝑟 𝑗 − ℛ̅(𝓾 𝚯 )) 𝑡 𝑚+1 −1𝑗=𝑛𝑡 𝑚+1 −1𝑛=𝑡 𝑚 , (23) as the 𝑚 -th cycle estimate of ∇ 𝜃 𝑘 ℛ̅ ⃗⃗⃗⃗⃗⃗ , which is obtained by each relay 𝑅 𝑘 from the sample renewal cycle ℎ 𝑚∗,𝑘 . Now, to allow for more efficient recursive implementation of the summation (23) in Algorithm 1, we rewrite 𝐹 𝑚𝑘 as follows: 𝐹 𝑚𝑘 = ∑ (𝑟 𝑛 − ℛ̅(𝓾 𝚯 )) 𝑡 𝑚+1 −1𝑛=𝑡 𝑚 ∑ 𝜕 ln [𝑢 𝜃 𝑘 (𝑎 𝑛𝑘 |𝑠 𝑛𝑘 )]𝜕𝜃 𝑘 | 𝜃 𝑘 =𝜃 𝑚𝑘 𝑛𝑗=𝑡 𝑚 , (24) which makes it possible to incrementally construct 𝐹 𝑚𝑘 using transient quantities 𝑧 𝑛𝑘 and 𝑔 𝑛𝑘 before reaching the end of each cycle. Accordingly, equation (27) in the pseudo-code is basically the standard rule for stochastic gradient–ascent in which the parameter 𝛼 𝑚 ∈ ℝ + denotes a learning rate. Also, similarly to [27], ℛ̅(𝓾 𝚯 ) in (24) is replaced via its estimate ℛ̂ 𝑚 , which is also updated at each renewal cycle via the recursion (25): ℛ̂ 𝑚+1 : = ℛ̂ 𝑚 + 𝛼 𝑚 ∑ (𝑟 𝑛 − ℛ̂ 𝑚 ) 𝑡 𝑚+1 −1𝑛=𝑡 𝑚 . (25) Equation (25) is a stochastic approximation of the average reward ℛ̅(𝓾 𝚯 ) , and is consistent with the observation that for the 𝑚 -th cycle, it holds: ℛ̅ 𝚯 𝑚 (≈ ℛ̂ 𝑚 ) = 𝔼 𝓾 𝚯 {∑ 𝑟 𝑛𝑡 𝑚+1 −1𝑛=𝑡 𝑚 }𝔼 𝓾 𝚯 {∆𝑡 𝑚 } . (26) Theorem 3.

Choose 𝛼 𝑚 such that the sequence {𝛼 𝑚 } be diminishing (i.e., 𝛼 𝑚 𝑚↑∞ → 0 ), un-summable (i.e., ∑ 𝛼 𝑚 = ∞ 𝑚 ), but square summable (i.e., ∑ 𝛼 𝑚2 < ∞ 𝑚 ). Also, consider the sequence of parameters {𝚯 𝑚 } generated by Algorithm 1. Then, {ℛ̂ 𝑚 } converges (with probability 1), and the profile of power control policies {𝓾 𝚯 𝑚 } converges to the local optimal of PO-IPSG 𝒢 ; i.e., ∇ 𝚯 ℛ̅(𝓾 𝚯 𝑚 ) 𝑚↑∞ → 0⁡(𝑤. 𝑝. 1) . Proof.

With this setup, DLTPC’s update equations in (27)

IEEE Transactions on Vehicular Technology 9 and (28) are exactly along the lines of the single-agent iterates in ([27], Eqs. (15) and (16)); hence, the convergence of the gradient components (with respect to 𝜃 𝑘 , ∀𝑘 ) of the performance measure ℛ̅(𝓾 𝚯 𝑚 ) to zero can be established via the same arguments made in ([27], Proposition 3). Combine this with Theorem 2 to conclude.  Algorithm 1.

Distributed Learning-Theoretic Power Control

Initialization : Set iteration index 𝑛 ∶= 0, renewal cycle index 𝑚 ∶= 0, initial transient differential reward 𝑄̂ ∶= 0, initial estimate for the average reward ℛ̂ ∶= 0;⁡ Initialize parameter vector 𝜃 randomly and set 𝑧 ∶= 𝟎⃗⃗ , 𝑔 ∶= 𝟎⃗⃗ , ∀𝑘 ∈ 𝒦; Source 𝑠 broadcasts data and its buffer state 𝑏 ; while (TRUE) for each relay 𝑘 ∈ 𝒦 do Choose power 𝑎 𝑛𝑘 ~𝑢 𝑘𝜃 𝑚𝑘 (. |𝑠 𝑛𝑘 ) ; 2) Transmit data to destination 𝑑 with power 𝑎 𝑛𝑘 ; 3) Inform 𝑠 only if battery level 𝑒 𝑛+1𝑘 has reached 𝑒 ∗ ; 4) Receive data from 𝑠 along with the next buffer state 𝑏 𝑛+1 , and the cycle termination signal 𝜎 𝑛 ≝{1, 𝒆 𝑛+1 = 𝒆 ∗ ⁡𝑎𝑛𝑑⁡𝑏 𝑛+1 = ⁡ 𝑏 ∗

0, 𝑑𝑒𝑓𝑎𝑢𝑙𝑡 ; Update transient quantities for gradient and differential reward: // Calculate immediate reward: 𝑟 𝑛 ∶= 𝜈(𝑁 𝐵 − 𝑏 𝑛+1 ); // Update the transient differential reward estimate: 𝑄̂ 𝑛+1 ∶= 𝑄̂ 𝑛 + (𝑟 𝑛 − ℛ̂ 𝑚 ); // Update the transient gradient estimate: 𝑧 𝑛+1𝑘 ∶= 𝑧 𝑛𝑘 + 𝜕 ln[𝑢 𝜃 𝑘 (𝑎 𝑛𝑘 |𝑠 𝑛𝑘 )]𝜕𝜃 𝑘 | 𝜃 𝑘 =𝜃 𝑚𝑘 ; 𝑔 𝑛+1𝑘 ∶= 𝑔 𝑛𝑘 + (𝑟 𝑛 − ℛ̂ 𝑚 )𝑧 𝑛+1𝑘 ; if (𝜎 𝑛 == 1) // The end of the 𝑚 -th renewal cycle // Update policy parameter: 𝜃 𝑚+1𝑘 ∶= 𝜃 𝑚𝑘 + 𝛼 𝑚 𝑔 𝑛+1𝑘 ; (27) // Update the average reward estimate: ⁡⁡⁡ℛ̂ 𝑚+1 ∶= ℛ̂ 𝑚 + 𝛼 𝑚 𝑄̂ 𝑛+1 ; (28) // Reset transient quantities: ⁡⁡⁡𝑔 𝑛+1𝑘 = 𝟎⃗⃗ , 𝑄̂ 𝑛+1 ∶= 0, 𝑧 𝑛+1𝑘 = 𝟎⃗⃗ ; // Update the cycle index: ⁡⁡⁡⁡⁡⁡⁡𝑚 ∶= 𝑚 + 1; end if end for ⁡𝑛 ∶= 𝑛 + 1; // Update the time index. end while C. Discussion and Directions for Future Research

In this section, we give a few remarks about the underlying assumptions in this paper, and discuss how relaxing these assumptions can serve as a basis for future research. The first issue has to do with our assumption on altruistic participation of the relays in forwarding the source signal. In fact, a relay’s willingness to cooperate is taken for granted and our game-theoretic formulation is only a means to perform decentralized coordination and control and not a means of cooperation stimulation. A p otential future direction, thus, includes extensions to systems with self-interested relaying terminals, where acquiring service from the relays requires an incentive mechanism.

The second issue is regarding the extension of our system model to the case where the source node also uses a state-dependent law to control its transmit power for minimizing the delay at its queue. While ideally, the source power should be treated as yet another “degree of freedom”, we argue, however, that such extension is non-trivial as an adaptive source would induce non-stationary dynamics on the power adjustment procedure performed by the relays. In fact, proposing a systematic mechanism for jointly controlling the source and relays’ power is beyond the scope of this paper since we cannot naively consider the source node as another player in our PO-IPSG formulation. Therefore, in Section II, we have explicitly restricted our system model to the case where the source is transmitting with a constant power supply (e.g., maximum allowed power). That being said, there exists, however, some fair justifications in support of our simplifying assumption: the source node in our system model does not rely on harvested energy but is instead connected to a fixed power supply. Also, no direct communication link is assumed between the source and the destination node. As such, it is fairly reasonable that the source can tap into its energy supply to power its transmission with little concern for replenishment of its energy budget. When the source node is a non-harvesting entity, there are several works in the context of EH relay systems where the source power is assumed fixed [8].

Finally, we need to discuss the case of buffer-aided relaying where the relay nodes have data queues as well. Cooperative networks with buffer-aided relays have the advantage that their achievable diversity is not bottlenecked by transmission order (unlike the stream-like communication in the conventional case where at each time slot, signal transmission starts from source and is then relayed to the destination) [41]. However, these relays may also incur larger packet delays which can be quite diverse for different packets. Hence, from the application point of view, the lack of a data buffer at the relays in our work can be justified by arguing that it is to advocate a simple relay design while also minimizing packet delay which is desirable in certain applications. There are also some technical complications in the way of extending the proposed approach to the case of relays with buffers: Reasonably enough, in buffer-aided relaying, it is typically the case that at each slot, only one relay is selected for either transmission or reception. This necessitates an explicit link selection mechanism which does not fit well with the collaborative all-playing nature of our PO-IPSG formulation and its identical-payoff structure. The systematic way to account for buffer-aided relaying is again a formulation based on stochastic dynamic programming; however, in order to come up with a realistic scalable solution, we need to take on a different approach for problem decomposition. There are some studies along this line (e.g., see [17]) which address delay optimization in the context of buffer-aided relaying by exploiting the structural properties inherent to the problem.

IEEE Transactions on Vehicular Technology 10 The setup considered in [17], however, only consists of a single relay which gives the problem a nice weakly coupled structure amenable to decomposition into sub-problems. V. P ERFORMANCE E VALUATION

In this section, we evaluate the performance of our proposed DLTPC algorithm for decentralized power control in EH multi-relay systems. We compare DLTPC’s performance with three other power control schemes: ( a ) Centralized MDP with perfect statistics : we assume that an MDP controller exists which is aware of the probability distributions of the channel fading

ℙ{𝒄} , traffic arrival

ℙ{𝐴} , as well as the energy arrival processes

ℙ{𝐻 𝑘 } for all relays 𝑘 ∈ 𝒦 . Armed with this knowledge, one can use standard solution methods (e.g., relative value iteration [23]) to solve for an optimal joint power control policy 𝓾: 𝓢 → 𝓐 , which maximizes an average reward measure defined similarly as (12). While in principle, this method can obtain superior performance compared to DLTPC, it suffers from both curses of dimensionality and modeling, and therefore has no practical relevance. However, the reward measure obtained using this procedure can serve as an upper bound against which to compare the DLTPC’s performance. ( b ) Harvesting rate ( HR ) assisted scheme [9]: The online-HR scheme proposed in [9] is a centralized online (suboptimal) algorithm for joint relay selection and power allocation in multi-relay AaF EH cooperative communication systems. However, unlike DLTPC, online-HR assumes infinite backlog at the source (saturated traffic assumption), and aims at maximizing the throughput. In order to make online decisions, the approach in [9] uses the causal information of ESI and CSI, but also needs the statistics of the harvesting and channel processes. The setup in [9] considers the case where the source node is also an EH entity; therefore, in our simulation, we remove this restriction and assume a continuous power supply for the source to make it comparable with DLTPC. At each slot, using the knowledge of mean harvesting rate and average channel SNRs, online-HR first determines the transmit power of the relays via a closed-form formula, and then a simple (centralized) optimization is solved to determine the relay with the maximum throughput. ( c ) Naive scheme [9]: This algorithm is also centralized and online; however, it does not require the statistics of the harvesting and channel processes. At each time slot, the relays use their stored energies as their transmit powers. Using these transmit powers, the equivalent SNRs for all links are calculated. Then, the relay with the maximum equivalent SNR among all is selected to forward the signal to destination. In what follows, we first compare the computational complexity of DLTPC with Online-HR and Online-Naive, and then present our numerical results in Section V.B. A. Comparison of Computational Complexity

At each time step, the Online-HR algorithm [9] has to compute the maximum system throughput achievable by every relay and then select the relay with the best value. Hence, its complexity is (𝐾) in each time step (i.e., linear in the number of relays). The Online-Naive algorithm has also the complexity of

𝑂(𝐾) per time step as it needs to select the relay which provides the maximum equivalent SNR among all the relays. Both these algorithms are centralized and need to gather global information from the whole network for their operations. On the other hand, our DLTPC is a particularly lightweight algorithm, working with minimal message signaling overhead between source and relays (see steps 3 and 4 in Algorithm 1). The algorithm’s update rules are written in terms of efficient recursive formulae, which lead to negligible complexity. Also, if the policy function for each relay is chosen to have the convenient form in (7), the score function at step 5 can simply be calculated as: 𝜕 ln [𝑢 𝜃 𝑘 (𝑎 𝑛𝑘 |𝑠 𝑛𝑘 )]𝜕𝜃 𝑘 | 𝜃 𝑘 =𝜃 𝑚𝑘 = { 1 − 𝑢 𝜃 𝑘 (𝑎|𝑠)| 𝜃 𝑘 =𝜃 𝑚𝑘 , 𝑎 = 𝑎 𝑛𝑘 , 𝑠 = 𝑠 𝑛𝑘 ⁡−𝑢 𝜃 𝑘 (𝑎|𝑠)| 𝜃 𝑘 =𝜃 𝑚𝑘 ,⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡𝑎 ≠ 𝑎 𝑛𝑘 , 𝑠 = 𝑠 𝑛𝑘 ⁡0,⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡𝑠 ≠ 𝑠 𝑛𝑘 ⁡. Therefore, at each time step, DLTPC needs just a few standard algebraic operations, along with one random number generation to calculate the next action. B. Numerical Evaluation

We consider a setup with a total of

𝐾 = 𝜏 = 2ms . We assume Poisson packet arrival with mean rate 𝜆 pkt/ ms , and the packet size is 1024 bytes. The total bandwidth is 𝑊 = 2.5MHz . The source buffer is quantized to have 10 states (i.e., 𝑁 𝐵 = 9 pkts). Moreover, we assume that all relays harvest energy according to a Poisson energy arrival with mean rate 𝜇 𝑘 = ms , ∀𝑘 , and the renewable energy is stored in a battery with maximum capacity 𝑁 𝐸𝑘 = 4 (energy pkts). The source transmission power is fixed at 5 (energy pkt/ms). Although our algorithm does not use the knowledge of the channel model, for the purpose of experiments, we simulate Rayleigh fading for each link. In this model, the channel states 𝑐 𝑠,𝑘 and 𝑐 𝑘,𝑑 ( ∀𝑘 ) are exponentially distributed random variables. However, as we consider a finite number of possible states, digital quantization is used to discretize the channel states. In particular, all the channel states are quantized into six probability bins with the boundaries specified as: { (-∞,-5.41 dB), [-5.41 dB,-1.59 dB), [-1.59 dB,-0.08 dB), [-0.08 dB,1.42 dB), [1.42 dB,3.18 dB), [3.18 dB,∞) } . Over these bins, the stochastic evolution of channel states is i.i.d. across time and independent across users. This discretization of channel states have been justified in [40]. We choose ⟨𝑏 ∗ , 𝒄, 𝒆 ∗ ⟩ = 〈𝑁 𝐵 , . , (𝑁 𝐸𝑘 ) 𝑘 〉 as the recurrent state marking the renewal cycles for DLTPC. Also, the initial learning rate is taken to be 𝛼 = 2.5 × 10 −4 , and is diminished every 100 renewal cycles by a factor of 0.9. Fig. 2 plots the progression of the average source buffer length over time under DLTPC along with the two other IEEE Transactions on Vehicular Technology 11 suboptimal policies. The mean data arrival rate is fixed at 2.0 pkt/ms. As can be seen, both the online-HR and online-naive schemes converge much more quickly, but are outperformed by DLTPC in the limit. In Fig. 3, we plot the policy of all relays (for one particular state-action pair) as the joint policy is driven towards the local optimal of the PO-IPSG.

Fig. 2. Progression of average source buffer length.

Fig. 3. Progression of power control policies.

Fig. 4 illustrates the average number of occupied slots in source buffer under various traffic intensities ( 𝜆 is varied from 1 pkt/ ms to 2 pkt/ ms ). As a general trend, the source buffer gets more occupied as packet arrival rate increases. As expected, the MDP controller has the best performance gain among the four schemes. However, compared to the other two suboptimal policies, our SBSI-adaptive DLTPC algorithm maintains a smaller sub-optimality gap. Fig. 4. The impact of input traffic intensity on delay performance.

Next, we investigate the impact of the relays’ harvesting rate 𝜇 𝑘 and battery capacity 𝑁 𝐸𝑘 on delay performance. The mean Poisson data packet arrival rate is assumed to be pkt/ ms . In Fig. 5, we assume that the mean Poisson energy arrival rate for all relays is 0.25 energy pkt/ ms , and plot the average number of occupied slots in source buffer for different values of battery size 𝑁 𝐸𝑘 (from 4 to 8 energy pkts). The delay performance generally improves as battery capacity increases. However, DLTPC and online-HR can better exploit the enlarged energy storage with respect to the naive policy. Fig. 5. The impact of energy storage capacity on delay performance.

Fig. 6. The impact of energy harvesting rate on delay performance.

In Fig. 6, we fix the battery size 𝑁 𝐸𝑘 to 4 energy packets, and instead vary the mean Poisson energy arrival rate for all relays from 0. 25 to 0.45 energy pkt/ ms . As expected, the source buffer receives a higher service rate as the relays’ harvesting rate increases. In both plots, it is observed that our DLTPC algorithm maintains a better performance margin with respect to the centralized MDP controller. VI. C ONCLUSION

The design of new protocols for cooperative networks with energy harvesting (EH) nodes is a promising research direction that incorporates cooperative benefits (diversity, capacity, etc.) with the energy harvesting concept. In pure EH relay systems, the nodes run on the energy harvested from the environment, and so are limited by their generation and storage capacities. This together with the stochastic nature of the profile of the harvested energy calls for the design of novel control policies which optimally utilize the power for meeting

IEEE Transactions on Vehicular Technology 12 the application demands. However, the majority of the existing schemes have considered the case of single-relay SRD systems, and have focused on the optimization of the physical layer throughput by assuming non-bursty traffic arrival at the source. Also, the dominant methodologies for the optimization of these systems have been either offline optimizations assuming the availability of acausal information on the exact energy arrival instants and amounts, or online optimizations which rely on precise statistical knowledge of the system. In this paper, we considered an EH relaying system consisting of a bursty source with finite data buffer size whose transmission is cooperatively assisted by multiple EH relays. In order to optimize the average delay experienced by the source packets, we proposed a learning-theoretic solution which operates in the absence of prior knowledge of the statistics of the channel variation, traffic arrival and energy harvesting processes. The proposed method is highly decentralized and induces very low control overhead. Numerical evaluations demonstrated the superior delay performance of our solution compared to existing heuristics. A

PPENDIX

A P

ROOF OF T HEOREM

𝑄̂(𝒔 [𝑛,𝒉 ∗ ] , 𝒂 [𝑛,𝒉 ∗ ] ) = 𝑄̂ (𝑠 [𝑛,ℎ ∗,𝑘 ] , 𝑎 [𝑛,ℎ ∗,𝑘 ] ). (29) Now, by substituting 𝓾 𝚯 = ∏ 𝑢 𝜃 𝑖 𝐾𝑖=1 in (20), it holds that: 𝔼 𝓾 𝚯 [ℓ(𝒉 ∗ )]∇ 𝜃 𝑘 ℛ̅(𝓾 𝚯 ) = ∑ ℙ(𝒉 ∗ |𝚯) × 𝒉 ∗ ∈𝓗 ∗ { ∑ ∇ 𝜃 𝑘 ln [∏ 𝑢 𝜃 𝑖 𝐾𝑖=1 (𝑎 [𝑛,ℎ ∗,𝑖 ] |𝑠 [𝑛,ℎ ∗,𝑖 ] )] 𝑄̂ (𝑠 [𝑛,ℎ ∗,𝑘 ] , 𝑎 [𝑛,ℎ ∗,𝑘 ] ) ℓ(ℎ ∗,𝑘 )−1𝑛=0 } (30) = ∑ ℙ(𝒉 ∗ |𝚯) × 𝒉 ∗ ∈𝓗 ∗ { ∑ [∑ ∇ 𝜃 𝑘 ln [𝑢 𝜃 𝑖 (𝑎 [𝑛,ℎ ∗,𝑖 ] |𝑠 [𝑛,ℎ ∗,𝑖 ] )] 𝐾𝑖=1 ] 𝑄̂ (𝑠 [𝑛,ℎ ∗,𝑘 ] , 𝑎 [𝑛,ℎ ∗,𝑘 ] ) ℓ(ℎ ∗,𝑘 )−1𝑛=0 } (31) = ∑ ℙ(𝒉 ∗ |𝚯) 𝒉 ∗ ∈𝓗 ∗ × { ∑ ∇ 𝜃 𝑘 ln [𝑢 𝜃 𝑘 (𝑎 [𝑛,ℎ ∗,𝑘 ] |𝑠 [𝑛,ℎ ∗,𝑘 ] )] 𝑄̂ (𝑠 [𝑛,ℎ ∗,𝑘 ] , 𝑎 [𝑛,ℎ ∗,𝑘 ] ) ℓ(ℎ ∗,𝑘 )−1𝑛=0 }, (32) Where the last equality is due to ∇ 𝜃 𝑘 ln [𝑢 𝜃 𝑖 (𝑎 [𝑛,ℎ ∗,𝑖 ] |𝑠 [𝑛,ℎ ∗,𝑖 ] )] = 0 for all 𝑖 ≠ ⁡𝑘 . Now, the entire term within the curly brackets in (32) can be written as a function 𝜙(. ) of relay 𝑘 ’s local renewal cycle ℎ ∗,𝑘 ; i.e., 𝜙(ℎ ∗,𝑘 ) ≜ {∑ ∇ 𝜃 𝑘 ln [𝑢 𝜃 𝑘 (𝑎 [𝑛,ℎ ∗,𝑘 ] |𝑠 [𝑛,ℎ ∗,𝑘 ] )] 𝑄̂ (𝑠 [𝑛,ℎ ∗,𝑘 ] , 𝑎 [𝑛,ℎ ∗,𝑘 ] ) ℓ(ℎ ∗,𝑘 )−1𝑛=0 }. Also, given that the global renewal cycle 𝒉 ∗ can be described as the collection 〈ℎ ∗,1 , . . , ℎ ∗,𝐾 〉 of local renewal cycles across all relays, we have: ∑ ℙ(𝒉 ∗ |𝚯)𝜙(ℎ ∗,𝑘 ) 𝒉 ∗ ∈𝓗 ∗ ⁡⁡= ∑ ℙ(〈ℎ ∗,1 , … , ℎ ∗,𝐾 〉|𝚯)𝜙(ℎ ∗,𝑘 ) 〈ℎ ,..,ℎ 𝐾∗ 〉∈𝓗 ∗ = ∑ [ ∑ ℙ(〈ℎ ∗,1 , . . , ℎ ∗,𝐾 〉|𝚯) 〈ℎ ∗,1 ,…ℎ ∗,𝑘−1 ,ℎ ∗,𝑘+1 ,…,ℎ ∗,𝐾 〉 ] 𝜙(ℎ ∗,𝑘 ) ℎ 𝑘∗ ∈ℋ 𝑘∗ ⁡⁡= ∑ ℙ(ℎ ∗,𝑘 |𝚯)𝜙(ℎ ∗,𝑘 ) ℎ ∗,𝑘 ∈ℋ ∗,𝑘 . (33) Hence, it follows that: ∇ 𝜃 𝑘 ℛ̅ ⃗⃗⃗⃗⃗⃗ = ∑ ℙ(ℎ ∗,𝑘 |𝚯) × ℎ ∗,𝑘 ∈ℋ ∗,𝑘 { ∑ ∇ 𝜃 𝑘 ln [𝑢 𝜃 𝑘 (𝑎 [𝑛,ℎ ∗,𝑘 ] |𝑠 [𝑛,ℎ ∗,𝑘 ] )] 𝑄̂ (𝑠 [𝑛,ℎ ∗,𝑘 ] , 𝑎 [𝑛,ℎ ∗,𝑘 ] ) ℓ(ℎ ∗,𝑘 )−1𝑛=0 }. (34)  R EFERENCES [1]

J. N. Laneman and G. W. Wornell, “Energy Efficient Antenna Sharing and Relaying for Wireless Networks,” in Proc. IEEE WCNC , pp. 7–12, Chicago, IL, Oct. 2000. [2]

M. Dail, P. Wang, S. Zhang, B. Chen, H. Wang, X. Lin, and C. Sun, "Survey on Cooperative Strategies for Wireless Relay Channels,"

Trans. Emerging Tel. Tech. , vol. 25, no. 9, pp. 926–942, 2014. [3]

A. Ikhlef, D. S. Michalopoulos, and R. Schober, “Buffers Improve the Performance of Relay Selection,” in Proc. IEEE GLOBECOM , pp. 1–6 , Houston, TX, Dec. 2011. [4]

K. J. R. Liu, A. K. Sadek, and W. Su and A. Kwasinski, Cooperative Communications and Networking, Cambridge, U.K.: Cambridge Univ. Press, 2009. [5]

A. Kansal, J. Hsu, S. Zahedi, and M. B. Srivastava, “Power Management in Energy Harvesting Sensor Networks,”

ACM Trans. Embedd. Comput. Syst. , vol. 6, no. 4, pp. 1–35, Sep. 2007. [6]

K-H. Liu and P. Lin, “Toward Self-Sustainable Cooperative Relays: State of the Art and the Future,”

IEEE Comms. Mag. , vol. 53, no. 6, pp. 56-62, June. 2015. [7]

Y. Xia, H. Chen, L. Fan, and F. Dai, “Optimal Power Control for Source and Relay in Energy Harvesting Relay Networks,” in Proc. CHINACOM’13 , pp. 942-947, Guilin, Aug. 2013. [8]

B. Medepally and N. B. Mehta, “Voluntary Energy Harvesting Relays and Selection in Cooperative Wireless Networks,”

IEEE Trans. Wireless Commun. , vol. 9, no. 11, pp. 3543-3553, Nov. 2010. [9]

I. Ahmed, A. Ikhlef, R. Schober, and R. K. Mallik, "Joint Power Allocation and Relay Selection in Energy Harvesting AaF Relay Systems,"

IEEE Wireless Commun. Lett. , vol. 2, no. 2, April. 2013. [10]

A. Minasian, S. ShahbazPanahi, and R. S. Adve, “Energy Harvesting Cooperative Communication Systems,”

IEEE Trans. Wireless Commun. , vol. 13, no. 11, pp. 6118-6131, Nov. 2014. [11]

H. Li, N. Jaggi, and B. Sikdar, "Relay Scheduling for Cooperative Communications in Sensor Networks with Energy Harvesting,"

IEEE Trans. Wireless Commun. , vol. 10, no. 9, pp. 2918-2928, Sep. 2011. [12]

I. Krikidis, T. Charalambous, and J.S. Thompson, “Stability Analysis and Power Optimization for Energy Harvesting Cooperative Networks,”

IEEE Signal Proc. Lett. , vol. 19, no. 1, pp. 20-23, 2012.

IEEE Transactions on Vehicular Technology 13 [13]

D. Gunduz and B. Devillers, “Two-Hop Communication with Energy Harvesting,” in Proc. IEEE CAMSAP , pp. 201–204, Dec. 2011. [14]

Y. Luo, J. Zhang, and K. B. Letaief, “Optimal Scheduling and Power Allocation for Two-Hop Energy Harvesting Communication Systems,”

IEEE Trans. Wireless Commun. , vol. 12, no. 9, pp. 4729–4741, Sep. 2013. [15]

O. Orhan and W. Erkip, “Energy Harvesting Two-Hop Networks: Optimal Policies for the Multi-Energy Arrival Case,” in Proc. SARNOFF , pp. 1-6, Newark, NJ, May 2012. [16]

Z. Ding, S. M. Perlaza, I. Esnaola, and H.V. Poor, "Power Allocation Strategies in Energy Harvesting Wireless Cooperative Networks,"

IEEE Trans. Wireless Commun. , vol. 13, no. 2, Feb. 2014. [17]

F. Zhang and V.K.N. Lau, "Delay-Optimal Multi-Flow Buffered Decode-and-Forward Relay Communications with Limited Renewable Energy Storage,"

In Proc. ASILOMAR , pp. 1351-1355, Pacific Grove, CA, 2012. [18]

C. Huang, R. Zhang, and S. Cui, “Throughput Maximization for the Gaussian Relay Channel with Energy Harvesting Constraints,”

IEEE J. Sel. Areas Commun. , vol. 31, no. 8, pp. 1469–1479, Aug. 2013. [19]

I. Ahmed, A. Ikhlef, R. Schober, and R. K. Mallik, “Power Allocation for Conventional and Buffer-Aided Link Adaptive Relaying Systems with Energy Harvesting Nodes,”

IEEE Trans. Wireless Commun. , vol. 13, no. 3, March. 2014. [20]

O. Orhan and E. Erkip, "Energy Harvesting Two-Hop Communication Networks,"

IEEE J. Sel. Areas Commun. , vol. 33, no. 12, pp. 2658-2670, Dec. 2015. [21]

D.P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II, 4th. Ed., Massachusetts: Athena Scientific, 2012. [22]

B. Schein and R. G. Gallager, “The Gaussian parallel relay network,” in Proc. Intl. Symp. Inf. Theory , Sorrento, Italy, Jun. 2000. [23]

M. L. Putterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York: Wiley-Interscience, 2005. [24]

A. Cassandra, L. Kaelbling, and M. Littman, “Acting Optimally in Partially Observable Stochastic Domains," in Proc. Natl. Conf. Artif. Intell. , vol. 2. pp. 1023-1028, 1994. [25]

L. Peshkin, K-E. Kim, N. Meuleau, and L.P. Kaelbling, "Learning to Cooperate via Policy Search," in Proc. Conf. Uncertain. Artif. Intell. , pp. 489-496, 2000. [26]

F. Simjee and P. H. Chou, “Everlast: Long-Life, Supercapacitor-Operated Wireless Sensor Node,” in Proc. ISLPED , pp. 197–202, 2006. [27]

P. Marbach and J. N. Tsitsiklis, “Simulation-Based Optimization of Markov Reward Processes,”

IEEE Trans. Aut. Contr. , vol. 46, no. 2, pp. 191–209, 2001. [28]

X. R. Cao, Stochastic Learning and Optimization: A Sensitivity-Based Approach. New York: Springer, 2007. [29]

P. Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues, New York: Springer, 1999. [30]

Y. Cui, V. K. N. Lau, R. Wang, H. Huang, and S. Zhang, "A Survey on Delay-Aware Resource Control for Wireless Systems—Large Deviation Theory, Stochastic Lyapunov Drift, and Distributed Stochastic Learning,"

IEEE Trans. Inf. Theory , vol. 58, no. 3, 2012. [31]

R. Y. Rubinstein and A. Shapiro, Discrete Event Systems: Sensitivity Analysis and Stochastic Optimization via the Score Function Method. New York: Wiley, 1993. [32]

X. Wang and T. Sandholm, “Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games,”

Adv. Neural Inf. Process. Syst. ( NIPS ), vol. 15, pp. 1571–1578, 9–14, Dec. 2002. [33]

J. N. Tsitsiklis, “NP-Hardness of Checking the Unichain Condition in Average Cost MDPs,”

Oper. Res. Lett. , vol. 35, no. 3, pp. 319–323, May 2007. [34]

J. N. Laneman, D. N. C. Tse, and G. W. Wornell, "Cooperative Diversity in Wireless Networks: Efficient Protocols and Outage Behavior,"

IEEE Trans. Inf. Theory , vol. 50, no. 12, pp. 3062–3080, Dec. 2004. [35]

M-L. Ku, W. Li, Y. Chen and K. J. R. Liu, "On Energy Harvesting Gain and Diversity Analysis in Cooperative Communications,"

IEEE J. Sel. Areas Commun. , 2015, vol. 33, no. 12, pp. 2641-2657, 2015. [36]

P.L. Bartlett and J. Baxter, "Stochastic optimization of controlled partially observable Markov decision processes. in Proc. the 39th IEEE CDC , pp. 124-129, 2000. [37]

O Ozel, K Tutuncuoglu, J Yang, S Ulukus, A Yener, “Transmission with energy harvesting nodes in fading wireless channels: optimal policies,”

IEEE J. Sel. Areas Commun. vol. 29, no. 8, pp. 1732–1743 2011. [38]

X. Huang, T. Han, and N. Ansari, “On green-energy-powered cognitive radio networks,”

IEEE Commun. Surv. Tutor. , vol. 17, no. 2, pp. 827-842, 2015. [39]

X. Huang and N. Ansari, “Data and energy cooperation in relay-enhanced OFDM systems,” in Proc.

IEEE ICC , 2016. [40]

H. Wang and N. Mandayam, “A Simple Packet Transmission Scheme for Wireless Data over Fading Channels,”

IEEE Trans. Commun. , vol. 52, no. 7, pp. 1055–1059, 2004. [41]

G-X Li, C. Dong, D. Liu, G. Li, Y. Zhang, “Outage Analysis of Dual-Hop Transmission with Buffer Aided Amplify-and-Forward Relay,” in proc. 80th IEEE VTC , pp. 1-5, 2014.

V. Hakami received his B.S. degree in computer engineering (software) and his M.S. and Ph.D. degrees in information technology (computer networking), all from Amirkabir University of Technology (AUT), Tehran, Iran, in 2004, 2008 and 2015, respectively. In 2016, He joined as an assistant professor to the Department of Computer Engineering, Iran University of Science and Technology (IUST), Tehran, Iran. His current research mainly focuses on cognitive control of computer networks using stochastic control theory, and game-theoretic learning.

M. Dehghan (M’10) received his B.S. degree in computer engineering from Iran University of Science and Technology (IUST), Tehran, Iran, in 1992, and M.S. and Ph.D. degrees from Amirkabir University of Technology (AUT), Tehran, Iran, in 1995 and 2001, respectively. He is an associate professor of computer engineering and information technology at Amirkabir University of Technology (AUT). Before joining AUT in 2004, he was a research scientist at Iran Telecommunication Research Center (ITRC) working in the area of network quality-of-service and management. His research interests are in wireless networks, pattern recognition, fault-tolerant computing, and distributed systems.(M’10) received his B.S. degree in computer engineering from Iran University of Science and Technology (IUST), Tehran, Iran, in 1992, and M.S. and Ph.D. degrees from Amirkabir University of Technology (AUT), Tehran, Iran, in 1995 and 2001, respectively. He is an associate professor of computer engineering and information technology at Amirkabir University of Technology (AUT). Before joining AUT in 2004, he was a research scientist at Iran Telecommunication Research Center (ITRC) working in the area of network quality-of-service and management. His research interests are in wireless networks, pattern recognition, fault-tolerant computing, and distributed systems.