Scheduling to Minimize Age of Synchronization in Wireless Broadcast Networks with Random Updates
11 Scheduling to Minimize Age of Synchronization inWireless Broadcast Networks with Random Updates
Haoyue Tang,
Student Member, IEEE , Jintao Wang,
Senior Member, IEEE , Zihan Tang, Jian Song,
Fellow, IEEE
Abstract —In this work, a wireless broadcast network with abase station (BS) sending random time-sensitive information up-dates to multiple users under bandwidth constraint is considered.To measure data desynchronization between BS and the user, themetric Age of Synchronization (AoS) is adopted in this work. Itshows the amount of the time elapsed since freshest informationat the receiver becomes desynchronized. The AoS minimizationscheduling problem is formulated into a discrete time Markov de-cision process and the optimal solution is approximated throughstructural finite state policy iteration. An index based heuristicscheduling policy based on restless multi-arm bandit (RMAB)is provided to further reduce the computational complexity.Simulation results show that the proposed index policy achievescompatible performance with the MDP and is close to the AoSlower bound. Our work indicates that, to obtain a small AoSover the entire network, users with larger transmission successprobability and smaller random update probability are morelikely to be scheduled at smaller AoS.
Index Terms —Age of information, Age of synchronization,Markov decision processes, Whittle’s index.
I. I
NTRODUCTION
The design of next generation mobile and wireless commu-nication networks are driven partly by the need of mission-critical services like real-time control and the Internet ofThings (IoT). Moreover, the proliferation of mobile deviceshave boosted the need to enhance the timeliness of serviceslike instant chatting, mobile ads, social updates notifications,etc. The above applications require that each user possessesfresh data about the information they are interested in.To measure data freshness from the perspective of thereceiver when update packets are generated randomly byexternal actions or environments, e.g., in databases and erroralarm systems, the metric called
Age of Synchronization (AoS)is proposed [3]. By definition, AoS measures the time elapsedsince freshest information at the receiver becomes desynchro-nized . Compared with the metric called Age of Information(AoI) which measures the time elapsed since the freshestinformation at the receiver is generated , the AoS accounts for
Manuscript received March 21, 2019, revised September 1, 2019 andNovember 24, 2019, accepted Feburary 20, 2020. The authors are withthe Department of Electronic Engineering, Tsinghua University, Beijing100084, China and Beijing National Research Center for InformationScience and Technology (BNRist). J. Wang and J. Song are also withResearch Institute of Tsinghua University in Shenzhen, Shenzhen, 518057.(e-mail: [email protected]; [email protected];[email protected]; [email protected]). This work hasbeen presented in part at the 2019 IEEE International Symposium onInformation Theory [1]. (Corresponding author: Jintao Wang)
This work was supported in part by the National Key R&D Program ofChina under Grant 2017YFE0112300, Beijing National Research Center forInformation Science and Technology under Grant BNR2019RC01014 andBNR2019TD01001 and the Tsinghua University Tutor Research Fund. whether the source being tracked has actually changed, whileAoI measures the combination of the content update inter-generation duration and content desynchronization. To betterexplain their differences, consider a database synchronizationproblem as an example. Suppose the file stored at the receiveris the same as the one in remote databases, which has not beenchanged or updated by external users or environment for a longtime. By definition, AoS equals 0 because files at the receiverand at the remote database are synchronized. However, AoIcan be very large due to the long interval between two remotedatabase updates and thus file desynchronization status cannotbe inferred directly from AoI. On the contrary, AoS is a moreappropriate metric in studying database desynchronization.Similar scenarios can be found in monitoring system [4] andweb crawling problems [5]. Due to the aforementioned differ-ences between AoI and AoS, scheduling strategies that aim atminimizing AoI may not guarantee a good AoS performance.Thus, it is of importance to study scheduling strategies toobtain a good AoS performance.Data freshness optimization has received lots of attentionin communication system design. The problem of minimizingAoI have been investigated in coding [6]–[8], physical layerdesign [9], [10] and network optimization [11]–[20]. Whenthe source keeps changing all the time and the update packetscarrying those updates can be generated at will, centralizedscheduling algorithm to optimize AoI performance in networkswith interference constraint is first studied in [11]. Theoreticlower bound for AoI performance is derived and variousscheduling policies are proposed to approach the bound [12].When the generation of update packets cannot be controlled atwill and appear in a stochastic manner, theoretic performanceand scheduling algorithms have been studied in [17]–[21].However, scheduling policies provided in [17]–[20] are basedon error free transmission, while in practical wireless com-munication scenarios, packet loss may happen due to channelfading and decoding error. Moreover, those researches all focuson minimizing AoI. The metric AoS, which measures thecontent desynchronization alone, is a more appropriate metricto measure the desynchronization status of caches, databasesand error alarm systems [22], [23]. Although AoS has beenused to measure the staleness of replicated databases andthe its performance has been studied under several updatingstrategies in [22], [23], the effect of packet loss has not beentaken into account. Besides, update strategies in the aboveworks are unaware of the database changes, i.e., schedulingdecision may take place even if there is no change in thedatabase content.To fill this gap, we aim at designing scheduling policies to a r X i v : . [ c s . I T ] M a r minimize the expected AoS of an unreliable wireless broadcastnetwork, when the generation of update packets cannot becontrolled at will and arrives stochastically because of exter-nal environment. Unlike previous work that consider updaterandomness at the transmitter but no transmission randomnessat the receiver [17]–[19], we consider double randomness atboth the transmitter and the receiver. Our contributions aresummarized as follows: • We derive the theoretic lower bound of the AoS per-formance in error-prone wireless networks when theupdate of each source appears following i.i.d Bernoullidistribution. • The AoS scheduling problem is reformulated into aMarkov decision process (MDP). We exploit the switch-ing structure of the optimum policy by analyzing themonotonic characteristic and submodularity of the valuefunction. The optimum solution is approximated throughfinite state policy iteration. • To overcome the computational complexity of the MDPsolution, we propose a heuristic index based algorithmby reformulating the scheduling problem into restlessmulti-arm bandit (RMAB). We prove that each banditis indexable and derive the closed form expression ofthe Whittle’s index. Simulation results show that theWhittle’s index policy can achieve AoS performanceclose to the MDP solution and the AoS lower bound.The remainder of this paper is organized as follows. The net-work model and the two metrics, AoI and AoS are introducedand compared in Section II, where the overall schedulingproblem is formulated and AoS lower bound is derived.In Section III, we reformulate the problem into a Markovdecision process and propose a structural policy iteration toapproximate the MDP solution. In Section IV, we proposean index based algorithm based on restless multi-arm bandit.Simulations are provided in Section V and Section VI drawsthe conclusion.
Notations:
Vectors are written in boldface letters. The prob-ability of event A conditioned on B is denoted as Pr ( A|B ) ,the expectation with regard to random variable X conditionedon random variable Y is denoted as E X [ f ( X ) | Y ] . Vector e i denotes vector with the i -th element being and the remainingelements take . II. S YSTEM O VERVIEW
In this section, we introduce the system model, formulatethe overall scheduling problem and derive the lower bound ofthe scheduling problem.
A. Network Model
We consider a wireless broadcast network with a base sta-tion (BS) holding update information of N randomly changingsources and broadcasting them to N users. Consider a discretetime scenario and we use t ∈ { , · · · , T } to denote theindex of the current slot, the update of source n appearsindependently and identically with probability λ n ∈ (0 , in each time slot. Let the indicator function Λ n ( t ) ∈ { , } denote whether an update of source n happens during slot t . If Λ n ( t ) = 1 , then an update occurs during slot t , andthen it can be broadcasted by the BS at the beginning of slot ( t + 1) . We assume each user is interested in the informationfrom the corresponding source, i.e., user n is only interestedinformation about n , n ∈ { , · · · , N } .At the beginning of each slot, the BS schedules to sendinformation updates over error-prone wireless links. Since eachuser is only interested in the freshest information about thecorresponding source, we assume the BS only keeps the latestupdate of each source, i.e., new update packets will replacethe older update packets for a typical source. Here we use theindicator function u n ( t ) ∈ { , } to denote scheduling actions.If user n is not scheduled then u n ( t ) = 0 . If u n ( t ) = 1 ,the newest update from source n is transmitted and user n will successfully receive the packet by the end of slot t if thetransmission succeeds. Assume that the packet erasure is amemoryless Bernoulli process and user n has a fixed channelcharacterized by the Bernoulli packet success probability p n .An error-free acknowledgment sent from user n will reach theBS instantaneously if the transmission succeeds. In each slot,the amount of data to be broadcasted must be smaller than thechannel capacity and the available bandwidth (otherwise thetransmission will all fail). Similar to [12], [13], [19], in thiswork, we assume the BS attempts to broadcast one update ineach time slot: N (cid:88) n =1 u n ( t ) ≤ . (1) B. Age of Information and Age of Synchronization
To introduce the concept of AoI and AoS, we consider asingle source discrete time scenario as an example. First wereview the definition of them and then we talk about howAoS evolves depending on scheduling decision { u n ( t ) } andupdate randomness { Λ n ( t ) } . Suppose the i th update packet ofthe source is generated during slot g i . Let r i be the receivingtime-stamp of the corresponding packet. If the packet is notreceived by the user, then denote r i = + ∞ .The AoI measures the time elapsed since the latest updateat the receiver is generated [2]. Let q ( t ) = max i ∈ N + { i | r i < t } be the index of the latest update at the beginning of slot t atthe receiver, which is generated in g q ( t ) . By definition, the AoI h ( t ) at the beginning of slot t can be computed by: h ( t ) = t − g q ( t ) . (2)The AoI evolution for stochastic arrivals transmitted throughan unreliable communication channel can be found in [21].The AoS describes how long the information at the receiverhas become desynchronized compared with the source [3].Notice that q ( t )+1 is the index of earliest source update sincethe generation time-stamp of the freshest information stored When λ n = 1 , new update packets of source n appear in every slot andthus the source keeps changing. AoS and AoI are the same metric in thisscenario. In following analysis, we will compare our derivations with resultsof AoI in this special case to show the relationship of the two metrics. at the receiver at the beginning of slot t . Let s ( t ) be the AoSat the beginning of slot t , by definition: s ( t ) = ( t − g q ( t )+1 ) + , (3)where function ( · ) + = max { , ·} . According Eq. (3), if nonew update arrives after the generation time-stamp of the latestrefresh of the user, i.e., g q ( t )+1 ≥ t , then s ( t ) = 0 . Thesample paths of AoI and AoS of a source are depicted inFig. 1. From the figure, the AoS remains zero until a new freshupdate arrives and increases linearly if the content stored at thereceiver becomes desynchronized with the source, while AoIkeeps increasing as long as no update has been received. Thedifference between AoI and AoS is the reference object. TheAoS measures data freshness compared to the content of therandom update source and accounts for the whether the processbeing tracked has actually changed, while AoI measures thetime difference between now and the generation time-stampof receiver’s current content. AoIAoS
Fig. 1. On the top, sample sequences representing time-stamps of updatearrivals (upward magenta arrows), update sending decisions (green circles)and update received time-stamps (downward brown arrows). On the bottom,sample paths of AoI (blue) and AoS (red).
Now we return to the multiple-user scenario and introducethe evolution of AoS. Let s n ( t ) be the AoS of user n at thebeginning of each slot t . The analysis is divided into twocategories based on desynchronization status: • First let us consider that the information at user n issynchronized with source n at the beginning of slot t ,i.e., s n ( t ) = 0 , then s n ( t + 1) depends on whether anupdate occurs during slot t .* When there is no update, i.e., Λ n ( t ) = 0 , then s n ( t +1) = 0 , indicating that user n is still synchronizedwith source n at the beginning of next slot.* When Λ n ( t ) = 1 , suggesting an update of source n occurs in slot t , then information of user n willbecome desynchronized at the beginning of next slot,i.e., s n ( t + 1) = 1 . • If s n ( t ) (cid:54) = 0 , then user n is desynchronized with source n at the beginning of slot t .+ If u n ( t ) = 1 and the transmission succeeds, then thelatest information of source n by the end of slot t − will be received by the end of slot t . Then in thiscase: * If Λ n ( t ) = 0 , there is no update during slot t ,thus the received information will be synchronizedwith the source at the beginning of next slot, i.e., s n ( t + 1) = 0 .* If Λ n ( t ) = 1 , the received information will beout-of-date immediately at the beginning of nextslot, then s n ( t + 1) = 1 .+ If the update is not transmitted u n ( t ) = 0 or thetransmission fails, the update packet will not bereceived by user n , then AoS increases linearly, i.e., s n ( t + 1) = s n ( t ) + 1 .Based on the above analysis, the dynamics of AoS for user n is: s n ( t + 1) = , s n ( t ) = 0 , Λ n ( t ) = 0;1 , s n ( t ) = 0 , Λ n ( t ) = 1;0 , Λ n ( t ) = 0 , u n ( t ) = 1 , succeeds ;1 , Λ n ( t ) = 1 , u n ( t ) = 1 , succeeds ; s n ( t ) + 1 , otherwise. (4) C. Problem Formulation
The expected average AoS of all users following policy π over a consecutive of T slots can be computed as follows: J T ( π ) = E π (cid:34) N T T (cid:88) t =1 N (cid:88) n =1 s n ( t ) | s (0) (cid:35) , where the vector s ( t ) = [ s ( t ) , s ( t ) , · · · , s N ( t )] T ∈ N N denotes the AoS of all users at the beginning of slot t . In thiswork, we assume that all the sources have been synchronizedinitially, i.e., s (0) = 0 and omit s (0) .Let Π NA denote the class of non-anticipated policies, i.e.,scheduling decisions { u n ( t ) } in slot t are made based onchannel statistics { p n } , the past and current AoS of all users { s n ( τ ) } τ In this part, a lower bound to the expected average AoSperformance to the above optimization problem is derived.Sample path argument is used here to characterize the AoSevolution of each user. Then, we establish the expected averageAoS over the entire network when the number of time slots T → ∞ . By using Fatou’s lemma, the lower bound is thenestablished. Theorem 1. For a given network setup, the average AoS overthe entire network is lower bounded by: AoS LB = (6) N N (cid:88) n =1 γ ∗ n (cid:34) (cid:18) γ ∗ n − − λ n λ n (cid:19) + 12 (cid:18) γ ∗ n − − λ n λ n (cid:19)(cid:35) , where γ ∗ n = max { / (cid:114)(cid:16) − λ n λ n (cid:17) − (cid:16) − λ n λ n (cid:17) + µ ∗ Np n , λ n } ,and µ ∗ is the coefficient that keeps (cid:80) Nn =1 γ ∗ n p n = 1 .Proof Sketch: The lower bound is obtained by solving AoSminimization problem under a relaxed bandwidth constraint,i.e., multiple users can be scheduled at the same time but thetime average users scheduled in each slot is still smaller than1. The hard bandwidth constraint in every slot indicates thatthe derived AoS lower bound can be loose. The bound is usedto evaluate the performance of our proposed algorithms. Thedetails of the derivations will be provided in Appendix A.III. M ARKOV D ECISION P ROCESS In this section, we design a scheduling strategy based onMarkov decision process (MDP) techniques. The optimizationproblem Eq. (5a-5c) can be formulated into an MDP problemwith elements explained as follows: • State space: The state at time slot t is defined to be theAoS of all the users over the entire network s ( t ) . Thestate space is countable but infinite because of possibletransmission failures. • Action space: We define the action a ( t ) at time t to be the index of the selected user. The correspond-ing scheduling decision can be computed by u n ( t ) = ( n = a ( t )) , ∀ n , where ( · ) is the indicator function. Denote a ( t ) = 0 if the BS chooses to be idle. The action space { , , , · · · , N } is hence countable and finite. • Transition probability: Let Pr ( s (cid:48) | s , a ) be the transitionprobability from state s ( t ) = s = [ s , s , · · · , s N ] T tostate s ( t + 1) = s (cid:48) = [ s (cid:48) , s (cid:48) , · · · , s (cid:48) N ] T at the nextslot by taking action a at slot t . Since the probabilityof new update packet arrival and channel states areindependent among the users, the transition probabilitycan be decomposed into:Pr ( s (cid:48) | s , a ) = N (cid:89) n =1 Pr ( s (cid:48) n | s n , a ) , (7)where Pr ( s (cid:48) n | s n , a ) denotes the one-step transition prob-ability of user n given action a and has the followingexpression according to Eq. (4):Pr ( s (cid:48) n | s n , a ) = , s (cid:48) n = s n + 1 , s n (cid:54) = 0 , a (cid:54) = n ;1 − p n , s (cid:48) n = s n + 1 , s n (cid:54) = 0 , a = n ; λ n p n , s (cid:48) n = 1 , s n (cid:54) = 0 , a = n ;(1 − λ n ) p n , s (cid:48) n = 0 , s n (cid:54) = 0 , a = n ; λ n , s (cid:48) n = 1 , s n = 0 , ∀ a ;1 − λ n , s (cid:48) n = 0 , s n = 0 , ∀ a ;0 , otherwise . • One-step cost: Let C ( s ( t ) , a ( t )) be the one-step cost atstate s ( t ) given action a ( t ) , which is the average AoSgrowth of the entire network at time t : C ( s ( t ) , a ( t )) = 1 N N (cid:88) n =1 s n ( t ) . The solution π ∗ that minimizes the average AoS in Eq. (5b)can be found by solving the MDP. Denote J α ( s , π ) be the α -discounted cost following policy π starting from state s , i.e.,where J α ( s , π ) = lim T →∞ E π (cid:34) T (cid:88) t =1 α t − C ( s ( t ) , a ( t )) (cid:35) , (8)In this section, we approximate π ∗ by solving the α -discountedcost problem when α → . Define π ∗ α be the optimum policythat minimizes the α -discounted cost starting from any state s ans satisfies the bandwidth constraint, i.e., π ∗ α = arg min π ∈ Π NA J α ( s , π ) , ∀ s , (9a)s.t. E π (cid:34) N (cid:88) n =1 u n ( t ) (cid:35) ≤ , ∀ t. (9b)Policy π ∗ α is obtained a modified policy iteration that utilizesits structure. To analyze the structure of π ∗ α , let us first providethe formal definition of the stationary deterministic policies: Definition 1. Let Π SD denote the class of stationary determin-istic policies. Given state s ( t ) = s , a stationary deterministicpolicy π SD ∈ Π SD selects action a ( t ) = f ( s ) , where function f ( · ) : s → { , · · · , N } is a deterministic mapping from statespace to action space. According to [24], π ∗ α can be a stationary deterministicpolicy and denote π α ( s ) be the action it takes in state s . Thendenote V α ( s ) is the α -discounted cost following policy π ∗ α starting from state s , i.e., V α ( s ) = lim T →∞ E π ∗ α (cid:34) T (cid:88) t =1 α t − C ( s ( t ) , a ( t )) (cid:35) = min π J α ( s , π ) . (10) Lemma 1. The α -discounted value function satisfy the fol-lowing Bellman equation: V α ( s ) = min a ∈ A { C ( s , a ) + α (cid:88) s (cid:48) V α ( s (cid:48) )Pr( s (cid:48) | s , a) } . (11) Proof Sketch: The main idea is to show there is a weightfunction w ( s ) : S → [1 , ∞ ) such that the w -norm of the valuefunction (cid:107) V α ( s ) (cid:107) = sup s ∈ S V α ( s ) w ( s ) is bounded. The proof issimilar to [17] and is provided in our online report due topage limitations.Based on the Bellman equation, we can apply a policyiteration to obtain the value function V α ( s ) . The computationalcomplexity of which can be reduced by utilizing the structureof the optimum policy. Next, we will first exploit the switchingstructure of the optimal policy π ∗ α in Section III-A, and thenpropose a structural finite state policy iteration to approximatethe optimal policy in Section III-B. A. Characterization of the Optimal Structure First we study the structure of the optimum policy π ∗ α suchthat minimizes the discounted cost of MDP. We first presenttwo lemmas, the proof of Lemma 3 them can be found inappendices. Lemma 2. For fixed α and any starting state s , the discountedvalue function V α ( s + z e n ) is a non-decreasing function of z ,regardless of n . Lemma 3. For any fixed α , the discounted value functionpossess a submodularity characteristic. That is, for state s and ∀ i (cid:54) = j, z i ≥ , ≤ z j ≤ s j : V α ( s + z i e i − z j e j ) − V α ( s − z j e j ) ≥ V α ( s + z i e i ) − V α ( s ) . (12)Based on the two lemmas, we will obtain the followingtheorem on the structure of π ∗ α : Theorem 2. The optimum policy π ∗ α possesses a switchingstructure. That is, if for state s policy π ∗ α chooses action n ,then policy π ∗ α chooses action n at state s + z e n , ∀ z ∈ N ,where e n is the unit vector with the n th component being 1and the remaining elements being 0.Proof: The proof is based on N = 2 for notationsimplicity and can be generalized easily to N > . Supposeit is optimal to schedule user at state s = [ s , s ] with dis-count factor α , then we have E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s , s ] , ≤ E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s , s ] , . Then let us compute and com-pare the expected value function by taking action a = 1 and a = 2 at state [ s + z, s ] , E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s + z, s ] , − E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s + z, s ] , p ((1 − λ ) V α ([0 , s + 1]) + λ V α ([1 , s + 1]))+ (1 − p ) V α ([ s + z + 1 , s + 1]) − p ((1 − λ ) V α ([ s + z + 1 , λ V α ([ s + z + 1 , − (1 − p ) V α ([ s + z + 1 , s + 1])= p ((1 − λ ) V α ([0 , s + 1]) + λ V α ([1 , s + 1]))+ (1 − p ) V α ([ s + 1 , s + 1]) − (1 − p ) V α ([ s + 1 , s + 1])+ (1 − p ) V α ([ s + z + 1 , s + 1]) − p ((1 − λ ) V α ([ s + z + 1 , λ V α ([ s + z + 1 , − (1 − p ) V α ([ s + z + 1 , s + 1]) ( a ) ≤ p ((1 − λ ) V α ([ s + 1 , λ V α ([ s + 1 , − p ) V α ([ s + 1 , s + 1])+ (1 − p )( V α ([ s + z + 1 , s + 1]) − V α ([ s + 1 , s + 1])) − p ((1 − λ ) V α ([ s + z + 1 , λ V α ([ s + z + 1 , − (1 − p ) V α ([ s + z + 1 , s + 1])= p (1 − λ )( V α ([ s + 1 , − V α ([ s + z + 1 , − V α ([ s + 1 , s + 1]) + V α ([ s + z + 1 , s + 1]))+ p λ ( V α ([ s + 1 , − V α ([ s + z + 1 , − V α ([ s + 1 , s + 1]) + V α ([ s + z + 1 , s + 1])) − p ( V α ([ s + z + 1 , s + 1]) − V α ([ s + 1 , s + 1])) ( b ) ≤ , (13)where inequality (a) is obtained because it is optimum tobroadcast source 1 at state [ s , s ] , which implies: E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s , s ] , ≤ E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s , s ] , , and (b) is obtained because of submodularity and monotonic.The above inequality Eq. (13) implies, the optimum choiceat state [ s + z, s ] is selecting source , which is theoptimum action at state [ s , s ] . The switching structure ishence verified. B. Relative Policy Iteration through Finite-State Approxima-tion MDP problems with countable finite states can be solved bypolicy iteration or value iteration. To deal with the infinite statespace in our problem, we approximate the whole countablespace, i.e., the AoS for each source, by setting an upperbound of AoS S max n for each of them. This approximationis reasonable since the probability of consecutive packet-loss vanishes exponentially with the number of consecutivetransmission slots. By letting S max n goes to infinity for all n ,the optimal structure will converge to the original problem.Denote x ( m ) n ( t ) be the truncated AoS of source n when S max n = m , i.e., x ( m ) n ( t ) = min { s n ( t ) , m } . With suchapproximation, by choosing different upper bound m , we canobtain a class of approximate MDP problems, where eachproblem differs from the primal problems with: • State space: We substitute the state s ( t ) by the truncatedAoS x ( m ) ( t ) = [ x ( m )1 ( t ) , x ( m )2 ( t ) , · · · , x ( m ) N ( t )] T . • Transition probabilities: The transition probabilitychanges in accordance with the action space, letPr ( x ( m ) (cid:48) | x ( m ) , a ) be the transition probability from state x ( t ) = x ( m ) to x ( t + 1) = x ( m ) (cid:48) with the dynamic beingPr ( x ( m ) (cid:48) | x ( m ) , a ) = N (cid:89) n =1 Pr ( x ( m ) n (cid:48) | x ( m ) n , a ) . (14)It should be noted that Pr ( x ( m ) n (cid:48) | x ( m ) n , a ) is the same asPr ( s (cid:48) n | s n , a ) except:Pr ( x ( m ) n (cid:48) | x ( m ) n , a ) = (cid:40) , x ( m ) n (cid:48) = x ( m ) n = m, a (cid:54) = n ;1 − p n , x ( m ) n (cid:48) = x ( m ) n = m, a = n. (15)Then for a given upper bound m , we can obtain an optimaldeterministic policy by relative policy iteration. We choose theinitial policy π (0) ( x ) = arg n x n , i.e., the greedy policy thatschedules user with the largest AoS. Then given policy π ( k ) ( x ) and value function V ( k ) α ( x ) , policy π ( k +1) ( x ) and the valuefunction V ( k +1) α ( x ) in the ( k + 1) th iteration can be obtainedthrough iteration. Considering the switching structure, once π ( k +1) ( x ) = a is obtained, it can be concluded then for any z ≥ , π ( k +1) ( x + z e a ) = a . The policy π ( k ) ( x ) and value V ( k ) α ( x ) will finally converge when k increases. Algorithmflowchart is provided below. Algorithm 1 Relative policy iteration based on switchingstructure initialization : for each state x , assign action π (0) ( x ) =arg max n x n , the initial value of V (0) α ( x ) = (cid:80) Nn =1 x n . repeat π ( k +1) ( x ) ← , ∀ x . for x ∈ state space X and π ( k ) ( x ) = 0 do Policy selection π ( k +1) ( x ) ← a ∗ , where a ∗ =arg min a ∈A { C ( x , a ) + α E x (cid:48) [ V ( k ) α ( s (cid:48) ) | s , a ] } . Policy evaluation V ( k +1) α ( x ) ← C ( x , a ∗ ) + α E x (cid:48) [ V ( k ) α ( x (cid:48) ) | x , a ∗ ] Assign π ( k +1) ( x + z e a ) ← a ∗ and V tmp α ( x + ze a ) ← C ( x , a ∗ ) + α E x (cid:48) [ V ( k ) α ( x (cid:48) ) | x + z e a , a ∗ ] end for V ( k +1) α ( x ) ← V tmp α ( x ) − V tmp α (0) , for all x . k ← k + 1 until π ( k ) ( x ) = π ( k − ( x ) , for all x .After the iteration we can then obtain a stationary deter-ministic policy π . The MDP scheduling policy is obtained asfollows: at each slot with state s ( t ) , compute the correspondingvirtual age x ( m ) ( t ) and choose the corresponding action a ( t ) = π ( x ( m ) ( t )) .Notice that there are a total X max N states, thus the com-putational complexity O ( X max N ) grows exponentially, whichmakes the optimum policy impossible to obtain for large N .IV. I NDEX -B ASED H EURISTIC MDP solution is computationally demanding for a largenumber of access users known as the curse of dimension.To reduce computational complexity, we propose a simpleindex-based heuristic policy based on restless multi-arm bandit(RMAB) [25].The N users can be viewed as arms, the state of user n atthe beginning of slot t is the corresponding AoS s n ( t ) . Forthe sake of simplicity, we define the bandit n to be active ifsource n is broadcasted, and the bandit is passive if source n isnot. In each slot, the BS activates one arm and sends updateinformation, while the remaining arms remain passive. TheAoS for each user n depends only on its past AoS s n ( t − and the scheduling decision u n ( t − . Hence the AoS evolvesas restless bandit based on current action and its current AoS.To solve RMAB problem, a low complexity heuristic indexpolicy is proposed by Whittle [26], it can approach asymptoticoptimal performance compared with the MDP solution undercertain scenarios. In this section, we first prove the indexabilityof the problem. Then, we obtain the closed form solution ofWhittle’s index and provide the scheduling policy. A. Decoupled Sub-Problem To adopt the Whittle’s index, let us first relax the trans-mission constraints in each time slot into a time-averagetransmission constraint, E π (cid:34) T T (cid:88) t =1 N (cid:88) n =1 u n ( t ) (cid:35) ≤ . Let W ≥ be the Lagrange multiplier and place the relaxedbandwidth constraint into the objective function, we have thefollowing minimization problem:minimize lim T →∞ E π (cid:34) N T T (cid:88) t =1 N (cid:88) n =1 (cid:18) s n ( t ) + W u n ( t ) − WN (cid:19)(cid:35) . (16)For fixed W , the above relaxed optimization problem canbe decoupled into N subproblems and solved separately. Thus,we omit the subscript n of each user henceforth. Given W , thegoal of each decoupled optimization problem is to derive anoptimum activation strategy µ W such that the following totalcost can be minimized: µ W = arg π ∈ Π NA lim T →∞ T E µ (cid:34) T (cid:88) t =1 ( s ( t ) + W u ( t )) (cid:35) . (17)The non-negative multiplier W ≥ can be viewed as an extracost of being active and the optimum strategy should achieve abalance between activation cost and the cost incurred by AoS.Each of the N subproblems can be formulated into an MDP,each element is explained as follows: • State space: The state at time t is the current AoS of thecorresponding user s ( t ) ∈ N , which is countable infinitebecause of possible transmission failures. • Action space: There are two possible actions at each timeslot, either choose the bandit to send updates a ( t ) = 1 or remain idle a ( t ) = 0 . It should be noted here that theaction a ( t ) here is different to the scheduling action u ( t ) ,which has strict interference constraint. • Transition probability: The state evolves with the ac-tion following Eq. (4). Let Pr ( s (cid:48) | s, a ) be the transitionprobability from state s ( t ) = s to s ( t + 1) = s (cid:48) by takingaction a ( t ) = a at slot t , then:Pr ( s (cid:48) | s, a ) = λ, s (cid:48) = 1 , s = 0 , a = 0 , − λ, s (cid:48) = 0 , s = 0 , a = 0 , pλ, s (cid:48) = 1 , s (cid:54) = 0 , a = 1; p (1 − λ ) , s (cid:48) = 0 , s (cid:54) = 0 , a = 1;1 − p, s (cid:48) = s + 1 , s (cid:54) = 0 , a = 1;1 , s (cid:48) = s + 1 , s (cid:54) = 0 , a = 0;0 , otherwise . (18) • One-step cost: For fixed W , the one step cost of state s ( t ) by taking action a ( t ) is defined as total cost incre-ment at slot t , which consists of both the current AoSand the extra cost of being active: C ( s ( t ) , a ( t )) = s ( t ) + W a ( t ) . (19)According to [24], a stationary deterministic policy existsto minimize the average cost over infinite horizon. Next wewill study the structure of such policy µ W : s ( t ) → a ( t ) toprove indexablility of the bandit. B. Proof of Indexability First, let us provide the formal definition of indexablility. Definition 2. According to [25], let Ω W be the set of stateswhere the optimum strategy µ W takes a passive action. A ban-dit is indexable if the passive set Ω W increases monotonicallywith multiplier W , i.e., Ω W ⊂ Ω W (cid:48) , ∀ W ≥ W (cid:48) . The proof for indexability can be divided into two parts.First we will show that the optimum policy to the above MDPpossesses a threshold structure. Next, we derive the optimumthreshold for fixed Lagrange multiplier W and proves that itis monotonic increasing.The threshold structure of the optimum stationary determin-istic policy is obtained by investigating policies to minimizethe α -discounted cost over infinite horizon. For fixed Lagrangemultiplier W , denote J α,W ( s, µ ) to be the α -discounted overinfinite horizon starting from initial state s (1) = s : J α,W ( s, µ ) = lim T →∞ sup E µ (cid:34) T (cid:88) t =1 α t − C ( s ( t ) , a ( t )) (cid:35) . Let µ α,W = arg min µ ∈ Π NA J α,W ( s, µ ) be the optimumpolicy that minimizes the α -discounted cost and let V α,W ( s ) =min µ J α ( s, µ ) = J α,W ( s, µ α,W ) . The value function V α ( s ) satisfies the following Bellman equation: V α,W ( s ) = min a ∈ A { C ( s, a ) + α (cid:88) s (cid:48) V α,W ( s (cid:48) ) Pr ( s (cid:48) | s, a ) } . (20) Lemma 4. For W > , the value function V α,W ( · ) increasesmonotonically. The proof of Lemma 4 is provided in the appendix. Recallthat µ W is the optimum policy that minimizes the average costEq. (52). Lemma 4 implies the following theorem about µ W : Theorem 3. The optimal policy µ W to minimize the averagecost over infinite horizon (17) has a threshold structure. Let µ W ( s ) be the action policy µ W takes when the AoS equals s . If at state s , it is optimal to keep the bandit idle, then forall s (cid:48) < s it is optimal to keep the bandit idle, i.e., µ W ( s (cid:48) ) =0 , ∀ s (cid:48) < s ; otherwise, if it is optimal to activate the bandit atstate s , then for states s + 1 , s + 2 , · · · , the optimal strategy µ W is to activate the bandit, i.e., µ W ( s (cid:48) ) = 1 , ∀ s (cid:48) > s .Proof: We will prove by investigating the optimal policy µ α,W that minimizes the α -discounted cost. The decision a ( t ) is chosen according to the Bellman equation (20), thecondition for bandit to be active is: W + αp ( λV α,W (1)+(1 − λ ) V α,W (0)) ≤ αpV α,W ( s +1) . (21)Suppose at state s , the above inequality is satisfied and theoptimum strategy µ α to minimize the α -discounted cost is tochoose the bandit to be active, i.e., a = 1 . According to themonotonic characteristic, for all states s (cid:48) satisfy s (cid:48) > s , W + αp ( λV α,W (1) + (1 − λ ) V α,W (0)) ≤ αpV α,W ( s ) ≤ αpV α,W ( s (cid:48) ) . Hence for state s (cid:48) ≥ s , the optimum policy µ α,W willchoose the bandit to be active, i.e., µ α,W ( s (cid:48) ) = 1 , ∀ s (cid:48) > s .By taking the conversion of inequality, in the same way, wecan obtain that the if the optimum policy µ α,W is to remainpassive at state s , then for all states satisfy s (cid:48) ≤ s , the optimum policy to minimize α -discounted cost is to remain passive. Thethreshold policy holds for all α ∈ (0 , . By taking α → , thisprovides insight that the optimum policy µ W has a thresholdstructure.The indexability is then proved by showing the activationthreshold increases with W . To compute the activation thresh-old, first we compute the average cost for fixed Lagrange costis threshold policy τ is employed, the proof can be found inAppendix D: Corollary 1. Denote F ( τ, W ) to be the average cost for fixed W if threshold policy τ is employed, i.e., the bandit will beactive for state s ≥ τ and be passive for state s < τ . Then, F ( τ, W ) = τ ( τ − ξ ( τ )1 + ξ ( τ )1 p ( 1 p − ξ ( τ )1 p ( τ + W ) , (22) where ξ ( τ ) s denotes the steady state distribution if bandit is instate s by applying threshold policy τ and typically, ξ ( τ )1 = 1 / (cid:18) − λλ + τ + 1 p − (cid:19) . Next, we derive the optimum activating threshold τ opt ( W ) for given W by examining the value of F ( τ, W ) . The op-timum value should satisfy F ( τ opt + 1 , W ) ≥ F ( τ opt , W ) and F ( τ opt − , W ) ≥ F ( τ opt , W ) . The following corollaryprovides the closed form expression of the threshold, furtherderivations are provided in Appendix E: Corollary 2. For given Lagrange multiplier W , the optimumactivating threshold can be computed as follows: τ opt = (cid:98) (cid:18) − p − λ (cid:19) (23) + (cid:115)(cid:18) − p − λ (cid:19) + 2( Wp − − λλ − pp ) + 2 1 − pp (cid:99) . Notice that the scheduling threshold τ opt is an increasingfunction of W , suggesting that the passive set increases mono-tonically with W . Especially when W = 0 the threshold equals , which suggests the passive set is ∅ . Thus the indexabilityof the bandit is proved. C. Derivation of the Whittle’s Index The Whittle’s index I ( s ) measures how rewarding it isif the bandit at state s is activated. By definition, it is theextra cost that makes action a = 1 and a = 0 for states s equally desirable [26]. Denote φ ( τ ) to be the averageactivation probability over infinite horizon if threshold policy τ is applied. According to the threshold structure, φ ( τ ) equalsthe time proportion spent on updating the bandit, which equalsthe total probability that the bandit is in state sγτ and can becomputed by: φ ( τ ) = ∞ (cid:88) s = τ ξ ( τ ) s = ∞ (cid:88) s = τ ξ ( τ )1 (1 − p ) s − τ = ξ ( τ )1 p . (24) According to [25, Eq. 6.11], the Whittle’s index can becomputed as follows: I ( s ) = F ( s + 1 , − F ( s, φ ( s ) − φ ( s + 1) = p ( F ( s + 1 , − F ( s, ξ ( s )1 − ξ ( s +1)1 . (25) D. Index-Based Scheduling Algorithm We will provide a low-complexity scheduling algorithm inthis part based on the derived index. At the beginning of eachtime slot, the BS observes current AoS of each source s n ( t ) and computes the Whittle’s index for each user I n ( s n ( t )) .Then, broadcast the corresponding message of user n withthe highest I n ( s n ( t )) , with ties broke arbitrarily. Since theindex is computed separately for each user, the computationalcomplexity is O ( N ) and can almost be neglected.The scheduling policy based on Whittle’s index can beeasily generated to constraint where no more than M userscan be scheduled in each time slot. That is, after computingthe Whittle’s index of each user at time slot t , the BSselects M users with the largest indexes and broadcast thecorresponding update packet. Moreover, according to [26], theWhittle’s index policy is shown to be asymptotic optimum inmost scenarios. That is, let the number of users N → ∞ while M/N keeps a constant, the performance gap betweenWhittle’s index policy and the optimum policy vanishes.V. N UMERICAL S IMULATIONS In this section, the performance of the proposed schedulingstrategies are evaluated in terms of the expected averageAoS over the entire network. We compare four schedulingstrategies: 1) The greedy arrival aware policy that schedulesto transmit undelivered packet to user with the largest AoS.2) AoI minimization policy proposed in [21]. 3) The Markovdecision process in Section III. 4) The Whittle’s index policyin Section IV. Define the total packet arriving rate over theentire network to be λ total = (cid:80) Nn =1 λ n . The expected averageAoS is computed by taking the average AoS evolution over T time slots such that each user is selected for transmissionlarger than a consecutive of slots.In Fig. 2, we consider a three user broadcast network witharriving rate λ = [0 . , . , . λ total and success transmissionprobability p = [0 . , . , . . The threshold m for comput-ing the truncated MDP solution is set to be m = 20 . Fig.3 study the AoS performance for networks with more users.The parameter λ total = 2 , the packet arrival probability foreach user is λ n = nN ( N +1) λ total and p n = nN . Due to thecomputational complexity caused by the curse of dimension,we display the derived lower bound instead of the MDP policyin Fig. 3. In Fig. 2, the proposed index based schedulingalgorithm achieves compatible performance with the MDP When λ = 1 , we have ξ ( s )1 = s + p − and F ( s + 1 , − F ( s, 0) = (cid:16) s ( s − + p − p + sp (cid:17) ( ξ ( s +1)1 − ξ ( s )1 ) + ( s + p ) ξ ( s +1)1 . The Whittle’sindex according to our derivations is I ( s ) = − p (cid:16) s ( s − + p − p + sp (cid:17) + p ( s + p )( s + p − 1) = ps ( τ + − pp ) , which is exactly equivalent to [12,Eq. (56)] with T = 1 and α = 1 . Fig. 2. Simulation results of expected AoS for a three user broadcastingnetwork with λ = [0 . , . , . λ total and p = [0 . , . , . Fig. 3. Simulation results of expected AoS as the number of users, the totalpacket arrival rate over the entire network λ total = 2 and the packet arrivalprobability for each user is λ n = nN ( N +1) λ total , p n = n/N . policy. In Fig. 3, the performance of the proposed index policyis close to the theoretic lower bound. When the number ofusers N increases, following arrival aware strategy is far fromthe proposed index policy.From our analysis in Section II, AoI and AoS are equivalentwhen λ = 1 . In Fig. 2, when λ total → . , the arrival rate λ n for each user is close to 1, indicating that update packetswill arrive in nearly every time slot. AoI minimization policytends to show similar performance as the AoS minimizationpolicies. When the update packet arrival rate λ n is muchless than 1, or the number of users in networks increases,AoI minimization policy will lead to significantly higher AoSperformance compared with the proposed index policy andmay even be worse than the greedy arrival aware policy. Thisphenomenon verifies our analysis that AoS and AoI are metricswith different physical meanings, a good AoI performancecannot guarantee a good AoS performance.To understand the design insight of how to minimize the av-erage AoS over the entire network, we plot the time proportionspent on scheduling each user under different AoS, i.e., thenumber of scheduling times divide the total number of slots T . Fig. 4. Simulations of the time proportions spent on scheduling each user with different AoS, i.e., the number of scheduling times divide the total number ofslots T for a network with N = 10 users, with λ n = 0 . n/N . The transmission success probability is p n = 0 . , ∀ n on the left and p n = 0 . on the right.Fig. 5. Simulations of the time proportions spent on scheduling each user with different AoS, i.e., the number of scheduling times divide the total numberof slots T for a network with N = 10 users, with p n = n/N . The random update probability in each slot is λ n = 0 . ∀ n on the left and λ n = 0 . on theright. In Fig. 4, we study a network with N = 10 users, with packetarrival rate λ n = 0 . n/N for each user. The transmissionsuccess probability is p n = 0 . on the left and p n = 0 . onthe right for all n . In Fig. 5 we set p n = n/N for each user,and the update arrival probability in each slot is λ n = 0 . onthe left and λ n = 0 . on the right. From simulations, userswith smaller update arrival probability λ n and larger packettransmission success probability are more likely to be updatedwith smaller AoS. VI. C ONCLUSIONS In this paper, we treated a broadcast network with a BSsending random updates to interested users over unreliablewireless channels. We measure data freshness by Age ofSynchronization . We propose two scheduling algorithms basedon Markov decision process (MDP) and the restless multi-armbandit (RMAB) to minimize AoS. Simulation results show thatthe proposed index policy achieves comparable performancewith the MDP and approaches the theoretic lower bound.Moreover, our work verifies that AoS and AoI are differentconcepts, policy mismatch will lead to bad AoS performances. To guarantee a good AoS performance, scheduling policiesshould ensure that users with smaller random update proba-bilities and larger success probabilities are updated at smallerAoS.The wireless broadcast model is a simplified one in ourwork. In the future, we will study scheduling policies to min-imize AoS in broadcast network with time-varying channelslike [27]. Scheduling and interference alignment co-design in anetwork with multiple base stations will also be an interestingproblem. A CKNOWLEDGMENT The authors would like to thank Mr. Jingzhou Sun, Prof.Zhiyuan Jiang, Prof. Sheng Zhou and Prof. Zhisheng Niu fora preprint of the manuscript [21] and fruitful discussions. Theauthors would like to thank Prof. Roy Yates for suggestionsduring the ISIT and are also grateful to the anonymous review-ers for valuable suggestions that improve the presentation ofthe manuscript. A PPENDIX AP ROOF OF T HEOREM Proof: Let π ∈ Π NA be a feasible scheduling policysatisfying the interference constraint. Since there will be noAoS decrease to broadcast source n when the AoS of user n equals 0, to formulate the AoS lower bound, we assume policy π broadcast source n at time t only if s n ( t ) > . Followingpolicy π , a sample path of AoS, denote by ω is obtained.Suppose up to slot T , the BS broadcast update packets ofsource n for a total of L Tn times, and user n receives K Tn packets successfully. Similar to the analysis in [12], when T →∞ , any strategy that transmits updates to user n less than afixed constant times will lead to an infinite average AoS and isthus far from optimum. We call such strategies to be ”starvingstrategies”. In the following analysis, we only focus on policy π that belongs to ”non-starving strategies”, which implies thenumber of broadcasts of source n goes to infinity when T goesto infinity, i.e., lim T →∞ L Tn = ∞ , w.p.1. (26)Since each broadcast arrives at user n with probability p n , bythe law of large numbers, we have: lim T →∞ K Tn L Tn = p n , w.p.1 . (27)The above equations imply that the number of received packetsabout source n by user n goes to infinity with probability 1,i.e., lim T →∞ K Tn = ∞ , w.p.1. (28)Suppose user n receives the i th update packet about source n at the end of slot t n,i . Denote τ n,i to be the inter-updateinterval of user n between the receiving time-stamps of the ( i − th and the i th update, which can be computed as follows: τ n,i = t n,i − t n,i − , ∀ i ∈ { , , · · · , K Tn } . (29)Since all users are assumed to be synchronized initially, let t n, = 0 , ∀ n . To facilitate the AoS computation during slots [ t n,K Tn , T ] , we define τ n,K Tn +1 to be: τ n,K Tn +1 = T − t n,K Tn . (30)Then sum of sequence { τ n,i } satisfies: K Tn +1 (cid:88) i =1 τ n,i = T. (31)According to the AoS evolution, if source n has no updatesafter the latest update packet has been received by user n ,the AoS s n ( t ) keeps zero. Denote v n,i to be the maximumnumber of consecutive slots that the AoS of user n remains after the i th update packet has been received by user n , i.e., s n ( t n,i + j ) = 0 , ∀ j ∈ [1 , v n,i ] and s n ( t n,i + v n,i + 1) = 1 .The AoS of user n will start from at the beginning of slot t n,i + v n,i +1 and increases linearly until the next update packethas been received by the end of slot t n,i +1 . For simplicity, wedenote w n,i = s n ( t n,i +1 ) = t n,i +1 − t n,i − v n,i afterwards. The total AoS of user n at the beginning of each slot between [ t n,i + 1 , t n,i +1 ] can be computed as follows: t n,i +1 (cid:88) j = t n,i +1 s n ( j ) = s n ( t n,i +1 )( s n ( t n,i +1 ) + 1)2 = w n,i w n,i . (32)Let S T ( ω ) be the average AoS over slot [1 , T ] of samplepath ω , which can be computed by: S T ( ω ) = 1 N T N (cid:88) n =1 K Tn +1 (cid:88) i =1 t n,i (cid:88) j = t n,i − +1 s n ( j ) = 1 N N (cid:88) n =1 K Tn + 1 T (cid:80) K Tn +1 i =1 w n,i + w n,i K Tn + 1 . (33)Let M [ · ] denote the sample mean of a set of variables anddenote γ n = K Tn +1 T . Then S T ( ω ) can be simplified and upperbound as follow: S T ( ω ) = 1 N N (cid:88) n =1 γ n (cid:18) M [ w n ] + 12 M [ w n ] (cid:19) ( a ) ≥ N N (cid:88) n =1 γ n (cid:18) M [ w n ] + 12 M [ w n ] (cid:19) , (34)where inequality (a) is obtained by the generalized meaninequality M [ w n ] ≥ M [ w n ] . To further lower bound S T ( ω ) ,in the following analysis, we first compute M [ w n ] for eachfixed γ n , then figure out the constraints on values sequences { γ n } can take. Finally, searching for the lower bound of AoScan be formulated into an optimization problem, where theobjective is to minimize the lower bound of lim T →∞ E [ S T ( ω )] under constraints about { γ n } .Given γ n , the computation of M [ w n ] is divided into twosteps. First, we obtain M [ w n ] + M [ v n ] and then we compute M [ v n ] . The sum of M [ w n ] and M [ v n ] is obtained as follows: M [ w n ] + M [ v n ] ( a ) = (cid:80) K Tn +1 i =1 w n,i K Tn + 1 + (cid:80) K Tn +1 i =1 v n,i K Tn + 1 ( b ) = TK Tn + 1 = 1 γ n , (35)where equality (a) is obtained by definition of M [ · ] and equal-ity (b) is because τ n,i +1 = w n,i + v n,i and T = (cid:80) K Tn +1 i =1 τ n,i .Since update packet arrives independently in each slot withprobability λ n , the consecutive number of slots that no updatepackets appear v n,i are i.i.d random variables with geometricdistribution of coefficient λ n . By the law of large numbers, M [ v n ] = E [ v n,i ] = 1 − λ n λ n , w.p.1. (36)Plugging Eq. (36) into the Eq. (35), the mean value M [ w n ] can be written out as a function of γ n : M [ w n ] = 1 γ n − λ n + 1 , w.p.1 . (37)The first constraint on γ n is obtained through the lowerbound of M [ w n ] . Recall policy π only broadcast updates of source n when the AoS of user n is no longer 0. Thus, s n ( t n,i ) ≥ , which implies M [ w n ] ≥ . According toEq. (37), we have the following restrictions on γ n : γ n − λ n ≥ ⇒ γ n ≤ λ n , w.p.1. (38)The second constraint on { γ n } inherits from the interferenceconstraint that no more than one source can be broadcastedin each slot. By summing up the interference constraint ineach slot, i.e., Eq. (5c), from t = 1 to T , we can obtain thefollowing inequality on sequence { L Tn } , i.e., N (cid:88) n =1 L Tn ≤ T. (39)Recall that when T → ∞ , we have γ n = lim T →∞ K Tn +1 T =lim T →∞ K Tn T . By plugging Eq. (27), i.e., the relationshipbetween K Tn and L Tn into Eq. (39), we have: N (cid:88) n =1 γ n p n ≤ . (40)Thus, for any non-starving policy π that satisfies the band-width constraint, the lower bound on its AoS performancecan be computed by computing { γ n } , and then place Eq. (37)into Eq. (34). Based on the above analysis, let AoS LB be theaverage AoS lower bound over the entire network, searchingfor AoS LB can be formulated into the following optimizationproblem:AoS LB = min γ ≥ N N (cid:88) n =1 γ n (cid:34) (cid:18) γ n − λ n + 1 (cid:19) + 12 (cid:18) γ n − λ n + 1 (cid:19)(cid:21) , (41a)s.t, N (cid:88) n =1 γ n p n ≤ , (41b) γ n ≤ λ n , for all n = 1 , , · · · , N, (41c)For any policy π , we can conclude that the average AoIobtained over T consecutive slots is larger than AoS LB withprobability 1 when T → ∞ . We will then proceed to talkabout how the solve the above optimization problem.Notice that the objective function is convex with closedpolygon constraint. For simplicity, denote γ = [ γ , · · · , γ N ] and ν = [ ν , · · · , ν N ] . We can write out the Lagrange functionas follows: L ( γ , µ, ν ) =1 N N (cid:88) n =1 γ n (cid:32) (cid:18) γ n − − λ n λ n (cid:19) + 12 (cid:18) γ n − − λ n λ n (cid:19)(cid:33) + µ (cid:32) N (cid:88) n =1 γ n p n − (cid:33) + N (cid:88) n =1 ν n ( γ n − λ n ) , (42)where µ and ν ≥ are the Lagrange multipliers. According tothe Karush-Kuhn-Tucker (KKT) conditions, when the function reaches its minimum, the condition ∇ γ L ( γ , µ, ν ) = 0 holdsfor any n , i.e., N (cid:34)(cid:18) − λ n λ n (cid:19) − (cid:18) − λ n λ n (cid:19)(cid:35) − N γ n + µp n + ν n = 0 . (43)Hence, the optimum γ n can be expressed as a function of µ and ν n : γ n = 1 / (cid:115)(cid:18) − λ n λ n (cid:19) − − λ n λ n + N (cid:18) µp n + 2 ν n (cid:19) . (44)Notice that the primal constraint γ n ≤ λ n implies that ν n =0 , if γ n < λ n . Next, consider the Complete Slackness (CS)conditions: µ (cid:32) N (cid:88) n =1 γ n p n − (cid:33) = 0 , (45a) ν n ( γ n − λ n ) = 0 , ∀ n. (45b)Then the optimum γ ∗ n can be computed as follows: γ ∗ n = max { / (cid:115)(cid:18) − λ n λ n (cid:19) − (cid:18) − λ n λ n (cid:19) + 2 µ ∗ Np n , λ n } , (46)where µ ∗ is the Lagrange multiplier that keeps (cid:80) Nn =1 γ ∗ n p n = 1 .A PPENDIX BP ROOF OF L EMMA N = 2 and denote s = [ s , s ] in the following discussion. The analysis canbe generalized N > . Similarly, we prove the submodular-ity investigating into the Bellman operator. Suppose V ( k ) α ( · ) has the submodularity characteristic, we will then show that V ( k +1) α ( · ) obtained after the ( k + 1) th iteration possesses thesame characteristic.With no loss of generality, assume i = 1 , j = 2 . By thesubmodularity of V ( k ) α ( · ) , we have V ( k ) α ([ s + z , s − z ]) + V ( k ) α ([ s , s ]) ≥ V ( k ) α ([ s , s − z ]) + V ( k ) α ([ s + z , s ]) . Notice that for any action a , a , a , a , we have C ([ s + z , s − z ] , a ) − C ([ s , s − z ] , a ) = z and C ([ s + z , s ] , a ) − C ([ s , s ] , a ) = z . Define ∆ be: ∆ (cid:44) min a E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s + z , s − z ] , a ]+ min a E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s , s ] , a ] − min a E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s , s − z ] , a ] − min a E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s + z , s ] , a ] . To show the submodularity holds for k + 1 , it is suffices toprove ∆ ≥ . Let π ( k +1) ( s ) = arg min a E s (cid:48) [ V ( k ) α ( s (cid:48) ) | s , a ] .For s (cid:54) = 0 and s − z (cid:54) = 0 , the proof is divided into twocases:1). If π ( k +1) ([ s + z , s − z ]) = π ( k +1) ([ s , s ]) = ˜ a .With no loss of generality, assume ˜ a = 1 . Since ˜ a may not be the optimum strategy for state [ s , s − z ] and [ s + z , s ] ,we have min a E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s , s − z ] , a ] ≤ E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s , s − z ] , ˜ a ] , and min a E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ] | [ s + z , s ] , a )] ≤ E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ] | [ s + z , s ] , ˜ a )] . By plugging them into ∆ we have: ∆ ≥ E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s + z , s − z ] , ˜ a ]+ E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s , s ] , ˜ a ] − E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s , s − z ] , ˜ a ] − E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s + z , s ] , ˜ a ]=(1 − p )( V ( k ) α ([ s + z + 1 , s − z + 1])+ V ( k ) α ([ s + 1 , s + 1]) − V ( k ) α ([ s + 1 , s − z + 1]) − V ( k ) α ([ s + z + 1 , s + 1])) . (47)Then according to the submodularity characteristic, we have ∆ ≥ . The case when ˜ a = 2 can be verified similarly.2). If π ( k +1) ([ s + z , s − z ]) = a , π ( k +1) ([ s , s ]) = a , a (cid:54) = a , with no loss of generality, suppose a = 1 and a = 2 .If p ≤ p , similar to the previous analysis, we have min a E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s , s − z ] , a ] ≤ E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s , s − z ] , a ] , and min a E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s + z , s ] , a ] ≤ E s (cid:48) ,s (cid:48) [ V α ([ s (cid:48) , s (cid:48) ]) | [ s + z , s ] , a ] . Then, ∆ can be lower bounded by: ∆ ≥ E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s + z , s − z ] , a ]+ E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s , s ] , a ] − E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s , s − z ] , a ] − E s (cid:48) ,s (cid:48) [ V ( k ) α ([ s (cid:48) , s (cid:48) ]) | [ s + z , s ] , a ]=(1 − p )( V ( k ) α ([ s + z + 1 , s − z + 1]) − V ( k ) α ([ s + 1 , s − z + 1]))+ (1 − p )( V ( k ) α ([ s + 1 , s + 1]) − V ( k ) α ([ s + z + 1 , s + 1]))=( p − p )( V ( k ) α ([ s + z + 1 , s − z + 1]) − V ( k ) α ([ s + 1 , s − z + 1]))+ (1 − p )( V ( k ) α ([ s + z + 1 , s − z + 1])+ V ( k ) α ([ s + 1 , s + 1]) − V ( k ) α ([ s + 1 , s − z + 1]) − V ( k ) α ([ s + z + 1 , s + 1])) . (48) By monotonic, We have V ( k ) α ([ s + z + 1 , s − z + 1]) − V ( k ) α ([ s + 1 , s − z + 1]) ≥ . Then combine the submodu-larity of V ( k ) α , ∆ ≥ can be verified.The case p ≥ p can be verified in the same way and ishence omitted.And for the case that s = 0 or s − z = 0 , the proof needssome rectification similar to the proof in lemma 1, which isomitted here. Based on the above analysis, we have ∆ ≥ and the submodularity of V ( k +1) α can be verified.A PPENDIX CP ROOF OF L EMMA V α ( s ) is obtained by taking the minimum of allpossible action sequence, hence, by choosing a ( t ) = 0 all thetime, we will formulate an upper bound on the α -discountedproblem. In this case, starting from any state s , according tothe probability transfer function, the state of the decoupledbandit at time t will satisfy s ( t ) < s + t . We can obtain theupper bound of V α ( s ) by computing the total cost of applyingthis naive strategy: V α ( s ) ≤ ∞ (cid:88) t =1 α t − ( s + t − 1) = s − α + 1(1 − α ) . Hence for every state V α ( s ) < ∞ . Based on this char-acteristic, we can use a value iteration to approach the α -discounted value function. Fixing V α (0) = 0 , the discountedvalue function obtained in the ( k + 1) th iteration can beobtained by: V ( k +1) α ( s ) = min a ∈{ , } { C ( s, a ) + α E s (cid:48) [ V ( k ) α ( s (cid:48) ) | s, a ] } , where E s (cid:48) [ V kα ( s (cid:48) ) | s, a ] = (cid:80) s (cid:48) V ( k ) α ( s (cid:48) ) Pr ( s (cid:48) | s, a ) denotes theexpected α -discounted function in the next time slot. We willthen prove the monotonic characteristic of the value functionby induction, suppose V ( k ) α ( s ) is a monotonically function of s , assume that ≤ s < s , then if a = 0 , according tothe cost function C ( s, a ) and the monotonic characteristic of V ( k ) α ( s ) , we will have the following inequality: C ( s , 0) + αV ( k ) α ( s + 1) ≤ C ( s , 0) + αV ( k ) α ( s + 1) . When a = 1 , starting from state s (cid:54) = 0 , the bandit willevolve into state , and state s +1 with probability (1 − λ ) p , λp and − p , respectively; starting from state s , the banditwill evolve into state , and state s + 1 with probability (1 − λ ) p , λp and − p . Then according to the monotoniccharacteristic of V ( k ) α ( s ) , we will have C ( s , α E s (cid:48) [ V ( k ) α ( s (cid:48) ) | s , ≤ C ( s , α E s (cid:48) [ V ( k ) α ( s (cid:48) ) | s , . By taking the minimum over action set A , the value of V ( k +1) α ( s ) can be obtained and the following inequality holds: V ( k +1) α ( s ) ≤ V ( k +1) α ( s ) Notice that when k → ∞ , we will have V ( k ) α ( s ) → V α ( s ) .Hence the monotonicity of the value function is proved. A PPENDIX DD ERIVATIONS OF C OROLLARY τ can beplotted as follows: Fig. 6. Probability transfer graph for threshold policy τ , for states below τ ,the bandit remains passive; for states that are equal or larger than threshold τ , the bandit becomes active. The transmission probability are denoted belowthe arrow. Denote ξ ( τ ) s to be the steady state distribution that the banditis in state s if a policy actives when s ≥ τ but idles when s < τ . Then according to the transition rule (18) and Fig. 6,the relationship with the steady state distribution must satisfy:1) For s < τ , the bandit remains passive, hence: ξ ( τ )1 = · · · = ξ ( τ ) τ . (49a)2) For s ≥ τ , the bandit is chosen to be active. Withprobability − p the transmission fails and AoS grows to s + 1 : ξ ( τ ) s +1 = (1 − p ) ξ ( τ ) s , ∀ s ≥ τ. (49b)The transmission succeeds with probability p , if a newupdate is sent to the BS with probability λ , then the banditwill go to state s = 1 : ξ ( τ )1 = ∞ (cid:88) s = τ λpξ ( τ ) s + λξ ( τ )0 . (49c)If no update arrives and the transmission succeeds, the AoSat the next slot will go down to zero: ξ ( τ )0 = ∞ (cid:88) s = τ (1 − λ ) pξ ( τ ) s + (1 − λ ) ξ ( τ )0 . (49d)All these state distributions sum up to , hence: ∞ (cid:88) s =0 ξ ( τ ) s = 1 . (49e)Based on Eq. (49b)-(49e), the steady state distribution ξ ( τ ) s can be computed as follows: ξ ( τ ) s = − λλ / (cid:16) − λλ + τ + p − (cid:17) , s = 0;1 / (cid:16) − λλ + τ + p − (cid:17) , ≤ s ≤ τ ; (1 − p ) s − τ ( − λλ + τ + p − ) , s > τ. (50)Therefore, the average total cost F ( τ, W ) for given W ifthreshold policy τ is employed can be computed: F ( τ, W ) = τ − (cid:88) s =0 sξ ( τ ) s + ∞ (cid:88) s = τ ( s + W ) ξ ( τ ) s = τ − (cid:88) s =1 sξ ( τ )1 + ∞ (cid:88) s = τ ( s + W ) ξ ( τ )1 (1 − p ) s − τ = τ ( τ − ξ ( τ )1 + ξ ( τ )1 p ( 1 p − 1) + ξ ( τ )1 p ( τ + W ) . (51)A PPENDIX ED ERIVATION OF C OROLLARY τ opt is the optimum threshold, i.e., when AoS satisfies ≤ s < τ opt the bandit idles and for s ≥ τ opt the banditis activated. Denote V ( s ) to be the differential cost-to-gofunction at state s , let β be the optimum average cost. Thenthe Bellman equation can be written out as follows: V ( s ) + β = min { ( W + s ) + (1 − p ) V ( s + 1)+ p ( λV (1) + (1 − λ ) V (0)) ,s + V ( s + 1) } . (52)The above Bellman equation also implies, since it is optimumto activate the bandit in state τ opt , we have: W + p ((1 − λ ) V (0) + λV (1)) ≤ pV ( τ opt + 1) , (53a)and it is optimum to idle in state τ opt − implies: W + p ((1 − λ ) V (0) + λV (1)) ≤ pV ( τ opt ) . (53b)The optimum threshold τ opt is obtained by first writing out V ( τ opt ) and β as a function of threshold τ opt . Then establishan equation to obtain τ opt . With no loss of generality, let usassume λV (1) + (1 − λ ) V (0) = 0 .To obtain V ( τ opt ) , we first consider states satisfy s ≥ τ opt .Since scheduling is the optimal action by assumption, accord-ing to the above Bellman equation Eq. (52), the relationshipbetween V ( s ) and V ( s + 1) is as follows: V ( s ) = ( − β + W + s ) + (1 − p ) V ( s + 1) . (54)Then, substitute V ( s +1) = ( − β + W + s )+(1 − p ) V ( s +2) into the above equation, we have: V ( s ) =( − β + W + s ) + (1 − p )( − β + W + s + 1)+ (1 − p ) V ( s + 2) . (55)Repeating this procedure for K times, we then have: V ( s ) = K − (cid:88) k =0 (1 − p ) k ( − β + W + s + k ) + (1 − p ) K V ( s + K ) . (56)Consider K → ∞ , since lim K →∞ (1 − p ) K V ( s + K ) = 0 ,the differential cost-to-go function V ( τ opt ) can be obtained asfollows: V ( s ) = 1 p ( − β + W + s ) + 1 − pp . (57)Recall Eq. (53a),(53b) and the assumption that (1 − λ ) V (0) + λV (1) = 0 , since τ opt is the optimum activethreshold, we have: V ( τ opt ) ≤ Wp < V ( τ opt + 1) . (58) Recall that V ( s ) is monotonically increasing, the aboveinequality implies there exists a γ ∈ [0 , such that V ( τ opt + γ ) = Wp . Plugging Eq. (57) into V ( τ opt + γ ) , we can compute β as follows: β = τ opt + γ + 1 − pp . (59)Next, we consider states s < τ opt and write (1 − λ ) V (0) + λV (1) as a function of τ opt . Since the optimum action is toidle when s < τ opt , according to the Bellman equation, wehave: V ( s ) = ( − β + W + s ) + V ( s + 1) . (60)Substitute V ( s − 1) = ( − β + s − 1) + V ( s ) into the aboveequation and repeat this procedure for s − times, we canthen obtained the following equation about V (1) and V ( τ opt ) : V (1) = ( τ opt − − β + τ opt )2 + V ( τ opt ) . (61)The Bellman equation for V (0) can be written out as follows: V (0) = − β + (1 − λ ) V (0) + λV (1) , which implies V (0) = − βλ + V (1) . (62)Plugging the above equality into assumption (1 − λ ) V (0) + λV (1) = 0 , the differential cost to go function V (1) can beobtained: V (1) = 1 − λλ β. (63)Plugging Eq. (62) and Eq. (63) into Eq. (61), we can writeout V ( τ opt ) as follows: V ( τ opt ) = − ( τ opt − − β + τ opt )2 + 1 − λλ β. (64)Next, we establish a function about τ opt . Recall Eq. (57),when s = τ opt , we have another expression about V ( τ opt ) : V ( τ opt ) = 1 p ( − β + W + τ opt ) + 1 − pp . (65)Since the above two equalities should be equal, we canestablish the following equation: − − λλ β + ( τ opt − − β + τ opt )2+ 1 p ( − β + W + τ opt ) + 1 − pp = 0 . (66)By substituting β = τ opt + γ + − pp into the above equation,we have: − − λλ ( τ opt + γ + 1 − pp )+ ( τ opt − − τ opt − γ − − pp )2 + 1 p ( W − γ ) = 0 . (67)The above equation is a quadratic equation about variable τ opt : τ opt + ( γ + 1 p + 1 λ − 52 ) τ opt (68) − p ( W − γ ) + 1 − λλ ( γ + 1 − pp ) − − pp − γ = 0 . Since τ opt is an integer and γ ∈ [0 , , we can finally obtainthe threshold for fixed W , i.e., τ opt = (cid:98) ( 52 − p − λ )+ (cid:115)(cid:18) − p − λ (cid:19) + 2 (cid:18) Wp − − λλ − pp (cid:19) + 2 1 − pp (cid:99) . (69)R EFERENCES[1] H. Tang, J. Wang, Z. Tang, and J. Song, “Scheduling to minimize age ofsynchronization in wireless broadcast networks with random updates,”in ,July 2019, pp. 1027–1031.[2] S. Kaul, R. Yates, and M. Gruteser, “Real-time status: How often shouldone update?” in , March 2012, pp.2731–2735.[3] J. Zhong, R. D. Yates, and E. Soljanin, “Two Freshness Metrics for LocalCache Refresh,” in , Jun. 2018, pp. 1924–1928.[4] K. C. Sia, J. Cho, and H. Cho, “Efficient Monitoring Algorithm forFast News Alerts,” in IEEE Transactions on Knowledge and DataEngineering , vol. 19, no. 7, pp. 950-961, July 2007.[5] J. Cho and H. G. Molina. 2003. Effective page refresh policies for Webcrawlers. ACM Trans. Database Syst . 28, 4 (December 2003), 390-426.[6] P. Mayekar, P. Parag, and H. Tyagi, “Optimal lossless source codes fortimely updates,” in , June 2018, pp. 1246–1250.[7] P. Parag, A. Taghavi, and J. Chamberland, “On real-time status updatesover symbol erasure channels,” in , March 2017, pp. 1–6.[8] J. Zhong and R. D. Yates, “Timeliness in lossless block coding,” in , March 2016, pp. 339–348.[9] A. G. Klein, S. Farazi, W. He, and D. R. Brown, “Staleness bounds andefficient protocols for dissemination of global channel state information,” IEEE Transactions on Wireless Communications , vol. 16, no. 9, pp.5732–5746, Sep. 2017.[10] M. Costa, S. Valentin, and A. Ephremides, “On the age of channelinformation for a finite-state markov model,” in , June 2015, pp. 4101–4106.[11] I. Kadota, E. Uysal-Biyikoglu, R. Singh, and E. Modiano, “Minimizingthe age of information in broadcast wireless networks,” in , Sept 2016, pp. 844–851.[12] I. Kadota, A. Sinha, E. Uysal-Biyikoglu, R. Singh, and E. Modiano,“Scheduling policies for minimizing age of information in broadcastwireless networks,” IEEE/ACM Transactions on Networking , vol. 26,no. 6, pp. 2637–2650, Dec 2018.[13] R. D. Yates, P. Ciblat, A. Yener and M. Wigger, “Age-optimal con-strained cache updating” , Aachen, 2017, pp. 141-145.[14] I. Kadota, A. Sinha, and E. Modiano, “Optimizing age of informationin wireless networks with throughput constraints,” in IEEE INFOCOM2018 - IEEE Conference on Computer Communications , April 2018, pp.1–9.[15] R. Talak, S. Karaman, and E. Modiano, “Minimizing age-of-informationin multi-hop wireless networks,” in , Oct 2017,pp. 486–493.[16] ——, “Distributed scheduling algorithms for optimizing informationfreshness in wireless networks,” in , June 2018, pp. 1–5.[17] Y. Hsu, E. Modiano, and L. Duan, “Age of information: Design andanalysis of optimal scheduling algorithms,” in , June 2017, pp. 561–565.[18] Y. Hsu, “Age of information: Whittle index for scheduling stochasticarrivals,” in , June 2018, pp. 2634–2638.[19] Z. Jiang, B. Krishnamachari, X. Zheng, S. Zhou, and Z. Niu, “Decen-tralized status update for age-of-information optimization in wirelessmultiaccess channels,” in , June 2018, pp. 2276–2280. [20] N. Lu, B. Ji, and B. Li, “Age-based scheduling: Improving data freshnessfor wireless real-time traffic,” in Proceedings of the Eighteenth ACMInternational Symposium on Mobile Ad Hoc Networking and Computing ,Mobihoc ’18, New York, NY, USA: ACM, 2018, pp. 191–200.[21] J. Sun, Z. Jiang, S. Zhou, and Z. Niu, “Optimizing information freshnessin broadcast network with unreliable links and random arrivals: Anapproximate index policy,” in to appear IEEE INFOCOM 2019 -IEEE Conference on Computer Communications Workshops (INFOCOMWKSHPS) , April 2019.[22] X. Li, D. B. H. Cline and D. Loguinov, “On Sample-Path Stalenessin Lazy Data Replication,” in IEEE/ACM Transactions on Networking ,vol. 24, no. 5, pp. 2858-2871, October 2016.[23] J. Cho and H. G.-Molina, “Synchronizing a database to improvefreshness,” in Proceedings of the 2000 ACM SIGMOD internationalconference on Management of data (SIGMOD ’00), ACM, New York,NY, USA, 117-128.[24] D. P. Bertsekas, Dynamic Programming and Optimal Control , 2nd ed.Athena Scientific, 2000.[25] J. Gittins, K. Glazebrook, and R. Weber, Multi-armed bandit allocationindices . John Wiley & Sons, 2011.[26] R. R. Weber and G. Weiss, “On an index policy for restless bandits,” Journal of Applied Probability , vol. 27, no. 3, pp. 637–648, 1990.[27] H. Tang, J. Wang, L. Song, and J. Song, “Scheduling to minimize ageof information in multi-state time-varying networks with power con-straints,” in2019 57th Annual Allerton Conference on Communication,Control, and Computing (Allerton)