[PDF] Massive Random Access with Sporadic Short Packets: Joint Active User Detection and Channel Estimation via Sequential Message Passing

Abstract

This paper considers an uplink massive machine-type communication (mMTC) scenario, where a large number of user devices are connected to a base station (BS). A novel grant-free massive random access (MRA) strategy is proposed, considering both the sporadic user traffic and short packet features. Specifically, the notions of active detection time (ADT) and active detection period (ADP) are introduced so that active user detection can be performed multiple times within one coherence time. By taking sporadic user traffic and short packet features into consideration, we model the joint active user detection and channel estimation issue into a dynamic compressive sensing (CS) problem with the underlying sparse signals exhibiting substantial temporal correlation. This paper builds a probabilistic model to capture the temporal structure and establishes a corresponding factor graph. A novel sequential approximate message passing (S-AMP) algorithm is designed to sequentially perform inference and recover sparse signal from one ADT to the next. The Bayes active user detector and the corresponding channel estimator are then derived. Numerical results show that the proposed S-AMP algorithm enhances active user detection and channel estimation performances over competing algorithms under our scenario.

Full PDF

aa r X i v : . [ c s . I T ] F e b Massive Random Access with Sporadic ShortPackets: Joint Active User Detection and ChannelEstimation via Sequential Message Passing

Jia-Cheng Jiang and Hui-Ming Wang,

Senior Member , IEEE

Abstract

This paper considers an uplink massive machine-type communication (mMTC) scenario, where a largenumber of user devices are connected to a base station (BS). A novel grant-free massive random access(MRA) strategy is proposed, considering both the sporadic user trafﬁc and short packet features. Speciﬁcally,the notions of active detection time (ADT) and active detection period (ADP) are introduced so that activeuser detection can be performed multiple times within one coherence time. By taking sporadic user trafﬁc andshort packet features into consideration, we model the joint active user detection and channel estimation issueinto a dynamic compressive sensing (CS) problem with the underlying sparse signals exhibiting substantialtemporal correlation. This paper builds a probabilistic model to capture the temporal structure and establishesa corresponding factor graph. A novel sequential approximate message passing (S-AMP) algorithm is designedto sequentially perform inference and recover sparse signal from one ADT to the next. The Bayes active userdetector and the corresponding channel estimator are then derived. Numerical results show that the proposed S-AMP algorithm enhances active user detection and channel estimation performances over competing algorithmsunder our scenario.

I. I

NTRODUCTION

Recently, the development of the 5G cellular communication systems drives a number of newlyemerging use cases. This leads to the key requirements to support massive machine-type communica-tions (mMTC), providing connectivity for millions of devices that perform machine-centric tasks suchas environment sensing, surveillance, control and event detection [1]. Different from the conventionalhuman-centric communications network, such as 4G LTE, aiming for a high data rate using large packetsizes, the core mission in the mMTC scenario changes into the uplink access with a low transmissionrate. The common features of the typical mMTC application scenarios are: massive number of userdevices, sporadic user activity and small data packets [2]. Such completely different assumptionscompared with those in human-centric communication systems trigger a completely different set oftechnologies.

Jia-Cheng Jiang and Hui-Ming Wang are with the School of Electronics and Information Engineering, and also with the Ministryof Education Key Lab for Intelligent Networks and Network Security, Xi’an Jiaotong University, Xi’an, 710049, Shaanxi, P. R. China.Email: [email protected]; [email protected].

To be more speciﬁc, the non-orthogonal medium access and grant-free access control have beenconsidered to provide access for a massive number of devices in the uplink [3], [4]. Typically, themassive number of user devices makes it impossible to assign orthogonal pilot sequences to all potentialuser devices. Hence, the non-orthogonal sequences are considered in the preamble design to enablea certain degree of temporal resource overloading. For the access control, the traditional strategiesare grant-based, where devices access the network with a prior scheduling assignment, which requiresgood predictions of the uplink requests, as well as additional control signaling or message exchangesto facilitate the granting of resources [2]. When the base station (BS) detects multiple user accessrequests simultaneously, the affected user devices will be arranged to restart the access procedureafter a time expire [5]. However, for the case of mMTC, grant-based access design is difﬁcult andpotentially inefﬁcient to support massive connectivity to access the network. Therefore, the promisingaccess control pattern in mMTC applications is the grant-free random access scheme, where each activedevice directly transmits its unique preamble sequence to the BS without waiting for any permission,which results in a low control overhead.However, non-orthogonal sequence based grant-free access often suffers from collisions, namely,multiple users access concurrently, and thus cannot be successfully detected and decoded [2]. Therefore,joint active user detection and channel estimation becomes a critical issue for mMTC applications.To address such a problem, the sparse feature of mMTC can be taken into consideration. Due to thelow-rate feature of the devices in mMTC scenario, it is always the case that most devices sleep mostof the time for energy efﬁciency and are only activated when triggered by external events. This causesthat the trafﬁc pattern for each user device is sporadic and partly unpredictable with only a small subsetof users being active concurrently. From a physical layer perspective, this situation leads to a sparserecovery problem in mMTC. Compressive sensing (CS) technology is therefore promising to provide anadvanced collision resolution to serve massive user devices, where randomly generated non-orthogonalpilot sequences are assigned to user devices [3], [6]–[8]. Especially, the approximate message passing(AMP) algorithm [9] has attracted high attentions to efﬁciently cope with the challenge of joint activeuser detection and channel estimation under the massive random access (MRA) scenario in [3], [5],[10], [11]. The AMP framework can achieve a high efﬁciency, and the performance of AMP can beevaluated via the so-called state evolution equation. The authors in [3], [10] have shown that the prior statistic knowledge of the wireless channel can be exploited by modifying the denoiser function in theAMP framework to enhance the detection performance.These works exhibit efﬁciencies of the CS technology to provide joint active user detection andchannel estimation strategies with large number of devices and sporadic user trafﬁc, which are the twoprimary standard literature assumptions of the mMTC scenario. However, the designs of the existingalgorithms did not further take the short packet feature into account. Typically, the average length ofpackets potentially goes down to a few bytes in mMTC scenario [2], leading to that the transmissionduration of one user device is far shorter than that of the traditional human-centric communications[2], [12]. And, it is often the case that the geographical locations of devices change negligibly, whichresults in the insigniﬁcant ﬂuctuation of user channel coefﬁcients and therefore a long coherence time.Further, as discussed, the event-driven trafﬁc makes the access patterns unpredictable and sporadic.These scenarios and features will give rise to the facts that the user devices can potentially existmultiple state switches, i.e., on-off, within one coherence time duration, and the times of access aretotally random and unpredictable.However, the existing algorithms, such as [3], [4], assume that all the user devices maintain activeor inactive throughout the whole coherent time duration, which does not fully conform the abovefeatures of mMTC. Speciﬁcally, user devices are considered synchronized, and the joint activitydetection and channel estimation operations are implemented once within a coherence block, anddata transmissions of the active user devices continue until the start of the next coherent time block.Further, the independent block-fading channel model is considered, which assumes that all the channelsfollow independent quasi-static ﬂat fading within a block of coherence time. Such conditions are toorestrictive for mMTC with short packets and unpredictable trafﬁcs.In this work, we design a new scheme to achieve grant-free MRA with considering the features ofboth the sporadic trafﬁc and the short packet. To be more speciﬁc, we deﬁne active detection time(ADT) as the time point for user activity detection and channel estimation. The interval between thetwo adjacent ADTs is denoted as the active detection period (ADP), which can be divided into twophases. In the ﬁrst phase, all active user devices transmit unique non-orthogonal pilot sequences toenable the active user detection and channel estimation. And in the second phase, data sequenceswith low rates will be transmitted. Note that although we still assume the synchronous user access, the duration of an ADP is much shorter than the coherence time, so that the user requests could beresponded in a timely manner. In addition, the user devices are permitted to access multiple timeswithin one coherence time. These features ﬁt the sporadic trafﬁc and the short packet features wellcompared with the traditional synchronous access schemes mentioned above, such as [3], [4].Taking all the above features into consideration, in this paper, we model the joint user detectionand channel estimation issue into a dynamic CS problem. Speciﬁcally, we denote the underlying time-varying sparse signal using access state sparse vector , which exhibits substantial temporal correlationin two aspects. First, the active user indicator, which is the support vector of sparse signal, variescorrelatively with ADTs. Second, the channel coefﬁcients of user devices, which is the amplitude ofthe sparse signal, changes smoothly with ADTs.In mMTC applications, the dynamic CS algorithms have been considered for multi-user detection[13], [14], where the temporal correlation of the user support between adjacent time step is exploited.Particularly, the authors in [14] further utilized the quality of the prior-information support set. Somerelated works in the signal processing literature have considered to solve the dynamic CS problem.Algorithms in [15], [16] are inspired by convex relaxation. On the other hand, the authors in [17],[18] consider a Bayesian framework. Speciﬁcally, [17] blends elements of Bayesian models with moretraditional CS through convex relaxation and greedy methods, while in [18], the authors consider afull Bayesian framework but the algorithm is heuristic and lack of theoretical measure rule.To the best of our knowledge, our paper is the ﬁrst to consider the temporal correlations of both thechannel coefﬁcient and the active user indicator in the massive connectivity literature. In this paper, weutilize the temporal correlations for both user support and user channel, and design a novel AMP-basedmethod for solving our dynamic CS problem under the Bayesian framework. The performance of ourproposed algorithm can be predicted by state evolution equation. To passing the message from oneADT to the next, we design a distribution approximation strategy based on moment-matching, whichis optimal under a typical measure rule. Our contributions can be summarized as follow. • We build a probabilistic model to reﬂect the temporal structure under the proposed MRA strategy.We establish a speciﬁc factor graph model in our scenario and provide the corresponding messagepassing schedule to implement the message passing algorithm under our graph model. • We propose a novel sequential message passing algorithm to recursively recover the access state sparse vector. Speciﬁcally, we utilize the AMP framework based on the historical knowledge-aidedprior. We derive the historical knowledge-aided prior based on the moment matching equations,which is optimal in the perspective of Kullback-Leibler (KL)-divergence. The state evolutionanalysis is provided, indicating that the historical knowledge beneﬁts the AMP framework in ourscenario. • We derive the active user detection and channel estimation strategy, which can be performed ineach ADT after executing the proposed message passing algorithm. Speciﬁcally, the LLR testwith Bayes criterion is considered and the channel estimation can be derived directly based onthe recovered sparse vector.

Notation : Throughout this paper, scalars are denoted by lower-case letters, vectors by bold-facelower-case letters, and matrices by bold-face upper-case letters. For a matrix A , A T and A H denoteits transpose and conjugate transpose, respectively. The Pr {·} returns the probability mass and p ( · ) returns the probability density. {·} ba returns the collection of variables from index a to b . The distributionof a circularly symmetric complex Gaussian random vector x with mean µ and covariance matrix Σ is denoted by CN ( x ; µ , Σ ) . Finally, R ( · ) returns the real part of the variable.II. S YSTEM M ODEL

In this paper, we consider the uplink of a mMTC scenario with one BS located at the center ofthe cellular and N devices located randomly in a coverage area. For simplicity, we assume that theBS as well as each device is equipped with a single antenna. User device n is assigned a uniquepilot sequence of length L , denoted as s n ∈ C L , [ s n , s n , . . . , s Ln ] T . Since we are interested in thescenario that the number of potential user devices is much larger than the length of pilot sequence,i.e., N ≫ L , the non-orthogonal pilot sequences are assigned to user devices. We further assume thatthe pilot sequence is generated according to an i.i.d. complex Gaussian distribution with zero meanand variance /L such that each sequence has a unit power [3], [10]. A. Access Strategy

In this work, we deﬁne the concepts of ADT and ADP, which can be seen in Fig. 1. We consider asynchronous access strategy, and each user device is permitted to choose whether or not to access thenetwork in each ADT. For the sporadic nature of user trafﬁc, in the t th ADT, there is only a subset Fig. 1. Sporadic and short packet transmission pattern for mMTC. of the users that are active, and the other users are idle. We denote a ( t ) n as the active user indicator in t th ADT, with a ( t ) n = 0 or indicating the user is idle or active, respectively. The received signal inthe t th ADT at the BS can be modeled as y ( t ) = N X n =1 a ( t ) n s n h ( t ) n + w ( t ) , (1)where h ( t ) n ∈ C is the channel coefﬁcient between user n and BS, and w ( t ) ∈ C L is the correspondingcomplex Gaussian noise vector with each element w ( t ) n ∼ CN ( w ( t ) n ; 0 , σ w ) . Note that the user transmitpower p n is absorbed into the channel coefﬁcient for concise. We deﬁne x ( t ) n , a ( t ) n h ( t ) n , and the vector x ( t ) , [ x ( t )1 , x ( t )2 , . . . , x ( t ) N ] T ∈ C N forms a sparse time vector in each ADT, which is denoted as accessstate sparse vector. As a consequence, the system model in (1) can be restated as y ( t ) = Sx ( t ) + w ( t ) . (2)where S , [ s , . . . , s N ] ∈ C L × N . For the sake of presentation, according to (2), we deﬁne a ( t ) , [ a ( t )1 , a ( t )2 , . . . , a ( t ) N ] T , h ( t ) , [ h ( t )1 , h ( t )2 , . . . , h ( t ) N ] T . And Y , { y ( t ) } Tt =1 ∈ C L × T , X , { x ( t ) } Tt =1 ∈ C N × T , A , { a ( t ) } tt =1 ∈ C N × T , H , { y ( t ) } Tt =1 ∈ C N × T are denoted as the collections of y ( t ) , x ( t ) , a ( t ) , h ( t ) in T consecutive ADTs, respectively.Note that for the typical scenario in mMTC, where the mobility of the user devices is negligible,the user channel coherence time could be extremely long, so that the user devices may potentiallyexist multiple state switches, i.e., on-off, with unpredictable times of accesses within one coherence time duration. Although we still assume the synchronous user access strategy, the duration of anADP is designed much shorter than the length of a coherence block, so that the user requests could beresponded in a timely manner. Moveover, the user devices are permitted to access multiple times withinone coherence block. Such a design ﬁts the sporadic trafﬁc and short packet features well comparedwith the access schemes in [3], [4], where they consider that there is only one ADT in each period ofcoherence time. Such a design is also compatible with the case that different user devices belong todifferent terminal patterns. For example, when the packet length and the transmission duration for auser device are relatively short, the user device may access at one ADT and immediately disconnectfrom the network at the next. On the contrary, the user with a long transmission duration typicallycovers several consecutive ADPs. As shown in Fig. 1, there could exist i) multiple consecutive ADTswithin one user device transmission, and ii) multiple consecutive ADTs within one coherence time.Obviously, the access state sparse vector x ( t ) often exhibits a high degree of correlation from one ADTto the next, which reﬂects in two aspects. First, the active user indicator a ( t ) , which can be regarded asthe support vector of the corresponding sparse vector x ( t ) , is highly correlated in the adjacent ADTs.Second, the channel coefﬁcient h ( t ) , which is the amplitude of x ( t ) , changes smoothly with ADTs. B. Probabilistic Model

To characterize the time-variation of the active user indicator a ( t ) , and the smooth evolution ofthe channel vector h ( t ) , we consider a probabilistic model as follow. We model the change of the n th element of the support vector a ( t ) n across time as a Markov chain characterized by a couple oftransition probabilities, i.e., p (10) n , Pr { a ( t ) n = 1 | a ( t − n = 0 } and p (01) n , Pr { a ( t ) n = 0 | a ( t − n = 1 } , and N users are supposed to form independent Markov chains. We further assume that each chain is undera steady state with Pr { a ( t ) n = 1 } = λ n that indicates the activation probability of each user n . Underthis condition, the Markov chain of each user n can be speciﬁed by parameters p (01) n and λ n , withthe transition probability p (10) n formulated as p (10) n = λ n p (01) n / (1 − λ n ) . Especially, we assume that ineach ADT the active user ratio is λ and the ratio of users from active to idle is p , and the transitionand activation probabilities are independent with n , i.e., p (10) n = p , p (01) n = p , λ n = λ, ∀ n . Note thatsuch an assumption captures the transition ratio and active user ratio in each ADT, and has shown its efﬁciency in the similar application [18] . The probabilistic distribution that speciﬁes the Markovchains can be given as p ( a ( t ) n | a ( t − n ) = (1 − p ) (1 − a ( t ) n )(1 − a ( t − n ) p a ( t ) n (1 − a ( t − n )10 (1 − p ) a ( t ) n a ( t − n p (1 − a ( t ) n ) a ( t − n , (3)where we deﬁne p ( a (1) n | a (0) n ) , p ( a (1) n ) = (1 − λ ) − a (1) n λ a (1) n .On the other hand, the smooth evolutions of the channel coefﬁcients for all the users can becharacterized by a set of independent Gaussian Markov state-space models. Since for each user, theprobabilistic distribution of the channel coefﬁcient depends highly on the propagation environmentand the geographical location changes negligibly in several channel coherence blocks, the statisticalcharacteristics of channel coefﬁcient stay unchanged. Hence, we assume a steady-state GaussianMarkov processes for each user, which can be characterized by a ﬁrst order autoregressive (AR-1)model [19], [20] as h ( t ) n = η n h ( t − n + u ( t ) n , (4)where η n = J (2 πD n T b ) is the AR coefﬁcient that controls the temporal correlation, J ( · ) is thezeroth order Bessel function of the ﬁrst kind, D n is the Doppler frequency of user n , and T b is thetime duration of an ADP. We further suppose that the Gaussian-Markov process for user n is understeady state with zero mean and variance ρ n , and the evolution noise u ( t ) n is therefore distributedas u ( t ) n ∼ CN ( u ( t ) n ; 0 , (1 − η n ) ρ n ) . Thus, the Gaussian-Markov can be speciﬁed by the followingprobabilistic distribution p ( h ( t ) n | h ( t − n ) = CN ( h ( t ) n ; η n h ( t − n , (1 − η n ) ρ n ) , (5)where we deﬁne p ( h (1) n | h (0) n ) , p ( h (1) n ) = CN ( h (1) n ; 0 , ρ n ) . Further, according to the deﬁnition of sparsevector x ( t ) , we can infer that the probabilistic distribution of x ( t ) n conditional on h ( t ) n and a ( t ) n is p ( x ( t ) n | h ( t ) n , a ( t ) n ) = δ ( x ( t ) n − h ( t ) n a ( t ) n ) , (6)where δ ( · ) is the Dirac delta function. We assume that channel coefﬁcient h ( t ) n evolves independentlyfrom the support vector a ( t ) n . Marginalizing out h ( t ) n and a ( t ) n via (3) and (5), we can obtain the marginaldistribution over x ( t ) n as [18] p ( x ( t ) n ) = (1 − λ ) δ ( x ( t ) n ) + λ CN ( x ( t ) n ; 0 , ρ n ) . (7) Although it ignores some speciﬁc information for each user, e.g., the average transmission duration of each user device may bedifferent, it has shown its efﬁciency in [18]

The form of distribution (7) is the Gaussian-Bernoulli distribution and also known as the “spike-and-slab” prior distribution, which is an efﬁcient sparsity-promoting prior with point-mass on x ( t ) n = 0 [18], [21]. The parameter λ controls the fraction of x ( t ) n that is expected to be zero.Note that the parameters for the probabilistic model can be speciﬁed by some speciﬁc information ofuser devices. The active ratio λ , the transition probability p , the channel correlation coefﬁcient η n , thechannel variance ρ n can be speciﬁed by the user access frequency [3], user transmission duration [18],user speed [20], and the distance between user device and the BS, respectively, which are consideredknown for BS in this paper.The goal for the BS is to ﬁrst detect the user activities and to further estimate the correspondingchannel coefﬁcient for each active user. This can be done by recovering the access state sparse vector x ( t ) in each ADT. Since we have introduced the temporal correlation between the user indicator (3)and the temporal correlation between channel coefﬁcient (5), we consider recursively recover thesparse vector x ( t ) . This forms a dynamic CS problem, which is different from the sparse recoveryalgorithms in traditional MRA models [3], [10], where the recovery is performed independently withthe invariant prior distribution (7). The proposed algorithm to recursively recover the sparse signalis called sequential approximate message passing (S-AMP). The algorithm mainly focuses on thefollowing issues: how to recover the sparse signal in current ADT with historical knowledge and howto deliver the knowledge from the current ADT to the next.III. S-AMP: G RAPH R EPRESENTATION AND S CHEDULE

Our inference is based on the message passing framework under a speciﬁc factor graph of oursystem model. In this section, we specify the factor graph and design the message passing schedulein our model.

A. Graph Representation

The factor graph is derived based on the decompositions of a joint distribution. Exploiting theinherent statistical structure of our model (1), the corresponding joint distribution of sparse signals,active indicators, channel coefﬁcients can be decomposed as p ( X , A , H | Y ) = Z − T Y t =1 L Y l =1 p ( y ( t ) l | x ( t ) ) N Y n =1 p ( x ( t ) n | a ( t ) n , h ( t ) n ) p ( a ( t ) n | a ( t − n ) p ( h ( t ) n | h ( t − n ) , (8) Fig. 2. Factor graph representation of the proposed model. where Z is a normalized constant, p ( y ( t ) l | x ( t ) ) = CN ( y ( t ) l ; s Tl x ( t ) , σ w ) with y ( t ) l denoted the l th elementof y ( t ) and s Tl denoted the l th row of the pilot sequence matrix S . We then give notations of the factornodes within (8) as g ( t ) l ( x ( t ) ) , p ( y ( t ) l | x ( t ) ) , f ( t ) n ( x ( t ) n , a ( t ) n , h ( t ) n ) , p ( x ( t ) n | a ( t ) n , h ( t ) n ) , q ( t ) n ( a ( t ) n , a ( t − n ) , p ( a ( t ) n | a ( t − n ) and d ( t ) n ( h ( t ) n , h ( t − n ) , p ( h ( t ) n | h ( t − n ) . Note that the variable node for each observedreceived signal y ( t ) l is absorbed into the factor node g ( t ) l ( x ( t ) ) . Then, the associated factor graph isshown in Fig. 2 B. Scheduling the Message Passing

From Fig. 2, we observe that all of the variables related in the t th ADT can be arranged on a plane,which is referred as a “frame”. The connections between the neighboring frames are established by thetemporal correlated variables a ( t ) n , h ( t ) n and their corresponding factor nodes q ( t ) n , d ( t ) n for all n . We canobserve in Fig. 2 that the temporal correlation between variable nodes a ( t ) n and h ( t ) n in each t th ADTbrings additional loops compared with the generic AMP algprithms in the traditional MRA models[3], [10]. Therefore, the speciﬁc implementing schedule for the message passing within our proposedgraph model has to be designed.The designed message passing schedule from the t th frame to the ( t + 1) th frame can be dividedinto two distinct parts. In the ﬁrst part, the algorithm focuses mainly on passing the messages withinthe t th frame with given input messages from the ( t − th frame. In the second part, the algorithmfocuses mainly on passing the messages from the t th frame into the next. Speciﬁcally, for the ﬁrst part of schedule, the input messages that provide current beliefs of temporal correlated variables a ( t ) n and h ( t ) n are delivered to x ( t ) n . Then, the node x ( t ) n updates the message with given received signalsavailable in the t th frame, i.e., { y ( t ) } tt =1 . Finally, the updated messages are output from the node x ( t ) n .For the second part, the node x ( t ) n propagates the messages providing the updated beliefs to a ( t ) n and h ( t ) n . Then, such messages are further propagated to the next ( t + 1) th frame as the input messagesthat carry the beliefs of a ( t +1) n and h ( t +1) n .We note that in the ﬁrst part of the schedule, the algorithm includes messages exchange betweennodes g ( t ) l and x ( t ) n . The matrix S couples a sequence of nodes { x ( t ) n } Nn =1 into each g ( t ) l , leading toseveral loops between nodes g ( t ) l and x ( t ) n . Hence, the corresponding messages passing algorithm inthe ﬁrst part of the schedule is required to be executed iteratively. On the contrary, we observe thatin the second part of the schedule, the messages are propagated only in one direction. Namely, oncethe messages are passed from the current frame to the next, they will not feed back to re-update thebelief in the current frame. As a consequence, we note that the convergence condition of the proposedalgorithm is same as the AMP algorithms [9]. For large but ﬁnite-sized i.i.d. Gaussian matrix S , theAMP performance is shown to be close to Bayes-optimal [22]. Moreover, the AMP framework hasbeen shown to perform extremely well in a number of applications in the mMTC literature, such as[3], [10]. In the following sections, we consider the design of the concrete S-AMP algorithm basedon such a schedule.IV. S-AMP: AMP B ASED ON H ISTORICAL K NOWLEDGE -A IDED P RIOR

In this section, we focus on the message passing algorithm under the ﬁrst part of the schedule, whereit absorbs the messages from the ( t − th frame and updates the messages with the available receivedsignal in the t th frame. We assume that the messages from ( t − th ADT to the t th are known, andwe will show that the proposed algorithm in this part based on such messages is equivalent to thegeneric AMP algorithm based on the historical knowledge-aided prior. A. AMP Based on Historical Knowledge-Aided Prior

The ﬁrst part of the schedule can be further partitioned into three distinct steps, which are denotedas “into”, “within”, and “out” step, respectively. Speciﬁcally, in each ADT, the “into” step involvesthe passing messages that provide current beliefs of temporal correlated variables a ( t ) n and h ( t ) n and forms the historical knowledge-aided prior of x ( t ) n for each user device n . The “within” step utilizesthe prior, together with the observations in the t th ADT to generate the posterior estimation of sparsesignal x ( t ) , aiming to achieve minimum mean square error (MMSE). The “out” step feeds back theupdated messages to every node x ( t ) n , which will be utilized in the second part of the schedule.

1) “into” step:

Since we concern a ﬁltering-like framework, no information will be conveyed fromthe unexperienced frame ( t + 1) in current t th frame. Hence, it is equivalent to disconnecting the linksbetween factor nodes q ( t +1) n , d ( t +1) n and variable nodes a ( t ) n , h ( t ) n in this step. Therefore, the message ν f ( t ) n → x ( t ) n can be derived as ν f ( t ) n → x ( t ) n ( x ( t ) n ) ∝ X a ( t ) n = { , } Z h ( t ) n f ( t ) n ( x ( t ) n , a ( t ) n , h ( t ) n ) · ν a ( t ) n → f ( t ) n ( a ( t ) n ) · ν h ( t ) n → f ( t ) n ( h ( t ) n ) , (9)where ν a ( t ) n → f ( t ) n ( a ( t ) n ) = ν q ( t ) n → a ( t ) n ( a ( t ) n ) and ν h ( t ) n → f ( t ) n ( h ( t ) n ) = ν d ( t ) n → h ( t ) n ( h ( t ) n ) because there is only oneedge between both a ( t ) n and q ( t ) n as well as h ( t ) n and d ( t ) n . The messages ν q ( t ) n → a ( t ) n ( a ( t ) n ) and ν d ( t ) n → h ( t ) n ( h ( t ) n ) provide current beliefs of variables a ( t ) n and h ( t ) n , which absorb the messages from ( t − th frame,carrying historical knowledge. Since the initial forms of messages ν q (1) n → a (1) n ( a (1) n ) and ν d (1) n → h (1) n ( h (1) n ) are Bernoulli and Gaussian distribution, respectively, we assume that ν q ( t ) n → a ( t ) n ( a ( t ) n = 1) , ˆ π n,t , ν d ( t ) n → h ( t ) n ( h ( t ) n ) , CN ( h ( t ) n ; ˆ ξ n,t , ˆ ψ n,t ) , (10)where ˆ π n,t , ˆ ξ n,t and ˆ ψ n,t are the parameters of the prior and assumed known in this part. Particularly,in the ﬁrst frame, we have ˆ π n, = λ and ˆ ξ n, = 0 and ˆ ψ n, = ρ n for each n . As a consequence, theprior of the node x ( t ) n in (9) can be formulated as ν f ( t ) n → x ( t ) n ( x ( t ) n ) = (1 − ˆ π n,t ) δ ( x ( t ) n ) + ˆ π n,t CN ( x ( t ) n ; ˆ ξ n,t , ˆ ψ n,t ) . (11)Comparing this historical knowledge-aided prior (11) of x ( t ) n with (7), we ﬁnd that both of them are‘spike-and-slab” prior distributions, which are sparsity-promoting. The difference is that the updatedparameters ˆ π n,t , ˆ ξ n,t and ˆ ψ n,t in (11) contain historical knowledge about the received signals { y ( t ) } t − t =1 .

2) “within” step: : In this step, our task is to utilize the historical knowledge-aided prior (11),together with the observations y ( t ) in current t th ADT, to generate the posterior estimation of sparsesignal x ( t ) , aiming to achieve MMSE. The main difﬁculty for this problem is that the matrix S mixesthe coefﬁcients of x ( t ) into y ( t ) . Fortunately, in large system limit, i.e. L, N → ∞ with

L/N ﬁxed, such a vector-valued estimation problem can be efﬁciently solved via the generic AMP framework[23], and an estimate of x ( t ) based on y ( t ) that minimizes the mean-squared error (MSE) could beobtained. For concise, we drop the superscript of index t . The generic AMP initializes µ n = 0 , z l = y l ,and c ≫ σ w for all n and l , and then iterates the following equations for i th iteration, φ in = X Ll =1 S ∗ ln z il + µ in , (12) µ i +1 n = F n ( φ in , c i ) , v i +1 n = G n ( φ in , c i ) , (13) z i +1 l = y l − X Nn =1 S ln µ i +1 n + z il L X Nn =1 F ′ n ( φ in , c i ) , (14) c i +1 = σ w + 1 /L X Nn =1 v i +1 n , (15)where F ′ n ( φ in , c i ) , ∂F n ( φ in ,c i ) ∂φ in is the ﬁrst derivative of function F n ( φ in , c i ) with respect to φ in . Using(11), (19) and together with the deﬁnitions, the speciﬁc expressions of above functions are given as F n ( φ in , c i ) = (1 + γ n ( φ in , c i )) − ˆ ψ n φ in + ˆ ξ n c i ˆ ψ n + c i ! , F ′ n ( φ in , c i ) = 1 c i G n ( φ in , c i ) , (16) G n ( φ in , c i ) = (1 + γ n ( φ in , c i )) − ˆ ψ n c i ˆ ψ n + c i ! + γ n ( φ in , c i ) | F n ( φ in , c i ) | , (17)where γ n ( φ in , c i ) , (cid:18) − ˆ π n ˆ π n (cid:19) ˆ ψ n + c i c i ! exp − " ˆ ψ n | φ in | + 2 R ( ˆ ξ ∗ n c i φ in ) − c i | ˆ ξ n | c i ( ˆ ψ n + c i ) . (18)After the convergence of the generic AMP algorithm (12)-(15), the posterior estimation of the sparsevector x is given by ˆ x = µ I , where we denote µ I , [ µ I , µ I , . . . , µ IN ] T ∈ C N and the index I indicatesthe maximal number of AMP iteration times.

3) “out” step:

So far, we have considered the messages exchange between nodes { x n } Nn =1 and { g l } Ll =1 with the historical knowledge-aided prior (11). Next, we derive the output message from x n ,i.e., the message passing from x n to f n , which will be used in the second part of our schedule. In thelarge system limit, it is reasonable to regard the message ν ig l → x n ( x n ) as Gaussian because of the Berry-Esseen central limit theorem [23]. This Gaussian quantity in each i th iteration can be parameterized bythe mean µ inl and variance v inl of the message ν ix n → g l ( x n ) . According to [24], the message ν ig l → x n ( x n ) takes the form as ν ig l → x n ( x n ) = CN ( S ln x n ; z iln , c iln ) , , where we deﬁne z iln , y l − P q = n S lq µ iql and c iln , σ w + P q = n | S lq | v iql . By applying the fact Y q CN ( x ; µ q , v q ) ∝ CN ( x ; P q µ q /v q P q v − q , P q v − q ) , (19) and together with the sum product algorithm, we can obtain ν x n → f n ( x n ) = Y l ν ig l → x n ( x n ) = CN ( x n ; X l S ∗ ln z iln , c in ) . (20)For further derivation, we utilize the general assumptions of the generic AMP framework that z iln = z il + δz iln + O (1 /N ) and µ inl = µ in + δµ inl + O (1 /N ) , then the mean z iln can be rewritten as z iln = y l − X n S ln µ inl + S ln µ inl + O (1 /N ) = z il + S ln µ in + O (1 /N ) . (21)We note that the term δµ inl is absorbed into O (1 /N ) , since S ln is also a O (1 /N ) term. Then, substituting(21) into (20), after convergence, the message ν x n → f n ( x n ) yields ν x n → f n ( x n ) = CN ( x n ; X l S ∗ ln z Il + µ In , c I ) = CN ( x n ; φ In , c I ) , (22)where we have utilized the approximation c in ≈ c i [23]. By utilizing sum-product algorithm [24], theposterior distribution of variable x n can be obtained by multiplying the messages from all directionsto node x n , i.e., ν x n → f n ( x n ) and ν f n → x n ( x n ) . Since the message ν f n → x n ( x n ) is considered as the localprior for x n , we imply that ν x n → f n ( x n ) can be regarded as the approximate likelihood function of x n . B. State Evolution Analysis

One remarkable property of the AMP framework is that the performance of sparse vector recoverycan be measured by the state evolution function when the entries of the sensing matrix generated fromi.i.d. Gaussian distribution [25]. In the case that the Bernoulli-Gaussian prior with the form (11) is theexact prior distribution and the MMSE denoiser (16) is utilized, the equation (15) is exactly the stateevolution function in the large system limit . For further analysis, we consider a more general form ofthe state evolution that applies to any arbitrary denoiser f θ n ( · , θ n ) with θ n denoted as the correspondingparameter set for each user device n . The general state evolution function of AMP framework is givenby c i +1 = σ w + NL E [ | f Θ ( X + √ c i V, Θ) − X | ] , (23)where the X , V , and Θ are random variables with X following p X | Θ , V ∼ CN (0 , , and the probabilitydistribution of Θ denoted as p Θ . The expectation are taken over X , V and Θ . The state evolutionequation (23) predicts the state of AMP accurately in the large system limit with the condition thatthe empirical distribution of { θ n } Nn =1 and { x n } Nn =1 converge to the probability measure p Θ ,X . Further, we denote a random variable Φ , and we can observe from (22) that after the convergenceof AMP, the messages passing from each node x n to f n are in the form of N independent Gaussiandistributions. Thus, the sequence of signals { φ n } Nn =1 can be regarded as N independent samples of therandom variable Φ = X + √ c I V , which can be interpreted as a Gaussian noise-corrupted version of X with noise level c I . Therefore, in t th ADT, the set of random variable Θ t in (23) can be deﬁned as Θ ( t ) , { Γ , { Φ ( t ) } t − t =1 } with a sequence of realizations { θ ( t ) n } Nn =1 , where θ ( t ) n , { η n , ρ n , { φ ( t ) n } t − t =1 } , andwe assume that the empirical distribution of { η n } Nn =1 and { ρ n } Nn =1 converge to the probability measureprobability measure p Γ .As a consequence, in the t th ADT, the denoiser f θ n ( · , θ n ) for each user device n is designed to recoverthe random variable X ( t ) from its noisy version Φ ( t ) = X ( t ) + √ c t V , with X ( t ) ∼ p X ( t ) | Θ ( t ) ( x ( t ) n | θ ( t ) n ) ,and the term E [ | f Θ ( t ) ( X ( t ) + √ c t V, Θ) − X | ] based on the state evolution equation (23) can beinterpreted as the MSE of the denoiser in each iteration with given c t . Note that for concise, we haveomitted the superscript i that will serve to keep track of the multiple iterations. Obviously, the optimaldenoiser that achieves MMSE for each user is given by the expectation of the corresponding posteriordistribution p ( x ( t ) n | φ ( t ) n , θ ( t ) n ; c t ) . In this condition, the MSE with respect to c t can be formulated as M ( c t ) = E [Var( X ( t ) | Φ ( t ) , Θ ( t ) )] , (24)where the Var( X ( t ) | Φ ( t ) , Θ ( t ) ) is the conditional variance of the posterior distribution p X ( t ) | Φ ( t ) , Θ ( t ) with given Φ ( t ) and Θ ( t ) and the expectation is taken over both Φ ( t ) and Θ ( t ) . Note that (24) is theachievable MSE in the t th ADT with given c t , when the BS knows exactly the historical information { φ ( t ) n } t − t =1 and the model parameters η n , ρ n . To characterize the feature of (24), we utilize the theoremof variance decomposition, and reformulate (24) as M ( c t ) = E [Var( X ( t ) | Φ ( t ) , Γ)] − E [Var( E [ X ( t ) | Φ ( t ) , Γ , { Φ ( t ) } t − t =1 ] | Φ ( t ) , Γ)] , (25)where the term E [Var( X ( t ) | Φ ( t ) , Γ)] is exactly the achievable MSE without the historical knowledge[10]. To examine the MSE relationship in (25), we expand the second term as E [Var( E [ X ( t ) | Φ ( t ) , Γ , { Φ ( t ) } t − t =1 ] | Φ ( t ) , Γ)]= E Φ ( t ) , Γ [ E { Φ ( t ) } t − t =1 [ | E [ X ( t ) | Φ ( t ) , Γ , { Φ ( t ) } t − t =1 ] | − | E { Φ ( t ) } t − t =1 [ E [ X ( t ) | Φ ( t ) , Γ , { Φ ( t ) } t − t =1 ]] | ]] . (26)Note that we have E [Var( E [ X ( t ) | Φ ( t ) , Γ , { Φ ( t ) } t − t =1 ] | Φ ( t ) , Γ)] ≥ , with equality if, p X ( t ) | Γ , { Φ ( t ) } t − t =1 ( · ) = p X ( t ) | Γ ( · ) . Hence, the achievable MSE of the denoiser can be reduced by E [Var( E [ X ( t ) | Φ ( t ) , Γ , { Φ ( t ) } t − t =1 ] | Φ ( t ) , Γ)] with given historical knowledge. From the factor graph perspective, this condition holds only in thecase that the links between node h ( t − n , h ( t ) n and node a ( t − n , a ( t ) n are vanished. In other words, theachievable MSE with historical knowledge degrades to that without the historical knowledge undersame noise level only when there is no temporal correlations between adjacent ADTa. Note that theMSE is a monotone increasing function with respect to the noise level, so that with the inherenttemporal correlations, the historical knowledge beneﬁts the detection performance of AMP in eachiteration with the same initialization.To achieve this optimal MSE, one requires to derive the exact posterior distribution or equivalently totrack the exact prior distribution p X ( t ) | Γ , { Φ ( t ) } t − t =1 ( · ) in every ADT. However, as we shall see in the nextsection, such a condition is not impractical, and we turn to ﬁnd the optimal tractable approximationunder some constrains to take advantage of the historical knowledge.V. S-AMP: D ERIVATION OF H ISTORICAL K NOWLEDGE -A IDED P RIOR

In this section, we focus on the second part of our schedule, which aims to propagate the beliefs fromone ADT to the next, deriving the historical knowledge-aided prior of sparse vector in every ADT. Wehave claimed that the achievable MSE in each ADT is beneﬁted from the historical knowledge, and theoptimal performance is achieved by cycling (12)-(15) until convergence when the prior distribution ofsparse signal is in the form of (11). However, in this section, we will observe that the prior distributionwill not maintain a consistent form with ADTs. Worse, the number of mixture components will increaseexponentially, making it impractical to accurately track the prior distribution of sparse signal. Thus, inthis section, we propose an approximation, restricting the forms of messages in each ADT to make theprior distribution tractable. Then, we ﬁnd the corresponding approximate historical knowledge-aidedprior, which is optimal in the perspective of KL divergence.We have claimed in section IV-A2 that the generic AMP algorithm in the “within” step of theﬁrst part of our schedule decouples the vector-valued estimation problem into a sequence of scalarproblems, with the MMSE achieved by tracking the prior distribution p X ( t ) | Γ , { Φ ( t ) } t − t =1 ( · ) in each ADT.This is equivalent to independently tracking the prior distributions p ( x ( t ) | η n , ρ n , { φ ( t ) } t − t =1 ) for N switching state space models (SSSMs) in each ADT with given historical evidence { φ ( t ) n } t − t =1 and model parameters η n , ρ n . Speciﬁcally, for each SSSM model related to a particular user device, theevidence φ ( t ) n of Gaussian Markov state-space model (5) is controlled by a Markov chain (3) and theprobability measure p ( φ ( t ) n | h ( t ) n , a ( t ) n ) = CN ( φ ( t ) n | h ( t ) n a ( t ) n , c t ) , so that the ﬁlter density p ( h ( t ) n |{ φ ( t ) n } tt =1 ) inthe t th ADT is speciﬁed as p ( h ( t ) n |{ φ ( t ) n } tt =1 ) = P a ( t ) n = { , } p ( a ( t ) n |{ φ ( t ) n } tt =1 ) p ( h ( t ) n | a ( t ) n , { φ ( t ) n } tt =1 ) , where p ( h ( t ) n | a ( t ) n , { φ ( t ) n } tt =1 ) ∝ p ( φ ( t ) n | h ( t ) n , a ( t ) n ) Z h ( t − n p ( h ( t ) n | h ( t − n , a ( t ) n ) p ( h ( t − n |{ φ ( t ) n } t − t =1 , a ( t ) n ) , (27) p ( h ( t − n |{ φ ( t ) n } t − t =1 , a ( t ) n = k ) = X j = { , } ω jk p ( h ( t − n | a ( t − n = j, { φ ( t ) n } t − t =1 ) , k = { , } , (28)with the weights given as ω jk ∝ p jk p ( a ( t − n = j |{ φ ( t ) n } t − t =1 ) . Note that we have omitted the conditions η n and ρ n in the expressions of probability measure, since the parameters η n and ρ n remain ﬁxed foreach user. It is obvious that the component number of the ﬁlter density p ( h ( t ) n |{ φ ( t ) n } tt =1 ) will growexponentially, leading to an intractable form of sparse signal prior distribution after several ADTs. A. Approximation Strategy

One approach to solving this problem is to restrict the complexity of the prior distribution rep-resentation in every ADT, and ﬁnd the optimal approximate prior under restrictions, allowing AMPalgorithm to operate on it effectively. Speciﬁcally, to maintain the sparsity-promoting feature of prior,we choose to restrict the approximate prior in the form of Bernoulli-Gaussian (11), with the historicalknowledge of { φ ( t ) n } tt =1 contained in ˆ π n,t , ˆ ξ n,t and ˆ ψ n,t in (11). We notice that the newly derived priordistribution based on the previous approximate prior will typically not in the restricted family, leadingthat the approximation should be performed in every ADT. One may concern that the errors willbe out of control over extended periods of time by accumulation due to the repeated approximations.Fortunately, authors in [26] have shown that this problem does not occur because the mere stochasticityof the process serves to attenuate the effects of errors over time, fast enough to prevent the accumulatederror from growing unboundedly.To measure similarity of a distribution and an approximation to it, we introduce the KL-divergence,which is deﬁned as D [ p ( x ) || q ( x )] = E p (cid:20) ln p ( x ) q ( x ) (cid:21) = Z x p ( x ) ln p ( x ) q ( x ) . (29)The integral will be replaced by the summation when the discrete random variable is considered. TheKL-divergence, or Relative Entropy in the information theory literature [27], is a very natural measure to quantify the information loss or inefﬁciency incurred by using distribution q ( x ) when the truedistribution is p ( x ) [28]. One important feature of the KL-divergence is that it satisﬁes D [ p ( x ) || q ( x )] ≥ , with equality if, and only if, p ( x ) = q ( x ) , and minimizing (29) can be considered as minimizingthe loss of information after approximation.Hence, one intuition is to minimize the KL-divergence between the newly derived prior distributionand the one restricted in the Gaussian-Bernoulli family in each ADT. However, dealing with thedistribution with Gaussian-Bernoulli form in the KL-divergence perspective is not straightforward.The following proposition provide a reasonable alternative to handle this problem. Proposition 1 : We deﬁne q ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) as the approximate posterior distribution of user de-vice n in t th ADT, and ˜ p ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) is the derived posterior distribution based on q ( h ( t − n , a ( t − n |{ φ ( t ) n } t − t =1 ) .Then, the following inequality holds. D [˜ p ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) || q ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 )] ≥ D [˜ p ( h ( t +1) n , a ( t +1) n |{ φ ( t ) n } tt =1 ) || q ( h ( t +1) n , a ( t +1) n |{ φ ( t ) n } tt =1 )] ≥ D [˜ p ( x ( t +1) n |{ φ ( t ) n } tt =1 ) || q ( x ( t +1) n |{ φ ( t ) n } tt =1 )] , (30)where we have q ( h ( t +1) n , a ( t +1) n |{ φ ( t ) n } tt =1 ) , Z h ( t ) n X a ( t ) n q ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) p ( a ( t +1) n | a ( t ) n ) p ( h ( t +1) n | h ( t ) n ) , (31) q ( x ( t +1) n |{ φ ( t ) n } tt =1 ) , Z h ( t +1) n X a ( t +1) n q ( h ( t +1) n , a ( t +1) n |{ φ ( t ) n } tt =1 ) δ ( x ( t +1) n − h ( t +1) n a ( t +1) n ) , (32) ˜ p ( h ( t +1) n , a ( t +1) n |{ φ ( t ) n } tt =1 ) , Z h ( t ) n X a ( t ) n ˜ p ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) p ( a ( t +1) n | a ( t ) n ) p ( h ( t +1) n | h ( t ) n ) , (33) ˜ p ( x ( t +1) n |{ φ ( t ) n } tt =1 ) , Z h ( t +1) n X a ( t +1) n ˜ p ( h ( t +1) n , a ( t +1) n |{ φ ( t ) n } tt =1 ) δ ( x ( t +1) n − h ( t +1) n a ( t +1) n ) . (34) Proof 1:

Please see the Appendix A.We can observe that the KL-divergence between the posterior distribution and its approximationnever increases by transition through the stochastic processes. This provides us an alternative to dealingwith the posterior distribution of h ( t ) n and a ( t ) n in each ADT instead of the prior of x ( t ) n . Thus, inone ADT, after the generic AMP framework, we derive the corresponding posterior distribution of ˜ p ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) and ﬁnd its optimal approximation from a restricted family in the perspective ofKL-divergence. In this paper, we choose to represent the approximation posterior q ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) in t th ADTusing a parametric family that represents as a product of Gaussian and Bernoulli distributions. Wewill observe that such a representation makes the approximate prior q ( x ( t +1) n |{ φ ( t ) n } tt =1 ) of x ( t +1) n in ( t + 1) th ADT take the form of Gaussian-Bernoulli (11). Speciﬁcally, the parametric family can beexpressed as q ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) = q ( h ( t ) n |{ φ ( t ) n } tt =1 ) q ( a ( t ) n |{ φ ( t ) n } tt =1 ) , (35)where we have q ( h ( t ) n |{ φ ( t ) n } tt =1 ) = CN ( h ( t ) n ; ¯ ξ n,t , ¯ ψ n,t ) , and q ( a ( t ) n |{ φ ( t ) n } tt =1 ) = ¯ π a ( t ) n n,t (1 − ¯ π n,t ) − a ( t ) n ,respectively. Note that the parametric family (35) is an exponential family, so that minimizing the KL-divergence D [˜ p ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) || q ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 )] is equivalent to adjusting parameters ¯ ξ n,t , ¯ ψ n,t and ¯ π n,t such that the moments E [ h ( t ) n ] , E [ | h ( t ) n | ] and E [ a ( t ) n ] match for both distributions q ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) and ˜ p ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) [29]. The moment matching equations can be given as ¯ π n,t = ˜ p ( a ( t ) n = 1 |{ φ ( t ) n } tt =1 ) , (36) ¯ ξ n,t = Z h ( t ) n h ( t ) n ˜ p ( h ( t ) n |{ φ ( t ) n } tt =1 ) , (37) ¯ ψ n,t = Z h ( t ) n | h ( t ) n | ˜ p ( h ( t ) n |{ φ ( t ) n } tt =1 ) − | ¯ ξ n,t | , (38)where ˜ p ( h ( t ) n |{ φ ( t ) n } tt =1 ) , and ˜ p ( a ( t ) n = 1 |{ φ ( t ) n } tt =1 ) are the margins of ˜ p ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) . B. Derivation of Historical Knowledge-Aided Prior

We can observe from (31) and (32) that deriving the historical knowledge-aided prior q ( x ( t +1) n |{ φ ( t ) n } tt =1 ) in ( t +1) is equivalent to deriving the approximate posterior distribution q ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) , requiringto specify the corresponding margins in (36)-(38). Recursively, we then derive such margins with thegiven approximation q ( h ( t − n , a ( t − n |{ φ ( t ) n } t − t =1 ) in the previous ADT.Recalling the factor graph representation of our model in Fig. 2, approximating the posteriordistribution ˜ p ( h ( t − n , a ( t − n |{ φ ( t ) n } tt =1 ) by a family q ( h ( t − n , a ( t − n |{ φ ( t ) n } t − t =1 ) represented as a productfactors is equivalent to disconnecting the link between node a ( t − n , h ( t − n . And, restricting the formsof distributions q ( h ( t − n |{ φ ( t ) n } t − t =1 ) and q ( a ( t − n |{ φ ( t ) n } t − t =1 ) is equivalent to restricting forms of the mes- sages ν a ( t − n → q ( t ) n ( a ( t − n ) and ν h ( t − n → d ( t ) n ( h ( t − n ) that pass from ( t − th ADT to the next. Speciﬁcally,we have ν a ( t − n → q ( t ) n ( a ( t − n ) = q ( a ( t − n |{ φ ( t ) n } t − t =1 ) = ¯ π a ( t − n n,t − (1 − ¯ π n,t − ) − a ( t − n , (39) ν h ( t − n → d ( t ) n ( h ( t − n ) = q ( h ( t − n |{ φ ( t ) n } t − t =1 ) = CN ( h ( t − n ; ¯ ξ n,t − , ¯ ψ n,t − ) . (40)According to the structure of Fig. 2, the following two messages can be then speciﬁed via ν q ( t ) n → a ( t ) n ( a ( t ) n ) = X a ( t − n q ( t ) n ( a ( t ) n , a ( t − n ) ν a ( t − n → q ( t ) n ( a ( t − n ) , (41) ν d ( t ) n → h ( t ) n ( h ( t ) n ) = Z h ( t − n d ( t ) n ( h ( t ) n , h ( t − n ) ν h ( t − n → d ( t ) n ( h ( t − n ) . (42)We denote ν q ( t ) n → a ( t ) n ( a ( t ) n = 1) , ˆ π n,t and ν d ( t ) n → h ( t ) n ( h ( t ) n ) , CN ( h ( t ) n ; ˆ ξ n,t , ˆ ψ n,t ) . Plugging (3), (39) into(41), and substituting (5), (40) into (42), we obtain ˆ π n,t = p (1 − ¯ π n,t − ) + (1 − p )¯ π n,t − , ˆ ξ n,t = η n ¯ ξ n,t − , ˆ ψ n,t = η n ¯ ψ n,t − + (1 − η n ) ρ n . (43)We emphasize again that we focus on a ﬁltering-like problem, where only the knowledge of theprevious ADTS can be utilized in current ADT, and we can omit the links between nodes a ( t ) n , a ( t +1) n as well as h ( t ) n , h ( t +1) n for each n when performing inference in the t th ADT. Then the correspondingmargins in (36)-(38) can be speciﬁed via ˜ p ( a ( t ) n |{ φ ( t ) n } tt =1 ) ∝ ν f ( t ) n → a ( t ) n ( a ( t ) n ) ν q ( t ) n → a ( t ) n ( a ( t ) n ) , (44) ˜ p ( h ( t ) n |{ φ ( t ) n } tt =1 ) ∝ ν f ( t ) n → h ( t ) n ( h ( t ) n ) ν d ( t ) n → h ( t ) n ( h ( t ) n ) . (45)Speciﬁcally, the explicit expressions of the following messages can be obtained as ν f ( t ) n → a ( t ) n ( a ( t ) n ) ∝ Z h ( t ) n Z x ( t ) n f ( t ) n ( x ( t ) n , a ( t ) n , h ( t ) n ) · ν x ( t ) n → f ( t ) n ( x ( t ) n ) · ν h ( t ) n → f ( t ) n ( h ( t ) n ) , (46) ν f ( t ) n → h ( t ) n ( h ( t ) n ) ∝ X a ( t ) n = { , } Z x ( t ) n f ( t ) n ( x ( t ) n , a ( t ) n , h ( t ) n ) · ν x ( t ) n → f ( t ) n ( x ( t ) n ) · ν a ( t ) n → f ( t ) n ( a ( t ) n ) . (47)Under the structure of our factor graph, it is obvious that the following two relationships alwayshold, e.g., ν a ( t ) n → f ( t ) n ( a ( t ) n ) = ν q ( t ) n → a ( t ) n ( a ( t ) n ) and ν h ( t ) n → f ( t ) n ( h ( t ) n ) = ν d ( t ) n → h ( t ) n ( h ( t ) n ) . As a consequence, wespecify the posterior distributions (44) and (45) as ˜ p ( a ( t ) n |{ φ ( t ) n } tt =1 ) ∝ (ˆ π n,t ˜ π n,t ) a ( t ) n [(1 − ˆ π n,t )(1 − ˜ π n,t )] − a ( t ) n , (48) ˜ p ( h ( t ) n |{ φ ( t ) n } tt =1 ) ∝ ˆ π n,t ˜ π n,t CN ( h ( t ) n ; ˜ τ n,t , ˜ κ n,t ) + (1 − ˆ π n,t )(1 − ˜ π n,t ) CN ( h ( t ) n ; ˆ ξ n,t , ˆ ψ n,t ) , (49) where we deﬁne ˜ π n,t , (cid:18) (cid:18) ˆ π n,t − ˆ π n,t (cid:19) γ n,t ( φ ( t ) n , c t ) (cid:19) − , (50) ˜ κ n,t , c t ˆ ψ n,t c t + ˆ ψ n,t , ˜ τ n,t , ˜ κ n,t · φ ( t ) n c t + ˆ ξ n,t ˆ ψ n,t ! . (51)After that, we can obtain the explicit expression of the approximation distribution q ( h ( t ) n , a ( t ) n |{ φ ( t ) n } tt =1 ) represented as the form (35) by combining the moment matching equations (36)-(38), and the marginalexpressions (48), (49). As a consequence, we have ¯ π n,t = ˆ π n,t ˜ π n,t ˆ π n,t ˜ π n,t + (1 − ˆ π n,t )(1 − ˜ π n,t ) , ¯ ξ n,t = ¯ π n,t ˜ τ n,t + (1 − ¯ π n,t ) ˆ ξ n,t , (52) ¯ ψ n,t = ¯ π n,t ( | ˜ τ n,t | + ˜ κ n,t ) + (1 − ¯ π n,t )( | ˆ ξ n,t | + ˆ ψ n,t ) − | ¯ ξ n,t | . (53)Then, the historical knowledge aided-prior can be formulated via (31) and (32). According to (39)and (40), executing (31) is equivalent to implementing (41), (42) and further implementing q ( h ( t +1) n , a ( t +1) n |{ φ ( t ) n } tt =1 ) = q ( a ( t +1) n |{ φ ( t ) n } tt =1 ) q ( h ( t +1) n |{ φ ( t ) n } tt =1 )= ν q ( t +1) n → a ( t +1) n ( a ( t +1) n ) ν d ( t +1) n → h ( t +1) n ( h ( t +1) n ) . (54)In addition, we notice that the equation (32) is equivalent to (9) with index ( t + 1) . Observing (41)and (42), the messages ν q ( t ) n → a ( t ) n ( a ( t ) n = 1) and ν d ( t ) n → h ( t ) n ( h ( t ) n ) are in the same forms as (10) assumedin section IV-A1. Hence, the historical knowledge aided-prior of the sparse vector x ( t ) in each t thADT will maintain the consistent form of Gaussian-Bernoulli (11) with the corresponding parametersspeciﬁed by (43). This leads that the generic AMP update equations in each ADT are (12)-(15).Somewhat differently, since the equation (15) is equivalent to the state evolution equation (23) onlywhen the exact prior distribution of x ( t ) is in the form of (11), the approximations in our proposedmethod will cause the loss of accuracy. To be more accurate, we consider an empirical alternative of(15), given as c i +1 = √ L || z i +1 || , where z i +1 , [ z i +11 , z i +12 , . . . , z i +1 L ] T ∈ C L , which can be used asan efﬁcient approximation of the state evolution [30].We summary our proposed S-AMP in Algorithm 1. The main computational burden of the proposedmethod is the generic AMP equations in section IV-A2, resulting the complexity in each iteration as O ( M N ) , which is equivalent to that of the algorithms proposed in [3], [10]. Algorithm 1

Proposed S-AMP algorithm

Input:

Received Signal: { y ( t ) } Tt =1 ;Pilot matrix: S . Output:

Recovered sparse vector { ˆ x ( t ) } Tt =1 . for t ≤ T do Initialization: µ n = 0 , z l = y l , and c ≫ σ w for all n and l (we drop the index of t ). for i ≤ I do ∀ n , calculate AMP equations with ˆ π n,t , ˆ ξ n,t and ˆ ψ n,t . φ in = X Ll =1 S ∗ ln z il + µ in , µ i +1 n = F n ( φ in , c i ) , v i +1 n = G n ( φ in , c i ) ,z i +1 l = y l − X Nn =1 S ln µ i +1 n + z il L X Nn =1 F ′ n ( φ in , c i ) , c i +1 = 1 √ L || z i +1 || , return ˆ x ( t ) = µ I . ∀ n , calculate the following parameters: ˜ π n,t , (cid:18) (cid:18) ˆ π n,t − ˆ π n,t (cid:19) γ n,t ( φ ( t ) n , c t ) (cid:19) − , ˜ κ n,t , c t ˆ ψ n,t c t + ˆ ψ n,t , ˜ τ n,t , ˜ κ n,t · φ ( t ) n c t + ˆ ξ n,t ˆ ψ n,t ! , ¯ π n,t = ˆ π n,t ˜ π n,t ˆ π n,t ˜ π n,t + (1 − ˆ π n,t )(1 − ˜ π n,t ) , ¯ ξ n,t = ¯ π n,t ˜ τ n,t + (1 − ¯ π n,t ) ˆ ξ n,t , ¯ ψ n,t = ¯ π n,t ( | ˜ τ n,t | + ˜ κ n,t ) + (1 − ¯ π n,t )( | ˆ ξ n,t | + ˆ ψ n,t ) − | ¯ ξ n,t | , ˆ π n,t +1 = p (1 − ¯ π n,t ) + (1 − p )¯ π n,t , ˆ ξ n,t +1 = η n ¯ ξ n,t , ˆ ψ n,t +1 = η n ¯ ψ n,t + (1 − η n ) ρ n . So far, we have given all the derivations of our proposed S-AMP algorithm. In the following, wewill consider the concrete active user detection and channel estimation strategy. As we shall see, basedon our proposed S-AMP algorithm, the user detector and channel estimator can be designed directly.VI. A

CTIVE U SER D ETECTION AND C HANNEL E STIMATION

After the generic AMP equations in the proposed S-AMP algorithm converges in each ADT, wefocus on the joint active user detection and channel estimation. The hypothesis testing to ﬁnd out theactive device is given by ( H : a ( t ) n = 1 , active device ; H : a ( t ) n = 0 , inactive device . (55) Applying (42), the LLR test rule given decision threshold l n,t in the t th ADT is given by LLR = log ˜ p ( φ ( t ) n | a ( t ) n = 1 , { φ ( t ) n } t − t =1 )˜ p ( φ ( t ) n | a ( t ) n = 0 , { φ ( t ) n } t − t =1 ) ! H ≷ H l n,t , (56)where we deﬁne the approximate likelihood function as ˜ p ( φ ( t ) n | a ( t ) n , { φ ( t ) n } t − t =1 ) , Z h ( t ) n p ( φ ( t ) n | h ( t ) n , a ( t ) n ) ν d ( t ) n → h ( t ) n ( h ( t ) n ) . Speciﬁcally, we have ˜ p ( φ ( t ) n | a ( t ) n = 0 , { φ ( t ) n } t − t =1 ) = CN ( φ ( t ) n ; 0 , c t ) , (57) ˜ p ( φ ( t ) n | a ( t ) n = 1 , { φ ( t ) n } t − t =1 ) = CN ( φ ( t ) n ; ˆ ξ n,t , ˆ ψ n,t + c t ) . (58)We notice that this is a general Gaussian hypothesis testing problem [31]. As a consequence, thesufﬁcient statistic for the LLR detector in the t th ADT for each user n is given as T ( φ ( t ) n ) = | φ ( t ) n | + 2 c t ˆ ψ − n,t R ( ˆ ξ ∗ n,t φ ( t ) n ) + c t ˆ ψ − n,t | ˆ ξ n,t | = | φ ( t ) n + c t ˆ ψ − n,t ˆ ξ n,t | , (59)which contains both the quadratic and linear term of variable φ ( t ) n . To examine the functional form ofthe sufﬁcient statistic T ( φ ( t ) n ) , we see that if we assume the zero mean of the local prior, i.e., ˆ ξ n,t = 0 ,the proposed detector transforms to the traditional energy detector derived in [3], [10].Another crucial issue for the detector is the design of decision criterion. In the detection literature,the basic criteria are the Neyman-pearson and the Bayes criterion [31]. In this paper, we consider theBayes criterion for the following reasons. First, the Bayes criterion focuses on the prior knowledge,which is neglected by the Neyman-pearson criterion. Second, we mainly concerned the detectionerror probability in our scenario, which is deﬁned as the sum of false alarm and missed detectionprobabilities, and utilizing the Bayes criterion will achieve minimum detection error probability whengiven the exact prior distribution. Further, we can observe that the sufﬁcient statistic T ( φ ( t ) n ) in (59)is derived based on the parameters ˆ ξ n,t , ˆ ψ n,t , c t , which include the historical knowledge of { φ ( t ) n } t − t =1 and cannot be predicted beforehand. Since the detection threshold under Neyman-Pearson criterion isdesigned based on the map between performance metrics, which is not explicit because of the linearterm in T ( φ nt ) [31], it is impractical to instantaneously design the threshold. On the contrary, under theBayes criterion, the LLR test is equivalent to computing the posterior probabilities of two hypothesesand choose the large one. Note that under S-AMP framework, the acquisition of the approximation ADT time step -16 -15 -14 -13 n o r ( c t ) AMP-MMSES-AMP normalized AWGN

Fig. 3. Normalized ﬁx point of state evolution function. posterior probabilities is very straightforward, as we have derived in (48). Speciﬁcally, the detectorunder Bayes criterion can be written as  H , if ˜ p ( a ( t ) n =1 |{ φ ( t ) n } tt =1 )˜ p ( a ( t ) n =0 |{ φ ( t ) n } tt =1 ) ≥ ,H , otherwise . (60)After active user detection, the channel estimation for the active users can be implemented directlyby applying the recovered sparse vector as the estimated channel, e.g., we consider ˆ h ( t ) n = ˆ x ( t ) n foreach user device n , where ˆ x n is the corresponding element of ˆ x ( t ) .VII. N UMERICAL R ESULTS

In this section, some numerical examples are provided to verify our theoretical results. We simulatethe mMTC system with N = 2000 devices. The user devices are randomly located in a cell withdistance d n for n th user. It is assumed that d n , n = 1 , . . . , N are randomly distributed in the region [0 . km, km ] . The large-scale fading between each user and the BS is considered as β n = − . − . ( d n ) in dB. The power spectral density of the AWGN at the BS is set to be − dBm/Hzwith the wireless channel bandwidth MHz. The duration of ADP is denoted as T s , and the totalnumber of ADPs is set as . For simpliﬁcation, we consider the same transmission power for eachuser, i.e., P n = P . The average speed for each user is supposed randomly distributed in [0 , km/h,the carrier frequency is set as . GHz. Since the temporal correlation of user indicator is speciﬁedby λ and the transition probability p , we assume p = r (1 − λ ) and p = rλ , where r denote a

100 200 300 400 500 600 700 800

Pilot length L -5 -4 -3 -2 -1 N M S E AMP-MMSES-AMP

Fig. 4. NMSE performance of sparse vector recovery. scale factor that controls the speciﬁc value of transition probability. For the performance metrics, weutilize the normalized mean squared error (NMSE) to illustrate the sparse vector recovery and channelestimation performance. And, the active user detection performance is measured by detection errorprobability (DEP). We ﬁrst set the access probability of users in each ADT is assumed to be λ = 0 . ,the scale r = 0 . and the duration of ADP is T s = 100 us.To show the advantage of the proposed S-AMP algorithms, we compare our results with theconventional CS-based algorithm. In the non-Bayesian framework, we consider the classical OMP[32] algorithm, and the R-PIA-ASP [14], which is the most efﬁcient existing method utilizes temporalcorrelation of the user support. Further, we consider the oracle LS algorithm, which assumes the trueactive user support set is exactly known at the BS. In the Bayesian literature, we consider the classicalAMP algorithm [33] and the AMP-MMSE algorithm [3], [10] as the counterparts of our proposedS-AMP.Fig. 3 depicts the normalized ﬁx point of state evolution equation from t = 1 to t = 10 ofboth S-AMP and AMP-MMSE algorithms under different settings. The normalized ﬁx point of stateevolution equation in t the ADT is deﬁned as nor( c t ) , PP c t , where P = 13 dBm is the referencelevel of power. We can observe that in the ﬁrst ADT, where each of two algorithms has the historicalknowledge, nor( c t ) of the two algorithms are almost equal. On the contrary, with given historicalknowledge, nor( c t ) of S-AMP is lower than that of AMP-MMSE. In addition, under low power level,

13 18 23 28 33 38 43

Transmit power P (dB) -4 -3 -2 -1 D e t ec t i o n e rr o r p r o b a b ili t y AMPAMP-MMSER-PIA-ASPOMPS-AMPS-AMP, L = 200 (a) Detection performance

13 18 23 28 33 38

Transmit power P (dB) -5 -4 -3 -2 -1 N M S E AMPAMP-MMSER-PIA-ASPOMPS-AMPS-AMP, L = 200Oracle LS (b) Channel estimation performanceFig. 5. Performance comparisons with respect to the transmission power. the performance gain provided by pilot length shrinks compared with that under high power level,since the AMGN term in state evolution function is dominant in this case. Further, with the pilot lengthincreases, nor( c t ) of both the two algorithms will converge to the normalized AWGN level, deﬁned as PP σ w , which is the lower bound of state evolution ﬁx point. Hence, as L increases, the performancegain of the S-AMP compared with the AMP-MMSE will come more from the prior distribution.Fig. 4 depicts effect of pilot length and power level on sparse vector recovery performance ofthe AMP-MMSE and S-AMP algorithms. We can ﬁnd that the S-AMP shows lower NMSE thanAMP-MMSE for all settings, verifying that the historical knowledge beneﬁts the AMP framework.We notice that in all the power levels, when length of pilot is relative large, increasing L will notdistinctly increase the NMSE performance. To explain, we compare the Fig. 4 with Fig. 3, and noticethat with the pilot length increases, nor( c t ) in Fig. 3 of both the two algorithms will converge tothe normalized AWGN level, deﬁned as PP σ w , which is the lower bound of state evolution ﬁx point.Hence, as L increases, the performance gain of the S-AMP compared with the AMP-MMSE will comemore from the prior distribution. We turn to the Fig. 4 and we can observe that in the region of highernumber of pilot, the S-AMP algorithm achieves about dB NMSE performance gain compared withits counterpart. This result indicates the clear advantage of prior with the aid of historical knowledge.Next, we investigate the active user detection and channel estimation performance of the S-AMP.Fig. 5 provides the user detection and channel estimation performances comparisons of the consideredbaseline algorithms versus the power level of the user devices. The pilot length is set as L = 400 .

100 200 300 400 500 600 700 800

Pilot length L -3 -2 -1 D e t ec t i o n e rr o r p r o b a b ili t y AMPAMP-MMSER-PIA-ASPOMPS-AMP (a) Detection performance

200 300 400 500 600 700 800

Pilot length L -4 -3 -2 N M S E AMPAMP-MMSER-PIA-ASPOMPS-AMPOracle LS (b) Channel estimation performanceFig. 6. Performance comparisons with respect to the pilot length. r -4 -3 -2 N M S E AMP-MMSES-AMP r -3 -2 -1 D e t ec t i o n e rr o r p r o b a b ili t y AMP-MMSES-AMP (a) ADP duration: 10ms r -5 -4 -3 -2 N M S E AMP-MMSES-AMP r -3 -2 -1 D e t ec t i o n e rr o r p r o b a b ili t y AMP-MMSES-AMP (b) ADP duration: 100usFig. 7. Performance of S-AMP versus user state transition probability in different cases of ADP durations. Top: channel estimationperformance; Bottom: active user detection performance. We adopt the detector in [11] for AMP algorithm, and the detections for OMP and R-PIA-ASP arebased on their estimated support. For AMP-MMSE and S-AMP, we consider the Bayesian detector.We see obviously that our proposed S-AMP outperforms its counterparts in the both detection andchannel estimation performances under all considered settings. With prior information, the BayesianCS algorithms, such as AMP-MMSE, S-AMP obtain distinct performances gain compared with thenon-Bayesian methods. And, with historical knowledge-aided prior, our S-AMP algorithm furtherimproves the performance compared with the AMP-MMSE algorithm. Speciﬁcally, for detection, theDEP is reduced by half, and for channel estimation, about dB NMSE gain is achieved. Particularly,the S-AMP even outperforms the Oracle LS method in channel estimation, which provides the lowerbound of any non-Bayesian method.Fig. 6 investigate the detection and channel estimation performance with respect to the pilot length.The transmission power is set as dBm. We can ﬁnd that the user detection and channel estimationperformances are improved with the increase of the pilot length. In the region of lower pilot, theDEP and NMSE reduce very fast, leading that the proposed S-AMP algorithm achieves signiﬁcantperformance even with low pilot overhead. For channel estimation, there is a slight performance lossof S-AMP compared with the Oracle LS, when L = 200 , since the Oracle LS knows exactly thesupport set. However, in all region, the proposed S-AMP algorithm achieves lower NMSE than anyother its achievable counterparts.Fig. 7 demonstrates the performance of S-AMP under different settings of ADP durations versus r , which is speciﬁed by r = 1 / r , controlling the transition probability of the user state. We notethat the probability of the user switching state will decrease as r increases. We can observe fromFig. 7 that with r increases, both the channel estimation error and detection error of the S-AMPdecrease. On the contrary, the variation of r has no impact on AMP-MMSE, since it does not utilizethe temporal structure. On the other hand, we note that the duration of ADP controls the temporalcorrelation of user channel, and a large duration associates with a poor temporal correlation of userchannel. We can see from Fig. 7 that considering a small ADP duration will enhance the channelestimation performance while having few impacts on the detection performance.Fig. 8 investigate the channel estimation and detection performance of S-AMP under differentsettings of access probabilities λ . We can see that in the same settings of pilot length and transmit

200 300 400 500 600 700 800

Pilot length L -5 -4 -3 -2 -1 N M S E AMP-MMSE, λ = 0.1S-AMP, λ = 0.1AMP-MMSE, λ = 0.2S-AMP, λ = 0.2 (a) Channel estimation performance

200 300 400 500 600 700 800

Pilot Length L -3 -2 -1 D e t ec t i o n E rr o r P r o b a b ili t y AMP-MMSE, P = 23dBmS-AMP, P = 23dBmAMP-MMSE, P = 33dBmS-AMP, P = 33dBm λ = 0.2 λ = 0.1 (b) Detection performanceFig. 8. Performance of S-AMP in different cases of access probabilities. power, increasing the sparsity level λ will decrease both the performances of channel estimation anddetection. However, our proposed S-AMP algorithm outperforms the AMP-MMSE in all settings,indicating the scalability of the S-AMP in different sparsity levels. In addition, we can intuitively seethat the impacts of improving sparsity level will be compensated by increasing the length of pilot.VIII. C ONCLUSION

In this paper, we proposed a grant-free novel MRA strategy, which considered both the sporadictrafﬁc and short packet features of mMTC scenario. Such a strategy results in the temporal correlationof the access state spare vector, leading to that the joint user detection and channel estimation formeda dynamic CS problem. We therefore proposed a novel S-AMP algorithm to sequentially recover thespare vector. Further, we derived the Bayes detector for active user detection and corresponding channelestimator based on the S-AMP. We veriﬁed that the S-AMP outperforms the traditional AMP algorithmsand other non-Bayes methods in both active user detection and channel estimation performances underour scenario, indicating the clear advantage of accounting for temporal correlation of the access statespare vector in user activity detector and channel estimator design. A PPENDIX

A. Proof of Proposition 1

Before proceeding, we deﬁne χ ( t ) n , { h ( t ) n , a ( t ) n } and φ ( t ) n , { φ ( t ) n } tt =1 , utilizing the deﬁnition ofKL-divergence, we have D [˜ p ( χ ( t ) n | φ ( t ) n ) || q ( χ ( t ) n | φ ( t ) n )]= Z χ ( t ) n ˜ p ( χ ( t ) n | φ ( t ) n ) log ˜ p ( χ ( t ) n | φ ( t ) n ) q ( χ ( t ) n | φ ( t ) n )= Z Z χ ( t ) n ,χ ( t +1) n ˜ p ( χ ( t ) n , χ ( t +1) n | φ ( t ) n ) log ˜ p ( χ ( t ) n | φ ( t ) n ) p ( χ ( t +1) n | χ ( t ) n ) q ( χ ( t ) n | φ ( t ) n ) p ( χ ( t +1) n | χ ( t ) n )= Z Z χ ( t ) n ,χ ( t +1) n ˜ p ( χ ( t ) n , χ ( t +1) n | φ ( t ) n ) log ˜ p ( χ ( t +1) n | φ ( t ) n ) q ( χ ( t +1) n | φ ( t ) n ) + log ˜ p ( χ ( t ) n | χ ( t +1) n , φ ( t ) n ) q ( χ ( t ) n | χ ( t +1) n , φ ( t ) n ) ! = D [˜ p ( χ ( t +1) n | φ ( t ) n ) || q ( χ ( t +1) n | φ ( t ) n )] + Z χ ( t +1) n ˜ p ( χ ( t +1) n | φ ( t ) n ) D [˜ p ( χ ( t ) n | χ ( t +1) n , φ ( t ) n ) || q ( χ ( t ) n | χ ( t +1) n , φ ( t ) n )] ≥ D [˜ p ( χ ( t +1) n | φ ( t ) n ) || q ( χ ( t +1) n | φ ( t ) n )] . (61)Note that the integration will be replaced by the summation if dealing with a ( t ) n . For concise, we omitthe proof of the second inequality, which can be proved in the same way. As a consequence, we obtainthe (30). R EFERENCES [1] W. Yu, “On the fundamental limits of massive connectivity,” in

Proc. Inf. Theory Appl. Workshop , Feb. 2017, pp. 1–6.[2] C. Bockelmann, N. Pratas, and H. Nikopour, “Massive machine-type communications in 5G: physical and MAC-layer solutions,”

IEEE Commun. Mag. , vol. 54, no. 9, pp. 59–65, Sep. 2016.[3] L. Liu and W. Yu, “Massive connectivity with massive MIMO-part I: Device activity detection and channel estimation,”

IEEETrans. Signal Process. , vol. 66, no. 11, pp. 2933–2946, Jun. 2018.[4] L. Liu, E. G. Larsson, W. Yu, P. Petar, and et al ., “Sparse signal processing for grant-free massive connectivity: A future paradigmfor random access protocols in the internet of things,”

IEEE Signal Process. Mag. , vol. 35, no. 5, pp. 88–99, Sep. 2018.[5] K. Senel and E. G. Larsson, “Grant-free massive MTC-enabled massive MIMO: A compressive sensing approach,”

IEEE Trans.Commun. , vol. 66, no. 12, pp. 6164–6175, Dec. 2018.[6] H. F. Schepker, C. Bockelmann, and A. Dekorsy, “Exploiting sparsity in channel and data estimation for sporadic multi-usercommunication,” in

Proc. Int. Symp. Wireless Commun. Syst. , Aug 2013, pp. 1–5.[7] G. Wunder, P. Jung, and M. Ramadan, “Compressive random access using a common overloaded control channel,” in

Proc. IEEEGlobecom Workshops , Dec 2015, pp. 1–6.[8] X. Xu, X. Rao, and V. K. N. Lau, “Active user detection and channel estimation in uplink cran systems,” in

Proc. IEEE Int. Conf.Commun. (ICC) , June 2015, pp. 2727–2732.[9] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for compressed sensing,”

Proc. Nat. Acad. Sci. USA ,vol. 106, no. 45, pp. 18 914–18 919, Nov.[10] Z. Chen, F. Sohrabi, and W. Yu, “Sparse activity detection for massive connectivity,”

IEEE Trans. Signal Process. , vol. 66, no. 7,pp. 1890–1904, Apr. 2018.[11] G. Hannak, M. Mayer, A. Jung, G. Matz, and N. Goertz, “Joint channel estimation and activity detection for multiusercommunication systems,” in

Prob. IEEE Int. Conf. Commun. Workshop (ICCW) , Jun. 2015, pp. 2086–2091.[12] Q. Yang, H. M. Wang, T. X. Zheng, Z. Han, and M. H. Lee, “Wireless powered asynchronous backscatter networks with sporadicshort packets: Performance analysis and optimization,”

IEEE Internet Things J. , vol. 5, no. 2, pp. 984–997, Apr. 2018.[13] B. Wang, L. Dai, Y. Zhang, T. Mir, and J. Li, “Dynamic compressive sensing-based multi-user detection for uplink grant-freenoma,”

IEEE Commun. Lett. , vol. 20, no. 11, pp. 2320–2323, 2016. [14] Y. Du, B. Dong, Z. Chen, X. Wang, Z. Liu, P. Gao, and S. Li, “Efﬁcient multi-user detection for uplink grant-free noma: Prior-information aided adaptive compressive sensing perspective,” IEEE J. Sel. Areas Commun. , vol. 35, no. 12, pp. 2812–2828, 2017.[15] D. Angelosante, G. B. Giannakis, and E. Grossi, “Compressed sensing of time-varying signals,” in

Proc. Int. Conf. Digit. SignalProcess. , 2009, pp. 1–8.[16] N. Vaswani and W. Lu, “Modiﬁed-CS: Modifying compressive sensing for problems with partially known support,”

IEEE Trans.Signal Process. , vol. 58, no. 9, pp. 4595–4607, 2010.[17] D. Angelosante, S. I. Roumeliotis, and G. B. Giannakis, “Lasso-Kalman smoother for tracking sparse signals,” in

Proc. AsilomarConf. Signals, Syst. Comput. , 2009, pp. 181–185.[18] J. Ziniel and P. Schniter, “Dynamic compressive sensing of time-varying signals via approximate message passing,”

IEEE Trans.Signal Process. , vol. 61, no. 21, pp. 5270–5284, Nov. 2013.[19] J. Ma, S. Zhang, H. Li, F. Gao, and S. Jin, “Sparse bayesian learning for the time-varying massive MIMO channels: Acquisitionand tracking,”

IEEE Trans. Commun. , vol. 67, no. 3, pp. 1925–1938, Mar. 2019.[20] R. Prasad, C. R. Murthy, and B. D. Rao, “Joint channel estimation and data detection in MIMO-OFDM systems: A sparse bayesianlearning approach,”

IEEE Trans. Signal Process. , vol. 63, no. 20, pp. 5369–5382, Oct. 2015.[21] J. Ziniel and P. Schniter, “Efﬁcient high-dimensional inference in the multiple measurement vector problem,”

IEEE Trans. Signal.Process. , vol. 61, no. 2, pp. 340–354, Jan. 2013.[22] C. Rush and R. Venkataramanan, “Finite-sample analysis of approximate message passing,” in

Proc. IEEE ISIT , 2016, pp. 755–759.[23] P. Schniter, “Turbo reconstruction of structured sparse signals,” in

Prob. Conf. Inf. Sci. and Syst. (CISS) , Mar. 2010, pp. 1–6.[24] F. R. Kschischang, B. J. Frey, and H. . Loeliger, “Factor graphs and the sum-product algorithm,”

IEEE Trans. Inf. Theory , vol. 47,no. 2, pp. 498–519, Feb. 2001.[25] M. Bayati and A. Montanari, “The dynamics of message passing on dense graphs, with applications to compressed sensing,”

IEEETrans. Inf. Theory , vol. 57, no. 2, pp. 764–785, Feb. 2011.[26] X. Boyen and D. Koller, “Tractable inference for complex stochastic processes,” 2013, [Online]. Available:https://arxiv.org/abs/1301.7362, preprint.[27] T. M. C. J. A. Thomas,

Elements of Information Theory . Wiley-Interscience, 2005.[28] K. P. Burnham and D. R. Anderson, “Model selection and inference. a practical information-theoric approach,”

Technometrics ,vol. 45, no. 2, pp. 181–181, 1998.[29] M. Opper and O. Winther,

A Bayesian approach to on-line learning , Jan. 1999.[30] A. Montanari, “Graphical models concepts in compressed sensing,” 2011, [Online]. Available: https://arxiv.org/abs/1011.4328,preprint.[31] H. L. Van Trees,

Detection, Estimation, and Modulation Theory, Part I : Detection, Estimation, and Linear Modulation Theory.

Wiley-Interscience, 2001.[32] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,”

IEEE Trans. Inf.Theory , vol. 53, no. 12, pp. 4655–4666, 2007.[33] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algorithms for compressed sensing: I. motivation and construction,”in