[PDF] A General Deep Reinforcement Learning Framework for Grant-Free NOMA Optimization in mURLLC

Abstract

Grant-free non-orthogonal multiple access (GF-NOMA) is a potential technique to support massive Ultra-Reliable and Low-Latency Communication (mURLLC) service. However, the dynamic resource configuration in GF-NOMA systems is challenging due to random traffics and collisions, that are unknown at the base station (BS). Meanwhile, joint consideration of the latency and reliability requirements makes the resource configuration of GF-NOMA for mURLLC more complex. To address this problem, we develop a general learning framework for signature-based GF-NOMA in mURLLC service taking into account the multiple access signature collision, the UE detection, as well as the data decoding procedures for the K-repetition GF and the Proactive GF schemes. The goal of our learning framework is to maximize the long-term average number of successfully served users (UEs) under the latency constraint. We first perform a real-time repetition value configuration based on a double deep Q-Network (DDQN) and then propose a Cooperative Multi-Agent learning technique based on the DQN (CMA-DQN) to optimize the configuration of both the repetition values and the contention-transmission unit (CTU) numbers. Our results show that the number of successfully served UEs under the same latency constraint in our proposed learning framework is up to ten times for the K-repetition scheme, and two times for the Proactive scheme, more than that with fixed repetition values and CTU numbers. In addition, the superior performance of CMA-DQN over the conventional load estimation-based approach (LE-URC) demonstrates its capability in dynamically configuring in long term. Importantly, our general learning framework can be used to optimize the resource configuration problems in all the signature-based GF-NOMA schemes.

Full PDF

11 A General Deep Reinforcement LearningFramework for Grant-Free NOMAOptimization in mURLLC

Yan Liu,

Student Member, IEEE,

Yansha Deng,

Member, IEEE,

Hui Zhou,

Student Member, IEEE,

Maged Elkashlan,

Senior Member, IEEE, and Arumugam Nallanathan,

Fellow, IEEE

Abstract

Grant-free non-orthogonal multiple access (GF-NOMA) is a potential technique to support massiveUltra-Reliable and Low-Latency Communication (mURLLC) service. However, the dynamic resourceconﬁguration in GF-NOMA systems is challenging due to the random trafﬁcs and collisions, whichare unknown at the base station (BS). Meanwhile, joint consideration of the latency and reliabilityrequirements makes the resource conﬁguration of GF-NOMA for mURLLC more complex. To addressthis problem, we develop a general learning framework for signature-based GF-NOMA in mURLLCservice taking into account the MA signature collision, and the UE detection as well as the data decodingprocedures for the K-repetition GF scheme and the Proactive GF scheme. The goal of our learningframework is to maximize the long-term average number of successfully served users (UEs) under thelatency constraint. We ﬁrst perform a real-time repetition value conﬁguration based on a double deepQ-Network (DDQN) and then propose a Cooperative Multi-Agent (CMA) learning technique based onthe DQN to optimize the conﬁguration of both the repetition values and the contention-transmissionunit (CTU) numbers. Our results shown that the number of successfully served UEs achieved underthe same latency constraint in our proposed learning framework is up to ten times for the K-repetitionscheme, and two times for the Proactive scheme, more than those achieved in the system with ﬁxed

Y. Liu, M. Elkashlan, and A. Nallanathan are with School of Electronic Engineering and Computer Science, Queen MaryUniversity of London, London, UK (e-mail: { yan.liu, maged.elkashlan, a.nallanathan } @qmul.ac.uk).Y. Deng and H. Zhou are with Department of Engineering, King’s College London, London, UK (e-mail: { yansha.deng,hui.zhou } @kcl.ac.uk).) (Corresponding author: Yansha Deng (e-mail: { yansha.deng } @kcl.ac.uk).) a r X i v : . [ c s . I T ] J a n repetition values and CTU numbers, respectively. Importantly, our general learning framework can beused to optimize the resource conﬁguration problems in all the signature-based GF-NOMA schemes. Index Terms mURLLC, NOMA, grant free, deep reinforcement learning, resource conﬁguration.

I. I

NTRODUCTION

As a new and dominating service class in 6th Generation (6G) networks, massive Ultra-Reliable and Low Latency Communications (mURLLC) integrates URLLC with massive accessto support massive short-packet data communications in time-sensitive wireless networks withhigh reliability and low access latency [1], which requires a reliability-latency-scalability trade-off and mandates a principled and scalable framework accounting for the delay, reliability, anddecision-making under uncertainty [2]. Concretely speaking, the Third Generation PartnershipProject (3GPP) standard [3] has deﬁned a general URLLC requirement as: − − reliabilitywithin 1ms user plane latency for 32 bytes. More details on the requirements of variousdifferent URLLC use cases, including smart grids, intelligent transportation systems, and processautomation with reliability requirements of − − to − − at latency requirements between1 ms to 100 ms, can be found in [4]. In addition, in the 6G white paper [5], it is anticipatedthat the device density may grow to hundred(s) of devices per cubic meter.Current cellular network can hardly fulﬁll the joint massive connectivity, ultra-reliability, andlow latency requirements in mURLLC service. To achieve low latency, grant-free (GF) access has been proposed [6] [7] as an alternative for traditional grant-based (GB) access due to itsdrawbacks in high latency and heavy signaling overhead [8]. Different from GB access, GF accessallows a User Equipment (UE) to transmit its data to the Base Station (BS) in an arrive-and-gomanner, without sending a scheduling request (SR) and obtaining a resource grant (RG) from thenetwork [9]. To achieve high reliability, several GF schemes, including the K-repetition schemeand the

Proactive scheme, have been proposed, where a pre-deﬁned number ( K ) of consecutivereplicas of the same packet are transmitted [10]–[13]. To achieve massive connectivity, non-orthogonal multiple access (NOMA) has been proposed to synergize with GF in order to deal User plane latency is deﬁned as the one-way latency from the processing of the packet at the transmitter to the successfulreception of the packet, including the transmission processing time, the transmission time, and the reception processing time. with the MA physical resource collision as shown in Fig. 1 in contention-based GF access onorthogonal multi-access (OMA) physical resources, when two or more UEs transmit their datausing the same MA physical resource [14], [15]. Here, we focus on the signature-based GF-Fig. 1: Collision of MA physical resource.NOMA, including the sparse code multiple access (SCMA), multiuser shared access (MUSA),pattern division multiple access (PDMA), and etc, where the NOMA technique allows multipleusers to transmit over the same MA physical resource by employing user-speciﬁc signaturepatterns (e.g, codebook, pilot sequence, interleaver/mapping pattern, demodulation referencesignal, power, etc.) [16]. However, when two or more UEs transmit their data using the sameMA physical resource and the same MA signature, the MA signature collision occurs, and theBS cannot differentiate among different UEs and therefore cannot decode the data [14], [15].It is important to know that the research challenges in GF-NOMA are fundamentally differentfrom those in GB-NOMA [17], [18]. In GB scheme, the four-step random access (RA) procedureas shown in Fig. 2 is executed by the UE to request the BS to schedule dedicated resources fordata transmission, where the data transmission is more likely to be successful once the randomaccess succeeds. While in GF scheme, the data is transmitted along with the pilot in a randomlychosen MA resource, which is unknown at the BS, and can lead to new research problemsincluding but not limited to: 1) identify the MA signature collisions due to UEs transmittingdata over the same channel resource with the same MA signature; 2) blind detection of active Unless otherwise stated, the GF and GB access described in this work are both contention based.

Fig. 2: Uplink transmissions for grant-based and grant-free random access.UEs due to that the set of active users is unknown to the BS; 3) blind decoding of UEs’ datawith no knowledge of channels and codebooks. These new research problems make the dynamicresource conﬁguration of GF-NOMA as one important but challenging problem to solve.The main challenges of the dynamic resource conﬁguration optimization of GF-NOMA in-clude: 1) the set of active users and their respective channel conditions are unknown to theBS, which prohibits the pre-conﬁguration and the pre-assignment of resources, including pi-lots/preambles, power, codebooks, repetition values, HARQ related parameters, and etc; 2)to satisfy the reliability and latency requirements simultaneously under random trafﬁcs, theoptimal parameter conﬁgurations vary over different time slots, which is hard to be describedby a tractable mathematical model; 3) the MA signature collision detection, and the blind UEactivity detection as well as the data decoding, need to be considered, which largely impactsthe resource conﬁguration in each time slot; 4) for various signature-based NOMA schemes, ageneral optimization framework for GF-NOMA systems have never been established.The above challenges can hardly be solved via traditional convex optimization method, dueto the complex communication environment with the lack of tractable mathematical formula-tions, whereas the Machine Learning (ML) can be an potential alternative approach. In theGF-NOMA system, the BS can only observe the results of both collision detection (e.g., thenumber of non-collision UEs and collision MA signatures) and data decoding (e.g., the numberof successful decoding UEs and failure decoding UEs) in each round trip time (RTT). This historical information can be potentially used to facilitate the long-term optimization of futureconﬁgurations. Even if one knew all the relevant statistics, tackling this problem in an exactmanner would result in a Partially Observable Markov Decision Process (POMDP) with largestate and action spaces, which is generally intractable. To deal with it, Reinforcement Learning(RL), can be a promising tool to deal with this complex POMDP problem of GF-NOMA resourceconﬁguration optimization, which solely relies on the self-learning of the environment interactionwithout deriving explicit optimization solutions based on a complex mathematical model.In this paper, we aim to develop a general learning framework for GF-NOMA systems withmURLLC services. Our contributions can be summarized as follows: • We develop a general learning framework for the dynamic resource conﬁguration optimiza-tion in signature-based GF-NOMA systems, including the SCMA, MUSA, PDMA, and etc,for mURLLC services, which takes into account the MA signature collision and the UEdetection as well as the data decoding procedure. • In this framework, we aim to optimize the number of successfully transmitted UEs underthe latency constraint via adaptively conﬁguring the uplink resources for the K-repetitonGF scheme and the Proactive GF scheme. The uplink GF-NOMA procedure is simulatedby taking into account the random trafﬁcs, the resource selection and conﬁguration, thetransmission latency check, the collision detection, the data decoding, and the HARQretransmission. This generated simulation environment is used for training the RL agentsbefore deployment, and these agents will be updated according to the real trafﬁc in practicalnetworks in an online manner. • We ﬁrst perform the repetition values dynamic optimization, where a double Deep Q-Network (DDQN) is developed for the two GF schemes. We then extend to a more practicalscenario to dynamically optimize the repetition values and MA resources conﬁguration,where a Cooperative Multi-Agent learning based on DQN (CMA-DQN) is developed tobreak down the selection in high-dimensional parameters into multiple parallel sub-taskswith a number of DQN agents cooperatively being trained to produce each parameter. • Our results shown that the number of successfully served UEs achieved under the samelatency constraint in our proposed learning framework is up to ten times for the K-repetitionscheme, and two times for the Proactive scheme, more than those achieved in the systemwith ﬁxed repetition values and CTU numbers, respectively. In addition, our results shown that the Proactive scheme outperforms the K-repetition scheme in terms of the number ofsuccessfully served UEs, especially under the long latency constraint of 8 ms, which isdifferent from the analytical results without optimization in our previous work [19] withonly a single packet transmission.II. R

ELATED W ORKS

The potential of GF-NOMA to different service is still an open research area as the researchchallenges of GF-NOMA have not been solved. In [20]–[22], GF-NOMA is designed empiricallyby directly incorporating the GF mechanism into the state-of-the-art NOMA schemes, includingthe SCMA, MUSA, and PDMA, which are categorized according to their specially designedspreading signatures. Most existing GF-NOMA works focused on receiver design using thecompressive sensing (CS) technique. The authors in [23] proposed a message passing algorithmto solve the problem of GF-NOMA using CS-based approaches, which improves the BERperformance in comparison to [24]. The authors in [25] considered a comprehensive study suchthat synchronization, channel estimation, user detection and data decoding are performed inone-shot. The proposed CS-based algorithm exploits the sparsity of the system in terms ofuser activity and multi-path fading. To the best of our knowledge, no works have focused onthe optimization of the general GF-NOMA system including the MA collision detection, andUE detection as well as data decoding procedures. The major challenge comes from the factthat random user activation and non-orthogonal transmissions make the GF-NOMA hard to bemathematically modeled and optimized using traditional optimization methods.ML has been introduced to improve the GF-NOMA systems in [26]–[28]. In [26], deep learningwas used to solve a variational optimization problem for GF-NOMA. The neural network modelincludes encoding, user activity, signature sequence generation, and decoding. The authors thenextended their work to design a generalized/uniﬁed framework for NOMA using deep multi-tasklearning in [27]. A deep learning-based active user detection scheme has been proposed for GF-NOMA in [28] . By feeding training data into the designed deep neural network, the proposedactive user detection scheme learns the nonlinear mapping between received NOMA signal andindices of active devices. Their results shown that the trained network can handle the wholeactive user detection process, and achieve accurate detection of the active users. These worksassumed that each UE is pre-allocated with a unique sequence, and thus collisions are not anissue. However, this assumption does not hold in massive UEs settings in mURLLC, where the collision is the bottle-neck of the GF-NOMA performance. Different from [26]–[28], we developa general learning framework to optimize GF-NOMA systems for mURLLC service taking intoaccount the latency constraint, the MA signature collision, and the UE detection as well as thedata decoding procedures. III. S

YSTEM M ODEL

We consider a single cell network consisting of a BS located at the center and a set of N UEsrandomly located in an area of the plane R , where the UEs are in-synchronized and unaware ofthe status of each other. Once deployed, the UEs remain spatially static. The time is divided intoshort-TTIs , and the small packets for each UE are generated according to random inter-arrivalprocesses over the short-TTIs, which are Markovian as deﬁned in [29] [30] and unknown to BS.Fig. 3: Uplink GF-NOMA transmission procedure.

5G NR introduces the concept of ‘mini-slots’ and supports a scalable numerology allowing the sub-carrier spacing (SCS)to be expanded up to 240 kHz. In contrast with the LTE slot consisting of 14 OFDM symbols per TTI, the number of OFDMsymbols in 5G NR mini-slots ranges from 1 to 13 symbols, and the larger SCS decreases the length of each symbol further.Collectively, mini-slots and ﬂexible numerology allows shorter transmission slots to meet the stringent latency requirement. Inthis paper, the TTI refers to a mini-slot.

A. GF-NOMA Network Model

We consider the uplink contention-based GF-NOMA over a set of preconﬁgured MA resourcesfor UEs with latency constraint T cons . To capture the effects of the physical radio, we considerthe standard power-law path-loss model with the path-loss attenuation r − η , where r is theEuclidean distance between the UE and the BS and η is the path-loss attenuation factor. Inaddition, we consider a Rayleigh ﬂat-fading environment, where the channel power gains h areexponentially distributed (i.i.d.) random variables with unit mean. Fig. 3 presents the uplinkGF-NOMA procedure following the 3GPP standard [9], [12], [13], [31], [32], which includes:1) trafﬁc inter-arrival, 2) resources and parameters conﬁguration, 3) latency check; 4) collisiondetection, 5) data decoding, and 6) Hybrid Automatic Repeat reQuest (HARQ) retransmissions.These six stages are explained in the following six subsections to introduce the system model.

1) Trafﬁc Inter-Arrival:

We consider a bursty trafﬁc scenario, where massive UEs are recov-ered due to an emergency event, e.g., earthquake alarm and ﬁre alarms [30], [33]. Each UEwould be activated at any time τ , according to a time limited Beta probability density functiongiven as [30, Section 6.1.1] p ( τ ) = τ α − ( T − τ ) β − T α + β − Beta( α, β ) , (1)where T is the total time of the bursty trafﬁc and Beta ( α, β ) is the Beta function with theconstant parameters α and β [34].Due to the nature of slotted-Aloha, a UE can only transmit at the beginning of a RTT asshown in Fig. 4 and Fig. 5, which means that the newly activated UEs executing transmissioncome from those who received an packet within the interval between with the last RTT period( τ i − , τ i ). The trafﬁc instantaneous rate in packets in a period is described by a function p ( τ ) ,so that the packets arrival rate in the i th RTT is given by µ i = (cid:90) τ i τ i − p ( τ ) dτ. (2)

2) Resources and Parameters Conﬁguration:

The UEs are conﬁgured by radio resource control(RRC) signaling and L1 signaling prior to the GF access (as Type 2 GF [35]), with MA resources,repetition values, and HARQ related parameters, etc.a) Repetition values: We consider two GF schemes, which are the K-repetition scheme and theProactive scheme in this work as shown in Fig. 4 and Fig. 5, respectively. The repetition valuesfor K-repetition scheme K t Krep and for Proactive scheme K t Proa are conﬁgured at the beginning of each RTT in order to be adapted to guarantee the reliability and latency requirements basedon the random trafﬁc. Fig. 4: K-repetition GF transmission • K-repetition scheme:

The K-repetition scheme is illustrated in Fig. 4, where the UE isconﬁgured to autonomously transmit the same packet for K t Krep repetitions in consecutiveTTIs. The BS decodes each repetition independently and the transmission in one RTT issuccessful when at least one repetition succeeds. After processing all the received K T Krep repetitions, the BS transmits the ACK/NACK back to the UE.Fig. 5: Proactive GF transmission • Proactive scheme:

The Proactive scheme is illustrated in Fig. 5. Similar to the K-repetitionscheme, the UE is conﬁgured to repeat the transmission for a maximum number of K t Proa repetitions, but can receive the feedback after each repetition. This allows the UE toterminate repetitions earlier once receiving the ACK.Considering the small packets of mURLLC trafﬁc, we set the packet transmission time as oneTTI. The BS feedback time and the BS (UE) processing time are also assumed to be one TTIfollowing our previous work [19]. Once the repetition value is conﬁgured, the duration of oneRTT is known to the UEs and the BS, which is given as T t RTT = ( K t + 3)TTIs , (3)with K t = K t Krep or K t = K t Proa for the K-repetition scheme and the Proactive scheme,respectively.b) MA resources: A contention-transmission unit (CTU) as shown in Fig. 6 is deﬁned asthe basic MA resource, where each CTU may comprise of a MA physical resource and a MAsignature [7] [15] [36]. The MA physical resources represent a set of time-frequency resourceFig. 6: GF-NOMA resourceblocks (RBs). The MA signatures represent a set of pilot sequences for channel estimation and/orUE activity detection, and a set of codebooks for robust data transmission and interferencewhitening, etc. Without loss of generality, in one TTI, we consider F orthogonal RBs and eachRB is overlaid with L unique codebook-pilot pairs [14] [38]. Thus, at the beginning of each A one-to-one mapping or a many-to-one mapping between the pilot sequences and codebooks can be predeﬁned. Since ithas been veriﬁed in [20] that the performance loss due to codebook collision is negligible for a real system, we focus on thepilot sequence collision and consider the one-to-one mapping in our work like [14] [37]. RTT, the BS conﬁgures a resource pool of C t = F × L unique CTUs, and each UE randomlychoose one CTU from the pool to transmit in this RTT.

3) Latency Check:

The HARQ index H HARQ is included in the pilot sequence and can bedetected by the BS. At the beginning of each RTT, the HARQ index and the transmissionlatency T late will be updated as shown in Fig. 3. For example, for the initial RTT with initial K , H HARQ = 1 and T late = RT T H HARQ =1 , where RT T H HARQ is calculated by using (3). Afterthis round time trip transmission, the BS optimizes a K based on the observation of the receptionand conﬁgures it to the UE for the next RTT. Then the UE updates its H HARQ = 2 and calculated

RT T H HARQ =2 by using (3) with K , and consequently, the transmission latency T late is updatedas T late = RT T H HARQ =1 + RT T H HARQ =2 . When T late > T cons , the UE fails to be served and thepackets will be dropped. Note that the HARQ index, as well as the transmission latency, willbe updated at the beginning of each RTT instead of at the end of each RTT due to that weconsider the user plane latency in this work, which has been explained in footnote 1. That is tosay, from the UE perspective, when the UE executes this RTT, it will check transmission resultsafter ﬁnishing the RTT. Thus, the duration of this RTT should be included when calculating theUE transmission latency.

4) Collision Detection:

At each RTT, each active UE transmits its packets to the BS byrandomly choosing a CTU. The BS can detect the UEs that have chosen different CTUs. However,if multiple UEs choose the same CTU, the BS cannot differentiate the these UEs and thereforecannot decode the data. We categorize the CTUs into three types. An idle

CTU is a CTU whichhas not been chosen by any UE. A singleton

CTU is a CTU chosen by only one UE, and a collision

CTU is a CTU chosen by two or more UEs [14]. One example is illustrated in Fig. 7,where UE 1 and UE 5 performed the GF-NOMA transmissions by choosing the unique CTU 6and CTU 5, respectively, where CTU 6 and 5 are the singleton CTUs. The CTU 3 is the idleCTU. UE 4 and UE 7 have chosen the CTU 1, UE 2 and UE 3 have chosen the CTU 2, andUE 6 and UE 8 have chosen the CTU 4, where CTU 1, 2 and 4 are the collision CTUs. At the t th RTT, we denote the set of singleton CTUs as C tsc , the set of idle CTUs as C tic , and the set ofcollision CTUs as C tcc .

5) GF-NOMA Data Decoding:

After detecting UEs that have chosen the singleton CTUs(e.g., UE 1 and 5 in Fig. 7), the BS applies successive interference cancellation (SIC) techniqueto decode the data of these UEs. When decoding, the BS treats the UEs that transmit in the sameRB as the interference as shown in Fig. 7, and the UEs that transmit in different RBs do not Fig. 7: Collision detection case of a network with L=2 RBs, C = 6 CTUs and N = 8 UEs.interfere with each other due to the orthogonality. In each iterative stage of SIC decoding, theCTU with the strongest received power is decoded by treating the received powers of other CTUsover the same RB as the interference. Each iterative stage of SIC decoding is successful whenthe signal-to-interference-plus-noise ratio (SINR) in that stage is larger than the SINR threshold.If the received signal is decoded successfully, the decoded signal is subtracted from the receivedsignal . Thus, in the k th repetition of the t th RTT , the s th SIC decoding is successful if SINR tf,s ( k ) = P h s,k r s − ηN tf,sc ( k ) (cid:80) m = s +1 P m h m,k r − ηm + (cid:80) n (cid:48) ∈N tf,cc ( k ) P n (cid:48) h n (cid:48) ,k r − ηn (cid:48) + σ ≥ γ th , (4)where P is the transmission power, N tf,sc is the set of devices that have chosen the singletonCTUs over the f th RB, N tf,cc is the set of devices that have chosen the collision CTUs over the f th RB, σ is the noise power, and γ th is the SINR threshold. We assume perfect SIC the same as [14], with no error propagation between iterations. (a) K-repetition scheme (b) Proactive scheme Fig. 8: SIC decoding procedure for each GF scheme.The SIC procedure stops when one stage of the SIC fails or when there are no more signalsto decode. The SIC decoding procedure for each GF scheme is given in Fig. 8 with the detailsdescribed in the following. i ) K-repetition scheme: For the K-repetition scheme as shown in Fig. 4, the successful decodingevent occurs at least one repetition decoding succeeds. Thus, the SIC decoding procedure follows: • Step 1: Start the k th repetition with the initial k = 1 , N tf,sc and N tf,cc ; • Step 2: Decode the s th CTU with the initial s = 1 using (4); • Step 3: If the s th CTU is successfully decoded, put the decoded UE in set N tf,sd ( k ) and goto Step 4, otherwise go to Step 5; • Step 4: If s ≤ N tf,sc , do s = s + 1 , go to Step 2, otherwise go to Step 5; • Step 5: SIC for the k th repetition stops; • Step 6: If k ≤ K Krep , do k = k + 1 , go to Step 1, otherwise go to the end. ii ) Proactive scheme: For the Proactive scheme as shown in Fig. 5, the successful decodingevent occurs once the repetition decoding succeeds. The successfully decoded UEs will nottransmit in the remaining repetitions to reduce interference to other UEs. It should be noted thatthe ACK/NACK feedback can only be received after 3TTIs, which means the ACK feedbackof the k th successful repetition can be received by the UE in the ( k + 3) th repetition and theUE stops transmission from the ( k + 4) th repetition. In addition, the BS does not send anyACK/NACK feedback to the collision UEs. The collision USs in the k th repetition that do notreceive feedback at the pre-deﬁned timing after the UEs sent the packet (e.g., after 3TTIs) willnot transmit in the remaining repetitions to reduce interference to other UEs. • Step 1: Initialize k = 1 , N tf,sc and N tf,cc . If k < , go to Step 3, otherwise go to Step 2; • Step 2: Update the N tf,sc ( k ) = N tf,sc ( k ) − N tf,sd ( k − and N tf,cc ( k ) = ∅ ; • Step 3: Start the k th repetition with k , N tf,sc ( k ) and N tf,cc ( k ) ; • Step 4: Decode the s th CTU with initial s = 1 using (4); • Step 5: If the s th CTU is successfully decoded, put the decoded UE in set N tf,sd ( k ) and goto Step 6, otherwise go to Step 7; • Step 6: If s ≤ N tf,sc , do s = s + 1 , go to Step 4, otherwise go to Step 7; • Step 7: SIC for the k th repetition stops; • Step 8: If k ≤ K Pro , do k = k + 1 , go to Step 1, otherwise go to the end.Finally, the set N tf,sd = K Krep (cid:83) k =1 ( N tf,sd ( k )) is the successfully decoded UEs over the f th RB and N tsd = F t (cid:83) f =1 ( N tf,sd ) is the successfully decoded UEs in the t th RTT.

6) GF-NOMA Retransmissions (HARQ):

We introduce the GF-NOMA HARQ retransmissionsto achieve high reliability performance. However, due to the latency constraint, the HARQretransmission times is limited as shown in Fig. 3. The UE determines a re-transmission ornot based on the following two different scenarios. i ) when the UE receives an ACK from the BS, it means that the BS successfully detected theUE (i.e., the UE choosing the singleton CTUs) and decoded the UE’s data (i.e., SIC succeeds),no further re-transmission is needed; ii ) when the UE receives a NACK from the BS, it means that the BS successfully detectedthe UE but failed to decode the UE’s data (i.e., SIC fails). Otherwise, when the UE does notreceive any feedback at the pre-deﬁned timing after the UE sent the packet (e.g., at the end of one RTT), it means the BS failed to identify the UE, the UE determines whether to retransmitor not based on the transmission latency check as shown in Fig. 3. B. Problem Formulation

We focus the uplink contention-based GF-NOMA procedure over a set of preconﬁgured MAresources for UEs with latency constraint T cons under two GF schemes. Each UE has only twopossible states, either inactive or active , while a UE with small data packets to be transmittedis in the active state. Once actived in a given RTT t , a UE executes the GF-NOMA procedure,where the UE randomly chooses one of the preconﬁgured C t CTUs to transmit its packets for K t Krep times or k t Proa ≤ K t Proa times under the K-repetition scheme and the Proactive scheme,respectively. During this RTT, the GF-NOMA fails if: ( i ) a CTU collision occurs when two ormore UEs choose the same CTU (i.e., UE detection fails); or ( ii ) the SIC decoding fails (i.e.,data decoding fails). Once failed, UEs decides whether to retransmit in the following RTT ornot based on the transmission latency check. When T late > T cons , the UE fails to be servedand its packets will be dropped. It is obvious that 1) increasing the repetition values K t couldimprove the GF-NOMA success probability, but results in an increasing latency; 2) increasingCTU numbers C t could improve the UE detection success probability, but it results in lowresource utilization efﬁciency.Thus, it is necessary to tackle the problem of optimizing the GF-NOMA conﬁguration deﬁnedby parameters A t = { K t , C t } for each RTT t under both the K-repetition scheme and theProactive scheme, where K t is the repetition value and C t is the number of CTUs. At thebeginning of each RTT t , the decision is made by the BS according to the transmission receptions U t (cid:48) for all prior RTTs t (cid:48) = 1 , ..., t − , consisting of the following variables: the number of thecollision CTUs V t (cid:48) cc , the number of the idle CTUs V t (cid:48) ic , the number of the singleton CTUs V t (cid:48) sc , thenumber of UEs that have been successfully detected and decoded under the latency constraint V t (cid:48) sd , and the number of UEs that have been successfully detected but not successfully decoded V t (cid:48) ud . We denote H t = { O , O , ..., O t − } with O t − = { U t − , A t − } as the observation in eachRTT t including histories of all such measurements and past actions. According to the UE detection and data decoding procedure described in Section II.A, for the same CTU number C t , a largeRB number F t leads to fewer UEs in each RB, which increases the data decoding success probability. That is to say, the largerRB number, the better. Thus, we ﬁx the RB number F = 4 in this work to optimize the CTU number. At each RRT t , the BS aims at maximizing a long-term objective R t (reward) related to theaverage number of UEs that have successfully send data under the latency constraint V t (cid:48) sd withrespect to the stochastic policy π that maps the current observation history O t to the probabilitiesof selecting each possible parameters in A t . This optimization problem (P1) can be formulatedas: (P1 :) max π ( A t | O t ) ∞ (cid:88) k = t γ k − t E π [ V ksd ] (5) s.t. T late ≤ T cons , (6)where γ ∈ [0 , is the discount factor for the performance accrued in the future RTTs, and γ = 0 means that the agent just concerns the immediate reward.Since the dynamics of the GF-NOMA system is Markovian over the continuous RRTs, this is aPartially Observable Markov Decision Process (POMDP) problem which is generally intractable.Here, the partial observation refers to that a BS can not fully know all the information ofthe communication environment, including, but not limited to, the channel conditions, the UEtransmission latency, the random collision process, and the trafﬁc statistics. Furthermore, thetraditional optimization methods may need global information to achieve the optimal solution,which not only increases the overhead of signal transmission, but also increase the computationcomplexity, or even hardly deal with. Approximate solutions will be discussed in Section IV.IV. D EEP REINFORCEMENT LEARNING - BASED

GF-NOMA

RESOURCE CONFIGURATION

The deep reinforcement learning (DRL) has been regarded as a powerful tool to addresscomplex dynamic control problems in POMDP with large state spaces [39], [40], since it hasthe potential to accurately approximate the desired value function via the combination of DNNstructures. In this section, we proposed a Deep Q-network (DQN) based algorithm for tacklingthe formulated problem P1. To evaluate the capability of DQN in GF-NOMA, we ﬁrst considerthe dynamic conﬁguration of repetition value K t with ﬁxed CTU numbers C t , where the DQNagent dynamically conﬁgures the K t at the beginning of each RTT for K-repetition and ProactiveGF schemes. We then propose a cooperative multi-agent learning technique based on the DQNto optimize the conﬁguration of both repetition value K t and CTU numbers C t simultaneously,which breaks down the selection in high-dimensional action space into multiple parallel sub-tasks. A. Deep reinforcement learning-based single-parameter conﬁguration1) Reinforcement learning framework:

To optimize the number of successfully served UEsunder the latency constraint in GF-NOMA schemes, we consider a RL-agent deployed at theBS to interact with the environment in order to choose appropriate actions progressively leadingto the optimization goal. We deﬁne s ∈ S , a ∈ A , and r ∈ R as any state, action, and rewardfrom their corresponding sets, respectively. The RL-agent ﬁrst observes the current state S t corresponding to a set of previous observations ( O t = { U t − , U t − , ..., U } ) in order to selectan speciﬁc action A t ∈ A ( S t ) . Here, the action A t represents the repetition values K t in the t th RTT A t = K t in this single-parameter conﬁguration scenario and the S t is a set of indicesmapping to the current observed information U t − = [ V t − cc , V t − ic , V t − sc , V t − sd , V t − ud ] . With theknowledge of the state S T , the RL-agent chooses an action A t from the set A . Once an action A t is performed, the RL-agent transits to a new observed state S t +1 and receives a correspondingreward R t +1 as the feedback from the environment, which is designed based on the new observedstate S t +1 and guides the agent to achieve the optimization goal. As the optimization goal is tomaximize the number of the successfully served UEs under the latency constraint, we deﬁne thereward R t +1 as R t +1 = V tsd , (7)where V tsd is the observed number of successfully served UEs under the latency constraint T cons .To select an action A t based on the current state S t , a mapping policy π ( a | s ) learned froma state-action value function Q ( s, a ) is needed to facilitate the action selection process, whichindicates probability distribution of actions with given states. Accordingly, our objective is toﬁnd an optimal value function Q ∗ ( s, a ) with optimal policy π ∗ ( a | s ) . At each RTT, Q ( s, a ) isupdated based on the received reward by following Q ( S t , A t ) = Q ( S t , A t ) + λ [ R t +1 + γ max a ∈A Q ( S t +1 , a ) − Q ( S t , A t )] , (8)where λ is a constant learning rate reﬂecting how fast the model is adapted to the problem, γ ∈ [0 , is the discount rate that determines how current rewards affects the value functionupdating. After enough iterations, the BS can learn the optimal policy that maximizes the long-term rewards.

2) Deep Q-network:

When the state and action spaces are large, the RL algorithm becomesexpensive in terms of memory and computation complexity, which is difﬁcult to converge to the optimal solution. To overcome this problem, DQN is proposed in [40], where the Q-learning iscombined with Deep Neural Network (DNN) to train a sufﬁciently accurate state-action valuefunction for the problems with high dimensional state space. Furthermore, the DQN algorithmutilizes the experience replay technique to enhance the convergence performance of RL. Whenupdating the DQN algorithm, mini-batch samples are selected randomly from the experiencememory as the input of the neural network, which breaks down the correlation among thetraining samples. In addition, through averaging the selected samples, the distribution of trainingsamples can be smoothed, which avoids the training divergence.In DQN algorithm, the action-state value function Q ( s.a ) is parameterized via a function Q ( s, a, θ ) , where θ represents the weights matrix of a multiple layers DNN. We considerthe conventional fully-connected DNN, where the neurons between two adjacent layers arefully pairwise connected. The variables in the state S t is fed in to the DNN as the input;the Rectiﬁer Linear Units (ReLUs) are adopted as intermediate hidden layers by utilizing thefunction f ( x ) = max (0 , x ) ; while the output layer is consisted of linear units, which are inone-to-one correspondence with all available actions in A .To achieve exploitation, the forward propagation of Q-function Q ( s, a, θ ) is performed accord-ing to the observed state S t . The online update of weights matrix θ is carried out along eachtraining episode to avoid the complexities of eligibility traces, where a double deep Q-learning(DDQN) training principle [41] is applied to reduce the overestimations of value function (i.e.,sub-optimal actions obtain higher values than the optimal action). Accordingly, learning takesplace over multiple training episodes, where each episode consists of several RTT periods. Ineach RTT, the parameter θ of the Q-function approximator Q ( s, a, θ ) is updated using RMSPropoptimizer [42] as θ t +1 = θ t − λ RMS ∇ L DDQN ( θ t ) (9)where λ RMS ∈ (0 , is RMSProp learning rate, ∇ L DDQN ( θ t ) is the gradient of the loss function L DDQN ( θ t ) used to train the state-action value function. The gradient of the loss function isdeﬁned as ∇ L DDQN ( θ t ) = E S i ,A i ,R i +1 ,S i +1 [( R i +1 + γ max a ∈A Q ( S i +1 , a, ¯ θ t ) − Q ( S i , A i , θ t )) ∇ θ Q ( S i , A i , θ t )] . (10)We consider the application of minibatch training, instead of a single sample, to update thevalue function Q ( s, a, θ ) , which improves the convergent reliability of value function Q ( s, a, θ ) . Therefore, the expectation is taken over the minibatch, which are randomly selected from previoussamples ( S i , A i , S i +1 , R i +1 ) for i ∈ { t − M r , ..., t } with M r being the replay memory size [39].When t − M r is negative, it represents to include samples from the previous episode. Furthermore, ¯ θ t is the target Q-network in DDQN that is used to estimate the future value of the Q-functionin the update rule, and ¯ θ t is periodically copied from the current value θ t and kept unchangedfor several episodes.Through calculating the expectation of the selected previous samples in minibatch and updatingthe θ t by (9), the DQN value function Q ( s, a, θ ) can be obtained. The detailed DQN algorithmis presented in Algorithm 1. B. Cooperative Multi-Agent Learning-based multi-parameter optimization

In practice, not only the repetition values but also the CTU numbers, inﬂuence reliability-latency performance in GF-NOMA. Fixed CTU numbers cannot adapt to the dynamics of therandom trafﬁc, which may violate the stringent latency requirement or lead to low resourceefﬁciency. Thus, we study the problem (P1) of jointly optimizing the resource conﬁgurationwith parameters A t = { K t , C t } to improve the network performance. The learning algorithmprovided in Sec. III.A is model-free, and thus the learning structure can be extended in thismulti-parameter scenario.Due to the high capability of DQN to handle problems with massive state spaces, we considerto improve the state spaces with more observed information to support the optimization of RL-agent. Therefore, we deﬁne the current state S t , to include information about the last M o RTTs ( U t − , U t − , U t − , ..., U t − M o ) , which enables the RL-agent to estimate the trend of trafﬁc. Similarto the state spaces, the available action spaces also exponentially increases with the incrementof the adjustable parameter conﬁgurations in GF-NOMA. The total number of available actionscorresponds to the possible combinations of all parameter conﬁgurationsAlthough the GF-NOMA conﬁguration is managed by a central BS, breaking down thecontrol of multiple parameters as multiple sub-tasks is sufﬁcient to deal with the problemswith unsolvable action space, which are cooperatively handled by independent Q-agents. Asshown in Fig. 9, we consider multiple DQN agents that are centralized at the BS following thesame structure of value function approximator as Sec. III.A. Each DQN agent controls their ownaction variable, namely K t or C t , and receives a common reward to guarantee the objective inP1 cooperatively. Algorithm 1:

DQN Based GF-NOMA Uplink Resource Conﬁguration

Input:

The set of repetition values in each RTT K and Operation Iteration I. Algorithm hyperparameters: learning rate λ RMS ∈ (0 , , discount rate γ ∈ [0 , , (cid:15) -greedy rate (cid:15) ∈ (0 , , target network update frequency K ; Initialization of replay memory M to capacity D , the state-action value function Q ( S, A, θ ) , the parameters of primary Q-network θ , and the target Q-network ¯ θ ; for Iteration ← do Initialization of S by executing a random action A and bursty trafﬁc arrival rate µ = 0 ; for t ← do Update µ using Eq. (2); if p (cid:15) < (cid:15) Then select a random action A t from A ; else select A t = arg max a ∈ A Q ( S t , a, θ ) . The BS broadcasts K ( A t ) and backloggedUEs attempt communication in the t th RTT; The BS observes state S t +1 , and calculate the related reward R t +1 using Eq. (7); Store transition ( S t , A t , R t +1 , S t +1 ) in replay memory M ; Sample random minibatch of transitions ( S t , A t , R t +1 , S t +1 ) from replay memory M Perform a gradient descent step and update parameters for Q ( s, a, θ ) using Eq.(10); Update the parameter ¯ θ = θ of the target Q-network every J steps. end end However, the common reward design also poses challenge on the evaluation of each action,because the individual effect of speciﬁc action is deeply hidden in the effects of the actionstaken by all other DQN agents. For instance, a positive action taken by a agent can receive amisleading low reward due to other DQN agents’ negative actions. Fortunately, in GF-NOMAscenario, all DQN agents are centralized at the BS and share full information among eachother. Accordingly, we include the action selection histories of each DQN agent as part of statefunction, and hence, the agents are able to learn the relationship between the common reward Fig. 9: The CMA-DQN agents and environment interaction in the POMDP.and different combinations of actions. To do so, we deﬁne state variable S t as S t = [ A t − , U t − , A t − , U t − , ..., A t − M o , U t − M o ] , (11)where M o is the number of stored observations, A t − is the set of selected action of each DQNagent in the ( t − th TTI corresponding to K t − , and C t − , and U t − is the set of observedtransmission receptions.In each RTT, the k th agent update the parameters θ k of the value function Q ( s, a k , θ k ) usingRMSProp optimizer following Eq. (9). The learning algorithm can be implemented followingAlgorithm 1. Different from the GF NOMA single-parameter conﬁguration scenario in SectionIII.A, it is required to initialize two primary networks θ k , target networks ¯ θ k and the replaymemories M k for each DQN agent. In step 10 of Algorithm 1, each agent stores their owncurrent transactions in memory separately. In step 11 and 12 of Algorithm 1, the minibatch oftransaction should separately be sampled from individual memory to train the correspondingDQN agent. V. S

IMULATION R ESULTS

In this section, we examine the effectiveness of our proposed GF-NOMA schemes with DQNalgorithm via simulation. We adopt the standard network parameters listed in Table I following[43], and hyperparameters for the DQN learning algorithm are listed in Table II. All testingTABLE II: Simulation Parameters

Parameters Value Parameters ValuePath-loss exponent η σ -132 dBmTransmission power P

23 dBm The received SINR threshold γ th -10 dBDuration of trafﬁc T { , , , , } The set of the CTU number { , , , } Latency constraint 2 ms and 8 msBursty trafﬁc parameter Beta( α , β ) (2, 4) The number of bursty UEs N TABLE III: Learning Hyperparameters

Hyperparameters Value Hyperparameters ValueLearning rate λ RMS (cid:15) γ performance results are obtained by averaging over 1000 episodes. The BS is located at thecenter of a circular area with a 10 km radius, and the UEs are randomly located within thecell. Unless otherwise stated, we consider the number of bursty UEs to be N = 20000 . TheDQN is set with two hidden layers, each with 128 ReLU units. In the following, we present oursimulation results of the single-repetition conﬁguration and the multi-parameter conﬁgurationin Section V-A and Section V-B, respectively. The single-repetition conﬁguration is optimizedunder the latency constraint T cons = 2 ms and the multi-parameter conﬁguration is optimizedunder the latency constraint T cons = 8 ms, respectively. A. Single-repetition conﬁguration

In the single-repetition conﬁguration scenario, we set the number of CTU as C = 48. Through-out epoch, each UE has a periodical a bursty trafﬁc proﬁle (i.e., the time limited Beta proﬁledeﬁned in (1) with parameters (2, 4) that has a peak around the 4000th TTI. TTIs N u m b e r o f u s e r s ProactiveKrepetion cons =8ms165 T cons =2ms159

Fig. 10: Backlog trafﬁc in each TTI under latency constraint T cons = 2 ms and T cons = 8 ms,respectively.Fig. 10 plots the backlog trafﬁc in each TTI under latency constraint T cons = 2 ms and T cons = 8 ms, respectively. It should be noted that the the backlog trafﬁc in each TTI does notonly includes the newly generated trafﬁc, but also the retransmission trafﬁc, due to the fact thatthe UEs are allowed to retransmit in the next RTT under the latency constraint. The results haveshown that when the latency constraint increases, the backlog trafﬁc in each TTI increase asthe retransmission trafﬁc increases. The backlog trafﬁc in each TTI for the Proactive schemeis smaller than that of the K-repetition scheme due to the efﬁciency of the Proactive scheme,which has been analyzed in details in the following. Epochs R e w a r d K-repetition (a) K-repetition scheme

Epochs R e w a r d Proactive (b) Proactive scheme

Fig. 11: Average received reward for each GF scheme. Fig. 11 plots the average received reward for the K-repetition scheme and the Proactive scheme,respectively. It can be seen that the average rewards of both K-repetition and proactive schemesconverge to the optimal value after training. We can also observe that the average received rewardof proactive scheme in Fig. 11 (b) is higher than that of the K-repetition scheme in Fig. 11 (a).This is because the proactive scheme can terminate the repetition earlier and start new packettransmission with timely ACK feedback, which is able to deal with the trafﬁc more effectively.

TTIs N u m b e r o f u s e r s SuccessNon-collisionCollisionDecoding Failure (a) K-repetition scheme

TTIs N u m b e r o f u s e r s SuccessNon-collisionCollisionDecoding Failure (b) Proactive scheme

Fig. 12: The transmission results for each GF scheme.Fig. 12 plots the number of the successfully served UEs, the non-collision UEs, the collisionUEs, and the decoding failure UEs for the K-repetition scheme and the Proactive schemerespectively, under latency constraint T cons = 2 ms. It is shown that the number of successfullyserved UEs achieved under latency constraint for the Proactive scheme is almost up to 1.5 timesmore than that for the K-repetition scheme. This is due to that the UEs in the Proactive schemecan terminate their repetitions earlier to reduce the interference to other UEs, which leads toan increase in the number of successfully decoding UEs. In both Fig. 12 (a) and Fig. 12 (b),the number of collision UEs has a peak at around the 4000th TTI with the peak trafﬁc at thistime as shown in Fig. 10. Due to the fact that only the non-collision UEs can be detected bythe UE to decoding their data, the number of successful UEs depends on the number of thenon-collision UEs. In addition, the number of failure decoding UEs reaches a peak due to thepeak trafﬁc at the 4000th TTI in the K-repetition, which leads to the decrease in the number ofsuccessful UEs at that time. B. Multi-parameter conﬁguration including the repetition values and the CTU number

Fig. 13 plots the average received reward for the K-repetition scheme and the Proactive scheme,respectively, with multi-parameter conﬁguration, including the repetition values and the CTUnumber. It can be seen that both the average received rewards of the K-repetition and the Proactivescheme converge to a optimal value after training. Compared to Fig. 11 under latency constraint T cons = T cons = 8 ms leads to larger retransmission packets,which results in serious trafﬁc congestion. It should be noted that the performance degradationof K-repetition scheme is much larger than that of Proactive scheme, which shows the potential Epochs R e w a r d K-repetition (a) K-repetition scheme

Epochs R e w a r d Proactive (b) Proactive scheme

Fig. 13: Average received reward for each GF scheme with multi-parameter conﬁguration.of the Proactive scheme in heavy trafﬁc and long latency constraint situation due to timelytermination.Fig. 14 plots the number of successful UEs, non-collision UEs, and decoding failure UEsfor K-repetition scheme and Proactive scheme, respectively, with multi-parameter conﬁgurationincluding the repetition values and the CTU number under latency constraint T cons = 8 ms. Wehave observed that the number of non-collision transmission UEs of both scheme is similar.However, the number of decoding failure UEs of the K-repetition scheme is almost up to ﬁvetimes more than that of the Proactive scheme at the peak trafﬁc, due to the interference causedby multiple repetitions from collision UEs, which is consistant with Fig. 15. In addition, it is TTIs N u m b e r o f u s e r s SuccessNon-collisionDecoding Failure (a) K-repetition scheme

TTIs N u m b e r o f u s e r s SuccessNon-collisionDecodinh Failure (b) Proactive scheme

Fig. 14: The transmission results for each GF scheme.noted that in both the K-repetition scheme and the Proactive scheme, there is lower number ofsuccess UEs in high trafﬁc, especially in peak trafﬁc at round 4000th TTI.

TTIs R e p e titi on v a l u e ProactiveK-repetition (a) Repetition value

TTIs C T U n m b e r ProactiveK-repetition (b) Number of CTU

Fig. 15: Actions for each GF scheme.Fig. 15 plots the actions of each scheme including the repetition value and the number ofCTUs for the K-repetition and the Proactive scheme, respectively. In Fig. 15 (a), we can see thatthe Proactive scheme adopts a higher and more stable repetition value due to its capability todeal with the trafﬁc congestion. However, the repetition value of K-repetition scheme decreases ﬁrst and then increases back to a higher value. This is because the agent in K-repetition schemelearns to sacriﬁce the current successful transmission to alleviate the trafﬁc congestion to obtaina long-term reward. In Fig. 15 (b), It can be seen that the number of CTUs has a similar trend asthe repetition value in Fig. 15 (a), which may be caused by the sharing of actions as observationsamong agents. TTIs N u m b e r o f s u cce ss u s e r s Proactive-CMA-DQNK-repetition-CMA-DQNProactive-FixK-repetition-Fix

Fig. 16: The average number of success UEs for each scheme with learning framework andﬁxed parameters.Fig. 16 plots the average number of success UEs for the K-repetition scheme and the Proactivescheme by comparing that with learning framework and that with ﬁxed parameters, respectively.Here, we set the ﬁxed repetition value K = 8 and the CTU number C = 48 . Our resultsshown that the number of successfully served UEs achieved under the same latency constraintin our proposed learning framework is up to ten times for the K-repetition scheme, and twotimes for the Proactive scheme, more than those achieved in the system with ﬁxed repetitionvalues and CTU numbers, respectively. This is because, in the learning framework, the agentlearns to dynamically conﬁgure lower repetition values and CTU numbers to alleviate the trafﬁccongestion to obtain a long-term reward.VI. C ONCLUSION

In this paper, we developed a general learning framework for dynamic resource conﬁgurationoptimization in signature-based GF-NOMA systems, including the sparse code multiple access(SCMA), multiuser shared access (MUSA), pattern division multiple access (PDMA), and etc, for mURLLC services under the K-repetition GF scheme and the Proactive GF scheme. This generallearning framework was designed to optimize the number of successfully served UEs under thelatency constraint via adaptively conﬁguring the uplink resources, including the repetition valuesand the contention-transmission unit (CTU) numbers. We ﬁrst performed a real-time repetitionvalue conﬁguration for the two schemes, where a double Deep Q-Network (DDQN) was de-veloped. We then studied a Cooperative Multi-Agent (CMA) learning technique based on theDQN to optimize the conﬁguration of both the repetition values and the contention-transmissionunit (CTU) numbers for these two schemes, by dividing high-dimensional conﬁgurations intomultiple parallel sub-tasks Our results have shown that 1) the number of successfully served UEsachieved under the same latency constraint in our proposed learning framework is up to ten timesfor the K-repetition scheme, and two times for the Proactive scheme, more than those achievedin the system with ﬁxed repetition values and CTU numbers, respectively; 2) with learningoptimization, the Proactive scheme always outperforms the K-repetition scheme in terms of thenumber of successfully served UEs, especially under the long latency constraint; 3) our proposedgeneral learning framework can be used to optimize the resource conﬁguration problems in allthe signature-based GF-NOMA schemes; and 4) determining the retransmission or not can beoptimized in the future by considering not only the long latency constraint but also the futuretrafﬁc congestion, due to the fact that long latency constraint will lead to high future trafﬁccongestion. R EFERENCES [1] X. Zhang, J. Wang, and H. V. Poor, “Statistical delay and error-rate bounded QoS provisioning for mURLLC over 6G CFM-MIMO mobile networks in the ﬁnite blocklength regime,”

IEEE J. Sel. Areas Commun. , pp. 1–1, Sep. 2020.[2] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems: Applications, trends, technologies, and open researchproblems,”

IEEE Network , vol. 34, no. 3, pp. 134–142, May. 2020.[3] “Study on scenarios and requirements for next generation access technologies,” , Jul. 2020.[4] “5G; service requirements for the 5G system,” , Mar. 2020.[5] M. Latva-aho, K. Lepp¨anen, F. Clazzer, and A. Munari, “Key drivers and research challenges for 6g ubiquitous wirelessintelligence,” 2020.[6] “UL grant-free transmission for URLLC,”

R1-1705654, 3GPP TSG-RAN WG1 , Apr. 2017.[7] N. Ye, H. Han, L. Zhao, and A.-H. Wang, “Uplink nonorthogonal multiple access technologies toward 5G: A survey,” Wireless Commun. Mobile Comput. , vol. 2018, Jun. 2018.[8] T. Taleb and A. Kunz, “Machine type communications in 3GPP networks: potential, challenges, and solutions,”

IEEECommun. Mag. , vol. 50, no. 3, pp. 178–184, Mar. 2012.[9] “Considerations on random resource selection,”

R1-1608917, 3GPP TSG RAN WG1 , Oct. 2016. [10] “5G; NR; physical layer procedures for data,” , Mar. 2020.[11] “ UL grant-free transmission for URLLC,” R1-1705246 , Apr. 2017.[12] “Discussion on HARQ support for URLLC,”

R1-1612246, 3GPP TR-RAN1 , Nov. 2016.[13] “Discussion on explicit HARQ-ACK feedback for conﬁgured grant transmission,”

R1-1903079, 3GPP TSG RAN WG1 , Mar. 2019.[14] R. Abbas, M. Shirvanimoghaddam, Y. Li, and B. Vucetic, “A novel analytical framework for massive grant-free NOMA,” IEEE Trans. Commun. , vol. 67, no. 3, pp. 2436–2449, Mar. 2019.[15] M. B. Shahab, R. Abbas, M. Shirvanimoghaddam, and S. J. Johnson, “Grant-free non-orthogonal multiple access for IoT:A survey,”

IEEE Commun. Surveys Tutorials , pp. 1–1, May. 2020.[16] M. Shirvanimoghaddam, M. Condoluci, M. Dohler, and S. J. Johnson, “On the fundamental limits of random non-orthogonalmultiple access in cellular massive iot,”

IEEE J. Sel. Areas Commun. , vol. 35, no. 10, pp. 2238–2252, Jul. 2017.[17] Y. Liu, Y. Deng, N. Jiang, M. Elkashlan, and A. Nallanathan, “Analysis of random access in NB-IoT networks with threecoverage enhancement groups: A stochastic geometry approach,”

IEEE Trans. Wireless Commun. , pp. 1–1, Oct. 2020.[18] N. Jiang, Y. Deng, X. Kang, and A. Nallanathan, “Random access analysis for massive IoT networks under a new spatio-temporal model: A stochastic geometry approach,”

IEEE Trans. Commun. , vol. 66, no. 11, pp. 5788–5803, Jul. 2018.[19] Y. Liu, Y. Deng, M. Elkashlan, A. Nallanathan, and G. K. Karagiannidis, “Analyzing grant-free access for URLLC service,”

IEEE J. Sel. Areas Commun. , pp. 1–1, Aug. 2020.[20] J. Zhang, L. Lu, Y. Sun, Y. Chen, J. Liang, J. Liu, H. Yang, S. Xing, Y. Wu, J. Ma, I. B. F. Murias, and F. J. L. Hernando,“PoC of SCMA-based uplink grant-free transmission in UCNC for 5G,”

IEEE J. Sel. Areas Commun. , vol. 35, no. 6, pp.1353–1362, Jun. 2017.[21] X. Dai, Z. Zhang, B. Bai, S. Chen, and S. Sun, “Pattern division multiple access: A new multiple access technology for5G,”

IEEE Wireless Commun. , vol. 25, no. 2, pp. 54–60, Apr. 2018.[22] Z. Yuan, C. Yan, Y. Yuan, and W. Li, “Blind multiple user detection for grant-free MUSA without reference signal,” in , Sep. 2017, pp. 1–5.[23] C. Wei, H. Liu, Z. Zhang, J. Dang, and L. Wu, “Approximate message passing-based joint user activity and data detectionfor NOMA,”

IEEE Commun. Lett. , vol. 21, no. 3, pp. 640–643, Mar. 2017.[24] B. Wang, L. Dai, T. Mir, and Z. Wang, “Joint user activity and data detection based on structured compressive sensingfor NOMA,”

IEEE Commun. Lett. , vol. 20, no. 7, pp. 1473–1476, Jul. 2016.[25] A. T. Abebe and C. G. Kang, “Comprehensive grant-free random access for massive low latency communication,” in , Jul. 2017, pp. 1–6.[26] N. Ye, X. Li, H. Yu, A. Wang, W. Liu, and X. Hou, “Deep learning aided grant-free NOMA toward reliable low-latencyaccess in tactile internet of things,”

IEEE Trans. Ind. Infor. , vol. 15, no. 5, pp. 2995–3005, Jan. 2019.[27] N. Ye, X. Li, H. Yu, L. Zhao, W. Liu, and X. Hou, “DeepNOMA: A uniﬁed framework for NOMA using deep multi-tasklearning,”

IEEE Trans. Wireless Commun. , vol. 19, no. 4, pp. 2208–2225, Jan. 2020.[28] W. Kim, Y. Ahn, and B. Shim, “Deep neural network-based active user detection for grant-free NOMA systems,”

IEEETrans. Commun. , vol. 68, no. 4, pp. 2143–2155, Jan. 2020.[29] “Cellular system support for ultra-low complexity and low throughput Internet of Things (CIoT),” , Nov. 2015.[30] “Study on RAN improvements for machine-type communications,” , Sep. 2011.[31] “Solutions for collisions of MA signatures,”

R1-1608860, 3GPP TSG-RAN WG1 , Oct. 2016.[32] “On MA resource and MA signature conﬁgurations,” R1-1609227, 3GPP TSG-RAN WG1 , Oct. 2016. [33] J. Navarro-Ortiz, P. Romero-Diaz, S. Sendra, P. Ameigeiras, J. J. Ramos-Munoz, and J. M. Lopez-Soler, “A survey on 5Gusage scenarios and trafﬁc models,” IEEE Commun. Surveys Tutorials , vol. 22, no. 2, pp. 905–929, Feb. 2020.[34] A. K. Gupta and S. Nadarajah,

Handbook of beta distribution and its applications . CRC press, 2004.[35] “Study on physical layer enhancements for NR ultra-reliable and low latency case (URLLC),” ,Mar. 2019.[36] K. Au, L. Zhang, H. Nikopour, E. Yi, A. Bayesteh, U. Vilaipornsawai, J. Ma, and P. Zhu, “Uplink contention based SCMAfor 5G radio access,” in , Dec. 2014, pp. 900–905.[37] A. C. Cirik, N. M. Balasubramanya, L. Lampe, G. Vos, and S. Bennett, “Toward the standardization of grant-free operationand the associated NOMA strategies in 3GPP,”

IEEE Commun. Standards Mag. , vol. 3, no. 4, pp. 60–66, Dec. 2019.[38] Y. Chen, A. Bayesteh, Y. Wu, B. Ren, S. Kang, S. Sun, Q. Xiong, C. Qian, B. Yu, Z. Ding, S. Wang, S. Han, X. Hou,H. Lin, R. Visoz, and R. Razavi, “Toward the standardization of non-orthogonal multiple access for next generation wirelessnetworks,”

IEEE Commun. Mag. , vol. 56, no. 3, pp. 19–27, Mar. 2018.[39] R. S. Sutton and A. G. Barto,

Reinforcement learning: An introduction . MIT press, 2018.[40] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,G. Ostrovski et al. , “Human-level control through deep reinforcement learning,” nature , vol. 518, no. 7540, pp. 529–533,2015.[41] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” arXiv preprintarXiv:1509.06461 , Dec. 2015.[42] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,”

COURSERA: Neural Netw. Mach. Learn. , vol. 4, no. 2, pp. 26–31, Oct. 2012.[43] “Study on new radio access technology-physical layer aspects,”3GPP, TR 38.802 v14.0.0