[PDF] Massive Access in Beyond 5G IoT Networks with NOMA: NP-hardness, Competitiveness and Learning

Abstract

This paper studies the problem of online user grouping, scheduling and power allocation in beyond 5G cellular-based Internet of things networks. Due to the massive number of devices trying to be granted to the network, non-orthogonal multiple access method is adopted in order to accommodate multiple devices in the same radio resource block. Different from most previous works, the objective is to maximize the number of served devices while allocating their transmission powers such that their real-time requirements as well as their limited operating energy are respected. First, we formulate the general problem as a mixed integer non-linear program (MINLP) that can be transformed easily to MILP for some special cases. Second, we study its computational complexity by characterizing the NP-hardness of different special cases. Then, by dividing the problem into multiple NOMA grouping and scheduling subproblems, efficient online competitive algorithms are proposed. Further, we show how to use these online algorithms and combine their solutions in a reinforcement learning setting to obtain the power allocation and hence the global solution to the problem. Our analysis are supplemented by simulation results to illustrate the performance of the proposed algorithms with comparison to optimal and state-of-the-art methods.

Full PDF

11 Massive Access in Beyond 5G IoT Networks withNOMA: NP-hardness, Competitiveness andLearning

Zoubeir Mlika, and Soumaya Cherkaoui,

Senior Member, IEEE

Abstract —This paper studies the problem of online user group-ing, scheduling and power allocation in beyond 5G cellular-basedInternet of things networks. Due to the massive number of devicestrying to be granted to the network, non-orthogonal multipleaccess method is adopted in order to accommodate multipledevices in the same radio resource block. Diﬀerent from mostprevious works, the objective is to maximize the number of serveddevices while allocating their transmission powers such that theirreal-time requirements as well as their limited operating energyare respected. First, we formulate the general problem as a mixedinteger non-linear program (MINLP) that can be transformedeasily to MILP for some special cases. Second, we study itscomputational complexity by characterizing the NP-hardnessof diﬀerent special cases. Then, by dividing the problem intomultiple NOMA grouping and scheduling subproblems, eﬃcientonline competitive algorithms are proposed. Further, we showhow to use these online algorithms and combine their solutionsin a reinforcement learning setting to obtain the power allocationand hence the global solution to the problem. Our analysis aresupplemented by simulation results to illustrate the performanceof the proposed algorithms with comparison to optimal and state-of-the-art methods.

Index Terms —Internet of things, machine-to-machine, non-orthogonal multiple access, scheduling, power allocation, NP-hardness, online competitive algorithms, learning algorithms.

I. IntroductionTens of billions of objects will be connected to the Internetin the near future [1]. These objects form the well-knownInternet of things (IoT) [2], which is one of the promis-ing applications in future wireless networks, including ﬁfthgeneration (5G) standard and beyond 5G (B5G). To realizeIoT, machine-to-machine (M2M) communication is proposedwhere objects communicate with each others without (or withlittle) human interactions [3]. The applications of M2M in IoTinclude smart cities, smart grids, industrial automation, health-care, intelligent transportation systems, to name only few. Dueto the maturity of cellular networks and their ability to providewide-area coverage, the integration of M2M communicationwith cellular networks can be viewed as a viable solution tothe realization of such applications. For example, narrow-bandIoT (NB-IoT) [4] is proposed by the 3rd generation partnershipproject as a cellular-based M2M communication network usingthe long term evolution (LTE) standard.The main requirement of cellular-based M2M communica-tion is massive connectivity, i.e, a massive number of objects(or interchangeably called devices) communicating with eachother through cellular networks needs to be supported inthe future [5, 6]. For example, in [7], it was expected that more than two billions objects will be directly connected tocellular networks. Besides massive connectivity, M2M traf-ﬁc is generally diﬀerent from traditional cellular traﬃc. Itis sporadic and characterized by small-sized packets. Thus,maximizing the sum-rate is not the ﬁrst priority anymorein cellular-based M2M networks. Further, the cellular-basedM2M networks have more stringent requirements includingstringent latency and energy-eﬃciency requirements. Conse-quently, new resource allocation methods that take into accountthese requirements are of great importance in such networks.In this paper, we study an online resource allocation problemin cellular-based M2M networks. We deviate from most pre-vious works that (i) study the sum-rate-related objectives and(ii) use stochastic Lyapunov framework to solve the problem.Particularly, we consider a more important objective that iswell suited to the massive access problem in such networks.In other words, we maximize the number of served devices(NSD). To accommodate a large number of devices, non-orthogonal multiple access (NOMA) technique is used. Theproblem is therefore to maximize the NSD while (i) groupingthem into the available resource blocks, (ii) scheduling theirtransmission to respect their real-time requirements, and (iii)allocating their transmission powers to respect their limitedoperating energy levels. This problem is solved in the onlinecomputation [8] and the learning [9] frameworks in order toprovide competitive and learning algorithms. This problem iscalled Grouping & Power Allocation (GPA).

Remark 1 (Maximizing the NSD is Better) . The followingsimple example shows that maximizing the sum-rate can beachieved while serving only few devices. Assume that timeis divided into two slots and there are two devices. Thechannel gain of the ﬁrst device over the two slots is [ , ] (cid:62) whereas the channel gain of the second device over the twoslots is [ , ] (cid:62) . The maximum transmission powers of the twodevices is [ , / ] (cid:62) . Maximizing the sum-rate subject to powerconstraints over the two slots would serve only the ﬁrst deviceon the two slots (with allocated transmission power of / on each slot)—achieving a sum-rate of bps/Hz. However,serving the second device in the ﬁrst slot with allocatedtransmission power of / and the ﬁrst device in the secondslot with allocated transmission power of , would achieve asum-rate of + lg 7 bps/Hz ( < bps/Hz). Therefore, a largersum-rate is achieved with fewer served devices. Copyright (c) 2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained fromthe IEEE by sending a request to [email protected]. a r X i v : . [ c s . N I] F e b A. Related Work

Most previous works considered the problem of resourceallocation in M2M networks from the perspective of eithermaximizing the sum-rate or minimizing the sum-power (orthe energy consumption) or other related objectives. Further,few research papers focus on the online competitive analysisand learning frameworks. A detailed survey on radio resourcemanagement in M2M networks is given in [3].In [10], the authors give a short survey for the problemof uplink grant in M2M networks. Comparison between co-ordinated and uncoordinated access are shown. The problemis formulated as a prediction problem: in order to reducedelay, devices do not have to send random access requests tothe BS, instead, the BS allocates its resources to the devicesby predicting which device has packets to send. Further, theauthors developed a two-stage solution based on machinelearning. In [11], a multi-armed bandit approach is proposedto solve the problem of fast uplink grant access in M2Mnetworks. The objective is to maximize a utility function thatis a combination of data rate, access delay, and value ofdata packets. Since the set of possible actions is not knownin advance, a sleeping multi-armed bandit technique is used.In [12], the authors studied the resource management problemin green IoT multihop networks. IoT devices obtain energythrough either grid power or harvesting. The problem is formu-lated as a stochastic optimization problem where the objectiveis to maximize a time-average network utility combined withenergy purchasing costs. Lyapunov optimization techniques areused to obtain a stable solution in large- and small-time scales.In [13], the authors formulate a dynamic scheduling and powerallocation problem in IoT networks using NOMA technique. Astochastic optimization problem is formulated. The objectiveis to minimize the long average power consumption over timesubject to maximum transmission powers, scheduling and longaverage rate constraints. Well-known techniques are used toderive the Lyapunov function and the upper bound on thedrift plus penalty. The problem is transformed into a set ofstatic optimization problems that are solved iteratively. Then,branch and bound technique is used to solve the problem.In [14], the authors study clustering and power allocationin NOMA systems to answer the following question: howto group the users into the resource blocks and how toallocate transmission powers in order to maximize the systemthroughput while guaranteeing minimum rate requirementsof the users and without exceeding maximum transmissionpowers. The solutions is divided into (1) user clustering bydeveloping algorithm that satisfy the successive interferencecancellation (SIC) constraints and (2) power allocation usingLagrangian methods. In [15], a dynamic power control anduser pairing problem is solved in delay-constrained multipleaccess networks with hybrid OMA and NOMA techniques.The objective is to minimize the long-term time-average trans-mission power while guaranteeing stable queues and minimumtime-average rate. By ﬁxing the user pairing, power allocationis derived. Further, for given power allocation, the user pairingproblem is solved using matching techniques. [16] considershybrid OMA and NOMA techniques to solve the problem of maximizing the energy-eﬃciency while guaranteeing min-imum rate requirements and maximum transmission power.Swap matching algorithm is proposed to solve the user pairingproblem under ﬁxed power allocation. Once the user pairingis found, the power allocation is solved while maximizingthe ratio of rate to total power consumption. The authorsof [17] study the problem of maximizing the NSD to solvethe uplink access problem in NOMA systems. Speciﬁcally,the objective is to maximize the NSD while allocating thechannel to the devices and further guaranteeing the minimumrate requirements and the maximum transmission power. Theauthors ﬁrst ﬁnd the power allocation by solving the feasibilityproblem of minimum rate requirements. The NOMA channelassignment problem is solved by reducing it to a maximumindependent set problem. The analysis is only given for devicepairing (two devices per NOMA group) and there is no NP-hard proof nor real-time requirements for the devices. In [18],the authors study the problem of minimizing the maximumaccess delay under minimum rate requirements and prove itsNP-hardness. They divide it into user scheduling and powercontrol subproblems. The user scheduling is solved using agraph cutting method while the power control is solved usingan iterative algorithm. The authors of [19] study the problemof maximizing the NSD in NOMA systems. They proposed amathematical programming formulation to solve the problem.No NP-hardness is provided. Further, there are no real-timerequirements for the devices.In the previously cited papers, there are some importantresearch gaps that we ﬁll in this work. First, the majorityof the works focus on sum-rate related objectives. Second,even when the objective is the NSD, the previously studiedproblems and our problem are fundamentally diﬀerent andthere are lacks of studies on (1) the computational complexityof the problem, (2) competitive and learning algorithms, and(3) stringent requirements including real-time, rate and powerrequirements.

B. Contributions

The main contributions of this work are summarized in thefollowing list. • We use mathematical programming techniques to modelGPA as a mixed integer non-linear program, which canbe easily transformed into an integer linear program forsome special cases of interest. • We characterize the computational complexity of GPAby studying its NP-hardness in diﬀerent cases. We givea complete analysis of each case by either presentinga formal proof of NP-hardness or a polynomial-timealgorithm. • We start by analyzing GPA in some important specialcases, where the problem involves NOMA grouping andscheduling, and we derive online competitive algorithmsto solve it. • We propose to combine our proposed online competitivealgorithms in a machine learning setting in order to solveGPA in the general case. That is, we propose learningalgorithms that ﬁnd the power allocation solution usingthe NOMA grouping and scheduling solutions. BS Slots

A NOMA group of size 4

Uplink transmissionAn empty slot A NOMA group of size 3Uplink transmission

Fig. 1: System Model with a single frame. Power allocation isshown below each device. An empty slot represents the caseof constraints violation (e.g., real-time or power constraints).

C. Organization

The paper is organized as follows. Section II presents thesystem model. Section III formulates GPA as a mathematicalprogram. Section IV studies its NP-hardness in diﬀerent casesand discusses its oﬄine solutions. Section V presents theproposed online competitive solutions whereas Section VIpresents the proposed learning algorithms. Section VII showssome simulation results, and ﬁnally, section VIII draws someconclusions.

D. Notations

Lowercase and boldface letters denote vectors whereasuppercase and boldface letters denote matrices and threedimensional arrays. A set of n elements is denoted by [ n ] (cid:66) { , , . . . , n } and its cardinality is denoted by |[ n ]| (cid:66) n . Intervalof integers is denoted by { a .. b } including a and b . The interval [ , x ] is sampled using the power level τ (cid:62) [ x ] τ (cid:66) { , x / τ, x / τ, . . . , x } with cardinality |[ x ] τ | = τ + O (·) denotes the big-O notation.II. System ModelWe consider a cellular-based M2M network composed ofone base station (BS) and m devices. Time is divided into k frames where each frame is composed of n time-slots with unitlength each. In each frame, device i may have a packet to send.The length (in bits) of device i ’s packet in frame t is L ti (cid:62) L ti = t ; meaning that i has no packet to send in frame t .) The arrival time and thedeadline of device i ’s packet in frame t are denoted by a ti and d ti , respectively. The considered traﬃc pattern is similar to butmore general than the well-known frame-synchronized traﬃcpattern [20, 21]. Device i has ¯ e i units of energy stored in itsbattery. For simplicity, a resource block (RB) is representedsimply by a time-slot (but time/frequency RBs can also beused). Every frame has n RBs and we denote a RB by thepair ( j , t ) for time-slot j of frame t . A RB has a bandwidth of W Hz. An example of the system model is given in Fig. 1. The wireless channel between device i and the BS using RB ( j , t ) is given by h tij , which may include fast and slow fading.Let x tij be a binary variable that is 1 if and only if device i isserved using RB ( j , t ) . Also, let p tij denotes the transmissionpower of device i using RB ( j , t ) . Finally, let z ti be a binaryvariable that is 1 if and only if device i is served in frame t .We use X , P , and Z to denote the multidimensional variablescorresponding to x tij , p tij and z ti , respectively.The signal to interference-plus-noise ratio (SINR) achievedby device i when served by the BS using RB ( j , t ) is givenby: SINR tij ( X , P ) = x tij p tij g tij + I tij ( X , P ) , (1)where g tij = | h tij | is the channel power gain and I tij ( X , P ) is the power of the interference coming from other devicestransmitting using RB ( j , t ) .To serve a large number of devices, power-domain NOMAtechnique is used [14], where a group of devices are trans-mitting to the BS using the same RB. Successive interferencecancellation (SIC) is used at the BS for the decoding. Let A tj be the set of devices that are transmitting on RB ( j , t ) .It is well-know to use the highest channel decoding orderin uplink NOMA [14, 16]. In other words, the interferencereceived by device i comes from all devices that have lowerchannel gains. We order the devices in A tj with respect to g tij to obtain a new set B tij (cid:66) { i (cid:48) ∈ A tj : g ti (cid:48) j < g tij } . With thatsaid, the interference received by device i using RB ( j , t ) canbe calculated as follows: I tij ( X , P ) = (cid:213) i (cid:48) ∈ B tij x ti (cid:48) j p ti (cid:48) j g ti (cid:48) j . (2)The achievable rate between device i and the BS using RB ( j , t ) is given by: R tij ( X , P ) = W lg ( + SINR tij ( X , P )) . [in bits/s] (3)The objective of GPA is to maximize the number of timesthe devices are served during the time horizon of k frames.This has to be done while grouping the devices in each RB andallocating their transmission powers. GPA guarantee that theserved devices respect (i) their data requirements by sendingall of their bits in each frame when they are served, (ii) theirmaximum transmission powers, and (iii) their arrival times anddeadlines.GPA is studied in the online scenario under the full in-formation assumption [22]. That is, each device knows onlythe current and previous information of all other devices(including itself), i.e., at time-slot j of frame t , the devicesget to know the channel gains g tij , the arrival times a ti anddeadlines d ti and the data requirements L ti for all i . The localinformation assumption [22], where each device knows onlyits own information in an online manner, is left for our futurework.To be able to solve GPA optimally (using oﬀ-the-shelfsolvers) in the oﬄine scenario, we formulate it as a mathe- We normalize the channel power gain to get a noise power of 1. matical program in the next section.III. Problem FormulationGPA is formulated as follows:maximize X , P , Z m (cid:213) i = k (cid:213) t = z ti (P1a)subject to x tij , z ti ∈ { , } , p tij (cid:62) , ∀ i , j , t , (P1b) n (cid:213) j = R tij ( X , P ) (cid:62) L ti z ti , ∀ i , t , (P1c) p tij (cid:54) ¯ e i x tij , ∀ i , j , t , (P1d) n (cid:213) j = k (cid:213) t = p tij (cid:54) ¯ e i , ∀ i , (P1e) x tij = , ∀ i , j (cid:60) { a ti .. d ti − } , t , (P1f) x tij (cid:54) z ti , ∀ i , j , t , (P1g) m (cid:213) i = x tij (cid:54) M , ∀ j , t , (P1h) z ti (cid:54) L ti , ∀ i , t . (P1i)The objective function in (P1a) maximizes the number oftimes the m devices are served during the time horizon of k frames. Constraints (P1b) list the optimization variables.Constraints (P1c) guarantee a minimum of L ti bits for device i when served in frame t . Constraints (P1d) relate the variables x tij and p tij : if x tij =

0, then so is p tij and if x tij =

1, then device i can use any p tij that is less than the maximum transmissionpower ¯ e i (if device i is not served in slot j of frame t , then itis not using any power and it is using at most the maximumotherwise). Constraints (P1e) restrict the limit of transmissionpowers used by device i in all RBs ( j , t ) . Constraints (P1f)force x tij = p tij = z ti = i , j , t , whenever device’s i packet has not yet arrived or its deadline is already due.Constraints (P1g) relate the variable x tij and z ti : if z ti = x tij (if device i is served in frame t , then there must existsa slot j when it is served). Constraints (P1h) limit the numberof devices served in RB ( j , t ) to a positive number M (cid:54) m .Finally, constraints (P1i) mark device i as not-yet-served inframe t if it has no packet to send.We can see that (P1) is non-linear and non-convex due tothe multiplication of X and P in constraints (P1c). Note thatthe variables X and P are equivalent and can be related, i.e., p tij > x tij =

1. Thus, we can get rid of X fromconstraints (P1c) by writing R tij ( X , P ) = R tij ( , P ) . Despite thisfact, (P1) is still mixed integer non-linear program, which isvery hard to solve in general.In the sequel, transmission powers of each device is assumedto belong to some discrete set. This is a realistic assump-tion in many real systems [23–25]. Although the continuouspower assumption can ease mathematical derivations throughmathematical programming, GPA is still NP-hard even underthe discrete power assumption (as it will be shown shortly).Under the discrete power assumption, every device i canchoose its transmission power p tij from the set [ ¯ e i ] τ i (cid:66) { , ¯ e i / τ i , e i / τ i , . . . , ¯ e i } , where τ i (cid:62) i . An important case, called the binary power (BP)case, is when τ i = i , and thus [ ¯ e i ] = { , ¯ e i } . We callthe general case of [ ¯ e i ] τ i the general power (GP) case. TheBP case is worth studying because it helps understand theintrinsic diﬃculty of the problem and helps in characterizingthe structure of the solution in more general cases. Remark 2 (Reduction from Multiple Frames to Single Frame) . Since the devices have to be served during k frames whilerespecting their limited operating energy levels, we observethat, in any optimal solution to GPA, every frame will beassociated an amount of allocated power for each device.Thus, if we could ﬁnd how much power to allocate to thedevices in each frame, we could reduce the problem to k single frame problems and solve each one separately. We startby analyzing the problem in the case of a single frame (thesuperscript t is dropped from all the notations). Then, we solvethe problem in the more general case of multiple frames byapplying machine learning techniques. In the next section, we study GPA in the oﬄine scenariowith a single frame. We give some insights into the oﬄinesolutions. Further, we study its computational complexity bycharacterizing its NP-hardness in diﬀerent cases.IV. NP-hardness and the Offline Scenario

A. The Oﬄine Problem in the BP Case

We consider the oﬄine version of GPA for M =

1. Inthis case, GPA is equivalent to the following: maximize theNSD during n slots subject to the constraints of arrival times,deadlines, data requirements, and matching capacity (i.e., nomore than one device in the same slot and no more than oneslot for each device). We can solve this problem by reducingit to a maximum matching problem in a bipartite graph.First, we create a bipartite graph where the slots representthe left vertexes and the devices represent the right vertexes.An edge exists between slot j and device i if and only if¯ e i g ij (cid:62) ( L i / W − ) and j ∈ { a i .. d i − } . Every edge inthe graph has capacity 1. By introducing a source vertexand sink vertex, we can ﬁnd the optimal solution to thismaximum matching problem in polynomial-time by applyingsome known maximum ﬂow algorithm.Solving GPA for general M > M (cid:62)

3. Nonetheless, it remains open whether or notGPA in the BP case is NP-hard for M = Theorem 1.

GPA is NP-hard in the BP case for M (cid:62) .Proof: We reduce 3-bounded 3-dimensional matching(3DM3) [26] to GPA. In 3DM3, we are given a set T ⊆ W × X × Y , where W , X , and Y are disjoint sets having thesame number (cid:96) of elements and | T | = r . Also, in 3DM3, noelement of W ∪ X ∪ Y occurs in more than three triples of T .We may assume without loss of generality that (cid:96) < r < (cid:96) .The goal is to ﬁnd in T a matching of maximum size, i.e.,a subset M ⊆ T where no two elements of M agree in anycoordinate.Given an instance of 3DM3, an instance of GPA is obtainedas follows. Let M =

3. We create n = r slots; slot j corresponds to the 3-element set t j from T . We create also m = (cid:96) + r devices. Speciﬁcally, for each i ∈ W , we have adevice w i , for each i ∈ X , we have a device x i , and for each i ∈ Y , we have a device y i . Also, there are r − (cid:96) additionaldevices { z , z , . . . , z r − (cid:96) } . Let a i = d i = n , that isdevice’s i packet arrived at the beginning of the frame andis due at its end. For all devices i , set ¯ e i = ∆ be alarge number. For each device w i and slot j corresponding tothe 3-element set t j , let b w i = g w i j (cid:66) (cid:40) , if i ∈ t j , + w i / ∆ , otherwise . For each device x i and slot j corresponding to the 3-elementset t j , let b x i = / g x i j (cid:66) (cid:40) , if i ∈ t j , + x i / ∆ , otherwise . For each device y i and slot j corresponding to the 3-elementset t j , let b y i = / g y i j (cid:66) (cid:40) / , if i ∈ t j , + y i / ∆ , otherwise . And for each additional device z i , and slot j , let b z i = g z i j = + z i / ∆ . This instance is clearly created in polynomial-time. We prove that 3DM3 is solved with a matching of size (cid:96) if and only if GPA is solved with 2 (cid:96) + r served devices andeach slot serves at most 3 devices.On the one hand, if 3DM3 is solved, then for each 3-elementset m j of the matching M , we can serve three devices in slot j —one for each element of m j . This is valid because eachelement of m j comes, respectively, from W , X , and Y andthus the corresponding three devices have channel gains equal,respectively, to 3, 1, or 1 /

2. This means that, for the device w i we have 3 /( + + / ) = / (cid:62)

1, for the device x i we have 1 /( + / ) = / (cid:62) / y i wehave 1 / (cid:62) /

2. We conclude that the three devices meet theirdata requirements. Since M is a matching in T of size (cid:96) , theslots corresponding to M serves a total of 3 (cid:96) devices—threedevices in each slot. The remaining slots can serve at most ( r − (cid:96) ) —at most one device per slot. To maximize the numberof devices served, each remaining slot can serve exactly onedevice. Thus, there are a total of 3 (cid:96) + r − (cid:96) = (cid:96) + r devicesserved where each slot serves at most three devices.On the other hand, assume that GPA is solved where eachslot serves at most three devices with a total of 2 (cid:96) + r serveddevices. We argue that if a slot serves three devices, then thesedevices correspond to some triple in T (they have channel gains3, 1 or 1 / T . Note that, in the solution toGPA, we cannot serve two devices in each slot (for a totalof 2 r devices) because 2 (cid:96) + r > r . Thus, to maximize thenumber of devices served, (cid:96) slots need to serve three devices(and r − (cid:96) slots serves one device each), which corresponds toa matching in T .The reduction is clearly done in polynomial-time. Finally,since 3DM3 is well-known NP-hard problem [26], the theorem follows. B. The Oﬄine Problem in the GP Case

In the following, we consider the GP case and we prove thatGPA is NP-hard even when M =

1, i.e., only OMA techniqueis used.

Theorem 2.

GPA is NP-hard in the GP case even for M = .Proof: The proof is to show that a special case of GPAis NP-hard. Let M =

1. Also, let a i = d i = n , that isdevice’s i packet arrived at the beginning of the frame andis due at its end. Let us assume that ¯ e i is large enough sothat device i cannot deplete its energy. Device i is transmittingwith ﬁxed power p ij such that (cid:205) nj = p ij (cid:54) ¯ e i . Denote by G ij (cid:66) lg ( + p ij g ij ) .Under this restriction, we prove that GPA is NP-hard byreduction from maximum independent set (MIS) problem ingraph theory. MIS is deﬁned [26] as follows: given a graph anda positive integer ζ . Is there an independent set in the graph ofsize ζ or more? An independent set is a set of vertexes that donot share any edge.

On the other hand, the restricted versionof GPA can be recasted as: given the coeﬃcients G ij and thesize of the packets L i , is there a scheduling of more than σ devices, denoted by the set O , such that a slot is used by atmost one device and (cid:205) j ∈ O G ij (cid:62) L i ?Given an instance of MIS, we create an instance of GPA inpolynomial-time as follows: the vertexes are the devices andthe edges are the slots. The edges are numbered as 1 , , . . . , n .There is an edge between two devices if and only if one ofthem can be served at that slot. Let O i be the set of slots thatdevice i can be served at. Let L i (cid:66) | O i | for each device i andset ζ = σ . The coeﬃcients G ie for device i and slot e is givenby: G ie (cid:66) (cid:40) , if e (cid:60) O i , , if e ∈ O i . (4)This reduction is clearly done in polynomial-time. Now itremains to prove that: “there are more than ζ served devices,denoted by the set O , such that a slot is used by at most onedevice and (cid:205) j ∈ O G ij (cid:62) L i ” if and only if “there is independentset in the graph of size ζ or more”.On the one hand, assume that we have an independent setin the graph of size ζ or more. By setting x ie = i inthe independent set and e ∈ O i , we have more than ζ devicesserved. Further, since we have an independent set, it is truethat the served devices are not overlapping with one another.On the other hand, assume that we have a solution to therestricted version of GPA, then we can see, by construction,that for device i to be satisﬁed, it must be scheduled in all slotsin O i (since L i = | O i | and G ie are binary). Since we have morethan ζ non-overlapping served devices, these devices form, inthe corresponding graph, an independent set of size more than ζ . Summarizing, we have reduced MIS to the restricted versionof GPA in polynomial-time such that MIS is solved if and onlyif GPA is solved. Since MIS is well-know NP-hard [26], so isGPA. This proves the theorem. All of our NP-hardness results are presented in table I. Weuse “Poly” to denote the polynomial-time complexity classand “Open?” to denote that, to the best of our knowledge, theproblem is still open.V. Competitive AlgorithmsAs previously discussed in remark 2, we start by solvingGPA in the single frame case. Then, we solve it in the moregeneral case of multiple frames. Before going into the details,we give the following deﬁnition.

Deﬁnition 1 ( c -competitive algorithm [8]) . An online algorithm alg is c -competitive if there is a constant α such that for all ﬁnite input sequences, O ≤ cA + α, (5) where A (or O ) is the value returned by alg (or an optimalalgorithm opt) for a given input. alg is strictly c -competitive if it is c -competitive and α ≤ . We assume that device i can choose its transmission powerfrom [ ¯ e i ] = { , ¯ e i } . That is, device i can either transmit ina slot or stay silent once and forever during the frame. Thisassumption implies that a device can use at most one RB. Thisis realistic in massive IoT networks where devices normallyhave short data packets to send [6, 28]. Remark 3 (The Selﬁsh Algorithm at Slot j ) . Assume that,without loss of generality, g j < g j < · · · < g mj . According tothe decoding order of largest channel gains in uplink NOMA,device knows that if ¯ e g j < L / W − , then p j = . Now, if ¯ e g j (cid:62) L / W − , then setting p j = ¯ e would satisfy device 1but may interfere with other devices. The following is called theselﬁsh algorithm for device i : whenever ¯ e i g ij (cid:62) L i / W − , set p ij = ¯ e i . We can prove that the selﬁsh algorithm can performvery badly. Say device 1 acts selﬁshly. Then, we can ﬁnd aninstance in which only device 1 will be served in slot j (due tothe severe interference that it generates)—removing device 1from slot j would satisfy all other devices in that slot. Assumewe have L i = lg ( + i ) for all i and g j = but for all i (cid:44) , g ij = √ ( i − )( i + ) (cid:0) i − (cid:1) . In this case, if device 1 transmits inslot j , then no other device can transmit in that slot. However,if it does not, then all device i (cid:44) can transmit. This showsthat the selﬁsh algorithm has a competitive ratio of at least m − which is very large. The previous remark proves that acting selﬁshly is not verygood in terms of maximizing the NSD. To provide betterresults, we ﬁrst study the case of M = M .For M =

1, we can reduce GPA to an online matchingproblem in a bipartite graph as follows. The devices representthe right vertexes of the bipartite graph. The slots representthe left vertexes that appear online one-by-one. An edge existsbetween slot j and device i if and only if ¯ e i g ij (cid:62) ( L i / W − ) and j ∈ { a i .. d i − } . When slot j appears, the channel gain g ij is revealed for all devices i and thus the edges incident to it arealso revealed. Once revealed, an online algorithm must makean irrevocable decision of which device to serve at slot j (i.e., match the corresponding edge). This online matching problemcan be solved using the well-known ranking algorithm that hasa competitive ratio of ee − ≈ .

58 [27]. The ranking algorithmchooses a random permutation ρ of the devices. For each slot j , it ﬁnds the set of not-yet-served devices Y j that can transmitin this slot, i.e., those devices i that have ¯ e i g ij (cid:62) ( L i / W − ) and j ∈ { a i .. d i − } . If Y j is not empty, then the rankingalgorithm chooses a device i from Y j that minimizes ρ ( i ) .The ranking algorithm is equivalent to assigning priorities tothe devices and choosing the not-yet-served device that hasthe highest priority.To solve the problem for general M (cid:62)

1, we transform itinto a many-to-one matching problem and we adopt a greedyapproach to solve it. We create the previous same bipartitegraph. Now, contrary to the case of M =

1, each slot can bematched to at most M devices from those connected to it by anedge. For each slot j ∈ [ n ] , let N j denotes the set of neighborsof j (i.e., N j (cid:66) { i ∈ [ m ] : { i , j } is an edge } ). Once slot j isrevealed, the problem is reduced to ﬁnding a set of (at most M ) devices D j ⊆ N j of maximum cardinality such that:¯ e i g ij (cid:62) (cid:16) L i / W − (cid:17) (cid:16) + (cid:213) i (cid:48) ∈ D (cid:48) j ¯ e i (cid:48) g i (cid:48) j (cid:17) , (6)is valid for each i ∈ D j , where D (cid:48) j (cid:66) { i (cid:48) ∈ D j : g ij > g i (cid:48) j } .It is known that the complexity of SIC decoding increasesas the number of users transmitting on the same RB in-creases [29]. Thus, in general, M is chosen small in orderto keep the complexity of SIC decoding low. In fact, multi-ple research papers consider the case of user pairing when M = M , one couldgenerate, in slot j , all combinations of at most M devicesand matches the maximum-cardinality set D j that respects (6).This leads to a polynomial-time (only for ﬁxed M ) worst-casecomplexity of O ( m M ) . However, by analyzing the problemstructure based on (6), we provide an optimal way of ﬁnding amaximum cardinality set that satisﬁes (6) in O ( m ) worst-casetime complexity. Lemma 1.

Once slot j is revealed, ﬁnding a maximumcardinality set D j that satisﬁes (6) can be done in O ( m ) worst-case time complexity.Proof: We prove that the greedy algorithm, given inAlgorithm 1 below and called binary-matching-j (bm j ), whichis applied at slot j , gives a maximum cardinality set thatsatisﬁes (6) in O ( m ) time in the worst-case; assuming thatthe channel gain g j is sorted, otherwise the complexity wouldbe O ( m lg m ) .The worst-case time complexity of bm j is clearly O ( m ) . Itremains to show that the algorithm returns a feasible solutionof maximum cardinality that satisﬁes (6).First, it is clear that | D j | (cid:54) M . Using mathematicalinduction on each iteration of the algorithm, we prove thatthe set D j represents a feasible solution that satisﬁes (6). Let D pj be the set of devices returned by bm j after iteration p .For p =

1, device 1 is added to D j only if ¯ e g j (cid:62) b and thus, D j is feasible. Assume that D pj is feasible. Is D p + j feasible? At iteration p +

1, the algorithm adds device p + TABLE I: Complexity ClassiﬁcationGPA with GP GPA with BPGroup Size M = M = M = M (cid:62) . Algorithm 1

The bm j algorithm Input: M , m , N j , g j , L , ¯ eOutput: D j X ← ∅ X ← for i ← to m do if i in N j then if ¯ e i g ij (cid:62) ( L i / W − )( + X ) then X ← X ∪ { i } X ← X + ¯ e i g ij if | X | (cid:54) M then D j ← X else Let D j ⊂ X with | D j | = M return D j D p + j only if ¯ e p + g p + j (cid:62) b p + ( + X ) where X = (cid:205) i ∈ D pj g ij .If this condition is not met, then D p + j = D pj and we aredone. Otherwise, D p + j = D pj ∪ { p + } . Since the channelgains are sorted in increasing order, thus g p + j (cid:62) g ij for all i ∈ D pj . According to the largest channel gain decoding orderof SIC, the devices already in D pj will not be aﬀected by thetransmission of device p +

1. Because device p + D pj only if (6) are respected, we conclude that D p + j is feasible.Combining the base case and the inductive hypothesis, weﬁnally have that the returned set D j = D mj is feasible.To prove the optimality, let { i , i , . . . , i (cid:96) } be the set ofserved devices in the order they were added to D j and let { i ∗ , i ∗ , . . . , i ∗ (cid:96) } be the set of devices in the order they wereadded to O j returned by some optimal algorithm. We assumewithout loss of generality that g i j < g i j < · · · < g i (cid:96) j and g i ∗ j < g i ∗ j < · · · < g i ∗ (cid:96) j . The optimality is to prove that (cid:96) = (cid:96) .First, we prove by induction on (cid:96) (cid:54) (cid:96) that g i ∗ (cid:96) j (cid:62) g i (cid:96) j . Thebase case, for (cid:96) =

1, is clearly true: the ﬁrst device served bybm j has the smallest channel gain. Assume now that for (cid:96) > , , . . . , (cid:96) −

1, i.e., g i ∗ (cid:96) − j (cid:62) g i (cid:96) − j , isit true for (cid:96) ? If g i ∗ (cid:96) j < g i (cid:96) j , then bm j would have chosen i ∗ (cid:96) instead of i (cid:96) because, using the inductive hypothesis, g i (cid:96) j > g i ∗ (cid:96) j (cid:62) b i ∗ (cid:96) ( + (cid:205) i ∗ (cid:96) − i = i ∗ g ij ) (cid:62) b i ∗ (cid:96) ( + (cid:205) i (cid:96) − i = i g ij ) . Thus, for all (cid:96) (cid:54) (cid:96) ,it is true that g i ∗ (cid:96) j (cid:62) g i (cid:96) j .Now, we use the previous fact to prove, by contradiction,that (cid:96) = (cid:96) . Assume that (cid:96) > (cid:96) . In other words, there existsa device i ∗ (cid:96) + ∈ O j , or equivalently, the optimal algorithmchooses i ∗ (cid:96) + in iteration (cid:96) +

1. Thus, g i ∗ (cid:96) + j (cid:62) b i ∗ (cid:96) + ( + (cid:205) i ∗ (cid:96) i = i ∗ g ij ) (cid:62) b i ∗ (cid:96) + ( + (cid:205) i (cid:96) i = i g ij ) , where the last inequalityfollows form the previous fact. Since the optimal algorithm chooses i ∗ (cid:96) + in iteration (cid:96) +

1, then g i ∗ (cid:96) + j > g i ∗ (cid:96) j (cid:62) g i (cid:96) j . Wecan see that, in iteration (cid:96) +

1, device i ∗ (cid:96) + has larger channelgain than device i (cid:96) and can be added to D j . Since bm j stoppedadding devices at iteration (cid:96) , we reach a contradiction and weconclude that (cid:96) = (cid:96) .Finally, the set D j returned by bm j is of maximum cardinal-ity and is obtained in O ( m ) worst-case time complexity. Thisproves the lemma.The proposed algorithm to solve GPA is called the binary-matching-slots (bms) algorithm and its pseudo-code is given inAlgorithm 2. For each arriving slot, bms calls mb j and servesthe maximum possible number of devices in that slot. Then,it updates the set of not-yet-served devices and continues inthis way. We prove that this algorithm is 2-competitive. Algorithm 2

The bms algorithm

Input:

Bipartite graph, M , m , n , g , L , ¯ eOutput: { D , D , . . . , D n } X ← [ m ] for each slot j do D j ← bm j ( M , m , N j , g j , L , ¯ e ) X ← X \ D j return { D , D , . . . , D n } Theorem 3. bms is -competitive.Proof: Let D M (cid:66) { D , D , . . . , D n } be the set of devicesserved by bms and let O M = { O , O , . . . , O n } be the set ofdevices served by some optimal algorithm opt, where D j (resp. O j ) are the devices served by bms (resp. by opt) at slot j .Based on lemma 1, it is clear that the number of devices in O j \ D M is at most the number of devices in D j ; since other-wise bms would have chosen O j \ D M instead of D j . Thus, bysumming over j , the number of devices in O M \ D M is at mostthe number of devices in D M . Since O M ⊆ ( D M ∪ O M \ D M ) ,thus we obtain: | O M | (cid:54) | D M | + | O M \ D M | , (cid:54) | D M | + | D M | (cid:54) | D M | . Theorem 4.

There is no deterministic online algorithm withbetter competitive ratio than bms.Proof:

Assume m = n =

2. Let ¯ e = ¯ e = L = L =

1. The channel gains in slot 1 is g = [ , ] . Now, if an onlinealgorithm decides to serve device 1 (resp. 2) in slot 1, then wecan choose the channel gains in the slot 2 as g = [ , ] (resp. g = [ , ] ). An optimal oﬄine algorithm can serve device 2in slot 1 and device 1 in slot 2 if g = [ , ] or it can serve device 1 in slot 1 and device 2 in slot 2 if g = [ , ] . In anycase, the oﬄine-to-online ratio is 2. A. Benchmarks Algorithms

For comparison purposes, in this section, we present anadapted version of a clustering algorithm, called hereinafterath (by M. S. Ali, H. Tabassum, and E. Hossain), proposedin [14]. The original algorithm is oﬄine and works withchannel gains that are independent of the RBs. We transformit to an online algorithm as follows. First, for simplicity, here,we assume that m is a multiple of M and denote by κ (cid:66) m / M .For each new slot j , ath creates κ clusters where each clustercontains exactly M devices by sorting the channel gains indescending order. That is, if g j > g j > · · · > g mj , thencluster l will contain the devices { l , κ + l , κ + l , . . . , M } . Now,for each slot j , ath iterates the clusters and checks whetherthe arrival times, deadlines and the data rate requirements ofthe devices in cluster l are respected. If not, the devices areremoved iteratively from cluster l until the constraints are notviolated. Once all clusters are checked, ath picks the clusterwith the maximum NSD. For subsequent slots, ath proceedssimilarly with the exception that an already served device isremoved from the clusters.We present another adapted version of a benchmark algo-rithm, called zz (by D. Zhai and R. Zhang), proposed in [17].The original algorithm is oﬄine, based on solving independentsets in graphs, and proposed only for M = M =

2, it generates allpairs of devices in each slot. By checking the constraints ofarrival time, deadlines, and data rate requirements, the pairsare updated in each slot—meaning that a pair can be reducedto a single element or to empty if necessary. zz constructsan undirected graph G where the set of nodes is the possibleset of devices (paired or not) in each slot. So, a node v isgiven by the tuple ( c , j ) where c is either a single device or apair of devices served in slot j . An edge between node ( c , j ) and node ( c (cid:48) , j (cid:48) ) exists if and only if j = j (cid:48) or c ∩ c (cid:48) is notempty. Once the graph is constructed, zz creates a new graph H by splitting every node ( c , j ) in G with | c | = G but are not linkedby an edge [17]. Device pairing is now obtained by solvingthe problem of maximum independent set in the new graph H using a greedy approach. zz is still oﬄine since it mustconstruct the graph G by knowing all the upcoming slots. B. Running Time Complexity

Here, we analyze the worst-case time complexities of thediﬀerent algorithms. We summarize the results in table II.The complexity of bms is clearly O ( n f ( m )) where O ( f ( m )) is the complexity of bm j which is equal to O ( m ) for sortedchannel gains g j or O ( m lg m ) otherwise. Similar analysis canbe done to obtain O ( n f ( m )) complexity for ath. As for zz,the generation of all pairs is done in O ( m ) . To construct thegraph, one has to iterate the slots and this gives a complexityof O ( nm ) . Once the graph is constructed, ﬁnding a maximalindependent set using the classical greedy approach requires O ( m ) complexity since the constructed graph has O ( m ) nodes. VI. Learning AlgorithmsWhen there are multiple frames, the problem involves powerallocation as well as NOMA grouping and scheduling. Notethat without further assumption, one cannot hope to obtaingood performances in terms of competitiveness. Speciﬁcally,say there are two frames and a single device. If an onlinealgorithm decided to allocate some transmission power p > p max j { g j } < L / W − p (cid:48) g j (cid:62) L / W − j with p (cid:48) > p . Next, the adversary canalso choose the channel gains in the second frame such that¯ e max j { g j } < L / W −

1. In this manner, the adversary can servethe device once in frame one with p (cid:48) but an online algorithmnever served the device. For this reason, we are motivated toconsider a relative performance measure and thus we adoptthe learning framework.In order to obtain a global solution (for multiple frames) toGPA, we use machine learning techniques. More speciﬁcally,we combine our proposed online competitive algorithms andreinforcement learning techniques to obtain the power alloca-tion solution to GPA.We model GPA with multiple frames as an online de-terministic Markov decision process (MDP). This modelingis important to apply reinforcement learning technique andhelped us to transform the problem into an online (stochastic)shortest path problem. The MDP is deterministic becausethe transition probabilities are known. The correspondingtransition graph (TG) is constructed as follows. A state (ora node) in the TG is a tuple ( e t , t ) , for t = , . . . , k +

1, where e t = [ e t , e t , . . . , e tm ] (cid:62) represents the remaining battery level ofthe devices in frame t . When t =

1, the node s (cid:66) ( e , ) iscalled the starting node, where e i = ¯ e i for all i . There is aterminal node denoted by t (cid:66) ( e k + , k + ) , where e k + i = i . For t = , , . . . , k +

1, a transition from ( e t , t ) to ( e t + , t + ) happens with probability one if and only if e t − e t + (cid:60) . Noother transition is allowed. For t = , , . . . , k +

2, the action setcorresponding to state ( e t , t ) is given by the Cartesian product [ e t ] τ × [ e t ] τ × · · · × [ e tm ] τ m , that is, an action taken in state ( e t , t ) and transitions to state ( e t + , t + ) is a transmissionpower vector p t = [ p t , p t , . . . , p tm ] (cid:62) . In other words, thepossible actions in state ( e t , t ) are given by the outgoing edgesof node ( e t , t ) . Denote by m i (cid:66) |[ ¯ e i ] τ i | = τ i + i and by m x (cid:66) Π mi = m i . The TG contains 2 + km x states and m x ( m x ( k − ) + k + )/ p t = [ p t , p t , . . . , p tm ] (cid:62) in state ( e t , t ) is the NSD in frame t ,which can be obtained by applying the previously proposedonline competitive algorithm bms.Under this modeling, GPA can be seen as an s - t shortestpath problem in the corresponding TG, or equivalently, asﬁnding the s - t path with the highest reward (by transformingrewards to losses we can move from shortest path to longestpath). Nonetheless, ﬁnding such an s - t path is too complexbecause the number of nodes and the number of edges inthe TG is exponentially large, e.g., for three frames, ﬁfteendevices and a power level of two ( m i = i ), theTG contains approximately one hundred thousand nodes and TABLE II: Worst-Case Time ComplexitiesAlgorithms Complexitybms O ( nm lg m ) ath O ( nm lg m ) zz O ( nm + m ) s ( , )( , )( , ) ( , )( , )( , ) ( , )( , )( , ) t

012 012010 012010 000

Fig. 2: An instance of the TG of one device with [ ] = { , , } and three frames. The rounded squares in the middleof the edges represent the actions.one billion edges. Due to the curse of dimensionality, wefollow a distributed approach to solve the power allocationlearning problem. The approach is distributed in the sensethat each device learns its own s - t shortest path by exchanginginformation among other devices through the BS. (Of course,there is a trade-oﬀ here between complexity and informationexchange that we left its analysis for future works.) We usea modiﬁed version of exp3 [31]—a popular reinforcementlearning algorithm for the adversarial multi-armed banditproblem. For comparison purposes, we also adopt the classicaltabular Q-learning algorithm [32]. A. EXP3-based Distributed Learning exp3 is based on exponential-weighting for explorationand exploitation and is proposed to solve the non-stochastic(adversarial) multi-armed bandit problem [31]. Each devicelearns its own s - t path by applying a modiﬁed version of exp3.Each device has its own TG. As before, a state in each TG i is given by ( e ti , t ) where e ti represents the remaining energylevel at frame t in device i ’s battery with e i = ¯ e i . For any state ( e ti , t ) of TG i , an action is given by the transmission power p ti ∈ [ e ti ] τ i . See Fig. 2. Normally, when device i , in state ( e ti , t ) ,chooses action p ti ∈ [ e ti ] τ i , its reward is a binary number thatrepresents whether or not it is served. Designing the rewardsin this way teaches the devices to act selﬁshly and thus doesnot necessarily give good outcome, i.e., the total NSD couldbe very low because each one will learn to use its transmissionpower to get served regardless of others (see remark 3). It isthus necessary to redesign the rewards to improve the learningoutcome. Instead of the binary rewards, each device receivesits reward as the NSD in each frame. This can be acquiredby information feedback between the devices and the BS. The main lines of the learning algorithm, called the path-learning(pl) algorithm is given for each round as follows. • Device i chooses an action p ti for each frame t accordingto some probability, i.e., it chooses an s i - t i path in TG i .We denote this path by the vector p i = [ p i , p i , . . . , p ki ] (cid:62) . • Device i sends its chosen s i - t i path to the BS. • The BS runs the online per-frame competitive algo-rithm bms at frame t with power allocation p t = [ p t , p t , . . . , p tm ] (cid:62) and calculates the NSD. • The BS broadcasts the rewards to each device (the rewardreceived by device i is the NSD in frame t ). That is,device i knows, not only the rewards of its chosen s i - t i path, but also the rewards in each edge of that path. • Device i updates the probabilities.pl operates in rounds, where in each round, it is appliedat device i that chooses an s i - t i path according to someprobability (proportional to the path weight). This probabilityis chosen to follow a distribution over the set of all s i - t i pathsin order to get a mixture between exponential weighting ofbiased estimates of the rewards and uniform distribution toensure suﬃciently large exploration of each edge of any s i - t i path. After choosing an s i - t i path, device i gets to know therewards on each edge of that path, i.e, it gets to know the NSDin each frame. Then, pl updates the probability distribution (byupdating the paths weights) and continues similarly.We notice that every TG i has 2 + km i nodes and m i ( m i ( k − ) + k + )/ i has length k +

1. Let P i be the set of all s i - t i paths in TG i and let σ i (cid:66) | P i | denotes the number of such paths. We can provethat σ i = (cid:0) k + m i − k (cid:1) , which is exponentially large and thuschoosing the paths in this way according to their weights isnot eﬃcient. However, a simple modiﬁcation can improve thealgorithm enormously [33]. First, instead of assigning weightsto paths, they are assigned to edges. Second, we constructa set of edge-covering s i - t i paths C i , which is deﬁned asthe set of paths in TG i such that for any edge e in TG i ,there is a path p i in C i such that e ∈ p i . Such an edge-covering paths C i can be obtained in O ( km i + km i lg ( km i )) time using Dijkstra’s algorithm where | C i | = O ( km i ) . Now,instead of each path, each edge e of TG i is assigned aweight w ( e ) (initialized to one for each edge at the beginningof the rounds) and the weight of an s i - t i path is given bythe product of the weights of its edges. For each round, pl,applied at device i , chooses an s i - t i path (1) uniformly from C i with probability γ or (2) according to the paths weightswith probability 1 − γ . If the latter is to be done, then the s i - t i path can be chosen by adding its vertexes one-by-oneaccording to edges’ weights (and not to paths’ weights) [33].Next, pl ﬁnds the probability of choosing each edge in theTG i , which can also be done using edges’ weights only. (It can be proven that choosing paths and updating the edges’probabilities can be done eﬃciently based on paths kernelsand dynamic programming [34].) Then, for each frame (orequivalently for each edge), the rewards are obtained usingbms, where the reward at any edge r is normalized by theprobability of that edge q ( e ) , i.e., the normalized reward is ( β + r { e ∈ p i } )/ q ( e ) , with A denotes the indicator functionand β ∈ ( , ] . Finally, the edges’ weights are updated as w ( e ) ← w ( e ) e η r where η > O ( kmm i + knm lg m ) , where O ( knm lg m ) is the complexity of applyingbms in all frames and O ( kmm i ) is the complexity of choosingthe paths according to the edges’ weights and updating theprobability of each edge. B. Tabular-based Distributed Learning

We use the tabular Q-learning algorithm [32]. The Q-learning algorithm is called ql and it proceeds in episodes.In each episode, each device i chooses an s i - t i path, receivesa reward and updates its Q-table. Precisely, each device i hasa Q-table Q i ( s , a ) that measures the quality of a state-actioncombination ( s , a ) , where s represents a node in the TG i and a represents a chosen transmission power in state s . Foreach episode, each device i starts the learning in the initialstate s i . For each step in that episode, that is for each frame t ,device i chooses a possible action (according to its state) usingthe epsilon-greedy approach and moves to the next state s (cid:48) .Once all devices choose their actions, the BS runs the onlinecompetitive algorithm bms in frame t and fed back the rewardsto each device (device i receives the NSD in frame t ). Next,each device move to the next state and updates its Q-table. Assoon as the last frame is reached and the Q-table is updated, thedevices move to the next episode and the Q-learning algorithmcontinues. Updating the Q-table is done as follows: Q ( s , a ) ← Q ( s , a ) + α ( r + max a Q ( s (cid:48) , a ) − Q ( s , a )) . (7)The per-episode complexity of ql is given by O ( knm lg m ) ,where O ( nm lg m ) is the complexity of applying bms in eachframe and updating the Q-table.VII. Simulation ResultsThis section illustrates the performance of the proposedalgorithms through computer simulations. We consider a ge-ographical zone modeled by a square of side 1000 meters.The BS is located at the center of this zone whereas thedevices are randomly and uniformly distributed inside thesquare. The simulations parameters are based on 3GPP spec-iﬁcations [35, p. 481] as in [6, 19]. The carrier frequency is f c =

900 MHz and the path-loss (in dB) at f c is given by120 . + . ( dist ti ) + α G + α L [35, p. 481], where dist ti isthe distance (in km) between device i and the BS at frame t , α G = − α L =

10 dB isthe penetration loss. Flat Rayleigh fading is also consideredand thus g tij includes the previous path-loss model as wellas an exponential random variable with unit parameter. Thepower spectral density of the noise is −

174 dBm/Hz and thenoise ﬁgure is 5 dB. Unless speciﬁed otherwise, the next parameters are ﬁxed as follows. Each device i has a maximumtransmission power of ¯ e i =

23 dBm [35, p. 481]. The groupsize is M =

2. The bandwidth is 200 kHz and the bandwidthof a single RB is 200 / n kHz where n is the total number ofRBs. The data requirements of the devices follow an uniformdistribution as L ti ∼ unif { , L max } with L max =

100 kbits. Thearrival times are given by a ti ∼ unif { , n } and the deadlines aregiven by d ti ∼ unif { a ti + , n + } . The optimization problem (P1)is modeled in AMPL [36] and solved using the CPLEX. Allsimulations are performed for independent random realizationsand averaged out.The next ﬁgures show the results for the case of a singleframe.Fig. 3a shows the performance of the proposed online com-petitive algorithm bms for diﬀerent values of M and n againsttwo benchmark algorithms, ath and zz, proposed for NOMAgrouping, and also against the optimal oﬄine algorithm optobtained by solving (P1) through AMPL modeling and usingthe CPLEX solver. When the number of RBs n or the groupsize M increases, the number of served devices increases fasterwith m . When the number of devices is small and the numberof RBs is large, the oﬄine algorithm zz achieves slightly betterperformance compared to our proposed algorithm bms, despitebeing online. This is might be due to the heuristic approachused in zz to ﬁnd a maximal independent set in the graph H as discussed in V-A. Another point is that bms achieves muchbetter performance compared to ath, because (1) the lattermainly optimizes the sum-rate objective and not the NSD andfurther (2) the pairs of devices in the latter are ﬁxed a priori(thus, because the stringent constraints in GPA very few pairscan satisfy these constraints according to NOMA). Lately, wesee that bms, despite being online and despite the proven 50%theoretical gap in Theorem 3, its performance is not very farfrom opt even for large values of n and M . The worst gapbetween bms and opt is about 87% which is much better thanthe proven 50% theoretical gap.Fig. 3b presents the performance of bms for m =

60 againstath, zz and opt. When the number of devices and the groupsize are ﬁxed, there is an optimal value of n at which the per-formance is maximized. When n increases above this optimalvalue, the NSD starts to decrease since the bandwidth of eachRB becomes small. In other words, when n continue to growwhich decreases the bandwith of each RB, the interferenceinside each NOMA group become intolerable and the devicescannot meet their requirements. Lately, the performance ofbms is still the best amongst the non-optimal algorithms exceptfor large n where the oﬄine zz becomes better (due to itsoﬄine nature and probably to exploring more nodes in thegraph H and thus selecting larger maximal independent sets)at the expense of higher running time complexity. Finally, bmsachieves close-to-optimal performance for diﬀerent n and theworst gap is about 78% which is much better than the proven50% theoretical gap.Fig. 4 presents the performance of bms against ath, zz,and opt when the battery capacity of the devices ¯ e changes.Increasing the battery capacity is able to increase the NSDquickly (the increase rate is faster when the number of RBs islarger). However, since the number of RBs and the group size

20 40 60 80 100102030405060 n = , M = n = , M = n = , M = Number of Devices ( m ) A vg . N S D optbmszz (a) Impact of m .

10 20 30 40 5020304050 Number of RBs ( n ) A vg . N S D optbmsathzz (b) Impact of n . Fig. 3: Impact of m and n .

10 15 20 25 30203040506070 n = , m = n = , m = Battery Capacities (¯ e in dBm) A vg . N S D optbmsathzz Fig. 4: Impact of ¯ e are limited, the increase rate slows down as the battery capacityincreases, and thus the curves start to converges. Despite beingonline and much simpler, bms is very close to opt and, asdiscussed previously, zz, due to its oﬄine nature, is betterthan bms only for large n but at the expense of higher runningtime complexity.Fig. 5a shows the performance of bms for large m and M .(Comparison with other algorithms is missing due to runningtime issues and incompatibility with large M .) When theminimum rate requirements L max is large, increasing M doesnot help improve the performance even for diﬀerent n becausethe interference inside a NOMA group will become large andthus no more devices can be admitted. However, when the L max is not very large, then increasing n and M can help improve the performance. When the network is really dense, it is notbeneﬁcial to increase n or M very largely. Due to increasedSIC complexity and to the limited performance improvements,it is better to keep the values of n and M not very large, e.g.,when L max =

100 kbits, n = M =

20 serves about 13 . n = M =

40 serves about 14 .

25% of thedevices. It is thus better to choose n = M =

20 rather than n = M =

40 (the latter gives a gain of only 0 . L max on the performance of the algorithms for m =

60. When the number of RBs n is small, the NSD slightlydecreases for large values of L max . However, when n is large,the NSD decreases faster with L max . This is because when n is large and L max is small, an important number of devicescan be grouped using NOMA (almost 80% of the devices areserved). As soon as L max increases, some important numberof devices will be unsatisﬁed and thus the performance drops.Nonetheless, the rate of dropping is much better when n issmall since a very few number of devices are grouped usingNOMA (almost 30% of the devices are served) for small L max .Thus, unless L max is not large, NOMA can help serve the same few number of devices with diﬀerent rate requirements. Onthe other hand, NOMA can serve a larger number of devicesbut is more inﬂuenced by their stringent rate requirements.This remark was derived in [19] when comparing NOMA andOMA.In the next simulations (Fig. 6), we show the performance ofthe proposed learning algorithms which learn the power allo-cation across the frames. The number of rounds (or episodes)is denoted by T . Unless stated otherwise, k = n = m = T = α = .

5, thepl’s parameters are γ = . β = .

01, and η = γ /( ( k + ) σ i ) .In Fig. 6a, we compare pl against ql and a very sim-ple algorithm, called random learning (rl), which assigns arandom amount of power in each frame (from the amount m ) A vg . N S D bms ( n = M = , L max = kbits)bms ( n = , M = , L max = kbits)bms ( n = M = , L max = kbits)bms ( n = M = , L max = kbits)bms ( n = , M = , L max = kbits)bms ( n = M = , L max = kbits) (a) Impact of large m and M .

20 40 60 80 10020406080 n = n = Minimum Rate Requirements ( L max in kbits) A vg . N S D optbmsathzz (b) Impact of L max . Fig. 5: Impact of m , M , and L max .

50 100 150 200 250 300100200300400 Number of Devices ( m ) A vg . N S D plqlrl (a) Impact of m . k ) A vg . P C ( i n w a tt ) plqlrl (b) Power consumption. Fig. 6: Impact of m and the power consumption across the frames.of power left). Learning the transmission powers using plachieves the best performance whereas the worst performanceis achieved, as expected, by rl that makes the energy depletequickly as the number of frames increases due to its randomchoices. Comparing pl and ql, the performance of the latterdegrades as the number of devices increases. The performancegap between pl and ql is about 1 .

13 for m =

50 whereas itbecomes about 1 .

26 for m = /√ T even for adversarial inputs.In Fig. 6b, we plot the average power consumption (PC)of all devices across the frames. The PC is averaged over all devices and over all random realizations and it measuresthe average power allocation of all devices in each frame.We see that rl depletes its transmission powers in the ﬁrstfew frames to end up without energy at later times and thusserves few devices. This is because as the number of framesincreases, the random choices available to rl decreases sincethe sampled power set [ ¯ e i ] τ i shrinks. The average PC of qlis much better than rl but still the former allocates moretransmission powers to the ﬁrst frames. However, pl allocatesthe transmission powers good enough which gives a goodlearning outcome. Indeed, pl almost have uniform average PCacross the frames and thus it saves more energy for futureframes. Consequently, the power allocation obtained by plhelps improving the performance by serving the largest number of devices compared to ql and rl.VIII. ConclusionIn this paper we studied the online grouping and powerallocation problem in beyond 5G cellular-based IoT NOMAnetworks where we introduced stringent requirements includ-ing real-time, rate, and energy constraints. First, we for-mulated the problem as a mathematical optimization modelusing integer programming techniques. Then, we studied thecomplexity of the problem by characterizing its NP-hardnessin diﬀerent and important cases. To solve the problem ina practical way, we divided it into subproblems of NOMAgrouping and scheduling. Then, we proposed online competi-tive algorithms to group the devices using NOMA. To obtainthe transmission power allocation solution, we proposed touse machine learning techniques and combined the NOMAgrouping and scheduling solutions to obtain a global solution.Speciﬁcally, by modeling the power allocation problem as anonline (stochastic) shortest path problem in directed graphs,we proposed two reinforcement learning algorithms: (1) basedon exponential-weighting for exploration and exploitation and(2) Q-learning. We showed that our proposed solutions canhelp solving, in an online fashion, the massive access problemeﬃciently in future networks.References [1] A. Bakshi, L. Chen, K. Srinivasan, C. E. Koksal, and A. Eryilmaz,“EMIT: An Eﬃcient MAC Paradigm for the Internet of Things,” IEEE/ACM Trans. Netw. , vol. 27, no. 4, pp. 1572–1583, Aug. 2019.[2] M. R. Palattella, M. Dohler, A. Grieco, G. Rizzo, J. Torsner, T. Engel,and L. Ladid, “Internet of Things in the 5G Era: Enablers, Architecture,and Business Models,”

IEEE J. Sel. Areas Commun. , vol. 34, no. 3, pp.510–527, Mar. 2016.[3] N. Xia, H. Chen, and C. Yang, “Radio Resource Management inMachine-to-Machine Communications–A Survey,”

IEEE Commun. Sur-veys Tuts. , vol. 20, no. 1, pp. 791–828, Firstquarter 2018.[4] Y. . E. Wang, X. Lin, A. Adhikary, A. Grovlen, Y. Sui, Y. Blankenship,J. Bergman, and H. S. Razaghi, “A Primer on 3GPP Narrowband Internetof Things,”

IEEE Commun. Mag. , vol. 55, no. 3, pp. 117–123, Mar.2017.[5] Z. Dawy, W. Saad, A. Ghosh, J. G. Andrews, and E. Yaacoub, “To-ward Massive Machine Type Cellular Communications,”

IEEE WirelessCommun. , vol. 24, no. 1, pp. 120–128, Feb. 2017.[6] M. Shirvanimoghaddam, Y. Li, M. Dohler, B. Vucetic, and S. Feng,“Probabilistic Rateless Multiple Access for Machine-to-Machine Com-munication,”

IEEE Trans. Wireless Commun. , vol. 14, no. 12, pp. 6815–6826, Dec. 2015.[7] H. S. Dhillon, H. C. Huang, H. Viswanathan, and R. A. Valenzuela, “OnResource Allocation for Machine-to-Machine (M2M) Communicationsin Cellular Networks,” in

Proc. IEEE Globecom Workshops , Dec. 2012,pp. 1638–1643.[8] A. Borodin and R. El-Yaniv,

Online Computation and CompetitiveAnalysis . New York, NY, USA: Cambridge University Press, 1998.[9] S. Shalev-Shwartz, “Online Learning and Online Convex Optimization,”

Foundations and Trends in Machine Learning , vol. 4, no. 2, pp. 107–194,2012.[10] S. Ali, N. Rajatheva, and W. Saad, “Fast Uplink Grant for Machine TypeCommunications: Challenges and Opportunities,”

IEEE Commun. Mag. ,vol. 57, no. 3, pp. 97–103, Mar. 2019.[11] S. Ali, A. Ferdowsi, W. Saad, and N. Rajatheva, “Sleeping Multi-Armed Bandits for Fast Uplink Grant Allocation in Machine TypeCommunications,” in

Proc. IEEE Globecom Workshops (GC Wkshps) ,Dec. 2018, pp. 1–6.[12] D. Zhang, Y. Qiao, L. She, R. Shen, J. Ren, and Y. Zhang, “Two Time-Scale Resource Management for Green Internet of Things Networks,”

IEEE Internet Things J. , vol. 6, no. 1, pp. 545–556, Feb. 2019. [13] D. Zhai, R. Zhang, L. Cai, B. Li, and Y. Jiang, “Energy-Eﬃcient UserScheduling and Power Allocation for NOMA-Based Wireless NetworksWith Massive IoT Devices,”

IEEE Internet Things J. , vol. 5, no. 3, pp.1857–1868, Jun. 2018.[14] M. S. Ali, H. Tabassum, and E. Hossain, “Dynamic User Clustering andPower Allocation for Uplink and Downlink Non-Orthogonal MultipleAccess (NOMA) Systems,”

IEEE Access , vol. 4, pp. 6325–6343, 2016.[15] M. Choi, J. Kim, and J. Moon, “Dynamic Power Allocation and UserScheduling for Power-Eﬃcient and Delay-Constrained Multiple AccessNetworks,”

IEEE Trans. Wireless Commun. , vol. 18, no. 10, pp. 4846–4858, Oct. 2019.[16] M. Zeng, A. Yadav, O. A. Dobre, and H. V. Poor, “Energy-Eﬃcient JointUser-RB Association and Power Allocation for Uplink Hybrid NOMA-OMA,”

IEEE Internet Things J. , vol. 6, no. 3, pp. 5119–5131, Jun. 2019.[17] D. Zhai and R. Zhang, “Joint Admission Control and Resource Al-location for Multi-Carrier Uplink NOMA Networks,”

IEEE WirelessCommun. Lett. , vol. 7, no. 6, pp. 922–925, Dec. 2018.[18] D. Zhai, R. Zhang, L. Cai, and F. R. Yu, “Delay Minimization forMassive Internet of Things With Non-Orthogonal Multiple Access,”

IEEE J. Sel. Topics Signal Process. , vol. 13, no. 3, pp. 553–566, Jun.2019.[19] A. E. Mostafa, Y. Zhou, and V. W. S. Wong, “Connection DensityMaximization of Narrowband IoT Systems With NOMA,”

IEEE Trans.Wireless Commun. , vol. 18, no. 10, pp. 4708–4722, Oct. 2019.[20] L. Deng, W. S. Wong, P. Chen, Y. S. Han, and H. Hou, “Delay-Constrained Input-Queued Switch,”

IEEE J. Sel. Areas Commun. ,vol. 36, no. 11, pp. 2464–2474, Nov. 2018.[21] L. Deng, C. Wang, M. Chen, and S. Zhao, “Timely Wireless Flows WithGeneral Traﬃc Patterns: Capacity Region and Scheduling Algorithms,”

IEEE/ACM Trans. Netw. , vol. 25, no. 6, pp. 3473–3486, Dec. 2017.[22] N. Buchbinder, L. Lewin-Eytan, I. Menache, J. Naor, and A. Orda,“Dynamic Power Allocation Under Arbitrary Varying Channels - TheMulti-User Case,” in

Proc. IEEE INFOCOM , Mar. 2010, pp. 1–9.[23] Y. Liu, X. Fang, and M. Xiao, “Discrete Power Control and Transmis-sion Duration Allocation for Self-Backhauling Dense mmWave CellularNetworks,”

IEEE Trans. Commun. , vol. 66, no. 1, pp. 432–447, Jan.2018.[24] F. Shan, J. Luo, W. Wu, M. Li, and X. Shen, “Discrete Rate Schedulingfor Packets With Individual Deadlines in Energy Harvesting Systems,”

IEEE J. Sel. Areas Commun. , vol. 33, no. 3, pp. 438–451, Mar. 2015.[25] E. Altman, K. Avrachenkov, G. Miller, and B. Prabhu, “Discrete PowerControl: Cooperative and Non-Cooperative Optimization,” in

Proc. IEEEINFOCOM , May 2007, pp. 37–45.[26] M. R. Garey and D. S. Johnson,

Computers and Intractability: A Guide tothe Theory of NP-Completeness . New York, NY, USA: W. H. Freeman& Co., 1979.[27] B. Birnbaum and C. Mathieu, “On-line Bipartite Matching Made Sim-ple,”

ACM SIGACT News , vol. 39, no. 1, pp. 80–87, Mar. 2008.[28] D. Niyato, P. Wang, and D. I. Kim, “Performance Modeling and Anal-ysis of Heterogeneous Machine Type Communications,”

IEEE Trans.Wireless Commun. , vol. 13, no. 5, pp. 2836–2849, May 2014.[29] D. Tse and P. Viswanath,

Fundamentals of Wireless Communication .New York, NY, USA: Cambridge University Press, 2005.[30] M. A. Sedaghat and R. R. Müller, “On User Pairing in Uplink NOMA,”

IEEE Trans. Wireless Commun. , vol. 17, no. 5, pp. 3474–3486, May2018.[31] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The Non-stochastic Multiarmed Bandit Problem,”

SIAM J. Comput. , vol. 32, no. 1,p. 48–77, Jan. 2003.[32] R. S. Sutton and A. G. Barto,

Introduction to Reinforcement Learning ,1st ed. Cambridge, MA, USA: MIT Press, 1998.[33] A. Gyorgy, T. Linder, and G. Lugosi, “The Shortest Path Problem inthe Bandit Setting,” in

Proc. IEEE Inf. Theory Workshop (ITW) , Mar.2006, pp. 87–91.[34] E. Takimoto and M. K. Warmuth, “Path Kernels and MultiplicativeUpdates,” in

Computational Learning Theory , J. Kivinen and R. H.Sloan, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp.74–89.[35] 3GPP, “Cellular System Support for Ultra-Low Complexity and LowThroughput Internet of Things (CIoT),” 3rd Generation PartnershipProject (3GPP), Technical Report (TR) 45.820, Nov. 2015, version13.1.0.[36] R. Fourer, D. M. Gay, and B. W. Kernighan, “A Modeling Language forMathematical Programming,”