[PDF] Reducing the Computational Complexity of Multicasting in Large-Scale Antenna Systems

Abstract

In this paper, we study the physical layer multicasting to multiple co-channel groups in large-scale antenna systems. The users within each group are interested in a common message and different groups have distinct messages. In particular, we aim at designing the precoding vectors solving the so-called quality of service (QoS) and weighted max-min fairness (MMF) problems, assuming that the channel state information is available at the base station (BS). To solve both problems, the baseline approach exploits the semidefinite relaxation (SDR) technique. Considering a BS with N antennas, the SDR complexity is more than \mathcal{O}(N^{6}), which prevents its application in large-scale antenna systems. To overcome this issue, we present two new classes of algorithms that, not only have significantly lower computational complexity than existing solutions, but also largely outperform the SDR based methods. Moreover, we present a novel duality between transformed versions of the QoS and the weighted MMF problems. The duality explicitly determines the solution to the weighted MMF problem given the solution to the QoS problem, and vice versa. Numerical results are used to validate the effectiveness of the proposed solutions and to make comparisons with existing alternatives under different operating conditions.

Full PDF

11 Reducing the Computational Complexity ofMulticasting in Large-Scale Antenna Systems

Meysam Sadeghi,

Student Member, IEEE , Luca Sanguinetti,

Senior Member, IEEE , Romain Couillet,

SeniorMember, IEEE , and Chau Yuen,

Senior Member, IEEE

Abstract —In this paper, we study the physical layer multicast-ing to multiple co-channel groups in large-scale antenna systems.The users within each group are interested in a common messageand different groups have distinct messages. In particular, we aimat designing the precoding vectors solving the so-called quality ofservice (QoS) and weighted max-min fairness (MMF) problems,assuming that the channel state information is available atthe base station (BS). To solve both problems, the baselineapproach exploits the semideﬁnite relaxation (SDR) technique.Considering a BS with N antennas, the SDR complexity ismore than O ( N ) , which prevents its application in large-scaleantenna systems. To overcome this issue, we present two newclasses of algorithms that, not only have signiﬁcantly lowercomputational complexity than existing solutions, but also largelyoutperform the SDR based methods. Moreover, we present anovel duality between transformed versions of the QoS and theweighted MMF problems. The duality explicitly determines thesolution to the weighted MMF problem given the solution tothe QoS problem, and vice versa. Numerical results are used tovalidate the effectiveness of the proposed solutions and to makecomparisons with existing alternatives under different operatingconditions. Index Terms —Physical layer multicasting, large-scale antennasystems, massive MIMO multicasting, computational complexity.

I. I

NTRODUCTION

The advent of data-hungry services and applications hassigniﬁcantly increased the amount of data trafﬁc of wirelessnetworks [1]. A considerable amount of this trafﬁc belongsto the services that are of interest to one or several groupsof subscribers such as news headlines, ﬁnancial data, regularsystem updates, and video broadcasting [1], [2]. The traditionalunicast technology is highly inefﬁcient for these services as itignores the nature of such a trafﬁc demand [2]–[4]. To addressthis issue, the multicasting technology has been included indifferent releases of the third generation partnership project(3GPP) [2].Physical layer multicasting is an efﬁcient multicasting tech-nique designed for wireless networks [3], [4]. It has been

M. Sadeghi ([email protected]) and C. Yuen([email protected]) are with Singapore University of Technologyand Design (SUTD), Singapore. L. Sanguinetti ([email protected]) iswith the University of Pisa, Dipartimento di Ingegneria dell’Informazione,Italy and also with the Large Systems and Networks Group(LANEAS), CentraleSup´elec, Universit´e Paris-Saclay, France. R. Couillet([email protected]) is with the Signals and Statistics Groupat CentraleSup´elec, Universit´e Paris-Saclay, France.Part of the work is supported by A*Star SERC project number 142-02-00043. L. Sanguinetti and R. Couillet have been supported by the ERCStarting Grant 305123 MORE. L. Sanguinetti is also funded by the 5GIOTTOproject from University of Pisa. widely studied in the literature either for single or multiplegroups of users [3]–[14]. In single-group multicasting, atransmitter exploits channel state information (CSI) to sendout a common stream of data to one group of users, whilein multigroup multicasting multiple independent streams ofcommon data are sent to multiple distinct groups of users. Inthis context, two classes of problems have received particularattention, the so-called quality-of-service (QoS) problem andthe weighted max-min-fairness (MMF) problem. The formeraims to minimize the total transmit power while satisfyingtarget signal-to-interference-plus-noise ratios (SINRs) at theactive user equipments (UEs). The latter seeks to maximize theminimum weighted SINR among all the UEs in the system,subject to a total transmit power constraint.A seminal treatment of single-group multicasting for bothQoS and MMF problems was ﬁrst presented in [3]. Therein,it was proved that both QoS and MMF problems are NP-hardand then an approximate solution was presented employingsemideﬁnite relaxation (SDR) technique [15]. This work isthen extended to a multigroup single-cell scenario in [4]. Itshould be noted that, since in the multigroup case the SINRof every UE depends on the precoding vectors of all othergroups, even ﬁnding a feasible solution for QoS and MMFproblems might be a challenging task [4], [11]. Therefore, in[4] the SDR technique is followed by a randomization and amultigroup multicast power control phase. In [5], the MMFproblem is studied under per-antenna power constraint formultigroup single-cell systems. The coordinated physical layermulticasting for single-group multicell scenario is investigatedin [6]. Also, its application to coordinated physical layermulticasting for multigroup multicell scenario is studied in[14].The aforementioned works (among many others) are basedon the SDR technique, which is characterized by high compu-tational complexity when the system dimensions grow large,especially for large antenna arrays. More precisely, considera single-cell network wherein a BS with N antennas serves K UEs in G multicasting groups. Then, solving the QoSproblem via SDR requires O ( p GN ) iterations of an interiorpoint method with each iteration requiring O ( G N + KGN ) arithmetic operations [4]. The computational cost of ﬁnding anapproximate solution for the MMF problem is even higheras its solution is achieved by iteratively solving differentinstances of the QoS problem. Therefore, the SDR-based solu- For brevity, hereafter, we refer to the weighted MMF problem as MMFproblem. a r X i v : . [ c s . I T ] F e b tions are not suitable for practical implementation when N , G ,or K , grow large, as envisioned in large-scale antenna systems(commonly known as Massive MIMO systems) wherein N canbe of the order of hundreds [16]–[23].In the context of Massive MIMO multicasting, two possibleapproaches have been recently proposed, namely, the asymp-totic approach and the successive convex approximation (SCA)approach [11], [23]–[28]. The former exploits the asymptoticorthogonality of the channel vectors, when N grows very largeand K is kept ﬁxed, to simplify the SINR expression of eachUE and facilitate the design of asymptotically optimal beam-forming schemes [23]–[25]. In particular, [25] investigatesthe MMF problem for the multigroup single-cell multicastingwhereas the single-group multicell case is studied in [24]. Theextension to a multigroup multicell network is considered in[23]. The main problem with the asymptotic approach is thatan extremely large number of antennas is required to reachthe asymptotic orthogonality condition. As a consequence, theperformance of the asymptotically optimal precoders is poorwhen the system does not have an extremely large number ofantennas (in the order of thousands) [23].The SCA approach aims at iteratively solving the non-convex QoS and MMF problems by means of SCA of theoriginal problems around a feasible point [29], [30]. Morespeciﬁcally, the algorithm starts from an initial feasible point,the non-convex constraints are approximated by convex func-tions around this point, and the resulting convex problemis solved before proceeding to the next iteration. This pro-cedure is repeated until convergence to a stationary point.In [26], the SCA technique has been applied to reduce thecomputational complexity of beamforming design in single-group multicasting for large-scale antenna arrays. However,the SCA method is not suitable for multigroup multicastingcommunications as it requires an initial feasible point, whichis hard to compute in these scenarios [11]. To handle this issue,a feasible point pursuit SCA (FPP-SCA) algorithm is proposedin [27] and applied to multigroup multicasting in [11]. Therein,in order to guarantee the feasibility of the problem, slackvariables are added to relax the constraints, and a penalty isused to ensure that slacks are sparingly used. The solution ofthe resulting optimization problem is then used for anotherround of approximation and the procedure is repeated untilconvergence. However, the method itself has two drawbacks.First, although the solution of the approximated problem isalways feasible, it might not be a feasible solution of theoriginal multicasting problem and it is sensitive to the initialpoint of the algorithm as it is detailed in [27]. Second, it isstill computationally demanding when the number of antennasgrows.In this paper, we address all the aforementioned drawbacksfor a multigroup single-cell system by introducing a two-layerprecoding scheme, which is tailored for large-scale antennasystems. Our main contributions are summarized as follows:1) We present two algorithms for the QoS and MMFproblems, that outperform most of the aforementionedsolutions while guaranteeing a low computational com-plexity. 2) We reveal new duality results that allow to solve bothQoS and MMF problems simultaneously. This is insharp contrast with the existing algorithms for which theMMF problem is solved by iteratively solving differentinstances of the QoS problem.3) We introduce a heuristic algorithm that signiﬁcantly im-proves the computational complexity while only slightlyreducing the performance of both QoS and MMF solu-tions.The remainder of the paper is organized as follows. SectionII introduces the system model for a multigroup single-celllarge-scale antenna array system and formulates the corre-sponding QoS and MMF problems. Section III introducesthe proposed two-layer precoder, it provides a duality resultbetween transformed versions of the QoS and MMF problems,and then it proposes two algorithms for solving both. SectionIV introduces a heuristic solution to further reduce the compu-tational complexity. Section V presents the numerical resultswhereas conclusions are drawn in Section VI. Notations:

Scalars are denoted by lower case letters whereasboldface lower (upper) case letters are used for vectors (ma-trices). We denote by a matrix of appropriate size where allits elements are zero. The transpose, conjugate transpose, realpart, absolute value, and second norm operator are denoted by ( · ) T , ( · ) H , Re( · ) , | · | , and k·k . The set of all positive realnumbers is denoted by R + . A circular symmetric complexGaussian random vector x is denoted by x ⇠CN ( , C ) ,where and C are its mean and covariance matrix, respec-tively. The inverse of an invertible function f ( . ) is shown by f ( . ) .II. S YSTEM M ODEL AND P ROBLEM F ORMULATION

Consider a single-cell large-scale antenna array system inwhich a BS equipped with N antennas serves G multicastinggroups. Denote by G = { , . . . , G } the set of indices of allgroups and call K j the set of UE indices associated with group j , with cardinality K j = |K j | and such that K j \K i = ; , j = i , i.e., each UE is associated with a single group. Withinthis setting, we assume that N > K min j K j , where K = P Gj =1 K j is the total number of UEs in the network. Since weconsider large antenna systems, this technical assumption isnaturally in place. A double index notation is used to refer toeach UE as e.g., “user k in group j ”. Under this convention, let g jk C N be the channel between UE k in group j and the BSand assume that g jk = p jk h jk , where h jk ⇠CN ( N , I N ) is the small-scale fading channel and jk accounts for thelarge-scale channel attenuation (or path loss). We assume thatthe BS has perfect knowledge of the channel vectors { g jk } .Denoting by w j C N the precoding vector associated withgroup j , the signal y jk received at UE k can be written as: y jk = g Hjk w j s j + G X i =1 ,i = j g Hjk w i s i + n jk (1)where s i ⇠CN (0 , is the signal intended to group i , as-sumed independent across i , and n jk ⇠CN (0 , jk ) accounts for the additive Gaussian noise. The SINR jk of UE k ingroup j can be written as jk = | g Hjk w j | P Gi =1 ,i = j | g Hjk w i | + jk (2)and the total average transmit power is P Gj =1 k w j k . Underthe above assumptions, an instance of the QoS problem canbe formulated as follows [4]: Q ( ⌘ ) : min { w j } G X j =1 k w j k (3) s . t . jk ⌘ jk j, k (4)where ⌘ jk is the prescribed SINR of UE k in group j and ⌘ C K is the vector collecting all the { ⌘ jk } . Accordingly,an instance of the MMF problem is [4]: F ( ⌘ , P ) : max { w j } min j min k ⌘ jk jk (5) s . t . G X j =1 k w j k  P (6)where P accounts for the power constraint at the BS, and ⌘ jk represents the weight of jk . As mentioned before, Q ( ⌘ ) and F ( ⌘ , P ) are NP-hard and the existing algorithms for comput-ing their approximate solutions have either high computationalcomplexity [3], [5], [11], or poor performance [23], [24], [28].A two-layered architecture is proposed next to overcome thesedrawbacks.III. T HE PROPOSED T WO - LAYER P RECODING S CHEME

In this section, we propose a simple and computationallyefﬁcient method to compute approximate solution to the QoSand MMF problems. The method is based on a two-layerprecoding scheme: (i) the outer layer restricts the space ofvalid precoders to those cancelling the inter-group interference,thereby approximating the QoS and MMF problems by simpler(still non-convex) problems, denoted by QoS dec and MMF dec ,for which trivial feasible points can be found; (ii) startingfrom these feasible points, the inner layer is designed to reacha suboptimal solution to the QoS dec and MMF dec problems,which are also feasible solutions of the original QoS and MMFproblems. Section III-A presents the outer layer. Section III-Breveals an explicit duality between the QoS dec and MMF dec problems. Section III-C presents the inner layer and thealgorithms developed. Section III-D evaluates the complexityof the proposed algorithms.

A. Outer Layer – Removing Multigroup Interference

Denote by G i C N ⇥ K i the matrix collecting the channelvectors of all the K i UEs in group i . The complete elimina-tion of the multigroup interference P Gi =1 ,i = j g Hjk w i s i in (1)is achieved by using the block-diagonalization zero-forcing(BDZF) technique [31], [32]. Consider a two-layer precodingvector for group j as follows w j = F j c j j (7) where c j C N ⌧ j with ⌧ j = K K j is the inner layer,the design of which is discussed later, and F j C N ⇥ ( N ⌧ j ) is the outer layer. We design F j as an isometric matrixwhose columns form a basis for the null space of G j =[ G , . . . , G j , G j +1 , . . . , G G ] C N ⇥ ⌧ j , i.e., G H j F j = ⌧ j ⇥ ( N ⌧ j ) . As proposed in [31], [32], F j can be obtainedthrough the singular value decomposition (SVD) of G j . Thisrequires G KN +24 N P Gj =1 ( K K j ) ﬂoating pointoperations (ﬂops) . The same goal can be obtained with lowercomplexity (linear in the number of BS antennas N ) usingthe QR-based decomposition approach as shown in [34]. Thisproduces G j = Q j R j = ⇥ Q j Q j ⇤  R j = Q j R j (8)where Q j C N ⇥ ( N ⌧ j ) gives the null space of G j such that G H j Q j = ⌧ j ⇥ ( N ⌧ j ) . Therefore we can use Q j , as the outerlayer of w j , i.e., F j = Q j . Since the QR decomposition ofan m by n matrix can be computed with mn / n ﬂops[33], the total number of ﬂops required to perform the BDZFtechnique reduces to N P Gj =1 ( K K j ) / P Gj =1 ( K K j ) , which increases linearly with N . Plugging F j = Q j into (2) yields jk = | g Hjk c j | (9)where g jk = jk ( Q j ) H g jk C N ⌧ j denotes the equivalentchannel vector of UE k in group j . As j ( F j ) H F j = I N ⌧ j ,the proposed outer layer does not change the transmit power,i.e., P Gj =1 k w j k = P Gj =1 k c j k . Therefore, using the BDZFtechnique the QoS problem reduces to QoS dec . Note that wedeﬁne the QoS dec as the transformed version of the QoSproblem into G single-group multicasting QoS problems, {Q j ( ⌘ j ) } Gj =1 , where the j th problem is given by Q j ( ⌘ j ) : min { c j } k c j k (10) s . t . | g Hjk c j | ⌘ jk k (11)where ⌘ j C K j is the vector collecting all the quantities { ⌘ jk } in group j . To grasp the relation between the QoS dec problem and the prescribed SINRs, i.e., { ⌘ j } Gj =1 , we denotean instance of the QoS dec problem by Q ( ⌘ ) = {Q j ( ⌘ j ) } Gj =1 .Accordingly, using the BDZF technique, the MMF problemreduces to MMF dec , where we show an instance of it by F ( ⌘ , P ) and it is given as follows F ( ⌘ , P ) : max { c j } min j min k ⌘ jk | g Hjk c j | (12) s . t . G X j =1 k c j k  P. (13)As mentioned in the introduction, ﬁnding a feasible pointfor Q ( ⌘ ) is hard [4]. The same holds for F ( ⌘ , P ) sincethe common approach to solve F ( ⌘ , P ) relies on iterativelysolving Q ( ⌘ ) . On the contrary, ﬁnding a feasible point for Q ( ⌘ ) = {Q j ( ⌘ j ) } and, thus, for F ( ⌘ , P ) is a straightforward The SVD calculation of an m by n matrix requires m n + 24 mn ﬂops [33]. task, thanks to the single-group nature of each subproblem Q j ( ⌘ j ) . More speciﬁcally, for any given group j an initialfeasible point can be computed by ﬁrst choosing an arbitrarybeamforming vector c j and then rescaling it so as to meet themost violated SINR constraint with equality. Despite beingsimpler to solve than Q ( ⌘ ) and F ( ⌘ , P ) , both Q ( ⌘ ) and F ( ⌘ , P ) are still NP-hard – as it easily follows observing thatfor G = 1 they reduce to the single-group problems studied in[3]. The main difﬁculty in solving Q ( ⌘ ) and F ( ⌘ , P ) lies inthe non-convexity of the SINR constraints. In Section III-C,the SCA technique is used to develop a possible solutioncapable of overcoming this issue.Before delving into this, next we further detail the character-istics of Q ( ⌘ ) and F ( ⌘ , P ) and establish a duality and directrelations between the two problems: the solution of F ( ⌘ , P ) can be obtained from that of Q ( ⌘ ) (and vice versa). Theseresults will be used in Section III-C and Section IV to computean approximate solution to F ( ⌘ , P ) by means of Q ( ⌘ ) withoutthe need of iteratively solving instances of Q ( ⌘ ) as for existingalternatives [4]–[6]. B. On the duality between Q ( ⌘ ) and F ( ⌘ , P ) Let { c ?j ( ⌘ ) } and P ? ( ⌘ ) denote the set of optimal precodingvectors and the optimal objective value of Q ( ⌘ ) , respectively.Similarly, let { c j ( ⌘ , P ) } and t ( ⌘ , P ) denote the set ofoptimal precoding vectors and the optimal objective value of F ( ⌘ , P ) . We then start providing the following result: Lemma 1.

For Q ( ⌘ ) and F ( ⌘ , P ) we have: c ?j ( ↵ ⌘ ) = c j ( ⌘ , P ? ( ↵ ⌘ )) j (14) with ↵ = t ( ⌘ , P ? ( ↵ ⌘ )) . Also, we have that: c j ( ⌘ , P ) = c ?j ( t ( ⌘ , P ) ⌘ ) j (15) with P = P ? ( t ( ⌘ , P ) ⌘ ) .Proof. The proof proceeds by contradiction. First, noticethat, by deﬁnition, { c ?j ( ↵ ⌘ ) } is a feasible solution of F ( ⌘ , P ? ( ↵ ⌘ )) with an objective value equal to ↵ . Now,let us assume there exists a set of precoding vectors { c j ( ⌘ , P ? ( ↵ ⌘ )) } for which t ( ⌘ , P ? ( ↵ ⌘ )) >↵ . Clearly, { c j ( ⌘ , P ? ( ↵ ⌘ )) } is also a feasible solution of Q ( ↵ ⌘ ) forwhich all the SINR constraints are over satisﬁed. Hence,there exists a constant ⌫< such that { ⌫ c j ( ⌘ , P ? ( ↵ ⌘ )) } meets all the SINR constraints of Q ( ↵ ⌘ ) with equality whileproviding a smaller objective value than P ? ( ↵ ⌘ ) . This, how-ever, contradicts our assumption and proves that (14) is validwith ↵ = t ( ⌘ , P ? ( ↵ ⌘ )) . A similar line of reasoning canbe used to prove (15). By deﬁnition the set of precodingvectors { c j ( ⌘ , P ) } is a feasible solution of Q ( t ( ⌘ , P ) ⌘ ) with an objective value equal to P . Let us assume thereexists { c ?j ( t ( ⌘ , P ) ⌘ ) } with P ? ( t ( ⌘ , P ) ⌘ ) < P . Then, onecan use the remaining power, P P ? ( t ( ⌘ , P ) ⌘ ) to rescale { c ?j ( t ( ⌘ , P ) ⌘ ) } and improve F ( ⌘ , P ) . This is in contradic-tion with our assumption and completes the proof.Also, the following lemma can be simply proved from thedeﬁnition of Q ( ⌘ ) : Lemma 2.

For Q ( ↵ ⌘ ) and ↵ R + we have P ? ( ↵ ⌘ ) = ↵P ? ( ⌘ ) (16) and j c ?j ( ↵ ⌘ ) = p ↵ c ?j ( ⌘ ) . (17)We are now ready to state the following explicit dualitybetween Q ( ⌘ ) and F ( ⌘ , P ) : Theorem 1.

Given the set of optimal precoding vectorsand the optimal objective value of Q ( ⌘ ) , i.e., { c ?j ( ⌘ ) } and P ? ( ⌘ ) , the set of optimal precoding vectors and the optimalobjective value of F ( ⌘ , P ) , i.e. { c j ( ⌘ , P ) } and t ( ⌘ , P ) , aredetermined as c j ( ⌘ , P ) = s PP ? ( ⌘ ) c ?j ( ⌘ ) j (18) t ( ⌘ , P ) = PP ? ( ⌘ ) (19) and vice versa as c ?j ( ⌘ ) = 1 p t ( ⌘ , P ) c j ( ⌘ , P ) j (20) P ? ( ⌘ ) = Pt ( ⌘ , P ) . (21) Proof.

Starting with (18) we have that s PP ? ( ⌘ ) c ?j ( ⌘ ) ( a ) = c ?j ( PP ? ( ⌘ ) ⌘ ) ( b ) = c j ✓ ⌘ , P ? ( PP ? ( ⌘ ) ⌘ ) ◆ (22) ( c ) = c j ✓ ⌘ , PP ? ( ⌘ ) P ? ( ⌘ ) ◆ = c j ( ⌘ , P ) (23)where ( a ) follows from (17), ( b ) holds because of (14) and ( c ) is obtained using (16). The equality (19) follows from P ( a ) = P ? ( t ( ⌘ , P ) ⌘ ) ( b ) = t ( ⌘ , P ) P ? ( ⌘ ) (24)where ( a ) exploits P = P ? ( t ( ⌘ , P ) ⌘ ) (see Lemma 1), and ( b ) is due to (16). The equality in (20) follows from replacing(19) in (18).Theorem 1 reveals the relation between the optimal pre-coding vectors and the optimal objective values of Q ( ⌘ ) and F ( ⌘ , P ) . However, as they are NP-hard, any arbitrary algo-rithm with polynomial complexity can provide an approximateset of precoding vectors, rather than the optimal one. Hence,it is interesting to establish a relation between the precodingvectors and the objective values of Q ( ⌘ ) and F ( ⌘ , P ) whilethey are achieved from any arbitrary sub-optimal algorithm.This relation is given in Propositions 1 and 2. Proposition 1.

Assume { c ?j, app ( ⌘ ) } is a set ofprecoding vectors of Q ( ⌘ ) and P ? app ( ⌘ ) is itsassociated objective value achieved by any arbitraryalgorithm. Then, the set of precoding vectors { q PP ? app ( ⌘ ) c ?j, app ( ⌘ ) } ( or { q PP ? app ( ⌘ ) F j c ?j, app ( ⌘ ) } ) isa feasible answer for F ( ⌘ , P ) ( or F ( ⌘ , P )) , andprovides an objective value t app ( ⌘ , P ) such that t app ( ⌘ , P ) [ PP ? app ( ⌘ ) , PP ? ( ⌘ ) ] . Proof.

Please refer to the Appendix A.

Proposition 2.

Assume { c j, app ( ⌘ , P ) } is a setof precoding vectors of F ( ⌘ , P ) and t app ( ⌘ , P ) is its associated objective value achieved by anyarbitrary algorithm. Then, the set of precoding vectors { p t app ( ⌘ ,P ) c j, app ( ⌘ , P ) } ( or { p t app ( ⌘ ,P ) F j c j, app ( ⌘ , P ) } ) ,is a feasible answer for Q ( ⌘ ) ( or Q ( ⌘ )) , andprovides an objective value P ? app ( ⌘ ) such that P ? app ( ⌘ ) [ Pt ( ⌘ ,P ) , Pt app ( ⌘ ,P ) ] .Proof. Please refer to the Appendix B.Note that the relation between the QoS and MMF problemswas ﬁrst discovered in [4], but it was not given in an explicitform. Therefore, the existing works in the literature, as [4]–[6],[11], solve the MMF problem by iteratively solving speciﬁcinstances of the QoS problem. By virtue of the large numberof antennas available in large-scale antenna systems and theBDZF technique, Theorem 1, Proposition 1, and Proposition2, state that F ( ⌘ , P ) and Q ( ⌘ ) (also F ( ⌘ , P ) and Q ( ⌘ ) ) canbe solved simultaneously. It is also interesting to observe thatthe upper bound of the objective value of F ( ⌘ , P ) achievedvia Proposition 1 is equal to (19). Also, the lower bound of theobjective value of Q ( ⌘ ) achieved via Proposition 2 is equalto (21). C. Inner Layer – Successive Convex Approximation

In the sequel, the SCA technique is applied to solve Q ( ⌘ ) and F ( ⌘ , P ) . We begin with Q ( ⌘ ) , and rewrite | ¯ g Hjk c j | as | ¯ g Hjk c j | = c Hj X jk c j (25)where X jk = ¯ g jk ¯ g Hjk is a rank-one positive semi-deﬁnitematrix. Thus, for any arbitrary vector z j C N ⌧ j we havethat ( c j z j ) H X jk ( c j z j ) from which it follows c Hj X jk c j Re ( z Hj X jk c j ) z Hj X jk z j . (26)Now, for any z j the non-convex SINR constraint c Hj X jk c j ⌘ jk can be replaced with a tighter convex constraint given by Re ( z Hj X jk c j ) z Hj X jk z j ⌘ jk . (27)By replacing (11) with (27), we obtain e Q j ( ⌘ j , z j ) : min { c j } k c j k (28) s . t . Re ( z Hj X jk c j ) z Hj X jk z j ⌘ jk k (29)which represents a convex approximation of Q j ( ⌘ j ) for aspeciﬁc instance of z j . Now, we can introduce Algorithm 1and its following proposition for the QoS problem. Algorithm 1

The QoS BDZF-SCA Algorithm Compute F j j . for j = 1 to G do Select an arbitrary z (1) j and rescale it such that k z (1) Hj X jk z (1) j ⌘ jk . repeat Solve: e Q j ( ⌘ j , z ( i ) j ) : min c ( i ) j k c ( i ) j k s . t . Re ( z ( i ) Hj X jk c ( i ) j ) z ( i ) Hj X jk z ( i ) j ⌘ jk k. Let c ( i ) j denote the optimal value obtained from e Q j ( ⌘ j , z ( i ) j ) , then set z ( i +1) j c ( i ) j . until Convergence end for Compute the precoding vectors w j = F j c j j . Proposition 3.

Algorithm 1 converges to a point satisfying theKKT conditions of Q ( ⌘ ) , while providing a feasible solutionfor Q ( ⌘ ) .Proof. Please refer to the Appendix C.Now let us consider F ( ⌘ , P ) and F ( ⌘ , P ) . A solutionto these two problems can be achieved by ﬁrst applyingAlgorithm 1 and then using Proposition 1. Besides, we candirectly apply the SCA technique to F ( ⌘ , P ) and ﬁnd asolution to these two problems, similar to Algorithm 1. Thelatter approach is presented in Algorithm 2 and we have thefollowing proposition for Algorithm 2. Proposition 4.

Algorithm 2 converges to a KKT point of F ( ⌘ , P ) , while providing a feasible solution to F ( ⌘ , P ) .Proof. The proof follows the same lines as the proof ofProposition 3.

Algorithm 2

The MMF BDZF-SCA Algorithm Compute F j j . Select an arbitrary set z (1) := { z (1) j } Gj =1 such that P Gj =1 k z (1) j k  P. repeat Solve: e F ( ⌘ , P, z ( i ) ) : max { c ( i ) j } min j min k ⌘ jk h Re ( z ( i ) Hj X jk c ( i ) j ) z ( i ) Hj X jk z ( i ) j i s . t . G X j =1 k c ( i ) j k  P. Let { c ( i ) j } Gj =1 denote the optimal value obtained from e F ( ⌘ , P, z ( i ) ) , then j set z ( i +1) j c ( i ) j . until Convergence Generate the precoding vectors w j = F j c j j . D. Computational Complexity

The computational load of Algorithms 1 and 2 is nowassessed in terms of the number of required ﬂops as follows.Note that both algorithms consist of three steps. The ﬁrststep computes { F j ; j and requires N P Gj =1 ( K K j ) / P Gj =1 ( K K j ) ﬂops using the QR-decomposition[33], [34]. The second step aims at designing the inner layerprecoding vectors { c j ; j – as detailed in lines to ( to ) of Algorithm 1 (Algorithm 2). Since e Q ( ⌘ , z ) and e F ( ⌘ , P, z ) are both convex, they can be solved at each iterationusing standard techniques with a worst case complexity of O ( N ) [27]. Therefore, the number of ﬂops required by thesecond step is O ( IN ) with I being the number of iterationsrequired to converge. As it will be observed in Section V, onlya few iterations are needed to reach a satisfying solution evenfor large N . The third step calculates the composite precodingvectors w j = F j c j and requires GN G KN ﬂops.In large-scale antenna array systems, i.e., where N K , theoverall complexity of the proposed algorithm is dominatedby the second step and it is of O ( N ) . Taking into accountthat the complexity of SDR based techniques is greater than O ( N ) [4], a reduction by a factor of N is achieved throughAlgorithms 1 and 2.IV. A HEURISTIC INNER LAYER OF ORDER O ( N ) In the previous section, it was shown that the complexityof the proposed algorithms is of O ( N ) , which is due to theapplication of SCA technique to ﬁnd the inner layer precodingvectors, i.e., { c j } Gj =1 ,. Therefore, the inner layer retrievalmay still be computationally expensive when N is relativelylarge. Moreover, it requires optimization packages for solvingthe convex problems e Q ( ⌘ , z ) and e F ( ⌘ , P, z ) , which may notbe available on every hardware platform. Therefore, in whatfollows, we present a simple, yet effective, heuristic algorithmfor computing the inner layer precoding vectors of Q ( ⌘ ) witha complexity of O ( N ) . Then, by employing Proposition 1 andthe solution obtained for Q ( ⌘ ) , we compute an approximatesolution for F ( ⌘ , P ) . Therefore, the complexity of simulta-neously ﬁnding an inner layer precoder for both problemsbecomes O ( N ) .The proposed heuristic algorithm aims at computing theprecoding vector c j j , in K j sequential steps. The algo-rithm has two main parts, the ordering part and the successiveprecoder design part. Assuming that the K j UEs in group j arelabeled from to K j , the ordering part will re-label them by abijective function f j : { , . . . , K j }!{ µ j , . . . , µ jK j } , where µ jk = f j ( i ) means that the UE who was labeled as i is now re-labeled as µ jk and will be served in k th step of the algorithm, k , . . . , K j } . Therefore, the new labels, { µ jk } K j k =1 , willdetermine the order by which the UEs in group j are servedin each step. The successive precoder design part, designs theprecoding vector of group j in K j steps such that in k th stepthe requested SNR of UE µ jk is met with minimum powerconsumption while the SNR of the previous k ordered UEs,i.e., { µ jt } k t =1 , is not violated. We will detail the successiveprecoder design and the user ordering in the following twosubsections. A. The Successive Precoder

Assume that j the UE ordering is given, i.e., { µ jk } K j k =1 is known. Denote by c ( k ) j the precoding vector c j at k th step,then it is computed as follows: c ( k ) j = c ( k j + ↵ ( k ) j d ( k ) j k , . . . , K j } (30)where d ( k ) j C N ⌧ j is a unit norm vector and ↵ ( k ) j C .In what follows, we explain how d ( k ) j and ↵ ( k ) j should bedesigned such that the SNR constraint of µ jk is met withminimum power consumption while the SNR of { µ jt } k t =1 isnot violated.We start by initializing the precoding vector c (1) j of UE µ j such that its own SNR constraint, i.e., | g Hjµ j c (1) j | ⌘ jµ j , is met with equality. This yields c (1) j = ( p ⌘ jµ j / k g jµ j k ) g jµ j . For k , . . . , K j } , thevectors d ( k ) j must be chosen such that the previously satisﬁed k SINR constraints are not violated. This is achieved byselecting d ( k ) j orthogonal to { g jµ ji } k i =1 , i.e., g Hjµ ji d ( k ) j = 0 for i = 1 , . . . , k . To this end, { d ( k ) j } K j k =2 are computedusing the Gram–Schmidt procedure, which produces d ( k ) j = u ( k ) j / k u ( k ) j k with u ( k ) j = g jµ jk k X i =1 u ( i ) Hj g jµ jk k u ( i ) j k u ( i ) j (31)being the component of g jµ jk orthogonal to the subspacespanned by { u ( i ) j } k i =1 . Once the unit norm vectors d ( k ) j are computed, we proceed with the design of coefﬁcients { ↵ ( k ) j } K j k =2 . In particular, each ↵ ( k ) j is chosen such that thepower consumption in step k , given by || c ( k ) j || = || c ( k j || + | ↵ ( k ) j | , is minimized while satisfying the k th SINR constraint.More precisely, ↵ ( k ) j must be computed as the solution of thefollowing problem: min ↵ ( k ) j | ↵ ( k ) j | s . t . | g Hjµ jk c ( k ) j | ⌘ jµ jk . (32)As shown in the Appendix D (see also [12]), the optimal ↵ ( k ) j = | ↵ ( k ) j | exp(i \ ↵ ( k ) j ) is computed as: \ ↵ ( k ) j = \ ⇢ ( k ) j (33) | ↵ ( k ) j | = | ⇢ ( k ) j | + q | ⇢ ( k ) j | | g Hjµ jk d ( k ) j | | g Hjµ jk c ( k j | ⌘ jµ jk | g Hjµ jk d ( k ) j | (34)with ⇢ ( k ) j = g Hjµ jk d ( k ) j c ( k H j g jµ jk . In the sequel, the aboveresults are used to sort the UEs according to a worst-ﬁrst pol-icy, which is observed to achieve close-to-optimal performanceby means of numerical results in Section V. B. User Ordering

At this stage, we are only left with the computation ofthe UE ordering indices { µ jk } . A possible solution is il-lustrated in [12], for the QoS problem in single-group mul-ticasting systems. More speciﬁcally, denote by S ( k j = { µ j , . . . , µ j ( k } the set of indices of the ordered UEs atstep k and call R ( k j the set of indices of the remainingones, i.e., R ( k j = { . . . , K j } \ { f j ( µ jt ) } k t =1 . Then, in[12] the set S ( k ) j is computed as S ( k ) j = S ( k j [{ µ jk } with µ jk = arg min i ( k j | g Hji c ( k j | ⌘ ji (35)corresponding to the UE index in R ( k j that has the weak-est ratio (or also the most violated constraint) between theprovided SNR up to step k , given by | g Hji c ( k j | , and therequested one given by ⌘ ji . The above procedure has thefollowing two drawbacks. Firstly, it needs to calculate ateach stage k the quantities | g Hji c ( k j | i ( k j , whichrequires O ( K j N ) ﬂops for group j . This is costly if N and K j are large. Secondly, it does not take into account the extraamount of power | ↵ ( k ) j | required at stage k to meet the SNRconstraint of the selected UE. To see how this comes about,consider a generic UE i ( k j such that at stage k the ratio | g Hji c ( k j | /⌘ ji takes a very high value. This might happen,for example, because its own channel vector g ji is almostcollinear to the channel vectors of the UEs selected in theprevious k stages. According to (35), such a UE will beselected at the very end of the procedure. This, however, wouldresult in a huge power consumption because the Gram-Schmidtprocedure will only have a restricted number of degrees offreedom to make c ( K j ) j orthogonal to g ji for i = 1 , . . . , K j and at the same time to meet the requested SINR. In otherwords, the procedure in (35) sorts the UEs according to a best-ﬁrst criterion such that higher priority is given to the UEsrequiring low power to meet their SNR constraints.Unlike [12], we make use of the power increase (34) ateach stage k to order the UEs within each group accordingto a worst-ﬁrst criterion. As mentioned before, this choiceis motivated by the fact that the Gram-Schmidt procedure in(31) progressively reduces the available degrees of freedomas k tends to K j . Therefore, since power consumption isdominated by UEs with the worst conditions (according tosome criterion), higher priority should be given to these UEs.Mathematically, we propose to compute the index µ jk at step k as follows: µ jk = arg max i ( k j | ↵ ( k ) ji | (36)with | ↵ ( k ) ji | = | ⇢ ( k ) ji | + q | ⇢ ( k ) ji | | g Hji d ( k ) j | | g Hji c ( k j | ⌘ ji | g Hji d ( k ) j | (37)and ⇢ ( k ) ji = g Hji d ( k ) j c ( k H j g ji . As seen, µ jk corresponds tothe UE index in R ( k j for which the incremental power | ↵ ( k ) ji | at stage k takes the maximum value. Note that the com-putational cost of this operation is still O ( K j N ) ﬂops as for[12]. To further reduce the computational burden, we proposean alternative approach that exploits the inherent characteristicof large-scale antenna systems. As N is large, each user i ( k j can use the excess degree of freedom, providedby the large number of antennas, to chose d ( k ) j as collinear aspossible to g ji while almost nulling the interference generatedto the other UEs, i.e., | g Hji c ( k j |⇡ . Therefore, by replacing | g Hji d ( k ) j | with || g ji || and neglecting the term | g Hji c ( k j | ,the right-hand-side of (37) reduces to ⌘ ji k g ji k . This means thatUEs in group j can be ordered by simply sorting the followingratios in a descending order: ( ⌘ j k g j k , . . . , ⌘ jK j k g jK j k ) j . (38)In other words, higher priority should be given to those UEsthat have bad channel conditions compared to the target SNRs.In doing so, no greedy strategy is required for UE ordering,thereby reducing the total number of ﬂops to O ( K j N ) . Basedon the above discussion, a heuristic solution is proposed inAlgorithm 3 for the inner layer. Numerical results are usedin Section V to make comparisons among the above orderingpolicies in different settings. As it will be seen, the orderingpolicy of (38) largely outperforms the strategy of [12]. C. The Proposed Heuristic Inner Layer Precoder

Collecting the results achieved in Sections IV-A and IV-B, we present the following heuristic algorithm to design theinner layer precoder of Q ( ⌘ ) . To emphasis on the simplicityof Algorithm 3 and to enable the reproducibility of our results,its MATLAB code is provided in [35]. Algorithm 3

A heuristic algorithm of the inner layer forsolving Q ( ⌘ ) for j = 1 to G do Sort the UEs in group j in descending order basedon { ⌘ ji k g ji k } and label the list as { µ j , . . . , µ jK j } , respec-tively. Compute { d ( k ) j } K j k =1 using the Gram–Schmidt proce-dure in (31). Set c (1) j = p ⌘ j k g jµj k g jµ j . for k = 2 to K j do if | g Hjµ jk c ( k j | <⌘ jµ jk then Compute ↵ ( k ) j through (33) and (34). Update c ( k ) j = c ( k j + ↵ ( k ) j d ( k ) j . end if end for end for The complexity of Algorithm 3 can be evaluated as follows.Observe that evaluating the terms { ⌘ ji / k g ji k } for group j requires N ⌧ j ) ﬂops whereas sorting a list of size K j needs O ( K j log( K j )) ﬂops. Therefore, the ﬂop counts for UEordering in line is K j ( N ⌧ j ) + O ( K j log( K j )) . TheGram–Schmidt procedure of line can be performed throughthe QR decomposition, which requires N ⌧ j ) K j K j ﬂops [33]. The computation of c (1) j needs N ⌧ j + 1) ﬂops.The condition | g Hjµ jk c ( k j | <⌘ jk in line avoids to wastepower for those UEs whose requested SNR constraints are

20 30 40 50 60 70 80 90

Number of antennas N A v e r age po w e r c on s u m p t i on [ W a tt] ordering based on (38)ordering based on (36)ordering based on [12] = 255 = 63 = 127 Fig. 1: Average power consumption of the QoS problem,comparing different ordering policies for G = 1 and K = 20 .already met (more details on this are given in the Appendix).Lines to require O ( N ⌧ j ) ﬂops, and as the conditionof line is true at most K j times, the ﬂops required bylines to is O ( K j ( N ⌧ j )) . Summing all the above termstogether, the complexity of Algorithm 3 is found to be O ( N ) ,thereby reducing the complexity of the inner layer precoderby a factor of N .Note that by jointly employing Proposition 1 and Algorithm3, the approximated precoding vectors for F ( ⌘ , P ) can becomputed as c j, BDZF HEU ( ⌘ ,P ) = s PP ? BDZF HEU ( ⌘ ) c ?j, BDZF HEU ( ⌘ ) (39)where { c ?j, BDZF HEU ( ⌘ ) } and P ? BDZF HEU ( ⌘ ) denotes theprecoding vectors and the total power consumption as obtainedwith Algorithm 3. Therefore the precoding vectors for F ( ⌘ , P ) are given by { F j c j, BDZF HEU ( ⌘ , P ) } .V. N UMERICAL RESULTS

Monte Carlo simulations are used to assess the performanceof the proposed algorithms and to make comparisons withexisting alternatives. In particular, we consider the algorithmpresented in [4], which employs the SDR technique followedby a randomization and multicast multigroup power control(MMPC) policy. Comparisons are also made with the asymp-totic results of [23], the FPP based algorithm presented in[11], and the heuristic algorithms developed in [12]. A single-cell system with radius of meters is considered with UEsbeing distributed uniformly and randomly in the cell excludingan inner circular area of radius meters. For each valueof N , the average values of power consumption or minimumSINR of the system are obtained from different channelrealizations and UE distributions. We assume (if not otherwisespeciﬁed) that there are G = 3 multicasting groups, eachcounting K j = 10 UEs (such that K = 30 ). The channelvector g jk between UE k in group j and the BS is modeled as For the randomization phase, samples are generated using the Gaus-sian randomization method [4].

40 45 50 55 60 65 70 75 80 85 90

Number of antennas N A v e r age po w e r c on s u m p t i on [ W a tt] SDR of [4]QoS Alg. of [12]BDZF + Alg. 3 (proposed)Alg. 1 (proposed)FPP of [11]Lower Bound

Fig. 2: Average power consumption of the QoS problem with ⌘ jk = 255 j, k . g jk = p jk h jk where h jk ⇠CN ( , I N ) represents the smallscale fading and jk accounts for the large scale attenuationgiven by jk = . . d jk dB with d jk being thedistance between the UE and the BS expressed in kilometers[36]. The noise power spectral density is assumed to be dBm/Hz, and the channel bandwidth is MHz [28]. All thesimulations are performed on a 64-bit Linux operating systemwith Intel Xeon processor E5-1680 v3.Fig. 1 compares the average power consumption of theordering policies proposed in [12] with those given by (36)and (38), for G = 1 , K = 20 and ⌘ = 63 , , and (which correspond to a target rate for each UE of , and bit/s/Hz, respectively). The proposed ordering policiesare seen to outperform the ordering of [12]. Moreover, thesimple ordering policy of (38) has even a slightly betterperformance than (36). Note that, as the ordering belongs tothe heuristic inner layer of the proposed precoder and as theouter layer removes the effect of inter-group interference, thesame conclusion holds for G > . Based on the above results,the simpler ordering policy presented in (38) will be used inthe remainder of this section.Fig. 2 depicts the average power consumption of the QoSproblem versus the number of antennas N at the BS. Weassume that ⌘ jk = 255 for all UEs (corresponding to bit/s/Hz/UE), and it is chosen in agreement with 5G raterequirements [37], but the conclusions generically hold for allother values of ⌘ . The performance of Algorithm 1, and thecombination of BDZF and Algorithm 3 is compared to otherexisting algorithms. As the QoS problem is NP-hard, a lowerbound of the problem is also presented as a benchmark [4].Observe that, the proposed algorithms outperform the SDR-based solution in [4] and the heuristic one in [12], while theyhave nearly the same performance as [11]. However, this isachieved at a much lower complexity and computational costas detailed next. Note that for N both algorithms are atmost away from the lower bound and this gap reduces as N grows large, while for SDR technique this gap is andreduces slowly by adding more antennas.Fig. 3 illustrates the average minimum SINR of the MMF

40 45 50 55 60 65 70 75 80 85 90

Number of antennas N A v e r age m i n i m u m S I NR Upper BoundFPP of [11]Alg. 2 (proposed)BDZF+Alg.3+ Prop.1 (proposed)SDR of [4]MMF Alg. [12]

Fig. 3: Average minimum SINR of the MMF problem for P = 10 Watt.TABLE I: The average time (in seconds) required to solveQoS, MMF, or both of them.

QoS Problem MMF Problem Both QoS and MMFSDR[4] FPP[11] SDR[4] FPP[11] Alg.1+Pr.1 BDZF+Alg.3+Pr.1 N =40

55 41 419 356 11.3 2.5 ⇥ -3 N =50

67 51 579 450 11.6 2.8 ⇥ -3 N =60

84 61 798 507 11.7 3.1 ⇥ -3 N =70

110 75 1151 617 11.9 3.5 ⇥ -3 N =80

146 87 1549 727 12.2 4.0 ⇥ -3 N =90

182 107 2050 865 12.5 4.5 ⇥ -3 problem versus N . The available power at the BS is consideredto be 10 Watt. In this ﬁgure, the performance achieved byAlgorithm 2, and the combination of BDZF, Algorithm 3, andProposition 1 is compared to other existing algorithms. Also,the upper bound of the problem is depicted as a benchmark.Similar to the results of Fig. 2, the proposed algorithmslargely outperform [4] and [12], while nearly having thesame performance as [11]. However, this is achieved for acomputational cost that is signiﬁcantly smaller than otheralgorithms as detailed next. Observe that Algorithm 2 is within of the upper bound with just N = 50 antennas. Also,Algorithm 3 jointly with BDZF and Proposition 1 achieve thesame target with N = 70 antennas.To assess the computational complexity of the investigatedalgorithms more intuitively, beside the complexity analysis ofSection III-D and Section IV-C, we also present the computa-tion time required to approximately solve Q ( ⌘ ) and F ( ⌘ , P ) versus N , in Table I. The table presents the average time (inseconds) required to solve the QoS and MMF problems. Thesecond and third columns report the average time requiredby the SDR and the FPP algorithms to solve an instance ofthe QoS problem. The fourth and ﬁfth columns present theaverage required time by the same algorithms to solve aninstance of the MMF problem. Note the increase in time fromthe QoS problem to the MMF problem, as in the SDR andFPP algorithms the MMF is solved by iteratively applyingthe QoS algorithm. The sixth and seventh columns of thetable present the average time required to solve both QoS andMMF problems simultaneously using the proposed algorithms. Minimum power consumption with N=40 CD F SDR of [4]QoS Alg. of [12]BDZF + Alg. 3 (proposed)Alg.1 (proposed)FPP of [11]Lower Bound (a) N = 40 .

10 15 20 25 30 35 40

Minimum power consumption with N=80 CD F SDR of [4]QoS Alg. of [12]BDZF + Alg. 3 (proposed)Alg. 1 (proposed)FPP of [11]Lower Bound (b) N = 80 . Fig. 4: CDF of minimum power consumption (QoS problem)of the system.Note that not only we solve both problems at the same timewith good performance, but also the required time has reducedsigniﬁcantly. As an example, for N = 90 , the combination ofBDZF, Algorithm 3, and Proposition 1, solves both problemsin less than milliseconds, while the SDR and the FPPalgorithms require or seconds, respectively, just tosolve the MMF problem. At the same time, as shown in Figs.2 and 3, the solution provided by joint application of theBDZF, Algorithm 3, and Proposition 1 is nearly as good as thesolution achieved from the FPP Algorithm, and signiﬁcantlyoutperforms the SDR Algorithm.Figs. 4 and 5 present the cumulative distribution function(CDF) of the precoder power consumption and the minimumSINR of the system for the QoS and MMF problems, respec-tively. For the QoS problem, the requested SINR by eachuser is assumed to be , and for the MMF problem theavailable power at the BS is considered to be Watt. UnlikeFigs. 2 and 3 that provide the average of minimum powerconsumption or the average of minimum SINR of the system,these ﬁgures provide a clear vision on the distribution of thesequantities, for the existing and proposed algorithms. It is seenthat, as we increase N from to , the CDF curves of our

20 40 60 80 100 120 140 160 180 200

Minimum SINR with N=40 CD F Upper BoundFPP of [11]Alg. 2 (proposed)BDZF+Alg.3+Pro.1 (proposed)SDR of [4]MMF Alg. of [12]Asymptotic Alg. of [23] (a) N = 40 .

100 150 200 250 300 350 400

Minimum SINR with N=80 CD F Upper BoundFPP of [11]Alg. 2 (proposed)BDZF+Alg.3+Prop.1 (proposed)SDR of [4]MMF Alg. of [12]Asymptotic Alg. of [23] (b) N = 80 . Fig. 5: CDF of minimum SINR (MMF problem) of the systemwith P = 10 .proposed algorithms become closer to the optimal bound, andalso improve signiﬁcantly in terms of performance, thanks tothe large number of antennas. As an example, for the QoSproblem with N = 40 , the power consumption is greater than Watt of the times, while with N = 80 our proposedalgorithms always meet the requested SINRs with less than Watt. Also, for the MMF problem none of the algorithmscan provide a minimum SINR bigger than with N = 40 ,while for N = 80 our proposed algorithms can provide aminimum SINR bigger than in of the times. Fig. 5also contains the CDF of the asymptotic approach of [23].Notice that the asymptotic approach can never provide anSINR which is bigger than (or ) with N = 40 (or N = 80 ) antennas and its insufﬁciency is detailed in [23].In Section III, we have elaborated the BDZF-SCA approachand proved the convergence of Algorithms 1 and 2, but wehave not speciﬁed the number of iterations required by eachalgorithm to converge. Table II presents the average numberof iterations required by Algorithms 1 and 2 to achieveconvergence for different values of G, K and N . Denotingthe objective value achieved at the k th iteration of eitherAlgorithm 1 or 2 as " ( k ) , the convergence condition of Table TABLE II: Average number of iterations required by Algo-rithms 1 and 2. G = 2 , K = 10 G = 3 , K = 10 G = 3 , K = 15 Alg. 2 Alg. 1 Alg. 2 Alg. 1 Alg. 2 Alg. 1 N = 40 N = 50 N = 60 N = 70 N = 80 N = 90 Iteration Index k P o w e r c on s u m p t i on [ W a tt] N=90N=80N=70N=60N=50N=40 (a) Algorithm 1.

Iteration Index k M i n i m u m S I NR N=90N=80N=70N=60N=50N=40 (b) Algorithm 2.

Fig. 6: The convergence behavior of Algorithms 1 and 2 fordifferent number of antennas N .II is | " ( k +1) " ( k ) | " ( k ) < . Also, Fig. 6 illustrates " ( k ) forboth algorithms at each iteration index k for different numberof antennas N . As it is seen, both algorithms converge in afew iterations for any value of N .So far, we have assumed that the BS has perfect knowledgeof the channel vectors { g jk } . Although some of the analysiscan in principle be extended (to some extent) to the scenariowith imperfect CSI, we believe that this is out of the scope ofthis work and thus it is left for the future. To partially fulﬁllthis lack, we now investigate the impact of imperfect CSI onthe performance of the proposed algorithms. A time-division-protocol (TDD) is employed such that channel estimation canbe performed in the uplink on the basis of UE pilot signalsand then used in the downlink. We assume that pilots of length ⌧ p = K are used, with power equal to Watt. The estimates

70 80 90 100 110 120 130 140 150

Number of antennas N A v e r age m i n i m u m S I NR BDZF + Alg. 3 + Prop. 1 with imperfect CSIAlg. 2 with imperfect CSIBDZF + Alg. 3 + Prop. 1Alg. 2 110/90 150/120

Fig. 7: Evaluating the impact of imperfect CSI at the BS forthe MMF Problem with G = 4 , K j = 15 j and P = 10 Watt.of channel vectors { g jk } are computed at the BS using anMMSE estimator. This yields [38] ˆ g jk = p ⌧ p jk + ⌧ p jk p ⌧ p g jk + n j, k (40)where n ⇠CN ( , I N ) is the additive Gaussian noise.Fig. 7 reports the performance of the proposed algorithms forMMF when perfect and imperfect CSI (as given in (40)) isavailable at the BS. We assume that G = 4 and K j = 15 for j = 1 , . . . , . As expected, imperfect CSI degrades theperformance of the proposed algorithms. However, such aperformance loss can be compensated by using more antennasat the BS. Quantitively speaking, with imperfect CSI N mustbe roughly increased by a factor of

30 % comparedto the perfect CSI case. The higher N , the larger the factor.For example, to achieve the same performance of the perfectCSI case with N = 90 and , then and antennasare respectively needed with imperfect CSI, corresponding toa and increase.VI. C ONCLUSIONS

Multicasting is an efﬁcient technology to transmit dis-tinct common data streams to multiple groups of users. Theexisting multicasting algorithms are either computationallyexpensive or exhibit poor performance when applied to large-scale systems with hundreds of antennas, as envisioned innext generation of wireless systems. In this paper, we de-signed new algorithms, which are tailored for physical layermulticasting in large-scale antenna systems. The proposedalgorithms achieve good performance and are characterizedby affordable computational complexity. This was achievedby using the large number of antennas to ﬁrst cancel theintergroup interference and then reformulate both the QoSand MMF problems in simple forms. Two efﬁcient algorithmsfor solving the simpliﬁed problems were presented. Unlikebaseline methods that solve the MMF problem by iterativelysolving the QoS problem, we showed how to solve bothsimultaneously with no extra cost. A

PPENDIX

A - P

ROOF OF P ROPOSITION c j with q PP ? app ( ⌘ ) c ?j, app ( ⌘ ) we have G X j =1 k s PP ? app ( ⌘ ) c ?j, app ( ⌘ ) k = PP ? app ( ⌘ ) G X j =1 k c ?j, app ( ⌘ ) k = P which proves the feasibility of the proposed solution. For theachievable objective value of F ( ⌘ , P ) ( or F ( ⌘ , P )) using { q PP ? app ( ⌘ ) c ?j, app ( ⌘ ) } , we have t app ( ⌘ , P ) = min j min k ⌘ jk | g Hjk s PP ? app ( ⌘ ) c ?j, app ( ⌘ ) | = PP ? app ( ⌘ ) min j min k ⌘ jk g Hjk c ?j, app ( ⌘ ) . Denote := min j min k ⌘ jk g Hjk c ?j, app ( ⌘ ) . As { c ?j, app ( ⌘ ) } is a set of the precoding vectors of Q ( ⌘ ) , . Therefore t app ( ⌘ , P ) = P P ? app ( ⌘ ) PP ? app ( ⌘ ) . As P ? app ( ⌘ ) is an objective value of Q ( ⌘ ) that can beachieved by { p c ?j, app ( ⌘ ) } , it is bigger than or equal to theoptimal objective value of Q ( ⌘ ) , i.e., P ? app ( ⌘ ) P ? ( ⌘ ) ,and we have P/P ? app ( ⌘ )  t app ( ⌘ , P )  P/P ? ( ⌘ ) .A PPENDIX

B - P

ROOF OF P ROPOSITION c j with p t app ( ⌘ ,P ) c j, app ( ⌘ , P ) , we have | g Hjk c j, app ( ⌘ , P ) | t app ( ⌘ , P ) = ⌘ jk t app ( ⌘ , P ) | g Hjk c j, app ( ⌘ , P ) | ⌘ jk ( a ) ⌘ jk where in (a) is due to the fact that t app ( ⌘ , P ) isthe minimum weighted SINR among all UEs. Therefore { p t app ( ⌘ ,P ) c j, app ( ⌘ , P ) } is a feasible answer of Q ( ⌘ ) and Q ( ⌘ ) . For the objective value we have P ? app ( ⌘ )= G X j =1 k c j, app ( ⌘ , P ) q t app ( ⌘ , P ) k = G P j =1 k c j, app ( ⌘ , P ) k t app ( ⌘ , P )  Pt app ( ⌘ ,P ) . Denote := P G P j =1 k c j, app ( ⌘ , P ) k . As { c j, app ( ⌘ , P ) } is aset of precoding vectors of F ( ⌘ , P ) ,  . Therefore, P ? app ( ⌘ ) = P t app ( ⌘ , P )  Pt app ( ⌘ , P ) . Since t app ( ⌘ , P ) is an objective value of F ( ⌘ , P ) achievedby { p c j, app ( ⌘ , P ) } , it is less than or equal to the optimalobjective value of F ( ⌘ , P ) , i.e., t app ( ⌘ , P )  t ( ⌘ , P ) .Hence we have P/t ( ⌘ , P )  P ? app ( ⌘ )  P/t app ( ⌘ , P ) . A PPENDIX

C - P

ROOF OF P ROPOSITION z (1) Hj X jk z (1) j ⌘ jk k j , j , z (1) j is a feasiblesolution of e Q (1) j ( ⌘ j , z (1) j ) . Now consider the ( i +1) th iterationof the problem i , , . . . } . k j , j we have Re ( z ( i +1) Hj X jk c ( i +1) j ) z ( i +1) Hj X jk z ( i +1) j a =2 Re ( c ( i ) Hj X jk c ( i +1) j ) c ( i ) Hj X jk c ( i ) j (41)where ( a ) is due to our update rule, z ( i +1) j c ( i ) j . Now if weset c ( i +1) j = c ( i ) j , (41) reduces to c ( i ) Hj X jk c ( i ) j which is biggerthan ⌘ jk due to (26) and (27). Therefore c ( i ) j is a feasiblesolution of e Q ( i +1) j ( ⌘ j , z ( i +1) j ) . Hence the objective functionof ( i + 1) th iteration is less than or equal to the objectivefunction of ( i ) th iteration. As the objective function is boundedfrom below, by successively solving the problem we achievea non-increasing bounded sequence. Therefore the algorithmconverges. Due to (26), any internal precoding vector c j thatsatisﬁes (27), will also satisfy (11) and as a result, any answerto e Q ( ⌘ , z ) is a feasible answer to Q ( ⌘ ) and therefore Q ( ⌘ ) .Due to the update rule and the inner approximation in (27),the convergence point satisﬁes the KKT conditions for Q ( ⌘ ) as detailed in [30].A PPENDIX

D - S

OLUTION TO (32)Hereby we prove the solution of (32), i.e., ↵ ( k ) j , is given by(33) and (34). We start with the SNR constraint | g Hjµ jk c ( k ) j | ⌘ jµ jk , and replace c ( k ) j with c ( k j + ↵ ( k ) j d ( k ) j using (30).Denote | ¯ g Hjµ jk d ( k ) j | , e j \ ↵ ( k ) j ¯ g Hjµ jk d ( k ) j c ( k Hj ¯ g jµ jk ) ,and | ¯ g Hjµ jk c ( k j | ⌘ jµ jk as A , B , and C , respectively. TheSNR constraint can be represented as | g Hjµ jk c ( k ) j | ⌘ jµ jk = A | ↵ ( k ) j | + B | ↵ ( k ) j | + C (42)Notice that if | g Hjµ jk c ( k ) j | ⌘ jµ jk , to minimize the power, notransmission shall be arranged for user µ jk , i.e., ↵ ( k ) j = 0 ,and the next user shall be served. Otherwise, C < . Now wetransform (42) to an equality by introducing as follows A | ↵ ( k ) j | + B | ↵ ( k ) j | + C = 0 . Hence | ↵ ( k ) j | = B + p B A ( C )2 A , as B p B A ( C )2 A < and is not a valid answer for | ↵ ( k ) j | . Moreover, as A ,to minimize the power should be equal to zero, i.e, thepower should be used to meet the SNR constraint withequality. Hence | ↵ ( k ) j | = B + p B AC A . Now as A is ﬁxed,to minimize | ↵ ( k ) j | we should minimize B + p B AC .Note B + p B AC always has a negative derivative withrespect to B , hence its minimum is achieved for the maximumvalue of B . Denote jk e j \ ✓ jk = ¯ g Hjµ jk d ( k ) j c ( k Hj ¯ g jµ jk , wehave B = 2 jk Re( e j ( \ ✓ jk + \ ↵ ( k ) j ) ) , the maximum of whichachieved if \ ↵ ( k ) j = \ ✓ jk and | ↵ ( k ) j | is given as in (34). R EFERENCES[1] Cisco Visual Networking Index, “Global mobile data trafﬁc forecastupdate, 2015 – 2020,” San Jose, CA, Tech. Rep., Feb. 2016.[2] D. Lecompte and F. Gabin, “Evolved multimedia broadcast/multicastservice (eMBMS) in LTE-advanced: Overview and Rel-11 enhance-ments,”

IEEE Commun. Mag. , vol. 50, no. 11, pp. 68–74, Nov. 2012.[3] N. D. Sidiropoulos, T. N. Davidson, and Z.-Q. Luo, “Transmit beam-forming for physical-layer multicasting,”

IEEE Trans. Signal Process. ,vol. 54, no. 6, pp. 2239–2251, Jun. 2006.[4] E. Karipidis, N. D. Sidiropoulos, and Z.-Q. Luo, “Quality of serviceand max-min fair transmit beamforming to multiple cochannel multicastgroups,”

IEEE Trans. Signal Process. , vol. 56, no. 3, pp. 1268–1279,Mar. 2008.[5] D. Christopoulos, S. Chatzinotas, and B. Ottersten, “Weighted fair mul-ticast multigroup beamforming under per-antenna power constraints,”

IEEE Trans. Signal Process. , vol. 62, no. 19, pp. 5132–5142, Oct. 2014.[6] Z. Xiang, M. Tao, and X. Wang, “Coordinated multicast beamformingin multicell networks,”

IEEE Trans. Wireless Commun. , vol. 12, no. 1,pp. 12–21, Jan. 2013.[7] S. X. Wu, W. K. Ma, and A. M. C. So, “Physical-layer multicastingby stochastic transmit beamforming and Alamouti space-time coding,”

IEEE Trans. Signal Process. , vol. 61, no. 17, pp. 4230–4245, Sep. 2013.[8] I. H. Kim, D. J. Love, and S. Y. Park, “Optimal and successiveapproaches to signal design for multiple antenna physical layer mul-ticasting,”

IEEE Trans. Commun. , vol. 59, pp. 2316–2327, Aug. 2011.[9] M. Li, S. Kundu, D. A. Pados, and S. N. Batalama, “Waveform designfor secure SISO transmissions and multicasting,”

IEEE J. Sel. AreasCommun. , vol. 31, no. 9, pp. 1864–1874, Sep. 2013.[10] B. Zhu, J. Ge, Y. Huang, Y. Yang, and M. Lin, “Rank-two beamformedsecure multicasting for wireless information and power transfer,”

IEEESignal Process. Lett. , vol. 21, no. 2, pp. 199–203, Feb. 2014.[11] D. Christopoulos, S. Chatzinotas, and B. Ottersten, “Multicast multi-group beamforming for per-antenna power constrained large-scale ar-rays,” in

Proc. IEEE Int. Workshop Signal Process. Adv. WirelessCommun. (SPAWC) , Jun. 2015, pp. 271–275.[12] R. Hunger, D. A. Schmidt, M. Joham, A. Schwing, and W. Utschick,“Design of single-group multicasting-beamformers,” in

Proc. IEEE Int.Conf. Commun. (ICC) , Glasgow, Jun. 2007, pp. 2499–2505.[13] M. C. Yue, S. X. Wu, and A. M. C. So, “A robust design for MISOphysical-layer multicasting over line-of-sight channels,”

IEEE SignalProcess. Lett. , vol. 23, no. 7, pp. 939–943, Jul. 2016.[14] D. Chen and V. Kuehn, “Weighted max-min fairness oriented load-balancing and clustering for multicast cache-enabled F-RAN,” in

Int.Symp. Turbo Codes and Iterative Inform. Process. (ISTC) , Brest, France,Sept 2016, pp. 395–399.[15] Z.-Q. Luo, W. K. Ma, A. M. C. So, Y. Ye, and S. Zhang, “Semideﬁniterelaxation of quadratic optimization problems,”

IEEE Signal Process.Mag. , vol. 27, no. 3, pp. 20–34, May 2010.[16] T. L. Marzetta, “Noncooperative cellular wireless with unlimited num-bers of base station antennas,”

IEEE Trans. Wireless Commun. , vol. 9,no. 11, pp. 3590–3600, Nov. 2010.[17] H. Q. Ngo, E. G. Larsson, and T. L. Marzetta, “Energy and spectralefﬁciency of very large multiuser mimo systems,”

IEEE Trans. onCommun. , vol. 61, no. 4, pp. 1436–1449, Apr. 2013.[18] J. Hoydis, S. ten Brink, and M. Debbah, “Massive MIMO in the UL/DLof cellular networks: How many antennas do we need?”

IEEE J. Sel.Areas Commun. , vol. 31, no. 2, pp. 160–171, Feb. 2013.[19] J. Zuo, J. Zhang, C. Yuen, W. Jiang, and W. Luo, “Multicell multiusermassive MIMO transmission with downlink training and pilot contami-nation precoding,”

IEEE Trans Veh. Technol. , vol. 65, no. 8, pp. 6301–6314, Aug. 2016.[20] T. L. Marzetta, “How much training is required for multiuser MIMO?”in

Proc. IEEE Asilomar Conference on Signals, Systems and Computers(ACSSC’06) , Paciﬁc Grove, CA, USA, Oct. 2006, pp. 359–363.[21] A. Arvola, A. Tolli, and D. Gesbert, “Two-layer precoding for dimen-sionality reduction in massive MIMO,” in , budapest, Hungary, Aug. 2016, pp. 2000–2004.[22] J. Zuo, J. Zhang, C. Yuen, W. Jiang, and W. Luo, “Energy efﬁcientdownlink transmission for multi-cell massive DAS with pilot contami-nation,”

IEEE Trans. Veh. Technol. , vol. 66, pp. 1209–1221, Apr. 2016.[23] M. Sadeghi and C. Yuen, “Multi-cell multi-group massive MIMOmulticasting: An asymptotic analysis,” in

Proc. IEEE Global Commun.Conf. (Globecom) , San Diego, CA, Dec. 2015, pp. 1–6.[24] Z. Xiang, M. Tao, and X. Wang, “Massive MIMO multicasting innoncooperative cellular networks,”

IEEE J. Sel. Areas Commun. , vol. 32,no. 6, pp. 1180–1193, Jun. 2014. [25] H. Zhou and M. Tao, “Joint multicast beamforming and user grouping inmassive MIMO systems,” in Proc. IEEE Int. Conf. on Commun. (ICC) ,London, Jun. 2015, pp. 1770–1775.[26] L. N. Tran, M. F. Hanif, and M. Juntti, “A conic quadratic programmingapproach to physical layer multicasting for large-scale antenna arrays,”

IEEE Signal Process. Lett. , vol. 21, no. 1, pp. 114–117, Jan. 2014.[27] O. Mehanna, K. Huang, B. Gopalakrishnan, A. Konar, and N. D.Sidiropoulos, “Feasible point pursuit and successive approximation ofnon-convex QCQPs,”

IEEE Signal Process. Lett. , vol. 22, no. 7, pp.804–808, Jul. 2015.[28] Z. Xiang, M. Tao, and X. Wang, “Massive MIMO multicasting innoncooperative multicell networks,” in

Proc. IEEE Int. Conf. Commun.(ICC) , Sydney, Jun. 2014, pp. 4777–4782.[29] B. R. Marks and G. P. Wright, “A general inner approximation algorithmfor nonconvex mathematical programs,”

Oper. Res. , vol. 26, no. 4, pp.681–683, 1978.[30] A. Beck, A. Ben-Tal, and L. Tetruashvili, “A sequential parametricconvex approximation method with applications to nonconvex trusstopology design problems,”

J. Global Optim. , vol. 47, pp. 29–51, 2010.[31] Q. H. Spencer, A. L. Swindlehurst, and M. Haardt, “Zero-forcing meth-ods for downlink spatial multiplexing in multiuser MIMO channels,”

IEEE Trans. Signal Process. , vol. 52, no. 2, pp. 461–471, Feb. 2004.[32] L.-U. Choi and R. D. Murch, “A transmit preprocessing technique formultiuser MIMO systems using a decomposition approach,”

IEEE Trans.Wireless Commun. , vol. 3, no. 1, pp. 20–24, Jan. 2004.[33] M. Arakawa, “Computational workloads for commonly used signalprocessing kernels,” MIT Lincoln Lab., Tech. Rep., 2006.[34] R. Chen, R. W. Heath, and J. G. Andrews, “Transmit selection diversityfor unitary precoded multiuser spatial multiplexing systems with linearreceivers,”

IEEE Trans. Signal Process. , vol. 55, no. 3, pp. 1159–1171,Mar. 2007.[35] M. Sadegi, “Reducing the complexity of multicasting.” [Online]. Avail-able: https://github.com/meysamsadeghi/Reducing-the-Computational-Complexity-of-Multicasting-in-Large-Scale-Antenna-Systems.[36] Technical Speciﬁcation Group RAN, “Physical layer aspects for e-utra,”3rd Generation Partnership Project (3GPP), Tech. Rep., TS 25.814, 2006.[37] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. K.Soong, and J. C. Zhang, “What will 5G be?”

IEEE J. Sel. AreasCommun. , vol. 32, no. 6, pp. 1065–1082, Jun. 2014.[38] T. L. Marzetta, E. G. Larsson, H. Yang, and H. Q. Ngo,