[PDF] Nonconvex Regularized Gradient Projection Sparse Reconstruction for Massive MIMO Channel Estimation

Abstract

Novel sparse reconstruction algorithms are proposed for beamspace channel estimation in massive multiple-input multiple-output systems. The proposed algorithms minimize a least-squares objective having a nonconvex regularizer. This regularizer removes the penalties on a few large-magnitude elements from the conventional l1-norm regularizer, and thus it only forces penalties on the remaining elements that are expected to be zeros. Accurate and fast reconstructions can be achieved by performing gradient projection updates within the framework of difference of convex functions (DC) programming. A double-loop algorithm and a single-loop algorithm are proposed via different DC decompositions, and these two algorithms have distinct computation complexities and convergence rates. Then, an extension algorithm is further proposed by designing the step sizes of the single-loop algorithm. The extension algorithm has a faster convergence rate and can achieve approximately the same level of accuracy as the proposed double-loop algorithm. Numerical results show significant advantages of the proposed algorithms over existing reconstruction algorithms in terms of reconstruction accuracies and runtimes. Compared to the benchmark channel estimation techniques, the proposed algorithms also achieve smaller mean squared error and higher achievable spectral efficiency.

Full PDF

11 Nonconvex Regularized Gradient ProjectionSparse Reconstruction for Massive MIMOChannel Estimation

Pengxia Wu and Julian Cheng,

Senior Member, IEEE

Abstract

Novel sparse reconstruction algorithms are proposed for beamspace channel estimations in massivemultiple-input multiple-output systems. The proposed algorithms minimize a least-squares objective witha nonconvex regularizer. This regularizer removes the penalties on a few large-magnitude elements fromthe conventional (cid:96) -norm regularizer, and thus it only forces penalties on the remaining elements thatare expected to be zeros. Accurate and fast reconstructions can be achieved by performing gradientprojection updates in a difference of convex functions (DC) programming framework. A double-loopalgorithm and a single-loop algorithm are derived by different DC decompositions, and they havedistinct computation complexities and convergence rates. An extension algorithm is further proposed togeneralize the step size of the single-loop algorithm. This extension algorithm has a faster convergencerate and can achieve the same level of accuracy as the proposed double-loop algorithm. Numerical resultsshow signiﬁcant advantages of the proposed algorithms over the existing reconstruction algorithms interms of reconstruction accuracies and runtimes. Index Terms

DC programming, gradient projection, massive MIMO, nonconvex optimization, sparse channelrecostruction

I. I

NTRODUCTION

As a key technology for the ﬁfth-generation and beyond communication systems, massivemultiple-input multiple-output (MIMO) highly relies on accurate knowledge of channel state

P. Wu and J. Cheng are with the School of Engineering, The University of British Columbia, Kelowna, BC V1X 1V7, Canada(e-mail:[email protected], [email protected]). a r X i v : . [ c s . I T ] J a n information (CSI) to reap potential performance beneﬁts from a large number of antennas.However, acquiring downlink CSI in a resource-efﬁcient manner is problematic for frequencydivision duplex (FDD) systems, which occupy different spectrum bands in uplink and downlinkchannels. For such a system, downlink channels are estimated at the user equipments (UEs)and then the UEs send back the estimated CSI to the base station (BS); both the overheadsof downlink pilots and CSI feedback are proportional to the number of BS antennas, which islarge for massive MIMO. To reduce the overhead of downlink CSI acquisition, one can developsparse channel estimation methods by exploiting channel sparsity in certain domains, since theunknown massive MIMO channels can be represented as high-dimensional sparse vectors incertain basis [1]. Based on a valid sparse representation of massive MIMO channel, transmittingpilots through massive MIMO channel can be regarded as a linear mapping of sparse channelvectors onto a compact subspace to obtain lower-dimensional observations. The full-dimensionalchannel vectors are then reconstructed at the BS. In such sparse channel estimation methods, thelength of pilot sequence for each antenna is no longer proportional to the number of BS antennas,but only depends on the sparsity level of the channel. Therefore, the pilot sequence length isallowed to be much less than the number of BS antennas, and the downlink CSI acquisitionoverhead for FDD massive MIMO can be substantially reduced.Sparse channel estimation [2] has been proposed for many traditional communication appli-cations such as orthogonal frequency-division multiplexing systems [3], ultra-wideband commu-nications [4], pulse-shaping multicarrier systems [5] and underwater acoustic communications[6]. Channels in these applications have sparse impulse responses because a multipath channelhas a large delay spread and only a small number of nonzero taps [2]. Nowadays, sparsechannel estimation has been increasingly investigated for massive MIMO systems, and variousbeamspace channel estimation schemes have been developed for massive MIMO systems [7]–[16]. Due to a limited number of scattering clusters and small angular spread for each scatteringcluster, the massive MIMO angular-domain (beamspace) channel exhibits a sparse characteristic[17], i.e., a majority of channel energy is occupied by a few dimensions and most elements inbeamspace channel vectors are either zero or nearly zero. More recently, owing to the strongsparsity of millimeter-wave (mmWave) channels, sparse channel estimation becomes popularin mmWave massive MIMO systems with a hybrid analog-and-digital (AD) architecture [18],where full-dimensional channel vectors have to be estimated from a few number of observationsdue to a limited number of radio-frequency chains [18]–[21]. Sparse channel estimation requires to reconstruct high-dimensional sparse channel vectorsfrom lower-dimensional observations. An efﬁcient sparse reconstruction algorithm is essen-tial to achieve accurate channel estimations. The ultimate sparse reconstruction problem is an (cid:96) -minimization optimization problem, which is NP hard [22]. Alternatively, one can replacethe (cid:96) -norm by other simple functions to seek approximate solutions. The most popular methodis to use the (cid:96) -norm to approximate the (cid:96) -norm, and this method is known as (cid:96) -relaxationor convex relaxation. Several well-known (cid:96) -relaxation algorithms have been proposed such asthe l ls [23], Bregman iterative regularization [24], gradient projection for sparse reconstruc-tion (GPSR) [25], sparse reconstruction by separable approximation (SpaRSA) [26], iterativeshrinkage-thresholding algorithm (ISTA) [27] and the fast iterative shrinkage-thresholding algo-rithm (FISTA) [28]. These (cid:96) -relaxation algorithms have theoretical convergence guarantees, butan (cid:96) -norm is a loose approximation of (cid:96) -norm and it can lead to biased estimates. To achievetighter approximations, several nonconvex functions have been proposed, including the (cid:96) q -normfor ≤ q ≤ [29] and the difference of two continuous functions [30]. Another popular methodis the greedy approach, and representative algorithms include orthogonal matching pursuit (OMP)[31], [32], CoSaMP [33] and the least angle regression (LARS) [34]. Greedy methods work wellfor sufﬁciently sparse vectors, but their reconstruction accuracies and speeds degrade severelywhen sparsity reduces.In this paper, we propose to use the difference of convex functions (DC) programming andgradient projection descent algorithm to solve sparse channel reconstructions in massive MIMOsystems. Instead of approximating the (cid:96) -norm, we exactly represent the (cid:96) -norm constraint byintroducing the top- ( K, norm [35] , which is simply the sum of K largest elements of a vectorin terms of absolute values. Thus, the original (cid:96) -minimization sparse reconstruction problemcan be equivalently transformed into a least squares optimization penalised by a nonconvexregularizer. The regularizer is more favorable than the (cid:96) -regularizer because it removes thepenalties on large-maganitude elements, but it leads to a nonconvex optimization problem. Weadopt the DC programming framework to transform the nonconvex problem into a list of convexsubproblems, and apply gradient projection descent method to solve these convex subproblems.We propose three different DC gradient projection sparse reconstruction (DC-GPSR) algorithms. Although the top- ( K, norm and a DC programming algorithm employing the soft thresholding operation were proposedto solve several (cid:96) -constrained optimization problems [35], the sparse reconstruction problem has not been investigated therein. The proposed algorithms are distinct in terms of time complexities, but they can achieve the samelevel of reconstruction accuracy. The contributions of this paper are summarized as follows: • We propose a double-loop DC-GPSR algorithm (Algorithm 1), which applies gradientprojection descent in a DC programming framework. Our preliminary study has reported thishighly accurate, simple and robust algorithm [36]. However, the double-loop mechanism ofAlgorithm 1 can be criticized for less efﬁciency; therefore, in this paper, we further proposethe single-loop algorithms. • We apply a special DC decomposition on the objective function such that the convexsubproblem in the DC programming has a closed-form solution. Thus, we propose a single-loop DC-GPSR algorithm (Algorithm 2), which can avoid inner loops and solve the DCprogramming subproblem in a one-step update. • We ﬁnd that an update of the single-loop DC-GPSR algorithm can be interpreted as agradient projection descent update with a required step size. Thus, we adopt the Barzilai-Borwein (BB) step size to generalize the basic single-loop DC-GPSR algorithm and proposeAlgorithm 3 that improves convergence signiﬁcantly.II. S

YSTEM M ODEL

We consider a downlink massive MIMO system having N t transmit antennas at the BS and N r receiver antennas at the UE. Assuming a slowly time-varying narrowband multipath channel,the physical channel can be represented by [37] H s = (cid:112) N r N t N p (cid:88) l =1 α l α r ( θ r,l ) α Ht ( θ t,l ) (1)where H s ∈ C N r × N t is the channel matrix, p denotes the multipath index, N p is the number ofpaths, α p is the complex channel gain of the p th path, α r ( θ r,p ) and α t ( θ t,p ) are respectively thearray steering vectors of receiver and transmitter array for the p th path, and the correspondingangle of arrival (AoA) and the angle of departure (AoD) are θ r,p and θ t,p . For a one-dimensionaluniform linear array (ULA) consisting of N elements, the steering vector can be expressed as α ( θ ) = [1 , e − j πϑ , e − j πϑ , ..., e − j πϑ ( N − ] T , where ϑ ∈ [ − , is the normalized spatial angle,which is related to the physical angle θ ∈ [ − π/ , π/ by ϑ = dλ sin ( θ ) , and where d = λ/ isthe antenna spacing and λ is the wavelength. Array steering vector represents the array phase proﬁle as a function of physical AoA or AoD.

The spatial domain channel matrix H s can be transformed into the beamspace channel matrix H a via a two-dimensional Fourier transform [17] H a = U Hr H s U t (2)where U t ∈ R N t × N t and U r ∈ R N r × N r are unitary digital Fourier transform (DFT) matrices,which can be expressed using array steering vectors as U t = √ N t [ α t ( θ t, ) , α t ( θ t, ) , ..., α t ( θ t,N t − )] T , U r = √ N r [ α r ( θ r, ) , α r ( θ r, ) , ..., α r ( θ r,N r − )] T , where { θ t, , ..., θ t,N t − } and { θ r, , ..., θ r,N t − } arethe virtual AoDs and AoAs deﬁned by array elements of transmitter and receiver. Beamspacechannel H a is sparse due to the limited number of scattering clusters and high dimensionality ofmassive MIMO channels [37]. In practice, owing to the limited resolution of the virtual angulargrids ( θ r,i , θ t,j ) for ≤ i ≤ N r − and ≤ j ≤ N t − , mismatches often happen between thepredeﬁned virtual angular grid and the real-path AoA and AoD ( θ r,l , θ t,l ) for l ∈ { , ..., N p } . Thisphenomenon is known as power leakage, which can cause beamspace channels being not exactlysparse. However, for a massive MIMO system having hundreds of antennas, power leakage canbe neglected since the energy (square sum) of a few largest-magnitude elements can capturemost of the channel power (cid:107) H a (cid:107) [38]. In this case, the beamspace channel is approximatelysparse or compressible.By transmitting the known pilots P from the BS through channel, the UE receives pilotobservations as R = H s P + W (3)where R ∈ C N r × L is the received pilot observations, and where L is the length of pilots foreach antenna; H s ∈ C N r × N t is the spatial-domain channel matrix, P ∈ C N t × L is the transmittedpilot matrix; W ∈ C N r × L is the additive white Gaussian noise (AWGN) matrix whose elementsare independent identical distributed (i.i.d.) complex Gaussian random variables having meanzero and variance σ n . Conventional linear reconstruction methods such as linear minimummean square error (LMMSE) or least equare (LS) requires L ≥ N t to estimate the channelaccurately. When the number of antenna elements N t is large, the pilot-length requirementimplies unbearable spectral occupancy and computation complexity. On the contrary, beamspacechannel estimation is attractive because the sparse beamspace channel ˆ H s can be accuratelyreconstructed even when L (cid:28) N t . From the transformation between spatial domain channel H s and beamspace channel H a in(2), we have H s = U r H a U Ht . Thus, eq. (3) can be expressed as R = U r H a U Ht P + W . (4)By taking conjugate transposes and right multiplications with U r on both sides of (4), we have R H U r = P H U t H Ha U Hr U r + W H U r (5)By representing R (cid:48) = R H U r , A = P H U t , H = H Ha and W (cid:48) = W H U r , we can simplify (5) as R (cid:48) = AH + W (cid:48) (6)where R (cid:48) ∈ C L × N r , A ∈ C L × N t , H ∈ C N t × N r and W (cid:48) ∈ C L × N r . For the i th ( ≤ i ≤ N r )column of R (cid:48) , H and W (cid:48) in (6), we have r (cid:48) i = Ah i + w (cid:48) i . (7)We adopt the real-form matrix A by forcing the imaginary part of A to be zeros, i.e., (cid:61) ( A ) = , A = (cid:60) ( A ) , where (cid:60) ( · ) and (cid:61) ( · ) respectively denote the real and imaginary part of a complexvector. Thus, the computation complexity can be reduced to a half of the original complex-valuedmatrix-vector multiplication of Ah i . We stack the real part the imaginary part of the complexchannel vector h i , and express (7) as  (cid:60) ( r (cid:48) i ) (cid:61) ( r (cid:48) i )  =  (cid:60) ( A ) (cid:60) ( A )   (cid:60) ( h i ) (cid:61) ( h i )  +  (cid:60) ( w (cid:48) i ) (cid:61) ( w (cid:48) i )  . (8)We write the stacked channel vector as x = [ (cid:60) ( h i ) T , (cid:61) ( h i ) T ] T , and express (8) as y = Φx + n (9)where y = [ (cid:60) ( r (cid:48) i ) T , (cid:61) ( r (cid:48) i ) T ] T , Φ =  (cid:60) ( A ) (cid:60) ( A )  , and n = [ (cid:60) ( w (cid:48) i ) T , (cid:61) ( w (cid:48) i ) T ] T . In theremainder of this paper, we uniquely refer the real-form vector x as the sparse channel vector. Thegoal of beamspace channel estimation is to reconstruct sparse channel vector ˆ x given a knownmeasurement matrix Φ and measurements y such that ˆ x ≈ x by solving the underdeterminedlinear system (9). III. DC R

EPRESENTATION FOR S PARSE C ONSTRAINT

In this section, We ﬁrst brieﬂy review the conventional (cid:96) -norm regularized least squares forsolving sparse reconstructions. Then, we introduce the top- ( K, norm [35] to represent the (cid:96) -norm constraint exactly in a sparse reconstruction problem. Also, we provide a threshold todetermine the penalty parameter.According to (9), reconstructing x from y and Φ using the sparsity as a prior is an NP-hard (cid:96) -minimization problem, which is deﬁned as [39]min x (cid:107) x (cid:107) s.t. (cid:107) y − Φx (cid:107) ≤ τ (10)where τ is a nonnegative real-form parameter. The problem (10) can be rewritten in an equivalentform as [39] min x (cid:107) y − Φx (cid:107) s.t. (cid:107) x (cid:107) ≤ K (11)where K is the bound of the number of nonzero elements of vector x , and it is uniquelydetermined by the parameter τ in problem (10).A more common method to solve problem (11) is by the (cid:96) -relaxation that replaces thenonconvex (cid:96) -norm (cid:107) x (cid:107) by the convex (cid:96) -norm (cid:107) x (cid:107) asmin x (cid:107) y − Φx (cid:107) s.t. (cid:107) x (cid:107) ≤ t (12)where t is a nonnegative real parameter that controls the sparsity level of the reconstructionvector. Using an appropriate multiplier λ ≥ , problem (12) can be rewritten in an equivalentform as min x (cid:107) y − Φx (cid:107) + λ (cid:107) x (cid:107) (13)where λ ≥ is a penalty parameter to balance the data-ﬁdelity and regularization. The penaltyparameter λ is difﬁcult to tune, since the (cid:96) -regularizer λ (cid:107) x (cid:107) forces equal penalties on eachelement of x . A larger λ will force a sparser solution ˆ x but can increase the residual error (cid:107) y − Φx (cid:107) , whereas a small λ will lead to a solution that is insufﬁciently sparse. Cross-validationis a common approach to ﬁnd a proper value of λ , but it is computationally intensive for large-scale problems. Sometimes, it is impractical to select a proper parameter by cross-validationowing to a large number of distinct instances.In contrast to using (cid:96) -relaxation, we introduce the top- ( K, norm so that the constraint (cid:107) x (cid:107) ≤ K in the original problem (11) can be expressed exactly. The top- ( K, norm (cid:107) x (cid:107) K, isdeﬁned as the sum of the largest K elements of the vector x in terms of absolute value, namely (cid:107) x (cid:107) K, := | x (1) | + | x (2) | + · · · + | x ( K ) | (14)where | x ( i ) | denotes the element whose absolute value is the i th-largest among the N elementsof the vector x , i.e., | x (1) | ≥ | x (2) | ≥ · · · ≥ | x ( N ) | . The constraint (cid:107) x (cid:107) ≤ K is equivalent to thestatement that the ( K + 1) th-largest element of the vector x is zero, i.e., (cid:107) x (cid:107) K +1 , − (cid:107) x (cid:107) K, = 0 .Thus, we have an equivalent relationship between the following two statements [35] (cid:107) x (cid:107) ≤ K ⇔ (cid:107) x (cid:107) − (cid:107) x (cid:107) K, = 0 . (15)Since both (cid:107) x (cid:107) and (cid:107) x (cid:107) K, are convex, we say the equality (cid:107) x (cid:107) − (cid:107) x (cid:107) K, = 0 is an exactDC representation for sparsity constraint. By replacing the sparsity constraint (cid:107) x (cid:107) ≤ K in (11)using exact DC constraint (cid:107) x (cid:107) − (cid:107) x (cid:107) K, = 0 , the sparse reconstruction problem can be rewrittenas min x (cid:107) y − Φx (cid:107) s.t. (cid:107) x (cid:107) − (cid:107) x (cid:107) K, = 0 . (16)Using an appropriate Lagrange multiplier ρ , we obtain the following unconstraint optimizationproblem from the problem (16)min x (cid:107) y − Φx (cid:107) + ρ ( (cid:107) x (cid:107) − (cid:107) x (cid:107) K, ) := F ( x ) (17)where ρ is the regularization parameter that balances the data consistency and penalty term.Our formulated optimization problem (17) differs from the conventional (cid:96) -regularized sparsereconstruction (13) only in terms of the subtracted top- ( K, norm (cid:107) x (cid:107) K, in penalty term .The regularizer ρ ( (cid:107) x (cid:107) − (cid:107) x (cid:107) K, ) is better than an (cid:96) -norm regularizer because it removes thepenalties on the K largest-magnitude elements. In practice, if the sparsity level of reconstructing vector is already known, we set K as the number of nonzero elements ofreconstruction vectors; if the sparsity level of reconstructing vector is unknown, we can use cross validation to determine K . To ensure the equivalence between the unconstrained problem (17) and the constraint problem(16), we provide a threshold ρ ∗ for the penalty parameter selection in Theorem 1. Theorem 1:

Let x ρ ∗ be an optimal solution of (17) with given ρ ∗ . Suppose there exists aconstant q > such that (cid:107) x ρ ∗ (cid:107) ≤ q for any ρ ∗ > . Then x ρ ∗ is also optimal to (16) if ρ ∗ ≥ max i { q ( (cid:107) Φ T Φe i (cid:107) + | ( Φ T Φ ) ii | /

2) + | ( Φ T y ) i |} where ≤ i ≤ N t ; e i represents the unit vector in which the i th element is one while theother elements are zeros; ( Φ T Φ ) ii represents the i th diagonal elements of matrix Φ T Φ ; ( Φ T y ) i indicates the i th element of the vector Φ T y . Prove:

See Appendix A.IV. D

OUBLE -L OOP

DC G

RADIENT P ROJECTION D ESCENT FOR S PARSE R ECONSTRUCTION

We have formulated the sparse reconstructions into a nonconvex optimization problem (17).In this section, we use DC programming and gradient projection descent to solve problem (17).We will propose a double-loop DC-GPSR algorithm.

A. DC Programming Framework

The DC programming is an iterative algorithm framework that can ensure global convergence[22]. For a nonconvex unconstraint optimization problemmin x f ( x ) − g ( x ) (18)where f ( x ) and g ( x ) are two convex functions. The DC programming method solves thefollowing convex subproblem at the t th-iteration,min x f ( x ) − x T ∂g ( x t − ) (19)where the second convex function g ( x ) in (18) is linearized by x T ∂g ( x t − ) in (19), and where ∂g ( x t − ) represents the gradient (or subgradient) of g ( x t − ) with respect to x t − . The DCalgorithm framework can be outlined as follows:1. Start:

Given a starting point x , and a small threashold parameter (cid:15) > .2. Repeat:

For t = 1 , , ... Compute the gradient (or subgradient) ∂g ( x t − ) Solve the convex subproblem (19) for obtaining x t End:

Until a terminate condition is satisﬁed. B. A Double-Loop DC-GPSR Algorithm

Following the DC programming framework, we decompose our objective function in (17) asthe difference of the two convex functions of f ( x ) and g ( x ) min x (cid:107) y − Φx (cid:107) + ρ (cid:107) x (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) f ( x ) − ρ (cid:107) x (cid:107) K, (cid:124) (cid:123)(cid:122) (cid:125) g ( x ) . (20)At the t th-iteration, we solve the following convex subproblemmin x f ( x ) − ρ x T ∂ (cid:107) x t − (cid:107) K, (21)where ∂ (cid:107) x t − (cid:107) K, denotes the subgradient of top- ( K, norm for x t − , and where the superscript t − indicates the ( t − th update. The subgradient of top- ( K, norm ∂ (cid:107) x (cid:107) K, for x is deﬁnedas [35] ∂ (cid:107) x (cid:107) K, := argmax w (cid:40) N (cid:88) i =1 x i w i (cid:12)(cid:12)(cid:12) N (cid:88) i =1 | w i | = K, w i ∈ [ − , (cid:41) . (22)By substituting the f ( x ) and using w t − x to denote a feasible value for the subgradient ∂ (cid:107) x t − (cid:107) K, ,we write the subproblem (21) asmin x (cid:107) y − Φx (cid:107) + ρ (cid:107) x (cid:107) − ρ x T w t − x (23)where w t − x ∈ ∂ (cid:107) x t − (cid:107) K, . A feasible subgradient w t − x can be simply obtained by setting thesigns of the ﬁrst K largest elements of | x t − | to the corresponding elements of w t − x , i.e., ( w x ) t − i ) = sign( x t − i ) ) , and setting the other elements of w t − x to be zeros, where the subscript i indicates the i th element of a vector.We obtain a convex subproblem (23). We will turn (23) into a constraint quadratic problemso that we can solve it using the gradient projection descent method. Speciﬁcally, we split thepositive and negative part of x , and represent x as the difference of its positive part u and itsnegative part v , that is x = u − v , u ≥ , v ≥ . (24)where u = ( x ) + , v = ( − x ) + , where ( · ) + is the positive-taking operation that retains the positiveelements and sets the other elements be zeros. More precisely, ( x ) + represents for each element x in vector x we take ( x ) + = max { , x } ; ( − x ) + represents for each element − x in vector − x we take ( − x ) + = max { , − x } . Noticing that (cid:107) x (cid:107) = T u + T v , the subproblem (23) can bewritten as a bound-constrained quadratic program (BCQP)min u , v (cid:107) y − Φ ( u − v ) (cid:107) + ρ T u + ρ T v − ρ u T w t − u − ρ v T w t − v s.t. u ≥ , v ≥ (25)where w t − u and w t − v respectively represent the positive and negative part of w t − x , i.e., w t − u =( w t − x ) + , w t − v = ( − w t − x ) + . Let z denote the concatenation of u and v , i.e., z = [ u T , v T ] T , werewrite (25) into a compact formmin z z T Bz + c T z := G ( z ) , s.t. z ≥ (26)where z =  uv  , B =  Φ T Φ − Φ T Φ − Φ T Φ Φ T Φ  , c =  − Φ T yΦ T y  + ρ T − ρ w t − z where T represents an all-ones column vector having the same dimension with z , and w t − z =[( w t − u ) T , ( w t − v ) T ] T . Note that w t − z is a subgradient of (cid:107) z t − (cid:107) K, , i.e., w t − z ∈ ∂ (cid:107) z t − (cid:107) K, .Since z t − ≥ , a feasible subgradient w t − z can be an indicator vector having either one-valuedor zero-valued elements, where the indices for the one-valued elements of w t − z correspond tothe indices of the K -largest elements of z t − . The problem (26) is an equivalent problem forthe subproblem (23). We apply the gradient projection descent method to solve (26), and the k th update is z ( k + ) = Proj (cid:0) z ( k ) − α k ∇ G ( z ( k ) ) (cid:1) , z ( k +1) = z ( k ) + β k ( z ( k + ) − z ( k ) ) (27)where α k > is the step size and it can be determined by the BB step size, which can becalculated as α t = (cid:107) z k − z k − (cid:107) ( z k − z k − ) T ( G ( z k ) − G ( z k − ) ) ; β k ∈ (0 , is another step size to ensure themonotonic-decreasing of objective and it can be calculated in closed-form as β k = ( δ k ) T ∇ G ( z k )( δ k ) T B δ k ,where δ k = z ( k + ) − z ( k ) [25]; Proj ( · ) represents the operation of orthogonal projection that projects the vector to the nonnegative orthant ; ∇ G ( z ( k ) ) represents the gradient of G ( z ) interms of z ( k ) . We have ∇ G ( z ( k ) ) = Bz + c which can be calculated as ∇ G ( z ( k ) ) =  Φ T Φ ( u ( k ) − v ( k ) ) − Φ T Φ ( u ( k ) − v ( k ) )  +  − Φ T yΦ T y  − ρ w t − z + ρ T . (28)In a nutshell, the proposed algorithm computes the following two steps iteratively untilconvergence: ( a ) w t − z ∈ ∂ (cid:107) z t − (cid:107) K, ( b ) z t = argmin z ≥ { z T Bz + c T z } (29)where z , B and c are deﬁned in (26). The subproblem (b) in (29) is solved by applying thegradient projection descent updates in (27). We summarize this double-loop DC-GPSR algorithmin Algorithm 1. Algorithm 1

Double-loop DC-GPSR (DlDC-GPSR)

Input: measurements y , measurement matrix Φ and a small number (cid:15) Output: reconstructed ˆ x Initialization: u , v , z ← [( u ) T , ( v ) T ] T for t = 1 , , . . . do Compute a subgradient w t − z ∈ ∂ (cid:107) z t − (cid:107) K, for k = 1 , , . . . do Compute gradient ∇ G ( z ( k ) ) by (28) Perform gradient projection descent (27) for obtaining z ( k +1) Check convergence, set z ∗ ← z ( k +1) and proceed to step 7 if convergence is satisﬁed;otherwise return to step 3. end for z t ← z ∗ Check terminate condition (cid:107) z t − z t − (cid:107) ≤ (cid:15) and return to Step 1 if not satisﬁed; otherwiseterminate with the solution z t = [( u t ) T , ( v t ) T ] T , and obtain the reconstruction ˆ x = u t − v t end for Let R n + = { x = ( x , x , ..., x n ) | x ≥ , x ≥ , ..., x n ≥ } be the nonnegative orthant of R n . V. S

INGLE -L OOP

DC G

RADIENT P ROJECTION D ESCENT FOR S PARSE R ECONSTRUCTION

In the proposed DlDC-GPSR algorithm, at each iteration we solve a nonsmooth convexsubproblem (23) using another iterative algorithm, i.e., the gradient projection descent updates.This double-loop computations have high computation complexity. In this section, we derive aclosed-form solution to solve the convex subproblem by performing a special DC decompositionto eliminate the inner iterations. Thus, we propose a basic single-loop DC-GPSR algorithm. Inter-estingly, we show that the proposed basic single-loop DC-GPSR algorithm can be interpreted bysimple gradient projection descent updates. Furthermore, we accelerate the single-loop DC-GPSRalgorithm using BB step size and propose an extension algorithm for single-loop DC-GPSR.

A. A Basic Single-Loop DC-GPSR Algorithm

We ﬁrst rewrite the least squares objective of problem (17) in an equivalent form as (cid:107) y − Φx (cid:107) = l (cid:107) x (cid:107) − (cid:18) l (cid:107) x (cid:107) − (cid:107) y − Φx (cid:107) (cid:19) (30)where l ≥ is a Lipschitz constant of the least square objective. By substituting (30) intoproblem (17), the unconstraint sparse reconstruction problem (17) becomesmin x l (cid:107) x (cid:107) − (cid:18) l (cid:107) x (cid:107) − (cid:107) y − Φx (cid:107) (cid:19) + ρ ( (cid:107) x (cid:107) − (cid:107) x (cid:107) K, ) := F ( x ) . (31)Then, we perform the following DC decomposition on the objective in problem (31), and havemin x l (cid:107) x (cid:107) + ρ (cid:107) x (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) f ( x ) − (cid:18) l (cid:107) x (cid:107) − (cid:107) y − Φx (cid:107) + ρ (cid:107) x (cid:107) K, (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) g ( x ) := F ( x ) , (32)where f ( x ) and g ( x ) are two convex functions. The convexity of g ( x ) can be ensured byconﬁrming the convexity of l (cid:107) x (cid:107) − (cid:107) y − Φx (cid:107) , which is given in Theorem 2. Theorem 2:

The least squares objective (cid:107) y − Φx (cid:107) is smooth and its gradient functionis Lipschitz continuous with the Lipschitz constant l = λ max ( Φ T Φ ) , where λ max ( · ) denotesthe maximum eigenvalue of a matrix. Thus, the function h ( x ) = l (cid:107) x (cid:107) − (cid:107) y − Φx (cid:107) for l = λ max ( Φ T Φ ) is convex. Prove:

See Appendix B.By writing the (cid:96) -square terms in (32) into standard quadratic forms, we havemin x l x T x + ρ (cid:107) x (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) f ( x ) − (cid:18) l x T x − x T Φ T Φx + ( Φ T y ) T x + ρ (cid:107) x (cid:107) K, (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) g ( x ) := F ( x ) . (33) We split the positive and negative part of x by letting u = ( x ) + , v = ( − x ) + . Denoting z =[ u T , v T ] T , we express (33) asmin z l z T z + ρ T z (cid:124) (cid:123)(cid:122) (cid:125) f ( z ) − (cid:18) l z T z − z T Bz + q T z + ρ TK z (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) g ( z ) := F ( z ) s.t. z ≥ (34)where z =  uv  , B =  Φ T Φ − Φ T Φ − Φ T Φ Φ T Φ  , q =  Φ T y − Φ T y  , and K is the indicator vector having either one-valued or zero-valued elements, and the one-valued elements of K indicate the K largest elements of z . Following the DC algorithm frame-work to solve problem (34), we perform the following two steps repeatedly until convergence: ( a ) ∂g ( z t − ) = l z t − − Bz t − + q + ρ t − K ( b ) z t = argmin z ≥ { l z T z + ρ T z − z T ∂g ( z t − ) } (35)where the superscript t − and t respectively indicate the ( t − th and the t th update. The sub-problem (b) of (35) has a closed-form optimal solution. To derive it, we rewrite the subproblem(b) of (35) as min z l z T z + ρ z T − z T ∂g ( z t − ) s.t. z ≥ . (36)where ∂g ( z t − ) = l z t − − Bz t − + q + ρ t − K . The problem (36) can be expressed asmin z l (cid:107) z − l (cid:0) ∂g ( z t − ) − ρ (cid:1) (cid:107) s.t. z ≥ . (37)The problem (37) is to minimize the (cid:107) z − l ( ∂g ( z t − ) − ρ ) (cid:107) over z ≥ with respect to variable z . The solution is simply the Euclidean projection of l ( ∂g ( z t − ) − ρ ) onto the nonnegativeorthant. Thus, the gradient projection descent update has a closed-form optimal solution z ∗ = Proj (cid:18) l (cid:0) ∂g ( z t − ) − ρ (cid:1)(cid:19) = (cid:18) l (cid:0) ∂g ( z t − ) − ρ (cid:1)(cid:19) + (38) where the Proj ( · ) represents the Euclidean projection onto the feasible set z ≥ , which issimply the positive-taking operation denoted by ( · ) + . Using the closed-form solution (38) forthe subproblem (b) of (35), the DC programming procedure (35) is simpliﬁed to perform thefollowing two single-step computations repeatedly until convergence: ( a ) ∂g ( z t − ) = l z t − − Bz t − + q + ρ t − K ( b ) z t = (cid:18) l ( ∂g ( z t − ) − ρ ) (cid:19) + . (39)Thus, we summarize the updates (39) as a single-loop DC gradient projection algorithm inAlgorithm 2. By avoiding the inner-loop iterations, the computation complexity of Algorithm 2for each iteration is largely reduced compared to Algorithm 1. However, we comment that ifthe parameter l is large, Algorithm 2 can be slow to convergence. Therefore, we will furtherpropose an accelerated algorithm in the following section. Algorithm 2

Single-loop DC-GPSR-Basic (SlDC-GPSR-Basic)

Input: measurements y , measurement matrix Φ and a small number (cid:15) Output: reconstructed ˆ x Initialization: u , v , z ← [( u ) T , ( v ) T ] T for t = 1 , , . . . do Compute the gradient ∂g ( z t − ) = l z t − − Bz t − + q + ρ t − K as (a) in (39) Perform the optimal projection operation z t = (cid:0) l ( ∂g ( z t − ) − ρ ) (cid:1) + as (b) in (39) Check terminate condition (cid:107) z t − z t − (cid:107) ≤ (cid:15) , return to step 1 if not satisﬁed; otherwiseterminate with the solution z t = [( u t ) T , ( v t ) T ] T , and return the reconstruction ˆ x = u t − v t end for B. An Extension Algorithm for Single-loop DC-GPSR Using Monotonic BB Step Size

A crucial observation of Algorithm 2 (SlDC-GPSR-Basic algorithm) is that we can interpret itas a simple projection gradient descent method to solve a nonconvex optimization problem withglobal convergence guarantee. This observation can be obtained by substituting the ∂g ( z t − ) in (a) into (b) of (39) such that Step 2 and Step 3 in Algorithm 2 can be combined as the t thupdate of z t z t = Proj (cid:18) z t − − l (cid:0) Bz t − − q − ρ t − K + ρ (cid:1)(cid:19) = (cid:18) z t − − l (cid:0) Bz t − − q − ρ t − K + ρ (cid:1)(cid:19) + (40)where the projection operation is simply a nonnegative-clipper operation ( · ) + . Thus, we canclearly see that the SlDC-GPSR-Basic is a pure gradient projection descent method to solve thefollowing nonconvex problem with the step size /l for l = λ max ( Φ T Φ ) ,min z z T Bz − q T z − ρ TK z + ρ T z := F ( z ) s.t. z ≥ (41)where z =  uv  , B =  Φ T Φ − Φ T Φ − Φ T Φ Φ T Φ  , q =  Φ T y − Φ T y  . The problem (41) is an equivalent form of the sparse reconstruction problem (17), but it usesa nonnegative double-sized variable z = [( x ) T + , ( − x ) T + ] T to express the sparse vector x . As wehave known, the gradient projection is a mature method for convex optimizations. However,it generally cannot directly applies to the nonconvex problem with any convergence guarantee.Interestingly, as a gradient projection descent for nonconvex sparse reconstructions, the proposedSlDC-GPSR-Basic algorithm is guaranteed to converge, because it is derived from the perspectiveof DC programming and it shares the same global convergence property with the general DCprogramming algorithms [22].Although the step size /l provides global convergence guarantee for SlDC-GPSR-Basicin Algorithm 2, this step size can be too small in practical implementations. We consider togeneralize the step size to accelerate the SlDC-GPSR-Basic algorithm. We use α t to representa general step size for the ( t + 1) th update, and extend the update (40) to be a generic gradientprojection descent update for problem (41) as ˜ z t +1 = Proj (cid:0) z t − α t (cid:0) ∇ F ( z t ) (cid:1)(cid:1) = (cid:0) z t − α t (cid:0) Bz t − q − ρ tK + ρ (cid:1)(cid:1) + (42)where α t can be explicitly calculated by the BB step size method as α t = (cid:107) z t − z t − (cid:107) ( z t − z t − ) T ( F ( z t ) − F ( z t − )) . (43) To prevent the step size being over large, we employ a scaler β t +1 ∈ (0 , to limit the updateso that the objective will descent monotonically z t +1 = z t + β t (˜ z t +1 − z t ) . (44)By writing ˜ z t +1 − z t as δ t , we can calculate a β t that can minimize the objective F ( z t +1 ) by β t = ( δ t ) T ∇ F ( z t )( δ t ) T B δ t . (45)We summarize the updates (42) and (44) in Algorithm 3 as an extension single-loop DC-GPSRalgorithm, which extends the Algorithm 2 using a monotonic BB step size implementation. Algorithm 3

Single-loop DC-GPSR with monotonic BB step size (SlDC-GPSR-BB)

Input: measurements y , measurement matrix Φ and a small number (cid:15) Output: reconstructed ˆ x Initialization: u , v , z ← [( u ) T , ( v ) T ] T , α for t = 0 , , , . . . do Compute update ˜ z t +1 using (42) Compute step size β t using (45) Compute update z t +1 using (44) Compute step size α t +1 using (43) Check terminate condition (cid:107) z t +1 − z t (cid:107) ≤ (cid:15) , return to step 1 if not satisﬁed; otherwiseterminate with the solution z t +1 = [( u t +1 ) T , ( v t +1 ) T ] T and return the reconstruction ˆ x = u t +1 − v t +1 end for VI. C

OMPLEXITY A NALYSIS

Time complexity of an algorithm is related to both the convergence rate and the computationcost of each iteration. Theoretical convergence analysis is too complicated to obtain a ratefor our proposed algorithms due to the nonconvexity of optimization objective. Nevertheless,numerical results in the next section will sufﬁciently demonstrate convergence rates of ourproposed algorithms. Before presenting numerical results, we analyse the computation cost foreach iteration of our proposed algorithms. At the t th iteration, a common operation of ourproposed algorithms is to calculate a subgradient of the top- ( K, norm for nonnegative vector z t − , i.e., ∂ (cid:107) z (cid:107) t − K, . A feasible subgradient computation of top- ( K, norm for a nonnegativevector needs sorting elements and taking the indices of K largest elements. This can be donewith the time complexity O (2 n log(2 n )) , where n is the length of vector z t − . Apart from thissubgradient computation, the remaining operations mainly contain matrix-vector multiplications,vector inner products, vector sums and scaler-vector multiplications. Computation cost can beestimated by their ﬂoating-point operations (ﬂops). Given the vectors u ∈ R n and v ∈ R n ,a scaler ρ , a dense matrix Φ ∈ R m × n , the vector inner product u T v needs n − ﬂops; thematrix-vector multiplication Φv needs mn − m ﬂops; the scale-vector ρ u multiplication needs n ﬂops; the vector sum u + v needs n ﬂops. The most computational-intensive term is Bz ,which needs mn − m + n ﬂops.In the proposed algorithms, DlDC-GPSR is a double-loop algorithm, while SlDC-GPSR-Basicand SlDC-GPSR-BB are single-loop algorithms. For fair comparisons, the computation costs periteration are estimated for each overall iteration, that is, from step 2 to step 5 for DlDC-GPSR,step 2 and step 3 for SlDC-GPSR-Basic and from step 2 to step 5 for SlDC-GPSR-BB. We assumethat an arbitray DlDC-GPSR iteration has I inner loops . Thus, computation cost of DlDC-GPSRis I × O ( mn ) , while computation cost of SlDC-GPSR-Basic is O ( mn ) , and computation costof SlDC-GPSR-BB is O ( mn ) , where n is the dimension of the sparse vector x , and m is thedimension of measurements y . Assuming the DlDC-GPSR algorithm also adopts BB step sizes,its computation cost per iteration is I times higher than that of the SlDC-GPSR-BB algorithm.In another words, we can treat SlDC-GPSR-BB as a special case of DlDC-GPSR when forcingthe number of inner loops for every outer iteration to be one, i.e., I = 1 . Although both SlDC-GPSR-Basic and SlDC-GPSR-BB have the same order of computation costs per iteration being O ( mn ) , the SlDC-GPSR-BB has a higher computation cost because it needs to compute thestep sizes α t and β t in each iteration as shown in (43) and (45). On the contrary, an SlDC-GPSR-Basic algoritm does not calculate any step size and it can be treated as a pure gradientprojection descent with the ﬁxed step size /l as we have shown in (40). This is a general approximation of sorting a list of numbers, and there can exist other more efﬁcient methods. The number I can vary in different iterations of DlDC-GPSR VII. N

UMERICAL R ESULT

In this section, we evaluate the algorithm performances for the proposed DlDC-GPSR (Algo-rithm 1) and SlDC-GPSR-BB (Algorithm 3) . For a fair comparison with SlDC-GPSR, we alsoadopt BB step sizes for the inner loops of DlDC-GPSR. We evaluate the performances of sparsebeamspace channel reconstructions by our proposed algorithms, and compare with the perfor-mance of conventional (cid:96) -regularized GPSR algorithm [25]. Then, we evaluate the performancesof our proposed algorithms in terms of reconstructing nearly sparse and noisy beamspace channelvectors. Moreover, we reconstruct some MNIST digits for visualizing reconstruction qualities tocompare our proposed algorithms with several existing popular sparse reconstruction algorithmsincluding the (cid:96) -regularized GPSR, ISTA, and OMP. All the simulations were implemented ona desktop computer equipped with 3.2 GHz Intel Core i7-8700 CPUs with 8GB of physicalmemory using Python 3. A. Sparse Beamspace Channel Reconstruction

In this subsection, we perform sparse beamspace channel reconstructions using our proposedDlDC-GPSR and SlDC-GPSR-BB algorithm, as well as the conventional (cid:96) -regularized GPSRalgorithm for comparison. We randomly generate a channel vector for a single-antenna useraccording to the channel model (1), where the number of antennas at the BS is set as andthe number of multiple paths is set as three. A sparse beamspace channel vector is obtainedthrough applying a DFT transformation on the generated spatial-domain channel vector. Theobtained beamspace channel vector is nearly sparse due to power leakage. To obtain a sparsebeamspace channel vector, we keep the elements that have the largest magnitude, and set therest of elements to zero. Since we stack the real and imaginary part of a beamspace channelvector, the manipulating vector is a real-valued sparse vector having the dimension of × .The number of measurements is set as . The measurement matrix has the dimension of × and it is drawn randomly from a standard Gaussian distribution. We set K = 32 inthe DlDC-GPSR and SlDC-GPSR-BB algorithm implementations.The original sparse beamspace channel and the reconstructions are shown in Fig. 1. We can seethat the DlDC-GPSR and SlDC-GPSR-BB algorithm can achieve perfect reconstructions with As we have shown, the SlDC-GPSR-Basic (Algorithm 2) has an interesting theoretical interpretation as a pure gradientprojection algorithm, but the required step size /l is too small to converge in a reasonable period for our simulation. Therefore,we do not use Algorithm 2 in the simulations. Figure 1: Illustrations of the optimal channel vector and reconstructionsNMSEs . × − and . × − , whereas the (cid:96) -regularized GPSR reconstruction hasnoticeable errors with an NMSE . × − . Under the termination condition (cid:107) ˆ x t − ˆ x t − (cid:107) ≤ − , the runtimes are about . , . and . seconds for (cid:96) -regularized GPSR, DlDC-GPSR and SlDC-GPSR-BB, respectively. To show their convergence properties during runtimes,we plot objective values versus iterations in Fig. 2, where the objective of proposed DlDC-GPSR and SlDC-GPSR-BB algorithm is (cid:107) y − Φx (cid:107) + ρ ( (cid:107) x (cid:107) − (cid:107) x (cid:107) K, ) while the objectiveof conventional (cid:96) -regularized GPSR is (cid:107) y − Φx (cid:107) + ρ (cid:107) x (cid:107) . For DlDC-GPSR algorithm, thered dots indicate objective values in outer iterations, and the blue solid line shows the objectivevalues along all inner iterations. We can see the proposed DlDC-GPSR algorithm decreases itsobjective signiﬁcantly after the seventh outer-step, and approaches the optimal value at the theeighth step, whereas the objective of (cid:96) -regularized GPSR is stuck at a relatively large value. TheSlDC-GPSR-BB algorithm has the fastest convergence and can achieve almost the same objectivevalue with DlDC-GPSR algorithm. Then, we investigate the values of least square objective withthe (cid:96) -norm penalty, i.e., (cid:107) y − Φ ˆ x (cid:107) + ρ (cid:107) ˆ x (cid:107) , and show the values versus iterations in Fig. 3, Figure 2: Objective values versus iterations for a beamspace channel reconstruction. Evaluatingobjective for DlDC-GPSR and SlDC-GPSR-BB: (cid:107) y − Φx (cid:107) + ρ ( (cid:107) x (cid:107) − (cid:107) x (cid:107) K, ) ; Evaluatingobjective for (cid:96) -regularized GPSR: (cid:107) y − Φx (cid:107) + ρ (cid:107) x (cid:107) where we also plot the optimal values of the penalty term ρ (cid:107) x opt (cid:107) . We can see that the DlDC-GPSR and SlDC-GPSR-BB algorithm arrive and stay at the value of ρ (cid:107) x opt (cid:107) , which is alsothe optimal value that the objective (cid:107) y − Φ ˆ x (cid:107) + ρ (cid:107) ˆ x (cid:107) can achieve when (cid:107) y − Φ ˆ x (cid:107) = 0 and ρ (cid:107) ˆ x (cid:107) = ρ (cid:107) x opt (cid:107) . However, the (cid:96) -regularized GPSR cannot achieve this optimal valuewith a noticeable gap. It is meaningful to observe this gap of minimal objective values betweenour proposed algorithms and conventional (cid:96) -regularized GPSR algorithm, because this gap cangive us an important insight of the approximation error introduced by relaxing the (cid:96) -norm to (cid:96) -norm. Figure 4 shows the normalized squared- (cid:96) errors ( (cid:107) x − ˆ x (cid:107) / (cid:107) x (cid:107) ) versus iterations forreconstructions. We can see that the DlDC-GPSR and SlDC-GPSR-BB algorithm can achieveaccurate reconstructions with errors on the order of − and − , which is far more accuratethan the reconstruction by conventional (cid:96) -regularized GPSR algorithm having an error on theorder of − . B. Nearly Sparse and Noisy Beamspace Channel Reconstruction

The beamspace channels can be nearly sparse due to power leakage and also can be noisy. Inthis subsection, we will evaluate the performances of our proposed algorithms on reconstructingnearly sparse channels and noisy channels. We choose a noisy channel sample x noisy ∈ R Figure 3: Objective values ( (cid:107) y − Φx (cid:107) + ρ (cid:107) x (cid:107) ) versus iterations for a beamspace channelreconstructionFigure 4: Reconstruction error ( (cid:107) ˆ x − x (cid:107) / (cid:107) x (cid:107) ) versus iterations in beamspace channelreconstructionand a nearly sparse channel sample x approx ∈ R . Then, we perform reconstructions usingour proposed algorithms to obtain reconstruction ˆ x from the measurements y noisy = Φx noisy and y approx = Φx approx , and evaluate the reconstruction performances for different penaltyparameters ρ ∈ { . , . , . } . We plot the original channel magnitudes and illustrate the Figure 5: Nearly sparse beamspace channel reconstructions by DlDC-GPSR and SlDC-GPSR-BBfor different penalty parameter valuesgray images for the magnitudes of original channel and reconstructions. In Fig. 5 (a), we ﬁrstplot the magnitude of a nearly sparse channel, then we reshape the × maginitude vector intoa × matrix and draw its gray image. As shown in Fig. 5 (a), a nearly sparse channel has afew large magnitudes but most of the magnitudes are in small values. Figure 5 (b) and Fig. 5 (c)are the reconstruction results with setting the sparsity level K = 10 for the DlDC-GPSR andSlDC-GPSR-BB algorithm. We can see that as the penalty parameter ρ becomes larger, thereconstructions becomes sparser. The important elements having large-valued magnitudes canbe successfully reconstructed by both the DlDC-GPSR and SlDC-GPSR-BB algorithm. Then, Figure 6: Noisy sparse beamspace channel reconstructions by DlDC-GPSR and SlDC-GPSR-BBfor different penalty parameter valueswe inverstigate the reconstruction performances of proposed algorithms on noisy channels. Werandomly add AWGN noise on the approximately sparse channels, then we perform reconstruc-tions from the corresponding measurements. A noisy channel for SNR being dB is shownin Fig. 6 (a). The SNR is deﬁned as (cid:107) x noisy (cid:107) / (cid:107) n (cid:107) , where n ∈ R is the added noisevector. Reconstruction results are shown in Fig. 6 (b) and Fig. 6 (c), where the sparsity level isset as K = 3 for DlDC-GPSR and SlDC-GPSR-BB. We can see that although reconstructionaccuracies degrades due to noise, those important elements having large magnitudes can alwaysbe successfully recovered. Figure 7: MNIST digits reconstructions by DlDC-GPSR, SlDC-GPSR-BB, (cid:96) -regularizedGPSR, ISTA and OMP algorithms C. Reconstruction Performance on MNIST Dataset

We have seen the numerical results of sparse reconstructions for beamspace channel vectors,which are generated from a given channel model based on the i.i.d. probabilistic assumption. Inthis subsection, we will investigate the performances of proposed algorithms on real-world data.We randomly choose handwritten digit images from MNIST dataset to perform reconstruc-tions. Since most of the pixels in an MNIST digit image are zeros, we can treat each image as asparse signal. Each image has a dimension × , and we unfold it being a sparse vector.When we perform reconstructions by our proposed DlDC-GPSR and SlDC-GPSR-BB algorithm,we set the top- ( K, norm parameter K = 180 . The dimension of compressed measurementsis set as . We choose several commonly-used algorithms for comparisons including the (cid:96) -regularized GPSR, ISTA and OMP algorithms. The original images and reconstructions areshown in Figure 7. We can see that the proposed DlDC-GPSR and SlDC-GPSR-BB algorithm canachieve higher-quality image reconstructions, which have the NMSE . − and . × − ,respectively. The benchmark algorithms have higher reconstruction NMSEs. We can visually see that (cid:96) -regularized GPSR reconstructs the digits , , inaccurately; the reconstructions byISTA are highly noisy; the OMP algorithm reconstructs a few digits clearly, but some digitsare seriously blurring. The high reconstruction accuracies by the proposed DlDC-GPSR andSlDC-GPSR-BB verify the robustness of the proposed algorithms on real-world data, whichmay not follow any i.i.d. probabilistic assumptions. Algorithm runtimes are shown in Table I.We can see that the SlDC-GPSR-BB and the (cid:96) -GPSR are faster than the other three algorithms.Considering that the SlDC-GPSR-BB has a higher accuracy than (cid:96) -GPSR by the order of − ,and compared with DlDC-GPSR it has a comparable accuracy but a more than times fasterruntime, the SlDC-GPSR-BB has a good tradeoff between reconstruction accuracy and algorithmruntime. Table I: Runtimes of MNIST digits reconstructions for different algorithms Algorithm DlDC-GPSR SlDC-GPSR-BB (cid:96) -GPSR ISTA OMPRuntime (seconds) 9.59 1.29 1.40 12.55 11.89 VIII. C

ONCLUSION

We proposed three DC programming gradient projection sparse reconstruction algorithms formassive MIMO beamspace channel estimations, and they are DlDC-GPSR (Algorithm 1), SlDC-GPSR-Basic (Algorithm 2) and SlDC-GPSR-BB (Algorithm 3). We derived these algorithmsby solving a least squares problem with a nonconvex regularizer. The regularizer is simplythe difference between an (cid:96) -norm and a top- ( K, norm, which was introduced to removethe penalties on K largest-magnitude elements. We employed DC programming and gradientprojection method to solve the nonconvex sparse reconstruction, and derived different algorithmsby different DC decompositions. The proposed double-loop algorithm DlDC-GPSR is straightforward; the single-loop algorithm SlDC-GPSR-Basic has simple updates, and it also sharesthe theoretical convergence property with the general DC programming method. We observedthat the SlDC-GPSR-Basic updates can be perfectly interpreted as a simple gradient projectionapplication. Based on this observation, we proposed an extension algorithm that adopts the BB The algorithm runtimes were calculated on a desktop computer equipped with 3.2 GHz Intel Core i7-8700 CPU with 8GBof physical memory using Python 3. step size. The proposed SlDC-GPSR-BB algorithm has a signiﬁcantly improved convergencerate and can achieve the same level accuracy with the double-loop algorithm DlDC-GPSR. Weprovide sufﬁcient numerical demonstrations to show the advantages of proposed algorithms suchas high accuracies and fast computations. For the future study, we will investigate the theoreticalconvergence rates of the proposed algorithms.IX. A CKNOWLEDGEMENT

The authors thank Dr. Hui Ma for his early contribution to Algorithm 1.A

PPENDIX

A. Proof of Theorem 1:

Suppose that x ρ ∗ is an optimal solution of (17) with given ρ ∗ , then x ρ ∗ is also optimalto (16) as long as (cid:107) x ρ ∗ (cid:107) ≤ K (or (cid:107) x ρ ∗ (cid:107) − (cid:107) x ρ ∗ (cid:107) K, = 0 ) satisﬁes. Thus, we only needto prove that (cid:107) x ρ ∗ (cid:107) > K is not feasible. Assume x ρ ∗ is an optimal solution to (17) with ρ ∗ > max i {| ( Φ T y ) i | + q ( (cid:107) Φ T Φe i (cid:107) + | ( Φ T Φ ) ii | / } . For (cid:107) x ρ ∗ (cid:107) > K , we construct a feasiblesolution to (16) as x (cid:48) = x ρ ∗ − x i e i , where i represents the index of the ( K + 1) th largest elementof vector x ρ ∗ ; e i represents the unit vector in which the i th element is one while the otherelements are zeros; x i represents the i th element of vector x ρ ∗ . By writing the objective of (17)as F ( x ) = 12 x T Φ T Φx − ( Φ T y ) T x + ρ ∗ (cid:107) x (cid:107) − ρ ∗ (cid:107) x (cid:107) K, . (46)We have F ( x ρ ∗ ) − F ( x (cid:48) )= F ( x ρ ∗ ) − F ( x ρ ∗ − x i e i )= 12 x Tρ ∗ Φ T Φx ρ ∗ − ( Φ T y ) T x ρ ∗ −

12 ( x ρ ∗ − x i e i ) T Φ T Φ ( x ρ ∗ − x i e i )+( Φ T y ) T ( x ρ ∗ − x i e i ) + ρ ∗ | x i | = x i e Ti Φ T Φx ρ ∗ − x i ( Φ T Φ ) i,i − x i ( Φ T y ) i + ρ ∗ | x i |≥ − x i e Ti Φ T Φx ρ ∗ − x i ( Φ T Φ ) i,i − x i ( Φ T y ) i + ρ ∗ | x i |≥ −| x i |(cid:107) e Ti Φ T Φ (cid:107) (cid:107) x ρ ∗ (cid:107) − | x i |(cid:107) x ρ ∗ (cid:107) | ( Φ T Φ ) i,i | − | x i | · | ( Φ T y ) i | + ρ ∗ | x i | = | x i | (cid:18) ρ − (cid:107) e Ti Φ T Φ (cid:107) (cid:107) x ρ ∗ (cid:107) − (cid:107) x ρ ∗ (cid:107) | ( Φ T Φ ) i,i | − | ( Φ T y ) i | (cid:19) > , (47)which contradicts the assumption that x ρ ∗ is an optimal solution to (17). B. Proof of Theorem 2:

We ﬁrst derive the Lipschitz constant l = λ max ( Φ T Φ ) , then we prove the convexity of h ( x ) .We denote the least squares objective by the function l ( x ) . Since the least squares objective w ( x ) = (cid:107) y − Φx (cid:107) is smooth if and only if its gradient function is Lipschitz continuous, weassume there exists l < ∞ , which is named as a Lipschitz constant, such that (cid:107)∇ w ( x ) − ∇ w ( z ) (cid:107) ≤ l (cid:107) x − z (cid:107) . (48)For w ( x ) = (cid:107) y − Φx (cid:107) , we have (cid:107)∇ w ( x ) − ∇ w ( z ) (cid:107) = (cid:107) Φ T ( Φx − y ) − Φ T ( Φz − y ) (cid:107) = (cid:107) Φ T Φ ( x − z ) (cid:107) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Φ T Φ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) x − z (cid:107) = λ max ( Φ T Φ ) (cid:107) x − z (cid:107) (49)where |||·||| represents the spectral norm of a matrix, and λ max ( · ) represents the largest eigenvalueof a matrix. Thus, we obtain the Lipschitz constant l = λ max ( Φ T Φ ) .For h ( x ) = l (cid:107) x (cid:107) − (cid:107) y − Φx (cid:107) , we have the Hessian matrix ∇ h ( x ) as ∇ h ( x )= ∂ ( ∇ h ( x )) ∂ x = ∂ ( l x T − Φ T ( Φx − y )) ∂ x = l I − Φ T Φ (50)where I is the identity matrix. For l = λ max ( Φ T Φ ) , we have l I − Φ T Φ (cid:23) , which means thatthe Hessian matrix l I − Φ T Φ of h ( x ) is semideﬁnite positive. Thus, h ( x ) is a convex functionof vector x . R EFERENCES [1] W. U. Bajwa, J. Haupt, A. M. Sayeed, and R. Nowak, “Compressed channel sensing: A new approach to estimating sparsemultipath channels,”

Proc. IEEE , vol. 98, no. 6, pp. 1058–1076, June 2010.[2] S. F. Cotter and B. D. Rao, “Sparse channel estimation via matching pursuit with application to equalization,”

IEEE Trans.Commun. , no. 3, pp. 374–377, Mar. 2002.[3] M. R. Raghavendra and K. Giridhar, “Improving channel estimation in OFDM systems for sparse multipath channels,”

IEEE Signal Process. Lett. , no. 1, pp. 52–55, Jan. 2005.[4] J. L. Paredes, G. R. Arce, and Z. Wang, “Ultra-wideband compressed sensing: Channel estimation,”

IEEE J. Sel. TopicsSignal Process. , no. 3, pp. 383–395, Otc. 2007. [5] G. Taub¨ock, F. Hlawatsch, D. Eiwen, and H. Rauhut, “Compressive estimation of doubly selective channels in multicarriersystems: Leakage effects and sparsity-enhancing processing,” IEEE J. Sel. Topics Signal Process. , no. 2, pp. 255–271, Apr.2020.[6] C. R. Berger, S. Zhou, J. C. Preisig, and P. Willett, “Sparse channel estimation for multicarrier underwater acousticcommunication: From subspace methods to compressed sensing,”

IEEE Trans. Signal Process. , no. 3, pp. 1708–1721, Mar.2020.[7] Z. Gao, L. Dai, Z. Wang, and S. Chen, “Spatially common sparsity based adaptive channel estimation and feedback forFDD massive MIMO,”

IEEE Trans. Signal Process. , vol. 63, no. 23, pp. 6169–6183, Dec. 2015.[8] Z. Gao, L. Dai, W. Dai, B. Shim, and Z. Wang, “Structured compressive sensing-based spatio-temporal joint channelestimation for FDD massive MIMO,”

IEEE Trans. Commun. , vol. 64, no. 2, pp. 601–617, Feb. 2016.[9] C. R. Berger, Z. Wang, J. Huang, and S. Zhou, “Application of compressive sensing to sparse channel estimation,”

IEEECommun. Mag. , vol. 48, no. 11, pp. 164–174, Nov. 2010.[10] M. E. Eltayeb, T. Y. Al-Naffouri, and H. R. Bahrami, “Compressive sensing for feedback reduction in MIMO broadcastchannels,”

IEEE Trans. Commun. , vol. 62, no. 9, pp. 3209–3222, Sep. 2014.[11] J. W. Choi, B. Shim, and S. Chang, “Downlink pilot reduction for massive MIMO systems via compressed sensing,”

IEEECommun. Lett. , vol. 19, no. 11, pp. 1889–1892, Nov. 2015.[12] C.-C. Tseng, J.-Y. Wu, and T.-S. Lee, “Enhanced compressive downlink CSI recovery for FDD massive MIMO systemsusing weighted block l -minimization,” IEEE Trans. Commun. , vol. 64, no. 3, pp. 1055–1067, Mar. 2016.[13] H. Almosa, S. Mosleh, E. Perrins, and L. Liu, “Downlink channel estimation with limited feedback for FDD multi-usermassive MIMO with spatial channel correlation,” in

Proc. IEEE ICC , Kansas City, MO, 2018, pp. 1–6.[14] W. Shen, L. Dai, Y. Shi, B. Shim, and Z. Wang, “Joint channel training and feedback for FDD massive MIMO systems,”

IEEE Trans. Veh. Technol. , vol. 65, no. 10, pp. 8762–8767, Otc. 2016.[15] X. Rao and V. K. N. Lau, “Distributed compressive CSIT estimation and feedback for FDD multi-user massive MIMOsystems,”

IEEE Trans. Signal Process. , vol. 62, no. 12, pp. 3261–3271, June 2014.[16] A. Liu, F. Zhu, and V. K. N. Lau, “Closed-loop autonomous pilot and compressive CSIT feedback resource adaptation inmulti-user FDD massive MIMO systems,”

IEEE Trans. Signal Process. , vol. 65, no. 1, pp. 173–183, Jan. 2017.[17] J. Brady, N. Behdad, and A. M. Sayeed, “Beamspace MIMO for millimeter-wave communications: System architecture,modeling, analysis, and measurements,”

IEEE Trans. Antennas Propag. , vol. 61, no. 7, pp. 3814–3827, July 2013.[18] Z. Gao, C. Hu, L. Dai, and Z. Wang, “Channel estimation for millimeter-wave massive MIMO with hybrid precoding overfrequency-selective fading channels,”

IEEE Commun. Lett. , vol. 20, no. 6, pp. 1259–1262, June 2016.[19] X. Gao, L. Dai, S. Han, C. I, and X. Wang, “Reliable beamspace channel estimation for millimeter-wave massive MIMOsystems with lens antenna array,”

IEEE Trans. Wireless Commun. , vol. 16, no. 9, pp. 6010–6021, Sep. 2017.[20] J. Rodr´ıguez-Fern´andez, N. Gonz´alez-Prelcic, K. Venugopal, and R. W. Heath, “Frequency-domain compressive channelestimation for frequency-selective hybrid millimeter wave MIMO systems,”

IEEE Trans. Wireless Commun. , vol. 17, no. 5,pp. 2946–2960, May 2018.[21] J. P. Gonz´alez-Coma, J. Rodr´ıguez-Fern´andez, N. Gonz´alez-Prelcic, L. Castedo, and R. W. Heath, “Channel estimationand hybrid precoding for frequency selective multiuser mmWave MIMO systems,”

IEEE J. Sel. Topics Signal Process. ,vol. 12, no. 2, pp. 353–367, May 2018.[22] H. A. L. Thi and T. P. Dinh, “DC programming and DCA: Thirty years of developments,”

Math. Program. , pp. 5–68,2018.[23] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky, “An interior-point method for large-scale (cid:96) -regularized leastsquares,” IEEE J. Sel. Topics Signal Process. , no. 4, pp. 606–617, Dec. 2007. [24] W. Yin, S. Osher, D. Goldfarb, and J. Darbon, “Bregman iterative algorithms for (cid:96) -minimization with applications tocompressed sensing,” SIAM Journal on Imaging Sciences , no. 1, pp. 143–168, 2008 Mar.[25] M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruction: Application tocompressed sensing and other inverse problems,”

IEEE J. Sel. Topics Signal Process. , vol. 1, no. 4, pp. 586–597, Dec2007.[26] S. J. Wright, R. D. Nowak, and M. A. T. Figueiredo, “Sparse reconstruction by separable approximation,”

IEEE Trans.Signal Process. , vol. 57, no. 7, pp. 2479–2493, July 2009.[27] T. Blumensath and M. E. Davies, “Iterative thresholding for sparse approximations,”

Journal of Fourier Analysis andApplications , vol. 14, pp. 629–654, 2008.[28] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,”

SIAM J. Imag.Sci. , vol. 2, no. 1, pp. 183–202, 2009.[29] S. Foucart and M.-J. Lai, “Sparsest solutions of underdetermined linear systems via (cid:96) q -minimization for ≤ q ≤ ,” Appl.Comput. Harmonic Anal. , no. 3, pp. 395–407, 2009.[30] G. Gasso, A. Rakotomamonjy, and S. Canu, “Recovering sparse signals with a certain family of nonconvex penalties andDC programming,”

IEEE Trans. Signal Process. , no. 12, pp. 4686–4698, Dec. 2009.[31] Y. C. Pati, R. Rezaiifar, and P. S. Krishnaprasad, “Orthogonal matching pursuit: recursive function approximation withapplications to wavelet decomposition,” in

Conf. Rec. 27th Asilomar Conf. Signals, Syst. Comput. , vol. 1, 1993, pp. 40–44.[32] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,”

IEEE Trans.Inform. Theory , vol. 53, no. 12, pp. 4655–4666, Dec. 2007.[33] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,”

Applied andComputational Harmonic Analysis , vol. 26, no. 3, pp. 301–321, 2008.[34] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,”

The Annals of Statistics , vol. 32, no. 2, pp.407–499, 2004.[35] J. Gotoh, A. Takeda, and K. Tono, “DC formulations and algorithms for sparse optimization problems,”

MathematicalProgramming , vol. 169, no. 1, pp. 141–176, 2018.[36] P. Wu, H. Ma, and J. Cheng, “Sparse channel reconstruction with nonconvex regularizer via DC programming for massiveMIMO systems,” in , Taipei, Taiwan, Dec. 2020.[37] D. Tse and V. Pramod,

Fundamentals of Wireless Communication . Cambridge, U.K.: Cambridge Univ. Press, 2005.[38] R. W. Heath, N. Gonz´alez-Prelcic, S. Rangan, W. Roh, and A. M. Sayeed, “An overview of signal processing techniquesfor millimeter wave MIMO systems,”

IEEE J. Sel. Topics Signal Process. , vol. 10, no. 3, pp. 436–453, Apr. 2016.[39] I. Rish and G. Y. Grabarnik,