LoRD-Net: Unfolded Deep Detection Network with Low-Resolution Receivers
Shahin Khobahi, Nir Shlezinger, Mojtaba Soltanalian, Yonina C. Eldar
LLoRD-Net: Unfolded Deep Detection Network withLow-Resolution Receivers
Shahin Khobahi, Nir Shlezinger, Mojtaba Soltanalian, and Yonina C. Eldar
Abstract —The need to recover high-dimensional signals fromtheir noisy low-resolution quantized measurements is widelyencountered in communications and sensing. In this paper, wefocus on the extreme case of one-bit quantizers, and proposea deep detector entitled LoRD-Net for recovering informationsymbols from one-bit measurements. Our method is a model-aware data-driven architecture based on deep unfolding offirst-order optimization iterations. LoRD-Net has a task-basedarchitecture dedicated to recovering the underlying signal ofinterest from the one-bit noisy measurements without requiringprior knowledge of the channel matrix through which the one-bit measurements are obtained. The proposed deep detector hasmuch fewer parameters compared to black-box deep networksdue to the incorporation of domain-knowledge in the design ofits architecture, allowing it to operate in a data-driven fashionwhile benefiting from the flexibility, versatility, and reliabilityof model-based optimization methods. LoRD-Net operates in ablind fashion, which requires addressing both the non-linearnature of the data-acquisition system as well as identifying aproper optimization objective for signal recovery. Accordingly,we propose a two-stage training method for LoRD-Net, in whichthe first stage is dedicated to identifying the proper form of theoptimization process to unfold, while the latter trains the resultingmodel in an end-to-end manner. We numerically evaluate theproposed receiver architecture for one-bit signal recovery inwireless communications and demonstrate that the proposedhybrid methodology outperforms both data-driven and model-based state-of-the-art methods, while utilizing small datasets, onthe order of merely ∼ samples, for training. I. I
NTRODUCTION
Analog-to-digital conversion plays an important role indigital signal processing systems. While physical signals takevalues in continuous-time over continuous sets, they mustbe represented using a finite number of bits in order to beprocessed in digital hardware [2]. This operation is carriedout using analog-to-digital converters (ADCs), which typicallyperform uniform sampling followed by a uniform quantizationof the discrete-time samples. When using high-resolutionADCs, this conversion induces a minimal distortion, allowingto effectively process the signal using methods derived assum-ing access to the continuous-amplitude samples. However, thecost, power consumption and memory requirements of ADCsgrow with the sampling rate and the number of bits assigned to
Parts of this work was presented in the IEEE International Conferenceon Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UnitedKingdom, 2019 [1]. This paper has received supports from the BenoziyoEndowment Fund for the Advancement of Science, the Estate of Olga Klein– Astrachan, the European Union’s Horizon 2020 research and innovationprogram under grant No. 646804-ERC-COG-BNYQ, from the Israel Sci-ence Foundation under grant No. 0100101, from the U.S. National ScienceFoundation under grant No. CCF-1704401, and from an Illinois DiscoveryPartners Institute (DPI) Seed Award. S. Khobahi and M. Soltanalian arewith the ECE Dept., University of Illinois at Chicago, Chicago, IL (e-mail: { skhoba2, msol } @uic.edu). N. Shlezinger is with the School of ECE, Ben-Gurion University of the Negev, Be’er-Sheva, Israel (e-mail: [email protected]).Y. C. Eldar is with the Faculty of Math and CS, Weizmann Institute of Science,Rehovot, Israel (e-mail: [email protected]). each sample [3]. Consequently, recent years have witnessed anincreasing interest in digital signal processing systems operat-ing with low-resolution ADCs. Particularly, in multiple-inputmultiple-output (MIMO) communication receivers, which arerequired to simultaneously capture multiple analog signalswith high bandwidth, there is a growing need to operatereliably with low-resolution ADCs [4]. The most coarse formof quantization is reduction of the signal to a single bit persample, which may be accomplished via comparing the sampleto some reference level, and recording whether the signal isabove or below the reference. One-bit acquisition allows usinghigh sampling rates at a low cost and low energy consumption.Due to such favorable properties of one-bit ADCs, they havebeen employed in a wide array of applications, including inwireless communications [1], [5], [6], radar signal processing[7]–[9], and sparse signal recovery [10], [11].The non-linear nature of low-resolution quantization makessymbol detection a challenging task. This situation is signif-icantly exacerbated in practical one-bit communication andsensing where the channel is to be estimated in conjunctionwith symbol detection. A coherent symbol detection task isconcerned with recovering the underlying signal of interestfrom the one-bit measurements assuming the channel stateinformation (CSI) is known at the receiver. On the other hand,the more difficult task of blind symbol detection, which is thefocus here, carries out recovery of the underlying transmittedsymbols when CSI is not available.Two main strategies have been proposed in the literatureto facilitate operation with low-resolution ADCs: The firstdesigns the overall acquisition system in light of the task forwhich the signals are acquired. For instance, MIMO communi-cation receivers acquire their channel output in order to extractsome underlying information, e.g., symbol detection. As theanalog signals are not required to be recovered from theirdigital representation, one can design the acquisition systemto reliably infer the desired information while operating withlow resolution ADCs [12]–[16]. Such task-based quantizationsystems rely on pre-quantization processing, which requiresdedicated hardware in the form of hybrid receiver architectures[17], [18] or unique antenna structures [19], [20], which areconfigured along with the quantization rule.An alternative approach to task-based quantization, whichdoes not require additional configurable analog hardwareand is the focus of the current work, is to recover thedesired information from the distorted coarsely discretizedrepresentation of the signal in the digital domain. The mainbenefit of schemes carried out only in the digital domain istheir simplicity of implementation, as they do not require tointroduce modifications to the quantization system and circum-vent the need for adding pre-quantization analog processinghardware. In the context of MIMO systems, various methods1 a r X i v : . [ ee ss . SP ] F e b ave been proposed in the literature for channel estimationand signal decoding from quantized outputs, including model-based signal processing methods as surveyed in [21], as well asmodel-agnostic systems based on machine learning and data-driven techniques [22]–[29].Most existing model-based detection algorithms requirecoherent operation, i.e., they rely on prior knowledge of theCSI and other system parameters. Among these works are thenear-Maximum Likelihood (nML) detector proposed for one-bit MIMO receivers in [30], the linear receivers studied in [31],[32], and the message passing based detectors considered in[33], [34]. The fact that such approaches require accurate CSIled to several works specifically dedicated to CSI estimation inthe presence of low-resolution ADCs. These include [30], [35],which studied maximum-likelihood estimation for recoveringthe CSI in the presence of one-bit data, the works in [36], [37],which developed linear estimators for CSI estimation purposesin one-bit MIMO systems, and [38] which focuses on sparsechannels and utilizes one-bit sparse recovery methods for CSIestimation. However, all these strategies inevitably induce non-negligible CSI estimation error, which may notably degradethe accuracy in signal detection based on the estimated CSI.Over the past several years, data-driven methods, andspecifically deep neural networks (DNNs), have attractedunprecedented attention from research communities across theboard. The advent of low-cost specialized powerful computingresources and the continually increasing amount of massivedata generated by the human population and machines, alongwith new optimization and learning methods, have pavedthe way for DNNs and machine learning-based models toprove their effectiveness in many engineering areas, such ascomputer vision and natural language processing [39]. DNNslearn their mapping from data in a model-agnostic manner,and can thus facilitate non-coherent (blind) detection.Previously proposed DNN-aided symbol detection tech-niques for communication receivers can be divided based ontheir receiver architectures; namely, those that utilize conven-tional machine learning architectures for detection, including[40]–[42], and schemes combining DNNs with model-baseddetection methods, such as the blind DNN-aided receiversproposed in [43]–[46] and the coherent detectors of [47], [48],see also surveys in [49], [50]. In the context of one-bit DNN-aided receivers, previous works to date focus mainly on thefirst approach, i.e., applying conventional DNNs for the overalldetection task. Among these works are [22], [25] and [23],which applied generic DNNs for channel estimation in one-bitMIMO receivers. The application of conventional architecturesfor symbol detection was studied in [24], [27] and [28], while[26] showed that autoencoders can facilitate the design of errorcorrection codes for communications with one-bit receivers.Recently, the authors in [29] considered the problem of symboldetection for a one-bit massive MIMO system and proposed alinear estimator module based on the Bussgang decompositiontechnique combined with a model-driven neural network.The vast majority of the aforementioned works on learning-aided one-bit receivers rely on conventional DNN architec-tures. Such DNNs require a massive amount of trainingsamples and must be trained on data from the same (or a similar) statistical model as the one under which they arerequired to operate, imposing a major challenge in dynamicwireless communications. In fact, the use of generic black-boxDNNs is mostly justified in applications where a satisfactorydescription of the underlying governing dynamics of thesystem is not achievable, as is the case in computer visionand natural language processing fields. As surveyed above,this is not the case in the field of one-bit MIMO systems.This gives rise to the need that is bridging the gap betweendata-driven and model-based approaches in this context, andmoving towards specialized deep learning models for signalprocessing techniques in one-bit MIMO systems—which is theaim of this work.In this paper, we develop a hybrid model-based and data-driven system which learns to carry out blind symbol detec-tion from one-bit measurements. The proposed architecture,referred to as LoRD-Net (Low Resolution Detection Net-work), combines the well-established model-based maximum-likelihood estimator (MLE) with machine learning toolsthrough the deep unfolding method [51]–[56] for designingDNNs based on model-based optimization algorithms. Toderive LoRD-Net, we first formulate the MLE for the taskof symbol detection from one-bit samples. Next, we resort tofirst-order gradient-based methods for the MLE computation,and unfold the iterations onto layers of a DNN. The result-ing LoRD-Net learns to carry out MLE-approaching symboldetection without requiring CSI.Applying conventional gradient-based optimization methodsrequires knowledge of the underlying system parameters, i.e.,full CSI. Hence, a typical approach to unfold such a symboldetection algorithm would be to estimate the unknown param-eters from training, and substitute it into the unfolded network[46]. We show that instead of estimating the unknown systemparameters, it is preferable to learn an alternative channel which allows the receiver to detect the symbols reliably. Sur-prisingly, we demonstrate that the alternative channel learnedby LoRD-Net is in general not the true channel. Based onthis observation, we propose a two-stage training procedure,comprised of learning the proper optimization process tounfold, followed by an end-to-end training of the unfoldedDNN.The proposed LoRD-Net has thus the following properties: i) Compared to the vanilla MLE symbol detector, our modeldoes not need to estimate the channel separately. ii)
Owing to its hybrid nature, it has low computationalcost in operation and is highly scalable, facilitating muchfaster inference as compared to its black-box data-drivenand model-based counterparts. iii)
The proposed deep architecture is interpretable and hasfar fewer parameters compared to existing black-box deeplearning solutions. This follows from the incorporationof domain-knowledge in the design of the network archi-tecture (i.e., being model-based), allowing LoRD-Net totrain with much fewer labeled samples as compared toexisting data-driven one-bit receivers.We verify the above characteristics of LoRD-Net in an ex-perimental study, where we show that training of the pro-posed LoRD-Net architecture can be performed with far fewer2amples as compared to its data-driven counterparts, anddemonstrate substantially superior performance compared toexisting model-based and data-driven algorithms for symboldetection in massive MIMO channels with one-bit ADCs.The rest of the paper is organized as follows. In Section II,we present the considered system model and the correspondingMLE formulation. In Section III, we derive LoRD-Net byunfolding the first-order gradient iterations associated withthe MLE computation, and present its two-stage trainingprocedure. Section IV provides a detailed numerical analysisof LoRD-Net applied to MIMO communications. Finally,Section V concludes the paper.Throughout the paper, we use the following notation. Boldlowercase and bold uppercase letters denote vectors and matri-ces, respectively. We use ( · ) T , Diag( · ) , and sign( · ) , and log {·} to denote the transpose operator, the diagonal matrix formedby the entries of the vector argument, the sign operator, andthe natural logarithm, respectively. The symbol (cid:12) representsthe Hadamard product, while and are the all-one and all-zero vectors/matrices. The i -th entry of the vector x is x i , and (cid:107) x (cid:107) p is the (cid:96) p -norm of x ; M n is the n -ary Cartesian productof a set M , and S + denotes the cone of symmetric positivedefinite matrices.II. S YSTEM M ODEL AND P RELIMINARIES
In this section, we discuss the considered system model. Wefocus on one-bit data acquisition and blind signal recovery.We then formulate the MLE for this problem, which is usedin designing the LoRD-Net architecture in Section III.
A. Problem Formulation
We consider a low-resolution data-acquisition system whichutilizes m one-bit ADCs. By letting y ∈ R m denote thereceived signal, the discrete output of the ADCs can be writtenas r = sign ( y − b ) , where b ∈ R m denotes the vector ofquantization thresholds, and sign( · ) is the sign function, i.e., sign( x ) = +1 if x ≥ and sign( x ) = − otherwise. Thereceived vector y is statistically related to the unknown vectorof interest x ∈ M n ⊆ R n according to the following linearrelationship: y = Hx + n , (1)where n ∼ N ( , C ) denotes additive Gaussian noise with acovariance matrix of the form C = Diag( σ , σ , . . . , σ m − ) with diagonal entries { σ i } m − i =0 representing the noise vari-ance at each respective dimension, and H ∈ R m × n is thechannel matrix. We assume that the elements of the unknownvector x are chosen independently from a finite alphabet M = { s , s , · · · , s |M| } . This setup represents low-resolutionreceivers in uplink multi-user MIMO systems, where x is thesymbols transmitted by the users, and y is the correspondingchannel output, as illustrated in Fig. 1.The overall dynamics of the system are thus compactlyexpressed as: r = sign( Hx + n − b ) . (2) Fig. 1. System model illustration.
In the sequel, we refer to
Θ = { H , C } as the systemparameters. Note that the above system model can be mod-ified using conventional transformations to accommodate acomplex-valued system model.Our main goal is to perform the task of symbol detection,i.e., recover x , from the collected one-bit measurements r . Wefocus on blind (non-coherent) recovery, namely, the systemparameters Θ = { H , C } , i.e., the channel matrix and thecovariance of the noise, are not available to the receiver.Nonetheless, the receiver has access to a limited set of B labeled samples { x bp , r bp } B − b =0 , representing, e.g., pilot trans-missions. The quantization thresholds of the ADCs, i.e., thevector b , are assumed to be fixed and known. While we donot consider the selection of b in the following, we discuss inthe sequel how its optimization can be incorporated into thedetection method. B. Maximum Likelihood Recovery
To understand the challenges associated with blind low-resolution detection, we next discuss the MLE for recovering x from r . In particular, the intuitive model-based approach isto utilize the labeled data to estimate the system parameters Θ , and then to use this estimation to compute the coherent(non-blind) MLE. Therefore, to highlight the limitations ofthis strategy, we assume here that the system parameters Θ = { H , C } are fully known at the receiver. Let F Θ ( x ; r ) (cid:44) log Pr( r | x , Θ) ( a ) = − m − (cid:88) i =0 log (cid:26) Q (cid:18) r i σ i (cid:0) b i − h Ti x (cid:1)(cid:19)(cid:27) , (3)represent the log-likelihood objective for a given vector of one-bit observations r , where ( a ) is proven in [1]. The coherentMLE is then given by ˆ x ML ( r ) = argmax x ∈M n F Θ ( x ; r ) . (4)Although the MLE in (4) has full accurate knowledgeof the parameters Θ , its computation is still challenging.The main difficulty emanates from solving the underlyingoptimization problem in the discrete domain, implying that theMLE requires an exhaustive search over the discrete domain M n , whose computational complexity grows exponentiallywith n . A common strategy to tackle the discrete optimizationproblem in (4) is to relax the search space to be continuous.This results in the following relaxed unconstrained MLE rule: ¯ x Θ ( r ) = argmax x ∈ R n F Θ ( x ; r ) . (5)3he optimization problem in (5) is convex due to the log-concavity of Q ( · ) , and thus can be solved using first-ordergradient optimization. In particular, the gradient of the negativelog-likelihood function with respect to the unknown vector x can be compactly expressed as [1]: ∇ x F Θ ( x ; r ) = H T ˜ R η (cid:16) ˜ R ( b − Hx ) (cid:17) , (6)where η is a non-linear function defined as η ( x ) (cid:44) Q (cid:48) ( x ) (cid:11) Q ( x ) , in which the operator (cid:11) denotes the element-wisedivision operation, Q (cid:48) ( x ) is the derivative of Q ( x ) , that isgiven by the negative probability density function of a standardNormal distribution, and ˜ R = RC − is the semi-whitenedversion of the one-bit matrix R = Diag ( r , . . . , r m − ) .As ¯ x Θ ( r ) obtained via (5) is not guaranteed to take valuesin M n , the final estimate of the symbols is obtained byapplying a projection operator P M n : R n (cid:55)→ M n to ¯ x ( r ) .This operator maps the continuous input vector onto its closestlattice point on the discrete set M n , i.e., P M n ( x ) = argmin z ∈M n (cid:107) z − x (cid:107) . (7)Tackling a discrete program via continuous relaxation, asdone in (5), is subject to an inherent drawback. As a casein point, one can only expect ¯ x Θ ( r ) to provide an accurateapproximation of the true MLE if the real-valued vector ¯ x Θ ( r ) is very close to the discrete valued MLE ˆ x ML ( r ) . In such acase, the MLE is obtained by projecting into the lattice pointsin M n . However, this is not the case in many scenarios,and specifically, when the noise variance in each respectivedimension is high. In other words, it is not necessarily thecase that the minimizer of the objective function on thecontinuous domain (5) is close to the MLE, which takesvalues in the discrete set M n . Note that utilizing the truesystem parameters will only lead to optimal estimates whenconsidering the original discrete problem (4). In fact, one canno longer necessarily argue that the true system parametersare optimal choices for Θ in the relaxed MLE. This insight,which is obtained from the computation of the coherent MLE,is used in our derivation of the blind unfolded detector in thefollowing section.III. P ROPOSED M ETHODOLOGY
In this section, we present the proposed Lo w R esolution D etection Net work, abbreviated as
LoRD-Net . We begin witha high-level description of LoRD-Net in Subsection III-A.Then, we present the unfolded architecture in Subsection III-Band discuss the training procedure in Subsection III-C. Finally,we provide a discussion in Subsection III-D.
A. High-Level Description
As noted in the previous section, the intuitive approachto blind symbol detection is to utilize the labeled data { x bp , r bp } B − b =0 to estimate the true system model Θ , and thento recover the symbol vector x from r using the MLE.Nonetheless, the coherent MLE (4) is computationally pro-hibitive, while its relaxed version in (5) may be inaccurate.Alternatively, one can seek a purely data-driven strategy, usingthe data to train a black-box highly-parameterized DNN for Fig. 2. An illustration of the relation between the optimal point of acompetitive objective function (dashed blue line) and the true MLE ˆ x ML obtained by an exact maximization of the log-likelihood objective function(solid black line) over the discrete set M as well as an approximation of theMLE ¯ x Θ obtained by a maximization of the log-likelihood objective functionover the continuous space R , when the true transmitted symbol is s ∈ M . detection, requiring a massive amount of labeled samples.Consequently, to facilitate accurate detection at affordablecomplexity and with limited data, we design LoRD-Net viamodel-based deep learning [57], by combining the learning ofa competitive objective , combined with deep unfolding of therelaxed MLE.Learning a competitive objective refers to the setting of theunknown system parameters Θ . However, the goal here is notto estimate the true system parameters , but rather the ones forwhich the solution to the relaxed MLE coincides with the truevalue of x . This system identification problem can be writtenas F Θ (cid:63) ( r ; x ) = min Θ B B − (cid:88) b =0 (cid:13)(cid:13) ¯ x Θ ( r bp ) − x bp (cid:13)(cid:13) , (8)where ¯ x Θ is the relaxed MLE (5). The optimization problem(8) yields a surrogate objective function F Θ (cid:63) , or equivalently,a set of system parameters Θ (cid:63) , referred to as a competitiveobjective to the true F Θ . An illustration of such a competitiveobjective obtained for the case of n = 1 is depicted in Fig. 2.The main difficulty in solving (8) stems from the fact that ¯ x Θ ( r ) = argmax x ∈ R n F Θ ( x ; r ) is not differentiable with respectto the system parameters Θ . We overcome this obstacle by ap-plying a differentiable approximation of ¯ x ( r ) , or equivalently,an algorithm that approximates the argmax operator specificto our problem. Since ¯ x Θ ( r ) can be computed by first-ordergradient methods, we design a deep unfolded network [52] tocompute the relaxed MLE in manner which is differentiablewith respect to Θ . The usage of deep unfolding allows notonly to learn a competitive objective via (8), but also resultsin accurate inference with a reduced number of iterationscompared to model-based first-order gradient optimization.Furthermore, the unfolded network utilizes a relatively smallamount of trainable parameters, thus enabling learning fromsmall amounts of labeled samples. B. LoRD-Net Architecture
We now present the architecture of LoRD-Net, which mapsthe low resolution r into an estimated ˆ x . For given systemparameters Θ whose learning is detailed in Subsection III-Cbased on the competitive objective rationale described above,4oRD-Net is obtained by unfolding the iterations of a first-order optimization of the relaxed MLE (5). Our derivation thusbegins by formulating the first-order methods to iterativelysolve (5) for a given Θ .Let g φ i : R n (cid:55)→ R be a parametrized operator defined as g φ i ( x ; Θ , r ) = x − G i ∇ x F Θ ( x ; r ) , where G i ∈ R n × n is apositive-definite weight matrix and φ i = { G i } denotes the setof parameters of the operator g φ i . Such a linear operator can beused to model a first-order optimization solver by consideringa composition of t mappings of the form: x t +1 = G t φ ( x ; Θ , r ) = x t − G t ∇ x F Θ ( x t ; Θ , r ) , (9) (cid:44) g φ t ◦ · · · ◦ g φ ◦ g φ ( x ; Θ , r ) where x is an initial point, φ = { φ , · · · , φ t − } is the setof parameters of the overall mapping G t φ . The mapping (9)is differentiable with respect to the system parameters Θ ,and its local weights φ . For a fixed number of iterations L , the resulting function G L φ ( x ; Θ , r ) is thus differentiablewith respect to the set of parameters { φ , Θ } and its input(unlike the original argmax operator). Therefore, it can now beused as a differentiable approximation of ¯ x Θ ( r ) , which allowsfor a training (optimization) over the set of its parametersbased on the gradient-based training algorithms and the back-propagation technique.Following the deep unfolding framework [52], the function G L φ ( x ; Θ , r ) can be implemented as a L -layer feed-forwardneural network, where the initial point x and the one-bitsamples r constitute the input to the network, and withtrainable parameters that are given by { Θ , φ } . By (6), the i -th layer computes: g φ i ( x i ; Θ , r ) = x i − G i z i , with (10) z i = H T ˜ R η (cid:16) ˜ R ( b − Hx i ) (cid:17) , (11)where the overall dynamics of the LoRD-Net is given by: G L φ ( x ; Θ , r ) = g φ L − ◦ g φ L − ◦ · · · ◦ g φ ( x ; Θ , r ) . (12)Each vector x i in (10) represents the input to the i -th layer(or equivalently, the output of the previous iteration), with x being the input of the entire network (which represents theinitial point for the optimization task). Upon the arrival ofany new one-bit measurement r , the recovered symbols ˆ x areobtained by feed-forwarding r through the L layers of LoRD-Net. In order to obtain discrete samples, the output of LoRD-Net is projected into the feasible discrete set M n , viz. ˆ x = P M n (cid:0) G L φ ( x ; Θ , r ) (cid:1) . (13)An illustration of LoRD-Net is depicted in Fig. 3.We note that one can also propose an alternative archi-tecture, derived by applying the projection operator P M n atthe output of each layer, i.e., by defining g φ i ( x i ; θ, r ) = P M n ( x i − G i z i ) . Such a setting corresponds to the unfold-ing of a projected gradient descent method. However, ournumerical investigations have consistently shown that suchan architecture suffers from the vanishing gradient problemduring training and a significant degradation in performance.As a result, we implement LoRD-Net while applying the projection operator once on the output of the network, andonly during inference, as discussed above.In principle, one can fix G i = δ I for some δ > , for which(12) represents L steps of gradient descent with step size δ .In the unfolded implementation, the weights { G i } are tunedfrom data, allowing to detect with less iterations, i.e., layers.As a result, once LoRD-Net is trained, i.e., its weight matrices φ = { G i } and the unknown system parameters Θ are learnedfrom data, it is capable of carrying out fast inference, owingto its hybrid model-based/data-driven structure. Furthermore,the number of iterations L is optimized to boost fast inferencein the training procedure, as detailed in the following. C. Training Procedure
Herein, we present the training procedure for LoRD-Net.In particular, our main goal is to perform inference of theunknown system parameters Θ based on the rationale detailedin Subsection III-A, i.e., to obtain a competitive objective. Thelearning competitive objective is used to tune the weights ofthe unfolded network φ . Accordingly, we present a two-stagetraining procedure for LoRD-Net (12). Once the training ofthe LoRD-Net is completed, it carries out symbol detectionfrom one-bit information without requiring the knowledge ofsystem parameters Θ .
1) Training Stage 1 - Learning a Competitive Objective:
The first stage corresponds to learning the unknown systemparameter Θ . However, as formulated in (8), we do not seekto estimate the true values of the channel matrix H andnoise covariance C , but rather learn the surrogate valueswhich will facilitate accurate detection using the relaxedMLE formulation. We do this by taking advantage of twopropertities of LoRD-Net: The first is the differentiability ofthe unfolded architecture with respect to Θ , which facilitatesgradient-based optimization optimization; The second is thefact that for G i = δ I , LoRD-Net essentially implements L steps of gradient descent with step size δ over the convexobjective (5), and is thus expected to reach its maxima.Based on the above properties, we fix a relatively largenumber of layers/iterations L for this training stage, and fixthe weights φ to G i = δ I . Under this setting, the output ofLoRD-Net G L φ = { δ I } ( x ; Θ , r ) represents an approximation ofthe relaxed MLE for a given parameter Θ , denoted ¯ x Θ ( r ) ,i.e., we have that ¯ x Θ ( r ) ≈ G L φ = { δ I } ( x ; Θ , r ) . (14)We refer to the setting φ = { δ I } using in this stage as the basic optimization policy. Note that as the number of layersgrows large, the above approximation becomes more accurate.Hence, by substituting (14) into (8) and replacing ¯ x Θ ( r ip ) withthe corresponding outputs of LoRD-Net, we formulate the lossmeasure of the first training stage of LoRD-Net as: Θ (cid:63) = argmin Θ B B − (cid:88) i =0 (cid:13)(cid:13)(cid:13) G L φ = { δ I } ( x ; Θ , r ip ) − x ip (cid:13)(cid:13)(cid:13) . (15)Owing to the differentiable nature of G L φ ( x ; Θ , r ) with re-spect to Θ , we recover Θ (cid:63) based on (15) using conventionalgradient-based training, e.g., stochastic gradient descent with5 ig. 3. An illustration of LoRD-Net, where trainable system parameters and unfolded weights are highlighted in red and green colors, respectively. backpropagation, as detailed in our numerical evaluationsdescription in Section IV
2) Training Stage 2 - Learning the Unfolded Weights:
Having learned the unknown system parameters Θ in Stage1, we turn to tuning the parameters of LoRD-Net, i.e., theset φ = { G i } . We note that in Stage 1, the rationale was touse the basic optimization policy φ = { G i = δ I } L − i =0 with alarge number of layers L , exploiting the insight that underthis setting, LoRD-Net effectively implements conventionalgradient descent. However, once Stage 1 is concluded and Θ (cid:63) is learned, it is preferable to reduce the number of layers L compared to that used in Stage 1, thus exploiting the ability ofthe unfolded network to carry out faster inference comparedto their model-based iterative counterparts by learning theweights applied in each iteration [52], [58]. Consequently, thefirst step in this stage is to set a number of layers to a valuewhich can potentially be smaller than that used in the firsttraining stage, and then optimize the weights according to thefollowing criterion: φ (cid:63) = argmin φ B B − (cid:88) i =0 (cid:13)(cid:13)(cid:13) G L φ = { G l } Ll =1 ( x ; Θ (cid:63) , r ip ) − x ip (cid:13)(cid:13)(cid:13) . (16)Generally speaking, in order for a first-order optimizer(LoRD-Net in this case) to provide a descent direction ateach iteration (layer), the pre-conditioning matrices must bepositive-semidefinite so that each iteration does not reverse thegradient direction. To incorporate this requirement into LoRD-Net training, we re-parameterize the pre-conditioning matricesby writing { G i = W i W Ti } and performing the traning overthe matrices { W i } . The resulting two-stage training algorithmis summarized as Algorithm 1. Algorithm 1:
Training LoRD-Net
Input:
Labeled data { x bp , r bp } Bb =0 Stage 1 Init:
Fix (large) L , step-size δ ∈ (0 , , andweights G l = δ I . Initialize x ; Optimize Θ (cid:63) via (15) ; // Stage 1 Stage 2 Init:
Fix (small) L . Initialize x ; Set the trainable parameters to { G i = W i W Ti } ; Optimize φ (cid:63) according to (16) ; // Stage 2 Output:
LoRD-Net parameters { Θ (cid:63) , φ (cid:63) } When the network is properly trained, LoRD-Net is ex-pected to carry out learned and accelerated first-order op-timization, tuned to operate even in channel conditions forwhich such an approach does not yield the MLE for the truechannel.
D. Discussion
LoRD-Net is a data-driven acquisition system based onunfolding first-order gradient optimization methods, designedfor low-resolution MIMO receivers operating without analogprocessing. Its model-awareness enables the receiver to learnto accurately infer from smaller training sets compared toconventional DNN architectures applied to such setups, assuggested, e.g., in [24], giving rise to the possibility of trackingblock-fading channel conditions via online training, as in [43].Furthermore, LoRD-Net differs from previously proposed deepunfolded MIMO receivers as surveyed in [49] in two keyaspects: First, LoRD-Net is particularly designed for one-bitobservations, being derived from the iterative optimizationformulation which arises from such setups. Second, previousunfolded MIMO receivers either assumed prior knowledge ofthe channel parameters, as in [47], or alternatively, utilizeexternal modules to directly estimate the CSI as in [46]. LoRD-Net exploits the fact that, for its unfolded relaxed convexoptimization algorithm to yield the desired MLE, an alternativechannel parameters, which differ from the true Θ , should beestimated. Consequently, the training procedure of LoRD-Netdoes not aim to recover the true CSI, but the one which yields acompetitive objective which facilitates symbol detection, thusaccounting for the overall system task.The proposed training procedure detailed in Algorithm 1carries out each training stage once in a sequential manner.This strategy can be extended to optimizing the hyperparame-ters and the weights in an alternating fashion, i.e. repeating thestages multiple times, while using the learned φ in Stage 2 inthe Stage 1 that follows. Alternatively, the hyperparametersand the weights can be learned jointly in an end-to-endmanner, by optimizing (16) with respect to both Θ and φ simultaneously. The main requirement for carrying out thesetraining strategies compared to that detailed in SubsectionIII-C is that the same number of layers L should be used whenlearning both Θ and φ , while when these stages are carried outonce sequentially, it is preferable to use large L at Stage 1 and6 smaller value, which dictates the number of learned weights,in Stage 2. Furthermore, our numerical evaluations show thattraining once in a two-stage fashion via Algorithm 1 yieldssimilar and sometimes even improved performance comparedto learning both Θ and φ simultaneously in a one-stagemanner, as well as when alternating between these two stages,as demonstrated in Section IV.A possible extension of the training procedure is to accountfor ADCs with more than one bit, as well as allow LoRD-Netto optimize the quantization thresholds b in light of the overallsymbol recovery task. While accounting for multi-level ADCsis a rather simple extension achieved by reformulating theobjective function (3), optimizing the quantization thresholdsrequires modifying the overall training strategy. The challengehere is that modifying b results in different one-bit measure-ments r . In a communication setup, in which periodic pilotsare transmitted, one can envision gradual optimization of b be-tween consecutive pilot sequences, using their correspondingone-bit observations to further optimize LoRD-Net. The studyof LoRD-Net with multi-level ADCs and optimized thresholdsis left for future work.IV. N UMERICAL S TUDY
In this section, we numerically evaluate LoRD-Net , andcompare its performance with state-of-the-art model-basedand data-driven methodologies. As a motivating applicationfor the proposed LoRD-Net, we focus on the evaluation ofLoRD-Net for blind symbol detection task in one-bit MIMOwireless communications. In the following, we first detailthe considered one-bit MIMO simulation settings in Subsec-tion IV-A, after which we evaluate the receiver performance,compare LoRD-Net to alternative unfolded architectures, andnumerically investigate its training procedure in SubsectionsIV-B, IV-C, and IV-D, respectively. . A. Simulation Setting
We consider an up-link one-bit multi-user MIMO scenarioas in (2). We focus on a single cell in which a base station(BS) equipped with m antenna elements serves n single-antenna users. Specifically, we consider two cases of ( m, n ) =(128 , and ( m, n ) = (64 , , i.e., a × and a × MIMO channel setup. The transmitted symbols of the users,represented by the unknown vector x , are randomized in anindependent and identically distributed (i.i.d.) fashion froma BPSK constellation set M = {− , +1 } . The projectionmapping is thus P M n ( x ) = sign( x ) , where the sign functionis applied element-wise on the vector argument. In the sequel,we assume that while the channel matrix H , representing theCSI, is not available at the BS, the noise statistics C are knownand are fixed to C = I . Accordingly, our goal is to utilizeLoRD-Net to recover the transmitted symbols from the one-bitmeasurements. Note that the proposed methodology can carryout the task of symbol detection even for the case in whichthe noise statistics C is unknown. Channel Models:
We evaluate LoRD-Net under two channelmodels: (i) i.i.d. Rayleigh fading channels, where H ∼N ( , I ) ; and (ii) the COST-2100 massive MIMO channel [59]. The source code is available at: https://github.com/skhobahi/LoRD-Net.
The COST-2100 channel model is a realistic geometry-basedstochastic model which accounts for prominent characteristicsof massive MIMO channels, and is considered to be anestablished benchmark for evaluating MIMO communicationsystems. We generate the channel matrices for the COST-2100 model for a narrow-band indoor scenario with closely-spaced users at . GHz band, where the BS is equipped witha uniform linear array (ULA) that has m omni-directionalreceive antenna elements. The one-bit ADC operation useszero thresholds, i.e. b = . We define the signal-to-noise ratio( SNR ) as:
SNR = E (cid:8) (cid:107) Hx (cid:107) (cid:9) / E (cid:8) (cid:107) n (cid:107) (cid:9) . (17) Benchmark Algorithms:
As LoRD-Net combines bothmodel-based and data-driven inference, we compare its per-formance with state-of-the-art model-based and data-drivenmethodologies in a one-bit MIMO receiver scenario. In partic-ular, we use the following benchmarking detection algorithms: • The model-based nML proposed in [30]. The nML algo-rithm is based on a convex relaxation of the conventionalML estimator, and requires the exact knowledge of thechannel parameters
Θ = { H , C } . We set the number ofiterations of the nML algorithm to , and the step-sizeis chosen using a grid search method to further improvethe performance of the nML, while the remaining param-eters are those reported in [30]. • The data-driven Deep Soft Interference Cancellation(DeepSIC) methodology proposed in [44], with fivelearned interference cancellation iterations. DeepSIC ischannel-model-agnostic and can be utilized for symboldetection in non-linear settings such as low-resolutionquantization setups. Unlike LoRD-Net, which is designedparticularly for observations of the form (2) where
Θ = { H , C } is unknown, DeepSIC has no prior knowledgeof neither the channel model nor its parameters. LoRD-Net Setting:
The LoRD-Net receiver is implementedwith L = 30 layers. Recall that the first training stage of theLoRD-Net is concerned with finding a competitive objectiveby carrying out the training of the network over the unknownset of channel parameters Θ = { H , C } . Unless otherwisespecified, we focus on the case where only H is unknown,and the correlation matrix of the noise C is available.During the first training stage, we set δ = 0 . , and recover Θ (cid:63) based on the objective (15) using the Adam stochasticoptimizer [60] with a constant learning rate of − . Next,we carry out the training of the LoRD-Net during the secondstage according to the objective function defined in (16)and over the set of trainable parameters φ , using the Adamoptimizer with a learning rate of − , and a mini-batch of size . We consider the learning of diagonal pre-conditioningmatrices (unfolded weights) during the second training stage.The network is trained for epochs during the first trainingstage, and epochs during the second training stage, withthe same value of L = 30 used in both stages. B. Receiver Performance
Here, we evaluate the performance of the proposed LoRD-Net, comparing it to the aforementioned benchmarks as well as7xamining its dependence on the number of training samples B . In particular, we numerically evaluate the bit-error-rate(BER) performance versus SNR using different training sizes B ∈ { , } , for both × and × channelconfigurations. For DeepSIC, we use only B = 2048 , whilethe nML recever of [30] operates with perfect CSI, i.e., withfull accurate knowledge of Θ . All data-driven receivers aretrained for each SNR separately, using a dataset correspondingto that specific SNR value.The results are depicted in Figs. 4(a) and 4(b) for a × channel configuration under the Rayleigh fading and COST-2100 channel models, respectively. Furthermore, the BERperformance for a × configuration under both channelmodels are illustrated in Fig. 5(a) for the Rayleigh fadingchannel, and in Fig. 5(b), for the COST-2100 channel model.Based on the results presented in Figs. 4 and 5, one canobserve that LoRD-Net significantly outperforms the compet-ing model-based and data-driven algorithms and achieves im-proved detection performance under both simulated channels,as well as both MIMO configurations.In particular, the nML algorithm, which is designed toiteratively approach the MLE using ideal CSI (prior knowledgeof the channel matrix), is notably outperformed by LoRD-Net. Such gains by LoRD-Net, which learns to compute theMLE from data without requiring CSI, compared to the model-based nML algorithm, demonstrate the benefits of learning acompetitive objective function combined with a relaxed deepunfolded optimization process. Specifically, the results de-picted in Figs. 4-5 illustrate that one can significantly improvethe receiver performance by learning a new channel matrix H upon which the learned competitive objective functionadmits optimal points near the true symbols. The learningof the competitive objective function is possible due to thehybrid model-based/data-driven nature of LoRD-Net, and thefact that it is derived based on the unfolding of first-orderoptimization techniques. From a computational complexitypoint-of-view, the depicted performance of the nML algorithmin Figs. 4-5 is achieved by employing iterations of afirst-order optimization algorithm, while LoRD-Net uses only L = 30 layers/iterations—exhibiting a significant reduction inthe computational cost during inference as compared to thenML algorithm.Comparing LoRD-Net to DeepSIC illustrates that LoRD-Net benefits considerably from its model-aware architecture.The fact that LoRD-Net is particularly tailored to the one-bitsystem model of (2) allows it to achieve improved accuracy,even in the case of training with small amounts of data. Forinstance, for the × MIMO Rayleigh fading channel(see Fig. 4(a)), LoRD-Net trained with B = 2048 samples,achieves BER of − at SNR of dB, while DeepSIC trainedwith the same dataset requires SNR as high as dB to achievesuch an error rate. Considering Fig. 4(b), a similar behavior isobserved in the COST-2100 channel, for a BER of × − .A similar performance gain for LoRD-Net can be observedin a × configuration; see Fig. 5. Furthermore, it canbe observed that the LoRD-Net still outperforms the DeepSICmethodology, even when trained on times less training sam-ples. In particular, for the (128 × channel setup considered in this part, the total number of trainable parameters of LoRD-Net is merely | Θ = { H }| + | φ | = n ( L + m ) = 2528 . Forcomparison, DeepSIC, which uses and trains a multi-layerfully-connected network for each user at each interferencecancellation iterations, consists here of over × trainableparameters. Such a reduction in the number of parametersallows for achieving substantially improved performance withmuch smaller training points, as observed in Figs. 4-5. Finally,we note that the small number of trainable parameters ofLoRD-Net shows its potential for online or real-time training,as proposed in [43]. This can be achieved by using periodicpilots with minimal overhead on the communication, whileinducing a relatively low computational burden in its periodicretraining.So far, we have investigated the performance of the pro-posed LoRD-Net for scenarios with known noise statistics,and unknown H (i.e., Θ = { H } ). Next, we investigate thedetection performance of LoRD-Net when both the channeland noise covariance matrices are not available, i.e., we set Θ = { H , C } and carry out the training according to theproposed two stage methodology. Specifically, we consider thelearning of a diagonally structured C in addition to the channelmatrix H for this scenario. Fig. 6 demonstrates the BERversus SNR performance of LoRD-Net under both channelmodels, when trained using a dataset of size B = 1024 .The performance of LoRD-Net for the case of Θ = { H } isfurther provided for comparison purposes. Observing Fig. 6,one can readily conclude that the proposed network cansuccessfully perform the task of symbol detection also when C in unknown. Furthermore, it can be observed that a smallgain in performance is achieved for both channel models when Θ = { H , C } as compared to the case of Θ = { H } , whichis presumably due to the careful addition of more degrees offreedom in learning a competitive surrogate model. C. Performance of Competing Deep Unfolded Architectures
In this part, we compare the performance of the proposedLoRD-Net with alternative deep unfolding-based architecturestailored for the problem at hand. Recall that the architecture ofLoRD-Net uses trainable parameters which are shared amongthe different layers, as illustrated in Fig. 3. Thus, LoRD-Net is comprised of a relatively small number of trainableparameters, and uses a two-stage learning method to trainthe shared parameters, representing the competitive model,and the iteration-specific weights, encapsulating the first-order optimization coefficient. Nonetheless, the conventionalapproach for unfolding first-order optimization techniques isto over-parameterize the iterations, and then, train in an end-to-end manner using a one-stage training procedure discussedearlier. Therefore, to numerically evaluate the proposed un-folding mechanism of LoRD-Net, we next compare it to twoconventional unfolding based benchmarks derived from therelaxed MLE: • Benchmark 1 : An over-parameterized deep unfolded ar-chitecture obtained by setting the computational dynam-ics for the i th layer as: ¯ g φ i ( x i ; r ) = x i − A i R η ( R ( b − B i x i )) . (18)8 SNR [dB]10 − − − B E R nML - Perfect CSIDeepSIC - Training Size B = 2048LoRD-Net - Training Size B = 1024LoRD-Net - Training Size B = 2048 (a) Rayleigh fading
SNR [dB]10 − − B E R nML - Perfect CSIDeepSIC - Training Size B = 2048LoRD-Net - Training Size B = 1024LoRD-Net - Training Size B = 2048 (b) COST-2100 massive MIMOFig. 4. BER performance versus SNR over a × channel configuration. SNR [dB]10 − − − B E R nML - Perfect CSIDeepSIC - Training Size B = 2048LoRD-Net - Training Size B = 1024LoRD-Net - Training Size B = 2048 (a) Rayleigh fading
SNR [dB]10 − − B E R nML - Perfect CSIDeepSIC - Training Size B = 2048LoRD-Net - Training Size B = 1024LoRD-Net - Training Size B = 2048 (b) COST-2100 massive MIMOFig. 5. BER performance versus SNR over a × channel configuration. Here, φ i = { A i ∈ R n × m , B i ∈ R m × n } are the trainableparameters of the i th layer, and R = Diag( r ) . • Benchmark 2 : Here, we again use the unfolded archi-tecture given in (18), while limiting the number oftrainable parameters by constraining the rank of thelearned matrices. In particular, we set A i = P i Q i and B i = R i S i , where φ i = { P i ∈ R n × r , Q i ∈ R r × m , R i ∈ R m × r , S i ∈ R r × n } denotes the set oftrainable parameters of the i th layer of the unfoldednetwork. The dimension r < min( m, n ) controls the rankof the resulting weight matrices { A i , B i } , and thus thenumber of trainable parameters.Comparing (18) with the corresponding dynamics of LoRD-Net in (10), we note that the channel matrix H , the pre-conditioning matrices G i , and the noise covariance matrix C are now absorbed into the per-layer trainable matrices A i and B i . Accordingly, these unfolded benchmarks, whichfollow the conventional approach for unfolding optimizationalgorithms, are less faithful to the underlying model. Thesebenchmarks also differ from LoRD-Net in their number oftrainable parameters. In particular, Benchmark 1 with L layers has Lnm trainable parameters, while Benchmark 2 has Lr ( m + n ) weights, which can be controlled by the settingof the hyperparameter r . For comparison, LoRD-Net has n ( L + m ) trainable parameters for the case of Θ = { H } and diagonally structured pre-conditioning matrices, whilefor the case of Θ = { H , C } with a diagonally structuredpre-conditioning matrix and noise covariance matrix it has n ( L + m ) + m trainable parameters.We evaluate the performance of the proposed LoRD-Netcompared to the unfolded benchmarks in the following simula-tion setup. We consider train all the considered network usinga dataset of size B = 1024 , while the highly-parameterizedBenchmark 1 is also trained using B = 2048 samples. ForBenchmark 2, we set r = 1 . All architectures have L = 30 layers and their performance are evaluated on the same testingdataset of size B = 2048 . The unfolded benchmarks aretrained in the conventional end-to-end fashion. The channelmodel is a (128 × Rayleigh fading channel. Foror theconsidered scenario above, the LoRD-Net admits a total of trainable parameters, while Benchmark 1 ha a total of (approximately times more parameters than LoRD-9 SNR [dB]10 − − B E R LoRD-Net - Θ = { H } , Rayleigh Fading ChannelLoRD-Net - Θ = { H , C } , Rayleigh Fading ChannelLoRD-Net - Θ = { H } , COST2100 ChannelLoRD-Net - Θ = { H , C } , COST2100 Channel
Fig. 6. BER versus
SNR for both channel models and a training size of B = 1024 . The performance of the proposed LoRD-Net is provided for bothscenarios of training over Θ = { H } (i.e., known noise statistics C ), and over Θ = { H , C } corresponding to unknown channel matrix and noise statistics. SNR [dB]10 − − B E R LoRD-Net - Training Size B = 1024Benchmark 1 - Training Size B = 1024Benchmark 1 - Training Size B = 2048Benchmark 2 - r = 1, Training Size B = 1024
Fig. 7. BER versus
SNR of LoRD-Net compared to the unfolded benchmarkfor a (128 × Rayleigh fading channel model.
Net), while Benchmark 2 has trainable parameters.Fig 7 depicts the BER versus
SNR of LoRD-Net comparedto the unfolded benchmarks. We observe in Fig. 7 that theproposed LoRD-Net significantly outperforms the conven-tional unfolding based benchmarks,indicating the gains of theincreased level of domain knowledge Incorporated in to thearchitecture of LoRD-Net and its two stage training procedure.It is also observed that the performance of Benchmark 1increases with more training samples. Interestingly, for a smalltraining set of B = 1024 samples, Benchmark 2, whichis obtained by imposing a rank constraint on the trainableparameters of Benchmark 1, achieves improved performanceover Benchmark 1, due to its notable reduction in the numberof trainable parameters. D. Training Analysis
In this part, we analyze the performance of the proposedtwo-stage training procedure described in Subsection III-C.
32 64 128 256 512 1024 2048
Training Size, B − − − B E R SNR = 0 [dB]
SNR = 2 [dB]
SNR = 4 [dB]
SNR = 6 [dB]
SNR = 8 [dB]
SNR = 10 [dB]
Fig. 8. BER versus training size B for the Rayleigh fading channel. The training aspects of LoRD-Net are numerically evaluatedfor the × Rayleigh channel model detailed before.Following our insight on the ability of LoRD-Net to ac-curately train with small datasets, we begin by evaluatingthe performance of the LoRD-Net versus the training datasize B . For this study, we generate training datasets ofsize B ∈ { , , , , , , } and evaluate theperformance of LoRD-Net using test samples. Fig. 8depicts the BER achieved for each training size B , for SNR ∈ { , , , , , } dB. We can observe from Fig. 8that the performance of the LoRD-Net improves across all SNR values, where the improvements are most notable for B ≤ . Interestingly, it may be concluded from Fig. 8 thatLoRD-Net is capable of accurately and reliably performingthe task of symbol detection without CSI with as few as B = 512 samples. The ability of LoRD-Net to train withvery few training samples (compared to the black-box DNNmodels for one-bit MIMO receivers [22], [24], as well as theDeepSIC architecture), stems from its incorporation of thedomain-knowledge in designing the LoRD-Net architecture.This in turn leads to far fewer trainable parameters requiringmuch less training samples for optimizing the network.Next, we analyze the performance and the effect of thetwo stage training methodology detailed in Algorithm 1 onthe detection performance of the LoRD-Net architecture. Re-call that the first training stage is concerned with findinga competitive objective function through an optimization ofLoRD-Net over the unknown system parameters Θ , while thesecond training stage tunes the positive definite precondition-ing matrices φ = { G i } to accelerate the convergence of theLoRD-Net to the optimal points of the obtained competitiveobjective. To numerically evaluate the performance of thetraining methodology, we set SNR = 8 dB, and generate atraining dataset of size B = 512 and a testing dataset of size . Then, we compare performance of Algorithm 1 withtwo other competing training procedures: • One-Stage Training : Here, the weights φ and the unknownsystem parameters Θ are jointly learned in a single stage. Theobjective of this one stage training procedure for LoRD-Net10
20 40 60 80 100 120 140 160
EPOCHS ( × ) − − − B E R Testing Loss - One Stage Training ProcedureTesting Loss - Two Stage Training ProcedureTesting Loss - Alternating Training ProcedureTraining Loss - One Stage Training ProcedureTraining Loss - Two Stage Training ProcedureTraining Loss - Alternating Training Procedure
Fig. 9. BER versus the training epoch number of LoRD-Net, Rayleigh fadingchannel,
SNR = 8 dB. is min φ = { G l } l , Θ ∈ Θ B B − (cid:88) i =0 (cid:13)(cid:13) G L φ ( x ; Θ , r ip ) − x ip (cid:13)(cid:13) . (19) • Alternating Training : This procedure is concerned withtraining the network by alternating between the two opti-mization problems (15) and (16) consecutively with respectto each training epoch. Here, the network is trained over alternations, corresponding to a total of training epochs.Namely, at each epoch index i , we update the variables Θ forodd i and update φ for even i .Before we proceed with the evaluation results, we providesome useful connections to notions widely used in the deeplearning literature. Generally speaking, the performance of astatistical learning framework and its training procedure isevaluated using its generalization gap and testing error. The generalization gap of a model can be defined as the differencebetween the training and testing errors. Specifically, a modelwith smaller generalization gap and smaller testing error ishighly favourable. Furthermore, a higher generalization gapmay indicate that the network has over-fitted to the data, andhence, it does not generalize well. For two models with thesame generalization gap, the one with lower testing error isfavourable. Fig. 9 depicts the BER versus the training epochfor both the training and testing dataset. We first note that theproposed two stage training method outperforms all other com-peting procedures and it assumes a significantly lower testingerror as compared to other algorithms. Interestingly, one canobserve that the proposed methodology has successfully closedthe generalization gap as the testing and training error are veryclose to each other. On the other hand, the other two trainingprocedures admits very large generalization gap indicating thefact that their utilization has resulted in an over-fitting of thenetwork to the data. Furthermore, it can be observed from Fig.9 that the major improvement of the detection accuracy ofthe LoRD-Net is taking place during the first training stagewhen finding a competitive objective function, i.e., epochs i < × , where a slight improvement in the BER isachieved during the second stage, i.e., i ≥ × .The success of the proposed two stage training procedurein closing the generalization gap compared to the one stage training procedure is presumably due to the fact that the two-stage training approach leads to an implicit regularization onthe model capacity limiting the total number of parametersused during the entire training procedure. On the contrary,the one stage training procedure allows the neural networkto use its full capacity leading to an over-fitting and a largergeneralization gap, as observed in Fig. 9.As discussed in Subsection III-C, the second training stageallows LoRD-Net to achieve fast inference, i.e., acceleratedconvergence to the optimal points of the competitive objectivefunction. To illustrate this behavior, we perform a per-layerBER evaluation of LoRD-Net, exploiting the interpretablemodel-based nature of the LoRD-Net, in which each layerrepresents an unfolded first-order optimization iteration, andthus its output can be used as an estimate of the transmittedsymbols. Figs. 10(a) and 10(b) depict the BER versus thelayer/iteration number of LoRD-Net at the completion oftraining stages 1 and 2, for the Rayleigh fading channel and theCOST-2100 channel model, respectively. We observe in Fig.10 that the convergence of LoRD-Net after the completion ofthe first training stage is slow and requires at least L = 30 layers/iterations to converge. Interestingly, we note from Fig.10 that the second training stage indeed results in an acceler-ation of the convergence of LoRD-Net via learning the bestset of pre-conditioning matrices for the problem at hand in anend-to-end manner. In particular, after the completion of thesecond training stage, LoRD-Net can accurately and reliablyrecover the symbols with as few as layers. This observationhints that one can consider further truncation of the LoRD-Netafter the training to reduce the computational complexity whilemaintaining its superior performance.In order to quantify the quality of the learned competitiveobjective in closing the gap between the discrete optimizationproblem and its continuous version, we further provide the per-iteration performance of the nML algorithm and the LoRD-Netalgorithm which operate with perfect CSI. For this scenario,LoRD-Net utilizes the true Θ , and is thus optimizer only overthe weights φ while employing the exact channel model H . Itis observed from Figs. 10(a)-10(b) that learning a new surro-gate model for the continuous optimization problem at hand isindeed highly beneficial and admits a far superior performancein recovering the transmitted symbols. The analysis providedin Fig. 10 further supports the rationale behind the proposedtwo-stage training methodology, and the fact that the secondtraining stage results in an acceleration of the underlying first-order optimization solver (i.e., achieving a much faster descentper step) upon which the layers of the LoRD-Net are based.V. C ONCLUSION
In this work, we introduced LoRD-Net, which is a hybriddata-driven and model-based deep architecture for blind sym-bol detection from one-bit observations. The proposed method-ology is based the unfolding of first-order optimization itera-tions for the recovery of the MLE. We proposed a two-stagetraining procedure incorporating the learning of a competitiveobjective function, for which the unfolded network yields anaccurate recovery of the transmitted symbols from one-bitnoisy measurements. In particular, owing to its model-based11
Layer/Iteration Number − − − B E R nML - Perfect CSILoRD-Net - Perfect CSILoRD-Net - Training Stage 1LoRD-Net - Training Stage 2 (a) Rayleigh fading Layer/Iteration Number − − B E R nML - Perfect CSILoRD-Net - Perfect CSILoRD-Net - Training Stage 1LoRD-Net - Training Stage 2 (b) COST-2100 Massive MIMOFig. 10. BER performance of LoRD-Net after completing training stages 1 and 2 versus the layer/iteration number for (a) the Rayleigh fading channel, and(b) the COST-2100 massive MIMO channel, with SNR = 8 dB. nature, LoRD-Net has far fewer trainable parameters comparedto its data-driven counterparts, and can be trained with veryfew training samples. Our numerical results demonstrate thatthe proposed LoRD-Net architecture outperforms the state-of-the-art model-based and data-driven symbol detectors in multi-user one-bit MIMO systems. We also numerically illustrate thebenefits of the proposed two-stage training procedure, whichallows to train with small training sets and infer quickly, dueto its interpretable model-aware nature.R
EFERENCES[1] S. Khobahi, N. Naimipour, M. Soltanalian, and Y. C. Eldar, “Deep signalrecovery with one-bit quantization,” in
Proc. IEEE ICASSP , May 2019,pp. 2987–2991.[2] Y. C. Eldar,
Sampling theory: Beyond bandlimited systems . CambridgeUniversity Press, 2015.[3] R. H. Walden, “Analog-to-digital converter survey and analysis,”
IEEEJ. Sel. Areas Commun. , vol. 17, no. 4, pp. 539–550, 1999.[4] J. G. Andrews, S. Buzzi, W. Choi, S. V. Hanly, A. Lozano, A. C. Soong,and J. C. Zhang, “What will 5G be?”
IEEE J. Sel. Areas Commun. ,vol. 32, no. 6, pp. 1065–1082, June 2014.[5] Y.-S. Jeon, N. Lee, S.-N. Hong, and R. W. Heath, “One-bit spheredecoding for uplink massive MIMO systems with one-bit ADCs,”
IEEETrans. Wireless Commun. , vol. 17, no. 7, pp. 4509–4521, 2018.[6] S. Rao, G. Seco-Granados, H. Pirzadeh, and A. L. Swindlehurst,“Massive MIMO channel estimation with low-resolution spatial sigma-delta ADCs,” arXiv preprint arXiv:2005.07752 , 2020.[7] A. Ameri, A. Bose, J. Li, and M. Soltanalian, “One-bit radar processingwith time-varying sampling thresholds,”
IEEE Trans. Signal Process. ,vol. 67, no. 20, pp. 5297–5308, 2019.[8] B. Jin, J. Zhu, Q. Wu, Y. Zhang, and Z. Xu, “One-bit LFMCW radar:spectrum analysis and target detection,”
IEEE Trans. Aerosp. Electron.Syst. , 2020.[9] F. Xi, N. Shlezinger, and Y. C. Eldar, “BiLiMO: Bit-limited MIMO radarvia task-based quantization,” arXiv preprint arXiv:2010.00195 , 2020.[10] P. Xiao, B. Liao, and J. Li, “One-bit compressive sensing via Schur-concave function minimization,”
IEEE Trans. Signal Process. , vol. 67,no. 16, pp. 4139–4151, 2019.[11] S. Khobahi and M. Soltanalian, “Model-based deep learning for one-bitcompressive sensing,”
IEEE Trans. Signal Process. , vol. 68, pp. 5292–5307, 2020.[12] N. Shlezinger, Y. C. Eldar, and M. R. Rodrigues, “Hardware-limitedtask-based quantization,”
IEEE Trans. Signal Process. , vol. 67, no. 20,pp. 5223–5238, 2019.[13] S. Salamatian, N. Shlezinger, Y. C. Eldar, and M. M´edard, “Task-basedquantization for recovering quadratic functions using principal inertiacomponents,” in
Proc. IEEE ISIT , 2019. [14] N. Shlezinger and Y. C. Eldar, “Deep task-based quantization,”
Entropy ,vol. 23, no. 1, 2021.[15] N. Shlezinger, R. J. G. van Sloun, I. A. M. Hujiben, G. Tsintsadze,and Y. C. Eldar, “Learning task-based analog-to-digital conversion forMIMO receivers,” in
Proc. IEEE ICASSP , 2020.[16] P. Neuhaus, N. Shlezinger, M. D¨orpinghaus, Y. C. Eldar, andG. Fettweis, “Task-based analog-to-digital converters,” arXiv preprintarXiv:2009.14088 , 2020.[17] T. Gong, N. Shlezinger, S. S. Ioushua, M. Namer, Z. Yang, and Y. C.Eldar, “RF chain reduction for MIMO systems: A hardware prototype,”
IEEE Syst. J. , 2020.[18] S. S. Ioushua and Y. C. Eldar, “A family of hybrid analog–digitalbeamforming methods for massive MIMO systems,”
IEEE Trans. SignalProcess. , vol. 67, no. 12, pp. 3243–3257, 2019.[19] H. Wang, N. Shlezinger, Y. C. Eldar, S. Jin, M. F. Imani, I. Yoo,and D. R. Smith, “Dynamic metasurface antennas for MIMO-OFDMreceivers with bit-limited ADCs,”
IEEE Trans. Commun. , 2020.[20] N. Shlezinger, G. C. Alexandropoulos, M. F. Imani, Y. C. Eldar, andD. R. Smith, “Dynamic metasurface antennas for 6G extreme massiveMIMO communications,”
IEEE Wireless Commun. Mag. , 2021.[21] J. Liu, Z. Luo, and X. Xiong, “Low-resolution ADCs for wirelesscommunication: A comprehensive survey,”
IEEE Access , vol. 7, pp.91 291–91 324, 2019.[22] Y. Zhang, M. Alrabeiah, and A. Alkhateeb, “Deep learning for massiveMIMO with 1-bit ADCs: When more antennas need fewer pilots,” 2020.[23] A. Klautau, N. Gonz´alez-Prelcic, A. Mezghani, and R. W. Heath, “De-tection and channel equalization with deep learning for low resolutionMIMO systems,” in . IEEE, 2018, pp. 1836–1840.[24] E. Balevi and J. G. Andrews, “One-bit OFDM receivers via deeplearning,”
IEEE Trans. Commun. , vol. 67, no. 6, pp. 4326–4336, 2019.[25] ——, “Two-stage learning for uplink channel estimation in one-bitmassive MIMO,” arXiv preprint arXiv:1911.12461 , 2019.[26] ——, “Autoencoder-based error correction coding for one-bit quantiza-tion,”
IEEE Trans. Commun. , 2020.[27] D. Kim and N. Lee, “Machine learning based detections for mmwavetwo-hop MIMO systems using one-bit transceivers,” in
Proc. IEEESPAWC , 2019.[28] L. V. Nguyen, A. L. Swindlehurst, and D. H. Nguyen, “SVM-basedchannel estimation and data detection for one-bit massive MIMO sys-tems,” arXiv preprint arXiv:2003.10678 , 2020.[29] ——, “Linear and deep neural network-based receivers for massiveMIMO systems with one-bit ADCs,” arXiv preprint arXiv:2008.03757 ,2020.[30] J. Choi, J. Mo, and R. W. Heath, “Near maximum-likelihood detectorand channel estimator for uplink multiuser massive mimo systems withone-bit adcs,”
IEEE Trans. Commun. , vol. 64, no. 5, pp. 2005–2018,2016.[31] C. Risi, D. Persson, and E. G. Larsson, “Massive MIMO with 1-bitADC,” arXiv preprint arXiv:1404.7736 , 2014.
32] S. Jacobsson, G. Durisi, M. Coldrey, U. Gustavsson, and C. Studer,“One-bit massive MIMO: Channel estimation and high-order modula-tions,” in
Proc. IEEE ICCW , 2015, pp. 1304–1309.[33] M. T. Ivrlac and J. A. Nossek, “On MIMO channel estimation withsingle-bit signal-quantization,” in
ITG smart antenna workshop , 2007.[34] J. Mo, P. Schniter, N. G. Prelcic, and R. W. Heath, “Channel estimationin millimeter wave mimo systems with one-bit quantization,” in . IEEE,2014, pp. 957–961.[35] A. Mezghani and A. L. Swindlehurst, “Blind estimation of sparsebroadband massive MIMO channels with ideal and one-bit ADCs,”
IEEETrans. Signal Process. , vol. 66, no. 11, pp. 2972–2983, 2018.[36] Y. Li, C. Tao, G. Seco-Granados, A. Mezghani, A. L. Swindlehurst,and L. Liu, “Channel estimation and performance analysis of one-bitmassive MIMO systems,”
IEEE Trans. Signal Process , vol. 65, no. 15,pp. 4075–4089, 2017.[37] S. Jacobsson, G. Durisi, M. Coldrey, U. Gustavsson, and C. Studer,“Throughput analysis of massive MIMO uplink with low-resolutionADCs,”
IEEE Trans. Wireless Commun. , vol. 16, no. 6, pp. 4038–4051,2017.[38] J. Mo, P. Schniter, and R. W. Heath, “Channel estimation in broadbandmillimeter wave MIMO systems with few-bit ADCs,”
IEEE Trans.Signal Process. , vol. 66, no. 5, pp. 1141–1154, 2017.[39] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature , vol. 521,no. 7553, p. 436, 2015.[40] N. Farsad and A. Goldsmith, “Detection algorithms for communicationsystems using deep learning,” arXiv preprint arXiv:1705.08044 , 2017.[41] V. Corlay, J. J. Boutros, P. Ciblat, and L. Brunel, “Multilevel MIMOdetection with deep learning,” in . IEEE, 2018, pp. 1805–1809.[42] Y. Liao, N. Farsad, N. Shlezinger, Y. C. Eldar, and A. J. Goldsmith,“Deep neural network symbol detection for millimeter wave communi-cations,” arXiv preprint arXiv:1907.11294 , 2019.[43] N. Shlezinger, N. Farsad, Y. C. Eldar, and A. J. Goldsmith, “ViterbiNet:A deep learning based Viterbi algorithm for symbol detection,”
IEEETrans. Wireless Commun. , vol. 19, no. 5, pp. 3319–3331, 2020.[44] N. Shlezinger, R. Fu, and Y. C. Eldar, “DeepSIC: Deep soft interferencecancellation for multiuser MIMO detection,”
IEEE Trans. WirelessCommun. , 2020.[45] N. Shlezinger, N. Farsad, Y. C. Eldar, and A. J. Goldsmith, “Data-drivenfactor graphs for deep symbol detection,” in
Proc. IEEE ISIT , 2020.[46] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “Model-driven deep learningfor joint MIMO channel estimation and signal detection,” arXiv preprintarXiv:1907.09439 , 2019.[47] N. Samuel, T. Diskin, and A. Wiesel, “Learning to detect,”
IEEE Trans.Signal Process. , vol. 67, no. 10, pp. 2554–2564, 2019.[48] S. Takabe, M. Imanishi, T. Wadayama, R. Hayakawa, and K. Hayashi,“Trainable projected gradient detector for massive overloaded mimochannels: Data-driven tuning approach,”
IEEE Access , vol. 7, pp.93 326–93 338, 2019.[49] A. Balatsoukas-Stimming and C. Studer, “Deep unfolding for commu-nications systems: A survey and some new directions,” arXiv preprintarXiv:1906.05774 , 2019.[50] N. Farsad, N. Shlezinger, A. J. Goldsmith, and Y. C. Eldar, “Data-drivensymbol detection via model-based machine learning,” arXiv preprintarXiv:2002.07806 , 2020.[51] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding:Model-based inspiration of novel deep architectures,” arXiv preprintarXiv:1409.2574 , 2014.[52] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable,efficient deep learning for signal and image processing,”
IEEE SignalProcess. Mag. , 2020.[53] N. Naimipour, S. Khobahi, and M. Soltanalian, “Unfolded algorithmsfor deep phase retrieval,” arXiv preprint arXiv:2012.11102 , 2020.[54] C. Agarwal, S. Khobahi, A. Bose, M. Soltanalian, and D. Schonfeld,“Deep-url: A model-aware approach to blind deconvolution based ondeep unfolded richardson-lucy network,” in . IEEE, 2020, pp. 3299–3303.[55] S. Khobahi, A. Bose, and M. Soltanalian, “Deep radar waveform designfor efficient automotive radar sensing,” in . IEEE, 2020, pp.1–5.[56] N. Naimipour, S. Khobahi, and M. Soltanalian, “Upr: A model-drivenarchitecture for deep phase retrieval,” arXiv preprint arXiv:2003.04396 ,2020.[57] N. Shlezinger, J. Whang, Y. C. Eldar, and A. G. Dimakis, “Model-baseddeep learning,” arXiv preprint arXiv:2012.08405 , 2020. [58] K. Gregor and Y. LeCun, “Learning fast approximations of sparse cod-ing,” in
Proceedings of the 27th International Conference on MachineLearning . Omnipress, 2010, pp. 399–406.[59] J. Flordelis, X. Li, O. Edfors, and F. Tufvesson, “Massive MIMOextensions to the COST 2100 channel model: Modeling and validation,”
IEEE Trans. Wireless Commun. , vol. 19, no. 1, pp. 380–394, 2019.[60] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014., 2014.