[PDF] Learning a Probabilistic Relaxation of Discrete Variables for Soft Detection with Low Complexity: CMDNet

Abstract

Following the great success of Machine Learning (ML), especially Deep Neural Networks (DNNs), in many research domains in 2010s, several ML-based approaches were proposed for detection in large inverse linear problems, e.g., massive MIMO systems. The main motivation behind is that the complexity of Maximum A-Posteriori (MAP) detection grows exponentially with system dimensions. Instead of using DNNs, essentially being a black-box, we take a slightly different approach and introduce a probabilistic Continuous relaxation of disCrete variables to MAP detection. Enabling close approximation and continuous optimization, we derive an iterative detection algorithm: Concrete MAP Detection (CMD). Furthermore, extending CMD by the idea of deep unfolding into CMDNet, we allow for (online) optimization of a small number of parameters to different working points while limiting complexity. In contrast to recent DNN-based approaches, we select the optimization criterion and output of CMDNet based on information theory and are thus able to learn approximate probabilities of the individual optimal detector. This is crucial for soft decoding in today's communication systems. Numerical simulation results in MIMO systems reveal CMDNet to feature a promising accuracy complexity trade-off compared to State of the Art. Notably, we demonstrate CMDNet's soft outputs to be reliable for decoders.

Full PDF

11 ConCrete MAP:Learning a Probabilistic Relaxation of DiscreteVariables for Soft Estimation with Low Complexity

Edgar Beck ,

Student Member, IEEE,

Carsten Bockelmann ,

Member, IEEE, and Armin Dekorsy ,

Senior Member, IEEE

Abstract —Following the great success of Machine Learning(ML), especially Deep Neural Networks (DNNs), in many re-search domains in 2010s, several learning-based approacheswere proposed for detection in large inverse linear problems,e.g., massive MIMO systems. The main motivation behind isthat the complexity of Maximum A-Posteriori (MAP) detectiongrows exponentially with system dimensions. Instead of usingDNNs, essentially being a black-box in its most basic form, wetake a slightly different approach and introduce a probabilisticContinuous relaxation of disCrete variables to MAP detection.Enabling close approximation and continuous optimization, wederive an iterative detection algorithm: ConCrete MAP Detection(CMD). Furthermore, by extending CMD to the idea of deepunfolding, we allow for (online) optimization of a small number ofparameters to different working points while limiting complexity.In contrast to recent DNN-based approaches, we select theoptimization criterion and output of CMD based on informationtheory and are thus able to learn approximate probabilities of theindividual optimal detector. This is crucial for soft decoding intoday’s communication systems. Numerical simulation results inMIMO systems reveal CMD to feature a promising performancecomplexity trade-off compared to SotA. Notably, we demonstrateCMD’s soft outputs to be reliable for decoders.

Index Terms —Maximum a-posteriori (MAP), Individual op-timal, Massive MIMO, Concrete distribution, Gumbel-softmax,Machine learning, Neural networks

I. I

NTRODUCTION C OMMUNICATIONS is a long standing engineering dis-cipline whose theoretical foundation was laid by ClaudeShannon with his landmark paper "A Mathematical Theoryof Communication" in [1]. Since then, the theory hasevolved into an own ﬁeld known as information theory todayand found its way into many other research areas where dataor information is processed including artiﬁcial intelligence andespecially its subdomain Machine Learning (ML). Informationtheory relies heavily on description with probabilistic modelsplaying a signiﬁcant role for design of new generations ofcellular communication systems from 2-6G with respectiveincreases in data rate. Probabilistic models have shown to beadvantageous also in the ML research domain. Accordingly,both ﬁelds, communications and ML, have touched repeatedlyin the past, e.g, [2], [3], [4].

This work was partly funded by the German Ministry of Education andResearch (BMBF) under grant 16KIS1028 (MOMENTUM).The authors are with the Department of Communications Engineering,University of Bremen, 28359 Bremen, Germany (e-mail: {beck, bockelmann,dekorsy}@ant.uni-bremen.de).

In the early 2010s, a special class of these models gave riseto several breakthroughs in data-driven ML research: DeepNeural Networks (DNNs). Inspired by the brain, several layersof artiﬁcial neurons are stacked on top of each other to createan expressive feed forward DNN able to approximate arbi-trarily well [5] and thus to learn higher levels of abstraction,i.e., features, present in data [6]. This is of crucial importancefor tasks where there are no well-established models but datato be collected. Previously considered intractable to optimize,dedicated hardware and software, i.e., Graphics ProcessingUnits (GPU) and automatic differentiation frameworks [7],innovation to DNN models [8], [9] and advancements intraining [8] have made it possible to build algorithms that equalor even surpass human performance in speciﬁc tasks such aspattern recognition [10] and playing games [11]. The impactincluded all ML subdomains, e.g., classiﬁcation [9], [10]in supervised learning, generative modeling in unsupervisedlearning [12] and Q-learning in reinforcement learning [11].

A. ML in Communications

The great success of DNNs in many domains has stimulatedlarge amount of work in communications just in recent years[6]. Especially in problems with a model deﬁcit, e.g., detectionin molecular and ﬁber-optical channels [13], [14], or withoutany known analytical solution, e.g., ﬁnding codes for AWGN-channels with feedback [15], DNNs have already provento allow for promising application. Notably, the authors ofthe early work [16] demonstrate a complete communicationsystem design by interpreting transmitter, channel and receiveras an autoencoder which is trained end-to-end similar to oneDNN. The resulting encodings are shown to reach the BERperformance of handcrafted systems in a simple AWGN sce-nario. A model-free approach based on reinforcement learningis proposed in [17]. Using advances in unsupervised learning,also blind channel equalization can be improved [18].In contrast to typical ML research areas, a model deﬁcitdoes not apply to wireless communications. The models,e.g., AWGN, describe reality well and enable developmentof optimized algorithms. However, this approach has its limitsand the algorithms may be too complex to be implemented.This algorithm deﬁcit applies to the core problem typicalfor communications: classiﬁcation in large inverse problems.Therefore, it is crucial to ﬁnd an approximate solution withan excellent trade-off between performance and complexity. a r X i v : . [ ee ss . SP ] F e b B. Related Work

A prominent example for large inverse problems undercurrent deep investigation and a key enabler for better spectralefﬁciency in 5G/6G are massive Multiple Input MultipleOutput (MIMO) systems [19]. In an uplink scenario, a BaseStation (BS) is equipped with a very large number of antennas(around – ) and simultaneously serves multiple single-antenna User Equipments (UEs) on the same time-frequencyresource. As a ﬁrst step in receiver design, different tasks suchas channel equalization/estimation and decoding are typicallyseparated to lower complexity. But still, an algorithm deﬁcitapplies to both MIMO detection and decoding of large block-length codes, e.g., LDPC and Polar codes, since Maximum A-Posteriori (MAP) detection has high computational complexitygrowing exponentially with system or code dimensions. Evenits efﬁcient implementation, the Sphere Decoder (SD), remainstoo complex in such a scenario [20].Hence, in communications history, many suboptimal solu-tions have been proposed to overcome the complexity bot-tleneck of the optimal detectors. One key approach is torelax the discrete Random Variables (RV) to be continuous:Remarkable examples include Matched Filter (MF), ZeroForcing and MMSE equalization. But linear equalization withsubsequent detection shows bad performance especially inlarge symmetric systems.A heuristic based on the latter is the V-Blast algorithmwhich ﬁrst equalizes and then detects one layer with largestSignal-to-Noise Ratio (SNR) successively to reduce interfer-ence iteratively. A more efﬁcient and sophisticated implemen-tation, MMSE Ordered Successive Interference Cancellation(MOSIC), is based on a sorted QR decomposition of a MMSEextended system matrix with post sorting and offers a goodtrade-off between complexity and performance [21].Pursuing another philosophy of mathematical optimiza-tion, the SemiDeﬁnite Relaxation (SDR) technique [22] treatsMIMO detection as a non-convex homogeneous quadraticallyconstrained quadratic problem and relaxes it to be convex bydropping the only non-convex requirement. Proving to be aclose approximation, SDR is more complex than MOSIC andsolved by interior point methods from convex optimization.Furthermore, also probabilistic model-based ML techniqueswere introduced to improve the trade-off and to integratedetection seamlessly with decoding: Mean Field VariationalInference (MFVI) provides a theoretical derivation of softSuccessive Interference Cancellation (SIC) and the Betheapproach lays the foundation for loopy belief propagation [23].Simplifying the latter, Approximate Message Passing (AMP)is derived known to be optimal for large system dimensionsin i.i.d. Gaussian channels and computational cheap [24]. Asa further beneﬁt, soft outputs are computed, today a strictrequirement to account for subsequent soft decoding. But inpractice, the performance of probabilistic approximations likeMFVI and AMP suffers if the approximating conditions arenot met, i.e., from the full-connected graph structure and ﬁnitedimensions in MIMO systems, respectively.More recent work considers DNNs for application in MIMOsystems and focus on the idea of deep unfolding [25], [26]. In deep unfolding, the number of iterations of a model-basediterative algorithm is ﬁxed and its parameters untied. Further,it is enriched with additional weights and non-linearities tocreate a computational efﬁcient DNN being optimized forperformance improvements in MIMO detection [27], [28], be-lief propagation decoding [29], [30], [31] and MMSE channelestimation [32]. The former approach DetNet, a generic DNNmodel with a large number of trainable parameters based on anunfolded projected gradient descent, proves DNNs to allow fora promising trade-off between performance and complexity. In[33], unfolding of an extension of AMP to unitarily-invariantchannels, the Orthogonal AMP (OAMP), into OAMPNet isproposed adding only trainable parameters per layer. Of-fering promising performance, the complexity bottleneck ofone matrix inversion per iteration makes this model-drivenapproach rather unattractive compared to DetNet. AnotherDNN-like network MMNet is inspired by iterative soft thresh-olding algorithms and AMP [34]: Striking the balance betweenexpressiveness and complexity, and exploiting spectral andtemporal locality, MMNet can be trained online for anyrealistic channel realization if coherence time is large enough.Since implementation of online training is not trivial, we focusin this work on the ofﬂine learned version MMNet (iid). Onemajor drawback of the latter approaches is that they focus onMIMO detection and do not provide soft outputs. C. Main Contributions

The main contributions of this article are manifold: Inspiredby recent ML research, we ﬁrst introduce a CONtinuous relax-ation of the probability mass function (pmf) of the disCRETERVs by a probability density function (pdf) from [35], [36]to the MAP detection problem. The proposed CONCRETErelaxation offers many favorable properties: On the one hand,the pdf of continuous RVs converges to the exact pmf in theparameter limit. On the other hand, we notice good algorith-mic properties like avoiding marginalization and allowing fordifferentiation instead. By this means, we replace exhaustivesearch by computationally cheaper continuous optimization toapproximately solve the MAP problem in any probabilisticnon-linear model. We name our approach Concrete MAPDetection (CMD).Second, following the idea of Deep Unfolding, we unfoldthe gradient descent algorithm into a DNN-like model with aﬁxed number of iterations to allow for parameter optimizationand to further improve detection performance while limitingdetection complexity. By this means, we are able to combinethe advantages of DNNs and model-based approaches. As thenumber of parameters is small, we are able to dynamicallyadapt them to easily adjust CMD to different working points.Further, the resulting structure allows for fast online trainingof CMD.Thirdly, we derive the optimization criterion from an in-formation theoretic perspective and are hence able to provideprobabilities of detection, i.e., reliable soft outputs. We showthat optimization is then equivalent to learning an approxima-tion of the Individual Optimal (IO) detector. This allows us toaccount for subsequent decoding, e.g., in MIMO systems, incontrast to literature [28], [34].

Finally, we provide numerical simulation results for use ofCMD in MIMO systems including a variety of simulation se-tups, e.g., correlated channels, revealing CMD to be a genericand promising approach competitive to State of the Art (SotA).Notably, we show superiority to other recently proposed ML-based approaches and demonstrate with simulations in codedsystems CMD’s soft outputs to be reliable for decoders asopposed to [28]. Furthermore, by estimating the computationalcomplexity, we prove CMD to feature an excellent trade-off between performance and complexity. Notably, only theMatched Filter has lower complexity.In the following, we ﬁrst introduce the concrete relaxationto MAP detection in Section II using the example of an inverselinear problem. In Section III, we follow a different routeand explain how to learn the posterior, i.e., replacing it bysome tractable approximation. To yield a suitable model forthis approximation, we propose to unfold CMD which we arethen able to train by variants of Stochastic Gradient Descent(SGD). Finally in Section IV and V, we provide numericalresults of the bit error performance in comparison to otherSotA approaches using the example of uncoded and codedMIMO systems and summarize the main results, respectively.II. C

ONCRETE RELAXATION OF

MAP

PROBLEM

A. System Model and Problem Statement

To motivate the concrete relaxation, we consider a linearcomplex-valued observation model typically encountered incommunications, e.g., MIMO systems, ﬁrst excluding coding: y = Hx + n . (1)Here, we assume x to be a normalized multivariate discreteRV, i.e., x = { 𝑥 𝑛 } 𝑁 T 𝑛 = with E [| 𝑥 𝑛 | ] = , whose i.i.d. elementsare from a set M , e.g., 16-QAM or 8-PSK. A linear channel H ∈ C 𝑁 R × 𝑁 T with i.i.d. Gaussian taps ℎ 𝑚𝑛 ∼ CN ( , / 𝑁 R ) introduces correlation. Then, the resulting RV is superimposedby Gaussian noise n ∼ CN ( , 𝜎 n I 𝑁 R ) with variance 𝜎 n . Thematrix I 𝑁 R denotes the identity matrix of dimension 𝑁 R × 𝑁 R .To detect the discrete multivariate RV x given the linearobservation y ∈ C 𝑁 R × , there exist two optimal detectors froma probabilistic Bayesian viewpoint: First, we can use Bayesrule to formulate the MAP problem ˆ x = arg max x ∈M 𝑁 T × 𝑝 ( x | y ) (2a) = arg max x ∈M 𝑁 T × 𝑝 ( y | x ) · 𝑝 ( x ) (2b) = arg min x ∈M 𝑁 T × − ln 𝑝 ( y | x ) − ln 𝑝 ( x ) (2c)with 𝑝 ( y | x ) = 𝜋 𝑁 R 𝜎 𝑁 R n 𝑒 − 𝜎 n ( y − Hx ) 𝐻 ( y − Hx ) (3)as the likelihood function and 𝑝 ( x ) as the a-priori pdf. Sincethe RV is discrete, i.e., 𝑥 𝑛 ∈ M , an exhaustive searchover all element combinations is required to solve the MAPproblem becoming computational intractable for large systemdimensions. Note that the Sphere Detector (SD) providesan efﬁcient implementation [20]. Second, we notice that theMAP detector only delivers the most likely received vector x and hence hard decisions. In coded systems with softdecoders usually employed today, delivering soft informationis a strict requirement. This brings us to the Individual Optimal(IO) detector obtained by evaluating the marginal posteriordistribution w.r.t. every single 𝑥 𝑛 : ˆ 𝑥 𝑛 = arg max 𝑥 𝑛 ∈M 𝑝 ( 𝑥 𝑛 | y ) = arg max 𝑥 𝑛 ∈M (cid:205) x \ 𝑥 𝑛 𝑝 ( y | x ) · 𝑝 ( x ) (cid:205) 𝑥 𝑛 (cid:205) x \ 𝑥 𝑛 𝑝 ( y | x ) · 𝑝 ( x ) . (4)This detector is optimal in terms of minimizing the SymbolError Rate (SER) per individual symbol without coding andfurther delivers probabilities as soft output in contrast to MAPdetection. However, it has higher complexity due to requiredmarginalization w.r.t. x . Since the MAP detector performancecoincides with the IO detector in the high SNR regime andis of lower complexity, we restrict to the MAP detector as abenchmark in simulations without coding. B. Concrete Distribution

We now focus on the following question to improve theperformance complexity trade-off: How to model the priorinformation 𝑝 ( x ) accurately by some approximation 𝑝 ( ˜ x ) ?In [37], we proposed to use ML tricks from [35], [36] toachieve this and to make inference computationally tractable.The idea was recently discovered in the ML community inthe context of unsupervised learning of generative models[35], [36]. There, marginalization to compute the objectivefunction, the evidence, becomes intractable. Therefore, theEvidence is replaced by its Lower BOund (ELBO) by meansof an auxiliary posterior function. But optimizing w.r.t. theELBO results in high variance of the gradient estimators. Forvariance reduction, the so called reparametrization trick is usedand leads to an optimization structure similar to an autoen-coder known as the variational autoencoder [23]. There, thestochastic node is reparametrized by a continuous stochasticvariable, e.g., a Gaussian, and its parameters, e.g., mean andvariance. In contrast to continuous variables, reparametrizationof discrete RV is not possible. Hence, a CONtinuous relaxationof disCRETE variables, the CONCRETE distribution, wasproposed in [35], [36] independently.To explain the introduction of this relaxation to the MAPproblem, let us assume that we have the discrete binary RV 𝑥 ∈ M with M = {− , + } . Further, we deﬁne the discrete RV z as a one-hot vector where all elements are zero except forone element, i.e., z ∈ { , } × with two possible realizations z = [ , ] 𝑇 , z = [ , ] 𝑇 . In addition, we describe the valuesof M by the representer vector m = [− , ] 𝑇 . That way, wecan write 𝑥 = z 𝑇 m , e.g., 𝑥 = [ , ] · [− , ] 𝑇 = − . Now,the one-hot vector z ∈ { , } 𝑀 × represents a categorical RVwith 𝑀 = |M| classes. Connecting Monte Carlo methods tooptimization [35], the Gumbel-Max trick states that we areable to generate samples, i.e., classes, of such a categoricalRV or pmf 𝑝 ( 𝑥 ) by sampling an index 𝑖 ∗ from 𝑀 continuousi.i.d. Gumbel RVs 𝑔 𝑖 known from extreme value theory: 𝑖 ∗ = arg max 𝑖 = ,...,𝑀 ln 𝑝 ( 𝑥 = 𝑚 𝑖 ) + 𝑔 𝑖 . (5) Deﬁning the function one-hot ( 𝑖 ∗ ) which sets the 𝑖 ∗ -th elementin the one-hot vector 𝑧 𝑖 ∗ = and 𝑧 𝑙 ≠ 𝑖 ∗ = , the Gumbel-Max trick hence allows to sample one-hot vectors z . Thus, weare able to reparametrize z through a continuous multivariateGumbel RV g ∈ R 𝑀 × and a vector 𝜶 ∈ [ , ] 𝑀 × of classprobabilities 𝑝 ( 𝑥 = 𝑚 𝑘 ) with (cid:205) 𝑀𝑘 = 𝛼 𝑘 = : z = one-hot (cid:18) arg max 𝑖 = ,...,𝑀 [ ln ( 𝜶 ) + g ] (cid:19) . (6)Note that (6) and equally 𝑥 are still discrete RVs, i.e., 𝑝 ( z ) ˆ = 𝑝 ( 𝑥 ) , but represented in probabilistic sense by contin-uous RVs g . To achieve a continuous RV we now replacethe one-hot and arg max computation in (6) by the softmaxfunction [35], [36]: ˜ z = 𝜎 𝜏 ( g ) = 𝑒 ( ln ( 𝜶 )+ g )/ 𝜏 (cid:205) 𝑀𝑖 = 𝑒 ( ln 𝛼 𝑖 + 𝑔 𝑖 )/ 𝜏 . (7)The resulting RV ˜ z ∈ [ , ] 𝑀 × is the so called concreteor Gumbel-Softmax RV and now continuous, e.g., ˜ z = [ . , . ] 𝑇 . It is controlled by a parameter, the softmaxtemperature 𝜏 . The distribution of ˜ z in (7) was found to have aclosed form density which gives the deﬁnition of the concretedistribution: 𝑝 ( ˜ z | 𝜶 , 𝜏 ) = ( 𝑀 − ) ! 𝜏 𝑀 − 𝑀 (cid:214) 𝑘 = (cid:32) 𝛼 𝑘 ˜ 𝑧 − 𝜏 − 𝑘 (cid:205) 𝑀𝑖 = 𝛼 𝑖 ˜ 𝑧 − 𝜏𝑖 (cid:33) . (8)With ˜ z , we are ﬁnally able to relax the discrete RV 𝑥 into acontinuous RV ˜ 𝑥 by deﬁning ˜ 𝑥 = ˜ z 𝑇 m . Now, our derivationof the relaxation is complete. In Fig. 1, we illustrate thedistribution 𝑝 ( ˜ 𝑥 ) for the special case 𝑀 = in comparisonto the original categorical pdf 𝑝 ( 𝑥 ) , i.e., a Bernoulli pmf.It has the following properties reﬂecting the correctness ofthe relaxation [35]: First, we are able to reparametrize theconcrete RV ˜ z and hence the RVs ˜ 𝑥 by Gumbel variables g ,a direct result from the initial idea (7). Moreover, the zerotemperature limit 𝜏 → restores a categorical variable: Thesmaller 𝜏 , the more ˜ z approaches a categorical distribution andthe approximation becomes more accurate. Thus, the statisticsof 𝑥 and ˜ 𝑥 remain the same for 𝜏 → . C. Reparametrization

In [37], the idea is to use the concrete distribution in orderto relax the MAP problem (2c) to ˆ x = arg min ˜ x ∈[ 𝑥 min ,𝑥 max ] 𝑁 T × − ln 𝑝 ( y | ˜ x ) − ln 𝑝 ( ˜ x ) . (9)The reparametrization of ˜ z by g helps to rewrite (9) byexpressing each ˜ 𝑥 𝑛 in ˜ x with (7) by the RV g 𝑛 , 𝑛 = , . . . , 𝑁 T of i.i.d. Gumbel RVs 𝑔 𝑘𝑛 : ˜ x ( G ) =  ˜ 𝑥 ... ˜ 𝑥 𝑁 T  =  ˜ z 𝑇 ... ˜ z 𝑇𝑁 T  m =  𝜎 𝜏 ( g ) 𝑇 ...𝜎 𝜏 (cid:0) g 𝑁 T (cid:1) 𝑇  m (10)with G = (cid:2) g · · · g 𝑁 T (cid:3) ∈ R 𝑀 × 𝑁 T . (11)By doing so, we will obtain an unconstrained optimizationproblem w.r.t. matrix G . Now, we reformulate the relaxed − − . . . . 𝑥 = − 𝑧 + 𝑝 ( ˜ 𝑥 | 𝜶 , 𝜏 ) 𝑝 ( ˜ 𝑥 | 𝜶 = [ . , . ] 𝑇 , 𝜏 = . ) 𝑝 ( ˜ 𝑥 | 𝜶 = [ . , . ] 𝑇 , 𝜏 = ) 𝑝 ( ˜ 𝑥 | 𝜶 = [ . , . ] 𝑇 , 𝜏 = ) 𝑝 ( ˜ 𝑥 | 𝜶 = [ . , . ] 𝑇 , 𝜏 = . ) 𝑝 ( 𝑥 | 𝜶 = [ . , . ] 𝑇 ) Fig. 1. The conrete pdf 𝑝 ( ˜ 𝑥 | 𝜶 , 𝜏 ) shown for different parameter sets and 𝑀 = . It relaxes the Bernoulli pmf 𝑝 ( 𝑥 | 𝜶 ) into the interior. Notably, for 𝜏 ≤ ( 𝑀 − ) − , it is log-convex and log-concave otherwise. Symmetry resultsif 𝛼 = . . . = 𝛼 𝑀 . − − . . 𝑥 − l n 𝑝 − ln 𝑝 ( ˜ 𝑥 )− ln 𝑝 ( 𝑦 | ˜ 𝑥 )− ln 𝑝 ( ˜ 𝑥, 𝑦 )− ln 𝑝 ( 𝑥, 𝑦 ) Fig. 2. Exemplary plot of the concrete binary MAP cost function (green) andthe contribution of conditional (black) and prior pdf (red) to it. The originalbinary MAP cost function (blue) is shown for comparison.

MAP problem (9): This means, we replace the Gaussianmodel 𝑝 ( y | ˜ x ) by 𝑝 ( y | G ) and introduce the Gumbel distri-bution 𝑝 ( 𝑔 𝑘𝑛 ) = exp (− 𝑔 𝑘𝑛 − exp (− 𝑔 𝑘𝑛 )) as the new priordistribution: ˆ G = arg min G ∈ R 𝑀 × 𝑁 T − ln 𝑝 ( y | G ) − ln 𝑝 ( G ) (12a) = arg min G ∈ R 𝑀 × 𝑁 T − ln 𝑝 ( y | G ) − 𝑁 T ∑︁ 𝑛 = 𝑀 ∑︁ 𝑘 = ln 𝑝 ( 𝑔 𝑘𝑛 ) (12b) = arg min G ∈ R 𝑀 × 𝑁 T 𝜎 n ( y − H ˜ x ( G )) 𝐻 ( y − H ˜ x ( G ))+ 𝑁 R ln ( 𝜋𝜎 n ) + 𝑇 G1 + 𝑇 𝑒 − G (12c) = arg min G ∈ R 𝑀 × 𝑁 T 𝐿 ( G , 𝜏 ) . (12d)However, owing to the softmax and exponential termsin 𝐿 ( G , 𝜏 ) , (12d) has no analytical solution. Furthermore, 𝐿 ( G , 𝜏 ) describes a non-convex objective function which is il-lustrated in Fig. 2 for the binary case 𝑀 = . This results fromlog-convexity of the concrete distribution for 𝜏 ≤ ( 𝑀 − ) − [35]. The conditional pdf 𝑝 ( y | ˜ x ) is log-concave and the priorpdf 𝑝 ( ˜ x ) log-convex, so the negative log joint distributionforms a non-convex objective function (12d). D. Gradient Descent Optimization

One common strategy for solving the non-linear and non-analytical problem (12d) is to use a variant of gradient descentbased approaches. Since we aim to reduce complexity, wechoose the most basic form steepest descent. The minimum isapproached iteratively by taking gradient descent steps untilthe necessary condition 𝜕𝐿 ( G , 𝜏 ) 𝜕 G = (13)is fulﬁlled. We point out that convergence to the global solu-tion depends heavily on the starting point initialization sincethe objective function is non-convex. A reasonable choice ofstarting point value is ˜ x ( ) = E [ x ] = 𝜶 𝑇 · m , i.e., the expectedvalue of the true discrete RV x . We achieve this by setting G ( ) = and 𝜏 = . After some tensor/matrix calculus andby noting that every ˜ 𝑥 𝑛 only depends on one g 𝑛 , the gradientdescent step for (12d) in iteration 𝑗 is: G ( 𝑗 + ) = G ( 𝑗 ) − 𝛿 ( 𝑗 ) · 𝜕𝐿 ( G , 𝜏 ) 𝜕 G (cid:12)(cid:12)(cid:12)(cid:12) G = G ( 𝑗 ) (14a) 𝜕𝐿 ( G , 𝜏 ) 𝜕 G = 𝜎 n · (cid:104) 𝜕 ˜ 𝑥 ( g ) 𝜕 g · · · 𝜕 ˜ 𝑥 𝑁 T ( g 𝑁 T ) 𝜕 g 𝑁 T (cid:105) · diag (cid:8) H 𝐻 H ˜ x ( G ) − H 𝐻 y (cid:9) + − 𝑒 − G (14b) 𝜕 ˜ 𝑥 𝑛 ( g 𝑛 ) 𝜕 g 𝑛 = 𝜏 ( 𝑗 ) · [ diag { 𝜎 𝜏 ( g 𝑛 )} · m − 𝜎 𝜏 ( g 𝑛 ) · ˜ 𝑥 𝑛 ( g 𝑛 )] . (14c)The operator diag { a } creates a diagonal matrix with thevector a on its main diagonal. The step-size 𝛿 ( 𝑗 ) can be chosenadaptively in every iteration 𝑗 just as the parameter 𝜏 ( 𝑗 ) . Forexample, we can follow a heuristic schedule like in simulatedannealing: We start with a large 𝜏 ( 𝑗 ) and decrease until weapproach the true prior pdf for 𝜏 ( 𝑗 ) → . Finally, after the lastiteration 𝑁 it , we get as a result the continuous estimate G ( 𝑁 it ) .For approximate detection of x in (2c), the estimate has to betransformed back to the discrete domain by quantizing ˜ x ontothe discrete set M : ˆ x = arg min x ∈M 𝑁 T × (cid:13)(cid:13)(cid:13) x − ˜ x (cid:16) G ( 𝑁 it ) (cid:17)(cid:13)(cid:13)(cid:13) . (15)In the following, we name this detection approach ConcreteMAP Detection (CMD). It is a generic approach applicable inany differentiable probabilistic non-linear model. Furthermore,only elementwise nonlinearities and matrix vector multiplica-tions are present.As a ﬁnal remark, we note that our implementation ofSection IV relies on scaling of the objective function by thenoise variance parameter, i.e., 𝜎 n · 𝐿 ( G , 𝜏 ) . Although scalingdoes not change the optimization problem, we observed thatthis slightly modiﬁed version of (14) is numerically morestable. E. Special Case: Binary Random Variables

Noting that the softmax function (7) is normalized, we areable to eliminate one degree of freedom in matrix G ∈ R 𝑀 × 𝑁 T along dimension 𝑀 . For the special case of binary RVs or 𝑀 = classes, this means that the matrix G can be reducedto a vector s ∈ R 𝑁 T × of logistic RVs to derive a differentalgorithm of low complexity. Here, we only brieﬂy summarizethe result of binary CMD in a real-valued system model andrefer the reader to [37] for the complete derivation: s ( 𝑗 + ) = s ( 𝑗 ) − 𝛿 ( 𝑗 ) · 𝜕𝐿 ( s ) 𝜕 s (cid:12)(cid:12)(cid:12)(cid:12) s = s ( 𝑗 ) (16a) 𝜕𝐿 ( s ) 𝜕 s = 𝜎 n · 𝜕 ˜ x ( s ) 𝜕 s · (cid:2) H 𝑇 H ˜ x ( s ) − H 𝑇 y (cid:3) + tanh (cid:16) s (cid:17) (16b) 𝜕 ˜ x ( s ) 𝜕 s = 𝜏 ( 𝑗 ) · diag (cid:8) − ˜ x ( s ) (cid:9) (16c) ˜ x ( s ) = tanh (cid:18) ln ( / 𝛼 − ) + s 𝜏 (cid:19) . (16d)The ﬁnal step consists again of quantization - in this case itsimpliﬁes to the sign function: ˆ x = sign ( ˜ x ( s ( 𝑁 it ) )) .III. L EARNING TO R ELAX

Although being simple and computational efﬁcient, us-ing a gradient descent approach like (14) and (16) leadsto several inconveniences. Regarding theoretical properties,a major drawback becomes apparent: Convergence of thegradient descent steps to an optimum is slow since consecutivegradients are perpendicular. Also practical questions arise:How to choose the parameters 𝜏 ( 𝑗 ) and 𝛿 ( 𝑗 ) and the numberof iterations 𝑁 it for a good complexity performance tradeoff? And how are we able to deliver soft information, e.g.,probabilities, to a soft decoder which is standard in today’scommunication systems?Our idea is to improve CMD by learning and in particularthe idea of deep unfolding to address these questions. Thismeans we have to deal withA. how learning is deﬁnedB. the application of deep unfolding to CMD. A. Basic Problem of Learning

To introduce our notation of learning, we revisit our basictask of MAP detection. Ideally, we would like to infer the mostlikely transmit signal x based on an a-posteriori pdf 𝑝 ( x | y ) .But as pointed out earlier, evaluation of 𝑝 ( x | y ) has intractablecomplexity. For this reason, we propose to relax the MAPproblem and CMD, respectively.Another idea to tackle this problem is to approximate thispdf 𝑝 ( x | y ) by another computationally tractable pdf 𝑞 ( x | y ) ,e.g., by calculation of 𝑞 ( x | y ) using few samples/observations x , and use this pdf for inference. Note that this approachincludes cases where we do not know the pdf 𝑝 ( x | y ) com-pletely. The quality of the approximation can be quantiﬁed bythe information theoretic measure of Kullback-Leibler (KL)divergence: 𝐷 KL ( 𝑝 (cid:107) 𝑞 ) = − ∑︁ x ∈M 𝑁 T × 𝑝 ( x | y ) ln 𝑝 ( x | y ) 𝑞 ( x | y ) (17) = E x ∼ 𝑝 ( x | y ) (cid:20) ln 𝑝 ( x | y ) 𝑞 ( x | y ) (cid:21) . (18) Just as the Mean Square Error (MSE), the measure of KLdivergence can be used to deﬁne an optimization problemtargeting at a tight 𝑞 ( x | y ) as a solution. This brings meto a crucial viewpoint of this article: Learning is deﬁnedto be the optimization process aiming to derive a goodapproximation 𝑞 ( x | y ) of 𝑝 ( x | y ) , i.e., 𝑞 ∗ ( x | y ) = arg min 𝑞 𝐷 KL ( 𝑝 (cid:107) 𝑞 ) . (19)This kind of problem is also referred to as Variational Infer-ence (VI). We can rewrite the KL divergence into a sum ofcross entropy H ( 𝑝, 𝑞 ) and entropy H ( 𝑝 ) : 𝐷 KL ( 𝑝 (cid:107) 𝑞 ) = E x ∼ 𝑝 ( x | y ) [− ln 𝑞 ( x | y )] − E x ∼ 𝑝 ( x | y ) [− ln 𝑝 ( x | y )] (20) = H ( 𝑝, 𝑞 ) − H ( 𝑝 ) . (21)Since we deﬁned the basic learning problem (19) w.r.t. approx-imation 𝑞 , we can neglect the entropy term H ( 𝑝 ) independentof 𝑞 and use cross entropy as the learning criterion. If wefurther restrict 𝑞 to a model 𝑞 ( x | y , 𝜽 ) with parameters 𝜽 , theoptimization problem now reads: 𝜽 ∗ = arg min 𝜽 H ( 𝑝, 𝑞 ) . (22)We note that problem (22) is solved separately for each y and thus not computationally efﬁcient. Therefore, we deﬁneone inference distribution 𝑞 ( x | y , 𝜽 ) for any value y which isknown as Amortized Inference [23]: 𝜽 ∗ = arg min 𝜽 E y ∼ 𝑝 ( y ) [H ( 𝑝 ( x | y ) , 𝑞 ( x | y , 𝜽 ))] (23) = arg min 𝜽 E y ∼ 𝑝 ( y ) (cid:2) E x ∼ 𝑝 ( x | y ) [− ln 𝑞 ( x | y , 𝜽 )] (cid:3) (24) ≈ arg min 𝜽 − 𝑁 𝑁 ∑︁ 𝑖 = ln 𝑞 ( x 𝑖 | y 𝑖 , 𝜽 ) , 𝑁 → ∞ . (25)The ﬁnal result (25) equals the maximum likelihood problemin supervised learning. We make use of it in the followingsince it allows for numerical optimization based on datapoints { x 𝑖 , y 𝑖 } . Furthermore, it proves to be a Monte Carloapproximation of (23) and is hence well motivated frominformation theory [23]. B. Idea of Unfolding and Application to CMD

Learning gives us the ability to obtain a tractable approxi-mation 𝑞 ( x | y , 𝜽 ) . But it remains one question: How to choosea suitable functional form for 𝑞 ( x | y , 𝜽 ) of low complexity andgood performance? We follow the idea of deep unfolding from[26] and apply it to our model-based approach CMD with pa-rameters 𝜽 = (cid:8) 𝜏 ( ) , . . . , 𝜏 ( 𝑁 it ) , 𝛿 ( ) , . . . , 𝛿 ( 𝑁 it − ) (cid:9) ∈ R ( 𝑁 it + )× able to relax tightly. Thereby, we combine strengths of DNNsand the latter: DNNs are known to be universal approximators[5] and their ﬁxed structure of parallel computations layer perlayer allows to deﬁne a good performance complexity tradeoff at run time. But if the model is dynamic and changes,e.g., the channel or noise over time, reiterated optimizationof (22) is required and the beneﬁt disappears. Fortunately, weknow our model (3), a MIMO channel, well and are able touse generative model-based approaches which mostly rely on a suitable approximation of (19) for computational tractability.For example, MFVI and AMP belong to this algorithm family.This means we unfold the iterations (14) of CMD into aDNN by untying the parameters 𝜏 ( 𝑗 ) and 𝛿 ( 𝑗 ) . Furthermore,we ﬁx the complexity by setting the number of iterations 𝑁 it . The resulting graph illustrated in Fig. 3 for binary CMDhas a DNN-like structure which should be able to generalizeand approximate well at the same time. Owing to the skipconnection from s ( 𝑗 ) to s ( 𝑗 + ) on the right hand side, thestructure resembles a Residual Network (ResNet) layer whichis SotA in image processing [9]. It is a result of the gradientdescent approach which allows to interpret optimization ofResNets as learning gradient descent steps. The reason for thesuccess of ResNet lies in the skip connection: The trainingerror is able to backpropagate through it to early layers whichallows for fast adaptation of early weights and hence fasttraining of DNNs. This makes CMD especially suitable foronline training proposed in [34] and allows for reﬁnement inapplication.As before, we have to deﬁne a ﬁnal layer which is nowalso used for optimization. Usually, its output is chosen tobe an continuous estimate of x and optimized w.r.t. the MSEcriterion, see [28], [34]. This viewpoint relaxes the estimate ˆ x into R 𝑁 T × and assumes a Gaussian distribution for errorsat the output. In our case, the output would correspond to ˜ x ( G ( 𝑁 it ) ) from (15). But this is in contrast to our informationtheoretic viewpoint on learning which states that we want toapproximate an output of the true pmf 𝑝 ( x | y ) . Like in MFVI,we assume a factorization of the approximating posterior tomake it computationally tractable and derive our learningcriterion: H ( 𝑝, 𝑞 ) = − ∑︁ x ∈M 𝑁 T 𝑝 ( x | y ) · ln 𝑞 ( x | y , 𝜽 ) (26) MFVI = − ∑︁ x ∈M 𝑁 T 𝑝 ( x | y ) · ln (cid:214) 𝑥 𝑛 ∈M 𝑞 𝑛 ( 𝑥 𝑛 | y , 𝜽 ) (27) = − ∑︁ 𝑥 𝑛 ∈M ln 𝑞 𝑛 ( 𝑥 𝑛 | y , 𝜽 ) · ∑︁ 𝑥 / 𝑛 ∈M 𝑁 T − 𝑝 ( x | y ) (28) = − ∑︁ 𝑥 𝑛 ∈M 𝑝 ( 𝑥 𝑛 | y ) · ln 𝑞 𝑛 ( 𝑥 𝑛 | y , 𝜽 ) (29) = ∑︁ 𝑥 𝑛 ∈M H ( 𝑝 ( 𝑥 𝑛 | y ) , 𝑞 𝑛 ( 𝑥 𝑛 | y , 𝜽 )) . (30)This interesting result shows that assuming MFVI factoriza-tion leads to an optimization criterion w.r.t. the soft output 𝑝 ( 𝑥 𝑛 | y ) of the IO detector (4). This soft output is required forsubsequent decoding and thus exactly what we need.The last step of our idea consists of inserting our unfoldedCMD structure into 𝑞 𝑛 ( 𝑥 𝑛 | y , 𝜽 ) . Hence, we propose to usea softmax function for the last layer being a typical choicefor classiﬁcation in discriminative probabilistic models. For-tunately, CMD already includes this softmax function as partof its structure so we rewrite 𝑞 𝑛 ( 𝑥 𝑛 | y , 𝜽 ) = 𝑀 (cid:214) 𝑘 = 𝑞 𝑛,𝑘 ( 𝑥 𝑛 | y , 𝜽 ) ( 𝑥 𝑛 = 𝑚 𝑘 ) = 𝑀 (cid:214) 𝑘 = ˜ 𝑧 ( 𝑥 𝑛 = 𝑚 𝑘 ) 𝑛,𝑘 (31) s ( 𝑗 )(•)+ ln ( / 𝜶 − ) 𝜏 ( 𝑗 ) ˜ x ( 𝑗 ) (•) − H 𝑇 H (•) − H 𝑇 y (cid:12) ++ s ( 𝑗 + ) /( 𝜏 ( 𝑗 ) 𝜎 n ) − − 𝛿 ( 𝑗 ) 𝜕𝐿 ( s ) 𝜕 s (cid:12)(cid:12)(cid:12)(cid:12) s = s ( 𝑗 ) Fig. 3. One layer of the unfolded binary CMD algorithm. In red: trainableparameters. TABLE IS

IMULATION S CENARIOS

Scenario Sys Dim Mod Corr. CodingLarge MIMO × QPSK no noMIMO × QPSK no noMulti-class × QAM16 no nomassive MIMO One-Ring × QPSK ◦ noSoft Output × QPSK no LDPC with ˜ z 𝑛 = 𝜎 𝜏 ( 𝑁 it ) ( g ( 𝑁 it ) 𝑛 ) from the last iteration 𝑁 it of (14). Tosummarize, we optimize the parameter set 𝜽 of our approxi-mating pdf 𝑞 ( x | y , 𝜽 ) based on CMD: 𝜽 ∗ = arg min 𝜽 E y ∼ 𝑝 ( y ) [H ( 𝑝 ( x | y ) , 𝑞 ( x | y , 𝜽 ))] (32) ≈ arg min 𝜽 − 𝑁 𝑁 ∑︁ 𝑖 = 𝑁 T ∑︁ 𝑛 =  𝑥 𝑛 = 𝑚 ...𝑥 𝑛 = 𝑚 𝑀  𝑇 ln (cid:16) 𝜎 𝜏 ( 𝑁 it ) ( g ( 𝑁 it ) 𝑛 ) (cid:17) . (33)As a side effect, we also learn to relax with CMD by 𝜏 ( 𝑗 ) . Theoptimization problem (33) can be efﬁciently solved by variantsof Stochastic Gradient Descent (SGD). Thanks to having amodel, we are able to create inﬁnite training and test data forreasonable approximation with (33) in every iteration of SGD.We notice that this is in contrast to classic data sets from themachine learning community.IV. N UMERICAL R ESULTS

A. Implementation Details / Settings

In order to evaluate the performance of the proposed ap-proach CMD, we present numerical simulation results forapplication in different MIMO systems with 𝑁 T transmit and 𝑁 R receive antennas given in Tab. I. We assume Rayleigh TABLE IIA

LGORITHMS

Abbrev. Complexity LiteratureMAP / SD

O ( 𝑀 𝛾𝑁 T ) , 𝛾 ∈ ( , ] [20]SDR O ( max ( 𝑁 R , 𝑁 T ) 𝑁 / T log ( / 𝜖 )) [22]OAMPNet O ( 𝑁 L 𝑁 T ) [33]MMSE / MOSIC O ( 𝑁 T ) [34], [21]DetNet O ( 𝑁 L ( 𝑁 T 𝑁 R + 𝑁 T 𝑀 )) [27], [28]MMNet (iid) O ( 𝑁 L 𝑁 T ( 𝑁 T + 𝑁 R + 𝑀 )) [34]AMP O ( 𝑁 it 𝑁 T ( 𝑁 R + 𝑀 )) [24]CMD O ( 𝑁 L 𝑁 T ( 𝑁 R + 𝑀 )) [37]MF O ( 𝑁 T 𝑁 R ) block fading and an uplink scenario where several UEstransmit to one BS. As an example, we assume the numberof iterations or layers to be 𝑁 it = 𝑁 L = 𝑁 T . For nu-merical optimization of the parameters 𝛿 ( 𝑗 ) and 𝜏 ( 𝑗 ) of theunfolded CMD structure in (33), we employ the Tensorﬂowframework in Python [7]. Here, we use Adam (AdaptiveMoment Estimation) as a popular variant of SGD with adefault batch size of 𝑁 b = and 𝑁 epoch = trainingiterations. Although providing fast convergence and requiringlittle hyperparameter tuning, it is known to generalize poorly[38]. Since we are able to generate a sufﬁcient amount oftraining data, i.e., 𝑁 = 𝑁 b · 𝑁 epoch = · to fulﬁll (33)approximately, we make sure that generalization to unseendata points is possible. To allow for training in Tensorﬂowand for comparison to DNN-based approaches, we restrictto QAM constellations with Gray encoding and transformthe complex-valued system model (1) into its real-valuedequivalent so that we have x ∈ M 𝑁 T × . The default rangeof training noise variance 𝜎 n is distributed uniformly between 𝐸 b / 𝑁 =

10 log ( / 𝜎 n ) −

10 log ( log ( 𝑀 )) ∈ [ , ] . Thedefault parameter starting point is set to 𝜽 with constant 𝛿 ( 𝑗 ) = and heuristically motivated and linear decreasing 𝜏 ( 𝑗 ) = 𝜏 max − ( 𝜏 max − . )/ 𝑁 it · 𝑗 (34)with 𝜏 max = /( 𝑀 − ) , 𝑗 ∈ [ , 𝑁 it ] . With this choice, 𝑝 ( ˜ 𝑥 ) isalways log-convex and hence reasonably approximating 𝑝 ( 𝑥 ) (see Fig. 1). For training of DNN-based approaches DetNetand MMNet, we used the original implementations uploadedto GitHub (see [28], [34]) with only minor modiﬁcations toparametrization if beneﬁcial. Consequently, we trained MM-Net with CMD training SNR and layer number. Since we focuson ofﬂine derived or trained algorithms which are used forinference at run time, we used its i.i.d. variant. We always usedthe soft output version of DetNet with output normalization to since we noted that performance is close to or better thanthe hard decision version. Furthermore, we compare CMD toseveral SotA approaches for MIMO detection (see Tab. II)choosing the number of Monte Carlo runs so that always errors are detected ( for SD and SDR). B. Symmetric channel

First, we test application of unfolded CMD in a largesymmetric × / × MIMO system with i.i.d. Gaussian − − − − − − 𝐸 b / 𝑁 [dB] B E R SDMMSEMOSICAMPSDRDetNetMMNet iid

OAMPNetCMD bin

AWGN

Fig. 4. BER curves of several detection methods in a × MIMO systemwith QPSK modulation. Effective system dimension is × and for iterativealgorithms 𝑁 it = 𝑁 L = . channel H and QPSK/BPSK modulation. Fig. 4 shows theresults in terms of Bit Error Rate (BER) as a function of 𝐸 b / 𝑁 . Owing to near-optimal performance, the SD is alwaysprovided as a benchmark in the following.Linear detectors perform bad in this setup: Since the curveof the MF remains almost constant at BER ≈ and theZero Forcer performs even worse, both are not shown in thefollowing. At least, the MMSE equalizer shows acceptablebehavior but is still separated by a dB gap at 𝐸 b / 𝑁 = dB from SD. In contrast, nonlinear SotA detectors like MOSICfrom [21], AMP and SDR technique show good performance.Whereas AMP runs into an error ﬂoor for high SNR sincethe message statistics are not Gaussian anymore in ﬁnitesmall-dimensional MIMO systems, SDR proves to be a closerelaxation by only dropping the non-convex requirement ofrank ( xx 𝑇 ) = .Notably, our approach CMD in its binary version CMD bin performs very well comparable to DetNet and OAMPNet.Further, CMD bin does not run into an error ﬂoor in thesimulated SNR range like AMP and DetNet. Finally, we notethat our approach is similar in complexity to AMP whereasDetNet and OAMPNet are very complex DNN architectures(see Tab. II). The other DNN-based approach MMNet hascomparable low complexity but fails to beat CMD bin . Since weobserved an early error ﬂoor similar to AMP in all settings, weomit further results. We conjecture that the denoising layersare insufﬁcient expressive in the interference limited high SNRregion.Results in a smaller × MIMO system plotted in Fig.5, show that all soft non-linear approaches except for SDRand MOSIC run into an error ﬂoor at lower SNR. Thus, weconjecture that they share the same suboptimality at ﬁnitesystem dimensions. They may rely on the statistics of the inter-ference terms to be Gaussian which is only approximately truefor large system dimensions. Apart from SDR and MOSIC,CMD bin offers still the best overall performance and is closein performance to SDR for 𝐸 b / 𝑁 < dB. − − − − 𝐸 b / 𝑁 [dB] B E R SDMMSEMOSICAMPSDRDetNetOAMPNetCMD bin

AWGN

Fig. 5. BER curves of several detection methods in a × MIMO systemwith QPSK modulation. Effective system dimension is × and for iterativealgorithms 𝑁 it = 𝑁 L = . . . 𝑗 𝛿 (a) Step size 𝛿 ( 𝑗 ) . 𝑗 𝜏 (b) Softmax temperature 𝜏 ( 𝑗 ) 𝑁 epoch = 𝑁 epoch = Fig. 6. Parameters 𝜽 of CMD bin in a × MIMO system with QPSKmodulation. Effective system dimension is × . C. Algorithm and Parametrization

To investigate the inﬂuence of learning on unfolded CMD bin and the values of its parameters 𝜽 , we visualize them per layer 𝑗 in Fig. 6 for the × MIMO system considered before.We cannot observe any pattern after parameter optimizationand interpretation seems very difﬁcult.Furthermore, we notice from Fig. 7 that starting pointinitialization 𝜽 has a large impact on the optimum 𝜽 foundby SGD (after 𝑁 epoch = iterations). If we use a startingpoint 𝜽 ,splin with linear decreasing 𝜏 ( 𝑗 ) ,splin = 𝛿 ( 𝑗 ) ,splin = − ( − . )/ 𝑁 it · 𝑗 (35)for 𝑗 ∈ [ , 𝑁 it ] , a solution 𝜽 ,splin is learned allowing CMD toperform better in the low 𝐸 b / 𝑁 region from to dB. CMDeven reaches the performance of the best suboptimal algorithmconsidered in this setup OAMPNet but then runs into an errorﬂoor in contrast to default CMD. To explain this behavior inthe interference limited higher 𝐸 b / 𝑁 region, we conjecturethat a higher starting and correlating end step size (see Fig. 6)allows CMD to leave a local optimum with higher probability − − − − − − 𝐸 b / 𝑁 [dB] B E R SDMMSEOAMPNet 𝑁 L = CMD bin 𝜽 𝑁 L = CMD 𝜽 𝑁 L = CMD bin 𝜽 ,splin 𝑁 L = CMD bin 𝜽 𝑁 L = CMD bin 𝜽 𝑁 it = CMD bin 𝜽 ,splin 𝑁 it = Fig. 7. BER curves of CMD with different parametrization or algorithmic ina × MIMO system with QPSK modulation. Effective system dimensionis × . and to ﬁnd a better one whereas a small step size enforcesconvergence to a local solution. In the noise limited 𝐸 b / 𝑁 region, noise removal is crucial and hence convergence. Thismeans CMD can be optimized for different working points andis sensitive to starting point initialization. Since CMD only hasa small parameter set, we are able to load them dynamicallyand achieve very good performance in all 𝐸 b / 𝑁 regions.In particular, we are able to further decrease the numberof parameters with negligible performance loss. UnfoldedCMD bin with only 𝑁 L = layers performs equally wellcompared to default CMD with 𝑁 L = at low 𝐸 b / 𝑁 andslightly worse at 𝐸 b / 𝑁 = dB by dB.Without unfolding, heuristics for parameter selection arerequired similar to starting point initialization. The detectionperformance with such heuristic parameters 𝜽 ,splin is quiteimpressive since the BER curve is close to that of learnedCMD with 𝜽 ,splin . Therefore, we are able to use the plainalgorithm for detection. We note that this is not true withdefault parameters 𝜽 and that performance can be quitedifferent after optimization ( 𝜽 ).Finally, we compare the performance of algorithm CMD bin for the special case of binary RV with that of the generic multi-class algorithm CMD since both are different. From Fig. 7, weobserve that the performance is very similar and conjecturethat CMD is capable of achieving the same performance iftraining is parameterized correctly. D. Multi-class Detection

So far, only BPSK modulation and hence two classes havebeen considered. To test multi-class detection with 𝑀 = classes, we show numerical results in a × MIMO systemwith -QAM modulation being equivalent to a ×

64 4 -ASKMIMO system after transformation into the equivalent real-valued problem. Owing to now degrees of freedom in thesoft-max function and denser symbol packing, we changed ourbatch size to 𝑁 b = and training SNR to higher 𝐸 b / 𝑁 ∈ − − − − 𝐸 b / 𝑁 [dB] B E R SDMMSEMOSICAMPSDRDetNetOAMPNetCMDAWGN

Fig. 8. BER curves of several detection methods in a × MIMO systemwith -QAM modulation. Effective system has dimension × and 4-ASKmodulation and for iterative algorithms 𝑁 it = 𝑁 L = . [ , ] , respectively. Setting the default starting point with 𝜏 max = /( 𝑀 − ) = / so that the MAP criterion ln 𝑝 ( ˜ x , y ) becomes convex for a couple of iterations proves to be crucialfor successful training of CMD with multiple classes.Fig. 8 shows BER curves in this system. Clearly, we cannow observe a large gap between the BER curve of SD andthat of all other suboptimal approaches. Comparing suboptimalalgorithms, OAMPNet is superior over the whole SNR region.Observing a maximum dB curve shift, we note that CMD iscompetitive to OAMPNet and SDR at 𝐸 b / 𝑁 ∈ [ , ] andwhen BER = [ − , − ] which is a typical working point ofdecoders. Without training parameter tuning, CMD performseven worse than the MMSE equalizer. At higher SNR, an errorﬂoor follows. Although using a more expressive DNN model,DetNet now trained for 𝐸 b / 𝑁 ∈ [ , ] fails to beat CMDespecially in this region. E. Massive MIMO

Investigation in large symmetric MIMO systems revealsthe potential and shortcomings of the algorithms. Rather in5G, massive MIMO systems with 𝑁 R > 𝑁 T are employed[19]. Assuming i.i.d. Gaussian channels, we shortly report theresults of a × MIMO system with QPSK modulation:The BER curves of learning based approaches and SDR almostfollow that of SD and thus suggest that they ﬁt perfectly forapplication in massive MIMO.However in practice, channels are spatially correlated at thereceiver side due to good spatial resolution of BS’ large arrayscompared to the number of scattering clusters [19]. Hence,the results for i.i.d. Gaussian channel are less meaningful asnoted in [34]. As a ﬁrst and quick attempt towards a realisticchannel model which captures its key characteristics, we testperformance in the so-called One-ring model assuming a BSequipped with a uniform linear antenna array [28], [19]. Weparameterize the correlation matrices of every column in H with reasonable values: Assuming an urban cellular network, − − − − − − 𝐸 b / 𝑁 [dB] B E R SDMMSEMOSICAMPSDRDetNetOAMPNetCMD bin

AWGN

Fig. 9. BER curves of several equalization methods in a correlated × MIMO system with QPSK modulation. The correlation matrices were gener-ated according to a One-Ring model with ◦ angular spread and ◦ cellsector. Effective system dimension is × and for iterative algorithms 𝑁 it = 𝑁 L = . we set the angular spread to ◦ and sample the nominal angleuniformly from [− ◦ , ◦ ] , i.e., ◦ cell sector.From Fig. 9, it becomes evident that the performance lossof learning based approaches compared to SD in such a One-Ring model of dimension × is similar to the symmetricsetting × . Surprisingly, MOSIC and SDR now prove to becomparable whereas the BER of AMP degrades. A sequentialschedule with SNR ordering seems to be beneﬁcial in theOne-Ring model. CMD performs very close to DetNet andOAMPNet below dB and even outperforms the latter above.Hence, it proves to be a generic and hence promising detectionapproach. F. Soft Output (Coded System)

After investigation of detection performance in uncodedsystems, we turn to an interleaved and horizontally coded × / × MIMO system with Rayleigh block fadingreﬂecting our uplink model. We aim to verify whether not onlyhard decisions but also soft outputs generated by unfoldedCMD and the soft output version of DetNet have high quality.This is especially important in practice since coding is anessential component besides equalization in today’s commu-nication systems. Therefore, we use a × LDPC codewith rate 𝑅 C = / from [39] and at receiver side a beliefpropagation decoder with iterations. The results in terms ofCoded Frame Error Rate (CFER) as a function of 𝐸 b / 𝑁 / 𝑅 C are shown in Fig. 10. Owing to overwhelming computationalcomplexity, we refrained from using the MAP solution withcoding as a benchmark and instead show uncoded CMD andSD curves for reference. Strikingly, CMD with coding beatsthe latter and allows for a coding gain. In contrast, AMP withcoding runs into an error ﬂoor after dB: The output statisticsbecome unreliable for high SNR in ﬁnite dimensional systems.Surprisingly, although being one of the best detection methodsin the uncoded setting, DetNet with coding performs close to − − − − 𝐸 b / 𝑁 / 𝑅 C [dB] C F E R SD uncodedCMD bin uncodedMMSEAMPDetNetCMD bin

Fig. 10. CFER curves of a horizontally coded × MIMO system withQPSK modulation. A × LDPC code with belief propagation decoderwas used. Effective system dimension is × and for iterative algorithms 𝑁 it = 𝑁 L = . MMSE equalization with soft outputs and thus worse thanexpected. Actually, the soft output version of DetNet shoulddeliver accurate probabilities or Log Likelihood Ratios (LLRs)according to [28] after optimization.Indeed, we visualize with an exemplary histogram of LLRsthat this is not the case. In Fig. 11, we show the relativefrequencies of LLRs of one symbol 𝑥 𝑛 in one random channelrealization H for 𝐸 b / 𝑁 = dB. First, we note the histogramsfor 𝑥 𝑛 = − and 𝑥 𝑛 = to be symmetric meaning that both al-gorithms fulﬁll a basic quality criterion. Furthermore, it can beclearly seen that DetNet mostly provides hard decisions with ≈ LLRs being −∞ and ∞ , respectively. Only a few valuesare close to . In contrast, CMD provides meaningful soft in-formation resembling a mixture of Gaussians as expected fromliterature [40] ranging from − to . These results stronglyindicate that the difference of soft output quality originatesfrom different underlying optimization strategies: As pointedout in Section III-B, unfolded CMD relies on minimizationof KL divergence between IO a-posteriori and approximatingpdf whereas the one-hot representation in DetNet is optimizedw.r.t. MSE. We conclude that our approach yields a betteroptimization strategy. G. Complexity Analysis

Since complexity is the main driver for development ofsuboptimal algorithms like CMD instead of relying on MAPdetection, we complete our numerical study by relating detec-tion performance to results on the computational complexitygiven in Tab. II. With regard to CMD, the iterative asymptoticcomplexity of

O ( 𝑁 T ( 𝑁 R + 𝑀 )) or O ( 𝑁 T 𝑁 R ) for binary RVis dominated by the matrix vector multiplications in H 𝑇 H ˜ x ,i.e., CMD scales linearly with the input and output dimensionas well as the number of classes. Clearly, CMD has verylow complexity comparable to AMP and MMNet but withremarkable higher detection rate. −

20 20 −

10 10 −

30 300123 −∞ ∞

97 LLR R e l a ti v e F r e qu e n c y [ % ] CMD 𝑥 𝑛 = DetNet 𝑥 𝑛 = CMD 𝑥 𝑛 = − DetNet 𝑥 𝑛 = − (cid:107) (cid:107) (cid:107) Fig. 11. Exemplary histogram showing the relative frequencies of LLRs ofone symbol 𝑥 𝑛 in one random channel realization H at 𝐸 b / 𝑁 = dB. o f M u lti p li ca ti on s SD dB MF MMSE AMP CMD DetNet OAMP SDR Fig. 12. Complexity of detection algorithms in terms of number of mul-tiplicative operations in a × / × MIMO system: Light coloredbars indicate a realistic low-complexity implementation with BPSK and darkcolored bars the worst-case complexity with 16QAM modulation.

Besides qualitative

O (·) analysis, we capture complexityquantitatively by counting the number of Multiplicative OPer-ations (MOPs) for one iteration and channel realization beingthe most common and costly ﬂoating point operations. Fig.12 shows the respective bar chart assuming a realistic low-complexity implementation in a × with QPSK ( 𝑀 = )and 𝑁 L = and worst-case complexity implementation with QAM modulation ( 𝑀 = ) and 𝑁 L = , respectively.For BPSK and the lower bar of MMSE equalization, weassumed Gaussian elimination to solve the linear equationsystem and, for higher order QAM and the higher bar, LUdecomposition. We estimate the upper bound on SDR MOPcount by unadapted O ( max ( 𝑁 R , 𝑁 T ) 𝑁 / T log ( / 𝜖 )) and thelower bound on MOPs to account for half of the FLOPS from[28] with inaccurate 𝜖 = . . The expected number of visitingnodes O ( 𝑀 𝛾𝑁 T ) of the SD is SNR dependent with 𝛾 ∈ ( , ] and was extracted from [20].Apparently, only the very basic MF beats CMD at consider-ably worse detection performance. Finally, we note that AMP,MMNet, DetNet and CMD further come with the beneﬁt ofalready delivering soft outputs. But in contrast, MMSE andMOSIC are able to reuse its computations with only one matrixvector multiplication remaining for any further detection insidethe channel coherence time interval. V. C ONCLUSION

In this article, we introduced the so called continuous relax-ation of discrete RV to the MAP detection problem. Allowingto replace exhaustive search by continuous optimization, wedeﬁned our classiﬁcation approach Concrete MAP Detection(CMD), e.g., based on gradient descent. By unfolding CMDinto a DNN, we further were able to optimize its low numberof parameters and hence to improve detection performancewhile limiting it to low complexity. As a side effect, theresulting structure allows for fast online training. Using theexample of MIMO detection, simulations reveal CMD to be ageneric detection method competitive to SotA outperformingother recently proposed ML-based approaches DetNet andMMNet in every considered scenario and in terms of complex-ity. Notably, we selected an optimization criterion groundedin information theory, i.e., cross entropy, and showed that itaims at learning an approximation of the individual optimaldetector. By simulations in coded systems, we demonstratedits ability to provide reliable soft outputs as opposed to [28],being a requirement for soft decoding, a crucial component intoday’s communication systems.All these ﬁndings prove CMD to be a promising detectionapproach for application in future massive MIMO systems.Further research is required to improve the performance withhigher-order modulations and to demonstrate its applicabilityto non-linear scenarios of other research domains.R

EFERENCES[1] C. E. Shannon, “A Mathematical Theory of Communication,”

The BellSystem Technical Journal , vol. 27, no. 3, pp. 379–423, Jul. 1948.[2] A. Viterbi, “Error bounds for convolutional codes and an asymptoticallyoptimum decoding algorithm,”

IEEE Trans. Inf. Theory , vol. 13, no. 2,pp. 260–269, Apr. 1967.[3] D. D. Lin and T. J. Lim, “A Variational Inference Framework for Soft-In Soft-Out Detection in Multiple-Access Channels,”

IEEE Trans. Inf.Theory , vol. 55, no. 5, pp. 2345–2364, May 2009.[4] E. Riegler, G. E. Kirkelund, C. N. Manchon, M. Badiu, and B. H. Fleury,“Merging Belief Propagation and the Mean Field Approximation: A FreeEnergy Approach,”

IEEE Trans. Inf. Theory , vol. 59, no. 1, pp. 588–602,Jan. 2013.[5] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforwardnetworks are universal approximators,”

Neural Networks , vol. 2, no. 5,pp. 359–366, Jan. 1989.[6] O. Simeone, “A Very Brief Introduction to Machine Learning with Ap-plications to Communication Systems,”

IEEE Trans. on Cogn. Commun.Netw. , Ft. Lauderdale, FL, USA, Jun. 2011, pp.315–323.[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning forImage Recognition,” in , Las Vegas, NV, USA, Jun. 2016, pp.770–778.[10] D. Cire¸san, U. Meier, and J. Schmidhuber, “Multi-column Deep Neu-ral Networks for Image Classiﬁcation,” in , Providence,RI, USA, 2012, pp. 3642–3649.[11] D. Silver et al., “Mastering the game of Go with deep neural networksand tree search,”

Nature , vol. 529, no. 7587, pp. 484–489, Jan. 2016.[12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets,” in , Montreal, Canada, 2014, pp. 2672–2680. [13] N. Farsad and A. Goldsmith, “Neural Network Detection of DataSequences in Communication Systems,” IEEE Trans. Signal Process. ,vol. 66, no. 21, pp. 5663–5678, Nov. 2018.[14] B. Karanov, M. Chagnon, F. Thouin, T. A. Eriksson, H. Bülow, D. Lav-ery, P. Bayvel, and L. Schmalen, “End-to-End Deep Learning of OpticalFiber Communications,”

J. Lightw. Technol. , vol. 36, no. 20, pp. 4843–4855, Oct. 2018.[15] H. Kim, Y. Jiang, S. Kannan, S. Oh, and P. Viswanath, “Deepcode:Feedback Codes via Deep Learning,”

IEEE Journal on Selected Areasin Information Theory , vol. 1, no. 1, pp. 194–206, May 2020.[16] T. O’Shea and J. Hoydis, “An Introduction to Deep Learning for thePhysical Layer,”

IEEE Trans. on Cogn. Commun. Netw. , vol. 3, no. 4,pp. 563–575, Dec. 2017.[17] F. A. Aoudia and J. Hoydis, “Model-Free Training of End-to-EndCommunication Systems,”

IEEE J. Sel. Areas Commun. , vol. 37, no. 11,pp. 2503–2516, Nov. 2019.[18] A. Caciularu and D. Burshtein, “Unsupervised Linear and NonlinearChannel Equalization and Decoding Using Variational Autoencoders,”

IEEE Trans. on Cogn. Commun. Netw. , vol. 6, no. 3, pp. 1003–1018,Sep. 2020.[19] E. Björnson, J. Hoydis, and L. Sanguinetti, “Massive MIMO Networks:Spectral, Energy, and Hardware Efﬁciency,”

Foundations and Trends®in Signal Processing , vol. 11, no. 3-4, pp. 154–655, 2017.[20] J. Jalden and B. Ottersten, “On the complexity of sphere decoding indigital communications,”

IEEE Trans. Signal Process. , vol. 53, no. 4,pp. 1474–1484, Apr. 2005.[21] D. Wübben, R. Böhnke, V. Kühn, and K.-D. Kammeyer, “MMSE Ex-tension of V-BLAST based on Sorted QR Decomposition,” in , vol. 1, Orlando,USA, Oct. 2003, pp. 508–512.[22] Z.-Q. Luo, W.-K. Ma, A. M.-C. So, Y. Ye, and S. Zhang, “SemideﬁniteRelaxation of Quadratic Optimization Problems,”

IEEE Signal Process.Mag. , vol. 27, no. 3, pp. 20–34, May 2010.[23] O. Simeone, “A Brief Introduction to Machine Learning for Engineers,”

Foundations and Trends® in Signal Processing , vol. 12, no. 3-4, pp.200–431, Aug. 2018.[24] C. Jeon, R. Ghods, A. Maleki, and C. Studer, “Optimality of LargeMIMO Detection via Approximate Message Passing,” in

IEEE Interna-tional Symposium on Information Theory (ISIT 2015) , Hong Kong, Jun.2015, pp. 1227–1231.[25] K. Gregor and Y. LeCun, “Learning fast approximations of sparsecoding,” in , Madison, WI, USA, Jun. 2010, pp. 399–406.[26] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep Unfolding:Model-based Inspiration of Novel Deep Architectures,” arXiv preprintarXiv:1409.2574 , Sep. 2014.[27] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO Detection,” in , Sapporo, Japan, Jul. 2017, pp. 1–5.[28] ——, “Learning to Detect,”

IEEE Trans. Signal Process. , vol. 67, no. 10,pp. 2554–2564, May 2019.[29] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linearcodes using deep learning,” in , Monticello,IL, USA, Sep. 2016, pp. 341–346.[30] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, andY. Be’ery, “Deep Learning Methods for Improved Decoding of LinearCodes,”

IEEE J. Sel. Topics Signal Process. , vol. 12, no. 1, pp. 119–131,Feb. 2018.[31] T. Gruber, S. Cammerer, J. Hoydis, and S. t. Brink, “On deep learning-based channel decoding,” in , Baltimore, MD, USA, Mar. 2017,pp. 1–6.[32] D. Neumann, T. Wiese, and W. Utschick, “Learning the MMSE ChannelEstimator,”

IEEE Trans. Signal Process. , vol. 66, no. 11, pp. 2905–2917,Jun. 2018.[33] H. He, C.-K. Wen, S. Jin, and G. Y. Li, “A Model-Driven Deep LearningNetwork for MIMO Detection,” in , Anaheim, CA,USA, Nov. 2018, pp. 584–588.[34] M. Khani, M. Alizadeh, J. Hoydis, and P. Fleming, “Adaptive NeuralSignal Detection for Massive MIMO,”

IEEE Trans. Wireless Commun. ,vol. 19, no. 8, pp. 5635–5648, Aug. 2020.[35] C. J. Maddison, A. Mnih, and Y. W. Teh, “The Concrete Distribution: AContinuous Relaxation of Discrete Random Variables,” in

InternationalConference on Learning Representations (ICLR 2017) , Toulan, France,Apr. 2017, pp. 1–20. [36] E. Jang, S. Gu, and B. Poole, “Categorical Reparameterization withGumbel-Softmax,” in

International Conference on Learning Represen-tations (ICLR 2017) , Toulan, France, Apr. 2017, pp. 1–13.[37] E. Beck, C. Bockelmann, and A. Dekorsy, “Concrete MAP Detection:A Machine Learning Inspired Relaxation,” in , vol. 24, Hamburg, Germany,Feb. 2020, pp. 1–5.[38] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “TheMarginal Value of Adaptive Gradient Methods in Machine Learning,” in

IEEETrans. Signal Process. , vol. 60, no. 12, pp. 6421–6434, Dec. 2012.

Edgar Beck received both his B.Sc. and M.Sc.in electrical engineering (with honors) from theUniversity of Bremen, Germany, in 2017, where heis currently pursuing his Ph.D. degree in electricalengineering at the Department of CommunicationsEngineering (ANT). His research interests includeseveral aspects of future 5G/6G systems: Cognitiveradio, compressive sensing, massive MIMO systemsand in particular the fertile application of machinelearning in wireless communications.

Dr. Carsten Bockelmann received his Dipl.-Ing.and Ph.D. degrees in electrical engineering fromthe University of Bremen, Germany, in 2006 and2012, respectively. Since 2012, he has been a Se-nior Research Group Leader with the University ofBremen coordinating research activities regardingthe application of compressive sensing/sampling tocommunication problems. His research interests in-clude communications in massive machine commu-nication, ultra reliable low latency communications(5G) and industry 4.0, compressive sensing, channelcoding, and transceiver design.