11 Learning to Detect
Neev Samuel,
Member, IEEE, and Tzvi Diskin,
Member, IEEE and Ami Wiesel,
Member, IEEE
Abstract —In this paper we consider Multiple-Input-Multiple-Output (MIMO) detection using deep neural networks. Weintroduce two different deep architectures: a standard fullyconnected multi-layer network, and a Detection Network (DetNet)which is specifically designed for the task. The structure ofDetNet is obtained by unfolding the iterations of a projectedgradient descent algorithm into a network. We compare theaccuracy and runtime complexity of the purposed approachesand achieve state-of-the-art performance while maintaining lowcomputational requirements. Furthermore, we manage to train asingle network to detect over an entire distribution of channels.Finally, we consider detection with soft outputs and show thatthe networks can easily be modified to produce soft decisions.
Index Terms —MIMO Detection, Deep Learning, Neural Net-works.
I. I
NTRODUCTION M ULTIPLE input multiple output (MIMO) systems en-able enhanced performance in communication systems,by using many dimensions that account for time and frequencyresources, multiple users, multiple antennas and other re-sources. While improving performance, these systems presentdifficult computational challenges when it comes to detectionsince the detection problem is NP-Complete, and there isa growing need for sub-optimal solutions with polynomialcomplexity.Recent advances in the field of machine learning, specif-ically the success of deep neural networks in solving manyproblems in almost any field of engineering, suggest that adata driven approach for detection using machine learning maypresent a computationally efficient way to achieve near optimaldetection accuracy.
A. MIMO detection
MIMO detection is a classical problem in simple hypothesistesting [1]. The maximum likelihood (ML) detector involvesan exhaustive search and is the optimal detector in the senseof minimum joint probability of error for detecting all thesymbols simultaneously. Unfortunately, it has an exponentialruntime complexity which makes it impractical in large realtime systems.In order and overcome the computational cost of the maxi-mum likelihood decoder there is considerable interest in imple-mentation of suboptimal detection algorithms which providea better and more flexible accuracy vs complexity tradeoff.In the high accuracy regime, sphere decoding algorithms [2], ∼ amiw/).This research was partly supported by the Heron Consortium and by ISFgrant 1339/15. [3], [4] were purposed, based on lattice search, and offeringbetter computational complexity with a rather low accuracyperformance degradation relatively to the full search. In theother regime, the most common suboptimal detectors are thelinear receivers, i.e., the matched filter (MF), the decorrelatoror zero forcing (ZF) detector and the minimum mean squarederror (MMSE) detector. More advanced detectors are based ondecision feedback equalization (DFE), approximate messagepassing (AMP) [5] and semidefinite relaxation (SDR) [6],[7]. Currently, both AMP and SDR provide near optimalaccuracy under many practical scenarios. AMP is simple andcheap to implement in practice, but is an iterative method thatmay diverge in challenging settings. SDR is more robust andhas polynomial complexity, but is limited in the settings itaddresses and is much slower in practice. B. Background on Machine Learning
Machine learning is the ability to solve statistical problemsusing examples of inputs and their desired outputs. Unlikeclassical hypothesis testing, it is typically used when theunderlying distributions are unknown and are characterizedvia sample examples. It has a long history but was previouslylimited to simple and small problems. Fast forwarding torecent years, the field witnessed the deep revolution. The“deep” adjective is associated with the use of complicated andexpressive classes of algorithms, also known as architectures.These are typically neural networks with many non-linearoperations and layers. Deep architectures are more expressivethan shallow ones and can theoretically solve much harder andlarger problems [8], but were previously considered impossibleto optimize. With the advances in big data, optimizationalgorithms and stronger computing resources, such networksare currently state of the art in different problems fromspeech processing [9], [10] and computer vision [11], [12]to online gaming [13]. Typical solutions involve dozens andeven hundreds of layers which are slowly optimized off-lineover clusters of computers, to provide accurate and cheapdecision rules which can be applied in real-time. In particular,one promising approach to designing deep architectures is byunfolding an existing iterative algorithm [14]. Each iteration isconsidered a layer and the algorithm is called a network. Thelearning begins with the existing algorithm as an initial startingpoint and uses optimization methods to improve the algorithm.For example, this strategy has been shown successful in thecontext of sparse reconstruction [15], [16]. Leading algorithmsas Iterative Shrinkage and Thresholding and a sparse versionof AMP have both been improved by unfolding their iterationsinto a network and learning their optimal parameters.Following this revolution, there is a growing body ofworks on deep learning methods for communication systems. a r X i v : . [ c s . I T ] M a y Exciting contributions in the context of error correcting codesinclude [17]–[21]. In [22] a machine learning approach isconsidered in order to decode over molecular communica-tion systems where chemical signals are used for transferof information. In these systems an accurate model of thechannel is impossible to find. This approach of decodingwithout CSI (channel state information) is further developedin [23]. Machine learning for channel estimation is consideredin [24], [25]. End-to-end detection over continuous signals isaddressed in [26]. And in [27] deep neural networks are usedfor the task of MIMO detection using an end-to-end approachwhere learning is deployed both in the transmitter in orderto encode the transmitted signal and in the receiver whereunsupervised deep learning is deployed using an autoencoder.Parts of our work on MIMO detection using deep learninghave already appeared in [28], see also [29]. Similar ideaswere discussed in [30] in the context of robust regression.
C. Main contributions
The main contribution of this paper is the introduction oftwo deep learning networks for MIMO detection. We showthat, under a wide range of scenarios including differentchannels models and various digital constellations, our net-works achieve near optimal detection performance with lowcomputational complexity.Another important result we show is their ability to easilyprovide soft outputs as required by modern communicationsystems. We show that for different constellations the soft out-put of our networks achieve accuracy comparable to that of theM-Best sphere decoder with low computational complexity.In a more general learning perspective, an important con-tribution is DetNet’s ability to perform on multiple modelswith a single training. Recently, there were works on learningto invert linear channels and reconstruct signals [15], [16],[31]. To the best of our knowledge, these were developed andtrained to address a single fixed channel. In contrast, DetNet isdesigned for handling multiple channels simultaneously witha single training phase.The paper is organized in the following order:In section II we present the MIMO detection problem andhow it is formulated as a learning problem including theuse of one-hot representations. In section III we present twotypes of neural network based detectors, FullyCon and DetNet.In section IV we consider soft decisions. In section V wecompare the accuracy and the runtime of the purposed learningbased detectors against traditional detection methods both inthe hard decision and the soft decision cases. Finally, sectionVI provides concluding remarks.
D. Notation
In this paper, we define the normal distribution where µ isthe mean and σ is the variance as N (cid:0) µ, σ (cid:1) . The uniformdistribution with the minimum value a and the maximum value b will be U ( a, b ) . Boldface uppercase letters denote matrices.Boldface lowercase letters denote vectors. The superscript ( · ) T denotes the transpose. The i’th element of the vector x will bedenoted as x i . Unless stated otherwise, the term independent and identically distributed (i.i.d.) Gaussian matrix, refers to amatrix where each of its elements is i.i.d. sampled from thenormal distribution N (0 , . The rectified linear unit definedas ρ ( x ) = max { , x } . When considering a complex matrix orvector the real and imaginary parts of it are defined as (cid:60) ( · ) and (cid:61) ( · ) respectively. An α -Toeplitz M matrix will be defined asa matrix such that M T M is a square matrix where the valueof each element on the i’th diagonal is α i − .II. P ROBLEM FORMULATION
A. MIMO detection
We consider the standard linear MIMO model: ¯ y = ¯ H ¯ x + ¯ w , (1)where ¯y ∈ C N is the received vector, ¯H ∈ C N × K is thechannel matrix, ¯x ∈ ¯ S K is an unknown vector of independentand equal probability symbols from some finite constellation ¯ S (e.g. PSK or QAM), ¯w is a noise vector of size N withindependent, zero mean Gaussian variables of variance σ .Our detectors do not assume knowledge of the noisevariance σ . Hypothesis testing theory guarantees that it isunnecessary for optimal detection. Indeed, the ML rule doesnot depend on it. This is contrast to the MMSE and AMPdecoders that exploit this parameter and are therefore lessrobust in cases where the noise variance is not known exactly. B. Reparameterization
A main challenge in MIMO detection is the use of complexvalued signals and various digital constellations ¯ S which areless common in machine learning. In order to use standardtools and provide a unified framework, we re-parameterize theproblem using real valued vectors and one-hot mappings asdescribed below.First, throughout this work, we avoid handling complexvalued variables, and use the following convention: y = Hx + w , (2)where y = (cid:20) (cid:60) (¯ y ) (cid:61) (¯ y ) (cid:21) , w = (cid:20) (cid:60) ( ¯ w ) (cid:61) ( ¯ w ) (cid:21) , x = (cid:20) (cid:60) (¯ x ) (cid:61) (¯ x ) (cid:21) , H = (cid:20) (cid:60) ( ¯ H ) − (cid:61) ( ¯ H ) (cid:61) ( ¯ H ) (cid:60) ( ¯ H ) (cid:21) (3)where y ∈ R N is the received vector, H ∈ R N × K is thechannel matrix and x ∈ S K where S = (cid:60){ ¯ S } (which is alsoequal to (cid:61){ ¯ S } in the complex valued constellations we tested)A second convention concerns the re-parameterization ofthe discrete constellations S = { s , · · · , s | S | } using one-hotmapping. With each possible s i we associate a unit vector u i ∈ R | S | . For example, the dimensional one-hot mappingof the real part of 16-QAM constellations is defined as s = − ↔ u = [1 , , , s = − ↔ u = [0 , , , s = 1 ↔ u = [0 , , , s = 3 ↔ u = [0 , , , (4) We denote this mapping via the function s = f oh ( u ) sothat s i = f oh ( u i ) for i = 1 , · · · , | S | . More generally, forapproximate inputs which are not unit vectors, the function isdefined as x = f oh ( x oh ) = | S | (cid:88) i =1 s i [ x oh ] i (5)The description above holds for a scalar symbol. The MIMOmodel involves a vector of K symbols which is handledby stacking the one-hot mapping of each of its elements.Altogether, a vector x oh ∈ { , } | S |· K is mapped to x ∈ S K . C. Learning to detect
We end this section by formulating the MIMO detec-tion problem as a machine learning task. The first step inmachine learning is choosing a class of possible detectors,also known as an architecture. A network architecture is afunction ˆ x oh ( H , y ; θ ) parameterized by θ that detects theunknown x oh given y and H . Learning is the problem offinding the θ within some feasible set that will lead to strongdetectors ˆ x oh ( H , y ; θ ) . For this purpose, we fix a loss function l ( x oh ; ˆ x oh ( H , y ; θ )) that measures the distance between thetrue vectors and their estimates. Then, we find the network’sparameter θ by minimizing the loss function over the MIMOmodel distribution: min θ E { l ( x oh ; ˆ x oh ( H , y ; θ )) } , (6)where the expectation is with respect to all the randomvariables in (2), i.e., x , w , and H . Learning to detect is definedas finding the best parameters θ of the networks’ architecturethat minimize the expected loss l ( · ; · ) over the distribution in(2).We always assume perfect channel state information (CSI)which means that the channel H is exactly known during de-tection time. However, we differentiate between two possiblecases: • Fixed Channel (FC): In the FC scenario, H is deter-ministic and constant (or a realization of a degeneratedistribution which only takes a single value). This meansthat during the training phase we know over whichchannel the detector will detect. • Varying Channel (VC): In the VC scenario, we assume H random with a known continuous distribution. It is stillcompletely known but changes in each realization, anda single detection algorithm must be designed for all itspossible realizations. When detecting, the channel is ran-domly chosen, and the network must be able to generalizeover the entire distribution of possible channels.Altogether, our goal is to detect x , using a neural networkthat receives y and H as inputs and provides an estimate ˆ x . Inthe next section, we will introduce two competing architecturesthat tradeoff accuracy and complexity.III. D EEP
MIMO
DETECTORS
A. FullyCon
The fully connected multi-layer network is a well knownarchitecture which is considered to be the basic deep neural W k b k q k X + ρ q k+1 input/output variables Learned Variables + ρ - Multiplication - Addition - Relu Activation Fig. 1. A flowchart representing a single layer of the fully connected network. network architecture, and from now on will be named simplyas ’FullyCon’. It is composed of L layers, where the outputof each layer is the input of the next layer. Each layer can bedescribed by the following equations: q = yq k +1 = ρ ( W k q k + b k )ˆ x oh = W L q L + b L ˆ x = f oh (ˆ x oh ) (7)An illustration of a single layer of FullyCon can be seen inFig 1. The parameters of the network that are optimized duringthe learning phase are: θ = { W k , b k } Lk =1 . (8)The loss function used is a simple l distance between theestimated signal and the true signal: l ( x oh ; ˆ x oh ( H , y ; θ )) = (cid:107) x oh − ˆ x oh (cid:107) (9)FullyCon is simple and general purpose. It has a relativelysmall number of parameters to optimize. It only uses theinput y , and does not exploit the channel H within (7). Thedependence on the channel is indirect via the expectation in(6) which depends on H and leads to parameters that dependon its moments. The result is a simple and straight forwardstructure which is ideal for detection over the FC model.As will be detailed in the simulations section, it manages toachieve almost optimal accuracy with low complexity. On theother hand, our experiences with FullyCon for the VC modelled to disappointing results. It was not expressive enough tocapture the dependencies of changing channels. We also triedto add the channel matrix H as an input, and this attempt failedtoo. In the next subsection, we propose a more expressivearchitecture specifically designed for addressing this challenge. B. DetNet
In this section we present an architecture designed specif-ically for MIMO detection that will be named from now on’DetNet’ (abbreviation of ’detection network’). The derivationbegins by noting that an efficient MIMO detector should not work with y directly, but use the compressed sufficientstatistic: H T y = H T Hx + H T w . (10)This hints that two main ingredients in the architecture shouldbe H T y and H T Hx . Second, our construction is based onmimicking a projected gradient descent like solution for themaximum likelihood optimization. Such an algorithm wouldlead to iterations of the form ˆ x k +1 = Π (cid:34) ˆ x k − δ k ∂ (cid:107) y − Hx (cid:107) ∂ x (cid:12)(cid:12)(cid:12)(cid:12) x =ˆ x k (cid:35) = Π (cid:2) ˆ x k − δ k H T y + δ k H T Hx k (cid:3) , (11)where ˆ x k is the estimate in the k ’th iteration, Π[ · ] is a nonlin-ear projection operator, and δ k is a step size. Intuitively, eachiteration is a linear combination of the x k , H T y , and H T Hx k followed by a non-linear projection. We enrich these iterationsby lifting the input to a higher dimension in each iteration andapplying standard non-linearities which are common in deepneural networks. In order to further improve the performancewe treat the gradient step sizes δ K at each step as a learnedparameter and optimize them during the training phase. Thisyields the following architecture: q k = ˆ x k − − δ k H T y + δ k H T Hx k − z k = ρ (cid:18) W k (cid:20) q k v k − (cid:21) + b k (cid:19) ˆ x oh,k = W k z k + b k ˆ x k = f oh (ˆ x oh,k )ˆ v k = W k z k + b k ˆ x = ˆ v = , (12)with the trainable parameters θ = { W k , b k , W k , b k , W k , b k , δ k , δ k } Lk =1 . (13)To enjoy the lifting and non-linearities, the parameters W k are defined as tall and skinny matrices. The final estimate isdefined as ˆ x L . For convenience, the structure of each DetNetlayer is illustrated in Fig. 2.Training deep networks is a difficult task due to vanishinggradients, saturation of the activation functions, sensitivityto initialization and more [32]. To address these challengesand following the notion of auxiliary classifiers feature inGoogLeNet [12], we adopted a loss function that takes intoaccount the outputs of all of the layers: l ( x oh ; ˆ x oh ( H , y ; θ )) = L (cid:88) l =1 log( l ) (cid:107) x oh − ˆ x oh,l (cid:107) . (14)In our final implementation, in order to further enhancethe performance of DetNet, we added a residual feature fromResNet [11] where the output of each layer is a weightedaverage with the output of the previous layer. IV. S OFT DECISION OUTPUT
In this section, we consider a more general setting inwhich the MIMO detector needs to provide soft outputs.High end communication systems typically resort to iterativedecoding where the MIMO detector and the error correctingdecoder iteratively exchange information on the unknownsuntil convergence. For this purpose, the MIMO detector mustreplace its hard estimates with soft posterior distributionsProb ( x j = s i | y ) for each unknown j = 1 , · · · , K and eachpossible symbol i = 1 , · · · , | S | . More precisely, it also needsto allow additional soft inputs but we leave this for futurework.Computation of the posteriors is straight forward based onBayes law, but its complexity is exponential in the size of thesignal and constellation. Similarly to the maximum likelihoodalgorithm in the hard decision case, this computation yieldsoptimal accuracy yet is intractable. Thus, the goal in thissection is to design networks that output approximate theposteriors. On first glance, this seems difficult to learn aswe have no training set of posteriors and cannot define aloss function. Remarkably, this is not a problem and theprobabilities of arbitrary constellations can be easily recoveredusing the standard l loss function with respect to the one-hotrepresentation x oh . Indeed, consider a scalar x and a single s ∈ S associated with its one-hot bit x oh then it is well knownthat arg min ˆ x oh E [ || x oh − ˆ x oh || | y ] = E [ x oh | y ] (15) = Prob s ∈ S (x oh , s = 1 | y )= Prob s ∈ S (x = s | y ) Thus, assuming that our network is sufficiently expressive andglobally optimized, the one-hot output ˆ x oh will provide theexact posterior probabilities.V. N UMERICAL R ESULTS
In this section, we provide numerical results on the accuracyand complexity of the proposed networks in comparison tocompeting methods.In the FC case, the results are over the 0.55-Toeplitzchannel.In the VC case and when testing the soft output perfor-mance, the results presented are over random channels, whereeach element is sampled i.i.d. from the normal distribution N (0 , . A. Implementation details
We train both networks using a variant of the stochastic gra-dient descent method [33], [34] for optimizing deep networks,named Adam Optimizer [35]. All networks were implementedusing the Python based TensorFlow library [36].To give a rough idea of the computation needed duringthe learning phase, optimizing the detectors in our numericalresults in both architectures took around 3 days on a standardIntel i7-6700 processor. Each sample was independently gen-erated from (2) according to the statistics of x , H (either in the δ δ W k,1 W k,2 W k,3 b k,1 b k,2 b k,3 H T H H T y x k v k X +
Con
X + ρ X + X + X
Oh,k+1 V k+1 f oh X k+1 input/output variables Learned Variables + f oh Con ρ - Multiplication - Addition - Concatenation - Relu Activation - One-Hot Function Fig. 2. A flowchart representing a single layer of DetNet. The network is composed out of L layers as such where each layers’ output is the ext layers’ input FC or VC model) and w . During training, the noise variancewas randomly generated so that the SNR will be uniformlydistributed on U (SNR min , SNR max ) . B. Competing algorithms
When presenting our network performance we shall use thefollowing naming conventions:
FullyCon:
The basic fully-connected deep architecture.
DetNet:
The DetNet deep architecture.In the hard decision scenarios, we tested our deep networksagainst the following detection algorithms:
ZF:
This is the classical decorrelator, also known as leastsquares or zero forcing (ZF) detector [1].
AMP:
Approximate message passing algorithm from [5].
SDR:
A decoder based on semidefinite relaxation imple-mented using an efficient interior point solver [6], [7].For the 8-PSK constellation we implemented the SDRvariation suggested in [37].
SD:
An implementation of the sphere decoding algorithm aspresented in [38].In the soft output case, we tested our networks againstthe M-Best sphere decoding algorithm as presented in [3](originally named K-Best, but changed here to avoid confusionwith K the transmitted signal size): M-Best SD M=5:
The M-Best sphere decoding algorithm,where the number of candidates we keep is 5.
M-Best SD M=7:
Same as M-Best SD M=5 with 7 candi-dates.
C. Accuracy results1) Fixed Channel (FC):
In the case of the FC scenario,where we know during the learning phase over what realizationof the channel we need to detect, the performance of both ournetwork was comparable to most of the competitors exceptSD. Both DetNet and FullyCon managed to achieve accuracyresults comparable to SDR and AMP. This result emphasizesthe notion that when learning to detect over simple scenarios asFC, a simple network is expressive enough. And since a simplenetwork is easier to optimize and has lower complexity, it is −5 −4 −3 −2 −1 SNR (dB) BE R ZFDFSDRAMPSDDetNetFC
Fig. 3. Comparison of the detection algorithms BER performance in the fixedchannel channel case over a BPSK modulated signal. preferable. Fig. 3 we present the accuracy rates over a range ofSNR values in the FC model. This is a rather difficult settingand algorithms such as AMP did not succeed to converge.
2) Varying channel:
In the VC case, the accuracy results ofFullyCon were poor and the network did not manage to learnhow to detect properly. DetNet managed to achieve accuracyrates comparable to those of SDR and AMP, and almostcomparable to those of SD, while being computationallycheaper (see next section regarding computational resources).In Fig. 4 we compare the accuracy results over a × real valued channel with BPSK signals and in Fig. 5 wecompare the accuracy of a × complex channel withQPSK symbols. In both cases DetNet achieves accuracy ratescomparable to SDR and AMP and near SD, and accuracy muchbetter than ZF and DF. Results over larger constellations arepresented in Fig. 6 and 7 where we compare the accuracy ratesover complex channels of size × for the 16-QAM and 8-PSK constellations respectively.We can see that in those largerconstellations DetNet performs better then AMP and SDR.For both constellations we can observe that DetNet reachesaccuracy levels topped only by SD. −5 −4 −3 −2 SNR (dB) BE R ZFDFAMPSDRSDDetNet
Fig. 4. Comparison of the detection algorithms BER performance in thevarying channel case over a BPSK modulated signal. All algorithms weretested channels of size 30x60. −4 −3 −2 SNR (dB) SE R DFAMPSDRSDDetNet
Fig. 5. Comparison of the detection algorithms BER performance in thevarying channel case over a QPSK modulated signal. All algorithms weretested on channels of size 20x30.
3) Soft Outputs:
We also experimented with soft decoding.Implementing a full iterative decoding scheme is outside thescope of this paper, and we only provide initial results onthe accuracy of our posterior estimates. For this purpose, weexamined smaller models where the exact posteriors can becomputed exactly and measured their statistical distance toour estimates.We shall define the following statistical distance function:Given two probability distributions P and Q over thesymbol set S (that is, the probability of each symbol to bethe true symbol), the distance δ ( P, Q ) shall be: δ ( P, Q ) = (cid:88) s ∈ S | P ( s ) − Q ( s ) | (16)As reference, we compare our results to the M-Best detectors[3]. In Fig. 8 we present accuracy in the case of a BPSK signalover a 10x20 real channel. In this setting we reach accuracylevels better than those achieved by the M-Best algorithm. Asseen in Fig. 8 adding additional layers improves the accuracyof the soft output. In Fig. 9 we present the results over a 4x8complex channel with 16-QAM constellation. We can see the −3 −2 −1 SNR (dB) SE R ZFDFAMPSDDetNet
Fig. 6. Comparison of the detection algorithms SER performance in thevarying channel case over a 16-QAM modulated signal. All algorithms weretested on channels of size 15X25.
19 20 21 22 23 2410 −6 −4 −2 SNR (dB) SE R SDRSDDetNet
Fig. 7. Comparison of the detection algorithms SER performance in thevarying channel case over a 8-PSK modulated signal. All algorithms weretested on channels of size 15X25. performance of DetNet is comparable to the M-Best Spheredecoding algorithm. For completeness, in Fig. 10 we added the8-PSK constellation soft output where DetNet is comparableto the M-Best algorithms only in the high SNR region.
D. Computational Resources1) FullyCon and DetNet run time:
In order and estimatethe computational complexity of the different detectors wecompared their run time. Comparing complexity is non-trivialdue to many complicating factors as implementation detailsand platforms. To ensure fairness, all the algorithms weretested on the same machine via python 2.7 environmentusing the Numpy package. The networks were converted fromTensorFlow objects to Numpy objects. We note that the run-time of SD depends on the SNR, and we therefore report arange of times.An important factor when considering the run time of theneural networks is the effect the batch size. Unlike classicaldetectors as SDR and SD, neural networks can detect overentire batches of data which speeds up the detection process.This is true also for the AMP algorithm, where computationcan be made on an entire batch of signals at once. However, the −3 −2 SNR (dB) A v e r age D i s t an c e F r o m P o s t e r i o r P r obab ili t y M−Best M=5M−Best M=7DetNet
Fig. 8. Comparison of the accuracy of the soft output relative to the posteriorprobability in the case of a BPSK signal over a × real valued channel.We present the results for 2 types of DetNet, one with 30 layers and thesecond one with 50 layers −2 −1 SNR (dB) A v e r age D i s t an c e F r o m P o s t e r i o r P r obab ili t y M−Best M=5M−Best M=7DetNet
Fig. 9. Comparison of the accuracy of the soft output relative to the posteriorprobability for a 16-QAM signal over an × complex valued channel. improvement introduced by using batches is highly dependenton the platform used (CPU/GPU/FPGA etc). Therefore, forcompleteness, we present the run time for several batch sizesincluding batch size equal to one.In table I the run times are presented for hard decisiondetection in a FC case. We can see that FullyCon is fasterthan all other detection algorithms, even without using batches.DetNet is slightly faster than traditional detection algorithmswithout using batches, yet when using batches, the run timeimproves significantly compared to other detection methods. Channel Batch FullyCon DetNet SDR AMP SDsize sizeTop055 1 0.0004 0.0045 0.009 0.005 0.00130x60 -0.01Top055 10 6.6E-05 0.0007 0.009 0.001 0.00130x60 -0.01Top055 100 2.4E-05 1.6E-04 0.009 0.0003 0.00130x60 -0.01Top055 1000 1.6E-05 1.1E-04 0.009 0.0003 0.00130x60 -0.01TABLE IF
IXED C HANNEL R UNTIME C OMPARISON
In table II we present the results for the VC setting. In theBPSK case the relative time difference between the different
19 20 21 22 23 24 25 2610 −2 −1 SNR (dB) A v e r age D i s t an c e F r o m P o s t e r i o r P r obab ili t y M−Best M=5M−Best M=7DetNet
Fig. 10. Comparison of the accuracy of the soft output relative to the posteriorprobability for a 8-PSK signal over an × complex valued channel. detection algorithms is similar to the FC case, with the excep-tion of SD being relatively slower. In larger constellations (8-PSK/16-QAM) DetNet’s relative advantage when comparingagainst AMP/SDR is smaller than in the BPSK case (and in the16-QAM constellation AMP was slightly faster without usingbatches). The reason is that these accurate detection with theseconstellations requires larger networks. On the other hand, therelative performance vs SD improved. Constellation Batch DetNet SDR AMP SDchannel size sizeBPSK 1 0.0066 0.024 0.0093 0.00830X60 -0.1BPSK 10 0.0011 0.024 0.0016 0.00830X60 -0.1BPSK 100 0.0005 0.024 0.00086 0.00830X60 -0.116-QAM 1 0.006 - 0.01 0.0115X25 -0.416-QAM 10 0.0014 - 0.002 0.0115X25 -0.416-QAM 100 0.0003 - 0.001 0.0115X25 -0.48-PSK 1 0.019 0.021 - 0.00415X25 -0.068-PSK 10 0.0029 0.021 - 0.00415X25 -0.068-PSK 100 0.0005 0.021 - 0.00415X25 -0.06TABLE IIR UN T IME C OMPARISON IN
VC. D ET N ET IS COMPARED WITH THE
SDR,AMP
AND S PHERE D ECODING ALGORITHMS
In table III we compare the run time of the detectionalgorithms in the soft-output case.As we can see, in the BPSKcase without using batches the performance of DetNet iscomparable to the performance of the M-Best sphere decoders,and using batches improves the performance significantly. Inthe 16-QAM/8-PSK cases DetNet is slightly faster than theM-Best decoders even without using batches.
2) Accuracy-Complexity Trade-Off:
An interesting featureof DetNet is that the complexity-accuracy trade-off can bedecided during run-time. Each of the network’s layers outputsan estimated signal, and our loss optimizes all of them. Weusually use the output of the last layer as the result since itis the most accurate, but it is possible to take the estimated
Constellation Batch DetNet M-Best M-Bestchannel size size (M=5) (M=7)BPSK 10X20 1 0.0075 0.006 0.008BPSK 10X20 10 0.00092 0.006 0.008BPSK 10X20 100 0.00029 0.006 0.00816-QAM 4X8 1 0.006 0.008 0.0116-QAM 4X8 10 0.0008 0.008 0.0116-QAM 4X8 100 0.0001 0.008 0.018-PSK 4X8 1 0.02 0.05 0.078-PSK 4X8 10 0.003 0.05 0.078-PSK 4X8 100 0.0012 0.05 0.07TABLE IIIR UN T IME C OMPARISON OF S OFT O UTPUT IN
VC. T HE D ET N ET ISCOMPARED WITH THE
M-B
EST S PHERE D ECODING ALGORITHM BE R Fig. 11. Comparison of the average BER as a function of the layer chosento be the output layer. output x i of previous layers to allow faster detection. In Fig.11 we present the accuracy as a function of the number oflayers. VI. C ONCLUSION
In this paper we investigated into the ability of deep neuralnetworks to serve as MIMO detectors. We introduced two deeplearning architectures that provide promising accuracy withlow and flexible computational complexity. We demonstratedtheir application to various digital constellations, and theirability to provide accurate soft posterior outputs. An importantfeature of one of our network is its ability to detect overmultiple channel realizations with a single training.Using neural networks as a general scheme in MIMOdetection still a long way to go and there are many open ques-tions. These include their hardware complexity, robustness,and integration into full communication systems. Nonetheless,we believe this approach is promising and has the potentialto impact future communication systems. Neural networkscan be trained on realistic channel models and tune theirperformance for specific environments. Their architectures andbatch operation are more natural to hardware implementationthan algorithms as SDR and SD. Finally, their multi-layerstructure allows a flexible accuracy vs complexity nature asrequired by many modern applications. A
CKNOWLEDGMENTS
We would like to thank Shai Shalev-Shwartz for manydiscussions throughout this research. In addition, we thankAmir Globerson and Yoav Wald for their ideas and help withthe soft output networks.R
EFERENCES[1] S. Verdu,
Multiuser detection . Cambridge university press, 1998.[2] E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search inlattices,”
IEEE transactions on information theory , vol. 48, no. 8, pp.2201–2214, 2002.[3] Z. Guo and P. Nilsson, “Algorithm and implementation of the k-bestsphere decoding for mimo detection,”
IEEE Journal on selected areasin communications , vol. 24, no. 3, pp. 491–503, 2006.[4] S. Suh and J. R. Barry, “Reduced-complexity MIMO detection viaa slicing breadth-first tree search,”
IEEE Transactions on WirelessCommunications , vol. 16, no. 3, pp. 1782–1790, 2017.[5] C. Jeon, R. Ghods, A. Maleki, and C. Studer, “Optimality of large MIMOdetection via approximate message passing,” in
IEEE InternationalSymposium on Information Theory (ISIT) . IEEE, 2015, pp. 1227–1231.[6] Z. Q. Luo, W. K. Ma, A. M. So, Y. Ye, and S. Zhang, “Semidefiniterelaxation of quadratic optimization problems,”
IEEE Signal ProcessingMagazine , vol. 27, no. 3, pp. 20–34, 2010.[7] J. Jald’en and B. Ottersten, “The diversity order of the semidefiniterelaxation detector,”
IEEE Transactions on Information Theory , vol. 54,no. 4, pp. 1406–1422, 2008.[8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature , vol. 521,no. 7553, pp. 436–444, 2015.[9] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior,V. Vanhoucke, P. Nguyen, T. Sainath et al. , “Deep neural networksfor acoustic modeling in speech recognition: The shared views of fourresearch groups,”
IEEE Signal Processing Magazine , vol. 29, no. 6, pp.82–97, 2012.[10] A. Graves and N. Jaitly, “Towards end-to-end speech recognition withrecurrent neural networks,” in
International Conference on MachineLearning , 2014, pp. 1764–1772.[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 2016, pp. 770–778.[12] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”in
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2015, pp. 1–9.[13] D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. Van Den Driess-che, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al. , “Mastering the game of Go with deep neural networks and treesearch,” nature , vol. 529, no. 7587, pp. 484–489, 2016.[14] J. R. Hershey, J. L. Roux, and F. Weninger, “Deep unfolding:Model-based inspiration of novel deep architectures,” arXiv preprintarXiv:1409.2574 , 2014.[15] K. Gregor and Y. LeCun, “Learning fast approximations of sparse cod-ing,” in
Proceedings of the 27th International Conference on MachineLearning (ICML-10) , 2010, pp. 399–406.[16] M. Borgerding and P. Schniter, “Onsager-corrected deep learning forsparse linear inverse problems,” in
Signal and Information Processing(GlobalSIP), 2016 IEEE Global Conference on . IEEE, 2016, pp. 227–231.[17] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linearcodes using deep learning,” in
Communication, Control, and Computing(Allerton), 2016 54th Annual Allerton Conference on . IEEE, 2016, pp.341–346.[18] E. Nachmani, E. Marciano, D. Burshtein, and Y. Be’ery, “RNN decodingof linear block codes,” arXiv preprint arXiv:1702.07560 , 2017.[19] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, andY. Be’ery, “Deep learning methods for improved decoding of linearcodes,”
IEEE Journal of Selected Topics in Signal Processing , 2018.[20] T. J. O’Shea and J. Hoydis, “An introduction to machine learningcommunications systems,” arXiv preprint arXiv:1702.00832 , 2017.[21] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learning-based channel decoding,” in
Information Sciences and Systems (CISS),2017 51st Annual Conference on . IEEE, 2017, pp. 1–6.[22] N. Farsad and A. Goldsmith, “Detection algorithms for communicationsystems using deep learning,” arXiv preprint arXiv:1705.08044 , 2017. [23] ——, “Neural network detection of data sequences in communicationsystems,” arXiv preprint arXiv:1802.02046 , 2018.[24] H. Ye, G. Li, and B. Juang, “Power of deep learning for channelestimation and signal detection in OFDM systems,”
IEEE WirelessCommunications Letters , 2017.[25] T. O’Shea, K. Karra, and T. Clancy, “Learning approximate neu-ral estimators for wireless channel state information,” arXiv preprintarXiv:1707.06260 , 2017.[26] S. D¨orner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learningbased communication over the air,”
IEEE Journal of Selected Topics inSignal Processing , vol. 12, no. 1, pp. 132–143, 2018.[27] T. O’Shea, T. Erpek, and T. Clancy, “Deep learning based MIMOcommunications,” arXiv preprint arXiv:1707.07980 , 2017.[28] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” arXivpreprint arXiv:1706.01151 , 2017.[29] T. Wang, C. Wen, H. Wang, F. Gao, T. Jiang, and S. Jin, “Deeplearning for wireless physical layer: Opportunities and challenges,”
China Communications , vol. 14, no. 11, pp. 92–111, 2017.[30] T. Diskin, G. Draskovic, F. Pascal, and A. Wiesel, “Deep robust regres-sion,” in
Computational Advances in Multi-Sensor Adaptive Processing(CAMSAP), 2017 IEEE 7th International Workshop on . IEEE, 2017,pp. 1–5.[31] A. Mousavi and R. G. Baraniuk, “Learning to invert: Signal recoveryvia deep convolutional networks,” in
Acoustics, Speech and SignalProcessing (ICASSP), 2017 IEEE International Conference on . IEEE,2017, pp. 2272–2276.[32] X. Glorot and Y. Bengio, “Understanding the difficulty of training deepfeedforward neural networks.” in
Aistats , vol. 9, 2010, pp. 249–256.[33] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning represen-tations by back-propagating errors,”
Cognitive modeling , vol. 5, no. 3,p. 1, 1988.[34] L. Bottou, “Large-scale machine learning with stochastic gradient de-scent,” in
Proceedings of COMPSTAT’2010 . Springer, 2010, pp. 177–186.[35] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[36] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorflow: Large-scalemachine learning on heterogeneous distributed systems,” arXiv preprintarXiv:1603.04467 , 2016.[37] W. Ma, P. Ching, and Z. Ding, “Semidefinite relaxation based multiuserdetection for m-ary PSK multiuser systems,”
IEEE Transactions onSignal Processing , vol. 52, no. 10, pp. 2862–2872, 2004.[38] A. Ghasemmehdi and E. Agrell, “Faster recursions in sphere decoding,”