[PDF] Supervised-Learning for Multi-Hop MU-MIMO Communications with One-Bit Transceivers

Abstract

This paper considers a nonlinear multi-hop multi-user multiple-input multiple-output (MU-MIMO) relay channel, in which multiple users send information symbols to a multi-antenna base station (BS) with one-bit analog-to-digital converters via intermediate relays, each with one-bit transceiver. To understand the fundamental limit of the detection performance, the optimal maximum-likelihood (ML) detector is proposed with the assumption of perfect and global channel state information (CSI) at the BS. This multi-user detector, however, is not practical due to the unrealistic CSI assumption and the overwhelming detection complexity. These limitations are addressed by presenting a novel detection framework inspired by supervised-learning. The key idea is to model the complicated multihop MU-MIMO channel as a simplified channel with much fewer and learnable parameters. One major finding is that, even using the simplified channel model, a near ML detection performance is achievable with a reasonable amount of pilot overheads in a certain condition. In addition, an online supervised-learning detector is proposed, which adaptively tracks channel variations. The idea is to update the model parameters with a reliably detected data symbol by treating it as a new training (labelled) data. Lastly, a multi-user detector using a deep neural network is proposed. Unlike the model-based approaches, this model-free approach enables to remove the errors in the simplified channel model, while increasing the computational complexity for parameter learning. Via simulations, the detection performances of classical, model-based, and model-free detectors are thoroughly compared to demonstrate the effectiveness of the supervised-learning approaches in this channel.

Full PDF

11 Supervised-Learning for Multi-Hop MU-MIMOCommunications with One-Bit Transceivers

Daeun Kim, Song-Nam Hong, and Namyoon Lee

Abstract

This paper considers a nonlinear multi-hop multi-user multiple-input multiple-output (MU-MIMO)relay channel, in which multiple users send information symbols to a multi-antenna base station (BS)with one-bit analog-to-digital converters via intermediate relays, each with one-bit transceiver. Tounderstand the fundamental limit of the detection performance, the optimal maximum-likelihood (ML)detector is proposed with the assumption of perfect and global channel state information (CSI) atthe BS. This multi-user detector, however, is not practical due to the unrealistic CSI assumptionand the overwhelming detection complexity. These limitations are addressed by presenting a noveldetection framework inspired by supervised-learning. The key idea is to model the complicated multi-hop MU-MIMO channel as a simpliﬁed channel with much fewer and learnable parameters. Onemajor ﬁnding is that, even using the simpliﬁed channel model, a near ML detection performance isachievable with a reasonable amount of pilot overheads in a certain condition. In addition, an onlinesupervised-learning detector is proposed, which adaptively tracks channel variations. The idea is toupdate the model parameters with a reliably detected data symbol by treating it as a new training(labelled) data. Lastly, a multi-user detector using a deep neural network is proposed. Unlike the model-based approaches, this model-free approach enables to remove the errors in the simpliﬁed channelmodel, while increasing the computational complexity for parameter learning. Via simulations, thedetection performances of classical, model-based, and model-free detectors are thoroughly compared todemonstrate the effectiveness of the supervised-learning approaches in this channel.

D. Kim and N. Lee are with the Department of Electrical Engineering, POSTECH, Pohang, Gyeongbuk 37673, South Korea(e-mail: { daeun.kim, nylee } @postech.ac.kr).S.-N. Hong is with the Department of Electrical and Computer Engineering, Ajou University, Suwon, Gyeonggi 16499, SouthKorea (e-mail: [email protected]). a r X i v : . [ c s . I T ] A p r I. I

NTRODUCTION

Wireless relaying is an effective solution to expand network coverage and to enhance systemreliability [1]. The use of multiple-input multiple-output (MIMO) systems is also a key tech-nology for providing both considerable gains of spectral and energy efﬁciencies [2]. Motivatedby these advantages, multi-user MIMO (MU-MIMO) relay networks have been considered as apromising cellular network architecture. There have been extensive studies to characterize thecapacity and to devise the effective communication schemes for the MU-MIMO relay networksover the past decade [3]–[9]. In [3], [4], the information theoretical limits of the MIMO relaychannels were characterized. It was shown in [9] that the analytical expressions for outageprobabilities were derived under a general channel fading distribution. The underlying assumptionof the aforementioned works, however, is that the relay and the BS equipped with multipleantennas use perfect hardware including inﬁnite-precision analog-to-digital converters (ADCs)and digital-to-analog converters (DACs). When using a large number of antennas at the relayand the BS, the fabrication cost and the power consumption signiﬁcantly increase. To diminishthe power consumption and the cost, the use of cheaper and more energy-efﬁcient hardwarecomponents including low precision ADCs and DACs has been considered as a promisingapproach [10]–[17]. Motivated by this approach, this paper focuses on a multi-hop MU-MIMOrelay channel, in which information symbols of users are delivered to the BS with one-bitADCs via layered and distributed relays, each with one-bit transceiver. In this channel, it is verychallenging to estimate the multi-hop channel accurately and detect information bits reliablybecause information symbols sent by the users are severely distorted by both multi-hop relaysusing one-bit transceivers and the BS using one-bit ADCs. In this paper, we present novelsupervised-learning approaches to reliably detect information symbols with a reasonable amountof pilots for channel training.

A. Related Works

Despite the beneﬁts of using low-precision ADCs and DACs at the relay and the BS, itchanges not only the fundamental limits but also the required communication schemes including channel estimation and data detection. For single-hop communication networks in which themulti-antenna BS employs the low-precision ADCs, the channel estimation and data detectionalgorithms have been proposed in [12]–[17]. Recently, asymptotic achievable rates have beencharacterized for dual-hop MU-MIMO systems when the low-precision ADCs are used at eitherthe relay [18], [19] or the BS [20]. The key tool for the analysis of the achievable rates in [18]–[20] is the use of the additive quantization noise model (AQNM) by leveraging the Bussgang’sdecomposition [21]. In [22], the channel estimation methods using support vector machine andneural networks were proposed when the one-bit relay cluster was considered. The commonlimitation of the prior works is that they consider the one-bit quantization at either the relay orthe BS. Therefore, the joint impact of low-resolution ADCs in the two-hop relying system isstill unknown. In addition, the existing works focused on the single-hop relay networks; thereby,the effects of channel estimation and data detection when scaling the number of hops are alsounrevealed.There have been increasing research interests in exploiting machine learning tools to addressthe nonlinearity of a MIMO system with low-resolution ADCs. By treating an end-to-end non-linear MIMO system with low-resolution ADCs as an autoencoder, a supervised-learning aidedcommunication framework was proposed in [17]. Speciﬁcally, it empirically learns the nonlinearchannel (i.e., the conditional probability mass functions (PMFs)) by sending pilot symbols (orknown data symbols) repeatedly. Leveraging the learned channel, novel empirical ML-like andminimum-center-distance detectors were proposed. Following this work, a reinforcement learningaided detector was presented in [23], in which a cyclic redundancy check (CRC) code is usedto obtain a new labelled data set to further improve the estimation accuracy of the PMFs.Recently, in [24], a deep-learning detector was also proposed for an orthogonal frequency divisionmultiplexing (OFDM) system using one-bit ADCs, which can address the nonlinear distortioncaused by one-bit quantizations. Beyond the nonlinearity induced by one-bit ADCs, in [25]–[28], numerous deep-learning based joint detection and decoding methods were proposed forlinear/nonlinear channels by treating an end-to-end communication system as an autoencoder.To our best knowledge, however, all the aforementioned machine learning based channel-training and data detectors have not been considered for a nonlinear multi-hop MU-MIMO relay channel,which involves multiple nonlinear one-bit quantization effects in a cascade manner.

B. Contributions

In this paper we focus on a nonlinear multi-hop MU-MIMO relay channel, in which K single-antenna users transmit data symbols to the BS equipped with N antennas with the help of the M layered and distributed relays. The nonlinearity of this channel comes from the assumptionthat both the relays and the BS use one-bit ADCs, and the relays also use one-bit DACs fortransmissions. The major contributions of this paper are summarized as follows. Classical communication approach:

Inspired by the classical approach of a communicationsystem, we ﬁrst derive the optimal maximum-likelihood (ML) detector for the nonlinear multi-hop MU-MIMO relay channel, by assuming that the BS has global and perfect knowledge ofchannel state information (CSI). This ML detector provides the fundamental limit of the detectionperformance in the channel. Toward this, we characterize the end-to-end transition probabilitydistribution of the multi-hop channel as a function of a per-hop signal-to-noise ratio (SNR) andCSI. In practice, however, the use of the derived ML detector is impossible, because acquiringglobal and perfect CSI at the BS is infeasible even using an inﬁnite number of pilots. Moreover,the computational complexity of the ML detector increases exponentially with both the numberof hops and relays per hop. Because of such limitations, it is pessimistic to apply the classicalcommunication approach for the nonlinear multi-hop MIMO channels.

Model-based supervised-learning approach:

To overcome the limitations of the classicalapproach, we propose a novel communication framework using a model-based supervised-learning approach. Unlike the model-free approach via deep learning in [25]–[28], the proposedframework is to model the end-to-end multi-hop MU-MIMO channel as simple N parallelbinary symmetry channel (BSCs), which can be characterized by much fewer and learnable modelparameters than those in the original channel. The parameters of the effective BSCs include 1)a set of N -dimensional binary vectors (i.e., codewords) and 2) the crossover probabilities ofthe BSCs. During a training phase, the parameters of the effective channel model are jointly trained with a reasonable number of pilots. Subsequently, during the phase of data transmission,the BS performs a weighted minimum Hamming distance detection (wMHD) to recover theinformation symbols using the estimated model parameters. We call this as an approximate ML(A-ML) detector. To verify the effectiveness of the proposed framework, we show that the A-ML detector achieves a near ML performance even with a reasonable amount of pilots, providedthat the SNRs of the previous M − hops are sufﬁciently high. One major observation is thatthis model-based approach reduces the number of parameters to learn in the complex nonlinearMIMO channel compared to the classical approach. As a result, the detection performance of themodel-based approach signiﬁcantly outperforms the classical approach for a given pilot overhead. Model-based online-learning approach:

Despite its attractive performance, the proposedsupervised-learning framework is the lack of ﬂexibility in adapting environment changes. Forinstance, when a channel value of a certain hop is time-varying, the model parameters includingthe codebook and the crossover probabilities should be updated accordingly, since they arethe function of the multi-hop channels. We further improve the proposed supervised-learningframework so that it is robust to time-varying channel environments. The proposed frameworkjointly performs the model parameter update and the data detection, similar to an expectation-maximization (EM) algorithm. Speciﬁcally, during a data transmission phase, the BS ﬁrst assignsa label to the received signal vector by computing a posteriori probability (APP) of it. Then,this APP information is exploited to estimate and update the model parameters. Subsequently,using the updated model parameters, the BS performs data detection. Our key ﬁnding is thatthe proposed online-learning approach can outperform the conventional linear detectors usinggenie-aided (perfect and global) CSI at the BS under some time-varying channel conditions.

Model-free deep learning approach:

The proposed model-based supervised-learning ap-proaches are very effective when training and tracking the channel model parameters. This isbecause the number of the model parameters for channel training is much fewer than that inthe original channel model. In addition, the parameters in the proposed model are accuratelyestimated by a simple training strategy compared to those in the original model. These ap-proaches, however, are fundamentally limited when the model contains a modeling error. Since

UserUser

Receiver Re First-hop

Second-hop th-hop Re Fig. 1: An illustration of the nonlinear M -hop MU-MIMO relay channel.the effective BSC channel model is a good approximation of the original channel model whenthe M − hop SNRs are high enough, the model-based learning approaches can degrade theperformance when the M − hop SNRs are low. To resolve this model error problem, we alsopropose a multi-user detector via a deep neural network (DNN) using a model-free approach. Tobe speciﬁc, we construct a DNN comprised of the multiple layers, and optimize the parametersof the DNN by sending a few repetitions of data symbols as pilots. One noticeable observation isthat this approach can improve the detection performance by eliminating the model error in somecases. Nevertheless, this approach is not suitable for the scenarios where the channel changesrelatively fast, because the computational complexity for training the DNN parameters is veryhigh compared to the model-based approaches.II. S YSTEM MODEL

In this section, we consider a nonlinear M -hop MU-MIMO relay channel. As illustrated inFig. 1, K users send information symbols to a BS with the aid of M − layered and distributedrelays, each with one-bit transceiver. We assume that the BS is equipped with N antennas, eachwith one-bit ADCs. First-hop transmission:

Let ˜ x k be an information symbol of the k th user, which is chosenfrom a constellation set M t . In addition, let ˜x = [˜ x , ˜ x , . . . , ˜ x K ] (cid:62) ∈ M K × denote theaggregated data symbol vector sent by all K users. We denote the complex channel from the k th uplink user to the (cid:96) th relay at the ﬁrst-hop by ˜ h ,(cid:96),k . Then, the received signal of the (cid:96) threlay with one-bit ADCs at the ﬁrst hop is ˜ r ,(cid:96) = sign (cid:32) K (cid:88) k =1 ˜ h ,(cid:96),k ˜ x k + ˜ v ,(cid:96) (cid:33) , (1)where ˜ v ,(cid:96) is a circularly-symmetric complex Gaussian random variable with zero mean andvariance σ , i.e., ˜ v ,(cid:96) ∼ CN (0 , σ ) and sign ( · ) : R → {− , +1 } denotes the one-bit quantizationfunction, which is independently applied to the real and imaginary components. Relay operation:

Since each relay is assumed to equip with one-bit DACs, it transmits aquadrature phase shift keying (QPSK) symbol for the second-hop transmission. Let ˜ s m,(cid:96) ∈ M r bethe transmit signal of the (cid:96) th relay at the m th hop where M r = √ { j, − j, − j, − − j } with |M r | = 4 . In particular, the relay transmission symbol is constructed by ˜ s m,(cid:96) = f r (˜ r m,(cid:96) ) ∈ M r , (2)where f r ( · ) : M r → M r denotes a relay operation function that uniquely maps a received signalof the relay to its transmit signal. For simplicity, we assume a relay operation function whichsimply forwards the binary received signal to the next hop, i.e., f r (˜ r m,(cid:96) ) = ˜ r m,(cid:96) . Multi-hop transmission:

We denote the number of relays in the m th layer by L m for m ∈{ , , . . . , M − } . We also denote the channel from the (cid:96) th relay transmission of the m th hopby ˜h m,(cid:96) ∈ C N × . Then, the received signal of the relays with one-bit ADCs at the m th hop is ˜r m = sign (cid:32) L m − (cid:88) (cid:96) =1 ˜h m,(cid:96) ˜ r m − ,(cid:96) + ˜v m (cid:33) , (3)where ˜v m = [˜ v m, , . . . , ˜ v m,L m ] (cid:62) ∈ C L m × is the noise vector at the m th hop. The elements of ˜v m are independent and identically distributed (IID) complex Gaussian random variables, i.e., ˜ v m,(cid:96) ∼ N C (0 , σ m ) . Considering the M -hop relaying systems, the received signal at the BS isgiven by ˜y = sign (cid:32) L M − (cid:88) (cid:96) =1 ˜h M,(cid:96) ˜ r M − ,(cid:96) + ˜v M (cid:33) . (4)Let ˜H m = [ ˜h m, , . . . , ˜h m,L m − ] ∈ C L m × L m − be the channel matrices of the m th hop. Then, thereceived signal of the BS in (4) can be written in a matrix form as ˜y = sign (cid:16) ˜H M ˜r M − + ˜v M (cid:17) . (5) For the notational simplicity, we rewrite the input and output relationship in (5) into a real-representation as y = sign ( H M r M − + v M ) , (6)where y = (cid:2) Re ( ˜y ) (cid:62) , Im ( ˜y ) (cid:62) (cid:3) (cid:62) , r M − = (cid:2) Re ( ˜r M − ) (cid:62) , Im ( ˜r M − ) (cid:62) (cid:3) (cid:62) , v M = (cid:2) Re ( ˜v M ) (cid:62) , Im ( ˜v M ) (cid:62) (cid:3) (cid:62) ,and H M =  Re ( ˜H M ) − Im ( ˜H M ) Im ( ˜H M ) Re ( ˜H M )  . This real-representation can be applied to each hop straight-forwardly and will be used in the sequel.III. ML D

ETECTION USING C LASSICAL A PPROACH

In this section, from a classical communication system design point-of-view, we propose aML detector when the global CSI is perfectly known to the BS. Although this CSI assumptionis unrealistic, the ML detector can provide the fundamental limit of the detection performancein this network.To derive the ML detector, we need to characterize the effective channel transition probabilitiesfor a given channel input vector. To accomplish this, we deﬁne M + 1 channel input and outputsets in the relay network. Let X = { x , x , . . . , x |M t | K } denote the channel input set containingall possible transmitted vectors by the K users, i.e., x ∈ X and |X | = |M t | K where |M t | isthe constellation size. We also deﬁne the channel output and input sets of the L m distributedrelays with one-bit ADCs by R m = { r m, , r m, , . . . , r m, Lm } , where m ∈ { , , . . . , M − } .Similarly, the channel output set of the BS is deﬁned by Y = { y , y , . . . , y N } . Using thesesets, we compute the channel transition probabilities of the M -hop. We ﬁrst consider the ﬁrst-hopchannel transition probability. The probability that the received signal vector of the L relays is r ,u when the K uplink users transmit x i is computed as P [ r = r ,u | x = x i ] = L (cid:89) (cid:96) =1 P [ r ,(cid:96) = r ,u,(cid:96) | x = x i ]= (cid:89) (cid:96) ∈Z + u P (cid:2) h (cid:62) ,(cid:96) x i + v ,(cid:96) > (cid:3) (cid:89) (cid:96) ∈Z − u P (cid:2) h (cid:62) ,(cid:96) x i + v ,(cid:96) < (cid:3) = L (cid:89) (cid:96) =1 Q (cid:32) − r ,u,(cid:96) h (cid:62) ,(cid:96) x i σ / √ (cid:33) , (7) where Z + u = { (cid:96) | r ,u,(cid:96) = 1 } and Z − u = { (cid:96) | r ,u,(cid:96) = − } are the sets indicating the sign of the (cid:96) th element of r ,u . Here, Q ( x ) = (cid:82) ∞ x √ π e − u d u is the standard Q-function. For the m th-hopchannel for m ∈ { , . . . , M − } , the received signal vector of the relays at the m th hop is r m,w when the previous relays transmit r m − ,v is computed as P [ r m = r m,w | r m − = r m − ,v ] = L m (cid:89) j =1 P [ r m,j = r m,w,j | r m − = r m − ,v ]= (cid:89) j ∈Z + w P (cid:2) h (cid:62) m,j r m − + v m,j > (cid:3)(cid:89) j ∈Z − w P (cid:2) h (cid:62) m,j r m − ,v + v m,j < (cid:3) = L m (cid:89) j =1 Q (cid:32) − r m,w,j h (cid:62) m,j r m − ,v σ m / √ (cid:33) . (8)For the M th-hop channel, the probability that the received signal vector of the BS is y j whenthe relays transmit r M − ,u is computed as P [ y = y j | r M − = r M − ,u ] = N (cid:89) n =1 P [ y n = y j,n | r M − = r M − ,u ]= (cid:89) n ∈S + j P (cid:2) h (cid:62) M,n r M − ,u + v M,n > (cid:3) (cid:89) n ∈S − j P (cid:2) h (cid:62) M,n r M − ,u + v M,n < (cid:3) = N (cid:89) n =1 Q (cid:32) − y j,n h (cid:62) M,n r M − ,u σ M / √ (cid:33) , (9)where S + j = { n | y j,n = 1 } and S − j = { n | y j,n = − } are the index sets. From (7) to (9), theprobability that the received signal of the BS is y j when the uplink transmission vector is givenby x i is computed as a sum-product form of the standard Q-function, namely, P [ y = y j | x = x i ] = (cid:88) r M − ,u ∈R M − . . . (cid:88) r ,u ∈R P [ y = y j | r M − = r M − ,u ] × · · · × P [ r = r ,u | x = x i ]= (cid:88) r M − ,u ∈R M − . . . (cid:88) r ,u ∈R N (cid:89) n =1 Q (cid:32) − y j,n h (cid:62) M,n r M − ,u σ M / √ (cid:33) ×· · ·× L (cid:89) (cid:96) =1 Q (cid:32) − r ,u,(cid:96) h (cid:62) ,(cid:96) x i σ / √ (cid:33) . (10)From the effective channel transition probabilities between x i ∈ X and y j ∈ Y , the optimal MLdetector is g ML ( y j ) = arg max x i ∈X P [ y = y j | x = x i ] . (11) We explain some remarks on the ML detector in (11).

Remark 1 (Need for global and perfect CSI):

As derived in (11), the BS needs to know 1)global and perfect CSI of the network at the BS, i.e., { H m } Mm =1 , and 2) all possible realizationsof the received signal at the relays, r m ∈ R m to perform the optimal ML detection. Speciﬁcally,the number of parameters (unknowns) for the channel estimation quadratically increases withthe number of hops, i.e., O (2 KN M L ) , where L m = L for ∀ m . Since the relays and the BSequip with one-bit ADCs, it is infeasible to obtain the accurate CSI from conventional pilottransmission methods. Remark 2 (Computational complexity):

For the single-hop multi-user MIMO system withone-bit ADCs, it has shown in [12] that the ML detection problem can be solved in a computation-ally efﬁcient manner by convex relaxation techniques with the logarithmically-concave propertyof the likelihood function. Whereas, the convex optimization algorithms cannot be applied inthe multi-hop relaying system, because the likelihood function in (10) is neither concave norlogarithmically-concave. The detection computational complexity exponentially increases withthe number of uplink users, the relays per layer, and hops, i.e., O (cid:16) |M t | K (cid:80) M − m =1 L m (cid:17) . Thiscomputational complexity hinders the use of the ML detector in practice.IV. M ODEL - BASED S UPERVISED -L EARNING A PPROACH

In this section, we propose a novel communication framework for the nonlinear multi-hopMU-MIMO relay channel by harnessing an end-to-end supervised-learning technique. We ﬁrstpresent a simple model that can be a good approximation of the complicated nonlinear multi-hopMU-MIMO relay channel by exploiting a coding theoretical framework developed in our priorworks [14], [17]. Then, we explain how to learn the model parameters using a simple trainingstrategy and to detect the data symbols using the trained model parameters. In addition, we provethat the proposed channel training and data detection framework can achieve the optimal MLdetection performance under certain conditions. A. The Proposed End-to-End Network Model

The data detection problem for the nonlinear multi-hop MU-MIMO relay channel can bereformulated as a decoding problem of channel-dependent nonlinear codes from a coding-theoretical perspective [14]. To explain this method, we ﬁrst introduce the notions of the codebookconstruction and the BSC model for the corresponding decoding problem.

Codebook construction:

Let us deﬁne a codeword vector c i , which is generated by anencoding function f ( x i , H m ) that maps information vector x i ∈ {− , +1 } K into an N -dimensional binary space {− , +1 } N . In particular, the encoding function is given by c i = sign ( H M sign ( H M − · · · sign ( H x i ))) . (12)As can be seen in (6), this codeword vector is a noise-free binary representation of the receivedsignal when the uplink users transmit x i . We also deﬁne a codebook as the collection of allpossible codeword vectors as C = (cid:8) c , c , . . . , c |M t | K (cid:9) . The cardinality of C is less than orequal to that of |X | , i.e., |C| ≤ |X | . This is because for a certain realization of H m , it is possiblethat c i = c j for two distinct information vectors x i and x j where i (cid:54) = j . In addition, since K binary information bits are sent over N channel uses, a code rate of this encoding function is r = K N = KN , which is typically less than 1/2 for a massive MIMO setting. Furthermore, thecodes are not a linear class, i.e., a linear combination of two codewords is not necessarily in C . Channel parameters:

When a codeword c i ∈ {− , +1 } N is generated by the encodingfunction in (12), it is sent over noisy channels. Then, the BS receives y ∈ {− , +1 } N . Sincenoise signals over different BS antennas are independent, the channel between c i and y can bemodeled by N -parallel BSCs, each with different crossover probabilities, i.e., P ( y | c i ) = N (cid:89) n =1 P ( y n | c i,n ) , (13)where P ( y n | c i,n ) =  p i,n if y n (cid:54) = c i,n , − p i,n if y n = c i,n . (14) Notice that the effective channel crossover probabilities { p i,n } Nn =1 varies over both the channeluses and the channel input. This is a major difference with the classical coding problem setting. Remark 3 (Our modeling and limitation):

We converted the nonlinear multi-hop MU-MIMOrelay channel into a simple channel model comprised of N BSCs. This simpliﬁed model isparameterized by the codebook C and the set of transition probabilities { p i,n } Nn =1 . Therefore,during the training phase, these parameters should be trained to ﬁt our model with the labeledtraining data set, i.e., a sequence of pilot symbols. Although our parallel BSC model is simple, itcannot capture the propagation effects of the correlated noise signals in the multi-hop channels.Nevertheless, it turns out that our simple model can be optimal in the sense of minimizingdetection error for the case when the ﬁrst M − hop SNRs are sufﬁciently high. This optimalityresult will be provided in Section IV-C. B. Parameter Learning and Detection Algorithm

Each transmission frame containing T B times slots consists of two phases: 1) a channel trainingphase with T t time slots and 2) a data transmission phase with T d time slots, i.e., T B = T t + T d . Acquisition of training examples:

Let T be the number of repetitions for training of each x i ∈ X where |X | = |M t | K . During the channel training phase, K uplink users repeatedly sendeach information vector x i ∈ X , i.e., x [ t ] = x i for t ∈ T i = { t : ( i − T + 1 ≤ t ≤ iT } . As aresult, a total of T |M t | K time slots is required during the training phase, i.e., T t = T |M t | K . Let X t = [ x [1] , x [2] , . . . , x [ T t ]] ∈ {− , +1 } K × T t and Y t = [ y [1] , y [2] , . . . , y [ T t ]] ∈ {− , +1 } N × T t be the sets of transmit and received signal vectors during the training phase. We deﬁne S = { ( x [1] , y [1]) , . . . , ( x [ T t ] , y [ T t ]) } as a set of labeled training examples. Codebook learning:

Using S , the BS ﬁrst estimates codebook ˆ C = { ˆc , . . . , ˆc |M t | K } . Toaccomplish this, during the training phase, the BS computes the centroid vector when transmitting x i using the received samples { y [( i − T +1] , y [( i − T +2] , . . . , y [ iT ] } . By taking the sampleaverage, it is given by ¯y i = 1 T iT (cid:88) t =( i − T +1 y [ t ] . (15) Since the received signal vectors { y [( i − T +1] , y [( i − T +2] , . . . , y [ iT ] } are IID, each with aﬁnite mean, this centroid vector (the sample average) almost surely converges to the true meanvalue ¯y i = E [ y [ t ] | x i ] = (cid:80) y j ∈Y y j P [ y [ t ] = y j | x = x j ] , as T → ∞ by the strong law of largenumbers [32, Theorem 4.3.1]. Using ¯y i , the i th codebook vector is estimated as ˆc i = sign ( ¯y i ) , (16)where i = (cid:8) , , . . . , |M t | K (cid:9) . Channel parameter learning:

Once the codebook is constructed, we need to estimate thecrossover probabilities ˆ p i,n for i ∈ { , , . . . , |M t | K } and n ∈ { , , . . . , N } using both Y t and ˆ C = { ˆc , ˆc , . . . , ˆc |M t | K } . These transition probabilities are empirically estimated as ˆ p i,n = 1 T iT (cid:88) t =( i − T +1 (cid:107) ˆ c i,n − y n [ t ] (cid:107) , (17)for n ∈ { , ..., N } . Similarly, these estimated channel transition probabilities converge to thetrue distribution in probability, when the number of training samples is sufﬁciently large. Inpractical communication systems, however, the number of training samples should be smallto enhance the transmission efﬁciency; thereby, the estimated transition probabilities and thecodebook can be erroneous when using a limited number of pilots, i.e., training samples. Detection:

Under our model assumption and the estimated model parameters, we explainan approximate-ML (A-ML) detector via the weighted minimum Hamming distance (wMHD)decoding. The A-ML detector differs from the optimal ML detector derived in (11), because itrelies on our simpliﬁed 2 N -parallel BSC model to reduce the computational complexity.When ˆc i ∈ ˆ C and ˆ p i,n are obtained during the training phase, the log-likelihood function isgiven by ln P ( y [ t ] | ˆc i ) = N (cid:88) n =1 ln P ( y n [ t ] | ˆ c i,n )= N (cid:88) n =1 (cid:0) ln p i,n { ˆ c i,n (cid:54) = y n [ t ] } + ln(1 − p i,n ) { ˆ c i,n = y n [ t ] } (cid:1) . (18) Therefore, the A-ML detector is equivalent to the wMHD decoding as ˆx (cid:63)i = arg max ˆc i ∈ ˆ C ln P ( y [ t ] | ˆc i )= arg min ˆc i ∈ ˆ C N (cid:88) n =1 ln 1ˆ p i,n | ˆ c i,n − y n [ t ] | +ln 11 − ˆ p i,n (1 −| ˆ c i,n − y n [ t ] | )= arg min i ∈{ ,..., |M t | K } d wH (cid:0) ˆc i , y [ t ]; { ˆ w i,n } Nn =1 , { ˆ w ci,n } Nn =1 (cid:1) , (19)where the last equality comes from the deﬁnition of the weighted Hamming distance with theweights ˆ w i,n = − ln (ˆ p i,n ) and ˆ w ci,n = − ln (1 − ˆ p i,n ) in [14]. The proposed supervised-learningcommunication framework is summarized in Algorithm 1. Algorithm 1

The proposed end-to-end supervised-learning framework. for t = 1 , . . . , T t do { Training for parameter learning } Centroid update ¯y i = T (cid:80) iTt =( i − T +1 y [ t ] . end for Compute the codeword vectors ˆc i = sign ( ¯y i ) for i = (cid:8) , , . . . , |M t | K (cid:9) . Compute the transition probabilities ˆ p i,n = T (cid:80) iTt =( i − T +1 (cid:107) ˆ c i,n − y n [ t ] (cid:107) . for t = T t + 1 , . . . , T B do { Data detection with the trained parameters } Perform the A-ML detection: ˆx (cid:63)i = arg min i d wH (cid:0) ˆc i , y [ t ]; { ˆ w i,n } Nn =1 , { ˆ w ci,n } Nn =1 (cid:1) with ˆ w i,n = − ln (ˆ p i,n ) and ˆ w ci,n = − ln (1 − ˆ p i,n ) . end for Remark 4 (Model parameter reduction):

The most advantageous feature of the proposedone is the huge reduction of the number of model parameters to perform data detection. Tomake a quantitative claim for this, we compare the number of parameters required for the datadetection in the two channel models. In the original channel, a total number of model (real-valued)parameters needed for the classical ML detector is KL + ( M − L ) + 4 LN + ( M − L + 1 ,which includes the number of multi-hop channel elements and SNRs per hop. As can be seen, thenumber of model parameters scales linearly with K , N , and M , while it quadratically increaseswith L . Whereas, the proposed model for the supervised-learning requires much fewer modelparameters. Since it only considers the transition probabilities of each channel input, a total of |M t | K N model parameters is required to perform the A-ML detector. Although the number of model parameters scales exponentially with K , it does not scale with the number of hops M and the number of distributed relays per layer L . Therefore, when the number of co-scheduleduplink users K is a few, the proposed model can reduce the model parameters signiﬁcantly. Remark 5 (Detection complexity reduction):

The proposed A-ML detector does not requireperfect and global CSI knowledge at the BS. Instead, the empirically estimated codebook ˆ C and the channel weights { ˆ w i,n } Nn =1 are sufﬁcient, which can be accurately estimated from thesimple repetition training strategy, even using a reasonable amount of pilots. In addition, thecomputational complexity of this method is the order of O ( |M t | K ) , which does not scale withthe number of relays and hops. This is a huge complexity reduction compared to the originalML detector in Section III. The required training length, however, exponentially increases withthe number of uplink users as in the single-hop multi-user MIMO system with one-bit ADCs.Therefore, for the implementation, the number of co-scheduled uplink users should be chosen tomeet the constraint of pilot overheads or a semi-supervised-learning and reinforcement-learningmethods can be used as in [29] and [30]. In addition, the complexity of the A-ML detector canbe further reduced using one-bit sphere decoding in [15]. C. Optimality of the Proposed Model

The proposed end-to-end network model simpliﬁes the classical multi-hop channel model bysigniﬁcantly reducing the number of model parameters. This simpliﬁcation can cause the modelerrors in general. In this subsection, we show that the proposed end-to-end network model canstill be optimal in terms of the detection performance under a certain scenario.

Theorem 1:

The proposed parameter learning and detection method is optimal in the senseof minimizing the detection error when i) σ = · · · = σ M − → and ii) the training length perchannel input, T , is large enough.To prove Theorem 1, we need the following lemma, which elucidates that the proposed simpletraining method guarantees the optimality for the parameter learning, provided that the numberof training samples is sufﬁciently large and the SNRs of the M − hops are inﬁnite. Lemma 1:

Let T (cid:80) iTt =( i − T +1 y [ t ] be the received sample average when x i ∈ X was sentduring the training phase. Under the inﬁnite SNR assumptions of the M − hops, i.e., σ i = 0 for i = 1 , , . . . , M − , the sign of the sample average for the received vectors almost surelyconverges to the codeword vector c i = sign ( H M · · · sign ( H x i )) as T → ∞ , i.e., sign  lim T →∞ T iT (cid:88) t =( i − T +1 y [ t ]  = c i . (20)In addition, the empirical transition probability converges to the true one as as T → ∞ , namely, lim T →∞ T iT (cid:88) t =( i − T +1 (cid:107) c i,n − y n [ t ] (cid:107) = Q (cid:18) | g i,n | σ M / √ (cid:19) , (21)where g i,n is the n th element of g i = H M · · · sign ( H x i ) ∈ R N . Proof.

Under the premise that σ i = 0 for i = 1 , , . . . , M − , the received signal at time slot t is rewritten as y [ t ] = sign ( H M · · · sign ( H x i ) + v M [ t ])= sign ( g i + v M [ t ]) . (22)Since v M [ t ] is IID over t , by the law of large numbers, the sample average converges to itsmean, namely, lim T →∞ T iT (cid:88) t =( i − T +1 y [ t ] = E [ y [ t ] | g i ] . (23)Let ¯y i = E [ y [ t ] | g i ] . Then, the n th component of ¯y i is computed as ¯ y i,n = P [ sign ( g i,n + v M,n [ t ]) = sign ( g i,n )] sign ( g i,n ) + P [ sign ( g i,n + v M,n [ t ]) (cid:54) = sign ( g i,n )] ( − sign ( g i,n ))= (1 − P [ sign ( g i,n + v M,n [ t ]) (cid:54) = sign ( g i,n )]) sign ( g i,n )= (cid:18) − Q (cid:18) | g i,n | σ M / √ (cid:19)(cid:19) sign ( g i,n ) . (24)Since Q (cid:16) | g i,n | σ M / √ (cid:17) < . for | g i,n | σ M / √ > , (cid:16) − Q (cid:16) | g i,n | σ M / √ (cid:17)(cid:17) does not change the sign. Therefore,we conclude that sign (¯ y i,n ) = sign ( g i,n ) = c i,n , (25) for all n ∈ { , , . . . , N } . Further, when the number of the training samples is large enough, T → ∞ , the empirical transition probability ˆ p i,n converges to lim T →∞ ˆ p i,n = lim T →∞ T iT (cid:88) t =( i − T +1 (cid:107) ˆ c i,n − y n [ t ] (cid:107) a ) = lim T →∞ T iT (cid:88) t =( i − T +1 (cid:107) c i,n − y n [ t ] (cid:107) b ) = lim T →∞ T iT (cid:88) t =( i − T +1 { sign ( g i,n ) (cid:54) = sign ( g i,n + v M,n [ t ] ) } ( c ) = E (cid:104) { sign ( g i,n ) (cid:54) = sign ( g i,n + v M,n [ t ] ) } (cid:105) = P [ sign ( g i,n ) (cid:54) = sign ( g i,n + v M,n [ t ])]= Q (cid:18) | g i,n | σ M / √ (cid:19) , (26)where (a) follows from (25), (b) holds by the deﬁnitions of the indicator function, c i,n = sign ( g i,n ) , and y n [ t ] = sign ( g i,n + v M,n [ t ]) , and (c) is by the law of large numbers.Now, we are ready to prove Theorem 1. Proof.

From Lemma 1, when T is sufﬁciently large, the model parameters including the code-book and the effective transition probabilities can be perfectly estimated. Therefore, to proveTheorem 1, it is sufﬁcient to show that the transition probability in (10) is equivalent to P ( y j | c i ) under the assumption of σ = · · · = σ M − → . To do this, from the assumption, we ﬁrstsimplify the probability that the BS receives y j ∈ Y when x i ∈ X was sent in (10) as lim σ ,...,σ M − → P ( y = y j | x = x i ) = P ( y j = sign ( H M · · · sign ( H x i ) + v M ))= P ( y j = sign ( g i + v M ))= N (cid:89) n =1 P ( y j,n = sign ( g i,n + v M,n ))= N (cid:89) n =1 Q (cid:18) | g i,n | σ M / √ (cid:19) { yj,n (cid:54) = sign ( gi,n ) } × Q (cid:18) −| g i,n | σ M / √ (cid:19) { yj,n = sign ( gi,n ) } . (27) By taking a log function, the corresponding log-likelihood function of the nonlinear multi-hopchannel becomes lim σ ,...,σ M − → ln P ( y = y j | x = x i ) = N (cid:88) n =1 (cid:0) ln p i,n { c i,n (cid:54) = y j,n } + ln(1 − p i,n ) { c i,n = y j,n } (cid:1) , (28)where p i,n = Q (cid:16) y j,n c i,n σ M / √ (cid:17) , which completes the proof.V. M ODEL - BASED O NLINE S UPERVISED -L EARNING A PPROACH

In this section, we propose an end-to-end online supervised-learning detector, which is robustto time-varying channel environments. We explain the proposed online-learning detector usingthe expectation-maximization framework. The proposed detector iteratively ﬁnds maximum like-lihood estimates of the model parameter using the received signals during the data transmissionphase by treating them as new training (labelled) data samples.

A. Joint Probability Distribution for Labeled and Unlabelled Training Samples

We denote a set of the model parameters of the end-to-end network by

Θ = (cid:8) { p i,n , c i,n } Nn =1 (cid:9) |M t | K i =1 .We also deﬁne a binary vector z [ t ] = [ z [ t ] , z [ t ] , . . . , z |M t | K [ t ]] , where z i [ t ] ∈ { , } indicateswhether the i th information vector was sent or not at time t . Therefore, (cid:80) |M t | K i =1 z i [ t ] = 1 . For thetraining phase, i.e., t ∈ { , , . . . , T t } , z [ t ] is a deterministic vector because the training samplesare labeled. For instance, for t ∈ T i , z i [ t ] = 1 and z k [ t ] = 0 for k (cid:54) = i . Whereas, for the datatransmission phase t ∈ { T t + 1 , . . . , T B } , z [ t ] is a hidden variable, i.e., a random vector, becausethe received signal vector y [ t ] is unlabeled. Our goal is to design an algorithm that jointly performthe data detection (e.g., the assignment of a label) and the update of the model parameters byharnessing both the labeled received signals y [ t ] for t ∈ ∪ |M t | K i =1 T i and the unlabeled receivedsignal vectors y [ t ] for t ∈ T τ = { T t + 1 , . . . , τ } where T t + 1 ≤ τ ≤ T B . To accomplish this,we propose an online learning algorithm inspired by the expectation-maximization framework.Using the proposed model developed in Section IV, we deﬁne a joint probability distributionfor { y [ t ] } τt =1 conditioned that the model parameter set Θ and the labels z [ t ] are given as p ( { y [ t ] } τt =1 , | { z [ t ] } τt =1 , Θ) = N (cid:89) n =1 |M t | K (cid:89) i =1 τ (cid:89) t =1 (cid:104) p i,n { ci,n (cid:54) = yn [ t ] } (1 − p i,n ) { ci,n = yn [ t ] } (cid:105) z i [ t ] . (29) We also deﬁne the joint probability of the labels as p ( { z [ t ] } τt =1 ) = |M t | K (cid:89) i =1 τ (cid:89) t =1 π z i [ t ] i , (30)where π i = P [ x = x i ] = |M t | K . Then, by Bayes’ rule, the joint probability of { y [ t ] } τt =1 and { z [ t ] } τt =1 when the model parameter set Θ is given can be written as p ( { y [ t ] } τt =1 , { z [ t ] } τt =1 | Θ) = p ( { y [ t ] } τt =1 | { z [ t ] } τt =1 , Θ) p ( { z [ t ] } τt =1 )= N (cid:89) n =1 |M t | K (cid:89) i =1 τ (cid:89) t =1 (cid:104) p i,n { ci,n (cid:54) = yn [ t ] } (1 − p i,n ) { ci,n = yn [ t ] } (cid:105) z i [ t ] τ (cid:89) t =1 |M t | K (cid:89) i =1 π z i [ t ] i . (31)Then, the log-likelihood function of ( y [ t ] } τt =1 , { z [ t ] } τt =1 ) conditioned on Θ is given by ln p ( { y [ t ] } τt =1 , { z [ t ] } τt =1 | Θ)= N (cid:88) n =1 |M t | K (cid:88) i =1 τ (cid:88) t =1 z i [ t ] (cid:2) ln p i,n { c i,n (cid:54) = y n [ t ] } + ln (1 − p i,n ) { c i,n = y n [ t ] } (cid:3) + τ (cid:88) t =1 |M t | K (cid:88) i =1 z i [ t ] ln π i . (32) B. Expectation Step

In the expectation step, we compute the probability that the k th label is assigned to a newreceived signal vector. Recall that z [ t ] is ﬁxed for t = { , , . . . , T t } . Therefore, in this step, weestimate the hidden variable z [ t ] for t ∈ T τ = { T t +1 , . . . , τ } by taking the expectation to the log-likelihood function in (32) with respective to the conditional distribution p ( z [ t ] | { y [ t ] } τt =1 , Θ) ,namely, E [ln p ( { y [ t ] } τt =1 , { z [ t ] } τt =1 | Θ)]= N (cid:88) n =1 |M t | K (cid:88) i =1 τ (cid:88) t =1 E [ z i [ t ] | { y [ t ] } τt =1 , Θ] (cid:2) ln p i,n { c i,n (cid:54) = y n [ t ] } + ln (1 − p i,n ) { c i,n = y n [ t ] } (cid:3) + |M t | K (cid:88) i =1 τ (cid:88) t =1 E [ z i [ t ] | { y [ t ] } τt =1 , Θ] ln π i = N (cid:88) n =1 |M t | K (cid:88) i =1 P ( z i [ t ] = 1 | { y [ t ] } τt =1 , Θ) (cid:2) ln p i,n { c i,n (cid:54) = y n [ t ] } + ln (1 − p i,n ) { c i,n = y n [ t ] } (cid:3) + |M t | K (cid:88) i =1 τ (cid:88) t =1 P ( z i [ t ] = 1 | { y [ t ] } τt =1 , Θ) ln π i , (33) where the last equality follows from the fact that z i [ t ] is an indicator function, i.e., E [ z i [ t ] | { y [ t ] } τt =1 , Θ] = P ( z i [ t ] = 1 | { y [ t ] } τt =1 , Θ) . Recall that P [ z i [ t ] = 1 | { y [ t ] } τt =1 , Θ] = 1 for t ∈ T i where i ∈ { , , . . . , |M t | K } thanks to the training phase. The probability that the k th label is assignedto the new observation y [ t ] for t ∈ T τ = { T t + 1 , . . . , τ } is obtained by computing APP as P [ z k [ τ ] = 1 | { y [ t ] } τt =1 , Θ] = P [ z k [ τ ] = 1 , y [ τ ] | Θ] P [ z k [ τ ] = 1] P [ y [ τ ] | Θ]= (cid:81) Nn =1 (cid:104) p k,n { ck,n (cid:54) = yn [ τ ] } (1 − p k,n ) { ck,n = yn [ t ] } (cid:105) π k (cid:80) |M t | K i =1 π i (cid:81) Nn =1 (cid:104) p i,n { ci,n (cid:54) = yn [ τ ] } (1 − p i,n ) { ci,n = yn [ t ] } (cid:105) = (cid:81) Nn =1 (cid:104) p k,n { ck,n (cid:54) = yn [ t ] } (1 − p k,n ) { ck,n = yn [ t ] } (cid:105)(cid:80) |M t | K i =1 (cid:81) Nn =1 (cid:104) p i,n { ci,n (cid:54) = yn [ τ ] } (1 − p i,n ) { ci,n = yn [ τ ] } (cid:105) , (34)where the last equality follows from π i = π k for all i and k . From (18), since p i,n { ci,n (cid:54) = yn [ t ] } (1 − p i,n ) { ci,n = yn [ t ] } = e − d wH ( c i , y [ τ ]; { ln( p i,n ) } Nn =1 , { ln(1 − p i,n ) } Nn =1 ) , we rewrite the APP in (34) as γ k [ τ ] = P [ z k [ τ ] = 1 | y [ τ ] , Θ]= exp (cid:0) − d wH (cid:0) c k , y [ τ ]; { w k,n } Nn =1 , { w ck,n } Nn =1 (cid:1)(cid:1)(cid:80) |M t | K j =1 exp (cid:0) − d wH (cid:0) c j , y [ τ ]; { w j,n } Nn =1 , { w cj,n } Nn =1 (cid:1)(cid:1) . (35)Therefore, to compute the reliability of the label γ k [ τ ] for y [ τ ] , we need the estimated modelparameters using the previously received signal vectors { y [ t ] } τ − t =1 . Let ˆΘ[ τ −

1] = { ˆ p i,n [ τ − , ˆ c i,n [ τ − } Nn =1 } be the estimated model parameter set using the received signal vectors { y [ t ] } τ − t =1 . Then, the APP in (35) can be rewritten in terms of ˆΘ[ τ − as ˆ γ k [ τ ] = exp (cid:0) − d wH (cid:0) ˆc k [ τ − , y [ τ ]; { ˆ w k,n [ τ − } Nn =1 , { ˆ w ck,n [ τ − } Nn =1 (cid:1)(cid:1)(cid:80) |M t | K j =1 exp (cid:0) − d wH (cid:0) c j [ τ − , y [ τ ]; { ˆ w j,n [ τ − } Nn =1 , { ˆ w cj,n [ τ − } Nn =1 (cid:1)(cid:1) . (36) C. Maximization Step

In the maximization step, we need to ﬁnd the parameters that maximize the expected log-likelihood function. Since the term (cid:80) |M t | K i =1 (cid:80) τt =1 ˆ γ i [ t ] ln π i is irrelevant to the parameter estima-tion, an effective log-likelihood function for the parameter optimization is L (Θ) = E [ln p ( { y [ t ] } τt =1 , { z [ t ] } τt =1 | Θ)] − |M t | K (cid:88) i =1 τ (cid:88) t =1 ˆ γ i [ t ] ln π i = N (cid:88) n =1 |M t | K (cid:88) i =1 τ (cid:88) t =1 ˆ γ i [ t ] (cid:2) ln p i,n { c i,n (cid:54) = y n [ t ] } + ln (1 − p i,n ) { c i,n = y n [ t ] } (cid:3) . (37) Unfortunately, maximizing L (Θ) with respective to Θ is a mixed integer optimization problem,because c i is a discrete vector. Instead of jointly ﬁnding the parameters, we solve this optimizationproblem with a two-step approach: 1) the codeword estimation and 2) the transition probabilityestimation.We ﬁrst present the codeword estimation method. With the knowledge of p i,n , the n th elementof the i th codeword vector, c i,n ∈ {− , } is obtained by solving the following optimizationproblem: c (cid:63)i,n = arg max c i,n ∈{ , − } τ (cid:88) t =1 ˆ γ i [ t ] (cid:2) ln p i,n { c i,n (cid:54) = y n [ t ] } + ln (1 − p i,n ) { c i,n = y n [ t ] } (cid:3) . (38)For given p i,n , the optimal solution for c (cid:63)i,n can be expressed using the sign function as c (cid:63)i,n = sign ( κ − κ − ) , (39)where κ = (cid:80) τt =1 ˆ γ i [ t ] (cid:2) ln p i,n { y n [ t ] (cid:54) =1 } + ln (1 − p i,n ) { y n [ t ]=1 } (cid:3) and κ − = (cid:80) τt =1 ˆ γ i [ t ] (cid:2) ln p i,n { y n [ t ] (cid:54) = − } + ln (1 − p i,n ) { y n [ t ]= − } (cid:3) . The estimator in (39) cannot beused in practice when the knowledge of p i,n is absent. To resolve this problem, we propose asimple blind estimator of c i,n using { y n [ t ] } τt =1 . Assuming that the SNR per hop is sufﬁcientlylarge, i.e., p i,n → , the optimal estimator can be approximated as ˆ c i,n [ τ ] = lim p i,n → sign ( κ − κ − )= sign (cid:32) ln p i,n (cid:40) τ (cid:88) t =1 γ i [ t ] (cid:8) { y n [ t ]= − } − { y n [ t ]=1 } (cid:9)(cid:41)(cid:33) = sign (cid:32) τ (cid:88) t =1 γ i [ t ] (cid:8) { y n [ t ]=1 } − { y n [ t ]= − } (cid:9)(cid:33) = sign (cid:32) τ (cid:88) t =1 γ i [ t ] y n [ t ] (cid:33) , (40)where the second equality follows from the fact ln p i,n < for any < p i,n < . and the lastequality is because { y n [ t ]=1 } − { y n [ t ]= − } = y n [ t ] . As can be seen, this estimator is not onlysimple to compute, but also it does not require the knowledge of p i,n . Nevertheless, this simpleweighted sample average estimator guarantees the optimality in a certain condition as shown inLemma 1. For given ˆ c i,n [ τ ] , the expected log-likelihood function in (37) is a concave function withrespective to p i,n . Therefore, the optimal p i,n is obtained by taking the ﬁrst-order derivative of L ( p i,n , ˆ c i,n [ τ ]) with respective to p i,n , which yields L ( p i,n , ˆ c i,n [ τ ]) ∂p i,n = τ (cid:88) t =1 ˆ γ i [ t ] (cid:26) p i,n { ˆ c i,n [ τ ] (cid:54) = y n [ t ] } − − p i,n { ˆ c i,n [ τ ]= y n [ t ] } (cid:27) . (41)By solving L ( p i,n , ˆ c i [ τ ]) ∂p i,n = 0 , we obtain the optimal estimate of the n th transition probability whenthe i th codeword was sent as ˆ p (cid:63)i,n [ τ ] = (cid:80) τt =1 ˆ γ i [ t ] { ˆ c i,n [ τ ] (cid:54) = y n [ t ] } (cid:80) τt =1 ˆ γ i [ t ] . (42)Therefore, the estimated model parameter set ˆΘ[ τ ] with the received signal vectors { y [ t ] } Ut =1 is ˆΘ[ τ ] = (cid:0) ˆ p (cid:63)i,n [ τ ] , ˆ c i,n [ τ ] (cid:1) , (43)for i ∈ { , , . . . , |M t | K } and n ∈ { , . . . , N } . Detection:

Once the model parameter set ˆΘ[ τ ] is updated using a new observation y [ τ ] , theBS performs wMHD method using the received signal for y [ τ ] as ˆx i [ τ ] = arg min i d wH (cid:0) ˆc i [ τ ] , y [ τ ]; { ˆ w i,n [ τ ] } Nn =1 , { ˆ w ci,n [ τ ] } Nn =1 (cid:1) . (44)The proposed online supervised-learning detection method is summarized in Algorithm 2. Algorithm 2

The proposed online supervised-learning detector. Parameter learning during the training phase: ˆΘ[ T t ] = (cid:0) ˆ p (cid:63)i,n [ T t ] , ˆ c i,n [ T t ] (cid:1) for i ∈ { , , . . . , |M t | K } and n ∈ { , . . . , N } .Centroid: ¯y i [ T t ] = (cid:80) T t t ∈T i y [ t ] . for t = T t + 1 , . . . , T B do { Data detection with the updated parameters } Compute the reliability of each codeword vector using a new observation y [ t ] : ˆ γ i [ t ] = exp ( − d wH ( ˆc i [ t − , y [ t ]; { ˆ w i,n [ t − } Nn =1 , { ˆ w ci,n [ t − } Nn =1 )) (cid:80) |M t | Kj =1 exp ( − d wH ( ˆc j [ t − , y [ t ]; { ˆ w j,n [ t − } Nn =1 , { ˆ w cj,n [ t − } Nn =1 )) for i = { , , . . . , |M t | K } . Centroid update: ¯y i [ t ] = ¯y i [ t −

1] + ˆ γ i [ t ] y [ t ] for i = { , , . . . , |M t | K } . Codeword update: ˆc i [ t ] = sign ( ¯y i [ t ]) for i = (cid:8) , , . . . , |M t | K (cid:9) . Transition probability update: ˆ p i,n [ t ] = (cid:80) tτ =1 ˆ γ i [ τ ] { ˆ ci,n [ t ] (cid:54) = yn [ τ ] } (cid:80) tτ =1 ˆ γ i [ τ ] . Detection: ˆx i [ t ] = arg min i d wH (cid:0) ˆc i [ t ] , y [ t ]; { ˆ w i,n [ t ] } Nn =1 , { ˆ w ci,n [ t ] } Nn =1 (cid:1) . end for Remark 6 (Differences with the existing algorithms):

Although the proposed online pa-rameter learning and detection method resembles the conventional EM algorithm. Unlike theconventional EM algorithm for data clustering applications, the proposed method does not itera-tively perform the parameter estimation and the detection until the algorithm converges. Instead,the proposed algorithm moves forward in a sample-by-sample fashion. This fact facilitates totrack the channel variations. In addition, the proposed online learning algorithm differs fromour prior work in [23]. The key difference is that in [23], the new training example set isacquired using a cyclic redundancy check (CRC) code. This method, however, requires a highcomputational complexity due to the iterative detection and decoding procedures. In addition,it is impossible to track the channel variation within a coding block. Whereas, the proposedmethod is able to instantaneously track the channel variation.VI. M

ODEL - FREE S UPERVISED -L EARNING VIA D EEP N EURAL N ETWORKS

In this section, we present a multi-user detector using a deep neural network (DNN) bymodel-less supervised-learning. This approach differs from the model-based supervised-learningapproaches explained in Section IV and V. For the model-less supervised-learning, we do notuse any speciﬁc end-to-end network model. Speciﬁcally, the model-based approaches learn theparameters including the codebook and the set of transition probabilities that accurately matcheswith the likelihood function p ( y j | x i ; Θ) using training examples. Using this likelihood function,the A-ML detector performs the data detection. The deep-learning approach, however, directlylearns the posteriori distribution p ( x i | y [ t ]; Θ DNN ) using training examples for the detection byoptimizing the DNN parameters Θ DNN . One noticeable point is that the deep-learning approachessentially does not require to know the model parameter Θ in the likelihood function. The proposed DNN architecture:

We construct a DNN with multiple layers to capture themulti-hop channels, one-bit quantization, and noise effects. As illustrated in Table I, the proposedDNN architecture is composed of an input layer, two hidden layers, and an output layer. The inputlayer takes N dimensional binary received signals y [ t ] ∈ {− , } N . Since this input signalof the DNN is spatially correlated due to the multi-hop MIMO channels, a long short-term Layer Type Size Activation Weight parametersInput Layer Received signals (Labeled) N − − Hidden Layer-1 LSTM N + 1) ρ + ρ ) ReLU Θ DNN Hidden Layer-2 Fully connected |M t | K ( ρ + 1) − Θ DNN Output Layer Fully connected |M t | K Softmax Θ DNN TABLE I: The used DNN structures. Here, ρ is the number of output parameters for LSTM andthe DNN parameter set is denoted by Θ DNN = (cid:8) Θ DNN , Θ DNN , Θ DNN (cid:9) .memory (LSTM) network is used for the ﬁrst hidden layer to exploit this correlation structure inthe received signal. We also employ a fully connected network for the second hidden layer. Thenthe output layer computes the posteriori distribution using the softmax function. Let u i [ t ](Θ DNN ) be the i th output of the output layer at the t th training sample, which is a function of the DNNparameters Θ DNN . Then, the k th output of the softmax function is given by p k [ t ](Θ DNN ) = e u i [ t ](Θ DNN ) (cid:80) |M t | K i =1 e u i [ t ](Θ DNN ) . (45)As can be seen, the output of the softmax function is the probability that the received signalbelongs to the k th information vector. Training the DNN detector:

During the training phase, the DNN detector is trained to classifythe received signal as one of the possible transmit symbols in X . Speciﬁcally, by sending pilotsymbols, the BS obtains T t = T |M t | K labeled training examples S = { ( x [1] , y [1]) , . . . , ( x [ T t ] , y [ T t ]) } .The DNN is trained to minimize a loss function between the outputs of the DNN and the labeleddata. The cross-entropy loss function is used for the parameter optimization as L (Θ DNN ) = − T t (cid:88) t =1 |M t | K (cid:88) i =1 { x [ t ]= x i } ln p k [ t ](Θ DNN ) . (46)The well-known stochastic gradient method is used to estimate the parameters of the DNN. Detection:

Detecting the transmit symbol from the received signal can be a classiﬁcationproblem when the number of possible transmit symbols is ﬁnite. In our problem, there are |M t | K classes from to |M t | K that are corresponding to the |M t | K possible transmit symbols.Once the network parameters are learned using the training examples, the DNN detector performs -3 -2 -1 -3 -2 -1 Fig. 2: The SER comparison for [ K, L , N ] = [2 , , (left) and [ K, L , N ] = [2 , , N ] (right).The QPSK modulation is used per user.the detection for the received signal vector during the data detection phase, i.e., y [ t ] for t ∈{ T t + 1 , . . . , T B } . Remark 7 (Practical challenges for the DNN detector):

This model-free approach is usefulin improving the detection performance by eliminating the possible model errors. Nevertheless,the DNN detector is not suitable for the scenarios where the channel changes relatively fast,because the computational complexity for learning the weight parameters is very high comparedto the model-based supervised-learning approach.VII. S

IMULATION R ESULTS

In this section, we evaluate symbol-error-rate (SER) and the symbol-vector-error-probability(SVEP) performances for classical, model-based, and model-free approaches. In our simulation,we consider Rayleigh-fading channels, in which each element of the channel matrix per hop isdrawn from IID complex Gaussian random variables, i.e.,

N C (0 , . In addition, we considerthe two-hop MU-MIMO system in which the SNR of the ﬁrst-hop is ﬁxed, while the second-hopSNRs are changed. For a notation, we simply denote the two-hop MU-MIMO channel with K uplink users, L distributed relays, and N BS antennas by [ K, L , N ] . -3 -2 -1 -3 -2 -1 Fig. 3: The SER comparison of the A-ML detector for [ K, L , N ] = [2 , , (left) and [ K, L , N ] = [2 , , N ] (right). The QPSK and 8PSK modulations are considered.Fig. 2 compares the SER performances between conventional detectors based on the classicalapproach and the A-ML detector based on the model-based supervised learning approach. Inaddition to the ML detector, we consider three linear detectors including a zero-forcing detector, alinear minimum mean square error (LMMSE) detector, and a successive Bussgang linear MMSE(S-BLMMSE) detector. In particular, the S-BLMMSE detector is designed by successivelyapplying the Bussgang decomposition in [31]. Unlike the conventional detectors in which perfectand global CSI is available at the BS, the proposed A-ML detector uses the estimated modelparameters by sending T = 15 pilots per information symbol vector. The ﬁrst-hop SNR isﬁxed to 20dB. As can be seen in the left ﬁgure, it is observed that the proposed A-MLdetector signiﬁcantly outperforms the linear detectors, even with imperfect knowledge of themodel parameters. The similar performance tendency is observed when increasing the numberof antennas at the BS, when the second-hop SNR is ﬁxed to 20dB.Fig. 3 shows the SER performances of the A-ML detector as T increases. In particular, toverify Theorem 1, we compare the SER performances between the ML and the A-ML detectorsby increasing T , assuming the ﬁrst-hop SNR is inﬁnite, σ = 0 . As can be seen in Fig. 3,the SER performance gap between the proposed A-ML and the ML detectors diminishes byincreasing T . This result agrees with Theorem 1. One remarkable observation is that the A-ML detector achieves a near-ML performance even with a reasonable amount of pilots when thesecond-hop SNR is beyond 15 dB.We evaluate the SVEP performance for several DNN detectors using various conﬁgurationsand parameter settings as shown in Table II. As shown in Fig. 4, when the number of thelayers increases, the performance degrades because of the overﬁtting problem. On the contrary,when the number of layers is ﬁxed, we observe that the use of a sufﬁcient number of hyper-parameters per layer performs better. Therefore, we use the DNN4 detector to compare with theother schemes including the A-ML and ML detectors. DNN1 DNN2 DNN3 DNN4Input Layer Received signals Received signals Received signals Received signalsHidden Layer LSTM (50)Fully Connected (30)ReLUFully Connected (16) LSTM (100)Fully Connected (16)ReLUFully Connected (16) LSTM (50)ReLUFully Connected (16) LSTM (100)ReLUFully Connected (16)Output Layer Softmax Softmax Softmax Softmax

TABLE II: Different conﬁgurations of the DNN detectors.Fig. 5 compares the SVEP performances for three different detection approaches: the classical,the model-based, and the model-free approaches. In this simulation, we evaluate the SVEPperformances at two different SNRs of the ﬁrst-hop. In addition, we train the parameters forthe DNN and the A-ML detectors per SNR by sending the pilots with the length of T = 15 .In particular, for the DNN detector, we chose ρ = 100 for the LSTM layer . One interestingobservation is that the proposed DNN detector slightly outperforms the A-ML detector, becauseit is capable of removing the model errors. This also happens when increasing the number ofantennas at the BS, when the second-hop SNR is ﬁxed to 20dB. Nevertheless, the computationalcomplexity of the DNN detector is much higher compared to that of the A-ML detector. For We simulate using different values of ρ to train the DNN detector. We observe that the detection performances are similarwhen ≤ ρ ≤ . However, the performance is degraded when ρ ≥ due to a overﬁtting problem. -2 -1 Fig. 4: The SVEP comparison for the different DNN conﬁgurations when QPSK modulation isused where [ K, L , N ] = [2 , , and SNR =20dB. -3 -2 -1 -3 -2 -1 SNR = 10dB SNR = 10dBSNR = 20dB SNR = 20dB Fig. 5: The SVEP comparison for the proposed detectors for [ K, L , N ] = [2 , , (left) and [ K, L , N ] = [2 , , N ] (right). The QPSK modulation is used.instance, for a given channel realization and one SNR point, the runtimes for the DNN, theML and the A-ML detection algorithms are measured as 11.383sec, 2.052sec, and 0.0931sec,respectively under the same simulation condition. This result is because the DNN detector usesa more number of hyper-parameters than that of the A-ML detector as shown in Table I.Fig. 6 compares the SER performances of various detectors in time-varying channels. In this -2 -1 -3 -2 -1 Fig. 6: The SER comparison for the detectors under the time-varying channels. The channelconﬁguration is [ K, L , N ] = [2 , , and the QPSK modulation is used.simulation, we assume that the second-hop channel is time-varying, while the ﬁrst-hop channelis time-invariant. To model the time-varying channel of the second-hop, we use an order-oneauto-regressive process as H [ t ] = η H [ t −

1] + W [ t ] , where η is the temporal correlationcoefﬁcient for the second-hop channel fading and W [ t ] is a process noise matrix whose ( i, j ) element is drawn from a complex Gaussian random variable, i.e., W i,j [ t ] ∼ CN (0 , − η ) .Using the Jakes’s model, the temporal correlation coefﬁcient is chosen as η = J (2 πf d T s ) ,where J ( · ) denotes the Bessel function of the ﬁrst kind of order zero, f d is the maximumDoppler frequency, and T s is the sampling time. In our simulation, we assume that the channelis invariant during the training phase, and the SNR of the ﬁrst hop is 30dB. As can be seen in Fig.6, when f d T s = 0 . , the online-learning based A-ML detector signiﬁcantly outperforms theexisting linear detectors, which use the global and perfect CSI knowledge at the BS. In addition,it is shown that the proposed online supervised-learning detector provides a considerable SERgain over the A-ML detector that does not update the model parameters in a symbol-by-symbolfashion. For example, the online supervised-learning detector yields about 8dB SNR gain over thesupervised-learning detector for a target SER of 0.03. When increasing the normalized doppler f d T s , however, the SER performance is degraded, because the larger f d T s causes the more channelvariation; thereby, it makes harder to track the channel variation. Nevertheless, the performance of the online supervised-learning detector is similar to that of the LMMSE detector that usesthe global and perfect CSI knowledge at the BS.VIII. C ONCLUSION

In this paper, we introduced a new nonlinear MU-MIMO relay channel, in which distributedrelays use one-bit DACs and ADCs motivated by low-power hardware constraints. In this channel,to understand the limit of the multi-user detection performance, we ﬁrst proposed the MLdetector, which requires global and perfect CSI at the BS. Inspired by an end-to-end supervised-learning technique, we presented a novel data communication framework by developing a simpleyet effective channel model. The proposed model facilitated learning parameters using a simplepilot transmission strategy, while ensuring the optimality of the detection performance in someconditions. In addition, we extended the proposed communication framework into a time-varyingchannel environment. The proposed online supervised-learning detector jointly performed theupdate of the model parameters and the data detection using unlabeled received signals via theEM-like algorithm. Lastly, we also presented a detector using a DNN that does not rely onany speciﬁc network model. Via simulations, we compared the SER performances of differentdetection approaches in order to provide a complete view on the effectiveness of using supervised-learning in the considered MIMO channel.R

EFERENCES [1] J. Boyer, D. D. Falconer and H. Yanikomeroglu, “Multihop diversity in wireless relaying channels,”

IEEE Trans. Commun., vol. 52, no. 10, pp. 1820-1830, Oct. 2004.[2] E. G. Larsson, O. Edfors, F. Tufvesson and T. L. Marzetta, “Massive MIMO for next generation wireless systems,”

IEEECommun. Mag., vol. 52, no. 2, pp. 186-195, Feb. 2014.[3] B. Wang, J. Zhang, and A. Host-Madsen, “On the capacity of MIMO relay channels,”

IEEE Trans. Inf. Theory, vol. 51, no.1, pp. 29-43, Jan. 2005.[4] H. Bolcskei, R. U. Nabar, O. Oyman, and A. J. Paulraj, “Capacity scaling laws in MIMO relay networks,”

IEEE Trans.Wireless Commun., vol. 5, no. 6, pp. 1433-1444, Jun. 2006.[5] X. Tang and Y. Hua, “Optimal design of non-regenerative MIMO wireless relays,”

IEEE Trans. Wireless Commun., vol. 6,no. 4, pp. 1398-1407, Apr. 2007. [6] N. Fawaz, K. Zariﬁ, M. Debbah and D. Gesbert, “Asymptotic capacity and optimal precoding in MIMO multi-hop relaynetworks” IEEE Trans. Inf. Theory, vol. 57, no. 4, pp. 2050-2069, Apr. 2011.[7] Y. Jing and X. Yu,“ML-based channel estimations for non-regenerative relay networks with multiple transmit and receiveantennas,”

IEEE J. Sel. Areas Commun., vol. 30, no. 8, pp. 1428-1439, Sep. 2012.[8] Y. Rong, M. R. A. Khandaker and Y. Xiang, “Channel estimation of dual-hop MIMO relay system via parallel factoranalysis,”

IEEE Trans. Wireless Commun., vol. 11, no. 6, pp. 2224-2233, June 2012.[9] N. Yang, M. Elkashlan, P. L. Yeoh, and J. Yuan, “Multiuser MIMO relay networks in Nakagami-m fading channels,”

IEEETrans. Commun., vol. 60, no. 11, pp. 3298-3310, Nov. 2012.[10] A. Mezghani and J. Nossek, “On ultra-wideband MIMO systems with 1-bit quantized outputs: Performance analysis andinput optimization,” in

Proc. IEEE Int. Symp. Inf. Theory (ISIT),

Nice, France, June 2007.[11] J. Singh, O. Dabeer, and U. Madhow, “On the limits of communication with low-precision analog-to-digital conversion atthe receiver,”

IEEE Trans. Commun., vol. 57, no. 12, pp. 3629–3639, Dec. 2009.[12] S. Wang, Y. Li, and J. Wang, “Convex optimization based multiuser detection for uplink large-scale MIMO under low-resolution quantization,” in

Proc. IEEE Int. Conf. Commun.,

June 2014.[13] C. Studer and G. Durisi, “Quantized massive MU-MIMO-OFDM uplink,”

IEEE Trans. Commun., vol. 64, no. 6, pp.2387–2399, June 2016.[14] S.-N. Hong, S. Kim, and N. Lee, “A weighted minimum distance decoding for uplink multiuser MIMO systems withlow-resolution ADCs,”

IEEE Trans. Commun., vol. 66, no. 5, pp. 1912–1924, May 2018.[15] Y.-S. Jeon, N. Lee, S.-N. Hong, and R. W. Heath, Jr., “One-bit sphere decoding for uplink massive MIMO systems withone-bit ADCs,”

IEEE Trans. Wireless Commun., vol. 17, no. 7, pp. 4509–4521, July 2018.[16] J. Mo, P. Schniter, and R. W. Heath, Jr., “Channel estimation in broadband millimeter wave MIMO systems with few-bitADCs,”

IEEE Trans. Signal Process., vol. 66, no. 5, pp. 1141–1154, Mar. 2018.[17] Y.-S. Jeon, S.-N. Hong, and N. Lee, “Supervised-learning-aided communication framework for MIMO systems with low-resolution ADCs,”

IEEE Trans. Veh. Technol., vol. 67, no. 8, pp. 7299–7313, Aug. 2018.[18] P. Dong, H. Zhang, W. Xu, and X. You, “Efﬁcient low-resolution ADC relaying for multiuser massive MIMO system,”

IEEE Trans. Veh. Technol., vol. 66, no. 12, pp. 11039–11056, Dec. 2017.[19] C. Kong, A. Mezghani, C. Zhong, A. L. Swindlehurst, and Z. Zhang, “Multipair massive MIMO relaying systems withone-bit ADCs and DACs,”

IEEE Trans. Signal Process., vol. 66, no. 11, pp. 2984-2997, June 2018.[20] J. Liu, J. Xu, W. Xu, S. Jin, and X. Dong, “Multiuser massive MIMO relaying with mixed-ADC receiver,”

IEEE SignalProcess. Lett., vol. 24, no. 1, pp. 76–80, Jan. 2017.[21] J. J. Bussgang, “Crosscorrelation functions of amplitude-distorted Gaussian signals,” Res. Lab. Electron., MassachusettsInst. Technol., Cambridge, MA, USA, Tech. Rep. 216, Mar. 1952.[22] C. Cao, H. Li, Z. Hu, and H. Zeng, “One-bit transceiver cluster for relay transmission,”

IEEE Commun. Lett., vol. 21, no.4, pp. 925–928, Apr.. 2017.[23] Y.-S. Jeon, M. So, and N. Lee, “Reinforcement-learning-aided ML detector for uplink massive MIMO systems withlow-precision ADCs.” in

Proc. IEEE Wireless Commun. Netw. Conf. (WCNC),

Barcelona, Spain, Apr. 2018.[24] E. Balevi and J. G. Andrews, “One-bit OFDM receivers via deep learning.” arXiv:1811.00971, [25] N. Farsad and A. Goldsmith, “Detection algorithms for communication systems using deep learning,” arXiv: 1705.08044, May 2017.[26] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,”

IEEE Trans. on Cogn. Commun. Netw., vol. 3, no. 4, pp. 563-575, Dec. 2017.[27] S. Dorner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learning based communication over the air,”

IEEE J. Sel.Topics Signal Process., vol. 12, no. 1, pp. 132-143, Feb. 2018.[28] A. Felix, S. Cammerer, S. Dorner, J. Hoydis, and S. ten Brink, “OFDM Autoencoder for end-to-end learning ofcommunications systems,” in Proc. IEEE Int. Workshop Signal Proc. Adv. Wireless Commun. (SPAWC),

June 2018.[29] S. Kim, M. So, N. Lee, and S.-N. Hong, “Semi-supervised learning detector for MU-MIMO systems with one-bit ADCs,”in

Proc. IEEE Int. Conf. Commun. Workshop ML4COM,

May 2019.[30] Y.-S. Jeon, N. Lee, and H. V. Poor, “Robust data detection for MIMO systems with one-bit ADCs: a reinforcement learningapproach,” submitted to

IEEE Trans. Wireless Commun.,

Mar. 2019. (Available at: https://128.84.21.199/abs/1903.12546)[31] Y. Li, C. Tao, G. Seco-Granados, A. Mezghani, A. L. Swindlehurst, and L. Liu, “Channel estimation and performanceanalysis of one-bit massive MIMO systems,”

IEEE Trans. Signal Process., vol. 65, no. 15, pp. 4075-4089, Aug., 2017.[32] M. J. Evans and J. S. Rosenthal, “Probability and Statistics, the Science of Uncertainty,”