Information Bottleneck for a Rayleigh Fading MIMO Channel with an Oblivious Relay
11 Information Bottleneck for a Rayleigh FadingMIMO Channel with an Oblivious Relay
Hao Xu,
Member, IEEE , Tianyu Yang, Giuseppe Caire,
Fellow, IEEE , andShlomo Shamai (Shitz),
Life Fellow, IEEE
Abstract
This paper considers the information bottleneck (IB) problem of a Rayleigh fading multiple-inputmultiple-out (MIMO) channel with an oblivious relay. The relay is constrained to operate withoutknowledge of the codebooks, i.e., it performs oblivious processing. Moreover, due to the bottleneckconstraint, it is impossible for the relay to inform the destination node of the perfect channel stateinformation (CSI) in each channel realization. To evaluate the bottleneck rate, we first provide an upperbound by assuming that the destination node can get the perfect CSI at no cost. Then, we provide fourachievable schemes where each scheme satisfies the bottleneck constraint and gives a lower bound tothe bottleneck rate. In the first and second schemes, the relay splits the capacity of the relay-destinationlink into two parts, and conveys both the CSI and its observation to the destination node. Due toCSI transmission, the performance of these two schemes is sensitive to the MIMO channel dimension,especially the channel input dimension. To ensure that it still performs well when the channel dimensiongrows large, in the third and fourth achievable schemes, the relay only transmits compressed observationto the destination node. Numerical results show that with simple symbol-by-symbol oblivious relayprocessing and compression, the proposed achievable schemes work well and can demonstrate lowerbounds coming quite close to the upper bound on a wide range of relevant system parameters.
This work was supported by the Alexander von Humboldt Foundation and the European Union’s Horizon 2020 Research andInnovation Programme with grant agreement No. 694630.H. Xu, T. Yang, and G. Caire are with the Faculty of Electrical Engineering and Computer Science at the Technical Universityof Berlin, 10587 Berlin, Germany (e-mail: [email protected]; [email protected]; [email protected]).S. Shamai (Shitz) is with the Viterbi Electrical Engineering Department, Technion–Israel Institute of Technology, Haifa 32000,Israel (e-mail: [email protected]). a r X i v : . [ c s . I T ] F e b I. I
NTRODUCTION
For a Markov chain X → Y → Z and an assigned joint probability distribution p X,Y , considerthe following information bottleneck (IB) problem max p Z | Y I ( X ; Z ) (1a)s.t. I ( Y ; Z ) ≤ C, (1b)where C is the bottleneck constraint parameter and the optimization is with respect to theconditional probability distribution p Z | Y of Z given Y . Formulation (1) was introduced by Tishbyin [1], and has found remarkable applications in supervised and unsupervised learning problemssuch as classification, clustering, prediction, etc. [2]–[7]. From a more fundamental informationtheoretic viewpoint, the IB arises from the classical remote source coding problem [8]–[10]under logarithmic distortion [11].An interesting application of the IB problem in communications consists of a source node,an oblivious relay, and a destination node, which is connected to the relay via an error-freelink with capacity C . The source node sends codewords over a communication channel and anobservation is made at the relay. X and Y are respectively the channel input from the source nodeand output at the relay. The relay is oblivious in the sense that it cannot decode the informationmessage of the source node itself. This feature can be modeled rigorously by assuming that thesource and destination nodes make use of a codebook selected at random over a library, whilethe relay is unaware of such random selection. For example, in a cloud radio access network(C-RAN), each remote radio head acts as a relay and is usually constrained to implement onlyradio functionalities while the baseband functionalities are migrated to the central processor [12].Considering the relatively simple structure of the remote radio heads, it is usually prohibitiveto let them know the codebooks and random encoding operations, particularly as the networksize gets large. The fact that the relay can not decode, is also supported by secrecy demands,which means that the codebooks known to the source and destination nodes are to be consideredabsolutely random, as done here.Due to the oblivious feature, the relaying strategies which require the codebooks to beknown at the relay, e.g., decode-and-forward, compute-and-forward, etc. [13]–[15] can not beapplied. Instead, the relay has to perform oblivious processing, i.e., employ strategies in forms ofcompress-and-forward [16]–[19]. In particular, the relay must treat X as a random process with a distribution induced by the random selection over the codebook library (see [12] and referencestherein), and has to produce some useful representation Z by simple signal processing and conveyit to the destination node subject to the link constraint C . Then, it makes sense to find Z suchthat I ( X ; Z ) is maximized.The IB problem for this kind of communication scenario has been studied in [20]–[26],[12]. In [20], the IB method was applied to reduce the fronthaul data rate of a C-RAN net-work. References [21] and [22] respectively considered Gaussian scalar and vector channelswith IB constraint, and investigated the optimal trade-off between the compression rate andthe relevant information. In [23], the bottleneck rate of a frequency-selective scalar Gaussianprimitive diamond relay channel was examined. In [24] and [25], the rate-distortion region of avector Gaussian system with multiple relays was characterized under logarithmic loss distortionmeasure. Reference [12] further extended the work in [25] to a C-RAN network with multipletransmitters and multiple relays, and studied the capacity region of this network. However, allreferences [20]–[25] and [12] considered block fading channels. Hence, the perfect channel stateinformation (CSI) was known at both the relay and the destination node. In [26], we studied theIB problem of a scalar Rayleigh fading channel. Due to the bottleneck constraint, it is impossibleto inform the destination node of the perfect CSI in each channel realization. An upper boundand two achievable schemes were provided in [26] to investigate the bottleneck rate.In this paper, we extend the work in [26] to the multiple-input multiple-out (MIMO) channelwith independent and identically distributed (i.i.d.) Rayleigh fading. To evaluate the bottleneckrate, we first obtain an upper bound by assuming that the perfect CSI is known at both the relayand the destination node with no cost. Then, we provide four achievable schemes where eachscheme satisfies the bottleneck constraint and gives a lower bound to the bottleneck rate.Intuitively, the relay can split the capacity of the relay-destination link into two parts, andconvey both the CSI and its observation to the destination node. Hence, in the first and secondachievable schemes, the relay transmits compressed CSI and observation to the destination node.Specifically, in the first scheme, the relay simply compresses the channel matrix as well as itsobservation and then forwards them to the destination node. Roughly speaking, this is whathappens today in ‘naive’ implementation of remote radio head systems. Therefore, this schemecan be seen as a baseline scheme. However, the capacity allocated for conveying the CSI to thedestination in this scheme is proportional to both the channel input dimension and the numberof antennas at the relay. To reduce the channel use required for CSI transmission, in the second achievable scheme, the relay first gets an estimate of the channel input using channel inversionand then transmits the quantized noise levels as well as the compressed noisy signal to thedestination node. In contrast to the first scheme, the capacity allocated to CSI transmission inthis scheme is only proportional to the channel input dimension.Due to the explicit CSI transmission through the bottleneck, the performance of the first andsecond achievable schemes is sensitive to the MIMO channel dimension, especially the channelinput dimension. To ensure that it still performs well when the channel dimension grows large,in the third and fourth achievable schemes, the relay does not convey any CSI to the destinationnode. In the third scheme, the relay first estimates the channel input using channel inversion andthen transmits a truncated representation of the estimate to the destination node. In the fourthscheme, the relay first produces the MMSE estimate of the channel input, and then source-encode this estimate. Numerical results show that with simple symbol-by-symbol oblivious relayprocessing and compression, the lower bounds obtained by the proposed achievable schemes cancome close to the upper bound on a wide range of relevant system parameters.The rest of this paper is organized as follows. In Section II, a MIMO channel with Rayleighfading is presented and the IB problem for this system is formulated. Section III provides anupper bound to the bottleneck rate. In Section IV, four achievable schemes are proposed, whereeach scheme satisfies the bottleneck constraint and gives a lower bound to the bottleneck rate.Numerical results are presented in Section V before conclusions in Section VI.Throughout this paper, we use the following notations. R and C denote the real space andthe complex space, respectively. Boldface upper (lower) case letters are used to denote matrices(vectors). I K stands for the K × K dimensional identity matrix and denotes the all-zerovector or matrix. Superscript ( · ) H denotes the conjugated-transpose operation, E [ · ] denotes theexpectation operation, and [ · ] + (cid:44) max( · , . ⊗ and (cid:12) respectively denote Kronecker productand Hadamard product. II. P ROBLEM F ORMULATION
We consider a system with a source node, an oblivious relay, and a destination node as shown inFig. 1. For convenience, we call the source-relay channel, ‘Channel 1’, and the relay-destinationchannel, ‘Channel 2’. For Channel 1, we consider the following Gaussian MIMO channel withi.i.d. Rayleigh fading y = Hx + n , (2) ( | , ) p y x H Source x State H Relay y z Destination
Fig. 1. Block diagram of the considered IB problem. where x ∈ C K × and n ∈ C M × are respectively zero-mean circularly symmetric complexGaussian input and noise with covariance matrices I K and σ I M , i.e., x ∼ CN (0 , I K ) and n ∼ CN (0 , σ I M ) . H ∈ C M × K is a random matrix independent of both x and n , and theelements of H are i.i.d. zero-mean unit-variance complex Gaussian random variables, i.e., H ∼CN ( , I K ⊗ I M ) . Let z denote a useful representation of y produced by the relay for thedestination node. x → ( y , H ) → z thus forms a Markov chain. We assume that the relay nodehas a direct observation of the channel matrix H , while the destination node does not since H isnot subject to any distortion constraint, i.e., it is not needed at all (explicitly) at the destination.Then, we consider the following IB problem max p ( z | y , H ) I ( x ; z ) (3a)s.t. I ( y , H ; z ) ≤ C, (3b)where C is the bottleneck constraint, i.e., the link capacity of Channel 2. In this paper, we call I ( x ; z ) the bottleneck rate and I ( y , H ; z ) the compression rate. Obviously, for a joint probabilitydistribution p ( x , y , H ) determined by (2), problem (3) is a slightly augmented version of IBproblem (1). In our problem, we aim to find a conditional distribution p ( z | y , H ) such thatbottleneck constraint (3b) is satisfied and the bottleneck rate is maximized, i.e., as much asinformation of x can be extracted from representation z .III. I NFORMED R ECEIVER U PPER B OUND
As stated in [26], an obvious upper bound to problem (3) can be obtained by letting both therelay and the destination node know the channel matrix H . We call the bound in this case the informed receiver upper bound. The IB problem in this case takes on the following form max p ( z | y ) I ( x ; z | H ) (4a)s.t. I ( y ; z | H ) ≤ C. (4b)In reference [21], the IB problem for a scalar Gaussian channel with block fading has beenstudied. In the following theorem, we show that for the considered MIMO channel with Rayleighfading, (4) can be decomposed into a set of parallel scalar IB problems, and the informed receiverupper bound can be obtained based on the result in [21]. Theorem 1.
For the considered MIMO channel with Rayleigh fading, the informed receiverupper bound, i.e., the optimal objective function of IB problem (4), is R ub = T (cid:90) ∞ νρ [log (1 + ρλ ) − log(1 + ν )] f λ ( λ ) dλ, (5) where T = min { K, M } , ρ = σ , λ is identically distributed as the unordered positive eigenvaluesof HH H , its probability density function (pdf), i.e., f λ ( λ ) , is given in (103), and ν is chosensuch that the following bottleneck constraint is met (cid:90) ∞ νρ (cid:18) log ρλν (cid:19) f λ ( λ ) dλ = CT . (6)
Proof:
See Appendix A. (cid:3)
Lemma 1.
When M → + ∞ or ρ → + ∞ , upper bound R ub tends asymptotically to C . When C → + ∞ , R ub approaches the capacity of Channel 1, i.e., R ub → I ( x ; y , H )= T (cid:90) ∞ log (1 + ρλ ) f λ ( λ ) dλ. (7) Proof:
See Appendix B. (cid:3)
IV. A
CHIEVABLE S CHEMES
In this section, we provide four achievable schemes where each scheme satisfies the bottleneckconstraint and gives a lower bound to the bottleneck rate. In the first and second schemes, therelay transmits both its observation and partial CSI to the destination node. In the third and fourthschemes, to avoid transmitting CSI, the relay first estimates x and then sends a representationof the estimate to the destination node. A. Non-decoding transmission (NDT) scheme
Our first achievable scheme assumes that without decoding x , the relay simply source-encodesboth y and H and then sends the encoded representations to the destination node. It shouldbe noticed that this scheme is actually reminiscent of the current state of the art in remoteantenna head technology, where both the pilot field (corresponding to H ) and the data field(corresponding to y ) are quantized and sent to the central processing unit.Let h denote the vectorization of matrix H , and z and z denote the representations of h and y , respectively. From the definition of H in (2), it is known that h ∼ CN (0 , I KM ) . Since theelements in h are i.i.d., in the best case, where I ( h ; z ) is minimized for a given total distortion,representation z introduces the same distortion to each element of h . Denote the distortion ofeach element quantization by D . It can then be readily verified by using [27, Theorem 10.3.3]that the rate distortion function of source h with total squared-error distortion KM D is givenby R ( D ) = min f ( z | h ): E [ d ( h , z )] ≤ KMD I ( h ; z )= KM log 1 D , (8)where < D ≤ and d ( h , z ) = ( h − z ) H ( h − z ) is the squared-error distortion measure. Let e denote the error vector of quantizing h , i.e., e = h − z . z and e are the vectorizations of Z and E . Hence, H = Z + E . Note that z ∼ CN (0 , (1 − D ) I KM ) , e ∼ CN (0 , D I KM ) ,and z is independent of e . Hence, E (cid:2) Z Z H (cid:3) = K (1 − D ) I K , E (cid:2) E E H (cid:3) = KD I K . (9)In [27, Theorem 10.3.3], the achievability of an information rate for a given distortion, e.g.,(8), is proven by considering a backward Gaussian test channel. However, the backward Gaussiantest channel does not provide an expression of z or e . Though the specific formulations of z and e are not necessary for the analysis in this section, since we are providing an achievablescheme, we still give a feasible z which satisfies (8) here to make the content more complete.By adding an independent Gaussian noise vector r ∼ CN ( , ε I KM ) with ε = D − D , to h , we get ˜ h = h + r . (10) Obviously, ˜ h ∼ CN (cid:0) , − D I KM (cid:1) . A representation of h can then be obtained as follows z = 11 + ε ˜ h = 11 + ε h + 11 + ε r = (1 − D ) h + (1 − D ) r , (11)which is actually the MMSE estimate of h obtained from (10). The error vector is then givenby e = h − z = D h − (1 − D ) r . (12)It can be readily verified that z provided in (11) satisfies (8), z ∼ CN (0 , (1 − D ) I KM ) , e ∼ CN (0 , D I KM ) , and z is independent of e .To meet the bottleneck constraint, we have to ensure that I ( h , y ; z , z ) ≤ C. (13)Using the chain rule of mutual information, I ( h , y ; z , z ) = I ( h , y ; z ) + I ( h , y ; z | z )= I ( h ; z ) + I ( y ; z | h ) + I ( y ; z | z ) + I ( h ; z | z , y ) . (14)Since z is a representation of h , y and z are conditionally independent given h . Similarly,since z is a representation of y , h and z are conditionally independent given y . Hence, I ( y ; z | h ) = 0 ,I ( h ; z | z , y ) = 0 . (15)From (8), (14), and (15), it is known that to have constraint (13) guaranteed, I ( y ; z | z ) , whichis the information rate at which the relay quantizes y (given z ), should satisfy I ( y ; z | z ) ≤ C − R ( D ) . (16)Obviously, C − R ( D ) > has to be guaranteed, which yields D > − CKM . Hence, in this section,we always assume − CKM < D ≤ . We then evaluate I ( y ; z | z ) . Since H = Z + E , y in (2) can be rewritten as y = Hx + n = Z x + E x + n . (17)For a given Z , the second moment of y is E (cid:2) yy H | Z (cid:3) = Z Z H + ( KD + σ ) I M . Denote theeigendecomposition of Z Z H by ˜ U Ω ˜ U H and ˜ y = ˜ U H y = ˜ U H Z x + ˜ U H E x + ˜ U H n . (18)The second moment of ˜ y is E (cid:2) ˜ y ˜ y H | Z (cid:3) = Ω + ( KD + σ ) I M . Since E is unknown, ˜ y is nota Gaussian vector. To evaluate I ( y ; z | z ) , we define a new Gaussian vector y g = ˜ U H Z x + n g , (19)where n g ∼ CN (0 , ( KD + σ ) I M ) . For a given Z , y g ∼ CN (0 , Ω + ( KD + σ ) I M ) . Thechannel in (19) can thus be seen as a set of parallel sub-channels. Let z g denote a representationof y g and consider the following IB problem max p ( z g | y g ) I ( x ; z g | Z ) (20a)s.t. I ( y g ; z g | Z ) ≤ C − R ( D ) , (20b) − CKM < D ≤ . (20c)Obviously, for a given feasible D , problem (20) can be similarly solved as (4) by following thesteps in Appendix A. We thus have the following theorem. Theorem 2.
For a given feasible D , the optimal objective function of IB problem (20) is R lb = T (cid:90) ∞ νγ [log (1 + γλ ) − log(1 + ν )] f λ ( λ ) dλ, (21) where γ = − DKD + σ , the pdf of λ , i.e., f λ ( λ ) , is given by (103), and ν is chosen such that thefollowing bottleneck constraint is met (cid:90) ∞ νγ (cid:18) log γλν (cid:19) f λ ( λ ) dλ = C − R ( D ) T . (22)
Proof:
See Appendix C. (cid:3) Since for a given Z , (19) can be seen as a set of parallel scalar Gaussian sub-channels, ac-cording to [21, (16)], the representation of y g , i.e., z g , can be constructed by adding independentfading and Gaussian noise to each element of y g . Denote z g = Ψ y g + n (cid:48) g = Ψ ˜ U H Z x + Ψ n g + n (cid:48) g , (23)where Ψ is a diagonal matrix with non-negative and real diagonal entries, and n (cid:48) g ∼ CN ( , I M ) .Note that y g in (19) and its representation z g in (23) are only auxiliary variables. What we arereally interested in is the representation of y and the corresponding bottleneck rate. Hence, wealso add fading Ψ and Gaussian noise n (cid:48) g to ˜ y in (18) and get the following representation z = Ψ ˜ y + n (cid:48) g = Ψ ˜ U H Z x + Ψ ˜ U H E x + Ψ ˜ U H n + n (cid:48) g . (24)In the following lemma we show that by transmitting representations z and z to the destinationnode, R lb1 is an achievable lower bound to the bottleneck rate and the bottleneck constraint issatisfied. Lemma 2.
If the representation of h , i.e., z , resulted from (8), is forwarded to the destinationnode for each channel realization, with observations y and y g in (17) and (18), and represen-tations z and z g in (24) and (23), we have I ( y ; z | Z ) ≤ I ( y g ; z g | Z ) , (25) I ( x ; z | Z ) ≥ I ( x ; z g | Z ) , (26) where (25) indicates that I ( y ; z | Z ) ≤ C − R ( D ) and (26) gives I ( x ; z | Z ) ≥ R lb1 . Proof:
See Appendix D. (cid:3)
Lemma 3.
When M → + ∞ , R lb → T (cid:104) log (1 + γM ) − log (cid:16) γM − C − R ( D ) T (cid:17)(cid:105) . (27) When ρ → + ∞ , R lb tends to a constant which can be obtained by letting γ = − DKD and using(21). In addition, when C → + ∞ , there exists a small D such that R lb approaches the capacityof Channel 1, i.e., R lb → I ( x ; y , H )= T (cid:90) ∞ log (1 + ρλ ) f λ ( λ ) dλ. (28) Proof:
See Appendix E. (cid:3)
B. Quantized channel inversion (QCI) scheme when K ≤ M In our second scheme, the relay first gets an estimate of the channel input using channelinversion and then transmits the quantized noise levels as well as the compressed noisy signalto the destination node.In particular, we apply the pseudo inverse matrix of H , i.e., ( H H H ) − H H , to y , and getthe zero-forcing estimate of x as follows ˜ x = ( H H H ) − H H y = x + ( H H H ) − H H n (cid:44) x + ˜ n . (29)For a given channel matrix H , ˜ n ∼ CN ( , A ) , where A = σ ( H H H ) − . Let A = A + A ,where A and A respectively consist of the diagonal and off-diagonal elements of A , i.e., A = A (cid:12) I K and A = A − A . If H can be perfectly transmitted to the destination node,the bottleneck rate can be obtained by following similar steps in Appendix A. However, since H follows a non-degenerate continuous distribution and the bottleneck constraint is finite, asshown in the previous subsection, this is not possible. To reduce the number of bits per channeluse required for informing the destination node of the channel information, we only convey acompressed version of A and consider a set of independent scalar Gaussian sub-channels.Specifically, we force each diagonal entry of A to belong to a finite set of quantized levelsby adding artificial noise, i.e., by introducing physical degradation. We fix a finite grid of J positive quantization points B = { b , · · · , b J } , where b ≤ b ≤ · · · ≤ b J − < b J , b J = + ∞ ,and define the following ceiling operation (cid:6) a (cid:7) B = arg min b ∈B { a ≤ b } . (30) Then, by adding a Gaussian noise vector ˜ n (cid:48) ∼ CN ( , diag (cid:8)(cid:6) a (cid:7) B − a , · · · , (cid:6) a K (cid:7) B − a K (cid:9)(cid:1) ,which is independent of everything else, to (29), a degraded version of ˜ x can be obtained asfollows ˆ x = ˜ x + ˜ n (cid:48) = x + ˜ n + ˜ n (cid:48) (cid:44) x + ˆ n , (31)where ˆ n ∼ CN ( , A (cid:48) + A ) for a given H and A (cid:48) (cid:44) diag (cid:8)(cid:6) a (cid:7) B , · · · , (cid:6) a K (cid:7) B (cid:9) . Obviously,due to A , the elements in noise vector ˆ n are correlated.To evaluate the bottleneck rate, we consider a new variable ˆ x g = x + ˆ n g , (32)where ˆ n g ∼ CN ( , A (cid:48) ) . Obviously, (32) can be seen as K parallel scalar Gaussian sub-channelswith noise power (cid:6) a k (cid:7) B for each sub-channel. Since each quantized noise level (cid:6) a k (cid:7) B only has J possible values, it is possible for the relay to inform the destination node of the channelinformation via the constrained link. Note that from the definition of A in (29), it is knownthat a k , ∀ k ∈ K (cid:44) { , · · · , K } are correlated. The quantized noise levels (cid:6) a k (cid:7) B , ∀ k ∈K are thus also correlated. Hence, we can jointly source-encode (cid:6) a k (cid:7) B , ∀ k ∈ K to furtherreduce the number of bits used for CSI transmission. For convenience, we define a space Ξ = { ( j , · · · , j K ) | ∀ j k ∈ J , k ∈ K} , where J = { , · · · , J } . It is obvious that there are a total of J K points in this space. Let ξ = ( j , · · · , j K ) denote a point in space Ξ and define the followingprobability mass function (pmf) P ξ = Pr (cid:8)(cid:6) a (cid:7) B = b j , · · · , (cid:6) a K (cid:7) B = b j K (cid:9) . (33)The joint entropy of (cid:6) a k (cid:7) B , ∀ k ∈ K , i.e., the number of bits used for jointly source-encoding (cid:6) a k (cid:7) B , ∀ k ∈ K , is thus given by H joint = (cid:88) ξ ∈ Ξ − P ξ log P ξ . (34)Then, the IB problem for (32) takes on the following form max p (ˆ z g | ˆ x g ) I ( x ; ˆ z g | A (cid:48) ) (35a)s.t. I ( ˆ x g ; ˆ z g | A (cid:48) ) ≤ C − H joint , (35b) where ˆ z g is a representation of ˆ x g .Note that as stated above, there are a total of J K points in space Ξ . The pmf P ξ thus has J K possible values and it becomes difficult to obtain the joint entropy H joint from (34) (evennumerically) when J or K is large. To reduce the computational complexity, we consider the(slightly) suboptimal, but far more practical, entropy coding of each noise level (cid:6) a k (cid:7) B separately,and get the following sum of individual entropies H sum indi = K (cid:88) k =1 H k , (36)where H k denotes the entropy of (cid:6) a k (cid:7) B or the number of bits used for informing the destinationnode of noise level (cid:6) a k (cid:7) B . In Appendix F, we show that a k , ∀ k ∈ K are marginally identicallyinverse chi squared distributed with M − K + 1 degrees of freedom, and their pdf is given in(130). Hence, H sum indi = KH = − K J (cid:88) j =1 P j log P j , (37)where P j = Pr (cid:8)(cid:6) a (cid:7) B = b j (cid:9) can be obtained from (131) and a follows the same distribution as a k . Since P j only has J possible values, the computational complexity of calculating H sum indi isproportional to J . Using the chain rule of entropy and the fact that conditioning reduces entropy,we know that H joint ≤ H sum indi . In Section V, the gap between H joint and H sum indi is investigatedby simulation. Replacing H joint in (35b) with H sum indi , we get the following IB problem max p (ˆ z g | ˆ x g ) I ( x ; ˆ z g | A (cid:48) ) (38a)s.t. I ( ˆ x g ; ˆ z g | A (cid:48) ) ≤ C − KH . (38b)The optimal solution of this problem is given in the following theorem. Theorem 3. If A (cid:48) is conveyed to the destination node for each channel realization, the optimalobjective function of IB problem (38) is R lb = J − (cid:88) j =1 KP j (cid:2) log (1 + ρ j ) − log(1 + ρ j − c j ) (cid:3) . (39) where ρ j = b j , c j = (cid:2) log ρ j ν (cid:3) + , and ν is chosen such that the following bottleneck constraint ismet J − (cid:88) j =1 KP j c j = C − KH . (40) Proof:
See Appendix F. (cid:3)
Since (32) can be seen as K parallel scalar Gaussian sub-channels, according to [21, (16)],the representation of ˆ x g , i.e., ˆ z g , can be constructed by adding independent fading and Gaussiannoise to each element of ˆ x g . Denote ˆ z g = Φ ˆ x g + ˆ n (cid:48) g = Φx + Φ ˆ n g + ˆ n (cid:48) g , (41)where Φ is a diagonal matrix with positive and real diagonal entries, and ˆ n (cid:48) g ∼ CN ( , I K ) .Note that similar to y g and z g in the previous subsection, ˆ x g in (32) and its representation ˆ z g in (41) are also auxiliary variables. What we are really interested in is the representation of ˆ x and the corresponding bottleneck rate. Hence, we also add fading Φ and Gaussian noise ˆ n (cid:48) g to ˆ x in (31) and get its representation as follows z = Φ ˆ x + ˆ n (cid:48) g = Φx + Φ ˆ n + ˆ n (cid:48) g . (42)In the following lemma we show that by transmitting quantized noise levels (cid:6) a k (cid:7) B , ∀ k ∈ K andrepresentation z to the destination node, R lb is an achievable lower bound to the bottleneckrate and the bottleneck constraint is satisfied. Lemma 4. If A (cid:48) is forwarded to the destination node for each channel realization, with signalvectors ˆ x and ˆ x g in (31) and (32), and their representations z and ˆ z g in (42) and (41), we have I ( ˆ x ; z | A (cid:48) ) ≤ I ( ˆ x g ; ˆ z g | A (cid:48) ) , (43) I ( x ; z | A (cid:48) ) ≥ I ( x ; ˆ z g | A (cid:48) ) , (44) where (43) indicates that I ( ˆ x ; z | A (cid:48) ) ≤ C − KH and (44) gives I ( x ; z | A (cid:48) ) ≥ R lb . Proof:
See Appendix G. (cid:3)
Lemma 5.
When M → + ∞ or ρ → + ∞ , we can always find suitable quantization points B = { b , · · · , b J } such that R lb → C . When C → + ∞ , R lb → K E (cid:20) log (cid:18) a (cid:19)(cid:21) ≤ I ( x ; y , H ) , (45) where the expectation can be calculated by using the pdf of a in (130) and I ( x ; y , H ) is thecapacity of Channel 1. Proof:
See Appendix H. (cid:3)
For the sake of simplicity, we may choose the quantization levels as quantiles such that weobtain the uniform pmf P j = J . The lower bound (39) can thus be simplified as R lb = J − (cid:88) j =1 KJ (cid:2) log (1 + ρ j ) − log(1 + ρ j − c j ) (cid:3) , (46)and the bottleneck constraint (40) becomes J − (cid:88) j =1 (cid:104) log ρ j ν (cid:105) + = J CK − J B, (47)where B = log J can be seen as the number of bits required for quantizing each diagonal entryof A . Since ρ ≥ · · · ≥ ρ J − , from the strict convexity of the problem, we know that theremust exist a unique integer ≤ l ≤ J − such that [28] l (cid:88) j =1 log ρ j ν = J CK − J B,ρ j ≤ ν, ∀ l + 1 ≤ j ≤ J − . (48)Hence, ν can be obtained from log ν = l (cid:88) j =1 log ρ j l − J ClK + J Bl , (49)and R lb can be calculated as follows R lb = l (cid:88) j =1 KJ [log (1 + ρ j ) − log(1 + ν )] . (50)Then, we only need to test the above condition for l = 1 , , , · · · till (48) is satisfied. Note thatto ensure R lb > , JCK − J B in (47) has to be positive, i.e.,
B < CK . Moreover, though choosingthe quantization levels as quantiles makes it easier to calculate R lb , the results in Lemma 5 maynot hold in this case since the choice of quantization points B = { b , · · · , b J } is restricted. C. Truncated channel inversion (TCI) scheme when K ≤ M Both the NDT and QCI schemes proposed in the preceding two subsections require that therelay transmits partial CSI to the destination node. Specifically, in the NDT scheme, channel matrix H is compressed and conveyed to the destination node. Hence, the channel use requiredfor transmitting compressed H is proportional to K and M . In contrast, the number of bitsrequired for transmitting quantized noise levels in the QCI scheme is proportional to K and B .Due to the bottleneck constraint, the performance of the NDT and QCI schemes is thus sensitiveto the MIMO channel dimension, especially K . To ensure that it still performs well when thechannel dimension is large, in this subsection, the relay first estimates x using channel inversionand then transmits a truncated representation of the estimate to the destination node.In particular, as in the previous subsection, we first get the zero-forcing estimate of x usingchannel inversion, i.e., ˜ x = ( H H H ) − H H y = x + ( H H H ) − H H n . (51)As given in Appendix A, the unordered eigenvalues of H H H are λ k , ∀ k ∈ K . Let λ min =min { λ k , ∀ k ∈ K} . Note that though the interfering terms can be nulled out by zero-forcingequalizer, the noise may be greatly amplified when the channel is noisy. Therefore, we put athreshold λ th on λ min such that zero capacity is allocated for states with λ min < λ th .Specifically, when λ min < λ th , the relay does not transmit the observation, while when λ min ≥ λ th , the relay takes ˜ x as the new observation and transmits a compressed version of ˜ x to thedestination node. The information about whether to transmit the observation or not is encodedinto a − sequence and is also sent to the destination node. Then, we need to solve the sourcecoding problem at the relay, i.e., encoding blocks of ˜ x when λ min ≥ λ th . For convenience, weuse ∆ to denote event ‘ λ min ≥ λ th ’. Here we choose p ( z | ˜ x , ∆) to be a conditionally Gaussiandistribution, i.e., z = ˜ x + q , if ∆ ∅ , otherwise , (52)where q ∼ CN ( , D I K ) is independent of the other variables. It can be easily found from (52)that I ( x ; z | λ min < λ th ) = 0 and I ( ˜ x ; z | λ min < λ th ) = 0 . Hence, we consider the followingmodified IB problem max D P th I ( x ; z | ∆) (53a)s.t. P th I ( ˜ x ; z | ∆) ≤ C − H th , (53b)where P th = Pr { ∆ } and H th is a binary entropy function with parameter P th . Since we assume K ≤ M in this subsection, as stated in Appendix A, H H H ∼ CW K ( M, I K ) .Then, according to [29, Proposition 2.6] and [29, Proposition 4.7], P th is given by P th = det ψ (cid:81) Kk =1 ( M − k )! (cid:81) Kk =1 ( K − k )! , (54)where ψ = ψ · · · ψ K − ... . . . ... ψ K − · · · ψ K − = [( ψ i + j − )] ,ψ i + j − = (cid:90) ∞ λ th µ M − K + i + j − e − µ dµ. (55)When K = M , using [30, Theorem 3.2], a more concise expression of P th can be obtained asfollows P th = (cid:90) ∞ λ th K e − µK/ dµ = e − λ th K . (56)Note that in (56), the lower bound of the integral is λ th rather than λ th . This is becausein this paper, the elements of H are assumed to be i.i.d. zero-mean unit-variance complexGaussian random variables, while in [30], the real and imaginary parts of the elements in H are independent standard normal variables.Given condition ∆ , let ˜ x g denote a zero-mean circularly symmetric complex Gaussian randomvector with the same second moment as ˜ x , i.e., ˜ x g ∼ CN (cid:0) , E (cid:2) ˜ x ˜ x H | ∆ (cid:3)(cid:1) , and ˜ z g = ˜ x g + q . The squared-error distortion between ˜ z g and ˜ x g is E (cid:2) q H q (cid:3) = KD . Rate-distortion pair ( P th I ( ˜ x g ; ˜ z g | ∆) , KD ) is then achievable if P th I ( ˜ x g ; ˜ z g | ∆) ≤ C − H th [27, Theorem 10.4.1].Hence, let P th I ( ˜ x g ; ˜ z g | ∆) = P th log det (cid:18) I K + 1 D E (cid:2) ˜ x ˜ x H | ∆ (cid:3)(cid:19) = C − H th . (57)To calculate D from (57), we denote the eigendecomposition of H H H by V ˜ ΛV H , where V is a unitary matrix whose columns are the eigenvectors of H H H , ˜ Λ is a diagonal matrix whose diagonal elements are unordered eigenvalues λ k , ∀ k ∈ K , and V and ˜ Λ are independent. Then,from (51), E (cid:2) ˜ x ˜ x H | ∆ (cid:3) = I K + σ E (cid:2) ( H H H ) − | ∆ (cid:3) , = I K + σ E (cid:104) V ˜ Λ − V H | ∆ (cid:105) , = I K + σ E (cid:20) λ | ∆ (cid:21) I K . (58)Based on [31], the joint pdf of the unordered eigenvalues λ k , ∀ k ∈ K under condition ∆ isgiven by f ( λ , · · · , λ K | ∆) = 1 P th K ! K (cid:89) i =1 e − λ i λ M − Ki ( K − i )!( M − i )! K (cid:89) i Note that we show in Appendix I that when K = M and λ th = 0 , the integral in(61) diverges. E (cid:2) λ | ∆ (cid:3) thus does not exist in this case. Therefore, without special instructions,the results derived in this subsection are for the cases with K = M and λ th > or with K < M and λ th ≥ . With (57), rate-distortion pair ( P th I ( ˜ x g ; ˜ z g | ∆) , KD ) is achievable. Due to the fact that Gaus-sian input maximizes the mutual information of a Gaussian additive noise channel, we have I ( ˜ x ; z | ∆) ≤ I ( ˜ x g ; ˜ z g | ∆) . Note that the distortion associated with z and ˜ x under condition ∆ is also KD . Rate-distortion pair ( P th I ( ˜ x ; z | ∆) , KD ) is thus achievable. The next step is to evaluate the resulting achievable bottleneck rate, i.e., I ( x ; z ) . To this end,we first obtain the following lower bound to I ( x ; z | ∆) from the fact that conditioning reducesdifferential entropy, I ( x ; z | ∆) = h ( z | ∆) − h ( z | x , ∆) ≥ h ( z | H , ∆) − h ( z | x , ∆) . (63)Then, we evaluate the differential entropies h ( z | H , ∆) and h ( z | x , ∆) , respectively. From (51)and (52), it is known that z is conditionally Gaussian given H and ∆ . Hence, h ( z | H , ∆) = E (cid:2) log( πe ) K det (cid:0) I K + σ ( H H H ) − + D I K (cid:1) | ∆ (cid:3) = E (cid:104) log( πe ) K det (cid:16) I K + σ ˜ Λ − + D I K (cid:17) | ∆ (cid:105) = K E (cid:20) log( πe ) (cid:18) D + σ λ (cid:19) | ∆ (cid:21) . (64)On the other hand, using the fact that Gaussian distribution maximizes the entropy over alldistributions with the same variance [27, Theorem 8.6.5], we have h ( z | x , ∆) = h ( z − x | ∆)= h (( H H H ) − H H n + q | ∆) ≤ log( πe ) K det (cid:0) σ E (cid:2) ( H H H ) − | ∆ (cid:3) + D I K (cid:1) = K log( πe ) (cid:18) D + σ E (cid:20) λ | ∆ (cid:21)(cid:19) . (65)Substituting (64) and (65) into (63), we can get a lower bound to I ( x ; z ) as shown in thefollowing theorem. Theorem 4. When K ≤ M , with truncated channel inversion, a lower bound to I ( x ; z ) can beobtained as follows R lb = P th K E (cid:20) log (cid:18) D + σ λ (cid:19) | ∆ (cid:21) − P th K log (cid:18) D + σ E (cid:20) λ | ∆ (cid:21)(cid:19) , (66) where P th and D are respectively given in (54) and (62), and the expectations can be calculatedby using pdf (60). Lemma 6. Using Jensen’s inequality on convex function log(1 + 1 /x ) and concave function log x , we can get a lower bound to R lb , i.e., ˇ R lb = P th K log (cid:18) D + σ E [ λ | ∆] (cid:19) − P th K log (cid:18) D + σ E (cid:20) λ | ∆ (cid:21)(cid:19) , (67) and an upper bound to R lb , i.e., ˆ R lb = P th K log (cid:18) D + σ E (cid:20) λ | ∆ (cid:21)(cid:19) − P th K log (cid:18) D + σ E (cid:20) λ | ∆ (cid:21)(cid:19) . (68) Remark 2. Obviously, ˇ R lb is also a lower bound to I ( x ; z ) . As for ˆ R lb , it is not an upperbound to I ( x ; z ) since it is derived after lower bound R lb . However, we can assess how goodthe lower bounds R lb and ˇ R lb are by comparing them with ˆ R lb . Lemma 7. When M → + ∞ , R lb , ˇ R lb , and ˆ R lb all tend asymptotically to C . When ρ → + ∞ , R lb , ˇ R lb , and ˆ R lb all tend asymptotically to C − H th . In addition, when C → + ∞ , R lb , ˇ R lb ,and ˆ R lb all approach constants, which can be respectively obtained by setting D = 0 in (66),(67), and (68). Proof: See Appendix J. (cid:3) When K < M and λ th = 0 , it is obvious that P th = 1 , H th = 0 , and E [ λ ] = M . Since H H H ∼ CW K ( M, I K ) , ( H H H ) − follows a complex inverse Wishart distribution. Hence, E (cid:2) λ (cid:3) = M − K . Then, from Theorem 4 and Lemma 6, we have the following lemma. Lemma 8. When K < M and λ th = 0 , R lb = K E (cid:20) log (cid:18) D + σ λ (cid:19)(cid:21) − K log (cid:18) D + σ M − K (cid:19) , (69) ˇ R lb = K log (cid:18) D + σ M (cid:19) − K log (cid:18) D + σ M − K (cid:19) , (70) and ˆ R lb = K log (cid:18) D + σ M − K (cid:19) − K log (cid:18) D + σ M − K (cid:19) , (71) where D = 1 + σ M − K CK − . (72) Remark 3. When K < M , λ th = 0 , and σ M − K is small (e.g., when ρ is large, i.e., σ is small,or when M − K is large), ˆ R lb − ˇ R lb ≈ . In this case, ˇ R lb is close to ˆ R lb , and is thus alsoclose to R lb . Then, we can use ˇ R lb instead of R lb to lower bound I ( x ; z ) since it has a moreconcise expression. D. MMSE estimate at the relay In this subsection, we assume that the relay first produces the MMSE estimate of x given ( y , H ) , and then source-encode this estimate.Denote F = (cid:0) HH H + σ I M (cid:1) − H . (73)The MMSE estimate of x is thus given by ¯ x = F H y = F H Hx + F H n . (74)Then, we consider the following modified IB problem max p ( z | ¯ x ) I ( x ; z ) (75a)s.t. I ( ¯ x ; z ) ≤ C. (75b)Note that since matrix HH H + σ I K in (73) is always invertible, the results obtained in thissubsection always hold no matter K ≤ M or K > M .Analogous to the previous subsection, we define z = ¯ x + q , ¯ x g ∼ CN (cid:0) , E (cid:2) ¯ x ¯ x H (cid:3)(cid:1) , ¯ z g = ¯ x g + q , (76)where q has the same definition as in (52), and E (cid:2) ¯ x ¯ x H (cid:3) = E (cid:2) F H HH H F + σ F H F (cid:3) . (77)Let I ( ¯ x g ; ¯ z g ) = log det (cid:32) I K + E (cid:2) ¯ x ¯ x H (cid:3) D (cid:33) = C. (78)Then, ( I ( ¯ x g ; ¯ z g ) , KD ) is achievable and D can be calculated from (78). Since I ( ¯ x ; z ) ≤ I ( ¯ x g ; ¯ z g ) , rate-distortion pair ( I ( ¯ x ; z ) , KD ) is achievable. In the following, we obtain a lower bound to I ( x ; z ) by evaluating h ( z | H ) and h ( z | x ) separately, and then using I ( x ; z ) = h ( z ) − h ( z | x ) ≥ h ( z | H ) − h ( z | x ) . (79)First, since z is conditionally Gaussian given H , we have h ( z | H ) = E (cid:2) log( πe ) K det (cid:0) F H HH H F + σ F H F + D I K (cid:1)(cid:3) . (80)Next, based on the fact that conditioning reduces differential entropy and Gaussian distributionmaximizes the entropy over all distributions with the same variance [32], we have h ( z | x ) = h ( z − E ( z | x ) | x )= h (cid:0)(cid:0) F H H − E (cid:2) F H H (cid:3)(cid:1) x + F H n + q | x (cid:1) ≤ h (cid:0)(cid:0) F H H − E (cid:2) F H H (cid:3)(cid:1) x + F H n + q (cid:1) ≤ log( πe ) K det( G ) , (81)where G = E (cid:2)(cid:0) F H H − E (cid:2) F H H (cid:3)(cid:1) (cid:0) H H F − E (cid:2) H H F (cid:3)(cid:1) + σ F H F (cid:3) + D I K = E (cid:2) F H HH H F (cid:3) − E (cid:2) F H H (cid:3) E (cid:2) H H F (cid:3) + σ E (cid:2) F H F (cid:3) + D I K . (82)Combining (79), (80), and (81), we can get a lower bound to I ( x ; z ) as shown in the followingtheorem. Theorem 5. With MMSE estimate at the relay, a lower bound to I ( x ; z ) can be obtained asfollows R lb = T E (cid:20) log (cid:18) λλ + σ + D (cid:19)(cid:21) + ( K − T ) log D − K log (cid:40) TK E (cid:20) λλ + σ (cid:21) − T K (cid:18) E (cid:20) λλ + σ (cid:21)(cid:19) + D (cid:41) , (83) where D = TK E (cid:2) λλ + σ (cid:3) CK − , (84) and the expectations can be calculated by using the pdf of λ in (103). Proof: See Appendix K. (cid:3) Lemma 9. When M → + ∞ or when K ≤ M and ρ → + ∞ , lower bound R lb tendsasymptotically to C . When K ≤ M and C → + ∞ , R lb → K E (cid:20) log (cid:18) λλ + σ (cid:19)(cid:21) − K log (cid:40) E (cid:20) λλ + σ (cid:21) − (cid:18) E (cid:20) λλ + σ (cid:21)(cid:19) (cid:41) . (85) Proof: See Appendix L. (cid:3) Lemma 10. When C → + ∞ , the NDT scheme outperforms the other three schemes, i.e., R lb ≥ max (cid:8) R lb , R lb , R lb (cid:9) . (86) Proof: See Appendix M. (cid:3) Remark 4. Besides the proof in Appendix M, we can also explain Lemma 10 from a moreintuitive perspective. When C → + ∞ , the destination node can get perfect y and H fromthe relay by using the NDT scheme. The bottleneck rate is thus determined by the capacityof Channel 1. In the QCI scheme, though the destination node can get perfect signal vectorand noise power of each channel, the correlation between the elements of the noise vector isneglected since the off-diagonal entries of A are not considered. The bottleneck rate obtained bythe QCI scheme is thus upper bounded by the capacity of Channel 1. As for the TCI or MMSEschemes, the destination node can get perfect ˜ x or ¯ x from the relay. However, the bottleneckrate in these two cases is not only affected by the capacity of Channel 1, but is also limited bythe performance of zero-forcing or MMSE estimation since the estimation inevitably incurs aloss of information. Hence, the NDT scheme has better performance when C → + ∞ . V. N UMERICAL R ESULTS In this section, we evaluate the lower bounds obtained by different achievable schemes pro-posed in Section IV and compare them with the upper bound derived in Section III. Whenperforming the QCI scheme, we choose the quantization levels as quantiles for the sake ofconvenience.Fig. 2 depicts R lb versus distortion D under different configurations of SNR ρ . It can befound from this figure that R lb first increases and then decreases with D . It is thus importantto find a good D to maximize R lb . Since it is difficult to get the explicit expression of (21), itis not easy to strictly analyze the relationship between R lb and D . However, we can intuitivelyexplain Fig. 2 as follows. When using the NDT scheme, the relay quantizes both h and y . Due D R l b1 ( b i t s / c o m p l e x d i m en s i on ) ρ = 0 dB ρ = 10 dB ρ = 40 dB Fig. 2. Lower bound R lb versus D with K = M = 4 and C = 40 bits/complex dimension. M E n t r op y ( b i t s ) H joint , B=2 bitsH sum_indi , B=2 bitsH joint , B=3 bitsH sum_indi , B=3 bitsH joint , B=4 bitsH sum_indi , B=4 bits Fig. 3. H joint and H sum indi versus M with K = 2 . K E n t r op y ( b i t s ) H sum_indi H joint , M=KH joint , M=K+1H joint , M=K+2H joint , M=K+3 Fig. 4. H joint and H sum indi versus K with B = 2 bits anddifferent values of M . to the bottleneck constraint C , there exists a trade-off. When D is small, the estimation errorof h is small. The destination node can get more CSI and R lb thus increases with D . When D grows large, though more capacity in C is allocated for quantizing y , the estimation error of h is large. Hence, R lb decreases with D . In the following simulation process, when implementingthe NDT scheme, we vary D , calculate R lb using (21), and then let R lb be the maximum value.In Fig. 3 and Fig. 4, we do Monte Carlo simulation to get joint entropy H joint in (34) and sumof individual entropies H sum indi in (37). Note that as stated in Subsection IV-B, the complexitiesof calculating H joint and H sum indi are respectively proportional to J K and J . Hence, when J or K is large, it becomes quite difficult to get H joint . For example, when B = 4 and K = 4 , we −5 −4 −3 −2 −1 Threshold λ th R l b3 ( b i t s / c o m p l e x d i m en s i on ) K=M=2, ρ =10dBK=M=4, ρ =10dBK=M=6, ρ =10dBK=M=8, ρ =10dBK=M=2, ρ =40dBK=M=4, ρ =40dBK=M=6, ρ =40dBK=M=8, ρ =40dB Fig. 5. Lower bound R lb versus λ th for the K = M case with C = 40 bits/complex dimension. −5 −4 −3 −2 −1 Threshold λ th R l b3 ( b i t s / c o m p l e x d i m en s i on ) M=5, ρ =10dBM=6, ρ =10dBM=7, ρ =10dBM=8, ρ =10dBM=5, ρ =40dBM=6, ρ =40dBM=7, ρ =40dBM=8, ρ =40dB Fig. 6. Lower bound R lb versus λ th for the K < M case with K = 4 and C = 40 bits/complex dimension. have J = 16 and J K = 65536 , i.e., there are points in space Ξ . To get a reliable pmf P ξ for each point, the number of channel realizations has to be much greater than .Fig. 3 shows that the gap between H joint and H sum indi is small. In addition, as M increases, H joint approaches H sum indi quickly, indicating that the dependence between (cid:6) a k (cid:7) B , ∀ k ∈ K becomesweak. This can be explained by considering an extreme case where M → + ∞ . Based on thedefinition of H and the law of large numbers, we have H H H → M I K . Hence, A → σ M I K . (cid:6) a k (cid:7) B , ∀ k ∈ K are thus almost independent.When M = K and K increases, Fig. 4 shows that there exists an obvious increase in the gapbetween H joint and H sum indi . Hence, when M = K and K increases, the correlation between (cid:6) a k (cid:7) B , ∀ k ∈ K is enhanced. We will thus get a gain to R lb if we use H joint instead of H sum indi .However, we would like to point out: First, it can be found from Fig. 4 that when M > K , thistrend becomes less evident. Second, as shown in the following results, when K ≥ , since theQCI scheme uses a lot of capacity in C to quantize (cid:6) a k (cid:7) B , ∀ k ∈ K , its performance is not asgood as the TCI scheme or MMSE scheme. Third, when K or B is large, it becomes difficultto get H joint . Therefore, when implementing the QCI scheme in the following, we obtain R lb by using H sum indi , i.e., quantizing (cid:6) a k (cid:7) B , ∀ k ∈ K separately.In Fig. 5 and Fig. 6, we investigate the effect of threshold λ th on R lb for the cases with K = M and K < M , respectively. From these two figures, several observations can be made.First, when K = M , and ρ or K is small, R lb increases greatly and then decreases with λ th ,indicating that the choice of λ th has a significant impact on R lb . It is thus important to look for M R a t e ( b i t s / c o m p l e x d i m en s i on ) R lb3 Lower bound to R lb3 Upper bound to R lb3 ρ =40 dB ρ =0 dB ρ =10 dB Fig. 7. R lb , ˇ R lb and ˆ R lb versus M with K = 4 and C = 40 bits/complex dimension. ρ (dB) R a t e ( b i t s / c o m p l e x d i m en s i on ) R lb3 Lower bound to R lb3 Upper bound to R lb3 M=4M=6M=8 Fig. 8. R lb , ˇ R lb and ˆ R lb versus ρ with K = 4 and C = 40 bits/complex dimension. a good λ th to maximize R lb in these cases. Second, when K = M , and K as well as ρ is largeor when K < M , R lb first remains unchanged and then monotonically decreases with R lb . Inthese cases, a small λ th is good enough to guarantee a large R lb and search of λ th can thus beavoided. For example, when K < M , we can set λ th = 0 , based on which a simpler expressionof R lb is given in (69). As for the case with K = M , since E (cid:2) λ (cid:3) does not exist when λ th = 0 ,we can set λ th to be a fixed small number.In Fig. 7 and Fig. 8, we compare R lb with its upper bound ˆ R lb and lower bound ˇ R lb . Asexpected, R lb , ˆ R lb , and ˇ R lb all increase with M and ρ . When M or ρ is small, there is a smallgap between R lb and ˆ R lb , and a small gap between R lb and ˇ R lb . As M and ρ increase, thesegaps narrow rapidly and the curves almost coincide, which verifies Remark 2. As a result, when M − K or ρ is large, we can set λ th = 0 and use ˇ R lb in (70) to lower bound I ( x ; z ) since ithas a more concise expression.In Fig. 9 and Fig. 10, the upper bound R ub and lower bounds obtained by different schemesare depicted versus SNR ρ . Several observations can be made from these two figures. First,as expected, all bounds increase with ρ . Second, when K , M , and ρ are small, the NDTscheme outperforms the other achievable schemes. However, as these parameters increase, theperformance of the NDT scheme deteriorates rapidly. This is because when K , M , and ρ aresmall, the performance of the considered system is mainly limited by the capacity of Channel 1,and the NDT scheme works well since the destination node can extract more information fromthe compressed observation of the relay and CSI. However, when K and M increase, the NDT ρ (dB) R a t e ( b i t s / c o m p l e x d i m en s i on ) R ub R lb1 , NDTR lb2 , QCI, B=1 bitsR lb2 , QCI, B=2 bitsR lb2 , QCI, B=4 bitsR lb2 , QCI, B=8 bitsR lb3 , TCIR lb4 , MMSE Fig. 9. Upper and lower bounds to the bottleneck rate versus ρ with K = M = 2 and C = 40 bits/complex dimension. ρ (dB) R a t e ( b i t s / c o m p l e x d i m en s i on ) R ub R lb1 , NDTR lb2 , QCI, B=1 bitsR lb2 , QCI, B=2 bitsR lb2 , QCI, B=4 bitsR lb2 , QCI, B=8 bitsR lb3 , TCIR lb4 , MMSE Fig. 10. Upper and lower bounds to the bottleneck rate versus ρ with K = M = 4 and C = 40 bits/complex dimension. 20 30 40 50 60 70 80 90 100051015202530 C (bits/complex dimension) R a t e ( b i t s / c o m p l e x d i m en s i on ) R ub R lb1 , NDTR lb2 , QCI, B=1 bitsR lb2 , QCI, B=2 bitsR lb2 , QCI, B=4 bitsR lb2 , QCI, B=8 bitsR lb3 , TCIR lb4 , MMSE Fig. 11. Upper and lower bounds to the bottleneck rate versus C with K = M = 2 and ρ = 40 dB. 20 30 40 50 60 70 80 90 1000102030405060 C (bits/complex dimension) R a t e ( b i t s / c o m p l e x d i m en s i on ) R ub R lb1 , NDTR lb2 , QCI, B=1 bitsR lb2 , QCI, B=2 bitsR lb2 , QCI, B=4 bitsR lb2 , QCI, B=8 bitsR lb3 , TCIR lb4 , MMSE Fig. 12. Upper and lower bounds to the bottleneck rate versus C with K = M = 4 and ρ = 40 dB. scheme requires too many channel uses for CSI transmission. Third, the QCI scheme can get agood performance when K is small. Of course, as stated at the beginning of Subsection IV-C, thenumber of bits required for transmitting quantized noise levels in the QCI scheme is proportionalto K and B . Hence, the performance of the QCI scheme varies significantly when K and B change. Moreover, it is also shown that the performance of the TCI scheme is worse than thatof the MMSE scheme in the low SNR regime, while getting quite close to that of the MMSEscheme in the high SNR regime. When ρ grows large, the lower bounds obtained by the TCIand MMSE schemes both approach C and are larger than those obtained by the NDT and QCIschemes. M R a t e ( b i t s / c o m p l e x d i m en s i on ) R ub R lb1 , NDTR lb2 , QCI, B=1 bitsR lb2 , QCI, B=2 bitsR lb2 , QCI, B=4 bitsR lb2 , QCI, B=8 bitsR lb3 , TCIR lb4 , MMSE Fig. 13. Upper and lower bounds to the bottleneck rate versus M with K = 2 , ρ = 10 dB, and C = 40 bits/complexdimension. M R a t e ( b i t s / c o m p l e x d i m en s i on ) R ub R lb1 , NDTR lb2 , QCI, B=1 bitsR lb2 , QCI, B=2 bitsR lb2 , QCI, B=4 bitsR lb2 , QCI, B=8 bitsR lb3 , TCIR lb4 , MMSE Fig. 14. Upper and lower bounds to the bottleneck rate versus M with K = 2 , ρ = 40 dB, and C = 40 bits/complexdimension. In Fig. 11 and Fig. 12, the effect of the bottleneck constraint C is investigated. From Fig. 11, itcan be found that as C increases, all bounds grow and converge to different constants, which canbe calculated based on Lemma 1, Lemma 3, Lemma 5, Lemma 7, and Lemma 9, respectively.Fig. 11 also shows that thanks to CSI transmission, the NDT and QCI schemes outperform theTCI and MMSE schemes when C is large. By comparing these two figures, it can be foundthat in Fig. 11, no bound approaches C , even for the case with C = 20 , while in Fig. 12, it ispossible for R ub , R lb , and R lb to approach C . For example, when K = M = 4 and C ≤ , R ub , R lb , R lb → C . This is because the bottleneck rate is limited by the capacity of Channel 1and C . In Fig. 11, since K and M are small, the capacity of Channel 1 is smaller than C .Hence, the bounds of course will not approach C . In Fig. 12, more multi-antenna gains can beobtained due to larger K and M . The capacity of Channel 1 is thus larger than C in some cases(e.g., K = M = 4 and C ≤ ). Hence, R ub , R lb , and R lb may approach C in these cases.Note that as shown in Fig. 11, since B < CK is not satisfied, R lb = 0 when C ≤ .In Fig. 13 and Fig. 14, the effect of M is investigated for different configurations of ρ . Thesetwo figures show that R ub , R lb , R lb , and R lb all increase monotonically with M , and as M grows, R lb as well as R lb gets very close to R ub . As for R lb , except the M = 3 case in Fig. 13, R lb monotonically decreases with M since the relay has to transmit more channel informationto the destination node.In Fig. 15 and Fig. 16, we set K = M and depict the upper and lower bounds versus K or K/M R a t e ( b i t s / c o m p l e x d i m en s i on ) R ub R lb1 , NDTR lb2 , QCI, B=1 bitsR lb2 , QCI, B=2 bitsR lb2 , QCI, B=4 bitsR lb3 , TCIR lb4 , MMSE Fig. 15. Upper and lower bounds to the bottleneck rate versus K or M with K = M , ρ = 40 dB, and C = 50 bits/complexdimension. K/M R a t e ( b i t s / c o m p l e x d i m en s i on ) R ub R lb1 , NDTR lb2 , QCI, B=1 bitsR lb2 , QCI, B=2 bitsR lb2 , QCI, B=4 bitsR lb3 , TCIR lb4 , MMSE Fig. 16. Upper and lower bounds to the bottleneck rate versus K or M with K = M , ρ = 40 dB, and C = 8 K bits/complexdimension. M . In Fig. 15, we fix C to , while in Fig. 16, we set C = 8 K , which makes sense since thebottleneck constraint should scale with the number of degrees of freedom of the input signal x . Since we choose the quantization levels as quantiles when performing the QCI scheme, asstated at the end of Subsection IV-B, B < CK should be satisfied. Hence, in Fig. 15 and Fig. 16,we only consider B = 1 , , bits when performing the QCI scheme. When K = M and theygrow simultaneously, the capacity of Channel 1 increases due to the muti-antenna gains. Hence,for a fixed C , Fig. 15 shows that all bounds increase first. When K or M grows large, R lb and R lb approach the bottleneck constraint C while R lb decreases for all values of B . This isbecause the number of bits per channel use required for informing the destination node of A (cid:48) in the QCI scheme is proportional to K , while CSI transmission is unnecessary for the TCI andMMSE schemes. As for the NDT scheme, since the number of bits required for quantizing H is proportional to both K and M , there is only an increase when K grows from to . Afterthat, R lb decreases monotonically and has the worst performance. In contrast, when C = 8 K ,the bottleneck rate of the system is mainly limited by C . Hence, Fig. 16 shows that all bounds,except R lb , increase almost linearly with K , and R ub , R lb , and R lb are quite close to C .VI. C ONCLUSIONS This work extends the IB problem of the scalar case in [26] to the case of MIMO Rayleighfading channels. Due to the information bottleneck constraint, the destination node can notget the perfect CSI from the relay. Hence, we provide an upper bound to the bottleneck rate by assuming that the destination node can get the perfect CSI at no cost. Besides, we alsoprovide four achievable schemes where each scheme satisfies the bottleneck constraint and givesa lower bound to the bottleneck rate. Our results show that with simple symbol-by-symbol relayprocessing and compression, we can get bottleneck rate close to the upper bound on a widerange of relevant system parameters. Although we have focused on a MIMO channel with onerelay, we plan to extend the problem to considering the case of multiple parallel relays, which isparticularly relevant to the centralized processing of multiple remote antennas, as in the so-calledC-RAN architectures. A PPENDIX AP ROOF OF T HEOREM y = sx + n, (87)where x ∼ CN (0 , , n ∼ CN (0 , σ ) , and s ∈ C is the deterministic channel gain. Withbottleneck constraint C , the IB problem for (87) has been studied in [21] and the optimalbottleneck rate is given by R = log (cid:0) ρ | s | (cid:1) − log (cid:0) ρ | s | − C (cid:1) . (88)In the following, we show that (4) can be decomposed into a set of parallel scalar IB problems,and (88) can then be applied to get upper bound R ub in Theorem 1.According to the definition of conditional entropy, problem (4) can be rewritten as max p ( z | y , H ) (cid:90) I ( x ; z | H = H ) p H ( H ) d H (89a)s.t. (cid:90) I ( y ; z | H = H ) p H ( H ) d H ≤ C, (89b)where H is a realization of H . Let U ΛU H denote the eigendecomposition of HH H , where U is a unitary matrix whose columns are the eigenvectors of HH H , and Λ is a diagonal matrixwhose diagonal elements are the eigenvalues of HH H . Since the rank of HH H is no greaterthan T = min { K, M } , there are at most T positive diagonal entries in Λ . Denote them by λ t ,where t ∈ T and T = { , · · · , T } . Let ˆ y = U H y = U H Hx + U H n . (90) Then, for a given channel realization H = H , ˆ y is conditionally Gaussian, i.e., ˆ y | H = H ∼ CN ( , Λ + σ I M ) . (91)Since I ( x ; y | H = H ) = I ( x ; ˆ y | H = H ) , (92)we work with ˆ y instead of y in the following.Based on (89) and (91), it is known that MIMO channel p ( ˆ y | x , H ) can be first divided intoa set of parallel channels for different realizations of H , and each channel p ( ˆ y | x , H = H ) can be further divided into T independent scalar Gaussian channels with SNRs ρλ t , ∀ t ∈ T .Accordingly, problem (4) can be decomposed into a set of parallel IB problems. For a scalarGaussian channel with SNR ρλ t , let c ub t denote the allocation of the bottleneck constraint C and R ub t denote the corresponding rate. According to (88), we have R ub t = log (1 + ρλ t ) − log (cid:16) ρλ t − c ub t (cid:17) . (93)Then, the solution of problem (4) can be obtained by solving the following problem max { c ub t } T (cid:88) t =1 E (cid:2) R ub t (cid:3) (94a)s.t. T (cid:88) t =1 E (cid:2) c ub t (cid:3) ≤ C. (94b)Assume that λ t , ∀ t ∈ T are unordered positive eigenvalues of HH H . Then, they are identicallydistributed. For convenience, define a new variable λ which follows the same distribution as λ t .The subscript ‘ t ’ in c ub t and R ub t can thus be omitted. In order to distinguish from R ub in (5), weuse R ub to denote the bottleneck rate corresponding to c ub , i.e., R ub = log (1 + ρλ ) − log (cid:16) ρλ − c ub (cid:17) . (95)Then, we have T (cid:88) t =1 E (cid:2) R ub t (cid:3) = T E (cid:2) R ub (cid:3) , T (cid:88) t =1 E (cid:2) c ub t (cid:3) = T E (cid:2) c ub (cid:3) . (96) Note that when deriving the upper and lower bounds in this paper, we consider the unordered positive eigenvalues of HH H or H H H since it simplifies the analysis. If the ordered positive eigenvalues of HH H or H H H are considered, it can bereadily proven by following similar steps in [31, Subsetion 4.2] that we arrive at problems equivalent to those in this paper. Problem (94) is thus equivalent to max c ub E (cid:2) R ub (cid:3) (97a)s.t. E (cid:2) c ub (cid:3) ≤ CT . (97b)This problem can be solved by the water-filling method. Consider the Lagrangian L = E (cid:2) − R ub + αc ub (cid:3) − αCT , (98)where α is the Lagrange multiplier. The KKT condition for the optimality is ∂ L ∂c ub = 0 , if c ub > ≤ , if c ub = 0 . (99)Then, c ub = log ρλν , if λ > νρ , if λ ≤ νρ , (100)where ν = α/ (1 − α ) and it is chosen such that the following bottleneck constraint is met E (cid:20) log ρλν | λ > νρ (cid:21) Pr (cid:26) λ > νρ (cid:27) = CT . (101)The informed receiver upper bound is thus given by R ub = T E (cid:20) log (1 + ρλ ) − log(1 + ν ) | λ > νρ (cid:21) Pr (cid:26) λ > νρ (cid:27) . (102)From the definition of H in (2), it is known that when K ≤ M (resp., when K > M ), H H H (resp., HH H ) is a central complex Wishart matrix with M (resp., K ) degrees of freedom andcovariance matrix I K (resp., I M ), i.e., H H H ∼ CW K ( M, I K ) (resp., HH H ∼ CW M ( K, I M ) )[33]. Since λ can be seen as one of the unordered positive eigenvalues of H H H or HH H , itspdf is thus given by [33, Theorem 2.17], [31] f λ ( λ ) = 1 T T − (cid:88) i =0 i !( i + S − T )! (cid:2) L S − Ti ( λ ) (cid:3) λ S − T e − λ , (103)where S = max { K, M } and the Laguerre polynomials are L S − Ti = e λ i ! λ S − T d i dλ i (cid:0) e − λ λ S − T + i (cid:1) . (104)Substituting (103) and (104) into (102) and (101), (5) and (6) can be obtained. Theorem 1 isthus proven. A PPENDIX BP ROOF OF L EMMA R ub approaches C as M → + ∞ , we first look at the special case with K = 1 . In this case, S = M and T = 1 . From (104) and (103), we have L S − T = 1 and the pdfof λ f λ ( λ ) = λ M − e − λ ( M − , (105)which shows that λ follows Erlang distribution with shape parameter M and rate parameter ,i.e., λ ∼ Erlang ( M, . The expectation of λ is thus M . As M → + ∞ , f λ ( λ ) becomes a deltafunction [34], and Pr { λ = M } → , Pr { λ (cid:54) = M } → . (106)Hence, the bottleneck constraint (6) (cid:90) ∞ νρ (cid:18) log ρλν (cid:19) f λ ( λ ) dλ = C → log ρMν , (107)based on which we get ν → M ρ − C . (108)Then, the informed receiver upper bound R ub → log (1 + M ρ ) − log (cid:0) M ρ − C (cid:1) → C. (109)Next, we consider the general case. For any positive integer K , when M → + ∞ , based onthe definition of H and the law of large numbers, we have H H H → M I K . Since H H H and HH H have the same positive eigenvalues, we have λ → M . (106) also holds for this generalcase. Then, (cid:90) ∞ νρ (cid:18) log ρλν (cid:19) f λ ( λ ) dλ = CT → log ρMν , (110) based on which we get ν → M ρ − C/T . (111)We thus have R ub → T (cid:2) log (1 + M ρ ) − log (cid:0) M ρ − C/T (cid:1)(cid:3) → C. (112)Now we prove that R ub approaches C as ρ → + ∞ . From (6), it can be seen that (cid:82) ∞ νρ (cid:0) log ρλν (cid:1) f λ ( λ ) dλ reduces with ν . Therefore, when ρ → + ∞ , to ensure that constraint (6) holds, ν becomes large.Then, we have R ub = T (cid:90) ∞ νρ [log (1 + ρλ ) − log(1 + ν )] f λ ( λ ) dλ → T (cid:90) ∞ νρ [log ( ρλ ) − log ν ] f λ ( λ ) dλ = C. (113)In addition, when C → + ∞ , it can be found from (6) that ν → . Using (5), we can get (7),which is the capacity of Channel 1. This completes the proof.A PPENDIX CP ROOF OF T HEOREM Z , y g ∼ CN (0 , Ω + ( KD + σ ) I M ) . Let ω denote the unordered positiveeigenvalue of Z Z H . Since the elements in Z and H respectively follow i.i.d., CN (0 , − D ) and CN (0 , , and λ is the unordered positive eigenvalue of HH H as defined in Appendix A, ω is thus identically distributed as (1 − D ) λ . Then, the pdf of ω is f ω ( ω ) = 11 − D f λ (cid:18) ω − D (cid:19) , (114)where f λ is the pdf of λ and is given in (103).For a given feasible D , problem (20) can be similarly solved as (4) by following the steps inAppendix A and the optimal solution is R lb = T (cid:90) ∞ ν ( KD + σ ) (cid:20) log (cid:18) ωKD + σ (cid:19) − log(1 + ν ) (cid:21) f ω ( ω ) dω, (115)where ν is chosen such that the following bottleneck constraint is met (cid:90) ∞ ν ( KD + σ ) (cid:20) log ων ( KD + σ ) (cid:21) f ω ( ω ) dω = C − R ( D ) T . (116) Using (114), (115) can be reformulated as R lb = T (cid:90) ∞ ν ( KD + σ ) (cid:20) log (cid:18) ωKD + σ (cid:19) − log(1 + ν ) (cid:21) f ω ( ω ) dω = T (cid:90) ∞ ν ( KD + σ ) (cid:20) log (cid:18) ωKD + σ (cid:19) − log(1 + ν ) (cid:21) − D f λ (cid:18) ω − D (cid:19) dω λ = ω − D ====== T (cid:90) ∞ νγ [log (1 + γλ ) − log(1 + ν )] f λ ( λ ) dλ, (117)where γ = − DKD + σ . Analogously, bottleneck constraint (116) can be transformed to (cid:90) ∞ νγ (cid:18) log γλν (cid:19) f λ ( λ ) dλ = C − R ( D ) T . (118)Theorem 2 is thus proven. A PPENDIX DP ROOF OF L EMMA I ( y ; z | Z ) = I ( ˜ y ; z | Z )= h ( z | Z ) − h ( z | Z , ˜ y ) ( a ) ≤ E (cid:2) log det (cid:0) ΩΨ + ( KD + σ ) Ψ + I M (cid:1)(cid:3) = I ( y g ; z g | Z ) , (119)where ( a ) holds since Gaussian distribution maximizes the entropy over all distributions withthe same variance.Then, we prove inequation (26). I ( x ; z | Z ) = h ( z | Z ) − h ( z | Z , x ) ( a ) ≥ E (cid:2) log det (cid:0) ΩΨ + ( KD + σ ) Ψ + I M (cid:1)(cid:3) − E (cid:2) log det (cid:0) ( KD + σ ) Ψ + I M (cid:1)(cid:3) = I ( x ; z g | Z ) , (120)where ( a ) holds since for a Gaussian input, Gaussian noise minimizes the mutual information[27, (9.178)]. Since Ψ is optimally obtained when solving IB problem (20), bottleneck constraint (20b) isthus satisfied and I ( x ; z g | Z ) = R lb1 . Then, from (119) and (120), we have I ( y ; z | Z ) ≤ C − R ( D ) ,I ( x ; z | Z ) ≥ R lb1 . (121)This completes the proof. A PPENDIX EP ROOF OF L EMMA M → + ∞ , as stated in Appendix B, λ → M . Then, (cid:90) ∞ νγ (cid:18) log γλν (cid:19) f λ ( λ ) dλ = C − R ( D ) T → log γMν , (122)based on which we get ν → γM − C − R ( D ) T . (123)From (21), R lb → T (cid:104) log (1 + γM ) − log (cid:16) γM − C − R ( D ) T (cid:17)(cid:105) . (124)When ρ → + ∞ , σ → . Let γ = − DKD . R lb thus tends to a constant and can be obtainedfrom (21).When C → + ∞ , it is possible for the relay to transmit h almost perfectly to the destinationnode, i.e., D → . Hence γ = − DKD + σ → ρ . In addition, it can be found from (22) that ν → .Then, from (21), R lb → T (cid:90) ∞ log (1 + ρλ ) f λ ( λ ) dλ = I ( x ; y , H ) . (125)Lemma 3 is thus proven. A PPENDIX FP ROOF OF T HEOREM ˆ n g ∼ CN ( , A (cid:48) ) and (cid:6) a k (cid:7) B has J possible values, i.e., b , · · · , b J , the channel in (32)can be divided into KJ independent scalar Gaussian sub-channels with noise power (cid:6) a k (cid:7) B = b j for each sub-channel. For the sub-channel with noise power (cid:6) a k (cid:7) B = b j , let c k,j denote theallocation of the bottleneck constraint C and R k,j denote the corresponding rate. According to(88), we have R k,j = log (1 + ρ j ) − log (cid:0) ρ j − c k,j (cid:1) , (126)where ρ j = b j . Since b J = + ∞ , we let R k,J = 0 and c k,J = 0 . Note that based on [21, (16)],the representation of ˆ x g , i.e., ˆ z g , can be constructed by adding independent fading and Gaussiannoise to each element of ˆ x g in (32). Denote P k,j = Pr (cid:8)(cid:6) a k (cid:7) B = b j (cid:9) . (127)Then, the optimal I ( x ; ˆ z g | A (cid:48) ) is equal to the objective function of the following problem max { c k,j } K (cid:88) k =1 J − (cid:88) j =1 P k,j R k,j (128a)s.t. K (cid:88) k =1 J − (cid:88) j =1 P k,j c k,j ≤ C − K (cid:88) k =1 H k , (128b)where H k = − (cid:80) Jj =1 P k,j log P k,j .Since K ≤ M , as stated in Appendix A, H H H ∼ CW K ( M, I K ) . Matrix ( H H H ) − thusfollows complex inverse Wishart distribution and its diagonal elements are identically inverse chisquared distributed with M − K + 1 degrees of freedom [35]. Let η denote one of the diagonalelement of ( H H H ) − . The pdf of η is thus given by f η ( η ) = 2 − ( M − K +1) / Γ (cid:0) M − K +12 (cid:1) η − ( M − K +1) / − e − / (2 η ) . (129)Since A = σ ( H H H ) − , the diagonal entries of A , i.e., a k , ∀ k ∈ K , are marginally identicallydistributed. Let a denote a new variable with the same distribution as a k . a thus follows thesame distribution as σ η and its pdf is given by f a ( a ) = 1 σ f η (cid:16) aσ (cid:17) = (2 /σ ) − ( M − K +1) / Γ (cid:0) M − K +12 (cid:1) a − ( M − K +1) / − e − σ / (2 a ) . (130) In addition, P k,j , R k,j , and c k,j can be simplified to P j , R j , and c j by dropping subscript ‘ k ’.Using (130), P j can be calculated as follows P j = Pr (cid:8)(cid:6) a (cid:7) B = b j (cid:9) = Pr { b j − < a ≤ b j } = (cid:90) b j b j − f a ( a ) da. (131)Problem (128) thus becomes max { c j } J − (cid:88) j =1 KP j R j (132a)s.t. J − (cid:88) j =1 KP j c j ≤ C − KH , (132b)where R j = log (1 + ρ j ) − log (cid:0) ρ j − c j (cid:1) ,H = − J (cid:88) j =1 P j log P j . (133)Analogous to problem (97), (132) can be optimally solved by the water-filling method. Theoptimal I ( x ; ˆ z g | A (cid:48) ) is given by R lb = J − (cid:88) j =1 KP j (cid:2) log (1 + ρ j ) − log(1 + ρ j − c j ) (cid:3) . (134)where c j = (cid:2) log ρ j ν (cid:3) + and ν is chosen such that the bottleneck constraint J − (cid:88) j =1 KP j c j = C − KH , (135)is met. Theorem 3 is then proven. A PPENDIX GP ROOF OF L EMMA Φ is a diagonal matrix with positive and real diagonal entries, it is invertible. Denote z (cid:48) = Φ − z = x + ˆ n + Φ − ˆ n (cid:48) g , ˆ z (cid:48) g = Φ − ˆ z g = x + ˆ n g + Φ − ˆ n (cid:48) g . (136) For a given A (cid:48) , each element in ˆ n is Gaussian distributed with zero mean and variance (cid:6) a k (cid:7) B .However, ˆ n is not a Gaussian vector since H is unknown. Hence, z (cid:48) is not a Gaussian vector.As for ˆ z (cid:48) g , from (32) and (41), it is known that ˆ z (cid:48) g ∼ CN ( , I K + A (cid:48) + Φ − ) .We first prove inequation (43). I ( ˆ x ; z | A (cid:48) ) = I ( ˆ x ; z (cid:48) | A (cid:48) )= h ( z (cid:48) | A (cid:48) ) − h ( z (cid:48) | ˆ x , A (cid:48) ) ( a ) ≤ E (cid:2) log det (cid:0) I K + E (cid:2) ˆ n ˆ n H (cid:3) + Φ − (cid:1) − log det (cid:0) Φ − (cid:1)(cid:3) ( b ) ≤ E (cid:2) log det (cid:0) I K + A (cid:48) + Φ − (cid:1) − log det (cid:0) Φ − (cid:1)(cid:3) = I ( ˆ x g ; ˆ z (cid:48) g | A (cid:48) )= I ( ˆ x g ; ˆ z g | A (cid:48) ) , (137)where ( a ) holds since Gaussian distribution maximizes the entropy over all distributions withthe same variance, and ( b ) follows by using Hadamard’s inequality.Denote x = ( x , · · · , x K ) T , z (cid:48) = ( z (cid:48) , · · · , z (cid:48) K ) T , ˆ z (cid:48) g = (ˆ z (cid:48) g, , · · · , ˆ z (cid:48) g,K ) T , and Φ = diag { ϕ , · · · , ϕ K } .Then, we prove inequation (44). Using the chain rule of mutual information, I ( x ; z | A (cid:48) ) = I ( x ; z (cid:48) | A (cid:48) )= K (cid:88) k =1 I ( x k ; z (cid:48) k | A (cid:48) ) + Q ≥ K (cid:88) k =1 I ( x k ; z (cid:48) k | A (cid:48) ) ( a ) = K (cid:88) k =1 I ( x k ; ˆ z (cid:48) g,k | A (cid:48) ) ( b ) = I ( x ; ˆ z (cid:48) g | A (cid:48) )= I ( x ; ˆ z g | A (cid:48) ) , (138)where Q is a non-negative constant, ( a ) holds since for a given A (cid:48) , both z (cid:48) k and ˆ z (cid:48) g,k follow CN (cid:0) , (cid:6) a k (cid:7) B + ϕ − k (cid:1) , and ( b ) follows since the elements in x and ˆ z (cid:48) g are independent. Since Φ is optimally obtained when solving IB problem (38), bottleneck constraint (38b) isthus satisfied and I ( x ; ˆ z g | A (cid:48) ) = R lb . Then, from (137) and (138), we have I ( ˆ x ; z | A (cid:48) ) ≤ C − KH ,I ( x ; z | A (cid:48) ) ≥ R lb . (139)This completes the proof. A PPENDIX HP ROOF OF L EMMA M → + ∞ , H H H → M I K . Hence, A → σ M I K . Let J = 2 , b = σ M + (cid:15) , and b = + ∞ , where (cid:15) is a sufficiently small positive real number. Since A → σ M I K ,we have P → and H → . Then, from (39) and (40), c → CK ,R lb → K (cid:20) log (cid:18) Mσ (cid:19) − log (cid:18) Mσ − CK (cid:19)(cid:21) → C. (140)When ρ → + ∞ , σ → and A → . By setting J = 2 and b small enough, it can be provenas above that R lb → C .When C → + ∞ , it is possible to choose suitable quantization points B = { b , · · · , b J } suchthat ˆ x g in (32) and A can be almost perfectly quantized. Hence, R lb = I ( x ; ˆ z g | A (cid:48) ) → I ( x ; ˆ x g | A (cid:48) )= E [log det ( I K + A (cid:48) ) − log det ( A (cid:48) )] → E [log det ( I K + A ) − log det ( A )] , (141)On the other hand, the capacity of Channel 1 is given by I ( x ; y , H ) = I ( x ; y | H )= E (cid:2) log det (cid:0) HH H + σ I M (cid:1) − log det (cid:0) σ I M (cid:1)(cid:3) = E (cid:2) log det (cid:0) H H H + σ I K (cid:1) − log det (cid:0) σ I K (cid:1)(cid:3) = E [log det ( I K + A ) − log det ( A )] . (142) To prove that (141) is upper bounded by (142), we first give and prove the following lemma. Lemma 11. For any K -dimensional positive definite matrix N , let N = N (cid:12) I K , i.e., N consist of the diagonal elements of N . Then, log det ( I K + N ) − log det ( N ) ≥ log det ( I K + N ) − log det ( N ) . (143) Proof: Obviously, (143) is equivalent to log det ( N ) − log det ( N ) ≥ log det ( I K + N ) − log det ( I K + N ) . (144)To prove (144), we introduce an auxiliary function g ( x ) = log det ( x I K + N ) − log det ( x I K + N ) and show that g ( x ) decreases monotonically w.r.t. x when x ≥ . By taking the first-orderderivative to g ( x ) , we have g (cid:48) ( x ) = tr (cid:2) ( x I K + N ) − (cid:3) − tr (cid:2) ( x I K + N ) − (cid:3) . (145)To prove g (cid:48) ( x ) ≤ , we show in the following that for any positive definite matrix O , we alwayshave tr (cid:0) O − (cid:1) ≤ tr (cid:0) O − (cid:1) , (146)where O consists of the diagonal elements of O , i.e., O = O (cid:12) I K . Denote the diagonalentries of O (or O ) by o = ( o , · · · , o K ) T and the eigenvalues of O by θ = ( θ , · · · , θ K ) T .Since O is a positive definite matrix, the entries of o and θ are real and positive. In addition,according to the Schur-Horn theorem, o is majorized by θ , i.e., o ≺ θ . (147)Define a real vector u = ( u , · · · , u K ) T with u k > , ∀ k ∈ K , and function g ( u ) = (cid:80) Kk =1 1 u k .It is obvious that g ( u ) is convex and symmetric. Hence, g ( u ) is a Schur-convex function.Therefore, g ( o ) ≤ g ( θ ) . (148) Using (148), we have tr (cid:0) O − (cid:1) = K (cid:88) k =1 o k = g ( o ) ≤ g ( θ )= K (cid:88) k =1 θ k = tr (cid:0) O − (cid:1) , (149)based on which we get g (cid:48) ( x ) ≤ and (143) can then be proven. (cid:3) Then, from (141), (142), and Lemma 11, it is known that when C → + ∞ , R lb → E [log det ( I K + A ) − log det ( A )]= K E (cid:20) log (cid:18) a (cid:19)(cid:21) ≤ I ( x ; y , H ) , (150)where the expectation can be calculated by using the pdf of a in (130). Lemma 5 is thus proven.A PPENDIX IP ROOF OF R EMARK K = M and λ th = 0 , E (cid:2) λ (cid:3) does not exist.When K = M , f λ ( λ ) is given in (103). From (104), it is known that for any ≤ i ≤ K − , L i can always be expressed as follows L i = e λ i ! d i dλ i (cid:0) e − λ λ i (cid:1) = i (cid:88) j =1 ς i,j λ j + 1 , (151)where ς i,j is a constant. Accordingly, from (103), f λ ( λ ) = 1 K K − (cid:88) i =0 (cid:2) L i ( λ ) (cid:3) e − λ = e − λ K K − (cid:88) j =1 τ j λ j + e − λ , (152) where τ j is a constant. Let (cid:15) denote a sufficiently small positive real number. Then, when λ th = 0 , E (cid:20) λ (cid:21) = (cid:90) ∞ λ f λ ( λ ) dλ = (cid:90) ∞ e − λ K K − (cid:88) j =1 τ j λ j − dλ + (cid:90) ∞ λ e − λ dλ = 1 K K − (cid:88) j =1 τ j ( j − (cid:90) (cid:15) λ e − λ dλ + (cid:90) ∞ (cid:15) λ e − λ dλ, (153)where we used (cid:82) ∞ e − λ λ j − dλ = ( j − . Since (cid:15) is sufficiently small, (cid:82) (cid:15) λ e − λ dλ ≈ (cid:82) (cid:15) λ dλ ,which diverges for any (cid:15) > . Therefore, E (cid:2) λ (cid:3) does not exist.A PPENDIX JP ROOF OF L EMMA M → + ∞ , H H H → M I K . Hence, λ → + ∞ ,P th = Pr { λ min ≥ λ th }→ ,H th → ,D → CK − , E [ λ | ∆] → + ∞ , E (cid:20) λ | ∆ (cid:21) → . (154)Combining (154) with (66), (67), and (68), we have R lb , ˇ R lb , ˆ R lb → K log (cid:18) D (cid:19) → C. (155)When ρ → + ∞ , σ → . Hence, D → C − H th P th K − ,R lb , ˇ R lb , ˆ R lb → P th K log (cid:18) D (cid:19) → C − H th . (156) When C → + ∞ , it can be found from (62) that D → . Then, from (66), (67), and (68), itis known that R lb , ˇ R lb and ˆ R lb all approach constants, which can be respectively obtained bysetting D = 0 in (66), (67), and (68). Lemma 7 is thus proven.A PPENDIX KP ROOF OF T HEOREM U ΛU H is the eigendecomposition of HH H and λ t , ∀ t ∈ T areunordered positive eigenvalues of HH H . To derive R lb , we further denote the singular valuedecomposition of H by U LV H , where V ∈ C K × K is a unitary matrix and L ∈ R M × K is arectangular diagonal matrix. In fact, the diagonal entries of L are the non-negative square rootsof the positive eigenvalues of HH H . Then, from (73), we have F H H = H H (cid:0) HH H + σ I M (cid:1) − H , = V L H (cid:0) Λ + σ I M (cid:1) − LV H , = V diag (cid:26) λ λ + σ , · · · , λ T λ T + σ , HK − T (cid:27) V H , F H HH H F = V L H (cid:0) Λ + σ I M (cid:1) − Λ (cid:0) Λ + σ I M (cid:1) − LV H , = V diag (cid:26) λ ( λ + σ ) , · · · , λ T ( λ T + σ ) , HK − T (cid:27) V H , F H F = V L H (cid:0) Λ + σ I M (cid:1) − LV H , = V diag (cid:26) λ ( λ + σ ) , · · · , λ T ( λ T + σ ) , HK − T (cid:27) V H , (157)where K − T is a ( K − T ) -dimensional all ‘ ’ column vector. Based on (157), F H HH H F + σ F H F + D I K = V diag (cid:26) λ λ + σ + D, · · · , λ T λ T + σ + D, D × HK − T (cid:27) V H , (158)where K − T is a ( K − T ) -dimensional all ‘ ’ column vector. Since Λ is independent of U , L is independent of U as well as V , and λ t , ∀ t ∈ T are unordered, we have E (cid:2) log det (cid:0) F H HH H F + σ F H F + D I K (cid:1)(cid:3) = T E (cid:20) log (cid:18) λλ + σ + D (cid:19)(cid:21) + ( K − T ) log D. (159) Then, we calculate G in (82). For this purpose, we have to calculate E (cid:2) F H H (cid:3) , E (cid:2) F H HH H F (cid:3) ,and E (cid:2) F H F (cid:3) . To get these expectations, we consider two different cases, i.e., the case with K ≤ M and the case with K > M . When K ≤ M , from (157), we have E (cid:2) F H H (cid:3) = E (cid:20) λλ + σ (cid:21) I K , E (cid:2) F H HH H F (cid:3) = E (cid:20) λ ( λ + σ ) (cid:21) I K , E (cid:2) F H F (cid:3) = E (cid:20) λ ( λ + σ ) (cid:21) I K . (160)When K > M , denote V = ( v , · · · , v K ) . Then, from (157), F H H = V diag (cid:26) λ λ + σ , · · · , λ M λ M + σ , HK − T (cid:27) V H = (cid:18) λ λ + σ v , · · · , λ M λ M + σ v M , HK , · · · , HK (cid:19) v H ... v HK = M (cid:88) m =1 λ m λ m + σ v m v Hm . (161)Since v m is the eigenvector of matrix H H H and is independent of unordered eigenvalue λ m ,we have E (cid:2) F H H (cid:3) = M (cid:88) m =1 E (cid:20) λ m λ m + σ (cid:21) K I K = MK E (cid:20) λλ + σ (cid:21) I K . (162)Similarly, we also have E (cid:2) F H HH H F (cid:3) = MK E (cid:20) λ ( λ + σ ) (cid:21) I K , E (cid:2) F H F (cid:3) = MK E (cid:20) λ ( λ + σ ) (cid:21) I K . (163)Using (160), (162), (163), and (82), G can be calculated as G = E (cid:2) F H HH H F (cid:3) − E (cid:2) F H H (cid:3) E (cid:2) H H F (cid:3) + σ E (cid:2) F H F (cid:3) + D I K = (cid:40) TK E (cid:20) λλ + σ (cid:21) − T K (cid:18) E (cid:20) λλ + σ (cid:21)(cid:19) + D (cid:41) I K . (164) Hence, log det( G ) = K log (cid:40) TK E (cid:20) λλ + σ (cid:21) − T K (cid:18) E (cid:20) λλ + σ (cid:21)(cid:19) + D (cid:41) . (165)Substituting (159) and (165) into (80) and (81), respectively, and using (79), we can get (83).We then calculate D in (84). From (77), (160), and (163), E (cid:2) ¯ x ¯ x H (cid:3) = E (cid:2) F H HH H F + σ F H F (cid:3) = TK E (cid:20) λλ + σ (cid:21) I K . (166) I ( ¯ x g ; ¯ z g ) in (78) can thus be calculated as follows I ( ¯ x g ; ¯ z g ) = log det (cid:32) I K + E (cid:2) ¯ x ¯ x H (cid:3) D (cid:33) = K log (cid:18) TDK E (cid:20) λλ + σ (cid:21)(cid:19) = C, (167)based on which (84) can be obtained. Theorem 5 is then proven.A PPENDIX LP ROOF OF L EMMA M → + ∞ , T = K . As stated in Appendix B, H H H → M I K . Hence, λ → M . From(167), I ( ¯ x g ; ¯ z g ) = K log (cid:18) D E (cid:20) λλ + σ (cid:21)(cid:19) = C → K log (cid:18) D (cid:19) . (168)Combining (83) and (168), we have R lb → K log(1 + D ) − K log D = K log (cid:18) D (cid:19) → C. (169)When K ≤ M and ρ → + ∞ , T = K and σ → . Using (167) and (83), we can also get(168) and (169). When K ≤ M and C → + ∞ , it can be found from (84) that D → . Then, using (83), wecan get (85). This finishes the proof. A PPENDIX MP ROOF OF L EMMA C → + ∞ , R lb approaches the capacity ofChannel 1, while R lb is upper bounded by the capacity of Channel 1. Hence, R lb ≥ R lb . (170)Moreover, when C → + ∞ , it is possible for the relay to transmit the zero-forcing estimate of x , i.e., ˜ x in (51), almost perfectly to the destination node. As a result, R lb ≤ I ( x ; z | ∆) → I ( x ; ˜ x | ∆) . (171)Using (125) and (171), we have R lb → I ( x ; y , H ) ( a ) = I ( x ; y , H | ∆)= h ( x | ∆) − h ( x | y , H , ∆)= h ( x | ∆) − h ( x | y , H , ˜ x , ∆) ≥ h ( x | ∆) − h ( x | ˜ x , ∆)= I ( x ; ˜ x | ∆) → I ( x ; z | ∆) ≥ R lb . (172) Note that ( a ) holds since event ∆ works only when the relay transmits its observation to thedestination node. Analogously, we also have R lb → I ( x ; y , H )= h ( x ) − h ( x | y , H )= h ( x ) − h ( x | y , H , ¯ x ) ≥ h ( x ) − h ( x | ¯ x ) ≥ I ( x ; ¯ x ) → I ( x ; z ) ≥ R lb , (173)where ¯ x is the MMSE estimate of x at the relay, i.e., (74). This completes the proof.R EFERENCES [1] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” arXiv preprint physics/0004057 , 2000.[2] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” arXiv preprintarXiv:1703.00810 , 2017.[3] A. A. Alemi, “Variational predictive information bottleneck,” in Symposium on Advances in Approximate BayesianInference . PMLR, 2020, pp. 1–6.[4] S. Mukherjee, “Machine learning using the variational predictive information bottleneck with a validation set,” arXivpreprint arXiv:1911.02210 , 2019.[5] ——, “General information bottleneck objectives and their applications to machine learning,” arXiv preprintarXiv:1912.06248 , 2019.[6] D. Strouse and D. J. Schwab, “The information bottleneck and geometric clustering,” Neural computation , vol. 31, no. 3,pp. 596–612, 2019.[7] A. Painsky and N. Tishby, “Gaussian lower bound for the information bottleneck limit,” The J. Mach. Learn. Res. (JMLR) ,vol. 18, no. 1, pp. 7908–7936, 2018.[8] R. Dobrushin and B. Tsybakov, “Information transmission with additional noise,” IRE Trans. Inf. Theory , vol. 8, no. 5,pp. 293–304, Sep. 1962.[9] H. Witsenhausen and A. Wyner, “A conditional entropy bound for a pair of discrete random variables,” IEEE Trans. Inf.Theory , vol. 21, no. 5, pp. 493–501, Sep. 1975.[10] H. Witsenhausen, “Indirect rate distortion problems,” IEEE Trans. Inf. Theory , vol. 26, no. 5, pp. 518–521, Sep. 1980.[11] T. A. Courtade and T. Weissman, “Multiterminal source coding under logarithmic loss,” IEEE Trans. Inf. Theory , vol. 60,no. 1, pp. 740–761, Jan. 2014.[12] I. E. Aguerri, A. Zaidi, G. Caire, and S. S. Shitz, “On the capacity of cloud radio access networks with oblivious relaying,” IEEE Trans. Inf. Theory , vol. 65, no. 7, pp. 4575–4596, July 2019. [13] B. Nazer and M. Gastpar, “Compute-and-forward: Harnessing interference through structured codes,” IEEE Trans. Inf.Theory , vol. 57, no. 10, pp. 6463–6486, Oct. 2011.[14] S.-N. Hong and G. Caire, “Compute-and-forward strategies for cooperative distributed antenna systems,” IEEE Trans. Inf.Theory , vol. 59, no. 9, pp. 5227–5243, Sep. 2013.[15] B. Nazer, A. Sanderovich, M. Gastpar, and S. Shamai, “Structured superposition for backhaul constrained cellular uplink,”in Proc. IEEE Int. Symp. Inf. Theory (ISIT) , Seoul, South Korea, June 2009, pp. 1530–1534.[16] O. Simeone, E. Erkip, and S. Shamai, “On codebook information for interference relay channels with out-of-band relaying,” IEEE Trans. Inf. Theory , vol. 57, no. 5, pp. 2880–2888, May 2011.[17] S.-H. Park, O. Simeone, O. Sahin, and S. Shamai, “Robust and efficient distributed compression for cloud radio accessnetworks,” IEEE Trans. Veh. Technol. , vol. 62, no. 2, pp. 692–703, Feb. 2013.[18] Y. Zhou, Y. Xu, W. Yu, and J. Chen, “On the optimal fronthaul compression and decoding strategies for uplink cloud radioaccess networks,” IEEE Trans. Inf. Theory , vol. 62, no. 12, pp. 7402–7418, Dec. 2016.[19] I. E. Aguerri and A. Zaidi, “Lossy compression for compute-and-forward in limited backhaul uplink multicell processing,” IEEE Trans. Commun. , vol. 64, no. 12, pp. 5227–5238, Dec. 2016.[20] J. Demel, T. Monsees, C. Bockelmann, D. Wuebben, and A. Dekorsy, “Cloud-ran fronthaul rate reduction via ibm-basedquantization for multicarrier systems,” in Proc. 24th International ITG Workshop on Smart Antennas , Hamburg, Germany,Feb. 2020, pp. 1–6.[21] A. Winkelbauer and G. Matz, “Rate-information-optimal Gaussian channel output compression,” in Proc. 48th Annu. Conf.Inf. Sci. Syst. (CISS) , Princeton, NJ, USA, Mar. 2014, pp. 1–5.[22] A. Winkelbauer, S. Farthofer, and G. Matz, “The rate-information trade-off for Gaussian vector channels,” in Proc. IEEEInt. Symp. Inf. Theory , Honolulu, USA, June 2014, pp. 2849–2853.[23] A. Katz, M. Peleg, and S. Shamai, “Gaussian diamond primitive relay with oblivious processing,” in Proc. IEEE Int. Conf.Microwaves, Antennas, Communications and Electronic Systems (COMCAS) , Tel-Aviv, Israel, Nov. 2019, pp. 1–6.[24] I. Estella Aguerri and A. Zaidi, “Distributed information bottleneck method for discrete and gaussian sources,” in Proc.Int. Zurich Seminar Inf. Commun. (IZS) , Zurich, Switzerland, Feb. 2018, pp. 35–39.[25] Y. U˘gur, I. E. Aguerri, and A. Zaidi, “Vector gaussian ceo problem under logarithmic loss and applications,” IEEE Trans.Inf. Theory , vol. 66, no. 7, pp. 4183–4202, July 2020.[26] G. Caire, S. Shamai, A. Tulino, S. Verdu, and C. Yapar, “Information bottleneck for an oblivious relay with channel stateinformation: the scalar case,” in Proc. IEEE Int. Conf. Science of Electrical Engineering in Israel (ICSEE) , Eilat, Israel,Dec. 2018, pp. 1–5.[27] T. M. Cover and J. A. Thomas, Elements of information theory . John Wiley & Sons, 2012.[28] S. Boyd and L. Vandenberghe, Convex Optimization . Cambridge university press, 2004.[29] T. Ratnarajah, “Topics in complex random matrices and information theory,” Ph.D. dissertation, University of Ottawa(Canada), 2003.[30] A. Edelman, “Eigenvalues and condition numbers of random matrices.”[31] E. Telatar, “Capacity of multi-antenna gaussian channels,” Europ. Trans. Telecommun. , vol. 10, no. 6, pp. 585–595, Nov.-Dec. 1999.[32] A. El Gamal and Y.-H. Kim, Network information theory . Cambridge University Press, 2011.[33] A. M. Tulino, S. Verd´u et al. , Random matrix theory and wireless communications . Now Publishers, 2004.[34] W. C. Lee, “Estimate of channel capacity in rayleigh fading environment,”