[PDF] A Tight Upper Bound on the Second-Order Coding Rate of the Parallel Gaussian Channel with Feedback

Abstract

This paper investigates the asymptotic expansion for the maximum rate of fixed-length codes over a parallel Gaussian channel with feedback under the following setting: A peak power constraint is imposed on every transmitted codeword, and the average error probability of decoding the transmitted message is non-vanishing as the blocklength increases. It is well known that the presence of feedback does not increase the first-order asymptotics of the channel, i.e., capacity, in the asymptotic expansion, and the closed-form expression of the capacity can be obtained by the well-known water-filling algorithm. The main contribution of this paper is a self-contained proof of an upper bound on the second-order asymptotics of the parallel Gaussian channel with feedback. The proof techniques involve developing an information spectrum bound followed by using Curtiss' theorem to show that a sum of dependent random variables associated with the information spectrum bound converges in distribution to a sum of independent random variables, thus facilitating the use of the usual central limit theorem. Combined with existing achievability results, our result implies that the presence of feedback does not improve the second-order asymptotics.

Full PDF

aa r X i v : . [ c s . I T ] J u l A Tight Upper Bound on the Second-OrderCoding Rate of the Parallel Gaussian Channelwith Feedback

Silas L. Fong and Vincent Y. F. Tan

Abstract

This paper investigates the asymptotic expansion for the maximum rate of ﬁxed-length codes over a parallelGaussian channel with feedback under the following setting: A peak power constraint is imposed on every transmittedcodeword, and the average error probabilities of decoding the transmitted message are non-vanishing as the blocklengthincreases. The main contribution of this paper is a self-contained proof of an upper bound on the ﬁrst- and second-orderasymptotics of the parallel Gaussian channel with feedback. The proof techniques involve developing an informationspectrum bound followed by using Curtiss’ theorem to show that a sum of dependent random variables associated withthe information spectrum bound converges in distribution to a sum of independent random variables, thus facilitatingthe use of the usual central limit theorem. Combined with existing achievability results, our result implies that thepresence of feedback does not improve the ﬁrst- and second-order asymptotics.

Index Terms

Curtiss’ theorem, feedback, ﬁxed-length codes, parallel Gaussian channel, second-order asymptotics

I. I

NTRODUCTION

This paper considers a point-to-point communication scenario where a source wants to transmit a message to adestination through a set of independent additive white Gaussian noise (AWGN) channels. The set of independentAWGN channels is referred to as the parallel Gaussian channel [1, Sec. 9.4] (also called the

Gaussian productchannel in [2, Sec. 3.4.3]). The parallel Gaussian channel has been used to model the multiple-input multiple-output (MIMO) channel [3, Sec. 7.1] — an essential channel model in wireless communications. Suppose theparallel Gaussian channel consists of L independent AWGN channels, and let L def = { , , . . . , L } be the index setof the L channels. For the k th channel use, the relation for the ℓ th channel between the input signal X ℓ,k and outputsignal Y ℓ,k is Y ℓ,k = X ℓ,k + Z ℓ,k (1) S. L. Fong and V. Y. F. Tan were supported by NUS Young Investigator Award under Grant R-263-000-B37-133.S. L. Fong is with the Department of Electrical and Computer Engineering, NUS, Singapore 117583 (e-mail: [email protected] ).V. Y. F. Tan is with the Department of Electrical and Computer Engineering, NUS, Singapore 117583, and also with the Department ofMathematics, NUS, Singapore 119076 (e-mail: [email protected] ). October 14, 2018 DRAFT where { Z ℓ,k } ℓ ∈L are independent Gaussian noises. For each ℓ ∈ L , the variance of the noise induced by the ℓ th channel is assumed to be some positive number N ℓ > for all channel uses, i.e., Var[ Z ℓ,k ] = N ℓ for all k ∈ N . To keep notation compact, let X k , Y k and Z k denote the random column vectors [ X ,k X ,k . . . X L,k ] t , [ Y ,k Y ,k . . . Y L,k ] t and [ Z ,k Z ,k . . . Z L,k ] t respectively. Then, the channel law (1) can be written as Y k = X k + Z k . (2)Throughout this paper, we consider ﬁxed-length codes over the parallel Gaussian channel, where the block lengthis denoted by n unless speciﬁed otherwise. Every codeword X n transmitted by the source over n channel uses issubject to the following peak power constraint where P > denotes the permissible power for X n : P ( n L X ℓ =1 n X k =1 X ℓ,k ≤ P ) = 1 . (3)If we would like to transmit a uniformly distributed message W ∈ { , , . . . , ⌈ nR ⌉} over this channel where theerror probabilities are required to vanish as the blocklength n approaches inﬁnity, it was shown by Shannon [4] thatthe maximum rate of communication R converges to a certain limit called capacity . The closed-form expression ofthe capacity can be obtained by ﬁnding the optimal power allocation among the L channels, which is described asfollows. Deﬁne the mapping C( s ) : R L + → R + as C( s ) = L X ℓ =1

12 log (cid:18) s ℓ N ℓ (cid:19) (4)where s ℓ can be viewed as the power allocated to channel ℓ . If we let Λ , P , P , . . . , P L denote the L + 1 realnumbers yielded from the water-ﬁlling algorithm [1, Ch 9.4] where L X ℓ =1 P ℓ = P (5)and P ℓ = max { , Λ − N ℓ } (6)for each ℓ ∈ L and let P ∗ def = [ P P . . . P L ] t (7)be the optimal power allocation vector, then the capacity of the parallel Gaussian channel was shown in [4] to be C( P ∗ ) bits per channel use. More speciﬁcally, if M ∗ ( n, ε, P ) denotes the maximum number of messages that canbe transmitted over n channel uses with permissible power P and average error probability ε , one has lim ε → lim inf n →∞ n log M ∗ ( n, ε, P ) = C( P ∗ ) . (8) October 14, 2018 DRAFT

The capacity result (8) has been strengthened by Polyanskiy-Poor-Verd´u [5, Th. 78] and Tan-Tomamichel [6,Appendix A] for each ε ∈ (0 , as n log M ∗ ( n, ε, P ) = C( P ∗ ) + r V( P ∗ ) n Φ − ( ε ) + Θ (cid:16) log nn (cid:17) , (9)where V : R L + → R + is the Gaussian dispersion function deﬁned as V( s ) = L X ℓ =1 s ℓ N ℓ ( s ℓ N ℓ + 2)2( s ℓ N ℓ + 1) (10)and Φ is the cumulative distribution function (cdf) of the standard normal distribution. Feedback , which is the focus of the current paper, can simplify coding schemes and improve the performance ofcommunication systems in many scenarios. See [2, Ch. 17] for a thorough discussion on the beneﬁts of feedbackin single- and multi-user information theory. When feedback is allowed, each input symbol X k depends on notonly the transmitted message W but also all the previous channel outputs up to the ( k − th channel use, i.e., thesymbols ( Y , Y , . . . , Y k − ) . In the presence of noiseless feedback, let M ∗ fb ( n, ε, P ) denote the maximum numberof messages that can be transmitted over n channel uses with permissible power P and average error probability ε .It was shown by Shannon [7] that the presence of noiseless feedback does not increase the capacity of point-to-point memoryless channels , which together with (8) implies that lim ε → lim inf n →∞ n log M ∗ fb ( n, ε, P ) = C( P ∗ ) . (11)In view of (9), we conclude that n log M ∗ fb ( n, ε, P ) ≥ C( P ∗ ) + r V( P ∗ ) n Φ − ( ε ) + Θ (cid:16) log nn (cid:17) . (12)In this paper, the main contribution is a conceptually simple, concise and self-contained proof that in the presenceof feedback, the ﬁrst- and second-order terms in the asymptotic expansion in (9) remains unchanged, i.e., n log M ∗ fb ( n, ε, P ) = C( P ∗ ) + r V( P ∗ ) n Φ − ( ε ) + o (cid:16) √ n (cid:17) . (13) A. Related Work

Our work is inspired by the recent study of the fundamental limits of communication over discrete memorylesschannels (DMCs) with feedback [8]. It was shown by Altu˘g and Wagner [8, Th. 1] that for some classes ofDMCs whose capacity-achieving input distributions are not unique (in particular, the minimum and maximumconditional information variances differ), coding schemes with feedback achieve a better second-order asymptoticscompared to those without feedback. They also showed [8, Th. 2] that feedback does not improve the second-orderasymptotics of DMCs q Y | X if the conditional variance of the log-likelihood ratio log q Y | X ( Y | x ) p ∗ ( Y ) , where p ∗ is theunique capacity-achieving output distribution, does not depend on the input x . Such DMCs include the class ofweakly-input symmetric DMCs initially studied by Polyanskiy-Poor-Verd´u [9]. October 14, 2018 DRAFT

However, we note that the proof technique used by Altu˘g and Wagner requires the use of a Berry-Ess´een-typeresult for bounded martingale difference sequences [10], and their technique cannot be extended to the parallelGaussian channel with feedback because each input symbol X ℓ,k belongs to an interval [ −√ nP , √ nP ] that growswithout bound as n increases. Instead, our proof uses Curtiss’ theorem to show that a sum of dependent randomvariables that naturally appears in the non-asymptotic analysis converges in distribution to a sum of independentrandom variables, thus facilitating the use of the usual central limit theorem [11].For L = 1 , the parallel Gaussian channel with feedback reduces to the AWGN channel with feedback, whosesecond-order coding rate is identical to the same channel without feedback by the following symmetry argument:The log-likelihood ratios log q Y | X ( Y | x ) p ∗ ( Y ) for all x on the power sphere with radius √ nP are the same. See [12] fora rigorous but simple proof. In contrast, for L > , this symmetry argument no longer holds due to the ﬂexiblepower allocation among the L channels, and hence the simple proof suggested in [12] cannot be extended to theparallel Gaussian channel with feedback.If the peak power constraint in (3) is replaced with the expected power constraint E h n P Lℓ =1 P nk =1 X ℓ,k i ≤ P ,the ﬁrst-order coding rate of the AWGN channel with feedback is improved from C( P ) to C( P − ε ) [13, Sec. II](the exact improvement holds for the non-feedback case as well [5, Sec. 4.3.3]) where ε denotes the tolerable errorprobability. For the general case L > , the proof in [13, Sec. II] can be easily extended to show that the ﬁrst-ordercoding rate of the parallel Gaussian channel with feedback can be improved from C( P ∗ ) to C( P ∗ − ε ) , and hence (13)no longer holds. B. Paper Outline

This paper is organized as follows. The next subsection summarizes the notation used in this paper. Section IIprovides the problem setup of the parallel Gaussian channel with feedback under the peak power constraint andpresents our main theorem. Section III contains the preliminaries required for the proof of our main theorem. Thepreliminaries include the following: (i) Important properties of non-asymptotic binary hypothesis testing quantities;(ii) Modiﬁcation of power allocation among the parallel channels; (iii) Curtiss’ theorem. Section IV presents theproof of our main theorem. Section V concludes this paper by explaining the novel ingredients in the proof of themain theorem and the major difﬁculty in strengthening the main theorem.

C. Notation

The sets of natural numbers, non-negative integers, real numbers and non-negative real numbers are denoted by N , Z + , R and R + respectively. An L -dimensional column vector is denoted by a def = [ a a . . . a L ] t where a ℓ denote the ℓ th element of a . The Euclidean norm of a vector a ∈ R L is denoted by k a k def = qP Lℓ =1 a ℓ . We willtake all logarithms to base e throughout this paper.We use P {E} to represent the probability of an event E , and we let {E} be the indicator function of E . Everyrandom variable is denoted by a capital letter (e.g., X ), and the realization and the alphabet of the random variableare denoted by the corresponding small letter (e.g., x ) and calligraphic letter (e.g., X ) respectively. We use X n to denote a random tuple ( X , X , . . . , X n ) , where all the elements X k have the same alphabet X . We let p X October 14, 2018 DRAFT be the probability distribution of a random variable X . More speciﬁcally, p X is the Radon-Nikodym derivativeof a measure with respect to the Lebesgue measure in an appropriate Euclidean space. We let p Y | X denote theconditional probability distribution of Y given X for any random variables X and Y . We let p X p Y | X denote the jointdistribution of ( X, Y ) , i.e., p X p Y | X ( x, y ) = p X ( x ) p Y | X ( y | x ) for all x and y . For any random variable X ∼ p X andany real-valued function g whose domain includes X , we let P p X { g ( X ) ≥ ξ } denote R X p X ( x ) { g ( x ) ≥ ξ } d x forany real constant ξ where p X The expectation and the variance of g ( X ) are denoted as E p X [ g ( X )] and Var p X [ g ( X )] respectively. For simplicity, we drop the subscript of a notation if there is no ambiguity. For any real-valued Gaussianrandom variable Z whose mean and variance are µ and σ respectively, we let N ( z ; µ, σ ) def = 1 √ πσ e − ( z − µ )22 σ (14)be the corresponding probability density function.II. P ARALLEL G AUSSIAN C HANNEL WITH F EEDBACK

Let s and d denote the source and the destination respectively. Suppose node s transmits a message to node d over n channel uses through the L independent AWGN channels. Before any transmission begins, node s choosesmessage W destined for node d where W is uniformly distributed on the message alphabet W def = { , , . . . , M } (15)whose size is denoted by M . For the k th channel use, node s transmits X k and the corresponding channel output Y k satisﬁes (2). We assume that a noiseless feedback link from the destination node d to the source node s exists so that ( W, Y k − ) is available for encoding X k for each k ∈ { , , . . . , n } . In addition, the codewords X n transmittedby s is subject to the peak power constraint (3). Upon receiving Y n , node d declares ˆ W to be the transmittedmessage. Deﬁnition 1: An ( n, M, P ) -feedback code consists of the following:1) A message set W at node s as deﬁned in (15). Message W is uniform on W .2) An encoding function f ℓ,k : W × R L × ( k − → R for each ℓ ∈ L and each k ∈ { , , . . . , n } , where f ℓ,k is the encoding function at node s for encoding X ℓ,k such that X ℓ,k = f ℓ,k ( W, Y k − ) and the peak power constraint (3) holds.3) A decoding function ϕ : R L × n → W , October 14, 2018 DRAFT where ϕ is the decoding function for W at node d such that ˆ W = ϕ ( Y n ) . Deﬁnition 2:

Let X and Y denote the random vectors [ X X . . . X L ] t and [ Y Y . . . Y L ] t respectively, andlet x and y be their realizations respectively. The parallel Gaussian channel with feedback is characterized by theconditional probability density distribution q Y | X satisfying q Y | X ( y | x ) = L Y ℓ =1 N ( y ℓ ; x ℓ , N ℓ ) (16)such that the following holds for any ( n, M, P ) -feedback code: For each k ∈ { , , . . . , n } , p W, X k , Y k = p W, X k , Y k − p Y k | X k (17)where p Y k | X k ( y k | x k ) = q Y | X ( y k | x k ) (18)for all ( x n , y n ) ∈ R L × n × R L × n .For any ( n, M, P ) -feedback code, let p W, X n , Y n , ˆ W be the joint distribution induced by the code. We can useDeﬁnition 1, (17) and (18) to factorize p W, X n , Y n , ˆ W as follows: p W, X n , Y n , ˆ W = p W n Y k =1 p X k | W, Y k − p Y k | X k ! p ˆ W | Y n . (19) Deﬁnition 3:

For an ( n, M, P ) -feedback code, we can calculate according to (19) the average probability ofdecoding error deﬁned as P (cid:8) ˆ W = W (cid:9) . We call an ( n, M, P ) -feedback code with average probability of decodingerror no larger than ε an ( n, M, P, ε ) -feedback code.Deﬁne M ∗ fb ( n, ε, P ) def = max { M ∈ N | There exists an ( n, M, P, ε ) -feedback code } . Deﬁnition 4:

Let ε ∈ (0 , . The ε -capacity of the parallel Gaussian channel with feedback, denoted by C fb ε , isdeﬁned to be C fb ε def = lim inf n →∞ n log M ∗ fb ( n, ε, P ) . The capacity is deﬁned to be C fb def = inf ε> C fb ε . Deﬁnition 5:

Let ε ∈ (0 , . The ε -second-order coding rate of the parallel Gaussian channel with feedback, October 14, 2018 DRAFT denoted by L fb ε , is deﬁned to be L fb ε def = lim inf n →∞ √ n (cid:0) log M ∗ fb ( n, ε, P ) − nC fb ε (cid:1) . Recall how C( P ∗ ) and V( P ∗ ) are determined through (4), (5), (6), (7) and (10). Since the capacity of the parallelGaussian channel without feedback is C( P ∗ ) (see, e.g., [4] and [2, Sec. 3.4.3]) and an introduction of an extranoiseless feedback link does not increase the capacity (see, e.g., [7] and [1, Sec. 9.6]), it follows that C fb = C( P ∗ ) . (20)Before stating our main result, recall that Φ : ( −∞ , ∞ ) → (0 , is the cdf of the standard normal distribution.Since Φ is strictly increasing on ( −∞ , ∞ ) , the inverse of Φ is well-deﬁned and is denoted by Φ − . The followingtheorem is the main result in this paper. Theorem 1:

Fix an ε ∈ (0 , . Then, C fb ε = C( P ∗ ) (21)and the ε -second-order coding rate satisﬁes L fb ε ≤ p V( P ∗ ) Φ − ( ε ) . (22)Combining (9) and Theorem 1, we complete the characterizations of the ﬁrst- and second-order asymptotics ofthe parallel Gaussian channel with feedback as shown in (13).III. P RELIMINARIES FOR THE P ROOF OF T HEOREM A. Binary Hypothesis Testing

The following deﬁnition concerning the non-asymptotic fundamental limits of a simple binary hypothesis test isstandard. See for example [5, Section 2.3].

Deﬁnition 6:

Let p X and q X be two probability distributions on some common alphabet X . Let A ( { , }|X ) def = { r Z | X : Z and X assume values in { , } and X respectively } be the set of randomized binary hypothesis tests between p X and q X where { Z = 0 } indicates the test chooses q X , and let δ ∈ [0 , be a real number. The minimum type-II error in a simple binary hypothesis test between p X and q X with type-I error less than − δ is deﬁned as β δ ( p X k q X ) def = inf r Z | X ∈A ( { , }|X ): R X r Z | X (1 | x ) p X ( x ) d x ≥ δ Z X r Z | X (1 | x ) q X ( x ) d x. (23)The existence of a minimizing test r Z | X is guaranteed by the Neyman-Pearson lemma.We state in the following lemma and proposition some important properties of β δ ( p X k q X ) , which are crucialfor the proof of Theorem 1. The proof of the following lemma can be found in, for example, [14, Lemma 1]. October 14, 2018 DRAFT

Lemma 1:

Let p X and q X be two probability distributions on some X , and let g be a function whose domaincontains X . Then, the following two statements hold:1. (Data processing inequality (DPI)) β δ ( p X k q X ) ≤ β δ ( p g ( X ) k q g ( X ) ) .2. For all ξ > , β δ ( p X k q X ) ≥ ξ (cid:16) δ − R X p X ( x ) n p X ( x ) q X ( x ) ≥ ξ o d x (cid:17) .The proof of the following proposition can be found in [14, Lemma 3] (see also [15, Th. 27]). Proposition 2:

Let p U,V be a probability distribution deﬁned on

W × W for some ﬁnite alphabet W , and let p U be the marginal distribution of p U,V . In addition, let q V be a distribution deﬁned on W . Suppose p U is the uniformdistribution, and let α = P { U = V } (24)be a real number in [0 , where ( U, V ) is distributed according to p U,V . Then, |W| ≤ /β − α ( p U,V k p U q V ) . (25) B. Modiﬁcation of Power Allocation among the Parallel Channels

For each transmitted codeword x n ∈ R L × n , we can view P nk =1 x ℓ,k as the power allocated to the ℓ th channelfor each ℓ ∈ L . In the proof of Theorem 1, an early step is to discretize the power allocated to the L channels. Tothis end, we need the following deﬁnition which deﬁnes the power allocation vector of a sequence x n ∈ R L × n . Deﬁnition 7:

The power allocation mapping φ : R L × n → R L + is deﬁned as φ ( x n ) = 1 n " n X k =1 x ,k n X k =1 x ,k . . . n X k =1 x L,k t . (26)We call φ ( x n ) the power type of x n .The proof of Theorem 1 involves modifying a given length- n code so that useful bounds on the performance ofthe given code can be obtained by analyzing the modiﬁed code. More speciﬁcally, the encoding functions the givencode are modiﬁed so that the power type of the random codeword generated by the modiﬁed code always fallsinto some small bounding box. The speciﬁc modiﬁcation of the encoding functions is described in the followingdeﬁnition. Deﬁnition 8:

Given an ( n, M, P ) -feedback code, let W , { f ℓ,k | ≤ ℓ ≤ L, ≤ k ≤ n } and ϕ be thecorresponding message alphabet, encoding functions and decoding function respectively. In addition, let γ ≥ and s = [ s s . . . s L ] ∈ R L + such that P Lℓ =1 s ℓ = P . Then, the ( γ, s ) -modiﬁed code based on the ( n, M, P ) -feedback code consists of the following message alphabet, encoding functions and decoding function which aredenoted by ˜ W , { ˜ f ℓ,k | ≤ ℓ ≤ L, ≤ k ≤ n } and ˜ ϕ respectively: A message set ˜ W = W at node s . Message W is uniform on ˜ W . An encoding function ˜ f ℓ,k : W × R L × ( k − → R October 14, 2018 DRAFT for each ℓ ∈ L and each k ∈ { , , . . . , n } , which is deﬁned as follows. For each w ∈ W and each y k − ∈ R L × ( k − ,deﬁne ˜ f ℓ,k in a recursive manner in this order ˜ f , , ˜ f , , . . . , ˜ f L, , . . . , ˜ f ,n , ˜ f ,n , . . . , ˜ f L,n as follows: For each k = 1 , , . . . , n − , deﬁne ˜ f ℓ,k recursively for ℓ = 1 , , . . . , L as ˜ f ℓ,k ( w, y k − ) =  f ℓ,k ( w, y k − ) if f ℓ,k ( w, y k − ) + k − P i =1 ˜ f ℓ,i ( w, y i − ) ≤ n ( s ℓ + γ ) , if f ℓ,k ( w, y k − ) + k − P i =1 ˜ f ℓ,i ( w, y i − ) > n ( s ℓ + γ ) . (27)It follows from (27) that P ( L \ ℓ =1 (cid:26) n − X i =1 ˜ f ℓ,i ( W, Y i − ) ≤ n ( s ℓ + γ ) (cid:27)) = 1 (28)and P ( L X ℓ =1 n − X i =1 ˜ f ℓ,i ( W, Y i − ) ≤ L X ℓ =1 n − X i =1 f ℓ,i ( W, Y i − ) ) = 1 . (29)In addition, in view of (28), we deﬁne ˜ f ℓ,n recursively for ℓ = 1 , , . . . , L − as follows: ˜ f ℓ,n ( w, y n − )=  f ℓ,n ( w, y n − ) if f ℓ,n ( w, y n − ) + n − P i =1 ˜ f ℓ,i ( w, y i − ) ∈ [ n ( s ℓ − Lγ ) , n ( s ℓ + γ )] , s n ( s ℓ − Lγ ) − n − P i =1 ˜ f ℓ,i ( w, y i − ) if f ℓ,n ( w, y n − ) + n − P i =1 ˜ f ℓ,i ( w, y i − ) < n ( s ℓ − Lγ ) , s n ( s ℓ + γ ) − n − P i =1 ˜ f ℓ,i ( w, y i − ) if f ℓ,n ( w, y n − ) + n − P i =1 ˜ f ℓ,i ( w, y i − ) > n ( s ℓ − Lγ ) . (30)Combining (27) and (30), we conclude that P ( L − \ ℓ =1 (cid:26) n X i =1 ˜ f ℓ,i ( W, Y i − ) ∈ [ n ( s ℓ − Lγ ) , n ( s ℓ + γ )] (cid:27)) = 1 . (31)On the other hand, it follows from (29), (30), the fact P nP Lℓ =1 P nk =1 f ℓ,i ( W, Y i − ) ≤ nP o = 1 and theassumption P Lℓ =1 s ℓ = P that P ( X ( ℓ,i ) ∈L×{ , ,...,n }\{ ( L,n ) } ˜ f ℓ,i ( W, Y i − ) ≤ nP ) = 1 . (32) October 14, 2018 DRAFT0

Finally, in view of (32), we deﬁne ˜ f L,n as ˜ f L,n ( w, y n − )=  f L,n ( w, y n − ) if f L,n ( w, y n − ) + P ( ℓ,i ) ∈L×{ , ,...,n }\{ ( L,n ) } ˜ f ℓ,i ( w, y i − ) = nP , vuut nP − P ( ℓ,i ) ∈L×{ , ,...,n }\{ ( L,n ) } ˜ f ℓ,i ( w, y i − ) if f L,n ( w, y n − ) + P ( ℓ,i ) ∈L×{ , ,...,n }\{ ( L,n ) } ˜ f ℓ,i ( w, y i − ) < nP . (33)Combining (31), (33) and the assumption that P Lℓ =1 s ℓ = P , we have P ( L \ ℓ =1 (cid:26) n X i =1 ˜ f ℓ,i ( W, Y i − ) ∈ [ n ( s ℓ − Lγ ) , n ( s ℓ + L γ )] (cid:27) ∩ ( L X ℓ =1 n X i =1 ˜ f ℓ,i ( W, Y i − ) = nP )) = 1 . (34) A decoding function ˜ ϕ = ϕ for providing an estimate of W at node d . (cid:4) Remark 1:

The basic idea behind transforming a code in Deﬁnition 8 is simple. Suppose we are given an ( n, M, P ) -feedback code, a γ ≥ and an s = [ s s . . . s L ] ∈ R L + such that P Lℓ =1 s ℓ = P . Then, the ( γ, s ) -modiﬁed code is formed by(i) truncating a transmitted codeword if the power transmitted over the ℓ th channel exceeds n ( s ℓ + γ ) , which canbe seen from (27) and the third clause of (30);(ii) boosting the power of the transmitted codeword if the power transmitted over the ℓ th channel falls below n ( s ℓ − Lγ ) , which can be seen from the second clause of (30);(iii) adjusting the last symbol transmitted over the L th channel (i.e., X L,n ) so that the total transmitted power isexactly equal to nP , which can be seen from the second clause of (33).Given an ( n, M, P ) -feedback code, we consider the corresponding ( γ, s ) -modiﬁed code constructed in Deﬁnition 8and let ˜ p W, X n , Y n , ˆ W be the distribution induced by the modiﬁed code. By (34), we see that P ˜ p X n ( L \ ℓ =1 (cid:26) n X k =1 X ℓ,k ∈ [ n ( s ℓ − L γ ) , n ( s ℓ + Lγ )] (cid:27) ∩ (cid:26) L X ℓ =1 n X k =1 X ℓ,k = nP (cid:27)) = 1 . (35)Deﬁne the ∆ -bounding box Γ (∆) ( s ) def = [ s − ∆ , s + ∆] × [ s − ∆ , s + ∆] × . . . × [ s L − ∆ , s L + ∆] . (36)for each γ ≥ and each s ∈ R L + . It then follows from (35) that P ˜ p X n (n φ ( X n ) ∈ Γ ( L γ ) ( s ) o ∩ (cid:26) L X ℓ =1 n X k =1 X ℓ,k = nP (cid:27)) = 1 . (37)The following lemma is a natural consequence of Deﬁnition 8, and the proof is deferred to Appendix A. October 14, 2018 DRAFT1

Lemma 3:

Given an ( n, M, P ) -feedback code, let p X n , Y n be the distribution induced by the code. Fix any γ ≥ and any s ∈ R L + such that P Lℓ =1 s ℓ = P , and let ˜ p X n , Y n be the distribution induced by the ( γ, s ) -modiﬁed codebased on the ( n, M, P ) -feedback code. Then, we have Z A p X n , Y n ( x n , y n ) n φ ( x n ) ∈ Γ ( γ ) ( s ) o ( L X ℓ =1 n X k =1 x ℓ,k = nP ) dx n dy n ≤ Z A ˜ p X n , Y n ( x n , y n )dx n dy n (38)for all Borel measurable A ⊆ R L × n × R L × n . C. Curtiss’ Theorem

Curtiss’ theorem [16, Th. 3] states that convergence of moment generating functions leads to convergence indistribution. The formal statement is reproduced below.

Theorem 2 (Curtiss’ theorem):

Let U ( n ) be a sequence of real-valued random variables. Suppose there exists arandom variable V such that lim n →∞ E h e tU ( n ) i = E (cid:2) e tV (cid:3) (39)for all t ∈ R . Then, lim n →∞ P { U ( n ) ≤ a } = P { V ≤ a } (40)for every a ∈ R at which a P { V ≤ a } is continuous.In contrast to the more well-known L´evy’s continuity theorem [17, Sec. 18.1], (39) of Theorem 2 is required tobe true for all real rather than purely imaginary t .IV. P ROOF OF T HEOREM ε ∈ (0 , and choose an arbitrary sequence of (¯ n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback codes. Since C fb ε ≥ C( P ∗ ) (41)by (20), it sufﬁces to show that lim inf n →∞ √ n (cid:0) log M ∗ fb ( n, ε, P ) − n C( P ∗ ) (cid:1) ≤ p V( P ∗ ) Φ − ( ε + τ ) (42)for all τ > . To this end, ﬁx an arbitrary τ > . A. Discretizing the Power Allocation Vectors by Appending Symbols

Using Deﬁnition 1, we have P ( L X ℓ =1 ¯ n X k =1 X ℓ,k ≤ ¯ nP ) = 1 for the chosen (¯ n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code for each ¯ n ∈ N . Given the chosen (¯ n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code, we can always construct an (¯ n + L, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code by appending a carefully October 14, 2018 DRAFT2 chosen tuple ( X ¯ n +1 , X ¯ n +2 , . . . , X ¯ n + L ) to each transmitted codeword X ¯ n generated by the (¯ n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code such that P ( ¯ n + L X k =1 X ℓ,k = P & P ¯ n X k =1 X ℓ,k ' for all ℓ ∈ L ) = 1 , (43)which implies that P ( ¯ n + L X k =1 X ℓ,k is a multiple of P for all ℓ ∈ L and L X ℓ =1 ¯ n + L X k =1 X ℓ,k ≤ (¯ n + L ) P ) = 1 . (44)In addition, given the (¯ n + L, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code, we can always construct an (¯ n + L +1 , M ∗ fb (¯ n, ε, P ) ,P, ε ) -feedback code by appending a carefully chosen X ¯ n + L +1 to each transmitted codeword X ¯ n + L generated bythe (¯ n + L, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code such that P ( ¯ n + L +1 X k =1 X ℓ,k is a multiple of P for all ℓ ∈ L and L X ℓ =1 ¯ n + L +1 X k =1 X ℓ,k = (¯ n + L + 1) P ) = 1 . (45)To simplify notation, we let n def = ¯ n + L + 1 . (46)Construct the set of power allocation vectors S ( n ) def = ( Pn · a L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a L ∈ Z L + , L X ℓ =1 a ℓ = n ) , (47)which can be viewed as a set of quantized power allocation vectors s with quantization level P/n that satisfy theequality power constraint L X ℓ =1 s ℓ = P. It follows from (47), (45) and Deﬁnition 7 that |S ( n ) | ≤ n L (48)and P n φ ( X n ) ∈ S ( n ) o = 1 . (49) B. Obtaining a Lower Bound on the Error Probability in Terms of the Type-II Error of a Hypothesis Test

Let p W, X n , Y n , ˆ W be the probability distribution induced by the ( n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code constructedabove for each n ∈ { L + 2 , L + 3 , . . . } , where p W, X n , Y n , ˆ W is obtained according to (19). Fix an n ∈ { L + 2 , L +3 , . . . } and the corresponding ( n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code. Recall the deﬁnition of P ℓ for each ℓ ∈ L in (6) and deﬁne the distribution r Y n , ˆ W def = p ˆ W | Y n r Y n (50) October 14, 2018 DRAFT3 where r Y n ( y n ) def = 12 |S ( n ) | X s ∈S ( n ) L Y ℓ =1 n Y k =1 N ( y ℓ,k ; 0 , s ℓ + N ℓ ) + 12 L Y ℓ =1 n Y k =1 N ( y ℓ,k ; 0 , P ℓ + N ℓ ) . (51)The choice of r Y n in (51) is motivated by the choice of the auxiliary output distribution in [18, Sec. X-A] whereDMCs are considered. Then, it follows from Proposition 2 and Deﬁnition 1 with the identiﬁcations U ≡ W , V ≡ ˆ W , p U,V ≡ p W, ˆ W , q V ≡ r ˆ W , |W| ≡ M ∗ fb (¯ n, ε, P ) and α ≡ P { ˆ W = W } ≤ ε that β − ε ( p W, ˆ W k p W r ˆ W ) ≤ β − α ( p W, ˆ W k p W r ˆ W ) ≤ /M ∗ fb (¯ n, ε, P ) . (52) C. Obtaining a Non-Asymptotic Bound from Simplifying the Type-II Error of the Binary Hypothesis Test

Using the DPI of β − ε by introducing X n and Y n , we have β − ε ( p W, ˆ W k p W r ˆ W ) ≥ β − ε (cid:18) p W, X n , Y n , ˆ W (cid:13)(cid:13)(cid:13)(cid:13) p W r Y n , ˆ W n Y k =1 p X k | W, Y k − (cid:19) (53)where p W, X n , Y n , ˆ W = p W p ˆ W | Y n n Y k =1 ( p X k | W, Y k − p Y k | X k ) (54)by (19). Combining (53), (54) and (50), we have β − ε ( p W, ˆ W k p W r ˆ W ) ≥ β − ε p W p ˆ W | Y n n Y k =1 ( p X k | Y k − ,W p Y k | X k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p W r Y n p ˆ W | Y n n Y k =1 p X k | W, Y k − ! . (55)Fix any constant ξ n > to be speciﬁed later. Using Lemma 1, (55) and (18), we have β − ε ( p W, ˆ W k p W r ˆ W ) ≥ ξ n (cid:18) − ε − P p X n, Y n (cid:26) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) ≥ ξ n (cid:27)(cid:19) , (56)which together with (52) implies that log M ∗ fb (¯ n, ε, P ) ≤ log ξ n − log (cid:18) − ε − P p X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27)(cid:19) . (57) D. Splitting the Probability Term into Multiple Terms Corresponding to Different Power Types of X n Deﬁne Π ( n ) def = (cid:26) s ∈ S ( n ) (cid:12)(cid:12)(cid:12)(cid:12) k s − P ∗ k ≤ n / (cid:27) (58) We note that even if we exclude the set of power types in the set Π ( n ) which is deﬁned later in (58), this leads to another valid deﬁnitionof r Y n ( y n ) . The conclusion of this proof remains unchanged if the n / term in (58) is replaced by n a for any a ∈ (0 , / . October 14, 2018 DRAFT4 to be the set of power allocation vectors in S ( n ) that are close to the optimal power allocation vector P ∗ (cf. (7)).Following (57), we use (49) to obtain P p X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) = P p X n, Y n (cid:26)(cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ∩ n φ ( X n ) ∈ Π ( n ) o(cid:27) + X s ∈S ( n ) \ Π ( n ) P p X n, Y n (cid:26)(cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ∩ { φ ( X n ) = s } (cid:27) . (59)In order to bound the ﬁrst term in (59), we let γ def = 1 n / (60)and deﬁne p ∗ X n , Y n be the distribution induced by the ( γ, P ∗ ) -modiﬁed code based on the ( n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code deﬁned in Deﬁnition 8. Then, consider the following chain of inequalities: P p X n, Y n (cid:26)(cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ∩ n φ ( X n ) ∈ Π ( n ) o(cid:27) ≤ P p ∗ X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) (61) ≤ P p ∗ X n, Y n ( n X k =1 log (cid:18) q Y | X ( Y k | X k ) Q Lℓ =1 N ( Y ℓ,k ; 0 , P ℓ + N ℓ ) (cid:19) ≥ log ξ n − log 2 ) (62)where • (61) is due to Lemma 3 and the fact that P p X n, Y n nP Lℓ =1 P nk =1 X ℓ,k = nP o = 1 (cf. (47) and (49)). • (62) is due to the deﬁnition of r Y n in (51).Similarly, in order to bound the second term in (59), we let p ( s ) X n , Y n be the distribution induced by the (0 , s ) -modiﬁedcode and consider the following chain of inequalities for each s ∈ S ( n ) \ Π ( n ) : P p X n, Y n (cid:26)(cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ∩ { φ ( X n ) = s } (cid:27) ≤ P p ( s ) X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) (63) ≤ P p ( s ) X n, Y n ( n X k =1 log (cid:18) q Y | X ( Y k | X k ) Q Lℓ =1 N ( Y ℓ,k ; 0 , s ℓ + N ℓ ) (cid:19) ≥ log ξ n − log (cid:0) |S ( n ) | (cid:1)) (64)where • (63) is due to Lemma 3. • (64) is due to the deﬁnition of r Y n in (51).Combining (59), (62), (64) and the deﬁnition of q Y | X in (16) followed by letting Z ℓ,k def = Y ℓ,k − X ℓ,k (65) October 14, 2018 DRAFT5 for each ℓ ∈ L and each k ∈ { , , . . . , n } , we obtain P p X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ≤ P p ∗ X n, Y n ( n C( P ∗ ) + n X k =1 L X ℓ =1 − (cid:0) P ℓ N ℓ (cid:1) Z ℓ,k + 2 X ℓ,k Z ℓ,k + X ℓ,k P ℓ + N ℓ ) ≥ log ξ n − log 2 ) + X s ∈S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( n C( s ) + n X k =1 L X ℓ =1 − (cid:0) s ℓ N ℓ (cid:1) Z ℓ,k + 2 X ℓ,k Z ℓ,k + X ℓ,k s ℓ + N ℓ ) ≥ log ξ n − log (cid:0) |S ( n ) | (cid:1)) (66)where C( · ) is as deﬁned in (4). In order to simplify the RHS of (66), we deﬁne ξ n > such that log ξ n def = n C( P ∗ ) + √ n (cid:16)p V( P ∗ ) Φ − ( ε + τ ) (cid:17) + log (cid:0) |S ( n ) | (cid:1) . (67)In addition, for each d ∈ R L + , let U ( d ) k def = L X ℓ =1 − (cid:0) d ℓ N ℓ (cid:1) Z ℓ,k + 2 X ℓ,k Z ℓ,k + d ℓ d ℓ + N ℓ ) (68)for each k ∈ { , , . . . , n } . By using (66), (67) and (68) together with the facts by (37) that P p ∗ X n, Y n (n φ ( X n ) ∈ Γ ( L γ ) ( P ∗ ) o ∩ (cid:26) L X ℓ =1 n X k =1 X ℓ,k = nP (cid:27)) = 1 (69)and P p ( s ) X n, Y n n φ ( X n ) ∈ Γ (0) ( s ) o = 1 (70)for each s ∈ S ( n ) , we can express (66) as P p X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ≤ P p ∗ X n, Y n ( p n V( P ∗ ) n X k =1 U ( P ∗ ) k ≥ Φ − ( ε + τ ) ) + X s ∈S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) . (71) E. Applying Curtiss’ Theorem When φ ( X n ) is Close to P ∗ In order to simplify the ﬁrst term in (71), we deﬁne V ( P ∗ ) k def = L X ℓ =1 − (cid:0) P ℓ N ℓ (cid:1) Z ℓ,k + 2 √ P ℓ Z ℓ,k + P ℓ P ℓ + N ℓ ) (72)for each k ∈ { , , . . . , n } and want to show that lim n →∞ E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k = lim n →∞ E p ∗ Z n " e t √ n n P k =1 V ( P ∗ ) k (73) October 14, 2018 DRAFT6 for all t ∈ R where p ∗ Z n (z n ) , n Y k =1 N ( z ℓ,k ; 0 , N ℓ ) . (74)To this end, recall the following statements due to the channel law:(i) Z ℓ,k ∼ N ( z ℓ,k ; 0 , N ℓ ) for all ℓ ∈ L and all k ∈ { , , . . . , n } ;(ii) { Z ℓ,k | ℓ ∈ L , k ∈ { , , . . . , n }} are independent;(iii) Z k and ( X k , Y k − , Z k − ) are independent for all k ∈ { , , . . . , n } .For any t ∈ R and any n ∈ { L + 2 , L + 3 , . . . } such that n ≥ t , since P ℓ − L γ − n n X k =1 X ℓ,k ≤ ≤ P ℓ + L γ − n n X k =1 X ℓ,k (75)by (69) and P ℓ + N ℓ + tP ℓ √ n > for all ℓ ∈ L , we have E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k · e t L P ℓ =1 Nℓ Pℓ − L γ − n n P k =1 X ℓ,k ! Pℓ + Nℓ ) (cid:18) Pℓ + Nℓ + tPℓ √ n (cid:19) ≤ E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k (76) ≤ E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k · e t L P ℓ =1 Nℓ Pℓ + L γ − n n P k =1 X ℓ,k ! Pℓ + Nℓ ) (cid:18) Pℓ + Nℓ + tPℓ √ n (cid:19) . (77)In order to simplify the above chain of inequalities, we need the following lemma, whose proof is deferred toAppendix B because it involves straightforward calculations based on (68), (72) and the channel law. Lemma 4:

For any λ ∈ R , we have E p ∗ X n, Y n " e λ n P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − n P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) = E p ∗ Z n " e λ n P k =1 V ( P ∗ ) k . (78)Lemma 4, which forms the crux of the proof of Theorem 1, is important because it establishes the equivalence indistribution between the sum P nk =1 U ( P ∗ ) k , which contains dependent random variables, and the sum P nk =1 V ( P ∗ ) k ,which contains independent random variables. The former is intractable to analyze while the latter can be analyzedin a straightforward manner by invoking the central limit theorem.Using Lemma 4, we can simplify (77) through the identiﬁcation λ ≡ t √ n and obtain E p ∗ Z n " e t √ n n P k =1 V ( P ∗ ) k · e − t L γ L P ℓ =1 Nℓ Pℓ + Nℓ ) (cid:18) Pℓ + Nℓ + tPℓ √ n (cid:19) ≤ E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k (79) ≤ E p ∗ Z n " e t √ n n P k =1 V ( P ∗ ) k · e t L γ L P ℓ =1 Nℓ Pℓ + Nℓ ) (cid:18) Pℓ + Nℓ + tPℓ √ n (cid:19) . (80) October 14, 2018 DRAFT7

Combining (80) and (60), we conclude that (73) holds for each t ∈ R . Since the moment generating functions of √ n P nk =1 V ( P ∗ ) k and √ n P nk =1 U ( P ∗ ) k converge to the same function, it follows from Curtiss’ theorem [16, Th. 3](as stated in Theorem 2) that lim n →∞ P p ∗ X n, Y n ( p n V( P ∗ ) n X k =1 U ( P ∗ ) k ≥ Φ − ( ε + τ ) ) = lim n →∞ P p ∗ Z n ( p n V( P ∗ ) n X k =1 V ( P ∗ ) k ≥ Φ − ( ε + τ ) ) . (81)Recognizing that (cid:8) V ( P ∗ ) k (cid:9) ∞ k =1 are independent zero-mean Gaussian random variables with variance V( P ∗ ) by thedeﬁnition of V ( P ∗ ) k in (72) and the deﬁnition of V( P ∗ ) in (10), we apply the central limit theorem [11] and obtain lim n →∞ P p ∗ Z n ( p n V( P ∗ ) n X k =1 V ( P ∗ ) k ≤ Φ − ( ε + τ ) ) = ε + τ, (82)which together with (81) implies that lim n →∞ P p ∗ X n, Y n ( p n V( P ∗ ) n X k =1 U ( P ∗ ) k ≥ Φ − ( ε + τ ) ) = 1 − ε − τ. (83) F. Applying Large Deviation Bounds When φ ( X n ) is Far from P ∗ In order to bound the second term in (71), we consider a ﬁxed n ∈ { L + 2 , L + 3 , . . . } and want to show thatthere exists some κ > such that C( P ∗ ) − C( s ) ≥ κ k P ∗ − s k (84)for all s ∈ S ( n ) . To this end, we ﬁrst deﬁne the Lagrangian function f : R L → R as f ( d ) def = C( d ) − L X ℓ =1 d ℓ − P ! + L X ℓ =1 µ ℓ d ℓ (85)where Λ ≥ is the unique number that satisﬁes (5) and (6) and µ ℓ ≥ is deﬁned for each ℓ ∈ L as µ ℓ def =  if P ℓ > , − N ℓ if P ℓ = 0 . (86)Deﬁne N max def = max ℓ ∈L N ℓ . Then for all s ∈ S ( n ) , we use Taylor’s theorem to obtain f ( s ) = f ( P ∗ ) + ( s − P ∗ ) t ▽ f ( P ∗ ) + 12 ( s − P ∗ ) t ▽ f (¯ s )( s − P ∗ ) (87)for some ¯ s that lies on the line that connects s and P ∗ , where ▽ f ( P ∗ ) denotes the gradient which satisﬁes ▽ f ( P ∗ ) = 0 (88)and ▽ f (¯ s ) denotes the Hessian matrix that satisﬁes ( s − P ∗ ) t ▽ f (¯ s )( s − P ∗ ) ≤ − k s − P ∗ k N max + P ) . (89) October 14, 2018 DRAFT8

For the sake of completeness, the derivations of (88) and (89) are contained in Appendix C. Combining (87), (88)and (89), we have for all s ∈ S ( n ) f ( P ∗ ) − f ( s ) ≥ k s − P ∗ k N max + P ) , (90)which together with the deﬁnitions of f and µ ℓ in (85) and (86) respectively implies that C( P ∗ ) − C( s ) ≥ k s − P ∗ k N max + P ) + L X ℓ =1 µ ℓ ( s ℓ − P ℓ ) (91) ≥ k s − P ∗ k N max + P ) . (92)Consequently, (84) holds by setting κ def = 14( N max + P ) . (93)Following (71), we consider for each s ∈ S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) ≤ P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ κ √ n k P ∗ − s k + p V( P ∗ ) Φ − ( ε + τ ) ) (94) ≤ P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ κn / + p V( P ∗ ) Φ − ( ε + τ ) ) (95)where • (94) is due to (84). • (95) follows from the deﬁnition of Π ( n ) in (58).Following the standard approach for obtaining large deviation bounds, we apply Markov’s inequality on the RHSof (95) and obtain for each s ∈ S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) ≤ E p ( s ) X n, Y n (cid:20) e √ n P nk =1 U ( s ) k (cid:21) e κn / + √ V( P ∗ ) Φ − ( ε + τ ) . (96)In order to bound the RHS of (96), consider the following chain of inequalities for each s ∈ S ( n ) \ Π ( n ) : E p ( s ) X n, Y n (cid:20) e √ n P nk =1 U ( s ) k (cid:21) = L Y ℓ =1 s ℓ + N ℓ (1 + n − / ) s ℓ + N ℓ ! n/ e L P ℓ =1 √ nsℓ sℓ + Nℓ ) + Nℓsℓ sℓ + Nℓ ) (( n − / ) sℓ + Nℓ ) ! (97) ≤ L Y ℓ =1 (cid:18) − n − / s ℓ (1 + n − / ) s ℓ + N ℓ (cid:19) n/ e √ nsℓ sℓ + Nℓ ) ! e L P ℓ =1 Nℓsℓ sℓ + Nℓ )2 (98) ≤ e L P ℓ =1 (cid:18) s ℓ n − / sℓ + Nℓ )( sℓ + Nℓ ) + Nℓsℓ sℓ + Nℓ )2 (cid:19) (99) October 14, 2018 DRAFT9 ≤ e L P ℓ =1 sℓ sℓ + Nℓ ) (100) ≤ e L/ , (101)where • (97) follows from straightforward calculations based on the deﬁnition of U ( s ) k in (68), the property of p ( s ) X n , Y n in (70) and the channel law, which are elaborated in Appendix D for the sake of completeness. • (99) is due to the fact that (1 − x ) x ≤ e − for all x > .Combining (96) and (101), we have the following large deviation bound for each s ∈ S ( n ) \ Π ( n ) : P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) ≤ e L/ e κn / + √ V( P ∗ ) Φ − ( ε + τ ) . (102)Following (71), we use (102) and (48) to obtain lim n →∞ X s ∈S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) = 0 . (103) G. Combining Earlier Bounds to Obtain an Upper Bound on L fb ε Combining (57), (67), (71), (83), (103) and (48), we have log M ∗ fb (¯ n, ε, P ) ≤ n C( P ∗ ) + √ n (cid:16)p V( P ∗ ) Φ − ( ε + τ ) (cid:17) + log (cid:0) n L (cid:1) − log ( τ / (104)for all sufﬁciently large n , which together with (46) implies that lim inf ¯ n →∞ √ ¯ n (cid:0) log M ∗ fb (¯ n, ε, P ) − ¯ n C( P ∗ ) (cid:1) ≤ p V( P ∗ ) Φ − ( ε + τ ) . (105)Since τ > is arbitrary, it follows from (105) and Deﬁnition 5 that L fb ε ≤ p V( P ∗ ) Φ − ( ε ) . (106)V. C ONCLUDING R EMARKS

A. Novel Ingredients in Proof of Theorem 1

As mentioned in Section I-A, the proof of [8, Th. 2] which obtains upper bounds on the second-order asymptoticsof DMCs with feedback cannot be generalized to the parallel Gaussian channel with feedback. Indeed, the proof ofTheorem 1 follows the standard procedures for obtaining the second-order asymptotics of DMCs without feedback(see, e.g., [5, proof of Th. 50] and [19, Sec. III]) except the following three novel ingredients:1) Instead of classifying transmitted codewords into polynomially many type classes based on their empiricaldistributions which is generally not possible for channels with continuous input alphabet, we discretize thetransmitted power and classify the codewords into polynomially many type classes based on their discretizedpower types. In particular, the collection of power type classes S ( n ) in (47) plays a key role in our analysis, October 14, 2018 DRAFT0 and there are polynomially many power type classes by (48). The details can be found in Section IV-A inthe proof.2) Curtiss’ theorem rather than Berry-Ess´een theorem is invoked to bound the information spectrum term (theﬁrst term in (71)) related to transmitted codewords whose types are close to the optimal power allocation. Inparticular, Berry-Ess´een theorem for bounded martingale difference sequences cannot be used to bound theinformation spectrum term in the presence of feedback because each input symbol X ℓ,k belongs to an interval [ −√ nP , √ nP ] that grows unbounded as n increases. Instead, we apply Curtiss’ theorem to show that thedistribution of the sum of random variables in the information spectrum term converges to some distributiongenerated from a sum of i.i.d. random variables (i.e., (73)), thus facilitating the use of the usual central limittheorem [11]. The details can be found in Section IV-E.3) In order to bound the information spectrum term related to transmitted codewords whose types are far from theoptimal power allocation (the second term in (71)), the usual approach is to bound the information spectrumterm by an average of exponentially many upper bounds where each upper bound is then further simpliﬁed byinvoking Chebyshev’s inequality [18, Sec. X-A]. However, due to the presence of feedback, the informationspectrum term can be expressed as only a sum (instead of average) of polynomially many upper bounds asshown in the second term in (71). In order to control the sum of polynomially many upper bounds, we haveto resort to large deviation bounds as shown in (102) rather than the weaker Chebyshev’s inequality. Thedetails can be found in Section IV-F. B. Major Difﬁculties in Tightening the Third-Order Term

If the feedback link is absent, the third-order term of the optimal ﬁnite blocklenth rate n log M ∗ fb ( n, ε, P ) is Θ (cid:16) log nn (cid:17) as shown in (9) in Section I. The third-order term can be obtained by applying Berry-Ess´een theorem tobound an information spectrum term (analogous to the ﬁrst term in (71)).In the presence of feedback, Theorem 1 shows that the third-order term is o (cid:16) √ n (cid:17) . If we want to compute anexplicit upper bound on the third-order term using the current proof technique, an intuitive way is to invoke anon-asymptotic version of Curtiss’ theorem that can measure the proximity between two distributions based on theproximity between their moment generating functions. However, such a non-asymptotic version of Curtiss’ theoremdoes not exist to the best of our knowledge, which prohibits us from strengthening the current o (cid:16) √ n (cid:17) bound on thethird-order term. It is worth noting that (77) and (80) in our proof break down if the moment generating functionsare replaced with characteristic functions. If one can ﬁnd a way to make characteristic functions amenable to ourproof approach, then a non-asymptotic version of L´evy’s continuity theorem known as Ess´een’s smoothing lemma (see, e.g., [20, Th. 1.5.2]) may be invoked to tighten the third-order term herein.

C. Future Work

The techniques presented herein may be used to analyze the ﬁxed-error asymptotics of ﬁxed-length codes overparallel DMCs with cost constraint and with feedback. We envision that there will be an added layer of complexityas the method of types [21] is typically used to analyze the ﬁxed-error asymptotics of DMCs with and without cost

October 14, 2018 DRAFT1 constraint [22]. Hence, we anticipate that two applications of the method of types need to be applied — one forhandling power types that specify the power allocation among the parallel channels (as was done in Section IV-A)and another for handling codewords of the same power type but of different empirical distributions (the usual types).In the present work, the latter situation is ameliorated by the fact that the maximum rate achievable by a codingscheme over an AWGN channel is completely determined by the power allocated for the coding scheme.A

PPENDIX AP ROOF OF L EMMA Proof:

Let f ℓ,k and ˜ f ℓ,k be the encoding functions of the ( n, M, P ) -feedback code and the ( γ, s ) -modiﬁedcode respectively for each ℓ ∈ L and each k ∈ { , , . . . , n } . For any w ∈ W and any y n ∈ R L × n such that φ  f , ( w ) ... f L, ( w )  , . . . ,  f ,n ( w, y n − ) ... f L,n ( w, y n − )  ∈ Γ ( γ ) ( s ) (107)and L X ℓ =1 n X k =1 f ℓ,k ( w, y k − ) = nP, (108)it follows from (27), (30) and (33) in Deﬁnition 8 that  f , ( w ) ... f L, ( w )  , . . . ,  f ,n ( w, y n − ) ... f L,n ( w, y n − )  =  ˜ f , ( w ) ... ˜ f L, ( w )  , . . . ,  ˜ f ,n ( w, y n − ) ... ˜ f L,n ( w, y n − )  . (109)Since (109) holds for any w ∈ W and y n ∈ R L × n that satisfy (107) and (108), it follows that (38) holds for allBorel measurable A ⊆ R L × n × R L × n . A PPENDIX BP ROOF OF L EMMA L = 1 ) is contained in [12, Sec. E]. For a general L ∈ N , considerthe following chain of equalities for each m = n, n − , . . . , : E  e λ m P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − m P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ )  = E  E  e λ m P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − m P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X m , Z m −  (110) October 14, 2018 DRAFT2 = E  e λ m − P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − m − P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) · Z R L × n p ∗ Z m ( z m ) · e λU ( P ∗ ) m · e λ L P ℓ =1 − X ℓ,m Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) d z m  (111) = e λ L P ℓ =1 Pℓ Pℓ + Nℓ ) vuut L Y ℓ =1 P ℓ + N ℓ (1 + λ ) P ℓ + N ℓ · E  e λ m − P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − m − P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ )  (112)where • (111) is due to the fact that Z m and ( X m , Z m − ) are independent. • (112) is due to the deﬁnition of U ( P ∗ ) k in (68).Applying (112) recursively from m = n to m = 1 , we have E  e λ n P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − n P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ )  = L Y ℓ =1 P ℓ + N ℓ (1 + λ ) P ℓ + N ℓ ! n/ e n L P ℓ =1 (cid:18) λPℓ Pℓ + Nℓ ) + λ NℓPℓ Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) (cid:19) . (113)On the other hand, straightforward calculations based on the deﬁnition of V ( P ∗ ) k in (72) and the fact that { Z k } nk =1 are independent implies that E " e λ n P k =1 V ( P ∗ ) k = L Y ℓ =1 P ℓ + N ℓ (1 + λ ) P ℓ + N ℓ ! n/ e n L P ℓ =1 (cid:18) λPℓ Pℓ + Nℓ ) + λ NℓPℓ Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) (cid:19) . (114)Combining (113) and (114), we obtain (78). A PPENDIX CD ERIVATIONS OF (88)

AND (89)Straightforward calculations based on (85) reveal that for all s ∈ R L + , we obtain that ▽ f ( s ) = 12  N + s − + µ ... N L + s L − + µ L  (115)and ▽ f ( s ) is a diagonal matrix that satisﬁes ▽ f ( s ) = −  N + s ) . . . N + s ) . . . ... ... . . . ... . . . N L + s L )  . (116) October 14, 2018 DRAFT3

Combining (115), (5), (6) and (86), we have ▽ f ( P ∗ ) = 0 . In addition, for any s such that P Lℓ =1 s ℓ ≤ P , it followsfrom (116) that N ℓ + s ℓ ≤ N max + P for all ℓ ∈ L , which then implies that (89) holds for all s ∈ S ( n ) .A PPENDIX DD ERIVATION OF (97)Let t = √ n . Fix any s ∈ S ( n ) \ Π ( n ) . Due to (70), it sufﬁces to show that E p ( s ) X n, Y n  e t n P k =1 U ( s ) k · e t L P ℓ =1 Nℓ nsℓ − n P k =1 X ℓ,k ! sℓ + Nℓ ) ( (1+ t ) sℓ + Nℓ )  = L Y ℓ =1 s ℓ + N ℓ (1 + t ) s ℓ + N ℓ ! n/ e n L P ℓ =1 (cid:18) tsℓ sℓ + Nℓ ) + t Nℓsℓ sℓ + Nℓ ) ( (1+ t ) sℓ + Nℓ ) (cid:19) . (117)Replacing P ∗ with s in the steps leading to (112) and (113), we obtain (117).A CKNOWLEDGMENTS

The authors would like to thank the Associate Editor Prof. Shun Watanabe and the two anonymous reviewersfor the useful comments that improve the presentation of this paper.R

EFERENCES[1] T. M. Cover and J. A. Thomas,

Elements of Information Theory , 2nd ed. Hoboken, NJ: John Wiley and Sons, 2006.[2] A. El Gamal and Y.-H. Kim,

Network Information Theory . Cambridge, U.K.: Cambridge University Press, 2012.[3] D. Tse and P. Viswanath,

Fundamentals of Wireless Communication . Cambridge, U.K.: Cambridge University Press, 2005.[4] C. E. Shannon, “Communication in the presence of noise,”

Proceedings of IRE , vol. 37, no. 1, pp. 10–21, 1949.[5] Y. Polyanskiy, “Channel coding: Non-asymptotic fundamental limits,” Ph.D. dissertation, Princeton University, 2010.[6] V. Y. F. Tan and M. Tomamichel, “The third-order term in the normal approximation for the AWGN channel,”

IEEE Trans. Inf. Theory ,vol. 61, no. 5, pp. 2430–2438, 2015.[7] C. E. Shannon, “The zero error capacity of a noisy channel,”

IRE Transactions on Information Theory , vol. 2, no. 3, pp. 8–19, 1956.[8] Y. Altu˘g and A. B. Wagner, “Feedback can improve the second-order coding performance in discrete memoryless channels,” in

Proc. IEEEIntl. Symp. Inf. Theory , Honolulu, HI, Jul 2014, pp. 2361–2365.[9] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “Feedback in the non-asymptotic regime,”

IEEE Trans. Inf. Theory , vol. 57, no. 8, pp. 4903–4925,2011.[10] M. E. Machkouri and L. Ouchti, “Exact convergence rates in the central limit theorem for a class of martingales,”

Bernoulli , vol. 13, no. 4,pp. 981–999, Nov. 2007.[11] W. Feller,

An Introduction to Probability Theory and Its Applications , 2nd ed. Hoboken, NJ: John Wiley and Sons, 1971, vol. 2.[12] S. L. Fong and V. Y. F. Tan, “Asymptotic expansions for the AWGN channel with feedback under a peak power constraint,” in

Proc. IEEEIntl. Symp. Inf. Theory , Hong Kong, Jun. 2015, pp. 311–315.[13] L. V. Truong, S. L. Fong, and V. Y. F. Tan, “On Gaussian channels with feedback under expected power constraints and with non-vanishingerror probabilities,”

IEEE Trans. Inf. Theory , vol. 63, no. 3, pp. 1746–1765, 2017.[14] L. Wang, R. Colbeck, and R. Renner, “Simple channel coding bounds,” in

Proc. IEEE Intl. Symp. Inf. Theory , Seoul, South Korea, June/July2009, pp. 1804 – 1808.[15] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “Channel coding rate in the ﬁnite blocklength regime,”

IEEE Trans. Inf. Theory , vol. 56, no. 5,pp. 2307–2359, 2010.[16] J. H. Curtiss, “A note on the theory of moment generating functions,”

Ann. Math. Statist. , vol. 13, no. 4, pp. 430–433, 1942.[17] D. Williams,

Probability with Martingales . Cambridge, U.K.: Cambridge University Press, 1991.

October 14, 2018 DRAFT4 [18] M. Hayashi, “Information spectrum approach to second-order coding rate in channel coding,”

IEEE Trans. Inf. Theory , vol. 55, no. 11,pp. 4947–4966, 2009.[19] M. Tomamichel and V. Y. F. Tan, “A tight upper bound for the third-order asymptotics of most discrete memoryless channels,”

IEEETrans. Inf. Theory , vol. 59, no. 11, pp. 7041–7051, 2013.[20] I. A. Ibragimov and Y. V. Linnik,

Independent and stationary sequences of random variables , J. F. C. Kingman, Ed. Groningen,Netherlands: Wolters-Noordhoff Publishing, 1971.[21] I. Csisz´ar and J. K¨orner,

Information Theory: Coding Theorems for Discrete Memoryless Systems . Cambridge, U.K.: Cambridge UniversityPress, 2011.[22] V. Strassen, “Asymptotische Absch¨atzungen in Shannons Informationstheorie,” in

Trans. Third Prague Conf. Inf. Theory ∼ pmlut/strassen.pdf.pmlut/strassen.pdf.