A Tight Upper Bound on the Second-Order Coding Rate of the Parallel Gaussian Channel with Feedback
aa r X i v : . [ c s . I T ] J u l A Tight Upper Bound on the Second-OrderCoding Rate of the Parallel Gaussian Channelwith Feedback
Silas L. Fong and Vincent Y. F. Tan
Abstract
This paper investigates the asymptotic expansion for the maximum rate of fixed-length codes over a parallelGaussian channel with feedback under the following setting: A peak power constraint is imposed on every transmittedcodeword, and the average error probabilities of decoding the transmitted message are non-vanishing as the blocklengthincreases. The main contribution of this paper is a self-contained proof of an upper bound on the first- and second-orderasymptotics of the parallel Gaussian channel with feedback. The proof techniques involve developing an informationspectrum bound followed by using Curtiss’ theorem to show that a sum of dependent random variables associated withthe information spectrum bound converges in distribution to a sum of independent random variables, thus facilitatingthe use of the usual central limit theorem. Combined with existing achievability results, our result implies that thepresence of feedback does not improve the first- and second-order asymptotics.
Index Terms
Curtiss’ theorem, feedback, fixed-length codes, parallel Gaussian channel, second-order asymptotics
I. I
NTRODUCTION
This paper considers a point-to-point communication scenario where a source wants to transmit a message to adestination through a set of independent additive white Gaussian noise (AWGN) channels. The set of independentAWGN channels is referred to as the parallel Gaussian channel [1, Sec. 9.4] (also called the
Gaussian productchannel in [2, Sec. 3.4.3]). The parallel Gaussian channel has been used to model the multiple-input multiple-output (MIMO) channel [3, Sec. 7.1] — an essential channel model in wireless communications. Suppose theparallel Gaussian channel consists of L independent AWGN channels, and let L def = { , , . . . , L } be the index setof the L channels. For the k th channel use, the relation for the ℓ th channel between the input signal X ℓ,k and outputsignal Y ℓ,k is Y ℓ,k = X ℓ,k + Z ℓ,k (1) S. L. Fong and V. Y. F. Tan were supported by NUS Young Investigator Award under Grant R-263-000-B37-133.S. L. Fong is with the Department of Electrical and Computer Engineering, NUS, Singapore 117583 (e-mail: [email protected] ).V. Y. F. Tan is with the Department of Electrical and Computer Engineering, NUS, Singapore 117583, and also with the Department ofMathematics, NUS, Singapore 119076 (e-mail: [email protected] ). October 14, 2018 DRAFT where { Z ℓ,k } ℓ ∈L are independent Gaussian noises. For each ℓ ∈ L , the variance of the noise induced by the ℓ th channel is assumed to be some positive number N ℓ > for all channel uses, i.e., Var[ Z ℓ,k ] = N ℓ for all k ∈ N . To keep notation compact, let X k , Y k and Z k denote the random column vectors [ X ,k X ,k . . . X L,k ] t , [ Y ,k Y ,k . . . Y L,k ] t and [ Z ,k Z ,k . . . Z L,k ] t respectively. Then, the channel law (1) can be written as Y k = X k + Z k . (2)Throughout this paper, we consider fixed-length codes over the parallel Gaussian channel, where the block lengthis denoted by n unless specified otherwise. Every codeword X n transmitted by the source over n channel uses issubject to the following peak power constraint where P > denotes the permissible power for X n : P ( n L X ℓ =1 n X k =1 X ℓ,k ≤ P ) = 1 . (3)If we would like to transmit a uniformly distributed message W ∈ { , , . . . , ⌈ nR ⌉} over this channel where theerror probabilities are required to vanish as the blocklength n approaches infinity, it was shown by Shannon [4] thatthe maximum rate of communication R converges to a certain limit called capacity . The closed-form expression ofthe capacity can be obtained by finding the optimal power allocation among the L channels, which is described asfollows. Define the mapping C( s ) : R L + → R + as C( s ) = L X ℓ =1
12 log (cid:18) s ℓ N ℓ (cid:19) (4)where s ℓ can be viewed as the power allocated to channel ℓ . If we let Λ , P , P , . . . , P L denote the L + 1 realnumbers yielded from the water-filling algorithm [1, Ch 9.4] where L X ℓ =1 P ℓ = P (5)and P ℓ = max { , Λ − N ℓ } (6)for each ℓ ∈ L and let P ∗ def = [ P P . . . P L ] t (7)be the optimal power allocation vector, then the capacity of the parallel Gaussian channel was shown in [4] to be C( P ∗ ) bits per channel use. More specifically, if M ∗ ( n, ε, P ) denotes the maximum number of messages that canbe transmitted over n channel uses with permissible power P and average error probability ε , one has lim ε → lim inf n →∞ n log M ∗ ( n, ε, P ) = C( P ∗ ) . (8) October 14, 2018 DRAFT
The capacity result (8) has been strengthened by Polyanskiy-Poor-Verd´u [5, Th. 78] and Tan-Tomamichel [6,Appendix A] for each ε ∈ (0 , as n log M ∗ ( n, ε, P ) = C( P ∗ ) + r V( P ∗ ) n Φ − ( ε ) + Θ (cid:16) log nn (cid:17) , (9)where V : R L + → R + is the Gaussian dispersion function defined as V( s ) = L X ℓ =1 s ℓ N ℓ ( s ℓ N ℓ + 2)2( s ℓ N ℓ + 1) (10)and Φ is the cumulative distribution function (cdf) of the standard normal distribution. Feedback , which is the focus of the current paper, can simplify coding schemes and improve the performance ofcommunication systems in many scenarios. See [2, Ch. 17] for a thorough discussion on the benefits of feedbackin single- and multi-user information theory. When feedback is allowed, each input symbol X k depends on notonly the transmitted message W but also all the previous channel outputs up to the ( k − th channel use, i.e., thesymbols ( Y , Y , . . . , Y k − ) . In the presence of noiseless feedback, let M ∗ fb ( n, ε, P ) denote the maximum numberof messages that can be transmitted over n channel uses with permissible power P and average error probability ε .It was shown by Shannon [7] that the presence of noiseless feedback does not increase the capacity of point-to-point memoryless channels , which together with (8) implies that lim ε → lim inf n →∞ n log M ∗ fb ( n, ε, P ) = C( P ∗ ) . (11)In view of (9), we conclude that n log M ∗ fb ( n, ε, P ) ≥ C( P ∗ ) + r V( P ∗ ) n Φ − ( ε ) + Θ (cid:16) log nn (cid:17) . (12)In this paper, the main contribution is a conceptually simple, concise and self-contained proof that in the presenceof feedback, the first- and second-order terms in the asymptotic expansion in (9) remains unchanged, i.e., n log M ∗ fb ( n, ε, P ) = C( P ∗ ) + r V( P ∗ ) n Φ − ( ε ) + o (cid:16) √ n (cid:17) . (13) A. Related Work
Our work is inspired by the recent study of the fundamental limits of communication over discrete memorylesschannels (DMCs) with feedback [8]. It was shown by Altu˘g and Wagner [8, Th. 1] that for some classes ofDMCs whose capacity-achieving input distributions are not unique (in particular, the minimum and maximumconditional information variances differ), coding schemes with feedback achieve a better second-order asymptoticscompared to those without feedback. They also showed [8, Th. 2] that feedback does not improve the second-orderasymptotics of DMCs q Y | X if the conditional variance of the log-likelihood ratio log q Y | X ( Y | x ) p ∗ ( Y ) , where p ∗ is theunique capacity-achieving output distribution, does not depend on the input x . Such DMCs include the class ofweakly-input symmetric DMCs initially studied by Polyanskiy-Poor-Verd´u [9]. October 14, 2018 DRAFT
However, we note that the proof technique used by Altu˘g and Wagner requires the use of a Berry-Ess´een-typeresult for bounded martingale difference sequences [10], and their technique cannot be extended to the parallelGaussian channel with feedback because each input symbol X ℓ,k belongs to an interval [ −√ nP , √ nP ] that growswithout bound as n increases. Instead, our proof uses Curtiss’ theorem to show that a sum of dependent randomvariables that naturally appears in the non-asymptotic analysis converges in distribution to a sum of independentrandom variables, thus facilitating the use of the usual central limit theorem [11].For L = 1 , the parallel Gaussian channel with feedback reduces to the AWGN channel with feedback, whosesecond-order coding rate is identical to the same channel without feedback by the following symmetry argument:The log-likelihood ratios log q Y | X ( Y | x ) p ∗ ( Y ) for all x on the power sphere with radius √ nP are the same. See [12] fora rigorous but simple proof. In contrast, for L > , this symmetry argument no longer holds due to the flexiblepower allocation among the L channels, and hence the simple proof suggested in [12] cannot be extended to theparallel Gaussian channel with feedback.If the peak power constraint in (3) is replaced with the expected power constraint E h n P Lℓ =1 P nk =1 X ℓ,k i ≤ P ,the first-order coding rate of the AWGN channel with feedback is improved from C( P ) to C( P − ε ) [13, Sec. II](the exact improvement holds for the non-feedback case as well [5, Sec. 4.3.3]) where ε denotes the tolerable errorprobability. For the general case L > , the proof in [13, Sec. II] can be easily extended to show that the first-ordercoding rate of the parallel Gaussian channel with feedback can be improved from C( P ∗ ) to C( P ∗ − ε ) , and hence (13)no longer holds. B. Paper Outline
This paper is organized as follows. The next subsection summarizes the notation used in this paper. Section IIprovides the problem setup of the parallel Gaussian channel with feedback under the peak power constraint andpresents our main theorem. Section III contains the preliminaries required for the proof of our main theorem. Thepreliminaries include the following: (i) Important properties of non-asymptotic binary hypothesis testing quantities;(ii) Modification of power allocation among the parallel channels; (iii) Curtiss’ theorem. Section IV presents theproof of our main theorem. Section V concludes this paper by explaining the novel ingredients in the proof of themain theorem and the major difficulty in strengthening the main theorem.
C. Notation
The sets of natural numbers, non-negative integers, real numbers and non-negative real numbers are denoted by N , Z + , R and R + respectively. An L -dimensional column vector is denoted by a def = [ a a . . . a L ] t where a ℓ denote the ℓ th element of a . The Euclidean norm of a vector a ∈ R L is denoted by k a k def = qP Lℓ =1 a ℓ . We willtake all logarithms to base e throughout this paper.We use P {E} to represent the probability of an event E , and we let {E} be the indicator function of E . Everyrandom variable is denoted by a capital letter (e.g., X ), and the realization and the alphabet of the random variableare denoted by the corresponding small letter (e.g., x ) and calligraphic letter (e.g., X ) respectively. We use X n to denote a random tuple ( X , X , . . . , X n ) , where all the elements X k have the same alphabet X . We let p X October 14, 2018 DRAFT be the probability distribution of a random variable X . More specifically, p X is the Radon-Nikodym derivativeof a measure with respect to the Lebesgue measure in an appropriate Euclidean space. We let p Y | X denote theconditional probability distribution of Y given X for any random variables X and Y . We let p X p Y | X denote the jointdistribution of ( X, Y ) , i.e., p X p Y | X ( x, y ) = p X ( x ) p Y | X ( y | x ) for all x and y . For any random variable X ∼ p X andany real-valued function g whose domain includes X , we let P p X { g ( X ) ≥ ξ } denote R X p X ( x ) { g ( x ) ≥ ξ } d x forany real constant ξ where p X The expectation and the variance of g ( X ) are denoted as E p X [ g ( X )] and Var p X [ g ( X )] respectively. For simplicity, we drop the subscript of a notation if there is no ambiguity. For any real-valued Gaussianrandom variable Z whose mean and variance are µ and σ respectively, we let N ( z ; µ, σ ) def = 1 √ πσ e − ( z − µ )22 σ (14)be the corresponding probability density function.II. P ARALLEL G AUSSIAN C HANNEL WITH F EEDBACK
Let s and d denote the source and the destination respectively. Suppose node s transmits a message to node d over n channel uses through the L independent AWGN channels. Before any transmission begins, node s choosesmessage W destined for node d where W is uniformly distributed on the message alphabet W def = { , , . . . , M } (15)whose size is denoted by M . For the k th channel use, node s transmits X k and the corresponding channel output Y k satisfies (2). We assume that a noiseless feedback link from the destination node d to the source node s exists so that ( W, Y k − ) is available for encoding X k for each k ∈ { , , . . . , n } . In addition, the codewords X n transmittedby s is subject to the peak power constraint (3). Upon receiving Y n , node d declares ˆ W to be the transmittedmessage. Definition 1: An ( n, M, P ) -feedback code consists of the following:1) A message set W at node s as defined in (15). Message W is uniform on W .2) An encoding function f ℓ,k : W × R L × ( k − → R for each ℓ ∈ L and each k ∈ { , , . . . , n } , where f ℓ,k is the encoding function at node s for encoding X ℓ,k such that X ℓ,k = f ℓ,k ( W, Y k − ) and the peak power constraint (3) holds.3) A decoding function ϕ : R L × n → W , October 14, 2018 DRAFT where ϕ is the decoding function for W at node d such that ˆ W = ϕ ( Y n ) . Definition 2:
Let X and Y denote the random vectors [ X X . . . X L ] t and [ Y Y . . . Y L ] t respectively, andlet x and y be their realizations respectively. The parallel Gaussian channel with feedback is characterized by theconditional probability density distribution q Y | X satisfying q Y | X ( y | x ) = L Y ℓ =1 N ( y ℓ ; x ℓ , N ℓ ) (16)such that the following holds for any ( n, M, P ) -feedback code: For each k ∈ { , , . . . , n } , p W, X k , Y k = p W, X k , Y k − p Y k | X k (17)where p Y k | X k ( y k | x k ) = q Y | X ( y k | x k ) (18)for all ( x n , y n ) ∈ R L × n × R L × n .For any ( n, M, P ) -feedback code, let p W, X n , Y n , ˆ W be the joint distribution induced by the code. We can useDefinition 1, (17) and (18) to factorize p W, X n , Y n , ˆ W as follows: p W, X n , Y n , ˆ W = p W n Y k =1 p X k | W, Y k − p Y k | X k ! p ˆ W | Y n . (19) Definition 3:
For an ( n, M, P ) -feedback code, we can calculate according to (19) the average probability ofdecoding error defined as P (cid:8) ˆ W = W (cid:9) . We call an ( n, M, P ) -feedback code with average probability of decodingerror no larger than ε an ( n, M, P, ε ) -feedback code.Define M ∗ fb ( n, ε, P ) def = max { M ∈ N | There exists an ( n, M, P, ε ) -feedback code } . Definition 4:
Let ε ∈ (0 , . The ε -capacity of the parallel Gaussian channel with feedback, denoted by C fb ε , isdefined to be C fb ε def = lim inf n →∞ n log M ∗ fb ( n, ε, P ) . The capacity is defined to be C fb def = inf ε> C fb ε . Definition 5:
Let ε ∈ (0 , . The ε -second-order coding rate of the parallel Gaussian channel with feedback, October 14, 2018 DRAFT denoted by L fb ε , is defined to be L fb ε def = lim inf n →∞ √ n (cid:0) log M ∗ fb ( n, ε, P ) − nC fb ε (cid:1) . Recall how C( P ∗ ) and V( P ∗ ) are determined through (4), (5), (6), (7) and (10). Since the capacity of the parallelGaussian channel without feedback is C( P ∗ ) (see, e.g., [4] and [2, Sec. 3.4.3]) and an introduction of an extranoiseless feedback link does not increase the capacity (see, e.g., [7] and [1, Sec. 9.6]), it follows that C fb = C( P ∗ ) . (20)Before stating our main result, recall that Φ : ( −∞ , ∞ ) → (0 , is the cdf of the standard normal distribution.Since Φ is strictly increasing on ( −∞ , ∞ ) , the inverse of Φ is well-defined and is denoted by Φ − . The followingtheorem is the main result in this paper. Theorem 1:
Fix an ε ∈ (0 , . Then, C fb ε = C( P ∗ ) (21)and the ε -second-order coding rate satisfies L fb ε ≤ p V( P ∗ ) Φ − ( ε ) . (22)Combining (9) and Theorem 1, we complete the characterizations of the first- and second-order asymptotics ofthe parallel Gaussian channel with feedback as shown in (13).III. P RELIMINARIES FOR THE P ROOF OF T HEOREM A. Binary Hypothesis Testing
The following definition concerning the non-asymptotic fundamental limits of a simple binary hypothesis test isstandard. See for example [5, Section 2.3].
Definition 6:
Let p X and q X be two probability distributions on some common alphabet X . Let A ( { , }|X ) def = { r Z | X : Z and X assume values in { , } and X respectively } be the set of randomized binary hypothesis tests between p X and q X where { Z = 0 } indicates the test chooses q X , and let δ ∈ [0 , be a real number. The minimum type-II error in a simple binary hypothesis test between p X and q X with type-I error less than − δ is defined as β δ ( p X k q X ) def = inf r Z | X ∈A ( { , }|X ): R X r Z | X (1 | x ) p X ( x ) d x ≥ δ Z X r Z | X (1 | x ) q X ( x ) d x. (23)The existence of a minimizing test r Z | X is guaranteed by the Neyman-Pearson lemma.We state in the following lemma and proposition some important properties of β δ ( p X k q X ) , which are crucialfor the proof of Theorem 1. The proof of the following lemma can be found in, for example, [14, Lemma 1]. October 14, 2018 DRAFT
Lemma 1:
Let p X and q X be two probability distributions on some X , and let g be a function whose domaincontains X . Then, the following two statements hold:1. (Data processing inequality (DPI)) β δ ( p X k q X ) ≤ β δ ( p g ( X ) k q g ( X ) ) .2. For all ξ > , β δ ( p X k q X ) ≥ ξ (cid:16) δ − R X p X ( x ) n p X ( x ) q X ( x ) ≥ ξ o d x (cid:17) .The proof of the following proposition can be found in [14, Lemma 3] (see also [15, Th. 27]). Proposition 2:
Let p U,V be a probability distribution defined on
W × W for some finite alphabet W , and let p U be the marginal distribution of p U,V . In addition, let q V be a distribution defined on W . Suppose p U is the uniformdistribution, and let α = P { U = V } (24)be a real number in [0 , where ( U, V ) is distributed according to p U,V . Then, |W| ≤ /β − α ( p U,V k p U q V ) . (25) B. Modification of Power Allocation among the Parallel Channels
For each transmitted codeword x n ∈ R L × n , we can view P nk =1 x ℓ,k as the power allocated to the ℓ th channelfor each ℓ ∈ L . In the proof of Theorem 1, an early step is to discretize the power allocated to the L channels. Tothis end, we need the following definition which defines the power allocation vector of a sequence x n ∈ R L × n . Definition 7:
The power allocation mapping φ : R L × n → R L + is defined as φ ( x n ) = 1 n " n X k =1 x ,k n X k =1 x ,k . . . n X k =1 x L,k t . (26)We call φ ( x n ) the power type of x n .The proof of Theorem 1 involves modifying a given length- n code so that useful bounds on the performance ofthe given code can be obtained by analyzing the modified code. More specifically, the encoding functions the givencode are modified so that the power type of the random codeword generated by the modified code always fallsinto some small bounding box. The specific modification of the encoding functions is described in the followingdefinition. Definition 8:
Given an ( n, M, P ) -feedback code, let W , { f ℓ,k | ≤ ℓ ≤ L, ≤ k ≤ n } and ϕ be thecorresponding message alphabet, encoding functions and decoding function respectively. In addition, let γ ≥ and s = [ s s . . . s L ] ∈ R L + such that P Lℓ =1 s ℓ = P . Then, the ( γ, s ) -modified code based on the ( n, M, P ) -feedback code consists of the following message alphabet, encoding functions and decoding function which aredenoted by ˜ W , { ˜ f ℓ,k | ≤ ℓ ≤ L, ≤ k ≤ n } and ˜ ϕ respectively: A message set ˜ W = W at node s . Message W is uniform on ˜ W . An encoding function ˜ f ℓ,k : W × R L × ( k − → R October 14, 2018 DRAFT for each ℓ ∈ L and each k ∈ { , , . . . , n } , which is defined as follows. For each w ∈ W and each y k − ∈ R L × ( k − ,define ˜ f ℓ,k in a recursive manner in this order ˜ f , , ˜ f , , . . . , ˜ f L, , . . . , ˜ f ,n , ˜ f ,n , . . . , ˜ f L,n as follows: For each k = 1 , , . . . , n − , define ˜ f ℓ,k recursively for ℓ = 1 , , . . . , L as ˜ f ℓ,k ( w, y k − ) = f ℓ,k ( w, y k − ) if f ℓ,k ( w, y k − ) + k − P i =1 ˜ f ℓ,i ( w, y i − ) ≤ n ( s ℓ + γ ) , if f ℓ,k ( w, y k − ) + k − P i =1 ˜ f ℓ,i ( w, y i − ) > n ( s ℓ + γ ) . (27)It follows from (27) that P ( L \ ℓ =1 (cid:26) n − X i =1 ˜ f ℓ,i ( W, Y i − ) ≤ n ( s ℓ + γ ) (cid:27)) = 1 (28)and P ( L X ℓ =1 n − X i =1 ˜ f ℓ,i ( W, Y i − ) ≤ L X ℓ =1 n − X i =1 f ℓ,i ( W, Y i − ) ) = 1 . (29)In addition, in view of (28), we define ˜ f ℓ,n recursively for ℓ = 1 , , . . . , L − as follows: ˜ f ℓ,n ( w, y n − )= f ℓ,n ( w, y n − ) if f ℓ,n ( w, y n − ) + n − P i =1 ˜ f ℓ,i ( w, y i − ) ∈ [ n ( s ℓ − Lγ ) , n ( s ℓ + γ )] , s n ( s ℓ − Lγ ) − n − P i =1 ˜ f ℓ,i ( w, y i − ) if f ℓ,n ( w, y n − ) + n − P i =1 ˜ f ℓ,i ( w, y i − ) < n ( s ℓ − Lγ ) , s n ( s ℓ + γ ) − n − P i =1 ˜ f ℓ,i ( w, y i − ) if f ℓ,n ( w, y n − ) + n − P i =1 ˜ f ℓ,i ( w, y i − ) > n ( s ℓ − Lγ ) . (30)Combining (27) and (30), we conclude that P ( L − \ ℓ =1 (cid:26) n X i =1 ˜ f ℓ,i ( W, Y i − ) ∈ [ n ( s ℓ − Lγ ) , n ( s ℓ + γ )] (cid:27)) = 1 . (31)On the other hand, it follows from (29), (30), the fact P nP Lℓ =1 P nk =1 f ℓ,i ( W, Y i − ) ≤ nP o = 1 and theassumption P Lℓ =1 s ℓ = P that P ( X ( ℓ,i ) ∈L×{ , ,...,n }\{ ( L,n ) } ˜ f ℓ,i ( W, Y i − ) ≤ nP ) = 1 . (32) October 14, 2018 DRAFT0
Finally, in view of (32), we define ˜ f L,n as ˜ f L,n ( w, y n − )= f L,n ( w, y n − ) if f L,n ( w, y n − ) + P ( ℓ,i ) ∈L×{ , ,...,n }\{ ( L,n ) } ˜ f ℓ,i ( w, y i − ) = nP , vuut nP − P ( ℓ,i ) ∈L×{ , ,...,n }\{ ( L,n ) } ˜ f ℓ,i ( w, y i − ) if f L,n ( w, y n − ) + P ( ℓ,i ) ∈L×{ , ,...,n }\{ ( L,n ) } ˜ f ℓ,i ( w, y i − ) < nP . (33)Combining (31), (33) and the assumption that P Lℓ =1 s ℓ = P , we have P ( L \ ℓ =1 (cid:26) n X i =1 ˜ f ℓ,i ( W, Y i − ) ∈ [ n ( s ℓ − Lγ ) , n ( s ℓ + L γ )] (cid:27) ∩ ( L X ℓ =1 n X i =1 ˜ f ℓ,i ( W, Y i − ) = nP )) = 1 . (34) A decoding function ˜ ϕ = ϕ for providing an estimate of W at node d . (cid:4) Remark 1:
The basic idea behind transforming a code in Definition 8 is simple. Suppose we are given an ( n, M, P ) -feedback code, a γ ≥ and an s = [ s s . . . s L ] ∈ R L + such that P Lℓ =1 s ℓ = P . Then, the ( γ, s ) -modified code is formed by(i) truncating a transmitted codeword if the power transmitted over the ℓ th channel exceeds n ( s ℓ + γ ) , which canbe seen from (27) and the third clause of (30);(ii) boosting the power of the transmitted codeword if the power transmitted over the ℓ th channel falls below n ( s ℓ − Lγ ) , which can be seen from the second clause of (30);(iii) adjusting the last symbol transmitted over the L th channel (i.e., X L,n ) so that the total transmitted power isexactly equal to nP , which can be seen from the second clause of (33).Given an ( n, M, P ) -feedback code, we consider the corresponding ( γ, s ) -modified code constructed in Definition 8and let ˜ p W, X n , Y n , ˆ W be the distribution induced by the modified code. By (34), we see that P ˜ p X n ( L \ ℓ =1 (cid:26) n X k =1 X ℓ,k ∈ [ n ( s ℓ − L γ ) , n ( s ℓ + Lγ )] (cid:27) ∩ (cid:26) L X ℓ =1 n X k =1 X ℓ,k = nP (cid:27)) = 1 . (35)Define the ∆ -bounding box Γ (∆) ( s ) def = [ s − ∆ , s + ∆] × [ s − ∆ , s + ∆] × . . . × [ s L − ∆ , s L + ∆] . (36)for each γ ≥ and each s ∈ R L + . It then follows from (35) that P ˜ p X n (n φ ( X n ) ∈ Γ ( L γ ) ( s ) o ∩ (cid:26) L X ℓ =1 n X k =1 X ℓ,k = nP (cid:27)) = 1 . (37)The following lemma is a natural consequence of Definition 8, and the proof is deferred to Appendix A. October 14, 2018 DRAFT1
Lemma 3:
Given an ( n, M, P ) -feedback code, let p X n , Y n be the distribution induced by the code. Fix any γ ≥ and any s ∈ R L + such that P Lℓ =1 s ℓ = P , and let ˜ p X n , Y n be the distribution induced by the ( γ, s ) -modified codebased on the ( n, M, P ) -feedback code. Then, we have Z A p X n , Y n ( x n , y n ) n φ ( x n ) ∈ Γ ( γ ) ( s ) o ( L X ℓ =1 n X k =1 x ℓ,k = nP ) dx n dy n ≤ Z A ˜ p X n , Y n ( x n , y n )dx n dy n (38)for all Borel measurable A ⊆ R L × n × R L × n . C. Curtiss’ Theorem
Curtiss’ theorem [16, Th. 3] states that convergence of moment generating functions leads to convergence indistribution. The formal statement is reproduced below.
Theorem 2 (Curtiss’ theorem):
Let U ( n ) be a sequence of real-valued random variables. Suppose there exists arandom variable V such that lim n →∞ E h e tU ( n ) i = E (cid:2) e tV (cid:3) (39)for all t ∈ R . Then, lim n →∞ P { U ( n ) ≤ a } = P { V ≤ a } (40)for every a ∈ R at which a P { V ≤ a } is continuous.In contrast to the more well-known L´evy’s continuity theorem [17, Sec. 18.1], (39) of Theorem 2 is required tobe true for all real rather than purely imaginary t .IV. P ROOF OF T HEOREM ε ∈ (0 , and choose an arbitrary sequence of (¯ n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback codes. Since C fb ε ≥ C( P ∗ ) (41)by (20), it suffices to show that lim inf n →∞ √ n (cid:0) log M ∗ fb ( n, ε, P ) − n C( P ∗ ) (cid:1) ≤ p V( P ∗ ) Φ − ( ε + τ ) (42)for all τ > . To this end, fix an arbitrary τ > . A. Discretizing the Power Allocation Vectors by Appending Symbols
Using Definition 1, we have P ( L X ℓ =1 ¯ n X k =1 X ℓ,k ≤ ¯ nP ) = 1 for the chosen (¯ n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code for each ¯ n ∈ N . Given the chosen (¯ n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code, we can always construct an (¯ n + L, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code by appending a carefully October 14, 2018 DRAFT2 chosen tuple ( X ¯ n +1 , X ¯ n +2 , . . . , X ¯ n + L ) to each transmitted codeword X ¯ n generated by the (¯ n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code such that P ( ¯ n + L X k =1 X ℓ,k = P & P ¯ n X k =1 X ℓ,k ' for all ℓ ∈ L ) = 1 , (43)which implies that P ( ¯ n + L X k =1 X ℓ,k is a multiple of P for all ℓ ∈ L and L X ℓ =1 ¯ n + L X k =1 X ℓ,k ≤ (¯ n + L ) P ) = 1 . (44)In addition, given the (¯ n + L, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code, we can always construct an (¯ n + L +1 , M ∗ fb (¯ n, ε, P ) ,P, ε ) -feedback code by appending a carefully chosen X ¯ n + L +1 to each transmitted codeword X ¯ n + L generated bythe (¯ n + L, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code such that P ( ¯ n + L +1 X k =1 X ℓ,k is a multiple of P for all ℓ ∈ L and L X ℓ =1 ¯ n + L +1 X k =1 X ℓ,k = (¯ n + L + 1) P ) = 1 . (45)To simplify notation, we let n def = ¯ n + L + 1 . (46)Construct the set of power allocation vectors S ( n ) def = ( Pn · a L (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a L ∈ Z L + , L X ℓ =1 a ℓ = n ) , (47)which can be viewed as a set of quantized power allocation vectors s with quantization level P/n that satisfy theequality power constraint L X ℓ =1 s ℓ = P. It follows from (47), (45) and Definition 7 that |S ( n ) | ≤ n L (48)and P n φ ( X n ) ∈ S ( n ) o = 1 . (49) B. Obtaining a Lower Bound on the Error Probability in Terms of the Type-II Error of a Hypothesis Test
Let p W, X n , Y n , ˆ W be the probability distribution induced by the ( n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code constructedabove for each n ∈ { L + 2 , L + 3 , . . . } , where p W, X n , Y n , ˆ W is obtained according to (19). Fix an n ∈ { L + 2 , L +3 , . . . } and the corresponding ( n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code. Recall the definition of P ℓ for each ℓ ∈ L in (6) and define the distribution r Y n , ˆ W def = p ˆ W | Y n r Y n (50) October 14, 2018 DRAFT3 where r Y n ( y n ) def = 12 |S ( n ) | X s ∈S ( n ) L Y ℓ =1 n Y k =1 N ( y ℓ,k ; 0 , s ℓ + N ℓ ) + 12 L Y ℓ =1 n Y k =1 N ( y ℓ,k ; 0 , P ℓ + N ℓ ) . (51)The choice of r Y n in (51) is motivated by the choice of the auxiliary output distribution in [18, Sec. X-A] whereDMCs are considered. Then, it follows from Proposition 2 and Definition 1 with the identifications U ≡ W , V ≡ ˆ W , p U,V ≡ p W, ˆ W , q V ≡ r ˆ W , |W| ≡ M ∗ fb (¯ n, ε, P ) and α ≡ P { ˆ W = W } ≤ ε that β − ε ( p W, ˆ W k p W r ˆ W ) ≤ β − α ( p W, ˆ W k p W r ˆ W ) ≤ /M ∗ fb (¯ n, ε, P ) . (52) C. Obtaining a Non-Asymptotic Bound from Simplifying the Type-II Error of the Binary Hypothesis Test
Using the DPI of β − ε by introducing X n and Y n , we have β − ε ( p W, ˆ W k p W r ˆ W ) ≥ β − ε (cid:18) p W, X n , Y n , ˆ W (cid:13)(cid:13)(cid:13)(cid:13) p W r Y n , ˆ W n Y k =1 p X k | W, Y k − (cid:19) (53)where p W, X n , Y n , ˆ W = p W p ˆ W | Y n n Y k =1 ( p X k | W, Y k − p Y k | X k ) (54)by (19). Combining (53), (54) and (50), we have β − ε ( p W, ˆ W k p W r ˆ W ) ≥ β − ε p W p ˆ W | Y n n Y k =1 ( p X k | Y k − ,W p Y k | X k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) p W r Y n p ˆ W | Y n n Y k =1 p X k | W, Y k − ! . (55)Fix any constant ξ n > to be specified later. Using Lemma 1, (55) and (18), we have β − ε ( p W, ˆ W k p W r ˆ W ) ≥ ξ n (cid:18) − ε − P p X n, Y n (cid:26) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) ≥ ξ n (cid:27)(cid:19) , (56)which together with (52) implies that log M ∗ fb (¯ n, ε, P ) ≤ log ξ n − log (cid:18) − ε − P p X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27)(cid:19) . (57) D. Splitting the Probability Term into Multiple Terms Corresponding to Different Power Types of X n Define Π ( n ) def = (cid:26) s ∈ S ( n ) (cid:12)(cid:12)(cid:12)(cid:12) k s − P ∗ k ≤ n / (cid:27) (58) We note that even if we exclude the set of power types in the set Π ( n ) which is defined later in (58), this leads to another valid definitionof r Y n ( y n ) . The conclusion of this proof remains unchanged if the n / term in (58) is replaced by n a for any a ∈ (0 , / . October 14, 2018 DRAFT4 to be the set of power allocation vectors in S ( n ) that are close to the optimal power allocation vector P ∗ (cf. (7)).Following (57), we use (49) to obtain P p X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) = P p X n, Y n (cid:26)(cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ∩ n φ ( X n ) ∈ Π ( n ) o(cid:27) + X s ∈S ( n ) \ Π ( n ) P p X n, Y n (cid:26)(cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ∩ { φ ( X n ) = s } (cid:27) . (59)In order to bound the first term in (59), we let γ def = 1 n / (60)and define p ∗ X n , Y n be the distribution induced by the ( γ, P ∗ ) -modified code based on the ( n, M ∗ fb (¯ n, ε, P ) , P, ε ) -feedback code defined in Definition 8. Then, consider the following chain of inequalities: P p X n, Y n (cid:26)(cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ∩ n φ ( X n ) ∈ Π ( n ) o(cid:27) ≤ P p ∗ X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) (61) ≤ P p ∗ X n, Y n ( n X k =1 log (cid:18) q Y | X ( Y k | X k ) Q Lℓ =1 N ( Y ℓ,k ; 0 , P ℓ + N ℓ ) (cid:19) ≥ log ξ n − log 2 ) (62)where • (61) is due to Lemma 3 and the fact that P p X n, Y n nP Lℓ =1 P nk =1 X ℓ,k = nP o = 1 (cf. (47) and (49)). • (62) is due to the definition of r Y n in (51).Similarly, in order to bound the second term in (59), we let p ( s ) X n , Y n be the distribution induced by the (0 , s ) -modifiedcode and consider the following chain of inequalities for each s ∈ S ( n ) \ Π ( n ) : P p X n, Y n (cid:26)(cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ∩ { φ ( X n ) = s } (cid:27) ≤ P p ( s ) X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) (63) ≤ P p ( s ) X n, Y n ( n X k =1 log (cid:18) q Y | X ( Y k | X k ) Q Lℓ =1 N ( Y ℓ,k ; 0 , s ℓ + N ℓ ) (cid:19) ≥ log ξ n − log (cid:0) |S ( n ) | (cid:1)) (64)where • (63) is due to Lemma 3. • (64) is due to the definition of r Y n in (51).Combining (59), (62), (64) and the definition of q Y | X in (16) followed by letting Z ℓ,k def = Y ℓ,k − X ℓ,k (65) October 14, 2018 DRAFT5 for each ℓ ∈ L and each k ∈ { , , . . . , n } , we obtain P p X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ≤ P p ∗ X n, Y n ( n C( P ∗ ) + n X k =1 L X ℓ =1 − (cid:0) P ℓ N ℓ (cid:1) Z ℓ,k + 2 X ℓ,k Z ℓ,k + X ℓ,k P ℓ + N ℓ ) ≥ log ξ n − log 2 ) + X s ∈S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( n C( s ) + n X k =1 L X ℓ =1 − (cid:0) s ℓ N ℓ (cid:1) Z ℓ,k + 2 X ℓ,k Z ℓ,k + X ℓ,k s ℓ + N ℓ ) ≥ log ξ n − log (cid:0) |S ( n ) | (cid:1)) (66)where C( · ) is as defined in (4). In order to simplify the RHS of (66), we define ξ n > such that log ξ n def = n C( P ∗ ) + √ n (cid:16)p V( P ∗ ) Φ − ( ε + τ ) (cid:17) + log (cid:0) |S ( n ) | (cid:1) . (67)In addition, for each d ∈ R L + , let U ( d ) k def = L X ℓ =1 − (cid:0) d ℓ N ℓ (cid:1) Z ℓ,k + 2 X ℓ,k Z ℓ,k + d ℓ d ℓ + N ℓ ) (68)for each k ∈ { , , . . . , n } . By using (66), (67) and (68) together with the facts by (37) that P p ∗ X n, Y n (n φ ( X n ) ∈ Γ ( L γ ) ( P ∗ ) o ∩ (cid:26) L X ℓ =1 n X k =1 X ℓ,k = nP (cid:27)) = 1 (69)and P p ( s ) X n, Y n n φ ( X n ) ∈ Γ (0) ( s ) o = 1 (70)for each s ∈ S ( n ) , we can express (66) as P p X n, Y n (cid:26) log (cid:18) Q nk =1 q Y | X ( Y k | X k ) r Y n ( Y n ) (cid:19) ≥ log ξ n (cid:27) ≤ P p ∗ X n, Y n ( p n V( P ∗ ) n X k =1 U ( P ∗ ) k ≥ Φ − ( ε + τ ) ) + X s ∈S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) . (71) E. Applying Curtiss’ Theorem When φ ( X n ) is Close to P ∗ In order to simplify the first term in (71), we define V ( P ∗ ) k def = L X ℓ =1 − (cid:0) P ℓ N ℓ (cid:1) Z ℓ,k + 2 √ P ℓ Z ℓ,k + P ℓ P ℓ + N ℓ ) (72)for each k ∈ { , , . . . , n } and want to show that lim n →∞ E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k = lim n →∞ E p ∗ Z n " e t √ n n P k =1 V ( P ∗ ) k (73) October 14, 2018 DRAFT6 for all t ∈ R where p ∗ Z n (z n ) , n Y k =1 N ( z ℓ,k ; 0 , N ℓ ) . (74)To this end, recall the following statements due to the channel law:(i) Z ℓ,k ∼ N ( z ℓ,k ; 0 , N ℓ ) for all ℓ ∈ L and all k ∈ { , , . . . , n } ;(ii) { Z ℓ,k | ℓ ∈ L , k ∈ { , , . . . , n }} are independent;(iii) Z k and ( X k , Y k − , Z k − ) are independent for all k ∈ { , , . . . , n } .For any t ∈ R and any n ∈ { L + 2 , L + 3 , . . . } such that n ≥ t , since P ℓ − L γ − n n X k =1 X ℓ,k ≤ ≤ P ℓ + L γ − n n X k =1 X ℓ,k (75)by (69) and P ℓ + N ℓ + tP ℓ √ n > for all ℓ ∈ L , we have E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k · e t L P ℓ =1 Nℓ Pℓ − L γ − n n P k =1 X ℓ,k ! Pℓ + Nℓ ) (cid:18) Pℓ + Nℓ + tPℓ √ n (cid:19) ≤ E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k (76) ≤ E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k · e t L P ℓ =1 Nℓ Pℓ + L γ − n n P k =1 X ℓ,k ! Pℓ + Nℓ ) (cid:18) Pℓ + Nℓ + tPℓ √ n (cid:19) . (77)In order to simplify the above chain of inequalities, we need the following lemma, whose proof is deferred toAppendix B because it involves straightforward calculations based on (68), (72) and the channel law. Lemma 4:
For any λ ∈ R , we have E p ∗ X n, Y n " e λ n P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − n P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) = E p ∗ Z n " e λ n P k =1 V ( P ∗ ) k . (78)Lemma 4, which forms the crux of the proof of Theorem 1, is important because it establishes the equivalence indistribution between the sum P nk =1 U ( P ∗ ) k , which contains dependent random variables, and the sum P nk =1 V ( P ∗ ) k ,which contains independent random variables. The former is intractable to analyze while the latter can be analyzedin a straightforward manner by invoking the central limit theorem.Using Lemma 4, we can simplify (77) through the identification λ ≡ t √ n and obtain E p ∗ Z n " e t √ n n P k =1 V ( P ∗ ) k · e − t L γ L P ℓ =1 Nℓ Pℓ + Nℓ ) (cid:18) Pℓ + Nℓ + tPℓ √ n (cid:19) ≤ E p ∗ X n, Y n " e t √ n n P k =1 U ( P ∗ ) k (79) ≤ E p ∗ Z n " e t √ n n P k =1 V ( P ∗ ) k · e t L γ L P ℓ =1 Nℓ Pℓ + Nℓ ) (cid:18) Pℓ + Nℓ + tPℓ √ n (cid:19) . (80) October 14, 2018 DRAFT7
Combining (80) and (60), we conclude that (73) holds for each t ∈ R . Since the moment generating functions of √ n P nk =1 V ( P ∗ ) k and √ n P nk =1 U ( P ∗ ) k converge to the same function, it follows from Curtiss’ theorem [16, Th. 3](as stated in Theorem 2) that lim n →∞ P p ∗ X n, Y n ( p n V( P ∗ ) n X k =1 U ( P ∗ ) k ≥ Φ − ( ε + τ ) ) = lim n →∞ P p ∗ Z n ( p n V( P ∗ ) n X k =1 V ( P ∗ ) k ≥ Φ − ( ε + τ ) ) . (81)Recognizing that (cid:8) V ( P ∗ ) k (cid:9) ∞ k =1 are independent zero-mean Gaussian random variables with variance V( P ∗ ) by thedefinition of V ( P ∗ ) k in (72) and the definition of V( P ∗ ) in (10), we apply the central limit theorem [11] and obtain lim n →∞ P p ∗ Z n ( p n V( P ∗ ) n X k =1 V ( P ∗ ) k ≤ Φ − ( ε + τ ) ) = ε + τ, (82)which together with (81) implies that lim n →∞ P p ∗ X n, Y n ( p n V( P ∗ ) n X k =1 U ( P ∗ ) k ≥ Φ − ( ε + τ ) ) = 1 − ε − τ. (83) F. Applying Large Deviation Bounds When φ ( X n ) is Far from P ∗ In order to bound the second term in (71), we consider a fixed n ∈ { L + 2 , L + 3 , . . . } and want to show thatthere exists some κ > such that C( P ∗ ) − C( s ) ≥ κ k P ∗ − s k (84)for all s ∈ S ( n ) . To this end, we first define the Lagrangian function f : R L → R as f ( d ) def = C( d ) − L X ℓ =1 d ℓ − P ! + L X ℓ =1 µ ℓ d ℓ (85)where Λ ≥ is the unique number that satisfies (5) and (6) and µ ℓ ≥ is defined for each ℓ ∈ L as µ ℓ def = if P ℓ > , − N ℓ if P ℓ = 0 . (86)Define N max def = max ℓ ∈L N ℓ . Then for all s ∈ S ( n ) , we use Taylor’s theorem to obtain f ( s ) = f ( P ∗ ) + ( s − P ∗ ) t ▽ f ( P ∗ ) + 12 ( s − P ∗ ) t ▽ f (¯ s )( s − P ∗ ) (87)for some ¯ s that lies on the line that connects s and P ∗ , where ▽ f ( P ∗ ) denotes the gradient which satisfies ▽ f ( P ∗ ) = 0 (88)and ▽ f (¯ s ) denotes the Hessian matrix that satisfies ( s − P ∗ ) t ▽ f (¯ s )( s − P ∗ ) ≤ − k s − P ∗ k N max + P ) . (89) October 14, 2018 DRAFT8
For the sake of completeness, the derivations of (88) and (89) are contained in Appendix C. Combining (87), (88)and (89), we have for all s ∈ S ( n ) f ( P ∗ ) − f ( s ) ≥ k s − P ∗ k N max + P ) , (90)which together with the definitions of f and µ ℓ in (85) and (86) respectively implies that C( P ∗ ) − C( s ) ≥ k s − P ∗ k N max + P ) + L X ℓ =1 µ ℓ ( s ℓ − P ℓ ) (91) ≥ k s − P ∗ k N max + P ) . (92)Consequently, (84) holds by setting κ def = 14( N max + P ) . (93)Following (71), we consider for each s ∈ S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) ≤ P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ κ √ n k P ∗ − s k + p V( P ∗ ) Φ − ( ε + τ ) ) (94) ≤ P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ κn / + p V( P ∗ ) Φ − ( ε + τ ) ) (95)where • (94) is due to (84). • (95) follows from the definition of Π ( n ) in (58).Following the standard approach for obtaining large deviation bounds, we apply Markov’s inequality on the RHSof (95) and obtain for each s ∈ S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) ≤ E p ( s ) X n, Y n (cid:20) e √ n P nk =1 U ( s ) k (cid:21) e κn / + √ V( P ∗ ) Φ − ( ε + τ ) . (96)In order to bound the RHS of (96), consider the following chain of inequalities for each s ∈ S ( n ) \ Π ( n ) : E p ( s ) X n, Y n (cid:20) e √ n P nk =1 U ( s ) k (cid:21) = L Y ℓ =1 s ℓ + N ℓ (1 + n − / ) s ℓ + N ℓ ! n/ e L P ℓ =1 √ nsℓ sℓ + Nℓ ) + Nℓsℓ sℓ + Nℓ ) (( n − / ) sℓ + Nℓ ) ! (97) ≤ L Y ℓ =1 (cid:18) − n − / s ℓ (1 + n − / ) s ℓ + N ℓ (cid:19) n/ e √ nsℓ sℓ + Nℓ ) ! e L P ℓ =1 Nℓsℓ sℓ + Nℓ )2 (98) ≤ e L P ℓ =1 (cid:18) s ℓ n − / sℓ + Nℓ )( sℓ + Nℓ ) + Nℓsℓ sℓ + Nℓ )2 (cid:19) (99) October 14, 2018 DRAFT9 ≤ e L P ℓ =1 sℓ sℓ + Nℓ ) (100) ≤ e L/ , (101)where • (97) follows from straightforward calculations based on the definition of U ( s ) k in (68), the property of p ( s ) X n , Y n in (70) and the channel law, which are elaborated in Appendix D for the sake of completeness. • (99) is due to the fact that (1 − x ) x ≤ e − for all x > .Combining (96) and (101), we have the following large deviation bound for each s ∈ S ( n ) \ Π ( n ) : P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) ≤ e L/ e κn / + √ V( P ∗ ) Φ − ( ε + τ ) . (102)Following (71), we use (102) and (48) to obtain lim n →∞ X s ∈S ( n ) \ Π ( n ) P p ( s ) X n, Y n ( √ n n X k =1 U ( s ) k ≥ √ n (cid:0) C( P ∗ ) − C( s ) (cid:1) + p V( P ∗ ) Φ − ( ε + τ ) ) = 0 . (103) G. Combining Earlier Bounds to Obtain an Upper Bound on L fb ε Combining (57), (67), (71), (83), (103) and (48), we have log M ∗ fb (¯ n, ε, P ) ≤ n C( P ∗ ) + √ n (cid:16)p V( P ∗ ) Φ − ( ε + τ ) (cid:17) + log (cid:0) n L (cid:1) − log ( τ / (104)for all sufficiently large n , which together with (46) implies that lim inf ¯ n →∞ √ ¯ n (cid:0) log M ∗ fb (¯ n, ε, P ) − ¯ n C( P ∗ ) (cid:1) ≤ p V( P ∗ ) Φ − ( ε + τ ) . (105)Since τ > is arbitrary, it follows from (105) and Definition 5 that L fb ε ≤ p V( P ∗ ) Φ − ( ε ) . (106)V. C ONCLUDING R EMARKS
A. Novel Ingredients in Proof of Theorem 1
As mentioned in Section I-A, the proof of [8, Th. 2] which obtains upper bounds on the second-order asymptoticsof DMCs with feedback cannot be generalized to the parallel Gaussian channel with feedback. Indeed, the proof ofTheorem 1 follows the standard procedures for obtaining the second-order asymptotics of DMCs without feedback(see, e.g., [5, proof of Th. 50] and [19, Sec. III]) except the following three novel ingredients:1) Instead of classifying transmitted codewords into polynomially many type classes based on their empiricaldistributions which is generally not possible for channels with continuous input alphabet, we discretize thetransmitted power and classify the codewords into polynomially many type classes based on their discretizedpower types. In particular, the collection of power type classes S ( n ) in (47) plays a key role in our analysis, October 14, 2018 DRAFT0 and there are polynomially many power type classes by (48). The details can be found in Section IV-A inthe proof.2) Curtiss’ theorem rather than Berry-Ess´een theorem is invoked to bound the information spectrum term (thefirst term in (71)) related to transmitted codewords whose types are close to the optimal power allocation. Inparticular, Berry-Ess´een theorem for bounded martingale difference sequences cannot be used to bound theinformation spectrum term in the presence of feedback because each input symbol X ℓ,k belongs to an interval [ −√ nP , √ nP ] that grows unbounded as n increases. Instead, we apply Curtiss’ theorem to show that thedistribution of the sum of random variables in the information spectrum term converges to some distributiongenerated from a sum of i.i.d. random variables (i.e., (73)), thus facilitating the use of the usual central limittheorem [11]. The details can be found in Section IV-E.3) In order to bound the information spectrum term related to transmitted codewords whose types are far from theoptimal power allocation (the second term in (71)), the usual approach is to bound the information spectrumterm by an average of exponentially many upper bounds where each upper bound is then further simplified byinvoking Chebyshev’s inequality [18, Sec. X-A]. However, due to the presence of feedback, the informationspectrum term can be expressed as only a sum (instead of average) of polynomially many upper bounds asshown in the second term in (71). In order to control the sum of polynomially many upper bounds, we haveto resort to large deviation bounds as shown in (102) rather than the weaker Chebyshev’s inequality. Thedetails can be found in Section IV-F. B. Major Difficulties in Tightening the Third-Order Term
If the feedback link is absent, the third-order term of the optimal finite blocklenth rate n log M ∗ fb ( n, ε, P ) is Θ (cid:16) log nn (cid:17) as shown in (9) in Section I. The third-order term can be obtained by applying Berry-Ess´een theorem tobound an information spectrum term (analogous to the first term in (71)).In the presence of feedback, Theorem 1 shows that the third-order term is o (cid:16) √ n (cid:17) . If we want to compute anexplicit upper bound on the third-order term using the current proof technique, an intuitive way is to invoke anon-asymptotic version of Curtiss’ theorem that can measure the proximity between two distributions based on theproximity between their moment generating functions. However, such a non-asymptotic version of Curtiss’ theoremdoes not exist to the best of our knowledge, which prohibits us from strengthening the current o (cid:16) √ n (cid:17) bound on thethird-order term. It is worth noting that (77) and (80) in our proof break down if the moment generating functionsare replaced with characteristic functions. If one can find a way to make characteristic functions amenable to ourproof approach, then a non-asymptotic version of L´evy’s continuity theorem known as Ess´een’s smoothing lemma (see, e.g., [20, Th. 1.5.2]) may be invoked to tighten the third-order term herein.
C. Future Work
The techniques presented herein may be used to analyze the fixed-error asymptotics of fixed-length codes overparallel DMCs with cost constraint and with feedback. We envision that there will be an added layer of complexityas the method of types [21] is typically used to analyze the fixed-error asymptotics of DMCs with and without cost
October 14, 2018 DRAFT1 constraint [22]. Hence, we anticipate that two applications of the method of types need to be applied — one forhandling power types that specify the power allocation among the parallel channels (as was done in Section IV-A)and another for handling codewords of the same power type but of different empirical distributions (the usual types).In the present work, the latter situation is ameliorated by the fact that the maximum rate achievable by a codingscheme over an AWGN channel is completely determined by the power allocated for the coding scheme.A
PPENDIX AP ROOF OF L EMMA Proof:
Let f ℓ,k and ˜ f ℓ,k be the encoding functions of the ( n, M, P ) -feedback code and the ( γ, s ) -modifiedcode respectively for each ℓ ∈ L and each k ∈ { , , . . . , n } . For any w ∈ W and any y n ∈ R L × n such that φ f , ( w ) ... f L, ( w ) , . . . , f ,n ( w, y n − ) ... f L,n ( w, y n − ) ∈ Γ ( γ ) ( s ) (107)and L X ℓ =1 n X k =1 f ℓ,k ( w, y k − ) = nP, (108)it follows from (27), (30) and (33) in Definition 8 that f , ( w ) ... f L, ( w ) , . . . , f ,n ( w, y n − ) ... f L,n ( w, y n − ) = ˜ f , ( w ) ... ˜ f L, ( w ) , . . . , ˜ f ,n ( w, y n − ) ... ˜ f L,n ( w, y n − ) . (109)Since (109) holds for any w ∈ W and y n ∈ R L × n that satisfy (107) and (108), it follows that (38) holds for allBorel measurable A ⊆ R L × n × R L × n . A PPENDIX BP ROOF OF L EMMA L = 1 ) is contained in [12, Sec. E]. For a general L ∈ N , considerthe following chain of equalities for each m = n, n − , . . . , : E e λ m P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − m P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) = E E e λ m P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − m P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X m , Z m − (110) October 14, 2018 DRAFT2 = E e λ m − P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − m − P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) · Z R L × n p ∗ Z m ( z m ) · e λU ( P ∗ ) m · e λ L P ℓ =1 − X ℓ,m Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) d z m (111) = e λ L P ℓ =1 Pℓ Pℓ + Nℓ ) vuut L Y ℓ =1 P ℓ + N ℓ (1 + λ ) P ℓ + N ℓ · E e λ m − P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − m − P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) (112)where • (111) is due to the fact that Z m and ( X m , Z m − ) are independent. • (112) is due to the definition of U ( P ∗ ) k in (68).Applying (112) recursively from m = n to m = 1 , we have E e λ n P k =1 U ( P ∗ ) k · e λ L P ℓ =1 Nℓ nPℓ − n P k =1 X ℓ,k ! Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) = L Y ℓ =1 P ℓ + N ℓ (1 + λ ) P ℓ + N ℓ ! n/ e n L P ℓ =1 (cid:18) λPℓ Pℓ + Nℓ ) + λ NℓPℓ Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) (cid:19) . (113)On the other hand, straightforward calculations based on the definition of V ( P ∗ ) k in (72) and the fact that { Z k } nk =1 are independent implies that E " e λ n P k =1 V ( P ∗ ) k = L Y ℓ =1 P ℓ + N ℓ (1 + λ ) P ℓ + N ℓ ! n/ e n L P ℓ =1 (cid:18) λPℓ Pℓ + Nℓ ) + λ NℓPℓ Pℓ + Nℓ ) ( (1+ λ ) Pℓ + Nℓ ) (cid:19) . (114)Combining (113) and (114), we obtain (78). A PPENDIX CD ERIVATIONS OF (88)
AND (89)Straightforward calculations based on (85) reveal that for all s ∈ R L + , we obtain that ▽ f ( s ) = 12 N + s − + µ ... N L + s L − + µ L (115)and ▽ f ( s ) is a diagonal matrix that satisfies ▽ f ( s ) = − N + s ) . . . N + s ) . . . ... ... . . . ... . . . N L + s L ) . (116) October 14, 2018 DRAFT3
Combining (115), (5), (6) and (86), we have ▽ f ( P ∗ ) = 0 . In addition, for any s such that P Lℓ =1 s ℓ ≤ P , it followsfrom (116) that N ℓ + s ℓ ≤ N max + P for all ℓ ∈ L , which then implies that (89) holds for all s ∈ S ( n ) .A PPENDIX DD ERIVATION OF (97)Let t = √ n . Fix any s ∈ S ( n ) \ Π ( n ) . Due to (70), it suffices to show that E p ( s ) X n, Y n e t n P k =1 U ( s ) k · e t L P ℓ =1 Nℓ nsℓ − n P k =1 X ℓ,k ! sℓ + Nℓ ) ( (1+ t ) sℓ + Nℓ ) = L Y ℓ =1 s ℓ + N ℓ (1 + t ) s ℓ + N ℓ ! n/ e n L P ℓ =1 (cid:18) tsℓ sℓ + Nℓ ) + t Nℓsℓ sℓ + Nℓ ) ( (1+ t ) sℓ + Nℓ ) (cid:19) . (117)Replacing P ∗ with s in the steps leading to (112) and (113), we obtain (117).A CKNOWLEDGMENTS
The authors would like to thank the Associate Editor Prof. Shun Watanabe and the two anonymous reviewersfor the useful comments that improve the presentation of this paper.R
EFERENCES[1] T. M. Cover and J. A. Thomas,
Elements of Information Theory , 2nd ed. Hoboken, NJ: John Wiley and Sons, 2006.[2] A. El Gamal and Y.-H. Kim,
Network Information Theory . Cambridge, U.K.: Cambridge University Press, 2012.[3] D. Tse and P. Viswanath,
Fundamentals of Wireless Communication . Cambridge, U.K.: Cambridge University Press, 2005.[4] C. E. Shannon, “Communication in the presence of noise,”
Proceedings of IRE , vol. 37, no. 1, pp. 10–21, 1949.[5] Y. Polyanskiy, “Channel coding: Non-asymptotic fundamental limits,” Ph.D. dissertation, Princeton University, 2010.[6] V. Y. F. Tan and M. Tomamichel, “The third-order term in the normal approximation for the AWGN channel,”
IEEE Trans. Inf. Theory ,vol. 61, no. 5, pp. 2430–2438, 2015.[7] C. E. Shannon, “The zero error capacity of a noisy channel,”
IRE Transactions on Information Theory , vol. 2, no. 3, pp. 8–19, 1956.[8] Y. Altu˘g and A. B. Wagner, “Feedback can improve the second-order coding performance in discrete memoryless channels,” in
Proc. IEEEIntl. Symp. Inf. Theory , Honolulu, HI, Jul 2014, pp. 2361–2365.[9] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “Feedback in the non-asymptotic regime,”
IEEE Trans. Inf. Theory , vol. 57, no. 8, pp. 4903–4925,2011.[10] M. E. Machkouri and L. Ouchti, “Exact convergence rates in the central limit theorem for a class of martingales,”
Bernoulli , vol. 13, no. 4,pp. 981–999, Nov. 2007.[11] W. Feller,
An Introduction to Probability Theory and Its Applications , 2nd ed. Hoboken, NJ: John Wiley and Sons, 1971, vol. 2.[12] S. L. Fong and V. Y. F. Tan, “Asymptotic expansions for the AWGN channel with feedback under a peak power constraint,” in
Proc. IEEEIntl. Symp. Inf. Theory , Hong Kong, Jun. 2015, pp. 311–315.[13] L. V. Truong, S. L. Fong, and V. Y. F. Tan, “On Gaussian channels with feedback under expected power constraints and with non-vanishingerror probabilities,”
IEEE Trans. Inf. Theory , vol. 63, no. 3, pp. 1746–1765, 2017.[14] L. Wang, R. Colbeck, and R. Renner, “Simple channel coding bounds,” in
Proc. IEEE Intl. Symp. Inf. Theory , Seoul, South Korea, June/July2009, pp. 1804 – 1808.[15] Y. Polyanskiy, H. V. Poor, and S. Verd´u, “Channel coding rate in the finite blocklength regime,”
IEEE Trans. Inf. Theory , vol. 56, no. 5,pp. 2307–2359, 2010.[16] J. H. Curtiss, “A note on the theory of moment generating functions,”
Ann. Math. Statist. , vol. 13, no. 4, pp. 430–433, 1942.[17] D. Williams,
Probability with Martingales . Cambridge, U.K.: Cambridge University Press, 1991.
October 14, 2018 DRAFT4 [18] M. Hayashi, “Information spectrum approach to second-order coding rate in channel coding,”
IEEE Trans. Inf. Theory , vol. 55, no. 11,pp. 4947–4966, 2009.[19] M. Tomamichel and V. Y. F. Tan, “A tight upper bound for the third-order asymptotics of most discrete memoryless channels,”
IEEETrans. Inf. Theory , vol. 59, no. 11, pp. 7041–7051, 2013.[20] I. A. Ibragimov and Y. V. Linnik,
Independent and stationary sequences of random variables , J. F. C. Kingman, Ed. Groningen,Netherlands: Wolters-Noordhoff Publishing, 1971.[21] I. Csisz´ar and J. K¨orner,
Information Theory: Coding Theorems for Discrete Memoryless Systems . Cambridge, U.K.: Cambridge UniversityPress, 2011.[22] V. Strassen, “Asymptotische Absch¨atzungen in Shannons Informationstheorie,” in
Trans. Third Prague Conf. Inf. Theory ∼ pmlut/strassen.pdf.pmlut/strassen.pdf.