[PDF] A Formula for the Capacity of the General Gel'fand-Pinsker Channel

Abstract

We consider the Gel'fand-Pinsker problem in which the channel and state are general, i.e., possibly non-stationary, non-memoryless and non-ergodic. Using the information spectrum method and a non-trivial modification of the piggyback coding lemma by Wyner, we prove that the capacity can be expressed as an optimization over the difference of a spectral inf- and a spectral sup-mutual information rate. We consider various specializations including the case where the channel and state are memoryless but not necessarily stationary.

Full PDF

11 A Formula for the Capacity of the GeneralGel’fand-Pinsker Channel

Vincent Y. F. Tan,

Member, IEEE

Abstract —We consider the Gel’fand-Pinsker problem in whichthe channel and state are general, i.e., possibly non-stationary,non-memoryless and non-ergodic. Using the information spec-trum method and a non-trivial modiﬁcation of the piggybackcoding lemma by Wyner, we prove that the capacity can beexpressed as an optimization over the difference of a spectralinf- and a spectral sup-mutual information rate. We considervarious specializations including the case where the channel andstate are memoryless but not necessarily stationary.

Index Terms —Gel’fand-Pinsker, Information spectrum, Gen-eral channels, General sources

I. I

NTRODUCTION

In this paper, we consider the classical problem of channelcoding with noncausal state information at the encoder, alsoknow as the

Gel’fand-Pinsker problem [1]. In this problem,we would like to send a uniformly distributed message over astate-dependent channel W n : X n × S n → Y n , where S , X and Y are the state, input and output alphabets respectively.The random state sequence S n ∼ P S n is available noncausallyat the encoder but not at the decoder. See Fig. 1. The Gel’fand-Pinsker problem consists in ﬁnding the maximum rate forwhich there exists a reliable code. Assuming that the channeland state sequence are stationary and memoryless, Gel’fandand Pinsker [1] showed that this maximum message rate or capacity C = C ( W, P S ) is given by C = max P U | S , g : U×S→X|U|≤|X ||S| +1 I ( U ; Y ) − I ( U ; S ) . (1)The coding scheme involves a covering step at the encoder toreduce the uncertainty due to the random state sequence anda packing step to decode the message [2, Chapter 7]. Thus,we observe the covering rate I ( U ; S ) and the packing rate I ( U ; Y ) in (1). A weak converse can be proved by using theCsisz´ar-sum-identity [2, Chapter 7]. A strong converse wasproved by Tyagi and Narayan [3] using entropy and image-size characterizations [4, Chapter 15]. The Gel’fand-Pinskerproblem has numerous applications, particularly in informationhiding and watermarking [5].In this paper, we revisit the Gel’fand-Pinsker problem.Instead of assuming stationarity and memorylessness on thechannel and state sequence, we let the channel W n be a general one in the sense of Verd´u and Han [6], [7]. Thatis, W = { W n } ∞ n =1 is an arbitrary sequence of stochasticmappings from X n × S n to Y n . We also model the state The author is with the Department of Electrical and Computer Engineeringas well as the Department of Mathematics, National University of Singapore(email: [email protected]). This paper was presented in part at the 2013International Symposium on Information Theory in Istanbul, Turkey. (cid:45) (cid:45) (cid:45) (cid:45)

M X n Y n Enc W n Dec ˆ M (cid:45) (cid:63)(cid:63)(cid:63) S n S n M d S n P S n StateEnc

Fig. 1. The Gel’fand-Pinsker problem with rate-limited coded state infor-mation available at the decoder [10], [11]. The message M is uniformlydistributed in [1 : exp( nR )] and independent of S n . The compression index M d is rate-limited and takes values in [1 : exp( nR d )] . The canonicalGel’fand-Pinsker problem [1] is a special case in which the output of thestate encoder is a deterministic quantity. distribution as a general one S ∼ { P S n ∈ P ( S n ) } ∞ n =1 . Suchgenerality allows us to understand optimal coding schemesfor general systems. We prove an analogue of the Gel’fand-Pinsker capacity in (1) by using information spectrum analy-sis [7]. Our result is expressed in terms of the limit superiorand limit inferior in probability operations [7]. For the directpart, we leverage on a technique used by Iwata and Mura-matsu [8] for the general Wyner-Ziv problem [2, Chapter 11].Our proof technique involves a non-trivial modiﬁcation ofWyner’s piggyback coding lemma (PBL) [9, Lemma 4.3].We also ﬁnd the capacity region for the case where rate-limited coded state information is available at the decoder.This setting, shown in Fig. 1, was studied by Steinberg [10]but we consider the scenario in which the state and channelare general. A. Main Contributions

There are two main contributions in this work.First by developing a non-asymptotic upper bound onthe average probability for any Gel’fand-Pinsker problem(Lemma 8), we prove that the general capacity is C = sup I ( U ; Y ) − I ( U ; S ) , (2)where the supremum is over all conditional probability laws { P X n ,U n | S n } ∞ n =1 (or equivalently over Markov chains U n − ( X n , S n ) − Y n given the channel law W n ) and I ( U ; Y ) (resp. I ( U ; S ) ) is the spectral inf-mutual information rate (resp. thespectral sup-mutual information rate) [7]. The expression in (2)reﬂects the fact that there are two distinct steps: a covering stepand packing step. To cover successfully, we need to expend arate of I ( U ; S ) + γ (for any γ > ) as stipulated by generalﬁxed-length rate-distortion theory [12, Section VI]. Thus, eachsubcodebook has to have at least ≈ exp( n ( I ( U ; S ) + γ )) a r X i v : . [ c s . I T ] A p r sequences. To decode the codeword’s subcodebook indexcorrectly, we can have at most ≈ exp( n ( I ( U ; Y ) − γ )) codewords by the general channel coding result of Verd´u andHan [6]. We can specialize the general result in (2) to thefollowing scenarios: (i) no state information, thus recoveringthe result by Verd´u and Han [6], (ii) common state informationis available at the encoder and the decoder, (iii) the channeland state are memoryless (but not necessarily stationary) and(iv) mixed channels and states [7].Second, we extend the above result to the case where codedstate information is available at the decoder. This problemwas ﬁrst studied by Heegard and El Gamal [11] and later bySteinberg [10]. In this case, we combine our coding schemewith that of Iwata and Muramatsu for the general Wyner-Zivproblem [8] to obtain the tradeoff between R d , the rate of thecompressed state information that is available at the decoder,and R be the message rate. We show that the tradeoff (orcapacity region) is the set of rate pairs ( R, R d ) satisfying R d ≥ I ( V ; S ) − I ( V ; Y ) (3) R ≤ I ( U ; Y | V ) − I ( U ; S | V ) , (4)for some ( U , V ) − ( X , S ) − Y . This general result can bespecialized the stationary, memoryless setting [10]. B. Related Work

The study of general channels started with the seminal workby Verd´u and Han [6] in which the authors characterizedthe capacity in terms of the limit inferior in probability ofa sequence of information densities. See Han’s book [7]for a comprehensive exposition on the information spectrummethod. This line of analysis provides deep insights into thefundamental limits of the transmission of information overgeneral channels and the compressibility of general sourcesthat may not be stationary, memoryless or ergodic. Informationspectrum analysis has been used for rate-distortion [12], theWyner-Ziv problem [8], the Heegard-Berger problem [13], theWyner-Ahlswede-K¨orner (WAK) problem [14] and the wiretapchannel [15], [16]. The Wyner-Ziv and wiretap problems arethe most closely related to the problem we solve in this paper.In particular, they involve differences of mutual informationsakin to the Gel’fand-Pinsker problem. Even though it is notin the spirit of information spectrum methods, the work of Yu et al. [17] deals with the Gel’fand-Pinsker problem for non-stationary, non-ergodic Gaussian noise and state (also called“writing on colored paper”). We contrast our work to [17] inSection III-E.It is also worth mentioning that bounds on the reliabilityfunction (error exponent) for the Gel’fand-Pinsker problemhave been derived by Tyagi and Narayan [3] (upper bounds)and Moulin and Wang [18] (lower bounds).

C. Paper Organization

The rest of this paper is structured as follows. In Sec-tion II, we state our notation and various other deﬁnitions.In Section III, we state all information spectrum-based re-sults and their specializations. We conclude our discussionin Section IV. The proofs of all results are provided in theAppendices. II. S

YSTEM M ODEL AND M AIN D EFINITIONS

In this section, we state our notation and the deﬁnitions ofthe various problems that we consider in this paper.

A. Notation

Random variables (e.g., X ) and their realizations (e.g., x )are denoted by upper case and lower case serif font respec-tively. Sets are denoted in calligraphic font (e.g., the alphabetof X is X ). We use the notation X n to mean a vector ofrandom variables ( X , . . . , X n ) . In addition, X = { X n } ∞ n =1 denotes a general source in the sense that each member ofthe sequence X n = ( X ( n )1 , . . . , X ( n ) n ) is a random vector.The same holds for a general channel W = { W n : X n →Y n } ∞ n =1 , which is simply a sequence of stochastic mappingsfrom X n to Y n . The set of all probability distributions withsupport on an alphabet X is denoted as P ( X ) . We use thenotation X ∼ P X to mean that the distribution of X is P X . The joint distribution induced by the marginal P X andthe conditional P Y | X is denoted as P X ◦ P Y | X . Information-theoretic quantities are denoted using the usual notations [4],[7] (e.g., I ( X ; Y ) for mutual information or I ( P X , P Y | X ) ifwe want to make the dependence on distributions explicit). Alllogarithms are to the base e . We also use the discrete interval[2] notation [ i : j ] := { i, . . . , j } and, for the most part, ignoreinteger requirements.We recall the following probabilistic limit operations [7]. Deﬁnition 1.

Let U := { U n } ∞ n =1 be a sequence of real-valued random variables. The limsup in probability of U isan extended real-number deﬁned as p -lim sup n →∞ U n := inf (cid:110) α : lim n →∞ P ( U n > α ) = 0 (cid:111) . (5) The liminf in probability of U is deﬁned as p -lim inf n →∞ U n := − p -lim sup n →∞ ( − U n ) . (6)We also recall the following deﬁnitions from Han [7]. Thesedeﬁnitions play a prominent role in the following. Deﬁnition 2.

Given a pair of stochastic processes ( X , Y ) = { X n , Y n } ∞ n =1 with joint distributions { P X n ,Y n } ∞ n =1 , the spec-tral sup-mutual information rate is deﬁned as I ( X ; Y ) := p -lim sup n →∞ n log P Y n | X n ( Y n | X n ) P Y n ( Y n ) . (7) The spectral inf-mutual information rate I ( X ; Y ) is deﬁnedas in (7) with p -lim inf in place of p -lim sup . The spectralsup- and inf-conditional mutual information rates are deﬁnedsimilarly.B. The Gel’fand-Pinsker Problem We now recall the deﬁnition of the Gel’fand-Pinsker prob-lem. The setup is shown in Fig. 1 with M d = ∅ . Deﬁnition 3. An ( n, M n , (cid:15) ) code for the Gel’fand-Pinskerproblem with channel W n : X n × S n → Y n and statedistribution P S n ∈ P ( S n ) consists of • An encoder f n : [1 : M n ] × S n → X n (possiblyrandomized) • A decoder ϕ n : Y n → [1 : M n ] such that the average error probability in decoding the mes-sage does not exceed (cid:15) , i.e., M n (cid:88) s n ∈S n P S n ( s n ) M n (cid:88) m =1 W n ( B cm | f n ( m, s n ) , s n ) ≤ (cid:15), (8) where B m := { y n ∈ Y n : ϕ n ( y n ) = m } and B cm := Y n \ B m . We assume that the message is uniformly distributed inthe message set [1 : M n ] and that it is independent of thestate sequence S n ∼ P S n . Here we remark that in (8) (andeverywhere else in the paper), we use the notation (cid:80) s n ∈S n even though S n may not be countable. This is the conventionwe adopt throughout the paper even though integrating againstthe measure P S n as in (cid:82) S n · d P S n would be more precise. Deﬁnition 4.

We say that the nonnegative number R is an (cid:15) -achievable rate if there exists a sequence of ( n, M n , (cid:15) n ) codesfor which lim inf n →∞ n log M n ≥ R, lim sup n →∞ (cid:15) n ≤ (cid:15). (9) The (cid:15) -capacity C (cid:15) is the supremum of all (cid:15) -achievable rates.The capacity C := lim (cid:15) → C (cid:15) . The capacity of the general Gel’fand-Pinsker channel isstated in Section III-A. This generalizes the result in (1),which is the capacity for the Gel’fand-Pinsker channel whenthe channel and state are stationary and memoryless.

C. The Gel’fand-Pinsker Problem With Coded State Informa-tion at the Decoder

In fact the general information spectrum techniques allowus to solve a related problem which was ﬁrst consideredby Heegard and El Gamal [11] and subsequently solved bySteinberg [10]. The setup is shown in Fig. 1.

Deﬁnition 5. An ( n, M n , M d ,n , (cid:15) ) code for the Gel’fand-Pinsker problem with channel W n : X n ×S n → Y n and statedistribution P S n ∈ P ( S n ) and with coded state information at the decoder consists of • A state encoder: f d ,n : S n → [1 : M d ,n ] • An encoder: f n : [1 : M n ] × S n → X n (possiblyrandomized) • A decoder: ϕ n : Y n × [1 : M d ,n ] → [1 : M n ] such that the average error probability in decoding the mes-sage is no larger than (cid:15) , i.e., M n (cid:88) s n ∈S n P S n ( s n ) M n (cid:88) m =1 W n ( B cm,s n | f n ( m, s n ) , s n ) ≤ (cid:15) (10) where B m,s n := { y n ∈ Y n : ϕ n ( y n , f d ,n ( s n )) = m } . Deﬁnition 6.

We say that the pair of nonnegative numbers ( R, R d ) is an achievable rate pair if there exists a sequence (cid:116) (cid:116) (cid:116) (cid:116) (cid:45)(cid:45)(cid:27) C R I ( U ; S ) I ( U ; S ) I ( U ; Y ) I ( U ; Y ) Info. Spec. of i ( U n ; S n ) Info. Spec. of i ( U n ; Y n ) Fig. 2. Illustration of Theorem 1 where i ( U n ; S n ) := n − log[ P U n | S n ( U n | S n ) /P U n ( U n )] and similarly for i ( U n ; Y n ) .The capacity is the difference between I ( U ; Y ) and I ( U ; S ) evaluatedat the optimal processes. The stationary, memoryless case (Corollary 4)corresponds to the situation in which I ( U ; S ) = I ( U ; S ) = I ( U ; S ) and I ( U ; Y ) = I ( U ; Y ) = I ( U ; Y ) so the information spectra are pointmasses at the mutual informations. of ( n, M n , M d ,n , (cid:15) n ) codes such that in addition to (9) , thefollowing holds lim sup n →∞ n log M d ,n ≤ R d . (11) The set of all achievable rate pairs is known as the capacityregion C . Heegard and El Gamal [11] (achievability) and Stein-berg [10] (converse) showed for the discrete memorylesschannel and discrete memoryless state that the capacity region C is the set of rate pairs ( R, R d ) such that R d ≥ I ( V ; S ) − I ( V ; Y ) (12) R ≤ I ( U ; Y | V ) − I ( U ; S | V ) (13)for some Markov chain ( U, V ) − ( X, S ) − Y . Furthermore, X can be taken to be a deterministic function of ( U, V, S ) , |V| ≤ |X ||S| + 1 and |U| ≤ |X ||S| ( |X ||S| + 1) . Theconstraint in (12) is obtained using Wyner-Ziv coding with“source” S and “side-information” Y . The constraint in (13)is analogous to the Gel’fand-Pinsker capacity where V iscommon to both encoder and decoder. A weak converse wasproven using repeated applications of the Csisz´ar-sum-identity.We generalize Steinberg’s region for the general source S andgeneral channel W using information spectrum techniques inSection III-F.III. I NFORMATION S PECTRUM C HARACTERIZATION OFTHE G ENERAL G EL ’ FAND -P INSKER PROBLEM

In this Section, we ﬁrst present the main result concerningthe capacity of the general Gel’fand-Pinsker problem (Deﬁni-tion 3) in Section III-A. These results are derived using theinformation spectrum method. We then derive the capacityfor various special cases of the Gel’fand-Pinsker problemin Section III-B (two-sided common state information) andSection III-C (memoryless channels and state). We considermixed states and mixed channels in Section III-D. The mainideas in the proof are discussed in Section III-E. Finally, inSection III-F, we extend our result to the general Gel’fand-Pinsker problem with coded state information at the decoder(Deﬁnition 5).

A. Main Result and Remarks

We now state our main result followed by some simpleremarks. The proof can be found in Appendices A and B.

Theorem 1 (General Gel’fand-Pinsker Capacity) . The capac-ity of the general Gel’fand-Pinsker channel with general states ( W , S ) (see Deﬁnition 4) is C = sup U − ( X , S ) − Y I ( U ; Y ) − I ( U ; S ) (14) where the maximization is over all sequences of randomvariables ( U , X , S , Y ) = { U n , X n , S n , Y n } ∞ n =1 forming therequisite Markov chain, having the state distribution coin-ciding with S and having conditional distribution of Y given ( X , S ) equal to the general channel W . See Fig. 2 for an illustration of Theorem 1.

Remark 1.

The general formula in (14) is the dual of that inthe Wyner-Ziv case [8]. However, the proofs, and in particular,the constructions of the codebooks, the notions of typicalityand the application of Wyner’s PBL, are subtly differentfrom [8], [14], [19]. We discuss these issues in Section III-E.Another problem which involves difference of mutual informa-tion quantities is the wiretap channel [2, Chapter 22]. Generalformulas for the secrecy capacity using channel resolvabilitytheory [7, Chapter 5] were provided by Hayashi [15] andBloch and Laneman [16]. They also involve the differencebetween spectral inf-mutual information rate (of the input andthe legitimate receiver) and sup-mutual information rate (ofthe input and the eavesdropper).

Remark 2.

Unlike the usual Gel’fand-Pinsker formula forstationary and memoryless channels and states in (1), wecannot conclude that the conditional distribution we are opti-mizing over P X n ,U n | S n in (14) can be decomposed into theconditional distribution of U n given S n (i.e., P U n | S n ) and adeterministic function (i.e., { x n = g n ( u n , s n ) } ). Remark 3.

If there is no state, i.e., S = ∅ in (14), then werecover the general formula for channel capacity by Verd´u andHan (VH) [6] C VH = sup X I ( X ; Y ) . (15)The achievability follows by setting U = X . The conversefollows by noting that I ( U ; Y ) ≤ I ( X ; Y ) because U − X − Y [6, Theorem 9]. This is the analogue of the data processinginequality for the spectral inf-mutual information rate. Remark 4.

The general formula in (14) can be slightlygeneralized to the Cover-Chiang (CC) setting [20] in which(i) the channel W n : X n × S n e × S n d → Y n depends on twostate sequences ( S n e , S n d ) ∼ P S n e ,S n d (in addition to X n ), (ii)partial channel state information S n e is available noncausallyat the encoder and (iii) partial channel state information S n d isavailable at the decoder. In this case, replacing Y with ( Y , S d ) and S with S e in (14) yields C CC = sup U − ( X , S e , S d ) − ( Y , S d ) I ( U ; Y , S d ) − I ( U ; S e ) , (16) For three processes ( X , Y , Z ) = { X n , Y n , Z n } ∞ n =1 , we say that X − Y − Z forms a Markov chain if X n − Y n − Z n for all n ∈ N . where the supremum is over all processes ( U , X , S e , S d , Y ) such that ( S e , S d ) coincides with the state distributions { P S n e ,S n d } ∞ n =1 and Y given ( X , S e , S d ) coincides with the se-quence of channels { W n } ∞ n =1 . Hence the optimization in (16)is over the conditionals { P X n ,U n | S n e } ∞ n =1 . B. Two-sided Common State Information

Specializing (16) to the case where S e = S d = S , i.e., the same side information is available to both encoder and decoder(ED), does not appear to be straightforward without furtherassumptions. Recall that in the usual scenario [20, Case 4in Corollary 1], we use the identiﬁcation U = X and chain rulefor mutual information to assert that I ( X ; Y, S ) − I ( X ; S ) = I ( X ; Y | S ) evaluated at the optimal P X | S is the capacity.However, in information spectrum analysis, the chain ruledoes not hold for the p -lim inf operation. In fact, p -lim inf is superadditive [7]. Nevertheless, under the assumption thata sequence of information densities converges in probability,we can derive the capacity of the general channel with generalcommon state available at both terminals using Theorem 1. Corollary 2 (General Channel Capacity with State at ED) . Consider the problem C ED = sup X I ( X ; Y | S ) , (17) where the supremum is over all ( X , S , Y ) such that S co-incides with the given state distributions { P S n } ∞ n =1 and Y given ( X , S ) coincides with the given channels { W n } ∞ n =1 .Assume that the maximizer of (17) exists and denote itby { P ∗ X n | S n } ∞ n =1 . Let the distribution of X ∗ given S be { P ∗ X n | S n } ∞ n =1 . If I ( X ∗ ; S ) = I ( X ∗ ; S ) , (18) then the capacity of the state-dependent channel with state S available at both encoder and decoder is C ED in (17) . The proof is provided in Appendix C. If the joint process ( X ∗ , S ) satisﬁes (18) (and X and S are ﬁnite sets), it is called information stable [21]. In other words, the limit distribu-tion of n − log[ P U n | S n ( U n | S n ) /P U n ( U n )] (where P U n | S n = P ∗ X n | S n ) in Fig. 2 concentrates at a single point. We remarkthat a different achievability proof technique (that does notuse Theorem 1) would allow us to dispense of the informationstability assumption. We can simply develop a conditional ver-sion of Feinstein’s lemma [7, Lemma 3.4.1] to prove the directpart of (17). However, we choose to start from Theorem 1,which is the most general capacity result for the Gel’fand-Pinsker problem. Note that the converse of Corollary 2 doesnot require (18). C. Memoryless Channels and Memoryless States

To see how we can use Theorem 1 in concretely, wespecialize it to the memoryless (but not necessarily stationary)setting and we provide some interesting examples. In thememoryless setting, the sequence of channels W = { W n } ∞ n =1 and the sequence of state distributions P S = { P S n } ∞ n =1 are such that for every ( x n , y n , s n ) ∈ X n × Y n × S n , we have W n ( y n | x n , s n ) = (cid:81) ni =1 W i ( y i | x i , s i ) , and P S n ( s n ) = (cid:81) ni =1 P S i ( s i ) for some { W i : X × S → Y} ∞ i =1 and some { P S i ∈ P ( S ) } ∞ i =1 . Corollary 3 (Memoryless Gel’fand-Pinsker Channel Ca-pacity) . Assume that X , Y and S are ﬁnite sets and theGel’fand-Pinsker channel is memoryless and characterizedby { W i } ∞ i =1 and { P S i } ∞ i =1 . Deﬁne φ ( P X,U | S ; W, P S ) := I ( U ; Y ) − I ( U ; S ) . Let the maximizers to the optimizationproblems indexed by i ∈ N C ( W i , P S i ) = max P X,U | S |U|≤|X ||S| +1 φ ( P X,U | S ; W i , P S i ) (19) be denoted as P ∗ X i ,U i | S i : S → X × U . Let P ∗ S i ,X i ,U i ,Y i = P S i ◦ P ∗ X i ,U i | S i ◦ W ni ∈ P ( S × X × U × Y ) be the jointdistribution induced by P ∗ X i ,U i | S i . Assume that either of thetwo limits lim n →∞ n n (cid:88) i =1 I ( P S i , P ∗ U i | S i ) , lim n →∞ n n (cid:88) i =1 I ( P ∗ U i , P ∗ Y i | U i ) (20) exist. Then the capacity of the memoryless Gel’fand-Pinskerchannel is C M (cid:48) less = lim inf n →∞ n n (cid:88) i =1 C ( W i , P S i ) . (21)The proof of Corollary 3 is detailed in Appendix D. TheCes`aro summability assumption in (20) is only required forachievability. We illustrate the assumption in (20) with twoexamples in the sequel. The proof of the direct part ofCorollary 3 follows by taking the optimization in the generalresult (14) to be over memoryless conditional distributions.The converse follows by repeated applications of the Csisz´ar-sum-identity [2, Chapter 2]. If in addition to being memory-less, the channels and states are stationary (i.e., each W i andeach P S i is equal to W and P S respectively), the both limitsin (20) exist since P ∗ X i ,U i | S i is the same for each i ∈ N . Corollary 4 (Stationary, Memoryless Gel’fand-Pinsker Chan-nel Capacity) . Assume that S is a ﬁnite set. In the stationary,memoryless case, the capacity of the Gel’fand-Pinsker chan-nels given by C ( W, P S ) in (1) . We omit the proof because it is a straightforward conse-quence of Corollary 3. Close examination of the proof ofCorollary 3 shows that only the converse of Corollary 4requires the assumption that |S| < ∞ . The achievability ofCorollary 4 follows easily from Khintchine’s law of largenumbers [7, Lemma 1.3.2] (for abstract alphabets).To gain a better understanding of the assumption in (20)in Corollary 3, we now present a couple of (pathological)examples which are inspired by [7, Remark 3.2.3]. Example 1.

Let J := { i ∈ N : 2 k − ≤ i < k , k ∈ N } = [2 : 3] ∪ [8 : 15] ∪ [32 : 63] ∪ . . . . Consider a discrete,nonstationary, memoryless channel W satisfying W i = (cid:26) ˜ W a i ∈ J ˜ W b i ∈ J c , (22) where ˜ W a , ˜ W b : X × S → Y are two distinct channels. Let P S n = Q n be the n -fold extension of some Q ∈ P ( S ) .Let V ∗ m : S → U be the U -marginal of the maximizerof (19) when the channel is ˜ W m , m ∈ { a , b } . In general, I ( Q, V ∗ a ) (cid:54) = I ( Q, V ∗ b ) . Because lim inf n →∞ n |J ∩ [1 : n ] | = (cid:54) = lim sup n →∞ n |J ∩ [1 : n ] | = , the ﬁrst limit in (20)does not exist. Similarly, the second limit does not exist ingeneral and Corollary 3 cannot be applied. Example 2.

Let J be as in Example 1 and let the set ofeven and odd positive integers be E and O respectively. Let S , X , Y = { , } . Consider a binary, nonstationary, memory-less channel W satisfying W i =  ˜ W a i ∈ O ∩ J ˜ W b i ∈ O ∩ J c ˜ W c i ∈ E , (23)where ˜ W a , ˜ W b , ˜ W c : X × S → Y . Also consider a binary,nonstationary, memoryless state S satisfying P S i = (cid:26) Q a i ∈ O Q b i ∈ E , (24)where Q a , Q b ∈ P ( { , } ) . In addition, assume that ˜ W m ( · | · , s ) for ( m, s ) ∈ { a , b } × { , } are binary symmetricchannels with arbitrary crossover probabilities q ms ∈ (0 , .Let V ∗ m,l : S → U be the U -marginal of the maximizer in (19)when the channel is ˜ W m , m ∈ { a , b } and the state distributionis Q l , l ∈ { a , b } . For m ∈ { a , b } (odd blocklengths),due to the symmetry of the channels the optimal V ∗ m, a ( u | s ) is Bernoulli ( ) and independent of s [2, Problem 7.12(c)].Thus, for all odd blocklengths, the mutual informations in theﬁrst limit in (20) are equal to zero. Clearly, the ﬁrst limitin (20) exists, equalling I ( Q b , V ∗ c , b ) (contributed by the evenblocklengths). Therefore, Corollary 3 applies and we can showthat the Gel’fand-Pinsker capacity is C = 12 (cid:104) G ( Q a ) + C ( ˜ W c , Q b ) (cid:105) (25)where C ( W, P S ) in (19) is given explicitly in [2, Prob-lem 7.12(c)] and G : P ( { , } ) → R is deﬁned as G ( Q ) := 23 min (cid:110) C ( ˜ W a , Q ) , C ( ˜ W b , Q ) (cid:111) + 13 max (cid:110) C ( ˜ W a , Q ) , C ( ˜ W b , Q ) (cid:111) . (26)See Appendix E for the derivation of (25). The expression in(25) implies that the capacity consists of two parts: C ( ˜ W c , Q b ) represents the performance of the system ( ˜ W c , Q b ) at evenblocklengths, while G ( Q a ) represents the non-ergodic be-havior of the channel at odd blocklengths with state distri-bution Q a ; cf. [7, Remark 3.2.3]. In the special case that C ( ˜ W a , Q a ) = C ( ˜ W b , Q a ) (e.g., ˜ W a = ˜ W b ), then the capacityis the average of G ( Q a ) = C ( ˜ W a , Q a ) and C ( ˜ W c , Q b ) . D. Mixed Channels and Mixed States

Now we use Theorem 1 to compute the capacity of theGel’fand-Pinsker channel when the channel and state sequence are mixed . More precisely, we assume that W n ( y n | x n , s n ) = ∞ (cid:88) k =1 α k W nk ( y n | x n , s n ) , (27) P S n ( s n ) = ∞ (cid:88) l =1 β l P S nl ( s n ) . (28)Note that we require (cid:80) ∞ k =1 α k = (cid:80) ∞ l =1 β l = 1 . In fact, let K := { k ∈ N : α k > } and L := { l ∈ N : β l > } .Note that if S nl is a stationary and memoryless source, S n thecomposite source given by (28), is a canonical example of a non-ergodic and stationary source. By (27), the channel W n can be regarded as an average channel given by the convexcombination of |K| constituent channels. It is stationary but non-ergodic and non-memoryless . Given P X n ,U n | S n , deﬁnethe following random variables which are indexed by k and l : ( S nl , X nl , U nl , Y nkl ) ∼ P S nl ◦ P X n ,U n | S n ◦ W nk . (29) Corollary 5 (Mixed Channels and Mixed States) . The capac-ity of the general mixed Gel’fand-Pinsker channel with generalmixed state as in (27) – (29) is C = sup U − ( X , S ) − Y (cid:26) inf ( k,l ) ∈K×L I ( U l ; Y kl ) − sup l ∈L I ( U l ; S l ) (cid:27) (30) where the maximization is over all sequences of randomvariables ( U , X , S , Y ) = { U n , X n , S n , Y n } ∞ n =1 with statedistribution coinciding with S in (28) and having conditionaldistribution of Y given ( X , S ) equal to the general channel W in (27) . Furthermore, if each general state sequence S nl and each general channel W nk is stationary and memoryless,the capacity is lower bounded as C ≥ max U − ( X,S ) − Y (cid:26) inf ( k,l ) ∈K×L I ( U l ; Y kl ) − sup l ∈L I ( U l ; S l ) (cid:27) (31) where ( S l , X l , U l , Y kl ) ∼ P S l ◦ P X,U | S ◦ W k and the max-imization is over all joint distributions P U,X,S,Y satisfying P U,X,S,Y = (cid:80) k,l α k β l P S l P X,U | S W k for some P X,U | S . Corollary 5 is proved in Appendix F and it basicallyapplies [7, Lemma 3.3.2] to the mixture with componentsin (29). Different from existing analyses for mixed channelsand sources [7], [8], here there are two independent mixtures—that of the channel and the state. Hence, we need to minimizeover two indices for the ﬁrst term in (30). If instead of thecountable number of terms in the sums in (27) and (28),the number of mixture components (of either the sourceor channel) is uncountable, Corollary 5 no longer appliesand a corresponding result has to involve the assumptionsthat the alphabets are ﬁnite and the constituent channels arememoryless. See [7, Theorem 3.3.6].The corollary says that the Gel’fand-Pinsker capacity isgoverned by two elements: (i) the “worst-case” virtual channel(from U n to Y n ), i.e., the one with the smallest packing rate I ( U l ; Y kl ) and (ii) the “worst-case” state distribution, i.e.,the one that results in the largest covering rate I ( U l ; S l ) .Unfortunately, obtaining a converse result for the stationary, memoryless case from (30) does not appear to be straight-forward. The same issue was also encountered for the mixedwiretap channel [16]. E. Proof Idea of Theorem 11) Direct part:

The high-level idea in the achievabil-ity proof is similar to the usual Gel’fand-Pinsker codingscheme [1] which involves a covering step to reduce theuncertainty due to the random state sequence and a packingstep to decode the transmitted codeword. However, to usethe information spectrum method on the general channeland general state, the deﬁnitions of “typicality” have to berestated in terms of information densities. See the deﬁnitionsin Appendix A. The main burden for the proof is to showthat the probability that the transmitted codeword U n is not“typical” with the channel output Y n vanishes. In regularGel’fand-Pinsker coding, one appeals to the conditional typ-icality lemma [1, Lemma 2] [2, Chapter 2] (which holds for“strongly typical sets”) to assert that this error probability issmall. But the “typical sets” used in information spectrumanalysis do not allow us to apply the conditional typicalitylemma in a straightforward manner. For example, our decoderis a threshold test involving the information density statistic n − log( P Y n | U n /P Y n ) . It is not clear in the event that there isno covering error that the transmitted U n codeword passes thethreshold test (i.e., n − log( P Y n | U n /P Y n ) exceeds a certainthreshold) with high probability.To get around this problem, we modify Wyner’s PBL [9,Lemma 4.3] [22, Lemma A.1] accordingly. Wyner essentiallyderived an analog of the Markov lemma [2, Chapter 12]without strong typicality by introducing a new “typical set”deﬁned in terms of conditional probabilities. This new deﬁni-tion is particularly useful for problems that involve coveringand packing as well as having some Markov structure. Ouranalysis is somewhat similar to the analyses of the generalWyner-Ziv problem in [8] and the WAK problem in [14], [19].This is unsurprising given that the Wyner-Ziv and Gel’fand-Pinsker problems are duals [20]. However, unlike in [8], weconstruct random subcodebooks and use them in subsequentsteps rather than to assert the existence of a single codebookvia random selection and subsequently regard it as beingdeterministic. This is because unlike Wyner-Ziv, we needto construct exponentially many subcodebooks each of size ≈ exp( nI ( U ; S )) and indexing a message in [1 : M n ] . Wealso require each of these subcodebooks to be different and identiﬁable based on the channel output. Also, our analogueof Wyner’s “typical set” is different from previous works.We also point out that Yu et al. [17] considered the GaussianGel’fand-Pinsker problem for non-stationary and non-ergodicchannel and state. However, the notion of typicality used is One version of the piggyback coding lemma (PBL), given in [22, LemmaA.1], can be stated as follows: If U − V − W are random variables formingMarkov chain, ( U n , V n , W n ) ∼ (cid:81) ni =1 P U,V ( u i , v i ) P W | V ( w i | v i ) , and ψ n : U n × W n → [0 , is a function satisfying E ψ n ( U n , W n ) → , thenfor any given ε > , for all sufﬁciently large n there exists a mapping g n : V n → W n such that (i) n log (cid:107) g n (cid:107) ≤ I ( V ; W ) + ε and (ii) E ψ n ( U n , g n ( V n )) < ε . The function ψ n ( u n , w n ) is usually taken to bethe indicator that ( u n , w n ) are jointly typical. “weak typicality”, which means that the sample entropy isclose to the entropy rate. This notion does not generalize wellfor obtaining the capacity expression in (14), which involveslimits in probability of information densities. Furthermore,Gaussianity is a crucial hypothesis in the proof of the asymp-totic equipartition property in [17].

2) Converse part:

For the converse, we use the Verd´u-Hanconverse [7, Lemma 3.2.2] and the fact that the message isindependent of the state sequence. Essentially, we emulate thesteps for the converse of the general wiretap channel presentedby Bloch and Laneman [16, Lemma 3].

F. Coded State Information at Decoder

We now state the capacity region of the coded state infor-mation problem (Deﬁnition 5).

Theorem 6 (Coded State Information at Decoder) . The capac-ity region of the Gel’fand-Pinsker problem with coded stateinformation at the decoder C (see Deﬁnition 6) is given bythe set of pairs ( R, R d ) satisfying R ≤ I ( U ; Y | V ) − I ( U ; S | V ) (32) R d ≥ I ( V ; S ) − I ( V ; Y ) (33) for ( U , V , X , S , Y ) = { U n , V n , X n , S n , Y n } ∞ n =1 satisfying ( U , V ) − ( X , S ) − Y , having the state distribution coincidingwith S and having conditional distribution of Y given ( X , S ) equal to the general channel W . A proof sketch is provided in Appendix G. For the directpart, we combine Wyner-Ziv and Gel’fand-Pinsker codingto obtain the two constraints in Theorem 6. To prove theconverse, we use exploit the independence of the messageand the state, the Verd´u-Han lemma [7, Lemma 3.2.2] and theproof technique for the converse of the general rate-distortionproblem [7, Section 5.4]. Because the proof of Theorem 6 isvery similar to Theorem 1, we only provide a sketch. We notethat similar ideas can be easily employed to ﬁnd the generalcapacity region for the problem of coded state information atthe encoder (and full state information at the decoder) [23].In analogy to Corollary 4, we can use Theorem 6 to recoverSteinberg’s result [10] for the stationary, memoryless case. SeeAppendix H for the proof.

Corollary 7 (Coded State Information at Decoder for Sta-tionary, Memoryless Channels and States) . Assume that S is a ﬁnite set. The capacity of the Gel’fand-Pinsker channelwith coded state information at the decoder in the stationary,memoryless case is given in (12) and (13) . IV. C

ONCLUSION

In this work, we derived the capacity of the generalGel’fand-Pinsker channel with general state distribution us-ing the information spectrum method. We also extended theanalysis to the case where coded state information is availableat the decoder. A

PPENDIX

A. Proof of Theorem 1Basic deﬁnitions : Fix γ , γ > and some conditionaldistribution P X n ,U n | S n . Deﬁne the sets T := (cid:26) ( u n , y n ) ∈ U n × Y n : 1 n log P Y n | U n ( y n | u n ) P Y n ( y n ) ≥ I ( U ; Y ) − γ (cid:27) (34) T := (cid:26) ( u n , s n ) ∈ U n × S n : 1 n log P U n | S n ( u n | s n ) P U n ( u n ) ≤ I ( U ; S ) + γ (cid:27) , (35)where the random variables ( S n , X n , U n , Y n ) ∼ P S n ◦ P X n ,U n | S n ◦ W n . We deﬁne the probabilities π := P (( U n , Y n ) / ∈ T ) (36) π := P (( U n , S n ) / ∈ T ) , (37)where ( U n , Y n ) ∼ P U n ,Y n and ( U n , S n ) ∼ P U n ,S n andwhere these joint distributions are computed with respect to P S n ◦ P X n ,U n | S n ◦ W n . Note that π and π are informationspectrum [7] quantities. Proof:

We begin with achievability. We show that the rate R := I ( U ; Y ) − γ − ( I ( U ; S ) + 2 γ ) (38)is achievable. The next lemma provides an upper bound onthe error probability in terms of the above quantities. Lemma 8 (Nonasymptotic upper bound on error probabilityfor Gel’fand-Pinsker) . Fix a sequence of conditional distribu-tions { P X n ,U n | S n } ∞ n =1 . This speciﬁes I ( U ; Y ) and I ( U ; S ) .For every positive integer n , there exists an ( n, exp( nR ) , ρ n ) code for the general Gel’fand-Pinsker channel where R isdeﬁned in (38) and ρ n := 2 π / + π + exp ( − exp( nγ )) + exp ( − nγ ) . (39)The proof of Lemma 8 is provided in Appendix B. Wenote that there have been stronger nonasymptotic bounds forthe Gel’fand-Pinsker problem developed recently [24]–[26] butLemma 8 sufﬁces for our purposes because the main theoremis an asymptotic one. We prove Lemma 8 in full in Section B.This argument is classical and follows the original idea ofWyner’s piggyback coding lemma [9, Lemma 4.3] with a fewmodiﬁcations as discussed in Section III-E1.Now, for any ﬁxed γ , γ > , the last two terms in (39) tendto zero. By the deﬁnition of spectral inf-mutual informationrate, π = P (cid:18) n log P Y n | U n ( Y n | U n ) P Y n ( Y n ) < I ( U ; Y ) − γ (cid:19) (40)goes to zero. By the deﬁnition of the spectral sup-mutualinformation rate, π = P (cid:18) n log P U n | S n ( U n | S n ) P U n ( U n ) > I ( U ; S ) + γ (cid:19) (41)also goes to zero. Hence, in view of (39), the error probabilityvanishes with increasing blocklength. This proves that the rate R in (38) is achievable. Taking γ , γ → and maximizingover all chains U − ( X , S ) − Y proves the direct part ofTheorem 1.For the converse, we follow the strategy for proving theconverse for the general wiretap channel as done by Bloch andLaneman [16, Lemma 3]. Consider a sequence of ( n, M n , (cid:15) n ) codes (Deﬁnition 3) achieving a rate R n = n log M n . Let U n ∈ U n denote an arbitrary random variable representing theuniform choice of a message in [1 : M n ] . Because the messageis independent of the state (cf. discussion after Deﬁnition 3),this induces the joint distribution P S n ◦ P U n ◦ P X n | U n ,S n ◦ W n where P X n | U n ,S n models possible stochastic encoding.Clearly by the independence, I ( U ; S ) = 0 . (42)Let the set of processes ( S , U , X , Y ) in which each col-lection of random variables ( S n , U n , X n , Y n ) is distributedas P S n ◦ P U n ◦ P X n | U n ,S n ◦ W n (resp. P S n ◦ P U n | S n ◦ P X n | U n ,S n ◦ W n ) be I W , S for “independent” (resp. D W , S for “dependent”). Fix γ > . The Verd´u-Han conversetheorem [7, Lemma 3.2.2] states that for any ( n, M n , (cid:15) n ) code for the general virtual channel P Y n | U n ( y n | u n ) := (cid:80) x n ,s n W n ( y n | x n , s n ) P X n | U n ,S n ( x n | u n , s n ) P S n ( s n ) , (cid:15) n ≥ P (cid:18) n log P Y n | U n ( Y n | U n ) P Y n ( Y n ) ≤ n log M n − γ (cid:19) − exp( − nγ ) , (43)where U n is uniform over the message set [1 : M n ] . Supposenow that M n = (cid:100) exp[ n ( I ( U ; Y ) + 2 γ )] (cid:101) . Then, (cid:15) n ≥ P (cid:18) n log P Y n | U n ( Y n | U n ) P Y n ( Y n ) ≤ I ( U ; Y ) + γ (cid:19) − exp( − nγ ) . (44)By the deﬁnition of the spectral inf-mutual information rate,the ﬁrst term on the right hand side of (44) converges to .Since exp( − nγ ) → , (cid:15) n → if M n = (cid:100) exp[ n ( I ( U ; Y ) +2 γ )] (cid:101) . This means that a necessary condition for the code tohave vanishing error probability is for R = lim n →∞ R n tosatisfy R ≤ I ( U ; Y ) + 2 γ (45) = I ( U ; Y ) − I ( U ; S ) + 2 γ (46) ≤ sup ( S , U , X , Y ) ∈I W , S (cid:8) I ( U ; Y ) − I ( U ; S ) (cid:9) + 2 γ, (47) ≤ sup ( S , U , X , Y ) ∈D W , S (cid:8) I ( U ; Y ) − I ( U ; S ) (cid:9) + 2 γ, (48)where (46) follows from (42) and (48) follows because I W , S ⊂ D W , S because the set of dependent processesincludes the independent processes as a special case. Since γ > is arbitrary, we have proven the upper bound of (14)and this completes the proof of Theorem 1. B. Proof of Lemma 8Proof:

Refer to the deﬁnitions of the sets T and T andthe probabilities π and π in Appendix A in which the random variables ( S n , X n , U n , Y n ) ∼ P S n ◦ P X n ,U n | S n ◦ W n . Deﬁnethe mapping η : U n × S n → R + as η ( u n , s n ) := (cid:88) x n (cid:88) y n :( u n ,y n ) / ∈T W n ( y n | x n , s n ) × P X n | U n ,S n ( x n | u n , s n ) (49)where P X n | U n ,S n is the conditional induced by P X n ,U n | S n .Analogous to [9, Lemma 4.3], deﬁne the set A := (cid:110) ( u n , s n ) ∈ U n × S n : η ( u n , s n ) ≤ π / (cid:111) . (50)Note the differences between the deﬁnition of η vis-`a-vis thatin [9, Lemma 4.3]. In particular, the summand in the deﬁnitionof η in (49) depends on P X n | U n ,S n and the channel W n . Now,for brevity, deﬁne the “inﬂated rate” ˜ R := I ( U ; Y ) − γ (51)so the rate of each subcodebook is ˜ R − R = I ( U ; S ) + 2 γ . Random code generation : Randomly and independentlygenerate (cid:100) exp( n ˜ R ) (cid:101) codewords { u n ( l ) : l ∈ [1 : exp( n ˜ R )] } each drawn from P U n . Denote the set of random codewordsand a speciﬁc realization as C and c respectively. Determinis-tically partition the codewords in C into (cid:100) exp( nR ) (cid:101) subcode-books C ( m ) = { u n ( l ) : l ∈ [( m −

1) exp( n ( ˜ R − R )) + 1 : m exp( n ( ˜ R − R ))] } where m ∈ [1 : exp( nR )] . Note that eachsubcodebook contains (cid:100) exp( n ( I ( U ; S ) + 2 γ )) (cid:101) codewords.Here is where our coding scheme differs from the generalWyner-Ziv problem [8] and the general WAK problem [14],[19]. We randomly generate exponentially many subcodebooksinstead of asserting the existence of one by random selectionvia Wyner’s PBL [9, Lemma 4.3]. By retaining the randomnessin the U n codewords, it is easier to bound the probability ofthe decoding error E deﬁned in (56). Encoding : The encoder, given m ∈ [1 : exp( nR )] and thestate sequence s n ∈ S n (noncausally), ﬁnds the sequence u n (ˆ l ) ∈ C ( m ) with the smallest index ˆ l satisfying ( u n (ˆ l ) , s n ) ∈ A . (52)If no such ˆ l exists, set ˆ l = 1 . Randomly generate a sequence x n ∼ P X n | U n ,S n ( · | u n (ˆ l ) , s n ) and transmit it as the channelinput in addition to s n . Note that the rate of the code is givenby R in (38) since there are (cid:100) exp( nR ) (cid:101) subcodebooks, eachrepresenting one message. Decoding : Given y n ∈ Y n decoder declares that ˆ m ∈ [1 :exp( nR )] is the message sent if it is the unique message suchthat ( u n ( l ) , y n ) ∈ T (53)for some u n ( l ) ∈ C ( ˆ m ) . If there is no such unique ˆ m declarean error. Analysis of Error Probability : Assume that m = 1 and L denotes the random index chosen by the encoder. Note that L = L ( S n ) is a random function of the random state sequence S n . We will denote the chosen codeword interchangeably by U n ( L ) or F C ( S n ) ∈ U n . The latter notation makes it clearthat the chosen codeword is a random function of the state.The randomness of F C ( S n ) comes not only from the random state S n but also from the random codewords in C . There arethree sources of error: E := {∀ U n ( l ) ∈ C (1) : ( U n ( l ) , S n ) / ∈ A} (54) E := { ( U n ( L ) , Y n ) / ∈ T } (55) E := {∃ U n (˜ l ) / ∈ C (1) : ( U n (˜ l ) , Y n ) ∈ T } (56)In term of E , E and E , the probability of error P ( E ) deﬁnedin (8) can be bounded as P ( E ) ≤ P ( E ) + P ( E ∩ E c ) + P ( E ) . (57)We bound each of the probabilities above: First consider P ( E ) : P ( E ) = P ( ∀ U n ( l ) ∈ C (1) : ( U n ( l ) , S n ) / ∈ A ) (58) = (cid:88) s n P S n ( s n )  (cid:88) u n :( u n ,s n ) / ∈A P U n ( u n )  |C (1) | , (59)where (59) holds because the codewords U n ( l ) are generatedindependently of each other and they are independent of thestate sequence S n . We now upper bound (59) as follows: P ( E ) ≤ (cid:88) s n P S n ( s n ) (cid:34) − (cid:88) u n P U n ( u n ) χ ( u n , s n ) (cid:35) |C (1) | (60)where χ ( u n , s n ) is the indicator of the set A ∩ T , i.e., χ ( u n , s n ) := { ( u n , s n ) ∈ A ∩ T } . (61)Clearly, by using the deﬁnition of T in (35), we have that forall ( u n , s n ) ∈ A ∩ T , P U n ( u n ) ≥ P U n | S n ( u n | s n ) exp( − n ( I ( U ; S ) + γ )) . (62)Thus substituting the bound in (62) into (60), we have P ( E ) ≤ (cid:88) s n P S n ( s n ) (cid:20) − exp( − n ( I ( U ; S ) + γ )) × (cid:88) u n P U n | S n ( u n | s n ) χ ( u n , s n ) (cid:21) |C (1) | (63) ≤ (cid:88) s n P S n ( s n ) (cid:34) − (cid:88) u n P U n | S n ( u n | s n ) χ ( u n , s n )+ exp (cid:2) − exp( − n ( I ( U ; S ) + γ )) |C (1) | (cid:3) (cid:35) , (64)where (64) comes from the inequality (1 − xy ) k ≤ − x + exp( yk ) . Recall that the size of the subcodebook C (1) is |C (1) | = (cid:100) exp( n ( I ( U ; S ) + 2 γ )) (cid:101) . Thus, P ( E ) ≤ (cid:88) s n P S n ( s n ) (cid:20) − (cid:88) u n P U n | S n ( u n | s n ) χ ( u n , s n )+ exp ( − exp( nγ )) (cid:21) (65) = P (( U n , S n ) ∈ A c ∪ T c ) + exp ( − exp( nγ )) (66) ≤ P (( U n , S n ) ∈ A c ) + P (( U n , S n ) ∈ T c )+ exp ( − exp( nγ )) (67) = P (( U n , S n ) ∈ A c ) + π + exp ( − exp( nγ )) , (68)where (66) follows from the deﬁnition of χ ( u n , s n ) in (61),and (68) follows from the deﬁnition of π in (37). We nowbound the ﬁrst term in (68). We have P (( U n , S n ) ∈ A c )= P (cid:16) η ( U n , S n ) > π / (cid:17) (69) ≤ π − / E ( η ( U n , S n )) (70) = π − / (cid:88) u n ,s n P U n ,S n ( u n , s n ) η ( u n , s n ) (71) = π − / (cid:88) u n ,s n P U n ,S n ( u n , s n ) × (cid:88) x n (cid:88) y n :( u n ,y n ) / ∈T W n ( y n | x n , s n ) P X n | U n ,S n ( x n | u n , s n ) (72) = π − / (cid:88) u n ,s n ,x n ,y n :( u n ,y n ) / ∈T P U n ,S n ,X n ,Y n ( u n , s n , x n , y n ) (73) = π − / P (( U n , Y n ) / ∈ T ) (74) ≤ π / (75)where (70) is by Markov’s inequality and (72) is due to thedeﬁnition of η ( u n , s n ) in (49). Equality (73) follows by theMarkov chain U n − ( X n , S n ) − Y n and (75) is by the deﬁnitionof π in (36). Combining the bounds in (68) and (75) yields P ( E ) ≤ π / + π + exp ( − exp( nγ )) . (76)Now we bound P ( E ∩E c ) . Recall that the mapping from thenoncausal state sequence S n to the chosen codeword U n ( L ) is denoted as F C ( S n ) . Deﬁne the event F := { ( F C ( S n ) , S n ) ∈ A} . (77)Then, P ( E ∩ E c ) can be bounded as P ( E ∩ E c ) = P ( E ∩ E c ∩ F ) + P ( E ∩ E c ∩ F c ) (78) ≤ P ( E ∩ F ) + P ( E c ∩ F c ) . (79)The second term in (79) is identically zero because given that E c occurs, we can successfully ﬁnd a u n ∈ C (1) such that ( u n , s n ) ∈ A hence E c ∩ F c = ∅ . Refer to the encodingstep in (52). So we consider the ﬁrst term in (79). Let E C [ · ] be the expectation over the random codebook C , i.e., E C [ φ ( C )] = (cid:80) c P ( C = c ) φ ( c ) , where c runs over all possiblesets of sequences { u n ( l ) ∈ U n : l ∈ [1 : exp( n ˜ R )] } . Consider, P ( { ( F C ( S n ) , Y n ) / ∈ T } ∩ { ( F C ( S n ) , S n ) ∈ A} )= E C (cid:34) (cid:88) ( s n ,y n ):( F C ( s n ) ,y n ) / ∈T ( F C ( s n ) ,s n ) ∈A P S n ( s n ) × (cid:88) x n W n ( y n | x n , s n ) P X n | U n ,S n ( x n | F C ( s n ) , s n ) (cid:35) (80) = E C (cid:34) (cid:88) s n :( F C ( s n ) ,s n ) ∈A P S n ( s n ) × (cid:18) (cid:88) x n (cid:88) y n :( F C ( s n ) ,y n ) / ∈T W n ( y n | x n , s n ) × P X n | U n ,S n ( x n | F C ( s n ) , s n ) (cid:19)(cid:35) . (81)Equality (80) can be explained as follows: Conditionedon {C = c } for some (deterministic) codebook c , themapping F c : S n → U n is deterministic (cf. the en-coding step in (52)) and thus { U n = F c ( s n ) } holds if { S n = s n } holds. Therefore, the conditional distributionof Y n given { S n = s n } and hence also { U n = F c ( s n ) } is (cid:80) x n W n ( · | x n , s n ) P X n | U n ,S n ( x n | F c ( s n ) , s n ) for ﬁxed c .The step (80) differs subtly from the proofs of the generalWyner-Ziv problem [8] and the general lossless source codingwith coded side information problem [14], [19]. In [8], [14]and [19], the equivalent of Y n just depends only implicitlythe auxiliary U n through another variable, say ˜ X n (i.e., Y n − ˜ X n − U n forms a Markov chain). In the Gel’fand-Pinsker problem, Y n depends on X n and S n , the formerbeing a (stochastic) function of both U n and S n . Thus, given { S n = s n } , Y n also depends on the state S n and the auxiliary U n through the codebook C and the covering procedurespeciﬁed by F C . Now using the deﬁnitions of η ( u n , s n ) and A in (49) and (50) respectively, we can bound (81) as follows: P ( { ( F C ( S n ) , Y n ) / ∈ T } ∩ { ( F C ( S n ) , S n ) ∈ A} )= E C  (cid:88) s n :( F C ( s n ) ,s n ) ∈A P S n ( s n ) η ( F C ( s n ) , s n )  (82) ≤ E C  (cid:88) s n :( F C ( s n ) ,s n ) ∈A P S n ( s n ) π /  (83) ≤ π / . (84)Uniting (79) and (84) yields P ( E ∩ E c ) ≤ π / . (85) In the Wyner-Ziv problem [8], ˜ X n is the source to be reconstructed towithin some distortion with the help of Y n and in the WAK problem [14],[19], Y n is the source to be almost-losslessly transmitted with the help of acoded version of ˜ X n . Finally, we consider the probability P ( E ) : P ( E ) = P (cid:16) ( U n (˜ l ) , Y n ) ∈ T for some U n (˜ l ) / ∈ C (1) (cid:17) (86) ≤ (cid:100) exp( n ˜ R ) (cid:101) (cid:88) ˜ l = (cid:100) exp( n ( ˜ R − R )) (cid:101) +1 P (cid:16) ( U n (˜ l ) , Y n ) ∈ T (cid:17) , (87)where (87) follows from the union bound and the fact thatthe indices of the confounding codewords belong to the set [exp( n ( ˜ R − R )) + 1 : exp( n ˜ R )] . Now, we upper bound theprobability in (87). Note that if ˜ l ∈ [exp( n ( ˜ R − R )) + 1 :exp( n ˜ R )] , then U n (˜ l ) is independent of Y n . Thus, P (cid:16) ( U n (˜ l ) , Y n ) ∈ T (cid:17) = (cid:88) ( u n ,y n ) ∈T P U n ( u n ) P Y n ( y n ) (88) ≤ (cid:88) ( u n ,y n ) ∈T P U n ( u n ) P Y n | U n ( y n | u n ) × exp ( − n ( I ( U ; Y ) − γ )) (89) ≤ exp ( − n ( I ( U ; Y ) − γ )) (90)where (89) follows from the deﬁnition of T in (34). Now,substituting (90) into (87) yields P ( E ) ≤ exp( n ˜ R ) exp ( − n ( I ( U ; Y ) − γ )) (91) = exp( n ( I ( U ; Y ) − γ )) × exp ( − n ( I ( U ; Y ) − γ )) (92) = exp ( − nγ ) (93)where (92) follows from the deﬁnition of ˜ R in (51). Unit-ing (57), (76), (85) and (93) shows that P ( E ) ≤ ρ n , where ρ n is deﬁned in (39). By the selection lemma, there existsa (deterministic) code whose average error probability is nolarger than ρ n . C. Proof of Corollary 2Proof:

In this problem where the state information isavailable at the encoder and the decoder, we can regard theoutput Y of the Gel’fand-Pinsker problem as ( Y , S ) . Forachievability, we lower bound the generalization of the Cover-Chiang [20] result in (16) with S d = S e = S as follows: C CC = sup U − ( X , S ) − Y I ( U ; Y , S ) − I ( U ; S ) (94) ≥ I ( X ∗ ; Y ∗ , S ) − I ( X ∗ ; S ) (95) = I ( X ∗ ; Y ∗ , S ) − I ( X ∗ ; S ) (96) ≥ I ( X ∗ ; Y ∗ | S ) , (97)where (95) follows because the choice ( { P ∗ X n | S n } ∞ n =1 , U n = ∅ ) belongs to the constraint set in (94), (96) uses theassumption in (17) and (97) follows from the basic in-formation spectrum inequality p -lim inf n →∞ ( A n + B n ) ≤ p -lim inf n →∞ A n + p -lim sup n →∞ B n [7, pp. 15]. This showsthat C ED = I ( X ∗ ; Y ∗ | S ) is an achievable rate. For the converse, we upper bound (94) as follows: C CC ≤ sup U − ( X , S ) − Y I ( U ; Y | S ) (98) ≤ sup U − ( X , S ) − Y I ( X ; Y | S ) (99) = sup X I ( X ; Y | S ) (100)where (98) follows from the superadditivity of p -lim inf ,(99) follows from the p -lim inf -version of the (conditional)data processing inequality [6, Theorem 9] and (100) followsbecause there is no U in the objective function in (99) sotaking U n = ∅ does not violate optimality in (99). Since (100)implies that all achievable rates are bounded above by C ED ,this proves the converse. D. Proof of Corollary 3Proof:

For achievability, note that since the channel andstate are memoryless, we lower bound (14) by replacing theconstraint set with the set of conditional distributions of theproduct from, i.e., P X n ,U n | S n = (cid:81) ni =1 P X i ,U i | S i for some { P X i ,U i | S i : S → X × U} ni =1 . Let M W , S (which standsfor “memoryless”) be the set of all ( U , X , S , Y ) such that P X n ,U n | S n = (cid:81) ni =1 P X i ,U i | S i for some { P X i ,U i | S i } ni =1 , thechannels coincide with { (cid:81) ni =1 W i } ∞ n =1 and the states coincidewith { (cid:81) ni =1 P S i } ∞ n =1 . Let ( S i , U ∗ i , X ∗ i , Y ∗ i ) be distributedaccording to P S i ◦ P ∗ X i ,U i | S i ◦ W i , the optimal distributionin (19). Consider, C ≥ sup ( U , X , S , Y ) ∈M W , S (cid:26) p -lim inf n →∞ n log P Y n | U n ( Y n | U n ) P Y n ( Y n ) − p -lim sup n →∞ n log P U n | S n ( U n | S n ) P U n ( U n ) (cid:27) (101) ≥ p -lim inf n →∞ n log P ( Y n ) ∗ | ( U n ) ∗ (( Y n ) ∗ | ( U n ) ∗ ) P ( Y n ) ∗ (( Y n ) ∗ ) − p -lim sup n →∞ n log P ( U n ) ∗ | ( S n ) ∗ (( U n ) ∗ | ( S n ) ∗ ) P ( U n ) ∗ (( U n ) ∗ ) (102) = lim inf n →∞ n n (cid:88) i =1 E (cid:20) log P Y ∗ i | U ∗ i ( Y ∗ i | U ∗ i ) P Y ∗ i ( Y ∗ i ) (cid:21) − lim sup n →∞ n n (cid:88) i =1 E (cid:20) log P U ∗ i | S i ( U ∗ i | S i ) P U ∗ i ( U ∗ i ) (cid:21) (103) = lim inf n →∞ n n (cid:88) i =1 I ( U ∗ i ; Y ∗ i ) − lim sup n →∞ n n (cid:88) i =1 I ( U ∗ i ; S i ) (104)where (102) follows by substituting the optimal ( S i , U ∗ i , X ∗ i , Y ∗ i ) into (101). Inequality (103) followsfrom Chebyshev’s inequality where we used the memorylessassumption and the assumption that the alphabets are ﬁniteso the variances of the information densities are uniformlybounded [7, Remark 3.1.1]. Essentially the limit inferior(resp. superior) in probability of the normalized informationdensities become the regular limit inferior (resp. superior)of averages of mutual informations under the memoryless assumption. Now, we assume the ﬁrst limit in (20) exists.Then, the lim sup in (104) is in fact a limit and we have C ≥ lim inf n →∞ n n (cid:88) i =1 I ( U ∗ i ; Y ∗ i ) − lim n →∞ n n (cid:88) i =1 I ( U ∗ i ; S i ) (105) = lim inf n →∞ n n (cid:88) i =1 I ( U ∗ i ; Y ∗ i ) − I ( U ∗ i ; S i ) (106)where (106) follows from the fact that lim inf n →∞ ( a n + b n ) =lim inf n →∞ a n + lim n →∞ b n if b n converges. If instead thesecond limit in (20) exists, the argument from (104) to (106)proceeds along exactly the same lines. The proof of the directpart is completed by invoking the deﬁnition of C ( W i , P S i ) in(19).For the converse, we only assume that |S| < ∞ . Let ( ˆ S n , ˆ U n , ˆ X n , ˆ Y n ) be dummy variables distributed as P S n ◦ P U n | S n ◦ P X n | U n ,S n ◦ W n . Note that P S n and W n are assumedto be memoryless. From [7, Theorem 3.5.2], I ( ˆ U ; ˆ Y ) ≤ lim inf n →∞ n I ( ˆ U n ; ˆ Y n ) (107)and if |S| < ∞ , we also have I ( ˆ U ; ˆ S ) ≥ lim sup n →∞ n I ( ˆ U n ; ˆ S n ) . (108)Now we can upper bound the objective function in (14) asfollows: I ( ˆ U ; ˆ Y ) − I ( ˆ U ; ˆ S ) ≤ lim inf n →∞ n I ( ˆ U n ; ˆ Y n ) + lim inf n →∞ − n I ( ˆ U n ; ˆ S n ) (109) ≤ lim inf n →∞ n (cid:104) I ( ˆ U n ; ˆ Y n ) − I ( ˆ U n ; ˆ S n ) (cid:105) , (110)where (110) follows from the superadditivity of the limitinferior. Hence, it sufﬁces to single-letterize the expressionin [ · ] in (110). To start, consider I ( ˆ U n ; ˆ Y n ) − I ( ˆ U n ; ˆ S n )= n (cid:88) i =1 I ( ˆ U n ; ˆ Y i | ˆ Y i − , ˆ S ni +1 ) − I ( ˆ U n ; ˆ S i | ˆ Y i − , ˆ S ni +1 ) (111) = n (cid:88) i =1 I ( ˆ U n , ˆ Y i − , ˆ S ni +1 ; ˆ Y i ) − I ( ˆ Y i ; ˆ Y i − , ˆ S ni +1 ) − I ( ˆ U n , ˆ Y i − , ˆ S ni +1 ; ˆ S i ) + I ( ˆ S i ; ˆ Y i − , ˆ S ni +1 ) , (112)where (111) follows from the key identity in Csisz´ar andK¨orner [4, Lemma 17.12] and (112) follows from the chainrule for mutual information. Now, we relate the (sum of the)second term to the (sum of the) fourth term: n (cid:88) i =1 I ( ˆ Y i ; ˆ Y i − , ˆ S ni +1 ) ≥ n (cid:88) i =1 I ( ˆ Y i ; ˆ S ni +1 | ˆ Y i − ) (113) = n (cid:88) i =1 I ( ˆ S i ; ˆ Y i − | ˆ S ni +1 ) (114) = n (cid:88) i =1 I ( ˆ S i ; ˆ Y i − , ˆ S ni +1 ) , (115) where (114) follows from the Csisz´ar-sum-identity [2, Chap-ter 2] and (115) follows because ˆ S n = ( ˆ S , . . . , ˆ S n ) is amemoryless process. Putting together (110), (112) and (115),we have the upper bound: C ≤ lim inf n →∞ n n (cid:88) i =1 I ( ˆ U n , ˆ Y i − , ˆ S ni +1 ; ˆ Y i ) − I ( ˆ U n , ˆ Y i − , ˆ S ni +1 ; ˆ S i ) . (116)Note that (116) holds for all P X n ,U n | S n of the product form(i.e., ( ˆ U , ˆ X , ˆ S , ˆ Y ) ∈ M W , S ) because the state and channelare memoryless. Now deﬁne U i := ( ˆ U n , ˆ Y i − , ˆ S ni +1 ) and Y i := ˆ Y i , X i := ˆ X i and S i := ˆ S i . Clearly, the Markov chain U i − ( X i , S i ) − Y i is satisﬁed for all i ∈ [1 : n ] . Substitutingthese identiﬁcations into (116) yields C ≤ lim inf n →∞ n n (cid:88) i =1 max U i − ( X i ,S i ) − Y i I ( U i ; Y i ) − I ( U i ; S i ) , (117)which upon invoking the deﬁnition of C ( W i , P S i ) in (19)completes the converse proof of Corollary 3.We remark that in the single-letterization procedure for theconverse, it seems as if we did not use the assumption thatthe transmitted message is independent of the state. This isnot true because in the converse proof of Theorem 1 we didin fact use this key assumption. See (42) and (46) where themessage is represented by U . E. Veriﬁcation of (25) in Example 2

Let E n := E ∩ [1 : n ] and O n := O ∩ [1 : n ] be respectively,the set of even and odd integers up to some n ∈ N . Toverify (25), we use the result in (21) and the deﬁnitions ofthe channel and state in (23) and (24) respectively. We ﬁrstsplit the sum into odd and even parts as follows: C M (cid:48) less = 12 lim inf n →∞ (cid:20) n (cid:88) i ∈O n C ( W i , P S i )+ 2 n (cid:88) i ∈E n C ( W i , P S i ) (cid:21) (118)For the even part, each of the summands is C ( ˜ W c , Q b ) sothe sequence b n := n (cid:80) i ∈E n C ( W i , P S i ) converges and thelimit is C ( ˜ W c , Q b ) . Let a n := n (cid:80) i ∈O n C ( W i , P S i ) bethe odd part in (118). A basic fact in analysis states that lim inf n →∞ ( a n + b n ) = lim inf n →∞ a n + lim n →∞ b n if b n has a limit. Hence, the above lim inf is C M (cid:48) less = 12 lim inf n →∞ (cid:34) n (cid:88) i ∈O n C ( W i , P S i ) (cid:35) + 12 C ( ˜ W c , Q b ) (119) = 12 lim inf n →∞ (cid:20) n |O n ∩ J | C ( ˜ W a , Q a )+ 2 n |O n ∩ J c | C ( ˜ W b , Q a ) (cid:21) + 12 C ( ˜ W c , Q b ) . (120) It can be veriﬁed that lim inf n →∞ n |O n ∩ J | = and lim sup n →∞ n |O n ∩ J | = . Thus, the lim inf in (120) is lim inf n →∞ a n = 23 min (cid:110) C ( ˜ W a , Q a ) , C ( ˜ W b , Q a ) (cid:111) + 13 max (cid:110) C ( ˜ W a , Q a ) , C ( ˜ W b , Q a ) (cid:111) . (121)which, by deﬁnition, is equal to G ( Q a ) in (26). This completesthe veriﬁcation of (25). F. Proof of Corollary 5Proof:

Fix P X n ,U n | S n . The key observation is to notethat the joint distribution of ( S n , U n , X n , Y n ) can be writtenas a convex combination of the distributions in (29) P S n ,U n ,X n ,Y n ( s n , u n , x n , y n )= (cid:34) ∞ (cid:88) l =1 β l P S nl ( s n ) (cid:35) P X n ,U n | S n ( u n , x n | s n ) × (cid:34) ∞ (cid:88) k =1 α k W nk ( y n | x n , s n ) (cid:35) (122) = ∞ (cid:88) k =1 ∞ (cid:88) l =1 α k β l (cid:110) P S nl ( s n ) P X n ,U n | S n ( u n , x n | s n ) × W nk ( y n | x n , s n ) (cid:111) (123) = ∞ (cid:88) k =1 ∞ (cid:88) l =1 α k β l P S nl ,U nl ,X nl ,Y nkl ( s n , u n , x n , y n ) , (124)where (124) follows from the deﬁnition of P S nl ,U nl ,X nl ,Y nkl in (29). By Tonelli’s theorem, the marginals of ( U n , Y n ) and ( U n , S n ) are given respectively by P U n ,Y n ( u n , y n ) = ∞ (cid:88) k =1 ∞ (cid:88) l =1 α k β l P U nl ,Y nkl ( u n , y n ) (125) P U n ,S n ( u n , s n ) = ∞ (cid:88) l =1 β l P U nl ,S nl ( u n , s n ) , (126)where in (126) we used the fact that (cid:80) ∞ k =1 α k = 1 . Thenumber of terms in the sums in (125) and (126) are countable.By Lemma 3.3.2 in [7] (the spectral inf-mutual informationrate of a mixed source is the inﬁmum of the constituentspectral inf-mutual information rates of those processes withpositive weights), we know that I ( U ; Y ) = inf ( k,l ) ∈K×L I ( U kl ; Y kl ) , (127)where recall that K = { k ∈ N : α k > } and L = { l ∈ N : β l > } . Analogously, I ( U ; S ) = sup l ∈L I ( U l ; S l ) . (128)This completes the proof of (30).The achievability statement in (31) follows by consideringthe optimal P X,U | S (in the chain U − ( X, S ) − Y ) anddeﬁning the i.i.d. random variables ( S nl , X nl , U nl , Y nkl ) ∼ (cid:81) ni =1 P S l ( s i ) P X,U | S ( x i , u i | s i ) W k ( y i | x i , s i ) . Khintchine’slaw of large numbers [7, Lemma 1.3.2] then asserts that forevery ( k, l ) ∈ K × L , we have I ( U l ; Y kl ) = I ( U l ; Y kl ) and I ( U l ; S l ) = I ( U l ; S l ) , completing the proof of the lowerbound in (31). G. Proof of Theorem 6Proof:

We will only sketch the proof for the direct partsince it combines the proof of Theorem 1 and the main resultin Iwata and Muramatsu [8] in a straightforward manner. Infact, this proof technique originated from Heegard and ElGamal [11]. Perform Wyner-Ziv coding for “source” S n atthe state encoder with correlated “side information” Y n . Thisgenerates a rate-limited description of S n at the decoder andthe rate is approximately n log M d ,n . Call this description V n ∈ V n . The rate constraint is given in (33) after goingthrough the same steps as in [8]. The description V n is thenused as common side information of the state (at the encoderand decoder) for the usual Gel’fand-Pinsker coding problem,resulting in (32). This part is simply a conditional version ofTheorem 1.For the converse part, ﬁx γ > and note that M d ,n ≤ exp( n ( R d + γ )) (129)for n large enough by the second inequality in (11). Thus,if V n denotes an arbitrary random variable such that thecardinality of its support is no larger than M d ,n , we can checkthat [7, Theorem 5.4.1] P (cid:18) n log 1 P V n ( V n ) ≥ n log M d ,n + γ (cid:19) ≤ exp( − nγ ) . (130)We can further lower bound the left-hand-side of (130) us-ing (129) which yields P (cid:18) n log 1 P V n ( V n ) ≥ R d + 2 γ (cid:19) ≤ exp( − nγ ) . (131)Now consider, R d ≥ H ( V ) − γ (132) ≥ H ( V ) − H ( V | S ) − γ (133) ≥ I ( V ; S ) − γ (134) ≥ I ( V ; S ) − I ( V ; Y ) − γ. (135)Inequality (132) follows from (131) and the deﬁnition ofthe spectral sup-entropy rate [7, Chapter 1]. Inequality (133)holds because by (129), V n is supported on ﬁnitely manypoints in V n and so the spectral inf-conditional entropy rate H ( V | S ) is non-negative [12, Lemma 2(a)]. Inequality (134)follows because the limsup in probability is superadditive [6,Theorem 8(d)]. Finally, inequality (135) follows because thespectral inf-mutual information rate is non-negative [6, Theo-rem 8(c)].The upper bound on the transmission rate in (32) follows byconsidering a conditional version of the Verd´u-Han converse.As in (45), this yields the constraint R ≤ I ( U ; Y | V ) + 2 γ, (136)where U = { U n } ∞ n =1 denotes a random variable representingthe uniform choice of a message in [1 : M n ] . Note that V = { V n } ∞ n =1 is recoverable (available) at both encoder anddecoder hence the conditioning on V in (136). Also note that ( S , V ) is independent of U since V n is a function of S n and S n is independent of U n . Hence I ( U ; S , V ) = 0 . Because I ( U ; S | V ) ≤ I ( U ; S , V ) and the spectral sup-conditionalmutual information rate is non-negative [7, Lemma 3.2.1], I ( U ; S | V ) = 0 . (137)In view of (136), we have R ≤ I ( U ; Y | V ) − I ( U ; S | V ) + 2 γ. (138)Since we have proved (135) and (138), we can now proceed inthe same way we did for the unconditional case in the converseproof of Theorem 1. Refer to the steps (46) to (48). Finally,take γ → and this gives (32) and (33). H. Proof of Corollary 7Proof:

We only prove the converse since achievabilityfollows easily using i.i.d. random codes and Khintchine’slaw of large numbers [7, Lemma 1.3.2]. Essentially, all fourspectral inf- and spectral sup-mutual information rates in (32)and (33) are mutual informations. For the converse, we ﬁx ( ˆ U , ˆ V ) − ( ˆ X , ˆ S ) − ˆ Y and lower bound R d in (33) as follows: R d ≥ I ( ˆ V ; ˆ S ) − I ( ˆ V ; ˆ Y ) (139) ≥ lim sup n →∞ n I ( ˆ V n ; ˆ S n ) − lim inf n →∞ n I ( ˆ V n ; ˆ Y n ) (140) ≥ lim sup n →∞ n (cid:104) I ( ˆ V n ; ˆ S n ) − I ( ˆ V n ; ˆ Y n ) (cid:105) (141) ≥ lim sup n →∞ n (cid:20) n (cid:88) i =1 I ( ˆ V n , ˆ Y i − , ˆ S ni +1 ; ˆ S i ) − I ( ˆ V n , ˆ Y i − , ˆ S ni +1 ; ˆ Y i ) (cid:21) , (142)where (140) follows from the same reasoning as in (107)and (108) (using the fact that |S| < ∞ ) and (142) follows thesame steps as in the proof of Corollary 4. In the same way,the condition on R in (32) can be further upper bounded usingthe Csisz´ar-sum-identity [2, Chapter 2] and memorylessness of ˆ S n as follows R ≤ lim inf n →∞ n (cid:20) n (cid:88) i =1 I ( ˆ U n ; ˆ Y i | ˆ V n , ˆ Y i − , ˆ S ni +1 ) − I ( ˆ U n ; ˆ S i | ˆ V n , ˆ Y i − , ˆ S ni +1 ) (cid:21) . (143)From this point on, the rest of the proof is standard. We let V i := ( ˆ V n , ˆ Y i − , ˆ S ni +1 ) and U i := ( ˆ U n , V i ) and Y i , X i and S i as in the proof of Corollary 3. These identiﬁcations satisfy ( U i , V i ) − ( X i , S i ) − Y i and the proof can be completed as perthe usual steps. Acknowledgements

I am very grateful to Stark C. Draper for several discussionsthat led to the current work. This work is supported by NUSstartup grants WBS R-263- 000-A98-750 (FoE) and WBS R-263-000-A98-133 (ODPRT). R EFERENCES[1] S. Gelfand and M. Pinsker, “Coding for channel with random parame-ters,”

Prob. of Control and Inf. Th. , vol. 9, no. 1, pp. 19–31, 1980.[2] A. El Gamal and Y.-H. Kim,

Network Information Theory . Cambridge,U.K.: Cambridge University Press, 2012.[3] H. Tyagi and P. Narayan, “The Gelfand-Pinsker channel: Strong converseand upper bound for the reliability function,” in

Proc. of IEEE Intl. Symp.on Info. Theory , Seoul, Korea, 2009.[4] I. Csisz´ar and J. K¨orner,

Information Theory: Coding Theorems forDiscrete Memoryless Systems . Cambridge University Press, 2011.[5] P. Moulin and J. A. O’Sullivan, “Information-theoretic analysis ofinformation hiding,”

IEEE Trans. on Inf. Th. , vol. 49, no. 3, pp. 563–593,Mar 2003.[6] S. Verd´u and T. S. Han, “A general formula for channel capacity,”

IEEETrans. on Inf. Th. , vol. 40, no. 4, pp. 1147–57, Apr 1994.[7] T. S. Han,

Information-Spectrum Methods in Information Theory .Springer Berlin Heidelberg, Feb 2003.[8] K.-I. Iwata and J. Muramatsu, “An information-spectrum approachto rate-distortion function with side information,”

IEICE Trans. onFundamentals of Electronics, Communications and Computer , vol. E85-A, no. 6, pp. 1387–95, 2002.[9] A. D. Wyner, “On source coding with side information at the decoder,”

IEEE Trans. on Inf. Th. , vol. 21, no. 3, pp. 294–300, 1975.[10] Y. Steinberg, “Coding for channels with rate-limited side information atthe decoder, with applications,”

IEEE Trans. on Inf. Th. , vol. 54, no. 9,pp. 4283–95, Sep 2008.[11] C. Heegard and A. El Gamal, “On the capacity of computer memorywith defects,”

IEEE Trans. on Inf. Th. , vol. 29, no. 5, pp. 731–739, May1983.[12] Y. Steinberg and S. Verd´u, “Simulation of random processes and rate-distortion theory,”

IEEE Trans. on Inf. Th. , vol. 42, no. 1, pp. 63–86,Jan 1996.[13] T. Matsuta and T. Uyematsu, “A general formula of rate-distortionfunctions for source coding with side information at many decoders,” in

Int. Symp. Inf. Th. , Boston, MA, 2012.[14] S. Miyake and F. Kanaya, “Coding theorems on correlated generalsources,”

IEICE Trans. on Fundamentals of Electronics, Communica-tions and Computer , vol. E78-A, no. 9, pp. 1063–70, 1995.[15] M. Hayashi, “General nonasymptotic and asymptotic formulas in chan-nel resolvability and identiﬁcation capacity and their application to thewiretap channel,”

IEEE Trans. on Inf. Th. , vol. 52, no. 4, pp. 1562–1575,April 2006.[16] M. Bloch and J. N. Laneman, “Strong secrecy from channel resolv-ability,”

IEEE Trans. on Inf. Th. , vol. 59, no. 12, pp. 8077–8098, Dec2013.[17] W. Yu, A. Sutivong, D. Julian, T. Cover, and M. Chiang, “Writing oncolored paper,” in

Int. Symp. Inf. Th. , Washington, DC, 2001.[18] P. Moulin and Y. Wang, “Capacity and random-coding exponents forchannel coding with side information,”

IEEE Trans. on Inf. Th. , vol. 53,no. 4, pp. 1326–47, Apr 2007.[19] S. Kuzuoka, “A simple technique for bounding the redundancy of sourcecoding with side information,” in

Int. Symp. Inf. Th. , Boston, MA, 2012.[20] T. M. Cover and M. Chiang, “Duality between channel capacity andrate distortion with two-sided state information,”

IEEE Trans. on Inf.Th. , vol. 48, no. 6, pp. 1629–38, Jun 2002.[21] T. S. Han and S. Verd´u, “Approximation theory of output statistics,”

IEEE Trans. on Inf. Th. , vol. 39, no. 3, pp. 752–72, Mar 1993.[22] M. Raginsky, “Empirical processes, typical sequences and coordinatedactions in standard Borel spaces,”

IEEE Trans. on Inf. Th. , vol. 59, no. 3,pp. 1288–1301, Mar 2013.[23] A. Rosenzweig, Y. Steinberg, and S. Shamai, “On channels with partialchannel state information at the transmitter,”

IEEE Trans. on Inf. Th. ,vol. 51, no. 5, pp. 1817–30, 2005.[24] S. Verd´u, “Non-asymptotic achievability bounds in multiuser informa-tion theory,” in

Allerton Conference , 2012.[25] S. Watanabe, S. Kuzuoka, and V. Y. F. Tan, “Non-asymptotic andsecond-order achievability bounds for coding with side-information,” arXiv:1301.6467 , Jan 2013.[26] M. H. Yassaee, M. R. Aref, and A. Gohari, “A technique for de-riving one-shot achievability results in network information theory,” arXiv:1303.0696 , Mar 2013.

Vincent Y. F. Tan (S’07-M’11) is an AssistantProfessor in the Departments of Electrical and Com-puter Engineering (ECE) and Mathematics at the Na-tional University of Singapore. He received the B.A.and M.Eng. degrees in Electrical and InformationSciences from Cambridge University in 2005. Hereceived the Ph.D. degree in Electrical Engineeringand Computer Science from the Massachusetts Insti-tute of Technology in 2011. He was a postdoctoralresearcher in the Department of ECE at the Uni-versity of Wisconsin-Madison and following that, ascientist at the Institute for Infocomm (I2