[PDF] Single Letter Expression of Capacity for a Class of Channels with Memory

Abstract

We study finite alphabet channels with Unit Memory on the previous Channel Outputs called UMCO channels. We identify necessary and sufficient conditions, to test whether the capacity achieving channel input distributions with feedback are time-invariant, and whether feedback capacity is characterized by single letter, expressions, similar to that of memoryless channels. The method is based on showing that a certain dynamic programming equation, which in general, is a nested optimization problem over the sequence of channel input distributions, reduces to a non-nested optimization problem. Moreover, for UMCO channels, we give a simple expression for the ML error exponent, and we identify sufficient conditions to test whether feedback does not increase capacity. We derive similar results, when transmission cost constraints are imposed. We apply the results to a special class of the UMCO channels, the Binary State Symmetric Channel (BSSC) with and without transmission cost constraints, to show that the optimization problem of feedback capacity is non-nested, the capacity achieving channel input distribution and the corresponding channel output transition probability distribution are time-invariant, and feedback capacity is characterized by a single letter formulae, precisely as Shannon's single letter characterization of capacity of memoryless channels. Then we derive closed form expressions for the capacity achieving channel input distribution and feedback capacity. We use the closed form expressions to evaluate an error exponent for ML decoding.

Full PDF

11 Single Letter Expression of Capacity for aClass of Channels with Memory

Christos K. Kourtellaris, Charalambos D. Charalambous and Ioannis Tzortzis

AbstractWe study ﬁnite alphabet channels with Unit Memory on the previous Channel Outputs calledUMCO channels. We identify necessary and sufﬁcient conditions, to test whether the capacityachieving channel input distributions with feedback are time-invariant, and whether feedbackcapacity is characterized by single letter, expressions, similar to that of memoryless channels. Themethod is based on showing that a certain dynamic programming equation, which in general,is a nested optimization problem over the sequence of channel input distributions, reduces to anon-nested optimization problem. Moreover, for UMCO channels, we give a simple expressionfor the ML error exponent, and we identify sufﬁcient conditions to test whether feedback doesnot increase capacity. We derive similar results, when transmission cost constraints are imposed.We apply the results to a special class of the UMCO channels, the Binary State SymmetricChannel (BSSC) with and without transmission cost constraints, to show that the optimizationproblem of feedback capacity is non-nested, the capacity achieving channel input distributionand the corresponding channel output transition probability distribution are time-invariant, andfeedback capacity is characterized by a single letter formulae, precisely as Shannon’s single lettercharacterization of capacity of memoryless channels. Then we derive closed form expressions forthe capacity achieving channel input distribution and feedback capacity. We use the closed formexpressions to evaluate an error exponent for ML decoding.

I. I

NTRODUCTION

Shannon in his landmark paper [1], showed that the capacity of Discrete Memoryless Channels (DMCs) (cid:8) A , B , { P B | A ( b | a ) : ( a , b ) ∈ A × B } (cid:9) is characterized by the celebrated single letter formulae C (cid:52) = max P A I ( A ; B ) . (I.1)This is often shown by using the converse to the channel coding theorem, to obtain the upper bounds[2] C noFBA n ; B n (cid:52) = max P An I ( A n ; B n ) ≤ max P Ai , i = ,..., n n ∑ i = I ( A i ; B i ) ≤ ( n + ) C (I.2)which are achievable, if and only if the channel input distribution satisﬁes conditional independence P A i | A i − = P A i , i = , , . . . , n , and { A i : i = , , . . . , } is identically distributed, which implies that the jointprocess { ( A i , B i ) : i = , , . . . , } is independent and identically distributed, and hence stationary ergodic.For DMCs, it is shown by Shannon [3] and Dobrushin [4] that feedback codes do not incur a highercapacity compared to that of codes without feedback, that is, C FB = C . This is often shown by ﬁrstapplying the converse to the coding theorem, to deduce that feedback does not increase capacity [5],that is, C FB ≤ C noFB = C , which then implies that any candidate of optimal channel input distributionwith feedback (cid:110) P A i | A i − , B i − : i = , . . . , n (cid:111) satisﬁes conditional independence P A i | A i − , B i − ( da i | a i − , b i − ) = P A i ( da i ) , i = , , . . . , n (I.3) C K. Kourtellaris, C D. Charalambous and I. Tzortzis are with the Department of Electrical and Computer Engineering,University of Cyprus, Nicosia, Cyprus,

Email:[email protected], [email protected], [email protected]

This work was ﬁnancially supported by a medium size University of Cyprus grant entitled “DIMITRIS” and by QNRF, amember of Qatar Foundation, under the project NPRP 6-784-2-329 a r X i v : . [ c s . I T ] J a n and hence identity C FB = C noFB = C holds if { A i : i = , , . . . , } is identically distributed.For general channels with memory deﬁned by (cid:8) P B i | B i − , A i : i = , , . . . , n } , P B | B − , A = P B | B − , A , where B − is the initial state, in general, feedback codes incur a higher capacity than codes without feedback[2], [6]. The information measure often employed to characterize feedback capacity of such channels isMarko’s directed information [7], put forward by Massey [8], and deﬁned by I ( A n → B n ) = n ∑ i = I ( A i ; B i | B i − ) (cid:52) = n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P B i | B i − ( ·| b i − ) ( b i ) (cid:17) P A i , B i ( da i , db i ) . (I.4)Indeed, Massey [8] showed that the per unit time limit of the supremum of directed information overchannel input distributions P FB [ , n ] (cid:52) = (cid:8) P A i | A i − , B i − : i = , . . . , n (cid:9) , deﬁned by C FBA ∞ → B ∞ = lim n → ∞ n + C A n → B n C A n → B n (cid:52) = sup P FB [ , n ] I ( A n → B n ) (I.5)gives a tight bound on any achievable rate of feedback codes, and hence C FBA ∞ → B ∞ is a candidate for thecapacity of feedback codes. However, for channels with memory, it is generally not known whether themulti-letter expression of capacity, (I.5), can be reduced to a single letter expression, analogous to (I.1).Our main objective is to provide a framework for a single letter characterization of feedback capacityfor a general class of channels with memory. Towards this direction, we provide conditions on channelswith memory such that C FBA n → B n = ( n + ) C FB (I.6)where C FB is a single letter expression similar to that of DMCs. Speciﬁcally, for channels of the form (cid:8) P B i | B i − , A i : i = , , . . . , n } , where B − = b − ∈ B − is the initial state, we give necessary and sufﬁcientconditions such that the following equality holds. C FBA n → B n = ( n + ) sup P A | B − ( ·| b − ) I ( A ; B | b − ) , ∀ b − ∈ B − . (I.7)That is, the single letter expression is C FB (cid:52) = sup P A | B − ( ·| b − ) I ( A ; B | b − ) , and is independent of theinitial state b − ∈ B − . A. Main Results and Methodology

First, we consider channels with Unit Memory on the previous Channel Output (UMCO), deﬁned by P B i | B i − , A i = P B i | B i − , A i , i = , , . . . , n (I.8)with and without a transmission cost constraint deﬁned by1 n + E (cid:40) n ∑ i = γ UMi ( A i , B i − ) (cid:41) (I.9)where γ UMi : A i × B i − (cid:55)−→ [ , ∞ ) . We identify necessary and sufﬁcient conditions on the channel so thatthe optimization problem C FBA n → B n , which is generally a nested optimization problem, often dealt with viadynamic programming, reduces to a non-nested optimization problem. These conditions give rise to asingle letter characterization of feedback capacity. Among other results, we derive sufﬁcient conditionsfor feedback not to increase capacity, and identify sufﬁcient conditions for asymptotic stationarity ofoptimal channel input distribution and ergodicity of the joint process { ( A i , B i ) : i = , , . . . } . Moreover,we give an upper bound on the error probability of maximum likelihood decoding. We also treat problemswith transmission cost constraints.Second, we apply the framework of the UMCO channel on the Binary State Symmetric Channel (BSSC), deﬁned by P B i | A i , B i − ( b i | a i , b i − )=  , , , , α β − β − α − α − β β α  , i = , , . . . , n , ( α , β ) ∈ [ , ] × [ , ] (I.10)with and without a transmission cost constraint deﬁned by1 n + E (cid:40) n ∑ i = γ ( A i , B i − ) (cid:41) ≤ κ , γ ( a i , b i − ) = a i ⊕ b i − , κ ∈ [ , κ max ] (I.11)where x ⊕ y denotes the compliment of the modulo2 addition of x and y . We calculate the capacityachieving channel input distribution with feedback without cost constraint and show that it is time-invariant. This illustrates that feedback capacity satisﬁes (I.6), it is independent of the initial state B − = b − , and it is characterized by C FBA ∞ → B ∞ = sup P A | B − I ( A ; B | b − ) , ∀ b − ∈ B − (I.12) = H ( λ ) − ν H ( α ) − ( − ν ) H ( β ) (I.13)where λ , ν are functions of channel parameters α , β (see Theorem IV.1). The characterization (I.12)is precisely analogous to the single letter characterization of (I.1) and (I.2) of capacity of DMCs.Additionally, we provide the error exponent evaluated on the capacity achieving channel input distributionwith feedback, and we derive an upper bound on the error probability of maximum likelihood decodingwhich is easy to compute (see Section IV-A3). Finally, we show that a time-invariant ﬁrst order Markovchannel input distribution without feedback achieves feedback capacity (I.13), and we give the closedform expressions both for the capacity achieving channel input distribution and the corresponding channeloutput distribution. We also treat the case with cost constraint.The main mathematical concept we invoke to obtain the above results are the structural properties of theoptimal channel input distributions, [9], [10]. Speciﬁcally the following.(a) For channels with inﬁnite memory on the previous channel outputs deﬁned by P B i | B i − , A i = P B i | B i − , A i ,the maximization of directed information I ( A n → B n ) occurs in the subset satisfying conditionalindependence (cid:8) P A i | A i − , B i − = P A i | B i − : i = , . . . , n (cid:9) .(b) For channels with limited memory of order M deﬁned by P B i | B i − , A i = P B i | B i − i − M , A i , the maximiza-tion of directed information I ( A n → B n ) occurs in the subset satisfying conditional independence (cid:8) P A i | A i − , B i − = P A i | B i − i − M : i = , . . . , n (cid:9) .(c) For the UMCO channel the maximization of directed information I ( A n → B n ) occurs in the subsetsatisfying conditional independence (cid:8) P A i | A i − , B i − = P A i | B i − : i = , . . . , n (cid:9) .The structural properties, (a), (b) and (c), along with the fact that C FBA n → B n ≥ C noFBA n ; B n , are employedin Section II to provide sufﬁcient conditions for feedback not to increase the capacity. Moreover,the structural property of the UMCO channel, (c), is applied in Section III to construct the ﬁnitehorizon dynamic programming, the necessary and sufﬁcient conditions on the capacity achieving inputdistribution, and the necessary and sufﬁcient conditions for the non-nested optimization of feedbackcapacity. The methodology and the corresponding theorems of Section III can be easily extended tochannels with ﬁnite memory on previous channel outputs by invoking the structural properties of thecapacity achieving distributions for these channels. B. Relation to the Literature

Although for several years signiﬁcant effort has been devoted to the study of channels with memory,with or without feedback, explicit or closed form expressions for capacity of such channels are limited to few but ripe cases. For non-stationary non-ergodic Additive Gaussian Noise (AGN) channels withmemory, Cover and Pombra [5] showed that feedback codes can increase capacity by at most half a bit.On the other hand, for a ﬁnite alphabet version of the Cover and Pombra channel with certain symmetry,Alajaji [11] showed that feedback does not increase capacity. Moreover, Permuter, Cuff, Van Roy andWeissman [12] derived the feedback capacity of the trapdoor channel, while Elishco and Permuter [13]employed dynamic programming to evaluate feedback capacity of the Ising channel.The capacity of channels (cid:8) P B i | B i − , A i : i = , . . . , n (cid:9) for feedback codes is analyzed by Berger [14] andChen and Berger [15], under the assumption that the capacity achieving distribution satisﬁes condi-tional independence property P A i | A i − , B i − = P A i | B i − ( a i | b i − ) , i = , , . . . , n . A derivation of this structuralproperty of capacity achieving distribution is given in [9], [10].Recently, Permuter, Asnani and Weissman [16], [17] derived the feedback capacity for a Binary-InputBinary-Output (BIBO) channel, called the Previous Output STate (POST) channel, where the current stateof the channel is the previously received symbol. The authors in [17], showed, among other results, thatfeedback does not increase capacity. It can be shown that the POST channel is within a transformationequivalent to the Binary State Symmetric channel (BSSC) [18], in which the state of the channel isdeﬁned as the modulo2 addition of the current input symbol and the previous output symbol. Whenthere are no transmission cost constraints, our results for the BSSC compliment existing results obtainedin [16], [17] regarding the POST channel, in the sense that, we show the time-invariant properties of thecapacity achieving distributions, which implies the single letter characterization of feedback capacity, wederive closed form expressions for these distributions, provide an upper bound on the error probability ofmaximum likelihood decoding, and we show that a ﬁrst-order Markov channel input distribution withoutfeedback achieves feedback capacity. Moreover, we derive similar closed form expressions when averagedtransmission cost constraints are imposed.A portion of the results established in this paper were utilized to construct a Joint Source Channel Coding(JSCC) scheme for the BSSC with a cost constraint and the Binary Symmetric Markov Source (BSMS)with single letter Hamming distortion measure [19]. The scheme is a natural generalization of the JSCCdesign (uncoded transmission) of an Independent and Identically Distributed (IID) Bernoulli source overa Binary Symmetric Channel (BSC) [20], [21].The remainder of the paper is organized as follows. In Section II, we introduce the mathematicalformulation and identify sufﬁcient conditions for feedback not to increase capacity. In Section III, weidentify sufﬁcient conditions to test whether the capacity achieving input distribution is time invariant.The results are then extended to the inﬁnite horizon case. In Section IV, we apply the main theoremsof section III to the BSSC, with and without feedback and with and without cost constraint, to prove,among other results, that capacity is given by a single letter characterization. Finally, Section V deliversour concluding remarks. II. F

ORMULATION & P

RELIMINARY R ESULTS

In this section we introduce the deﬁnitions of feedback capacity, capacity without feedback , and weidentify necessary and sufﬁcient conditions for feedback not to increase the capacity.

A. Notation and Deﬁnitions

The probability distribution of a Random Variable (RV) deﬁned on a probability space ( Ω , F , P ) by themapping X : ( Ω , F ) (cid:55)−→ ( X , B ( X )) is denoted by P ( · ) ≡ P X ( · ) . The space of probability distributions on X is denoted by M ( X ) . A RV is called discrete if there exists a countable set S such that ∑ x i ∈ S P { ω ∈ Ω : X ( ω ) = x i } =

1. The probability distribution P X ( · ) is then concentrated on points in S , and it isdeﬁned by P X ( A ) (cid:52) = ∑ x i ∈ S (cid:84) A P { ω ∈ Ω : X ( ω ) = x i } , ∀ A ∈ B ( X ) . (II.14) Given another RV Y : ( Ω , F ) (cid:55)→ ( Y , B ( Y )) , P Y | X ( dy | x )( ω ) is the conditional distribution of RV Y given X . For a ﬁxed X = x we denote the conditional distribution by P Y | X ( dy | X = x ) = P Y | X ( dy | x ) .Let Z denote the set of integers and N (cid:52) = { , , , . . . , } , N n (cid:52) = { , , , . . . , n } . The channel input andchannel output spaces are sequences of measurable spaces { ( A i , B ( A i )) : i ∈ Z } and { ( B i , B ( B i )) : i ∈ Z } , respectively, while their product spaces are A Z (cid:52) = × i ∈ Z A i , B Z (cid:52) = × i ∈ Z B i , B ( A Z ) (cid:52) = ⊗ i ∈ Z B ( A i ) , B ( B Z ) (cid:52) = ⊗ i ∈ Z B ( B i ) . Points in the product spaces are denoted by a n (cid:52) = { . . . , a − , a , a , . . . , a n } ∈ A n and b n (cid:52) = { . . . , b − , b , b , . . . , b n } ∈ B n , n ∈ Z . B. Capacity with Feedback & Properties

Next, we provide the precise formulation of information capacity and some preliminary results. Webegin by introducing the deﬁnitions of channel distribution, channel input distribution, transmission costconstraint, and feedback code.

Deﬁnition II.1. (Channel distribution with memory)A sequence of conditional distributions deﬁned by C [ , n ] (cid:52) = (cid:110) P B i | B i − , A i ( db i | b i − , a i ) = P B i | B i − , A i ( db i | b i − , a i ) : i = , . . . , n (cid:111) . (II.15) At time i = the conditional distribution is P B | B − , A ( db | b − , a ) , where B − = b − ∈ B − is the initialdata. The initial data, b − ∈ B − , denotes the initial state of the channel and this should not be misinterpretas feedback information. In this work we assume that the initial data are known both to the encoder andthe decoder, unless we state otherwise. Deﬁnition II.2. (Channel input distribution with feedback)A sequence of conditional distributions deﬁned by P FB [ , n ] (cid:52) = (cid:110) P A i | A i − , B i − ( da i | a i − , b i − ) : i = , . . . , n (cid:111) . (II.16) At time i = the conditional distribution is P A | A − , B − ( da | a − , b − ) = P A | B − ( da | b − ) . That is, theinformation structure of the channel input distribution is I FBi (cid:52) = { b − , a , b , a , b , . . . , a i − , b i − } , fori = , . . . , n. For i = the convention is I FB (cid:52) = { a − , b − } = { b − } , which states that the channel inputdistribution depends only on the initial data. Deﬁnition II.3. (Transmission cost constraints)The cost of transmitting symbols over the channel (II.15) is a measurable function c , n : A n × B n − (cid:55)−→ [ , ∞ ) deﬁned by c , n ( a n , b n − ) (cid:52) = n ∑ i = γ i ( a i , b i − ) . (II.17) The transmission cost constraint is deﬁned by P FB [ , n ] ( κ ) (cid:52) = (cid:110) P A i | A i − , B i − , i = , . . . , n : 1 n + E µ (cid:8) c , n ( A n , B n − ) (cid:9) ≤ κ (cid:111) , κ ∈ [ , ∞ ] (II.18) where κ ∈ [ , ∞ ) , and the subscript notation E µ indicates the joint distribution over which the expectationis taken is parametrized by the initial distribution P B − ( db − ) = µ ( db − ) (and of course the channelinput distribution). Deﬁnition II.4. (Feedback code)A feedback code for the channel deﬁned by (II.15) with transmission cost constraint P FB [ , n ] ( κ ) is asequence { ( n , M n , ε n ) : n = , , . . . } , which consist of the following elements. (a) A set of uniformly distributed messages M n (cid:52) = { , . . . , M n } and a set of encoding strategies, mappingmessages into channel inputs of block length ( n + ) , deﬁned by E FB [ , n ] ( κ ) (cid:44) (cid:110) g i : M n × A i − × B i − (cid:55)−→ A i , a = g ( w , b − ) , a = g ( w , b − , a , b ) , . . . , a n = g n ( w , b − , a , b , . . . , a n − , b n − ) , w ∈ M n : 1 n + E g (cid:16) c , n ( A n , B n − ) (cid:17) ≤ κ (cid:111) , n = , , . . . . (II.19) The codeword for any w ∈ M n is u w ∈ A n , u w = ( g ( w , b − ) , g ( w , b − , a , b ) , . . . , g n ( w , b − , a , b , . . . , a n − , b n − )) , and C n = ( u , u , . . . , u M n ) is the code for the message set M n , and { A − , B − } = { b − } . In general, the code depends on the initial data, depending on the convention, i.e., B − = b − , which are known to the encoder and decoder (unless speciﬁed otherwise). Alternatively, wecan take { A − , B − } = { /0 } . (b) Decoder measurable mappings d , n : B n (cid:55)−→ M n , such that the average probability of decoding errorsatisﬁes P ( n ) e (cid:44) M n ∑ w ∈ M n P g (cid:110) d , n ( B n ) (cid:54) = w | W = w (cid:111) ≡ P g (cid:110) d , n ( B n ) (cid:54) = W (cid:111) ≤ ε n and the decoder may also assume knowledge of the initial data.The coding rate or transmission rate over the channel is deﬁned by r n (cid:44) n + log M n . A rate Ris said to be an achievable rate, if there exists a code sequence satisfying lim n −→ ∞ ε n = and lim inf n −→ ∞ n + log M n ≥ R.The operational deﬁnition of feedback capacity of the channel is the supremum of all achievable rates,i.e., C (cid:44) sup { R : R is achievable } . Given any channel input distribution { P A i | A i − , B i − : i = , , . . . , n } ∈ P FB [ , n ] , a channel distribution { P B i | B i − , A i : i = , , . . . , n } , and a ﬁxed initial distribution µ ( b − ) , then the induced joint distribution P A n , B n parametrizedby µ ( · ) is uniquely deﬁned, and a probability space (cid:16) Ω , F , P (cid:17) carrying the sequence of RVs ( A n , B n ) (cid:52) = { B − , A , B , A , B , . . . , A n , B n } is constructed, as follows. P (cid:8) A n ∈ da n , B n ∈ db n (cid:9) (cid:52) = P A n , B n ( da n , db n )= ⊗ nj = (cid:16) P B j | B j − , A j ( db j | b j − , a j ) ⊗ P A j | A j − , B j − ( da j | a j − , b j − ) (cid:17) ⊗ µ ( db − ) . (II.20) P (cid:8) B n ∈ db n (cid:9) (cid:52) = P B n ( db n ) = (cid:90) A n P A n , B n ( da n , db n ) . (II.21) P B i | B i − ( db i | b i − ) = (cid:90) A i P B i | B i − , A i ( db i | b i − , a i ) ⊗ P A i | A i − , B i − ( da i | a i − , b i − ) ⊗ P A i − | B i − ( da i − | b i − ) , i = , . . . , n . (II.22) P B | B − ( db | b − ) = (cid:90) A P B | B − , A ( db | b − , a ) ⊗ P A | B − ( da | b − ) . (II.23)The Directed Information from A n (cid:52) = { A , A , . . . , A n } to B n (cid:52) = { B , B , . . . , B n } conditioned on B − is The superscript on expectation, i.e., E g indicates the dependence of the distribution on the encoding strategies. If B − = b − is ﬁxed, then µ ( · ) = δ B − ( · ) is a dirac or delta measure concentrated at B − = b − . deﬁned by [7], [8] I ( A n → B n ) (cid:52) = n ∑ i = I ( A i ; B i | B i − ) = n ∑ i = I ( A i ; B i | B i − )= n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P B i | B i − ( ·| b i − ) ( b i ) (cid:17) P A i , B i ( da i , db i ) (II.24) ≡ I FBA n → B n ( P A i | A i − , B i − , P B i | B i − , A i : i = , , . . . , n ) (II.25)where (II.24) follows from the channel deﬁnition, and the notation I FBA n → B n ( · , · ) indicates that I ( A n → B n ) is a functional of the sequences of channel input and channel distributions; its dependence on the initialdistribution µ ( · ) is suppressed.Deﬁne the information quantities C FBA n → B n (cid:52) = sup P FB [ , n ] I ( A n → B n ) , C FBA n → B n ( κ ) (cid:52) = sup P FB [ , n ] ( κ ) I ( A n → B n ) . (II.26)Under the assumption that { B − , A , B , A , B , . . . , } is jointly ergodic or n + ∑ ni = P Bi | Bi − , Ai ( ·| B i − , A i ) P Bi | Bi − ( ·| B i − ) ( B i ) is information stable [4], [22] and c , n ( a n , b n − ) = n + ∑ ni = γ i ( A i , B i − ) is stable, then the capacity of thechannel with feedback with and without transmission cost is given by C FBA ∞ → B ∞ (cid:52) = lim n −→ ∞ n + C FBA n → B n , C FBA ∞ → B ∞ ( κ ) (cid:52) = lim n −→ ∞ n + C FBA n → B n ( κ ) . (II.27)

1) Convexity Properties.:

Next, we recall the convexity properties of directed information with respectto a speciﬁc deﬁnition of channel input distributions, which is equivalent to the above deﬁnition.Any sequence of channel input distribution { P A i | A i − , B i − : i = , , . . . , n } ∈ P FB [ , n ] and channel distribution { P B i | B i − , A i : i = , , . . . , n } uniquely deﬁne the causal conditioned distributions ←− P ( da n | b n − ) (cid:52) = ⊗ ni = P A i | A i − , B i − ( da i | a i − , b i − ) , (II.28) −→ P ( db n | a n , b − ) (cid:52) = ⊗ ni = P B i | B i − , A i ( db i | b i − , a i ) (II.29)and vice-versa, and these are parametrized by the initial data b − . Moreover, for a ﬁxed B − = b − we can formally deﬁne the joint distribution of { A , B , A , B , . . . , A n , B n } and the joint distribution of { B , B , . . . , B n } conditioned on B − = b − by P ←− P ( da n , db n | b − ) (cid:52) =( ←− P ⊗ −→ P )( da n , db n | b − ) , (II.30) P ←− P ( db n | b − ) (cid:52) = (cid:90) A n ( ←− P ⊗ −→ P )( da n , db n | b − ) . (II.31)Both distributions are parametrized by the initial data b − . Then, from [23], we have the followingconvexity property of directed information.(a) The set of conditional distributions deﬁned by (II.28), ←− P A n | B n − ( ·| b n − ) ∈ M ( A n ) is convex.(b) Directed information is equivalently expressed as follows. I ( A n → B n ) = (cid:90) log (cid:16) −→ P ( ·| a n , b − ) P ←− P ( ·| b − ) ( b n ) (cid:17) P ←− P ( da n , db n | b − ) ⊗ µ ( db − ) ≡ I FBA n → B n ( ←− P , −→ P ) . (II.32)(c) Directed information, I FBA n → B n ( ←− P , −→ P ) , is concave with respect to ←− P ( ·| b n − ) ∈ M ( A n ) for a ﬁxed −→ P ( ·| a n , b − ) ∈ M ( B n ) .Since the set of conditional distributions with or without transmission cost constraints is convex, anddirected information is a concave functional, the optimization problems (II.26) are convex, and we havethe following theorem. Theorem II.1. (Convexity properties)Assume the set P FB [ , n ] ( κ ) is non-empty and the supremum of I ( A n → B n ) over the set of distributions P FB [ , n ] ( κ ) is achieved (i.e., it exists). Then, the following hold. (a) C FBA n → B n ( κ ) is non-decreasing concave function of κ ∈ [ , ∞ ] . (b) An alternative characterization of C

FBA n → B n ( κ ) is given byC FBA n → B n ( κ ) = sup ←− P : n + E µ (cid:8) c , n ( a n , b n − ) (cid:9) = κ I FBA n → B n ( ←− P , −→ P ) , for κ ≤ κ max (II.33) where κ max is the smallest number belonging to [ , ∞ ] such that C FBA n → B n ( κ ) is constant in [ κ max , ∞ ] ,and E µ (cid:8) · (cid:9) denotes expectation with respect to the joint distribution ( ←− P ⊗ −→ P ) ⊗ µ .Proof: Since the set P FB [ , n ] ( κ ) is convex with respect to ←− P ( ·| b n − ) ∈ M ( A n ) , the statements followfrom the convexity and non-decreasing properties [23].The above theorem states that the extremum problem of feedback capacity is a convex optimizationproblem, over appropriate sets of distributions.

2) Information Structures of Optimal Channel Input Distributions.:

Consider the extremum problem C FBA n → B n ( κ ) , given by (II.33). In [9], [10], it is shown that the optimal channel input distribution satisﬁesthe following conditional independence. P A i | A i − , B i − ( da i | a i − , b i − ) = P A i | B i − ( da i | b i − ) ≡ π i ( da i | b i − ) , i = , . . . , n . (II.34)Moreover, in view of the information structure of the optimal channel input distribution, C FBA n → B n ( κ ) reduces to the following optimization problem. C FBA n → B n ( κ ) = sup P FB [ , n ] ( κ ) n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P π B i | B i − ( ·| b i − ) ( b i ) (cid:17) P π A i , B i ( da i , db i ) (II.35) ≡ sup P FB [ , n ] ( κ ) I FBA n → B n ( π i , P B i | B i − , A i : i = , , . . . , n ) (II.36)where the transmission cost constraint is deﬁned by P FB , n ( κ ) (cid:52) = (cid:110) π i ( da i | b i − ) , i = , . . . , n : 1 n + E πµ (cid:8) c , n ( A n , B n − ) (cid:9) ≤ κ (cid:111) (II.37)and the induced joint and transition probability distributions are given by P π A i , B i ( da i , db i ) = P B i | B i − , A i ( db i | b i − , a i ) ⊗ π i ( da i | b i − ) ⊗ P π B i − ( db i − ) (II.38) P π B i | B i − ( db i | b i − ) = (cid:90) A i P B i | B i − , A i ( db i | b i − , a i ) ⊗ π i ( da i | b i − ) , i = , . . . , n . (II.39)The superscript indicates the dependence of these distributions on { π i ( da i | b i − ) : i = , . . . , n } .The information feedback capacity rate is then given by C FBA ∞ → B ∞ ( κ ) (cid:52) = lim n −→ ∞ n + P FB [ , n ] ( κ ) I FBA n → B n ( π i , P B i | B i − , A i : i = , , . . . , n ) . (II.40) C. Feedback Versus No Feedback

Here, we address the question whether feedback increases capacity via optimization problem (II.35).First, we recall the deﬁnition of channel input distributions without feedback.

Deﬁnition II.5. (Channels input distribution without feedback)A sequence of conditional distributions deﬁned by P noFB [ , n ] (cid:52) = (cid:110) P A i | A i − , B − ( da i | a i − , b − ) ≡ π noFBi ( da i | a i − , b − ) : i = , . . . , n (cid:111) . (II.41) The information structure of the channel input distribution without feedback is I noFBi (cid:52) = { a i − , b − } .For time i = , the distribution is P A | A − , B − ( da | a − , b − ) ≡ π noFBi ( da | b − ) , hence the informationstructure is I FB (cid:52) = { a − , b − } = { b − } , which states that the channel input distribution depends onlyon the initial data. Similar to the feedback case (Section II-B), the initial state of the channel, b − , is assumed to be knownat the encoder. The transmission cost constraint without feedback, is deﬁned by P noFB [ , n ] ( κ ) (cid:52) = (cid:110) π noFBi ( da i | a i − , b − ) , i = , . . . , n : 1 n + E µ (cid:8) c , n ( A n , B n − ) (cid:9) ≤ κ (cid:111) , κ ∈ [ , ∞ ] . (II.42)Moreover, the set of encoding strategies without feedback, mapping messages into channel inputs ofblock length ( n + ) , are deﬁned by E noFB [ , n ] ( κ ) (cid:44) (cid:110) g noFBi : M n × A i − × B − (cid:55)−→ A i , a = g noFB ( w , b − ) , a = g noFB ( w , b − , a ) , . . . , a n = g noFBn ( w , b − , a n − ) , w ∈ M n : 1 n + E g noFB (cid:16) c , n ( A n , B n − ) (cid:17) ≤ κ (cid:111) , n = , , . . . . (II.43)By employing (II.42) and (II.43), a code without feedback is deﬁned similarly to Deﬁnition II.4.Given any channel input distribution without feedback { π noFBi ( da i | a i − , b − ) : i = , , . . . , n } ∈ P noFB [ , n ] ( κ ) ,a channel distribution { P B i | B i − , A i : i = , , . . . , n } , and a ﬁxed initial distribution P B − ( db − ) = µ ( b − ) ,then the induced joint distribution P A n , B n parametrized by µ ( · ) is uniquely deﬁned. The mutual informa-tion between from A n (cid:52) = { A , A , . . . , A n } to B n (cid:52) = { B , B , . . . , B n } conditioned on B − is deﬁned by I ( A n ; B n ) (cid:52) = E π noFB µ (cid:110) log (cid:16) P B n | A n , B − ( ·| A n , B − ) P π noFB B n | B − ( ·| B − ) ( B n ) (cid:17)(cid:111) = n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P π noFB B i | B i − ( ·| b i − ) ( b i ) (cid:17) P π noFB A i , B i ( da i , db i )= n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P π noFB B i | B i − ( ·| b i − ) ( b i ) (cid:17) P π noFB A i , B i ( da i , db i ) ≡ I noFBA n → B n ( π noFBi , P B i | B i − , A i : i = , , . . . , n ) (II.44)where the joint distribution and transition probability distribution are induced by { π noFBi ( da i | a i − , b − ) : i = , . . . , n } ∈ P noFB [ , n ] as follows. P π noFB A i , B i ( da i , db i ) = P B i | B i − , A i ( db i | b i − , a i ) ⊗ P π noFB A i | B i − ( da i | b i − ) ⊗ P π noFB B i − ( db i − ) (II.45) P π noFB B i | B i − ( db i | b i − ) = (cid:90) A i P B i | B i − , A i ( db i | b i − , a i ) ⊗ P π noFB A i | B i − ( da i | b i − ) , i = , . . . , n (II.46) P π noFB A i | B i − ( da i | b i − ) = (cid:90) A i − π noFBi ( da i | a i − , b − ) ⊗ P π noFB A i − | B i − ( da i − | b i − ) (II.47) P π noFB A i − | B i − ( da i − | b i − ) = ⊗ i − j = P B j | B j − , A j ( db j | b j − , a j ) ⊗ π noFBj ( da j | a j − , b − ) (cid:82) A j P B j | B j − , A j ( db j | b j − , a j ) ⊗ π noFBj ( da j | a j − , b − ) . (II.48)The superscript in the above distributions are important to distinguish that these are generated by thechannel and channel input distributions without feedback, while the functional in (II.44) is fundamen-tally different from the one in (II.36). Clearly, compared to the channel with feedback in which thecorresponding distributions are (II.38) and (II.39), and they are induced by { π i ( da i | b i − ) : i = , . . . , n } ∈ P FB , n ( κ ) , when the channel is used without feedback, the distributions (II.45) and (II.46) are induced by { π noFBi ( da i | a i − , b − ) : i = , . . . , n } ∈ P noFB [ , n ] ( κ ) .Deﬁne the information quantity C noFBA n ; B n ( κ ) = sup P noFB [ , n ] ( κ ) n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P π noFB B i | B i − ( ·| b i − ) ( b i ) (cid:17) P π noFB A i , B i ( da i , db i ) (II.49) ≡ sup P noFB [ , n ] ( κ ) I noFBA n → B n ( π noFBi , P B i | B i − , A i : i = , , . . . , n ) . (II.50)Then the information capacity without feedback subject to a transmission cost constraint is deﬁned by C noFBA ∞ ; B ∞ ( κ ) (cid:52) = lim n −→ ∞ n + P noFB [ , n ] ( κ ) I noFBA n → B n ( π noFBi , P B i | B i − , A i : i = , , . . . , n ) . (II.51)Next, we note the following. Let { π ∗ i ( da i | b i − ) : i = , . . . , n } ∈ P FB [ , n ] ( κ ) denote the maximizing distri-bution in C FBA n → B n ( κ ) deﬁned by (II.35). Suppose there exists a sequence of channel input distributionswithout feedback { P ∗ ( da i | I noFBi ) ≡ π ∗ , noFBi ( da i | I noFBi ) : I noFBi ⊆ { b − , a , . . . , a i − } , i = , . . . , n } ∈ P noFB , n ( κ ) which induces the maximizing channel input distribution with feedback { π ∗ i ( da i | b i − ) : i = , , . . . , n } . That is, P noFBA i | B i − ( da i | b i − ) given by (II.47) is equal to π ∗ i ( da i | b i − ) , ∀ i = , , . . . , n . Then, itis clear that this sequence also induces the optimal joint distribution and conditional distribution deﬁnedby (II.38), (II.39), and consequently C FBA n → B n ( κ ) and C FBA ∞ → B ∞ ( κ ) are achieved without using feedback.In the following theorem, we prove that this condition is not only sufﬁcient but also necessary for anychannel input distribution without feedback to achieve the ﬁnite time feedback information capacity, C FBA n → B n ( κ ) . Theorem II.2. (Necessary and sufﬁcient conditions for C

FBA n → B n ( κ ) = C noFBA n ; B n ( κ ) )Consider channel (II.15) and let { π ∗ i ( da i | b i − ) : i = , . . . , n } ∈ P FB [ , n ] ( κ ) denote the maximizing distri-bution in C FBA n → B n ( κ ) deﬁned by (II.35), and let (cid:110) P π ∗ A i , B i ( da i , db i ) , P π ∗ B i | B i − ( db i | b i − ) : i = , . . . , n (cid:111) denotethe corresponding joint and transition distributions as deﬁned by (II.38), (II.39).Then C FBA n → B n ( κ ) = C noFBA n ; B n ( κ ) (II.52) if and only if there exists a sequence of channel input distributions (cid:110) P ∗ ( da i | I noFBi ) ≡ π noFB , ∗ i ( da i | I noFBi ) : I noFBi ⊆ { b − , a , . . . , a i − } , i = , . . . , n (cid:111) ∈ P noFB , n ( κ ) which induces the maximizing channel input distribution with feedback { π ∗ i ( da i | b i − ) : i = , , . . . , n } .Proof: In general, the inequality C FBA n → B n ( κ ) ≥ C noFBA n ; B n ( κ ) holds. Moreover, by Section II-B thedistributions { P π ∗ A i , B i ( da i , db i ) , P π ∗ B i | B i − ( db i | b i − ) : i = , . . . , n } are induced by the channel, which is ﬁxed,and the optimal conditional distribution { π ∗ i ( da i | b i − ) : i = , . . . , n } ∈ P FB [ , n ] ( κ ) . Then, equality holds ifand only if there exists a distribution without feedback { π noFB , ∗ i ( da i | I noFBi ) : i = , . . . , n } ∈ P noFB , n ( κ ) which induces { π ∗ i ( da i | b i − ) : i = , . . . , n } ∈ P FB [ , n ] ( κ ) . This follows from the fact that the distributions (cid:110) P π ∗ A i , B i ( da i , db i ) , P π ∗ B i | B i − ( db i | b i − ) : i = , . . . , n (cid:111) are induced by the feedback distribution, { π ∗ i ( da i | b i − ) : i = , . . . , n } and the channel distribution. This completes the proof.Theorem II.2 provides a sufﬁcient condition for feedback not to increase capacity, i.e. C FBA ∞ → B ∞ ( κ ) = C noFBA ∞ ; B ∞ ( κ ) , since if (II.52) holds, then lim n → ∞ n + C FBA n → B n ( κ ) = lim n → ∞ n + C noFBA n ; B n . In Section IV-B wedemonstrate an application of Theorem II.2 to a speciﬁc channel with memory, where we show thatan input distribution without feedback induces { π ∗ i ( da i | b i − ) : i = , . . . , n } , hence feedback does notincrease capacity. III. D

YNAMIC P ROGRAMMING AND N ECESSARY S UFFICIENT C ONDITIONS FOR N ON - NESTED O PTIMIZATION

In this section we employ the structural properties of capacity achieving channel input distributionswith feedback to derive dynamic programming recursions and necessary and sufﬁcient conditions for thesingle letter characterization (I.6), to hold. Speciﬁcally, we provide the following results for the UMCOchannel.(a) Necessary and sufﬁcient conditions to determine when dynamic programming recursions, whichare nested optimization problems, reduce to non-nested optimization problems.(b) Repeat (a) for the per unit time inﬁnite horizon.(c) Upper bounds on the probability of maximum likelihood decoding.The time-varying UMCO channel is deﬁned by P i ( b i | b i − , a i ) (cid:52) = P i ( b i | b i − , a i ) , i = , . . . , n (III.53)and the transmission cost constraint is deﬁned by1 n + E (cid:40) n ∑ i = γ UMi ( A i , B i − ) (cid:41) ≤ κ (III.54)where γ UMi : A i × B i − (cid:55)−→ [ , ∞ ) . At i = { b − , a } , where b − ∈ B − is the initial data which are either known to the encoder and the decoder or, b − = { /0 } .For simplicity, of presentation and technical assumptions needed, we consider a channel model withtransmission cost function, deﬁned on ﬁnite alphabet spaces. However, all main results extend to abstractalphabet spaces and channel distributions, which depend on ﬁnite memory on past channel output.Moreover, our analysis and the corresponding theorems can be extended to channels with ﬁnite memoryon the previous channel outputs by exploiting the structural form of the capacity achieving distributionsgiven in [9], [10].For the above model, it is shown in [9], [10] that maximizing directed information, I ( A n → B n ) , over P FB [ , n ] or P FB [ , n ] ( κ ) occurs in the subset of conditional distributions that satisfy the following conditionalindependence. P i ( a i | a i − , b i − ) = P i ( a i | b i − ) ≡ π i ( a i | b i − ) , i = , , . . . , n . (III.55)Consequently, we have the following Markovian properties. P i ( a i , b i | a i − , b i − ) = P π i ( a i , b i | a i − , b i − ) , i = , , . . . , n , (III.56) P i ( b i | b i − ) = P π i ( b i | b i − ) , i = , , . . . , n , (III.57) P π i ( b i | b i − ) = ∑ a i ∈ A i P i ( b i | b i − , a i ) π i ( a i | b i − ) , i = , , . . . , n . (III.58)where the superscript indicates the dependence on the channel input distribution (III.55). In view of theseMarkov properties, the characterization of the FTFI capacity (i.e., (II.26)) is given by C FB , UMCOA n → B n (cid:52) = sup ◦ P FB [ , n ] E πµ (cid:110) n ∑ i = log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17)(cid:111) (III.59) = sup ◦ P FB [ , n ] n ∑ i = I ( A i ; B i | B i − ) (III.60)where ◦ P FB [ , n ] (cid:52) = (cid:8) π i ( a i | b i − ) : i = , , . . . , n (cid:9) ⊂ P FB [ , n ] . (III.61) When clear from the context, the subscript notation of the distributions is omitted, i.e., P π B i | B i − ( b i | b i − ) ≡ P π i ( b i | b i − ) . Similarly, for conditional distributions with transmission cost the characterization of FTFI capacity isgiven by C FB , UMCOA n → B n ( κ ) (cid:52) = sup ◦ P FB [ , n ] ( κ ) E πµ (cid:110) n ∑ i = log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17)(cid:111) (III.62) = sup ◦ P FB [ , n ] ( κ ) n ∑ i = I ( A i ; B i | B i − ) (III.63)where ◦ P FB [ , n ] ( κ ) (cid:52) = (cid:110) π i ( a i | b i − ) , i = , , . . . , n : 1 n + n ∑ i = E πµ (cid:16) γ UMi ( A i , B i − ) (cid:17) ≤ κ (cid:111) . (III.64)Since the joint process { B − , A , B , . . . , A n , B n } and channel output process { B − , B , . . . , B n } are Markov,we explore the connection of the above optimization problems to Markov Decision theory, to derive theresults listed in (a)-(c). We do this in the next sections. A. Necessary and Sufﬁcient Conditions via Dynamic Programming: The Finite Horizon case

To derive the necessary and sufﬁcient conditions for any channel input distribution to maximize directedinformation, i.e., item (a), we ﬁrst apply dynamic programming on a ﬁnite horizon.

1) Without Transmission Cost Constraint:

The dynamic programming recursion for C FB , UMCOA n → B n is obtainedas follows. Let V t ( b t − ) represent the value function, that is, the maximum expected total cost on thefuture time horizon { t , t + , . . . , n } given output B t − = b t − at time t −

1, deﬁned by V t ( b t − ) = sup π i ( a i | b i − ) : i = t , t + ,..., n E π (cid:110) n ∑ i = t log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17) | B t − = b t − (cid:111) (III.65)where the transition probability of the channel output process is P π t ( b t | b t − ) = ∑ a t ∈ A t P t ( b t | b t − , a t ) π t ( a t | b t − ) . (III.66)Then (III.65) satisﬁes the following dynamic programming recursions. V n ( b n − ) = sup π n ( a n | b n − ) ∑ ( a n , b n ) ∈ A n × B n log (cid:16) P n ( b n | b n − , a n ) P π n ( b n | b n − ) (cid:17) P n ( b n | b n − , a n ) π n ( a n | b n − ) , (III.67) V t ( b t − ) = sup π t ( a t | b t − ) (cid:110) ∑ a t ∈ A t (cid:104) ∑ b t ∈ B t log (cid:16) P t ( b t | b t − , a t ) P π t ( b t | b t − ) (cid:17) P t ( b t | b t − , a t )+ ∑ b t ∈ B t V t + ( b t ) P t ( b t | b t − , a t ) (cid:105) π t ( a t | b t − ) (cid:111) , t = , , . . . , n − . (III.68)For a ﬁxed initial distribution P B − ( b − ) = µ ( b − ) we have C FB , UMCOA n → B n = ∑ b − ∈ B − V ( b − ) µ ( b − ) . (III.69)Clearly, by using the properties of relative entropy, we can show that the right hand side of the dynamicprogramming recursion, (III.67), is a concave function of the input distribution π n ( a n | b n − ) . Similarlyat each step of the recursion, the right hand side of the dynamic programming recursion, (III.68), isa concave function of the input distribution π t ( a t | b t − ) , since the future channel input distributions, { π t + ( a t + | b t ) , . . . , π n ( a n | b n − ) } are ﬁxed to their optimal strategies. Utilizing this observation we havethe following necessary and sufﬁcient conditions for any channel input distribution to maximize the righthand side of the dynamic programming recursions (III.67) and (III.68). Theorem III.1. (Necessary and sufﬁcient conditions)The necessary and sufﬁcient conditions for any input distribution { π t ( a t | b t − ) : t = , , . . . , n } to achievethe supremum of the dynamic programming recursions (III.67) and (III.68) are the following. For eachb n − ∈ B n − , there exist V n ( b n − ) such thatV n ( b n − ) = ∑ b n ∈ B n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P n ( b n | a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) (cid:54) = , (III.70) V n ( b n − ) ≤ ∑ b n ∈ B n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P n ( b n | a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) = and for each t = n − , n − . . . , , there exist V t ( b t − ) such thatV t ( b t − ) = ∑ b t ∈ B t (cid:110) log (cid:16) P t ( b t | a t , b t − ) P π t ( b t | b t − ) (cid:17) + V t + ( b t ) (cid:111) P t ( b t | a t , b t − ) , ∀ a t ∈ A t , if π t ( a t | b t − ) (cid:54) = , (III.72) V t ( b t − ) ≤ ∑ b t ∈ B t (cid:110) log (cid:16) P t ( b t | a t , b t − ) P π t ( b t | b t − ) (cid:17) + V t + ( b t ) (cid:111) P t ( b t | a t , b t − ) , ∀ a t ∈ A t , if π t ( a t | b t − ) = . (III.73) Moreover, { V t ( b t − ) : ( t , b t − ) ∈ { , . . . , n } × B t − } is the value function deﬁned by (III.65).Proof: The derivation is given in [24].Before we proceed further, in the next remark, we relate Theorem III.1 to the necessary and sufﬁcientconditions of DMCs derived in [25].

Remark III.1. (Relation to necessary and sufﬁcient conditions of DMCs)(a) Suppose the channel is a time-varying DMC, i.e., P t ( b t | b t − , a t ) = P t ( b t | a t ) , t = , . . . , n . (III.74) Since the optimal distribution of DMCs, which maximizes the directed information I ( A n → B n ) is mem-oryless, i.e., P A t | A t − , B t − ( a t | a t − , b t − ) = P A t ( a t ) ≡ π t ( a t ) , t = , . . . , n, then (III.66) reduces to P π t ( b t ) = (cid:82) A t P t ( b t | a t ) π t ( a t ) , t = , . . . , n. By replacing in (III.70) and (III.71), the following quantities P t ( b t | b t − , a t ) (cid:55)−→ P t ( b t | a t ) , P π t ( b t | b t − ) (cid:55)−→ P π t ( b t ) = (cid:90) A t P t ( b t | α t ) π t ( α t ) , t = , . . . , n (III.75) we obtain V n ( b n − ) ≡ V n = ∑ b n log (cid:16) P n ( b n | a n ) P π n ( b n ) (cid:17) P n ( b n | a n ) , ∀ a n ∈ A n if π n ( a n ) (cid:54) = , (III.76) V n ( b n − ) ≡ V n ≤ ∑ b n log (cid:16) P n ( b n | a n ) P π n ( b n ) (cid:17) P n ( b n | a n ) , ∀ a n ∈ A n if π n ( a n ) = where V n ( b n − ) = V n is a constant number, independent of b n − . Moreover, for each t, from (III.72) and(III.73) we obtainV t ( b t − ) ≡ V t = ∑ b t log (cid:16) P t ( b t | a t ) P π t ( b t ) (cid:17) P t ( b t | a t ) + V t + , if ∀ a t ∈ A t , π t ( a t ) (cid:54) = , (III.78) V t ( b t − ) ≡ V t ≤ ∑ b t log (cid:16) P t ( b t | a t ) P π t ( b t ) (cid:17) P t ( b t | a t ) + V t + , if ∀ a t ∈ A t , π t ( a t ) = where V t ( b t − ) ≡ V t is a constant number independent of b t − , for t = n − , n − , . . . , , . Consequently, by evaluating V t ( b t − ) = V t at t = , we obtain the following identities.V = max π t ( a t ) : t = ,..., n E π (cid:110) n ∑ t = log (cid:16) P t ( b t | a t ) P π t ( b t ) (cid:17)(cid:111) = n ∑ t = max π t ( a t ) E π (cid:110) log (cid:16) P t ( b t | a t ) P π t ( b t ) (cid:17)(cid:111) . (III.80) As expected, (III.80) shows that under (III.74), the sequence of nested optimization problems reduces toa sequence of non-nested optimization problems.(b) Suppose the channel is time-invariant (homogeneous) DMC. In this case, P t ( b t | b t − , a t ) = P ( b t | b t − , a t ) , t = , . . . , n, and the equations in (a) reduce to the single set of necessary and sufﬁcient conditions obtainedin [25], that is, letting V = C (cid:52) = max P A I ( A ; B ) , thenV = ∑ b log (cid:16) P ( b | a ) P π ( b ) (cid:17) P ( b | a ) , ∀ a ∈ A if π ( a ) (cid:54) = , (III.81) V ≤ ∑ b log (cid:16) P ( b | a ) P π ( b ) (cid:17) P ( b | a ) , ∀ a ∈ A if π ( a ) = . (III.82)In view of Remark III.1, next we identify necessary and sufﬁcient conditions for any optimal channelinput conditional distribution, which is a solution of the dynamic programming recursions to be time-invariant, i.e., item (b), and to exhibit a non-nested property reminiscent to that of DMCs, i.e., item (c).We derived such conditions based on the following deﬁnition. Deﬁnition III.1. (Non-nested optimization)Given a channel distribution { P t ( b t | b t − , a t ) : t = , . . . , n } , the optimization problem C FB , UMCOA n → B n deﬁnedby (III.69) is called(a) non-nested if and only if the value function (III.65) satisﬁes the following non-nested identity.V t ( b t − ) = n ∑ i = t sup π i ( a i | b i − ) E π (cid:110) log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17)(cid:12)(cid:12)(cid:12) B t − = b t − (cid:111) (III.83) for all ( t , b t − ) ∈ { , , . . . , n } × B t − ;(b) non-nested and time-invariant if and only if the value function satisﬁes the following identity.V t ( b t − ) = ( n − t + ) sup π t ( a t | b t − ) E π (cid:110) log (cid:16) P t ( B t | B t − , A t ) P π t ( B t | B t − ) (cid:17)(cid:12)(cid:12)(cid:12) B t − = b t − (cid:111) (III.84) for all ( t , b t − ) ∈ { , , . . . , n } × B t − . Clearly, if we can identify conditions so that the optimization problem deﬁned by (III.69) is non-nested,then by evaluating the value function (III.83) at time t = I ( A i ; B i | B i − = b i − ) over π ( a i | b i − ) , for which b i − is ﬁxed. Moreover, if the optimizationproblem is non-nested and time invariant, then by evaluating (III.84) at t =

0, we obtain C FB , UMCOA n → B n = ( n + ) ∑ b − ∈ B − I ( A ; B | B − = b − ) . (III.85)Next, we state the main theorem, which generalizes the non-nested and time-invariant properties ofmemoryless channels given in Remark III.1, to channels with memory. Theorem III.2. (Necessary and sufﬁcient conditions for non-nested optimization)(a) Consider any channel distribution { P i ( b i | b i − , a i ) , : i = , . . . , n } .The optimization problem C FB , UMCOA n → B n deﬁned by (III.69) is non-nested and the value function is charac- terized by (III.83) if and only ifthere exists constants (cid:8) V t : t = , . . . , n (cid:9) such that V t ( b t − ) = V t , ∀ ( t , b t − ) ∈ { , , . . . , n } × B t − which satisfy (III.70)-(III.73) . (III.86) (b) Consider any time-invariant channel distribution { P ( b i | b i − , a i ) : i = , . . . , n } .The optimization problem C FB , UMCOA n → B n deﬁned by (III.69) is non-nested and time-invariant and the valuefunction is characterized byV t ( b t − ) = V t (cid:52) =( n − t + ) sup π TI ( a i | b i − ) E π (cid:110) log (cid:16) P ( B i | B i − , A i ) P π TI ( B i | B i − ) (cid:17)(cid:12)(cid:12)(cid:12) B i − = b i − (cid:111) , ∀ ( t , b t − ) ∈ { , , . . . , n } × B t − (III.87) where { π i ( a i | b i − ) = π T I ( a i | b i − ) : i = , . . . , n } and { P π i ( b i | b i − ) = P π TI ( b i | b i − ) : i = , . . . , n } are time-invariant, if and only ifthere exists a constant V n such that V n ( b n − ) = V n , ∀ b n − ∈ B n − which satisﬁes (III.70), (III.71) . (III.88) Proof: (a) Suppose (III.86) holds. Then by Theorem III.1, for any t , the optimal strategy π t ( a t | b t − ) is not affected by the future strategies { π i ( a i | b i − ) : i = t + , t + , . . . , n } for all t = , , . . . , n −

1. Hence,the optimization problem C FB , UMCOA n → B n is non-nested. Conversely, if (III.83) holds, since its left hand sideis the value function deﬁned by (III.65), then necessarily for each t , the value function is a constant, i.e., { V i ( b i − ) = V i : i = t + , . . . , n } , for t = , , . . . , n −

1. In view of Theorem III.1, then (III.86) holds.(b) This is degenerate case of part (a). Suppose (III.88) holds and consider the necessary and sufﬁcientconditions given in Theorem III.1 at time t = n −

1. Since V n ( b n − ) = V n , ∀ b n − , then by (III.72) (andsimilarly for (III.73)) we have V n − ( b n − ) = {·} + V n , ∀ a n − , π n − ( a n − | b n − ) , where the term {·} is theﬁrst right hand side term in (III.72). Since the channel is time-invariant, subtracting the term V n from bothsides of the equation (III.72) (i.e., corresponding to t = n − t = n −

1. To complete the derivation, we useinduction, that is, we assume validity of (III.84) for t ∈ { n , n − , . . . , i + } and we show it also holdsfor t = i . This is similar to the case t = n −

2) With Transmission Cost Constraints.:

All statements of the previous section generalize to C FB , UMCOA n → B n ( κ ) deﬁned by (III.63), where the transmission cost constraint is given by (III.64). In view of the convexityof the optimization problem, and existence of an interior point of the constraint set ◦ P FB [ , n ] ( κ ) (i.e.,Slater’s condition), by Lagrange duality theorem [26], then the constraint and unconstraint problems areequivalent, that is, C FB , UMCOA n → B n ( κ ) = inf s ≥ sup π i ( a i | b i − ) : i = , ,..., n E πµ (cid:110) n ∑ i = (cid:104) log (cid:16) P π i ( b i | b i − , a i ) P i ( b i | b i − ) (cid:17) − s (cid:16) γ UM ( A i , B i − ) − ( n + ) κ (cid:17)(cid:105)(cid:111) (III.89)where s ∈ [ , ∞ ) is the Lagrange multiplier associated with the constraint.The dynamic programming recursions are obtained as follows. Let V st ( b t − ) represent value function onthe future time horizon { t , t + , . . . , n } given output B i − = b i − at time t −

1, deﬁned by V st ( b t − ) = sup π i ( a i | b i − ) : i = t , t + ,..., n E π (cid:110) n ∑ i = t (cid:104) log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17) − s γ UM ( A i , B i − ) (cid:105) | B t − = b t − (cid:111) . (III.90) The corresponding dynamic programming recursions are the following. V sn ( b n − ) = sup π n ( a n | b n − ) (cid:110) ∑ a n ∈ A n (cid:104) ∑ b n ∈ B n log (cid:16) P n ( b n | b n − , a n ) P π n ( b n | b n − ) (cid:17) P n ( b n | b n − , a n ) − s γ UMn ( a n , b n − ) (cid:105) π n ( a n | b n − ) (cid:111) (III.91) V st ( b t − ) = sup π t ( a t | b t − ) (cid:110) ∑ a t ∈ A t (cid:104) ∑ b t ∈ B t (cid:16) log (cid:16) P t ( b t | b t − , a t ) P π t ( b t | b t − ) (cid:17) P t ( b t | b t − , a t ) + V st + ( b t ) (cid:17) P t ( b t | b t − , a t ) − s γ UMt ( a t , b t − ) (cid:105) π t ( α t | b t − ) (cid:111) . (III.92)Moreover, for a ﬁxed initial distribution P B − ( b − ) = µ ( b − ) , then C FB , UMCOA n → B n ( κ ) = inf s ≥ (cid:110) ∑ b − V s ( b − ) µ ( b − ) + s ( n + ) κ ) (cid:111) . (III.93)The analogues of Theorem III.1 and Theorem III.2 are stated as a corollary. Corollary III.1. (Necessary and sufﬁcient conditions)(a) The necessary and sufﬁcient conditions for any input distribution { π t ( a t | b t − ) : t = , , . . . , n } toachieve the supremum of the dynamic programming recursions (III.91) and (III.92) are the following.For each b n − ∈ B n − , there exist V sn ( b n − ) such thatV sn ( b n − ) = ∑ b n ∈ B n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P n ( b n | a n , b n − ) − s γ UMn ( a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) (cid:54) = , (III.94) V sn ( b n − ) ≤ ∑ b n ∈ B n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P n ( b n | a n , b n − ) − s γ UMn ( a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) = and for each t = , , . . . , n − there exist V st ( b t − ) such thatV st ( b t − ) = ∑ b t ∈ B t (cid:110) log (cid:16) P t ( b t | a t , b t − ) P π t ( b t | b t − ) (cid:17) + V t + ( b t ) (cid:111) P t ( b t | a t , b t − ) − s γ UMt ( a t , b t − ) , ∀ a t ∈ A t , if π t ( a t | b t − ) (cid:54) = , (III.96) V st ( b t − ) ≤ ∑ b t ∈ B t (cid:110) log (cid:16) P t ( b t | a t , b t − ) P π t ( b t | b t − ) (cid:17) + V t + ( b t ) (cid:111) P t ( b t | a t , b t − ) − s γ UMt ( a t , b t − ) , ∀ a t ∈ A t , if π t ( a t | b t − ) = . (III.97) Moreover, { V st ( b t − ) : ( t , b t − ) ∈ { , . . . , n } × B t − } is the value function deﬁned by (III.90).(b) The optimization problem C FB , UMCOA n → B n ( κ ) is non-nested and the value function is characterized byV st ( b t − ) = n ∑ i = t sup π i ( a i | b i − ) E π (cid:110) log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17) − s γ UM ( A i , B i − ) | B i − = b i − (cid:111) (III.98) for all ( t , b t − ) ∈ { , , . . . , n } × B t − if and only ifthere exists constants (cid:8) V st : t = , . . . , n (cid:9) such that V st ( b t − ) = V st , ∀ ( t , b t − ) ∈ { , , . . . , n } × B t − which satisfy (III.94)-(III.97) . (III.99) (b) If the channel distribution is time-invariant { P ( b i | b i − , a i ) : i = , . . . , n } and γ UMi ( · , · ) = γ UM ( · , · ) : i = , , . . . , n, then the optimization problem C FB , UMCOA n → B n is non-nested and time-invariant, and the valuefunction is characterized byV st ( b t − ) ≡ V st = ( n − t + ) sup π TI ( a i | b i − ) E π TI (cid:110) log (cid:16) P ( B i | B i − , A i ) P π ( B i | B i − ) (cid:17) − s γ UM ( A i , B i − ) (cid:12)(cid:12)(cid:12) B i − = b i − (cid:111) (III.100) where { π i ( a i | b i − ) = π T I ( a i | b i − ) : i = , . . . , n } and { P π i ( b i | b i − ) = P π TI ( b i | b i − ) : i = , . . . , n } are time-invariant, if and only ifthere exists a constant V sn such that V n ( b n − ) = V n , ∀ b n − ∈ B n − which satisﬁes (III.94), (III.95) . (III.101) Proof:

The derivation is precisely as in Theorem III.1.

B. Necessary and Sufﬁcient Conditions via Dynamic Programming: The Inﬁnite Horizon case

In this section, we ﬁrst identify sufﬁcient conditions the convergence of the per unit time limit ofthe characterization of FTFI capacity, using the ergodic theory of Markov decision with randomizedstrategies, and inﬁnite horizon dynamic programming. Then, we apply these to derive necessary andsufﬁcient conditions for any channel input distribution to maximize the inﬁnite horizon extremumproblems C FB , UMCOA ∞ → B ∞ and C FB , UMCOA ∞ → B ∞ ( κ ) .For the material of this section we make the following assumption. Assumptions III.1. (Time-Invariant or homogeneous)The channel distribution and transmission cost function are time-invariant, and the optimal strategiesare restricted to time-invariant strategies, i.e., P i ( b i | b i − , a i ) = P ( b i | b i − , a i ) , γ UMi ( a i , b i − ) ≡ γ UM ( a i , b i − ) , i = , . . . , n , (III.102) π i ( a i | b i − ) = π ∞ ( a i | b i − ) , i = , . . . , n (III.103) and A i = A , B i = B , i = , . . . , n. Moreover, the initial distribution P B − = µ ( b − ) is assumed ﬁxed. By invoking Assumptions III.1, we can introduce the corresponding extremum problem as follows. Forﬁxed initial distribution µ ( db − ) ∈ M ( B ) , we deﬁne J ( π ∞ , µ ) (cid:52) = lim inf n −→ ∞ n E π ∞ µ (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) ≡ lim inf n −→ ∞ n n − ∑ i = I ( A i ; B i | B i − ) . (III.104)By taking the supremum over all channel input distributions [27] and by using the fact that the alphabetspaces are of ﬁnite cardinality, we have the following identity. J ( π ∞ , ∗ , µ ) (cid:52) = sup π ∞ ( a i | b i − ) : i = , ,..., J ( π ∞ , µ ) (III.105) = lim inf n −→ ∞ sup π ∞ ( a i | b i − ) : i = ,..., n n E π ∞ µ (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) ≡ C FB , UMCOA ∞ → B ∞ . (III.106)For abstract alphabet spaces the exchange of lim inf and sup requires strong conditions [27]. Clearly, theabove quantity J ( π ∞ , ∗ , µ ) depends on the initial distribution µ ( db − ) .Similarly, for a ﬁxed initial state B − = b − we also have the identity J ( π ∞ , ∗ , b − ) (cid:52) = sup π ∞ ( a i | b i − ) : i = , ,..., lim inf n −→ ∞ n E π ∞ b − (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) (III.107) = lim inf n −→ ∞ sup π ∞ ( a i | b i − ) : i = ,..., n n E π ∞ b − (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) (III.108)which depends on the initial state b − . Note that Assumptions III.1 do not imply that the joint distributionof the process { A , B , A , B , . . . , A n , B n } is stationary or that the marginal distribution of the outputprocess { B i : i = , . . . , n } is stationary, because stationarity depends on the distribution of the initial state B − . However, it implies that the transition probabilities are time-invariant (i.e., homogeneous), hence, P π ∞ i ( a i , b i | a i − , b i − ) ≡ P π ∞ ( a i , b i | a i − , b i − ) , P π ∞ i ( b i | b i − ) ≡ P π ∞ ( b i | b i − ) , i = , . . . , n .Next, we develop the material without imposing transmission cost constraints, because extensions toproblems with transmission cost are easily obtained by using the material of the previous section.

1) Sufﬁcient Condition for Asymptotic Stationarity and Ergodicity from Finite-Time Dynamic Program-ming Recursions:

Consider the problem of maximizing the per unit time limiting version of C FB , UMCOA n → B n ,when the strategies are restricted to { π ∞ ( a i | b i − ) : i = , . . . , n } . From the previous section, the ﬁnitehorizon value function satisﬁes the dynamic programming equation V t ( b t − ) = sup π ∞ ( ·| b t − ) (cid:110) ∑ a t (cid:110) ∑ b t log (cid:16) P ( b t | b t − , a t ) P π ∞ ( b t | b t − ) (cid:17) P ( b t | b t − , a t ) + ∑ b t V t + ( b t ) P ( b t | b t − , a t ) (cid:111) π ∞ ( a t | b t − ) (cid:111) . (III.109)Since b t − is always ﬁxed, we let V t ( b t − ) = V t ( b − ) , t = , . . . , n −

1. Since the transition probabilitiesare time-invariant, we can deﬁne, for simplicity, the variables (cid:101) V t ( b − ) = V n − t ( b − ) , t = , . . . , n −

1. Then { (cid:101) V t ( · ) : t = , . . . , n } satisfy the following equation. (cid:101) V t ( b − ) = sup π ∞ ( ·| b − ) (cid:110) ∑ a ∈ A (cid:110) ∑ b ∈ B log (cid:16) P ( b | b − , a ) P π ∞ ( b | b − ) (cid:17) P ( b | b − , a )+ ∑ b ∈ B (cid:101) V t − ( b ) P ( b | b − , a ) (cid:111) π ∞ ( a | b − ) (cid:111) , t ∈ { , . . . , n } . (III.110)Next, we introduce a sufﬁcient condition to test whether the per unit time limit of the solution to thedynamic programming recursions exists and it is independent of the initial state B − = b − ∈ B . Assumptions III.2. (Sufﬁcient condition for convergence of dynamic programming recursions)Assume that there exists a V : B (cid:55)−→ R , and a J ∗ ∈ R such that for all b − ∈ B lim t −→ ∞ (cid:16)(cid:101) V t ( b − ) − tJ ∗ (cid:17) = V ( b − ) . (III.111)Clearly, if Assumptions III.2 hold, then lim t −→ ∞ t (cid:101) V t ( b − ) = J ∗ , ∀ b − ∈ B (because for ﬁnite alphabetspaces, the dynamic programming operator maps bounded continous functions to bounded continuousfunctions) [28]. This means the per unit time limit of the dynamic programming recursion is independentof the initial state b − , which then implies C FB , UMCOA n → B n is independent of the choice of the initial distribution µ ( b − ) . Remark III.2. (Test of asymptotic stationarity and ergodicity)Given any channel we can verify that the optimal channel input distribution induces asymptotic station-arity and ergodicity of the corresponding joint process (cid:8) ( A i , B i ) : i = , , . . . , (cid:9) by solving the dynamicprogramming recursions analytically for ﬁnite “n” via (III.109), and then identifying conditions on thechannel parameters so that Assumptions III.2 hold. In view of Assumptions III.2, we have the following lemma.

Lemma III.1.

If Assumptions III.2 hold and there exists a (cid:8) π ∞ , ∗ ( a | b − ) ∈ M ( A ) : b − ∈ B (cid:9) and acorresponding pair (cid:110)(cid:16) V ( b − ) , J ∗ (cid:17) : b − ∈ B , J ∗ ∈ R (cid:111) , which solvesJ ∗ + V ( b − ) = sup π ∞ ( a | b − ) (cid:110) ∑ a (cid:110) ∑ b log (cid:16) P ( b | b − , a ) P π ∞ ( b | b − ) (cid:17) P ( b | b − , a ) + ∑ b V ( b ) P ( b | b − , a ) (cid:111) π ∞ ( a | b − ) . (III.112) then feedback capacity is given byJ ∗ = C FB , UMCOA ∞ → B ∞ (cid:52) = lim inf n −→ ∞ n sup π ∞ ( ·| b i − ) E π ∞ (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) , ∀ µ ( db − ) ∈ M ( B ) (III.113) and moreover the value C FB , UMCOA ∞ → B ∞ does not depend on the choice of the initial distribution µ ( db − ) ∈ M ( B ) .Proof: See Appendix A. Thus, we have two different ways to determine sufﬁcient conditions for J ∗ to correspond to feedbackcapacity; one based on Remark III.2, and one based on Lemma III.1, i.e., by solving the inﬁnite horizondynamic programming equation (III.112).Next, we state the necessary and sufﬁcient conditions for any (cid:8) π ∞ ( a | b − ) ∈ M ( A ) : b − ∈ B (cid:9) to be asolution of the dynamic programming equation (III.112). Theorem III.3. (Inﬁnite horizon Necessary and Sufﬁcient conditions)Suppose Assumptions III.2 hold and there exists a (cid:8) π ∞ , ∗ ( a | b − ) ∈ M ( A ) : b − ∈ B , J ∗ ∈ R (cid:9) and acorresponding pair (cid:110)(cid:16) V ( b − ) , J ∗ (cid:17) : b − ∈ B (cid:111) , which solves (III.112).The necessary and sufﬁcient conditions for any input distribution { π ∞ ( a | b − ) ∈ M ( A ) : b − ∈ B } toachieve the supremum of the dynamic programming equation (III.111) are the following.There exist { V ( b − ) : b − ∈ B } such thatJ ∗ + V ( b − ) = ∑ b (cid:16) log (cid:16) P ( b | a , b − ) P π ∞ ( b | b − ) (cid:17) + V ( b ) (cid:17) P ( b | a , b − ) , ∀ a ∈ A if π ∞ ( a | b − ) (cid:54) = , (III.114) J ∗ + V ( b − ) ≤ ∑ b (cid:16) log (cid:16) P ( b | a , b − ) P π ∞ ( b | b − ) (cid:17) + V ( b ) (cid:17) P ( b | a , b − ) , ∀ a ∈ A if π ∞ ( a | b − ) = . (III.115) Moreover, { V ( b − ) : b − ∈ B } is the value function deﬁned by (III.112).Proof: Consider the dynamic programming equation (III.112) and repeat the necessary steps of thederivation of Theorem III.1. A more direct approach is to use the necessary and sufﬁcient conditions ofTheorem III.1, as follows. Re-writing the necessary and sufﬁcient conditions (III.72), (III.73) as done in(A.186), using Assumptions III.2, to verify that (III.114), (III.115) are the resulting equations.

C. Sufﬁcient Conditions for Asymptotic Stationarity and Ergodicity based on Irreducibility

In this section we give another set of assumptions based on irreducibility of the channel output transitionprobability for each channel input conditional distribution.Deﬁne (cid:96) ( b i − , a i ) (cid:52) = E π ∞ (cid:110) log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17) | B i − = b i − , A i = a i (cid:111) (III.116) ≡ ∑ b i ∈ B log (cid:16) P ( b i | b i − , a i ) P π ∞ ( b i | b i − ) (cid:17) P ( b i | b i − , a i ) , (III.117) (cid:96) ( b i − , π ∞ ( b i − )) (cid:52) = E π ∞ (cid:110) log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17) | B i − = b i − (cid:111) ≡ ∑ a i ∈ A (cid:96) ( b i − , a i ) π ∞ ( a i | b i − ) . (III.118)To apply standard results of the Markov Decision (MD) theory from [27], [28], we introduce the followingnotation. Each element of the alphabet space B is identiﬁed by the vector B = { b ( ) , . . . , b ( | B | ) } , where | B | is the cardinality of the set B . Then we can identify any V : B (cid:55)→ R with a vector in R | B | . Similarly,any channel input distribution is identiﬁed with π ∞ (cid:52) = (cid:110) π ∞ ( b ( )) , π ∞ ( b ( )) , . . . , π ∞ ( b ( | B | ) (cid:111) (cid:52) = (cid:110) π ∞ ( ·| b ( i )) ∈ M ( A ) : b ( i ) ∈ B (cid:111) . (III.119) Next, we deﬁne the vector pay-off and channel output transition probability matrix as follows. (cid:96) ( π ∞ ) (cid:52) = (cid:16) (cid:96) ( b ( ) , π ∞ ( b ( ))) . . . (cid:96) ( b ( | B | ) , π ∞ ( b ( | B | ))) (cid:17) T ∈ R | B | , (III.120) P ( π ∞ ) = (cid:110) P π ∞ ( b i | b i − ) : ( b i , b i − ) ∈ B × B (cid:111) ∈ R | B |×| B | . (III.121)Let { µ ( b ( i )) : i = , , . . . , | B |} ∈ R | B | be deﬁned by µ ( b ( i )) = P ( B − = b ( i )) , i = , . . . , | B | .Using the above notation we have the following main theorem. Theorem III.4. (Dynamic programming equation under irreducibility)Suppose Assumptions III.1 holds and for each channel input distribution π ∞ , the transition probabilitymatrix of the output process P ( π ∞ ) ≡ (cid:8) P π ∞ ( b | b − ) : ( b , b − ) ∈ B × B (cid:9) is irreducible.Then for any channel input distribution π ∞ the expression (III.104) is given byJ ( π ∞ , µ ) = ν ( π ∞ ) T (cid:96) ( π ∞ ) ≡ J ( π ∞ ) (III.122) i.e., it is independent of µ ( · ) , where ν ( π ∞ ) is the unique invariant probability distribution of the channeloutput process { B , B , . . . , } , which satisﬁes P ( π ∞ ) ν ( π ∞ ) = ν ( π ∞ ) . (III.123) If there exists a time-invariant Markov channel distribution π ∞ ( ·|· ) such thatJ ( π ∞ , ∗ ) = max π ∞ J ( π ∞ ) then there exists a pair ( V ( π ∞ , ∗ , · ) , J ( π ∞ , ∗ )) , V ( π ∞ , ∗ , · ) : B (cid:55)→ R | B | and J ( π ∞ ) ∈ R that is a solution ofthe dynamic programming equationJ ( π ∞ , ∗ ) + V ( π ∞ , ∗ , b − ) = sup π ∞ ( ·| b − ) (cid:110) (cid:96) ( b − , π ∞ ( b − )) + ∑ z ∈ B V ( π ∞ , ∗ , z ) P π ∞ ( z | b − ) (cid:111) . (III.124) Moreover, J ∗ ≡ J ( π ∞ , ∗ ) = C FB , UMCOA ∞ → B ∞ satisﬁes (III.113) and corresponds to feedback capacity.Proof: This is shown in Appendix B.We make the following comments.

Remark III.3. (Comments on Theorem III.4)(a) Theorem III.4 gives sufﬁcient conditions in terms of irreducibility of channel output transitionprobability matrix P ( π ∞ ) to test whether the per unit time limit of the FTFI capacity corresponds tofeedback capacity. Unfortunately, it is not possible to know prior to solving the dynamic programmingequation (III.124) whether the irreducibility condition holds, because the transition probability P ( π ∞ ) is a functional of the optimal channel input distribution. A similar issue occurs in the analysis providedby Chen and Berger [15, Lemma 2, Theorem 3], In view of this technicality it is more appropriate toapply the necessary and sufﬁcient conditions of Theorem III.1 to determine the optimal channel inputdistribution and corresponding characterization of FTFI capacity, and then follow the suggestion givenunder Remark III.2.(b) The solution of V obtained from (III.124) is unique up to an additive constant, and if π ∗ ( ·| b − ) attains the maximum in (III.124) for every b − , then π ∗ ( ·|· ) is an optimal channel input distribution,and the maximum cost is J ∗ .(c) In speciﬁc application examples it may happen that the optimal channel input probability distribution π ∞ , ∗ ( ·|· ) induces a transition probability matrix P ( π ∞ , ∗ ) which is reducible, i.e., not irreducible. Forcompleteness, this speciﬁc case is addressed in Remark III.4. Next, we provide an iterative algorithm to compute the optimal channel input distribution and the feedback capacity. In Section IV-D1, we illustrate how Algorithm 1 is implemented through an example. Algorithm 1

1) Let m = π .2) Solve the equation J ( π m ) e + V ( π m ) = (cid:96) ( π m ) + V T ( π m ) P ( π m ) , e (cid:44) ( , . . . , ) ∈ R | B | (III.125)for J ( π m ) ∈ R and V ( π m ) ∈ R | B | .3) Let π m + = argmax π (cid:110) (cid:96) ( π ) + V T ( π ) P ( π ) (cid:111) . (III.126)4) If π m + = π m , let π ∗ = π m ; else let m = m + Remark III.4.

Theorem III.4 and Algorithm 1 pre-suppose that we know in advance that the transitionprobability matrix P ( π ∞ ) of the channel output process, when evaluated at the optimal strategy π ∞ , ∗ ( ·|· ) is irreducible. If irreducibility does not hold, then the dynamic programming equation (III.124) may notbe sufﬁcient to give the optimal channel input distribution and the feedback capacity. In particular, if P ( π ∞ ) is reducible then (III.124) need not have a solution. To overcome this limitation an additionalequation is added to (III.124) giving the following pair of equations.J ∗ ( b − ) = sup π ∞ ( ·| b − ) (cid:110) (cid:90) A × B J ∗ ( b ) P π ∞ ( b | b − ) (cid:111) (III.127) J ∗ ( b − ) + V ( b − ) = sup π ∞ ( ·| b − ) (cid:110) (cid:90) A × B (cid:110) log (cid:16) P ( b | a , b − ) P π ∞ ( b | b − ) (cid:17) + V ( b ) (cid:111) P π ∞ ( b | b − ) (cid:111) . (III.128) We refer to the pair (III.127) and (III.128) as the generalized dynamic programming equations. Theproposed pair of dynamic programming equations completely characterize feedback capacity.D. Error exponents for the UMCO Channel with feedback

In this section, we provide bounds on the probability of error of maximum likelihood decoding, byutilizing the results in [25] and [29]. However, we go one step further and show how to compute thisbound, taking advantage of the structure of the capacity achieving distribution.Consider the channel (cid:8) P i ( b i | b i − , a i ) : i = , . . . , n (cid:9) , where B i = B , A i = A , i = , . . . , n . Let P ( n ) e , m ( b − ) denote the probability of error for an arbitrary message m ∈ M n (cid:52) = (cid:8) , , . . . , M n = (cid:98) nR (cid:99) (cid:9) , given theinitial state b − ∈ B . From [29] there exists a feedback code for which the probability of error isbounded above as follows. P ( n ) e , m ( b − ) ≤ | B | {− n [ − ρ R + F n ( ρ )] } , ∀ m ∈ M n , b − ∈ B − , ≤ ρ ≤ , (III.129) F n ( ρ ) (cid:52) = − ρ log | B | n + max P i ( a i | a i − , b i − ) : i = , ,..., n (cid:20) min b − ∈ B E P , n ( ρ , b − ) (cid:21) (III.130) E P , n ( ρ , b − ) = − n log ∑ ( b ,..., b n − ) (cid:34) ∑ ( a ,..., a n − ) n − ∏ i = P i ( a i | a i − , b i − ) P i ( b i | a i , b i − ) + ρ (cid:35) + ρ . (III.131)However, by restricting the channel input distribution in (III.130), (III.131), to the set (cid:8) π i ( da i | b i − ) : i = If the initial state is known both to the encoder and the decoder then the cardinality of the state alphabet, | B | , in (III.129)and (III.130) are removed [Problem 5.37, [25]]. , . . . , n (cid:9) ∈ ◦ P FB [ , n ] , the following upper bound is obtained. P ( n ) e , m ( b − ) ≤ | B | {− n [ − ρ R + F n ( ρ )] } , ∀ m ∈ M n , b − ∈ B − , ≤ ρ ≤ , (III.132) F n ( ρ ) (cid:52) = − ρ log | B | n + min b − ∈ B E π , n ( ρ , b − ) , ∀ (cid:8) π i ( da i | b i − ) : i = , . . . , n (cid:9) ∈ ◦ P FB [ , n ] , (III.133) E π , n ( ρ , b − ) = − n log ∑ ( b ,..., b n − ) (cid:34) ∑ ( a ,..., a n − ) n − ∏ i = π i ( a i | b i − ) P i ( b i | a i , b i − ) + ρ (cid:35) + ρ . (III.134)Next, we derive simpliﬁed equations for (III.132) -(III.134), in order to compute the bound on theprobability of error. For the rest of the analysis we view the memory of the channel on the previousoutput symbol as the state of the channel, deﬁned by s i − (cid:52) = b i − , i = , , . . . , n −

1. Then we transformthe channel to an equivalent channel of the form P i ( b i | a i , b i − ) = P i ( b i | a i , s i − ) , i = , . . . , n . Since thestate of the channel is known at the decoder, we apply the methodology used to derive Theorem 5.9.3,[25] and (III.134), to obtain an upper bound on the probability of error, which is computationally lessintensive than (III.132), as follows. At each time i , the channel distribution is further transformed to P i ( b i , s i | a i , s i − ) =  P i ( b i | a i , s i − ) if s i = b i . i = , . . . , n . (III.135)Substituting (III.135) into (III.134) gives the following equivalent expression. E π , n ( ρ , b − ) = − n log ∑ ( b ,..., b n − ) (cid:34) ∑ ( a ,..., a n − ) n − ∏ i = π i ( a i | b i − ) P i ( b i | a i , s i − ) + ρ (cid:35) + ρ (III.136) = − n log ∑ ( s ,..., s n − ) ∑ ( b ,..., b n − ) (cid:34) ∑ ( a ,..., a n − ) n − ∏ i = π i ( a i | s i − ) P i ( b i , s i | a i , s i − ) + ρ (cid:35) + ρ (III.137) = − n log ∑ ( s ,..., s n − ) n − ∏ i = ∑ b i (cid:34) ∑ a i π i ( a i | s i − ) P i ( b i , s i | a i , s i − ) + ρ (cid:35) + ρ . (III.138)Deﬁne the inner summations in (III.138) by Λ π i ( s i , s i − ) (cid:52) = ∑ b i (cid:34) ∑ a i π i ( a i | s i − ) P i ( b i , s i | a i , s i − ) + ρ (cid:35) + ρ , i = , . . . , n . (III.139)Then, by substituting (III.139) in (III.138), we obtain E π , n ( ρ , b − ) = − n log ∑ ( s ,..., s n ) n − ∏ i = Λ π i ( s i , s i − ) . (III.140)Let (cid:110) Λ π i ( s i , s i − ) : ( s i , s i − ) ∈ B × B (cid:111) denote the matrix with elements identiﬁed by Λ π i ( s i , s i − ) , s i = , . . . , | B | , s i − = , . . . , | B | , that is, the matrix is denoted by [ Λ π i ( s i , s i − )] (cid:52) =  Λ π i ( , ) . . . Λ π i ( , | B | ) ... . . . ... Λ π i ( | B | , ) . . . Λ π i ( | B | , | B | )  , i = , . . . , n . (III.141)The computation of the error probability is difﬁcult, in view of the time varying properties of the channeldistribution and the channel input distribution, which implies the matrix [ Λ π i ( s i , s i − )] is also time-varying.However, by following the derivation of equation (5.9.45) in [25], we derive the following bound on theprobability of error for the UMCO channel with feedback. Theorem III.5. (Error probability bound for maximum likelihood decoding)Suppose the channel distribution is time-invariant given by (cid:8) P ( b i | b i − , a i ) : i = , . . . , n (cid:9) and the proba- bility of error deﬁned by (III.132)-(III.134) is evaluated at any time-invariant channel input distribution (cid:8) π T I ( a i | b i − ) : i = , . . . , n (cid:9) . Then (i) The matrix [ Λ π i ( s i , s i − )] = (cid:104) Λ π TI ( s i , s i − ) (cid:105) , i = , . . . , n is time-invariant. (ii) If the time-invariant matrix (cid:104) Λ π TI ( s i , s i − ) (cid:105) is irreducible, then there exists a feedback code forwhich the probability of error is bounded above as follows. P ( n ) e , m ≤ | B | v max v min (cid:110) − n (cid:104) − ρ R − log λ π TImax ( ρ ) (cid:105)(cid:111) , ∀ m ∈ M n , ≤ ρ ≤ , (III.142) where λ π TI max ( ρ ) is the largest eigenvalue of the matrix (cid:104) Λ π TI ( s i , s i − ) (cid:105) , and v max and v min are themaximum and minimum components, respectively, of the positive eigenvector that corresponds tothe largest eigenvalue.Proof: (i) The ﬁrst statement is due to the assumptions and follows directly from the fact that E π , n ( ρ , b − ) = E π TI , n ( ρ , b − ) (cid:52) = − n log ∑ ( s ,..., s n ) ∏ n − i = Λ π TI ( s i , s i − ) .(ii) For an irreducible matrix [ Λ ( s i , s i − )] with non negative components we can apply the Frobeniustheorem, to show that the following inequality holds [25]. (cid:12)(cid:12)(cid:12)(cid:12) E π TI , n ( ρ , b − ) + log λ π TI max ( ρ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ n log v max v min . (III.143)The upper bound (III.142) follows from the last expression.Note that the probability of error in Theorem III.5 is independent of the initial state b − ∈ B . InSection IV-A3 we evaluate (III.142) of Theorem III.5 for a speciﬁc channel with memory.IV. T HE BSSC

WITH & WITHOUT F EEDBACK AND WITH & WITHOUT T RANSMISSION C OST

In this section, we apply the main results of the previous section to the unit memory channel BinaryState Symmetric Channel (BSSC) deﬁned by P ( b i | a i , b i − )=  , , , , α β − β − α − α − β β α  , i = , , , . . . , n , ( α , β ) ∈ [ , ] × [ , ] . (IV.144)We show using Theorem III.2, that the feedback capacity optimization problem is non-nested and theoptimal channel input distribution is time invariant. Further, we derive explicit expressions for feedbackcapacity and capacity without feedback, and we show that the capacity achieving distribution and thecorresponding transition probability of the channel output processes are characterized by doubly stochasticmatrices. Moreover, we show that feedback does not increase capacity, and that capacity without feedbackis achieved by a ﬁrst order Markov channel input distribution, which is also doubly stochastic.First we show that the BSSC, is equivalent to a channel with state information s i (cid:52) = a i ⊕ b i − , i = , , . . . , n ,where ⊕ denotes the modulo2 addition, as depicted in Fig. IV.1. Clearly, this transformation is one toone and onto, i.e., for a ﬁxed channel input symbol value a i (respectively channel output symbol value b i − ) then s i is uniquely determined by the value of b i − (respectively a i ) and vice-versa. Hence, we EncoderMessage DelayChannel b i −

01 0101 01 αβ − β − α βα − α − β a i b i b i − (a) Previous Output STate (POST) channel. EncoderMessage DelayChannel b i − s i

01 0101 01 αα − α − α ββ − β − β a i b i b i − (b) Binary State Symmetric Channel (BSSC). Fig. IV.1: An equivalent model.obtain the following equivalent representation of the BSSC. P ( b i | a i , s i = ) = (cid:34) α − α − α α (cid:35) , i = , , . . . , n , (IV.145) P ( b i | a i , s i = ) = (cid:34) β − β − β β (cid:35) , i = , , . . . , n . (IV.146)The above transformation highlights the symmetric form of the BSSC, since, for a ﬁxed state s i ∈{ , } , the channel decomposes (IV.144) into two Binary Symmetric Channels (BSC), with transitionprobabilities given by (IV.145) and (IV.146), respectively. Therefore, for a ﬁxed value of previous outputsymbol, b i − , the encoder by choosing the current input symbol, a i , knows which of the two BSC’s isapplied at each transmission time. This decomposition motivates the name state-symmetric channel.The following notation will be used in the rest of the paper. • BSSC ( α , β ) denotes the BSSC with transition probabilities deﬁned by (IV.144); • BSC ( − α ) denotes the “state zero” channel deﬁned by (IV.145); • BSC ( − β ) denotes the “state one” channel deﬁned by (IV.146).The necessity of imposing transmission cost constraint on the channel, is discussed by Shannon in [30,pp. 162–163] and it is encapsulated in the following statement. “ There is a curious and provocativeduality between the properties of a source with a distortion measure and those of a channel. This dualityis enhanced if we consider channels in which there is a “cost” associated with the different input letters,and it is desired to ﬁnd the capacity subject to the constraint that the expected cost not exceed a certainquantity...”. In [19], it is shown that the BSSC is in perfect duality with the Binary Symmetric MarkovSource (BSMS) with respect to a transmission cost function for the channel and a ﬁdelity constraint forthe source. This is a generalization of JSCM of the discrete memoryless Bernoulli source with singleletter Hamming distortion transmitted over a memoryless BSC.Next, we illustrate that the cost constraint is natural when imposed on the BSSC. The memory on theprevious output symbols and its decomposable nature, allow us to impose a cost function related to the state of the channel.The physical interpretation of the transmission cost is the following. The two states of the BSSC are • s i = − α ); • s i = − β );Suppose α > β ≥ .

5. Then the capacity of the state zero channel is greater than the capacity of the stateone channel. With “abuse” of terminology, the state zero channel is interpreted as the “good channel”and the state one channel is interpreted as the “bad channel”. With such interpretation it is reasonableto impose a higher cost, when employing the “good channel”, and a lower cost, when employing the“bad channel”. This policy is quantiﬁed by assigning a binary pay-off equal to “1”, that is, when thethe good channel is used, and a pay-off equal to “0”, that is, when the bad channel is used.

Deﬁnition IV.1. (Binary cost function for the BSCC)The cost function of the BSSC satisﬁes γ ( a i , b i − ) = a i ⊕ b i − (cid:52) = (cid:26) if a i = b i − ( s i = ) if a i (cid:54) = b i − ( s i = ) i = , . . . , n . (IV.147) The average transmission cost constraint is deﬁned by n + E (cid:40) n ∑ i = γ ( A i , B i − ) (cid:41) ≤ κ , κ ∈ [ , κ max ] (IV.148) where the letter-by-letter average transmission cost is given by E (cid:8) γ ( A i , B i − ) (cid:9) = P i ( a i = , b i − = ) + P i ( a i = , b i − = ) = P i ( s i = ) . (IV.149)This cost function may differ, according to someone’s preferences. For example, if we want to penalizethe use of the “bad” channel, we may employ the complement of the cost function (IV.147). A moregeneral cost function is γ ( a i , b i − ) (cid:52) = (cid:26) γ if a i = b i − ( s i = ) − γ if a i (cid:54) = b i − ( s i = ) i = , . . . , n (IV.150)where γ ∈ [ , ] . However, the binary form of the transmission cost does not downgrade the problem,since, the average cost is a linear functional, and it can be easily upgraded to more complex forms,without affecting the proposed methodology.Additional observations regarding the above formulation are given in the following remark. Remark IV.1. (Cost function) If only the good channel is used, that is, P i ( s i = ) = , then the capacity of the BSSC is equalto zero, because P ( s i = ) = corresponds to channel input a i = b i − , a deterministic function fori = , , . . . , n (this also follows from I ( A i ; B i | B i − ) (cid:12)(cid:12)(cid:12) A i = B i − = , i = , , . . . , n) . The capacity of theBSSC is also equal to zero if only the bad channel is used P i ( s i = ) = , for i = , , . . . , n. It is shown shortly that the optimal channel input distribution that achieves the unconstrainedcapacity of the BSSC, corresponds to a ﬁxed occupation of the two states. Upon introducing thetransmission cost constraint, one is not allowed to use the state corresponding to the good channelbeyond a certain threshold, because the overall cost of transmission needs to be satisﬁed. If β > α ≥ . , then we reverse the transmission cost, while if α and β are less than . , then ﬂipthe corresponding channel input probabilities. A. Capacity of the BSSC with feedback

In this section, we apply Theorem III.1 and Theorem III.2 to calculate the closed form expressions ofthe capacity achieving channel input distribution, the corresponding channel output distributions, and toshow that these are time-invariant. Further, we employ these theorems to calculate the feedback capacitywith and without cost constraints.

1) Feedback capacity of the BSSC without transmission cost:

In the next theorem we show that feedback capacity of the BSSC, without cost constraint, is given bya single letter expression and that the optimal input distribution is time invariant.

Theorem IV.1. (Feedback capacity and time-invariant property of the optimal distributions)Consider the BSSC deﬁned by (IV.144) with feedback, without transmission cost. Then the followinghold. (a)

The capacity achieving channel input distribution and the corresponding channel output distribu-tion which maximize the FTFI capacity, C FB , BSSCA n → B n , are time-invariant and given by the followingexpressions. π ∗ i ( a i | b i − ) = π T I ( a i | b i − ) = (cid:20) ν − ν − ν ν (cid:21) , ∀ i ∈ { , , . . . , n } (IV.151) P π ∗ i ( b i | b i − ) = P T I ( b i | b i − ) = (cid:20) λ − λ − λ λ (cid:21) , ∀ i ∈ { , , . . . , n } (IV.152) where λ = + µ , µ = H ( β ) − H ( α ) − α − β , ν = − ( − β )( + µ )( α + β − )( + µ ) . (IV.153) Moreover, C FB , BSSCA n → B n = ( n + ) max π TI ( a | b − ) I ( A ; B | B − = b i − ) , ∀ b i − ∈ { , } (IV.154) = ( n + ) [ H ( λ ) − ν H ( α ) − ( − ν ) H ( β )] . (IV.155)(b) The feedback capacity is given byC FB , BSSCA ∞ → B ∞ = max π TI ( a | b − ) I ( A ; B | B − = b i − ) , ∀ b i − ∈ { , } (IV.156) = H ( λ ) − ν H ( α ) − ( − ν ) H ( β ) (IV.157) and it is independent of the initial state.Proof: The proof of Theorem. IV.1 is given in Appendix C.Theorem IV.1, speciﬁcally (IV.154), illustrates the non-nested and time-invariant property, which givesa direct connection of the BSSC and memoryless channels. Note, that these properties hold due to the“symmetric” form of the BSSC. As will show at the end of the current section via simulations, thetime-invariant property does not hold for general Binary Unit Memory Channel (BUMC).The BSSC without cost constraint is equivalent to the POST channel investigated in [17]. The authors in[17] derived an expression for feedback capacity, which is equivalent to (IV.156), by using the convexhull theorem. Theorem IV.1 compliments the results in [17] in the sense that it provides closed formexpressions of the capacity achieving distribution and the corresponding optimal channel output condi-tional distribution. More importantly, it shows that these distributions are time-invariant and correspond to the non-nested optimization problem (IV.154), which is directly analogous to Shannon’s two-lettercapacity formulae of memoryless channels.The structure of our expression (IV.157) provides insight on how the occupancy of the two states affectsthe capacity. Recall that the state of the channel deﬁnes which of the two binary symmetric channels isin use at each time instant. Since P S i ( ) = P A i , B i − ( , ) + P A i , B i − ( , ) , then by substituting the capacityachieving input distribution we have P S i ( ) = ν . Thus, the optimal occupancy, or equivalently the optimaltime sharing, among the two binary symmetric channels with crossover probabilities α , β , is given by ν which is a function of the channel parameters α and β . This interpretation is obvious in the feedbackcapacity expression (IV.157) and this expression is similar to the capacity of the memoryless binarysymmetric channel. However, for the BSSC the maximization of the output process corresponds to atime invariant, ﬁrst order doubly stochastic Markov process.

2) Feedback capacity of the BSSC with transmission cost:

Next, we consider the BSSC with transmission cost constraint deﬁned by (IV.148). Since C FB , BSSCA n → B n ( κ ) isa convex optimization problem the optimal channel input conditional distribution occurs on the boundaryof the constraint, i.e., for κ ≥ κ max C FB , BSSCA n → B n ( κ ) is constant and equal to the unconstrained capacity givenin Theorem IV.1. Theorem IV.2.

Consider the BSSC deﬁned by (IV.144) with feedback and transmission cost constraintdeﬁned by (IV.148). Then the following hold. (a)

The optimal channel input distribution which corresponds to C FB , BSSCA n → B n ( κ ) and the optimal outputdistribution, are time-invariant and given by π ∗ i ( a i | b i − ) = π T I ( a i | b i − ) = (cid:20) κ − κ − κ κ (cid:21) , ∀ i = , , . . . , n (IV.158) P π ∗ i ( b i | b i − ) = P T I ( b i | b i − ) = (cid:20) ¯ λ − ¯ λ − ¯ λ ¯ λ (cid:21) , ∀ i = , , . . . , n (IV.159) where ¯ λ = ακ + ( − κ )( − β ) . (IV.160) Moreover,C FB , BSSCA n → B n ( κ ) = ( n + ) max π TI ( a | b − ) : E { a i , b i − }≤ κ I ( A ; B | B − = b i − ) , ∀ b i − ∈ { , } (IV.161)(b) The feedback capacity is given byC FB , BSSCA n → B n ( κ ) =  H ( ¯ λ ) − κ H ( α ) − ( − κ ) H ( β ) if κ ≤ κ max H ( λ ) − κ max H ( α ) − ( − κ max ) H ( β ) if κ > κ max (IV.162) where κ max is equal to ν deﬁned by (IV.153) . This proof is similar to the proof of Theorem IV.1 and is given in Appendix D.The unconstrained and constrained feedback capacity of the

BSSC are depicted in Figure IV.2. Inparticular, Figure IV.2a depicts the unconstrained capacity of the

BSSC for all possible values of theparameters α , β ∈ [ , ] . Figure IV.2b, depicts how the transmission cost affects the capacity of the BSSC for all possible values of the parameters α , β ∈ [ , ] , and for three different choices κ . The inner plotcorresponds to the unconstrained case ( κ = κ max ). (a) Unconstrained Capacity α C apa c i t y 010.10.20.3 0.80.4 10.50.6 0.6 0.80.70.8 0.60.9 β (b) Constrained Capacity. Fig. IV.2: Capacity of BSSC with feedback.

3) Error exponents for the BSSC with feedback:

In this section we apply the results of Section III-D to the BSSC, and we evaluate the error exponentand the probability of error, for the capacity achieving input distribution with feedback denoted by π T I and deﬁned by (IV.151).It is straightforward to verify that evaluating (III.133) at the capacity achieving input distribution deﬁned(IV.151), this term is independent of the initial state of the channel, and is given by E π TI , n ( ρ , b − ) ≡ E π TI , n ( ρ ) . (IV.163)Consequently, the upper bound bound on the probability of error is also independent of the initial state,and is given by P ( n ) e , m ≤ | B | {− n [ − ρ R + F n ( ρ )] } , ∀ m ∈ M n , ≤ ρ ≤ . (IV.164)Moreover, since the capacity achieving distribution is time invariant, then Λ π i ( s i , s i − ) = Λ π TI ( s i , s i − ) : i = , , . . . , n . Then, by substituting the time invariant capacity achieving distribution and the channeldistribution in (III.139), we obtain Λ π TI ( , ) = Λ π TI ( , ) = (cid:104) να + ρ + ( − ν )( − β ) + ρ (cid:105) + ρ (IV.165) Λ π TI ( , ) = Λ π TI ( , ) = (cid:104) ν ( − α ) + ρ + ( − ν ) β + ρ (cid:105) + ρ (IV.166)The largest eigenvalue for the resulted 2 × λ π TI max ( ρ ) = Λ ( , ) + Λ ( , )= (cid:104) να + ρ + ( − ν )( − β ) + ρ (cid:105) + ρ + (cid:104) ν ( − α ) + ρ + ( − ν ) β + ρ (cid:105) + ρ . (IV.167) v max v min = . (IV.168)Substituting (IV.167) and (IV.168) in (III.143) we obtain E π TI ( ρ ) = − log λ π TI max ( ρ ) (IV.169) F ∞ ( ρ ) (cid:52) = lim n → ∞ F n ( ρ ) = − log λ π TI max ( ρ ) . (IV.170) R E π T I r ( R ) (a) Error exponent. n -8 -6 -4 -2 P ( n ) e , m R=0.5 × CR=0.9 × CR=0.99 × C (b) Probability of error. Fig. IV.3: Error exponent and probability of error for the BSSC with parameters α = . β = . E π TI r ( R ) (cid:52) = max ≤ ρ ≤ { F ∞ ( ρ ) − ρ R } . (IV.171)Hence, the probability of error is given by P ( n ) e , m ≤ × | | × (cid:110) − n (cid:104) − ρ R − log λ π TImax ( ρ ) (cid:105)(cid:111) . (IV.172)Better bounds can be obtained if both the encoder and the decoder know the initial state of the channel.In this case the cardinality of the state, | | , is omitted from (IV.172) [Problem 5.37, [25]]. The errorexponent and the probability of error, optimized with respect to ρ , are given in Fig. IV.3. Obviously, evenbetter bounds can be obtained by optimizing with respect to the channel input distribution. However,even for DMC’s, the error exponent which is analogue to (IV.171), is often evaluated at the capacityachieving distribution of the ergodic capacity. B. Capacity without feedback of the BSSC

In this section we apply Theorem II.2, to show that the feedback capacity of the BSSC is achieved bya time invariant ﬁrst order channel input distribution without feedback.

Theorem IV.3. (Capacity of BSSC without Feedback with & without Transmission Cost)Consider the BSSC deﬁned by (IV.144) without feedback. Then the following hold. (a)

For a channel with transmission cost constraint deﬁned by (IV.148), the optimal channel inputdistribution which corresponds to C noFB , BSSCA n ; B n ( κ ) is time-invariant ﬁrst-order Markov, and it is givenby P noFB , ∗ i ( a i | a i − ) = π noFB , T I ( a i | a i − ) =  − κ − σ − σ κ − σ − σκ − σ − σ − κ − σ − σ  , i = , , . . . , n , (IV.173) where σ = ακ + β ( − κ ) . Moreover (IV.173) induces the optimal channel input and channel outputdistributions π T I ( a i | b i − ) and P T I ( b i | b i − ) of the BSSC with feedback and transmission cost. (b) For a channel without transmission cost (a) holds with κ = κ ∗ and σ = σ ∗ = ακ ∗ + β ( − κ ∗ ) . (c) The capacity the BSSC without feedback and transmission cost is given byC noFB , BSSCA n ; B n = ( n + ) max π noFB , TI ( a | a ) I ( A ; B | B ) = ( n + ) C FB , BSSCA ∞ → B ∞ (IV.174) and similarly, if there is a transmission cost.Proof: (a) By applying Theorem II.2, it sufﬁces to show that there exists an input distribution withoutfeedback which induces the capacity achieving channel input distribution with feedback, π ∗ i ( a i | b i − ) . Forthe BSSC, it is clear that, if any input distribution without feedback induces π ∗ i ( a i | b i − ) = π T I ( a i | b i − ) given by (IV.158), then it also induces the optimal output process P π noFB , TI , ∗ i ( b i | b i − ) = P T I ( b i | b i − ) given by (IV.159), since P T I ( b i | b i − ) = ∑ a i ∈{ , } P ( b i | a i , b i − ) π T I ( a i | b i − ) . (IV.175)Suppose the distribution of the initial state b − is given by the stationary distribution of the outputprocess, that is, P b − ( ) = P b − ( ) = .

5. Then, we show by induction that there exist a time invariant,ﬁrst order Markov channel input distribution without feedback that induces the time invariant channelinput distribution with feedback. For i =

0, the optimal channel input distribution without feedback isequal to optimal channel input distribution with feedback, that is, π noFB ( a | b − ) = π T I ( a | b − ) , andis given by (IV.158), since b − is the initial state known at the encoder. Therefore, the correspondingchannel output distribution with feedback, P π ∗ ( b | b − ) , is induced and since P π ∗ ( b | b − ) = P T I ( b | b − ) is doubly stochastic, then P ∗ ( b = ) = P ∗ ( b = ) = . i =

1, the following identities hold, in general. P ( a | b ) = ∑ a ∈{ , } P ( a | a , b ) P ( a | b )= ∑ a ∈{ , } P ( a | a , b ) P ( b , a ) P ( b )= ∑ a ∈{ , } P ( a | a , b ) P ( b ) ∑ b − ∈{ , } P ( b | a , b − ) P ( a | b − ) P ( b − ) (IV.176)Next using (IV.176), we investigate whether there exists a ﬁrst order Markov channel input distributionwithout feedback, P ( a | a , b ) = π noFB ( a | a ) , which induces the time-invariant capacity achieving inputdistribution with feedback, π T I ( a | b ) , given by (IV.158). Therefore, we need to determine whether thefollowing identity holds for some π noFB ( a | a ) . From (IV.176), π T I ( a | b ) ? = ∑ a ∈{ , } π noFB ( a | a ) P ∗ ( b ) ∑ b − ∈{ , } P ( b | a , b − ) π T I ( a | b − ) P ( b − ) (IV.177)Note that P ( a | b − ) = π T I ( a | b − ) and P ( b ) = P ∗ ( b ) hold due to step i =

0. Solving the systemof resulting equations, yields that there exists a channel input distribution without feedback, deﬁned by(IV.173), that induces π T I ( a | b ) . Therefore, it also induces the time invariant optimal output distribution, P π ∗ i ( b | b ) = P T I ( b | b ) given by (IV.159), and its corresponding optimal marginal distribution P ∗ ( b ) .Next, suppose that for time up to time i = j −

1, the ﬁrst order Markov input distribution deﬁned by(IV.173) induces the time invariant capacity achieving distribution with feedback, { π T I ( a i | b i − ) : i = , , . . . , j − } , given by (IV.158), and therefore it induces, { P T I ( b i | b i − ) : i = , , . . . , j − } given by(IV.159), and its corresponding optimal marginal distribution { P ∗ ( b i ) : i = , , . . . , j − } . Then, at time i = j , the following identity holds. P j ( a j | b j − ) = ∑ a j − , b j − P j ( a j | a j − , b j − ) P j − ( a j − , b j − | b j − )= ∑ a j − , b j − P j ( a j | a j − , b j − ) P ( b j − ) P ( b j − | a j − , b j − ) P ( a j − | a j − b j − ) P ( a j − , b j − )= ∑ a j − , b j − P j ( a j | a j − , b j − ) P ∗ ( b j − ) P ( b j − | a j − , b j − ) π T I ( a j − | b j − ) P ∗ ( a j − , b j − ) . (IV.178)The last equality holds since the distributions P ∗ ( b j − ) , π T I ( a j − | b j − ) , P ∗ ( a j − , b j − ) were inducedfrom the previous steps i = , , . . . , j −

1. Subsequently, we investigate whether there exists a ﬁrst orderMarkov channel input distribution, P j ( a j | a j − , b j − ) = π noFBj ( a j | a j − ) , that satisﬁes (IV.178). That is, π ∗ ( a j | b j − ) ? = ∑ a j − , b j − π noFBj ( a j | a j − ) P ∗ ( b j − ) P ( b j − | a j − , b j − ) π T I ( a j − | b j − ) P ∗ ( a j − , b j − )= ∑ a j − π noFBj ( a j | a j − ) P ∗ ( b j − ) ∑ b j − P ( b j − | a j − , b j − ) π T I ( a j − | b j − ) ∑ a j − , b j − P ∗ ( a j − , b j − )= ∑ a j − π noFBj ( a j | a j − ) P ∗ ( b j − ) ∑ b j − P ( b j − | a j − , b j − ) π ∗ ( a j − | b j − ) P ∗ ( b j − ) (IV.179)Solving, the system of equation resulting from equation (IV.179), yields the time-invariant ﬁrst orderMarkov input distribution deﬁned by (IV.173). Since, the time invariant ﬁrst order Markov channelinput distribution without feedback deﬁned by (IV.173), induces the optimal channel input distributionwith feedback ∀ i = , , . . . , j , then it is the time invariant capacity achieving input distribution withoutfeedback.(b) Holds since for the BSSC without transmission cost κ = κ ∗ , and therefore σ = σ ∗ = ακ ∗ + β ( − κ ∗ ) .(c) Since, { π noFB , T I ( a i | a i − ) ≡ π noFB , T I ( a | a ) : i = , , . . . , n } induces { π T I ( a i | b i − ) : i = , , . . . , n } given by (IV.158), and { P T I ( b i | b i − ) i = , , . . . , n } given by (IV.159), then the channel capacity withoutfeedback and transmission cost is given by (IV.174). Similarly, for the constrained capacity we have C noFB , BSSCA n ; B n ( κ ) = C FB , BSSCA ∞ → B ∞ ( κ ) . C. Special cases of the BSSC1) Memoryless BBSC ( α = β = − ε , ε (cid:54) = . ): Consider the trivial case where α = β (cid:52) = − ε . Then, given the state s i = a i ⊕ b i − , the BSSC degeneratesto the Discrete Memoryless - Binary Symmetric Channel (DM-BSC) with cross over probability ε . Byemploying (IV.151)-(IV.156) and (IV.173), then µ = ν = λ = .

5, the capacity achieving inputdistribution and the corresponding output distribution are memoryless and uniformly distributed, and thecapacity expression reduces to C DM − BSC = H (( − ε )( − . ) + ε . ) − . H ( ε ) − . H ( ε ) = − H ( ε ) . This are the known results of the memoryless BSC.

2) Best and Worst BBSC ( α = , β = . ): Consider the case α = β = .

5. This channel decomposes to a noiseless BSC channel withcrossover probability 0 if s i = a i ⊕ b i − =

0, and to a noisy BSC channel with crossover probability 0 . s i = a i ⊕ b i − =

1. By invoking (IV.151)-(IV.156), then ν = . λ = .

8, the channel capacity is equalto C FB , BSSC (cid:12)(cid:12)(cid:12) α = , β = . = C noFB , BSSC (cid:12)(cid:12)(cid:12) α = , β = . = H ( . ) − . H ( ) − . H ( . ) = . Time Horizon V a l ue F un c t i on , V ( b i - ) V(b i-1 =0)V(b i-1 =1) (a) Value Function. Time Horizon

P(b i−1 =0)P(b i−1 =1)P(a i =0|b i−1 =0)P(a i =1|b i−1 =0)P(a i =0|b i−1 =1)P(a i =1|b i−1 =1) (b) Optimal Input and Output Distributions. Fig. IV.4: Unconstrained

BIBO − U MCO channel with parameters α = . α = . α = . α = . π T I ( a i | b i − ) = (cid:20) . . . . (cid:21) , π noFB , T I ( a i | a i − ) = (cid:20) / / / / (cid:21) and the optimal channel output distribution for both is given by P T I ( b i | b i − ) = (cid:20) . . . . (cid:21) . This completes the analysis of degenerate BSSC.

D. Capacity of the Binary Input Binary Output - Unit Memory Channel Output (BIBO-UMCO) channelwith feedback Time Horizon V a l ue F un c t i on , V ( b i - ) V(b i-1 =0)V(b i-1 =1) (a) Value Function. Time Horizon i-1 =0)P(b i-1 =1)P(a i =0|b i-1 =0)P(a i =1|b i-1 =0)P(a i =0|b i-1 =1)P(a i =1|b i-1 =1) (b) Optimal Input and Output Distributions. Fig. IV.5: Constrained

BIBO − U MCO channel with parameters α = . α = . α = . α = . k = . feedback capacity of BIBO − U MCO channel, denoted by P ( b i | b i − , a i ) = (cid:18)

00 01 10 110 α α α α − α − α − α − α (cid:19) , i = , . . . , n (IV.180)with and without transmission cost. In addition, we calculate the capacity achieving input distributionswith feedback and the respective optimal output distributions.

1) Without cost constraint:

Consider the

BIBO − U MCO channel (IV.180) with parameters α = . α = . α = . α = .

4. By employing dynamic programming equations (III.67)-(III.68) theconvergence of the value functions without transmission cost, and the convergence of the optimal inputdistributions with feedback and the corresponding output distributions are depicted in Figures IV.4a andIV.4b, respectively. To characterize the feedback capacity and the capacity achieving input distributionof the BIBO-UMCO channel we employ Algorithm 1, which yields the following results. π ∞ ( a i | b i − ) = (cid:20) .

626 0 . .

374 0 . (cid:21) , C FB , BIBO − UMCO = .

215 bits/per channel use.

2) With cost constraint.:

Consider the

BIBO − U MCO channel (IV.180) with parameters α = . α = . α = . α = . k = . ONCLUSIONS

We apply the dynamic programming recursions and necessary and sufﬁcient conditions for any channelinput conditional distribution to achieve capacity, to identify necessary and sufﬁcient conditions suchthat the nested optimization problem C FBA n → B n reduces to a non-nested optimization problem. This givesrise to the single letter characterization of feedback capacity. The methodology can be easily generalized to channels that have ﬁnite memory on the previous outputs.These results are applied to the BSSC with feedback, with and without cost constraint, to calculate thefeedback capacity, the capacity achieving input distribution, and the corresponding output distribution.One of the fascinating results is that feedback capacity is characterized by a single letter expression thatis precisely analogous to the single letter characterization of capacity of DMCs. Additionally, we showthat a ﬁrst order Markov channel input distribution without feedback achieves feedback capacity. Wealso derive an upper bound on the error probability of maximum likelihood decoding.A PPENDIX AP ROOF OF L EMMA . III.1We can re-write (III.110) as follows.˜ V t ( b − ) + t ˜ V t ( b − ) = sup π ∞ ( ·| b − ) (cid:110) ∑ a (cid:110) ∑ b log (cid:16) P ( b | b − , a ) P π ∞ ( b | b − ) (cid:17) P ( b | b − , a ) (A.181) + ∑ b (cid:16) ˜ V t − ( b ) + t ˜ V t ( b − ) (cid:17) P ( b | b − , a ) (cid:111) π ∞ ( a | b − ) (cid:111) . (A.182)Assumptions III.2, imply that lim t −→ ∞ t ˜ V t ( b − ) = J ∗ , ∀ b − ∈ B (A.183) and that the limit does not depend on b − ∈ B . Moreover, under Assumption III.2, (A.183), taking thelimit of both sides of (A.182), the following dynamic programming equation is obtained. J ∗ + v ( b − ) = lim t −→ ∞ (cid:110) t ˜ V t ( b − ) + (cid:16) ˜ V t ( b − ) − tJ ∗ (cid:17)(cid:111) (A.184) ( a ) = lim t −→ ∞ sup π ∞ ( ·| b − ) (cid:110) ∑ a (cid:110) ∑ b log (cid:16) P ( b | b − , a ) P π ∞ ( b | b − ) (cid:17) P ( b | b − , a ) (A.185) + ∑ b (cid:16) ˜ V t − ( b ) − ( t − ) J ∗ + t ˜ V t ( b − ) − J ∗ ) (cid:17) P ( b | b − , a ) (cid:111) π ∞ ( a | b − ) (cid:111) (A.186)where (a) is due to (A.181). Since the channel input and output alphabet spaces are at most countable,then we can interchange of the limit and the maximization operations, to obtain dynamic programmingequation (III.112). A PPENDIX BP ROOF OF T HEOREM . III.4For any { π ∞ ( a i | b i − ) : i = , . . . , n } , (III.104) is expressed as follows. J ( π ∞ , µ ) = lim inf n −→ ∞ n E π ∞ µ (cid:110) n − ∑ i = (cid:96) ( b i − , a i ) (cid:111) = lim inf n −→ ∞ n E π ∞ µ (cid:110) n − ∑ i = (cid:96) ( b i − , π ( b i − )) (cid:111) , ∀ µ ( b − ) ∈ M ( B ) (B.187) = lim inf n −→ ∞ n µ T (cid:16) n − ∑ i = P ( π ∞ ) i (cid:17) (cid:96) ( π ∞ ) . (B.188)Following [31], it can be shown that the above limit exists but it may depend on the distribution µ ( · ) of B − . However, if P ( π ∞ ) is irreducible then J ( π ∞ , µ ∞ ) = µ T P ( π ∞ ) (cid:96) ( π ∞ ) = ν ( π ∞ ) T (cid:96) ( π ∞ ) (B.189)where P ( π ∞ ) is the limiting matrix (this follows by the Cesaro limit), and ν ( π ∞ ) is the unique invariantprobability distribution, which satisﬁes P ( π ∞ ) ν ( π ∞ ) = ν ( π ∞ ) . From (B.189), it follows that J ( π ∞ , µ ) ≡ J ( π ∞ ) , that is, it does not depend on the initial distribution µ of B − . It can be shown that if for allstationary Markov channel input distributions π ∞ the transition matrix P ( π ∞ ) is irreducible, there existsa solution V : B (cid:55)→ R | B | and J ∈ R , which satisﬁes (III.124).A PPENDIX CP ROOF OF T HEOREM . IV.1(a) First, we employ the necessary and sufﬁcient conditions of Theorem III.1, to calculate the optimalinput and output distributions and the value function at the terminal time. To show the time-invariantproperty it is sufﬁcient to prove the the value function of the terminal condition, V n ( b n − ) , is independentof b n − (part (b) of Theorem III.2). By Theorem III.1, we have V n ( b n − ) = ∑ b n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P ( b n | a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) (cid:54) = b n − = a n =

1, we obtain V n ( b n − = ) = ∑ b n log (cid:16) P n ( b n | a n = , b n − = ) P π n ( b n | b n − = ) (cid:17) P ( b n | a n = , b n − = )= ( − β ) log 1 − P π n ( b n = | b n − = ) P π n ( b n = | b n − = ) + log 11 − P π n ( b n = | b n − = ) − H ( β ) . (C.192)By (C.190), we equate (C.191) and (C.192), to deduce P π n ( b n = | b n − = ) = λ = + µ (C.193)where λ and µ are given in (IV.153). We repeat the above procedure for the pair b n − = , a n = b n − = , a n =

1, to deduce P π n ( b n = | b n − = ) = + µ ≡ λ . (C.194)Therefore the optimal transition probability of the output process at time n , is given by the doublystochastic matrix (IV.152). Next, we show that the value function, V n ( b n − ) , is independent of b n − . Thevalue function for b n − = a n = V n ( b n − = ) = ∑ b n log (cid:16) P n ( b n | a n = , b n − = ) P π n ( b n | b n − = ) (cid:17) P ( b n | a n = , b n − = )= α log 1 − P π n ( b n = | b n − = ) P π n ( b n = | b n − = ) + log 11 − P π n ( b n = | b n − = ) − H ( α )= α log 1 − P π n ( b n = | b n − = ) P π n ( b n = | b n − = ) + log 11 − P π n ( b n = | b n − = ) − H ( α ) . = V n ( b n − = ) (C.195)Since the value function, V n ( b n − ) , is independent of b n − , we apply Theorem III.2.(b), to deduce thatthe optimal channel input and channel output conditional distributions are time invariant. The optimalchannel input conditional distribution is calculated via the expression P π n ( b n | b n − ) = ∑ A i P n ( b n | a n , b n − ) π ( a n | b n − ) . For b n = b n − =

0, we have P π n ( b n = | b n − = ) = ∑ A n P n ( b n = | a n , b n − = ) π ( a n | b n − = )= απ ( a n = | b n − = ) + ( − β )( − π ( a n = | b n − = )) . (C.196)Solving (C.196) with respect to the input distribution yields π ( a n = | b n − = ) = − ( − β )( + µ )( α + β − )( + µ ) ≡ ν . (C.197)Similarly, P π n ( b n = | b n − = ) = ∑ A n P n ( b n = | a n , b n − = ) π ( a n | b n − = )= απ ( a n = | b n − = ) + ( − β )( − π ( a n = | b n − = )) . (C.198)The above, shows (IV.151). By Theorem III.2.(b), speciﬁcally (III.87) evaluated at t =

0, we obtain thefollowing expression for the FTFI capacity. C FB , BSSCA n → B n ( α ) = ∑ b − V ( b − ) µ ( b − ) ( β ) = ( n + ) max π ( a | b − ) ∑ b , a , b − (cid:18) P ( b | a , b − ) P π ( b | b − ) (cid:19) P ( b | a , b − ) π ( a n | b n − ) µ ( b − ) , b − ∈ { , } ( γ ) = ( n + ) [ H ( λ ) − ν H ( α ) − ( − ν ) H ( β )] (C.199) where ( α ) holds by deﬁnition (equation (III.69)), ( β ) holds due to (III.87) evaluated at t = ( γ ) bysubstituting the time invariant capacity achieving input distribution (IV.151), the corresponding optimaloutput distribution (IV.152) and any value of b − ∈ { , } .(b) holds by deﬁnition (equation (II.27)). A PPENDIX DP ROOF OF T HEOREM . IV.2(a) By employing the dynamic programming recursion for the constrained problem (III.90) we can showthat the value function at the terminal time is independent of b n − . Therefore, by Theorem III.2, theoptimization problem is non-nested and the dynamic programming for the constrained capacity is givenby V i ( b i − ) = sup π ( a i | b i − ) , s ≤ (cid:110) ∑ A n ∑ B n log (cid:16) P ( b i | b i − , α i ) P π ( b i | b i − ) (cid:17) P ( b i | b i − , α i ) π ( α i | b i − )+ s (cid:110) ∑ A i γ ( a i , b i − ) π n ( a i | b i − ) − κ (cid:111)(cid:111) , ∀ i = , , . . . , n (D.200)Differentiating (D.200) with respect to the Lagrangian s, we obtain the optimal input distribution of(IV.158). The optimal output distribution is then calculated by P π n ( b n | b n − ) = ∑ A n P n ( b n | a n , b n − ) π ( a n | b n − ) to obtain (IV.159).(b) Since (i) the optimal input channel conditional distribution and the channel output conditionaldistribution are time-invariant and (ii) the value function V i ( b i − ) is independent of b i − , ∀ i = , , . . . , n ,the proof is identical to the proof of Theorem IV.1.(b). The value of κ max is given when the Lagrangian s =

0, i.e. the constrained optimization problem is equivalent to the constrained optimization problem. Inthis case, s =

0, and the optimal channel input conditional distribution for the constrained case is equalto the optimal channel input conditional distribution for the unconstrained case, thus κ | s = = κ max = ν .R EFERENCES [1] C. E. Shannon, “A mathematical theory on communication,”

Bell System Technical Journal , no. 27, pp. 379–423, October1948.[2] T. M. Cover and J. A. Thomas,

Elements of Information Theory (Wiley Series in Telecommunications and SignalProcessing) . Wiley-Interscience, 2006.[3] C. E. Shannon, “The zero error capacity of a noisy channel,”

IRE Transactions on Information Theory , vol. 2, no. 3, pp.112–124, 1956.[4] R. L. Dobrushin, “Information transmission in channel with feedback,”

Theory of Probability and its Applications , vol. 3,no. 4, pp. 367–383, 1958.[5] T. M. Cover and S. Pombra, “Gaussian feedback capacity,”

IEEE Transactions on Information Theory , vol. 35, no. 1, pp.37–43, 1989.[6] S. Ihara,

Information Theory for Continuous Systems , ser. Series on probability and statistics. World Scientiﬁc, 1993.[7] H. Marko, “The bidirectional communication theory–a generalization of information theory,”

Communications, IEEETransactions on , vol. 21, no. 12, pp. 1345 – 1351, dec 1973.[8] J. Massey, “Causality, feedback and directed information,”

IEEE International Symposium on Information Theory and itsApplicationss , vol. 72, pp. 303–305, November 2001.[9] C. K. Kourtellaris and C. D. Charalambous, “Information structures of capacity achieving distributions for feedbackchannels with memory and transmission cost: Stochastic optimal control & variational equalities-part i,” arXiv preprintarXiv:1512.04514 , 2015.[10] ——, “Information structures of capacity achieving distribution for channels with memory and feedback,” in

InformationTheory (ISIT), 2016 IEEE International Symposium on, accepted for publication , Juny 2016.[11] N. Sen, F. Alajaji, and S. Yuksel, “Feedback capacity of a class of symmetric ﬁnite-state markov channels,”

IEEETransactions on Information Theory , vol. 57, no. 7, pp. 4110–4122, July 2011. [12] H. Permuter, P. Cuff, B. V. Roy, and T. Weissman, “Capacity of the trapdoor channel with feedback,” IEEE Transactionson Information Theory , vol. 54, no. 7, pp. 3150–3165, July 2008.[13] O. Elishco and H. Permuter, “Capacity and coding for the ising channel with feedback,”

IEEE Transactions on InformationTheory , vol. 60, no. 9, pp. 5138–5149, Sept 2014.[14] T. Berger, “Living Information Theory,”

IEEE Information Theory Society Newsletter , vol. 53, no. 1, March 2003.[15] J. Chen and T. Berger, “The capacity of ﬁnite-state markov channels with feedback,”

IEEE Transactions on InformationTheory , vol. 55, no. 6, pp. 780–798, 2005.[16] H. Asnani, H. Permuter, and T. Weissman, “Capacity of a post channel with and without feedback,” in

Information TheoryProceedings (ISIT), 2013 IEEE International Symposium on , 2013, pp. 2538–2542.[17] H. Permuter, H. Asnani, and T. Weissman, “Capacity of a post channel with and without feedback,”

IEEE Transactionson Information Theory , vol. 60, no. 10, pp. 6041–6057, Oct 2014.[18] C. K. Kourtellaris and C. D. Charalambous, “Capacity of binary state symmetric channel with and without feedback andtransmission cost,” in

Information Theory Workshop (ITW), 2015 IEEE , April 2015, pp. 1–5.[19] C. K. Kourtellaris, C. D. Charalambous, and J. J. Boutros, “Nonanticipative transmission for sources and channels withmemory,” in

Information Theory (ISIT), 2015 IEEE International Symposium on , June 2015, pp. 521–525.[20] F. Jel´ınek,

Probabilistic information theory: discrete and memoryless models , ser. McGraw-Hill series in systems science.McGraw-Hill, 1968.[21] M. Gastpar, “To code or not to code,” Ph.D. dissertation, Ecole Polytechnique F´ed´erale (EPFL), Lausanne, 2002.[22] S. Tatikonda, “Control over communication constraints,” Ph.D. thesis, M.I.T, Cambridge, MA, 2000.[23] C. Charalambous and P. Stavrou, “Directed information on abstract spaces: Properties and variational equalities,”

IEEETransactions on Information Theory , vol. PP, no. 99, pp. 1–1, 2016.[24] P. A. Stavrou, C. D. Charalambous, and C. K. Kourtellaris, “Sequential necessary and sufﬁcient conditions for optimalchannel input distributions of channels with memory and feedback,” in , 2016, pp. 300–304.[25] R. G. Gallager,

Information Theory and Reliable Communication . New York, NY, USA: John Wiley & Sons, Inc., 1968.[26] D. Luenberger,

Optimization by Vector Space Methods , ser. Professional Series. Wiley, 1968.[27] O. Hernandez-Lerma and J. Lasserre,

Discrete-Time Markov Control Processes: Basic Optimality Criteria , ser. Applicationsof Mathematics Stochastic Modelling and Applied Probability. Springer Verlag, 1996, no. v. 1.[28] P. R. Kumar and P. Varaiya,

Stochastic systems: Estimation, identiﬁcation, and adaptive control . Prentice Hall, 1986.[29] H. Permuter, T. Weissman, and A. Goldsmith, “Capacity of ﬁnite-state channels with time-invariant deterministic feedback,”in , July 2006, pp. 64–68.[30] C. E. Shannon, “Coding theorems for a discrete source with a ﬁdelity criterion,” in

IRE Nat. Conv. Rec., Pt. 4 , 1959, pp.142–163.[31] D. Bertsekas,