Single Letter Expression of Capacity for a Class of Channels with Memory
Christos K. Kourtellaris, Charalambos D. Charalambous, Ioannis Tzortzis
11 Single Letter Expression of Capacity for aClass of Channels with Memory
Christos K. Kourtellaris, Charalambos D. Charalambous and Ioannis Tzortzis
AbstractWe study finite alphabet channels with Unit Memory on the previous Channel Outputs calledUMCO channels. We identify necessary and sufficient conditions, to test whether the capacityachieving channel input distributions with feedback are time-invariant, and whether feedbackcapacity is characterized by single letter, expressions, similar to that of memoryless channels. Themethod is based on showing that a certain dynamic programming equation, which in general,is a nested optimization problem over the sequence of channel input distributions, reduces to anon-nested optimization problem. Moreover, for UMCO channels, we give a simple expressionfor the ML error exponent, and we identify sufficient conditions to test whether feedback doesnot increase capacity. We derive similar results, when transmission cost constraints are imposed.We apply the results to a special class of the UMCO channels, the Binary State SymmetricChannel (BSSC) with and without transmission cost constraints, to show that the optimizationproblem of feedback capacity is non-nested, the capacity achieving channel input distributionand the corresponding channel output transition probability distribution are time-invariant, andfeedback capacity is characterized by a single letter formulae, precisely as Shannon’s single lettercharacterization of capacity of memoryless channels. Then we derive closed form expressions forthe capacity achieving channel input distribution and feedback capacity. We use the closed formexpressions to evaluate an error exponent for ML decoding.
I. I
NTRODUCTION
Shannon in his landmark paper [1], showed that the capacity of Discrete Memoryless Channels (DMCs) (cid:8) A , B , { P B | A ( b | a ) : ( a , b ) ∈ A × B } (cid:9) is characterized by the celebrated single letter formulae C (cid:52) = max P A I ( A ; B ) . (I.1)This is often shown by using the converse to the channel coding theorem, to obtain the upper bounds[2] C noFBA n ; B n (cid:52) = max P An I ( A n ; B n ) ≤ max P Ai , i = ,..., n n ∑ i = I ( A i ; B i ) ≤ ( n + ) C (I.2)which are achievable, if and only if the channel input distribution satisfies conditional independence P A i | A i − = P A i , i = , , . . . , n , and { A i : i = , , . . . , } is identically distributed, which implies that the jointprocess { ( A i , B i ) : i = , , . . . , } is independent and identically distributed, and hence stationary ergodic.For DMCs, it is shown by Shannon [3] and Dobrushin [4] that feedback codes do not incur a highercapacity compared to that of codes without feedback, that is, C FB = C . This is often shown by firstapplying the converse to the coding theorem, to deduce that feedback does not increase capacity [5],that is, C FB ≤ C noFB = C , which then implies that any candidate of optimal channel input distributionwith feedback (cid:110) P A i | A i − , B i − : i = , . . . , n (cid:111) satisfies conditional independence P A i | A i − , B i − ( da i | a i − , b i − ) = P A i ( da i ) , i = , , . . . , n (I.3) C K. Kourtellaris, C D. Charalambous and I. Tzortzis are with the Department of Electrical and Computer Engineering,University of Cyprus, Nicosia, Cyprus,
Email:[email protected], [email protected], [email protected]
This work was financially supported by a medium size University of Cyprus grant entitled “DIMITRIS” and by QNRF, amember of Qatar Foundation, under the project NPRP 6-784-2-329 a r X i v : . [ c s . I T ] J a n and hence identity C FB = C noFB = C holds if { A i : i = , , . . . , } is identically distributed.For general channels with memory defined by (cid:8) P B i | B i − , A i : i = , , . . . , n } , P B | B − , A = P B | B − , A , where B − is the initial state, in general, feedback codes incur a higher capacity than codes without feedback[2], [6]. The information measure often employed to characterize feedback capacity of such channels isMarko’s directed information [7], put forward by Massey [8], and defined by I ( A n → B n ) = n ∑ i = I ( A i ; B i | B i − ) (cid:52) = n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P B i | B i − ( ·| b i − ) ( b i ) (cid:17) P A i , B i ( da i , db i ) . (I.4)Indeed, Massey [8] showed that the per unit time limit of the supremum of directed information overchannel input distributions P FB [ , n ] (cid:52) = (cid:8) P A i | A i − , B i − : i = , . . . , n (cid:9) , defined by C FBA ∞ → B ∞ = lim n → ∞ n + C A n → B n C A n → B n (cid:52) = sup P FB [ , n ] I ( A n → B n ) (I.5)gives a tight bound on any achievable rate of feedback codes, and hence C FBA ∞ → B ∞ is a candidate for thecapacity of feedback codes. However, for channels with memory, it is generally not known whether themulti-letter expression of capacity, (I.5), can be reduced to a single letter expression, analogous to (I.1).Our main objective is to provide a framework for a single letter characterization of feedback capacityfor a general class of channels with memory. Towards this direction, we provide conditions on channelswith memory such that C FBA n → B n = ( n + ) C FB (I.6)where C FB is a single letter expression similar to that of DMCs. Specifically, for channels of the form (cid:8) P B i | B i − , A i : i = , , . . . , n } , where B − = b − ∈ B − is the initial state, we give necessary and sufficientconditions such that the following equality holds. C FBA n → B n = ( n + ) sup P A | B − ( ·| b − ) I ( A ; B | b − ) , ∀ b − ∈ B − . (I.7)That is, the single letter expression is C FB (cid:52) = sup P A | B − ( ·| b − ) I ( A ; B | b − ) , and is independent of theinitial state b − ∈ B − . A. Main Results and Methodology
First, we consider channels with Unit Memory on the previous Channel Output (UMCO), defined by P B i | B i − , A i = P B i | B i − , A i , i = , , . . . , n (I.8)with and without a transmission cost constraint defined by1 n + E (cid:40) n ∑ i = γ UMi ( A i , B i − ) (cid:41) (I.9)where γ UMi : A i × B i − (cid:55)−→ [ , ∞ ) . We identify necessary and sufficient conditions on the channel so thatthe optimization problem C FBA n → B n , which is generally a nested optimization problem, often dealt with viadynamic programming, reduces to a non-nested optimization problem. These conditions give rise to asingle letter characterization of feedback capacity. Among other results, we derive sufficient conditionsfor feedback not to increase capacity, and identify sufficient conditions for asymptotic stationarity ofoptimal channel input distribution and ergodicity of the joint process { ( A i , B i ) : i = , , . . . } . Moreover,we give an upper bound on the error probability of maximum likelihood decoding. We also treat problemswith transmission cost constraints.Second, we apply the framework of the UMCO channel on the Binary State Symmetric Channel (BSSC), defined by P B i | A i , B i − ( b i | a i , b i − )= , , , , α β − β − α − α − β β α , i = , , . . . , n , ( α , β ) ∈ [ , ] × [ , ] (I.10)with and without a transmission cost constraint defined by1 n + E (cid:40) n ∑ i = γ ( A i , B i − ) (cid:41) ≤ κ , γ ( a i , b i − ) = a i ⊕ b i − , κ ∈ [ , κ max ] (I.11)where x ⊕ y denotes the compliment of the modulo2 addition of x and y . We calculate the capacityachieving channel input distribution with feedback without cost constraint and show that it is time-invariant. This illustrates that feedback capacity satisfies (I.6), it is independent of the initial state B − = b − , and it is characterized by C FBA ∞ → B ∞ = sup P A | B − I ( A ; B | b − ) , ∀ b − ∈ B − (I.12) = H ( λ ) − ν H ( α ) − ( − ν ) H ( β ) (I.13)where λ , ν are functions of channel parameters α , β (see Theorem IV.1). The characterization (I.12)is precisely analogous to the single letter characterization of (I.1) and (I.2) of capacity of DMCs.Additionally, we provide the error exponent evaluated on the capacity achieving channel input distributionwith feedback, and we derive an upper bound on the error probability of maximum likelihood decodingwhich is easy to compute (see Section IV-A3). Finally, we show that a time-invariant first order Markovchannel input distribution without feedback achieves feedback capacity (I.13), and we give the closedform expressions both for the capacity achieving channel input distribution and the corresponding channeloutput distribution. We also treat the case with cost constraint.The main mathematical concept we invoke to obtain the above results are the structural properties of theoptimal channel input distributions, [9], [10]. Specifically the following.(a) For channels with infinite memory on the previous channel outputs defined by P B i | B i − , A i = P B i | B i − , A i ,the maximization of directed information I ( A n → B n ) occurs in the subset satisfying conditionalindependence (cid:8) P A i | A i − , B i − = P A i | B i − : i = , . . . , n (cid:9) .(b) For channels with limited memory of order M defined by P B i | B i − , A i = P B i | B i − i − M , A i , the maximiza-tion of directed information I ( A n → B n ) occurs in the subset satisfying conditional independence (cid:8) P A i | A i − , B i − = P A i | B i − i − M : i = , . . . , n (cid:9) .(c) For the UMCO channel the maximization of directed information I ( A n → B n ) occurs in the subsetsatisfying conditional independence (cid:8) P A i | A i − , B i − = P A i | B i − : i = , . . . , n (cid:9) .The structural properties, (a), (b) and (c), along with the fact that C FBA n → B n ≥ C noFBA n ; B n , are employedin Section II to provide sufficient conditions for feedback not to increase the capacity. Moreover,the structural property of the UMCO channel, (c), is applied in Section III to construct the finitehorizon dynamic programming, the necessary and sufficient conditions on the capacity achieving inputdistribution, and the necessary and sufficient conditions for the non-nested optimization of feedbackcapacity. The methodology and the corresponding theorems of Section III can be easily extended tochannels with finite memory on previous channel outputs by invoking the structural properties of thecapacity achieving distributions for these channels. B. Relation to the Literature
Although for several years significant effort has been devoted to the study of channels with memory,with or without feedback, explicit or closed form expressions for capacity of such channels are limited to few but ripe cases. For non-stationary non-ergodic Additive Gaussian Noise (AGN) channels withmemory, Cover and Pombra [5] showed that feedback codes can increase capacity by at most half a bit.On the other hand, for a finite alphabet version of the Cover and Pombra channel with certain symmetry,Alajaji [11] showed that feedback does not increase capacity. Moreover, Permuter, Cuff, Van Roy andWeissman [12] derived the feedback capacity of the trapdoor channel, while Elishco and Permuter [13]employed dynamic programming to evaluate feedback capacity of the Ising channel.The capacity of channels (cid:8) P B i | B i − , A i : i = , . . . , n (cid:9) for feedback codes is analyzed by Berger [14] andChen and Berger [15], under the assumption that the capacity achieving distribution satisfies condi-tional independence property P A i | A i − , B i − = P A i | B i − ( a i | b i − ) , i = , , . . . , n . A derivation of this structuralproperty of capacity achieving distribution is given in [9], [10].Recently, Permuter, Asnani and Weissman [16], [17] derived the feedback capacity for a Binary-InputBinary-Output (BIBO) channel, called the Previous Output STate (POST) channel, where the current stateof the channel is the previously received symbol. The authors in [17], showed, among other results, thatfeedback does not increase capacity. It can be shown that the POST channel is within a transformationequivalent to the Binary State Symmetric channel (BSSC) [18], in which the state of the channel isdefined as the modulo2 addition of the current input symbol and the previous output symbol. Whenthere are no transmission cost constraints, our results for the BSSC compliment existing results obtainedin [16], [17] regarding the POST channel, in the sense that, we show the time-invariant properties of thecapacity achieving distributions, which implies the single letter characterization of feedback capacity, wederive closed form expressions for these distributions, provide an upper bound on the error probability ofmaximum likelihood decoding, and we show that a first-order Markov channel input distribution withoutfeedback achieves feedback capacity. Moreover, we derive similar closed form expressions when averagedtransmission cost constraints are imposed.A portion of the results established in this paper were utilized to construct a Joint Source Channel Coding(JSCC) scheme for the BSSC with a cost constraint and the Binary Symmetric Markov Source (BSMS)with single letter Hamming distortion measure [19]. The scheme is a natural generalization of the JSCCdesign (uncoded transmission) of an Independent and Identically Distributed (IID) Bernoulli source overa Binary Symmetric Channel (BSC) [20], [21].The remainder of the paper is organized as follows. In Section II, we introduce the mathematicalformulation and identify sufficient conditions for feedback not to increase capacity. In Section III, weidentify sufficient conditions to test whether the capacity achieving input distribution is time invariant.The results are then extended to the infinite horizon case. In Section IV, we apply the main theoremsof section III to the BSSC, with and without feedback and with and without cost constraint, to prove,among other results, that capacity is given by a single letter characterization. Finally, Section V deliversour concluding remarks. II. F
ORMULATION & P
RELIMINARY R ESULTS
In this section we introduce the definitions of feedback capacity, capacity without feedback , and weidentify necessary and sufficient conditions for feedback not to increase the capacity.
A. Notation and Definitions
The probability distribution of a Random Variable (RV) defined on a probability space ( Ω , F , P ) by themapping X : ( Ω , F ) (cid:55)−→ ( X , B ( X )) is denoted by P ( · ) ≡ P X ( · ) . The space of probability distributions on X is denoted by M ( X ) . A RV is called discrete if there exists a countable set S such that ∑ x i ∈ S P { ω ∈ Ω : X ( ω ) = x i } =
1. The probability distribution P X ( · ) is then concentrated on points in S , and it isdefined by P X ( A ) (cid:52) = ∑ x i ∈ S (cid:84) A P { ω ∈ Ω : X ( ω ) = x i } , ∀ A ∈ B ( X ) . (II.14) Given another RV Y : ( Ω , F ) (cid:55)→ ( Y , B ( Y )) , P Y | X ( dy | x )( ω ) is the conditional distribution of RV Y given X . For a fixed X = x we denote the conditional distribution by P Y | X ( dy | X = x ) = P Y | X ( dy | x ) .Let Z denote the set of integers and N (cid:52) = { , , , . . . , } , N n (cid:52) = { , , , . . . , n } . The channel input andchannel output spaces are sequences of measurable spaces { ( A i , B ( A i )) : i ∈ Z } and { ( B i , B ( B i )) : i ∈ Z } , respectively, while their product spaces are A Z (cid:52) = × i ∈ Z A i , B Z (cid:52) = × i ∈ Z B i , B ( A Z ) (cid:52) = ⊗ i ∈ Z B ( A i ) , B ( B Z ) (cid:52) = ⊗ i ∈ Z B ( B i ) . Points in the product spaces are denoted by a n (cid:52) = { . . . , a − , a , a , . . . , a n } ∈ A n and b n (cid:52) = { . . . , b − , b , b , . . . , b n } ∈ B n , n ∈ Z . B. Capacity with Feedback & Properties
Next, we provide the precise formulation of information capacity and some preliminary results. Webegin by introducing the definitions of channel distribution, channel input distribution, transmission costconstraint, and feedback code.
Definition II.1. (Channel distribution with memory)A sequence of conditional distributions defined by C [ , n ] (cid:52) = (cid:110) P B i | B i − , A i ( db i | b i − , a i ) = P B i | B i − , A i ( db i | b i − , a i ) : i = , . . . , n (cid:111) . (II.15) At time i = the conditional distribution is P B | B − , A ( db | b − , a ) , where B − = b − ∈ B − is the initialdata. The initial data, b − ∈ B − , denotes the initial state of the channel and this should not be misinterpretas feedback information. In this work we assume that the initial data are known both to the encoder andthe decoder, unless we state otherwise. Definition II.2. (Channel input distribution with feedback)A sequence of conditional distributions defined by P FB [ , n ] (cid:52) = (cid:110) P A i | A i − , B i − ( da i | a i − , b i − ) : i = , . . . , n (cid:111) . (II.16) At time i = the conditional distribution is P A | A − , B − ( da | a − , b − ) = P A | B − ( da | b − ) . That is, theinformation structure of the channel input distribution is I FBi (cid:52) = { b − , a , b , a , b , . . . , a i − , b i − } , fori = , . . . , n. For i = the convention is I FB (cid:52) = { a − , b − } = { b − } , which states that the channel inputdistribution depends only on the initial data. Definition II.3. (Transmission cost constraints)The cost of transmitting symbols over the channel (II.15) is a measurable function c , n : A n × B n − (cid:55)−→ [ , ∞ ) defined by c , n ( a n , b n − ) (cid:52) = n ∑ i = γ i ( a i , b i − ) . (II.17) The transmission cost constraint is defined by P FB [ , n ] ( κ ) (cid:52) = (cid:110) P A i | A i − , B i − , i = , . . . , n : 1 n + E µ (cid:8) c , n ( A n , B n − ) (cid:9) ≤ κ (cid:111) , κ ∈ [ , ∞ ] (II.18) where κ ∈ [ , ∞ ) , and the subscript notation E µ indicates the joint distribution over which the expectationis taken is parametrized by the initial distribution P B − ( db − ) = µ ( db − ) (and of course the channelinput distribution). Definition II.4. (Feedback code)A feedback code for the channel defined by (II.15) with transmission cost constraint P FB [ , n ] ( κ ) is asequence { ( n , M n , ε n ) : n = , , . . . } , which consist of the following elements. (a) A set of uniformly distributed messages M n (cid:52) = { , . . . , M n } and a set of encoding strategies, mappingmessages into channel inputs of block length ( n + ) , defined by E FB [ , n ] ( κ ) (cid:44) (cid:110) g i : M n × A i − × B i − (cid:55)−→ A i , a = g ( w , b − ) , a = g ( w , b − , a , b ) , . . . , a n = g n ( w , b − , a , b , . . . , a n − , b n − ) , w ∈ M n : 1 n + E g (cid:16) c , n ( A n , B n − ) (cid:17) ≤ κ (cid:111) , n = , , . . . . (II.19) The codeword for any w ∈ M n is u w ∈ A n , u w = ( g ( w , b − ) , g ( w , b − , a , b ) , . . . , g n ( w , b − , a , b , . . . , a n − , b n − )) , and C n = ( u , u , . . . , u M n ) is the code for the message set M n , and { A − , B − } = { b − } . In general, the code depends on the initial data, depending on the convention, i.e., B − = b − , which are known to the encoder and decoder (unless specified otherwise). Alternatively, wecan take { A − , B − } = { /0 } . (b) Decoder measurable mappings d , n : B n (cid:55)−→ M n , such that the average probability of decoding errorsatisfies P ( n ) e (cid:44) M n ∑ w ∈ M n P g (cid:110) d , n ( B n ) (cid:54) = w | W = w (cid:111) ≡ P g (cid:110) d , n ( B n ) (cid:54) = W (cid:111) ≤ ε n and the decoder may also assume knowledge of the initial data.The coding rate or transmission rate over the channel is defined by r n (cid:44) n + log M n . A rate Ris said to be an achievable rate, if there exists a code sequence satisfying lim n −→ ∞ ε n = and lim inf n −→ ∞ n + log M n ≥ R.The operational definition of feedback capacity of the channel is the supremum of all achievable rates,i.e., C (cid:44) sup { R : R is achievable } . Given any channel input distribution { P A i | A i − , B i − : i = , , . . . , n } ∈ P FB [ , n ] , a channel distribution { P B i | B i − , A i : i = , , . . . , n } , and a fixed initial distribution µ ( b − ) , then the induced joint distribution P A n , B n parametrizedby µ ( · ) is uniquely defined, and a probability space (cid:16) Ω , F , P (cid:17) carrying the sequence of RVs ( A n , B n ) (cid:52) = { B − , A , B , A , B , . . . , A n , B n } is constructed, as follows. P (cid:8) A n ∈ da n , B n ∈ db n (cid:9) (cid:52) = P A n , B n ( da n , db n )= ⊗ nj = (cid:16) P B j | B j − , A j ( db j | b j − , a j ) ⊗ P A j | A j − , B j − ( da j | a j − , b j − ) (cid:17) ⊗ µ ( db − ) . (II.20) P (cid:8) B n ∈ db n (cid:9) (cid:52) = P B n ( db n ) = (cid:90) A n P A n , B n ( da n , db n ) . (II.21) P B i | B i − ( db i | b i − ) = (cid:90) A i P B i | B i − , A i ( db i | b i − , a i ) ⊗ P A i | A i − , B i − ( da i | a i − , b i − ) ⊗ P A i − | B i − ( da i − | b i − ) , i = , . . . , n . (II.22) P B | B − ( db | b − ) = (cid:90) A P B | B − , A ( db | b − , a ) ⊗ P A | B − ( da | b − ) . (II.23)The Directed Information from A n (cid:52) = { A , A , . . . , A n } to B n (cid:52) = { B , B , . . . , B n } conditioned on B − is The superscript on expectation, i.e., E g indicates the dependence of the distribution on the encoding strategies. If B − = b − is fixed, then µ ( · ) = δ B − ( · ) is a dirac or delta measure concentrated at B − = b − . defined by [7], [8] I ( A n → B n ) (cid:52) = n ∑ i = I ( A i ; B i | B i − ) = n ∑ i = I ( A i ; B i | B i − )= n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P B i | B i − ( ·| b i − ) ( b i ) (cid:17) P A i , B i ( da i , db i ) (II.24) ≡ I FBA n → B n ( P A i | A i − , B i − , P B i | B i − , A i : i = , , . . . , n ) (II.25)where (II.24) follows from the channel definition, and the notation I FBA n → B n ( · , · ) indicates that I ( A n → B n ) is a functional of the sequences of channel input and channel distributions; its dependence on the initialdistribution µ ( · ) is suppressed.Define the information quantities C FBA n → B n (cid:52) = sup P FB [ , n ] I ( A n → B n ) , C FBA n → B n ( κ ) (cid:52) = sup P FB [ , n ] ( κ ) I ( A n → B n ) . (II.26)Under the assumption that { B − , A , B , A , B , . . . , } is jointly ergodic or n + ∑ ni = P Bi | Bi − , Ai ( ·| B i − , A i ) P Bi | Bi − ( ·| B i − ) ( B i ) is information stable [4], [22] and c , n ( a n , b n − ) = n + ∑ ni = γ i ( A i , B i − ) is stable, then the capacity of thechannel with feedback with and without transmission cost is given by C FBA ∞ → B ∞ (cid:52) = lim n −→ ∞ n + C FBA n → B n , C FBA ∞ → B ∞ ( κ ) (cid:52) = lim n −→ ∞ n + C FBA n → B n ( κ ) . (II.27)
1) Convexity Properties.:
Next, we recall the convexity properties of directed information with respectto a specific definition of channel input distributions, which is equivalent to the above definition.Any sequence of channel input distribution { P A i | A i − , B i − : i = , , . . . , n } ∈ P FB [ , n ] and channel distribution { P B i | B i − , A i : i = , , . . . , n } uniquely define the causal conditioned distributions ←− P ( da n | b n − ) (cid:52) = ⊗ ni = P A i | A i − , B i − ( da i | a i − , b i − ) , (II.28) −→ P ( db n | a n , b − ) (cid:52) = ⊗ ni = P B i | B i − , A i ( db i | b i − , a i ) (II.29)and vice-versa, and these are parametrized by the initial data b − . Moreover, for a fixed B − = b − we can formally define the joint distribution of { A , B , A , B , . . . , A n , B n } and the joint distribution of { B , B , . . . , B n } conditioned on B − = b − by P ←− P ( da n , db n | b − ) (cid:52) =( ←− P ⊗ −→ P )( da n , db n | b − ) , (II.30) P ←− P ( db n | b − ) (cid:52) = (cid:90) A n ( ←− P ⊗ −→ P )( da n , db n | b − ) . (II.31)Both distributions are parametrized by the initial data b − . Then, from [23], we have the followingconvexity property of directed information.(a) The set of conditional distributions defined by (II.28), ←− P A n | B n − ( ·| b n − ) ∈ M ( A n ) is convex.(b) Directed information is equivalently expressed as follows. I ( A n → B n ) = (cid:90) log (cid:16) −→ P ( ·| a n , b − ) P ←− P ( ·| b − ) ( b n ) (cid:17) P ←− P ( da n , db n | b − ) ⊗ µ ( db − ) ≡ I FBA n → B n ( ←− P , −→ P ) . (II.32)(c) Directed information, I FBA n → B n ( ←− P , −→ P ) , is concave with respect to ←− P ( ·| b n − ) ∈ M ( A n ) for a fixed −→ P ( ·| a n , b − ) ∈ M ( B n ) .Since the set of conditional distributions with or without transmission cost constraints is convex, anddirected information is a concave functional, the optimization problems (II.26) are convex, and we havethe following theorem. Theorem II.1. (Convexity properties)Assume the set P FB [ , n ] ( κ ) is non-empty and the supremum of I ( A n → B n ) over the set of distributions P FB [ , n ] ( κ ) is achieved (i.e., it exists). Then, the following hold. (a) C FBA n → B n ( κ ) is non-decreasing concave function of κ ∈ [ , ∞ ] . (b) An alternative characterization of C
FBA n → B n ( κ ) is given byC FBA n → B n ( κ ) = sup ←− P : n + E µ (cid:8) c , n ( a n , b n − ) (cid:9) = κ I FBA n → B n ( ←− P , −→ P ) , for κ ≤ κ max (II.33) where κ max is the smallest number belonging to [ , ∞ ] such that C FBA n → B n ( κ ) is constant in [ κ max , ∞ ] ,and E µ (cid:8) · (cid:9) denotes expectation with respect to the joint distribution ( ←− P ⊗ −→ P ) ⊗ µ .Proof: Since the set P FB [ , n ] ( κ ) is convex with respect to ←− P ( ·| b n − ) ∈ M ( A n ) , the statements followfrom the convexity and non-decreasing properties [23].The above theorem states that the extremum problem of feedback capacity is a convex optimizationproblem, over appropriate sets of distributions.
2) Information Structures of Optimal Channel Input Distributions.:
Consider the extremum problem C FBA n → B n ( κ ) , given by (II.33). In [9], [10], it is shown that the optimal channel input distribution satisfiesthe following conditional independence. P A i | A i − , B i − ( da i | a i − , b i − ) = P A i | B i − ( da i | b i − ) ≡ π i ( da i | b i − ) , i = , . . . , n . (II.34)Moreover, in view of the information structure of the optimal channel input distribution, C FBA n → B n ( κ ) reduces to the following optimization problem. C FBA n → B n ( κ ) = sup P FB [ , n ] ( κ ) n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P π B i | B i − ( ·| b i − ) ( b i ) (cid:17) P π A i , B i ( da i , db i ) (II.35) ≡ sup P FB [ , n ] ( κ ) I FBA n → B n ( π i , P B i | B i − , A i : i = , , . . . , n ) (II.36)where the transmission cost constraint is defined by P FB , n ( κ ) (cid:52) = (cid:110) π i ( da i | b i − ) , i = , . . . , n : 1 n + E πµ (cid:8) c , n ( A n , B n − ) (cid:9) ≤ κ (cid:111) (II.37)and the induced joint and transition probability distributions are given by P π A i , B i ( da i , db i ) = P B i | B i − , A i ( db i | b i − , a i ) ⊗ π i ( da i | b i − ) ⊗ P π B i − ( db i − ) (II.38) P π B i | B i − ( db i | b i − ) = (cid:90) A i P B i | B i − , A i ( db i | b i − , a i ) ⊗ π i ( da i | b i − ) , i = , . . . , n . (II.39)The superscript indicates the dependence of these distributions on { π i ( da i | b i − ) : i = , . . . , n } .The information feedback capacity rate is then given by C FBA ∞ → B ∞ ( κ ) (cid:52) = lim n −→ ∞ n + P FB [ , n ] ( κ ) I FBA n → B n ( π i , P B i | B i − , A i : i = , , . . . , n ) . (II.40) C. Feedback Versus No Feedback
Here, we address the question whether feedback increases capacity via optimization problem (II.35).First, we recall the definition of channel input distributions without feedback.
Definition II.5. (Channels input distribution without feedback)A sequence of conditional distributions defined by P noFB [ , n ] (cid:52) = (cid:110) P A i | A i − , B − ( da i | a i − , b − ) ≡ π noFBi ( da i | a i − , b − ) : i = , . . . , n (cid:111) . (II.41) The information structure of the channel input distribution without feedback is I noFBi (cid:52) = { a i − , b − } .For time i = , the distribution is P A | A − , B − ( da | a − , b − ) ≡ π noFBi ( da | b − ) , hence the informationstructure is I FB (cid:52) = { a − , b − } = { b − } , which states that the channel input distribution depends onlyon the initial data. Similar to the feedback case (Section II-B), the initial state of the channel, b − , is assumed to be knownat the encoder. The transmission cost constraint without feedback, is defined by P noFB [ , n ] ( κ ) (cid:52) = (cid:110) π noFBi ( da i | a i − , b − ) , i = , . . . , n : 1 n + E µ (cid:8) c , n ( A n , B n − ) (cid:9) ≤ κ (cid:111) , κ ∈ [ , ∞ ] . (II.42)Moreover, the set of encoding strategies without feedback, mapping messages into channel inputs ofblock length ( n + ) , are defined by E noFB [ , n ] ( κ ) (cid:44) (cid:110) g noFBi : M n × A i − × B − (cid:55)−→ A i , a = g noFB ( w , b − ) , a = g noFB ( w , b − , a ) , . . . , a n = g noFBn ( w , b − , a n − ) , w ∈ M n : 1 n + E g noFB (cid:16) c , n ( A n , B n − ) (cid:17) ≤ κ (cid:111) , n = , , . . . . (II.43)By employing (II.42) and (II.43), a code without feedback is defined similarly to Definition II.4.Given any channel input distribution without feedback { π noFBi ( da i | a i − , b − ) : i = , , . . . , n } ∈ P noFB [ , n ] ( κ ) ,a channel distribution { P B i | B i − , A i : i = , , . . . , n } , and a fixed initial distribution P B − ( db − ) = µ ( b − ) ,then the induced joint distribution P A n , B n parametrized by µ ( · ) is uniquely defined. The mutual informa-tion between from A n (cid:52) = { A , A , . . . , A n } to B n (cid:52) = { B , B , . . . , B n } conditioned on B − is defined by I ( A n ; B n ) (cid:52) = E π noFB µ (cid:110) log (cid:16) P B n | A n , B − ( ·| A n , B − ) P π noFB B n | B − ( ·| B − ) ( B n ) (cid:17)(cid:111) = n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P π noFB B i | B i − ( ·| b i − ) ( b i ) (cid:17) P π noFB A i , B i ( da i , db i )= n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P π noFB B i | B i − ( ·| b i − ) ( b i ) (cid:17) P π noFB A i , B i ( da i , db i ) ≡ I noFBA n → B n ( π noFBi , P B i | B i − , A i : i = , , . . . , n ) (II.44)where the joint distribution and transition probability distribution are induced by { π noFBi ( da i | a i − , b − ) : i = , . . . , n } ∈ P noFB [ , n ] as follows. P π noFB A i , B i ( da i , db i ) = P B i | B i − , A i ( db i | b i − , a i ) ⊗ P π noFB A i | B i − ( da i | b i − ) ⊗ P π noFB B i − ( db i − ) (II.45) P π noFB B i | B i − ( db i | b i − ) = (cid:90) A i P B i | B i − , A i ( db i | b i − , a i ) ⊗ P π noFB A i | B i − ( da i | b i − ) , i = , . . . , n (II.46) P π noFB A i | B i − ( da i | b i − ) = (cid:90) A i − π noFBi ( da i | a i − , b − ) ⊗ P π noFB A i − | B i − ( da i − | b i − ) (II.47) P π noFB A i − | B i − ( da i − | b i − ) = ⊗ i − j = P B j | B j − , A j ( db j | b j − , a j ) ⊗ π noFBj ( da j | a j − , b − ) (cid:82) A j P B j | B j − , A j ( db j | b j − , a j ) ⊗ π noFBj ( da j | a j − , b − ) . (II.48)The superscript in the above distributions are important to distinguish that these are generated by thechannel and channel input distributions without feedback, while the functional in (II.44) is fundamen-tally different from the one in (II.36). Clearly, compared to the channel with feedback in which thecorresponding distributions are (II.38) and (II.39), and they are induced by { π i ( da i | b i − ) : i = , . . . , n } ∈ P FB , n ( κ ) , when the channel is used without feedback, the distributions (II.45) and (II.46) are induced by { π noFBi ( da i | a i − , b − ) : i = , . . . , n } ∈ P noFB [ , n ] ( κ ) .Define the information quantity C noFBA n ; B n ( κ ) = sup P noFB [ , n ] ( κ ) n ∑ i = (cid:90) log (cid:16) P B i | B i − , A i ( ·| b i − , a i ) P π noFB B i | B i − ( ·| b i − ) ( b i ) (cid:17) P π noFB A i , B i ( da i , db i ) (II.49) ≡ sup P noFB [ , n ] ( κ ) I noFBA n → B n ( π noFBi , P B i | B i − , A i : i = , , . . . , n ) . (II.50)Then the information capacity without feedback subject to a transmission cost constraint is defined by C noFBA ∞ ; B ∞ ( κ ) (cid:52) = lim n −→ ∞ n + P noFB [ , n ] ( κ ) I noFBA n → B n ( π noFBi , P B i | B i − , A i : i = , , . . . , n ) . (II.51)Next, we note the following. Let { π ∗ i ( da i | b i − ) : i = , . . . , n } ∈ P FB [ , n ] ( κ ) denote the maximizing distri-bution in C FBA n → B n ( κ ) defined by (II.35). Suppose there exists a sequence of channel input distributionswithout feedback { P ∗ ( da i | I noFBi ) ≡ π ∗ , noFBi ( da i | I noFBi ) : I noFBi ⊆ { b − , a , . . . , a i − } , i = , . . . , n } ∈ P noFB , n ( κ ) which induces the maximizing channel input distribution with feedback { π ∗ i ( da i | b i − ) : i = , , . . . , n } . That is, P noFBA i | B i − ( da i | b i − ) given by (II.47) is equal to π ∗ i ( da i | b i − ) , ∀ i = , , . . . , n . Then, itis clear that this sequence also induces the optimal joint distribution and conditional distribution definedby (II.38), (II.39), and consequently C FBA n → B n ( κ ) and C FBA ∞ → B ∞ ( κ ) are achieved without using feedback.In the following theorem, we prove that this condition is not only sufficient but also necessary for anychannel input distribution without feedback to achieve the finite time feedback information capacity, C FBA n → B n ( κ ) . Theorem II.2. (Necessary and sufficient conditions for C
FBA n → B n ( κ ) = C noFBA n ; B n ( κ ) )Consider channel (II.15) and let { π ∗ i ( da i | b i − ) : i = , . . . , n } ∈ P FB [ , n ] ( κ ) denote the maximizing distri-bution in C FBA n → B n ( κ ) defined by (II.35), and let (cid:110) P π ∗ A i , B i ( da i , db i ) , P π ∗ B i | B i − ( db i | b i − ) : i = , . . . , n (cid:111) denotethe corresponding joint and transition distributions as defined by (II.38), (II.39).Then C FBA n → B n ( κ ) = C noFBA n ; B n ( κ ) (II.52) if and only if there exists a sequence of channel input distributions (cid:110) P ∗ ( da i | I noFBi ) ≡ π noFB , ∗ i ( da i | I noFBi ) : I noFBi ⊆ { b − , a , . . . , a i − } , i = , . . . , n (cid:111) ∈ P noFB , n ( κ ) which induces the maximizing channel input distribution with feedback { π ∗ i ( da i | b i − ) : i = , , . . . , n } .Proof: In general, the inequality C FBA n → B n ( κ ) ≥ C noFBA n ; B n ( κ ) holds. Moreover, by Section II-B thedistributions { P π ∗ A i , B i ( da i , db i ) , P π ∗ B i | B i − ( db i | b i − ) : i = , . . . , n } are induced by the channel, which is fixed,and the optimal conditional distribution { π ∗ i ( da i | b i − ) : i = , . . . , n } ∈ P FB [ , n ] ( κ ) . Then, equality holds ifand only if there exists a distribution without feedback { π noFB , ∗ i ( da i | I noFBi ) : i = , . . . , n } ∈ P noFB , n ( κ ) which induces { π ∗ i ( da i | b i − ) : i = , . . . , n } ∈ P FB [ , n ] ( κ ) . This follows from the fact that the distributions (cid:110) P π ∗ A i , B i ( da i , db i ) , P π ∗ B i | B i − ( db i | b i − ) : i = , . . . , n (cid:111) are induced by the feedback distribution, { π ∗ i ( da i | b i − ) : i = , . . . , n } and the channel distribution. This completes the proof.Theorem II.2 provides a sufficient condition for feedback not to increase capacity, i.e. C FBA ∞ → B ∞ ( κ ) = C noFBA ∞ ; B ∞ ( κ ) , since if (II.52) holds, then lim n → ∞ n + C FBA n → B n ( κ ) = lim n → ∞ n + C noFBA n ; B n . In Section IV-B wedemonstrate an application of Theorem II.2 to a specific channel with memory, where we show thatan input distribution without feedback induces { π ∗ i ( da i | b i − ) : i = , . . . , n } , hence feedback does notincrease capacity. III. D
YNAMIC P ROGRAMMING AND N ECESSARY S UFFICIENT C ONDITIONS FOR N ON - NESTED O PTIMIZATION
In this section we employ the structural properties of capacity achieving channel input distributionswith feedback to derive dynamic programming recursions and necessary and sufficient conditions for thesingle letter characterization (I.6), to hold. Specifically, we provide the following results for the UMCOchannel.(a) Necessary and sufficient conditions to determine when dynamic programming recursions, whichare nested optimization problems, reduce to non-nested optimization problems.(b) Repeat (a) for the per unit time infinite horizon.(c) Upper bounds on the probability of maximum likelihood decoding.The time-varying UMCO channel is defined by P i ( b i | b i − , a i ) (cid:52) = P i ( b i | b i − , a i ) , i = , . . . , n (III.53)and the transmission cost constraint is defined by1 n + E (cid:40) n ∑ i = γ UMi ( A i , B i − ) (cid:41) ≤ κ (III.54)where γ UMi : A i × B i − (cid:55)−→ [ , ∞ ) . At i = { b − , a } , where b − ∈ B − is the initial data which are either known to the encoder and the decoder or, b − = { /0 } .For simplicity, of presentation and technical assumptions needed, we consider a channel model withtransmission cost function, defined on finite alphabet spaces. However, all main results extend to abstractalphabet spaces and channel distributions, which depend on finite memory on past channel output.Moreover, our analysis and the corresponding theorems can be extended to channels with finite memoryon the previous channel outputs by exploiting the structural form of the capacity achieving distributionsgiven in [9], [10].For the above model, it is shown in [9], [10] that maximizing directed information, I ( A n → B n ) , over P FB [ , n ] or P FB [ , n ] ( κ ) occurs in the subset of conditional distributions that satisfy the following conditionalindependence. P i ( a i | a i − , b i − ) = P i ( a i | b i − ) ≡ π i ( a i | b i − ) , i = , , . . . , n . (III.55)Consequently, we have the following Markovian properties. P i ( a i , b i | a i − , b i − ) = P π i ( a i , b i | a i − , b i − ) , i = , , . . . , n , (III.56) P i ( b i | b i − ) = P π i ( b i | b i − ) , i = , , . . . , n , (III.57) P π i ( b i | b i − ) = ∑ a i ∈ A i P i ( b i | b i − , a i ) π i ( a i | b i − ) , i = , , . . . , n . (III.58)where the superscript indicates the dependence on the channel input distribution (III.55). In view of theseMarkov properties, the characterization of the FTFI capacity (i.e., (II.26)) is given by C FB , UMCOA n → B n (cid:52) = sup ◦ P FB [ , n ] E πµ (cid:110) n ∑ i = log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17)(cid:111) (III.59) = sup ◦ P FB [ , n ] n ∑ i = I ( A i ; B i | B i − ) (III.60)where ◦ P FB [ , n ] (cid:52) = (cid:8) π i ( a i | b i − ) : i = , , . . . , n (cid:9) ⊂ P FB [ , n ] . (III.61) When clear from the context, the subscript notation of the distributions is omitted, i.e., P π B i | B i − ( b i | b i − ) ≡ P π i ( b i | b i − ) . Similarly, for conditional distributions with transmission cost the characterization of FTFI capacity isgiven by C FB , UMCOA n → B n ( κ ) (cid:52) = sup ◦ P FB [ , n ] ( κ ) E πµ (cid:110) n ∑ i = log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17)(cid:111) (III.62) = sup ◦ P FB [ , n ] ( κ ) n ∑ i = I ( A i ; B i | B i − ) (III.63)where ◦ P FB [ , n ] ( κ ) (cid:52) = (cid:110) π i ( a i | b i − ) , i = , , . . . , n : 1 n + n ∑ i = E πµ (cid:16) γ UMi ( A i , B i − ) (cid:17) ≤ κ (cid:111) . (III.64)Since the joint process { B − , A , B , . . . , A n , B n } and channel output process { B − , B , . . . , B n } are Markov,we explore the connection of the above optimization problems to Markov Decision theory, to derive theresults listed in (a)-(c). We do this in the next sections. A. Necessary and Sufficient Conditions via Dynamic Programming: The Finite Horizon case
To derive the necessary and sufficient conditions for any channel input distribution to maximize directedinformation, i.e., item (a), we first apply dynamic programming on a finite horizon.
1) Without Transmission Cost Constraint:
The dynamic programming recursion for C FB , UMCOA n → B n is obtainedas follows. Let V t ( b t − ) represent the value function, that is, the maximum expected total cost on thefuture time horizon { t , t + , . . . , n } given output B t − = b t − at time t −
1, defined by V t ( b t − ) = sup π i ( a i | b i − ) : i = t , t + ,..., n E π (cid:110) n ∑ i = t log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17) | B t − = b t − (cid:111) (III.65)where the transition probability of the channel output process is P π t ( b t | b t − ) = ∑ a t ∈ A t P t ( b t | b t − , a t ) π t ( a t | b t − ) . (III.66)Then (III.65) satisfies the following dynamic programming recursions. V n ( b n − ) = sup π n ( a n | b n − ) ∑ ( a n , b n ) ∈ A n × B n log (cid:16) P n ( b n | b n − , a n ) P π n ( b n | b n − ) (cid:17) P n ( b n | b n − , a n ) π n ( a n | b n − ) , (III.67) V t ( b t − ) = sup π t ( a t | b t − ) (cid:110) ∑ a t ∈ A t (cid:104) ∑ b t ∈ B t log (cid:16) P t ( b t | b t − , a t ) P π t ( b t | b t − ) (cid:17) P t ( b t | b t − , a t )+ ∑ b t ∈ B t V t + ( b t ) P t ( b t | b t − , a t ) (cid:105) π t ( a t | b t − ) (cid:111) , t = , , . . . , n − . (III.68)For a fixed initial distribution P B − ( b − ) = µ ( b − ) we have C FB , UMCOA n → B n = ∑ b − ∈ B − V ( b − ) µ ( b − ) . (III.69)Clearly, by using the properties of relative entropy, we can show that the right hand side of the dynamicprogramming recursion, (III.67), is a concave function of the input distribution π n ( a n | b n − ) . Similarlyat each step of the recursion, the right hand side of the dynamic programming recursion, (III.68), isa concave function of the input distribution π t ( a t | b t − ) , since the future channel input distributions, { π t + ( a t + | b t ) , . . . , π n ( a n | b n − ) } are fixed to their optimal strategies. Utilizing this observation we havethe following necessary and sufficient conditions for any channel input distribution to maximize the righthand side of the dynamic programming recursions (III.67) and (III.68). Theorem III.1. (Necessary and sufficient conditions)The necessary and sufficient conditions for any input distribution { π t ( a t | b t − ) : t = , , . . . , n } to achievethe supremum of the dynamic programming recursions (III.67) and (III.68) are the following. For eachb n − ∈ B n − , there exist V n ( b n − ) such thatV n ( b n − ) = ∑ b n ∈ B n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P n ( b n | a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) (cid:54) = , (III.70) V n ( b n − ) ≤ ∑ b n ∈ B n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P n ( b n | a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) = and for each t = n − , n − . . . , , there exist V t ( b t − ) such thatV t ( b t − ) = ∑ b t ∈ B t (cid:110) log (cid:16) P t ( b t | a t , b t − ) P π t ( b t | b t − ) (cid:17) + V t + ( b t ) (cid:111) P t ( b t | a t , b t − ) , ∀ a t ∈ A t , if π t ( a t | b t − ) (cid:54) = , (III.72) V t ( b t − ) ≤ ∑ b t ∈ B t (cid:110) log (cid:16) P t ( b t | a t , b t − ) P π t ( b t | b t − ) (cid:17) + V t + ( b t ) (cid:111) P t ( b t | a t , b t − ) , ∀ a t ∈ A t , if π t ( a t | b t − ) = . (III.73) Moreover, { V t ( b t − ) : ( t , b t − ) ∈ { , . . . , n } × B t − } is the value function defined by (III.65).Proof: The derivation is given in [24].Before we proceed further, in the next remark, we relate Theorem III.1 to the necessary and sufficientconditions of DMCs derived in [25].
Remark III.1. (Relation to necessary and sufficient conditions of DMCs)(a) Suppose the channel is a time-varying DMC, i.e., P t ( b t | b t − , a t ) = P t ( b t | a t ) , t = , . . . , n . (III.74) Since the optimal distribution of DMCs, which maximizes the directed information I ( A n → B n ) is mem-oryless, i.e., P A t | A t − , B t − ( a t | a t − , b t − ) = P A t ( a t ) ≡ π t ( a t ) , t = , . . . , n, then (III.66) reduces to P π t ( b t ) = (cid:82) A t P t ( b t | a t ) π t ( a t ) , t = , . . . , n. By replacing in (III.70) and (III.71), the following quantities P t ( b t | b t − , a t ) (cid:55)−→ P t ( b t | a t ) , P π t ( b t | b t − ) (cid:55)−→ P π t ( b t ) = (cid:90) A t P t ( b t | α t ) π t ( α t ) , t = , . . . , n (III.75) we obtain V n ( b n − ) ≡ V n = ∑ b n log (cid:16) P n ( b n | a n ) P π n ( b n ) (cid:17) P n ( b n | a n ) , ∀ a n ∈ A n if π n ( a n ) (cid:54) = , (III.76) V n ( b n − ) ≡ V n ≤ ∑ b n log (cid:16) P n ( b n | a n ) P π n ( b n ) (cid:17) P n ( b n | a n ) , ∀ a n ∈ A n if π n ( a n ) = where V n ( b n − ) = V n is a constant number, independent of b n − . Moreover, for each t, from (III.72) and(III.73) we obtainV t ( b t − ) ≡ V t = ∑ b t log (cid:16) P t ( b t | a t ) P π t ( b t ) (cid:17) P t ( b t | a t ) + V t + , if ∀ a t ∈ A t , π t ( a t ) (cid:54) = , (III.78) V t ( b t − ) ≡ V t ≤ ∑ b t log (cid:16) P t ( b t | a t ) P π t ( b t ) (cid:17) P t ( b t | a t ) + V t + , if ∀ a t ∈ A t , π t ( a t ) = where V t ( b t − ) ≡ V t is a constant number independent of b t − , for t = n − , n − , . . . , , . Consequently, by evaluating V t ( b t − ) = V t at t = , we obtain the following identities.V = max π t ( a t ) : t = ,..., n E π (cid:110) n ∑ t = log (cid:16) P t ( b t | a t ) P π t ( b t ) (cid:17)(cid:111) = n ∑ t = max π t ( a t ) E π (cid:110) log (cid:16) P t ( b t | a t ) P π t ( b t ) (cid:17)(cid:111) . (III.80) As expected, (III.80) shows that under (III.74), the sequence of nested optimization problems reduces toa sequence of non-nested optimization problems.(b) Suppose the channel is time-invariant (homogeneous) DMC. In this case, P t ( b t | b t − , a t ) = P ( b t | b t − , a t ) , t = , . . . , n, and the equations in (a) reduce to the single set of necessary and sufficient conditions obtainedin [25], that is, letting V = C (cid:52) = max P A I ( A ; B ) , thenV = ∑ b log (cid:16) P ( b | a ) P π ( b ) (cid:17) P ( b | a ) , ∀ a ∈ A if π ( a ) (cid:54) = , (III.81) V ≤ ∑ b log (cid:16) P ( b | a ) P π ( b ) (cid:17) P ( b | a ) , ∀ a ∈ A if π ( a ) = . (III.82)In view of Remark III.1, next we identify necessary and sufficient conditions for any optimal channelinput conditional distribution, which is a solution of the dynamic programming recursions to be time-invariant, i.e., item (b), and to exhibit a non-nested property reminiscent to that of DMCs, i.e., item (c).We derived such conditions based on the following definition. Definition III.1. (Non-nested optimization)Given a channel distribution { P t ( b t | b t − , a t ) : t = , . . . , n } , the optimization problem C FB , UMCOA n → B n definedby (III.69) is called(a) non-nested if and only if the value function (III.65) satisfies the following non-nested identity.V t ( b t − ) = n ∑ i = t sup π i ( a i | b i − ) E π (cid:110) log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17)(cid:12)(cid:12)(cid:12) B t − = b t − (cid:111) (III.83) for all ( t , b t − ) ∈ { , , . . . , n } × B t − ;(b) non-nested and time-invariant if and only if the value function satisfies the following identity.V t ( b t − ) = ( n − t + ) sup π t ( a t | b t − ) E π (cid:110) log (cid:16) P t ( B t | B t − , A t ) P π t ( B t | B t − ) (cid:17)(cid:12)(cid:12)(cid:12) B t − = b t − (cid:111) (III.84) for all ( t , b t − ) ∈ { , , . . . , n } × B t − . Clearly, if we can identify conditions so that the optimization problem defined by (III.69) is non-nested,then by evaluating the value function (III.83) at time t = I ( A i ; B i | B i − = b i − ) over π ( a i | b i − ) , for which b i − is fixed. Moreover, if the optimizationproblem is non-nested and time invariant, then by evaluating (III.84) at t =
0, we obtain C FB , UMCOA n → B n = ( n + ) ∑ b − ∈ B − I ( A ; B | B − = b − ) . (III.85)Next, we state the main theorem, which generalizes the non-nested and time-invariant properties ofmemoryless channels given in Remark III.1, to channels with memory. Theorem III.2. (Necessary and sufficient conditions for non-nested optimization)(a) Consider any channel distribution { P i ( b i | b i − , a i ) , : i = , . . . , n } .The optimization problem C FB , UMCOA n → B n defined by (III.69) is non-nested and the value function is charac- terized by (III.83) if and only ifthere exists constants (cid:8) V t : t = , . . . , n (cid:9) such that V t ( b t − ) = V t , ∀ ( t , b t − ) ∈ { , , . . . , n } × B t − which satisfy (III.70)-(III.73) . (III.86) (b) Consider any time-invariant channel distribution { P ( b i | b i − , a i ) : i = , . . . , n } .The optimization problem C FB , UMCOA n → B n defined by (III.69) is non-nested and time-invariant and the valuefunction is characterized byV t ( b t − ) = V t (cid:52) =( n − t + ) sup π TI ( a i | b i − ) E π (cid:110) log (cid:16) P ( B i | B i − , A i ) P π TI ( B i | B i − ) (cid:17)(cid:12)(cid:12)(cid:12) B i − = b i − (cid:111) , ∀ ( t , b t − ) ∈ { , , . . . , n } × B t − (III.87) where { π i ( a i | b i − ) = π T I ( a i | b i − ) : i = , . . . , n } and { P π i ( b i | b i − ) = P π TI ( b i | b i − ) : i = , . . . , n } are time-invariant, if and only ifthere exists a constant V n such that V n ( b n − ) = V n , ∀ b n − ∈ B n − which satisfies (III.70), (III.71) . (III.88) Proof: (a) Suppose (III.86) holds. Then by Theorem III.1, for any t , the optimal strategy π t ( a t | b t − ) is not affected by the future strategies { π i ( a i | b i − ) : i = t + , t + , . . . , n } for all t = , , . . . , n −
1. Hence,the optimization problem C FB , UMCOA n → B n is non-nested. Conversely, if (III.83) holds, since its left hand sideis the value function defined by (III.65), then necessarily for each t , the value function is a constant, i.e., { V i ( b i − ) = V i : i = t + , . . . , n } , for t = , , . . . , n −
1. In view of Theorem III.1, then (III.86) holds.(b) This is degenerate case of part (a). Suppose (III.88) holds and consider the necessary and sufficientconditions given in Theorem III.1 at time t = n −
1. Since V n ( b n − ) = V n , ∀ b n − , then by (III.72) (andsimilarly for (III.73)) we have V n − ( b n − ) = {·} + V n , ∀ a n − , π n − ( a n − | b n − ) , where the term {·} is thefirst right hand side term in (III.72). Since the channel is time-invariant, subtracting the term V n from bothsides of the equation (III.72) (i.e., corresponding to t = n − t = n −
1. To complete the derivation, we useinduction, that is, we assume validity of (III.84) for t ∈ { n , n − , . . . , i + } and we show it also holdsfor t = i . This is similar to the case t = n −
2) With Transmission Cost Constraints.:
All statements of the previous section generalize to C FB , UMCOA n → B n ( κ ) defined by (III.63), where the transmission cost constraint is given by (III.64). In view of the convexityof the optimization problem, and existence of an interior point of the constraint set ◦ P FB [ , n ] ( κ ) (i.e.,Slater’s condition), by Lagrange duality theorem [26], then the constraint and unconstraint problems areequivalent, that is, C FB , UMCOA n → B n ( κ ) = inf s ≥ sup π i ( a i | b i − ) : i = , ,..., n E πµ (cid:110) n ∑ i = (cid:104) log (cid:16) P π i ( b i | b i − , a i ) P i ( b i | b i − ) (cid:17) − s (cid:16) γ UM ( A i , B i − ) − ( n + ) κ (cid:17)(cid:105)(cid:111) (III.89)where s ∈ [ , ∞ ) is the Lagrange multiplier associated with the constraint.The dynamic programming recursions are obtained as follows. Let V st ( b t − ) represent value function onthe future time horizon { t , t + , . . . , n } given output B i − = b i − at time t −
1, defined by V st ( b t − ) = sup π i ( a i | b i − ) : i = t , t + ,..., n E π (cid:110) n ∑ i = t (cid:104) log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17) − s γ UM ( A i , B i − ) (cid:105) | B t − = b t − (cid:111) . (III.90) The corresponding dynamic programming recursions are the following. V sn ( b n − ) = sup π n ( a n | b n − ) (cid:110) ∑ a n ∈ A n (cid:104) ∑ b n ∈ B n log (cid:16) P n ( b n | b n − , a n ) P π n ( b n | b n − ) (cid:17) P n ( b n | b n − , a n ) − s γ UMn ( a n , b n − ) (cid:105) π n ( a n | b n − ) (cid:111) (III.91) V st ( b t − ) = sup π t ( a t | b t − ) (cid:110) ∑ a t ∈ A t (cid:104) ∑ b t ∈ B t (cid:16) log (cid:16) P t ( b t | b t − , a t ) P π t ( b t | b t − ) (cid:17) P t ( b t | b t − , a t ) + V st + ( b t ) (cid:17) P t ( b t | b t − , a t ) − s γ UMt ( a t , b t − ) (cid:105) π t ( α t | b t − ) (cid:111) . (III.92)Moreover, for a fixed initial distribution P B − ( b − ) = µ ( b − ) , then C FB , UMCOA n → B n ( κ ) = inf s ≥ (cid:110) ∑ b − V s ( b − ) µ ( b − ) + s ( n + ) κ ) (cid:111) . (III.93)The analogues of Theorem III.1 and Theorem III.2 are stated as a corollary. Corollary III.1. (Necessary and sufficient conditions)(a) The necessary and sufficient conditions for any input distribution { π t ( a t | b t − ) : t = , , . . . , n } toachieve the supremum of the dynamic programming recursions (III.91) and (III.92) are the following.For each b n − ∈ B n − , there exist V sn ( b n − ) such thatV sn ( b n − ) = ∑ b n ∈ B n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P n ( b n | a n , b n − ) − s γ UMn ( a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) (cid:54) = , (III.94) V sn ( b n − ) ≤ ∑ b n ∈ B n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P n ( b n | a n , b n − ) − s γ UMn ( a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) = and for each t = , , . . . , n − there exist V st ( b t − ) such thatV st ( b t − ) = ∑ b t ∈ B t (cid:110) log (cid:16) P t ( b t | a t , b t − ) P π t ( b t | b t − ) (cid:17) + V t + ( b t ) (cid:111) P t ( b t | a t , b t − ) − s γ UMt ( a t , b t − ) , ∀ a t ∈ A t , if π t ( a t | b t − ) (cid:54) = , (III.96) V st ( b t − ) ≤ ∑ b t ∈ B t (cid:110) log (cid:16) P t ( b t | a t , b t − ) P π t ( b t | b t − ) (cid:17) + V t + ( b t ) (cid:111) P t ( b t | a t , b t − ) − s γ UMt ( a t , b t − ) , ∀ a t ∈ A t , if π t ( a t | b t − ) = . (III.97) Moreover, { V st ( b t − ) : ( t , b t − ) ∈ { , . . . , n } × B t − } is the value function defined by (III.90).(b) The optimization problem C FB , UMCOA n → B n ( κ ) is non-nested and the value function is characterized byV st ( b t − ) = n ∑ i = t sup π i ( a i | b i − ) E π (cid:110) log (cid:16) P i ( B i | B i − , A i ) P π i ( B i | B i − ) (cid:17) − s γ UM ( A i , B i − ) | B i − = b i − (cid:111) (III.98) for all ( t , b t − ) ∈ { , , . . . , n } × B t − if and only ifthere exists constants (cid:8) V st : t = , . . . , n (cid:9) such that V st ( b t − ) = V st , ∀ ( t , b t − ) ∈ { , , . . . , n } × B t − which satisfy (III.94)-(III.97) . (III.99) (b) If the channel distribution is time-invariant { P ( b i | b i − , a i ) : i = , . . . , n } and γ UMi ( · , · ) = γ UM ( · , · ) : i = , , . . . , n, then the optimization problem C FB , UMCOA n → B n is non-nested and time-invariant, and the valuefunction is characterized byV st ( b t − ) ≡ V st = ( n − t + ) sup π TI ( a i | b i − ) E π TI (cid:110) log (cid:16) P ( B i | B i − , A i ) P π ( B i | B i − ) (cid:17) − s γ UM ( A i , B i − ) (cid:12)(cid:12)(cid:12) B i − = b i − (cid:111) (III.100) where { π i ( a i | b i − ) = π T I ( a i | b i − ) : i = , . . . , n } and { P π i ( b i | b i − ) = P π TI ( b i | b i − ) : i = , . . . , n } are time-invariant, if and only ifthere exists a constant V sn such that V n ( b n − ) = V n , ∀ b n − ∈ B n − which satisfies (III.94), (III.95) . (III.101) Proof:
The derivation is precisely as in Theorem III.1.
B. Necessary and Sufficient Conditions via Dynamic Programming: The Infinite Horizon case
In this section, we first identify sufficient conditions the convergence of the per unit time limit ofthe characterization of FTFI capacity, using the ergodic theory of Markov decision with randomizedstrategies, and infinite horizon dynamic programming. Then, we apply these to derive necessary andsufficient conditions for any channel input distribution to maximize the infinite horizon extremumproblems C FB , UMCOA ∞ → B ∞ and C FB , UMCOA ∞ → B ∞ ( κ ) .For the material of this section we make the following assumption. Assumptions III.1. (Time-Invariant or homogeneous)The channel distribution and transmission cost function are time-invariant, and the optimal strategiesare restricted to time-invariant strategies, i.e., P i ( b i | b i − , a i ) = P ( b i | b i − , a i ) , γ UMi ( a i , b i − ) ≡ γ UM ( a i , b i − ) , i = , . . . , n , (III.102) π i ( a i | b i − ) = π ∞ ( a i | b i − ) , i = , . . . , n (III.103) and A i = A , B i = B , i = , . . . , n. Moreover, the initial distribution P B − = µ ( b − ) is assumed fixed. By invoking Assumptions III.1, we can introduce the corresponding extremum problem as follows. Forfixed initial distribution µ ( db − ) ∈ M ( B ) , we define J ( π ∞ , µ ) (cid:52) = lim inf n −→ ∞ n E π ∞ µ (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) ≡ lim inf n −→ ∞ n n − ∑ i = I ( A i ; B i | B i − ) . (III.104)By taking the supremum over all channel input distributions [27] and by using the fact that the alphabetspaces are of finite cardinality, we have the following identity. J ( π ∞ , ∗ , µ ) (cid:52) = sup π ∞ ( a i | b i − ) : i = , ,..., J ( π ∞ , µ ) (III.105) = lim inf n −→ ∞ sup π ∞ ( a i | b i − ) : i = ,..., n n E π ∞ µ (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) ≡ C FB , UMCOA ∞ → B ∞ . (III.106)For abstract alphabet spaces the exchange of lim inf and sup requires strong conditions [27]. Clearly, theabove quantity J ( π ∞ , ∗ , µ ) depends on the initial distribution µ ( db − ) .Similarly, for a fixed initial state B − = b − we also have the identity J ( π ∞ , ∗ , b − ) (cid:52) = sup π ∞ ( a i | b i − ) : i = , ,..., lim inf n −→ ∞ n E π ∞ b − (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) (III.107) = lim inf n −→ ∞ sup π ∞ ( a i | b i − ) : i = ,..., n n E π ∞ b − (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) (III.108)which depends on the initial state b − . Note that Assumptions III.1 do not imply that the joint distributionof the process { A , B , A , B , . . . , A n , B n } is stationary or that the marginal distribution of the outputprocess { B i : i = , . . . , n } is stationary, because stationarity depends on the distribution of the initial state B − . However, it implies that the transition probabilities are time-invariant (i.e., homogeneous), hence, P π ∞ i ( a i , b i | a i − , b i − ) ≡ P π ∞ ( a i , b i | a i − , b i − ) , P π ∞ i ( b i | b i − ) ≡ P π ∞ ( b i | b i − ) , i = , . . . , n .Next, we develop the material without imposing transmission cost constraints, because extensions toproblems with transmission cost are easily obtained by using the material of the previous section.
1) Sufficient Condition for Asymptotic Stationarity and Ergodicity from Finite-Time Dynamic Program-ming Recursions:
Consider the problem of maximizing the per unit time limiting version of C FB , UMCOA n → B n ,when the strategies are restricted to { π ∞ ( a i | b i − ) : i = , . . . , n } . From the previous section, the finitehorizon value function satisfies the dynamic programming equation V t ( b t − ) = sup π ∞ ( ·| b t − ) (cid:110) ∑ a t (cid:110) ∑ b t log (cid:16) P ( b t | b t − , a t ) P π ∞ ( b t | b t − ) (cid:17) P ( b t | b t − , a t ) + ∑ b t V t + ( b t ) P ( b t | b t − , a t ) (cid:111) π ∞ ( a t | b t − ) (cid:111) . (III.109)Since b t − is always fixed, we let V t ( b t − ) = V t ( b − ) , t = , . . . , n −
1. Since the transition probabilitiesare time-invariant, we can define, for simplicity, the variables (cid:101) V t ( b − ) = V n − t ( b − ) , t = , . . . , n −
1. Then { (cid:101) V t ( · ) : t = , . . . , n } satisfy the following equation. (cid:101) V t ( b − ) = sup π ∞ ( ·| b − ) (cid:110) ∑ a ∈ A (cid:110) ∑ b ∈ B log (cid:16) P ( b | b − , a ) P π ∞ ( b | b − ) (cid:17) P ( b | b − , a )+ ∑ b ∈ B (cid:101) V t − ( b ) P ( b | b − , a ) (cid:111) π ∞ ( a | b − ) (cid:111) , t ∈ { , . . . , n } . (III.110)Next, we introduce a sufficient condition to test whether the per unit time limit of the solution to thedynamic programming recursions exists and it is independent of the initial state B − = b − ∈ B . Assumptions III.2. (Sufficient condition for convergence of dynamic programming recursions)Assume that there exists a V : B (cid:55)−→ R , and a J ∗ ∈ R such that for all b − ∈ B lim t −→ ∞ (cid:16)(cid:101) V t ( b − ) − tJ ∗ (cid:17) = V ( b − ) . (III.111)Clearly, if Assumptions III.2 hold, then lim t −→ ∞ t (cid:101) V t ( b − ) = J ∗ , ∀ b − ∈ B (because for finite alphabetspaces, the dynamic programming operator maps bounded continous functions to bounded continuousfunctions) [28]. This means the per unit time limit of the dynamic programming recursion is independentof the initial state b − , which then implies C FB , UMCOA n → B n is independent of the choice of the initial distribution µ ( b − ) . Remark III.2. (Test of asymptotic stationarity and ergodicity)Given any channel we can verify that the optimal channel input distribution induces asymptotic station-arity and ergodicity of the corresponding joint process (cid:8) ( A i , B i ) : i = , , . . . , (cid:9) by solving the dynamicprogramming recursions analytically for finite “n” via (III.109), and then identifying conditions on thechannel parameters so that Assumptions III.2 hold. In view of Assumptions III.2, we have the following lemma.
Lemma III.1.
If Assumptions III.2 hold and there exists a (cid:8) π ∞ , ∗ ( a | b − ) ∈ M ( A ) : b − ∈ B (cid:9) and acorresponding pair (cid:110)(cid:16) V ( b − ) , J ∗ (cid:17) : b − ∈ B , J ∗ ∈ R (cid:111) , which solvesJ ∗ + V ( b − ) = sup π ∞ ( a | b − ) (cid:110) ∑ a (cid:110) ∑ b log (cid:16) P ( b | b − , a ) P π ∞ ( b | b − ) (cid:17) P ( b | b − , a ) + ∑ b V ( b ) P ( b | b − , a ) (cid:111) π ∞ ( a | b − ) . (III.112) then feedback capacity is given byJ ∗ = C FB , UMCOA ∞ → B ∞ (cid:52) = lim inf n −→ ∞ n sup π ∞ ( ·| b i − ) E π ∞ (cid:110) n − ∑ i = log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17)(cid:111) , ∀ µ ( db − ) ∈ M ( B ) (III.113) and moreover the value C FB , UMCOA ∞ → B ∞ does not depend on the choice of the initial distribution µ ( db − ) ∈ M ( B ) .Proof: See Appendix A. Thus, we have two different ways to determine sufficient conditions for J ∗ to correspond to feedbackcapacity; one based on Remark III.2, and one based on Lemma III.1, i.e., by solving the infinite horizondynamic programming equation (III.112).Next, we state the necessary and sufficient conditions for any (cid:8) π ∞ ( a | b − ) ∈ M ( A ) : b − ∈ B (cid:9) to be asolution of the dynamic programming equation (III.112). Theorem III.3. (Infinite horizon Necessary and Sufficient conditions)Suppose Assumptions III.2 hold and there exists a (cid:8) π ∞ , ∗ ( a | b − ) ∈ M ( A ) : b − ∈ B , J ∗ ∈ R (cid:9) and acorresponding pair (cid:110)(cid:16) V ( b − ) , J ∗ (cid:17) : b − ∈ B (cid:111) , which solves (III.112).The necessary and sufficient conditions for any input distribution { π ∞ ( a | b − ) ∈ M ( A ) : b − ∈ B } toachieve the supremum of the dynamic programming equation (III.111) are the following.There exist { V ( b − ) : b − ∈ B } such thatJ ∗ + V ( b − ) = ∑ b (cid:16) log (cid:16) P ( b | a , b − ) P π ∞ ( b | b − ) (cid:17) + V ( b ) (cid:17) P ( b | a , b − ) , ∀ a ∈ A if π ∞ ( a | b − ) (cid:54) = , (III.114) J ∗ + V ( b − ) ≤ ∑ b (cid:16) log (cid:16) P ( b | a , b − ) P π ∞ ( b | b − ) (cid:17) + V ( b ) (cid:17) P ( b | a , b − ) , ∀ a ∈ A if π ∞ ( a | b − ) = . (III.115) Moreover, { V ( b − ) : b − ∈ B } is the value function defined by (III.112).Proof: Consider the dynamic programming equation (III.112) and repeat the necessary steps of thederivation of Theorem III.1. A more direct approach is to use the necessary and sufficient conditions ofTheorem III.1, as follows. Re-writing the necessary and sufficient conditions (III.72), (III.73) as done in(A.186), using Assumptions III.2, to verify that (III.114), (III.115) are the resulting equations.
C. Sufficient Conditions for Asymptotic Stationarity and Ergodicity based on Irreducibility
In this section we give another set of assumptions based on irreducibility of the channel output transitionprobability for each channel input conditional distribution.Define (cid:96) ( b i − , a i ) (cid:52) = E π ∞ (cid:110) log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17) | B i − = b i − , A i = a i (cid:111) (III.116) ≡ ∑ b i ∈ B log (cid:16) P ( b i | b i − , a i ) P π ∞ ( b i | b i − ) (cid:17) P ( b i | b i − , a i ) , (III.117) (cid:96) ( b i − , π ∞ ( b i − )) (cid:52) = E π ∞ (cid:110) log (cid:16) P ( B i | B i − , A i ) P π ∞ ( B i | B i − ) (cid:17) | B i − = b i − (cid:111) ≡ ∑ a i ∈ A (cid:96) ( b i − , a i ) π ∞ ( a i | b i − ) . (III.118)To apply standard results of the Markov Decision (MD) theory from [27], [28], we introduce the followingnotation. Each element of the alphabet space B is identified by the vector B = { b ( ) , . . . , b ( | B | ) } , where | B | is the cardinality of the set B . Then we can identify any V : B (cid:55)→ R with a vector in R | B | . Similarly,any channel input distribution is identified with π ∞ (cid:52) = (cid:110) π ∞ ( b ( )) , π ∞ ( b ( )) , . . . , π ∞ ( b ( | B | ) (cid:111) (cid:52) = (cid:110) π ∞ ( ·| b ( i )) ∈ M ( A ) : b ( i ) ∈ B (cid:111) . (III.119) Next, we define the vector pay-off and channel output transition probability matrix as follows. (cid:96) ( π ∞ ) (cid:52) = (cid:16) (cid:96) ( b ( ) , π ∞ ( b ( ))) . . . (cid:96) ( b ( | B | ) , π ∞ ( b ( | B | ))) (cid:17) T ∈ R | B | , (III.120) P ( π ∞ ) = (cid:110) P π ∞ ( b i | b i − ) : ( b i , b i − ) ∈ B × B (cid:111) ∈ R | B |×| B | . (III.121)Let { µ ( b ( i )) : i = , , . . . , | B |} ∈ R | B | be defined by µ ( b ( i )) = P ( B − = b ( i )) , i = , . . . , | B | .Using the above notation we have the following main theorem. Theorem III.4. (Dynamic programming equation under irreducibility)Suppose Assumptions III.1 holds and for each channel input distribution π ∞ , the transition probabilitymatrix of the output process P ( π ∞ ) ≡ (cid:8) P π ∞ ( b | b − ) : ( b , b − ) ∈ B × B (cid:9) is irreducible.Then for any channel input distribution π ∞ the expression (III.104) is given byJ ( π ∞ , µ ) = ν ( π ∞ ) T (cid:96) ( π ∞ ) ≡ J ( π ∞ ) (III.122) i.e., it is independent of µ ( · ) , where ν ( π ∞ ) is the unique invariant probability distribution of the channeloutput process { B , B , . . . , } , which satisfies P ( π ∞ ) ν ( π ∞ ) = ν ( π ∞ ) . (III.123) If there exists a time-invariant Markov channel distribution π ∞ ( ·|· ) such thatJ ( π ∞ , ∗ ) = max π ∞ J ( π ∞ ) then there exists a pair ( V ( π ∞ , ∗ , · ) , J ( π ∞ , ∗ )) , V ( π ∞ , ∗ , · ) : B (cid:55)→ R | B | and J ( π ∞ ) ∈ R that is a solution ofthe dynamic programming equationJ ( π ∞ , ∗ ) + V ( π ∞ , ∗ , b − ) = sup π ∞ ( ·| b − ) (cid:110) (cid:96) ( b − , π ∞ ( b − )) + ∑ z ∈ B V ( π ∞ , ∗ , z ) P π ∞ ( z | b − ) (cid:111) . (III.124) Moreover, J ∗ ≡ J ( π ∞ , ∗ ) = C FB , UMCOA ∞ → B ∞ satisfies (III.113) and corresponds to feedback capacity.Proof: This is shown in Appendix B.We make the following comments.
Remark III.3. (Comments on Theorem III.4)(a) Theorem III.4 gives sufficient conditions in terms of irreducibility of channel output transitionprobability matrix P ( π ∞ ) to test whether the per unit time limit of the FTFI capacity corresponds tofeedback capacity. Unfortunately, it is not possible to know prior to solving the dynamic programmingequation (III.124) whether the irreducibility condition holds, because the transition probability P ( π ∞ ) is a functional of the optimal channel input distribution. A similar issue occurs in the analysis providedby Chen and Berger [15, Lemma 2, Theorem 3], In view of this technicality it is more appropriate toapply the necessary and sufficient conditions of Theorem III.1 to determine the optimal channel inputdistribution and corresponding characterization of FTFI capacity, and then follow the suggestion givenunder Remark III.2.(b) The solution of V obtained from (III.124) is unique up to an additive constant, and if π ∗ ( ·| b − ) attains the maximum in (III.124) for every b − , then π ∗ ( ·|· ) is an optimal channel input distribution,and the maximum cost is J ∗ .(c) In specific application examples it may happen that the optimal channel input probability distribution π ∞ , ∗ ( ·|· ) induces a transition probability matrix P ( π ∞ , ∗ ) which is reducible, i.e., not irreducible. Forcompleteness, this specific case is addressed in Remark III.4. Next, we provide an iterative algorithm to compute the optimal channel input distribution and the feedback capacity. In Section IV-D1, we illustrate how Algorithm 1 is implemented through an example. Algorithm 1
1) Let m = π .2) Solve the equation J ( π m ) e + V ( π m ) = (cid:96) ( π m ) + V T ( π m ) P ( π m ) , e (cid:44) ( , . . . , ) ∈ R | B | (III.125)for J ( π m ) ∈ R and V ( π m ) ∈ R | B | .3) Let π m + = argmax π (cid:110) (cid:96) ( π ) + V T ( π ) P ( π ) (cid:111) . (III.126)4) If π m + = π m , let π ∗ = π m ; else let m = m + Remark III.4.
Theorem III.4 and Algorithm 1 pre-suppose that we know in advance that the transitionprobability matrix P ( π ∞ ) of the channel output process, when evaluated at the optimal strategy π ∞ , ∗ ( ·|· ) is irreducible. If irreducibility does not hold, then the dynamic programming equation (III.124) may notbe sufficient to give the optimal channel input distribution and the feedback capacity. In particular, if P ( π ∞ ) is reducible then (III.124) need not have a solution. To overcome this limitation an additionalequation is added to (III.124) giving the following pair of equations.J ∗ ( b − ) = sup π ∞ ( ·| b − ) (cid:110) (cid:90) A × B J ∗ ( b ) P π ∞ ( b | b − ) (cid:111) (III.127) J ∗ ( b − ) + V ( b − ) = sup π ∞ ( ·| b − ) (cid:110) (cid:90) A × B (cid:110) log (cid:16) P ( b | a , b − ) P π ∞ ( b | b − ) (cid:17) + V ( b ) (cid:111) P π ∞ ( b | b − ) (cid:111) . (III.128) We refer to the pair (III.127) and (III.128) as the generalized dynamic programming equations. Theproposed pair of dynamic programming equations completely characterize feedback capacity.D. Error exponents for the UMCO Channel with feedback
In this section, we provide bounds on the probability of error of maximum likelihood decoding, byutilizing the results in [25] and [29]. However, we go one step further and show how to compute thisbound, taking advantage of the structure of the capacity achieving distribution.Consider the channel (cid:8) P i ( b i | b i − , a i ) : i = , . . . , n (cid:9) , where B i = B , A i = A , i = , . . . , n . Let P ( n ) e , m ( b − ) denote the probability of error for an arbitrary message m ∈ M n (cid:52) = (cid:8) , , . . . , M n = (cid:98) nR (cid:99) (cid:9) , given theinitial state b − ∈ B . From [29] there exists a feedback code for which the probability of error isbounded above as follows. P ( n ) e , m ( b − ) ≤ | B | {− n [ − ρ R + F n ( ρ )] } , ∀ m ∈ M n , b − ∈ B − , ≤ ρ ≤ , (III.129) F n ( ρ ) (cid:52) = − ρ log | B | n + max P i ( a i | a i − , b i − ) : i = , ,..., n (cid:20) min b − ∈ B E P , n ( ρ , b − ) (cid:21) (III.130) E P , n ( ρ , b − ) = − n log ∑ ( b ,..., b n − ) (cid:34) ∑ ( a ,..., a n − ) n − ∏ i = P i ( a i | a i − , b i − ) P i ( b i | a i , b i − ) + ρ (cid:35) + ρ . (III.131)However, by restricting the channel input distribution in (III.130), (III.131), to the set (cid:8) π i ( da i | b i − ) : i = If the initial state is known both to the encoder and the decoder then the cardinality of the state alphabet, | B | , in (III.129)and (III.130) are removed [Problem 5.37, [25]]. , . . . , n (cid:9) ∈ ◦ P FB [ , n ] , the following upper bound is obtained. P ( n ) e , m ( b − ) ≤ | B | {− n [ − ρ R + F n ( ρ )] } , ∀ m ∈ M n , b − ∈ B − , ≤ ρ ≤ , (III.132) F n ( ρ ) (cid:52) = − ρ log | B | n + min b − ∈ B E π , n ( ρ , b − ) , ∀ (cid:8) π i ( da i | b i − ) : i = , . . . , n (cid:9) ∈ ◦ P FB [ , n ] , (III.133) E π , n ( ρ , b − ) = − n log ∑ ( b ,..., b n − ) (cid:34) ∑ ( a ,..., a n − ) n − ∏ i = π i ( a i | b i − ) P i ( b i | a i , b i − ) + ρ (cid:35) + ρ . (III.134)Next, we derive simplified equations for (III.132) -(III.134), in order to compute the bound on theprobability of error. For the rest of the analysis we view the memory of the channel on the previousoutput symbol as the state of the channel, defined by s i − (cid:52) = b i − , i = , , . . . , n −
1. Then we transformthe channel to an equivalent channel of the form P i ( b i | a i , b i − ) = P i ( b i | a i , s i − ) , i = , . . . , n . Since thestate of the channel is known at the decoder, we apply the methodology used to derive Theorem 5.9.3,[25] and (III.134), to obtain an upper bound on the probability of error, which is computationally lessintensive than (III.132), as follows. At each time i , the channel distribution is further transformed to P i ( b i , s i | a i , s i − ) = P i ( b i | a i , s i − ) if s i = b i . i = , . . . , n . (III.135)Substituting (III.135) into (III.134) gives the following equivalent expression. E π , n ( ρ , b − ) = − n log ∑ ( b ,..., b n − ) (cid:34) ∑ ( a ,..., a n − ) n − ∏ i = π i ( a i | b i − ) P i ( b i | a i , s i − ) + ρ (cid:35) + ρ (III.136) = − n log ∑ ( s ,..., s n − ) ∑ ( b ,..., b n − ) (cid:34) ∑ ( a ,..., a n − ) n − ∏ i = π i ( a i | s i − ) P i ( b i , s i | a i , s i − ) + ρ (cid:35) + ρ (III.137) = − n log ∑ ( s ,..., s n − ) n − ∏ i = ∑ b i (cid:34) ∑ a i π i ( a i | s i − ) P i ( b i , s i | a i , s i − ) + ρ (cid:35) + ρ . (III.138)Define the inner summations in (III.138) by Λ π i ( s i , s i − ) (cid:52) = ∑ b i (cid:34) ∑ a i π i ( a i | s i − ) P i ( b i , s i | a i , s i − ) + ρ (cid:35) + ρ , i = , . . . , n . (III.139)Then, by substituting (III.139) in (III.138), we obtain E π , n ( ρ , b − ) = − n log ∑ ( s ,..., s n ) n − ∏ i = Λ π i ( s i , s i − ) . (III.140)Let (cid:110) Λ π i ( s i , s i − ) : ( s i , s i − ) ∈ B × B (cid:111) denote the matrix with elements identified by Λ π i ( s i , s i − ) , s i = , . . . , | B | , s i − = , . . . , | B | , that is, the matrix is denoted by [ Λ π i ( s i , s i − )] (cid:52) = Λ π i ( , ) . . . Λ π i ( , | B | ) ... . . . ... Λ π i ( | B | , ) . . . Λ π i ( | B | , | B | ) , i = , . . . , n . (III.141)The computation of the error probability is difficult, in view of the time varying properties of the channeldistribution and the channel input distribution, which implies the matrix [ Λ π i ( s i , s i − )] is also time-varying.However, by following the derivation of equation (5.9.45) in [25], we derive the following bound on theprobability of error for the UMCO channel with feedback. Theorem III.5. (Error probability bound for maximum likelihood decoding)Suppose the channel distribution is time-invariant given by (cid:8) P ( b i | b i − , a i ) : i = , . . . , n (cid:9) and the proba- bility of error defined by (III.132)-(III.134) is evaluated at any time-invariant channel input distribution (cid:8) π T I ( a i | b i − ) : i = , . . . , n (cid:9) . Then (i) The matrix [ Λ π i ( s i , s i − )] = (cid:104) Λ π TI ( s i , s i − ) (cid:105) , i = , . . . , n is time-invariant. (ii) If the time-invariant matrix (cid:104) Λ π TI ( s i , s i − ) (cid:105) is irreducible, then there exists a feedback code forwhich the probability of error is bounded above as follows. P ( n ) e , m ≤ | B | v max v min (cid:110) − n (cid:104) − ρ R − log λ π TImax ( ρ ) (cid:105)(cid:111) , ∀ m ∈ M n , ≤ ρ ≤ , (III.142) where λ π TI max ( ρ ) is the largest eigenvalue of the matrix (cid:104) Λ π TI ( s i , s i − ) (cid:105) , and v max and v min are themaximum and minimum components, respectively, of the positive eigenvector that corresponds tothe largest eigenvalue.Proof: (i) The first statement is due to the assumptions and follows directly from the fact that E π , n ( ρ , b − ) = E π TI , n ( ρ , b − ) (cid:52) = − n log ∑ ( s ,..., s n ) ∏ n − i = Λ π TI ( s i , s i − ) .(ii) For an irreducible matrix [ Λ ( s i , s i − )] with non negative components we can apply the Frobeniustheorem, to show that the following inequality holds [25]. (cid:12)(cid:12)(cid:12)(cid:12) E π TI , n ( ρ , b − ) + log λ π TI max ( ρ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ n log v max v min . (III.143)The upper bound (III.142) follows from the last expression.Note that the probability of error in Theorem III.5 is independent of the initial state b − ∈ B . InSection IV-A3 we evaluate (III.142) of Theorem III.5 for a specific channel with memory.IV. T HE BSSC
WITH & WITHOUT F EEDBACK AND WITH & WITHOUT T RANSMISSION C OST
In this section, we apply the main results of the previous section to the unit memory channel BinaryState Symmetric Channel (BSSC) defined by P ( b i | a i , b i − )= , , , , α β − β − α − α − β β α , i = , , , . . . , n , ( α , β ) ∈ [ , ] × [ , ] . (IV.144)We show using Theorem III.2, that the feedback capacity optimization problem is non-nested and theoptimal channel input distribution is time invariant. Further, we derive explicit expressions for feedbackcapacity and capacity without feedback, and we show that the capacity achieving distribution and thecorresponding transition probability of the channel output processes are characterized by doubly stochasticmatrices. Moreover, we show that feedback does not increase capacity, and that capacity without feedbackis achieved by a first order Markov channel input distribution, which is also doubly stochastic.First we show that the BSSC, is equivalent to a channel with state information s i (cid:52) = a i ⊕ b i − , i = , , . . . , n ,where ⊕ denotes the modulo2 addition, as depicted in Fig. IV.1. Clearly, this transformation is one toone and onto, i.e., for a fixed channel input symbol value a i (respectively channel output symbol value b i − ) then s i is uniquely determined by the value of b i − (respectively a i ) and vice-versa. Hence, we EncoderMessage DelayChannel b i −
01 0101 01 αβ − β − α βα − α − β a i b i b i − (a) Previous Output STate (POST) channel. EncoderMessage DelayChannel b i − s i
01 0101 01 αα − α − α ββ − β − β a i b i b i − (b) Binary State Symmetric Channel (BSSC). Fig. IV.1: An equivalent model.obtain the following equivalent representation of the BSSC. P ( b i | a i , s i = ) = (cid:34) α − α − α α (cid:35) , i = , , . . . , n , (IV.145) P ( b i | a i , s i = ) = (cid:34) β − β − β β (cid:35) , i = , , . . . , n . (IV.146)The above transformation highlights the symmetric form of the BSSC, since, for a fixed state s i ∈{ , } , the channel decomposes (IV.144) into two Binary Symmetric Channels (BSC), with transitionprobabilities given by (IV.145) and (IV.146), respectively. Therefore, for a fixed value of previous outputsymbol, b i − , the encoder by choosing the current input symbol, a i , knows which of the two BSC’s isapplied at each transmission time. This decomposition motivates the name state-symmetric channel.The following notation will be used in the rest of the paper. • BSSC ( α , β ) denotes the BSSC with transition probabilities defined by (IV.144); • BSC ( − α ) denotes the “state zero” channel defined by (IV.145); • BSC ( − β ) denotes the “state one” channel defined by (IV.146).The necessity of imposing transmission cost constraint on the channel, is discussed by Shannon in [30,pp. 162–163] and it is encapsulated in the following statement. “ There is a curious and provocativeduality between the properties of a source with a distortion measure and those of a channel. This dualityis enhanced if we consider channels in which there is a “cost” associated with the different input letters,and it is desired to find the capacity subject to the constraint that the expected cost not exceed a certainquantity...”. In [19], it is shown that the BSSC is in perfect duality with the Binary Symmetric MarkovSource (BSMS) with respect to a transmission cost function for the channel and a fidelity constraint forthe source. This is a generalization of JSCM of the discrete memoryless Bernoulli source with singleletter Hamming distortion transmitted over a memoryless BSC.Next, we illustrate that the cost constraint is natural when imposed on the BSSC. The memory on theprevious output symbols and its decomposable nature, allow us to impose a cost function related to the state of the channel.The physical interpretation of the transmission cost is the following. The two states of the BSSC are • s i = − α ); • s i = − β );Suppose α > β ≥ .
5. Then the capacity of the state zero channel is greater than the capacity of the stateone channel. With “abuse” of terminology, the state zero channel is interpreted as the “good channel”and the state one channel is interpreted as the “bad channel”. With such interpretation it is reasonableto impose a higher cost, when employing the “good channel”, and a lower cost, when employing the“bad channel”. This policy is quantified by assigning a binary pay-off equal to “1”, that is, when thethe good channel is used, and a pay-off equal to “0”, that is, when the bad channel is used.
Definition IV.1. (Binary cost function for the BSCC)The cost function of the BSSC satisfies γ ( a i , b i − ) = a i ⊕ b i − (cid:52) = (cid:26) if a i = b i − ( s i = ) if a i (cid:54) = b i − ( s i = ) i = , . . . , n . (IV.147) The average transmission cost constraint is defined by n + E (cid:40) n ∑ i = γ ( A i , B i − ) (cid:41) ≤ κ , κ ∈ [ , κ max ] (IV.148) where the letter-by-letter average transmission cost is given by E (cid:8) γ ( A i , B i − ) (cid:9) = P i ( a i = , b i − = ) + P i ( a i = , b i − = ) = P i ( s i = ) . (IV.149)This cost function may differ, according to someone’s preferences. For example, if we want to penalizethe use of the “bad” channel, we may employ the complement of the cost function (IV.147). A moregeneral cost function is γ ( a i , b i − ) (cid:52) = (cid:26) γ if a i = b i − ( s i = ) − γ if a i (cid:54) = b i − ( s i = ) i = , . . . , n (IV.150)where γ ∈ [ , ] . However, the binary form of the transmission cost does not downgrade the problem,since, the average cost is a linear functional, and it can be easily upgraded to more complex forms,without affecting the proposed methodology.Additional observations regarding the above formulation are given in the following remark. Remark IV.1. (Cost function) If only the good channel is used, that is, P i ( s i = ) = , then the capacity of the BSSC is equalto zero, because P ( s i = ) = corresponds to channel input a i = b i − , a deterministic function fori = , , . . . , n (this also follows from I ( A i ; B i | B i − ) (cid:12)(cid:12)(cid:12) A i = B i − = , i = , , . . . , n) . The capacity of theBSSC is also equal to zero if only the bad channel is used P i ( s i = ) = , for i = , , . . . , n. It is shown shortly that the optimal channel input distribution that achieves the unconstrainedcapacity of the BSSC, corresponds to a fixed occupation of the two states. Upon introducing thetransmission cost constraint, one is not allowed to use the state corresponding to the good channelbeyond a certain threshold, because the overall cost of transmission needs to be satisfied. If β > α ≥ . , then we reverse the transmission cost, while if α and β are less than . , then flipthe corresponding channel input probabilities. A. Capacity of the BSSC with feedback
In this section, we apply Theorem III.1 and Theorem III.2 to calculate the closed form expressions ofthe capacity achieving channel input distribution, the corresponding channel output distributions, and toshow that these are time-invariant. Further, we employ these theorems to calculate the feedback capacitywith and without cost constraints.
1) Feedback capacity of the BSSC without transmission cost:
In the next theorem we show that feedback capacity of the BSSC, without cost constraint, is given bya single letter expression and that the optimal input distribution is time invariant.
Theorem IV.1. (Feedback capacity and time-invariant property of the optimal distributions)Consider the BSSC defined by (IV.144) with feedback, without transmission cost. Then the followinghold. (a)
The capacity achieving channel input distribution and the corresponding channel output distribu-tion which maximize the FTFI capacity, C FB , BSSCA n → B n , are time-invariant and given by the followingexpressions. π ∗ i ( a i | b i − ) = π T I ( a i | b i − ) = (cid:20) ν − ν − ν ν (cid:21) , ∀ i ∈ { , , . . . , n } (IV.151) P π ∗ i ( b i | b i − ) = P T I ( b i | b i − ) = (cid:20) λ − λ − λ λ (cid:21) , ∀ i ∈ { , , . . . , n } (IV.152) where λ = + µ , µ = H ( β ) − H ( α ) − α − β , ν = − ( − β )( + µ )( α + β − )( + µ ) . (IV.153) Moreover, C FB , BSSCA n → B n = ( n + ) max π TI ( a | b − ) I ( A ; B | B − = b i − ) , ∀ b i − ∈ { , } (IV.154) = ( n + ) [ H ( λ ) − ν H ( α ) − ( − ν ) H ( β )] . (IV.155)(b) The feedback capacity is given byC FB , BSSCA ∞ → B ∞ = max π TI ( a | b − ) I ( A ; B | B − = b i − ) , ∀ b i − ∈ { , } (IV.156) = H ( λ ) − ν H ( α ) − ( − ν ) H ( β ) (IV.157) and it is independent of the initial state.Proof: The proof of Theorem. IV.1 is given in Appendix C.Theorem IV.1, specifically (IV.154), illustrates the non-nested and time-invariant property, which givesa direct connection of the BSSC and memoryless channels. Note, that these properties hold due to the“symmetric” form of the BSSC. As will show at the end of the current section via simulations, thetime-invariant property does not hold for general Binary Unit Memory Channel (BUMC).The BSSC without cost constraint is equivalent to the POST channel investigated in [17]. The authors in[17] derived an expression for feedback capacity, which is equivalent to (IV.156), by using the convexhull theorem. Theorem IV.1 compliments the results in [17] in the sense that it provides closed formexpressions of the capacity achieving distribution and the corresponding optimal channel output condi-tional distribution. More importantly, it shows that these distributions are time-invariant and correspond to the non-nested optimization problem (IV.154), which is directly analogous to Shannon’s two-lettercapacity formulae of memoryless channels.The structure of our expression (IV.157) provides insight on how the occupancy of the two states affectsthe capacity. Recall that the state of the channel defines which of the two binary symmetric channels isin use at each time instant. Since P S i ( ) = P A i , B i − ( , ) + P A i , B i − ( , ) , then by substituting the capacityachieving input distribution we have P S i ( ) = ν . Thus, the optimal occupancy, or equivalently the optimaltime sharing, among the two binary symmetric channels with crossover probabilities α , β , is given by ν which is a function of the channel parameters α and β . This interpretation is obvious in the feedbackcapacity expression (IV.157) and this expression is similar to the capacity of the memoryless binarysymmetric channel. However, for the BSSC the maximization of the output process corresponds to atime invariant, first order doubly stochastic Markov process.
2) Feedback capacity of the BSSC with transmission cost:
Next, we consider the BSSC with transmission cost constraint defined by (IV.148). Since C FB , BSSCA n → B n ( κ ) isa convex optimization problem the optimal channel input conditional distribution occurs on the boundaryof the constraint, i.e., for κ ≥ κ max C FB , BSSCA n → B n ( κ ) is constant and equal to the unconstrained capacity givenin Theorem IV.1. Theorem IV.2.
Consider the BSSC defined by (IV.144) with feedback and transmission cost constraintdefined by (IV.148). Then the following hold. (a)
The optimal channel input distribution which corresponds to C FB , BSSCA n → B n ( κ ) and the optimal outputdistribution, are time-invariant and given by π ∗ i ( a i | b i − ) = π T I ( a i | b i − ) = (cid:20) κ − κ − κ κ (cid:21) , ∀ i = , , . . . , n (IV.158) P π ∗ i ( b i | b i − ) = P T I ( b i | b i − ) = (cid:20) ¯ λ − ¯ λ − ¯ λ ¯ λ (cid:21) , ∀ i = , , . . . , n (IV.159) where ¯ λ = ακ + ( − κ )( − β ) . (IV.160) Moreover,C FB , BSSCA n → B n ( κ ) = ( n + ) max π TI ( a | b − ) : E { a i , b i − }≤ κ I ( A ; B | B − = b i − ) , ∀ b i − ∈ { , } (IV.161)(b) The feedback capacity is given byC FB , BSSCA n → B n ( κ ) = H ( ¯ λ ) − κ H ( α ) − ( − κ ) H ( β ) if κ ≤ κ max H ( λ ) − κ max H ( α ) − ( − κ max ) H ( β ) if κ > κ max (IV.162) where κ max is equal to ν defined by (IV.153) . This proof is similar to the proof of Theorem IV.1 and is given in Appendix D.The unconstrained and constrained feedback capacity of the
BSSC are depicted in Figure IV.2. Inparticular, Figure IV.2a depicts the unconstrained capacity of the
BSSC for all possible values of theparameters α , β ∈ [ , ] . Figure IV.2b, depicts how the transmission cost affects the capacity of the BSSC for all possible values of the parameters α , β ∈ [ , ] , and for three different choices κ . The inner plotcorresponds to the unconstrained case ( κ = κ max ). (a) Unconstrained Capacity α C apa c i t y 010.10.20.3 0.80.4 10.50.6 0.6 0.80.70.8 0.60.9 β (b) Constrained Capacity. Fig. IV.2: Capacity of BSSC with feedback.
3) Error exponents for the BSSC with feedback:
In this section we apply the results of Section III-D to the BSSC, and we evaluate the error exponentand the probability of error, for the capacity achieving input distribution with feedback denoted by π T I and defined by (IV.151).It is straightforward to verify that evaluating (III.133) at the capacity achieving input distribution defined(IV.151), this term is independent of the initial state of the channel, and is given by E π TI , n ( ρ , b − ) ≡ E π TI , n ( ρ ) . (IV.163)Consequently, the upper bound bound on the probability of error is also independent of the initial state,and is given by P ( n ) e , m ≤ | B | {− n [ − ρ R + F n ( ρ )] } , ∀ m ∈ M n , ≤ ρ ≤ . (IV.164)Moreover, since the capacity achieving distribution is time invariant, then Λ π i ( s i , s i − ) = Λ π TI ( s i , s i − ) : i = , , . . . , n . Then, by substituting the time invariant capacity achieving distribution and the channeldistribution in (III.139), we obtain Λ π TI ( , ) = Λ π TI ( , ) = (cid:104) να + ρ + ( − ν )( − β ) + ρ (cid:105) + ρ (IV.165) Λ π TI ( , ) = Λ π TI ( , ) = (cid:104) ν ( − α ) + ρ + ( − ν ) β + ρ (cid:105) + ρ (IV.166)The largest eigenvalue for the resulted 2 × λ π TI max ( ρ ) = Λ ( , ) + Λ ( , )= (cid:104) να + ρ + ( − ν )( − β ) + ρ (cid:105) + ρ + (cid:104) ν ( − α ) + ρ + ( − ν ) β + ρ (cid:105) + ρ . (IV.167) v max v min = . (IV.168)Substituting (IV.167) and (IV.168) in (III.143) we obtain E π TI ( ρ ) = − log λ π TI max ( ρ ) (IV.169) F ∞ ( ρ ) (cid:52) = lim n → ∞ F n ( ρ ) = − log λ π TI max ( ρ ) . (IV.170) R E π T I r ( R ) (a) Error exponent. n -8 -6 -4 -2 P ( n ) e , m R=0.5 × CR=0.9 × CR=0.99 × C (b) Probability of error. Fig. IV.3: Error exponent and probability of error for the BSSC with parameters α = . β = . E π TI r ( R ) (cid:52) = max ≤ ρ ≤ { F ∞ ( ρ ) − ρ R } . (IV.171)Hence, the probability of error is given by P ( n ) e , m ≤ × | | × (cid:110) − n (cid:104) − ρ R − log λ π TImax ( ρ ) (cid:105)(cid:111) . (IV.172)Better bounds can be obtained if both the encoder and the decoder know the initial state of the channel.In this case the cardinality of the state, | | , is omitted from (IV.172) [Problem 5.37, [25]]. The errorexponent and the probability of error, optimized with respect to ρ , are given in Fig. IV.3. Obviously, evenbetter bounds can be obtained by optimizing with respect to the channel input distribution. However,even for DMC’s, the error exponent which is analogue to (IV.171), is often evaluated at the capacityachieving distribution of the ergodic capacity. B. Capacity without feedback of the BSSC
In this section we apply Theorem II.2, to show that the feedback capacity of the BSSC is achieved bya time invariant first order channel input distribution without feedback.
Theorem IV.3. (Capacity of BSSC without Feedback with & without Transmission Cost)Consider the BSSC defined by (IV.144) without feedback. Then the following hold. (a)
For a channel with transmission cost constraint defined by (IV.148), the optimal channel inputdistribution which corresponds to C noFB , BSSCA n ; B n ( κ ) is time-invariant first-order Markov, and it is givenby P noFB , ∗ i ( a i | a i − ) = π noFB , T I ( a i | a i − ) = − κ − σ − σ κ − σ − σκ − σ − σ − κ − σ − σ , i = , , . . . , n , (IV.173) where σ = ακ + β ( − κ ) . Moreover (IV.173) induces the optimal channel input and channel outputdistributions π T I ( a i | b i − ) and P T I ( b i | b i − ) of the BSSC with feedback and transmission cost. (b) For a channel without transmission cost (a) holds with κ = κ ∗ and σ = σ ∗ = ακ ∗ + β ( − κ ∗ ) . (c) The capacity the BSSC without feedback and transmission cost is given byC noFB , BSSCA n ; B n = ( n + ) max π noFB , TI ( a | a ) I ( A ; B | B ) = ( n + ) C FB , BSSCA ∞ → B ∞ (IV.174) and similarly, if there is a transmission cost.Proof: (a) By applying Theorem II.2, it suffices to show that there exists an input distribution withoutfeedback which induces the capacity achieving channel input distribution with feedback, π ∗ i ( a i | b i − ) . Forthe BSSC, it is clear that, if any input distribution without feedback induces π ∗ i ( a i | b i − ) = π T I ( a i | b i − ) given by (IV.158), then it also induces the optimal output process P π noFB , TI , ∗ i ( b i | b i − ) = P T I ( b i | b i − ) given by (IV.159), since P T I ( b i | b i − ) = ∑ a i ∈{ , } P ( b i | a i , b i − ) π T I ( a i | b i − ) . (IV.175)Suppose the distribution of the initial state b − is given by the stationary distribution of the outputprocess, that is, P b − ( ) = P b − ( ) = .
5. Then, we show by induction that there exist a time invariant,first order Markov channel input distribution without feedback that induces the time invariant channelinput distribution with feedback. For i =
0, the optimal channel input distribution without feedback isequal to optimal channel input distribution with feedback, that is, π noFB ( a | b − ) = π T I ( a | b − ) , andis given by (IV.158), since b − is the initial state known at the encoder. Therefore, the correspondingchannel output distribution with feedback, P π ∗ ( b | b − ) , is induced and since P π ∗ ( b | b − ) = P T I ( b | b − ) is doubly stochastic, then P ∗ ( b = ) = P ∗ ( b = ) = . i =
1, the following identities hold, in general. P ( a | b ) = ∑ a ∈{ , } P ( a | a , b ) P ( a | b )= ∑ a ∈{ , } P ( a | a , b ) P ( b , a ) P ( b )= ∑ a ∈{ , } P ( a | a , b ) P ( b ) ∑ b − ∈{ , } P ( b | a , b − ) P ( a | b − ) P ( b − ) (IV.176)Next using (IV.176), we investigate whether there exists a first order Markov channel input distributionwithout feedback, P ( a | a , b ) = π noFB ( a | a ) , which induces the time-invariant capacity achieving inputdistribution with feedback, π T I ( a | b ) , given by (IV.158). Therefore, we need to determine whether thefollowing identity holds for some π noFB ( a | a ) . From (IV.176), π T I ( a | b ) ? = ∑ a ∈{ , } π noFB ( a | a ) P ∗ ( b ) ∑ b − ∈{ , } P ( b | a , b − ) π T I ( a | b − ) P ( b − ) (IV.177)Note that P ( a | b − ) = π T I ( a | b − ) and P ( b ) = P ∗ ( b ) hold due to step i =
0. Solving the systemof resulting equations, yields that there exists a channel input distribution without feedback, defined by(IV.173), that induces π T I ( a | b ) . Therefore, it also induces the time invariant optimal output distribution, P π ∗ i ( b | b ) = P T I ( b | b ) given by (IV.159), and its corresponding optimal marginal distribution P ∗ ( b ) .Next, suppose that for time up to time i = j −
1, the first order Markov input distribution defined by(IV.173) induces the time invariant capacity achieving distribution with feedback, { π T I ( a i | b i − ) : i = , , . . . , j − } , given by (IV.158), and therefore it induces, { P T I ( b i | b i − ) : i = , , . . . , j − } given by(IV.159), and its corresponding optimal marginal distribution { P ∗ ( b i ) : i = , , . . . , j − } . Then, at time i = j , the following identity holds. P j ( a j | b j − ) = ∑ a j − , b j − P j ( a j | a j − , b j − ) P j − ( a j − , b j − | b j − )= ∑ a j − , b j − P j ( a j | a j − , b j − ) P ( b j − ) P ( b j − | a j − , b j − ) P ( a j − | a j − b j − ) P ( a j − , b j − )= ∑ a j − , b j − P j ( a j | a j − , b j − ) P ∗ ( b j − ) P ( b j − | a j − , b j − ) π T I ( a j − | b j − ) P ∗ ( a j − , b j − ) . (IV.178)The last equality holds since the distributions P ∗ ( b j − ) , π T I ( a j − | b j − ) , P ∗ ( a j − , b j − ) were inducedfrom the previous steps i = , , . . . , j −
1. Subsequently, we investigate whether there exists a first orderMarkov channel input distribution, P j ( a j | a j − , b j − ) = π noFBj ( a j | a j − ) , that satisfies (IV.178). That is, π ∗ ( a j | b j − ) ? = ∑ a j − , b j − π noFBj ( a j | a j − ) P ∗ ( b j − ) P ( b j − | a j − , b j − ) π T I ( a j − | b j − ) P ∗ ( a j − , b j − )= ∑ a j − π noFBj ( a j | a j − ) P ∗ ( b j − ) ∑ b j − P ( b j − | a j − , b j − ) π T I ( a j − | b j − ) ∑ a j − , b j − P ∗ ( a j − , b j − )= ∑ a j − π noFBj ( a j | a j − ) P ∗ ( b j − ) ∑ b j − P ( b j − | a j − , b j − ) π ∗ ( a j − | b j − ) P ∗ ( b j − ) (IV.179)Solving, the system of equation resulting from equation (IV.179), yields the time-invariant first orderMarkov input distribution defined by (IV.173). Since, the time invariant first order Markov channelinput distribution without feedback defined by (IV.173), induces the optimal channel input distributionwith feedback ∀ i = , , . . . , j , then it is the time invariant capacity achieving input distribution withoutfeedback.(b) Holds since for the BSSC without transmission cost κ = κ ∗ , and therefore σ = σ ∗ = ακ ∗ + β ( − κ ∗ ) .(c) Since, { π noFB , T I ( a i | a i − ) ≡ π noFB , T I ( a | a ) : i = , , . . . , n } induces { π T I ( a i | b i − ) : i = , , . . . , n } given by (IV.158), and { P T I ( b i | b i − ) i = , , . . . , n } given by (IV.159), then the channel capacity withoutfeedback and transmission cost is given by (IV.174). Similarly, for the constrained capacity we have C noFB , BSSCA n ; B n ( κ ) = C FB , BSSCA ∞ → B ∞ ( κ ) . C. Special cases of the BSSC1) Memoryless BBSC ( α = β = − ε , ε (cid:54) = . ): Consider the trivial case where α = β (cid:52) = − ε . Then, given the state s i = a i ⊕ b i − , the BSSC degeneratesto the Discrete Memoryless - Binary Symmetric Channel (DM-BSC) with cross over probability ε . Byemploying (IV.151)-(IV.156) and (IV.173), then µ = ν = λ = .
5, the capacity achieving inputdistribution and the corresponding output distribution are memoryless and uniformly distributed, and thecapacity expression reduces to C DM − BSC = H (( − ε )( − . ) + ε . ) − . H ( ε ) − . H ( ε ) = − H ( ε ) . This are the known results of the memoryless BSC.
2) Best and Worst BBSC ( α = , β = . ): Consider the case α = β = .
5. This channel decomposes to a noiseless BSC channel withcrossover probability 0 if s i = a i ⊕ b i − =
0, and to a noisy BSC channel with crossover probability 0 . s i = a i ⊕ b i − =
1. By invoking (IV.151)-(IV.156), then ν = . λ = .
8, the channel capacity is equalto C FB , BSSC (cid:12)(cid:12)(cid:12) α = , β = . = C noFB , BSSC (cid:12)(cid:12)(cid:12) α = , β = . = H ( . ) − . H ( ) − . H ( . ) = . Time Horizon V a l ue F un c t i on , V ( b i - ) V(b i-1 =0)V(b i-1 =1) (a) Value Function. Time Horizon
P(b i−1 =0)P(b i−1 =1)P(a i =0|b i−1 =0)P(a i =1|b i−1 =0)P(a i =0|b i−1 =1)P(a i =1|b i−1 =1) (b) Optimal Input and Output Distributions. Fig. IV.4: Unconstrained
BIBO − U MCO channel with parameters α = . α = . α = . α = . π T I ( a i | b i − ) = (cid:20) . . . . (cid:21) , π noFB , T I ( a i | a i − ) = (cid:20) / / / / (cid:21) and the optimal channel output distribution for both is given by P T I ( b i | b i − ) = (cid:20) . . . . (cid:21) . This completes the analysis of degenerate BSSC.
D. Capacity of the Binary Input Binary Output - Unit Memory Channel Output (BIBO-UMCO) channelwith feedback Time Horizon V a l ue F un c t i on , V ( b i - ) V(b i-1 =0)V(b i-1 =1) (a) Value Function. Time Horizon i-1 =0)P(b i-1 =1)P(a i =0|b i-1 =0)P(a i =1|b i-1 =0)P(a i =0|b i-1 =1)P(a i =1|b i-1 =1) (b) Optimal Input and Output Distributions. Fig. IV.5: Constrained
BIBO − U MCO channel with parameters α = . α = . α = . α = . k = . feedback capacity of BIBO − U MCO channel, denoted by P ( b i | b i − , a i ) = (cid:18)
00 01 10 110 α α α α − α − α − α − α (cid:19) , i = , . . . , n (IV.180)with and without transmission cost. In addition, we calculate the capacity achieving input distributionswith feedback and the respective optimal output distributions.
1) Without cost constraint:
Consider the
BIBO − U MCO channel (IV.180) with parameters α = . α = . α = . α = .
4. By employing dynamic programming equations (III.67)-(III.68) theconvergence of the value functions without transmission cost, and the convergence of the optimal inputdistributions with feedback and the corresponding output distributions are depicted in Figures IV.4a andIV.4b, respectively. To characterize the feedback capacity and the capacity achieving input distributionof the BIBO-UMCO channel we employ Algorithm 1, which yields the following results. π ∞ ( a i | b i − ) = (cid:20) .
626 0 . .
374 0 . (cid:21) , C FB , BIBO − UMCO = .
215 bits/per channel use.
2) With cost constraint.:
Consider the
BIBO − U MCO channel (IV.180) with parameters α = . α = . α = . α = . k = . ONCLUSIONS
We apply the dynamic programming recursions and necessary and sufficient conditions for any channelinput conditional distribution to achieve capacity, to identify necessary and sufficient conditions suchthat the nested optimization problem C FBA n → B n reduces to a non-nested optimization problem. This givesrise to the single letter characterization of feedback capacity. The methodology can be easily generalized to channels that have finite memory on the previous outputs.These results are applied to the BSSC with feedback, with and without cost constraint, to calculate thefeedback capacity, the capacity achieving input distribution, and the corresponding output distribution.One of the fascinating results is that feedback capacity is characterized by a single letter expression thatis precisely analogous to the single letter characterization of capacity of DMCs. Additionally, we showthat a first order Markov channel input distribution without feedback achieves feedback capacity. Wealso derive an upper bound on the error probability of maximum likelihood decoding.A PPENDIX AP ROOF OF L EMMA . III.1We can re-write (III.110) as follows.˜ V t ( b − ) + t ˜ V t ( b − ) = sup π ∞ ( ·| b − ) (cid:110) ∑ a (cid:110) ∑ b log (cid:16) P ( b | b − , a ) P π ∞ ( b | b − ) (cid:17) P ( b | b − , a ) (A.181) + ∑ b (cid:16) ˜ V t − ( b ) + t ˜ V t ( b − ) (cid:17) P ( b | b − , a ) (cid:111) π ∞ ( a | b − ) (cid:111) . (A.182)Assumptions III.2, imply that lim t −→ ∞ t ˜ V t ( b − ) = J ∗ , ∀ b − ∈ B (A.183) and that the limit does not depend on b − ∈ B . Moreover, under Assumption III.2, (A.183), taking thelimit of both sides of (A.182), the following dynamic programming equation is obtained. J ∗ + v ( b − ) = lim t −→ ∞ (cid:110) t ˜ V t ( b − ) + (cid:16) ˜ V t ( b − ) − tJ ∗ (cid:17)(cid:111) (A.184) ( a ) = lim t −→ ∞ sup π ∞ ( ·| b − ) (cid:110) ∑ a (cid:110) ∑ b log (cid:16) P ( b | b − , a ) P π ∞ ( b | b − ) (cid:17) P ( b | b − , a ) (A.185) + ∑ b (cid:16) ˜ V t − ( b ) − ( t − ) J ∗ + t ˜ V t ( b − ) − J ∗ ) (cid:17) P ( b | b − , a ) (cid:111) π ∞ ( a | b − ) (cid:111) (A.186)where (a) is due to (A.181). Since the channel input and output alphabet spaces are at most countable,then we can interchange of the limit and the maximization operations, to obtain dynamic programmingequation (III.112). A PPENDIX BP ROOF OF T HEOREM . III.4For any { π ∞ ( a i | b i − ) : i = , . . . , n } , (III.104) is expressed as follows. J ( π ∞ , µ ) = lim inf n −→ ∞ n E π ∞ µ (cid:110) n − ∑ i = (cid:96) ( b i − , a i ) (cid:111) = lim inf n −→ ∞ n E π ∞ µ (cid:110) n − ∑ i = (cid:96) ( b i − , π ( b i − )) (cid:111) , ∀ µ ( b − ) ∈ M ( B ) (B.187) = lim inf n −→ ∞ n µ T (cid:16) n − ∑ i = P ( π ∞ ) i (cid:17) (cid:96) ( π ∞ ) . (B.188)Following [31], it can be shown that the above limit exists but it may depend on the distribution µ ( · ) of B − . However, if P ( π ∞ ) is irreducible then J ( π ∞ , µ ∞ ) = µ T P ( π ∞ ) (cid:96) ( π ∞ ) = ν ( π ∞ ) T (cid:96) ( π ∞ ) (B.189)where P ( π ∞ ) is the limiting matrix (this follows by the Cesaro limit), and ν ( π ∞ ) is the unique invariantprobability distribution, which satisfies P ( π ∞ ) ν ( π ∞ ) = ν ( π ∞ ) . From (B.189), it follows that J ( π ∞ , µ ) ≡ J ( π ∞ ) , that is, it does not depend on the initial distribution µ of B − . It can be shown that if for allstationary Markov channel input distributions π ∞ the transition matrix P ( π ∞ ) is irreducible, there existsa solution V : B (cid:55)→ R | B | and J ∈ R , which satisfies (III.124).A PPENDIX CP ROOF OF T HEOREM . IV.1(a) First, we employ the necessary and sufficient conditions of Theorem III.1, to calculate the optimalinput and output distributions and the value function at the terminal time. To show the time-invariantproperty it is sufficient to prove the the value function of the terminal condition, V n ( b n − ) , is independentof b n − (part (b) of Theorem III.2). By Theorem III.1, we have V n ( b n − ) = ∑ b n log (cid:16) P n ( b n | a n , b n − ) P π n ( b n | b n − ) (cid:17) P ( b n | a n , b n − ) , ∀ a n ∈ A n if π n ( a n | b n − ) (cid:54) = b n − = a n =
0, we obtain V n ( b n − = ) = ∑ b n log (cid:16) P n ( b n | a n = , b n − = ) P π n ( b n | b n − = ) (cid:17) P ( b n | a n = , b n − = )= α log 1 − P π n ( b n = | b n − = ) P π n ( b n = | b n − = ) + log 11 − P π n ( b n = | b n − = ) − H ( α ) . (C.191) For b n − = a n =
1, we obtain V n ( b n − = ) = ∑ b n log (cid:16) P n ( b n | a n = , b n − = ) P π n ( b n | b n − = ) (cid:17) P ( b n | a n = , b n − = )= ( − β ) log 1 − P π n ( b n = | b n − = ) P π n ( b n = | b n − = ) + log 11 − P π n ( b n = | b n − = ) − H ( β ) . (C.192)By (C.190), we equate (C.191) and (C.192), to deduce P π n ( b n = | b n − = ) = λ = + µ (C.193)where λ and µ are given in (IV.153). We repeat the above procedure for the pair b n − = , a n = b n − = , a n =
1, to deduce P π n ( b n = | b n − = ) = + µ ≡ λ . (C.194)Therefore the optimal transition probability of the output process at time n , is given by the doublystochastic matrix (IV.152). Next, we show that the value function, V n ( b n − ) , is independent of b n − . Thevalue function for b n − = a n = V n ( b n − = ) = ∑ b n log (cid:16) P n ( b n | a n = , b n − = ) P π n ( b n | b n − = ) (cid:17) P ( b n | a n = , b n − = )= α log 1 − P π n ( b n = | b n − = ) P π n ( b n = | b n − = ) + log 11 − P π n ( b n = | b n − = ) − H ( α )= α log 1 − P π n ( b n = | b n − = ) P π n ( b n = | b n − = ) + log 11 − P π n ( b n = | b n − = ) − H ( α ) . = V n ( b n − = ) (C.195)Since the value function, V n ( b n − ) , is independent of b n − , we apply Theorem III.2.(b), to deduce thatthe optimal channel input and channel output conditional distributions are time invariant. The optimalchannel input conditional distribution is calculated via the expression P π n ( b n | b n − ) = ∑ A i P n ( b n | a n , b n − ) π ( a n | b n − ) . For b n = b n − =
0, we have P π n ( b n = | b n − = ) = ∑ A n P n ( b n = | a n , b n − = ) π ( a n | b n − = )= απ ( a n = | b n − = ) + ( − β )( − π ( a n = | b n − = )) . (C.196)Solving (C.196) with respect to the input distribution yields π ( a n = | b n − = ) = − ( − β )( + µ )( α + β − )( + µ ) ≡ ν . (C.197)Similarly, P π n ( b n = | b n − = ) = ∑ A n P n ( b n = | a n , b n − = ) π ( a n | b n − = )= απ ( a n = | b n − = ) + ( − β )( − π ( a n = | b n − = )) . (C.198)The above, shows (IV.151). By Theorem III.2.(b), specifically (III.87) evaluated at t =
0, we obtain thefollowing expression for the FTFI capacity. C FB , BSSCA n → B n ( α ) = ∑ b − V ( b − ) µ ( b − ) ( β ) = ( n + ) max π ( a | b − ) ∑ b , a , b − (cid:18) P ( b | a , b − ) P π ( b | b − ) (cid:19) P ( b | a , b − ) π ( a n | b n − ) µ ( b − ) , b − ∈ { , } ( γ ) = ( n + ) [ H ( λ ) − ν H ( α ) − ( − ν ) H ( β )] (C.199) where ( α ) holds by definition (equation (III.69)), ( β ) holds due to (III.87) evaluated at t = ( γ ) bysubstituting the time invariant capacity achieving input distribution (IV.151), the corresponding optimaloutput distribution (IV.152) and any value of b − ∈ { , } .(b) holds by definition (equation (II.27)). A PPENDIX DP ROOF OF T HEOREM . IV.2(a) By employing the dynamic programming recursion for the constrained problem (III.90) we can showthat the value function at the terminal time is independent of b n − . Therefore, by Theorem III.2, theoptimization problem is non-nested and the dynamic programming for the constrained capacity is givenby V i ( b i − ) = sup π ( a i | b i − ) , s ≤ (cid:110) ∑ A n ∑ B n log (cid:16) P ( b i | b i − , α i ) P π ( b i | b i − ) (cid:17) P ( b i | b i − , α i ) π ( α i | b i − )+ s (cid:110) ∑ A i γ ( a i , b i − ) π n ( a i | b i − ) − κ (cid:111)(cid:111) , ∀ i = , , . . . , n (D.200)Differentiating (D.200) with respect to the Lagrangian s, we obtain the optimal input distribution of(IV.158). The optimal output distribution is then calculated by P π n ( b n | b n − ) = ∑ A n P n ( b n | a n , b n − ) π ( a n | b n − ) to obtain (IV.159).(b) Since (i) the optimal input channel conditional distribution and the channel output conditionaldistribution are time-invariant and (ii) the value function V i ( b i − ) is independent of b i − , ∀ i = , , . . . , n ,the proof is identical to the proof of Theorem IV.1.(b). The value of κ max is given when the Lagrangian s =
0, i.e. the constrained optimization problem is equivalent to the constrained optimization problem. Inthis case, s =
0, and the optimal channel input conditional distribution for the constrained case is equalto the optimal channel input conditional distribution for the unconstrained case, thus κ | s = = κ max = ν .R EFERENCES [1] C. E. Shannon, “A mathematical theory on communication,”
Bell System Technical Journal , no. 27, pp. 379–423, October1948.[2] T. M. Cover and J. A. Thomas,
Elements of Information Theory (Wiley Series in Telecommunications and SignalProcessing) . Wiley-Interscience, 2006.[3] C. E. Shannon, “The zero error capacity of a noisy channel,”
IRE Transactions on Information Theory , vol. 2, no. 3, pp.112–124, 1956.[4] R. L. Dobrushin, “Information transmission in channel with feedback,”
Theory of Probability and its Applications , vol. 3,no. 4, pp. 367–383, 1958.[5] T. M. Cover and S. Pombra, “Gaussian feedback capacity,”
IEEE Transactions on Information Theory , vol. 35, no. 1, pp.37–43, 1989.[6] S. Ihara,
Information Theory for Continuous Systems , ser. Series on probability and statistics. World Scientific, 1993.[7] H. Marko, “The bidirectional communication theory–a generalization of information theory,”
Communications, IEEETransactions on , vol. 21, no. 12, pp. 1345 – 1351, dec 1973.[8] J. Massey, “Causality, feedback and directed information,”
IEEE International Symposium on Information Theory and itsApplicationss , vol. 72, pp. 303–305, November 2001.[9] C. K. Kourtellaris and C. D. Charalambous, “Information structures of capacity achieving distributions for feedbackchannels with memory and transmission cost: Stochastic optimal control & variational equalities-part i,” arXiv preprintarXiv:1512.04514 , 2015.[10] ——, “Information structures of capacity achieving distribution for channels with memory and feedback,” in
InformationTheory (ISIT), 2016 IEEE International Symposium on, accepted for publication , Juny 2016.[11] N. Sen, F. Alajaji, and S. Yuksel, “Feedback capacity of a class of symmetric finite-state markov channels,”
IEEETransactions on Information Theory , vol. 57, no. 7, pp. 4110–4122, July 2011. [12] H. Permuter, P. Cuff, B. V. Roy, and T. Weissman, “Capacity of the trapdoor channel with feedback,” IEEE Transactionson Information Theory , vol. 54, no. 7, pp. 3150–3165, July 2008.[13] O. Elishco and H. Permuter, “Capacity and coding for the ising channel with feedback,”
IEEE Transactions on InformationTheory , vol. 60, no. 9, pp. 5138–5149, Sept 2014.[14] T. Berger, “Living Information Theory,”
IEEE Information Theory Society Newsletter , vol. 53, no. 1, March 2003.[15] J. Chen and T. Berger, “The capacity of finite-state markov channels with feedback,”
IEEE Transactions on InformationTheory , vol. 55, no. 6, pp. 780–798, 2005.[16] H. Asnani, H. Permuter, and T. Weissman, “Capacity of a post channel with and without feedback,” in
Information TheoryProceedings (ISIT), 2013 IEEE International Symposium on , 2013, pp. 2538–2542.[17] H. Permuter, H. Asnani, and T. Weissman, “Capacity of a post channel with and without feedback,”
IEEE Transactionson Information Theory , vol. 60, no. 10, pp. 6041–6057, Oct 2014.[18] C. K. Kourtellaris and C. D. Charalambous, “Capacity of binary state symmetric channel with and without feedback andtransmission cost,” in
Information Theory Workshop (ITW), 2015 IEEE , April 2015, pp. 1–5.[19] C. K. Kourtellaris, C. D. Charalambous, and J. J. Boutros, “Nonanticipative transmission for sources and channels withmemory,” in
Information Theory (ISIT), 2015 IEEE International Symposium on , June 2015, pp. 521–525.[20] F. Jel´ınek,
Probabilistic information theory: discrete and memoryless models , ser. McGraw-Hill series in systems science.McGraw-Hill, 1968.[21] M. Gastpar, “To code or not to code,” Ph.D. dissertation, Ecole Polytechnique F´ed´erale (EPFL), Lausanne, 2002.[22] S. Tatikonda, “Control over communication constraints,” Ph.D. thesis, M.I.T, Cambridge, MA, 2000.[23] C. Charalambous and P. Stavrou, “Directed information on abstract spaces: Properties and variational equalities,”
IEEETransactions on Information Theory , vol. PP, no. 99, pp. 1–1, 2016.[24] P. A. Stavrou, C. D. Charalambous, and C. K. Kourtellaris, “Sequential necessary and sufficient conditions for optimalchannel input distributions of channels with memory and feedback,” in , 2016, pp. 300–304.[25] R. G. Gallager,
Information Theory and Reliable Communication . New York, NY, USA: John Wiley & Sons, Inc., 1968.[26] D. Luenberger,
Optimization by Vector Space Methods , ser. Professional Series. Wiley, 1968.[27] O. Hernandez-Lerma and J. Lasserre,
Discrete-Time Markov Control Processes: Basic Optimality Criteria , ser. Applicationsof Mathematics Stochastic Modelling and Applied Probability. Springer Verlag, 1996, no. v. 1.[28] P. R. Kumar and P. Varaiya,
Stochastic systems: Estimation, identification, and adaptive control . Prentice Hall, 1986.[29] H. Permuter, T. Weissman, and A. Goldsmith, “Capacity of finite-state channels with time-invariant deterministic feedback,”in , July 2006, pp. 64–68.[30] C. E. Shannon, “Coding theorems for a discrete source with a fidelity criterion,” in
IRE Nat. Conv. Rec., Pt. 4 , 1959, pp.142–163.[31] D. Bertsekas,