[PDF] Minimax Filtering via Relations between Information and Estimation

Abstract

We investigate the problem of continuous-time causal estimation under a minimax criterion. Let X T ={ X t ,0≤t≤T} be governed by the probability law P θ from a class of possible laws indexed by θ∈Λ , and Y T be the noise corrupted observations of X T available to the estimator. We characterize the estimator minimizing the worst case regret, where regret is the difference between the causal estimation loss of the estimator and that of the optimum estimator. One of the main contributions of this paper is characterizing the minimax estimator, showing that it is in fact a Bayesian estimator. We then relate minimax regret to the channel capacity when the channel is either Gaussian or Poisson. In this case, we characterize the minimax regret and the minimax estimator more explicitly. If we further assume that the uncertainty set consists of deterministic signals, the worst case regret is exactly equal to the corresponding channel capacity, namely the maximal mutual information attainable across the channel among all possible distributions on the uncertainty set of signals. The corresponding minimax estimator is the Bayesian estimator assuming the capacity-achieving prior. Using this relation, we also show that the capacity achieving prior coincides with the least favorable input. Moreover, we show that this minimax estimator is not only minimizing the worst case regret but also essentially minimizing regret for "most" of the other sources in the uncertainty set. We present a couple of examples for the construction of an minimax filter via an approximation of the associated capacity achieving distribution.

Full PDF

11 Minimax Filtering Regret via Relations BetweenInformation and Estimation

Albert No,

Student Member, IEEE , and Tsachy Weissman,

Fellow, IEEE

Abstract —We investigate the problem of continuous-timecausal estimation under a minimax criterion. Let X T = { X t , ≤ t ≤ T } be governed by the probability law P θ from a class ofpossible laws indexed by θ ∈ Λ , and Y T be the noise corruptedobservations of X T available to the estimator. We characterizethe estimator minimizing the worst case regret, where regret isthe difference between the causal estimation loss of the estimatorand that of the optimum estimator.One of the main contributions of this paper is characterizingthe minimax estimator, showing that it is in fact a Bayesianestimator. We then relate minimax regret to the channel capacitywhen the channel is either Gaussian or Poisson. In this case, wecharacterize the minimax regret and the minimax estimator moreexplicitly. If we further assume that the uncertainty set consistsof deterministic signals, the worst case regret is exactly equal tothe corresponding channel capacity, namely the maximal mutualinformation attainable across the channel among all possibledistributions on the uncertainty set of signals. The correspondingminimax estimator is the Bayesian estimator assuming thecapacity-achieving prior. Using this relation, we also show thatthe capacity achieving prior coincides with the least favorableinput. Moreover, we show that this minimax estimator is not onlyminimizing the worst case regret but also essentially minimizingregret for “most” of the other sources in the uncertainty set.We present a couple of examples for the construction of anminimax ﬁlter via an approximation of the associated capacityachieving distribution. Index Terms —Mismatched estimation, minimax regret, regret-capacity, strong regret-capacity, directed information, sparse sig-nal estimation, AWGN channel, Poisson channel, least favorableinput.

I. I

NTRODUCTION

Recent work on relations between information and es-timation has shown fundamental links between the causalestimation error and information theoretic quantities. In [1],Duncan showed that causal estimation error of an additivewhite Gaussian noise (AWGN) corrupted signal is equal tothe mutual information between the input and output processesdivided by the signal-to-noise ratio. In [2], Weissman extendedthe result to the case of mismatched estimation, where theestimator assumes that the input signal is governed by a law Q while its true law is P . In this case, the cost of mismatch,which is half the difference between the mismatched causal This paper was presented in part at the 2013 IEEE International Symposiumon Information Theory.This work was supported by the NSF Center forScience of Information under Grant Agreement CCF-0939370.The authors are with the Department of Electrical Engineering, Stan-ford University, Stanford, CA 94305 USA (e-mail: [email protected];[email protected]).Copyright (c) 2014 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected]. estimation error and the optimum (non-mismatched) causalestimation error, is given by the relative entropy between thelaws of output processes when the input processes have laws P and Q , respectively. In [3], Atar and Weissman showed thatparallel information-estimation relations exist in the Poissonchannel for both mismatched and non-mismatched settings.In this paper, we investigate the continuous-time causalestimation problem. We assume that the input process isgoverned by a probability law from a known uncertainty class P although the estimator does not know the true law. Inparticular, suppose that the input process is governed by alaw P θ ∈ P , where θ ∈ Λ and Λ is the uncertainty setknown to the decoder. In this setting, it is natural to considerthe minimax estimator which minimizes the worst case regret,where regret is deﬁned as the difference between the causalestimation error of the estimator and that of the optimumestimator. If there is a minimum achieving estimator, we willcall it a minimax estimator or minimax ﬁlter. One of themain contributions of this paper is characterizing the minimaxestimator, showing that it is in fact a Bayesian estimator underthe distribution which is the capacity-achieving mixture ofdistributions associated with the channel whose input is asource in the uncertainty set.We can ﬁnd similar arguments in classical universal sourcecoding theory. In universal source coding theory, the encoderonly knows that the source is governed by some law froman uncertainty set. The goal is to construct the universal codethat minimizes the gap between its expected code length andthat under the optimum encoding strategy for the true law.Redundancy capacity theory [4] tells us that the minimum ofthe worst case redundancy (minimax redundancy) coincideswith the maximum mutual information between input andoutput of the channel whose input is a choice of a law fromthe uncertainty set and whose output is a realization of thatlaw.Using these ideas, we show similar results for our causalestimation problem. If the channel is either Gaussian or Pois-son, we can combine the results of mismatched estimation andthe above redundancy capacity theorem in order to relate theminimax regret to the corresponding channel capacity. Indeed,the minimax regret turns out to equal to the maximum mutualinformation between the input index and the correspondingoutput which we shall refer to as regret capacity . Moreover,the minimax ﬁlter is Bayesian with respect to the same priorthat achieves maximum mutual information. Therefore, if weknow the distribution that maximizes mutual information, wecan induce the minimax estimator. Further, we shall see thatif the class of measures P is a set of deterministic signals, a r X i v : . [ c s . I T ] J u l this mutual information reduces to the mutual informationbetween input and output processes X T and Y T . This allowsus to harness well known results from channel coding tocharacterize and construct the minimax ﬁlter.The relation between the capacity achieving prior and theminimax ﬁlter gives us a new link between estimation andinformation which is the probability law over input signals thatresults in the worst causal mean loss. In particular, using theregret-capacity theorem, we show that the capacity achievingprior coincides with the least favorable input.Since the goal in minimax estimation is to minimize theworst case regret, one may argue that the minimax estimatormight not be a good estimator for many other sources in theclass. However, in universal source coding theory, Merhav andFeder [5] showed that the minimax encoder works well for“most” distributions in the uncertainty set, where “most” ismeasured with respect to the capacity-achieving prior whichis argued to be the “right” prior. Indeed, the framework of[5] strengthened and generalized the results of this nature thatwere established for parametric uncertainty sets by Rissanenin [6]. We apply this idea to our minimax estimation setting.These results imply that the minimax estimator not onlyminimizes the worst case error, but does essentially at leastas well as any other estimator for most sources.Our results for the Gaussian and the Poisson channel carryover to accommodate the presence of feedback. In this paper,feedback means that the input process at time t , X t , is alsoaffected by previous outputs { Y s : 0 ≤ s < t } . We showthat all the theorems are still valid in this case by substitutingmutual information with directed information.The rest of the paper is organized as follows. Section II de-scribes the concrete problem setting. In Section III, we presentand discuss the main results. Relation between the capacityachieving prior and the least favorable input is presented inSection IV. Section V provides proofs of the theorems. InSections VI and VII, we provide examples of experiments withsimulated signals. We conclude with a summary in SectionVIII. II. P ROBLEM S ETTING

Let the right-continuous input process X T = { X t , ≤ t ≤ T } be governed by the probability law P θ from some classof possible laws indexed by θ ∈ Λ . Throughout the paper, wewill assume that the collection of laws P = { P θ : θ ∈ Λ } istight. P and Λ are uncertainty sets known to the estimator.Let Y T be the noise corrupted observations of X T , andtherefore the probability law of Y T also depends on thespeciﬁc θ ∈ Λ . However, we assume that the noise corruptionmechanism P Y T | X T is ﬁxed and known to the decoder.Denote the input and reconstruction alphabets by X and ˆ X ,respectively. In other words, X t ∈ X and ˆ X t ∈ ˆ X , whereboth X and ˆ X are closed subsets of R . Let the measurable l ( · , · ) : X × ˆ X (cid:55)→ [0 , ∞ ) be a given loss function. Forsimplicity and transparency of our arguments, we assume that l ( · , · ) satisﬁes the following properties: From this point on we tacitly assume measurability of all functionsintroduced. (P1) l ( x, ˆ x ) is a lower semi-continuous convex function in ˆ x ;(P2) min ˆ x ∈ ˆ X E [ l ( X, ˆ x )] = E [ l ( X, E [ X ])] for all random vari-ables X on X .The squared error loss function and the natural loss function l ( x, ˆ x ) = x log( x ˆ x ) − x + ˆ x , introduced in [3], are examplesof loss functions satisfying these properties. Note that allBregman loss functions satisfy (P2). Moreover, if E [ X ] is aunique minimizer of E [ l ( X, ˆ x )] for all random variables X (i.e., (P2) with uniqueness), then l ( · , · ) is a Bregman lossfunction (up to an additive constant) [7]. However, Bregmanloss functions are not convex in the second argument ingeneral.Deﬁne the causal estimator ˆ X t ( · ) as a function of the outputprocess up to time t , i.e., Y t = { Y s , ≤ s ≤ t } and alsodeﬁne the causal mean loss associated with the ﬁlter ˆ X = { ˆ X t ( · ) , ≤ t ≤ T } bycml ( θ, ˆ X ) = E P θ (cid:34)(cid:90) T l ( X t , ˆ X t ( Y t )) dt (cid:35) where E P θ [ · ] denotes expectation under P θ × P Y T | X T . We willuse E P θ [ ·| Y t ] in the rest of paper which denotes conditionalexpectation under P θ × P Y T | X T .III. M AIN R ESULTS

A. Minimax Causal Estimation Criterion

If the estimator knows the true law P θ , property (P2) impliesthat the optimum ﬁlter will be the Bayesian estimator withrespect to the law P θ , i.e., the estimate at time t will be E P θ [ X t | Y t ] . However, since the estimator does not know thetrue law P θ , the estimator can be optimized for law Q (whilethe active law remains P θ ). Then the estimator is the Bayesianestimator ˆ X Bayes Q , where ˆ X Bayes Q = { E Q [ X t |· ] : 0 ≤ t ≤ T } denotes the collection of Bayesian ﬁlter under prior Q , i.e.,the estimate at time t will be ˆ X Bayes Q ( Y t ) = E Q [ X t | Y t ] . Thecorresponding mismatched causal mean loss will becml ( θ, ˆ X Bayes Q ) = E P θ (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) . We can treat cml ( θ, ˆ X Bayes P θ ) as our benchmark since itminimizes the causal mean loss when the P θ is exactly known.Therefore, we deﬁne regret of the ﬁlter ˆ X when the activesource is P θ by R ( θ, ˆ X ) = cml ( θ, ˆ X ) − cml ( θ, ˆ X Bayes P θ ) . Since we do not have a prior on θ , it is natural to seekto minimize the worst-case regret over all possible θ ∈ Λ .Speciﬁcally, deﬁne minimax (Λ) asminimax (Λ) = inf ˆ X sup θ ∈ Λ R ( θ, ˆ X ) , where the inﬁmum is over all possible ﬁlters. If there existsan inﬁmum achieving ˆ X , we will say ˆ X is the minimax ﬁlter. B. Statement of Results

Theorem 1.

Suppose there exists some reference symbol ˆ x ∈ ˆ X such that E P θ [ (cid:82) T l ( X t , ˆ x ) dt ] < ∞ for all θ ∈ Λ . Let Q denote the convex hull of the closure of the uncertainty set P ,i.e., Q = conv ( cl ( { P θ ; θ ∈ Λ } )) . Let l ( · , · ) be a loss functionwith properties (P1) and (P2). Then, the minimax estimator isa Bayesian estimator, i.e., minimax (Λ)= min Q ∈Q sup θ ∈ Λ R ( θ, ˆ X Bayes Q )= min Q ∈Q sup θ ∈ Λ { cml ( θ, ˆ X Bayes Q ) − cml ( θ, ˆ X Bayes P θ ) } . (1)Consider the following two canonical continuous-time chan-nel models which deﬁne the conditional law P Y T | X T .

1) Gaussian Channel:

Suppose that under all P θ ∈ P , theoutput process Y T is the AWGN corrupted version of X T ,i.e., dY t = X t dt + dW t where W T is a standard Brownian motion independent of X T .We consider half the squared loss function which is l ( x, ˆ x ) = ( x − ˆ x ) , where we introduce the factor / to streamlinethe exposition that follows.

2) Poisson Channel:

Suppose that under all P θ ∈ P ,the output Y T is a non-homogeneous Poisson process withintensity X T , where X T is a nonnegative stochastic process.As in [3], we employ the natural loss function l ( x, ˆ x ) = x log( x/ ˆ x ) − x + ˆ x . This loss function is a natural choicefor the Poisson channel, cf. [3, Lemma 2.1].Let deﬁne a virtual channel which takes θ ∈ Λ as aninput. The corresponding output of the virtual channel is Y T which is a realization of the output process when theinput has law P θ . Then the capacity of the virtual channel is sup w ∈ µ (Λ) I (Θ; Y T ) where Θ is a random variable that takesa value from Λ and µ (Λ) denotes the class of all probabilitymeasures on the set Λ . We are now ready to state our mainresults. Theorem 2 (Regret-Capacity) . Let the setting be eitherthat of the Gaussian channel or the Poisson channel. Then minimax (Λ) is equal to the capacity of the virtual channel,i.e., minimax (Λ) = sup w ∈ µ (Λ) I (Θ; Y T ) . (2) Theorem 3 (Minimax Filter) . Suppose the supremum in (2) isachieved by w ∗ ∈ µ (Λ) . Then the minimum in (1) is achievedby the Bayesian optimum ﬁlter with respect to Q ∗ where Q ∗ is the mixture of P θ ’s with respect to w ∗ , i.e., Q ∗ = (cid:90) θ ∈ Λ P θ w ∗ ( dθ ) . Moreover, the minimax ﬁlter is ˆ X Bayes Q ∗ . Theorem 4 (Strong Regret-Capacity) . Suppose the supremumin (2) is achieved by w ∗ ∈ µ (Λ) . For any ﬁlter ˆ X and every (cid:15) > , R ( θ, ˆ X ) > (1 − (cid:15) ) · minimax (Λ) for all θ ∈ Λ with the possible exception of points in a subset B ⊂ Λ , where w ∗ ( B ) ≤ e · − (cid:15) · minimax (Λ) . Consider the case of the presence of feedback where X t is also affected by previous output { Y s : 0 ≤ s < t } .More precisely, X t can be viewed as a function of Y t − δ and U for some δ > where U is an additional randomnessindependent of all other processes. Let P be a class of jointlaws of ( X T , Y T ) and Λ be a set of indices of laws. Let thedeﬁnition of minimax (Λ) and R ( θ, ˆ X Bayes Q ) remain the same.Then, the following theorem tells us that all the above resultshold essentially verbatim, i.e., Theorem 5 (Presence of Feedback) . minimax (Λ) = min Q ∈Q sup θ ∈ Λ R ( θ, ˆ X Bayes Q ) . Moreover, if the setting is either Gaussian or Poisson, then minimax (Λ) = min Q ∈Q sup θ ∈ Λ R ( θ, ˆ X Bayes Q )= sup w ∈ µ (Λ) I (Θ; Y T )= sup w ∈ µ (Λ) I ( X T → Y T ) − I ( X T → Y T | Θ) where I ( X T → Y T ) is the directed information from X T to Y T , as introduced in [8]. Directed information in continuous-time is also precisely deﬁned in Section V-A2. C. Discussion

Theorem 1 implies that the minimax ﬁlter is a Bayesian ﬁlterunder some law Q . Furthermore, this minimax optimal Q is amixture of P θ ’s. Therefore, in order to ﬁnd the minimax ﬁlter,it is enough to restrict the search space to that of Bayesianﬁlters. This is equivalent to ﬁnding an optimum prior Q ∗ , oroptimum weights w ∗ over laws { P θ } . Note that we have notassumed anything on the statistics of the input and outputprocesses but only the aforementioned properties of the lossfunction l ( · , · ) .If we assume that the noise corruption mechanism is eitherGaussian or Poisson, Theorem 2 implies that the minimaxregret coincides with the capacity of the virtual channel. Wepresent the parallel results from universal coding in SectionV-A1. Furthermore, Theorem 3 provides a prescription forsuch a ﬁlter in cases. Note that the mutual information I (Θ; Y T ) is equal to I ( X T ; Y T ) − I ( X T ; Y T | Θ) (since Θ − X T − Y T forms a Markov chain) where the ﬁrst termis the mutual information between input and output when theinput distribution is Q = (cid:82) θ P θ w ( dθ ) . If the uncertainty setis a class of deterministic laws (e.g., each θ corresponds toa Dirac measure concentrated at some signal x T that satisﬁesthe input constraints of the channel) then the right hand sideof (2) boils down to a supremum over all distributions on the set of allowable channel inputs, i.e.,minimax (Λ) = sup w ∈ µ (Λ) I (Θ; Y T )= sup w ∈ µ (Λ) I ( X T ; Y T ) − I ( X T ; Y T | Θ)= sup P XT ∈Q I ( X T ; Y T ) , (3)where Q = conv ( cl ( P )) . (3) follows because X T is deter-ministic given Θ , and therefore I ( X T ; Y T | Θ) = 0 . Notethat the right hand side of (3) is the capacity of the channelwhose input is constrained to lie in the uncertainty set ofsignals. Moreover, letting Q ∗ denote the capacity achievingdistribution, the minimax estimator is the Bayesian estimatorwith respect to the law Q ∗ . More interestingly, Q ∗ turns out tocoincide with the classical notion of the least favorable priorfrom estimation theory. We establish this connection in SectionIV. These results show the strong relation between minimaxestimation and channel coding problems.In Theorem 4, we can see that the minimax estimatorminimizes not only the worst case regret, but also regret formost θ ∈ Λ , under distribution w ∗ . Cf. [4] for a discussion ofthe signiﬁcance and implications of this result. For example,it implies that when Λ is a compact subset of R k and theparametrization of the input distributions P θ is sufﬁcientlysmooth, the minimax ﬁlter is essentially optimum not onlyin the worst case sense for which it was optimized, but infact on “most” of the sources over all possible ﬁlters. Notethat we are not restricting ﬁlters to be Bayesian. “Most” heremeans that the Lebesgue measure of the set of parametersindexing sources is vanishing, as the value of minimax (Λ) grows without bound. It is often the case that minimax (Λ) isgrowing without bound as T increases. For example, if theuncertainty set consists of a set that constrains the possibleunderlying signals rather than their laws, we have seen thatminimax (Λ) is equal to T times the channel capacity, whichis growing linearly with T .Theorem 5 implies that the above result can be extended tothe case where feedback exists. Similar to (3), if P is a classof deterministic laws, i.e, X t is a function of θ and previousoutputs, then,minimax (Λ) = sup w I ( X T → Y T ) . Recall, this is T times the channel capacity in the presence offeedback. Again, if we can ﬁnd the capacity achieving scheme,it will give us the minimax ﬁlter.IV. L EAST F AVORABLE I NPUT

In Section III, we saw a relation between the capacityachieving prior for a virtual channel, and the minimax esti-mator. More precisely, the minimax estimator is the Bayesianestimator with respect to law Q ∗ , where Q ∗ is the capacityachieving prior. In this section, we will show that Q ∗ coin-cides with the “least favorable prior” from estimation theory.This is another interesting relation between information andestimation theory. A. Notation and Deﬁnitions

Suppose S is a class of possible input signals with corre-sponding index class Λ , i.e., S = { f θ } θ ∈ Λ . The input process X t is equal to f θ ( t ) for some θ ∈ Λ which is unknown tothe ﬁlter. Instead of the minimax criterion that we discussedthus far, we can consider the same problem in a Bayesiansetting, namely where the input signal { X t , ≤ t ≤ T } isgoverned by a probability law deﬁned on S where estimatorknows the true distribution of the source. We also assumethat the channel is either Gaussian or Poisson. Deﬁne averageloss, where the input prior is Q and the estimator employs theoptimum Bayesian ﬁlter E Q [ X t | Y t ] as, r Q ∆ = E Q (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) . The goal is to ﬁnd the least favorable input distribution Q ∈ µ ( S ) which causes the greatest average loss (rather thanregret). We refer to [9, Chapter 5] for a similar concept inpoint estimation theory. More formally, we deﬁne the leastfavorable prior as follows. Deﬁnition 1.

A prior distribution Q ∗ is least favorable if r Q ∗ ≥ r Q for all prior distributions Q . We deﬁne P θ to be a deterministic measure such that P θ ( X t = f θ ( t ) for all ≤ t ≤ T ) = 1 and considerthe corresponding minimax estimation problem. Note thatcml ( θ, ˆ X Bayes P θ ) = 0 , since the input process is deterministicunder P θ , and therefore R ( θ, ˆ X ) = cml ( θ, ˆ X ) − cml ( θ, ˆ X Bayes P θ ) = cml ( θ, ˆ X ) . In this setting, the minimax estimator can be viewed as anachiever of min ˆ X sup θ ∈ Λ cml ( θ, ˆ X ) . We already showed in(3) that the minimax estimator is the Bayesian estimator withrespect to Q ∗ where Q ∗ is a capacity achieving prior. B. Relation to the Least Favorable Input

The relation between the minimax estimator and the leastfavorable input is characterized in the following theorem.

Theorem 6.

Suppose that Q ∗ is a distribution on S such that r Q ∗ = sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ∗ ) Then: ˆ X Bayes Q ∗ is a minimax estimator. If ˆ X Bayes Q ∗ is the unique minimizer of E Q ∗ (cid:104)(cid:82) T l ( X t , ˆ X t ( Y t )) dt (cid:105) , then it is the uniqueminimax estimator. Q ∗ is least favorable.Proof:

1) For any ﬁlter ˆ X , sup θ ∈ Λ cml ( θ, ˆ X ) ≥ (cid:90) cml ( θ, ˆ X ) dQ ∗ ( θ )= E Q ∗ (cid:34)(cid:90) T l ( X t , ˆ X t ( Y t )) dt (cid:35) ≥ E Q ∗ (cid:34)(cid:90) T l ( X t , E Q ∗ [ X t | Y t ]) dt (cid:35) (4) = r Q ∗ = sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ∗ ) . This implies inf ˆ X sup θ ∈ Λ cml ( θ, ˆ X ) = sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ∗ ) . Therefore, ˆ X Bayes Q ∗ is a minimax estimator.2) By assumption, (4) holds with equality only if ˆ X t ( Y t ) = E Q ∗ [ X t | Y t ] . This implies the uniqueness of the mini-max estimator.3) For any prior Q , r Q = E Q (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) ≤ E Q (cid:34)(cid:90) T l ( X t , E Q ∗ [ X t | Y t ]) dt (cid:35) ≤ (cid:90) cml ( θ, ˆ X Bayes Q ∗ ) dQ ( θ ) ≤ sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ∗ )= r Q ∗ . This implies Q ∗ is least favorable.When l ( · , · ) is a Bregman divergence, the minimizerof min ˆ x E [ l ( X, ˆ x )] is unique, and therefore ˆ X Bayes Q ∗ is theunique minimizer of E Q ∗ (cid:104)(cid:82) T l ( X t , ˆ X t ( Y t )) dt (cid:105) . Furthermore,if r Q ∗ = sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ∗ ) , then ˆ X Bayes Q ∗ is the uniqueminimax ﬁlter.Theorem 6 provides a sufﬁcient condition for Q ∗ to beleast favorable. Using this theorem, we can show that the leastfavorable input is equal to the capacity achieving prior. Theorem 7. If Q ∗ is a capacity achieving prior of the channelwhen the input is restricted to the set S , then Q ∗ is a leastfavorable input.Proof: Since our uncertainty set is a collection of deter-ministic measures, we can apply (3); min Q ∈ µ ( S ) sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ) = sup Q ∈ µ ( S ) I ( X T ; Y T ) . Since Q ∗ achieves both the minimum andsupremum of min Q ∈ µ ( S ) sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ) and sup Q ∈ µ ( S ) I ( X T ; Y T ) , respectively, we can write sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ∗ ) = I ( X T ; Y T ) (5) = E Q ∗ (cid:34)(cid:90) T l ( X t , E Q ∗ [ X t | Y t ] dt (cid:35) (6) = r Q ∗ , where the probability law of X T in (5) is Q ∗ . Line (6) isdue to the relation between mutual information and the causalestimation loss. C.f. [1] and [3] for Gaussian and Poisson casesrespectively. This result tells us that Q ∗ satisﬁes the conditionof Theorem 6, and therefore the capacity achieving prior Q ∗ is least favorable. C. Examples

We have shown that the least-favorable prior and thecapacity-achieving prior always coincide in continuous-time causal estimation. However, this may not be true in generalestimation problem. Consider the problem of minimax estima-tion of a bounded normal mean. We have a noisy observation Y = x + Z where x ∈ [ − a, a ] is a bounded scalar parameter and Z is astandard normal random variable. We can consider the leastfavorable input in this setting. The least favorable input issimply deﬁned by arg max Q E Q [( X − E Q [ X | Y ]) ] where themaximum is over all probability laws of X on [ − a, a ] . For a > . , the unique least favorable prior is supported on atleast 3 discrete points in [ − a, a ] [10].On the other hand, consider the corresponding peak powerconstrained Gaussian channel capacity problem: sup P X ∈ µ ([ − a,a ]) I ( X ; X + Z ) . Sharma and Shamai showed that P ∗ X = δ − a + δ a achievescapacity for all a ≤ . ([11], [12]). Therefore the leastfavorable prior and the capacity achieving distributions do notcoincide when . < a < . . This example shows that theleast favorable prior and the capacity achieving distribution donot coincide in general.Now let us examine an analogous but contrastingcontinuous-time causal estimation problem. Consider the inputprocess X t ≡ x for all ≤ t ≤ T = 1 , where x ∈ [ − a, a ] is a bounded scalar parameter and a > . We observe Y T , the output of AWGN channel dY t = X t dt + dW t .In this setting, the least favorable input can be deﬁned by arg max Q E Q (cid:104)(cid:82) T ( X − E Q [ X t | Y t ]) dt (cid:105) where the maximumis over all probability laws of X on [ − a, a ] .On the other hand, the corresponding channel capacityproblem remains the same, i.e., sup Q ∈ µ ([ − a,a ]) I ( X T ; Y T ) = sup Q ∈ µ ([ − a,a ]) I ( X ; Y T ) . Theorem 7 tells us that the least favorable prior coincideswith the capacity achieving prior. Therefore, both the capacityachieving prior and the least favorable prior are Q ∗ = δ − a + δ a if a ≤ . . V. P

ROOF

A. Preliminaries1) Redundancy Capacity Theory:

It is worth reviewingsome results from universal source coding theory, since thetechniques will be useful in proving some of our results. Inthe context of universal source coding, let x n = ( x , · · · , x n ) be a sequence of symbols. Let { P θ : θ ∈ Λ } be a set ofprobability laws of sequences. Deﬁne redundancy by R n ( L, θ ) = E P θ [ L ( X n )] − H θ ( X n ) where L ( X n ) is length of codewords for given uniquelydecodable (UD) code and H θ ( X n ) is an entropy of sequencewith respect to P θ . Then, deﬁne minimax redundancy as R n = min L sup θ ∈ Λ R n ( L, θ ) . In [4], Gallager showed that minimax redundancy is equalto the capacity of the virtual channel, where its input is θ ∈ Λ and output is drawn by probability measure P θ ( x n ) , i.e., R n = C n where C n = sup w I (Θ; X n ) and the supremum is over allpriors of random variable Θ on Λ .Furthermore, the minimum achieving length function L ∗ isrelated to the supremum achieving weights w ∗ in the followingmanner: L ∗ ( x n ) = − log Q ∗ ( x n ) where Q ∗ = (cid:82) θ ∈ Λ P θ w ∗ ( dθ ) .Merhav and Feder [5] proved the strong version of redun-dancy capacity theorem which is for any length function L ofa UD code and every (cid:15) > , R n ( L, θ ) > (1 − (cid:15) ) C n , for all θ ∈ Λ except for points in a subset B ⊂ Λ where w ∗ ( B ) ≤ e · − (cid:15)C n . (7)In (7), the choice of probability measure w ∗ is reasonablebecause it captures variety in sets (cf. Merhav and Feder [5]).This theorem implies that L ∗ is not only the minimum of worstcase redundancy, but also close to minimum redundancy formost of other sources.

2) Directed Information:

Given two random vectors X n and Y n , we can deﬁne directed information. Deﬁnition 2 (Discrete-time Directed Information) . I ( X n → Y n ) (cid:44) n (cid:88) i =1 I ( X i ; Y i | Y i − ) . In [8], Weissman et al. extended this deﬁnition to thecontinuous-time setting, i.e., directed information betweentwo random processes X T and Y T . For given vector t =( t , · · · , t n ) where t < t < · · · < t n = T ,deﬁne X T, t (cid:44) ( X t , X t t , · · · , X Tt n − ) and treat X T, t as a n dimensional vector. Using this notation, we can deﬁne thedirected information between two random processes. Deﬁnition 3 (Continuous-time Directed Information) . I ( X T → Y T ) (cid:44) inf t I ( X T, t → Y T, t ) where the inﬁmum is over all ﬁnite dimensional vectors t . We refer to [8] for more on the properties of directed infor-mation and its signiﬁcance in communication and estimation.

B. Proof of Theorem 1Proof:

We denote the class of measures on Λ by µ (Λ) ,i.e., w ∈ µ (Λ) can be viewed as a weight function of eachprobability distribution in P θ where θ ∈ Λ . Then we haveminimax (Λ)= inf ˆ X sup θ ∈ Λ R ( θ, ˆ X )= inf ˆ X sup θ ∈ Λ (cid:110) cml ( θ, ˆ X ) − cml ( θ, ˆ X Bayes P θ ) (cid:111) = inf ˆ X sup w ∈ µ (Λ) (cid:26)(cid:90) θ ∈ Λ (cid:16) cml ( θ, ˆ X ) − cml ( θ, ˆ X Bayes P θ ) (cid:17) w ( dθ ) (cid:27) . let P av = (cid:82) P θ w ( dθ ) . Use Fubini’s theorem; sincethere exists some reference symbol ˆ x ∈ ˆ X such that E P θ [ (cid:82) T l ( X t , ˆ x ) dt ] < ∞ for all θ ∈ Λ , there exists a ﬁlter ˆ X such that (cid:82) T l ( X t , ˆ X t ( Y t )) dt is L with respect to all P θ .Therefore, (cid:90) θ ∈ Λ cml ( θ, ˆ X ) w ( dθ ) = E P av (cid:34)(cid:90) T l ( X t , ˆ X t ( Y t )) dt (cid:35) . The remaining proof ofminimax (Λ) ≥ min Q ∈Q sup θ ∈ Λ R ( θ, ˆ X Bayes Q ) appears at the top of the next page where: • (9) is because for any real-valued function f ( x, y ) on X × Y , we have inf x ∈X sup y ∈Y f ( x, y ) ≥ sup y ∈Y inf x ∈X f ( x, y ) . • (10) is because the loss function l satisﬁes property (P2)(expectation minimizes the loss function). • (11) is because of Sion’s minimax theorem. In order toapply Sion’s minimax theorem, we have to show thefollowing four conditions; – Q has to be a compact convex subset of a lineartopological space – µ (Λ) has to be a convex subset of a linear topologicalspace – We have to show that E P av (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) − (cid:90) θ ∈ Λ cml ( θ, ˆ X Bayes P θ ) w ( dθ ) is upper semi-continuous and quasiconcave on µ (Λ) for all Q ∈ Q . minimax (Λ) = inf ˆ X sup w ∈ µ (Λ) (cid:40) E P av (cid:34)(cid:90) T l ( X t , ˆ X t ( Y t )) dt (cid:35) − (cid:90) θ ∈ Λ cml ( θ, ˆ X Bayes P θ ) w ( dθ ) (cid:41) (8) ≥ sup w ∈ µ (Λ) inf ˆ X (cid:40) E P av (cid:34)(cid:90) T l ( X t , ˆ X t ( Y t )) dt (cid:35) − (cid:90) θ ∈ Λ cml ( θ, ˆ X Bayes P θ ) w ( dθ ) (cid:41) (9) = sup w ∈ µ (Λ) (cid:40) E P av (cid:34)(cid:90) T l ( X t , E P av [ X t | Y t ]) dt (cid:35) − (cid:90) θ ∈ Λ cml ( θ, ˆ X Bayes P θ ) w ( dθ ) (cid:41) (10) = sup w ∈ µ (Λ) min Q ∈Q (cid:40) E P av (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) − (cid:90) θ ∈ Λ cml ( θ, ˆ X Bayes P θ ) w ( dθ ) (cid:41) = min Q ∈Q sup w ∈ µ (Λ) (cid:40) E P av (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) − (cid:90) θ ∈ Λ cml ( θ, ˆ X Bayes P θ ) w ( dθ ) (cid:41) (11) = min Q ∈Q sup w ∈ µ (Λ) (cid:40)(cid:90) θ ∈ Λ (cid:32) E P θ (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) − cml ( θ, ˆ X Bayes P θ ) (cid:33) w ( dθ ) (cid:41) (12) = min Q ∈Q sup θ ∈ Λ (cid:40) E P θ (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) − cml ( θ, ˆ X Bayes P θ ) (cid:41) = min Q ∈Q sup θ ∈ Λ (cid:110) cml ( θ, ˆ X Bayes Q ) − cml ( θ, ˆ X Bayes P θ ) (cid:111) = min Q ∈Q sup θ ∈ Λ R ( θ, ˆ X Bayes Q ) . – We also have to show that E P av (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) − (cid:90) θ ∈ Λ cml ( θ, ˆ X Bayes P θ ) w ( dθ ) is lower semi-continuous and quasi-convex on Q forall w ∈ µ (Λ) .Consider the topology of weak convergence of probabilitylaws. Since P = { P θ : θ ∈ Λ } is tight and X is aPolish space, we can apply Prohorov’s theorem whichimplies that the closure of P is compact. Since convexhull of compact set is always compact, and therefore Q is compact. Convexity of µ (Λ) and upper semi-continuity are clear. Lower semi-continuity is clear sincewe assumed that l ( · , · ) is a lower semi-continuous in thesecond argument. This guarantees that E P av (cid:34)(cid:90) T l ( X t , E Q [ X t | Y t ]) dt (cid:35) − (cid:90) θ ∈ Λ cml ( θ, ˆ X Bayes P θ ) w ( dθ ) is lower semi-continuous in Q ∈ Q . • Note that (12) also holds due to a similar argument with(8).The opposite direction is trivial, that is inf ˆ X sup θ ∈ Λ (cid:110) cml ( θ, ˆ X ) − cml ( θ, ˆ X Bayes P θ ) (cid:111) ≤ min Q ∈Q sup θ ∈ Λ (cid:110) cml ( θ, ˆ X Bayes Q ) − cml ( θ, ˆ X Bayes P θ ) (cid:111) . Therefore,minimax (Λ) = inf ˆ X sup θ ∈ Λ R ( θ, ˆ X ) = min Q ∈Q sup θ ∈ Λ R ( θ, ˆ X Bayes Q ) . C. Proof of Theorems 2 and 3Proof:

For both Gaussian and Poisson setting, the costof mismatch is related to relative entropy between outputscorresponding to input laws P θ and Q , respectively [2], [3].In other words, if ( P θ ) Y T is the distribution of Y T where thelaw of the input process is P θ , and if Q Y T is deﬁned similarly,we havecml ( θ, ˆ X Bayes Q ) − cml ( θ, ˆ X Bayes P θ ) = D (( P θ ) Y T || Q Y T ) . (13)Using similar argument from classical minimax redundancytheory, we can getminimax (Λ)= min Q ∈Q sup θ ∈ Λ { cml ( θ, ˆ X Bayes Q ) − cml ( θ, ˆ X Bayes P θ ) } = min Q ∈Q sup θ ∈ Λ D (( P θ ) Y T || Q Y T )= min Q ∈Q sup θ ∈ Λ (cid:90) d ( P θ ) Y T log (cid:18) d ( P θ ) Y T dQ Y T (cid:19) = min Q ∈Q sup w ∈ µ (Λ) (cid:90) (cid:90) d ( P θ ) Y T log (cid:18) d ( P θ ) Y T dQ Y T (cid:19) w ( dθ )= sup w ∈ µ (Λ) min Q ∈Q (cid:90) (cid:90) d ( P θ ) Y T log (cid:18) d ( P θ ) Y T dQ Y T (cid:19) w ( dθ ) (14) = sup w ∈ µ (Λ) min Q ∈Q (cid:90) (cid:90) d ( P θ ) Y T log (cid:18) d ( P θ ) Y T d ( P av ) Y T (cid:19) w ( dθ ) + (cid:90) (cid:90) d ( P θ ) Y T log (cid:18) d ( P av ) Y T dQ Y T (cid:19) w ( dθ )= sup w ∈ µ (Λ) min Q ∈Q (cid:90) D (( P θ ) Y T || ( P av ) Y T ) w ( dθ )+ D (( P av ) Y T || Q Y T )= sup w ∈ µ (Λ) (cid:90) D (( P θ ) Y T || ( P av ) Y T ) w ( dθ ) (15) = sup w ∈ µ (Λ) I (Θ; Y T ) . In (14), we applied the minimax theorem again where weaklower semi-continuity in Q follows from the property of therelative entropy. All other conditions for minimax theorem arethe same as the proof in the previous section. This completesthe proof of Theorem 2.In (15), if a supremum achieving w ∗ exists, the minimumachieving Q ∗ is P av , i.e., Q ∗ = (cid:90) θ ∈ Λ P θ w ∗ ( dθ ) . Therefore,minimax (Λ) = sup θ ∈ Λ { cml ( θ, ˆ X Bayes Q ∗ ) − cml ( θ, ˆ X Bayes P θ ) } , which implies the minimax estimator is a Bayesian estimatorbased on law Q ∗ , i.e., ˆ X ( Y t ) = E Q ∗ [ X t | Y t ] . D. Proof of Theorem 4Proof:

The idea of proof is similar to that in [5]. For givenestimator ˆ X ∗ and (cid:15) > , deﬁne the set B = { θ : R ( θ, ˆ X ∗ ) ≤ (1 − (cid:15) ) · minimax (Λ) } . Then, by deﬁnition of B , we haveminimax ( B ) = inf ˆ X sup θ ∈ B R ( θ, ˆ X ) ≤ sup θ ∈ B R ( θ, ˆ X ∗ ) ≤ (1 − (cid:15) ) · minimax (Λ) . Consider Θ as a random variable with measure w ∗ where w ∗ achieves the supremum of (2). Let Z = { Θ ∈ B } be a binaryrandom variable. Clearly we have P ( Z = 1) = w ∗ ( B ) . Since Z − Θ − Y T is a Markov chain, we haveminimax (Λ) = I (Θ; Y T )= I ( Z ; Y T ) + I (Θ; Y T | Z )= I ( Z ; Y T ) + P ( Z = 1) I (Θ; Y T | Z = 1)+ P ( Z = 0) I (Θ; Y T | Z = 0) ≤ I ( Z ; Y T ) + w ∗ ( B ) · minimax ( B )+ (1 − w ∗ ( B )) · minimax (Λ) (16) ≤ H ( Z ) + (1 − (cid:15) · w ∗ ( B )) · minimax (Λ) . (16) is because I (Θ; Y T | Z = 1) = minimax ( B ) and I (Θ; Y T | Z = 0) ≤ minimax (Λ) . Finally, we get − log w ∗ ( B ) − − w ∗ ( B ) w ∗ ( B ) log(1 − w ∗ ( B )) ≥ (cid:15) · minimax (Λ) , which implies w ∗ ( B ) ≤ e · − (cid:15) · minimax (Λ) . E. Proof of Theorem 5Proof:

Proofs of Theorem 1 and 4 are still valid even witha feedback. Moreover, since the result of cost of mismatch alsovalid with feedback [3], the only non-trivial part is to show I (Θ; Y T ) = I ( X T → Y T ) − I ( X T → Y T | Θ) .Recall the deﬁnition of directed information in continuous-time setting. For ﬁxed time intervals t < t < t < · · · < t n = T , I (Θ; Y T ) = n (cid:88) i =1 I (Θ; Y t i t i − | Y t i − )= n (cid:88) i =1 (cid:90) log dP Y titi − | Y ti − , Θ dP Y titi − | Y ti − dP Y ti , Θ = n (cid:88) i =1 (cid:90) log dP Y titi − | X ti ,Y ti − , Θ dP Y titi − | Y ti − − log dP Y titi − | X ti ,Y ti − , Θ dP Y titi − | Y ti − , Θ dP X i ,Y ti , Θ = n (cid:88) i =1 (cid:90) log dP Y titi − | X ti ,Y ti − dP Y titi − | Y ti − dP X i ,Y ti − (cid:90) log dP Y titi − | X ti ,Y ti − , Θ dP Y titi − | Y ti − , Θ dP X ti ,Y ti , Θ (17) = n (cid:88) i =1 I ( Y t i t i − ; X t i | Y t i − ) − I ( Y i ; X t i | Y t i − , Θ) , where (17) is because Θ − ( X t i , Y t i − ) − Y t i t i − formsa Markov chain. Since the equality holds for any choice oftime intervals, we take t ’s such that sup i || t i − t i − || → andconcludeminimax (Λ) = min Q ∈Q sup θ ∈ Λ R ( θ, ˆ X Bayes Q )= min Q ∈Q sup θ ∈ Λ D (( P θ ) Y T || Q Y T )= sup w I (Θ; Y T )= sup w I ( X T → Y T ) − I ( X T → Y T | Θ) . VI. E

XAMPLES

A. Gaussian Channel and Sparse Signal

We ﬁrst apply our theorems to the problem of sparse signalestimation under Gaussian noise.

1) Setting:

We assume output process Y T is an AWGN cor-rupted version of X T as we discussed in Section III-B1. Theinput process X T is sparse (the meaning will be explained).Recall that we are using half of a mean squared error as adistortion measure, l ( x, ˆ x ) = ( x − ˆ x ) .Let { φ i ( t ) , ≤ t ≤ T } ni =1 be a given orthonormal signal setwhich is known to the estimator. Suppose X T is a linear com-bination of φ i ( t ) ’s, i.e., X t = (cid:80) ni =1 A i φ i ( t ) where { A i } ni =1 are random variables with unknown distribution. However, weassume that the estimator knows that the signal X T is powerconstrained and is sparse, by which we mean that the fractionof nonzero elements in { A i } ni =1 should be smaller than q (i.e.,at most nq number of A i ’s can be nonzero). Let P be a classof all probability measures P θ of vector A = ( A , · · · , A n ) indexed by θ which satisﬁes these two constraints almostsurely, i.e., P = (cid:40) P θ : 1 n n (cid:88) i =1 A i ≤ P, n n (cid:88) i =1 { A i (cid:54) =0 } ≤ q a.s. (cid:41) . (18)Note that (cid:82) T X t dt = (cid:80) ni =1 A i because of orthonormality,and therefore it is equivalent to consider n (cid:80) ni =1 A i ≤ P asthe power constraint. Deﬁne an uncertainty set Λ be the setof such indices. It is clear that P = { P θ : θ ∈ Λ } is a convexset.We further deﬁne P D as a class of deterministic measures P θ ∈ P (i.e., P θ ( { a n } ) = 1 for some a n ∈ R n ), and thecorresponding set of indices as Λ D . Note that conv ( P D ) = P . We also deﬁne the class of sparse signals with averageconstraints P av = (cid:40) P θ : E (cid:34) n n (cid:88) i =1 A i (cid:35) ≤ P, E (cid:34) n n (cid:88) i =1 { A i (cid:54) =0 } (cid:35) ≤ q (cid:41) . and the corresponding index set Λ av .We can understand P D as a class of Dirac measures atsome a n , and P av as a class of measures that satisfy averagepower and sparsity constraints in expectation while measuresin P satisﬁes constraints with probability 1. In classicalminimax statistical theory, P D is often called the set of pointuncertainty, and P av is called minimax Bayes relaxation.Also, deﬁne the corresponding set of indices as Λ D and Λ av ,respectively. There are some simple relations among these sets. • P D ⊂ P ⊂ P av and Λ D ⊂ Λ ⊂ Λ av • P is a convex closure of P D , i.e., P = conv ( P D ) .The goal is to ﬁnd minimax (Λ) and the minimax ﬁlter thatachieves it.A similar non-causal minimax problem was studied byPinsker [13]. Pinsker considered the non-causal estimationproblem with only the power constraint. Although Pinsker’sapproach does not directly apply to our setting because of thedifference between non-causal and causal estimation, we willuse a similar idea to argue that the approximated version ofthe minimax ﬁlter works well.

2) Application of the Theorem:

It is easy to show that P , P D and P av are tight, and therefore we can apply thetheorems. Theorem 2 implies thatminimax (Λ) = sup w ( · ) ∈ µ (Λ) I ( X T ; Y T ) − I ( X T ; Y T | Θ) . Since our optimum causal minimax estimator is a Bayesianestimator under the distribution Q ∗ = (cid:82) P θ w ∗ ( dθ ) where w ∗ achieves the supremum, we are interested in w ∗ . Ratherthan maximizing the difference between mutual informations,we can ﬁnd an equivalent problem which is much easier tohandle by exploiting the relation between minimax (Λ) andminimax (Λ D ) . Lemma 8. minimax (Λ D ) = minimax (Λ) . Appendix I is dedicated to the proof of Lemma 8. Since P D is a set of deterministic measures, we can get more explicitformula of minimax (Λ D ) as we showed in Section III-C,minimax (Λ) = minimax (Λ D )= sup w ( · ) ∈ µ (Λ D ) I ( X T ; Y T ) (19) = sup P θ ∈P I ( X T ; Y T ) . (20)In (19), X T is governed by the law (cid:82) P θ w ( dθ ) which isan element of P . Therefore, ﬁnding a supremum achiever w ∗ in (19) is equivalent to ﬁnd the maximum prior P ∗ θ in P , thus, (20) holds. Moreover, the minimum achiever Q ∗ ofminimax (Λ D ) coincides with that of minimax (Λ) . Thus, it isenough to consider minimax (Λ D ) which is much simpler tosolve.Now, consider the minimax (Λ av ) .minimax (Λ)= min Q ∈P sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ) − cml ( θ, ˆ X Bayes P θ )= min Q ∈P av sup θ ∈ Λ cml ( θ, ˆ X Bayes Q ) − cml ( θ, ˆ X Bayes P θ ) (21) ≤ min Q ∈P av sup θ ∈ Λ av cml ( θ, ˆ X Bayes Q ) − cml ( θ, ˆ X Bayes P θ )= minimax (Λ av )= sup w ( · ) ∈ µ ( P av ) I ( X T ; Y T ) − I ( X T ; Y T | Θ) where (21) is because Bayesian estimator with prior Q ∗ ∈ P isoptimum over all possible ﬁlters and we can always extend thesearch space. We will use this relation between minimax (Λ) and minimax (Λ av ) to approximate the minimax ﬁlter.

3) Sufﬁcient Statistics:

Since the channel input signal is alinear combination of orthonormal signals, sufﬁcient statisticsof the channel output signal at time t = T are projections oneach φ i ’s, i.e., { (cid:82) T φ i ( t ) dY t } ni =1 . Therefore, the above mutualinformation I ( X T ; Y T ) can be further simpliﬁed asminimax (Λ) = sup P θ ∈P I ( A n ; B n ) where B i = (cid:82) T φ i ( t ) dY t for ≤ i ≤ n . Since we assumedan orthonormal basis, B n can be viewed as the output of adiscrete-time additive white Gaussian channel, i.e., B i = A i + W i where W i is i.i.d. standard Gaussian noise and independentof A n . This implies that our problem of maximizing the mutualinformation over the continuous-time channel is equivalent tomaximizing the mutual information between n channel inputs and n channel outputs over the discrete AWGN channel, withthe input distribution constrained as in (18).Recall that above result shows that sufﬁcient statis-tics for estimating X T given Y T are projections, i.e., (cid:110)(cid:82) T φ i ( s ) dY s (cid:111) ni =1 , in other words, the following Markovrelation holds X T − (cid:40)(cid:90) T φ i ( s ) dY s (cid:41) ni =1 − Y T . Since we are looking for a causal estimator, we need a similarresult for time t < T . The following lemma shows that (cid:110)(cid:82) t φ i ( s ) dY s (cid:111) ni =1 are sufﬁcient statistics for estimating X t given Y t . Lemma 9.

The following Markov relation holds for all t ∈ [0 , T ] , X t − (cid:26)(cid:90) t φ i ( s ) dY s (cid:27) ni =1 − Y t . Proof of Lemma 9 is given in Appendix II. Using thislemma, we will show that we can compute E [ X t | Y t ] easily.

4) Bayesian Estimator:

Let Q ∗ be the minimum achievinglaw of minimax (Λ) so that the optimum causal minimaxestimator is a Bayesian estimator assuming the prior Q ∗ , i.e., ˆ X t ( Y t ) = E Q ∗ [ X t | Y t ] . This conditional expectation is hard to compute in general.However, the sufﬁcient statistics provide us a practical imple-mentation of the estimator.Let us ﬁrst , deﬁne a projection vector ˜Y ( t ) = [ ˜ Y ( t ) , ˜ Y ( t ) , · · · , ˜ Y n ( t )] T where ˜ Y i ( t ) = (cid:82) t φ i ( s ) dY s . The vector ˜Y ( t ) indicates aprojection of Y t on the basis space. Similarly, deﬁne ˜W ( t ) = ( ˜ W ( t ) , ˜ W ( t ) , · · · , ˜ W n ( t )) T ˜X ( t ) = ( ˜ X ( t ) , ˜ X ( t ) , · · · , ˜ X n ( t )) T where ˜ W i ( t ) = (cid:82) t φ i ( s ) dW s and ˜ X i ( t ) = (cid:90) t φ i ( s ) X s ds = n (cid:88) j =1 a j (cid:18)(cid:90) t φ i ( s ) φ j ( s ) ds (cid:19) . Let further deﬁne a n by n matrix Γ( t ) where [Γ( t )] i,j = (cid:82) t φ i ( s ) φ j ( s ) ds .Note that ˜W ( t ) is Gaussian with zero mean and covariancematrix Γ( t ) since E [ ˜ W i ( t ) ˜ W j ( t )] = E (cid:20)(cid:90) t (cid:90) t φ i ( s ) φ j ( u ) dW s dW u (cid:21) = (cid:90) t φ i ( s ) φ j ( s ) ds. From Lemma 9, for ﬁxed t , the causal estimation problem isreduced to the following vector estimation problem ˜Y ( t ) = ˜X ( t ) + ˜W ( t ) = Γ( t ) A + ˜W ( t ) where A = A n = ( A , · · · , A n ) T and ˜W ( t ) ∼ N ( , Γ( t )) ,and the corresponding Bayesian estimator will be ˆ X t ( Y t ) = E Q ∗ [ X t | Y t ]= n (cid:88) i =1 E Q ∗ [ A i | ˜Y ( t )] φ i ( t ) . This implies that it is enough to ﬁnd E Q ∗ [ A i | ˜Y ] .If Γ( t ) is invertible, this problem is simple. If Γ( t ) is notinvertible, we can use the following tricks. Suppose the eigen-value decomposition of matrix Γ( t ) is Γ( t ) = V ( t )Λ( t ) V ( t ) T where V ( t ) = [ v ( t ) , · · · , v n ( t )] is an orthonormal matrixand Λ( t ) = diag ( λ ( t ) , λ ( t ) , · · · , λ n ( t )) with ≤ λ ( t ) ≤ λ ( t ) ≤ · · · ≤ λ n ( t ) . We can rewrite the problem as V ( t ) T ˜Y ( t ) = Λ( t ) V ( t ) T A + V ( t ) T ˜W ( t ) . Clearly we have V ( t ) T ˜W ( t ) ∼ N ( , Λ( t )) . Let m be thenumber of zero eigenvalues, i.e., λ ( t ) = · · · = λ m ( t ) = 0 <λ m +1 ( t ) . As ﬁrst m elements can be removed, we can deﬁneeffective vectors as V eff ( t ) = [ v m +1 ( t ) · · · v n ( t )]Λ eff ( t ) = diag ( λ m +1 ( t ) , · · · , λ n ( t )) . Therefore, the above vector estimation problem can further besimpliﬁed as V eff ( t ) T ˜Y ( t ) = Λ eff ( t ) V eff ( t ) T A + V eff ( t ) T ˜W ( t ) which is equivalent to Λ eff ( t ) − / V eff ( t ) T ˜Y ( t )= Λ eff ( t ) / V eff ( t ) T A + Λ eff ( t ) − / V eff ( t ) T ˜W ( t ) . (22)Note that Λ eff ( t ) − / V eff ( t ) T ˜W ( t ) ∼ N (0 , I n − m ) . Usingequation (22), we can easily ﬁnd E [ A | Y t ] = E [ A | ˜Y ] .

5) Almost Optimal Causal Minimax Estimator:

In Sec-tion VI-A4, we show how to ﬁnd E Q ∗ [ A | Y t ] if we know Q ∗ . However, it is often hard to ﬁnd a capacity achievingdistribution Q ∗ . Indeed most of the problems of ﬁndingcapacity achieving distribution are still open including oursparse signal estimation problem sup P θ ∈P I ( A n ; B n ) . Instead,we can use an approximated version of the prior ˜ Q . Onenatural choice of ˜ Q is the capacity achieving distribution of sup P θ ∈P av I ( A n ; B n ) . This problem was recently consideredby Zhang and Guo in [14], where they referred to it as“Gaussian channels with duty cycle and power constraints”.They showed that the distribution on A n that maximizes thismutual information is i.i.d. and discrete. In other words, letting P d denote the supremum achieving distribution of sup P A : E [ A ] ≤ P,P ( A (cid:54) =0) ≤ q I ( A ; B ) where B = A + W and W is a standard Gaussian noise W ,then sup P θ ∈P av I ( A n ; B n ) = n [ I ( A ; B )] P A = P d where [ I ( A ; B )] P A = P d denotes the mutual information be-tween A and B when the probability law of A is P d . Then, our choice of ˜ Q will be P nd . The authors of [14] also showedthat P d is discrete and has inﬁnite number of mass points, andthat it can be easily approximated with arbitrary precision.Then the following question is the performance of thisalternative ﬁlter compare to that of the minimax ﬁlter. Morespeciﬁcally, let deﬁne L (Λ , ˜ Q ) by L (Λ , ˜ Q ) (cid:52) = sup θ ∈ Λ R ( θ, ˆ X Bayes˜ Q ) − min Q ∈P sup θ ∈ Λ R ( θ, ˆ X Bayes Q ) . Following lemma gives an upper bound of L (Λ , P nd ) . Lemma 10. L (Λ , P nd ) ≤ [ I ( A n ; B n )] P An = P nd − [ I ( A n ; B n )] P An = Q ∗ . Proof of Lemma 10 is given in Appendix III. Thisresult implies that if these two mutual informationsare close enough, then we are not losing much byusing approximated version of optimum ﬁlter. Since [ I ( A n ; B n )] P An = P nd = n [ I ( A ; B )] P A = P d , it is enough toargue that n [ I ( A ; B )] P A = P d − [ I ( A n ; B n )] P An = Q ∗ is smallenough. The following lemma suggests that above two mutualinformations are close for large n . Lemma 11. lim n →∞ n [ I ( A ; B )] P A = P d − sup P An ∈P I ( A n ; B n ) = 0 Proof of Lemma 11 is given in Appendix IV. Thus, if thenumber of basis are large enough, the performance of Bayesianﬁlter ˆ X Bayes P nd is close to the optimum. B. Poisson Channel and Direct Current Signal

Consider direct current (DC) signal estimation over thePoisson channel. The input process is X t ≡ X for all ≤ t ≤ T , where X is a random variable bounded by a ≤ X ≤ A for some positive constants a and A . We candeﬁne the uncertainty set Λ such that { P θ : θ ∈ Λ } is theset of all possible probability measures on X under which a ≤ X ≤ A holds almost surely. The estimator observes aPoisson process with rate X t and performance is measuredunder the natural loss function l ( x, ˆ x ) = x log( x/ ˆ x ) − x + ˆ x .Similar to the previous section, we can deﬁne Λ D and proveminimax (Λ) = minimax (Λ D ) . It is clear that { P θ : θ ∈ Λ } isconvex and tight. Since Y T is a sufﬁcient statistic of Y T for X T (which is constant at X ), we haveminimax (Λ) = minimax (Λ D )= sup w ∈ µ (Λ D ) I ( X T ; Y T )= sup P X ∈ µ ([ a,A ]) I ( X ; Y T ) , where the maximization is over all distributions on X sup-ported on [ a, A ] . The corresponding communication problemis the capacity achieving problem of the discrete-time Poissonchannel. Discrete-time Poisson channel takes nonnegative, realvalued X as an input, and outputs a Poisson random variablewith parameter T X . Note that we have additional input con-straint that a ≤ X ≤ A almost surely. In this scenario, Shamai[15] showed that capacity achieving distribution is discrete with ﬁnite number of mass points. Let P s be this capacityachieving distribution. Using Theorem 3, we can conclude thatthe minimax causal estimator is conditional expectation of X given Y t with respect to the distribution P s , i.e., ˆ X t ( Y t ) = E P s [ X | Y t ] . Although an analytic expression of P s and capacity of thechannel has yet to be found, we can approximate the distribu-tion numerically to arbitrary precision.VII. E XPERIMENTS

A. Gaussian Channel and Sparse Signal

Consider the setting of Section VI-A. As described in [14],we approximate P d with ﬁnite number of mass points. Initially,ﬁnd an maximum mutual information for three mass points,then increase the number of mass points until the incrementof maximum mutual information is smaller than − . Usingapproximated version of P d , we can construct the Bayesianﬁlter which is close to the optimum as described in SectionVI-A5.In order to compare the performance of the suggestedminimax ﬁlter, we introduce some possible estimators. Onenaive choice of estimator is the maximum likelihood (ML)estimator. For equation (22), the ML estimation of vector A is given as ˆ A = (cid:16) Λ eff ( t ) / V eff ( t ) T (cid:17) † Λ eff ( t ) − / V eff ( t ) T ˜Y ( t ) where X † is Moore-Penrose pseudo-inverse of matrix X .Since A is sparse, we can further improve the estimator withthresholding. For example, estimator can do ML estimationand then take the largest nq elements of ˆ A .Another possible estimator is the minimax estimator thatlacks the sparsity information. Since the estimator does notknow that the signal is sparse, it assumes the uncertainty setis P LS = { P θ : P θ ( n || A || ≤ P ) = 1 } . Using similarideas in the previous section, we can relate this minimaxoptimization problem to the channel coding problem on theGaussian channel with average power constraint. Moreover,we can ﬁnd the almost minimax ﬁlter which is Bayesian withi.i.d. Gaussian prior, i.e., A ∼ N ( , P I n ) . Note that this ﬁlterturns out to be linear which is easy to implement. Using theresult of the previous section, we have Λ eff ( t ) − / V eff ( t ) T ˜Y ( t )= Λ eff ( t ) / V eff ( t ) T A + Λ eff ( t ) − / V eff ( t ) T ˜W ( t ) . Since every components are Gaussian, we can easily computethe conditional expectation. Recall, A ∼ N ( , P I n ) , and Λ eff ( t ) − / V eff ( t ) T ˜ Y ( t ) ∼ N ( , P Λ eff ( t ) + I n − m ) . There-fore, E [ A | Λ eff ( t ) − / V eff ( t ) T ˜Y ( t )]= P V eff ( t ) (cid:0) P Λ eff ( t ) + I n − m (cid:1) − V eff ( t ) T ˜Y ( t ) . We can also consider the genie-aided scheme which allowsadditional information of the source. Suppose the decoderknows the position of nonzeros i , · · · , i k , i.e., the estimator ML ML with threshodingMinimax w/o Sparsity Knowledge MinimaxGenie Aided

Fig. 1: Plots of cml for the experiment of SectionVII-A. Weset T = 10 . x -axis shows time and the y -axis represents theworst causal mean loss for each estimator.knows the fact that A i , · · · , A i k are nonzero and all othersare zero. Clearly, this scheme should outperform all otherschemes. Let A nonzero be a k dimensional vector that consistsof nonzero elements of A . Since the decoder has additionalinformation, it is enough to estimate A nonzero. Using similarargument from the minimax estimator that lacks the sparsityinformation, we can show that the optimum minimax estima-tor is a Bayesian estimator with prior N (cid:0) , nPk I k (cid:1) . Recallequation (22) and let U eff be a matrix consisting of columnsof Λ eff ( t ) / V eff ( t ) T which coincides with nonzero positionof A . Then we can rewrite the equation (22) as Λ eff ( t ) − / V eff ( t ) T ˜Y ( t )= U eff A nonzero + Λ eff ( t ) − / V eff ( t ) T ˜W ( t ) . It is clear that Λ eff ( t ) − / V eff ( t ) T ˜Y ( t ) ∼ N ( , P Λ eff ( t ) + I n − m ) . Therefore, E [ A nonzero | Λ eff ( t ) − / V eff ( t ) T ˜Y ( t )]= nPk U T eff ( U eff U T eff + I n − m ) − Λ eff ( t ) − / V eff ( t ) T ˜Y ( t ) . We compare the performance of estimators in Figure 1.We choose n = 7 , k = 2 , P = 10 . (4dB), and Haarbasis as an orthonormal signal set. We generate the randomsparse coefﬁcients by drawing the k nonzero coefﬁcientsaccording to Gaussian distribution. For each realization ofcoefﬁcients, we generate 100 output signals and take anaverage of causal loss. Finally, we take the maximum causalmean loss for each estimators among 100 simulations in orderto check the worst case performance. We can see that minimaxestimator outperforms maximum likelihood estimators andminimax estimator without sparsity knowledge. Note that theperformance of minimax estimator is comparable to genie-aided estimator even though the genie-aided estimator usedadditional information. Uniform PriorMinimax ML

Fig. 2: Plots of cml for the experiment of SectionVII-B. Herewe set T = 10 . x -axis shows time and the y -axis representsthe worst causal mean loss for each estimator. B. Poisson Channel and DC Signal

Optimum ﬁlter can be approximated using similar techniquefrom Section VII-A. For comparison, we present some othernatural estimators. First is the ML estimator, ˆ X ML ( Y t ) = min (cid:26) max { a, Y t t } , A (cid:27) . Another possible estimator is a Bayesian estimator whichassumes X has uniform distribution, i.e., X ∼ U [ a, A ] . Inthis case, the optimum Bayesian estimator is ˆ X unif ( Y t ) = Y t + 1 t + e − at a Y t +1 − e − At A Y t +1 t (cid:82) Aa e − xt x Y t dx . Figure 2 shows numerical results for a = 0 . , A = 2 case. We take an average of causal mean loss error over 100times for X = 0 . , , . , and ﬁnd an worst case error.The minimax estimator outperforms the other estimators asexpected. VIII. C ONCLUSIONS

We considered minimax estimation, focusing on the case ofcausal estimation when the noise-free object is a continuous-time signal and governed by a law from a given uncertaintyset. We showed that the minimax ﬁlter is a Bayesian ﬁlterif the distortion criterion satisﬁes certain properties. We alsocharacterized the worst case regret and the minimax estimatorin the case of Gaussian and Poisson channels by relating itto a familiar communication problem of maximizing mutualinformation. We further showed that the capacity achievingprior coincides with the least favorable input. Using the ideaof strong redundancy/regret-capacity theorem, we showed thatour minimax estimator is optimum in a sense much strongerthan it was designed to optimize for. Using these results, wepresented two examples: sparse signal estimation under Gaus-sian setting and DC signal estimation under Poisson setting, for which we have used our results to derive and implementthe minimax ﬁlter and exhibit its favorable performance inpractice.Our estimation framework can be applied to many otherestimation problems. One possible extension is to apply The-orem 5 to stochastic learning problems of the type consideredby Bento et al. in [16]. In this setting, the process Y T isdeﬁned by stochastic equation Y t = F ( Y t ; A ) dt + dW t , where A is an unknown random parameter and W T is a standardBrownian motion. We can set X t = F ( Y t ; A ) and considerour estimation framework with feedback. We can apply ourframeworks to estimate X T in the minimax sense and learn A . It will be interesting to investigate how an estimator guidedby this approach would compare to that in [16].A CKNOWLEDGMENT

The authors would like to thank Ernest Ryu and KartikVenkat for valuable discussions. The authors also would liketo thank the anonymous reviewers and associate editor fortheir thorough and constructive feedback resulting in improvedmanuscript. A

PPENDIX IP ROOF OF L EMMA Λ D ⊂ Λ , we haveminimax (Λ D ) = min Q ∈ conv ( P D ) sup θ ∈ Λ D R ( θ, ˆ X Bayes Q )= min Q ∈P sup θ ∈ Λ D R ( θ, ˆ X Bayes Q ) ≤ min Q ∈P sup θ ∈ Λ R ( θ, ˆ X Bayes Q )= minimax (Λ) . On the other hand,minimax (Λ)= min Q ∈P sup θ ∈ Λ R ( θ, ˆ X Bayes Q ) ≤ min Q ∈P sup θ ∈ Λ E P θ (cid:34)(cid:90) T l ( X t , E Q [ X | Y t ]) dt (cid:35) . It is clear that E P θ (cid:34)(cid:90) T l ( X t , E Q [ X | Y t ]) dt (cid:35) = (cid:90) E (cid:34)(cid:90) T l ( X t , E Q [ X | Y t ]) dt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A n = a n (cid:35) dP θ ( a n ) , and therefore sup θ ∈ Λ E P θ (cid:34)(cid:90) T l ( X t , E Q [ X | Y t ]) dt (cid:35) ≤ sup a n ∈T ( n ) E (cid:34)(cid:90) T l ( X t , E Q [ X | Y t ]) dt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) A n = a n (cid:35) = sup θ ∈ Λ D E P θ (cid:34)(cid:90) T l ( X t , E Q [ X | Y t ]) dt (cid:35) where T ( n ) = { a n ∈ R n : n (cid:80) ni =1 a i ≤ P, n (cid:80) ni =1 ( a i (cid:54) =0) ≤ q } is a set of vector a n that satisﬁes constraints. Thisimplies thatminimax (Λ) ≤ min Q ∈P sup θ ∈ Λ E P θ (cid:34)(cid:90) T l ( X t , E Q [ X | Y t ]) dt (cid:35) ≤ min Q ∈P sup θ ∈ Λ D E P θ (cid:34)(cid:90) T l ( X t , E Q [ X | Y t ]) dt (cid:35) = min Q ∈P sup θ ∈ Λ D R ( θ, ˆ X Bayes Q )= minimax (Λ D ) Finally, these two inequalities implyminimax (Λ) = minimax (Λ D ) . Indeed, sup θ ∈ Λ R ( θ, ˆ X Bayes Q ) = sup θ ∈ Λ D R ( θ, ˆ X Bayes Q ) holds for any Q ∈ P in general.A PPENDIX

IIP

ROOF OF L EMMA Proof:

At time t , output process Y t can be discretized as ¯ Y = (cid:104) Y tN (cid:16) Y tN − Y tN (cid:17) · · · (cid:16) Y NtN − Y ( N − tN (cid:17)(cid:105) T . This ¯ Y can be approximated as ¯ Y ≈ N ¯Φ A + ¯ W where ¯Φ =  φ (0) φ (0) · · · φ n (0) φ ( tN ) φ ( tN ) · · · φ n ( tN ) ... φ ( ( N − tN ) φ ( ( N − tN ) · · · φ n ( ( N − tN )  A = (cid:2) a a · · · a n (cid:3) T ¯ W = (cid:104) W tN (cid:16) W tN − W tN (cid:17) · · · (cid:16) W NtN − W ( N − tN (cid:17)(cid:105) T . It is easy to see that ¯ W ∼ N (0 , N I N ) . Furthermore, (cid:82) t φ i ( s ) dY s can be approximated as N (cid:88) k =1 φ i (cid:18) ( k − tN (cid:19) (cid:16) Y ktN − Y ( k − tN (cid:17) . This approximation is similar to the idea from Ito’s inte-gral, and it is enough to prove the lemma based on thisapproximation. Therefore, the lemma holds if and only if p ( A | ¯ Y ) = p ( A | ¯Φ T ¯ Y ) for all ¯ Y which is enough to show that p ( ¯ Y | A ) p (¯Φ T ¯ Y | A ) is constant (independent of choice of A ) for all ¯ Y .Throughout the proof, we assume ¯Φ T ¯Φ is invertible, however,it is not difﬁcult to derive the similar result when ¯Φ T ¯Φ is notinvertible. It is easy to check that log p ( ¯ Y | A )= log p ( ¯ W = ¯ Y − N ¯Φ A )= − log (2 π (1 /N ) N ) N/ − N N Y − N ¯Φ A ) T ( ¯ Y − N ¯Φ A )= − log (2 π (1 /N ) N ) N/ − N N Y T ¯ Y − N A T ¯Φ T Y + 1 N A T ¯Φ T ¯Φ A ) . On the other hand, log p ( ¯Φ T ¯ Y | A )= log p ( ¯Φ T ¯ W = ¯Φ T ¯ Y − N ¯Φ T ¯Φ A )= − log (2 π · det ((1 /N ) ¯Φ T ¯Φ)) n/ − N N T ¯ Y − N ¯Φ T ¯Φ A ) T ( ¯Φ T ¯Φ) − ( ¯Φ T ¯ Y − N ¯Φ T ¯Φ A )= − log (2 π · det ((1 /N ) ¯Φ T ¯Φ)) n/ − N N Y T ¯ Y − N A T ¯Φ T ¯ Y + 1 N A T ¯Φ T ¯Φ A ) − N N Y T ¯Φ(Φ T Φ) − ¯Φ T ¯ Y − ¯ Y T ¯ Y ) where det ( · ) denotes the determinant of the matrix. Thus, log p ( ¯ Y | A ) p ( ¯Φ T ¯ Y | A )= log (2 π · det ((1 /N ) ¯Φ T ¯Φ)) n/ (2 π (1 /N ) N ) N/ + N N Y T ¯Φ(Φ T Φ) − ¯Φ T ¯ Y − ¯ Y T ¯ Y ) . Therefore, the fraction p ( ¯ Y | A ) p (¯Φ T ¯ Y | A ) is independent of choice of A . This completes the proof of lemma.A PPENDIX

IIIP

ROOF OF L EMMA Proof:

Let deﬁne a class of all deterministic laws P D,all = { P θ : P θ ( a n ) = 1 for some a n ∈ R n } with corre-sponding index set Λ D,all . Deﬁne µ D,av = { w ∈ µ (Λ D,all ) : (cid:82) P θ w ( dθ ) ∈ P av } which is a class of measure on Λ D,all thatsatisﬁes (cid:82) P θ w ( dθ ) ∈ P av . Then, min Q ∈P av sup w ( · ) ∈ µ D,av (cid:90) D ( P θ || Q ) w ( dθ )= min Q ∈P av sup w ( · ) ∈ µ D,av (cid:90) D ( P θ || Q w ) w ( dθ ) + D ( Q w || Q ) (23) = sup w ( · ) ∈ µ D,av min Q ∈P av (cid:90) D ( P θ || Q w ) w ( dθ ) + D ( Q w || Q ) (24) = sup w ( · ) ∈ µ D,av (cid:90) D ( P θ || Q w ) w ( dθ )= sup w ( · ) ∈ µ D,av I (Θ; B n )= sup w ( · ) ∈ µ D,av I ( A n ; B n ) = sup P An ∈P av I ( A n ; B n )= [ I ( A n ; B n )] P An = P nd where we used minimax theorem in (24). Therefore, we canconclude that P nd achieves the minimum of (23), i.e., sup w ( · ) ∈ µ D,av (cid:90) D ( P θ || P nd ) w ( dθ ) = [ I ( A n ; B n )] P An = P nd . On the other hand, we have sup θ ∈ Λ D ( P θ || P nd ) = sup θ ∈ Λ D D ( P θ || P nd )= sup w ( · ) ∈ µ (Λ D ) (cid:90) D ( P θ || P nd ) w ( dθ ) ≤ sup w ( · ) ∈ µ D,av (cid:90) D ( P θ || P nd ) w ( dθ )= [ I ( A n ; B n )] P An = P nd . Therefore, we can bound L (Λ , P nd ) , L (Λ , P nd ) (cid:52) = sup θ ∈ Λ R ( θ, ˆ X Bayes P nd ) − min Q ∈P sup θ ∈ Λ R ( θ, ˆ X Bayes Q ) ≤ [ I ( A n ; B n )] P An = P nd − [ I ( A n ; B n )] P An = Q ∗ . A PPENDIX

IVP

ROOF OF L EMMA Proof:

It is trivial that sup w ∈ µ (Λ) I ( A n ; B n ) ≤ n [ I ( A ; B )] P A = P d for all n . Therefore, it is enough to ﬁnd anupper bound of n [ I ( A ; B )] P A = P d − sup w ∈ µ (Λ) I ( A n ; B n ) thatconverges to 0 as n grows. Recall that sup w ∈ µ (Λ) I ( A n ; B n ) is equal to sup P θ ∈P I ( A n ; B n ) .Let probability law P d,(cid:15) be a capacity achieving distributionof Gaussian channel with power constraint P − (cid:15) and duty cycleconstraint q − (cid:15) . In other words, P d,(cid:15) is a supremum achieverof sup E [ A ] ≤ P − (cid:15)P ( A (cid:54) =0)

SIAM Journalon Applied Mathematics , vol. 19, no. 1, pp. 215-220, 1970.[2] T. Weissman, “The Relationship Between Causal and Noncausal Mis-matched Estimation in Continuous-Time AWGN Channels,”

IEEE Trans.Inf. Theory , vol. 56, no. 9, pp. 4256-4273, Sep. 2010.[3] R. Atar, T. Weissman, “Mutual Information, Relative Entropy, andEstimation in the Poisson Channel,”

IEEE Trans. Inf. Theory , vol. 58,no. 3, pp. 1302-1318, Mar. 2012.[4] R.G. Gallager, “Source Coding with Side Information and UniversalCoding,”

Tech. Rep. LIDS-P-937, Lab. Inform. Decision Syst. , 1979.[5] N. Merhav, M. Feder, “A Strong Version of the Redundancy-CapacityTheorem of Universal Coding,”

IEEE Trans. Inf. Theory , vol. 41, no. 3,pp. 714-722, May 1995.[6] J. Rissanen, “Universal Coding, Information, Prediction, and Estima-tion,”

IEEE Trans. Inf. Theory , vol. 30, no. 4, pp. 629-636, July 1984.[7] A. Banerjee, X. Guo, and H. Wang, “On the Optimality of ConditionalExpectation as a Bregman Predictor,”

IEEE Trans. Inf. Theory , vol. 51,no. 7, pp. 2664-2669, July 2005.[8] T. Weissman, Y.-H. Kim, and H. Permuter, “Directed Information,Causal Estimation, and Communication in Continuous Time,”

IEEETrans. Inf. Theory , vol. 59, no. 3, p. 1271-1287, Mar. 2012.[9] E. Lehmann and G. Casella, “Theory of Point Estimation,”

Springer ,vol. 31, 1998.[10] G. Casella and W.E. Strawderman, “ Estimating a Bounded NormalMean,”

The Annals of Statistics , vol. 9, no. 4, pp. 870-878, 1981.[11] N. Sharma and S. Shamai (Shitz), “Characterization of the Discrete-Capacity Achieving Distribution when Mass Points Increases,”

Pro-ceedings of International Symposium on Information Theory and ItsApplications (ISITA) , Auckland, New Zealand, December 2008.[12] N. Sharma and S. Shamai (Shitz), “Transition Points in the Capacity-Achieving Distribution for the Peak-Power Limited AWGN and Free-Space Optical Intensity Channels,”

Problems of Information Transmis-sion , vol. 46, no. 4, pp. 283-299, 2010.[13] M.S. Pinsker, “Optimal Filtering of Square Integrable Signals in Gaus-sian White Noise,”

Problems of Information Transmission , vol. 16, pp.120-133, 1980.[14] L. Zhang, H. Li and D. Guo, “Capacity of Gaussian channels with dutycycle and power constraints,”

IEEE Trans. Inf. Theory , vol. 60, pp. 1615-1629, March 2014.[15] S. Shamai, “On the Capacity of a Direct-Detection Photon Channel withIntertransition-Constrained Binary Input,”

IEEE Trans. Inf. Theory , vol.37, no. 6, pp. 1540-1550, Nov. 1991.[16] J. Bento, M. Ibrahimi, and A. Montanari, “Information Theoretic Limitson Learning Stochastic Differential Equations,” in Proc. IEEE Int. Symp.Inform. Theory , St Petersburg, Russia, 2011.

Albert No (S‘12) is currently a PhD candidate in the Department of ElectricalEngineering at Stanford University, under the supervision of Prof. TsachyWeissman. His research interests include the relation between informationand estimation theory, lossy compression and joint source-channel coding.Albert received a Bachelors degree in both Electrical Engineering andMathematics from Seoul National University, in 2009, and a Masters degreein Electrical Engineering from Stanford University in 2012.