Estimation in Poisson Noise: Properties of the Conditional Mean Estimator
aa r X i v : . [ c s . I T ] M a r Estimation in Poisson Noise: Properties of theConditional Mean Estimator
Alex Dytso,
Member, IEEE, and H. Vincent Poor,
Fellow, IEEE
Abstract —This paper considers estimation of a random vari-able in Poisson noise with signal scaling coefficient and darkcurrent as explicit parameters of the noise model. Specifically, thepaper focuses on properties of the conditional mean estimator asa function of the scaling coefficient, the dark current parameter,the distribution of the input random variable and channelrealizations. With respect to the scaling coefficient and the darkcurrent, several identities in terms of derivatives are established.For example, it is shown that the gradient of the conditionalmean estimator with respect to the scaling coefficient and darkcurrent parameter is proportional to the conditional variance.Moreover, a score function is proposed and a Tweedie-likeformula for the conditional expectation is recovered. With respectto the distribution, several regularity conditions are shown. Forinstance, it is shown that the conditional mean estimator uniquelydetermines the input distribution. Moreover, it is shown that ifthe conditional expectation is close to a linear function in termsof mean squared error, then the input distribution is approxi-mately gamma in the Lévy distance. Furthermore, sufficient andnecessary conditions for linearity are found. Interestingly, it isshown that the conditional mean estimator cannot be linear whenthe dark current parameter of the Poisson noise is non-zero.
I. I
NTRODUCTION
Poisson noise models comprise an important set of modelswith a wide range of applications. This paper considersdenoising of a non-negative random variable in
Poisson noise with a specific focus on properties of the conditional meanestimator . Concretely, for an input random variable X ≥ thePoisson noise channel is specified by the following conditionalprobability mass function (pmf) of the output random variable Y : P Y | X ( y | x ) = 1 y ! ( ax + λ ) y e − ( ax + λ ) , x ≥ , y = 0 , , . . . (1)where a > is a scaling factor and λ ≥ is a non-negative constant called the dark current parameter . In words,conditioned on a non-negative input X = x , the output ofthe Poisson channel is a non-negative integer-valued randomvariable Y that is distributed according to (1). In (1) we usethe convention that = 1 .The random transformation of the input random variable X to an output random variable Y by the channel in (1) will bedenoted by Y = P ( aX + λ ) . (2) This work was supported in part by the U.S. National Science Foundationunder Grants CCF-0939370, CCF-1513915 and CCF-1908308. This paper waspresented in part at the 2019 IEEE Information Theory Workshop [1].A. Dytso and H. V. Poor are with the Department of ElectricalEngineering, Princeton University, Princeton, NJ 08544, USA. E-mail:[email protected], [email protected].
The transformation in (2) is depicted in Fig. 1a. It is importantto note that the operator P ( · ) is not linear, and it is not truethat P ( aX + λ ) = a P ( X ) + λ . Using the language of lasercommunications, the aX term represents the intensity of alaser beam at the transmitter and Y represents the number ofphotons that arrive at the receiver equipped with a particlecounter (i.e., a photodetector). The dark current parameter λ represents the intensity of an additional source of noiseor interference, which produces an extra P ( λ ) photons at aparticle counter [2]–[5].In this work, we are interested in the properties of theconditional mean estimator of the input X given the outputof the Poisson channel Y , that is E [ X | Y = y ] = Z x d P X | Y = y ( x ) , y = 0 , , . . . (3)Specifically, we are interested in how E [ X | Y ] behaves as afunction of the channel parameters ( a, λ ) and the distributionof X . We will also study the conditional mean estimator asa function of channel realizations, that is y → E [ X | Y = y ] .Properties of the conditional expectation are important in viewof the fact that it is the unique optimal estimator under a verylarge family of loss functions, namely Bregman divergences[6]. For example, an important member of the Bregman familyis the ubiquitous squared error loss.Throughout the paper, we will contrast our results with thosefor the Gaussian noise channel given by Y G = aV + N, (4)where N is a normal random variable with zero mean andvariance σ , a is a fixed scalar, the input V is a real-valued random variable and N and V are independent. Thetransformation in (4) is shown in Fig. 1b. Since the behaviorof the conditional expectation E [ V | Y G ] is well understood,the comparison between these two channels can be veryilluminating. Also, somewhat surprisingly, we use the insightsdeveloped for the Poisson noise channel to derive a newidentity for the Gaussian noise case.The literature on the Poisson distribution is vast, and theinterested reader is referred to [7] and [8] for a summary ofcommunication theoretic applications; [9], [10] and [11] forapplications of the Poisson model in compressed sensing; and[12] and [13] for applications of the Poisson distributions insignal processing and other fields. Our interest in studyingproperties of the conditional expectation is motivated by the Unlike the Poisson channel, Gaussian channel can be parameterized byonly one parameter either a or σ . However, we keep both a and σ forcomparison. X × P ( · ) + a P ( λ ) Y = P ( aX + λ ) (a) The Poisson noise channel with the input X , the scalingfactor a and the dark current parameter λ . V × + a N ∼ N (0 , σ ) Y G = aV + N (b) The Gaussian noise channel with the input X , the scalingfactor a and the noise variance σ . Fig. 1: Noise models considered in this work.bridge that the Poisson noise model offers between estimationtheory and information theory. In [14] and [15] the authorshave shown that information measures such as mutual in-formation and relative entropy can be expressed as integralsover the dark current λ and/or the scaling parameter a ofBayesian risks that use loss functions natural (e.g., Bregmandivergences) for the Poisson channel. The results in [14] and[15] are counterparts of the Gaussian noise identities betweeninformation and estimation measures shown in [16] and [17].In [18], the authors have generalized the results of [14] and[15] to the vector Poisson model and have introduced a notionof matrix Bregman divergence. For a unifying treatment ofsuch identities, which generalizes these results beyond thePoisson and Gaussian models, the interested reader is referredto [19]. Finally, for the point-wise generalizations, we referthe reader to [20]. A. Notation
Throughout the paper, deterministic quantities are denotedby lowercase letters and random variables are denoted byuppercase letters. We denote the distribution of a randomvariable X by P X . The expected value and variance of X are denoted by E [ X ] and V ( X ) , respectively.The binomial coefficient is denoted by (cid:18) xy (cid:19) = Γ( x + 1)Γ( y + 1)Γ( x − y + 1) , x ∈ R , y ∈ R , (5)where Γ( · ) is the gamma function. In this paper, the gammadistribution has a probability density function (pdf) given by f ( x ) = α θ Γ( θ ) x θ − e − αx , x ≥ , (6)where θ > is the shape parameter and α > is therate parameter. Moreover, the mean and variance of thisdistribution are given by E [ X ] = θα , V ( X ) = θα . (7)We denote the distribution with the pdf in (6) by Gam ( α, θ ) . B. Contributions and Outline
The paper outline and contributions are as follows:1) In Section II we study properties of the output distribution P Y and show: • In Section II-A, Theorem 1 connects the distribution ofthe output Y to the Laplace transform of the input X ; • In Section II-B, Theorem 2 shows that P Y is a continu-ous and bijective operator of the input distribution P X ;and • In Section II-C, studies analytical properties of P Y . Inparticular, Lemma 1 characterizes derivatives of P Y withrespect to the scaling parameter a and the dark currentparameter λ and connects these derivatives to the forwardand backward difference operators of P Y ( y ) in termsof y . Moreover, Lemma 2 establishes lower and upperbounds on the tail of P Y .2) In Section III we study properties of the conditional ex-pectation of the input X given the output Y and show: • Section III-A, Lemma 3, re-derives the known Turing-Good-Robbins Formula for the conditional expectationwith explicit emphasis on the parameters a and λ . Thegeneralizations of this formula to higher conditionalmoments are discussed. Moreover, in Lemma 4, it isshown that any higher order conditional moment canbe completely determined by the first order conditionalmoment. Furthermore, in Lemma 5 it is shown that theconditional expectation can be represented as a ratio ofderivatives of the Laplace transform of X . Finally, theLaplace representation is used to compute conditionalexpectations for a few important distributions; • Section III-B discusses connections between the condi-tional expectation and the likelihood function, and thediscrete versions of the score function. In particular,Theorem 3 proposes a version of the score functionwhere instead of taking the derivative with respect to theoutput y the gradient is taken with respect to the channelparameters ( a, λ ) , and the proposed score function isshown to be related to the conditional expectation viaa Tweedie-like formula. Moreover, this new version ofthe score functions is compared with other definitions ofscore functions known in the literature. Section III-B isconcluded by proposing a version of the Fisher informa-tion and a Brown-like identity is establishing betweenthe Fisher information and the minimum mean squarederror. • Section III-C studies analytical properties of the con-ditional expectation. In particular, in Theorem 5, itis shown that the conditional expectation is a strictlyincreasing function of channel realizations. Theorem 6finds the derivative of the conditional expectation withrespect to the dark current parameter λ , which is givenby the negative conditional covariance of X given Y . This incidentally shows that the conditional expectationis a monotonically decreasing function of λ . Moreover,Theorem 6 finds derivatives of higher order conditionalmoments. Furthermore, Corollary 2 finds the limitingbehavior of the conditional expectation as λ → ∞ .Finally, Lemma 6 concludes Section III-C by presentingan inequality on the conditional expectation that has aflavor of the reverse Jensen’s inequality; • Section III-D studies whether the conditional expectationis uniquely determined by the input distribution of X .First, in Theorem 7, as an ancillary result, it is shownthat the conditional distribution of the input X giventhe output Y is completely determined by its moments.Second, in Theorem 8, it is shown the conditional ex-pectation uniquely determines the distribution of X . Thesection is concluded by discussing some consequencesof the uniqueness result; and • Section III-E studies upper bounds on the conditionalexpectation. In particular, in Theorem 9, it is shown thatthe conditional expectation is either a linear or a sub-linear function of the output realization y .3) In Section IV we study connections between the condi-tional mean estimator and linear estimators and show: • Section IV-A, Theorem 11, shows that the conditionalexpectation is linear if and only if the input of X isaccording to a gamma distribution and the dark currentparameter is equal to zero; and • Section IV-B, Theorem 12, provides a quantitative refine-ment of the linear condition in Theorem 11 and showsthat if the conditional expectation is close to a linearfunction in the L metric then the input distribution isclose to a gamma distribution in the Lévy metric.II. P ROPERTIES OF THE P OISSON T RANSFORMATION
The distribution of the output random variable Y = P ( aX + λ ) induced by the input X ∼ P X will be denoted by P Y ( y ; P X ) = E [ P Y | X ( y | X )] , y = 0 , , . . . (8)Also, for simplicity, we will use P Y ( y ) instead of P Y ( y ; P X ) whenever the nature of the underlying input probability dis-tribution is clear or nonessential. Examples of the outputdistribution induced by the binary input are shown in Fig. 2. A. Connections to the Laplace Transform
This section shows an alternative representation of P Y interms of the Laplace transform of X . Theorem 1.
Denote the Laplace transform of a non-negativerandom variable W by L W ( t ) = E (cid:2) e − tW (cid:3) , t ≥ , (9) and its n -th derivative by L ( n ) W ( t ) . Then, for every a > , λ ≥ and y = 0 , , . . .P Y ( y ; P X ) = ( − y e − λ y ! y X i =0 (cid:18) yi (cid:19) a y − i ( − λ ) i L ( y − i ) X ( a ) , (10) · − y P Y λ = 0 λ = 3 Fig. 2: Examples of the output distribution P Y for the input ( P X (6) , P X (16)) = (0 . , . with a = 1 . The circles are theactual values of P Y . and L aX + λ ( t ) = ∞ X y =0 ( − y P Y ( y ; P X )( t − y , ∀ | t − | < . (11) Proof:
See Appendix A.For λ = 0 the expression in (10) takes a simple form givenby P Y ( y ; P X ) = ( − a ) y y ! L ( y ) X ( a ) , y = 0 , , . . . (12)To show the utility of the expression in (10), consider an input X ∼ Gam ( α, θ ) with the Laplace transform L X ( t ) = 1(1 + tα ) θ , t > − α, (13)and the channel output given by Y = P ( aX ) . Using (10) wearrive at the following output distribution: P Y ( y ) = ( − a ) y α θ ( α + a ) θ + y (cid:18) − θy (cid:19) (14) = a y α θ ( α + a ) θ + y (cid:18) θ + y − y (cid:19) , y = 0 , , . . . (15)The distribution in (15) is known as the negative binomialdistribution with failure parameter θ and success probability aα + a . Remark 1.
There exists a similar result to Theorem 1 forthe Gaussian noise channel in (4). Specifically, the probabilitydensity function (pdf) of the output Y G is a Weierstrasstransform of the input V [21]. B. Properties of the Output Distribution as a Function of theInput Distribution
In this section, we are also interested in how P Y behavesas a function of P X . We will need the following definition. Definition 1.
A sequence of probability measures { P X n } n ∈ N on R is said to converge weakly to the probability measure P X (i.e., X n → X in distribution ) if and only if lim n →∞ E [ φ ( X n )] = E [ φ ( X )] , (16)for all bounded and continuous functions φ .The next result presents two important properties of theoutput distribution. Theorem 2.
Let Y = P ( aX + λ ) . Then, for all a > and λ ≥ , P Y satisfies the following properties: • Let P X n → P X weakly. Then, P Y n ( · ; P X n ) → P Y ( · ; P X ) weakly. In other words, the mapping X → P ( aX + λ ) iscontinuous in distribution; and • P Y is a bijective operator of P X , that is P Y ( · ; P X ) = P Y ( · ; P X ) ⇐⇒ P X = P X . (17) Proof:
To establish continuity observe that for every k thePoisson probability P Y | X ( k | x ) is a bounded and continuousfunction of x and, therefore, by the definition of convergencein distribution, for every non-negative integer k lim n →∞ P Y n ( k ; P X n ) = lim n →∞ E [ P Y | X ( k | X n )]= E [ P Y | X ( k | X )] = P Y ( k ; P X ) . (18)Thus, if P X n → P X weakly, then P Y n → P Y weakly . Thisconcludes the proof of the continuity of the mapping P X → P Y ( · ; P X ) .Next we show that the output distribution is a bijectiveoperator of P X . The implication P X = P X = ⇒ P Y ( · ; P X ) = P Y ( · ; P X ) (19)is immediate. Therefore, it remains to show that P Y ( · ; P X ) isan injective operator.The injectivity follows from the fact that, in view of (11),the output pmf P Y ( · ; P X ) completely determines the Laplacetransform of P X on the interval t ∈ (0 , , and since theLaplace transform of P X is unique on any given interval [23,Sec. 30], we have that P Y ( · ; P X ) completely determines P X .This concludes the part of the proof that shows that P X → P Y ( · ; P X ) is an injective transform.This concludes the proof of the theorem. C. Analytical Properties of the Output Distribution
Another useful identity of the Poisson transformation relatesthe derivative of P Y with respect to the channel parameter tothe forward and backward differences of yP Y ( y ) and P Y ( y ) . Lemma 1.
Let Y = P ( aX + λ ) . Then, for every a > , λ ≥ and y = 0 , , . . . dd a P Y | X ( y | x ) = x dd λ P Y | X ( y | x ) (20) = x (cid:0) P Y | X ( y − | x ) − P Y | X ( y | x ) (cid:1) , (21) Note that in (18) we have established a point-wise convergence of thepmf. However, for integer value random variables (note our output is integervalued) point-wise convergence in the pmf is equivalent to weak convergence[22, Ch. 8.8, Exercise 4]. and a dd a P Y ( y ) + λ dd λ P Y ( y ) = yP Y ( y ) − ( y + 1) P Y ( y + 1) , (22) dd λ P Y ( y ) = P Y ( y − − P Y ( y ) , (23) where P Y | X ( − | x ) = P Y ( −
1) = 0 .Proof:
See Appendix B.Special cases of Lemma 1 have been shown in the past; seefor example [24, Lemma 8.5]. Lemma 1 will be used in latersections to study properties of the conditional expectation. Forexample, to study monotonicity properties of the mapping y → E [ X | Y = y ] it will be convenient to translate the differenceswith respect to discrete points in y into the derivatives withrespect to the continuous parameter λ . Remark 2.
The identity in (22) can be expressed as anequality between the inner product of the vector v = [ a, λ ] with the gradient of P Y with respect to v = [ a, λ ] , and theforward difference operator as follows: v · ∇ v P Y ( y ) = − D fw [ yP Y ( y )] , (24)where ‘ · ’ denotes the inner product, and the forward differ-ence operator of a function f is given by D fw k [ f ( y )] = f ( y + k ) − f ( y ) . (25)Similarly, the expression in (23), can be expressed in terms ofthe backward difference operator as dd λ P Y ( y ) = − D bw [ P Y ( y )] , (26)where the backward difference operator of a function f isgiven by D bw k [ f ( y )] = f ( y ) − f ( y − k ) . (27) Remark 3.
Consider a Gaussian noise channel in (4). Let f Y G | V be the conditional pdf for this channel. It is not difficultto check that for this channel we have the following identitybetween derivatives with respect to the realization of the output y and the scaling parameter a : dd a f Y G | V ( y | v ) = − vσ dd y f Y G | V ( y | v ) . (28)Continuing with our parallelism between derivatives and dif-ference operators the identity in (21) can be written as follows: dd a P Y | X ( y | x ) = − x D bw [ P Y | X ( y | x )] . (29)Another useful result concerns the following lower andupper bounds on the tail of P Y . Lemma 2.
Let Y = P ( aX + λ ) . Then, for every a > , λ ≥ and y = 0 , , . . . y ! e y E [log( aX + λ )] − ( a E [ X ]+ λ ) ≤ P Y ( y ) ≤ y ! y y e − y ≤ √ π y . (30) Proof:
See Appendix C.Properties of the output distribution P Y derived in thissection will be useful in the next section where we studyproperties of the conditional expectation. III. P
ROPERTIES OF THE C ONDITIONAL E XPECTATION
In this section, we study the properties of the conditionalexpectation. Specifically, we focus on how E [ X | Y = y ] behaves as a function of the channel parameters ( a, λ ) , thechannel realization y , and the distribution of X . Examplesof conditional expectations for the binary distribution used inFig. 2 are shown in Fig. 3. y E [ X | Y = y ] λ = 0 λ = 3 Fig. 3: Examples of conditional expectations for the input ( P X (6) , P X (16)) = (0 . , . with a = 1 and dark currentvalues λ = 0 , . The circles are the actual values of E [ X | Y ] .Next, we present a formula that will form a basis for muchof our analysis of the conditional mean estimator. A. The Turing-Good-Robbins Formula
An interesting property of the conditional expectation overa Poisson noise channel is its dependence only on the marginaldistribution of Y . This identity was first demonstrated by Goodin [25] and is credited to Alan Turing. Moreover, it has alsobeen independently derived by Robbins in [26] in the contextof empirical Bayes estimation. For completeness, we derive itnext. Lemma 3.
Let Y = P ( aX + λ ) . Then, for every a > and λ ≥ , E [ X | Y = y ] = 1 a ( y + 1) P Y ( y + 1) P Y ( y ) − λa , y = 0 , , . . . (31) Proof:
The proof follows via the following sequence ofmanipulations: E [ X | Y = y ]= E (cid:2) XP Y | X ( y | X ) (cid:3) P Y ( y ) (32) = E h X y ! ( aX + λ ) y e − ( aX + λ ) i P Y ( y ) (33) = 1 a E h y ! ( aX + λ ) y +1 e − ( aX + λ ) i − λP Y ( y ) P Y ( y ) (34) = 1 a ( y + 1) E h y +1)! ( aX + λ ) y +1 e − ( aX + λ ) i − λP Y ( y ) P Y ( y ) (35) = 1 a ( y + 1) P Y ( y + 1) − λP Y ( y ) P Y ( y ) . (36)This completes the proof of (31).The key advantage of the expression in (31) is that itdepends only on the marginal P Y . This, in turn, avoidscomputation of the often complicated conditional P X | Y . The Turing-Good-Robbins (TGR) formula played an importantrole in the development of empirical Bayes estimation ; theinterested reader is referred to [27, Chapter 6.1] for a historicalaccount and impact of the TGR formula. Vector versions ofLemma 3 were shown in [19, Lemma 3] and [18, Lemma 3].
Remark 4.
It is not difficult to see that the expression in (31)can be generalized to higher order moments as follows: forevery non-negative integers k and y E (cid:2) ( aX + λ ) k | Y = y (cid:3) = ( y + k )! y ! P Y ( y + k ) P Y ( y ) . (37)This, for example, leads to the following interesting expressionfor the conditional variance of U = aX + λ : V ( U | Y = y ) = E [ U | Y = y ]( E [ U | Y = y + 1] − E [ U | Y = y ]) . (38)Combining the TGR formula with the expression in (37), wearrive at the following lemma that relates all the conditionalmoments to the first conditional moment. Lemma 4.
Let Y = P ( aX + λ ) . Then, for every positiveinteger k and every non-negative integer y E (cid:2) ( aX + λ ) k | Y = y (cid:3) = k − Y i =0 E [ aX + λ | Y = y + i ] . (39) Proof:
The proof follows by combining (31) and (37).In probability theory, typically, lower order moments arecontrolled by higher order moments (i.e., Jensen’s inequality).The identity in (39) is very strong and implies that all ofthe higher conditional moments can be recovered from thefirst conditional moment. This observation will be used inSection III-D to show that the conditional expectation is uniquefor every input distribution.
Remark 5.
To the best of our knowledge no Gaussian coun-terpart of the identity in (39) has ever been presented in thepast. In fact, inspired by the result in Lemma 4 the authors ofthis paper have recently derived in [28] the following Gaussiananalog of the identity in (39): E [( aV ) k | Y G = y ]= σ k e − σ R y E [ aV | Y G = t ]d t d k d y k e σ R y E [ aV | Y G = t ]d t , (40)where V and Y G are related through (4). Note that (40) is inthe same spirit as (4) in the sense that all the higher conditionalmoments can be characterized by the first conditional moment.It is also interesting to combine the TGR formula with theLaplace transform representation of the output pmf in (10), which leads to the following representation of the conditionalexpectation. Lemma 5.
Let Y = P ( aX + λ ) . Then, for every a > and λ ≥ E [ X | Y = y ]= − P y +1 i =0 (cid:0) y +1 i (cid:1) a y +1 − i ( − λ ) i L ( y +1 − i ) X ( a ) a P yi =0 (cid:0) yi (cid:1) a y − i ( − λ ) i L ( y − i ) X ( a ) − λa , y = 0 , , . . . (41) In particular, for λ = 0 E [ X | Y = y ] = − L ( y +1) X ( a ) L ( y ) X ( a ) , y = 0 , , . . . (42)The expression in (42) enables computation of the conditionalexpectation if the Laplace transform of the random variableis known and can be easily differentiated. Using (42), Table Icollects expressions for the conditional expectations of a fewimportant distributions. B. On Tweedie’s Formula, the Score Function, the FisherInformation and Brown’s Identity
Consider a Gaussian channel given in (4) for which theclassical
Tweedie’s formula for the conditional expectation isgiven by a E [ V | Y G = y ] = y + σ dd y log( f Y G ( y )) (43) = y + σ ρ Y G ( y ) , (44)where the quantity ρ Y G ( y ) = dd y log f Y G ( y ) = f ′ Y G ( y ) f Y G ( y ) , (45)is known as the score function and the logarithm of the pdfis known as the log-likelihood function. The identity in (44)has been derived by Robbins in [26] where he credits MauriceTweedie for the derivation. The version of (44) for multivariatenormal has been derived by Esposito in [29]. The identity in(44) has an advantage that it depends only on the marginaldistribution of the output; see [30] for an application example.In this section, we propose an analog of Tweedie’s formula forthe Poisson case.Note that, in the Poisson case it is not possible to obtainthe logarithmic derivative form similar to the one in (44) inview of the fact that the output space is discrete. Nonetheless,it is interesting to observe the following forward differenceproperty of the logarithm of P Y , which is just a restatementof the TGR formula in (31). Corollary 1.
Let Y = P ( aX + λ ) . Then, D fw [log P Y ( y )] = log ( E [ aX + λ | Y = y ]) − log( y + 1) . (46)The identity in (46) has been previously demonstrated in [31].Although for the Poisson case, there is no Tweedie’s for-mula that differentiates with respect to y , using the result inLemma 1 that translates the difference with respect to discretepoints to the gradient with respect to continuous parameters a and λ , we propose the following version of Tweedie’s formulafor the Poisson noise channel. The proposed formula does havea logarithmic derivative form. Theorem 3.
For Y = P ( aX + λ ) let v = [ a, λ ] and define aPoisson score function as ρ Po Y ( y ) = − v · ∇ v log P Y ( y ) , (47) and its scaling and dark current components, respectively, by ρ Po , scale Y ( y ) = dd a log( P Y ( y )) = dd a P Y ( y ) P Y ( y ) , (48) ρ Po , d . c Y ( y ) = dd λ log( P Y ( y )) = dd λ P Y ( y ) P Y ( y ) , (49) and define discrete forward and backward score functions,respectively, by ρ Po , fw . d Y ( y ) = D fw [ yP Y ( y )] P Y ( y ) , (50) ρ Po , bw . d Y ( y ) = D bw [ P Y ( y )] P Y ( y ) . (51) Then, E [ aX + λ | Y = y ] = y + ρ Po Y ( y ) , y = 1 , , . . . (52) Moreover, ρ Po Y ( y ) = v · h ρ Po , scale Y ( y ) , ρ Po , d . c Y ( y ) i (53) = v · h ρ Po , scale Y ( y ) , − ρ Po , bw . d Y ( y ) i (54) = − ρ Po , fw . d Y ( y ) . (55) Proof:
We start by re-writing the conditional expectationusing the TGR formula (31) E [ aX + λ | Y = y ] = y + ( y + 1) P Y ( y + 1) − yP Y ( y ) P Y ( y ) , (56)and work with the second term in (56). Next, to show (55)observe that by using the definition of the forward differencewe have that ( y + 1) P Y ( y + 1) − yP Y ( y ) P Y ( y ) = ρ Po , fw . d Y ( y ) . (57)To show (52) and (53) use the identity in (22) ( y + 1) P Y ( y + 1) − yP Y ( y ) P Y ( y ) = − a dd a P Y ( y ) + λ dd λ P Y ( y ) P Y ( y ) (58) = − v · h ρ Po , scale Y ( y ) , ρ Po , d . c Y ( y ) i (59) = − v · ∇ v log P Y ( y ) (60) = ρ Po Y ( y ) . (61)Finally, to show (54) use the identity in (23) to see that ρ Po , d . c Y ( y ) = P Y ( y − − P Y ( y ) P Y ( y ) = − D bw [ P Y ( y )] P Y ( y ) . (62)This concludes the proof. TABLE I: Examples of Conditional Expectations for λ = 0 . Distribution of X Laplace Transform L X ( t ) L ( n ) X ( t ) E [ X | Y = y ] , y = 0 , , . . . Gamma (defined in (6)) tα ) θ , t > − α ( − n α n tα ) n + θ Γ( θ + n )Γ( θ ) , t > − α yα + a + θα + a Inverse-gamma f X ( x ) = β α Γ( α ) 1 x α +1 e − βx , x > , α > , β > βt ) α Γ( α ) K α ( √ βt ) , t > where K α ( · ) is a modifiedBessel function of a secondkind ( − n β α Γ( α ) K α − n (cid:0) √ βt (cid:1) (cid:16) βt (cid:17) n − α , t > (cid:16) βa (cid:17) K α − ( y +1) ( √ βa ) K α − y ( √ βa ) Uniform f X ( x ) = b − c , ≤ c ≤ x ≤ b − e − tb − e − tc t ( b − c ) ( − n b − c Γ( n +1 ,c t ) − Γ( n +1 ,b t ) t n +1 where Γ( · , · ) is the upper incomplete Gamma function Γ( y +2 ,c a ) − Γ( y +2 ,b a ) a (Γ( y +1 ,c a ) − Γ( y +1 ,b a )) Bernoulli P [ X = 1] = p = 1 − P [ X = 0] = 1 − q q + p e − t p ( − n e − t ( p e − a q + p e − a y = 01 y > Poisson P [ X = k ] = γ k e − γ k ! , k = 0 , , . . . e γ (e − t − ( − n e γ (e − t − T n (e γ e − t ) where T n ( · ) isthe Touchard polynomial T y +1 (e γ e − a ) T y (e γ e − a ) Exponential Family d P X dµ ( x ; θ ) = e θx − ψ ( θ ) , x ≥ , θ ∈ N where • d µ - base measure ; • θ - natural parameter ; • N ⊆ R - natural parameter space ;and • ψ - log-partition function . e ψ ( θ − t ) − ψ ( θ ) , θ − t ∈ N e − ψ ( θ ) d n d t n e ψ ( θ − t ) − d y +1d ty +1 e ψ ( θ − t )d y d ty e ψ ( θ − t ) (cid:12)(cid:12)(cid:12) t = a Unlike for continuous random variables, there are severaldefinitions of a score function for discrete random variables.Specifically, the following definitions of score functions for arandom variable W supported on non-negative integers havebeen proposed in [32], [33] and [34], respectively: for w =0 , , , . . .ρ K W ( w ) = P W ( w − − P W ( w ) P W ( w ) , (63) ρ KHJ W ( w ) = ( w + 1) P W ( w + 1) E [ W ] P W ( w ) − , (64) ρ JG W ( w ) = w E [ W ] P W ( w − − P W ( w ) P W ( w ) − . (65)By letting U = aX + λ and E [ U ] = a E [ X ] + λ we observethe following relationship between score functions proposedin Theorem 3 and score functions in (63), (64) and (65): ρ Po , d . c Y ( y ) = ρ K Y ( y ) = ρ JG Y ( y ) + 1 y E [ U ] , (66) ρ Po Y ( y ) = E [ U ] ρ KHJ Y ( y ) + E [ U ] − y. (67)The score function has an intimate relationship with theFisher information . Motivated by our gradient definition ofthe score function in (47) we define the following version ofthe Fisher information of Y for the Poisson noise case: J Po ( Y ) = E h(cid:0) ρ Po Y ( Y ) (cid:1) i . (68)We can now show the following result between the Fisherinformation and the minimum mean squared error (MMSE)where the latter is defined as mmse( X | Y ) = E (cid:2) ( X − E [ X | Y ]) (cid:3) . (69) Theorem 4.
Let Y = P ( aX + λ ) . Then, for every a > , λ ≥ X | Y ) = a E [ X ] + λ − J Po ( Y ) a . (70) Proof:
We start with the definition of the MMSE mmse( X | Y )= E (cid:2) ( X − E [ X | Y ]) (cid:3) (71) = E "(cid:18) X − Y + ρ Po Y ( Y ) − λa (cid:19) (72) = E "(cid:18) X − Y − λa (cid:19) − E (cid:20)(cid:18) X − Y − λa (cid:19) ρ Po Y ( Y ) a (cid:21) + 1 a J Po ( Y ) (73) = E "(cid:18) X − Y − λa (cid:19) − E (cid:20)(cid:18) E [ X | Y ] − Y − λa (cid:19) ρ Po Y ( Y ) a (cid:21) + 1 a J Po ( Y ) (74) = E "(cid:18) X − Y − λa (cid:19) − a J Po ( Y ) + 1 a J Po ( Y ) (75) = a E [ X ] + λa − a J Po ( Y ) , (76)where (72) follows by using (52); (74) follows from the lawof total expectation; (75) follows by using the identity in (52);and (75) follows by using E "(cid:18) X − Y − λa (cid:19) = E (cid:20) V (cid:18) Y − λa | X (cid:19)(cid:21) = 1 a E [ V ( Y | X )] = 1 a E [ aX + λ ] . (77) This concludes the proof.
Remark 6.
A Gaussian analog of the identity in (70) has beenshown by Brown in [35] mmse( V | Y G ) = σ − σ J ( Y G ) a , (78)where the Fisher information is given by J ( Y G ) = E "(cid:18) dd y log ( f Y G ( Y G )) (cid:19) . (79)As an example, by using that J Po ( Y ) ≥ , the expressionin (70) leads to the bound mmse( X | Y ) ≤ a E [ X ] + λa . (80)For the case of λ = 0 and under some mild conditions on thepdf of X , the bound in (80) was shown to be order tight in[36, Theorem 4]. For upper and lower bounds and some otherresults on the MMSE of Poisson noise the interested reader isreferred to [36]. C. Analytical Properties of the Conditional Expectation
In this section, we study how the conditional expectationbehaves as a function of y and as a function of the channelparameters ( a, λ ) . Theorem 5.
Let Y = P ( aX + λ ) . Then, for every fixed a > , λ ≥ , the mapping y → E [ X | Y = y ] is non-decreasing.Proof: To show that the expected value is non-decreasinglet U = aX + λ and consider P Y ( k ) = 1 k ! E (cid:2) U k e − U (cid:3) (81) ≤ k ! q E [ U k +1 e − U ] E [ U k − e − U ] (82) = 1 k ! p ( k + 1)! P Y ( k + 1)( k − P Y ( k − (83) = r k + 1 k P Y ( k + 1) P Y ( k − , (84)where in (82) we have used the Cauchy-Schwarz inequality.Now applying the bound in (84) to the TGR formula in (31)we have that E [ U | Y = y ] = ( y + 1) P Y ( y + 1) P Y ( y ) (85) ≥ ( y + 1) yy +1 P Y ( y ) P Y ( y ) P Y ( y − (86) = yP Y ( y ) P Y ( y − (87) = E [ U | Y = y − . (88)This shows that the conditional expectation is a non-decreasingfunction of y .We note that the conditional expectation of a non-degeneraterandom variable X may not be strictly increasing and can increase only once. Consider an example of a Bernoullirandom variable and λ = 0 for which E [ X | Y = y ] = ( p e − a q + p e − a y = 01 y > . (89)For the Gaussian noise setting this, however, is not the caseand the conditional expectation is a strictly increasing functionfor non-degenerate random variables (see Remark 7 below).From Fig. 3 and Fig. 5 we observe that the conditional ex-pectation with a lower dark current dominates the conditionalexpectation with a larger dark current. This observation holdsin general and is formally shown next. Theorem 6.
For Y = P ( aX + λ ) and U = aX + λ let v = [ a, λ ] . Then, for every a > , λ > and y = 0 , , . . . v · ∇ v E [ X | Y = y ]= a dd a E [ X | Y = y ] + λ dd λ E [ X | Y = y ]= − a V ( X | Y = y ) , (90) where a dd λ E [ X | Y = y ] = ( y = 0 − y V ( U | Y = y − E [ U | Y = y − y ≥ . (91) More generally, for every positive integer k v · ∇ v E (cid:2) U k | Y = y (cid:3) = a dd a E [ U k | Y = y ] + λ dd λ E [ U k | Y = y ] (92) = k E [ U k | Y = y ] − E [ U k +1 | Y = y ]+ E [ U k | Y = y ] E [ U | Y = y ] , (93) where dd λ E [ U k | Y = y ]= ( k E [ U k − | Y = 0] y = 0 ( y + k ) E [ U k − | Y = y ] E [ U | Y = y − − y E [ U k | Y = y ] E [ U | Y = y − y ≥ . (94) Proof:
See Appendix D.As a consequence of Theorem 6 we have the followingcorollary.
Corollary 2.
Let Y λ = P ( aX + λ ) where X is a non-degenerate random variable. Then, for a fixed y ≥ , themapping λ → E [ X | Y λ = y ] is strictly decreasing. Moreover, lim λ →∞ E [ X | Y λ = y ] = E [ X | Y κ = 0] , y = 0 , , . . . (95) where κ ≥ can be arbitrary.Proof: The fact that λ → E [ X | Y λ = y ] is strictly de-creasing for y ≥ follows from (91) by using the fact thatthe variance of a non-degenerate random variable is positive.The proof of (95) proceeds as follows: lim λ →∞ E [ X | Y λ = y ] = lim λ →∞ E (cid:2) X ( aX + λ ) y e − ( aX + λ ) (cid:3) E (cid:2) ( aX + λ ) y e − ( aX + λ ) (cid:3) (96) = lim λ →∞ E h X (cid:0) aXλ + 1 (cid:1) y e − aX i E h(cid:0) aXλ + 1 (cid:1) y e − aX i (97) = E (cid:2) X e − aX (cid:3) E [e − aX ] (98) = E [ X | Y = 0] (99) = E [ X | Y κ = 0] , (100)where the exchange of the limit and expectation in (97)follows by using the dominated convergence theorem with thefollowing bound that holds for λ > : X (cid:18) aXλ + 1 (cid:19) y e − aX ≤ ( y + 1) y +1 e − y ; (101)the proof of the above bound follows along the same lines asthat of (30); and κ ≥ in (100) can be taken arbitrary sincethe conditional expectation at y = 0 does not depend on thedark current parameter. This concludes the proof. Remark 7.
For the Gaussian noise channel in (4) the analogof the identity in (90) is the following formula shown in [37]: σ dd y E [ V | Y G = y ] = V ( V | Y G = y ) . (102)Moreover, the Gaussian analog of the identity in (93) is σ dd y E [ V n − | Y G = y ]= E [ V n | Y G = y ] − E [ V n − | Y G = y ] E [ V | Y G = y ] , (103)shown in [38]. These two observations further support theview that differentiation with respect to a and λ in a Poissonchannel is the analog of differentiation with respect to y forthe Gaussian channel.We conclude this section by using the monotonicity ofconditional expectation to show an inequality that has theflavor of the reverse Jensen’s inequality. Lemma 6.
Let Y = P ( X ) . Then, for every positive integer k and non-negative integer y E [ X | Y = y ] ≤ E k (cid:2) X k | Y = y (cid:3) ≤ E [ X | Y = y + k − . (104) Proof:
Using the monotonicity of the conditional expec-tation observe the following inequalities: ( E [ X | Y = y ]) k ≤ k − Y i =0 E [ X | Y = y + i ] ≤ ( E [ X | Y = y + k − k . (105)The proof is now concluded by using the identity in (39) .Observe that the first inequality in (104) could have beenshown by using Jensen’s inequality. D. Uniqueness of the Conditional Expectation with Respectto the Input Distribution
In this section, we are interested in whether the knowledgeof E [ U | Y = y ] for all y uniquely determines the inputdistribution P U . We begin by showing an auxiliary result about the condi-tional distribution, which is of independent interest, that willbe used to show the uniqueness of the conditional expectationwith respect to the input distribution. Theorem 7.
Fix some non-negative integer y and let U y bedistributed according to P U | Y = y where Y = P ( U ) . Denote by { m k } ∞ k =1 the sequence of integer moments of U y (i.e., m k = E [ U ky ] ).Then, the distribution of U y is uniquely determined by thesequence { m k } ∞ k =1 .Proof: Clearly, we have that m k < ∞ for all k > ,which follows from the inequality in Lemma 6. In the restof the proof, we seek to determine whether the moments of U y are unique. Since U y ≥ , this is a classical Stieltjesmoment problem [39]. The following sufficient condition forthe uniqueness of moments was given by Carleman [40]: themoments of U y are unique if ∞ X k =1 (cid:0) E [ U ky ] (cid:1) − k = ∞ . (106)Next, using the upper bound in (30) observe the followinginequality: E [ U ky ] = ( y + k )! y ! P Y ( y + k ) P Y ( y ) . (107) ≤ ( y + k )! y ! y + k )! ( y + k ) y + k e − ( y + k ) P Y ( y ) (108) = c y ( y + k ) y + k e − k , (109)where in the last step we have defined c y = e − y y ! P Y ( y ) . (110)Now, applying the bound in (109) to the summation in (106) ∞ X k =1 (cid:0) E [ U ky ] (cid:1) − k ≥ e ∞ X k =1 c k y ( y + k ) k , (111)since y and c y are fixed the above sum diverges by thecomparison test. Therefore, the Carleman condition in (106)is satisfied and the moments determine the distribution. Thisconcludes the proof. Remark 8.
To the best of our knowledge, no Gaussian analogof Theorem 7 has ever been presented. Nonetheless, for theGaussian noise case, a version of Theorem 7 can be shown byusing [41, Proposition 6], where it was shown that P V | Y G = y issub-Gaussian for every fixed y regardless of the distributionof the input V , together with Carleman’s conditions for real-valued random variables. We note, however, that in generalfor the Poisson case the conditional distribution P U | Y doesnot enjoy the sub-Gaussianity property. To see this, recall thatsub-Gaussian random variables have moments that grow at arate of at most C k √ k ! for some fixed constant C > [42].Now, consider U ∼ Gam (1 , in which case using Table I wehave that E [ U | Y = y ] = y + 12 , y = 0 , , . . . (112) and using (39) the higher order moments of U | Y = y aregiven by m k = ( y + k )!2 k y ! . (113)This, however, for every fixed y , grows faster than C k √ k ! andso P U | Y = y is not sub-Gaussian.The next result establishes the uniqueness of the conditionalexpectation. Theorem 8.
Let Y = P ( U ) and Y = P ( U ) . Then, theconditional expectation is a bijective operator of the inputdistribution; that is E [ U | Y = y ] = E [ U | Y = y ] , ∀ y ⇐⇒ P U = P U . (114) Proof:
Suppose that P U = P U ; then it is immediate that E [ U | Y = y ] = E [ U | Y = y ] , ∀ y. (115)Therefore, it remains to show the other direction.Next suppose that E [ U | Y = y ] = E [ U | Y = y ] , ∀ y . Then,using the identity in (39) we have that for all integers k ≥ E [ U k | Y = y ] = E [ U k | Y = y ] , ∀ y. (116)Using Theorem 7, the expression in (116) implies that P U | Y = y = P U | Y = y , ∀ y. (117)Now the equality in (117) implies that P U = P U . To see thischoose some measurable set A ⊂ R and observe that P U ( A ) = E [1 A ( U )] (118) = E [ E [1 A ( U ) | Y ]] (119) = E [ E [1 A ( U ) | Y ]] (120) = P U ( A ) . (121)This concludes the proof. Remark 9.
For the Gaussian noise channel, the uniquenessof the conditional expectation has been established in [43,Appendix B]. We note, however, that our proof and the proofin [43] are very different. Moreover, to the best of our knowl-edge, there is no single unifying approach for demonstratinguniqueness results for the conditional expectation with respectto the input distribution.There are several interesting consequences of Theorem 8: • (Strict Concavity of the MMSE). Consider Y = P ( aX + λ ) for some X and the corresponding MMSE of estimating X from Y mmse( X | Y ) = E (cid:2) ( X − E [ X | Y ]) (cid:3) . (122)The uniqueness of the conditional expectation implies that P X → mmse( X | Y ) is a strictly concave mapping. Thisfollows by applying [43, Theorem 1] where it was estab-lished that the MMSE is strictly concave provided thatthe conditional expectation uniquely determines the inputdistribution; • (Least Favorable Distributions). A distribution P X is saidto be least favorable with respect to the MMSE and someparameter space Ω if P X ∈ arg max P X : X ∈ Ω mmse( X | Y ) . (123)The uniqueness of the conditional expectation, which im-plies that the objective function in (123) is strictly con-vex, guarantees that the least favorable prior distributionis unique. Moreover, this also implies uniqueness of theminimax estimator of a deterministic parameter θ ∈ Ω ofthe risk R ( θ, ˆ θ ) = E (cid:20)(cid:16) θ − ˆ θ ( Y ) (cid:17) (cid:21) , where Y = P ( aθ + λ ) .(124)The interested reader is also referred to [44] where condi-tions for the least favorable prior to be binary have beenshown; • (Empirical Bayes). Consider an independent and identicallysequence (i.i.d.) of input random variables { X i } and acorresponding output sequence { Y i } where Y i = P ( X i ) .Now consider the expression for the TGR formula given by E [ X | Y = y ] = ( y + 1) P Y ( y + 1) P Y ( y ) . (125)Because the conditional estimator in the TGR formuladepends only on the marginal of the output Y , from the Y i observations we can build an empirical distribution b P Y andconstruct an empirical version of the conditional expectation b E [ X | Y = y ] = ( y + 1) b P Y ( y + 1) b P Y ( y ) . (126)In other words, we are able to approximate the optimalestimator without the knowledge of the prior distributionon X . This remarkable procedure was first developed byRobbins in [26]. The uniqueness result of Theorem 8 impliesthat the empirical Bayes procedure is not only producing anestimate of the conditional expectation but is also (for free!)producing an estimate of the distribution of X . Therefore, itwill be interesting to characterize the exact inverse transformrelationship between P X and E [ X | Y = y ] ; and • (Uniqueness of the Ratio of Derivatives of the LaplaceTransform). Another interesting byproduct of Theorem 8 isthat the sequence of ratios of derivatives of the Laplacetransform evaluated at one, that is L ( k +1) U (1) L ( k ) U (1) , k = 0 , , . . . , (127)completely determines the distribution P U . To see that se-quence in (127) completely determines P U , use the Laplaceidentity in (42). E. Bounds on the Conditional Expectation
In this section, we are interested in determining if the rateof growth of the conditional expectation as a function of y .Specifically, we are concerned with the question of whetherthe conditional expectation can have a super-linear growth. Remark 10.
Upper bounds on the conditional expectation forthe Gaussian noise channel have been previously reportedin [43, Lemma 4] and [45, Proposition 1.2] where it wasdemonstrated that | E [ V | Y G = y ] | = O ( | y | ) . That is, in theGaussian noise case the conditional expectation can have atmost linear growth as a function y .The question of whether the conditional expectation cangrow faster than a linear function can be reduced to answeringwhether the following supremum is finite: sup y ≥ E [ U | Y = y ] y + 1 = sup y ≥ P Y ( y + 1) P Y ( y ) , (128)where we have used the TGR formula. Remark 11.
We note that it is possible to construct an exam-ple of a discrete random variable W supported on non-negativeintegers such that sup n ≥ P W ( n + 1) P W ( n ) = ∞ . (129)Take an arbitrary discrete distribution Q supported on non-negative integers and define P W ( n ) = c Q ( n ) n for n even; and (130) P W ( n ) = cQ ( n − for n odd , (131)where c is the normalization constant. Then, P W ( n + 1) P W ( n ) = n, (132)and sup n ≥ P W ( n +1) P W ( n ) = ∞ . This construction suggests thatwe cannot rule out the possibility of (128) being infinite.The next result provides a sufficient conditions for thefiniteness of the supremum (128) in terms of properties ofthe Laplace transform of the input random variable. Theorem 9.
Let Y = P ( U ) . Then, the following hold:1) (Lower Envelope). For every U lim inf y →∞ E [ U | Y = y ] y + 1 < . (133)
2) (Upper Evelope). sup y ≥ E [ U | Y = y ] y + 1 < ∞ , (134) if and only if lim sup n →∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( n +1) U (1)( n + 1) L ( n ) U (1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ . (135)
3) Let U be such that the following limit exists: lim n →∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( n +1) U (1)( n + 1) L ( n ) U (1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = L. (136) Then,
L < ∞ , and (137) For two non-negative functions f and g over R , we say that f ( x ) = O ( g ( x )) if and only if lim sup x →∞ f ( x ) g ( x ) < ∞ . sup y ≥ E [ U | Y = y ] y + 1 < ∞ . (138) Moreover, lim n →∞ E [ U | Y = y ] y + 1 = L = lim y →∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( y ) U (1) y ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) y ≤ . (139) Proof:
To show that (134) is equivalent to (135), note that sup y ≥ E [ U | Y = y ] y + 1 = sup y ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( y +1) U (1)( y + 1) L ( y ) U (1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (140)and sup y ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( y +1) U (1)( y + 1) L ( y ) U (1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ ⇔ lim sup n →∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( n +1) U (1)( n + 1) L ( n ) U (1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < ∞ . (141)To show the first and third part of the theorem recallthe following inequalities: for a sequence of real numbers { a n } ∞ n =1 lim inf n →∞ | a n +1 || a n | ≤ lim inf n →∞ | a n | n ≤ lim sup n →∞ | a n | n ≤ lim sup n →∞ | a n +1 || a n | . (142)Now let | a n | = (cid:12)(cid:12)(cid:12) n ! L ( n ) X (1) (cid:12)(cid:12)(cid:12) . Therefore, using the sequenceof inequalities in (142) we have that lim inf n →∞ E [ U | Y = n ] n + 1 = lim inf n →∞ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) L ( n +1) U (1)( n + 1) L ( n ) U (1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (143) ≤ lim inf n →∞ (cid:12)(cid:12)(cid:12)(cid:12) n ! L ( n ) U (1) (cid:12)(cid:12)(cid:12)(cid:12) n (144) ≤ lim inf n →∞ (cid:18) n ! n n e − n (cid:19) n (145) = 1 , (146)where in (145) we have used that |L ( n ) U (1) | = | E [ U n e − U ] | ≤ n n e − n . Now, if the limit in (136) exists, then the inequalityin (144) becomes equality which leads to the proof of 3).This concludes the proof. Remark 12.
An interesting future direction would be to in-vestigate whether the condition in (136) is also necessary forfiniteness or to find an input U that provides a counterexample.Regrettably the bound (138) is not explicit. The next theo-rem provides an explicit upper bound and shows that the rateof growth of a conditional expectation cannot exits O ( y log y ) for the inputs with finite mean. Theorem 10.
Let Y = P ( U ) . Then, for y = 0 , , . . . E [ U | Y = y ]2 ≤ ( y + 1) log( y + 1) − y E [log( U )] + E [ U ]+ 12e + 1 . (147) Proof:
See Appendix E.Fig. 4 compares a bound in (147) to E [ U | Y = y ] where U is according to a gamma distribution. y E [ X | Y = y ] Fig. 4: Comparison of the bound in (147) (clear circles) to E [ U | Y = y ] (filled circles) where U ∼ Gam (0 . , . .IV. L INEAR E STIMATION AND THE C ONDITIONAL E XPECTATION
In this section, we study the properties of linear estimators.More specifically, our interest lies in various questions ofoptimality of linear estimators such as:1) Under what input distributions are linear estimators optimalfor squared error loss and Bregman divergence loss? Sincethe conditional expectation is an optimal estimator for theaforementioned loss functions, this is equivalent to askingwhen is the conditional expectation a linear function of y ;and2) If the linear estimators are approximately optimal, can wesay something about the input distribution? In other words,we are looking for a quantitative refinement of 1). A. When is the Conditional Expectation Linear?
Linear estimators are important in estimation theory dueto the simplicity of their implementation and analysis. Inthe Gaussian noise case linear estimators are induced by theGaussian input. In Poisson noise the same role is played bythe gamma distribution.The next result provides sufficient and necessary conditionson the distribution of X and parameters of the Poissontransformation ( a, λ ) that guarantee the linearity of E [ X | Y ] . Theorem 11.
Let Y = P ( aX + λ ) . Then, E [ X | Y = y ] = b y + b , ∀ y = 0 , , . . . (148) if and only if the following two conditions hold: • λ = 0 ; and • X ∼ Gam (cid:16) − ab b , b b (cid:17) for any < b < a and b > .Proof: See Appendix F.One of the ramifications of Theorem 11 is that E [ X | Y ] islinear only if λ = 0 . In other words, in the presence of darkcurrent, somewhat disappointingly, linear estimators are notoptimal for a large class of loss functions. The existence of an input distribution that results in linearestimators plays an important role in estimation theory. Forexample, to provide performance guarantees minimum meansquared error is often upper bounded by analyzing a sub-optimal linear estimator. Hence, the existence of input dis-tributions with linear conditional expectations shows that suchbounds can be attained. Remark 13.
For λ = 0 , Theorem 11 could have been derivedby using the fact that the gamma distribution is a conjugateprior distribution of the Poisson distribution [46], [47]. Thisimplies that for Y = P ( aX ) if X ∼ Gam ( α, θ ) , then aX | Y = y ∼ Gam (cid:0) αa + 1 , θ + y (cid:1) .Furthermore, the fact that the posterior distribution isgamma allows one to show that the conditional variance of X given Y is also linear and is given by V ( X | Y = y ) = θ + y ( α + a ) . (149)In contrast, for the Gaussian noise channel in (4) the linearestimator is obtained if and only if the input is Gaussian (Gaus-sian is also a conjugate prior of Gaussian), but the conditionalvariance is constant and independent of the observation y .Specifically, for V G ∼ N (0 , related to Y G through (4) V ( V G | Y G = y ) = σ σ , ∀ y ∈ R . (150)To present an illustrative example of the effect of dark cur-rent on the conditional expectation we compute the conditionalexpectation for λ ≥ of an input according to the exponentialdistribution, which is the gamma distribution with a unit shapeparameter. Lemma 7.
Let Y = P ( aX + λ ) and take X to be anexponential random variable of rate α . Then, for every a > and λ ≥ P Y (0) = α e − λ α + a , (151a) P Y ( k ) = Γ( k + 1 , λ )Γ( k + 1) − Γ( k, λ )Γ( k )+ e αa λ (cid:0) αa (cid:1) k Γ (cid:0) k, λ (cid:0) αa + 1 (cid:1)(cid:1) Γ ( k ) − Γ (cid:0) k + 1 , λ (cid:0) αa + 1 (cid:1)(cid:1) Γ ( k + 1) (cid:0) αa (cid:1) ! , (151b)where Γ( · , · ) is the upper incomplete gamma function. Proof:
The proof follows via tedious but otherwise ele-mentary integration.By using the expression of the output distribution inLemma 7 and the TGR formula in (31), Fig. 5 shows examplesof conditional mean estimators for various values of the darkcurrent parameter λ . Observe that in Fig. 5 the estimator islinear for λ = 0 and non-linear for λ > . B. Quantitative Refinement of the Linearity Condition in The-orem 11
In this section, we refine the statement of Theorem 11.Specifically, we show that if the conditional expectation isclose to a linear function in a mean squared error sense then k E [ X | Y = k ] λ = 0 λ = 1 λ = 2 λ = 5 Fig. 5: Examples of conditional expectations for X distributedaccording to an exponential distribution with rate parameter α = 3 .the input distribution must be close in the Lévy metric to thegamma distribution where the Lévy metric is defined next. Definition 2.
Let P and Q be two cumulative distributionfunctions. Then, the Lévy metric between P and Q is definedas L ( P, Q ) = inf { h ≥ Q ( x − h ) − h ≤ P ( x ) ≤ Q ( x + h ) + h, ∀ x ∈ R } . (152)An important property of the Lévy metric is that convergenceof distributions in the Lévy metric is equivalent to weakconvergence of distributions as defined in Definition 1 [48]. Theorem 12.
Let Y = P ( U ) where U ∼ P U and supposethat E h | E [ U | Y ] − ( c Y + c ) | i ≤ ǫ, (153) for some < c < and c > . Then, for U γ ∼ Gam (cid:16) − c c , c c (cid:17) sup s ≥ (cid:12)(cid:12)(cid:12)(cid:12) φ U ( s ) − φ U γ ( s ) s (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ ǫ − c , (154) where φ U ( s ) and φ U γ ( s ) are the characteristic functions of U and U γ , respectively. Consequently, L (cid:16) P U , Gam (cid:16) − c c , c c (cid:17)(cid:17) ≤ √ ǫ − c . (155) Proof:
See Appendix G.
Remark 14.
The proof of Theorem 12 is inspired by theGaussian analog shown in [49, Lemma 4].The bound in (155) can also be related to the Kolmogorovmetric [50]. V. C
ONCLUSION AND O UTLOOK
This work has focused on studying properties of the condi-tional mean estimator of a random variable in Poisson noise.Specific emphasis has been placed on how the conditionalexpectation behaves as a function of the scaling parameter a and dark current parameter λ and a function of the inputdistribution (prior distribution).With respect to the channel parameters ( a, λ ) , several iden-tities in terms of derivatives have been established. Thesederivative identities have also been used to show that theconditional expectation is a monotone function of both thedark current parameter λ and the channel observation y .Another such identity has proposed a notion of a score functionof the output pmf where the gradient is taken with respectto the channel parameters ( a, λ ) (note that derivatives withrespect to the output space are not defined in the Poissoncase), and has shown that this score function has a naturalconnection to the conditional expectation via a Tweedie-likeformula. In fact, by contrasting with a Gaussian case, it hasbeen argued that differentiating with respect to the channelparameters ( a, λ ) is a natural substitute for differentiation withrespect to the output space as the latter cannot be performedin view of discreteness of the output space. Moreover, in theprocesses of cataloging Poisson identities that have Gaussiancounterparts, new identities for higher moments have beenfound for the Gaussian noise case. We refer the reader toTable II for the summary of all the identities and their Gaussiancounterparts.With respect to the input distribution, several new resultshave been shown. For instance, it has been shown that theconditional expectation can be written as a ratio of derivativesof the Laplace transform the input distribution. This identityhas been used to compute conditional expectations for afew commonly occurring input distributions; see Table I.Importantly, it has been shown that the conditional expectationuniquely determines the input distribution (i.e., the conditionalexpectation is a bijective operator of the input distribution).Several consequences of this uniqueness have been discussedincluding the uniqueness of least favorable distributions andof minimax estimators. Moreover, it has shown that the con-ditional mean estimator is a linear function if and only if thedark current parameter is zero and the input distribution isa gamma distribution. Furthermore, a quantitative refinementof the equality condition has been given by showing that ifthe conditional expectation is close to a linear function in L distance, then the input distribution must be close to thegamma distribution in the Lévy metric.We conclude the paper by mentioning a few interestingfuture directions: • It would be interesting to see to what extent the results ofthis paper can be generalized to the vector Poisson model.On the one hand, results such as Lemma 4, Theorem 6 andTheorem 9 appear to have immediate extensions. On theother hand, results in Theorem 1 and Theorem 12 mightrequire more work. The interested reader is referred to [51]for preliminary results on these extensions. TABLE II: Summary of Identities and Results.
Identity Poisson Channel Y = P ( aX + λ ) Gaussian Channel Y G = aV + N, N ∼ N (0 , σ ) Natural Transform and theOutput Distribution Laplace (see Theorem 1) Weierstrass (see Remark 1)Derivative & Difference dd a P Y | X ( y | x ) = x dd λ P Y | X ( y | x )= x (cid:0) P Y | X ( y − | x ) − P Y | X ( y | x ) (cid:1) (see Lemma 1) dd a f Y G | V ( y | v ) = − vσ dd y f Y G | V ( y | v ) (see Remark 2)Conditional Expectation &Output Distribution E [ X | Y = y ] = a ( y +1) P Y ( y +1) P Y ( y ) − λa (see Lemma 3) a E [ V | Y G = y ] = y + σ f ′ Y G ( y ) f Y G ( y ) (see (44))Higher Order Conditional Mo-ments E (cid:2) ( aX + λ ) k | Y = y (cid:3) = Q k − i =0 E [ aX + λ | Y = y + i ] (see Lemma 4) E [( aV ) k | Y G = y ]= σ k e − σ R y E [ aV | Y G = t ]d t d k d y k e σ R y E [ aV | Y G = t ]d t (see Remark 5)MMSE and Fisher Information mmse( X | Y ) = a E [ X ]+ λ − J Po ( Y ) a (see Theorem 4) mmse( X | Y G ) = σ − σ J ( Y G ) a (see Remark 6)Conditional Expectation &Conditional Variance v · ∇ v E [ X | Y = y ] = − a V ( X | Y = y ) (see Theorem 6) σ y E [ aV | Y G = y ] = V ( aV | Y G = y ) (see Remark 7)Uniqueness of the ConditionalExpectation (see Theorem 8) (see Remark 9)Bounds on the Conditional Ex-pectation E [ U | Y = y ] = O ( y ) (see Theorem 9) E [ U | Y = y ] = O ( y log( y )) (see Theorem 10) | E [ V | Y G = y ] | = O ( | y | ) (see Remark 10)Linearity of the ConditionalExpectation iff X is according to Gamma and λ = 0 (see Theorem 11) iif V is according to GaussianStability of Linear Estimators (see Theorem 12) (see Remark 14) • Building on the uniqueness property of the conditionalexpectation it is likely possible to show that the Bayesianrisk defined through the Bregman divergence natural forthe Poisson channel [6], that is R = E [ ℓ P ( X ; E [ X | Y ])] , (156)where ℓ P ( u ; v ) = u log uv − ( u − v ) , v, u ≥ , (157)is a strictly concave function in the input distribution. Thistogether with the identities in [14], [15], [18], where it wasshown that the mutual information between the input X andthe output Y = P ( X ) can be written as an integral of (156),would also imply that the mutual information is a strictly concave function of the input distribution; and • It would be interesting to understand how the optimal riskin (156) compares to the linear risk, that is R L = E [ ℓ P ( X ; c Y + c )] , (158)where c and c are some constants. Specifically, it wouldbe interesting to study the following limits: lim a →∞ RR L , (159) lim a → RR L , (160) lim λ →∞ RR L ; (161)the limit in (159) focuses on the optimality of linear estima-tors in a low noise regime (a similar study for the Gaussiannoise was undertaken in [53]), and the limits in (160) and(161) focus on the optimality of the linear estimator in alow signal regime (a similar study for the Gaussian noisewas undertaken in [41]). Every distribution that is a member of an exponential family has acorresponding Bregman divergence, which is termed natural [52]. A PPENDIX AP ROOF OF T HEOREM U = aX + λ . Then, using the definition of the Laplacetransform the output distribution can be written as P Y ( y ; P X ) = 1 y ! E (cid:2) U y e − U (cid:3) (162) = ( − y y ! L ( y ) U ( t ) (cid:12)(cid:12)(cid:12) t =1 . (163)Next, using the scaling and shifting properties of the Laplacetransform we have that L ( y ) U ( t ) (cid:12)(cid:12)(cid:12) t =1 = L ( y ) aX + λ ( t ) (cid:12)(cid:12)(cid:12) t =1 (164) = d y d t y L X ( at )e − tλ (cid:12)(cid:12)(cid:12) t =1 (165) = y X i =0 (cid:18) yi (cid:19) a y − i L ( y − i ) X ( at )( − λ ) i e − λt (cid:12)(cid:12)(cid:12) t =1 (166) = y X i =0 (cid:18) yi (cid:19) a y − i L ( y − i ) X ( a )( − λ ) i e − λ , (167)where in (166) we have used generalized product rule. Thisconcludes the proof of (10).To show (11) observe that E (cid:2) U y e − U (cid:3) = ( − y d y d t y L U ( t ) (cid:12)(cid:12)(cid:12) t =1 . (168)Therefore, we can write L U ( t ) as a Taylor series around t = 1 as follows: L U ( t ) = ∞ X y =0 y ! d y d u y L U ( u ) | u =1 ( t − y (169) = ∞ X y =0 ( − y P Y ( y ; P X )( t − y , (170) where in the last step we have used (163). Therefore, it remainsto find the region of convergence of (170). Next, by the roottest for the convergence of power series we have that r = lim sup k →∞ | P Y ( k ; P X ) | k ≤ , (171)where in the last step we have used that P Y ( · ; P X ) ≤ .From (171) we have that the series in (170) converges on theinterval | t − | < . This concludes the proof.A PPENDIX BP ROOF OF L EMMA P Y | X ( − | x ) = P Y ( −
1) = 0 . Proof of the expression in (21) follows byinspection.To show (22) observe that a dd a P Y ( y )= y E (cid:20) aX ( aX + λ ) y − y ! e − ( aX + λ ) (cid:21) − E (cid:20) aX ( aX + λ ) y y ! e − ( aX + λ ) (cid:21) (172) = y E (cid:20) ( aX + λ ) y y ! e − ( aX + λ ) (cid:21) − λy E (cid:20) ( aX + λ ) y − y ! e − ( aX + λ ) (cid:21) − E (cid:20) ( aX + λ ) y +1 y ! e − ( aX + λ ) (cid:21) + λ E (cid:20) ( aX + λ ) y y ! e − ( aX + λ ) (cid:21) (173) = yP Y ( y ) − ( y + 1) P Y ( y + 1) + λ ( P Y ( y ) − P Y ( y − , (174)the exchange of differentiation and expectation in (172) fol-lows from a simple application of the dominated convergencetheorem.Next, we find the derivative with respect to λ dd λ P Y ( y ) = − E (cid:20) ( aX + λ ) y y ! e − ( aX + λ ) (cid:21) + y E (cid:20) ( aX + λ ) y − y ! e − ( aX + λ ) (cid:21) (175) = − P Y ( y ) + P Y ( y − . (176)Finally, combining (174) and (176) concludes the proof.A PPENDIX CP ROOF OF L EMMA x → ( ax + λ ) y e − ( ax + λ ) occurswhen x = y − λa . Therefore, P Y ( y ) = E (cid:2) P Y | X ( y | X ) (cid:3) ≤ y ! y y e − y (177) ≤ √ π √ y . (178) The second bound in (178) follows by using Stirling’s lowerbound y ! ≥ √ πy y + e − y .We now show the lower bound in (30). Let U = aX + λ .Then, P Y ( y ) = 1 y ! E (cid:2) U y e − U (cid:3) (179) = 1 y ! E h e y log( U ) − U i (180) ≥ y ! e y E [log( U )] − E [ U ] , (181)where in the last step we applied Jensen’s inequality. Thisconcludes the proof of the lower bound in (178).A PPENDIX DP ROOF OF T HEOREM P Y ( −
1) = 0 and E [ U k | Y = −
1] = ∞ . Fix some y ≥ . Then, y !( y + k )! dd λ E [ U k | Y = y ]= dd λ P Y ( y + k ) P Y ( y ) (182) = P Y ( y ) dd λ P Y ( y + k ) − P Y ( y + k ) dd λ P Y ( y ) P Y ( y ) (183) = P Y ( y )( P Y ( y + k − − P Y ( y + k )) P Y ( y ) − P Y ( y + k )( P Y ( y − − P Y ( y )) P Y ( y ) (184) = P Y ( y + k − P Y ( y ) − P Y ( y + k ) P Y ( y − P Y ( y ) (185) = y ! E [ U k − | Y = y ]( y + k − − y !( y + k )! y E [ U k | Y = y ] E [ U | Y = y − , (186)where (182) follows by using the generalized TGR formulain (37); (184) follows by using the derivative identity in (23)where we use the convention that P Y ( −
1) = 0 ; and (186)follows by using the generalized TGR formula in (37) andthe convention E [ U | Y = −
1] = ∞ . Dividing (186) by y !( y + k )! leads to (94).For the case of k = 1 (186) reduces to dd λ E [ U | Y = y ] = ( y + 1) − y E [ U | Y = y ] E [ U | Y = y − . (187)Moreover, a dd λ E [ X | Y = y ] = − y V ( U | Y = y − E [ U | Y = y − , (188)where in last step we have used the variance expression in(38).Next, we compute the derivative with respect to ay !( y + k )! dd a E [ U k | Y = y ]= dd a P Y ( y + k ) P Y ( y ) (189) = P Y ( y ) dd a P Y ( y + k ) − P Y ( y + k ) dd a P Y ( y ) P Y ( y ) (190) = y + ka P Y ( y + k ) − y + k +1 a P Y ( y + k + 1) P Y λ ( y )+ λa ( P Y ( y + k ) − P Y ( y + k − P Y ( y ) − P Y ( y + k ) (cid:0) ya P Y ( y ) − y +1 a P Y ( y + 1) (cid:1) P Y ( y ) − P Y ( y + k ) (cid:0) λa ( P Y ( y ) − P Y ( y − (cid:1) P Y ( y ) (191) = y + ka y ! E [ U k | Y = y ]( y + k )! − y + k + 1 a y ! E [ U k +1 | Y = y ]( y + k + 1)!+ λa (cid:18) y ! E [ U k | Y = y ]( y + k )! − y ! E [ U k − | Y = y ]( y + k − (cid:19) − y ! E [ U k | Y = y ] (cid:18) ya − E [ U | Y = y ] a + λ ( − y E [ U | Y = y − ) a (cid:19) ( y + k )! , (192)where (189) follows by using the generalized TGR formula in(37); and (191) follows by using the identity in (174). Next,dividing (192) by a y !( y + k )! we have that a dd a E [ U k | Y = y ]= k E [ U k | Y = y ] − E [ U k +1 | Y = y ]+ E [ U k | Y = y ] E [ U | Y = y ] − λ ( y + k ) E [ U k − | Y = y ]+ λy E [ U k | Y = y ] E [ U | Y = y − . (193)Now, combining (186) and (193) we have that a dd a E [ U k | Y = y ] + λ dd λ E [ U k | Y = y ]= k E [ U k | Y = y ] − E [ U k +1 | Y = y ]+ E [ U k | Y = y ] E [ U | Y = y ] , (194)which concludes the proof of (93).Setting k = 1 the left side of (194) reduces to a dd a E [ U | Y = y ] + λ dd λ E [ U | Y = y ]= a E [ X | Y = y ] + a dd a E [ X | Y = y ]+ aλ dd λ E [ X | Y = y ] + λ, (195)and the right side of (194) reduces to E [ U | Y = y ] − E [ U | Y = y ] + E [ U | Y = y ] E [ U | Y = y ]= E [ U | Y = y ] − V ( U | Y = y ) (196) = a E [ X | Y = y ] + λ − a V ( X | Y = y ) . (197)Combining (195) and (197) we have that a dd a E [ X | Y = y ] + λ dd λ E [ X | Y = y ] = − a V ( X | Y = y ) . (198)This concludes the proof. A PPENDIX EP ROOF OF T HEOREM
Definition 3.
Consider the function w e w = f ( w ) = x, (199)where x and w are real-valued. The inverse of the function in(199) is known as the Lambert W function. The
Lambert W function and solutions to (199) have the following properties: • (199) has a real-valued solutions only if x ≥ − . Hence,the Lambert W function is real-valued if x ≥ − ; and • (199) has two solutions if − < x < and a single solutionfor x ≥ . Hence, the Lambert W function has two realbranches in the interval − < x < . The two branchesare denoted by W (principle branch) and W − (negativebranch).We are now in position to proof the upper bound on theconditional expectation. Choose some A > and observe that E (cid:20) U P Y | U ( y | U ) P Y ( y ) (cid:21) = E (cid:20) U P Y | U ( y | U ) P Y ( y ) 1 { UP Y | U ( y | U ) ≤ AP Y ( y ) } (cid:21) + E (cid:20) U P Y | U ( y | U ) P Y ( y ) 1 { UP Y | U ( y | U ) >AP Y ( y ) } (cid:21) ≤ A + E (cid:20) U P Y | U ( y | U ) P Y ( y ) 1 { UP Y | U ( y | U ) >AP Y ( y ) } (cid:21) . (200)Now note that the set in (200) can be re-written as follows: (cid:8) x : xP Y | U ( y | x ) > AP Y ( y ) (cid:9) = (cid:26) x : x y +1 e − x y ! > AP Y ( y ) (cid:27) (201) = { x : g l ( y ) < x < g u ( y ) } , (202)where g u ( y ) = − ( y + 1) W − − ( AP Y ( y ) y !) y +1 y + 1 ! , (203) g l ( y ) = − ( y + 1) W − ( AP Y ( y ) y !) y +1 y + 1 ! . (204)Moreover, the functions in (203) and (204) are real-valuedprovided that A ≤ P Y ( y ) y ! (cid:18) y + 1e (cid:19) y +1 . (205)By combining (200) and (202) we have that E [ U | Y = y ] ≤ A + g u ( y ) (206) = A − ( y + 1) W − − ( AP Y ( y ) y !) y +1 y + 1 ! . (207) Next, we use the following bound on W − shown in [54]: for x ∈ [0 , e − ] − W − ( − x ) ≤ s (cid:18) x (cid:19) − (cid:18) x (cid:19) (208) ≤ (cid:18) x (cid:19) . (209)Fig. 6 compares the bounds in (208) and (209). · − . .
15 0 . .
25 0 . . x − W − ( − x ) Bound in (208)Bound in (209)Fig. 6: Bounds on the Lambert W function.Collecting the bounds in (207) and (209) we arrive at thefollowing bound: E [ U | Y = y ] ≤ A + 2( y + 1) log y + 1( AP Y ( y ) y !) y +1 ! . (210)Finally, by using the upper bound in (30), it is not difficult tocheck that a choice of A = e − satisfies (205), and we arriveat E [ U | Y = y ]2 ≤
12e + 1 + log (cid:18) P Y ( y ) y ! (cid:19) + ( y + 1) log( y + 1) (211) ≤
12e + 1 + log (cid:18) y E [log( U )] − E [ U ] (cid:19) + ( y + 1) log( y + 1) (212) = 12e + 1 − y E [log( U )] + E [ U ] + ( y + 1) log( y + 1) , (213)where the inequality in (212) follows by applying the lowerbound in (30). This concludes the proof.A PPENDIX FP ROOF OF T HEOREM
Lemma 8.
Let Y = P ( U ) . Then, for any t > E (cid:2) ( U − ( c Y + c )) e − tY (cid:3) = − ( c ( s −
1) + 1) L ′ U ( s ) − c L U ( s ) , (214) where s = 1 − e − t .Proof: To compute (214) we have to compute the follow-ing terms: E (cid:2) U e − tY (cid:3) , E (cid:2) Y e − tY (cid:3) , and E (cid:2) e − tY (cid:3) . (215)Now we rewrite each term in (215) in terms of U only. Tothat end let v ( t ) = e − t − − s, (216)in which case e − t = 1 − s and v ′ ( t ) = s − .Now recall that the Laplace transform of a Poisson randomvariable W with parameter β is given by E (cid:2) e − tW (cid:3) = e βv ( t ) , (217)Now E (cid:2) e − tY (cid:3) = E (cid:2) E (cid:2) e − tY | U (cid:3)(cid:3) (218) = E h e Uv ( t ) i (219) = L U ( s ) , (220)where we used the fact that Y given U = u has a Poissondistribution with a parameter u and the Laplace transform ofa Poisson random variable in (217). Moreover, using similarsteps E (cid:2) U e − tY (cid:3) = E (cid:2) U E (cid:2) e − tY | U (cid:3)(cid:3) (221) = E h U e Uv ( t ) i (222) = E (cid:2) U e − sU (cid:3) (223) = −L ′ U ( s ) . (224)Finally, E (cid:2) Y e − tY (cid:3) = − dd t E (cid:2) e − tY (cid:3) (225) = − dd t E h e Uv ( t ) i (226) = − E h e Uv ( t ) U v ′ ( t ) i (227) = ( s − L ′ U ( s ) , (228)where in (225) we have used (220). Now combining (215),(220), (224) and (228) concludes the proof.Let U = aX + λ and suppose that E [ X | Y ] = b Y + b forsome b and b . Then, from (31) we have that E [ U | Y = y ] = c y + c , (229)with c = ab , (230) c = ab + λ. (231)Then, by the orthogonality principle E (cid:2) ( U − ( c Y + c )) e − tY (cid:3) , (232)which in view of (214) is equivalent to −L ′ U ( s ) = c ( s − L ′ U ( s ) + c L U ( s ) . (233) Therefore, the final differential equation is given by − ( c s − c + 1) L ′ U ( s ) = c L U ( s ) , (234)where the boundary condition is given by L U (0) = 1 . (235)The solution to this first-oder linear ordinary differentialequation is unique and is given by L U ( s ) = 1 (cid:16) c − c s (cid:17) c c . (236)The function in (236) is the Laplace transform of U ∼ Gam (cid:16) − c c , c c (cid:17) .Next, observe that λ = 0 . This follows from the definitionof U = aX + λ and the assumption that X ≥ . Since U is distributed according to a gamma distribution, a strictlypositive λ would violated the fact that X is a non-negativerandom variable.Therefore, using (230) and (231) we have that aX = U ∼ Gam (cid:16) − ab ab , b b (cid:17) and X ∼ Gam (cid:16) − ab b , b ab (cid:17) . Thisconcludes the proof. A PPENDIX GP ROOF OF T HEOREM
Lemma 9.
Let φ U be characteristic function of some non-negative random variable U . Then, for any α, θ > and τ > (cid:12)(cid:12)(cid:12) φ U ( τ ) − (cid:0) − iτα (cid:1) − θ (cid:12)(cid:12)(cid:12) τ ≤ sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) t + iαα (cid:19) φ ′ U ( t ) + θα φ U ( t ) (cid:12)(cid:12)(cid:12)(cid:12) . (237) Proof:
First observe that φ U ( τ ) (cid:18) − iτα (cid:19) θ − Z τ (cid:18) − itα (cid:19) θ φ ′ U ( t ) − iθα (cid:18) − itα (cid:19) θ − φ U ( t )d t (238) = − Z τ i (cid:18) − itα (cid:19) θ − (cid:18)(cid:18) t + iαα (cid:19) φ ′ U ( t ) + θα φ U ( t ) (cid:19) d t. (239)Next, using the integral representation in (239) and modulusinequality, we have the following bound: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ U ( τ ) (cid:18) − iτα (cid:19) θ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Z τ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) − itα (cid:19) θ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) t + iαα (cid:19) φ ′ U ( t ) + θα φ U ( t ) (cid:12)(cid:12)(cid:12)(cid:12) d t (240) = Z τ (cid:18) t α (cid:19) θ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) t + iαα (cid:19) φ ′ U ( t ) + θα φ U ( t ) (cid:12)(cid:12)(cid:12)(cid:12) d t (241) ≤ sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) t + iαα (cid:19) φ ′ U ( t ) + θα φ U ( t ) (cid:12)(cid:12)(cid:12)(cid:12) Z τ (cid:18) t α (cid:19) θ − d t. (242)To conclude the proof observe that the difference betweencharacteristic functions is given by (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ U ( τ ) − (cid:18) − iτα (cid:19) − θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:18) − iτα (cid:19) − θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ U ( τ ) (cid:18) − iτα (cid:19) θ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (243) = (cid:18) τ α (cid:19) − θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ U ( τ ) (cid:18) − iτα (cid:19) θ − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (244) ≤ sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) t + iαα (cid:19) φ ′ U ( t ) + θα φ U ( t ) (cid:12)(cid:12)(cid:12)(cid:12) R τ (cid:16) t α (cid:17) θ − d t (cid:0) τ α (cid:1) θ (245) ≤ sup t ∈ [0 ,τ ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:18) t + iαα (cid:19) φ ′ U ( t ) + θα φ U ( t ) (cid:12)(cid:12)(cid:12)(cid:12) τ, (246)where in (245) we have used the bound in (242); and in (246)we use the bound that for every θ > R τ (cid:16) t α (cid:17) θ − d t (cid:0) τ α (cid:1) θ ≤ τ. (247)This concludes the proof.Another useful result will the following bound on the Lévydistance [55] and [56]. Lemma 10.
Let P and Q be two distribution functions withcharacteristic functions φ P and φ Q , respectively. Then, L ( P, Q )2 ≤ sup t ≥ (cid:12)(cid:12)(cid:12)(cid:12) φ P ( t ) − φ Q ( t ) t (cid:12)(cid:12)(cid:12)(cid:12) . (248)We now proceed with the proof of the bound. Our start-ing place is the following consequence of the orthogonalityprinciple: fix some t ∈ R E (cid:2) ( U − E [ U | Y ])e itY (cid:3) = E (cid:2) ( U − ( c Y + c ) + ( c Y + c ) − E [ U | Y ]) e itY (cid:3) . (249)The identity in (249) implies that E (cid:2) ( c Y + c − E [ U | Y ]) e itY (cid:3) = E (cid:2) ( U − ( c Y + c )) e itY (cid:3) (250) = 1 i dd s φ U ( s ) − c ( s − i ) dd s φ U ( s ) − c φ U ( s ) (251) = − (cid:18) ( i (1 − c ) + c s ) dd s φ U ( s ) + c φ U ( s ) (cid:19) , (252)where in (251) s and t are related via is = e it − and followsfrom identity in (214) by replacing the Laplace transform withthe characteristic function. Now applying the Cauchy-Schwarz inequality to (252) wehave that (cid:12)(cid:12)(cid:12)(cid:12) ( i (1 − c ) + c s ) dd s φ U ( s ) + c φ U ( s ) (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12) E (cid:2) ( c Y + c − E [ U | Y ]) e itY (cid:3)(cid:12)(cid:12) (253) ≤ E (cid:2) | c Y + c − E [ U | Y ] | (cid:12)(cid:12) e itY (cid:12)(cid:12)(cid:3) (254) ≤ r E h ( c Y + c − E [ U | Y ]) i E h | e itY | i (255) = r E h ( c Y + c − E [ U | Y ]) i . (256)Setting α = − c c and θ = c c and combining the bounds in(237) and (256) we have that for all s > s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) φ U ( s ) − (cid:18) − isα (cid:19) − θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ r E h ( c Y + c − E [ U | Y ]) i − c . (257)Now applying (257) and the bound in (248) we have that L (cid:16) P U , Gam (cid:16) − c c , c c (cid:17)(cid:17) ≤ r E h ( c Y + c − E [ U | Y ]) i (1 − c ) (258) ≤ √ ǫ (1 − c ) . (259)This concludes the proof.R EFERENCES[1] A. Dytso and H. V. Poor, “Properties of the conditional mean estimatorin Poisson noise,” in
Proc. IEEE Inf. Theory Workshop , Visby, Sweden,August 2019.[2] J. P. Gordon, “Quantum effects in communications systems,”
Proc. ofthe IRE , vol. 50, no. 9, pp. 1898–1908, 1962.[3] J. Pierce, E. Posner, and E. Rodemich, “The capacity of the photoncounting channel,”
IEEE Trans. Inf. Theory , vol. 27, no. 1, pp. 61–77,1981.[4] D. Brady and S. Verdú, “The asymptotic capacity of the direct detectionphoton channel with a bandwidth constraint,” in
Proc. 28th AllertonConf. Commun., Control and Comp. , Monticello, IL, USA, pp. 691–700.[5] S. Shamai (Shitz), “Capacity of a pulse amplitude modulated directdetection photon channel,”
IEE Proc. I (Communications, Speech andVision) , vol. 137, no. 6, pp. 424–430, 1990.[6] A. Banerjee, X. Guo, and H. Wang, “On the optimality of conditionalexpectation as a Bregman predictor,”
IEEE Trans. Inf. Theory , vol. 51,no. 7, pp. 2664–2669, 2005.[7] S. Verdú, “Poisson communication theory,” Haifa, Israel., Mar. 1999,invited talk in the International Technion Communication Day in Honorof Israel Bar-David.[8] A. Lapidoth and S. M. Moser, “On the capacity of the discrete-timePoisson channel,”
IEEE Trans. Inf. Theory , vol. 55, no. 1, pp. 303–322,2009.[9] M. Raginsky, R. M. Willett, Z. T. Harmany, and R. F. Marcia, “Com-pressed sensing performance bounds under Poisson noise,”
IEEE Trans.Signal Process. , vol. 58, no. 8, pp. 3990–4002, Aug 2010.[10] L. Wang, J. Huang, X. Yuan, K. Krishnamurthy, J. Greenberg, V. Cevher,M. R. Rodrigues, D. Brady, R. Calderbank, and L. Carin, “Signalrecovery and system calibration from multiple compressive Poissonmeasurements,”
SIAM Journal on Imaging Sciences , vol. 8, no. 3, pp.1923–1954, 2015. [11] L. Wang, D. E. Carlson, M. Rodrigues, D. Wilcox, R. Calderbank, andL. Carin, “Designed measurements for vector count data,” in
Advancesin Neural Information Processing Systems , 2013, pp. 1142–1150.[12] J. Grandell,
Mixed Poisson Processes . CRC Press, 1997, vol. 77.[13] L. Wang and Y. Chi, “Stochastic approximation and memory-limitedsubspace tracking for Poisson streaming data,”
IEEE Trans. SignalProcess. , vol. 66, no. 4, pp. 1051–1064, Feb 2018.[14] D. Guo, S. Shamai, and S. Verdú, “Mutual information and conditionalmean estimation in Poisson channels,”
IEEE Trans. Inf. Theory , vol. 54,no. 5, pp. 1837–1849, 2008.[15] R. Atar and T. Weissman, “Mutual information, relative entropy, andestimation in the Poisson channel,”
IEEE Trans. Inf. Theory , vol. 58,no. 3, pp. 1302–1318, 2012.[16] D. Guo, S. Shamai, and S. Verdú, “Mutual information and minimummean-square error in Gaussian channels,”
IEEE Trans. Inf. Theory ,vol. 51, no. 4, pp. 1261–1282, 2005.[17] S. Verdú, “Mismatched estimation and relative entropy,”
IEEE Trans.Inf. Theory , vol. 56, no. 8, pp. 3712–3720, Aug 2010.[18] L. Wang, D. E. Carlson, M. R. Rodrigues, R. Calderbank, and L. Carin,“A Bregman matrix and the gradient of mutual information for vectorPoisson and Gaussian channels,”
IEEE Trans. Inf. Theory , vol. 60, no. 5,pp. 2611–2629, 2014.[19] D. P. Palomar and S. Verdú, “Representation of mutual information viainput estimates,”
IEEE Trans. Inf. Theory , vol. 53, no. 2, pp. 453–470,2007.[20] J. Jiao, K. Venkat, and T. Weissman, “Mutual information, relativeentropy and estimation error in semi-martingale channels,”
IEEE Trans.Inf. Theory , vol. 64, no. 10, pp. 6662–6671, 2018.[21] A. I. Zayed,
Handbook of Function and Generalized Function Trans-forms . CRC press, 1996.[22] S. I. Resnick,
A Probability Path . Springer Science & Business Media,2013.[23] P. Billingsley,
Probability and Measure . John Wiley & Sons, 2008.[24] D. Guo, S. Shamai, and S. Verdú,
The Interplay Between Informationand Estimation Measures . now Publishers Incorporated, 2013.[25] I. J. Good, “The population frequencies of species and the estimationof population parameters,”
Biometrika , vol. 40, no. 3-4, pp. 237–264,1953.[26] H. Robbins, “An empirical Bayes approach to statistics,” in
Proc. ThirdBerkeley Symp Math Statist Probab . Citeseer, 1956.[27] B. Efron and T. Hastie,
Computer Age Statistical Inference . CambridgeUniversity Press, 2016, vol. 5.[28] A. Dytso, H. V. Poor, and S. Shamai, “A general derivative identity forthe conditional mean estimator in Gaussian noise,” in Submitted to theIEEE Int. Symp. Inf. Theory , Los Angeles, CA, June 2020.[29] R. Esposito, “On a relation between detection and estimation in decisiontheory,”
Inf. Control , vol. 12, no. 2, pp. 116–120, February 1968.[30] A. Dytso, M. Al, H. V. Poor, and S. Shamai (Shitz), “On the capacityof the peak power constrained vector Gaussian channel: An estimationtheoretic perspective,”
IEEE Trans. Inf. Theory , vol. 65, no. 6, pp. 3907–3921, 2019.[31] A. Réveillac, “Likelihood ratios and inference for Poisson channels,”
IEEE Trans. Inf. Theory , vol. 59, no. 10, pp. 6261–6272, 2013.[32] A. Kagan, “A discrete version of the Stam inequality and a characteri-zation of the Poisson distribution,”
Journal of Statistical Planning andInference , vol. 92, no. 1-2, pp. 7–12, 2001.[33] I. Kontoyiannis, P. Harremoës, and O. Johnson, “Entropy and the law ofsmall numbers,”
IEEE Trans. Inf. Theory , vol. 51, no. 2, pp. 466–472,2005.[34] O. Johnson and S. Guha, “A de Bruijn identity for discrete randomvariables,” in
Proc. IEEE Int. Symp. Inf. Theory , Aachen, Germany, June2017, pp. 898–902.[35] L. D. Brown, “Admissible estimators, recurrent diffusions,and insoluble boundary value problems,”
Ann. Math. Statist. ,vol. 42, no. 3, pp. 855–903, 06 1971. [Online]. Available:https://doi.org/10.1214/aoms/1177693318[36] A. Dytso, M. Fauß, and H. V. Poor, “A class of lower bounds forBayesian risk with a Bregman loss,” arXiv preprint arXiv:2001.10982 ,2020.[37] C. Hatsell and L. Nolte, “Some geometric properties of the likelihoodratio (corresp.),”
IEEE Trans. Inf. Theory , vol. 17, no. 5, pp. 616–618,1971.[38] A. Jaffer, “A note on conditional moments of random signals in Gaussiannoise (corresp.),”
IEEE Trans. Inf. Theory , vol. 18, no. 4, pp. 513–514,1972.[39] J. A. Shohat and J. D. Tamarkin,
The Problem of Moments . AmericanMathematical Soc., 1943, no. 1. [40] J. M. Stoyanov, Counterexamples in Probability . Courier Corporation,2013.[41] D. Guo, Y. Wu, S. Shamai, and S. Verdú, “Estimation in Gaussian noise:Properties of the minimum mean-square error,”
IEEE Trans. Inf. Theory ,vol. 57, no. 4, pp. 2371–2385, 2011.[42] V. V. Buldygin and Y. V. Kozachenko, “Sub-Gaussian random variables,”
Ukrainian Mathematical Journal , vol. 32, no. 6, pp. 483–489, 1980.[43] Y. Wu and S. Verdú, “Functional properties of minimum mean-squareerror and mutual information,”
IEEE Trans. Inf. Theory , vol. 58, no. 3,pp. 1289–1301, 2012.[44] A. Dytso, H. V. Poor, R. Bustin, and S. Shamai, “On the structure of theleast favorable prior distributions,” in
Proc. IEEE Int. Symp. Inf. Theory ,Aachen, Germany, June 2018, pp. 1081–1085.[45] M. Fozunbal, “On regret of parametric mismatch in minimum meansquare error estimation,” in
Proc. IEEE Int. Symp. Inf. Theory , Austin,TX, USA, June 2010, pp. 1408–1412.[46] P. Diaconis and D. Ylvisaker, “Conjugate priors for exponential fami-lies,”
The Annals of Statistics , pp. 269–281, 1979.[47] N. Johnson, “Uniqueness of a result in the theory of accident proneness,”
Biometrika , vol. 44, no. 3-4, pp. 530–531, 1957.[48] R. M. Dudley,
Real Analysis and Probability . Cambridge UniversityPress, 2002, vol. 74.[49] F. du Pin Calmon, Y. Polyanskiy, and Y. Wu, “Strong data processinginequalities for input constrained additive noise channels,”
IEEE Trans.Inf. Theory , vol. 64, no. 3, pp. 1879–1892, 2018.[50] A. Dytso and H. V. Poor, “On stability of linear estimators in Poissonnoise,” in ,2019.[51] A. Dytso, M. Fauß, and H. V. Poor, “Vector Poisson channel: On thelinearity of the conditional mean estimator,” 2020, in preparation.[52] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clustering withBregman divergences,”
Journal of Machine Learning Research , vol. 6,no. Oct, pp. 1705–1749, 2005.[53] W. Yihong and S. Verdú, “MMSE dimension,”
IEEE Trans. Inf. Theory ,vol. 57, no. 8, pp. 4857–4879, Aug 2011.[54] I. Chatzigeorgiou, “Bounds on the Lambert function and their applica-tion to the outage analysis of user cooperation,”
IEEE Commun. Lett. ,vol. 17, no. 8, pp. 1505–1508, August 2013.[55] H. Bohman, “Approximate Fourier analysis of distribution functions,”
Ark. Mat. , vol. 4, no. 2-3, pp. 99–157, 04 1961. [Online]. Available:https://doi.org/10.1007/BF02592003[56] S. G. Bobkov, “Proximity of probability distributions in terms ofFourier–Stieltjes transforms,”