[PDF] Fisher Information and Mutual Information Constraints

Abstract

We consider the processing of statistical samples X\sim P_\theta by a channel p(y|x), and characterize how the statistical information from the samples for estimating the parameter \theta\in\mathbb{R}^d can scale with the mutual information or capacity of the channel. We show that if the statistical model has a sub-Gaussian score function, then the trace of the Fisher information matrix for estimating \theta from Y can scale at most linearly with the mutual information between X and Y. We apply this result to obtain minimax lower bounds in distributed statistical estimation problems, and obtain a tight preconstant for Gaussian mean estimation. We then show how our Fisher information bound can also imply mutual information or Jensen-Shannon divergence based distributed strong data processing inequalities.

Full PDF

aa r X i v : . [ c s . I T ] F e b Fisher Information and Mutual InformationConstraints

Leighton Pate Barnes and Ayfer ¨Ozg¨urStanford University, Stanford, CA 94305Email: { lpb, aozgur } @stanford.edu Abstract —We consider the processing of statistical samples X ∼ P θ by a channel p ( y | x ) , and characterize how the statisticalinformation from the samples for estimating the parameter θ ∈ R d can scale with the mutual information or capacity of thechannel. We show that if the statistical model has a sub-Gaussianscore function, then the trace of the Fisher information matrixfor estimating θ from Y can scale at most linearly with themutual information between X and Y . We apply this result toobtain minimax lower bounds in distributed statistical estimationproblems, and obtain a tight preconstant for Gaussian meanestimation. We then show how our Fisher information boundcan also imply mutual information or Jensen-Shannon divergencebased distributed strong data processing inequalities. I. I

NTRODUCTION

In this work, we consider the processing of statisticalsamples X ∼ P θ by a channel p ( y | x ) , and try to understandhow the statistical information from the samples for estimatingthe parameter θ can scale with the mutual information orcapacity of the channel. In particular, we begin by lookingat the Fisher information for estimating θ from the processeddata Y . Fisher information describes the curvature of thestatistical model as one moves around the parameter space, andimmediately implies lower bounds for the error in estimating θ from the statistical samples via the well-known Cram´er-Raolower bound [1], [2] for unbiased estimators. It can also implylower bounds for arbitrarily biased estimators in an asymptoticsense [3], or in a Bayesian setting via the van Trees inequality[4].Fisher information satisﬁes a data processing inequality inthe sense that it must decrease during processing [5]. In ourmain result, we develop a strong data processing inequalitythat more precisely quantiﬁes the maximum possible Fisherinformation from Y – in terms of the mutual informationbetween X and Y – after processing the data. This is ageneralization of recent results on Fisher information due tothe authors in both the communication constrained [6] andprivacy constrained [7] settings. Other works such as [8]have also considered statistical inference under communicationand privacy constraints, but have not considered the moregeneral class of mutual information constrained channels.One application where general mutual information constrainedchannels are needed is when communication of statisticalsamples is done over an analog channel such as an additivewhite Gaussian noise channel or Gaussian multiple access This work was supported in part by NSF award CCF-1704624 and by aGoogle Faculty Research Award. channel, such as in recent works [9] and [10] which seek tojointly study the communication and estimation problem.In our main result, we show that if the score of the statisticalmodel is sub-Gaussian – an assumption that is also usedin previous works [6], [7] – then the trace of the Fisherinformation matrix for estimating θ from Y can scale at mostlinearly with the mutual information between X and Y . Wealso show by the example of the Gaussian location model thatthe preconstant we obtain is optimal and cannot be improved.We then apply this result in a distributed setting where thereare n nodes each with a sample X i taken independently from P θ , and where the nodes can communicate in multiple roundsof communication via a public blackboard. We show howFisher information from the total blackboard transcript canbe similarly bounded, and develop minimax lower bounds inthis distributed estimation setting. Finally, in the last section,we show how our Fisher information upper bound can alsoimply mutual information or Jensen-Shannon divergence baseddistributed strong data processing inequalities similar to thosefrom [11]. II. P RELIMINARIES

Suppose that { P θ } θ ∈ Θ is a family of probability distribu-tions parametrized by θ ∈ Θ ⊆ R d that is dominated bysome sigma-ﬁnite measure µ . Let p θ be the density of P θ withrespect to µ . Let X be a statistical sample drawn from P θ , andlet Y be the output of a channel with transition probabilitydensity p ( y | x ) (with respect to some dominating measure ν on the sample space Y ) when x is the input. In this paperwe analyze the Fisher information for estimating θ from theprocessed sample Y and show how it can scale with the mutualinformation I θ ( X ; Y ) . Recall some Fisher information basics. The score for X isdeﬁned as S θ ( X ) = ∇ θ log p θ ( X ) . The Fisher information matrix for the processed samples Y is I Y ( θ ) = E h ( ∇ θ log p θ ( Y )) ( ∇ θ log p θ ( Y )) T i and we can characterize the trace of this matrix by Tr ( I Y ( θ )) = E Y k E X [ S θ ( X ) | Y ] k . (1) We use I θ ( X ; Y ) to denote the mutual information between X and Y when X is drawn from P θ . In the case that there is a prior distribution on θ this is the same as I ( X ; Y | θ ) . he decomposition in (1) is one of the main tools usedby the authors in [6], [7], and we defer its proof to thosereferences. The proof is a straightforward computation, butrequires the interchange of limiting operations, speciﬁcally theinterchange between integration over the sample space X anddifferentiation with respect to the parameter components θ j .This interchange can be justiﬁed under the following regularityconditions (see [12] §26 Lemma 1):(i) The square-root density p p θ ( x ) is continuously differ-entiable with respect to each component θ j at µ -almostall x .(ii) The Fisher information for each component θ j , E (cid:20)(cid:16) ∂∂θ j log p θ ( X ) (cid:17) (cid:21) , exists and is a continuous func-tion of θ j .One consequence of these two conditions is that the scorerandom vector has mean zero, i.e., E [ S θ ( X )] = 0 . One additional regularity condition is needed in the presentpaper to interchange limits because we are not assuming thatthe channel p ( y | x ) has a ﬁnite output alphabet Y (such as in[6]), or a point-wise local differential privacy condition (suchas in [7]):(iii) The channel p ( y | x ) is square integrable in the sense that R p ( y | x ) p θ ( x ) dµ ( x ) < ∞ at ν -almost all y for each θ .Finally, recall some basic facts about sub-Gaussian randomvariables. We say that a mean-zero random variable X is sub-Gaussian with parameter N if E (cid:2) e λX (cid:3) ≤ e λ N for all λ ∈ R . Furthermore, such random variables X satisfy E h e λX N i ≤ √ − λ for each λ ∈ [0 , (see, for example, [13] §2.1 and §2.4).III. F ISHER I NFORMATION B OUND

In this section we state and prove the following main resultconcerning how Fisher information can scale with the mutualinformation between X and Y . Theorem 1.

Suppose that h u, S θ ( X ) i is sub-Gaussian withparameter N for any unit vector u ∈ R d . Under regularityconditions (i)-(iii) above, Tr ( I Y ( θ )) ≤ N I θ ( X ; Y ) . Note that if Y ∈ [1 : 2 k ] , i.e. Y is constrained to being a k -bit message, then I θ ( X ; Y ) ≤ H ( Y ) ≤ k and we recoverthe sub-Gaussian case from Theorem 2 of [6] as a specialcase. Similarly, if p ( y | x ) is a locally differentially privatemechanism in the sense that p ( y | x ) p ( y | x ′ ) ≤ e ε for any x, x ′ , y ,then I θ ( X ; Y ) = O (min { ε, ε } ) which recovers Propositions2 and 4 from [7], again as a special case.Note also that the mutual information I θ ( X ; Y ) is upperbounded by the capacity of the channel p ( y | x ) since the capacity is the maximum mutual information over all possibleinput distributions. In this way we can also think of the abovetheorem as an upper bound on Fisher information in terms ofthe capacity of the channel that processes the samples. Proof.

We begin by “lifting” the problem to higher dimensionsby considering a new dB -dimensional statistical model X ∼ P θ = B Y i =1 P θ i where θ = ( θ , . . . , θ B ) , X = ( X , . . . , X B ) , and Y =( Y , . . . , Y B ) . Each X i is drawn independently according to P θ i from the original d -dimensional model, and each Y i is thecorresponding output of the channel. Note that Tr ( I Y ( θ )) = B X i =1 Tr ( I Y i ( θ i )) (2)and that when θ = θ = . . . = θ B we have Tr ( I Y ( θ )) = B Tr ( I Y ( θ )) . (3)We will therefore proceed by analyzing Tr ( I Y ( θ )) evaluatedat the speciﬁc θ values with θ = θ = . . . = θ B . Note thatby taking scaled sums of independent sub-Gaussian randomvariables, the new dB -dimensional model has a score functionthat is sub-Gaussian with the same constant N as that of theoriginal model. We use the decomposition shown in (1) forthe new dB -dimensional model: Tr ( I Y ( θ )) = E Y k E X [ S θ ( X ) | Y ] k = E Y "(cid:18) E X (cid:20) p ( Y | X ) p θ ( Y ) h u Y , S θ ( X ) i (cid:21)(cid:19) (4)where u Y = E X [ S θ ( X ) | Y ] k E X [ S θ ( X ) | Y ] k . The key point, and the reason for doing the lifting step, isthat the ratio inside the expectation in (4) concentrates around e BI θ ( X ; Y ) as B gets large. More concretely, by the stronglaw of large numbers, B log p ( y | x ) p θ ( y ) = 1 B log B Y i =1 p ( y i | x i ) p θ ( y i )= 1 B B X i =1 log p ( y i | x i ) p θ ( y i ) converges almost surely to E X,Y (cid:20) log p ( Y | X ) p θ ( Y ) (cid:21) = I θ ( X ; Y ) as B → ∞ . Therefore p ( Y | X ) p θ ( Y ) ≤ e B ( I θ ( X ; Y )+ ε B ) with probability at least − ε B for some ε B that con-verges to zero as B → ∞ . Let A B be the event ( x , y ) ∈ X B × Y B : p ( y | x ) p θ ( y ) ≤ e B ( I θ ( X ; Y )+ ε B ) o and let A CB be its complement. Following from (4), E Y "(cid:18) E X (cid:20) p ( Y | X ) p θ ( Y ) h u Y , S θ ( X ) i (cid:21)(cid:19) ≤ E Y (cid:20) E X (cid:20) p ( Y | X ) p θ ( Y ) h u Y , S θ ( X ) i (cid:21)(cid:21) = Z Z A B p ( y | x ) p θ ( y ) h u y , S θ ( x ) i p θ ( x ) p θ ( y ) dµ ( x ) dν ( y ) (5) + Z Z A CB p ( y | x ) p θ ( y ) h u y , S θ ( x ) i p θ ( x ) p θ ( y ) dµ ( x ) dν ( y ) (6)We bound the term (5) as follows: exp Z Z A B ( x , y ) p ( y | x ) p θ ( y ) λ (cid:18) h u y , S θ ( x ) i N (cid:19) · p θ ( x ) p θ ( y ) dµ ( x ) dν ( y ) ≤ Z Z p ( y | x ) p θ ( y ) exp A B ( x , y ) λ (cid:18) h u y , S θ ( x ) i N (cid:19) ! · p θ ( x ) p θ ( y ) dµ ( x ) dν ( y ) ≤ ε B + Z Z A B p ( y | x ) p θ ( y ) exp λ (cid:18) h u y , S θ ( x ) i N (cid:19) ! · p θ ( x ) p θ ( y ) dµ ( x ) dν ( y ) ≤ ε B + e B ( I θ ( X ; Y )+ ε B ) Z Z A B exp λ (cid:18) h u y , S θ ( x ) i N (cid:19) ! · p θ ( x ) p θ ( y ) dµ ( x ) dν ( y ) ≤ ε B + e B ( I θ ( X ; Y )+ ε B ) √ − λ . for ≤ λ < . Taking logs, Z Z A B p ( y | x ) p θ ( y ) h u y , S θ ( x ) i p θ ( x ) p θ ( y ) dµ ( x ) dν ( y ) ≤ N λ log (cid:18) ε B + e B ( I θ ( X ; Y )+ ε B ) √ − λ (cid:19) . (7)For the other term (6), Z Z A CB p ( y | x ) p θ ( y ) h u y , S θ ( x ) i p θ ( x ) p θ ( y ) dµ ( x ) dν ( y ) ≤ Z Z A CB p θ ( x , y ) k S θ ( x ) k dµ ( x ) dν ( y ) ≤ (cid:18) ε B Z Z p θ ( x , y ) k S θ ( x ) k dµ ( x ) dν ( y ) (cid:19) (8) = (cid:18) ε B Z p θ ( x ) k S θ ( x ) k dµ ( x ) (cid:19) ≤ (cid:0) c ε B ( dB ) N (cid:1) (9)for some absolute constant c . To get (8) we have used theCauchy-Schwarz inequality, and (9) follows by using the sub- Gaussianity of each of the ( dB ) different terms in k S θ ( X ) k to bound their fourth moments. Combining (7) and (9), Tr ( I Y ( θ )) = 1 B Tr ( I Y ( θ )) ≤ N λB log (cid:18) ε B + e B ( I θ ( X ; Y )+ ε B ) √ − λ (cid:19) + ( c ε B d N ) and taking B → ∞ and then λ → , Tr ( I Y ( θ )) ≤ N I θ ( X ; Y ) . IV. D

ISTRIBUTED S TATISTICAL E STIMATION

The upper bound on Fisher information from Section III canbe of particular interest in a distributed setting, where thereare n distinct nodes each with a sample X i taken i.i.d. from P θ . In this setting, the nodes communicate information abouttheir samples via a “public blackboard” in multiple roundsof communication, and then a centralized estimator uses theblackboard transcript after communication to construct anestimate ˆ θ of the parameter θ .More formally, on round t = 1 , . . . , T of communication,each node i = 1 , . . . , n communicates a random variable Y i,t according to its local data X i and the previous informationwritten on the transcript Y i − ,t , Y i − ,t , . . . , Y ,t , Π t − , . . . , Π via a channel p ( y i,t | x i , y i − ,t , . . . , y ,t , π t − , . . . , π ) . Herewe deﬁne π t = ( y ,t , . . . , y n,t ) to be the information writtento the public blackboard after communication on round t . Theestimator ˆ θ is then a function of the total blackboard transcript Π = (Π , . . . , Π T ) . By using Theorem 1, we have the following upper bound onFisher information from the transcript Π . Corollary 1.

Suppose that h u, S θ ( X ) i is sub-Gaussian withparameter N for any unit vector u ∈ R d . Under regularityconditions (i)-(iii) above, Tr ( I Π ( θ )) ≤ N I θ ( X , . . . , X n ; Π) . Proof.

Using the chain rule for Fisher information, Tr ( I Π ( θ )) = T X t =1 n X i =1 Tr ( I Y i,t | Y i − ,t ,Y i − ,t ,...,Y ,t , Π t − ,..., Π ( θ ))= X t,i Z Tr ( I Y i,t | y i − ,t ,y i − ,t ,...,y ,t ,π t − ,...,π ( θ )) dP ( y i − ,t , y i − ,t , . . . , y ,t , π t − , . . . , π ) . (10)Then by applying Theorem 1 to each term inside the integralin (10), Tr ( I Π ( θ )) ≤ N X t,i I θ ( X i ; Y i,t | Y i − ,t , Y i − ,t , . . . , Y ,t , Π t − , . . . , Π ) . ince Y i,t is independent of X , . . . , X i − , X i +1 , . . . , X n when conditioned on the past data Y i − ,t , Y i − ,t , . . . , Y ,t , Π t − , . . . , Π , Tr ( I Π ( θ )) ≤ N I θ ( X , . . . , X n ; Π) as desired.We now consider the special case of the Gaussian locationmodel where P θ = N ( θ, σ I d ) and Θ = [ − , d . It can bereadily checked that in this case S θ ( X ) ∼ N (cid:0) , σ I d (cid:1) . Thisleads to the following upper bound on Fisher information,where we will see shortly that the constant on the right-handside is optimal.

Corollary 2.

Suppose that P θ = N ( θ, σ I d ) . Then Tr ( I Π ( θ )) ≤ σ I θ ( X , . . . , X n ; Π) . By using Corollary 2 along with the multivariate van Treesinequality from [4], we immediately get the following lowerbound on the minimax risk in estimating θ from the transcript Π . Corollary 3.

Suppose that P θ = N ( θ, σ I d ) and Θ =[ − , d . Then sup θ ∈ Θ E k ˆ θ (Π) − θ k ≥ d σ sup θ ∈ Θ I θ ( X , . . . , X n ; Π) + π d . In order to see that the constant 2 in Corollaries 2 and 3 isoptimal, consider the following example. Let d = 1 and T = 1 and suppose that communication is done independently overadditive white Guassian noise channels so that Y i = X i + W i where W i ∼ N (0 , σ noise ) and Π = ( Y , . . . , Y n ) . In this case I θ ( X , . . . , X n ; Π) = n (cid:18) σ σ noise (cid:19) and Corollary 3 gives a lower bound of sup θ ∈ Θ E [(ˆ θ (Π) − θ ) ] ≥ σ n log (cid:16) σ σ noise (cid:17) assuming that n is large enough so that the second term indenominator is negligible. The simple averaging estimator ˆ θ (Π) = 1 n n X i =1 Y i is unbiased and has variance σ n (cid:16) σ noise σ (cid:17) . In the regimewhere σ noise >> σ , both the lower bound and expectedsquared error become σ noise n , and thus any constant less than2 in Corollary 2 would lead to a contradiction.The distributed estimation setting considered here isthe same setting that is considered in past works [6],[7], except that here we do not assume the channels p ( y i,t | x i , y i − ,t , . . . , y ,t , π t − , . . . , π ) have a particular com-munication or privacy-constrained structure, and instead leavethem general and have the constraint be the total mutual in-formation I θ ( X , . . . , X n ; Π) . The distributed data-processing inequality from [11] can also apply to general mutual infor-mation constrained settings, but requires a bounded likelihoodratio assumption that we do not make. One of the beneﬁts ofour Fisher information bound is that it can be applied to theGaussian location model directly, without the need to truncatethe Gaussians so that they satisfy this bounded likelihood ratioassumption. In the subsequent section we show how our Fisherinformation bounds can also imply Jensen-Shannon divergencebased strong data processing inequalities that are very similarto those from [11].V. R ELATION TO D IVERGENCE B OUNDS

Because of Fisher information’s interpretation as a second-order approximation of KL divergence (or Jensen-Shannondivergence) as one moves around the parameter space Θ , theabove upper bounds on Fisher information can also imply sim-ilar upper bounds on the divergence between two distributionsfrom { Q θ } θ ∈ Θ where Q θ = P nθ ◦ P Π | X ,...,X n is the induceddistribution for Π .Instead of working with KL divergence directly, we will in-stead analyze the related Jensen-Shannon divergence becauseof its nicer properties. In particular, its square root satisﬁesthe triangle inequality which we describe and use below. Let JS ( P k Q ) = KL (cid:18) P (cid:13)(cid:13)(cid:13)(cid:13) P + Q (cid:19) + KL (cid:18) Q (cid:13)(cid:13)(cid:13)(cid:13) P + Q (cid:19) be the Jensen-Shannon divergence between distributions P and Q where KL ( P k Q ) is the usual KL divergence. The square-root of the Jensen-Shannon divergence satisﬁes the triangleinequality [14] in that p JS ( P k Q ) ≤ p JS ( P k R ) + p JS ( Q k R ) for any distributions P, Q, and R .In this section we will need the following additional regu-larity condition.(iv) Suppose that JS ( Q θ k Q θ +∆ θ ) can be represented by itsTaylor expansion JS ( Q θ k Q θ +∆ θ ) = 14 ∆ θ T I Π ( θ )∆ θ + O ( k ∆ θ k ) in such a way that the constants in the Big-O term canbe made independent of the choice of θ ∈ Θ .For the Jensen-Shannon divergence we will prove the follow-ing data processing bound. Theorem 2.

Suppose θ , θ ∈ Θ are such that θ λ = λθ +(1 − λ ) θ for λ ∈ [0 , are contained in Θ . Under the assumptionsin Theorem 1 above and regularity condition (iv), JS ( Q θ k Q θ ) ≤ k θ − θ k N Z I θ λ ( X , . . . , X n ; Π) dλ . Note that if we consider the random variable V to repre-sent a prior that chooses the parameter θ or θ each withprobability , then I ( V ; Π) = 12 JS ( Q θ k Q θ ) nd we can write the result from Theorem 2 as I ( V ; Π) ≤ k θ − θ k N Z I θ λ ( X , . . . , X n ; Π) dλ . (11)We compare this to the following theorem. Let β ( P θ , P θ ) bethe strong data processing inequality (SDPI) constant deﬁnedto be the minimum value β such that I ( V ; Π) ≤ βI ( X ; Π) where V → X → Π is a Markov chain and V ∼ Bern (1 / picks whether X is drawn from P θ or P θ as above. Theorem 3 ([11] Theorem 1.1) . Suppose P θ , P θ are twoprobability distributions with c P θ ≤ P θ ≤ cP θ for someconstant c ≥ . Let β ( P θ , P θ ) be the SDPI constant deﬁnedabove. Then I ( V ; Π) ≤ Kcβ ( P θ , P θ ) min v ∈{ θ ,θ } I v ( X , . . . , X n ; Π) (12) for a universal constant K . Note that in Theorem 3 we have only presented the specialcase V ∼ Bern (1 / so that the left-hand side can be writtenas the mutual information, whereas the general theorem doesnot assume even probabilities and instead the left-hand side iswritten as a Hellinger distance term.To compare (11) and (12) consider what happens in theGaussian case where P θ ∼ N ( θ, σI d ) . In this case the SDPIconstant β ( P θ , P θ ) = O (cid:16) k θ − θ k σ (cid:17) as used in [11] andshown in [15]. In this case the right-hand sides of the twobounds are very similar with only some minor differences.In particular, (12) has the minimum mutual information overthe two parameter values, but has an extra factor of c tocompensate. In contrast, (11) is written in terms on the averagemutual information along the linear path from θ to θ inthe parameter space. One beneﬁt of (11) over (12) is thatit does not need this bounded likelihood ratio assumption,instead assuming the sub-Gaussian score function propertyfrom Theorem 1, and therefore it can be applied to theGaussian case directly without any need for truncation. A. Proof of Theorem 2

Using a multivariate Taylor expansion of JS ( Q θ k Q θ +∆ θ ) at ∆ θ = 0 , JS ( Q θ k Q θ +∆ θ ) = 14 ∆ θ T I Π ( θ )∆ θ + O ( k ∆ θ k ) . (13)Given two parameter values θ , θ ∈ Θ ⊆ R d we can upperbound the Jensen-Shannon divergence between Q θ and Q θ as follows. For any real number ≤ λ ≤ we deﬁne θ λ = λθ + (1 − λ ) θ . By the triangle inequality, p JS ( Q θ k Q θ ) ≤ M X i =1 s JS (cid:18) Q θ i − M (cid:13)(cid:13)(cid:13)(cid:13) Q θ iM (cid:19) and then by squaring and using Jensen’s inequality, JS ( Q θ k Q θ ) ≤ M X i =1 s JS (cid:18) Q θ i − M (cid:13)(cid:13)(cid:13)(cid:13) Q θ iM (cid:19)! ≤ M M X i =1 JS (cid:18) Q θ i − M (cid:13)(cid:13)(cid:13)(cid:13) Q θ iM (cid:19) = M M X i =1 (cid:18) θ − θ M (cid:19) T I Π (cid:16) θ i − M (cid:17) (cid:18) θ − θ M (cid:19) + M O (cid:13)(cid:13)(cid:13)(cid:13) θ − θ M (cid:13)(cid:13)(cid:13)(cid:13) ! (14)where the last line (14) follows from (13). Continuing from(14), JS ( Q θ k Q θ ) ≤ k θ − θ k M M X i =1 Tr (cid:16) I Π (cid:16) θ i − M (cid:17)(cid:17) + M O (cid:13)(cid:13)(cid:13)(cid:13) θ − θ M (cid:13)(cid:13)(cid:13)(cid:13) ! . Taking M → ∞ we get that JS ( Q θ k Q θ ) ≤ k θ − θ k Z Tr ( I Π ( θ λ )) dλ . where we have used the Riemann integrability of the entriesof I Π ( θ λ ) which can be shown using regularity conditions (i)and (ii). The result then follows from Corollary 1. B. One-Parameter Families of Distributions

In this section we apply Theorem 2 to the case where we aregiven two distributions µ and µ that do satisfy the boundedlikelihood ratio assumption, even if they are not part of aparametric family. Suppose the two distributions have densities f ( x ) and f ( x ) , respectively. In this case we can deﬁne aone-parameter family of distributions between the two usingan exponential twist such as in [16]. For θ ∈ [0 , we deﬁnea new density f θ by f θ ( x ) = 1 C θ f θ ( x ) f − θ ( x ) where C θ = R f θ ( x ) f − θ ( x ) dx in order to normalize thedensity. The score function for this one-parameter family is S θ ( x ) = log f ( x ) f ( x ) − C ′ θ C θ . If, like in [11], we make the bounded likelihood ratio assump-tion c f ( x ) ≤ f ( x ) ≤ cf ( x ) , then | S θ ( x ) | ≤ c for all x . This implies the score functionis sub-Gaussian with a parameter that is O (log c ) , and yields I ( V ; Π) ≤ K (log c ) Z I λ ( X , . . . , X n ; Π) dλ for an absolute constant K . EFERENCES[1] H. Cram´er,

Mathematical Methods of Statistics . Princeton Univ. Press,1946.[2] C. R. Rao, “Information and the accuracy attainable in the estimationof statistical parameters,”

Bulletin of the Calcutta Mathematical Society ,vol. 37, 1945.[3] A. W. Van der Vaart,

Asymptotic statistics . Cambridge university press,2000, vol. 3.[4] R. D. Gill and B. Y. Levit, “Applications of the van trees inequality:a Bayesian cram´er-rao bound,”

Bernoulli , vol. 1, no. 1/2, pp. 059–079,1995.[5] R. Zamir, “A proof of the ﬁsher information inequality via a data pro-cessing argument,”

IEEE Transactions on Information Theory , vol. 44,no. 3, pp. 1246–1250, 1998.[6] L. P. Barnes, Y. Han, and A. ¨Ozg¨ur, “Lower bounds for learningdistributions under communication constraints via ﬁsher information,”

Journal of Machine Learning Resarch , vol. 21, no. 236, pp. 1–30, 2020.[7] L. P. Barnes, W.-N. Chen, and A. ¨Ozg¨ur, “Fisher information underlocal differential privacy,”

IEEE Journal of Selected Areas in InformationTheory , vol. 1, no. 3, pp. 645 – 659, 2020.[8] J. Acharya, C. L. Canonne, and H. Tyagi, “Inference under informationconstraints: Lower bounds from chi-square contraction,” in

Proceedingsof the 32nd Conference on Learning Theory , vol. 99. Phoenix, USA:PMLR, 25–28 Jun 2019, pp. 3–17.[9] C.-Z. Lee, L. P. Barnes, and A. ¨Ozg¨ur, “Over-the-air statistical estima-tion,”

Proceedings of the 2020 IEEE Global Communications Confer-ence .[10] ——, “Lower bounds for over-the-air statistical estimation,”

Submittedto the 2021 IEEE International Symposium on Information Theory .[11] M. Braverman, A. Garg, T. Ma, H. L. Nguyen, and D. P. Woodruff,“Communication lower bounds for statistical estimation problems viaa distributed data processing inequality,” in

Proceedings of the forty-eighth annual ACM symposium on Theory of Computing . ACM, 2016,p. 1011–1020.[12] A. A. Borovkov,

Mathematical Statistics . Gordon and Breach SciencePublishers, 1998.[13] M. J. Wainwright,

High-Dimensional Statistics: A Non-Asymptotic View-point . Cambridge Serieis in Statistical and Probabilistic Mathematics,2019.[14] D. M. Endres and J. E. Schindelin, “A new metric for probabilitydistributions,”

IEEE Transactions on Information Theory , vol. 49, no. 7,pp. 1858–1860, 2003.[15] M. Raginsky, “Strong data processing inequalities and φ -sobolev in-equalities for discrete channels,” IEEE Transactions on InformationTheory , vol. 62, no. 6, pp. 3355–3389, 2016.[16] A. G. Dabak and D. H. Johnson, “Relations betweenkullback-leibler distance and ﬁsher information,”

Available athttp://ece.rice.edu/ ∼ dhj/distance.pdfdhj/distance.pdf