[PDF] Empirical Bayes improvement of Kalman filter type of estimators

Abstract

We consider the problem of estimating the means μ i of n random variables Y i ∼N( μ i ,1) , i=1,…,n . Assuming some structure on the μ process, e.g., a state space model, one may use a summary statistics for the contribution of the rest of the observations to the estimation of μ i . The most important example for this is the Kalman filter. We introduce a non-linear improvement of the standard weighted average of the given summary statistics and Y i itself, using empirical Bayes methods. The improvement is obtained under mild assumptions. It is strict when the process that governs the states μ 1 ,…, μ n is not a linear Gaussian state-space model. We consider both the sequential and the retrospective estimation problems.

Full PDF

aa r X i v : . [ m a t h . S T ] J un Empirical Bayes improvement ofKalman ﬁlter type of estimators

Eitan Greenshtein

Israel Census Bureau of Statistics; e-mail: [email protected]

Ariel Mansura,

Bank of Israel; e-mail: [email protected]

Ya’acov Ritov

The Hebrew University of Jerusalem; e-mail: [email protected]

Abstract:

We consider the problem of estimating the means µ i of n ran-dom variables Y i ∼ N ( µ i , i = 1 , . . . , n . Assuming some structure on the µ process, e.g., a state space model, one may use a summary statistics forthe contribution of the rest of the observations to the estimation of µ i . Themost important example for this is the Kalman ﬁlter. We introduce a non-linear improvement of the standard weighted average of the given summarystatistics and Y i itself, using empirical Bayes methods. The improvementis obtained under mild assumptions. It is strict when the process that gov-erns the states µ , . . . , µ n is not a linear Gaussian state-space model. Weconsider both the sequential and the retrospective estimation problems.

1. Introduction and Preliminaries

We consider the estimation under squared error loss of a vector µ , . . . , µ n ob-served with Gaussian additive error: Y i = µ i + ε i , i = 1 , . . . , n , where ε , . . . , ε n are i.i.d. N (0 , i asdenoting time, and regard µ , . . . , µ n as a realization of a stochastic process.We analyze, accordingly, two main setups. In the ﬁrst, the estimation is doneretrospectively, after all of Y , . . . , Y n are observed. The second case is of se-quential estimation, where µ i should be estimated at time i , after observing Y , . . . , Y i . Let D i be the data set based on which µ i is estimated, excluding the i th observation itself. That is, D i = { , . . . , i − } in the sequential case, and D i = { j : 1 ≤ j ≤ n, j = i } when the estimation is retrospective.We could consider a more general situation in which the observations are( Y i , X i ), i = 1 , . . . , n , where the X i s are observed covariates andˆ µ i = X j ∈D i ∪{ i } (cid:16) β ij Y j + β xij X j (cid:17) . However, to simplify the presentation, we discuss only the situation withoutobserved covariates: ˆ µ i = X j ∈D i ∪{ i } β ij Y j . (1) ansura, Greenshtein, Ritov/ When µ , µ , . . . are a realization of a Gaussian process, the optimal estimatorfor µ i based on the data set D i ∪ { i } is indeed linear, and is given by theKalman ﬁlter (KF). However, in more general state space models, and certainlywhen the model is misspeciﬁed, the Kalman ﬁlter, or any other linear scheme,are not optimal. Yet, they may be taken as a reasonable starting point for theconstruction of a better estimator. We consider in this paper an empirical Bayesimprovement of a given linear ﬁlter which is nonparametric and does not dependon structural assumptions.The linear estimator ˆ µ i in (1) can be be considered as a weighted average oftwo components, Y i , and an estimator ˜ µ i based on all the observations availableat time i excluding the i th observations itself:˜ µ i = X j ∈D i ˜ β ij Y j . In the Gaussian case, ˜ µ i and ˆ µ i are typically the suﬃcient statistics for µ i giventhe data in D i and D i ∪ { i } respectively. In the sequential Gaussian case theestimator ˜ µ i is called the optimal one step ahead predictor of µ i while ˆ µ i isthe KF estimator of µ i , i = 1 , . . . , n . For background about the KF, state-space models, and general facts about time series see, e.g., Brockwell and Davis(1991). We will hardly use that theory in the following development, since weaim for results that are true regardless on whether various common state-spaceassumptions hold. In the sequel, when we want to emphasize that ˜ µ i and ˆ µ i are the standard KF estimators we will write ˜ µ Ki and ˆ µ Ki , but the followingderivation is for a general pair ˜ µ i and ˆ µ i .Our goal in this paper is to use ˜ µ i as a basis for the construction of anestimator which improves upon ˆ µ i . In fact, we try to ﬁnd the best estimator ofthe form: ˆ µ ig = ˜ µ i + g ( Y i − ˜ µ i ) , i = 1 , . . . , n. (2)Let δ ≡ arg min g E n X i =1 (ˆ µ ig − µ i ) (3)Thus, we use a simple coordinate-wise function, as was introduced by Robbins(1951) in the context of compound decision: Deﬁnition 1

A function f : R n → R n is called simple coordinate-wise func-tion, if it has the representation f ( X , . . . , X n ) = (cid:0) f ( X ) , . . . , f ( X n ) (cid:1) for some f : R → R . Our improvement, denoted δ ( · ), is a simple coordinate-wise function of ( Y − ˜ µ ). In the theory of compound decision and empirical Bayes, the search for anoptimal simple-coordinate-wise decision function is central. We elaborate in thenext section. The improved estimator µ iδ is denoted µ Ii , and in vector notationswe write in short µ I = ˜ µ + δ . ansura, Greenshtein, Ritov/ The ideas of empirical Bayes (EB) and compound decision (CD) procedureswere developed by Robbins (1951, 1955, 1964), see the review papers of Copas(1969) and Zhang (2003), and the paper of Greenshtein and Ritov (2008) forresults relating compound decision, simple coordinate-wise decision functionsand permutational invariant decision functions.The classical EB/CD theory is restricted to independent exchangeable obser-vations and to permutation invariant procedures, and in particular it excludesthe utilization of explanatory variables. Fay and Herriot (1979) suggested a wayto extend the ideas of parametric EB (i.e, linear decision functions correspond-ing to Gaussian measurement and prior) to handle covariates. Recently, there isan eﬀort to extend the EB ideas, so they may be incorporated in the presenceof covariates also in the context of non-parametric EB, see, Jiang and Zhang(2010), Cohen et al. (2013), and Koenker and Mizera (2013). Our paper may beviewed as a continuation of this eﬀort.The above papers extended the discussion to the situation where the obser-vations, due to the covariates, are not exchangeable. However, the estimatedparameters themselves, µ , . . . , µ n , are permutation invariant. Thus, in all theseproblems, centering each response by a linear transformation of the covariatestransforms the problem into a classical EB problem. In our setup of a time se-ries, the estimated variables are not permutation invariant, and the explanatoryvariables of Y i are the available observations Y j , j = i , so there is an obviousstrong dependence between the response variables and the covariates and theresponse is degenerate conditional on the covariates.Furthermore, in all the above mentioned papers the extension of EB ideas tohandle covariates is done in a retrospective setup, where all the observations aregiven in advance. Under the time series structure that we study, it is natural toconsider real time sequential estimation of the µ ’s. In Section 3 we consider thesequential case, where at stage i the decision function should be approximatedbased on the currently available data. Our analysis would be based on an ex-tension of Samuel (1965). The retrospective case is simpler and will be treatedﬁrst in Section 2. A small simulation study is presented in Section 4, and a realdata example is discussed in Section 5. Most EB/CD solutions involve simple coordinate-wise functions. By the natureof the problem, these functions are estimated from the data, which is usedsymmetrically. ˆ f ( X , . . . , X n ) = (cid:0) ˆ f ( X ; X (1) , . . . , X ( n ) ) , . . . , ˆ f ( X n ; X (1) , . . . , X ( n ) ) (cid:1) , (4)where X (1) ≤ · · · ≤ X ( n ) are the ordered statisticsUnfortunately, any permutation invariant function can be written in this way.Suppose for simplicity that X , . . . , X n are real. Let ψ ( X , . . . , X n ) : R n → R n ansura, Greenshtein, Ritov/ be a permutation invariant function. Let ( · ) be the indicator function. It ispossible to write ψ = ˆ f as in (4), withˆ f ( x ; X (1) , . . . , X ( n ) ) = X ψ i (cid:0) X (1) , . . . , X ( n ) (cid:1) ( x = X ( i ) ) , or a smooth version of this function.Actually, any function that is estimated from the data and is used only onthat data can be written as a simple coordinate-wise function.Intuitively, the set of simple coordinate-wise functions is a strict subset ofthe set of permutation invariant functions. We therefore consider a function ˆ f as simple coordinate-wise function if it approximates a function f that is simplecoordinate-wise function by Deﬁnition 1. This later function may be random(i.e., a stochastic process), with non-degenerate asymptotic distribution. The performance of our estimators will be measured by their mean squarederror loss, in vector notation: E || µ I − µ || , E || ˜ µ − µ || , and E || ˆ µ − µ || . Let F i be the smallest σ -ﬁeld under which Y j , j ∈ D i are measurable. The dependencyof diﬀerent objects on n will be suppressed, when there will be no danger ofconfusion. Assumption 1

For every i = 1 , . . . , n , the estimator ˜ µ i is F i measurable. Itis Lipschitz in Y j with a constant ρ | i − j | , where lim sup M →∞ M ρ M <

1. Thatis: For j ∈ D i let ˜ µ ′ i be ˜ µ i , but computed with Y j replaced by Y j + d . Then, | ˜ µ ′ i − ˜ µ i | ≤ ρ | i − j | d .This condition is natural when ˜ µ is KF for a stationary Gaussian process, wheretypically β ij decreases exponentially with | i − j | . The main need for generalizingthe KF is to include ﬁlters which are based on estimated parameters.The Kalman ﬁlter for an ergodic process also satisﬁes the following condition.It has no real importance for our results, except giving a standard benchmark. Assumption 2

Suppose that there is a α n ∈ F n , α n < µ i = α n ˜ µ i + (1 − α n ) Y i + ζ i , where Eζ i → , (5)as n → ∞ , 0 < lim inf i/n ≤ lim sup i/n < Remark:

Our major example is the ergodic normal state-space model. If theassumed model is correct, and ˜ µ i and ˆ µ i are the optimal estimators, then ˜ µ i is a suﬃcient statistics for µ i given D i . The estimators satisfy (5) with α n ≡ (1 + τ ) − , where τ is the asymptotic variance of µ i given ˜ µ i . In the iterativeKalman ﬁlter method for computing ˆ µ i , with some abuse of notation, the values α i = (1 + τ i ) − are computed, with τ i the variance of µ i given ˜ µ i , and we haveˆ µ i = α i ˜ µ i + (1 − α i ) Y i . ansura, Greenshtein, Ritov/ By considering the functions g ( z ) ≡ g ( z ) = (1 − α ) z in (2), it iseasy to see that µ I has asymptotically mean squared error not larger than ˜ µ and ˆ µ , respectively. In fact, we argue that unless the process is asymptoticallyGaussian, there is a strict improvement.The derivation of the Kalman ﬁlter is based on an assumed stochastic modelfor the sequence µ , . . . , µ n . Very few properties of the the process are relevant,and it is irrelevant to our discussion whether the model is true or not. However,we do need some tightness. We expect that typically | µ i − µ i − | is not larger thanlog n , and ˜ µ i is sensible at least as ˜ µ i ≡ Y i − . Since max | Y i − µ i | = O p ( √ log n ),the next condition is natural: Assumption 3

It holds:1 n n X i =1 P (cid:0) | Y i − ˜ µ i | > log n (cid:1) ≤ n ) .

2. Retrospective estimation

Denote, Z i = Y i − ˜ µ i ; ν i = µ i − ˜ µ i , i = 1 , . . . , n. (6)Clearly, Z i = ν i + ε i . Since ε i is independent both of µ , . . . , µ n and of ε j , j = i , itis independent of ν i . Thus, the conditional distribution of Z i given ν i is N ( ν i , µ = B Y = B µ + B ε . Then ν = ( I − B ) µ − B ε . It is true that Z = ν + ε ,but the vectors ν and ε are not independent. Hence Z | ν ∼ N n ( ν , I n ). Yet, werely only on the marginal distributions of Z i | ν i , i = 1 , . . . , n .To elaborate, (cid:18) Yµ (cid:19) = (cid:18) ( I − B ) − B ( I − B ) − I (cid:19) (cid:18) Zν (cid:19) . Therefore, the joint density of Z and ν is proportional to f µ (cid:0) ν + B ( I − B ) − Z (cid:1) exp (cid:0) −k Z − ν k / (cid:1) , where f µ is the joint density of the vector µ . Clearly, unless f µ is multivariatenormal, the conditional density of Z given ν is not multivariate standard normal. Example 2.1

Suppose n = 2, we observe Y , Y , and use ˜ µ i = γY − i , i = 0 , Z i = Y i − γY − i ⇒ Y i = 11 − γ ( Z i + γZ − i ) ν i = µ i − γY − i ⇒ Y i = 1 γ ( µ − i − ν − i ) ansura, Greenshtein, Ritov/ ⇒ Z i = 1 γ ( µ − i − ν − i − γµ i + γν i ) . Suppose further that µ i is ﬁnitely supported. It follows from the above cal-culations that the distribution of the vector Z given the vector ν is ﬁnitelysupported as well.The estimator in vector notation is µ I = ˜ µ + δ , where δ = (cid:0) δ ( Z ) , . . . , δ ( Z n ) (cid:1) T .As discussed in the introduction, simple coordinate-wise functions like δ are cen-tral in EB and CD models. However, our decision function µ I is not a simplecoordinate-wise function of the observations. It is a hybrid of non-coordinate-wise function ˜ µ and a simple coordinate-wise one, δ . The ˜ µ component accountsfor the non-coordinate-wise information from the covariates, while δ aims toimprove it in a coordinate-wise way after the information from all other obser-vations was accounted for by ˜ µ .By Assumption 1, the dependency between the Z i s conditioned on ν is onlylocal, and hence, if we consider a permutation invariant procedure, which treatsneighboring observations and far away ones the same, the dependency disap-pears asymptotically, and we may consider only the marginal normality of the ν . The basic ideas of EB are helpful and we get the representation (7) of δ asgiven below.Let f Z ( z ) = 1 n n X i =1 ϕ ( z − ν i ) , where ϕ is the standard normal density. Note that this is not a kernel estimator—the kernel is with ﬁxed bandwidth and ν , . . . , ν n are unobserved. Let I beuniformly distributed over 1 , . . . , n . Denote by F n the distribution of the randompairs ( ν I , ν I + η I ), where η , . . . , η n are i.i.d. standard normal independent of theother random variables mentioned so far and the randomness is induced by therandom index I and the η s. One marginal distribution of F n is the empiricaldistribution of ν , . . . , ν n , while the density of the other is given by f Z . Wedenote the marginals by F nν and F nZ . Finally, note that Z I given ν I has thedistributing of F nZ | ν , i.e., the conditional distribution of F n .It is well known that asymptotically, the Bayes procedure for estimating ν i given Z i is approximated by the Bayes procedure with F nν as prior, and it isdetermined by f Z . The optimal simple coordinate-wise function δ = δ n dependsonly on marginal joint distribution of ( ν I , Z I ). In fact, it depends only on f Z .As in Brown (1971) we have: δ n ( z ) = E F n ( ν I | ν I + η I = z ) = z + f ′ Z ( z ) f Z ( z ) , (7)where f ′ Z is the derivative of f Z . The dependency on n is suppressed in thenotations.Note that δ n is a random function, and in fact, if µ , . . . , µ n is not an ergodicprocess, it may not have an asymptotic deterministic limit. Yet, it would be theobject we estimate in (8) below. ansura, Greenshtein, Ritov/ It is of a special interest to characterize when δ = δ n is asymptotically linear,in which case the improved estimator µ Ii = ˜ µ i + δ n ( Z i ) is asymptotically alinear combination of ˜ µ i and Y i . Only in such a case the diﬀerence between theloss of the improved estimator µ I = ˜ µ + δ and that of the estimator ˆ µ maybe asymptotically of o ( n ). It follows from (7) that, the optimal decision δ ( Z )is approximately (1 − α ) Z , if and only if, f ′ Z /f Z = (log f Z ) ′ is approximatelyproportional to z . This happens only if f Z converges to a Gaussian distribution.Since f Z is a convolution of a Gaussian kernel with the the prior, this can happenonly if the prior is asymptotically Gaussian. In our setup where F nν plays therole of a prior, in order to have asymptotically linear improved estimator weneed that F nν converges weakly to a normal distribution G .The above is formally stated in the following Proposition 2.1. Proposition 2.1

Under assumptions 1-3, n − E k µ I − ˆ µ k → implies that thesequence F nν − N (0 , (1 − α n ) /α n ) converges weakly to the zero measure. Given the observations Z , . . . , Z n , let ˆ F nZ be the empirical distribution of Z , . . . , Z n . We will show that as n → ∞ the ‘distance’ between ˆ F nZ and F nZ getssmaller, so that f Z and and its derivative may be replaced in (7) by appropriatekernel estimates based on ˆ F nZ , and yield a good enough estimator ˆ δ n of δ n .In the sequel we will occasionally drop the superscript n . Note, that we donot assume that F n approaches some limit F as n → ∞ , although this is thesituation if we assume that Z , Z , . . . is an ergodic stationary process, howeverour assumptions on that process are milder.We now state the above formally. Consider the two kernel estimators ˆ f Z ( z ) = n − P j K σ ( Z j − z ) and ˆ f ′ Z ( z ) = n − P j K ′ σ ( Z j − z ), where K σ ( z ) = σ − K (cid:0) z/σ (cid:1) , σ = σ n . For simplicity, we use the same bandwidth to estimate both the densityand its derivative. We deﬁne the following estimator ˆ δ ≡ ˆ δ σ for δ :ˆ δ ( z ) = ˆ δ nσ ( z ) ≡ z + ˆ f ′ Z ( z )ˆ f Z ( z ) . (8)Brown and Greenstein (2009) used the normal kernel, we prefer to use thelogistic kernel 2( e x + e − x ) − (it is the derivative of the logistic cdf, (1 + e − x ) − ,and hence its integral is 1). We suggest this kernel since it ensures that | ˆ δ ( z ) − z | < σ − , see the Appendix. However, we do adopt the recommendation ofBrown and Greenshtein (2009) for a very slowly converging sequence σ n =1 / log( n ).Denote ˆ µ I = ˜ µ + ˆ δ . Theorem 2.2

Under Assumptions 1 and 3:i) E || ˆ µ I − µ || ≤ E || µ I − µ || + o ( n ) ≤ E || ˆ µ − µ || + o ( n ) . ii) Under Assumptions 1–3, if α n p −→ α ∈ (0 , and F nν converges weakly toa distribution diﬀerent from N (0 , (1 − α ) /α ) , then there is c < such that ansura, Greenshtein, Ritov/ for large enough n : E || ˆ µ I − µ || ≤ cE || ˆ µ − µ || . Proof. i) The proof is given in the appendixii) The same arguments used to prove part i) may be used to prove a modi-ﬁcation of Proposition 2.1, in which µ Ii is replaced by ˆ µ Ii , when assumingin addition that i/n ∈ ( ζ, − ζ ) for any ζ ∈ (0 , (cid:3) Part i) of the above theorem assures us that asymptotically the improved esti-mator does as good as ˆ µ ; part ii) implies that in general the improved estimatordoes asymptotically strictly better. Obviously the asymptotic improvement isnot always strict since the Kalman ﬁlter is optimal under a Gaussian state-spacemodel.

3. Sequential estimation

We consider now the case F i = σ ( Y , . . . , Y i − ), i ≤ n . The deﬁnition of thediﬀerent estimators is the same as in the previous section with the necessaryadaption to the diﬀerent information set. Our aim is to ﬁnd a sequential esti-mator, denoted ˆ µ IS , that satisﬁes E || ˆ µ IS − µ || + o ( n ) < E || ˆ µ − µ || . By asequential estimator ˆ µ IS = ( ψ , . . . , ψ n ) we mean that ψ i ∈ F i , i = 1 , . . . , n .A natural approach, which indeed works, is to let ψ i = ˜ µ i + ˆ δ i , where ˆ δ i isdeﬁned as in (8), but with ˆ f = ˆ f i restricted to the available data Z , . . . , Z i − , i = 1 , . . . , n . Let ˆ δ S = (ˆ δ , . . . , ˆ δ n ).We deﬁne: ˆ µ IS = ˜ µ + ˆ δ S . Our main result in this section:

Theorem 3.1

Theorem 2.2 holds with ˆ µ IS and µ IS replacing ˆ µ I and µ I , re-spectively. In order to prove Theorem 3.1 we adapt Lemma 1 of Samuel (1965). Samuel’sresult is stated for a compound decision problem, i.e., the parameters are ﬁxed,and the observations are independent. The result compares the performance ofthe optimal estimators in the sequential and retrospective procedures. It is notclear a priori whether retrospective estimation is easier or more diﬃcult than thesequential. On the one hand, the retrospective procedure is using more informa-tion when dealing with the i th parameter. On the other hand, the sequential es-timator can adapt better to non-stationarity in the parameter sequence. Samuelproved that the latter is more important. There is no paradox here, since theretrospective procedure is optimal only under the assumption of permutationinvariance, and under permutation invariance, the weak inequality in Lemma3.2 below is, in fact, equality. ansura, Greenshtein, Ritov/ Our approach is to rephrase and generalize Samuel’s lemma. Let η , . . . , η n be N (0 ,

1) i.i.d. random variables independent of ( µ i , ε i ), i = 1 , . . . , n . Let L ( ν i , ˆ ν i )be the loss for estimating ν i by ˆ ν i . For every i ≤ n let δ i be the decision functionthat satisﬁes: δ i = arg min δ E η i X j =1 L (cid:0) ν j , δ ( ν i + η i ) (cid:1) = arg min δ i X j =1 E η L (cid:0) ν j , δ ( ν i + η i ) (cid:1) ≡ arg min δ i X j =1 R ( δ, ν j ) , say,where E η is the expectation over η . That is, δ i is the functional that minimizesthe sum of risks for estimating the components ν , . . . , ν i , but it is applied onlyfor estimating ν i . The quantity R ( δ j , ν j ) is the analog of R ( φ F j , θ j ) in Samuel’sformulation. In analogy to Samuel (1965) we deﬁne R n ≡ n − P nj =1 R ( δ n , ν j ),the empirical Bayes risk of the non-sequential problem. Lemma 3.2 n − n X j =1 R ( δ j , ν j ) ≤ R n . The proof of the lemma is formally similar to the proof of Lemma 1 of Samuel(1965).

Proof of Theorem 3.1 .

From Theorem 2.2, E (cid:0) ˆ δ i ( Z ) − δ i ( Z ) (cid:1) →

0. The lastfact coupled with Lemma 3.2 implies part i) of Theorem 3.1. Part ii) is shownsimilarly to part ii) of Theorem 2.2. (cid:3)

4. Simulations.

We present now simulation results for the following state-space model. Y i = µ i + ε i µ i = φµ i − + U i , i = 1 , . . . , n, (9)where ε i ∼ N (0 , i = 1 , . . . , n , are independent of each other and of U i , i = 1 , . . . , n . The variables U i , i = 1 , . . . , n are independent, U i = X i I i where X i ∼ N (0 , v ) are independent, while I , . . . , I n are i.i.d. Bernoulli with mean0.1, independent of each other and of X i , i = 1 , . . . , n . We study the twelvecases that are determined by φ = 0 . , .

75 and v = 0 , , . . . ,

5. In each case weinvestigate both the sequential and the retrospective setups. ansura, Greenshtein, Ritov/ If U , . . . , U n , were i.i.d Normal, the data would follow a Gaussian state-space,and the corresponding Kalman ﬁlter estimator would be optimal. Since the U i ’sare not normal, the corresponding AR(1) Kalman ﬁlter is not optimal (exceptin the degenerate case, v = 0), though it is optimal among linear ﬁlters. This isreﬂected in our simulation results where for the cases v = 0 , µ I performs slightly worse than ˆ µ = ˆ µ K . It improves in all the rest.The above is stated and proved formally in the following proposition. It couldalso be shown indirectly by applying part ii) of Theorem 2.2. Proposition 4.1

Consider the state-space model, as deﬁned by (9) . If U i arenot normally distributed then E || ˆ µ I − µ || ≤ cE || ˆ µ K − µ || , for a constant c ∈ (0 , and large enough n .Proof. Given the estimators ˜ µ Ki and ˆ µ Ki , i = 1 , . . . , n , let Z i = Y i − ˜ µ i . Then Z i = µ i − + U i − ˜ µ Ki + ε i = ν i + ε i . The distribution G i of ν i may be normalonly if U i is normal, since U i is independent of µ i − and ˜ µ i . The distributions G i converge to a distribution G as i and n − i approach inﬁnity. As before, G is normal only if U i are normal. Now, asymptotically optimal estimator for µ i under squared loss and given the observation Y i , is ˜ µ Ki + ˆ ν i , where ˆ ν i is theBayes estimator under a prior G on ν i and an observation Y i ∼ N ( ν i , µ Ki , only if G isnormal. (cid:3) Analogous discussion and situation are valid also in the sequential case. In oursimulations the parameters φ and V AR ( U i ) are treated as known. Alternatively,maximum likelihood estimation assuming (wrongly) normal inovations yieldsresults similar to those reported in Table 1.The simulation results in Table 1 are for the case n = 500. Each entry is basedon 100 simulations. In each realization we recorded || ˆ µ − µ || and || ˆ µ I − µ || ,and each entry is based on the corresponding average. In order to speed theasymptotics we allowed a ‘warm up’ of 100 observations prior to the n = 500in the sequential case, we also allowed a ‘warm up’ of 50 in both sides of the n = 500 observations in the retrospective case.It may be seen that when the best linear ﬁlter is optimal or nearly opti-mal (when v = 0 or approximately so), our improved method is slightly worsethan the Kalman ﬁlter estimator, however as v increases, the advantage of theimproved method may become signiﬁcant.It seems that in the case φ = 0 .

25 the future observations are not veryhelpful, and in our simulations there are cases where the simulated risk of ˆ µ in the sequential case is even smaller than the corresponding simulated risk ofthe retrospective case. This could be an artifact of the simulations, but also aresult of noisy estimation of the coeﬃcients β ij of the mildly informative futureobservations. ansura, Greenshtein, Ritov/ Table 1

Mean Squred error of the two estimator for an autoregressive process with aperiodic normalshocks φ v Retrospective ﬁlter: b µ † µ I ‡

23 66 125 148 160 177 24 91 166 215 253 271

Sequential ﬁlter: b µ † µ I ‡

39 81 129 147 159 158 34 112 184 216 239 253 † Kalman ﬁlter, ‡ Improved.

Table 2

The retrospective case: Cross-validation estimation of the average squared risk. p = 0 .

95 AR(1) AR(2) ARIMA(1,1,0)

Kalman ﬁlter - ˆ λ Ki . . . Improved method - ˆ λ Ii . . . Naive method - ˆ λ Ni . . .

5. Real Data Example

In this section we demonstrate the performance of our method on real datataken from the FX (foreign exchange) market in Israel. The data consists ofthe daily number of swaps (purchase of of one currency for another with agiven value date, done simultaneously with the selling back of the same amountwith a diﬀerent value date. This way there is no foreign exchange risk) in theOTC (over-the-counter) Shekel/Dollar market. We consider only the buys ofmagnitude 5 to 20 million dollars. The time period is January 2nd, 2009 toDecember 31st, 2013, a total of n = 989 business days. The number of buys ineach day is 24 on the average, with the range of 2–71. In our analysis we usedthe ﬁrst 100 observations as a ‘warm up’, similarly to the way it was done inour simulations section.We denote by X i , i = 1 , . . . , n , the number of buys on day i and assumethat X i ∼ P o ( λ i ) . We transform the data by Y i = 2 √ X i + 0 .

25 as in Brownet al. (2010) and Brown et al. (2013) in order to get an (approximately) normalvariable with variance σ = 1 . The assumed model in this section is the following state space system ofequations: Y i = µ i + ε i µ i ∼ ARIM A ( p, d, q ) , i = 1 , . . . , n, where µ i = 2 √ λ i and ε i ∼ N (0 ,

1) are independent of each other and ofthe

ARIM A ( p, d, q ) process. We consider the following three special cases of ARIM A ( p, d, q ): AR (1), AR (2), and ARIM A (1 , , . Under each model there are induced Kalman ﬁlter estimators ˜ µ K , and ˆ µ K that correspond to one step prediction and to the update. Similarly, the im- ansura, Greenshtein, Ritov/ proved estimator ˆ µ I is deﬁned. We denote the sequential and retrospective es-timators similarly with no danger of confusion.After estimating µ i , we transform the result back to get the estimator ˆ λ Ji for λ i , ˆ λ ji = 0 . (cid:16) ˆ µ ji (cid:17) , i = 1 , . . . , n , J ∈ { ‘ I ’ , ‘ K ’ } where, ˆ µ Ji is the estimator of µ i by method J. We evaluate the performances of both estimation methods bythe following non-standard cross-validation method as described in Brown et al.(2013). It is brieﬂy explained in the following.Let p ∈ (0 , , p ≈ , and let U , . . . , U n be independent given X , . . . , X n ,where U i ∼ B (cid:0) X i , p (cid:1) are Binomial variables. It is known that U i ∼ P o (cid:0) pλ i (cid:1) and V i = X i − U i ∼ P o (cid:0) (1 − p ) λ i (cid:1) and they are independent given λ , . . . , λ n . Wewill use the ‘main’ sub-sample U , . . . , U n for the construction of both estimators(Kalman ﬁlter and Improvement) while the ‘auxiliary’ sub-sample V , . . . , V n isused for validation. Consider the following loss function, ρ ( J ; U , V ) = 1 n n X i =1 (cid:16) ˆ λ Ji p − V i (1 − p ) (cid:17) = 1 np n X i =1 (cid:16) ˆ λ Ji − pλ i (cid:17) + 1 n (1 − p ) n X i =1 ( V i − (1 − p ) λ i ) − n n X i =1 (cid:16) ˆ λ Ji p − λ i (cid:17)(cid:16) V i (1 − p ) − λ i (cid:17) = 1 np n X i =1 (cid:16) ˆ λ Ji − pλ i (cid:17) + A n + R n ( J ) , J ∈ { ′ K ′ , ′ I ′ } . The term R n ( J ) = O p ( n − / ) and will be ignored. We estimate A n by themethod of moments: ˆ A n = 1 n (1 − p ) n X i =1 V i . We repeat the cross-validation process 500 times and average the computedvalues of ρ ( J ; U , V ) − ˆ A n . When p is close to 1, the average obtained is a plausi-ble approximation of the average squared risk in estimating λ i , i = 101 , . . . , λ Ni = X i , i = 101 , . . . , ansura, Greenshtein, Ritov/ Table 3

The retrospective case: Parameter estimation p = 0 .

95 AR(1) AR(2) ARIMA(1,1,0) α .

38 14 .

218 0 . φ − . − . − . φ − . σ . . . Table 4

The sequential case: Cross-validation approximation of the average squared risk p = 0 .

95 AR(1) AR(2) ARIMA(1,1,0)

Kalman ﬁlter - ˆ λ Ki . . . Improved method - ˆ λ Ii . . . Naive method - ˆ λ Ni . . . the sequential case. The reason is that the AR(1) model does not ﬁt the data.When it is enforced on the data, the Kalman ﬁlter gives too much weight tothe surrounding data, and too little to the “model free” naive estimator. Thisresult show the robustness of our estimator.In fact, we did a small simulation, where the process was AR(2), with theparameters as estimated for the data. When an AR(1) was ﬁtted to the data,the retrospective Kalman ﬁlter was strictly inferior to the sequential one.In the sequential case, Table 4, the improved method does better than thenaive method, but contrary to the non-sequential case, it improves upon theKalman ﬁlter only in the AR(1) and AR(2) models, while in the ARIMA(1,1,0)model the Kalman ﬁlter does slightly better.

6. Appendix: Proof of Theorem 2.2

Note that by (3), obviously, E || µ I − µ || < E || ˆ µ − µ || + o ( n ). Thus, in order toobtain E || ˆ µ I − µ || < E || ˆ µ − µ || + o ( n ) it is enough to show that E || ˆ δ − δ || = o ( n ) . First recall: K ( x ) = 2 (cid:0) e x + e − x (cid:1) K ′ ( x ) = − e x − e − x (cid:0) e x + e − x (cid:1) K ′ ( x ) K ( x ) = − e x − e − x e x + e − x ∈ ( − , . Thus sup z (cid:12)(cid:12)(cid:12) ˆ f ′ Z ( z )ˆ f Z ( z ) (cid:12)(cid:12)(cid:12) < σ − n . (10)By Assumption 1, if we replace ˜ µ i by a similar (unobserved) function ˜ µ ∗ i ,where Y j , j ∈ D i , | j − i | > n γ is replaced by µ j , then max i | ˜ µ ∗ i − ˜ µ i | = ansura, Greenshtein, Ritov/ O p (cid:0) n − γ (cid:1) max i | ε i | = O p ( n − γ/ ), for any γ >

0. Deﬁne ν ∗ i and Z ∗ i as in (6),but where ˜ µ i is replaced by ˜ µ ∗ i , i = 1 , . . . , n . Since | ˆ f ′′ | ≤ σ − n , max i | Z i − Z ∗ i | isignorable for our approximations. In the following all variables are replaced bytheir ∗ version, but we drop the ∗ for simplicity.Let L n = log n . By Assumption 3 with probability greater than 1 − L − n , allbut n/L n of the Z i s are in ( − L n , L n ), hence, their density is mostly not toosmall: Z (cid:0) f Z ( z ) < L − n (cid:1) f Z ( z ) dz < L − n . (11)Let ¯ ϕ ( z ) = ϕ ∗ K σ n = Z K (cid:16) z − eσ n (cid:17) ϕ ( e ) de ¯ f Z = f Z ∗ K σ n = 1 n n X j =1 ¯ ϕ ( z − ν j )Thenˆ f Z ( z ) − ¯ f Z ( z ) = 1 σ n n n X j =1 K (cid:16) z − Z j σ n (cid:17) − n n X j =1 ¯ ϕ ( z − ν j )= 1 σ n n n X j =1 n K (cid:16) z − ν j − ε j σ n (cid:17) − Z K (cid:16) z − ν j − eσ n (cid:17) ϕ ( e ) de o . Since ε j and ν j are independent, this diﬀerence has mean 0. Moreover, since( ν ∗ j , ε j ) and ( ν ∗ k , ε k ) are independent for | j − k | > M , the above expression hasvariance of order M ( nσ n ) − . The diﬀerence k ¯ f Z − f Z k ∞ is of order σ n .A similar expansion works for ˆ f ′ Z , except that the variance now is of order M ( nσ n ) − . Regular large deviation argument shows that when f Z ( Z i ) > L − n then ˆ f Z ( Z i ) < L − n / f ′ Z ( Z i ) / ˆ f Z ( Z i ) = f ′ Z ( Z i ) /f Z ( Z i ) + O p (1)holds not only in the mean but also in the mean square, since the exceptionprobability are smaller than σ n . References

Brockwell, P.J. and Davis, R.A. (1991). Time Series: Theory and Methods.Second edition. Springer.fBrown, L. D. (1971). Admissible estimators, recurrent diﬀusions, and insol-uble boundary value problems.

Ann.Math.Stat. , 855-904.Brown, L.D., Cai, T., Zhang, R., Zhao, L., Zhou, H. (2010). The root-unrootalgorithm for density estimation as implemented via wavelet block thresh-olding. Probability and Related Fields , , 401-433.Brown, L.D. and Greenshtein, E. (2009). Non parametric empirical Bayesand compound decision approaches to estimation of high dimensional vec-tor of normal means. Ann. Stat. , No 4, 1685-1704. ansura, Greenshtein, Ritov/ Brown L.D., Greenshtein, E. and Ritov, Y. (2013). The Poisson compounddecision revisited.

JASA.

Statistica Sinica. , No. 1, 333-357.Copas, J.B. (1969). Compound decisions and empirical Bayes (with discus-sion). JRSSB JASA , , No. 366,269-277.Greenshtein, E. and Ritov, Y. (2008). Asymptotic eﬃciency of simple deci-sions for the compound decision problem. The 3’rd Lehmann Symposium.IMS Lecture Notes Monograph Series, J.Rojo, editor. 266-275.Jiang, W. and Zhang, C.-H. (2010). Empirical Bayes in-season prediction ofbaseball batting average. Borrowing Strength: Theory Powering Application-A festschrift for L.D. Brown

J.O. Berger, T.T. Cai, I.M. Johnstone, eds.IMS collections , 263-273.Koenker, R. and Mizera, I. (2012). Shape constraints , compound decisionsand empirical Bayes rules. Manuscript.Robbins, H. (1951). Asymptotically subminimax solutions of compound de-cision problems. Proc. Second Berkeley Symp.

Proc. ThirdBerkeley Symp.

Ann.Math.Stat. , 1-20.Samuel, E. (1965). Sequential Compound Estimators. Ann.Math. Stat. ,No 3, 879-889.Zhang, C.-H.(2003). Compound decision theory and empirical Bayes meth-ods.(invited paper). Ann. Stat.31