aa r X i v : . [ m a t h . S T ] F e b Efficiency requires innovation
Abram KaganDepartment of Mathematics, University of MarylandCollege Park, MD 20742, USA
Abstract
In estimation a parameter θ ∈ R from a sample ( x , . . . , x n ) from a pop-ulation P θ a simple way of incorporating a new observation x n +1 into anestimator ˜ θ n = ˜ θ n ( x , . . . , x n ) is transforming ˜ θ n to what we call the jack-knife extension ˜ θ ( e ) n +1 = ˜ θ ( e ) n +1 ( x , . . . , x n , x n +1 ),˜ θ ( e ) n +1 = { ˜ θ n ( x , . . . , x n )+˜ θ n ( x n +1 , x , . . . , x n )+ . . . +˜ θ n ( x , . . . , x n − , x n +1 ) } / ( n +1) . Though ˜ θ ( e ) n +1 lacks an innovation the statistician could expect from a largerdata set, it is still better than ˜ θ n ,var(˜ θ ( e ) n +1 ) ≤ nn + 1 var(˜ θ n ) . However, an estimator obtained by jackknife extension for all n is asymp-totically efficient only for samples from exponential families. For a general P θ , asymptotically efficient estimators require innovation when a new obser-vation is added to the data.Some examples illustrate the concept.1 Introduction
Let ˜ θ n = ˜ θ n ( x , . . . , x n ) be an estimator of θ based on sample of size n froma population P θ with θ ∈ R as a parameter. If another observation x n +1 isadded to the data, a simple way of incorporating it in the existing estimatoris by what we call the jackknife extension ,˜ θ ( e ) n +1 = ˜ θ ( e ) n +1 ( x , . . . , x n , x n +1 ) = (˜ θ n, + . . . + ˜ θ n,n +1 ) / ( n + 1) (1)where˜ θ n,i = ˜ θ n ( x , . . . , x i − , x i +1 , . . . , x n ) , i = 2 , . . . , n, ˜ θ n,n +1 = ˜ θ n ( x , . . . , x n , x n ) . Plainly, E θ (˜ θ ( e ) n +1 ) = E θ (˜ θ n ) and if ˜ θ n is symmetric in its arguments (as isusually the case) the jackknife extension is symmetric in x , . . . , x n , x n +1 .If var θ (˜ θ n ) < ∞ , then not only var θ (˜ θ ( e ) n +1 ) < var θ (˜ θ n ) but a stronger in-equality holds: ( n + 1)var θ (˜ θ ( e ) n +1 ) ≤ n var θ (˜ θ n ) . (2)The inequality (2) is a direct corollary of a special case of the so called variance drop lemma due to (Artstein et al. ,2004). Lemma 1
Let X , . . . , X n , X n +1 be independent identically distributed ran-dom variables and ψ ( X , . . . , X n ) a function with E ( | ψ ( X , . . . , X n +1 ) | ) < ∞ . Set ψ = ψ ( X , . . . , X n +1 ) , ψ i = ψ ( X , . . . , X i − , X i +1 , . . . , X n +1 ) , i = 2 , . . . , n +1 . Then var( n +1 X ψ i ) ≤ n n +1 X var( ψ i ) . (3)Note that with n + 1 instead of n on the right hand side of (5), the inequalitybecomes a trivial corollary of( n +1 X a i ) ≤ ( n + 1) n +1 X a i holding for any numbers a , . . . , a n +1 .For an extension of the variance drop lemma see (Madiman an Barron,2007). 2uppose that starting with n = m and ˜ θ m ( x , . . . , x m ), the statistician con-structs the jackknife extension ˜ θ ( e ) m +1 of ˜ θ m ( x , . . . , x m ), then the jackknifeextension ˜ θ ( e ) m +2 of ˜ θ ( e ) m +1 and so on. One can easily see that for n ≥ m theestimator ˜ θ ( e ) n ( x , . . . , x n ) thus obtained is a classical U -statistic with thekernel ˜ θ m ( x , . . . , x m ):˜ θ ( e ) n ( x , . . . , x n ) = 1 (cid:0) nm (cid:1) X ≤ i ≤ ... ≤ i m ≤ n ˜ θ m ( x i , . . . , x i m ) . (4)Hoeffding initiated studying U -statistics back in 1948. The variance of ˜ θ ( e ) n can be explicitly expressed in terms of ˜ θ m . Set˜ θ m | k ( x , . . . , x k ) = E θ { ˜ θ m ( X , . . . , X m ) | X = x , . . . , X k = x k } . The following formula due to Hoeffding (1948) expresses var(˜ θ ( e ) n ) via v k ( θ ) =var(˜ θ m | k ( X , . . . , X k )) , k = 1 , . . . , m :var θ (˜ θ ( e ) n ) = 1 (cid:0) nm (cid:1) m X k =1 (cid:18) mk (cid:19)(cid:18) n − mm − k (cid:19) v k ( θ ) . (5) Assume that the distributions P θ are given by differentiable in θ density(with respect to a measure µ ) p ( x ; θ ) with the Fisher information I ( θ ) = Z ( ∂ log p ( x ; θ ) ∂θ ) p ( x ; θ ) dµ ( x )well defined and finite.If E θ (˜ θ m ) = γ ( θ ), by Cram´er-Rao inequalityvar θ (˜ θ m ) ≥ | γ ′ ( θ ) | mI ( θ ) . In particular, if ˜ θ m is an unbiased estimator of θ , var θ (˜ θ m ) ≥ mI ( θ ) .Furthermore, if var θ (˜ θ m ) < ∞ , then var θ (˜ θ ( e ) n ) < ∞ for all n > m and thefollowing lemma holds. Lemma 2 (Hoeffding 1948). As n → ∞ , √ n (˜ θ ( e ) n − γ ( θ )) is asymptoticallynormal N (0 , m v ( θ )) . γ n of γ ( θ ) basedon a sample ( x , . . . , x n ) from a population with Fisher information I ( θ ),var θ ( γ n ) ≥ | γ ′ ( θ ) | nI ( θ ) . (6)Combining (6) with Lemma 2 leads to a formula for the asymptotic efficiencyof ˜ θ ( e ) n : aseff(˜ θ ( e ) n ) = | γ ′ ( θ ) | /I ( θ ) m v ( θ ) . (7) Lemma 3
Let X be a random element, X ∼ p ( x ; θ ) with finite Fisher in-formation I ( θ ) . If h ( X ) is a (scalar valued) function with E θ ( h ( X )) = µ ( θ ) differebtiable and var θ ( h ( X ) = σ ( θ ) < ∞ , then I ( θ ) ≥ | µ ′ ( θ ) | σ ( θ ) . (8) Proof . Take the projection of the Fisher score J ( X ; θ ) = ( p ′ ( x ; θ ) /p ( x ; θ )into the subspace span1 , h ( X ) of the Hilbert space of functions with g ( X )with E θ ( | g ( X ) | ) < ∞ :ˆ E θ { ( J ( X ; θ ) | , h ( X ) } = ˆ J ( X ; θ ) = a ( θ )( h ( X ) − µ ( θ )) . (9)Multiplying both sides by h ( X ) − µ ( θ ) and taking the expectations resultsin a ( θ ) = µ ′ ( θ ) /σ ( θ ) due to the property E θ ( J ( X ; θ ) h ( X )) = µ ′ ( θ ) of theFisher score. Hence I ( θ ) = var θ ( J ( X ; θ )) ≥ var θ ( ˆ J ( X ; θ )) = | µ ′ ( θ ) | σ ( θ )which is exactly (8). The equality sign in (8) is attained if and only if with P θ -probability one the relation p ′ ( x ; θ ) p ( x ; θ ) = a ( θ )( h ( x ) − γ ( θ )) (10)holds for a ( θ ).From E θ (˜ θ m | ) = γ ( θ ) , v = var(˜ θ m | ) and (8) one getsaseff(˜ θ ( e ) n ) ≤ /m . (11)4hus, a necessary condition for the asymptotic efficiency of ˜ θ ( e ) n is m = 1and by virtue of (4)˜ θ ( e ) n ( x , . . . , x n ) = ( h ( x ) + . . . + h ( x n )) /n (12)for some h ( x ) with E θ { h ( X ) } = γ ( θ ).From Lemma 3 the estimator (12) is an asymptotically efficient estimatorof γ ( θ ) if and only if the relation (10) holds implying that the family isexponential, p ( x ; θ ) = exp { A ( θ ) h ( x ) + B ( θ ) + g ( x ) } (13)where the functions in the exponent are such that E θ ( h ( X )) = γ ( θ ).From (8) one can see that that the maximum likelihood equation for θ basedon a sample ( x , . . . , x n ) from population (13) is( h ( x ) + . . . + h ( x n )) /n = γ ( θ ) (14)and ˜ θ ( e ) n ( x , . . . , x n ) = ( h ( x ) + . . . + h ( x n )) /n as the maximum likelihood estimator of γ ( θ ) is asymptotically efficient.We summarize the above as a theorem. Theorem 1
Under the regularity type conditions of the theory of maximumlikelihood estimators, the jackknife extension estimators are asymptoticallyefficient if and only if they are arithmetic means based on samples fromexponential families.
The jackknife extension lacks innovation. A jackknife extension estimatorbased on ( x , . . . , x n +1 ) differs from the estimator based on ( x , . . . , x n ) onlyby the sample sample size. In a sense, it is an extensive vs. intensive use ofthe data when the main factor is quantity vs. quality.Nonparametric estimators of population characteristics such as the empir-ical distribution function, the sample mean and variance are jackknife ex-tensions. Their main goal is to be universal rather than optimal for in-dividual populations. An interesting statistic is the sample median ˜ µ n =˜ µ n ( x , . . . , x n ) constructed from a sample from a continuous population.Without loss in generality, one may assume x < . . . < x n . n = 2 m + 1 , ˜ µ n = x m +1 . If x n +1 < x or x n +1 > x n , one can easily seethat ˜ µ ( e ) n +1 = ( x ′ m +1 + x ′ m +2 ) / x ′ m +1 and x ′ m +2 are the ( m + 1)st and ( m + 2)nd elements of thesample ( x , . . . , x n +1 ). Thus, the median of a sample of an even size is ajackknife extension though one should keep in mind that the definitions ofthe median in samples of even and odd size are different and it is not clearif the inequality (2) holds.For n = 2 m the jackknife extension of ˜ µ n = ( x m + x m +1 ) / x ′ m +1 .Let us start with simple cases of m = 2 and m = 3. In the first case,˜ µ ( e )5 = 1 . x ′ + 2 x ′ + 1 . x ′ x ′ and its nearest neighbors. The same holds in thesecond case, with x ′ instead of x ′ and different weights:˜ µ ( e )7 = 2 x ′ + 3 x ′ + 2 x ′ . It seems likely that the extrapolation to an arbitrary n = 2 m will result in˜ µ ( e ) n +1 = (( m + 1) / x ′ m + mx ′ m +1 + (( m + 1) / x ′ m +2 n + 1 . (16)Though (16) is a reasonable estimator of the median, it is not clear how itbehaves for n = 2 m in small and large samples compared to the standard˜ µ n +1 = x ′ m +1 . Artstein, S., Ball, K. M., Barthe, F., Naor, A. (2004). Solution of Shannonsproblem on the monotonicity of entropy.
J. Amer. Math. Soc.,
Ann. Math. Stat.,
J. Math. Sci. ,
2, 202-214.Madiman, M., Barron, A. (2007). Generalized Entropy Power Inequali-ties and Monotonicity Properties of Information.
IEEE Transactions onInformation Theory,53,