The RKHS Approach to Minimum Variance Estimation Revisited: Variance Bounds, Sufficient Statistics, and Exponential Families
aa r X i v : . [ m a t h . S T ] N ov The RKHS Approach to Minimum VarianceEstimation Revisited: Variance Bounds,Sufficient Statistics, and Exponential Families
Alexander Jung a (corresponding author), Sebastian Schmutzhard b , and Franz Hlawatsch a a Institute of Telecommunications, Vienna University of Technology; {ajung, fhlawats}@nt.tuwien.ac.at b NuHAG, Faculty of Mathematics, University of Vienna; [email protected]
Abstract
The mathematical theory of reproducing kernel Hilbert spaces (RKHS) provides powerful tools for minimumvariance estimation (MVE) problems. Here, we extend the classical RKHS-based analysis of MVE in severaldirections. We develop a geometric formulation of five known lower bounds on the estimator variance (Barankinbound, Cramér–Rao bound, constrained Cramér–Rao bound, Bhattacharyya bound, and Hammersley-Chapman-Robbins bound) in terms of orthogonal projections onto a subspace of the RKHS associated with a given MVEproblem. We show that, under mild conditions, the Barankin bound (the tightest possible lower bound on theestimator variance) is a lower semi-continuous function of the parameter vector. We also show that the RKHSassociated with an MVE problem remains unchanged if the observation is replaced by a sufficient statistic.Finally, for MVE problems conforming to an exponential family of distributions, we derive novel closed-formlower bounds on the estimator variance and show that a reduction of the parameter set leaves the minimumachievable variance unchanged.
Index Terms
Minimum variance estimation, exponential families, RKHS, Cramér–Rao bound, Barankin bound, Hammersley–Chapman–Robbins bound, Bhattacharyya bound, locally minimum variance unbiased estimator.
I. I
NTRODUCTION
We consider the problem of estimating the value g ( x ) of a known deterministic function g ( · ) evaluatedat an unknown nonrandom parameter vector x ∈ X , where the parameter set X is known. The estimation of g ( x ) is based on an observed vector y , which is modeled as a random vector with an associated probabilitymeasure [1] µ yx or, as a special case, an associated probability density function (pdf) f ( y ; x ) , both parametrizedby x ∈ X . More specifically, we study the problem of minimum variance estimation (MVE), where one aims This work was supported by the FWF under Grants S10602-N13 (Signal and Information Representation) and S10603-N13 (StatisticalInference) within the National Research Network SISE and by the WWTF under Grant MA 07-004 (SPORTS).First revision; submitted to the IEEE Transactions on Information Theory, June 7, 2018 at finding estimators with minimum variance under the constraint of a prescribed bias. Our treatment of MVEwill be based on the mathematical framework and methodology of reproducing kernel Hilbert spaces (RKHS).
A. State of the Art and Motivation
The RKHS approach to MVE was introduced in the seminal papers [2] and [3]. On a general level, the theoryof RKHS yields efficient methods for high-dimensional optimization problems. These methods are popular, e.g.,in machine learning [4], [5]. For the MVE problem considered here, the optimization problem is the minimizationof the estimator variance subject to a bias constraint. The RKHS approach to MVE enables a consistent andintuitive geometric treatment of the MVE problem. In particular, the determination of the minimum achievablevariance (or Barankin bound) and of the locally minimum variance estimator reduces to the computation ofthe squared norm and isometric image of a specific vector—representing the prescribed estimator bias—thatbelongs to the RKHS associated with the estimation problem. This reformulation is interesting from a theoreticalperspective; in addition, it may also be the basis for an efficient computational evaluation. Furthermore, a wideclass of lower bounds on the minimum achievable variance (and, in turn, on the variance of any estimator) isobtained by performing projections onto subspaces of the RKHS. Again, this enables an efficient computationalevaluation of these bounds.A specialization to estimation problems involving sparsity constraints was presented in [6]–[8]. For certainspecial cases of these sparse estimation problems, the RKHS approach allows the derivation of closed-formexpressions of the minimum achievable variance and the corresponding locally minimum variance estimators.The RKHS approach has also proven to be a valuable tool for the analysis of estimation problems involvingcontinuous-time random processes [2], [3], [9].
B. Contribution and Outline
The main contributions of this paper concern an RKHS-theoretic analysis of the performance of MVE,with a focus on questions related to lower variance bounds, sufficient statistics, and observations conformingto an exponential family of distributions. First, we give a geometric interpretation of some well-known lowerbounds on the estimator variance. The tightest of these bounds, i.e., the Barankin bound, is proven to be alower semi-continuous function of the parameter vector x under mild conditions. We then analyze the roleof a sufficient statistic from the RKHS viewpoint. In particular, we prove that the RKHS associated with anestimation problem remains unchanged if the observation y is replaced by any sufficient statistic. Furthermore,we characterize the RKHS for estimation problems with observations conforming to an exponential family ofdistributions. It is found that this RKHS has a strong structural property, and that it is explicitly related to themoment-generating function of the exponential family. Inspired by this relation, we derive novel lower boundson the estimator variance, and we analyze the effect of parameter set reductions. The lower bounds have aparticularly simple form. The remainder of this paper is organized as follows. In Section II, basic elements of MVE are reviewed andthe RKHS approach to MVE is summarized. In Section III, we present an RKHS-based geometric interpretationof known variance bounds and demonstrate the lower semi-continuity of the Barankin bound. The effect ofreplacing the observation by a sufficient statistic is studied in Section IV. In Section V, the RKHS for exponentialfamily-based estimation problems is investigated, novel lower bounds on the estimator variance are derived,and the effect of a parameter set reduction is analyzed. We note that the proofs of most of the new resultspresented can be found in the doctoral dissertation [10] and will be referenced in each case.
C. Notation and Basic Definitions
We will use the shorthand notations N , { , , , . . . } , Z + , { , , , . . . } , and [ N ] , { , , . . . , N } . Theopen ball in R N with radius r > and centered at x c is defined as B ( x c , r ) , (cid:8) x ∈ R N (cid:12)(cid:12) k x − x c k < r (cid:9) . Wecall x ∈ X ⊆ R N an interior point if B ( x , r ) ⊆ X for some r > . The set of all interior points of X is calledthe interior of X and denoted X o . A set X is called open if X = X o .Boldface lowercase (uppercase) letters denote vectors (matrices). The superscript T stands for transposition.The k th entry of a vector x and the entry in the k th row and l th column of a matrix A are denoted by ( x ) k = x k and ( A ) k,l = A k,l , respectively. The k th unit vector is denoted by e k , and the identity matrix of size N × N by I N . The Moore-Penrose pseudoinverse [11] of a rectangular matrix F ∈ R M × N is denoted by F † .A function f ( · ) : D → R , with D ⊆ R N , is said to be lower semi-continuous at x ∈ D if for every ε > there is a radius r > such that f ( x ) ≥ f ( x ) − ε for all x ∈ B ( x , r ) . (This definition is equivalentto lim inf x → x f ( x ) ≥ f ( x ) , where lim inf x → x f ( x ) , sup r> (cid:8) inf x ∈ D ∩ [ B ( x ,r ) \{ x } ] f ( x ) (cid:9) [12], [13].)The restriction of a function f ( · ) : D → R to a subdomain D ′ ⊆ D is denoted by f ( · ) (cid:12)(cid:12) D ′ . Given a multi-index p = ( p · · · p N ) T ∈ Z N + , we define the partial derivative of order p of a real-valued function f ( · ) : D → R , with D ⊆ R N , as ∂ p f ( x ) ∂ x p , ∂ p ∂x p k · · · ∂ pN ∂x pNN f ( x ) (if it exists) [13], [14]. Similarly, for a function f ( · , · ) : D × D → R and two multi-indices p , p ∈ Z N + , we denote by ∂ p ∂ p f ( x , x ) ∂ x p x p the partial derivative of order ( p , p ) , where f ( x , x ) is considered as a function of the “super-vector” ( x T x T ) T of length N . Given a vector-valuedfunction φ ( · ) : R M → R N and p ∈ Z N + , we denote the product Q Nk =1 (cid:0) φ k ( y ) (cid:1) p k by φ p ( y ) .The probability measure of a random vector y taking on values in R M is denoted by µ y [1], [15]–[17]. Weconsider probability measures that are defined on the measure space given by all M -dimensional Borel sets on R M [1, Sec. 10]. The probability measure assigns to a measureable set A ⊆ R M the probability P { y ∈ A} , Z R M I A ( y ′ ) dµ y ( y ′ ) = Z A dµ y ( y ′ ) , where I A ( · ) : R M → { , } denotes the indicator function of the set A . We will also consider a family ofprobability measures { µ yx } x ∈X parametrized by a nonrandom parameter vector x ∈ X . We assume that thereexists a dominating measure µ E , so that we can define the pdf f ( y ; x ) (again parametrized by x ) as the Radon-Nikodym derivative of the measure µ yx with respect to the measure µ E [1], [15]–[17]. (In general, we will choose for µ E the Lebesgue measure on R M .) We refer to both the set of measures { µ yx } x ∈X and the set ofpdfs { f ( y ; x ) } x ∈X as the statistical model . Given a (possibly vector-valued) deterministic function t ( y ) , theexpectation operation is defined by [1] E x { t ( y ) } , Z R M t ( y ′ ) dµ yx ( y ′ ) = Z R M t ( y ′ ) f ( y ′ ; x ) d y ′ , (1)where the subscript in E x indicates the dependence on the parameter vector x parametrizing µ yx ( y ) and f ( y ; x ) .II. F UNDAMENTALS
A. Review of MVE
It will be convenient to denote a classical (frequentist) estimation problem by the triple E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) ,consisting of the parameter set X , the statistical model { f ( y ; x ) } x ∈X , and the parameter function g ( · ) : X → R P .Note that our setting includes estimation of the parameter vector x itself, which is obtained when g ( x ) = x .The result of estimating g ( x ) from y is an estimate ˆ g ∈ R P , which is derived from y via a deterministic estimator ˆ g ( · ) : R M → R P , i.e., ˆ g = ˆ g ( y ) . We assume that any estimator is a measurable mapping from R M to R P [1, Sec. 13]. A convenient characterization of the performance of an estimator ˆ g ( · ) is the mean squarederror (MSE) defined as ε , E x (cid:8) k ˆ g ( y ) − g ( x ) k (cid:9) = Z R M k ˆ g ( y ) − g ( x ) k f ( y ; x ) d y . We will write ε (ˆ g ( · ); x ) to explicitly indicate the dependence of the MSE on the estimator ˆ g ( · ) and theparameter vector x . Unfortunately, for a general estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) , there does notexist an estimator ˆ g ( · ) that minimizes the MSE simultaneously for all parameter vectors x ∈ X [18], [19]. Thisfollows from the fact that minimizing the MSE at a given parameter vector x always yields zero MSE; thisis achieved by the estimator ˆ g ( y ) = g ( x ) , which completely ignores the observation y .A popular rationale for the design of good estimators is MVE. This approach is based on the MSEdecomposition ε (ˆ g ( · ); x ) = k b (ˆ g ( · ); x ) k + v (ˆ g ( · ); x ) , (2)with the estimator bias b (ˆ g ( · ); x ) , E x { ˆ g ( y ) } − g ( x ) and the estimator variance v (ˆ g ( · ); x ) , E x (cid:8) k ˆ g ( y ) − E x { ˆ g ( y ) }k (cid:9) . In MVE, one fixes the bias for all parameter vectors, i.e., b (ˆ g ( · ); x ) ! = c ( x ) for all x ∈ X ,with a prescribed bias function c ( · ) : X → R P , and considers only estimators with the given bias. Note thatfixing the estimator bias is equivalent to fixing the estimator mean, i.e., E x (cid:8) ˆ g ( y ) (cid:9) ! = γ ( x ) for all x ∈ X ,with the prescribed mean function γ ( x ) , c ( x ) + g ( x ) . The important special case of unbiased estimation isobtained for c ( x ) ≡ or equivalently γ ( x ) ≡ g ( x ) for all x ∈ X . Fixing the bias can be viewed as a kindof regularization of the set of considered estimators [15], [19], because useless estimators like the estimator ˆ g ( y ) = g ( x ) are excluded. Another justification for fixing the bias is the fact that, if a large number ofindependent and identically distributed (i.i.d.) realizations { y i } Li =1 of the vector y are observed, then, under certain technical conditions, the bias term dominates in the decomposition (2). Thus, in that case, the MSEis small if and only if the bias is small; this means that the estimator has to be effectively unbiased, i.e., b (ˆ g ( · ); x ) ≈ for all x ∈ X .For a fixed “reference” parameter vector x ∈ X and a prescribed bias function c ( · ) , we define the set ofallowed estimators by A ( c ( · ) , x ) , (cid:8) ˆ g ( · ) (cid:12)(cid:12) v (ˆ g ( · ); x ) < ∞ , b (ˆ g ( · ); x ) = c ( x ) ∀ x ∈ X (cid:9) . We call a bias function c ( · ) valid for the estimation problem E = ( X , f ( y ; x ) , g ( · )) at x ∈ X if the set A ( c ( · ) , x ) is nonempty. This means that there is at least one estimator ˆ g ( · ) with finite variance at x andwhose bias equals c ( · ) , i.e., b (ˆ g ( · ); x ) = c ( x ) for all x ∈ X . From (2), it follows that for a fixed bias c ( · ) ,minimizing the MSE ε (ˆ g ( · ); x ) is equivalent to minimizing the variance v (ˆ g ( · ); x ) . Therefore, in MVE, oneattempts to find estimators that minimize the variance under the constraint of a prescribed bias c ( · ) function.Let M ( c ( · ) , x ) , inf ˆ g ( · ) ∈A ( c ( · ) , x ) v (ˆ g ( · ); x ) (3)denote the minimum (strictly speaking, infimum) variance at x for bias function c ( · ) . If A ( c ( · ) , x ) is empty,i.e., if c ( · ) is not valid, we set M ( c ( · ) , x ) , ∞ . Any estimator ˆ g ( x ) ( · ) ∈ A ( c ( · ) , x ) that achieves theinfimum in (3), i.e., for which v (cid:0) ˆ g ( x ) ( · ); x (cid:1) = M ( c ( · ) , x ) , is called a locally minimum variance (LMV)estimator at x for bias function c ( · ) [2], [3], [15]. The corresponding minimum variance M ( c ( · ) , x ) is calledthe minimum achievable variance at x for bias function c ( · ) . The minimization problem (3) is referred to asa minimum variance problem (MVP). By its definition in (3), M ( c ( · ) , x ) is a lower bound on the variance at x of any estimator with bias function c ( · ) , i.e., ˆ g ( · ) ∈ A ( c ( · ) , x ) ⇒ v (ˆ g ( · ); x ) ≥ M ( c ( · ) , x ) . (4)In fact, M ( c ( · ) , x ) is the tightest lower bound, which is sometimes referred to as the Barankin bound .If, for a prescribed bias function c ( · ) , there exists an estimator that is the LMV estimator simultaneously at all x ∈ X , then that estimator is called the uniformly minimum variance (UMV) estimator for bias function c ( · ) [2], [3], [15]. For many estimation problems, a UMV estimator does not exist. However, it always exists ifthere exists a complete sufficient statistic [15, Theorem 1.11 and Corollary 1.12], [20, Theorem 6.2.25]. Undermild conditions, this includes the case where the statistical model corresponds to an exponential family.The variance to be minimized can be decomposed as v (ˆ g ( · ); x ) = X l ∈ [ P ] v (ˆ g l ( · ); x ) , where ˆ g l ( · ) , (cid:0) ˆ g ( · ) (cid:1) l and v (ˆ g l ( · ); x ) , E x (cid:8)(cid:2) ˆ g l ( y ) − E x { ˆ g l ( y ) } (cid:3) (cid:9) for l ∈ [ P ] . Moreover, ˆ g ( · ) ∈ A ( c ( · ) , x ) if and only if ˆ g l ( · ) ∈ A ( c l ( · ) , x ) for all l ∈ [ P ] , where c l ( · ) , (cid:0) c ( · ) (cid:1) l . It follows that the minimizationof v (ˆ g ( · ); x ) can be reduced to P separate problems of minimizing the component variances v (ˆ g l ( · ); x ) , each involving the optimization of a single scalar component ˆ g l ( · ) of ˆ g ( · ) subject to the scalar bias constraint b (ˆ g l ( · ); x ) = c l ( x ) for all x ∈ X . Therefore, without loss of generality, we will hereafter assume that theparameter function g ( x ) is scalar-valued, i.e., P = 1 . B. Review of the RKHS Approach to MVE
A powerful mathematical toolbox for MVE is provided by RKHS theory [2], [3], [21]. In this subsection, wereview basic definitions and results of RKHS theory and its application to MVE, and we discuss a differentiabilityproperty that will be relevant to the variance bounds considered in Section III.An RKHS is associated with a kernel function , which is a function R ( · , · ) : X × X → R with the followingtwo properties [21]: • It is symmetric, i.e., R ( x , x ) = R ( x , x ) for all x , x ∈ X . • For every finite set { x , . . . , x D } ⊆ X , the matrix R ∈ R D × D with entries R m,n = R ( x m , x n ) is positivesemidefinite.There exists an RKHS for any kernel function R ( · , · ) : X × X → R [21]. This RKHS, denoted H ( R ) , is aHilbert space equipped with an inner product h· , ·i H ( R ) such that, for any x ∈ X , • R ( · , x ) ∈ H ( R ) (here, R ( · , x ) denotes the function f x ( x ′ ) = R ( x ′ , x ) with a fixed x ∈ X ); • for any function f ( · ) ∈ H ( R ) , (cid:10) f ( · ) , R ( · , x ) (cid:11) H ( R ) = f ( x ) . (5)Relation (5), which is known as the reproducing property , defines the inner product h f, g i H ( R ) for all f ( · ) , g ( · ) ∈H ( R ) because (in a certain sense) any f ( · ) ∈ H ( R ) can be expanded into the set of functions { R ( · , x ) } x ∈X .In particular, consider two functions f ( · ) , g ( · ) ∈ H ( R ) that are given as f ( · ) = P x k ∈D a k R ( · , x k ) and g ( · ) = P x ′ l ∈D ′ b l R ( · , x ′ l ) with coefficients a k , b l ∈ R and (possibly infinite) sets D , D ′ ⊆ X . Then, by the linearity ofinner products and (5), (cid:10) f ( · ) , g ( · ) (cid:11) H ( R ) = X x k ∈D X x ′ l ∈D ′ a k b l R ( x k , x ′ l ) .
1) The RKHS Associated with an MVP:
Consider the class of MVPs that is defined by an estimationproblem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) , a reference parameter vector x ∈ X , and all possible prescribed bias functions c ( · ) : X → R . With this class of MVPs, we can associate a kernel function R E , x ( · , · ) : X × X → R and, inturn, an RKHS H ( R E , x ) [2], [3]. (Note that, as our notation indicates, R E , x ( · , · ) and H ( R E , x ) depend on E and x but not on c ( · ) .) We assume that P { f ( y ; x ) = 0 } = 1 , (6)where the probability is evaluated for the underlying dominating measure µ E . We can then define the likelihoodratio as ρ E , x ( y , x ) , f ( y ; x ) f ( y ; x ) , if f ( y ; x ) = 00 , else. (7)We consider ρ E , x ( y , x ) as a random variable (since it is a function of the random vector y ) that is parametrizedby x ∈ X . Furthermore, we define the Hilbert space L E , x as the closure of the linear span of the set of randomvariables (cid:8) ρ E , x ( y , x ) (cid:9) x ∈X . The topology of L E , x is determined by the inner product h· , ·i RV : L E , x × L E , x → R defined by (cid:10) ρ E , x ( y , x ) , ρ E , x ( y , x ) (cid:11) RV , E x (cid:8) ρ E , x ( y , x ) ρ E , x ( y , x ) (cid:9) . (8)It can be shown that it is sufficient to define the inner product only for the random variables (cid:8) ρ E , x ( y , x ) (cid:9) x ∈X [2]. We will assume that (cid:10) ρ E , x ( y , x ) , ρ E , x ( y , x ) (cid:11) RV < ∞ , for all x , x ∈ X . (9)The assumptions (6) and (9) (or variants thereof) are standard in the literature on MVE [2], [3], [23], [24].They are typically satisfied for the important and large class of estimation problems arising from exponentialfamilies (cf. Section V).The inner product h· , ·i RV : L E , x × L E , x → R can now be interpreted as a kernel function R E , x ( · , · ) : X ×X → R : R E , x ( x , x ) , (cid:10) ρ E , x ( y , x ) , ρ E , x ( y , x ) (cid:11) RV (10)The RKHS induced by R E , x ( · , · ) will be denoted by H E , x , i.e., H E , x , H ( R E , x ) . This is the RKHS associatedwith the estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) and the corresponding class of MVPs at x ∈ X .We note that assumption (6) implies that the likelihood ratio ρ E , x ( y , x ) is measurable with respect tothe underlying dominating measure µ E . Furthermore, the likelihood ratio ρ E , x ( y , x ) is the Radon-Nikodymderivative [1], [16] of the probability measure µ yx induced by f ( y ; x ) with respect to the probability measure µ yx induced by f ( y ; x ) (cf. [1], [22], [25]). It is also important to observe that ρ E , x ( y , x ) does not dependon the dominating measure µ E underlying the definition of the pdfs f ( y ; x ) . Thus, the kernel R E , x ( · , · ) givenby (10) does not depend on µ E either. Moreover, under assumption (6), we can always use the measure µ yx as the base measure µ E for the estimation problem E , since the Radon-Nikodym derivative ρ E , x ( y , x ) is welldefined. Note that, trivially, this also implies that the measure µ yx dominates the measures { µ yx } x ∈X [1, p.443].The two Hilbert spaces L E , x and H E , x are isometric. In fact, as proven in [2], a specific congruence (i.e.,isometric mapping of functions in H E , x to functions in L E , x ) J [ · ] : H E , x → L E , x is given by J [ R E , x ( · , x )] = ρ E , x ( · , x ) . A detailed discussion of the concepts of closure, inner product, orthonormal basis, and linear span in the context of abstract Hilbertspace theory can be found in [2], [22].
The isometry J [ f ( · )] can be evaluated for an arbitrary function f ( · ) ∈ H E , x by expanding f ( · ) into theelementary functions { R E , x ( · , x ) } x ∈X (cf. [2]). Given the expansion f ( · ) = P x k ∈D a k R E , x ( · , x k ) withcoefficients a k ∈ R and a (possibly infinite) set D ⊆ X , the isometric image of f ( · ) is obtained as J [ f ( · )] = P x k ∈D a k ρ E , x ( · , x k ) .
2) RKHS-based Analysis of MVE:
An RKHS-based analysis of MVE is enabled by the following centralresult. Consider an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) , a fixed reference parameter vector x ∈ X , anda prescribed bias function c ( · ) : X → R , corresponding to the prescribed mean function γ ( · ) , c ( · ) + g ( · ) .Then, as shown in [2], [3], the following holds: • The bias function c ( · ) is valid for E at x if and only if γ ( · ) belongs to the RKHS H E , x , i.e., A ( c ( · ) , x ) = ∅ ⇐⇒ γ ( · ) ∈ H E , x . (11) • If the bias function c ( · ) is valid, the corresponding minimum achievable variance at x is given by M ( c ( · ) , x ) = k γ ( · ) k H E , x − γ ( x ) , (12)and the LMV estimator at x is given by ˆ g ( x ) ( · ) = J [ γ ( · )] . This result shows that the RKHS H E , x is equal to the set of the mean functions γ ( x ) = E x { ˆ g ( y ) } ofall estimators ˆ g ( · ) with a finite variance at x , i.e., v (ˆ g ( · ); x ) < ∞ . Furthermore, the problem of solving theMVP (3) can be reduced to the computation of the squared norm k γ ( · ) k H E , x and the isometric image J [ γ ( · )] of the prescribed mean function γ ( · ) , viewed as an element of the RKHS H E , x . This is especially helpful if asimple characterization of H E , x is available. Here, following the terminology of [3], what is meant by “simplecharacterization” is the availability of an orthonormal basis (ONB) for H E , x such that the inner products of γ ( · ) with the ONB functions can be computed easily.If such an ONB of H E , x cannot be found, the relation (12) can still be used to derive lower bounds on theminimum achievable variance M ( c ( · ) , x ) . Indeed, because of (12), any lower bound on k γ ( · ) k H E , x inducesa lower bound on M ( c ( · ) , x ) . A large class of lower bounds on k γ ( · ) k H E , x can be obtained via projectionsof γ ( · ) onto a subspace U ⊆ H E , x . Denoting the orthogonal projection of γ ( · ) onto U by γ U ( · ) , we have k γ U ( · ) k H E , x ≤ k γ ( · ) k H E , x [22, Chapter 4] and thus, from (12), M ( c ( · ) , x ) ≥ k γ U ( · ) k H E , x − γ ( x ) , (13)for an arbitrary subspace U ⊆ H E , x . In particular, let us consider the special case of a finite-dimensionalsubspace U ⊆ H E , x that is spanned by a given set of functions u l ( · ) ∈ H E , x , i.e., U = span { u l ( · ) } l ∈ [ L ] , ( f ( · ) = X l ∈ [ L ] a l u l ( · ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) a l ∈ R ) . (14) Here, k γ U ( · ) k H E , x can be evaluated very easily due to the following expression [10, Theorem 3.1.8]: k γ U ( · ) k H E , x = γ T G † γ , (15)where the vector γ ∈ R L and the matrix G ∈ R L × L are given elementwise by γ l = h γ ( · ) , u l ( · ) i H E , x , G l,l ′ = h u l ( · ) , u l ′ ( · ) i H E , x . (16)If all u l ( · ) are linearly independent, then a larger number L of basis functions u l ( · ) entails a higher dimensionof U and, thus, a larger k γ U ( · ) k H E , x ; this implies that the lower bound (13) will be higher (i.e., tighter). InSection III, we will show that some well-known lower bounds on the estimator variance are obtained from (13)and (15), using a subspace U of the form (14) and specific choices for the functions u l ( · ) spanning U .
3) Regular Estimation Problems and Differentiable RKHS:
Some of the lower bounds to be considered inSection III require the estimation problem to satisfy certain regularity conditions.
Definition II.1.
An estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) satisfying (9) is said to be regular up to order m ∈ N at an interior point x ∈ X o if the following holds: • For every multi-index p ∈ Z N + with entries p k ≤ m , the partial derivatives ∂ p f ( y ; x ) ∂ x p exist and satisfy E x (cid:26)(cid:18) f ( y ; x ) ∂ p f ( y ; x ) ∂ x p (cid:19) (cid:27) < ∞ , for all x ∈ B ( x , r ) , (17) where r > is a suitably chosen radius such that B ( x , r ) ⊆ X . • For any function h ( · ) : R M → R such that E x { h ( y ) } exists, the expectation operation commutes withpartial differentiation in the sense that, for every multi-index p ∈ Z N + with p k ≤ m , ∂ p ∂ x p Z R M h ( y ) f ( y ; x ) d y = Z R M h ( y ) ∂ p f ( y ; x ) ∂ x p d y , for all x ∈ B ( x , r ) , (18) or equivalently ∂ p E x { h ( y ) } ∂ x p = E x (cid:26) h ( y ) 1 f ( y ; x ) ∂ p f ( y ; x ) ∂ x p (cid:27) , for all x ∈ B ( x , r ) , (19) provided that the right hand side of (18) and (19) is finite. • For every pair of multi-indices p , p ∈ Z N + with p ,k ≤ m and p ,k ≤ m , the expectation E x (cid:26) f ( y ; x ) ∂ p f ( y ; x ) ∂ x p ∂ p f ( y ; x ) ∂ x p (cid:27) (20) depends continuously on the parameter vectors x , x ∈ B ( x , r ) . We remark that the notion of a regular estimation problem according to Definition II.1 is somewhat similarto the notion of a regular statistical experiment introduced in [17, Section I.7].As shown in [10, Thm. 4.4.3.], the RKHS associated with a regular estimation problem has an importantstructural property, which we will term differentiable . More precisely, we call an RKHS H ( R ) differentiable up to order m if it is associated with a kernel R ( · , · ) : X × X → R that is differentiable up to a given order m . The properties of differentiable RKHSs have been previously studied, e.g., in [26]–[28].It will be seen that, under certain conditions, the functions belonging to an RKHS H ( R ) that is differentiableare characterized completely by their partial derivatives at any point x ∈ X o . This implies via (11) togetherwith identity (22) below that, for a regular estimation problem, the mean function γ ( x ) = E x { ˆ g ( y ) } of anyestimator ˆ g ( · ) with finite variance at x is completely specified by the partial derivatives (cid:8) ∂ p γ ( x ) ∂ x p (cid:12)(cid:12) x = x (cid:9) p ∈ Z N + (cf. Lemma V.3 in Section V-D).Further important properties of a differentiable RKHS have been reported in [9], [27]. In particular, for anRKHS H ( R ) that is differentiable up to order m , and for any x ∈ X o and any p ∈ Z N + with p k ≤ m , thefollowing holds: • The function r ( p ) x ( · ) : X → R defined by r ( p ) x ( x ) , ∂ p R ( x , x ) ∂ x p (cid:12)(cid:12)(cid:12)(cid:12) x = x (21)is an element of H ( R ) , i.e., r ( p ) x ( · ) ∈ H ( R ) . • For any function f ( · ) ∈ H ( R ) , the partial derivative ∂ p f ( x ) ∂ x p (cid:12)(cid:12) x = x exists. • The inner product of r ( p ) x ( · ) with an arbitrary function f ( · ) ∈ H ( R ) is given by (cid:10) r ( p ) x ( · ) , f ( · ) (cid:11) H ( R ) = ∂ p f ( x ) ∂ x p (cid:12)(cid:12)(cid:12)(cid:12) x = x . (22)Thus, an RKHS H ( R ) that is differentiable up to order m contains the functions (cid:8) r ( p ) x ( x ) (cid:9) p k ≤ m , and theinner products of any function f ( · ) ∈ H ( R ) with the r ( p ) x ( x ) can be computed easily via differentiation of f ( · ) .This makes function sets (cid:8) r ( p ) x ( x ) (cid:9) appear as interesting candidates for a simple characterization of the RKHS H ( R ) . However, in general, these function sets are not guaranteed to be complete or orthonormal, i.e., they donot constitute an ONB. An important exception is constituted by certain estimation problems E involving anexponential family of distributions, which will be studied in Section V.Consider an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) that is regular up to order m ∈ N at x ∈ X o .According to (11), the mean function γ ( · ) of any estimator with finite variance at x belongs to the RKHS H E , x . Since E is assumed regular up to order m , H E , x is differentiable up to order m . This, in turn, implies via (11) and (22) that the partial derivatives of γ ( · ) at x exist up to order m . Therefore, for the derivation oflower bounds on the minimum achievable variance at x in the case of an estimation problem that is regularup to order m at x , we can always tacitly assume that the partial derivatives of γ ( · ) at x exist up to order m ; otherwise the corresponding bias function c ( · ) = γ ( · ) − g ( · ) cannot be valid, i.e., there would not exist anyestimator with mean function γ ( · ) (or, equivalently, bias function c ( · ) ) and finite variance at x . Indeed, it follows from (11) that the mean function γ ( · ) belongs to the RKHS H E , x . Therefore, by (22), the partial derivatives of γ ( · ) at x coincide with well-defined inner products of functions in H E , x . III. RKHS F
ORMULATION OF K NOWN V ARIANCE B OUNDS
Consider an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) and an estimator ˆ g ( · ) with mean function γ ( x ) = E x { ˆ g ( y ) } and bias function c ( x ) = γ ( x ) − g ( x ) . We assume that ˆ g ( · ) has a finite variance at x , which impliesthat the bias function c ( · ) is valid and ˆ g ( · ) is an element of A ( c ( · ) , x ) , the set of allowed estimators at x for prescribed bias function c ( · ) , which therefore is nonempty. Then, γ ( · ) ∈ H E , x according to (11). We alsorecall from our discussion further above that if the estimation problem E is regular at x up to order m , thenthe partial derivatives ∂ p γ ( x ) ∂ x p (cid:12)(cid:12) x = x exist for all p ∈ Z N + with p k ≤ m .In this section, we will demonstrate how five known lower bounds on the variance—Barankin bound,Cramér–Rao bound, constrained Cramér–Rao bound, Bhattacharyya bound, and Hammersley-Chapman-Robbinsbound—can be formulated in a unified manner within the RKHS framework. More specifically, by combining(4) with (13), it follows that the variance of ˆ g ( · ) at x is lower bounded as v (ˆ g ( · ); x ) ≥ k γ U ( · ) k H E , x − γ ( x ) , (23)where U is any subspace of H E , x . The five variance bounds to be considered are obtained via specific choicesof U . A. Barankin Bound
For a (valid) prescribed bias function c ( · ) , the Barankin bound [23], [29] is the minimum achievable varianceat x , i.e., the variance of the LMV estimator at x , which we denoted M ( c ( · ) , x ) . This is the tightest lowerbound on the variance, cf. (4). Using the RKHS expression of M ( c ( · ) , x ) in (12), the Barankin bound can bewritten as v (ˆ g ( · ); x ) ≥ M ( c ( · ) , x ) = k γ ( · ) k H E , x − γ ( x ) , (24)with γ ( · ) = c ( · ) + g ( · ) , for any estimator ˆ g ( · ) with bias function c ( · ) . Comparing with (23), we see that theBarankin bound is obtained for the special choice U = H E , x , in which case γ U ( · ) = γ ( · ) and (23) reduces to(24).In the literature [23], [29], the following special expression of the Barankin bound is usually considered.Let D , { x , . . . , x L } ⊆ X be a subset of X , with finite size L = |D| ∈ N and elements x l ∈ X , and let a , ( a · · · a L ) T with a l ∈ R . Then the Barankin bound can be written as [23, Theorem 4] v (ˆ g ( · ); x ) ≥ M ( c ( · ) , x ) = sup D⊆X , L ∈ N , a ∈A D (cid:16) P l ∈ [ L ] a l [ γ ( x l ) − γ ( x )] (cid:17) E x n(cid:16) P l ∈ [ L ] a l ρ E , x ( y , x l ) (cid:17) o , (25)where ρ E , x ( y , x l ) is the likelihood ratio as defined in (7) and A D is defined as the set of all a ∈ R L for whichthe denominator E x (cid:8)(cid:0) P l ∈ [ L ] a l ρ E , x ( y , x l ) (cid:1) (cid:9) does not vanish. Note that our notation sup D⊆X , L ∈ N , a ∈A D isintended to indicate that the supremum is taken not only with respect to the elements x l of D but also with respect to the size of D (number of elements), L . We will now verify that the bound in (25) can be obtainedfrom our RKHS expression in (24). We will use the following result that we reported in [10, Theorem 3.1.2]. Lemma III.1.
Consider an RKHS H ( R ) with kernel R ( · , · ) : X × X → R . Let D , { x , . . . , x L } ⊆ X withsome L = |D| ∈ N and x l ∈ X , and let a , ( a · · · a L ) T with a l ∈ R . Then the norm k f ( · ) k H ( R ) of any function f ( · ) ∈ H ( R ) can be expressed as k f ( · ) k H ( R ) = sup D⊆X , L ∈ N , a ∈A ′D P l ∈ [ L ] a l f ( x l ) qP l,l ′ ∈ [ L ] a l a l ′ R ( x l , x l ′ ) , (26) where A ′D is the set of all a ∈ R L for which P l,l ′ ∈ [ L ] a l a l ′ R ( x l , x l ′ ) does not vanish. We will furthermore use the fact—shown in [10, Section 2.3.5]—that the minimum achievable varianceat x , M ( c ( · ) , x ) (i.e., the Barankin bound) remains unchanged when the prescribed mean function γ ( x ) is replaced by ˜ γ ( x ) , γ ( x ) + c with an arbitrary constant c . Setting in particular c = − γ ( x ) , we have ˜ γ ( x ) = γ ( x ) − γ ( x ) and ˜ γ ( x ) = 0 , and thus (24) simplifies to v (ˆ g ( · ); x ) ≥ M ( c ( · ) , x ) = k ˜ γ ( · ) k H E , x . (27)Using (26) in (27), we obtain M ( c ( · ) , x ) = sup D⊆X , L ∈ N , a ∈A ′D (cid:16) P l ∈ [ L ] a l ˜ γ ( x l ) (cid:17) P l,l ′ ∈ [ L ] a l a l ′ R E , x ( x l , x l ′ ) = sup D⊆X , L ∈ N , a ∈A ′D (cid:16) P l ∈ [ L ] a l [ γ ( x l ) − γ ( x )] (cid:17) P l,l ′ ∈ [ L ] a l a l ′ R E , x ( x l , x l ′ ) . (28)From (10) and (8), we have R E , x ( x , x ) = E x (cid:8) ρ E , x ( y , x ) ρ E , x ( y , x ) (cid:9) , and thus the denominator in (28)becomes X l,l ′ ∈ [ L ] a l a l ′ R E , x ( x l , x l ′ ) = E x ( X l,l ′ ∈ [ L ] a l a l ′ ρ E , x ( y , x l ) ρ E , x ( y , x l ′ ) ) = E x ( X l ∈ [ L ] a l ρ E , x ( y , x l ) ! ) , whence it also follows that A ′D = A D . Therefore, (28) is equivalent to (25). Hence, we have shown that ourRKHS expression (24) is equivalent to (25). B. Cramér–Rao Bound
The Cramér–Rao bound (CRB) [18], [30], [31] is the most popular lower variance bound. Since the CRBapplies to any estimator with a prescribed bias function c ( · ) , it yields also a lower bound on the minimumachievable variance M ( c ( · ) , x ) (cf. (4)).Consider an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) that is regular up to order 1 at x ∈ X o in the senseof Definition II.1. Let ˆ g ( · ) denote an estimator with mean function γ ( x ) = E x { ˆ g ( y ) } and finite variance at x (i.e., v (ˆ g ( · ); x ) < ∞ ). Then, this variance is lower bounded by the CRB v (ˆ g ( · ); x ) ≥ b T ( x ) J † ( x ) b ( x ) , (29) where b ( x ) , ∂γ ( x ) ∂ x (cid:12)(cid:12) x and J ( x ) ∈ R N × N , known as the Fisher information matrix associated with E , isgiven elementwise by (cid:0) J ( x ) (cid:1) k,l , E x (cid:26) ∂ log f ( y ; x ) ∂x k ∂ log f ( y ; x ) ∂x l (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:27) . (30)Since the estimation problem E is assumed regular up to order 1 at x , the associated RKHS H E , x isdifferentiable up to order 1. This differentiability is used in the proof of the following result [10, Section 4.4.2]. Theorem III.2.
Consider an estimation problem that is regular up to order 1 in the sense of Definition
II.1 .Then, for a reference parameter vector x ∈ X o , the CRB in (29) is obtained from (23) by using the subspace U CR , span (cid:8) { u ( · ) } ∪ { u l ( · ) } l ∈ [ N ] (cid:9) , with the functions u ( · ) , R E , x ( · , x ) ∈ H E , x , u l ( · ) , ∂R E , x ( · , x ) ∂x l (cid:12)(cid:12)(cid:12)(cid:12) x = x ∈ H E , x , l ∈ [ N ] . C. Constrained Cramér–Rao Bound
The constrained CRB [32]–[34] is an evolution of the CRB in (29) for estimation problems E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) with a parameter set of the form X = (cid:8) x ∈ R N (cid:12)(cid:12) f ( x ) = (cid:9) , (31)where f ( · ) : R N → R Q with Q ≤ N is a continuously differentiable function. We assume that the set X has a nonempty interior. Moreover, we require the Jacobian matrix F ( x ) , ∂ f ( x ) ∂ x ∈ R Q × N to have rank Q whenever f ( x ) = , i.e., for every x ∈ X . This full-rank requirement implies that the constraints represented by f ( x ) = are nonredundant [33]. Such parameter sets are considered, e.g., in [32]–[34]. Under these conditions,the implicit function theorem [34, Theorem 3.3], [13, Theorem 9.28] states that for any x ∈ X , with X givenby (31), there exists a continuously differentiable map r ( · ) from an open set O ⊆ R N − Q into a set P ⊆ X containing x , i.e., r ( · ) : O ⊆ R N − Q → P ⊆ X , with x ∈ P . (32)The constrained CRB in the form presented in [33] reads v (ˆ g ( · ); x ) ≥ b T ( x ) U ( x ) (cid:0) U T ( x ) J ( x ) U ( x ) (cid:1) † U T ( x ) b ( x ) , (33)where b ( x ) = ∂γ ( x ) ∂ x (cid:12)(cid:12) x , J ( x ) is again the Fisher information matrix defined in (30), and U ( x ) ∈ R N × ( N − Q ) is any matrix whose columns form an ONB for the null space of the Jacobian matrix F ( x ) , i.e., F ( x ) U ( x ) = , U T ( x ) U ( x ) = I N − Q . The next result is proved in [10, Section 4.4.2].
Theorem III.3.
Consider an estimation problem that is regular up to order 1 in the sense of Definition
II.1 .Then, for a reference parameter vector x ∈ X o , the constrained CRB in (33) is obtained from (23) by usingthe subspace U CCR , span (cid:8) { u ( · ) } ∪ { u l ( · ) } l ∈ [ N − Q ] (cid:9) , with the functions u ( · ) , R E , x ( · , x ) ∈ H E , x , u l ( · ) , ∂R E , x ( · , r ( θ )) ∂θ l (cid:12)(cid:12)(cid:12)(cid:12) θ = r − ( x ) ∈ H E , x , l ∈ [ N − Q ] , where r ( · ) is any continuously differentiable function of the form (32) .D. Bhattacharyya Bound Whereas the CRB depends only on the first-order partial derivatives of f ( y ; x ) with respect to x , the Bhat-tacharyya bound [35], [36] involves also higher-order derivatives. For an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) that is regular at x ∈ X o up to order m ∈ N , the Bhattacharyya bound states that v (ˆ g ( · ); x ) ≥ a T ( x ) B † ( x ) a ( x ) , (34)where the vector a ( x ) ∈ R L and the matrix B ( x ) ∈ R L × L are given elementwise by (cid:0) a ( x ) (cid:1) l , ∂ p l γ ( x ) ∂ x p l (cid:12)(cid:12) x and (cid:0) B ( x ) (cid:1) l,l ′ , E x (cid:26) f ( y ; x ) ∂ p l f ( y ; x ) ∂ x p l ∂ p l ′ f ( y ; x ) ∂ x p l ′ (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:27) , respectively. Here, the p l , l ∈ [ L ] are L distinct multi-indices with ( p l ) k ≤ m .The following result is proved in [10, Section 4.4.3]. Theorem III.4.
Consider an estimation problem that is regular up to order m in the sense of Definition II.1 .Then, for a reference parameter vector x ∈ X o , the Bhattacharyya bound in (34) is obtained from (23) byusing the subspace U B , span (cid:8) { u ( · ) } ∪ { u l ( · ) } l ∈ [ L ] (cid:9) , with the functions u ( · ) , R E , x ( · , x ) ∈ H E , x , u l ( · ) , ∂ p l R E , x ( · , x ) ∂ x p l (cid:12)(cid:12)(cid:12)(cid:12) x = x ∈ H E , x , l ∈ [ L ] . (35)While the RKHS interpretation of the Bhattacharyya bound has been presented previously in [3] for aspecific estimation problem, the above result holds for general estimation problems. We note that the boundtends to become higher (tighter) if L is increased in the sense that additional functions u l ( · ) are used (i.e.,in addition to the functions already used). Finally, we note that the CRB subspace U CR in Theorem III.2 isobtained as a special case of the Bhattacharyya bound subspace U B by setting L = N , m = 1 , and p l = e l in(35). E. Hammersley-Chapman-Robbins Bound
A drawback of the CRB and the Bhattacharyya bound is that they exploit only the local structure of anestimation problem E around a specific point x ∈ X o [35]. As an illustrative example, consider two differentestimation problems E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) and E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) with the same statistical model f ( y ; x ) and parameter function g ( · ) but different parameter sets X and X . These parameter sets are assumedto be open balls centered at x with different radii r and r , i.e., X = B ( x , r ) and X = B ( x , r ) with r = r . Then the CRB at x for both estimation problems will be identical, irrespective of the values of r and r , and similarly for the Bhattacharyya bound. Thus, these bounds do not take into account a partof the information contained in the parameter set X . The Barankin bound, on the other hand, exploits thefull information carried by the parameter set X since it is the tightest possible lower bound on the estimatorvariance. However, the Barankin bound is difficult to evaluate in general.The Hammersley-Chapman-Robbins bound (HCRB) [37]–[39] is a lower bound on the estimator variancethat takes into account the global structure of the estimation problem associated with the entire parameter set X .It can be evaluated much more easily than the Barankin bound, and it does not require the estimation problemto be regular. Based on a suitably chosen set of “test points” { x , . . . , x L } ⊆ X , the HCRB states that [37] v (ˆ g ( · ); x ) ≥ m T ( x ) V † ( x ) m ( x ) , (36)where the vector m ( x ) ∈ R L and the matrix V ( x ) ∈ R L × L are given elementwise by (cid:0) m ( x ) (cid:1) l , γ ( x l ) − γ ( x ) and (cid:0) V ( x ) (cid:1) l,l ′ , E x (cid:26) [ f ( y ; x l ) − f ( y ; x )][ f ( y ; x l ′ ) − f ( y ; x )] f ( y ; x ) (cid:27) , respectively.The following result is proved in [10, Section 4.4.4]. Theorem III.5.
The HCRB in (36) , with test points { x l } l ∈ [ L ] ⊆ X , is obtained from (23) by using the subspace U HCR , span (cid:8) { u ( · ) } ∪ { u l ( · ) } l ∈ [ L ] (cid:9) , with the functions u ( · ) , R E , x ( · , x ) ∈ H E , x , u l ( · ) , R E , x ( · , x l ) − R E , x ( · , x ) , l ∈ [ L ] . The HCRB tends to become higher (tighter) if L is increased in the sense that test points x l or, equivalently,functions u l ( · ) are added to those already used. F. Lower Semi-continuity of the Barankin Bound
For a given estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) and a prescribed bias function c ( · ) , it is sometimesof interest to characterize not only the minimum achievable variance M ( c ( · ) , x ) at a single parameter vector x ∈ X but also how M ( c ( · ) , x ) changes if x is varied. The following result is proved in Appendix A. Theorem III.6.
Consider an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) with parameter set X ⊆ R N and aprescribed bias function c ( · ) : X → R that is valid at all x ∈ C for some open set C ⊆ X and for which the PSfrag replacements x x f ( x ) Fig. 1. Graph of a function that is lower semi-continuous at x . The solid dot indicates the function value f ( x ) . associated prescribed mean function γ ( · ) = c ( · ) + g ( · ) is a continuous function on C . Furthermore assume thatfor any fixed x , x ∈ X , R E , x ( x , x ) is continuous with respect to x on C , i.e., lim x ′ → x R E , x ′ ( x , x ) = R E , x ( x , x ) , ∀ x ∈ C , ∀ x , x ∈ X . (37) Then, the minimum achievable variance M ( c ( · ) , x ) , viewed as a function of x , is lower semi-continuous on C . A schematic illustration of a lower semi-continuous function is given in Fig. 1. The application of TheoremIII.6 to the estimation problems considered in [40]—corresponding to the linear/Gaussian model with a sparseparameter vector—allows us to conclude that the “sparse CRB” introduced in [40] cannot be maximally tight,i.e., it is not equal to the minimum achievable variance. Indeed, the sparse CRB derived in [40] is in general astrictly upper semi-continuous function of the parameter vector x , whereas the minimum achievable variance M ( c ( · ) , x ) is lower semi-continuous according to Theorem III.6. Since a function cannot be simultaneouslystrictly upper semi-continuous and lower semi-continuous, the sparse CRB cannot be equal to M ( c ( · ) , x ) .IV. S UFFICIENT S TATISTICS
For some estimation problems E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) , the observation y ∈ R M contains information thatis irrelevant to E , and thus y can be compressed in some sense. Accordingly, let us replace y by a transformedobservation z = t ( y ) ∈ R K , with a deterministic mapping t ( · ) : R M → R K . A compression is achievedif K < M . Any transformed observation z = t ( y ) is termed a statistic , and in particular it is said to be a sufficient statistic if it preserves all the information that is relevant to E [1], [15]–[18], [41]. In particular, asufficient statistic preserves the minimum achievable variance (Barankin bound) M ( c ( · ) , x ) . In the following,the mapping t ( · ) will be assumed to be measurable.For a given reference parameter vector x ∈ X , we consider estimation problems E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) for which there exists a dominating measure µ E such that the pdfs { f ( y ; x ) } x ∈X are well defined with respectto µ E and condition (6) is satisfied. The Neyman-Fisher factorization theorem [15]–[18] then states that thestatistic z = t ( y ) is sufficient for E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) if and only if f ( y ; x ) can be factored as f ( y ; x ) = h ( t ( y ); x ) k ( y ) , (38) A function is said to be strictly upper semi-continuous if it is upper semi-continuous but not continuous. where h ( · ; x ) and k ( · ) are nonnegative functions and the function k ( · ) does not depend on x . Relation (38)has to be satisfied for every y ∈ R M except for a set of measure zero with respect to the dominating measure µ E . The probability measure on R K (equipped with the system of K -dimensional Borel sets, cf. [1, Section10]) that is induced by the random vector z = t ( y ) is obtained as µ zx = µ yx t − [16], [17]. According to SectionII-B1, under condition (6), the measure µ yx dominates the measures { µ yx } x ∈X . This, in turn, implies via [16,Lemma 4] that the measure µ zx dominates the measures { µ zx } x ∈X , and therefore that, for each x ∈ X , thereexists a pdf f ( z ; x ) with respect to the measure µ zx . This pdf is given by the following result. (Note that wedo not assume condition (9).) Lemma IV.1.
Consider an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) satisfying (6) , i.e., which is such thatthe Radon-Nikodym derivative of µ yx with respect to µ yx is well defined and given by the likelihood ratio ρ E , x ( y , x ) . Furthermore consider a sufficient statistic z = t ( y ) for E . Then, the pdf of z with respect to thedominating measure µ zx is given by f ( z ; x ) = h ( z ; x ) h ( z ; x ) , (39) where the function h ( z ; x ) is obtained from the factorization (38) .Proof : The pdf f ( z ; x ) of z with respect to µ zx is defined by the relation E x (cid:8) I A ( z ) f ( z ; x ) (cid:9) = P x { z ∈ A} , (40)which has to be satisfied for every measurable set A ⊆ R K [1]. Denoting the pre-image of A under the mapping t ( · ) by t − ( A ) , (cid:8) y (cid:12)(cid:12) t ( y ) ∈ A (cid:9) ⊆ R M , we have E x (cid:26) I A ( z ) h ( z ; x ) h ( z ; x ) (cid:27) ( a ) = E x (cid:26) I A ( t ( y )) h ( t ( y ); x ) h ( t ( y ); x ) (cid:27) = E x (cid:26) I t − ( A ) ( y ) h ( t ( y ); x ) h ( t ( y ); x ) (cid:27) (38) , (7) = E x (cid:26) I t − ( A ) ( y ) ρ E , x ( y , x ) (cid:27) ( b ) = P x { y ∈ t − ( A ) } = P x { z ∈ A} , (41)where step ( a ) follows from [1, Theorem 16.12] and ( b ) is due to the fact that the Radon-Nikodym derivativeof µ yx with respect to µ yx is given by ρ E , x ( y , x ) (cf. (7)), as explained in Section II-B1. Comparing (41) with(40), we conclude that h ( z ; x ) h ( z ; x ) = f ( z ; x ) up to differences on a set of measure zero (with respect to µ zx ). Notethat because we require t ( · ) to be a measurable mapping, it is guaranteed that the set t − ( A ) = (cid:8) y (cid:12)(cid:12) t ( y ) ∈ A (cid:9) is measurable for any measurable set A ⊆ R K . (cid:3) Consider next an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) satisfying (9), so that the kernel R E , x ( · , · ) exists according to (10). Let z = t ( y ) be a sufficient statistic. We can then define the modified estimationproblem E ′ , (cid:0) X , f ( z ; x ) , g ( · ) (cid:1) , which is based on the observation z and whose statistical model is given bythe pdf f ( z ; x ) (cf. (39)). The following theorem states that the RKHS associated with E ′ equals the RKHSassociated with E . Theorem IV.2.
Consider an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) satisfying (9) and a reference pa-rameter vector x ∈ X . For a sufficient statistic z = t ( y ) , consider the modified estimation problem E ′ = (cid:0) X , f ( z ; x ) , g ( · ) (cid:1) . Then, E ′ also satisfies (9) and furthermore R E ′ , x ( · , · ) = R E , x ( · , · ) and H E ′ , x = H E , x .Proof : We have R E , x ( x , x ) (10) = E x (cid:8) ρ E , x ( y , x ) ρ E , x ( y , x ) (cid:9) (38) , (7) = E x (cid:26) h ( t ( y ); x ) h ( t ( y ); x ) h ( t ( y ); x ) (cid:27) ( a ) = E x (cid:26) h ( z ; x ) h ( z ; x ) h ( z ; x ) (cid:27) (39) = E x (cid:26) f ( z ; x ) f ( z ; x ) f ( z ; x ) (cid:27) (10) = R E ′ , x ( x , x ) , (42)where, as before, step ( a ) follows from [1, Theorem 16.12]. From (42), we conclude that if E satisfies (9)then so does E ′ . Moreover, from R E ′ , x ( · , · ) = R E , x ( · , · ) in (42), it follows that H E ′ , x = H ( R E ′ , x ) equals H E , x = H ( R E , x ) . (cid:3) Intuitively, one might expect that the RKHS associated with a sufficient statistic should be typically “smaller”or “simpler” than the RKHS associated with the original observation, since in general the sufficient statisticis a compressed and “more concise” version of the observation. However, Theorem IV.2 states that the RKHSremains unchanged by this compression. One possible interpretation of this fact is that the RKHS descriptionof an estimation problem is already “maximally efficient” in the sense that it cannot be reduced or simplifiedby using a compressed (yet sufficiently informative) observation.V. MVE
FOR THE E XPONENTIAL F AMILY
An important class of estimation problems is defined by statistical models belonging to an exponentialfamily. Such models are of considerable interest in the context of MVE because, under mild conditions, theexistence of a UMV estimator is guaranteed. Furthermore, any estimation problem that admits the existenceof an efficient estimator , i.e., an estimator whose variance achieves the CRB, must be necessarily based on anexponential family [15, Theorem 5.12]. In this section, we will characterize the RKHS for this class and useit to derive lower variance bounds. A. Review of the Exponential Family
An exponential family is defined as the following parametrized set of pdfs { f ( y ; x ) } x ∈X (with respect tothe Lebesgue measure on R M ) [15], [42], [43]: f ( y ; x ) = exp (cid:0) φ T ( y ) u ( x ) − A ( x ) (cid:1) h ( y ) , with the sufficient statistic φ ( · ) : R M → R P , the parameter function u ( · ) : R N → R P , the cumulant function A ( · ) : R N → R , and the weight function h ( · ) : R M → R . Many well-known statistical models are specialinstances of an exponential family [43]. Without loss of generality, we can restrict ourselves to an exponentialfamily in canonical form [15], for which P = N and u ( x ) = x , i.e., f ( A ) ( y ; x ) = exp (cid:0) φ T ( y ) x − A ( x ) (cid:1) h ( y ) . (43)Here, the superscript ( A ) emphasizes the importance of the cumulant function A ( · ) in the characterization of anexponential family. In what follows, we assume that the parameter space is chosen as X ⊆ N , where
N ⊆ R N is the natural parameter space defined as N , (cid:26) x ∈ R N (cid:12)(cid:12)(cid:12)(cid:12) Z R M exp (cid:0) φ T ( y ) x (cid:1) h ( y ) d y < ∞ (cid:27) . From the normalization constraint R R M f ( A ) ( y ; x ) d y = 1 , it follows that the cumulant function A ( · ) isdetermined by the sufficient statistic φ ( · ) and the weight function h ( · ) as A ( x ) = log (cid:18) Z R M exp (cid:0) φ T ( y ) x (cid:1) h ( y ) d y (cid:19) , x ∈ N . The moment-generating function of f ( A ) ( y ; x ) is defined as λ ( x ) , exp( A ( x )) = Z R M exp (cid:0) φ T ( y ) x (cid:1) h ( y ) d y , x ∈ N . (44)Note that N = (cid:8) x ∈ R N (cid:12)(cid:12) λ ( x ) < ∞ (cid:9) . (45)Assuming a random vector y ∼ f ( A ) ( y ; x ) , it is known [42, Theorem 2.2], [43, Proposition 3.1] that for any x ∈ X o and p ∈ Z N + , the moments E x (cid:8) φ p ( y ) (cid:9) exist, i.e., E x (cid:8) φ p ( y ) (cid:9) < ∞ , and they can be calculated fromthe partial derivatives of λ ( x ) according to E x (cid:8) φ p ( y ) (cid:9) = 1 λ ( x ) ∂ p λ ( x ) ∂ x p . (46)Thus, the partial derivatives ∂ p λ ( x ) ∂ x p exist for any x ∈ X o and p ∈ Z N + , and for any choice of the sufficient statistic φ ( · ) and the weight function h ( · ) . Moreover, they depend continuously on x ∈ X o [42], [43]. B. RKHS Associated with an Exponential Family Based MVP
Consider an estimation problem E ( A ) , (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) with an exponential family statistical model { f ( A ) ( y ; x ) } x ∈X as defined in (43), and a fixed x ∈ X . Consider further the RKHS H E ( A )) , x . Its kernel isobtained as R E ( A ) , x ( x , x ) (10) = E x (cid:26) f ( A ) ( y ; x ) f ( A ) ( y ; x )( f ( A ) ( y ; x )) (cid:27) (47) (43) = E x ( exp (cid:0) φ T ( y ) x − A ( x ) (cid:1) exp (cid:0) φ T ( y ) x − A ( x ) (cid:1) exp (cid:0) (cid:2) φ T ( y ) x − A ( x ) (cid:3)(cid:1) ) = E x (cid:8) exp (cid:0) φ T ( y ) ( x + x − x ) − A ( x ) − A ( x ) + 2 A ( x ) (cid:1)(cid:9) (43) = exp (cid:0) A ( x ) − A ( x ) + 2 A ( x ) (cid:1) × Z R M exp (cid:0) φ T ( y ) ( x + x − x ) (cid:1) exp (cid:0) φ T ( y ) x − A ( x ) (cid:1) h ( y ) d y = exp (cid:0) − A ( x ) − A ( x ) + A ( x ) (cid:1) Z R M exp (cid:0) φ T ( y ) ( x + x − x ) (cid:1) h ( y ) d y (44) = λ ( x + x − x ) λ ( x ) λ ( x ) λ ( x ) . (48)Because (47) and (48) are equal, we see that condition (9) is satisfied, i.e., E x n f ( A ) ( y ; x ) f ( A ) ( y ; x )( f ( A ) ( y ; x )) o < ∞ for all x , x ∈ X , if and only if λ ( x + x − x ) λ ( x ) λ ( x ) λ ( x ) < ∞ for all x , x ∈ X . Since x ∈ X ⊆ N , we have λ ( x ) < ∞ . Furthermore, λ ( x ) = 0 for all x ∈ X . Therefore, (9) is satisfied if and only if λ ( x + x − x ) < ∞ .We conclude that for an estimation problem whose statistical model belongs to an exponential family, condition(9) is equivalent to x , x ∈ X ⇒ x + x − x ∈ N . (49)Furthermore, from (48) and the fact that the partial derivatives ∂ p λ ( x ) ∂ x p exist for any x ∈ X o and p ∈ Z N + anddepend continuously on x ∈ X o , we can conclude that the RKHS H E ( A ) , x is differentiable up to any order. Wesummarize this finding in Lemma V.1.
Consider an estimation problem E ( A ) = (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) associated with an exponentialfamily (cf. (43) ) with natural parameter space N . The parameter set X is assumed to satisfy condition (49) for some reference parameter vector x ∈ X . Then, the kernel R E ( A ) , x ( x , x ) and the RKHS H E ( A ) , x aredifferentiable up to any order m . Next, by combining Lemma V.1 with (22), we will derive simple lower bounds on the variance of estimatorswith a prescribed bias function.
C. Variance Bounds for the Exponential Family If X o is nonempty, the sufficient statistic φ ( · ) is a complete sufficient statistic for the estimation problem E ( A ) , and thus there exists a UMV estimator ˆ g UMV ( · ) for any valid bias function c ( · ) [15, p. 42]. This UMV estimator is given by the conditional expectation ˆ g UMV ( y ) = E x { ˆ g ( y ) | φ ( y ) } , (50)where ˆ g ( · ) is any estimator with bias function c ( · ) , i.e., b (ˆ g ( · ); x ) = c ( x ) for all x ∈ X . The minimumachievable variance M ( c ( · ) , x ) is then equal to the variance of ˆ g UMV ( · ) at x , i.e., M ( c ( · ) , x ) = v (ˆ g UMV ( · ); x ) [15, p. 89]. However, it may be difficult to actually construct the UMV estimator via (50) and to calculate itsvariance. In fact, it may be already a difficult task to find an estimator ˆ g ( · ) whose bias function equals c ( · ) .Therefore, it is still of interest to find simple closed-form lower bounds on the variance of any estimator withbias c ( · ) . Theorem V.2.
Consider an estimation problem E ( A ) = (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) with parameter set X satisfying (49) and a finite set of multi-indices { p l } l ∈ [ L ] ⊆ Z N + . Then, at any x ∈ X o , the variance of any estimator ˆ g ( · ) with mean function γ ( x ) = E x { ˆ g ( y ) } and finite variance at x is lower bounded as v (ˆ g ( · ); x ) ≥ n T ( x ) S † ( x ) n ( x ) − γ ( x ) , (51) where the vector n ( x ) ∈ R L and the matrix S ( x ) ∈ R L × L are given elementwise by (cid:0) n ( x ) (cid:1) l , X p ≤ p l (cid:18) p l p (cid:19) E x (cid:8) φ p l − p ( y ) (cid:9) ∂ p γ ( x ) ∂ x p (cid:12)(cid:12)(cid:12)(cid:12) x = x (52) (cid:0) S ( x ) (cid:1) l,l ′ , E x (cid:8) φ p l + p l ′ ( y ) (cid:9) , (53) respectively. Here, P p ≤ p l denotes the sum over all multi-indices p ∈ Z N + such that p k ≤ ( p l ) k for k ∈ [ N ] ,and (cid:0) p l p (cid:1) , Q Nk =1 (cid:0) ( p l ) k p k (cid:1) . A proof of this result is provided in Appendix B. This proof shows that the bound (51) is obtained byprojecting an appropriately transformed version of the mean function γ ( · ) onto the finite-dimensional subspace U = span (cid:8) r ( p l ) x ( · ) (cid:9) l ∈ [ L ] of an appropriately defined RKHS H ( R ) , with the functions r ( p l ) x ( · ) given by (21).If we increase the set (cid:8) r ( p l ) x ( · ) (cid:9) l ∈ [ L ] by adding further functions r ( p ′ ) x ( · ) with multi-indices p ′ / ∈ { p l } l ∈ [ L ] , thesubspace tends to become higher-dimensional and in turn the lower bound (51) becomes higher, i.e., tighter.The requirement of a finite variance v (ˆ g ( · ); x ) in Theorem V.2 implies via (11) that γ ( · ) ∈ H E ( A ) , x . This,in turn, guarantees via (22)—which can be invoked since due to Lemma V.1 the RKHS H E ( A ) , x is differentiableup to any order at x —the existence of the partial derivatives ∂ p γ ( x ) ∂ x p (cid:12)(cid:12) x = x . Note also that the bound (51) dependson the mean function γ ( · ) only via its local behavior as given by the the partial derivatives of γ ( · ) at x up toa suitable order.Evaluating the bound (51) requires computation of the moments E x (cid:8) φ p ( y ) (cid:9) . This can be done by meansof message passing algorithms [43]. The conditional expectation in (50) can be taken with respect to the measure µ yx for an arbitrary x ∈ X . Indeed, since φ ( · ) is asufficient statistic, E x { ˆ g ( y ) | φ ( y ) } yields the same result for every x ∈ X . For the choice L = N and p l = e l , the bound (51) is closely related to the CRB obtained for the estimationproblem E ( A ) . In fact, the CRB for E ( A ) is obtained as [15, Thm. 2.6.2] v (ˆ g ( · ); x ) ≥ n T ( x ) J † ( x ) n ( x ) , (54)with (cid:0) n ( x ) (cid:1) l = ∂γ ( x ) ∂x l (cid:12)(cid:12) x = x and the Fisher information matrix given by J ( x ) = E x (cid:8)(cid:0) φ ( y ) − E x { φ ( y ) } (cid:1)(cid:0) φ ( y ) − E x { φ ( y ) } (cid:1) T (cid:9) , i.e., the covariance matrix of the sufficient statistic vector φ ( y ) . On the other hand, evaluating the bound (51)for L = N and p l = e l and assuming without loss of generality that γ ( x ) = 0 , we obtain v (ˆ g ( · ); x ) ≥ n T ( x ) S † ( x ) n ( x ) , (55)with n ( x ) as before and S ( x ) = E x (cid:8) φ ( y ) φ T ( y ) (cid:9) . Thus, the only difference is that the CRB in (54) involves the covariance matrix of the sufficient statistic φ ( y ) whereas the bound in (55) involves the correlation matrix of φ ( y ) . D. Reducing the Parameter Set
Using the RKHS framework, we will now show that, under mild conditions, the minimum achievablevariance M ( c ( · ) , x ) for an exponential family type estimation problem E ( A ) = (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) is in-variant to reductions of the parameter set X . Consider two estimation problems E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) and E ′ = (cid:0) X ′ , f ( y ; x ) , g ( · ) (cid:12)(cid:12) X ′ (cid:1) —for now, not necessarily of the exponential family type—that differ only in theirparameter sets X and X ′ . More specifically, E ′ is obtained from E by reducing the parameter set, i.e., X ′ ⊆ X .For these two estimation problems, we consider corresponding MVPs at a specific parameter vector x ∈ X ′ and for a certain prescribed bias c ( · ) . More precisely, c ( · ) is the prescribed bias for E on the set X , whilethe prescribed bias for E ′ is the restriction of c ( · ) to X ′ , c ( · ) (cid:12)(cid:12) X ′ . We will denote the minimum achievablevariances of the MVPs corresponding to E and E ′ by M ( c ( · ) , x ) and M ′ (cid:0) c ( · ) (cid:12)(cid:12) X ′ , x (cid:1) , respectively. From (25),it follows that M ′ (cid:0) c ( · ) (cid:12)(cid:12) X ′ , x (cid:1) ≤ M ( c ( · ) , x ) , since taking the supremum over a reduced set can never resultin an increase of the supremum.The effect that a reduction of the parameter set X has on the minimum achievable variance can be analyzedconveniently within the RKHS framework. This is based on the following result [21]: Consider an RKHS H ( R ) of functions f ( · ) : D → R , with kernel R ( · , · ) : D × D → R . Let D ⊆ D . Then, the set of functions (cid:8) ˜ f ( · ) , f ( · ) (cid:12)(cid:12) D (cid:12)(cid:12) f ( · ) ∈ H ( R ) (cid:9) that is obtained by restricting each function f ( · ) ∈ H ( R ) to the subdomain D coincides with the RKHS H ( R ) whose kernel R ( · , · ) : D × D → R is the restriction of the kernel R ( · , · ) : D × D → R to the subdomain D × D , i.e., H ( R ) = (cid:8) ˜ f ( · ) , f ( · ) (cid:12)(cid:12) D (cid:12)(cid:12) f ( · ) ∈ H ( R ) (cid:9) , with R ( · , · ) , R ( · , · ) (cid:12)(cid:12) D ×D . (56) Furthermore, the norm of an element ˜ f ( · ) ∈ H ( R ) is equal to the minimum of the norms of all functions f ( · ) ∈ H ( R ) that coincide with ˜ f ( · ) on D , i.e., k ˜ f ( · ) k H ( R ) = min f ( · ) ∈H ( R ) f ( · ) (cid:12)(cid:12) D = ˜ f ( · ) k f ( · ) k H ( R ) . (57)Consider an arbitrary but fixed f ( · ) ∈ H ( R ) , and let ˜ f ( · ) , f ( · ) (cid:12)(cid:12) D . Because ˜ f ( · ) ∈ H ( R ) , we cancalculate k ˜ f ( · ) k H ( R ) . From (57), we obtain for k ˜ f ( · ) k H ( R ) = (cid:13)(cid:13) f ( · ) (cid:12)(cid:12) D (cid:13)(cid:13) H ( R ) the inequality (cid:13)(cid:13) f ( · ) (cid:12)(cid:12) D (cid:13)(cid:13) H ( R ) ≤ k f ( · ) k H ( R ) . (58)This inequality holds for all f ( · ) ∈ H ( R ) .Let us now return to the MVPs corresponding to E and E ′ . From (58) with D = X , D = X ′ , H ( R ) = H E , x , and H ( R ) = H E ′ , x , we can conclude that, for any x ∈ X ′ , M ′ (cid:0) c ( · ) (cid:12)(cid:12) X ′ , x (cid:1) (12) = (cid:13)(cid:13) γ ( · ) (cid:12)(cid:12) X ′ (cid:13)(cid:13) H E′ , x − γ ( x ) (58) ≤ k γ ( · ) k H E , x − γ ( x ) = M ( c ( · ) , x ) . (59)Here, we also used the fact that γ ( · ) (cid:12)(cid:12) X ′ = c ( · ) (cid:12)(cid:12) X ′ + g ( · ) (cid:12)(cid:12) X ′ . The inequality in (59) means that a reduction ofthe parameter set X can never result in a deterioration of the achievable performance, i.e., in a higher minimumachievable variance. Besides this rather intuitive fact, the result (56) has the following consequence: Consider anestimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) whose statistical model { f ( y ; x ) } x ∈X satisfies (9) at some x ∈ X and moreover is contained in a “larger” model { f ( y ; x ) } x ∈ ˜ X with ˜ X ⊇ X . If the larger model { f ( y ; x ) } x ∈ ˜ X also satisfies (9), it follows from (56) that a prescribed bias function c ( · ) : X → R can only be valid for E at x if it is the restriction of a function c ′ ( · ) : ˜ X → R that is a valid bias function for the estimation problem ˜ E = (cid:0) ˜ X , f ( y ; x ) , g ( · ) (cid:1) at x . This holds true since every valid bias function for E at x is an element of theRKHS H E , x , which by (56) consists precisely of the restrictions of the elements of the RKHS H ˜ E , x , whichby (11) consists precisely of the mean functions that are valid for ˜ E at x .For the remainder of this section, we restrict our discussion to estimation problems E ( A ) = (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) whose statistical model is an exponential family model. The next result characterizes the analytic properties ofthe mean functions γ ( · ) that belong to an RKHS H E ( A ) , x . A proof is provided in Appendix C. Lemma V.3.
Consider an estimation problem E ( A ) = (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) with an open parameter set X ⊆ N satisfying (49) for some x ∈ X . Let γ ( · ) ∈ H E ( A ) , x be such that the partial derivatives ∂ p γ ( x ) ∂ x p (cid:12)(cid:12) x = x vanishfor every multi-index p ∈ Z N + . Then γ ( x ) = 0 for all x ∈ X . Note that since H E ( A ) , x is differentiable at x up to any order (see Lemma V.1), it contains the functionset (cid:8) r ( p ) x ( x ) (cid:9) p ∈ Z N + defined in (21). Moreover, by (22), for any f ( · ) ∈ H E ( A ) , x and any p ∈ Z N + , there is (cid:10) r ( p ) x ( · ) , f ( · ) (cid:11) H E ( A ) , x = ∂ p f ( x ) ∂ x p (cid:12)(cid:12) x = x . Hence, under the assumptions of Lemma V.3, we have that if a function f ( · ) ∈ H E ( A ) , x satisfies (cid:10) r ( p ) x ( · ) , f ( · ) (cid:11) H E ( A ) , x = 0 for all p ∈ Z N + , then f ( · ) ≡ . Thus, in this case, the set (cid:8) r ( p ) x ( x ) (cid:9) p ∈ Z N + is complete for the RKHS H E ( A ) , x .Upon combining (56) and (57) with Lemma V.3, we arrive at the second main result of this section: Theorem V.4.
Consider an estimation problem E ( A ) = (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) with an open parameter set X ⊆ N satisfying (49) for some x ∈ X , and a prescribed bias function c ( · ) that is valid for E ( A ) at x . Furthermoreconsider a reduced parameter set X ⊆ X such that x ∈ X o . Let E ( A )1 , (cid:0) X , f ( A ) ( y ; x ); g ( · ) (cid:1) denote theestimation problem that is obtained from E ( A ) by reducing the parameter set to X , and let c ( · ) , c ( · ) (cid:12)(cid:12) X . Then,the minimum achievable variance for the restricted estimation problem E ( A )1 and the restricted bias function c ( · ) , denoted by M ( c ( · ) , x ) , is equal to the minimum achievable variance for the original estimation problem E ( A ) and the original bias function c ( · ) , i.e., M ( c ( · ) , x ) = M ( c ( · ) , x ) . A proof of this theorem is provided in Appendix D. Note that the requirement x ∈ X o of the theoremimplies that the reduced parameter set X must contain a neighborhood of x , i.e., an open ball B ( x , r ) withsome radius r > . The main message of the theorem is that, for an estimation problem based on an exponentialfamily, parameter set reductions have no effect on the minimum achievable variance at x as long as the reducedparameter set contains a neighborhood of x . VI. C ONCLUSION
The mathematical framework of reproducing kernel Hilbert spaces (RKHS) provides powerful tools for theanalysis of minimum variance estimation (MVE) problems. Building upon the theoretical foundation developedin the seminal papers [2] and [3], we derived novel results concerning the RKHS-based analysis of lowervariance bounds for MVE, of sufficient statistics, and of MVE problems conforming to an exponential family ofdistributions. More specifically, we presented an RKHS-based geometric interpretation of several well-knownlower bounds on the estimator variance. We showed that each of these bounds is related to the orthogonalprojection onto an associated subspace of the RKHS. In particular, the subspace associated with the Cramér–Rao bound is based on the strong structural properties of a differentiable
RKHS. For a wide class of estimationproblems, we proved that the minimum achievable variance, which is the tightest possible lower bound on theestimator variance (Barankin bound), is a lower semi-continuous function of the parameter vector. In somecases, this fact can be used to show that a given lower bound on the estimator variance is not maximallytight. Furthermore, we proved that the RKHS associated with an estimation problem remains unchanged if theobservation is replaced by a sufficient statistic.Finally, we specialized the RKHS description to estimation problems whose observation conforms to anexponential family of distributions. We showed that the kernel of the RKHS has a particularly simple expressionin terms of the moment-generating function of the exponential family, and the RKHS itself is differentiable up to any order. Using this differentiability, we derived novel closed-form lower bounds on the estimator variance.We also showed that reducing the parameter set has no effect on the minimum achievable variance at a givenreference parameter vector x if the reduced parameter set contains a neighborhood of x .Promising directions for future work include the practical implementation of message passing algorithmsfor the efficient computation of the lower variance bounds for exponential families derived in Section V-C.Furthermore, in view of the close relations between exponential families and probabilistic graphical models [43],it would be interesting to explore the relations between the graph-theoretic properties of the graph associatedwith an exponential family and the properties of the RKHS associated with that exponential family.A PPENDIX AP ROOF OF T HEOREM
III.6We first note that our assumption that the prescribed bias function c ( · ) is valid for E at every x ∈ C hastwo consequences. First, M ( c ( · ) , x ) < ∞ for every x ∈ C (cf. our definition of the validity of a bias functionin Section II); second, due to (11), the prescribed mean function γ ( · ) = c ( · ) + g ( · ) belongs to H E , x for every x ∈ C .Following [2], we define the linear span of a kernel function R ( · , · ) : X × X → R , denoted by L ( R ) , asthe set of all functions f ( · ) : X → R that are finite linear combinations of the form f ( · ) = X l ∈ [ L ] a l R ( · , x l ) , with x l ∈ X , a l ∈ R , L ∈ N . (60)The linear span L ( R ) can be used to express the norm of any function h ( · ) ∈ H ( R ) according to k h ( · ) k H ( R ) = sup f ( · ) ∈L ( R ) k f ( · ) k H ( R ) > h h ( · ) , f ( · ) i H ( R ) k f ( · ) k H ( R ) . (61)This expression can be shown by combining [10, Theorem 3.1.2] and [10, Theorem 3.2.2]. We can now developthe minimum achievable variance M ( c ( · ) , x ) as follows: M ( c ( · ) , x ) (12) = k γ ( · ) k H E , x − γ ( x ) (61) = sup f ( · ) ∈L ( R E , x ) k f ( · ) k HE , x > h γ ( · ) , f ( · ) i H E , x k f ( · ) k H E , x − γ ( x ) . Using (60) and letting D , { x , . . . , x L } , a , ( a · · · a L ) T , and A D , (cid:8) a ∈ R L (cid:12)(cid:12) P l,l ′ ∈ [ L ] a l a l ′ R E , x ( x l , x l ′ ) > (cid:9) , we obtain further M ( c ( · ) , x ) = sup D⊆X , L ∈ N , a ∈A D h D , a ( x ) . (62) Here, our notation sup
D⊆X , L ∈ N , a ∈A D indicates that the supremum is taken not only with respect to the elements x l of D but also with respect to the size of D , L = |D| , and the function h D , a ( · ) : X → R is given by h D , a ( x ) , (cid:10) γ ( · ) , P l ∈ [ L ] a l R E , x ( · , x l ) (cid:11) H E , x (cid:13)(cid:13) P l ∈ [ L ] a l R E , x ( · , x l ) (cid:13)(cid:13) H E , x − γ ( x )= (cid:0) P l ∈ [ L ] a l (cid:10) γ ( · ) , R E , x ( · , x l ) (cid:11) H E , x (cid:1) P l,l ′ ∈ [ L ] a l a l ′ (cid:10) R E , x ( · , x l ) R E , x ( · , x l ′ ) (cid:11) H E , x − γ ( x ) (5) = (cid:0) P l ∈ [ L ] a l γ ( x l ) (cid:1) P l,l ′ ∈ [ L ] a l a l ′ R E , x ( x l , x l ′ ) − γ ( x ) . For any finite set D = { x , . . . , x L } ⊆ X and any a ∈ A D , it follows from our assumptions of continuity of R E , x ( · , · ) with respect to x on C (see (37)) and continuity of γ ( x ) on C that the function h D , a ( x ) is continuousin a neighborhood around any point x ∈ C . Thus, for any x ∈ C , there exists a radius δ > such that h D , a ( x ) is continuous on B ( x , δ ) ⊆ C .We will now show that the function M ( c ( · ) , x ) given by (62) is lower semi-continuous at every x ∈ C ,i.e., for any x ∈ C and ε > , we can find a radius r > such that M ( c ( · ) , x ) ≥ M ( c ( · ) , x ) − ε , for all x ∈ B ( x , r ) . (63)Due to (62), there must be a finite subset D ⊆ X and a vector a ∈ A D such that h D , a ( x ) ≥ M ( c ( · ) , x ) − ε , (64)for any given ε > . Furthermore, since h D , a ( x ) is continuous on B ( x , δ ) as shown above, there is a radius r > (with r < δ ) such that h D , a ( x ) ≥ h D , a ( x ) − ε , for all x ∈ B ( x , r ) . (65)By combining this inequality with (64), it follows that there is a radius r > (with r < δ ) such that for any x ∈ B ( x , r ) we have h D , a ( x ) (65) ≥ h D , a ( x ) − ε (64) ≥ M ( c ( · ) , x ) − ε , (66)and further M ( c ( · ) , x ) (62) = sup D⊆X , L ∈ N , a ∈A D h D , a ( x ) ≥ h D , a ( x ) (66) ≥ M ( c ( · ) , x ) − ε . Thus, for any given ε > , there is a radius r > (with r < δ ) such that M ( c ( · ) , x ) ≥ M ( c ( · ) , x ) − ε for all x ∈ B ( x , r ) , i.e., (63) has been proved. Indeed, if (64) were not true, we would have h D , a ( x ) < M ( c ( · ) , x ) − ε/ for every choice of D and a . This, in turn,would imply that sup D⊆X , L ∈ N , a ∈A D h D , a ( x ) ≤ M ( c ( · ) , x ) − ε/ < M ( c ( · ) , x ) , yielding the contradiction M ( c ( · ) , x ) (62) =sup D⊆X , L ∈ N , a ∈A D h D , a ( x ) < M ( c ( · ) , x ) . A PPENDIX BP ROOF OF T HEOREM
V.2The bound (51) in Theorem V.2 is derived by using an isometry between the RKHS H E ( A ) , x and the RKHS H ( R ) that is defined by the kernel R ( · , · ) : X ×X → R , R ( x , x ) = λ ( x + x − x ) λ ( x ) . (67)It is easily verified that R ( · , · ) and, thus, H ( R ) are differentiable up to any order. Invoking [10, Theorem3.3.4], it can be verified that the two RKHSs H E ( A ) , x and H ( R ) are isometric and a specific congruence J : H E ( A ) , x → H ( R ) is given by J [ f ( · )] = λ ( x ) λ ( x ) f ( x ) . (68)Similarly to the bound (23), we can then obtain a lower bound on v (ˆ g ( · ); x ) via an orthogonal projection ontoa subspace of H ( R ) . Indeed, with c ( · ) = γ ( · ) − g ( · ) denoting the bias function of the estimator ˆ g ( · ) , we have v (ˆ g ( · ); x ) (4) ≥ M ( c ( · ) , x ) (12) = k γ ( · ) k H E ( A ) , x − γ ( x ) ( a ) = (cid:13)(cid:13) J [ γ ( · )] (cid:13)(cid:13) H ( R ) − γ ( x ) ≥ (cid:13)(cid:13)(cid:0) J [ γ ( · )] (cid:1) U (cid:13)(cid:13) H ( R ) − γ ( x ) , (69)for an arbitrary subspace U ⊆ H ( R ) . Here, step ( a ) is due to the fact that J is a congruence, and ( · ) U denotes orthogonal projection onto U . The bound (51) is obtained from (69) by choosing the subspace as U , span (cid:8) r ( p l ) x ( · ) (cid:9) l ∈ [ L ] , with the functions r ( p l ) x ( · ) ∈ H ( R ) as defined in (21), i.e., r ( p l ) x ( x ) = ∂ p l R ( x , x ) ∂ x p l (cid:12)(cid:12) x = x .Let us denote the image of γ ( · ) under the isometry J by ˜ γ ( · ) , J [ γ ( · )] . According to (68), ˜ γ ( x ) = λ ( x ) λ ( x ) γ ( x ) . (70)Furthermore, the variance bound (69) reads v (ˆ g ( · ); x ) ≥ k ˜ γ U ( · ) k H ( R ) − γ ( x ) . Using (15), we obtain further v (ˆ g ( · ); x ) ≥ n T ( x ) S † ( x ) n ( x ) − γ ( x ) , (71)where, according to (16), the entries of n ( x ) and S ( x ) are calculated as follows: (cid:0) n ( x ) (cid:1) l (16) = (cid:10) ˜ γ ( · ) , r ( p l ) x ( · ) (cid:11) H ( R ) (22) = ∂ p l ˜ γ ( x ) ∂ x p l (cid:12)(cid:12)(cid:12)(cid:12) x = x (70) = 1 λ ( x ) ∂ p l [ λ ( x ) γ ( x )] ∂ x p l (cid:12)(cid:12)(cid:12)(cid:12) x = x ( a ) = 1 λ ( x ) X p ≤ p l (cid:18) p l p (cid:19) ∂ p l − p λ ( x ) ∂ x p l − p ∂ p γ ( x ) ∂ x p (cid:12)(cid:12)(cid:12)(cid:12) x = x (46) = X p ≤ p l (cid:18) p l p (cid:19) E x (cid:8) φ p l − p ( y ) (cid:9) ∂ p γ ( x ) ∂ x p (cid:12)(cid:12)(cid:12)(cid:12) x = x (72)(here, ( a ) is due to the generalized Leibniz rule for differentiation of a product of two functions [13, p. 104]),and (cid:0) S ( x ) (cid:1) l,l ′ (16) = (cid:10) r ( p l ) x ( · ) , r ( p l ′ ) x ( · ) (cid:11) H ( R ) (22) = ∂ p l r ( p l ′ ) x ( x ) ∂ x p l (cid:12)(cid:12)(cid:12)(cid:12) x = x (21) = ∂ p l ∂ x p l (cid:26) ∂ p l ′ R ( x , x ) ∂ x p l ′ (cid:12)(cid:12)(cid:12)(cid:12) x = x (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) x = x (67) = 1 λ ( x ) ∂ p l + p l ′ λ ( x ) ∂ x p l + p l ′ (cid:12)(cid:12)(cid:12)(cid:12) x = x (46) = E x (cid:8) φ p l + p l ′ ( y ) (cid:9) . (73)Note that the application of (22) was based on the differentiability of H ( R ) . Comparing (71), (72), and (73)with (51), (52), and (53), respectively, we conclude that the theorem is proved.A PPENDIX CP ROOF OF L EMMA
V.3For E ( A ) = (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) and x ∈ X , consider a function γ ( · ) : X → R belonging to the RKHS H E ( A ) , x . By (11), the function c ( · ) = γ ( · ) − g ( · ) is a valid bias function for E ( A ) = (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) at x ;furthermore, the LMV estimator at x exists and is given by ˆ g ( x ) ( · ) = J [ γ ( · )] . Trivially, this estimator has thefinite variance v (cid:0) ˆ g ( x ) ( · ); x (cid:1) = M ( c ( · ) , x ) at x and its mean function equals γ ( · ) , i.e., E x (cid:8) ˆ g ( x ) ( y ) } = γ ( x ) for all x ∈ X . Hence, the mean power E x (cid:8)(cid:0) ˆ g ( x ) ( y ) (cid:1) (cid:9) is finite at x , since E x (cid:8)(cid:0) ˆ g ( x ) ( y ) (cid:1) (cid:9) = v (cid:0) ˆ g ( x ) ( y ); x (cid:1) + (cid:0) E x (cid:8) ˆ g ( x ) ( y ) (cid:9)(cid:1) = M ( c ( · ) , x ) + γ ( x ) < ∞ . (74)Now, for any exponential family based estimation problem E ( A ) = (cid:0) X , f ( A ) ( y ; x ) , g ( · ) (cid:1) , it follows from[42, Theorem 2.7] that the mean function E x { ˆ g ( · ) } of any estimator ˆ g ( · ) is analytic on the interior T o ofthe set T , (cid:8) x ∈ N (cid:12)(cid:12) E x {| ˆ g ( y ) |} < ∞ (cid:9) . Furthermore, T can be shown to be a convex set [42, Corollary2.6]. In particular, the mean function γ ( x ) of the LMV estimator ˆ g ( x ) ( · ) is analytic on the interior T o of theconvex set T , (cid:8) x ∈ N (cid:12)(cid:12) E x {| ˆ g ( x ) ( y ) |} < ∞ (cid:9) . We will now verify that X ⊆ T . Using the Hilbert space Following [14, Definition 2.2.1], we call a real-valued function f ( · ) : U → R defined on some open domain U ⊆ R N analytic iffor every point x c ∈ U there exists a power series P p ∈ Z N + a p ( x − x c ) p converging to f ( x ) for every x in some neighborhood of x c .Note that the coefficients a p may vary with x c . H x , (cid:8) t ( y ) (cid:12)(cid:12) E x { t ( y ) } < ∞ (cid:9) and associated inner product h t ( y ) , t ( y ) i RV = E x { t ( y ) t ( y ) } , we obtainfor an arbitrary x ∈ X ⊆ N E x {| ˆ g ( y ) |} = E x (cid:26)(cid:12)(cid:12) ˆ g ( x ) ( y ) (cid:12)(cid:12) f ( y ; x ) f ( y ; x ) (cid:27) = (cid:10)(cid:12)(cid:12) ˆ g ( x ) ( y ) (cid:12)(cid:12) , ρ ( y , x ) (cid:11) RV ( a ) ≤ q(cid:10)(cid:12)(cid:12) ˆ g ( x ) ( y ) (cid:12)(cid:12) , (cid:12)(cid:12) ˆ g ( x ) ( y ) (cid:12)(cid:12)(cid:11) RV (cid:10) ρ ( y , x ) , ρ ( y , x ) (cid:11) RV = s E x (cid:8)(cid:0) ˆ g ( x ) ( y ) (cid:1) (cid:9) E x (cid:26)(cid:18) f ( y ; x ) f ( y ; x ) (cid:19) (cid:27) (74) , (9) ≤ ∞ , where ( a ) follows from the Cauchy-Schwarz inequality in the Hilbert space H x . Thus, we have verified that X ⊆ T . Moreover, we have X ⊆ T o . (75)This is implied by X ⊆ T together with the fact that (by assumption) X is an open set.Let us now consider the restrictions γ R x ( a ) , γ (cid:0) a x + (1 − a ) x (cid:1) , a ∈ ( − ε, ε ) , (76)of γ ( · ) on line segments of the form R x , (cid:8) a x + (1 − a ) x (cid:12)(cid:12) a ∈ ( − ε, ε ) (cid:9) , where x ∈ T o and ε > .Here, ε is chosen sufficiently small such that the vectors x a , x − ε ( x − x ) and x b , x + ε ( x − x ) belong to T o , i.e., x a , x b ∈ T o . Such an ε can always be found, since—due to (75)—we have x ∈ T o . Ascan be verified easily, any vector in R x is a convex combination of the vectors x a and x b , which both belongto the interior T o of the convex set T . Therefore we have R x ⊆ T o for any x ∈ T o , as the interior T o of theconvex set T is itself a convex set [44, Theorem 6.2], i.e., the interior T o contains any convex combinationof its elements.The function γ R x ( · ) : ( − ε, ε ) → R in (76) is the composition of the mean function γ ( · ) : X → R , whichis analytic on T o ⊆ X , with the vector-valued function b ( · ) : ( − ε, ε ) → T o given by b ( a ) = a x +(1 − a ) x .Since each component b l ( · ) of the function b ( · ) , whose domain is the open interval ( − ε, ε ) , is an analyticfunction, the function γ R x ( · ) is itself analytic [14, Proposition 2.2.8]. Indeed, assume that the open set
X ⊆ T contains a vector x ′ ∈ X that does not belong to the interior T o . It follows that no singleneighborhood of x ′ can be contained in T and, thus, no single neighborhood of x ′ can be contained in X , since X ⊆ T . However,because x ′ belongs to the open set X = X o , there must be at least one neighborhood of x ′ that is contained in X . Thus, we arrived ata contradiction, which implies that every vector x ′ ∈ X must belong to T o , or, equivalently, that X ⊆ T o . Strictly speaking, [44, Theorem 6.2] states that the relative interior of a convex set is a convex set. However, since we assume that X is open with non-empty interior and therefore, by (75), also T has a nonempty interior, the relative interior of T coincides withthe interior of T . Since the partial derivatives of γ ( · ) at x , ∂ p γ ( x ) ∂ x p (cid:12)(cid:12) x = x , are assumed to vanish for every p ∈ Z N + , the(ordinary) derivatives of arbitrary order of the scalar function γ R x ( a ) vanish at a = 0 (cf. [13, Theorem 9.15]).According to [14, Corollary 1.2.5], since γ R x ( a ) is an analytic function, this implies that γ R x ( a ) vanisheseverywhere on its open domain ( − ε, ε ) . This, in turn, implies that γ ( · ) vanishes on every line segment R x with some x ∈ T o and, thus, γ ( · ) vanishes everywhere on T o . By (75), we finally conclude that γ ( · ) vanisheseverywhere on X . A PPENDIX DP ROOF OF T HEOREM
V.4Because c ( · ) was assumed valid at x , the corresponding mean function γ ( · ) = c ( · ) + g ( · ) is an element of H E ( A ) , x (see (11)). Let γ ( · ) , γ ( · ) (cid:12)(cid:12) X , and note that γ ( · ) is the mean function corresponding to the restrictedbias function c ( · ) , i.e., γ ( · ) = c ( · ) + g ( · ) (cid:12)(cid:12) X . We have γ ( · ) ∈ H E ( A )1 , x due to (11), because γ ( x ) is themean function (evaluated for x ∈ X ) of an estimator ˆ g ( · ) that has finite variance at x and whose bias functionon X equals c ( x ) . (The existence of such an estimator ˆ g ( · ) is guaranteed since c ( · ) was assumed valid at x .)For the minimum achievable variance for the restricted estimation problem, we obtain M ( c ( · ) , x ) (12) = k γ ( · ) k H E ( A )1 , x − γ ( x ) (57) = min γ ′ ( · ) ∈H E ( A ) , x γ ′ ( · ) (cid:12)(cid:12) X = γ ( · ) k γ ′ ( · ) k H E ( A ) , x − γ ( x ) . (77)However, the only function γ ′ ( · ) ∈ H E ( A ) , x that satisfies γ ′ ( · ) (cid:12)(cid:12) X = γ ( · ) is the mean function γ ( · ) . Thisis a consequence of Lemma V.3 and can be verified as follows. Consider a function γ ′ ( · ) ∈ H E ( A ) , x thatsatisfies γ ′ ( · ) (cid:12)(cid:12) X = γ ( · ) . By the definition of γ ( · ) , we also have γ ( · ) (cid:12)(cid:12) X = γ ( · ) . Therefore, the difference γ ′′ ( · ) , γ ′ ( · ) − γ ( · ) ∈ H E ( A ) , x satisfies γ ′′ ( · ) (cid:12)(cid:12) X = γ ′ ( · ) (cid:12)(cid:12) X − γ ( · ) (cid:12)(cid:12) X = γ ( · ) − γ ( · ) = 0 , i.e., γ ′′ ( x ) = 0 forall x ∈ X . Since x ∈ X o , this implies that ∂ p γ ′′ ( x ) ∂ x p (cid:12)(cid:12) x = x = 0 for all p ∈ Z N + . It then follows from LemmaV.3 that γ ′′ ( x ) = 0 for all x ∈ X and, thus, γ ′ ( x ) = γ ( x ) for all x ∈ X . This shows that γ ( · ) is the uniquefunction satisfying γ ( · ) (cid:12)(cid:12) X = γ ( · ) . Therefore, we have min γ ′ ( · ) ∈H E ( A ) , x γ ′ ( · ) (cid:12)(cid:12) X = γ ( · ) k γ ′ ( · ) k H E ( A ) , x = k γ ( · ) k H E ( A ) , x , and thus (77) becomes M ( c ( · ) , x ) = k γ ( · ) k H E ( A ) , x − γ ( x ) = k γ ( · ) k H E ( A ) , x − γ ( x ) (12) = M ( c ( · ) , x ) . Here, the second equality is due to the fact that γ ( x ) = γ ( x ) (because x ∈ X o ). R EFERENCES [1] P. Billingsley,
Probability and Measure , 3rd ed. New York: Wiley, 1995.[2] E. Parzen, “Statistical inference on time series by Hilbert space methods, I.” Appl. Math. Stat. Lab., Stanford University, Stanford,CA, Tech. Rep. 23, Jan. 1959.[3] D. D. Duttweiler and T. Kailath, “RKHS approach to detection and estimation problems – Part V: Parameter estimation,”
IEEETrans. Inf. Theory , vol. 19, no. 1, pp. 29–37, Jan. 1973.[4] S. Smale and D. X. Zhou, “Learning theory estimates via integral operators and their approximations,”
Constr. Approx. , vol. 26,pp. 153–172, 2007.[5] F. Cucker and S. Smale, “On the mathematical foundations of learning,”
Bulletin of the American Mathematical Society , vol. 39,pp. 1–49, 2002.[6] S. Schmutzhard, A. Jung, F. Hlawatsch, Z. Ben-Haim, and Y. C. Eldar, “A lower bound on the estimator variance for the sparselinear model,” in
Proc. 44th Asilomar Conf. Signals, Systems, Computers , Pacific Grove, CA, Nov. 2010, pp. 1976–1980.[7] S. Schmutzhard, A. Jung, and F. Hlawatsch, “Minimum variance estimation for the sparse signal in noise model,” in
Proc. IEEEISIT 2011 , St. Petersburg, Russia, Jul.–Aug. 2011, pp. 124–128.[8] A. Jung, S. Schmutzhard, F. Hlawatsch, and A. O. Hero III, “Performance bounds for sparse parametric covariance estimation inGaussian models,” in
Proc. IEEE ICASSP 2011 , Prague, Czech Republic, May 2011, pp. 4156–4159.[9] T. Kailath, “RKHS approach to detection and estimation problems – Part I: Deterministic signals in Gaussian noise,”
IEEE Trans.Inf. Theory , vol. 17, no. 5, pp. 530–549, Jan. 1971.[10] A. Jung, “An RKHS Approach to Estimation with Sparsity Constraints,” Ph.D. dissertation, Vienna University of Technology,2011.[11] G. H. Golub and C. F. Van Loan,
Matrix Computations , 3rd ed. Baltimore, MD: Johns Hopkins University Press, 1996.[12] B. R. Gelbaum and J. M. Olmsted,
Counterexamples in Analysis . Mineola, NY: Dover Publications, 2003.[13] W. Rudin,
Principles of Mathematical Analysis , 3rd ed. New York: McGraw-Hill, 1976.[14] S. G. Krantz and H. R. Parks,
A Primer of Real Analytic Functions , 2nd ed. Boston, MA: Birkhäuser, 2002.[15] E. L. Lehmann and G. Casella,
Theory of Point Estimation , 2nd ed. New York: Springer, 1998.[16] P. R. Halmos and L. J. Savage, “Application of the Radon-Nikodym Theorem to the Theory of Sufficient Statistics,”
Ann. Math.Statist. , vol. 20, no. 2, pp. 225–241, 1949.[17] I. A. Ibragimov and R. Z. Has’minskii,
Statistical Estimation. Asymptotic Theory.
New York: Springer, 1981.[18] S. M. Kay,
Fundamentals of Statistical Signal Processing: Estimation Theory . Englewood Cliffs, NJ: Prentice Hall, 1993.[19] Y. C. Eldar,
Rethinking Biased Estimation: Improving Maximum Likelihood and the Cramér–Rao Bound , ser. Foundations andTrends in Signal Processing. Hanover, MA: Now Publishers, 2007, vol. 1, no. 4.[20] G. Casella and R. L. Berger,
Statistical Inference , 2nd ed. Pacific Grove, CA: Duxbury, 2002.[21] N. Aronszajn, “Theory of reproducing kernels,”
Trans. Am. Math. Soc. , vol. 68, no. 3, pp. 337–404, May 1950.[22] W. Rudin,
Real and Complex Analysis , 3rd ed. New York: McGraw-Hill, 1987.[23] E. W. Barankin, “Locally best unbiased estimates,”
Ann. Math. Statist. , vol. 20, no. 4, pp. 477–501, 1949.[24] C. Stein, “Unbiased estimates with minimum variance,”
Ann. Math. Statist. , vol. 21, no. 3, pp. 406–415, 1950.[25] P. R. Halmos,
Measure Theory . New York: Springer, 1974.[26] H.-W. Sun and D.-X. Zhou, “Reproducing kernel Hilbert spaces associated with analytic translation-invariant Mercer kernels,”
J.Fourier Anal. Appl. , vol. 14, no. 1, pp. 89–101, Feb. 2008.[27] D.-X. Zhou, “Derivative reproducing properties for kernel methods in learning theory,”
J. Comput. Appl. Math. , vol. 220, no. 1-2,pp. 456–463, Oct. 2008.[28] ——, “Capacity of reproducing kernel spaces in learning theory,”
IEEE Trans. Inf. Theory , vol. 49, pp. 1743–1752, 2003. [29] R. McAulay and E. Hofstetter, “Barankin bounds on parameter estimation,” IEEE Trans. Inf. Theory , vol. 17, no. 6, pp. 669–676,Nov. 1971.[30] H. Cramér, “A contribution to the theory of statistical estimation,”
Skand. Akt. Tidskr. , vol. 29, pp. 85–94, 1946.[31] C. R. Rao, “Information and the accuracy attainable in the estimation of statistical parameters,”
Bull. Calcutta Math. Soc. , vol. 37,pp. 81–91, 1945.[32] P. Stoica and B. C. Ng, “On the Cramér–Rao bound under parametric constraints,”
IEEE Signal Processing Letters , vol. 5, no. 7,pp. 177–179, Jul. 1998.[33] Z. Ben-Haim and Y. Eldar, “On the constrained Cramér–Rao bound with a singular Fisher information matrix,”
IEEE SignalProcessing Letters , vol. 16, no. 6, pp. 453–456, June 2009.[34] T. J. Moore, “A theory of Cramér–Rao bounds for constrained parametric models,” Ph.D. dissertation, University of Maryland,2010.[35] J. S. Abel, “A bound on mean-square-estimate error,”
IEEE Trans. Inf. Theory , vol. 39, no. 5, pp. 1675–1680, Sep. 1993.[36] A. Bhattacharyya, “On some analogues of the amount of information and their use in statistical estimation,”
Shankya: The IndianJournal of Statistics (1933-1960) , vol. 8, no. 1, pp. 1–14, Nov. 1946.[37] J. D. Gorman and A. O. Hero, “Lower bounds for parametric estimation with constraints,”
IEEE Trans. Inf. Theory , vol. 36, no. 6,pp. 1285–1301, Nov. 1990.[38] D. G. Chapman and H. Robbins, “Minimum variance estimation without regularity assumptions,”
Ann. Math. Statist. , vol. 22,no. 4, pp. 581–586, Dec. 1951.[39] J. M. Hammersley, “On estimating restricted parameters,”
J. Roy. Statist. Soc. B , vol. 12, no. 2, pp. 192–240, 1950.[40] Z. Ben-Haim and Y. C. Eldar, “The Cramér–Rao bound for estimating a sparse parameter vector,”
IEEE Trans. Signal Processing ,vol. 58, pp. 3384–3389, June 2010.[41] S. Kullback,
Information Theory and Statistics . Mineola, NY: Dover Publications, 1968.[42] L. D. Brown,
Fundamentals of Statistical Exponential Families , ser. Lecture Notes – Monograph Series. Hayward, CA: Instituteof Mathematical Statistics, 1986.[43] M. J. Wainwright and M. I. Jordan,
Graphical Models, Exponential Families, and Variational Inference , ser. Foundations andTrends in Machine Learning. Hanover, MA: Now Publishers, 2008, vol. 1, no. 1-2.[44] R. T. Rockafellar,