[PDF] Minimum Variance Estimation of a Sparse Vector within the Linear Gaussian Model: An RKHS Approach

Abstract

We consider minimum variance estimation within the sparse linear Gaussian model (SLGM). A sparse vector is to be estimated from a linearly transformed version embedded in Gaussian noise. Our analysis is based on the theory of reproducing kernel Hilbert spaces (RKHS). After a characterization of the RKHS associated with the SLGM, we derive novel lower bounds on the minimum variance achievable by estimators with a prescribed bias function. This includes the important case of unbiased estimation. The variance bounds are obtained via an orthogonal projection of the prescribed mean function onto a subspace of the RKHS associated with the SLGM. Furthermore, we specialize our bounds to compressed sensing measurement matrices and express them in terms of the restricted isometry and coherence parameters. For the special case of the SLGM given by the sparse signal in noise model (SSNM), we derive closed-form expressions of the minimum achievable variance (Barankin bound) and the corresponding locally minimum variance estimator. We also analyze the effects of exact and approximate sparsity information and show that the minimum achievable variance for exact sparsity is not a limiting case of that for approximate sparsity. Finally, we compare our bounds with the variance of three well-known estimators, namely, the maximum-likelihood estimator, the hard-thresholding estimator, and compressive reconstruction using the orthogonal matching pursuit.

Full PDF

aa r X i v : . [ c s . I T ] A p r Minimum Variance Estimation of a Sparse Vectorwithin the Linear Gaussian Model:An RKHS Approach

Alexander Jung a (corresponding author), Sebastian Schmutzhard b , Franz Hlawatsch a ,Zvika Ben-Haim c , and Yonina C. Eldar d a Institute of Telecommunications, Vienna University of Technology; {ajung, fhlawats}@nt.tuwien.ac.at b NuHAG, Faculty of Mathematics, University of Vienna; [email protected] c Google, Inc., Israel; [email protected] d Technion—Israel Institute of Technology; [email protected]

Abstract

We consider minimum variance estimation within the sparse linear Gaussian model (SLGM). A sparse vectoris to be estimated from a linearly transformed version embedded in Gaussian noise. Our analysis is based onthe theory of reproducing kernel Hilbert spaces (RKHS). After a characterization of the RKHS associated withthe SLGM, we derive novel lower bounds on the minimum variance achievable by estimators with a prescribedbias function. This includes the important case of unbiased estimation. The variance bounds are obtained viaan orthogonal projection of the prescribed mean function onto a subspace of the RKHS associated with theSLGM. Furthermore, we specialize our bounds to compressed sensing measurement matrices and express themin terms of the restricted isometry and coherence parameters. For the special case of the SLGM given by the sparse signal in noise model (SSNM), we derive closed-form expressions of the minimum achievable variance(Barankin bound) and the corresponding locally minimum variance estimator. We also analyze the effects ofexact and approximate sparsity information and show that the minimum achievable variance for exact sparsityis not a limiting case of that for approximate sparsity. Finally, we compare our bounds with the variance ofthree well-known estimators, namely, the maximum-likelihood estimator, the hard-thresholding estimator, andcompressive reconstruction using the orthogonal matching pursuit.

Index Terms

Sparsity, compressed sensing, unbiased estimation, denoising, RKHS, Cramér–Rao bound, Barankin bound,Hammersley–Chapman–Robbins bound, locally minimum variance unbiased estimator.

This work was supported by the FWF under Grants S10602-N13 (Signal and Information Representation) and S10603-N13 (StatisticalInference) within the National Research Network SISE, by the WWTF under Grant MA 07-004 (SPORTS), by the Israel ScienceFoundation under Grant 1081/07, and by the European Commission under the FP7 Network of Excellence in Wireless CommunicationsNEWCOM++ (contract no. 216715). Parts of this work were previously presented at the 44th Asilomar Conference on Signals, Systemsand Computers, Paciﬁc Grove, CA, Nov. 2010 and at the 2011 IEEE International Symposium on Information Theory (ISIT 2011),Saint Petersburg, Russia, July/Aug. 2011.Submitted to the IEEE Transactions on Information Theory, February 24, 2018

I. I

NTRODUCTION

We study the problem of estimating the value g ( x ) of a known vector-valued function g ( · ) evaluated atthe unknown parameter vector x ∈ R N . It is known that x is S -sparse, i.e., at most S of its entries are nonzero,where S ∈ [ N ] , { , . . . , N } (typically S ≪ N ). While the sparsity degree S is known, the set of positionsof the nonzero entries of x , i.e., the support supp( x ) ⊆ [ N ] , is unknown. The estimation of g ( x ) is based onan observed random vector y = Hx + n ∈ R M , with a known system matrix H ∈ R M × N and independent andidentically distributed (i.i.d.) Gaussian noise n ∼ N ( , σ I ) with known noise variance σ > . We assumethat the minimum number of linearly dependent columns of H is larger than S .The data model described above will be termed the sparse linear Gaussian model (SLGM). The SLGM isrelevant, e.g., to sparse channel estimation [1], where the sparse parameter vector x represents the tap coefﬁcientsof a linear time-invariant channel and the system matrix H represents the training signal. More generally, theSLGM can be used for any type of sparse deconvolution [2]. The special case of the SLGM obtained for H = I (so that M = N and y = x + n ) will be referred to as the sparse signal in noise model (SSNM). The SSNMcan be used, e.g., for sparse channel estimation [1] employing an orthogonal training signal [3] and for imagedenoising employing an orthonormal wavelet basis [4].A fundamental question, to be considered in this work, is how to exploit the knowledge of the sparsitydegree S . In contrast to compressed sensing (CS), where the sparsity is exploited for compression [5]–[7], herewe investigate how much the sparsity assumption helps us improve the accuracy of estimating g ( x ) . Relatedquestions have been previously addressed for the SLGM in [4] and [8]–[13]. In [8] and [9], bounds on theminimax risk and approximate minimax estimators whose worst-case risk is close to these bounds have beenderived for the SLGM. An asymptotic analysis of minimax estimation for the SSNM has been given in theseminal work [4], [10]. In the context of minimum variance estimation (MVE), which is relevant to our presentwork, lower bounds on the minimum achievable variance for the SLGM have been derived recently. In particular,the Cramér–Rao bound (CRB) for the SLGM has been derived and analyzed in [11] and [12]. Furthermore, inour previous work [13], we derived lower and upper bounds on the minimum achievable variance of unbiasedestimators for the SSNM.The contributions of the present paper can be summarized as follows. First, we present novel CRB-typelower bounds on the variance of estimators for the SLGM. These bounds are derived by an application ofthe mathematical framework of reproducing kernel Hilbert spaces (RKHS) [14]–[16]. Since they hold for anyestimator with a prescribed mean function, they are also lower bounds on the minimum achievable variance(also known as Barankin bound ) for the SLGM. The bounds are tighter than those presented in [11], [12],and they have an appealing form in that they are scaled versions of the conventional CRB obtained for thenonsparse case [17], [18]. We note that our RKHS approach is quite different from the technique used in [13].Also, a shortcoming of the lower bounds presented in [11], and [13] is the fact that they exhibit a discontinuity when passing from the case k x k = S (i.e., x has exactly S nonzero values) to the case k x k < S (i.e., x has less than S nonzero values). For unbiased estimation, we derive a lower bound that is tighter than thebounds in [11]–[13] and, moreover, a continuous function of x . In particular, this bound exhibits a smoothtransition between the two regimes given by k x k = S and k x k < S . Based on the fact that the linear CSrecovery problem is an instance of the SLGM, we specialize our lower bounds to system matrices that are CSmeasurement matrices, and we express them in terms of the restricted isometry and coherence parameters ofthese matrices.Furthermore, for the SSNM, we derive expressions of the minimum achievable variance at a given parametervector x = x and of the locally minimum variance (LMV) estimator, i.e., the estimator achieving the minimumvariance at x . Simpliﬁed expressions of the minimum achievable variance and the LMV estimator are obtainedfor a certain subclass of “diagonal” bias functions (which includes the unbiased case).Finally, we consider the SLGM with an approximate sparsity constraint and show that the minimumachievable variance under an exact sparsity constraint is not a limiting case of the minimum achievable varianceunder an approximate sparsity constraint.A central aspect of this paper is the application of the mathematical framework of RKHS [14] to the SLGM.The RKHS framework has been previously applied to classical estimation in the seminal work reported in [15]and [16], and our present treatment is substantially based on that work. However, to the best of our knowledge,the RKHS framework has not been applied to the SLGM or, more generally, to the estimation of (functionsof) sparse vectors. The sparse case is speciﬁc in that we are considering functions whose domain is the set of S -sparse vectors. For S < N , the interior of this set is empty, and thus there do not exist derivatives in everypossible direction. This lack of a differentiable structure makes the characterization of the RKHS a somewhatdelicate matter.The remainder of this paper is organized as follows. We begin in Section II with formal statements of theSLGM and SSNM and continue in Section III with a review of basic elements of MVE. In Section IV, wereview some fundamentals of RKHSs and the application of RKHSs to MVE. In Section V, we characterizeand discuss the RKHS associated with the SLGM. For the SLGM, we then use the RKHS framework topresent formal characterizations of the class of bias functions allowing for ﬁnite-variance estimators, of theminimum achievable variance (Barankin bound), and of the LMV estimator. We also present a result on theshape of the Barankin bound. In Section VI, we reinterpret the sparse CRB of [11] from the RKHS perspective,and we present two novel lower variance bounds for the SLGM. In Section VII, we specialize the boundsof Section VI to system matrices that are CS measurement matrices. The important special case given bythe SSNM is discussed in Section VIII, where we derive closed-form expressions of the minimum achievablevariance (Barankin bound) and of the corresponding LMV estimator. A discussion of the effects of exact andapproximate sparsity information from the MVE perspective is presented in Section IX. Finally, in SectionX, we present numerical results comparing our theoretical bounds with the actual variance of some popular estimation schemes.

Notation and basic deﬁnitions . The sets of real, nonnegative real, natural, and nonnegative integer numbersare denoted by R , R + , N , { , , . . . } , and Z + , { , , . . . } , respectively. For L ∈ N , we deﬁne [ L ] , { , . . . , L } . The space of all discrete-argument functions f [ · ] : T → R (with T ⊆ Z ) for which P l ∈T f [ l ] < ∞ is denoted by ℓ ( T ) , with associated norm k f [ · ] k T , pP l ∈T f [ l ] . The Kronecker delta δ k,l is if k = l and otherwise. Given an N -tuple of nonnegative integers (a “multi-index”) p = ( p · · · p N ) T ∈ Z N + [19],we deﬁne p ! , Q l ∈ [ N ] p l ! , | p | , P l ∈ [ N ] p l , and x p , Q l ∈ [ N ] ( x l ) p l (for x ∈ R N ). Given two multi-indices p , p ∈ Z N + , the inequality p ≤ p is understood to hold elementwise, i.e., p ,l ≤ p ,l for all l ∈ [ N ] .Lowercase (uppercase) boldface letters denote column vectors (matrices). The superscript T stands fortransposition. The k th unit vector is denoted by e k , and the identity matrix by I . For a rectangular matrix H ∈ R M × N , we denote by H † its Moore-Penrose pseudoinverse [20], by ker( H ) , { x ∈ R N | Hx = } itskernel (or null space), by span( H ) , { y ∈ R M | ∃ x ∈ R N : y = Hx } its column span, and by rank( H ) its rank.For a square matrix H ∈ R N × N , we denote by tr( H ) , det( H ) , and H − its trace, determinant, and inverse (ifit exists), respectively. The k th entry of a vector x is denoted by ( x ) k = x k , and the entry in the k th row and l th column of a matrix H by ( H ) k,l = H k,l . The support (i.e., set of indices of all nonzero entries) and thenumber of nonzero entries of a vector x are denoted by supp( x ) and k x k = | supp( x ) | , respectively. Givenan index set I ⊆ [ N ] , we denote by x I ∈ R N the vector obtained from x ∈ R N by zeroing all entries exceptthose indexed by I , and by H I ∈ R M ×|I| the matrix formed by those columns of H ∈ R M × N that are indexedby I . The p -norm of a vector x ∈ R N is deﬁned as k x k p , (cid:0) P k ∈ [ N ] x pk (cid:1) /p .II. T HE S PARSE L INEAR G AUSSIAN M ODEL

We will ﬁrst present a more detailed statement of the SLGM. Let x ∈ R N be an unknown parameter vectorthat is known to be S -sparse in the sense that at most S of its entries are nonzero, i.e., k x k ≤ S , with aknown sparsity degree S ∈ [ N ] (typically S ≪ N ). We will express this S -sparsity in terms of a parameter set X S , i.e., x ∈ X S , with X S , (cid:8) x ′ ∈ R N (cid:12)(cid:12) k x ′ k ≤ S (cid:9) ⊆ R N . (1)In the limiting case where S is equal to the dimension of x , i.e., S = N , we have X S = R N . Note that thesupport supp( x ) ⊆ [ N ] is unknown. We observe a linearly transformed and noisy version of x , y = Hx + n ∈ R M , (2)where H ∈ R M × N is a known matrix and n ∈ R M is i.i.d. Gaussian noise, i.e., n ∼ N ( , σ I ) , with a knownnoise variance σ > . It follows that the probability density function (pdf) of the observation y for a speciﬁcvalue of x is given by f H ( y ; x ) = 1(2 πσ ) M/ exp (cid:18) − σ k y − Hx k (cid:19) . (3)We assume that spark( H ) > S , (4)where spark( H ) denotes the minimum number of linearly dependent columns of H [21], [22]. Note that wealso allow M < N (this case is relevant to CS methods as discussed in Section VII); however, condition (4)implies that M ≥ S . Condition (4) is weaker than the standard condition spark( H ) > S [11]. Still, thestandard condition is reasonable since otherwise one can ﬁnd two different parameter vectors x , x ∈ X S forwhich f H ( y ; x ) = f H ( y ; x ) for all y , which implies that one cannot distinguish between x and x based onknowledge of y . Finally, we note that the assumption of i.i.d. noise in (2) does not imply a loss of generality.Indeed, consider an SLGM y = Hx + n where n is not i.i.d. with some positive deﬁnite (hence, nonsingular)covariance matrix C . Then, the “whitened observation” ˜ y , C − / ˜ y [23], where C − / is the inverse of thematrix square root C / [24], can be written as ˜ y = e Hx + ˜ n , with e H , C − / H and ˜ n , C − / n . It can beveriﬁed that e H also satisﬁes (4) and ˜ n is i.i.d. with variance σ = 1 , i.e., ˜ n ∼ N ( , I ) .The task considered in this paper is estimation of the function value g ( x ) from the observation y = Hx + n ,where the parameter function g ( · ) : X S → R P is a known deterministic function. The estimate ˆ g = ˆ g ( y ) ∈ R P is derived from y via a deterministic estimator ˆ g ( · ) : R M → R P . We allow ˆ g ∈ R P without constraining ˆ g tobe in g ( X S ) , { g ( x ) | x ∈ X S } , even though it is known that x ∈ X S . The reason for not enforcing the sparsityconstraint ˆ g ∈ g ( X S ) is twofold: ﬁrst, it would complicate the analysis; second, it would typically result in aworse achievable estimator performance (in terms of mean squared error) since it restricts the class of allowedestimators. In particular, it has been shown that a sparsity constraint can increase the worst-case risk of theresulting estimators signiﬁcantly [25].Estimation of the parameter vector x itself is a special case obtained by choosing the parameter functionas the identity mapping, i.e., g ( x ) = x , which implies P = N . Again, we allow ˆ x ∈ R N and do not constrain ˆ x to be in X S .In what follows, it will be convenient to denote the SLGM-based estimation problem by the triple E SLGM , (cid:0) X S , f H ( y ; x ) , g ( · ) (cid:1) , where f H ( y ; x ) is given by (3) and will be referred to as the statistical model . A related estimation problemis based on the linear Gaussian model (LGM) [17], [26]–[28], for which x ∈ R N rather than x ∈ X S ; thisproblem will be denoted by E LGM , (cid:0) R N , f H ( y ; x ) , g ( · ) (cid:1) . The SLGM shares with the LGM the observation model (2) and the statistical model (3); it is obtained from theLGM by restricting the parameter set R N to the set of S -sparse vectors, X S . For S = N , the SLGM reducesto the LGM. Another important special case of the SLGM is given by the SSNM, for which H = I , M = N ,and y = x + n , where x ∈ X S and n ∼ N ( , σ I ) with known variance σ > . The SSNM-based estimation problem will bedenoted as E SSNM , (cid:0) X S , f I ( y ; x ) , g ( · ) (cid:1) . III. B

ASIC E LEMENTS OF M INIMUM V ARIANCE E STIMATION

Let us consider a general estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) based on an arbitrary parameter set X ⊆ R N and an arbitrary statistical model f ( y ; x ) . The general goal in the design of an estimator ˆ g ( · ) is that ˆ g ( y ) should be close to the true value g ( x ) . A frequently used criterion for assessing the quality of an estimator ˆ g ( y ) is the mean squared error (MSE) deﬁned as ε , E x (cid:8) k ˆ g ( y ) − g ( x ) k (cid:9) = Z R M k ˆ g ( y ) − g ( x ) k f ( y ; x ) d y . Here, E x {·} denotes the expectation operation with respect to the pdf f ( y ; x ) ; the subscript in E x indicatesthe dependence on the parameter vector x parametrizing f ( y ; x ) . We will write ε (ˆ g ( · ); x ) to indicate thedependence of the MSE on the estimator ˆ g ( · ) and the parameter vector x . In general, there does not existan estimator ˆ g ( · ) that minimizes the MSE simultaneously for all x ∈ X [30]. This follows from the fact thatminimizing the MSE at a given parameter vector x always yields zero MSE; this is achieved by the trivialestimator ˆ g ( y ) ≡ g ( x ) , which ignores the observation y .A popular rationale for the design of good estimators is MVE. The MSE can be decomposed as ε (ˆ g ( · ); x ) = k b (ˆ g ( · ); x ) k + v (ˆ g ( · ); x ) , (5)with the bias b (ˆ g ( · ); x ) , E x { ˆ g ( y ) } − g ( x ) and the variance v (ˆ g ( · ); x ) , E x (cid:8)(cid:13)(cid:13) ˆ g ( y ) − E x { g ( y ) } (cid:13)(cid:13) (cid:9) . InMVE, one ﬁxes the bias on the entire parameter set X , i.e., one requires that b (ˆ g ( · ); x ) ! = c ( x ) , for all x ∈ X , (6)with a prescribed bias function c ( · ) : X → R P , and attempts to minimize the variance v (ˆ g ( · ); x ) among allestimators with the given bias function c ( · ) . Fixing the bias function is equivalent to ﬁxing the estimator’smean function, i.e., E x (cid:8) ˆ g ( y ) (cid:9) ! = γ ( x ) for all x ∈ X , with the prescribed mean function γ ( x ) , c ( x ) + g ( x ) . Unbiased estimation is obtained as a special case for c ( x ) ≡ or equivalently γ ( x ) ≡ g ( x ) . Fixing the bias canbe viewed as a kind of “regularization” of the set of considered estimators [18], [30], since it excludes uselessestimators such as ˆ g ( y ) ≡ g ( x ) . Another justiﬁcation for considering a ﬁxed bias function is that under mildconditions, for a large number of i.i.d. observations { y i } i ∈ [ L ] , the bias term dominates in the decomposition(5). Thus, in order to achieve a small MSE in that case, an estimator has to be at least asymptotically unbiased,i.e., one has to require that, for a large number of observations, b (ˆ g ( · ); x ) ≈ for all x ∈ X . This introductory section closely parallels [29, Section II]. We include it nevertheless because it constitutes an important basis forour subsequent discussion.

For an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) , a ﬁxed parameter vector x ∈ X , and a prescribed biasfunction c ( · ) : X → R P , we deﬁne the set of allowed estimators by A ( c ( · ) , x ) , (cid:8) ˆ g ( · ) (cid:12)(cid:12) v (ˆ g ( · ); x ) < ∞ , b (ˆ g ( · ); x ) = c ( x ) ∀ x ∈ X (cid:9) . We call a bias function c ( · ) valid for the estimation problem E at x ∈ X if the set A ( c ( · ) , x ) is nonempty,which means that there is at least one estimator ˆ g ( · ) that has ﬁnite variance at x and whose bias functionequals c ( · ) , i.e., b (ˆ g ( · ); x ) = c ( x ) for all x ∈ X . For the SLGM, in particular, this deﬁnition trivially entailsthe following fact: If a bias function c ( · ) : X S → R P is valid for S = N , it is also valid for S < N .It follows from (5) that, for a ﬁxed bias function c ( · ) , minimizing the MSE ε (ˆ g ( · ); x ) is equivalent tominimizing the variance v (ˆ g ( · ); x ) . Let us denote the minimum (strictly speaking, inﬁmum) variance at x for bias function c ( · ) by M ( c ( · ) , x ) , inf ˆ g ( · ) ∈A ( c ( · ) , x ) v (ˆ g ( · ); x ) . (7)If A ( c ( · ) , x ) is empty, i.e., if c ( · ) is not valid, we set M ( c ( · ) , x ) , ∞ . Any estimator ˆ g ( c ( · ) , x ) ( · ) ∈A ( c ( · ) , x ) that achieves the inﬁmum in (7), i.e., for which v (cid:0) ˆ g ( c ( · ) , x ) ( · ); x (cid:1) = M ( c ( · ) , x ) , (8)is called an LMV estimator at x for bias function c ( · ) [15], [16], [18]. The corresponding minimum variance M ( c ( · ) , x ) is called the minimum achievable variance at x for bias function c ( · ) . The minimization problemdeﬁned by (7) is referred to as a minimum variance problem (MVP). From its deﬁnition in (7), it follows that M ( c ( · ) , x ) is a lower bound on the variance at x of any estimator with bias function c ( · ) , i.e., ˆ g ( · ) ∈ A ( c ( · ) , x ) ⇒ v (ˆ g ( · ); x ) ≥ M ( c ( · ) , x ) . This is sometimes referred to as the

Barankin bound ; it is the tightest possible lower bound on the variance at x of estimators with bias function c ( · ) .If, for a prescribed bias function c ( · ) , there exists an estimator that is the LMV estimator simultaneously atall x ∈ X , then that estimator is termed the uniformly minimum variance (UMV) estimator for bias function c ( · ) [15], [16], [18]. For the SLGM, a UMV estimator does not exist in general [13], [31]. A noteworthyexception is the SLGM where H has full column rank, g ( x ) = x , S = N , and c ( · ) ≡ ; here, it is well known[18], [17, Thm. 4.1] that the least squares estimator, ˆ x = H † y , is the UMV estimator.Finally, let ˆ g k ( · ) , (cid:0) ˆ g ( · ) (cid:1) k and c k ( · ) , (cid:0) c ( · ) (cid:1) k . The variance of the vector estimator ˆ g ( · ) can be decomposedas v (ˆ g ( · ); x ) = X k ∈ [ P ] v (ˆ g k ( · ); x ) , (9)where v (ˆ g k ( · ); x ) , E x (cid:8)(cid:2) ˆ g k ( y ) − E x { ˆ g k ( y ) } (cid:3) (cid:9) is the variance of the k th estimator component ˆ g k ( · ) . Further-more, ˆ g ( · ) ∈ A ( c ( · ) , x ) if and only if ˆ g k ( · ) ∈ A ( c k ( · ) , x ) for all k ∈ [ P ] . This shows that the MVP (7) can be reduced to P separate scalar MVPs M ( c k ( · ) , x ) , inf ˆ g k ( · ) ∈A ( c k ( · ) , x ) v (ˆ g k ( · ); x ) , k ∈ [ P ] , each requiring the optimization of a single scalar component ˆ g k ( · ) of ˆ g ( · ) . Therefore, without loss of generality,we will hereafter assume that the parameter function g ( x ) is scalar-valued, i.e., P = 1 and g ( x ) = g ( x ) .IV. RKHS F UNDAMENTALS

As mentioned in Section I, the existing variance bounds for the SLGM are not maximally tight. Using thetheory of RKHSs will allow us to derive variance bounds which are tighter than the existing bounds. For theSSNM (see Section VIII), the RKHS approach even yields a precise characterization of the minimum achievablevariance (Barankin bound) and of the accompanying LMV estimator. In this section, we present a review (similarin part to [29, Section III]) of some fundamentals of the theory of RKHSs and of the application of RKHSs toMVE. These fundamentals will provide a framework for our analysis of the SLGM in later sections.

A. Basic Facts

An RKHS is associated with a kernel function R ( · , · ) : X × X → R , where X is an arbitrary set. Thedeﬁning properties of a kernel function are (i) symmetry, i.e., R ( x , x ) = R ( x , x ) for all x , x ∈ X , and(ii) positive semideﬁniteness in the sense that, for every ﬁnite set { x , . . . , x D } ⊆ X , the matrix R ∈ R D × D with entries R m,n = R ( x m , x n ) is positive semideﬁnite. A fundamental result [14, p. 344] states that for anysuch kernel function R , there exists an RKHS H ( R ) , which is a Hilbert space equipped with an inner product h· , ·i H ( R ) and satisfying the following two properties: • For any x ∈ X , R ( · , x ) ∈ H ( R ) (here, R ( · , x ) denotes the function f x ( x ′ ) = R ( x ′ , x ) for ﬁxed x ∈ X ). • For any function f ( · ) ∈ H ( R ) and any x ∈ X , (cid:10) f ( · ) , R ( · , x ) (cid:11) H ( R ) = f ( x ) . (10)The “reproducing property” (10) deﬁnes the inner product h f , f i H ( R ) for all f ( · ) , f ( · ) ∈ H ( R ) , becauseany f ( · ) ∈ H ( R ) can be expanded into the set of functions { R ( · , x ) } x ∈X . The induced norm is k f k H ( R ) = q h f, f i H ( R ) .For later use, we mention the following result [14, p. 351]. Consider a kernel function R ( · , · ) : X ×X → R ,its restriction R ′ ( · , · ) : X ′ ×X ′ → R to a given subdomain X ′ ×X ′ with X ′ ⊆ X , and the corresponding RKHSs H ( R ) and H ( R ′ ) . Then, a function f ′ ( · ) : X ′ → R belongs to H ( R ′ ) if and only if there exists a function f ( · ) : X → R belonging to H ( R ) whose restriction to X ′ , denoted f ( · ) (cid:12)(cid:12) X ′ , equals f ′ ( · ) . Thus, H ( R ′ ) equalsthe set of functions that is obtained by restricting each function f ( · ) ∈ H ( R ) to the subdomain X ′ , i.e., H ( R ′ ) = (cid:8) f ′ ( · ) = f ( · ) (cid:12)(cid:12) X ′ (cid:12)(cid:12) f ( · ) ∈ H ( R ) (cid:9) . (11) Furthermore [14, p. 351], the norm of a function f ′ ( · ) ∈ H ( R ′ ) is equal to the minimum of the norms of allfunctions f ( · ) ∈ H ( R ) whose restriction to X ′ equals f ′ ( · ) , i.e., k f ′ ( · ) k H ( R ′ ) = min f ( · ) ∈H ( R ) f ( · ) (cid:12)(cid:12) X′ = f ′ ( · ) k f ( · ) k H ( R ) . (12) B. The RKHS Approach to MVE

RKHS theory provides a powerful mathematical framework for MVE [15]. Given an arbitrary estimationproblem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) and a parameter vector x ∈ X for which f ( y ; x ) = 0 , a kernel function R E , x ( · , · ) and, in turn, an RKHS H E , x can be deﬁned as follows. We ﬁrst deﬁne the likelihood ratio ρ x ( y , x ) , f ( y ; x ) f ( y ; x ) , (13)which is considered as a random variable (since it is a function of the random vector y ) that is parametrizedby x ∈ X . Next, we deﬁne the Hilbert space L E , x as the closure of the linear span of the set of randomvariables (cid:8) ρ x ( y , x ) (cid:9) x ∈X . The inner product in L E , x is deﬁned by (cid:10) ρ x ( y , x ) , ρ x ( y , x ) (cid:11) RV , E x (cid:8) ρ x ( y , x ) ρ x ( y , x ) (cid:9) = E x (cid:26) f ( y ; x ) f ( y ; x ) f ( y ; x ) (cid:27) . (It can be shown that it is sufﬁcient to deﬁne h· , ·i RV for the random variables (cid:8) ρ x ( y , x ) (cid:9) x ∈X [15].) From nowon, we consider only estimation problems E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) such that (cid:10) ρ x ( y , x ) , ρ x ( y , x ) (cid:11) RV < ∞ for all x , x ∈ X , or, equivalently, E x (cid:26) f ( y ; x ) f ( y ; x ) f ( y ; x ) (cid:27) < ∞ , for all x , x ∈ X . Thus, h· , ·i RV is well deﬁned. We can interpret the inner product h· , ·i RV : L E , x ×L E , x → R as a kernel function R E , x ( · , · ) : X × X → R : R E , x ( x , x ) , (cid:10) ρ x ( y , x ) , ρ x ( y , x ) (cid:11) RV = E x (cid:26) f ( y ; x ) f ( y ; x ) f ( y ; x ) (cid:27) . (14)The RKHS associated with the estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) and the parameter vector x ∈ X isthen deﬁned to be the RKHS induced by the kernel function R E , x ( · , · ) . We will denote this RKHS as H E , x ,i.e., H E , x , H ( R E , x ) . As shown in [15], the two Hilbert spaces L E , x and H E , x are isometric, and a speciﬁccongruence, i.e., isometric mapping J [ · ] : H E , x → L E , x is given by J [ R E , x ( · , x )] = ρ x ( · , x ) . A fundamental relation of the RKHS H E , x with MVE is established by the following central result: Theorem IV.1 ([15], [16]) . Consider an estimation problem E = (cid:0) X , f ( y ; x ) , g ( · ) (cid:1) , a ﬁxed parameter vector x ∈ X , and a prescribed bias function c ( · ) : X → R , corresponding to the prescribed mean function γ ( · ) = c ( · ) + g ( · ) . Then, the following holds: For a detailed discussion of the concepts of closure, inner product, orthonormal basis, and linear span in the context of abstractHilbert spaces, see [15] and [32]. The bias function c ( · ) is valid for E at x if and only if γ ( · ) belongs to the RKHS H E , x . If the bias function c ( · ) is valid for E at x , the minimum achievable variance at x (Barankin bound)is given by M ( c ( · ) , x ) = k γ ( · ) k H E , x − γ ( x ) , (15) and the LMV estimator at x is given by ˆ g ( c ( · ) , x ) ( · ) = J [ γ ( · )] . Based on Theorem IV.1, the following remarks can be made: • The RKHS H E , x can be interpreted as the set of the mean functions γ ( x ) = E x { ˆ g ( y ) } of all estimators ˆ g ( · ) with a ﬁnite variance at x , i.e., v (ˆ g ( · ); x ) < ∞ . • The MVP (7) can be reduced to the computation of the squared norm k γ ( · ) k H E , x and isometric image J [ γ ( · )] of the prescribed mean function γ ( · ) , viewed as an element of the RKHS H E , x . This theoreticalresult is especially helpful if a simple characterization of H E , x is available. A simple characterization inthe sense of [16] is given by an orthonormal basis for H E , x such that the inner products of γ ( · ) with thebasis functions can be computed easily. • If a simple characterization of H E , x is not available, we can still use (15) to establish a large class of lowerbounds on the minimum achievable variance M ( c ( · ) , x ) . Indeed, let U ⊆ H E , x be an arbitrary subspaceof H E , x and let P U γ ( · ) denote the orthogonal projection of γ ( · ) onto U . We then have k γ ( · ) k H E , x ≥k P U γ ( · ) k H E , x [32, Chapter 4] and thus, from (15), M ( c ( · ) , x ) ≥ k P U γ ( · ) k H E , x − γ ( x ) . (16)Some well-known lower bounds on the estimator variance, such as the Cramér–Rao and Bhattacharyabounds, are obtained from (16) by speciﬁc choices of the subspace U [29]. C. The RKHS Associated with the LGM

In our analysis of the SLGM, the RKHS associated with the LGM will play an important role. Consider X = R N and f ( y ; x ) = f H ( y ; x ) as deﬁned in (3), where the system matrix H ∈ R M × N is not required tosatisfy condition (4). The likelihood ratio (13) for f ( y ; x ) = f H ( y ; x ) is obtained as ρ LGM , x ( y , x ) = f H ( y ; x ) f H ( y ; x ) = exp (cid:18) − σ (cid:2) y T H ( x − x ) + k Hx k − k Hx k (cid:3)(cid:19) . (17)Furthermore, from (14), the kernel associated with the LGM follows as R LGM , x ( · , · ) : R N × R N → R ; R LGM , x ( x , x ) = exp (cid:18) σ ( x − x ) T H T H ( x − x ) (cid:19) . (18)Let D , rank( H ) . We will use the thin singular value decomposition (SVD) of H , i.e., H = UΣV T ,where U ∈ R M × D with U T U = I , V ∈ R N × D with V T V = I , and Σ ∈ R D × D is a diagonal matrix withpositive diagonal entries ( Σ ) k,k > [20]. The next theorem has been shown in [31, Sec. 5.2]. Theorem IV.2.

Let H LGM , x denote the RKHS associated with the LGM-based estimation problem E LGM = (cid:0) R N , f H ( y ; x ) , g ( · ) (cid:1) and the parameter vector x ∈ R N , and let e H , VΣ − ∈ R N × D . Then, the followingholds: Any function f ( · ) ∈ H LGM , x is invariant to translations by vectors x ′ ∈ R N belonging to the null spaceof H , i.e., f ( x ) = f ( x + x ′ ) for all x ′ ∈ ker( H ) and x ∈ R N . The RKHS H LGM , x is isometric to the RKHS H ( R G ) whose kernel R G ( · , · ) : R D × R D → R is given by R G ( z , z ) = exp (cid:0) z T z (cid:1) , z , z ∈ R D . A congruence from H ( R G ) to H LGM , x is constituted by the mapping K G [ · ] : H ( R G ) → H LGM , x given by K G [ f ( · )] = e f ( x ) , f (cid:18) σ e H † x (cid:19) exp (cid:18) σ k Hx k − σ x T H T Hx (cid:19) , x ∈ R N , for all f ( · ) ∈ H ( R G ) , (19) and a congruence from H LGM , x to H ( R G ) is constituted by the inverse mapping K − G [ · ] : H LGM , x →H ( R G ) given by K − G [ e f ( · )] = f ( z ) = e f (cid:0) σ e Hz (cid:1) exp (cid:18) − σ k Hx k + 1 σ z T e H † x (cid:19) , z ∈ R D , for all e f ( · ) ∈ H LGM , x . (20)The congruence K G reduces the characterization of the RKHS H LGM , x to that of the RKHS H ( R G ) . Asimple characterization (in the sense of an orthonormal basis) of the RKHS H ( R G ) can be obtained by notingthat the kernel R G ( · , · ) is inﬁnitely often differentiable and applying the results for RKHSs with differentiablekernels presented in [33]. This leads to the following theorem [31], [33]. Theorem IV.3. For any p ∈ Z D + , the RKHS H ( R G ) contains the function r ( p ) ( · ) : R D → R given by r ( p ) ( z ) , √ p ! ∂ p R G ( z , z ) ∂ z p (cid:12)(cid:12)(cid:12)(cid:12) z = = 1 √ p ! z p . The inner product of an arbitrary function f ( · ) ∈ H ( R G ) with r ( p ) ( · ) is given by (cid:10) f ( · ) , r ( p ) ( · ) (cid:11) H ( R G ) = 1 √ p ! ∂ p f ( z ) ∂ z p (cid:12)(cid:12)(cid:12)(cid:12) z = . (21)3) The set of functions (cid:8) r ( p ) ( · ) (cid:9) p ∈ Z D + is an orthonormal basis for H ( R G ) . In particular, because of result 3, a function f ( · ) : R D → R belongs to H ( R G ) if and only if it can bewritten pointwise as f ( z ) = X p ∈ Z D + a [ p ] r ( p ) ( z ) = X p ∈ Z D + a [ p ] √ p ! z p , (22)with a unique coefﬁcient sequence a [ p ] ∈ ℓ ( Z D + ) . The coefﬁcient a [ p ] is given by (21), i.e., a [ p ] = 1 √ p ! ∂ p f ( z ) ∂ z p (cid:12)(cid:12)(cid:12)(cid:12) z = . (23)Expression (22) implies that any f ( z ) ∈ H ( R G ) is inﬁnitely often differentiable and, because of (23), fullydetermined by its partial derivatives at z = , i.e., ∂ p f ( z ) ∂ z p (cid:12)(cid:12) z = for p ∈ Z D + . Furthermore, since according to (19)any function e f ( · ) ∈ H LGM , x is the image of a function f ( · ) ∈ H ( R G ) under the congruence K G [ · ] , it follows thatalso any e f ( · ) ∈ H LGM , x is inﬁnitely often differentiable and fully determined by its partial derivatives at x = ,i.e., ∂ p e f ( x ) ∂ x p (cid:12)(cid:12) x = for p ∈ Z N + . (The latter fact holds because the partial derivatives of e f ( · ) uniquely determinethe partial derivatives of f ( · ) = K − G [ e f ( · )] via (20) and the generalized Leibniz rule for the differentiation ofa product of functions.) This agrees with the well-known result [34, Lemma 2.8] that for a statistical modelof the exponential family type, the mean function of any ﬁnite-variance estimator is analytic, and thus fullydetermined by its partial derivatives at zero. (To appreciate the connection with the mean function of ﬁnite-variance estimators, recall from the discussion following Theorem IV.1 that the elements of H LGM , x are themean functions of all ﬁnite-variance estimators for the LGM, which is a special case of an exponential family.)V. RKHS- BASED A NALYSIS OF M INIMUM V ARIANCE E STIMATION FOR THE

SLGMIn this section, we apply the RKHS framework to the SLGM-based estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( · ) (cid:1) . Thus, the parameter set is the set of S -sparse vectors, X = X S ⊆ R N in (1), and thestatistical model is given by f ( y ; x ) = f H ( y ; x ) in (3). More speciﬁcally, we consider SLGM-based MVE at agiven parameter vector x ∈ X S , for a prescribed bias function c ( · ) : X S → R . We recall that the set of allowedestimators, A ( c ( · ) , x ) , consists of all estimators ˆ g ( · ) with ﬁnite variance at x , i.e., v (ˆ g ( · ); x ) < ∞ , whosebias function equals c ( · ) , i.e., b (ˆ g ( · ); x ) = c ( x ) for all x ∈ X S .Our results can be summarized as follows. We characterize the RKHS associated with the SLGM andemploy it to analyze SLGM-based MVE. Using this characterization together with Theorem IV.1, we provideconditions on the prescribed bias function c ( · ) such that the minimum achievable variance is ﬁnite, i.e., wecharacterize the set of valid bias functions (cf. Section III). Furthermore, we present expressions of the minimumachievable variance (Barankin bound) M SLGM ( c ( · ) , x ) and of the associated LMV estimator ˆ g ( c ( · ) , x ) ( · ) for anarbitrary valid bias function c ( · ) . Since these expressions are difﬁcult to evaluate in general, we ﬁnally derivelower bounds on the minimum achievable variance. These lower bounds are also lower bounds on the varianceof any estimator with the prescribed bias function. A. The RKHS Associated with the SLGM

Let us consider the SLGM-based estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( · ) (cid:1) and the correspondingLGM-based estimation problem E LGM = (cid:0) R N , f H ( y ; x ) , g ( · ) (cid:1) with the same system matrix H ∈ R M × N satisfyingcondition (4) and with the same noise variance σ . For an S -sparse parameter vector x ∈ X S , let H SLGM , x and H LGM , x denote the RKHSs associated with the estimation problems E SLGM and E LGM , respectively. Using (14) and (3), the kernel underlying H SLGM , x is obtained as R SLGM , x ( · , · ) : X S ×X S → R ; R SLGM , x ( x , x ) = exp (cid:18) σ ( x − x ) T H T H ( x − x ) (cid:19) . (24)Comparing with the kernel R LGM , x ( · , · ) underlying H LGM , x , which was presented in (18), we conclude that R SLGM , x ( · , · ) is the restriction of R LGM , x ( · , · ) to the subdomain X S ×X S ⊆ R N × R N .The characterization of H LGM , x provided by Theorems IV.2 and IV.3 is also relevant to H SLGM , x . This isdue to the following application of the “RKHS restriction result” in Section IV-A (see (11) and (12)): Corollary V.1.

The RKHS H SLGM , x consists of the restrictions of all functions f ( · ) : R N → R contained in H LGM , x to the subdomain X S ⊆ R N , i.e., H SLGM , x = (cid:8) f ′ ( · ) = f ( · ) (cid:12)(cid:12) X S (cid:12)(cid:12) f ( · ) ∈ H LGM , x (cid:9) . Furthermore, the norm of a function f ′ ( · ) ∈ H SLGM , x is equal to the minimum of the norms of all functions f ( · ) ∈ H LGM , x whose restriction to X S equals f ′ ( · ) , i.e., k f ′ ( · ) k H SLGM , x = min f ( · ) ∈H LGM , x f ( · ) (cid:12)(cid:12) X S = f ′ ( · ) k f ( · ) k H LGM , x . (25)An immediate consequence of Corollary V.1 is the obvious fact that the minimum achievable variance forthe SLGM can never exceed that for the LGM (if the prescribed bias function for the SLGM is the restrictionof the prescribed bias function for the LGM). Indeed, letting c ( · ) : R N → R be the prescribed bias function forthe LGM and γ ( · ) = c ( · ) + g ( · ) the corresponding mean function, and recalling that x ∈ X S , we have M SLGM (cid:0) c ( · ) (cid:12)(cid:12) X S , x (cid:1) (15) = (cid:13)(cid:13) γ ( · ) (cid:12)(cid:12) X S (cid:13)(cid:13) H SLGM , x − γ ( x ) (25) ≤ k γ ( · ) k H LGM , x − γ ( x ) (15) = M LGM ( c ( · ) , x ) . Thus, in the precise sense of Corollary V.1, H SLGM , x is the restriction of H LGM , x to the set X S of S -sparseparameter vectors, and the characterization of H LGM , x provided by Theorems IV.2 and IV.3 can also be used fora characterization of H SLGM , x . In what follows, we will employ this principle for developing an RKHS-basedanalysis of MVE for the SLGM. Proofs of the presented results can be found in [31]. As before, we will usethe thin SVD of the system matrix H , i.e., H = UΣV T , as well as the shorthand notations e H = VΣ − and D = rank( H ) . B. The Class of Valid Bias Functions

The class of valid bias functions for the SLGM-based estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( · ) (cid:1) at x ∈ X S is characterized by the following result [31, Thm. 5.3.1]: Indeed, prescribing the bias for all x ∈ R N (as is done within the LGM), instead of prescribing it only for the sparse vectors x ∈ X S (as is done within the SLGM) can only result in a higher (or equal) minimum achievable variance. Theorem V.2.

A bias function c ( · ) : X S → R is valid for E SLGM = (cid:0) X S , f H ( y ; x ) , g ( · ) (cid:1) at x ∈ X S if and onlyif it can be expressed as c ( x ) = exp (cid:18) σ k Hx k − σ x T H T Hx (cid:19) X p ∈ Z D + a [ p ] √ p ! (cid:18) σ e H † x (cid:19) p − g ( x ) , x ∈ X S , (26) with some coefﬁcient sequence a [ p ] ∈ ℓ ( Z D + ) . Theorem V.2 implies that the mean function γ ( · ) = c ( · ) + g ( · ) corresponding to a bias function c ( · ) that isvalid for E SLGM at x ∈ X S is of the form γ ( x ) = exp (cid:18) σ k Hx k − σ x T H T Hx (cid:19) X p ∈ Z D + a [ p ] √ p ! (cid:18) σ e H † x (cid:19) p , x ∈ X S , (27)with some coefﬁcient sequence a [ p ] ∈ ℓ ( Z D + ) . The function on the right-hand side in (27) is analytic on thedomain X S in the sense that it can be locally represented at any point x ∈ X S by a convergent power series.Thus, in particular, the mean function γ ( x ) = E x { ˆ g ( y ) } of any ﬁnite-variance estimator ˆ g ( y ) is necessarilyan “analytic” function. Again, this agrees with the general result about the mean function of estimators forexponential families presented in [34, Lemma 2.8]. (Note that the statistical model of the SLGM is a specialcase of an exponential family.)In the special case where g ( x ) = x k for some k ∈ [ N ] , a sufﬁcient condition on a bias function to be validis stated as follows [31, Thm. 5.3.4]: Theorem V.3.

The function c ( x ) = exp (cid:0) x T e H † x (cid:1) X p ∈ Z D + a [ p ] p ! (cid:18) σ e H † x (cid:19) p − x k , x ∈ X S , (28) with an arbitrary x ∈ R D and coefﬁcients a [ p ] satisfying | a [ p ] | ≤ C | p | with an arbitrary constant C ∈ R + ,is a valid bias function for E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) at any x ∈ X S . In particular, for H = I , theunbiased case (i.e., c ( x ) ≡ ) is obtained for x = , a [ e k ] = σ , and a [ p ] = 0 for all other p ∈ Z D + . Note that the difference of the factors in (28) compared to the factors in (26) (i.e., a [ p ] p ! instead of a [ p ] √ p ! )is in accordance with the different condition on the coefﬁcient sequence a [ p ] (i.e., | a [ p ] | ≤ C | p | instead of a [ p ] ∈ ℓ ( Z D + ) ). C. Minimum Achievable Variance (Barankin Bound) and LMV Estimator

Let us consider the MVP (7) at a given parameter vector x ∈ X S for an SLGM-based estimation problem E SLGM , (cid:0) X S , f H ( y ; x ) , g ( · ) (cid:1) and for a prescribed bias function c ( · ) : X S → R , which is known to be valid.Then, the minimum achievable variance (Barankin bound) at x , denoted M SLGM ( c ( · ) , x ) (cf. (7)), and thecorresponding LMV estimator ˆ g ( c ( · ) , x ) ( · ) (cf. (8)) are characterized by the following theorem [31, Thm. 5.3.1]. Note that a function with domain X S , with S < N , cannot be analytic in the conventional sense since the domain of an analyticfunction has to be open by deﬁnition [19, Deﬁnition 2.2.1]. Theorem V.4.

Consider an SLGM-based estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( · ) (cid:1) and a valid pre-scribed bias function c ( · ) : X S → R . Then: The minimum achievable variance at x ∈ X S is given by M SLGM ( c ( · ) , x ) = min a [ · ] ∈C ( c ) k a [ · ] k ℓ ( Z D + ) − γ ( x ) , (29) where γ ( · ) = c ( · ) + g ( · ) , k a [ · ] k ℓ ( Z D + ) , P p ∈ Z D + a [ p ] , and C ( c ) ⊆ ℓ ( Z D + ) denotes the set of coefﬁcientsequences a [ p ] ∈ ℓ ( Z D + ) that are consistent with (26) . The function ˆ g ( · ) : R M → R given by ˆ g ( y ) = exp (cid:18) − σ k Hx k (cid:19) X p ∈ Z D + a [ p ] √ p ! χ p ( y ) , (30) with an arbitrary coefﬁcient sequence a [ · ] ∈ C ( c ) and χ p ( y ) , ∂ p (cid:2) ρ LGM , x ( y , σ e Hz ) exp (cid:0) σ x T H T H e Hz (cid:1)(cid:3) ∂ z p (cid:12)(cid:12)(cid:12)(cid:12) z = , where ρ LGM , x ( y , x ) is given by (17) , is an allowed estimator at x for c ( · ) , i.e., ˆ g ( · ) ∈ A ( c ( · ) , x ) . The LMV estimator at x , ˆ g ( c ( · ) , x ) ( · ) , is given by (30) using the speciﬁc coefﬁcient sequence a [ p ] =argmin a [ · ] ∈C ( c ) k a [ · ] k ℓ ( Z D + ) . The kernel R SLGM , x ( · , · ) given by (24) is pointwise continuous with respect to the parameter x , i.e., lim x ′ → x R SLGM , x ′ ( x , x ) = R SLGM , x ( x , x ) for all x , x , x ∈ X S . Therefore, applying [31, Thm. 4.3.6]or [29, Thm. IV.6] to the SLGM yields the following result. Corollary V.5.

Consider the SLGM with parameter function g ( x ) = x k and a prescribed bias function c ( · ) : X S → R that is valid for E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) at each parameter vector x ∈ X S . Then if c ( · ) is continuous, the minimum achievable variance M SLGM ( c ( · ) , x ) is a lower semi-continuous function of x . From Corollary V.5, we can conclude that the sparse CRB derived in [11] is not tight, i.e., it is not equal tothe minimum achievable variance M SLGM ( c ( · ) , x ) . Indeed, the sparse CRB is in general a strictly upper semi-continuous function of the parameter vector x , whereas the minimum achievable variance M SLGM ( c ( · ) , x ) is lower semi-continuous according to Corollary V.5. Since a function cannot be simultaneously strictly uppersemi-continuous and lower semi-continuous, the sparse CRB cannot be equal to M SLGM ( c ( · ) , x ) in general.VI. L OWER V ARIANCE B OUNDS FOR THE

SLGMWhile Theorem V.4 provides a mathematically complete characterization of the minimum achievable vari-ance and the LMV estimator, the corresponding expressions are somewhat difﬁcult to evaluate in general.Therefore, we will next derive lower bounds on the minimum achievable variance M SLGM ( c ( · ) , x ) for the A deﬁnition of lower semi-continuity can be found in [35]. estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) with some k ∈ [ N ] and for a prescribed bias function c ( · ) . These bounds are easier to evaluate. As mentioned before, they are also lower bounds on the varianceof any estimator having the prescribed bias function. Our assumption that g ( x ) = x k is no restriction because,according to [31, Thm. 2.3.1], the MVP for a given parameter function g ( x ) and prescribed bias function c ( x ) isequivalent to the MVP for parameter function g ′ ( x ) = x k and prescribed bias function c ′ ( x ) = c ( x )+ g ( x ) − x k .In particular, if c ′ ( x ) is valid for the MVP with parameter function g ′ ( x ) = x k , then c ( x ) = c ′ ( x ) − g ( x ) + x k is valid for the MVP with parameter function g ( x ) . Therefore, any MVP can be reduced to an equivalent MVPwith g ( x ) = x k and an appropriately modiﬁed prescribed bias function.We assume that the prescribed bias function c ( · ) is valid for E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) . Thisvalidity assumption is no real restriction either, since our lower bounds are ﬁnite and therefore are lowerbounds also if M SLGM ( c ( · ) , x ) = ∞ , which, by our deﬁnition in Section III, is the case if c ( · ) is not valid.The lower bounds to be presented are based on the generic lower bound (16), i.e., they are of the form M SLGM ( c ( · ) , x ) ≥ k P U γ ( · ) k H SLGM , x − γ ( x ) , (31)for some subspace U ⊆ H

SLGM , x . Here, the prescribed mean function γ ( · ) : X S → R , given by γ ( x ) = c ( x )+ x k ,is an element of H SLGM , x since c ( · ) is assumed valid (recall Theorem IV.1). A. The Sparse CRB

The ﬁrst bound is an adaptation of the CRB [17], [18], [27], [29] to the sparse setting and has beenpreviously derived in a slightly different form in [11].

Theorem VI.1.

Consider the estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) with a system matrix H ∈ R M × N satisfying (4) . Let x ∈ X S . If the prescribed bias function c ( · ) : X S → R is such that the partialderivatives ∂c ( x ) ∂x l (cid:12)(cid:12) x = x exist for all l ∈ [ N ] , then M SLGM ( c ( · ) , x ) ≥  σ b T ( H T H ) † b , if k x k ≤ S − σ b T x ( H T x H x ) † b x , if k x k = S . (32)

Here, in the case k x k < S , b ∈ R N is given by b l , δ k,l + ∂c ( x ) ∂x l (cid:12)(cid:12) x = x , l ∈ [ N ] , and in the case k x k = S , b x ∈ R S and H x ∈ R M × S consist of those entries of b and columns of H , respectively that are indexed by supp( x ) ≡ { k , . . . , k S } , i.e., ( b x ) i = b k i and ( H x ) m,i = ( H ) m,k i , i ∈ [ S ] . Indeed, if c ′ ( x ) is valid at x for the MVP with parameter function x k , there exists a ﬁnite-variance estimator ˆ g ( · ) with meanfunction E x { ˆ g ( y ) } = c ′ ( x ) + x k . For the MVP with parameter function g ( · ) , that estimator ˆ g ( · ) has the bias function b (ˆ g ( · ) , x ) = E x { ˆ g ( y ) } − g ( x ) = c ′ ( x ) + x k − g ( x ) = c ( x ) . Thus, there exists a ﬁnite-variance estimator with bias function c ( x ) = c ′ ( x ) − g ( x ) + x k , which implies that the bias function c ( · ) isvalid for the MVP with parameter function g ( · ) . A proof of this theorem is given in [31, Thm. 5.4.1]. There, it is shown that the bound (32) for k x k < S is obtained from the generic bound (31) using the subspace U = span (cid:8) u ( · ) , { u l ( · ) } l ∈ [ N ] (cid:9) , where u ( · ) , R SLGM , x ( · , x ) , u l ( · ) , ∂R SLGM , x ( · , x ) ∂ ( x ) l (cid:12)(cid:12)(cid:12)(cid:12) x = x , l ∈ [ N ] , with R SLGM , x ( · , · ) given by (24), and the bound (32) for k x k = S is obtained from (31) using the subspace U = span (cid:8) u ( · ) , { u l ( · ) } l ∈ supp( x ) (cid:9) . This establishes a new, RKHS-based interpretation of the bound in [11]in terms of the projection of the prescribed mean function γ ( x ) = c ( x ) + x k onto an RKHS-related subspace U . We note that the bound in [11] was formulated as a bound on the variance v (ˆ x ( · ); x ) of a vector-valuedestimator ˆ x ( · ) of x (and not only of the k th entry x k ). Consistent with (9), that bound can be reobtained bysumming our bound in (32) (with c ( · ) = c k ( · ) ) over all k ∈ [ N ] . Thus, the two bounds are equivalent.An important aspect of Theorem VI.1 is that the lower variance bound in (32) is not a continuous functionof x on X S in general. Indeed, for the case H = I and c ( · ) ≡ , which has been considered in [13], it can beveriﬁed that the bound is a strictly upper semi-continuous function of x : for example, for M = N = 2 , H = I , c ( · ) ≡ , S = 1 , k = 2 , and x = a · (1 , T with a ∈ R + , the bound is equal to for a = 0 (case of k x k < S )but equal to for all a > (case of k x k = S ). However, by Corollary V.5, the minimum achievable variance M SLGM ( c ( · ) , x ) is a lower semi-continuous function of x . It thus follows that the bound in (32) cannot betight, i.e., it cannot be equal to M SLGM ( c ( · ) , x ) for all x ∈ X S , which means that we have a strict inequalityin (32) at least for some x ∈ X S .Let us ﬁnally consider the special case where M ≥ N and H ∈ R M × N has full rank, i.e., rank( H ) = N .The least-squares (LS) estimator [17], [27] of x k is given by ˆ x LS ,k ( y ) = e Tk H † y ; it is unbiased and its varianceis v (ˆ x LS ,k ( · ); x ) = σ e Tk ( H T H ) − e k . (33)On the other hand, for unbiased estimation, i.e., c ( · ) ≡ , our lower bound for k x k < S in (32) becomes M SLGM ( c ( · ) ≡ , x ) ≥ σ b T ( H T H ) † b = σ e Tk ( H T H ) − e k . Comparing with (33), we conclude that ourbound is tight and the minimum achievable variance is in fact M SLGM ( c ( · ) ≡ , x ) = σ e Tk ( H T H ) − e k , which is achieved by the LS estimator. Thus, for M ≥ N and rank( H ) = N , the LS estimator is the LMVunbiased estimator for the SLGM at each parameter vector x ∈ X S with k x k < S . It is interesting to notethat the LS estimator does not exploit the sparsity information expressed by the parameter set X S , i.e., theknowledge that k x k ≤ S , and that it has the constant variance (33) for each x ∈ X S (in fact, even for x ∈ R N ). We also note that the LS estimator is not an LMV unbiased estimator for the case k x k = S ;therefore, it is not a UMV unbiased estimator on X S (i.e., an unbiased estimator with minimum variance ateach x ∈ X S ). In fact, as shown in [13], and [31], there does not exist a UMV unbiased estimator for theSLGM in general. If an LMV estimator exists, it is unique [18]. B. A Novel CRB-Type Lower Variance Bound

A novel lower bound on M SLGM ( c ( · ) , x ) is stated in the following theorem [36]. Theorem VI.2.

Consider the estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) with a system matrix H ∈ R M × N satisfying (4) . Let x ∈ X S , and consider an arbitrary index set K = { k , . . . , k |K| } ⊆ [ N ] consistingof no more than S indices, i.e., |K| ≤ S . If the prescribed bias function c ( · ) : X S → R is such that the partialderivatives ∂c ( x ) ∂x ki (cid:12)(cid:12) x = x exist for all k i ∈ K , then M SLGM ( c ( · ) , x ) ≥ exp (cid:18) − σ k ( I − P ) Hx k (cid:19)(cid:2) σ b T x (cid:0) H T K H K (cid:1) − b x + γ ( e x ) (cid:3) − γ ( x ) . (34) Here, P , H K ( H K ) † ∈ R M × M , b x ∈ R |K| is deﬁned elementwise as ( b x ) i , δ k,k i + ∂c ( x ) ∂x ki (cid:12)(cid:12) x = e x for i ∈ [ |K| ] , e x ∈ R N is deﬁned as the unique (due to (4) ) vector with supp( e x ) ⊆ K solving H e x = PHx , and γ ( x ) = c ( x ) + x k . According to [31, Thm. 5.4.3], the bound in (34) follows from the generic bound (31) by using the subspace U = span (cid:8) ˜ u ( · ) , { ˜ u l ( · ) } l ∈K (cid:9) , where ˜ u ( · ) , R SLGM , x ( · , e x ) , ˜ u l ( · ) , ∂R SLGM , x ( · , x ) ∂ ( x ) l (cid:12)(cid:12)(cid:12)(cid:12) x = e x , l ∈ K . We note that the bound presented in [36] is obtained by maximizing (34) with respect to the index set K ; thisgives the tightest possible bound of the type (34).For the special case given by the SSNM, i.e., H = I , and unbiased estimation, i.e., c ( · ) ≡ , the bound (34)is a continuous function of x on X S . This is an important difference from the bound given in Theorem VI.1and, also, from the bound to be given in Theorem VIII.8. Furthermore, still for H = I and c ( · ) ≡ , the bound(34) can be shown [36], [31, p. 106] to be tighter (higher) than the bounds in Theorem VI.1 and TheoremVIII.8.The matrix P appearing in (34) is the orthogonal projection matrix [20] on the subspace H K , span( H K ) ⊆ R M , i.e., the subspace spanned by those columns of H whose indices are in K . Consequently, I − P is theorthogonal projection matrix on the orthogonal complement of H K , and the norm k ( I − P ) Hx k thus representsthe distance between the point Hx and the subspace H K [32]. Therefore, the factor exp (cid:0) − σ k ( I − P ) Hx k (cid:1) appearing in the bound (34) can be interpreted as a measure of the distance between Hx and H K . In general,the bound (34) is tighter (i.e., higher) if K is chosen such that the distance k ( I − P ) Hx k is smaller.A slight modiﬁcation in the derivation of (34) yields the following alternative bound: M SLGM ( c ( · ) , x ) ≥ exp (cid:18) − σ k ( I − P ) Hx k (cid:19) σ b T x (cid:0) H T K H K (cid:1) − b x . (35)As shown in [31, Thm. 5.4.4], this bound follows from the generic lower bound (31) by using the subspace U = span (cid:8) u ( · ) , { ˜ u l ( · ) } l ∈K (cid:9) , with u ( · ) = R SLGM , x ( · , x ) and ˜ u l ( · ) = ∂R SLGM , x ( · , x ) ∂ ( x ) l (cid:12)(cid:12) x = e x as deﬁned Note that (cid:0) H T K H K (cid:1) − exists because of (4). previously. Note that this subspace deviates from the subspace underlying the bound (34) only by the use of u ( · ) instead of ˜ u ( · ) . The difference of the bounds (35) and (34) is ∆ (35) − (34) = γ ( x ) − exp (cid:18) − σ k ( I − P ) Hx k (cid:19) γ ( e x ) . (36)This depends on the choice of the index set K (via P and e x ). If, for some K and c ( · ) , γ ( e x ) ≈ γ ( x ) , then ∆ (35) − (34) is approximately nonnegative since exp (cid:0) − σ k ( I − P ) Hx k (cid:1) ≤ . Hence, in that case, the bound(35) is tighter (higher) than the bound (34). We note that one sufﬁcient condition for γ ( e x ) ≈ γ ( x ) is thatthe columns of H K are nearly orthonormal and c ( · ) ≡ , i.e., unbiased estimation.The bounds (34) and (35) have an intuitively appealing interpretation in terms of a scaled CRB for anLGM. Indeed, the quantity σ b T x (cid:0) H T K H K (cid:1) − b x appearing in (34) and (35) can be interpreted as the CRB[17] for the LGM with parameter dimension N = |K| , parameter function g ( x ) = x k , and prescribed biasfunction c ( · ) . For a discussion of the scaling factor exp (cid:0) − σ k ( I − P ) Hx k (cid:1) , we will consider the followingtwo complementary cases:1) For the case where either k ∈ supp( x ) or k x k < S (or both), the factor exp (cid:0) − σ k ( I − P ) Hx k (cid:1) can be made equal to by choosing K = supp( x ) ∪ { k } .2) On the other hand, consider the complementary case where k / ∈ supp( x ) and k x k = S . Choosing K = L ∪ { k } , where L comprises the indices of the S − largest (in magnitude) entries of x , we obtain k ( I − P ) Hx k = ξ k ( I − P ) He j k , where ξ and j denote the value and index, respectively, of thesmallest (in magnitude) nonzero entry of x . Typically, k ( I − P ) He j k > and therefore, as ξ becomeslarger (in magnitude), the bound (35) transitions from a “low signal-to-noise ratio (SNR)” regime, where exp (cid:0) − σ k ( I − P ) Hx k (cid:1) ≈ , to a “high-SNR” regime, where exp (cid:0) − σ k ( I − P ) Hx k (cid:1) ≈ . In thelow-SNR regime, the bound (35) is approximately equal to σ b T x (cid:0) H T K H K (cid:1) − b x , i.e., to the CRB for theLGM with N = |K| . In the high-SNR regime, the bound becomes approximately equal to ; this suggeststhat the zero entries x k with k / ∈ supp( x ) can be estimated with small variance. Note that for increasing ξ , the transition from the low-SNR regime to the high-SNR regime exhibits an exponential decay.VII. T HE SLGM V

IEW OF C OMPRESSED S ENSING

The lower bounds of Section VI are also relevant to the linear CS recovery problem, which can be viewedas an instance of the SLGM-based estimation problem. In this section, we express one of these lower boundsin terms of the restricted isometry constant of the system matrix (CS measurement matrix) H . Note that, for the case k / ∈ supp( x ) and k x k = S considered, j / ∈ K with |K| ≤ S . For a system matrix H satisfying (4), wethen have k ( I − P ) He j k > if and only if the submatrix H K∪{ j } has full column rank. A. CS Fundamentals

The compressive measurement process within a CS problem is often modeled as [2], [7], [21], [37], [38] y = Hx + n . (37)Here, y ∈ R M denotes the compressive measurements; H ∈ R M × N , where M ≤ N and typically M ≪ N ,denotes the CS measurement matrix; x ∈ X S ⊆ R N is an unknown S -sparse signal or parameter vector, withknown sparsity degree S (typically S ≪ N ); and n represents additive measurement noise. We assume that n ∼ N ( , σ I ) and that the columns { h j } j ∈ [ N ] of H are normalized, i.e., k h j k = 1 for all j ∈ [ N ] . The CSmeasurement model (37) is then identical to the SLGM observation model (2). Any CS recovery method, suchas the Basis Pursuit (BP) [37], [39] or the Orthogonal Matching Pursuit (OMP) [21], [40], can be interpretedas an estimator ˆ x ( y ) that estimates the sparse vector x from the observation y .Due to the typically large dimension of the measurement matrix H , a complete characterization of theproperties of H (e.g., via its SVD) is often infeasible. Useful incomplete characterizations are provided bythe (mutual) coherence and the restricted isometry property [7], [21], [37], [38]. The coherence of a matrix H ∈ R M × N is deﬁned as µ ( H ) , max i = j | h Tj h i | . Furthermore, a matrix H ∈ R M × N is said to satisfy the restricted isometry property (RIP) of order K if forevery index set I ⊆ [ N ] of size |I| = K there is a constant δ ′ K ∈ R + such that (1 − δ ′ K ) k z k ≤ k H I z k ≤ (1 + δ ′ K ) k z k , for all z ∈ R K . (38)The smallest δ ′ K for which (38) holds—hereafter denoted δ K —is called the RIP constant of H . Condition (4)is necessary for a matrix H to have the RIP of order S with a RIP constant δ S < . It can be easily veriﬁedthat δ K ′ ≥ δ K for K ′ ≥ K . The coherence µ ( H ) provides a coarser description of the matrix H than the RIPconstant δ K but can be calculated more easily. The two parameters are related according to δ K ≤ ( K − µ ( H ) [38]. B. A Lower Variance Bound

We now specialize the bound (35) on the minimum achievable variance for E SLGM to the CS scenario, i.e.,to the SLGM with sparsity degree S and a system matrix H that is a CS measurement matrix (i.e., M ≤ N )with known RIP constant δ S < . Note that δ S < implies that condition (4) is satisﬁed. The following resultwas presented in [31, Thm. 5.7.2]. A comprehensive overview is provided at http://dsp.rice.edu/cs. Indeed, assume that spark( H ) ≤ S . This means that there exists an index set I ⊆ [ N ] consisting of S indices such that thecolumns of H I are linearly dependent. This, in turn, implies that there is a nonzero coefﬁcient vector z ∈ R S such that H I z = andconsequently k H I z k = 0 . Therefore, there cannot exist a constant δ ′ K < satisfying (38) for all z ∈ R S . Theorem VII.1.

Consider the SLGM-based estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) , where H ∈ R M × N with M ≤ N satisﬁes the RIP of order S with RIP constant δ S < . Let x ∈ X S , and consideran arbitrary index set K ⊆ [ N ] consisting of no more than S indices, i.e., |K| ≤ S . If the ﬁrst-order partialderivatives ∂c ( x ) ∂x l (cid:12)(cid:12) x = x of the prescribed bias function c ( · ) : X S → R exist for all l ∈ K , then M SLGM ( c ( · ) , x ) ≥ exp (cid:18) − δ S σ (cid:13)(cid:13) x supp( x ) \K (cid:13)(cid:13) (cid:19) σ b T x (cid:0) H T K H K (cid:1) − b x , (39) with b x ∈ R |K| as deﬁned in Theorem VI.2 . Using the inequality δ S ≤ ( S − µ ( H ) , we obtain from (39) the coherence-based bound M SLGM ( c ( · ) , x ) ≥ exp (cid:18) − S − µ ( H ) σ (cid:13)(cid:13) x supp( x ) \K (cid:13)(cid:13) (cid:19) σ b T x (cid:0) H T K H K (cid:1) − b x . If we want to compare the actual variance behavior of a given CS recovery scheme (or, estimator) ˆ x k ( · ) with the bound on the minimum achievable variance in (39), then we have to ensure that the ﬁrst-order partialderivatives of the estimator’s bias function E x { ˆ x k ( y ) } − x k exist. The following lemma states that this is indeedthe case under mild conditions. Moreover, the lemma gives an explicit expression of these partial derivatives. Lemma VII.2 ([34, Cor. 2.6]) . Consider the SLGM-based estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) and an estimator ˆ x k ( · ) : R M → R . If the mean function γ ( x ) = E x { ˆ x k ( y ) } exists for all x ∈ X S , thenalso the partial derivatives ∂c ( x ) ∂x l , l ∈ [ N ] exist for all x ∈ X S and are given by ∂c ( x ) ∂x l = δ k,l + 1 σ E x (cid:8) ˆ x k ( y )( y − Hx ) T He l (cid:9) . (40) C. The Case δ S ≈ For CS applications, measurement matrices H with RIP constant close to zero, i.e., δ S ≈ , are generallypreferable [7], [38], [41]–[43]. For δ S = 0 , the bound in (39) becomes M SLGM ( c ( · ) , x ) ≥ exp (cid:18) − σ (cid:13)(cid:13) x supp( x ) \K (cid:13)(cid:13) (cid:19) σ b T x (cid:0) H T K H K (cid:1) − b x . (41)This is equal to the bound (57) for the SSNM (i.e., H = I ) except that the factor b T x (cid:0) H T K H K (cid:1) − b x in(41) is replaced by k b x k in (57). For a “good” CS measurement matrix, i.e., with δ S ≈ , we have b T x (cid:0) H T K H K (cid:1) − b x ≈ k b x k for any index set K ⊆ [ N ] of size |K| ≤ S . Thus, the bound in (41) is very closeto (57). This means that, conversely, in terms of a lower bound on the achievable estimation accuracy, relativeto the SSNM (case H = I ), no loss of information is incurred by multiplying x by the CS measurement matrix H ∈ R M × N and thereby reducing the signal dimension from N to M , where typically M ≪ N . This agreeswith the fact that if δ S ≈ , one can recover—e.g., by using the BP—the sparse parameter vector x ∈ X S fromthe compressed observation y = Hx + n up to an error that is typically very small (and whose norm is almostindependent of H and solely determined by the measurement noise n [7], [44]). VIII. RKHS-

BASED A NALYSIS OF M INIMUM V ARIANCE E STIMATION FOR THE

SSNMNext, we specialize our RKHS-based MVE analysis to the SSNM, i.e., to the special case given by H = I (which implies M = N and y = x + n ). For the SSNM-based estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( · ) (cid:1) with k ∈ [ N ] , we will analyze the minimum achievable variance M SSNM ( c ( · ) , x ) and the corresponding LMVestimator. We note that the SLGM with a system matrix H ∈ R M × N having orthonormal columns, i.e., satisfying H T H = I , is equivalent to the SSNM [13].Specializing the kernel R SLGM , x ( · , · ) (see (24)) to the system matrix H = I , we obtain R SSNM , x ( x , x ) = exp (cid:18) σ ( x − x ) T ( x − x ) (cid:19) , x , x , x ∈ X S . (42)The corresponding RKHS, H ( R SSNM , x ) , will be brieﬂy denoted by H SSNM , x . A. Valid Bias Functions, Minimum Achievable Variance, and LMV Estimator

Since the SSNM is a special case of the SLGM, we can characterize the class of valid bias functions, theminimum achievable variance (Barankin bound), and the corresponding LMV estimator by Theorems V.2 andV.4 specialized to H = I , as stated in the following corollary. Corollary VIII.1.

Consider the SSNM-based estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( · ) (cid:1) with k ∈ [ N ] . A bias function c ( · ) : X S → R is valid for E SSNM at x ∈ X S if and only if it can be expressed as c ( x ) = exp (cid:18) σ k x k − σ x T x (cid:19) X p ∈ Z D + a [ p ] √ p ! (cid:18) σ x (cid:19) p − g ( x ) , x ∈ X S , (43) with some coefﬁcient sequence a [ p ] ∈ ℓ ( Z D + ) . Let c ( · ) : X S → R be a valid prescribed bias function. Then: a) The minimum achievable variance at x ∈ X S , M SSNM ( c ( · ) , x ) , is given by (29) , in which C ( c ) ⊆ ℓ ( Z D + ) denotes the set of coefﬁcient sequences a [ p ] ∈ ℓ ( Z D + ) that are consistent with (43) . b) The function ˆ g ( · ) : R M → R given by ˆ g ( y ) = exp (cid:18) − σ k x k (cid:19) X p ∈ Z D + a [ p ] √ p ! χ p ( y ) , (44) with an arbitrary coefﬁcient sequence a [ · ] ∈ C ( c ) and χ p ( y ) , ∂ p (cid:2) ρ LGM , x ( y , σ x ) exp (cid:0) σ x T x (cid:1)(cid:3) ∂ x p (cid:12)(cid:12)(cid:12)(cid:12) x = , is an allowed estimator at x for c ( · ) , i.e., ˆ g ( · ) ∈ A ( c ( · ) , x ) . c) The LMV estimator at x , ˆ g ( c ( · ) , x ) ( · ) , is given by (44) using the speciﬁc coefﬁcient sequence a [ p ] =argmin a [ · ] ∈C ( c ) k a [ · ] k ℓ ( Z D + ) . However, a more convenient characterization can be obtained by exploiting the speciﬁc structure of H SSNM , x that is induced by the choice H = I . We omit the technical details, which can be found in [31, Sec. 5.5], andjust present the main results regarding MVE [31, Thm. 5.5.2]. Theorem VIII.2.

Consider the SSNM-based estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( · ) (cid:1) with k ∈ [ N ] . A prescribed bias function c ( · ) : X S → R is valid for E SSNM at x ∈ X S if and only if the associatedprescribed mean function γ ( · ) = c ( · ) + g ( · ) can be expressed as γ ( x ) = 1 ν x ( x ) X p ∈ Z N + ∩X S a [ p ] √ p ! (cid:18) x σ (cid:19) p , x ∈ X S , with ν x ( x ) , exp (cid:18) − σ k x k + 1 σ x T x (cid:19) and with a coefﬁcient sequence a [ p ] ∈ ℓ ( Z N + ∩ X S ) . This coefﬁcient sequence is unique for a given c ( · ) . Let c ( · ) : X S → R be a valid prescribed bias function. Then: a) The minimum achievable variance at x ∈ X S is given by M SSNM ( c ( · ) , x ) = X p ∈ Z N + ∩X S a x [ p ] − γ ( x ) , (45) with a x [ p ] , √ p ! ∂ p (cid:0) γ ( σ x ) ν x ( σ x ) (cid:1) ∂ x p (cid:12)(cid:12)(cid:12)(cid:12) x = . b) The LMV estimator at x is given by ˆ g ( c ( · ) , x ) ( y ) = X p ∈ Z N + ∩X S a x [ p ] √ p ! ∂ p ψ x ( x , y ) ∂ x p (cid:12)(cid:12)(cid:12)(cid:12) x = , (46) with ψ x ( x , y ) , exp (cid:18) y T ( σ x − x ) σ + x T x σ − k x k (cid:19) . Note that the statement of Theorem VIII.2 is stronger than that of Corollary VIII.1, because it containsexplicit expressions of the minimum achievable variance M SSNM ( c ( · ) , x ) and the corresponding LMV estimator ˆ g ( c ( · ) , x ) ( y ) .The expression (45) nicely shows the inﬂuence of the sparsity constraints on the minimum achievablevariance. Indeed, consider a prescribed bias c ( · ) : R N → R that is valid for the SSNM with S = N , and thereforealso for the SSNM with S < N . Let us denote by M N and M S the minimum achievable variance M ( c ( · ) , x ) for the degenerate SSNM without sparsity ( S = N ) and for the SSNM with sparsity ( S < N ), respectively.Note that in the nonsparse case S = N , the SSNM coincides with the LGM with system matrix H = I . It thenfollows from (45) that M N = P p ∈ Z N + a x [ p ] − γ ( x ) and M N − M S = X p ∈ Z N + \X S a x [ p ] . (47) Clearly, if x is more sparse, i.e., if the sparsity degree S is smaller, the number of (nonnegative) terms in theabove sum is larger. This implies a larger difference M N − M S and, thus, a stronger reduction of the minimumachievable variance due to the sparsity information.We mention the obvious fact that a UMV estimator for E SSNM = (cid:0) X S , f I ( y ; x ) , g ( · ) (cid:1) and prescribed biasfunction c ( · ) exists if and only if the LMV estimator ˆ g ( c ( · ) , x ) ( · ) given by (46) does not depend on x .Finally, consider the SSNM with parameter function g ( x ) = x k , i.e., E SSNM = (cid:0) X S , f I ( y ; x ) , g ( x ) = x k (cid:1) , forsome k ∈ [ N ] . Because the speciﬁc estimator ˆ g ( y ) = y k has ﬁnite variance and zero bias at each x ∈ X S , thebias function c u ( x ) ≡ must be valid for E SSNM at each x ∈ X S . Therefore, according to Corollary V.5, theminimum achievable variance for unbiased estimation within the SSNM with parameter function g ( x ) = x k , M SSNM ( c u ( · ) , x ) , is a lower semi-continuous function of x on its domain, i.e., on X S . (Note that this remarkis not related to Theorem VIII.2.) B. Diagonal Bias Functions

In this subsection, we consider the SSNM-based estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( x ) = x k (cid:1) ,for some k ∈ [ N ] , and we study a speciﬁc class of bias functions. Let us call a bias function c ( · ) : X S → R diagonal if c ( x ) depends only on the k th entry of the parameter vector x , i.e., the speciﬁc scalar parameter x k to be estimated. That is, c ( x ) = ˜ c ( x k ) , with some function ˜ c ( · ) : R → R that may depend on k . Similarly,we say that an estimator ˆ x k ( y ) is diagonal if it depends only on the k th entry of y , i.e., ˆ x k ( y ) = ˆ x k ( y k ) (with an abuse of notation). Clearly, the bias function b (ˆ x k ( · ); x ) of a diagonal estimator ˆ x k ( · ) is diagonal,i.e., b (ˆ x k ( · ); x ) = b (ˆ x k ( · ); x k ) . Well-known examples of diagonal estimators are the hard- and soft-thresholdingestimators described in [2], [45], and [10] and the LS estimator, ˆ x LS ,k ( y ) = y k . The maximum likelihoodestimator for the SSNM is not diagonal, and its bias function is not diagonal either [13].The following theorem [31, Thm. 5.5.4], which can be regarded as a specialization of Theorem VIII.2 tothe case of diagonal bias functions, provides a characterization of the class of valid diagonal bias functions,as well as of the minimum achievable variance and LMV estimator for a prescribed diagonal bias function. Inthe theorem, we will use the l th order (probabilists’) Hermite polynomial H l ( · ) : R → R deﬁned as [46] H l ( x ) , ( − l e x / d l dx l e − x / . Furthermore, in the case k x k = S , the support of x will be denoted as supp( x ) = { k , . . . , k S } . Theorem VIII.3.

Consider the SSNM-based estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( x ) = x k (cid:1) , k ∈ [ N ] ,at x ∈ X S . Furthermore consider a prescribed bias function c ( · ) : X S → R that is diagonal and such that theprescribed mean function γ ( x ) = c ( x ) + x k can be written as a convergent power series centered at x , i.e., γ ( x ) = X l ∈ Z + m l l ! ( x k − x ,k ) l , (48) We recall that the assumption g ( x ) = x k is no restriction, because the MVP for any given parameter function g ( · ) is equivalent tothe MVP for the parameter function g ′ ( x ) = x k and the modiﬁed prescribed bias function c ′ ( x ) = c ( x ) + g ( x ) − x k . with suitable coefﬁcients m l . (Note, in particular, that m = γ ( x ) .) In what follows, let B c , X l ∈ Z + m l σ l l ! . The bias function c ( · ) is valid at x if and only if B c < ∞ . Assume that B c < ∞ , i.e., c ( · ) is valid. Then: a) The minimum achievable variance at x is given by M SSNM ( c ( · ) , x ) = B c φ ( x ) − γ ( x ) , with φ ( x ) ,  , if | supp( x ) ∪ { k }| ≤ S X i ∈ [ S ] exp (cid:18) − x ,k i σ (cid:19) Y j ∈ [ i − (cid:20) − exp (cid:18) − x ,k j σ (cid:19)(cid:21) < , if | supp( x ) ∪ { k }| = S + 1 . (49) (Recall that supp( x ) = { k i } Si =1 in the case | supp( x ) ∪ { k }| = S + 1 .) b) The LMV estimator at x is given by ˆ x ( c ( · ) , x ) k ( y ) = ψ ( y , x ) X l ∈ Z + m l σ l l ! H l (cid:18) y k − x ,k σ (cid:19) , with ψ ( y , x ) ,  , if | supp( x ) ∪ { k }| ≤ S X i ∈ [ S ] exp (cid:18) − x ,k i + 2 y k i x ,k i σ (cid:19) × Y j ∈ [ i − (cid:20) − exp (cid:18) − x ,k j + 2 y k j x ,k j σ (cid:19)(cid:21) , if | supp( x ) ∪ { k }| = S + 1 . (50)Regarding the case distinction in Theorem VIII.3, we note that | supp( x ) ∪ { k }| ≤ S either if k x k < S or if both k x k = S and k ∈ supp( x ) , and | supp( x ) ∪ { k }| = S + 1 if both k x k = S and k supp( x ) .If the prescribed bias function c ( · ) is the actual bias function b (ˆ x ′ k ( · ); x ) of some diagonal estimator ˆ x ′ k ( y ) =ˆ x ′ k ( y k ) with ﬁnite variance at x , the coefﬁcients m l appearing in Theorem VIII.3 have a particular interpretation.For a discussion of this interpretation, we need the following lemma [47]. Lemma VIII.4.

Consider the SSNM-based estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( x ) = x k (cid:1) , k ∈ [ N ] , at x ∈ X S . Furthermore consider the Hilbert space P SSNM consisting of all ﬁnite-variance estimator functions ˆ g ( · ) : R N → R , i.e., P SSNM , { ˆ g ( · ) | v (ˆ g ( · ); x ) < ∞} , and endowed with the inner product (cid:10) ˆ g ( · ) , ˆ g ( · ) (cid:11) RV = E x (cid:8) ˆ g ( y )ˆ g ( y ) (cid:9) = 1(2 πσ ) N/ Z R N ˆ g ( y ) ˆ g ( y ) exp (cid:18) − σ k y − x k (cid:19) d y . Then, the subset D SSNM ⊆ P

SSNM consisting of all diagonal estimators ˆ g ( y ) = ˆ g ( y k ) is a subspace of P SSNM ,with induced inner product (cid:10) ˆ g ( · ) , ˆ g ( · ) (cid:11) D SSNM = 1 √ πσ Z R ˆ g ( y ) ˆ g ( y ) exp (cid:18) − σ ( y − x ,k ) (cid:19) dy . An orthonormal basis for D SSNM is constituted by { h ( l ) ( · ) } l ∈ Z + , with h ( l ) ( · ) : R N → R given by h ( l ) ( y ) = 1 √ l ! H l (cid:18) y k − x ,k σ (cid:19) . (51)Combining Theorem VIII.3 with Lemma VIII.4 yields the following result [31, Cor. 5.5.7]. Corollary VIII.5.

Consider the SSNM-based estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( x ) = x k (cid:1) , k ∈ [ N ] ,at x ∈ X S . Furthermore consider a prescribed diagonal bias function c ( · ) : X S → R that is the actual biasfunction of a diagonal estimator ˆ x k ( y ) = ˆ x k ( y k ) , i.e., c ( x ) = b (ˆ x k ( · ); x ) . The estimator ˆ x k ( · ) is assumed tohave ﬁnite variance at x , v (ˆ x k ( · ); x ) < ∞ , and hence ˆ x k ( y ) ∈ D SSNM and, also, c ( · ) is valid. The prescribed mean function γ ( x ) = c ( x ) + x k = E x { ˆ x k ( y ) } can be written as a convergent powerseries (48) , with coefﬁcients given by m l = √ l ! σ l (cid:10) ˆ x k ( · ) , h ( l ) ( · ) (cid:11) D SSNM (52) = 1 √ πσ l +1 Z R ˆ x k ( y ) H l (cid:18) y − x ,k σ (cid:19) exp (cid:18) − σ ( y − x ,k ) (cid:19) dy . The minimum achievable variance at x is given by M SSNM ( c ( · ) , x ) = v (ˆ x k ( · ); x ) φ ( x ) + [ φ ( x ) − γ ( x ) , (53) with φ ( x ) as deﬁned in (49) . The LMV estimator at x is given by ˆ x ( c ( · ) , x ) k ( y ) = ˆ x k ( y k ) ψ ( y , x ) , (54) with ψ ( y , x ) as deﬁned in (50) . It follows from (52) and from Lemma VIII.4 that the given diagonal estimator ˆ x k ( · ) can be written as ˆ x k ( y ) = σ X l ∈ Z + m l √ l ! h ( l ) ( y ) . Thus, the coefﬁcients m l appearing in Theorem VIII.3 have the interpretation of being (up to a factor of / √ l ! ) the expansion coefﬁcients of the estimator ˆ x k ( · ) —viewed as an element of D SSNM —with respect to theorthonormal basis (cid:8) h ( l ) ( y ) (cid:9) l ∈ Z + .Remarkably, as shown by (54), the LMV estimator can be obtained by multiplying the diagonal estimator ˆ x k ( y ) —which is arbitrary except for the condition that its variance at x is ﬁnite—by the “correction factor” ψ ( y , x ) in (50). It can be easily veriﬁed that ψ ( y , x ) does not depend on y k . According to (50), the followingtwo cases have to be distinguished:1) For k ∈ [ N ] such that | supp( x ) ∪ { k }| ≤ S , we have ψ ( y , x ) = 1 , and therefore the LMV estimator isobtained from (54) as ˆ x ( c ( · ) , x ) k ( y ) = ˆ x k ( y k ) = ˆ x k ( y ) . Thus, in that case, it follows from Corollary VIII.5that every diagonal estimator ˆ x k ( · ) : R N → R for the SSNM that has ﬁnite variance at x is necessarilyan LMV estimator. In particular, the variance v (ˆ x k ( · ); x ) equals the minimum achievable variance M SSNM ( c ( · ) , x ) , i.e., the Barankin bound. Furthermore, the sparsity information cannot be leveragedfor improved MVE, because the estimator ˆ x k ( · ) is an LMV estimator for the parameter set X S witharbitrary S , including the nonsparse case X = R N .2) For k ∈ [ N ] such that | supp( x ) ∪ { k }| = S + 1 , it follows from Corollary VIII.5 and (49) that thereexist estimators (in particular, the LMV estimator ˆ x ( c ( · ) , x ) k ( y ) ) with the same bias function as ˆ x k ( · ) butwith a smaller variance at x . Indeed, in this case, we have φ ( x ) < in (49), and by (53) it thus followsthat M SSNM ( c ( · ) , x ) < v (ˆ x k ( · ); x ) .Let us for the moment make the (weak) assumption that the given diagonal estimator ˆ x k ( · ) has ﬁnite varianceat every parameter vector x ∈ R N . It can then be shown that the LMV estimator ˆ x ( c ( · ) , x ) k ( · ) is robust to deviationsfrom the nominal parameter x in the sense that its bias and variance depend continuously on x . Furthermore, ˆ x ( c ( · ) , x ) k ( · ) has ﬁnite bias and ﬁnite variance at any parameter vector x ∈ R N , i.e., (cid:12)(cid:12) b (cid:0) ˆ x ( c ( · ) , x ) k ( · ); x (cid:1)(cid:12)(cid:12) < ∞ and v (cid:0) ˆ x ( c ( · ) , x ) k ( · ); x (cid:1) < ∞ for all x ∈ R N .We ﬁnally note that Corollary VIII.5 also applies to unbiased estimation, i.e., prescribed bias function c ( · ) ≡ (equivalently, γ ( x ) = x k ). This is because c ( · ) ≡ is the actual bias function of the LS estimator ˆ x LS ,k ( y ) = y k . Clearly, the LS estimator is diagonal and has ﬁnite variance at x . Thus, it can be used as thegiven diagonal estimator ˆ x k ( y ) in Corollary VIII.5. C. Lower Variance Bounds

Finally, we complement the exact expressions of the minimum achievable variance M SSNM ( c ( · ) , x ) pre-sented above by simple lower bounds. The following bound is obtained by specializing the sparse CRB inTheorem VI.1 to the SSNM ( H = I ). Corollary VIII.6.

Consider the estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( x ) = x k (cid:1) . Let x ∈ X S . If theprescribed bias function c ( · ) : X S → R is such that the partial derivatives ∂c ( x ) ∂x l (cid:12)(cid:12) x = x exist for all l ∈ [ N ] , then M SSNM ( c ( · ) , x ) ≥  σ k b k , if k x k < Sσ k b x k , if k x k = S . (55)

Here, in the case k x k < S , b ∈ R N is given by b l , δ k,l + ∂c ( x ) ∂x l (cid:12)(cid:12) x = x , l ∈ [ N ] , and in the case k x k = S , b x ∈ R S consists of those entries of b that are indexed by supp( x ) = { k , . . . , k S } , i.e., ( b x ) i = b k i , i ∈ [ S ] . Specializing the alternative bound in Theorem VI.2 to the SSNM yields the following result.

Corollary VIII.7.

Consider the estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( x ) = x k (cid:1) . Let x ∈ X S , andconsider an arbitrary index set K = { k , . . . , k |K| } ⊆ [ N ] consisting of no more than S indices, i.e., |K| ≤ S .If the prescribed bias function c ( · ) : X S → R is such that the partial derivatives ∂c ( x ) ∂x ki (cid:12)(cid:12) x = x exist for all k i ∈ K ,then M SSNM ( c ( · ) , x ) ≥ exp (cid:18) − σ (cid:13)(cid:13) x [ N ] \K (cid:13)(cid:13) (cid:19)(cid:2) σ k b x k + γ ( x K ) (cid:3) − γ ( x ) . Here, b x ∈ R |K| is deﬁned elementwise as ( b x ) i , δ k,k i + ∂c ( x ) ∂x ki (cid:12)(cid:12) x = x K for i ∈ [ |K| ] , and γ ( x ) = c ( x ) + x k . Furthermore, the modiﬁed bound in (35) specialized to the SSNM reads as M SSNM ( c ( · ) , x ) ≥ exp (cid:18) − σ k ( I − P ) x k (cid:19) σ k b x k . (56)Because H = I , we have P = H K ( H K ) † = I K ( I K ) † = P l ∈K e l e Tl . Therefore, multiplying x by I − P simplyzeros all entries of x whose indices belong to K , i.e., ( I − P ) x = x supp( x ) \K , and thus (56) becomes M SSNM ( c ( · ) , x ) ≥ exp (cid:18) − σ (cid:13)(cid:13) x supp( x ) \K (cid:13)(cid:13) (cid:19) σ k b x k . (57)For unbiased estimation ( c ( · ) ≡ ), the following lower bound on M SSNM ( c ( · ) ≡ , x ) is based on theHammersley-Chapman-Robbins bound (HCRB) [18], [29], [48]. This bound has been previously derived in aslightly different form in [13]. Theorem VIII.8.

Consider the estimation problem E SSNM = (cid:0) X S , f I ( y ; x ) , g ( x ) = x k (cid:1) with k ∈ [ N ] and theprescribed bias function c ( · ) ≡ . Let x ∈ X S . Then, M SSNM ( c ( · ) , x ) ≥  σ , if | supp( x ) ∪ { k }| ≤ Sσ N − S − N − S exp( − ξ /σ ) , if | supp( x ) ∪ { k }| = S + 1 , (58) where ξ denotes the value of the S -largest (in magnitude) entry of x . In [31, Thm. 5.4.2], it is shown that the bound (58) for | supp( x ) ∪ { k }| ≤ S is obtained from thegeneric bound (31) by using for the subspace U the limit of U ( t ) , span (cid:8) u ( · ) , { u ( t ) l ( · ) } l ∈ [ N ] (cid:9) as t → . Here, u ( · ) , R SSNM , x ( · , x ) and u ( t ) l ( · ) ,  R SSNM , x ( · , x + t e l ) − R SSNM , x ( · , x ) , if l ∈ supp( x ) R SSNM , x ( · , x − ξ e j + t e l ) − R SSNM , x ( · , x ) , if l ∈ [ N ] \ supp( x ) , l ∈ [ N ] , where j denotes the index of the S -largest (in magnitude) entry of x . Similarly, the bound (58) for | supp( x ) ∪{ k }| = S + 1 is obtained from (31) by using for U the limit of e U ( t ) , span (cid:8) u ( · ) , u ( t ) ( · ) (cid:9) as t → , where u ( t ) ( · ) , R SSNM , x ( · , x + t e k ) − R SSNM , x ( · , x ) . (An expression of R SSNM , x ( · , · ) was given in (42).) In[13], an equivalent bound on the MSE (equivalently, on the variance, because c ( · ) ≡ ) was formulated for avector-valued estimator ˆ x ( · ) ; that bound can be obtained by summing (58) over all k ∈ [ N ] . x x (a) x x (b) x x (c) x x (d)Fig. 1. Examples of ℓ q -balls of radius S = 1 , B q (1) , in R : (a) q = 0 , (b) q = 0 . , (c) q = 0 . , (d) q = 1 . It can be shown that the HCRB-type bound (58) is tighter (higher) than the CRB (55) specialized to c ( · ) ≡ .For | supp( x ) ∪ { k }| = S + 1 (which is true if both k x k = S and k supp( x ) ), the HCRB-type bound(58) is a strictly upper semi-continuous function of x , just as the CRB (55). Hence, it again follows fromCorollary V.5 that the bound cannot be tight, i.e., in general, we have a strict inequality in (58). However, for | supp( x ) ∪ { k }| ≤ S (which is true either if k x k < S or if both k x k = S and k ∈ supp( x ) ), the bound(58) is tight since it is achieved by the LS estimator ˆ x LS ,k ( y ) = y k .IX. E XACT VERSUS A PPROXIMATE S PARSITY

So far, the parameter set X has been the set X S of S -sparse vectors. In this section, we consider anapproximate version of S -sparsity, which is modeled by a modiﬁed parameter set X . Following [8], [10], and[4], we deﬁne this modiﬁed parameter set to be the ℓ q -ball of radius S , i.e., X = B q ( S ) , (cid:8) x ′ ∈ R N (cid:12)(cid:12) k x ′ k q ≤ S (cid:9) , with ≤ q ≤ . The parameter set X S of “exactly” S -sparse vectors is a special case obtained for q = 0 , i.e., X S = B ( S ) . InFig. 1, we illustrate B q ( S ) in R for S = 1 and various values of q . In contrast to X S = B ( S ) , the parametersets B q ( S ) with q > are bounded, i.e., for every q > and S ∈ [ N ] , B q ( S ) is contained in a ﬁnite ball about . Thus, the set X S of exactly S -sparse vectors is not a subset of B q ( S ) for any q > . For a given system matrix H ∈ R M × N , sparsity degree S ≤ N , and index k ∈ [ N ] , let us consider theestimation problem E ( q ) , (cid:0) B q ( S ) , f H ( y ; x ) , g ( x ) = x k (cid:1) . Note that E ( q ) differs from the SLGM-based estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) only inthe parameter set X , which is B q ( S ) instead of X S . Because B ( S ) = X S , we have E (0) = E SLGM . Furthermore,we consider a bias function c ( · ) : R N → R that is deﬁned on all of R N , and a parameter vector x ∈ B q ( S ) ∩ X S .For E SLGM , as before, the bias function c ( · ) is prescribed on X S , i.e., we consider estimators ˆ x k ( · ) satisfying(cf. (6)) b (ˆ x k ( · ); x ) = c ( x ) , for all x ∈ X S . Again as before, the minimum achievable variance at x is denoted as M SLGM ( c ( · ) , x ) . On the other hand, for E ( q ) , the bias function c ( · ) is prescribed on B q ( S ) , i.e., we consider estimators ˆ x k ( · ) satisfying b (ˆ x k ( · ); x ) = c ( x ) , for all x ∈ B q ( S ) . Here, the minimum achievable variance at x is denoted as M ( q ) ( c ( · ) , x ) .Evidently, because B ( S ) = X S and E (0) = E SLGM , we have M (0) ( c ( · ) , x ) = M SLGM ( c ( · ) , x ) . It seemstempting to conjecture that M ( q ) ( c ( · ) , x ) ≈ M SLGM ( c ( · ) , x ) for q ≈ , i.e., changing the parameter set X from X S = B ( S ) to B q ( S ) with q > , and hence considering E ( q ) instead of E SLGM , should not result in asigniﬁcantly different minimum achievable variance as long as q is sufﬁciently small. However, the next result[31, Thm. 5.6.1] implies that there is a decisive difference, no matter how small q is. Theorem IX.1.

Consider a subset

X ⊆ R N that contains an open set, and a function c ( · ) : R N → R thatis valid at some x ∈ X for the LGM-based estimation problem E LGM = (cid:0) R N , f H ( y ; x ) , g ( x ) = x k (cid:1) , withsome system matrix H that does not necessarily satisfy condition (4) . Let M LGM ( c ( · ) , x ) denote the minimumachievable variance at x for E LGM with bias function c ( · ) prescribed on R N . Furthermore let M ′ ( c ( · ) , x ) denote the minimum achievable variance at x for the estimation problem E ′ , (cid:0) X , f H ( y ; x ) , g ( x ) = x k (cid:1) withbias function c ( · ) prescribed on X . Then M ′ ( c ( · ) , x ) = M LGM ( c ( · ) , x ) . Moreover, the LMV estimator ˆ g ( c ( · ) , x ) LGM ( · ) for E LGM and bias function c ( · ) is simultaneously the LMV estimatorfor E ′ and bias function c ( · ) (cid:12)(cid:12) X . Since for q > , the parameter set X = B q ( S ) contains an open set, Theorem IX.1 implies that M ( q ) ( c ( · ) , x ) = M LGM ( c ( · ) , x ) , for all q > . Thus, the minimum achievable variance for E ( q ) , q > with bias function c ( · ) prescribed on B q ( S ) is alwaysequal to the minimum achievable variance for E LGM with bias function c ( · ) prescribed on R N . Furthermore, This estimator is given by Part 3 of Theorem V.4 specialized to S = N (in which case the SLGM reduces to the LGM). Theorem IX.1 also implies that the minimum achievable variance for E ( q ) = (cid:0) B q ( S ) , f H ( y ; x ) , g ( x ) = x k (cid:1) , q > is achieved by the LMV estimator for E LGM = (cid:0) R N , f H ( y ; x ) , g ( x ) = x k (cid:1) . But since in general M LGM ( c ( · ) , x ) > M SLGM ( c ( · ) , x ) (see (47) for the special case given by the SSNM), it follows that M ( q ) ( c ( · ) , x ) = M LGM ( c ( · ) , x ) does not generally converge to M SLGM ( c ( · ) , x ) as q approaches .For another interesting consequence of Theorem IX.1, consider an estimation problem E = (cid:0) X , f H ( y ; x ) , g ( x ) = x k (cid:1) whose parameter set X is the union of the set of exactly S -sparse vectors X S and an open ball B ( x c , r ) , { x ∈ R N | k x − x c k < r } ), i.e., X = X S ∪ B ( x c , r ) . Then, it follows from Theorem IX.1 that theminimum achievable variance for E at any sparse x ∈ X S coincides with M LGM ( c ( · ) , x ) . Since in general M LGM ( c ( · ) , x ) > M SLGM ( c ( · ) , x ) this implies that the minimum achievable variance for E is in general strictlylarger than the minimum achievable variance for the SLGM. Thus, no matter how small the radius r is andhow distant x c is from X S , the inclusion of the open ball in X signiﬁcantly affects the MVE of the S -sparsevectors in X S .The statement of Theorem IX.1 is closely related to the facts that (i) the statistical model of the LGMbelongs to an exponential family, and (ii) the mean function γ ( x ) = E x { ˆ g ( y ) } of any estimator ˆ g ( · ) with ﬁnitebias and variance for an estimation problem whose statistical model belongs to an exponential family is ananalytic function [34, Lemma 2.8]. Indeed, any analytic function is completely determined by its values on anarbitrary open set in its domain [19]. Therefore, because the mean function γ ( x ) of any estimator for the LGMis analytic, it is completely speciﬁed by its values for all x ∈ B q ( S ) with an arbitrary q > (note that B q ( S ) contains an open set). X. N UMERICAL R ESULTS

In this section, we compare the lower variance bounds presented in Section VI with the actual vari-ance behavior of some well-known estimators. We consider the SLGM-based estimation problem E SLGM = (cid:0) X S , f H ( y ; x ) , g ( x ) = x k (cid:1) for k ∈ [ N ] . In what follows, we will denote the lower bounds (32), (34), and (35)by B (1) k ( c ( · ) , x ) , B (2) k ( c ( · ) , x ) , and B (3) k ( c ( · ) , x ) , respectively. We recall that the latter two bounds dependon an index set K ⊆ [ N ] with |K| ≤ S , which can be chosen freely.Let ˆ x ( · ) be an estimator of x with bias function c ( · ) . Because of (9), a lower bound on the estimatorvariance v (ˆ x ( · ); x ) can be obtained by summing with respect to k ∈ [ N ] the “scalar bounds” B (1) k ( c k ( · ) , x ) or B (2) k ( c k ( · ) , x ) or B (3) k ( c k ( · ) , x ) , where c k ( · ) , (cid:0) c ( · ) (cid:1) k , i.e., v (ˆ x ( · ); x ) ≥ B (1 / / ( c ( · ) , x ) , X k ∈ [ N ] B (1 / / k ( c k ( · ) , x ) . (59)Here, the index sets K k used in B (2) k ( c k ( · ) , x ) and B (3) k ( c k ( · ) , x ) can be chosen differently for different k . A. An SLGM View of Fourier Analysis

Our ﬁrst example is inspired by [17, Example 4.2]. We consider the SLGM with N even, i.e., N = 2 L ,and σ = 1 . The system matrix H ∈ R M × L is given by H m,l = cos (cid:0) θ l ( m − (cid:1) for m ∈ [ M ] and l ∈ [ L ] and PSfrag replacements

SNR [dB] − −

10 0 10 20 30 4004812162024 v a r i a n ce / bound bound on v (ˆ x ML ( · ); x ) v (ˆ x OMP ( · ); x ) B (3) ( c OMP ( · ) , x ) B (2) ( c OMP ( · ) , x ) B (1) ( c OMP ( · ) , x ) Oracle CRB

MLHT ( T = 3 )HT ( T = 4 )HT ( T = 5 ) Fig. 2. Variance of the OMP estimator and corresponding lower bounds versus SNR, for the SLGM with N = 16 , M = 128 , S = 4 ,and σ = 1 . H m,l = sin (cid:0) θ l ( m − (cid:1) for m ∈ [ M ] and l ∈ { L + 1 , . . . , L } . Here, the normalized angular frequencies θ l are uniformly spaced according to θ l = θ + (cid:2) ( l − mod L (cid:3) ∆ θ , l ∈ [ N ] . The multiplication of x by H thencorresponds to an inverse discrete Fourier transform that maps L spectral samples (the entries of x ) to M temporal samples (the entries of Hx ). In our simulation, we chose M = 128 , L = 8 (hence, N = 16 ), S = 4 , θ = 0 . , and ∆ θ = 3 . · − . The frequency spacing ∆ θ is about half the nominal DFT frequency resolution,which is / ≈ . × − .We consider the OMP estimator ˆ x OMP ( · ) that is obtained by applying the OMP [21], [40] with S = 4 iterations to the observation y . We used Monte Carlo simulation with randomly generated noise n ∼ N ( , I ) toestimate the variance v (ˆ x OMP ( · ); x ) of ˆ x OMP ( · ) . The parameter vector was chosen as x = √ SNR ˜ x , where ˜ x ∈ { , } , supp(˜ x ) = { , , , } , and SNR varies between − and . Thus, the observation y is anoisy superposition of four sinusoidal components with identical amplitudes; two of them are consine and sinecomponents with frequency θ = θ = θ + 2∆ θ , and two are cosine and sine components with frequency θ = θ = θ + 5∆ θ . In Fig. 2, we plot v (ˆ x OMP ( · ); x ) versus SNR. For comparison, we also plot the lowerbounds B (1) ( c OMP ( · ) , x ) , B (2) ( c OMP ( · ) , x ) , and B (3) ( c OMP ( · ) , x ) in (59), with c OMP ( x ) , b (ˆ x OMP ( · ); x ) being the actual bias function of the OMP estimator ˆ x OMP ( · ) . To evaluate these bounds, we computed theﬁrst-order partial derivatives of the bias functions c OMP ,k ( x ) (see Theorems VI.1 and VI.2) by means of (40)and Monte Carlo simulation (see [28] for details). The index sets K k in the bounds B (2) ( c OMP ( · ) , x ) and B (3) ( c OMP ( · ) , x ) were chosen as K k = supp( x ) for k ∈ supp( x ) and K k = { k } for k / ∈ supp( x ) . This isthe simplest nontrivial choice of the K k for which B (3) ( c OMP ( · ) , x ) is tighter than the state-of-the-art bound B (1) ( c OMP ( · ) , x ) (the sparse CRB, which was originally presented in [11]). Finally, Fig. 2 also shows the“oracle CRB,” which is deﬁned as the CRB for known supp( x ) . This is simply the CRB for a linear Gaussian model with system matrix H supp( x ) and is thus given by tr (cid:0)(cid:0) H T supp( x ) H supp( x ) (cid:1) − (cid:1) ≈ . [17] for all valuesof SNR (recall that we set σ = 1 ).As can be seen from Fig. 2, for SNR below 20 dB, v (ˆ x OMP ( · ); x ) is signiﬁcantly higher than the four lowerbounds. This suggests that there might exist estimators with the same bias as that of the OMP estimator but asmaller variance; however, a positive statement regarding the existence of such estimators cannot be based onour analysis. For SNR larger than about 15 dB, the four lower bounds coincide. Furthermore, for SNR largerthan about 11 dB, v (ˆ x OMP ( · ); x ) quickly converges toward the lower bounds. This is because for high SNR,the OMP estimator is able to detect supp( x ) with very high probability. Note also that the results in Fig. 2agree with our observation in Section VI-B, around (36), that the bound B (3) ( c ( · ) , x ) tends to be higher than B (2) ( c ( · ) , x ) . B. Minimum Variance Analysis for the SSNM

Next, we consider the maximum likelihood (ML) estimator and the hard-thresholding (HT) estimator forthe SSNM, i.e., for M = N and H = I , with N = 50 , S = 5 , and σ = 1 . The ML estimator is given by ˆ x ML ( y ) , argmax x ′ ∈X S f ( y ; x ′ ) = P S ( y ) , where the operator P S retains the S largest (in magnitude) entries and zeros all other entries. Closed-formexpressions of the mean and variance of the ML estimator were derived in [13]. The HT estimator ˆ x HT ( · ) isgiven by ˆ x HT ,k ( y ) = ˆ x HT ,k ( y k ) =  y k , | y k | ≥ T , else , k ∈ [ N ] , (60)where T is a ﬁxed threshold. Note that in the limiting case T = 0 , the HT estimator coincides with the LSestimator ˆ x LS ( y ) = y [17], [18], [27]. The mean and variance of the HT estimator are given by E x (cid:8) ˆ x HT ,k ( y ) (cid:9) = 1 √ πσ Z R \ [ − T,T ] y exp (cid:18) − σ ( y − x k ) (cid:19) dy (61) v (ˆ x HT ,k ( · ); x ) = 1 √ πσ Z R \ [ − T,T ] y exp (cid:18) − σ ( y − x k ) (cid:19) dy − (cid:0) E x (cid:8) ˆ x HT ,k ( y ) (cid:9)(cid:1) . (62)We calculated the variances v (ˆ x ML ( · ); x ) and v (ˆ x HT ( · ); x ) at parameter vectors x = √ SNR ˜ x , where ˜ x ∈ { , } , supp(˜ x ) = [ S ] , and SNR varies between − and . (The ﬁxed choice supp( x ) = [ S ] isjustiﬁed by the fact that neither the variances of the ML and HT estimators nor the corresponding variancebounds depend on the location of supp( x ) .) In particular, v (ˆ x HT ( · ); x ) was calculated by numerical evaluationof the integrals (62) and (61). Fig. 3 shows v (ˆ x ML ( · ); x ) and v (ˆ x HT ( · ); x ) —the latter for four different choicesof T in (60)—versus SNR. Also shown are the lower bounds B (2) ( c ML ( · ) , x ) and B (3) ( c ML ( · ) , x ) as wellas B (2) ( c HT ( · ) , x ) and B (3) ( c HT ( · ) , x ) (cf. (59)), with c ML ( · ) and c HT ( · ) being the actual bias functions of ˆ x ML ( · ) and of ˆ x HT ( · ) , respectively. The index sets underlying the bounds were chosen as K k = supp( x ) for PSfrag replacements

SNR [dB] − −

20 10 20051015202530354045505560 v a r i a n ce / bound bound on v (ˆ x ML ( · ); x ) ML v (ˆ x HT ( · ); x ) , T =2 v (ˆ x HT ( · ); x ) , T =3 v (ˆ x HT ( · ); x ) , T =4 B (2 / ( c HT ( · ) , x ) , T =2 B (2 / ( c HT ( · ) , x ) , T =3 B (2 / ( c HT ( · ) , x ) , T =4 v (ˆ x LS ( · ); x ) = v (ˆ x HT ( · ); x ) , T =0 B (2 / ( c HT ( · ) , x ) , T =0 v (ˆ x ML ( · ); x ) B (2 / ( c ML ( · ) , x ) ML T =3 T =4 T =2 T =0 (LS) Sσ Fig. 3. Variance of the ML and HT estimators and corresponding lower bounds versus SNR, for the SSNM with N = 50 , S = 5 , and σ = 1 . k ∈ supp( x ) and K k = { k } ∪ { supp( x ) \ { j S }} for k / ∈ supp( x ) , where j S denotes the index of the S -largest (in magnitude) entry of x . For this choice of the K k , the two bounds are equal, i.e., B (2) ( c ML ( · ) , x ) = B (3) ( c ML ( · ) , x ) and B (2) ( c HT ( · ) , x ) = B (3) ( c HT ( · ) , x ) . The ﬁrst-order partial derivatives of the bias functions c ML ,k ( x ) involved in the bounds B (2 / ( c ML ( · ) , x ) were approximated by a ﬁnite-difference quotient [28], i.e., ∂c ML ,k ( x ) ∂x l = δ k,l + ∂ E x { ˆ x ML ,k ( y ) } ∂x l with ∂ E x (cid:8) ˆ x ML ,k ( y ) (cid:9) ∂x l ≈ E x +∆ e l (cid:8) ˆ x ML ,k ( y ) (cid:9) − E x (cid:8) ˆ x ML ,k ( y ) (cid:9) ∆ , where ∆ > is a small stepsize and the expectations were calculated using the closed-form expressionspresented in [13, Appendix I]. The ﬁrst-order partial derivatives of the bias functions c HT ,k ( x ) involved in thebounds B (2 / ( c HT ( · ) , x ) were calculated by means of (40).It can be seen in Fig. 3 that for SNR larger than about 18 dB, the variances of the ML and HT estimatorsand the corresponding bounds are effectively equal (for the HT estimator, this is true if T is not too small).Also, all bounds are close to Sσ = 4 ; this equals the variance of an oracle estimator that knows supp( x ) andis given by ˆ x k ( y ) = y k for k ∈ supp( x ) and ˆ x k ( y ) = 0 otherwise. However, in the medium-SNR range, thevariances of the ML and HT estimators are signiﬁcantly higher than the corresponding lower bounds. We canconclude that there might exist estimators with the same bias as that of the ML or HT estimator but a smallervariance; however, in general, a positive statement regarding the existence of such estimators cannot be basedon our analysis.On the other hand, for the special case of diagonal estimators, such as the HT estimator, Theorem VIII.3 andCorollary VIII.5 make positive statements about the existence of estimators that have locally a smaller variancethan the HT estimator. In particular, we can use Corollary VIII.5 to obtain the LMV estimator and corresponding PSfrag replacements

SNR [dB] − − v a r i a n ce / B a r a nk i nbound bound on v (( · ); x ) v (( · ); x ) v (ˆ x LS ( · ); x ) = v (ˆ x HT ( · ); x ) , T =0 M HT ( x ) , T =0 v (ˆ x HT ( · ); x ) , T =2 v (ˆ x HT ( · ); x ) , T =3 v (ˆ x HT ( · ); x ) , T =4 M HT ( x ) , T =2 M HT ( x ) , T =3 M HT ( x ) , T =4 v (ˆ x ML ( · ); x ) ML T =4 T =3 T =2 T =0 (LS) Sσ Fig. 4. Variance of the HT estimator, v (ˆ x HT ( · ); x ) , for different T (solid lines) and corresponding minimum achievable variance(Barankin bound) M HT ( x ) (dashed lines) versus SNR, for the SSNM with N = 50 , S = 5 , and σ = 1 . minimum achievable variance at a parameter vector x ∈ X S for the given bias function of the HT estimator, c HT ( · ) . In Fig. 4, we plot the variance v (ˆ x HT ( · ); x ) for four different choices of T versus SNR. We also plotthe corresponding minimum achievable variance (Barankin bound) M HT ( x ) , P k ∈ [ N ] M SSNM ( c HT ,k ( · ) , x ) .Here, M SSNM ( c HT ,k ( · ) , x ) was obtained from (53) in Corollary VIII.5. (Note that (53) is applicable because theestimator ˆ x HT ,k ( y ) is diagonal and has ﬁnite variance at all x ∈ X S .) It is seen that for small T (including T = 0 ,where the HT estimator reduces to the LS estimator) and for SNR above dB, v (ˆ x HT ( · ); x ) is signiﬁcantlyhigher than M HT ( x ) . However, as T increases, the gap between the v (ˆ x HT ( · ); x ) and M HT ( x ) curves becomessmaller; in particular, the two curves are almost indistinguishable already for T = 4 . For high SNR, M HT ( x ) approaches the oracle variance Sσ = 4 for any value of T .XI. C ONCLUSION

We used RKHS theory to analyze the MVE problem within the sparse linear Gaussian model (SLGM).In the SLGM, the unknown parameter vector to be estimated is assumed to be sparse with a known sparsitydegree, and the observed vector is a linearly transformed version of the parameter vector that is corrupted by i.i.d.Gaussian noise with a known variance. The RKHS framework allowed us to establish a geometric interpretationof existing lower bounds on the estimator variance and to derive novel lower bounds on the estimator variance,in both cases under a bias constraint. These bounds were obtained by an orthogonal projection of the prescribedmean function onto a subspace of the RKHS associated with the SLGM. Viewed as functions of the SNR, thebounds were observed to vary between two extreme regimes. On the one hand, there is a low-SNR regime where the entries of the true parameter vector are small compared with the noise variance. Here, our boundspredict that if the estimator bias is approximately zero, the a priori sparsity information does not help muchin the estimation; however, if the bias is allowed to be nonzero, the estimator variance can be reduced by thesparsity information. On the other hand, there is a high-SNR regime where the nonzero entries of the trueparameter vector are large compared with the noise variance. Here, our bounds coincide with the Cramér–Raobound of an associated conventional linear Gaussian model in which the support of the unknown parametervector is supposed known. Our bounds exhibit a steep transition between these two regimes. In general, thistransition has an exponential decay.For the special case of the SLGM that corresponds to the recovery problem in a linear compressed sensingscheme, we expressed our lower bounds in terms of the restricted isometry and coherence parameters of themeasurement matrix. Furthermore, for the special case of the SLGM given by the sparse signal in noise model(SSNM), we derived closed-form expressions of the minimum achievable variance and the corresponding LMVestimator. These latter results include closed-form expressions of the (unbiased) Barankin bound and of theLMVU estimator for the SSNM. Simpliﬁed expressions of the minimum achievable variance and the LMVestimator were presented for the subclass of “diagonal” bias functions.An analysis of the effects of exact and approximate sparsity information from the MVE perspective showedthat the minimum achievable variance under an exact sparsity constraint is not a limiting case of the minimumachievable variance under an approximate sparsity constraint.Finally, a comparison of our bounds with the actual variance of established estimators for the SLGM andSSNM (maximum likelihood estimator, hard thresholding estimator, least squares estimator, and orthogonalmatching pursuit) showed that there might exist estimators with the same bias but a smaller variance.An interesting direction for future investigations is the search for (classes of) estimators that asymptoticallyapproach our lower variance bounds when the estimation is based on an increasing number of i.i.d. observationvectors y i . In the unbiased case, the maximum likelihood estimator can be intuitively expected to achieve thevariance bounds asymptotically. However, a rigorous proof of this conjecture seems to be nontrivial. Indeed,most studies of the asymptotic behavior of maximum likelihood estimators assume that the parameter set isan open subset of R N [18], [49], [50], which is not the case for the parameter set X S . For the popular classof M-estimators or penalized maximum likelihood estimators, a characterization of the asymptotic behavior isavailable [30], [50], [51]. Under mild conditions, M-estimators allow an efﬁcient implementation via convexoptimization techniques.Furthermore, it would be interesting to generalize our results to the case of block or group sparsity [52]–[54].This could be useful, e.g., for sparse channel estimation in the case of clustered scatterers and delay-Dopplerleakage [55] and for the estimation of structured sparse spectra (extending sparsity-exploiting spectral estimationas proposed in [56]–[59]). R EFERENCES [1] C. Carbonelli, S. Vedantam, and U. Mitra, “Sparse channel estimation with zero tap detection,”

IEEE Trans. Wireless Comm. ,vol. 6, no. 5, pp. 1743–1763, May 2007.[2] S. G. Mallat,

A Wavelet Tour of Signal Processing – The Sparse Way , 3rd ed. San Diego, CA: Academic Press, 2009.[3] M. Dong and L. Tong, “Optimal design and placement of pilot symbols for channel estimation,”

IEEE Trans. Signal Processing ,vol. 50, no. 12, pp. 3055–3069, Dec 2002.[4] D. L. Donoho and I. M. Johnstone, “Ideal spatial adaptation by wavelet shrinkage,”

Biometrika , vol. 81, pp. 425–455, 1994.[5] D. L. Donoho, “Compressed sensing,”

IEEE Trans. Inf. Theory , vol. 52, no. 4, pp. 1289–1306, April 2006.[6] E. Candès and M. Wakin, “An introduction to compressive sampling,”

IEEE Signal Processing Magazine , vol. 25, no. 2, pp. 21–30,March 2008.[7] E. J. Candès, J. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,”

Comm. Pure Appl.Math. , vol. 59, no. 8, pp. 1207–1223, Aug. 2006.[8] G. Raskutti, M. J. Wainwright, and B. Yu, “Minimax rates of estimation for high-dimensional linear regression over ℓ q -balls,” IEEE Trans. Inf. Theory , vol. 57, no. 10, pp. 6976–6994, Oct. 2011.[9] N. Verzelen, “Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons,”

Electron. J. Statist. , vol. 6, pp. 38–90,2012.[10] D. L. Donoho and I. M. Johnstone, “Minimax risk over ℓ p -balls for ℓ q -error,” Probab. Theory Relat. Fields , vol. 99, pp. 277–303,1994.[11] Z. Ben-Haim and Y. C. Eldar, “The Cramér–Rao bound for estimating a sparse parameter vector,”

IEEE Trans. Signal Processing ,vol. 58, pp. 3384–3389, June 2010.[12] ——, “Performance bounds for sparse estimation with random noise,” in

Proc. IEEE-SP Workshop Statist. Signal Process. , Cardiff,Wales, UK, Aug. 2009, pp. 225–228.[13] A. Jung, Z. Ben-Haim, F. Hlawatsch, and Y. C. Eldar, “Unbiased estimation of a sparse vector in white Gaussian noise,”

IEEETrans. Inf. Theory , vol. 57, no. 12, pp. 7856–7876, Dec. 2011.[14] N. Aronszajn, “Theory of reproducing kernels,”

Trans. Am. Math. Soc. , vol. 68, no. 3, pp. 337–404, May 1950.[15] E. Parzen, “Statistical inference on time series by Hilbert space methods, I.” Appl. Math. Stat. Lab., Stanford University, Stanford,CA, Tech. Rep. 23, Jan. 1959.[16] D. D. Duttweiler and T. Kailath, “RKHS approach to detection and estimation problems – Part V: Parameter estimation,”

IEEETrans. Inf. Theory , vol. 19, no. 1, pp. 29–37, Jan. 1973.[17] S. M. Kay,

Fundamentals of Statistical Signal Processing: Estimation Theory . Englewood Cliffs, NJ: Prentice Hall, 1993.[18] E. L. Lehmann and G. Casella,

Theory of Point Estimation , 2nd ed. New York: Springer, 1998.[19] S. G. Krantz and H. R. Parks,

A Primer of Real Analytic Functions , 2nd ed. Boston, MA: Birkhäuser, 2002.[20] G. H. Golub and C. F. Van Loan,

Matrix Computations , 3rd ed. Baltimore, MD: Johns Hopkins University Press, 1996.[21] J. A. Tropp, “Greed is Good: Algorithmic results for sparse approximation,”

IEEE Trans. Inf. Theory , vol. 50, no. 10, pp. 2231–2242, Oct. 2004.[22] D. L. Donoho and M. Elad, “Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ minimization,” Proc.Nat. Acad. Sci. , vol. 100, no. 5, pp. 2197–2202, March 2003.[23] A. Papoulis,

Probability, Random Variables, and Stochastic Processes , 3rd ed. New York: McGraw-Hill, 1991.[24] N. Higham, “Newton’s method for the matrix square root,”

Mathematics of Computation , vol. 46, no. 174, pp. 537–549, Apr.1986.[25] H. Leeb and B. M. Pötscher, “Sparse estimators and the oracle property, or the return of Hodges’ estimator,”

Journal ofEconometrics , vol. 142, no. 1, pp. 201–211, 2008.[26] H. V. Poor,

An Introduction to Signal Detection and Estimation . New York: Springer, 1988. [27] L. L. Scharf, Statistical Signal Processing . Reading (MA): Addison Wesley, 1991.[28] A. O. Hero III, J. Fessler, and M. Usman, “Exploring estimator bias-variance tradeoffs using the uniform CR bound,”

IEEE Trans.Signal Processing , vol. 44, no. 8, pp. 2026–2041, Aug. 1996.[29] A. Jung, S. Schmutzhard, and F. Hlawatsch, “The RKHS approach to minimum variance estimation revisited: Variance bounds,sufﬁcient statistics, and exponential families,” submitted to IEEE Trans. Inf. Theory , Oct. 2012, available online: arXiv:1210.6516.[30] Y. C. Eldar,

Rethinking Biased Estimation: Improving Maximum Likelihood and the Cramér–Rao Bound , ser. Foundations andTrends in Signal Processing. Hanover, MA: Now Publishers, 2007, vol. 1, no. 4.[31] A. Jung, “An RKHS Approach to Estimation with Sparsity Constraints,” Ph.D. dissertation, Vienna University of Technology,2011.[32] W. Rudin,

Real and Complex Analysis , 3rd ed. New York: McGraw-Hill, 1987.[33] D.-X. Zhou, “Derivative reproducing properties for kernel methods in learning theory,”

J. Comput. Appl. Math. , vol. 220, no. 1-2,pp. 456–463, Oct. 2008.[34] L. D. Brown,

Fundamentals of Statistical Exponential Families , ser. Lecture Notes – Monograph Series. Hayward, CA: Instituteof Mathematical Statistics, 1986.[35] W. Rudin,

Principles of Mathematical Analysis , 3rd ed. New York: McGraw-Hill, 1976.[36] S. Schmutzhard, A. Jung, F. Hlawatsch, Z. Ben-Haim, and Y. C. Eldar, “A lower bound on the estimator variance for the sparselinear model,” in

Proc. 44th Asilomar Conf. Signals, Systems, Computers , Paciﬁc Grove, CA, Nov. 2010, pp. 1976–1980.[37] J. A. Tropp, “Just relax: Convex programming methods for identifying sparse signals in noise,”

IEEE Trans. Inf. Theory , vol. 50,no. 3, pp. 1030–1051, March 2004.[38] Z. Ben-Haim, Y. C. Eldar, and M. Elad, “Coherence-based performance guarantees for estimating a sparse vector under randomnoise,”

IEEE Trans. Signal Processing , vol. 58, no. 10, pp. 5030–5043, Oct. 2010.[39] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,”

SIAM J. Scient. Comput. , vol. 20, pp.33–61, 1998.[40] J. A. Tropp and A. C. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,”

IEEE Trans. Inf.Theory , vol. 53, no. 12, pp. 4655–4666, Dec. 2007.[41] D. Needell and R. Vershynin, “Signal recovery from incomplete and inaccurate measurements via regularized orthogonal matchingpursuit,”

IEEE J. Sel. Topics Sig. Proc. , vol. 4, no. 2, pp. 310–316, Apr. 2010.[42] M. Davenport and M. Wakin, “Analysis of orthogonal matching pursuit using the restricted isometry property,”

IEEE Trans. Inf.Theory , vol. 56, no. 9, pp. 4395–4401, Sept. 2010.[43] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from incomplete and inaccurate samples,”

Appl. Comp. HarmonicAnal. , vol. 26, pp. 301–321, 2008.[44] E. Candès and T. Tao, “The Dantzig selector: Statistical estimation when p is much larger than n ,” Ann. Statist. , vol. 35, no. 6,pp. 2313–2351, 2007.[45] D. L. Donoho and I. M. Johnstone, “Minimax estimation via wavelet shrinkage,”

Ann. Statist. , vol. 26, no. 3, pp. 879–921, 1998.[46] M. Abramowitz and I. A. Stegun, Eds.,

Handbook of Mathematical Functions . New York: Dover, 1965.[47] G. Szegö,

Orthogonal Polynomials . Providence, RI: American Mathematical Society, 1939.[48] J. D. Gorman and A. O. Hero, “Lower bounds for parametric estimation with constraints,”

IEEE Trans. Inf. Theory , vol. 36, no. 6,pp. 1285–1301, Nov. 1990.[49] I. A. Ibragimov and R. Z. Has’minskii,

Statistical Estimation. Asymptotic Theory.

New York: Springer, 1981.[50] A. van der Vaart,

Asymptotic Statistics . Cambridge, UK: Cambridge Univ. Press, 1998.[51] P. J. Huber,

Robust Statistics . New York: Wiley, 1981.[52] Y. C. Eldar, P. Kuppinger, and H. Bölcskei, “Block-sparse signals: Uncertainty relations and efﬁcient recovery,”

IEEE Trans. SignalProcessing , vol. 58, no. 6, pp. 3042–3054, June 2010. [53] M. Mishali and Y. C. Eldar, “Reduce and boost: Recovering arbitrary sets of jointly sparse vectors,” IEEE Trans. Signal Processing ,vol. 56, no. 10, pp. 4692–4702, Oct. 2008.[54] Y. C. Eldar and H. Rauhut, “Average case analysis of multichannel sparse recovery using convex relaxation,”

IEEE Trans. Inf.Theory , vol. 56, no. 1, pp. 505–519, Jan. 2009.[55] D. Eiwen, G. Tauböck, F. Hlawatsch, and H. G. Feichtinger, “Group sparsity methods for compressive channel estimation indoubly dispersive multicarrier systems,” in

Proc. IEEE SPAWC 2010 , Marrakech, Morocco, Jun. 2010, pp. 1–5.[56] A. Jung, G. Tauböck, and F. Hlawatsch, “Compressive spectral estimation for nonstationary random processes,”

IEEE Trans. Inf.Theory , 2013, available online: arXiv:1203.5475.[57] Z. Tian, “Compressed wideband sensing in cooperative cognitive radio networks,” in

Proc. IEEE GLOBECOM 2008 , New Orleans,LA, Dec. 2008, pp. 1–5.[58] Y. Polo, Y. Wang, A. Pandharipande, and G. Leus, “Compressive wide-band spectrum sensing,” in

Proc. IEEE ICASSP-2009 ,Taipei, Taiwan, Apr. 2009, pp. 2337–2340.[59] Z. Tian, Y. Tafesse, and B. Sadler, “Cyclic feature detection with sub-Nyquist sampling for wideband spectrum sensing,”