Nonparametric estimation by convex programming
aa r X i v : . [ m a t h . S T ] A ug The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2009
NONPARAMETRIC ESTIMATION BY CONVEX PROGRAMMING
By Anatoli B. Juditsky and Arkadi S. Nemirovski Universit´e Grenoble I and Georgia Institute of Technology
The problem we concentrate on is as follows: given (1) a convexcompact set X in R n , an affine mapping x A ( x ), a parametricfamily { p µ ( · ) } of probability densities and (2) N i.i.d. observations ofthe random variable ω , distributed with the density p A ( x ) ( · ) for some(unknown) x ∈ X , estimate the value g T x of a given linear form at x .For several families { p µ ( · ) } with no additional assumptions on X and A , we develop computationally efficient estimation routineswhich are minimax optimal, within an absolute constant factor. Wethen apply these routines to recovering x itself in the Euclidean norm.
1. Introduction.
The problem we are interested in is essentially as fol-lows: suppose that we are given a convex compact set X in R n , an affinemapping x A ( x ) and a parametric family { p µ ( · ) } of probability densities.Suppose that N i.i.d. observations of the random variable ω , distributed withthe density p A ( x ) ( · ) for some (unknown) x ∈ X , are available. Our objectiveis to estimate the value g T x of a given linear form at x .In nonparametric statistics, there exists an immense literature on variousversions of this problem (see, e.g., [10, 11, 12, 13, 15, 17, 18, 21, 22, 23, 24,25, 26, 27, 28] and the references therein). To the best of our knowledge,the majority of papers on the subject focus on specific domains X (e.g.,distributions with densities from Sobolev balls), and investigate lower andupper bounds on the worst-case, with regard to x ∈ X , accuracy to whichthe problem of interest can be solved. These bounds depend on the numberof observations N , and the question of primary interest is the behavior ofthose bounds as N → ∞ . When the lower and the upper bounds coincidewithin a constant factor [or, ideally, within factor (1 + o (1)) as N → ∞ ],the estimation problem is considered essentially solved, and the estimationmethods underlying the upper bounds are treated as optimal. Received March 2008; revised July 2008. Supported in part by the NSF Grant 0619977.
AMS 2000 subject classifications.
Primary 62G08; secondary 62G15, 62G07.
Key words and phrases.
Estimation of linear functional, minimax estimation, oracle in-equalities, convex optimization, PE tomography.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2009, Vol. 37, No. 5A, 2278–2300. This reprint differs from the original inpagination and typographic detail. 1
A. B. JUDITSKY AND A. S. NEMIROVSKI
The approach we adopt in this paper is of a different spirit; we make no“structural assumptions” on X , aside from assumptions of convexity andcompactness which are crucial for us, and we make no assumptions on thelinear functional p . Clearly, with no structural assumptions on X and p ,explicit bounds on the risks of our estimates, as well as bounds on the mini-max optimal risk, are impossible. However, it is possible to show that whenestimating linear forms, the worst-case risk of the estimator we propose iswithin an absolute constant factor of the “ideal” (i.e., the minimax optimal) risk. It should be added that while the optimal, within an absolute constantfactor, worst-case risk of our estimates is not available in a closed analyt-ical form, it is “available algorithmically”—it can be efficiently computed,provided that X is computationally tractable. Note that the estimation problem, presented above, can be seen as ageneralization of the problem of estimation of linear functionals of the centralparameter of a normal distribution (see [4, 8, 9, 16]). Namely, suppose thatthe observation ω ∈ R m , ω = Ax + σξ of the unknown signal x is available. Here A is a given m × n matrix and ξ ∼ N (0 , I m ), σ > affine in ω estimate is minimax optimal,within an absolute constant factor, among all possible estimates.Another special case of our setting is the problem of estimating a linearfunctional g ( p ) of an unknown distribution p , given N i.i.d. observations ω , . . . , ω N , which obey p . We suppose that it is known a priori that p ∈ X ,where X is a given convex compact set of distributions (here the parameter x is the density p itself). Some important results for this problem have beenobtained in [6] and [7]. For instance, in [7] the authors established minimaxbounds for the risk of estimation of g ( p ) and developed an estimation methodbased on the binary search algorithm. The estimation procedure uses at eachsearch iteration tests of convex hypotheses, studied in [2, 3]. That estimatorof g ( p ) is shown to be minimax optimal (within an absolute constant factor)if some basic structural assumptions about X hold.In this paper, we concentrate on the properties of affine estimators . Here,we refer to an estimator ˆ g as affine when it is of the form ˆ g ( ω , . . . , ω N ) = P Ni =1 φ ( ω i ), for some given functions φ , that is, if ˆ g is an affine function ofthe empirical distribution. When φ itself is an affine function, the estimator For details on computational tractability and complexity issues in convex optimization,see, for example, [1], Chapter 4. A reader not familiar with this area will not lose muchwhen interpreting a computationally tractable convex set as a set given by a finite systemof inequalities p i ( x ) ≤ i = 1 , . . . , m , where p i ( x ) are convex polynomials.STIMATION BY CONVEX PROGRAMMING is also affine in the observations, as it is in the setting of [5]. Our motivationis to extend the results obtained in [5] to the non-Gaussian situation. Inparticular, we propose a technique of derivation of affine estimators whichare minimax optimal (up to a moderate absolute constant) for a class of“good parametric families of distributions,” which is defined in Section 2.1.As normal family and discrete distributions belong to the class of good para-metric families, the minimax optimal estimators for these cases are obtainedby direct application of the general construction. In this sense, our resultsgeneralize those of [7] and [5] on the estimation of linear functionals. Onthe other hand, it is clear that different techniques, presented in the currentpaper, inherit from those developed in [3] and [7]. To make a computation-ally efficient solution of the estimation problem possible, unlike the authorsof those papers, we concentrate only on the finite-dimensional situation. Asa result, the proposed estimation procedures allow efficient numeric imple-mentation. This also allows us to avoid much of the intricate mathematicaldetails. However, we allow the dimension to be arbitrarily large, thus ad-dressing, essentially, a nonparametric estimation problem.The rest of this paper is organized as follows. In Section 2, we define themain components of our study—we state the estimation problem and definethe corresponding risk measures. Then, in Section 3, we provide the generalsolution to the estimation problem, which is then applied, in Section 4, tothe problems of estimating linear functionals in the normal model and thetomography model. Finally, in Section 5, we present adaptive versions ofaffine estimators.Note that when passing from recovering linear forms of the unknownsignal to recovering the signal itself, we do impose structural assumptionson X , but still make no structural assumptions on the affine mapping A ( x ).Our “optimality results” become weaker—instead of “optimality within anabsolute constant factor” we end up with statements like “the worst-caserisk of such-and-such estimate is in between the minimax optimal risk andthe latter risk to the power χ ,” with χ depending on the geometry of X (and close to 1 when this geometry is “good enough”).
2. Problem statement.
Good parametric families of distributions.
Let (Ω , P ) be a Polishspace with Borel σ -finite measure, and M ⊂ R m . Assume that every µ ∈ M is associated with a probability density p µ ( ω )—a Borel nonnegative functionon Ω such that R Ω p µ ( ω ) P ( dω ) = 1; we refer to the mapping µ → p µ ( · ) as to a parametric density family D . Let also F be a finite-dimensional linear spaceof Borel functions on Ω which contains constants. We call a pair ( D , F ) good if it possesses the following properties: A. B. JUDITSKY AND A. S. NEMIROVSKI M is an open convex set in R m ;2. whenever µ ∈ M , we have p µ ( ω ) > µ, ν ∈ M , we have φ ( ω ) = ln( p µ ( ω ) /p ν ( ω )) ∈ F ;4. whenever φ ( ω ) ∈ F , the function F φ ( µ ) = ln (cid:18)Z Ω exp { φ ( ω ) } p µ ( ω ) P ( dω ) (cid:19) is well defined and concave in µ ∈ M .The reader familiar with exponential families will immediately recognizethat the above definition implies that D is such a family. Let us denote p µ ( ω ) = exp { θ ( µ ) T ω − C ( θ ( µ )) } , µ ∈ M , its density with regard to P where θ is the natural parameter and C ( · ) as the cumulant function. Then, D isgood if:1. M is an open convex set in D P = { µ ∈ R m | R e θ ( µ ) T ω P ( dω ) < ∞} ;2. for any φ such that the cumulant function C ( θ ( µ ) + φ ) is well definded,the function [ C ( θ ( µ ) + φ ) − C ( θ ( µ ))] is concave in µ ∈ M .Let us list several examples. Example 1 ( Discrete distributions ). Let Ω = { , , . . . , M } be a finiteset, P be a counting measure on Ω, M = { µ ∈ R M : µ > , P i µ i = 1 } and p µ ( i ) = µ i , i = 1 , . . . , M . Let also F be the set of all functions on Ω. Theassociated pair ( D , F ) clearly is good. Example 2 ( Poisson distributions ). Let Ω = { , , . . . } , P be the count-ing measure on Ω, M = { µ ∈ R : µ > } and p µ ( i ) = µ i exp {− µ } i ! , i ∈ Ω, so that p µ is the Poisson distribution with the parameter µ . Let also F be the set ofaffine functions φ ( i ) = αi + β on Ω. We claim that the associated pair ( D , F )is good. Indeed, ln( p µ ( i ) /p ν ( i )) = i [ln µ − ln ν ] + µ − ν is an affine functionof i , andln X i exp { αi + β } µ i exp {− µ } i ! ! = ln(exp { β − µ } exp { µ exp { α }} )= β − µ + µ exp { α } is a concave function of µ > Example 3 ( Gaussian distributions with fixed covariance ). Let Ω = R k , P be the Lebesque measure on Ω, Σ be a positive definite k × k matrix, M = R k and p µ ( ω ) = (2 π ) − k/ (Det Σ) − / exp {− ( ω − µ ) T Σ − ( ω − µ ) } STIMATION BY CONVEX PROGRAMMING be the density of the Gaussian distribution with mean µ and covariancematrix Σ. Let, further, F be comprised of affine functions on Ω. We claimthat the associated pair ( D , F ) is good. Indeed, the function ln( p µ ( ω ) /p ν ( ω ))indeed is affine on Ω, andln (cid:18)Z exp { φ T ω + c } p µ ( ω ) dω (cid:19) = c + φ T µ + φ T Σ φ µ . Example 4 ( Direct product of good pairs ). Let p ℓµ ℓ ( ω ℓ ) be a probabilitydensity, parameterized by µ ℓ ∈ M ℓ ⊂ R m ℓ , on a Polish space Ω ℓ with Borel σ -finite measure P ℓ , and F ℓ be a finite-dimensional linear space of Borelfunctions on Ω ℓ such that the associated pairs ( D ℓ , F ℓ ) are good. Let usdefine the direct product ( D , F ) = N Lℓ =1 ( D ℓ , F ℓ ) of these pairs as follows: • The associated space with measure is (Ω = Ω × · · · × Ω L , P = P × · · · × P ℓ ). • The set of parameters is M = M × · · · × M L , and the density associ-ated with a parameter µ = ( µ , . . . , µ L ) from this set is p µ ( ω , . . . , ω L ) = Q Lℓ =1 p ℓµ ℓ ( ω ℓ ). • F is comprised of all functions φ ( ω , . . . , ω L ) = P Lℓ =1 φ ℓ ( ω ℓ ) with φ ℓ ( · ) ∈F ℓ , ℓ = 1 , . . . , m .We claim that the direct product of good pairs is good . Indeed, M is an openconvex set; when µ = ( µ , . . . , µ L ) and ν = ( ν , . . . , ν L ) are in M , we haveln( p µ ( ω , . . . , ω L ) /p ν ( ω , . . . , ω L )) = L X ℓ =1 ln( p ℓµ ℓ ( ω ℓ ) /p ℓν ℓ ( ω ℓ )) ∈ F and when φ ( ω , . . . , ω L ) = P ℓ φ ℓ ( ω ℓ ) ∈ F , we haveln (cid:18)Z Ω exp { φ ( ω ) } p µ ( ω ) P ( dω ) (cid:19) = ln Y ℓ Z Ω ℓ exp { φ ℓ ( ω ℓ ) } p ℓµ ℓ ( ω ℓ ) P ( dω ℓ ) ! = X ℓ ln (cid:18)Z Ω ℓ exp { φ ℓ ( ω ℓ ) } p ℓµ ℓ ( ω ℓ ) P ( dω ℓ ) (cid:19) , which is a sum of concave functions of µ and thus is concave in µ .2.2. The problem.
The problem we are interested in is as follows:
Problem I.
We are given the following: • a convex compact set X ⊂ R n , • a good pair ( D , F ) comprised of A. B. JUDITSKY AND A. S. NEMIROVSKI – a parametric family { p µ ( ω ) : µ ∈ M ⊂ R m } of probability densities on aBorel space Ω with σ -finite Borel measure P and– a finite-dimensional linear space F of Borel functions on Ω, • an affine mapping x A ( x ) : X
7→ M , • a linear form g T z on R n ⊃ X .Aside of this a priori information, we are given a realization ω of a randomvariable taking values in Ω and distributed with the density p A ( x ) ( · ) for some unknown in advance x ∈ X . Our goal is to infer from this observation anestimate ˆ g ( ω ) of the value g T x of the given linear form at x .From now on we refer to an estimate as affine , if it is of the form ˆ g ( ω ) = φ ( ω ), with certain φ ∈ F .We quantify the risk of a candidate estimate ˆ g ( · ) by its worst-case, over x ∈ X , confidence interval, given the confidence level. Specifically, given a confidence level ε ∈ (0 , ε -risk of an estimate ˆ g as Risk(ˆ g ; ε ) = inf (cid:26) δ : sup x ∈ X Prob ω ∼ p A ( x ) ( · ) { ω : | ˆ g ( ω ) − g T x | > δ } < ε (cid:27) . The corresponding minimax optimal ε -risk is defined asRisk ∗ ( ε ) = inf ˆ g ( · ) Risk(ˆ g ; ε ) , where inf is taken over the space of all Borel functions ˆ g on Ω. We areinterested also in the minimax optimal ε -risk of affine estimatesRiskA( ε ) = inf φ ( · ) ∈F Risk( φ ; ε ) .
3. Minimax optimal affine estimators.
Main result.
Our main result follows.
Theorem 3.1.
Let the pair ( D , F ) underlying Problem I be good. Then,the minimax optimal risk achievable with affine estimates is, for small ε ,within an absolute constant factor of the “true” minimax optimal risk, specif-ically, ≤ ε < / ⇒ RiskA( ε ) ≤ θ ( ε ) Risk ∗ ( ε ) , θ ( ε ) = 2 ln(2 /ε )ln(1 / (4 ε )) . Proof.
For r ≥
0, let us setΦ r ( x, y ; φ, α ) = g T x − g T y + α ln (cid:18)Z Ω exp { α − φ ( ω ) } p A ( y ) ( ω ) P ( dω ) (cid:19) STIMATION BY CONVEX PROGRAMMING + α ln (cid:18)Z Ω exp {− α − φ ( ω ) } p A ( x ) ( ω ) P ( dω ) (cid:19) + 2 αr : Z × F + → R ,Z = X × X, F + = F × { α > } . We claim that this function is a continuous real-valued function on Z × F + ,which is convex in ( φ, α ) ∈ F + and concave in ( x, y ) ∈ Z . Indeed, the functionΨ( µ, ν ; φ ) = ln (cid:16)Z Ω exp { φ ( ω ) } p µ ( ω ) P ( dω ) (cid:17) + ln (cid:16)Z Ω exp {− φ ( ω ) } p ν ( ω ) P ( dω ) (cid:17) : ( M × M ) × F → R is well defined, concave in ( µ, ν ) ∈ M × M [since ( D , F ) is good] and convex in φ ∈ F (evident). Since M is open and F is a finite-dimensional linear space,Ψ is continuous on its domain. It remains to note that Φ ε is the sum of alinear function of x, y, α and the function α Ψ( A ( x ) , A ( y ); α − φ ) which clearlyis concave in ( x, y ) [since Ψ( µ, ν ; φ ) is concave in ( µ, ν ) and A ( · ) is affine]and convex in ( φ, α ) ∈ F + [since Ψ( µ, ν ; φ ) is continuous in φ ∈ F , and thetransformation f ( u ) g ( u, α ) = αf ( u/α ) converts a convex function of u intoa convex in ( α > , u ) function of ( u, α )]. Since Z is a convex finite-dimensional compact set, F + is a convex finite-dimensional set and Φ ε is continuous and convex–concave on Z × F + , wecan invoke the Sion–Kakutani theorem (see, e.g., [14]) to infer thatsup x,y ∈ X inf φ ∈F ,α> Φ r ( x, y ; φ, α ) = inf φ ∈F ,α> max x,y ∈ X Φ r ( x, y ; φ, α ) := 2Φ ∗ ( r ) . (3.1)Note that Φ ∗ ( r ) ≥ r ≥
0. Indeed,the functional f x [ h ] = ln R Ω exp { h ( ω ) } p A ( x ) ( ω ) P ( dω ) is well defined and con-vex on F , whenceΦ r ( x, x ; φ, α ) = 2 αr + α ( f x [ − α − φ ] + f x [ α − φ ]) ≥ αr ≥ , whence Φ ∗ ( r ) ≥ sup x ∈ X inf φ ∈F ,α> Φ r ( x, x ; φ, α ) ≥
0. The concavity of Φ ∗ ( r )on the nonnegative ray follows immediately from the representation, yieldedby (3.1), Φ ∗ ( r ) = 12 inf φ ∈F ,α (cid:20) αr + sup x,y ∈ X Φ ( x, y ; φ, α ) (cid:21) of Φ ∗ ( r ) as the infinum of a family of affine functions of r . Lemma 3.1.
One has
RiskA( ε ) ≤ Φ ∗ (ln(2 /ε )) . A. B. JUDITSKY AND A. S. NEMIROVSKI
Proof.
Given δ > ε ∈ (0 , / ε -risk ≤ R ≡ Φ ∗ (ln(2 /ε )) + δ/
2, namely, as follows. By (3.1), thereexist φ ∗ ∈ F and α ∗ >
0, such that2Φ ∗ (ln(2 /ε )) + δ/ ≥ max x,y ∈ X Φ ε/ ( x, y ; φ ∗ , α ∗ )= max x ∈ X (cid:20) g T x + α ∗ ln (cid:18)Z Ω exp {− α − ∗ φ ∗ ( ω ) } p A ( x ) ( ω ) P ( dω ) (cid:19) + α ∗ ln(2 /ε ) (cid:21)| {z } U × max y ∈ X (cid:20) − g T y + α ∗ ln (cid:18)Z Ω exp { α − ∗ φ ∗ ( ω ) } p A ( y ) ( ω ) P ( dω ) (cid:19) + α ∗ ln(2 /ε ) (cid:21)| {z } V . Setting c = U − V , we havemax x ∈ X (cid:20) g T x + α ∗ ln (cid:18)Z Ω exp {− α − ∗ [ φ ∗ ( ω ) + c ] } p A ( x ) ( ω ) P ( dω ) (cid:19) + α ∗ ln(2 /ε ) (cid:21) = U − c = U + V ≤ Φ ∗ (ln(2 /ε )) + δ/ R − δ/ , max y ∈ Y (cid:20) g T x + α ∗ ln (cid:18)Z Ω exp { α − ∗ [ φ ∗ ( ω ) + c ] } p A ( y ) ( ω ) P ( dω ) (cid:19) + α ∗ ln(2 /ε ) (cid:21) = V + c = U + V ≤ Φ ∗ (ln(2 /ε )) + δ/ R − δ/ x ∈ X ln (cid:18)Z Ω exp { α − ∗ [ g T x − ( φ ∗ ( ω ) + c ) − R ] } p A ( x ) ( ω ) P ( dω ) (cid:19) ≤ ln( ε/ − δ α ∗ ≡ ln( ε ′ / , max y ∈ X ln (cid:18)Z Ω exp { α − ∗ [( φ ∗ ( ω ) + c ) − R − g T y ] } p A ( y ) ( ω ) P ( dω ) (cid:19) ≤ ln( ε ′ / , that is,(a) ∀ x ∈ X : Z Ω exp { α − ∗ [ g T x − ( φ ∗ ( ω ) + c ) − R ] } p A ( x ) ( ω ) P ( dω ) ≤ ε ′ / , (b) ∀ y ∈ X : Z Ω exp { α − ∗ [[ φ ∗ ( ω ) + c ] − R − g T y ] } p A ( y ) ( ω ) P ( dω ) ≤ ε ′ / . For a given x ∈ X , the exponent in (a) is nonnegative and is >
1, for all ω such that g T x − [ φ ∗ ( ω )+ c ] > R ; therefore, (a) implies that Prob ω ∼ p A ( x ) ( · ) { g T x > [ φ ∗ ( ω ) + c ] + R } ≤ ε ′ /
2, for every x ∈ X . By similar reasons, (b) implies that STIMATION BY CONVEX PROGRAMMING Prob ω ∼ p A ( x ) ( · ) { g T x < [ φ ∗ ( ω ) + c ] − R } ≤ ε ′ /
2, for all x ∈ X . Since by con-struction ε ′ < ε , we see that the ε -risk of the affine estimate ˆ g ( ω ) = φ ∗ ( ω ) + c is ≤ R , as claimed. (cid:3) Lemma 3.2.
One has δ ∈ (0 , ⇒ Risk ∗ ( δ / ≥ Φ ∗ (ln(1 /δ )) , (3.2) whence also ε ∈ (0 , / ⇒ Risk ∗ ( δ ) ≥ ln(1 / (4 ε ))2 ln(2 /ε ) Φ ∗ (ln(2 /ε )) . (3.3) Proof.
To prove (3.2), let us set ρ = ln(1 /δ ). The function Ψ ρ ( x, y ) =inf φ ∈F ,α> Φ ρ ( x, y ; φ, α ) takes values in {−∞} ∪ R , is upper semicontinuous(since Φ r is continuous) and is not identically −∞ (in fact, it is even ≥ y = x ). Thus, Ψ ρ achieves its maximum on X × X at certain point(¯ x, ¯ y ), and for any ( α > , φ ∈ F ):Φ ρ (¯ x, ¯ y ; φ, α ) ≥ Ψ ρ (¯ x, ¯ y ) = sup x,y ∈ X inf φ ∈F ,α> Φ ρ ( x, y ; φ, α ) = 2Φ ∗ ( ρ ) , (3.4)where the concluding inequality is given by (3.1). Since ( D , F ) is a good pair,setting µ = A (¯ x ), ν = A (¯ y ) and ¯ φ ( ω ) = ln( p µ ( ω ) /p ν ( ω )), we get ¯ φ ∈ F ,which combines with (3.4) to imply that ∀ ( α > ∗ ( ρ ) ≤ g T ¯ x − g T ¯ y + α (cid:20) ln (cid:18)Z Ω exp {− α − [ α ¯ φ ( ω )] } p µ ( ω ) P ( dω ) (cid:19) + ln (cid:18)Z Ω exp { α − [ α ¯ φ ( ω )] } p ν ( ω ) P ( dω ) (cid:19) + 2 ρ (cid:21) = g T ¯ x − g T ¯ y + 2 α (cid:20) ln (cid:18)Z Ω q p µ ( ω ) p ν ( ω ) P ( dω ) (cid:19) + ρ (cid:21) . The resulting inequality holds true for all α >
0, meaning that(a) g T ¯ x − g T ¯ y ≥ ∗ ( ρ ) = 2Φ ∗ (ln(1 /δ )) , (3.5) (b) Z Ω q p µ ( ω ) p ν ( ω ) P ( dω ) ≥ exp {− ρ } = δ. Now assume, in contrast to what should be proved, that Risk ∗ ( δ / < Φ ∗ (ln(1 /δ )). Then, there exists R ′ < Φ ∗ (ln(1 /δ )), δ ′ < δ / g ( ω ) such thatProb ω ∼ p A ( x ) ( · ) {| ˆ g ( ω ) − g T x | > R ′ } ≤ δ ′ ∀ x ∈ X. A. B. JUDITSKY AND A. S. NEMIROVSKI
Now, consider two hypotheses Π , on the distribution of ω stating that thedensities of the distribution with regard to P are p µ and p ν , respectively.Consider a procedure for distinguishing between the hypotheses as follows:after ω is observed, we compare ˆ g ( ω ) with ¯ g = [ g T ¯ x + g T ¯ y ]; if ˆ g ( ω ) ≥ ¯ g ,we accept Π , otherwise we accept Π . Note that by (3.5)(a) and due to R ′ < Φ ∗ (ln(1 /δ )), the probability to accept Π when Π is true is ≤ theprobability for ˆ g ( ω ) to deviate from g T ¯ x by at most R ′ , that is, it is ≤ δ ′ .Similarly, the probability to accept Π when Π is true is ≤ δ ′ . Now, let Ω be the part of Ω where our hypotheses testing routine accepts Π , so thatin Ω = Ω \ Ω the routine accepts Π . As we just have seen, Z Ω p ν ( ω ) P ( dω ) ≤ δ ′ , Z Ω p µ ( ω ) P ( dω ) ≤ δ ′ , whence Z Ω q p µ ( ω ) p ν ( ω ) P ( dω ) = X i =1 Z Ω i q p µ ( ω ) p ν ( ω ) P ( dω ) ≤ X i =1 (cid:18)Z Ω i p µ ( ω ) P ( dω ) (cid:19) / (cid:18)Z Ω i p ν ( ω ) P ( dω ) (cid:19) / ≤ √ δ ′ < q δ / δ. The resulting inequality R Ω q p µ ( ω ) p ν ( ω ) P ( dω ) < δ contradicts (3.5)(b); wehave arrived at a desired contradiction. (3.2) is proved.To prove (3.3), let us set δ = 2 √ ε , so that Risk ∗ ( ε ) = Risk ∗ ( δ / ≥ Φ ∗ (ln(1 /δ )) = Φ ∗ ( ln( ε )), where the concluding ≥ is due to (3.2). Nowrecall that Φ ∗ ( r ) is a nonnegative and concave function of r ≥
0, so thatΦ ∗ ( tr ) ≥ t Φ ∗ ( r ), for all r ≥ ≤ t ≤
1. We therefore haveΦ ∗ (cid:18)
12 ln (cid:18) ε (cid:19)(cid:19) ≥ ln(1 / (4 ε ))2 ln(2 /ε ) Φ ∗ (cid:18) ln (cid:18) ε (cid:19)(cid:19) and we arrive at (3.3). (cid:3) Lemmas 3.1 and 3.2 clearly imply Theorem 3.1. (cid:3)
Remark 3.1.
Lemmas 3.1 and 3.2 provide certain information evenbeyond the case when the pair ( D , F ) is good, specifically, that:(i) The ε -risk of an affine estimate can be made arbitrarily close to thequantity Φ + ( ε ) = inf φ ∈F ,α> sup x,y ∈ X Φ ln(2 /ε ) ( x, y ; φ, α )(cf. Lemma 3.1); STIMATION BY CONVEX PROGRAMMING (ii) We have Risk ∗ ( ε ) ≥ Φ − ( ε ) = sup x,y ∈ X inf φ ∈F ,α> Φ / / (4 ε )) ( x, y ; φ, α ) (cf. Lemma 3.2).As it is seen from the proofs of Lemmas 3.1 and 3.2, both these statementshold true without the goodness assumption. The role of the latter is inensuring that Φ + ( ε ) is within an absolute constant factor of Φ − ( ε ).Lemma 3.2 Implies the following result. Proposition 3.1.
Under the premise of Theorem 3.1, the Hellingeraffinity
AffH( µ, ν ) = Z Ω q p µ ( ω ) p ν ( ω ) P ( dω ) is a continuous and log-concave function on M × M , and the quantity Φ ∗ ( r ) , r ≥ , admits the following representation: ∗ ( r ) = max x,y { g T x − g T y : AffH( A ( x ) , A ( y )) ≥ exp {− r } , x, y ∈ X } . (3.6)We see that the upper bound Φ ∗ (ln(2 /ε )) on RiskAff( ε ) stated in The-orem 3.1 admits a very transparent interpretation: this bound is the max-imum of the variation max x,y [ g T x − g T y ] of the estimated functional onthe set of pairs x, y ∈ X with the associated distributions “close” to eachother, namely, such that AffH( A ( x ) , A ( y )) ≥ ε/
2. Observe that asymptoti-cally (when r becomes small), Φ ∗ ( r ) is equivalent to the modulus of conti-nuity ω ( r, X ) of g with regard to the Hellinger distance , introduced in [7]. Proof of Proposition 3.1.
By exactly the same argument as in theproof of Theorem 3.1, the function Ψ( µ, ν ; φ ) : ( M × M ) × F → R ,Ψ( µ, ν ; φ ) = (cid:20) ln (cid:18)Z exp {− φ ( ω ) } p µ ( ω ) P ( dω ) (cid:19) + ln (cid:18)Z exp { φ ( ω ) } p ν ( ω ) P ( dω ) (cid:19)(cid:21) is well defined and continuous on its domain, and this function is convex in φ and concave in ( µ, ν ). We claim thatln(AffH( µ, ν )) = min φ Ψ( µ, ν ; ψ ) , (3.7)which would imply that ln(AffH( · )) is indeed a finite concave function on M × M and as such is continuous (recall that M is open). To justify our Recall that we consider here the case of one observation. A. B. JUDITSKY AND A. S. NEMIROVSKI claim, note that, for fixed µ, ν ∈ M , setting φ = ln( p ν /p µ ), we get a func-tion from F such that Ψ( µ, ν ; ¯ φ ) = 2 ln(AffH( µ, ν )). To complete the verifica-tion of (3.7), it suffices to demonstrate that Ψ( µ, ν ; φ ) ≥ Ψ( µ, ν ; ¯ φ ) whenever φ ∈ F , which is immediate, since setting φ = ¯ φ + ∆, we haveexp { Ψ( µ, ν ; ¯ φ ) / } = Z q p µ ( ω ) p ν ( ω ) P ( dω )= Z [( p µ ( ω ) p ν ( ω )) / exp {− ∆( ω ) / } ] × [( p µ ( ω ) p ν ( ω )) / exp { ∆( ω ) / } ] P ( dω ) ≤ (cid:20)Z q p µ ( ω ) p ν ( ω ) exp {− ∆( ω ) } P ( dω ) (cid:21) / × (cid:20)Z q p µ ( ω ) p ν ( ω ) exp { ∆( ω ) } P ( dω ) (cid:21) / = exp { Ψ( µ, ν ; φ ) / } . Now, note that by (3.1)2Φ ∗ ( r ) = sup x,y ∈ X (cid:26) inf φ ∈F ,α> [ g T x − g T y + α Ψ( A ( x ) , A ( y ); α − φ ) + 2 αr ] (cid:27) = sup x,y ∈ X (cid:26) g T x − g T y + inf α> α (cid:20) inf φ ∈F Ψ( A ( x ) , A ( y ); α − φ ) + 2 r (cid:21)(cid:27) = sup x,y ∈ X (cid:26) g T x − g T y + inf α> α (cid:20) inf ψ ≡ α − φ ∈F Ψ( A ( x ) , A ( y ); ψ ) + 2 r (cid:21)(cid:27) = sup x,y ∈ X (cid:26) g T x − g T y + inf α> α [2 ln(AffH( A ( x ) , A ( y ))) + 2 r ] | {z } = n , ln(AffH( A ( x ) , A ( y ))) + r ≥ −∞ , ln(AffH( A ( x ) , A ( y ))) + r < (cid:27) [see (3.7)]= max x,y { g T x − g T y : AffH( A ( x ) , A ( y )) ≥ exp {− r } , x, y ∈ X } . (cid:3) The case of multiple observations.
In Problem I, our goal was toestimate g T x from a single observation ω of the random variable ω ∼ p A ( x ) ( · ),associated with x . The result can be immediately extended to the case whenwe want to recover g T x from a sample of independent observations ω , . . . , ω L of random variables ω ℓ with distributions parameterized by x . Specifically,let (Ω ℓ , P ℓ ) and ( D ℓ , F ℓ ), 1 ≤ ℓ ≤ L , be as in Example 4, and let every pair STIMATION BY CONVEX PROGRAMMING ( D ℓ , F ℓ ) be good. Assume, further, that X ⊂ R n is a convex compact set and A ℓ ( x ) are affine mappings with A ℓ ( X ) ⊂ M ℓ . Given a linear form g T z on R n and a sequence of independent realizations ω ℓ ∼ p ℓA ℓ ( x ) ( · ), ℓ = 1 , . . . , L ,we want to recover from these observations the value g T x of the given affineform at the “signal” x underlying our observations.In our current situation, we call a candidate estimate ˆ g ( ω , . . . , ω L ) affine if it is of the form ˆ g ( ω , . . . , ω L ) = L X ℓ =1 φ ℓ ( ω ℓ ) , (3.8)where φ ℓ ∈ F ℓ , ℓ = 1 , . . . , L . Note that setting ( D , F ) = N Lℓ =1 ( D ℓ , F ℓ ), wereduce the situation to the one we have already considered. In particular,Theorem 3.1 along with the proof of Lemma 3.1 implies the following result(where the ε -risks—of an estimate, the minimax optimal and the affine-minimax optimal—are defined exactly as in the single-observation case). Theorem 3.2.
In the situation just described, for r > , let Φ r ( x, y ; φ, α ) = α " L X ℓ =1 ln (cid:18)Z Ω ℓ exp {− α − φ ℓ ( ω ℓ ) } p ℓA ℓ ( x ) ( ω ℓ ) P ( dω ℓ ) (cid:19) + (cid:18)Z Ω ℓ exp { α − φ ℓ ( ω ℓ ) } p ℓA ℓ ( y ) ( ω ℓ ) P ( dω ℓ ) (cid:19) + g T x − g T y + 2 αr : Z × F + → R ,Z = X × X, F + = F × · · · × F L × { α > } . The function Φ r is continuous on its domain, concave in the ( x, y ) -argument,convex in the ( φ, α ) -argument and possesses a well-defined saddle point value ∗ ( r ) = sup x,y ∈ X inf φ,α ∈F + Φ r ( x, y ; φ, α ) | {z } Φ r ( x,y ) = inf ( φ,α ) ∈F + sup x,y ∈ X Φ r ( x, y ; φ, α ) | {z } Φ r ( φ,α ) , which is a concave and nonnegative function of r ≥ . Moreover: (i) For all ε ∈ (0 , / , we have RiskA( ε ) ≤ Φ ∗ (ln(2 /ε )) ≤ θ ( ε ) Risk ∗ ( ε ) , θ ( ε ) = 2 ln(2 /ε )ln(1 / (4 ε )) . A. B. JUDITSKY AND A. S. NEMIROVSKI (ii)
Given ε ∈ (0 , / and δ ≥ , in order to build an affine estimate with ε -risk not exceeding [Φ ∗ (ln(2 /ε )) + δ ] , where δ > is given, it suffices to find α ∗ > and φ ∗ ℓ ∈ F ℓ , ≤ ℓ ≤ L , such that Φ ln(2 /ε ) ( φ ∗ , α ∗ ) ≤ ∗ (ln(2 /ε )) + δ/ , to compute the quantity c = 12 max x ∈ X " g T x + α ∗ L X ℓ =1 ln (cid:18)Z Ω ℓ exp {− α − φ ∗ ℓ ( ω ℓ ) } p ℓA ℓ ( x ) ( ω ℓ ) P ℓ ( dω ℓ ) (cid:19) −
12 max y ∈ X " − g T y + α ∗ L X ℓ =1 ln (cid:18)Z Ω ℓ exp { α − φ ∗ ℓ ( ω ℓ ) } p ℓA ℓ ( y ) ( ω ℓ ) P ℓ ( dω ℓ ) (cid:19) and to set ˆ g ( ω , . . . , ω L ) = L X ℓ =1 φ ∗ ℓ ( ω ℓ ) + c. (3.9) Remark 3.2.
Computing the “nearly optimal” affine estimate (3.9) re-duces to convex programming and thus can be carried out efficiently, pro-vided that we are given explicit descriptions of: • the linear spaces F ℓ , ℓ = 1 , . . . , L (as it is the case, e.g., in Examples 1–3), • and X (e.g., by a list of efficiently computable convex constraints whichcut X out of R n ) and are capable to compute efficiently the value of Φ r at a point. Remark 3.3.
Assume that the observations ω ℓ , ℓ ≤ ℓ ≤ ℓ , are copiesof the same random variable [i.e., Ω ℓ , P ℓ , D ℓ , F ℓ , A ℓ ( · ) are independent of ℓ for ℓ ≤ ℓ ≤ ℓ ]. Then, the convex function Φ r ( φ , . . . , φ L , α ) is symmetricwith regard to the arguments φ ℓ ∈ F ℓ , ℓ ≤ ℓ ≤ ℓ , and therefore, whenbuilding the estimate (3.9) we lose nothing when restricting ourselves to φ ’ssatisfying φ ℓ = φ ℓ , ℓ ≤ ℓ ≤ ℓ , which allows to reduce the computationaleffort of building α ∗ , φ ∗ ℓ .3.2.1. Illustration.
Consider the toy problem where we want to recoverthe probability p of getting 1 from a Bernoulli distribution, given L inde-pendent realizations ω , . . . , ω L of the associated random variable. To handlethe problem, we specialize our general setup as follows: • (Ω ℓ , P ℓ ), 1 ≤ ℓ ≤ L , are identical to the two-point set {
0; 1 } with the count-ing measure; • M is the interval (0 , p µ (1) = 1 − p µ (0) = µ , µ ∈ M ; • X is a compact convex subset in M , say, the segment [1 · e–16, 1–1 / e–16],and A ( x ) = x . STIMATION BY CONVEX PROGRAMMING Table 1
Recovering the parameter of a Bernoulli distribution
Upper risk Lower risk Ratio of ε L γ δ bound bound bounds ϑ ( ε ) .
05 10 2.91e–1 4.18e–2 3.61e–1 2.49e–1 1.45 4.580 .
05 100 4.13e–2 9.17e–3 1.33e–1 8.19e–2 1.63 4.580 .
05 1000 4.29e–3 9.91e–4 4.29e–3 2.60e–3 1.65 4.580 .
01 10 3.58e–1 2.83e–2 4.04e–1 3.29e–1 1.23 3.290 .
01 100 5.83e–2 8.84e–2 1.59e–1 1.15e–1 1.38 3.290 .
01 1000 6.15e–3 9.88e–4 5.13e–2 3.67e–3 1.40 3.290 .
001 10 4.19e–1 1.61e–2 4.42e–1 3.98e–1 1.11 2.750 .
001 100 8.15e–2 8.37e–3 1.88e–1 1.51e–1 1.24 2.750 .
001 1000 8.79e–3 9.82e–4 6.14e–3 4.88e–3 1.26 2.75
Invoking Remark 3.3, we lose nothing when restricting ourselves to affineestimates of the form (3.8) with mutually identical functions φ ℓ ( · ), 1 ≤ ℓ ≤ L ,that is, with the estimatesˆ g ( ω , . . . , ω L ) = γ + δ L X ℓ =1 ω ℓ . Invoking Theorem 3.2, the coefficients γ and δ are readily given by the φ -component of the saddle point (max in x, y ∈ X , min in φ = [ φ ; φ ] ∈ R and α >
0) of the convex–concave function x − y + α [ L ln( ε − φ /α (1 − x ) + ε − φ /α x )+ L ln( ε φ /α (1 − y ) + ε φ /α y ) + 2 ln(2 /ε )];the (guaranteed upper bound on the) ε -risk of this estimate is half of the cor-responding saddle point value. The saddle point (it is easily seen that it doesexist) can be computed with high accuracy by standard convex programmingtechniques. In Table 1, we present the nearly optimal affine estimates alongwith the corresponding risks. In the table, the upper risk bound is the oneguaranteed by Theorem 3.2 and the lower risk bound is the largest d suchthat the hypotheses “ p = 0 . d ” and “ p = 0 . − d ” cannot be distinguishedfrom L independent observations of a random variable ∼ Bernoulli( p ) withthe sum of probabilities of errors < ε [this easily computable quantity is alower bound on the minimax optimal ε -risk Risk ∗ ( ε )], and ϑ ( ε ) = /ε )ln(0 . /ε ) is the theoretical upper bound on the “level of nonoptimality” of our esti-mate. As it could be guessed in advance, for large L , the near-optimal affineestimate is close to the trivial estimate L P Lℓ =1 ω ℓ .
4. Applications.
In this section, we present some applications of Theo-rems 3.1 and 3.2. A. B. JUDITSKY AND A. S. NEMIROVSKI
Positron emission tomography.
The positron emission tomography(PET) is a noninvasive diagnostic tool allowing us to visualize not onlythe anatomy of tissues in a body, but their functioning as well. In PET,a patient is administered a radioactive tracer chosen in such a way that itconcentrates in the areas of interest (e.g., those of high metabolic activity inearly diagnosis of cancer). The tracer disintegrates, emitting positrons whichthen annihilate with nearby electrons to produce pairs of photons flying atthe speed of light in opposite directions; the orientation of the resulting lineof response (LOR) is completely random. The patient is placed in a cylinderwith the surface split into small detector cells. When two of the detectorsare hit by photons “nearly simultaneously”—within an appropriately chosenshort time window—the event indicates that somewhere at a line crossingthe detectors a disintegration act took place. Such an event is registered, andthe data collected by the PET device form a list of the number of eventsregistered in every one of the bins (pairs of detectors) in the course of a giventime t . The goal of a PET reconstruction algorithm is to recover the densityof the tracer from this data. The standard mathematical model of PET is asfollows. After discretization of the field of view, there are N voxels (small 3Dcubes) assigned with nonnegative (and unknown) amounts x i of the traces i = 1 , . . . , n . The number of LORs emanating from a voxel i is a realizationof a Poisson random variable with parameter x i , and these variables fordifferent voxels are independent. Every LOR emanating from a voxel i issubject to a “lottery,” which decides in which bin (pair of detectors) it willbe registered or if it will be registered at all—some LORs can intersect thesurface of the cylinder only in one point or not intersect it at all and thusare missed. The role of the lottery is played by the random orientation ofthe LOR in question, and outcomes of different lotteries are independent.The probabilities q iℓ for a LOR emanating from voxel i to be registered inbin ℓ are known (they are readily given by the geometry of the device).With this model, the data registered by PET is a realization of a randomvector ( ω , . . . , ω L ) ( L is the total number of bins) with independent Poisson-distributed coordinates, the parameter of the Poisson distribution associatedwith ω ℓ being A ℓ ( x ) = n X i =1 q iℓ x i . Assume that our a priori information on x allows us to point out a convexcompact set X ⊂ { x ∈ R n : x > } , such that x ∈ X . Assuming without lossof generality that P i q iℓ > ℓ (indeed, we can eliminate all bins ℓ which never register LORs) and invoking Example 2, we find ourselvesin the situation of Section 3.2. It follows that in order to evaluate a givenlinear form g T x of the unknown tracer density x , we can use the construction STIMATION BY CONVEX PROGRAMMING from Theorem 3.2 to build a near-optimal affine estimate of g T x . The recipesuggested to this end by Theorem 3.2 reads as follows: the estimate is of theform ˆ g ( ω ) = L X ℓ =1 γ ∗ ℓ y ℓ + c ∗ , where y ℓ is the number of LORs registered in bin ℓ and γ ∗ = [ γ ∗ ; . . . ; γ ∗ L ], c ∗ are given by an optimal solution ( γ ∗ , α ∗ ) to the convex optimization problemmin α> ,γ Φ r ( γ, α ) , Φ r ( γ, α ) = max x,y ∈ X ( g T x − g T y + α " L X ℓ =1 [ q ℓ ( x ) exp {− α − γ ℓ } + q ℓ ( y ) exp { α − γ ℓ } ] − q ( x ) − q ( y ) + 2 r ,r = ln(2 /ε ) , q ℓ ( z ) = n X i =1 q iℓ z i , q ( z ) = L X ℓ =1 q ℓ ( z ) . It is easily seen that the problem is solvable with c ∗ = 12 " max x ∈ X ( g T x + α ∗ " − q ( x ) + L X ℓ =1 q ℓ ( x ) exp {− α − ∗ γ ∗ ℓ } − max y ∈ X ( − g T y + α ∗ " − q ( y ) + L X ℓ =1 q ℓ ( y ) exp { α − ∗ γ ∗ ℓ } . Gaussian observations.
Now consider the standard problem of re-covering a linear form g T x of a signal x known to belong to a given convexcompact set X ⊂ R n via indirect observations of the signal corrupted byGaussian noise. Without loss of generality, let the model of observations be ω = Ax + ξ, ξ ∼ N (0 , I L ) . (4.1)The associated pair ( D , F ) is comprised of the shifts of the standard Gaus-sian distribution ( D ) and all affine forms on R L ( F ) and is good (see Example3). The affine estimates in the case in question are just the affine functionsof ω . The near-optimality of affine estimates in the case in question was es-tablished by Donoho [5], not only for the ε -risk, but for all risks based on thestandard loss functions. We have the following direct corollary of Theorem3.2 (cf. Theorem 2 and Corollary 1 of [5]): A. B. JUDITSKY AND A. S. NEMIROVSKI
Proposition 4.1.
In the situation in question, the affine estimate ˆ g ε ( · ) yielded by Theorem 3.2 is asymptotically ( ε → +0) optimal, specifically, ε ∈ (0 , / ⇒ Risk(ˆ g ε ; ε ) ≤ ψ ( ε ) Risk ∗ ( ε ) ,ψ ( ε ) = p /ε )ErfInv( ε ) = 1 + o (1) as ε → +0 [here x = ErfInv( y ) stands for the inverse error function, i.e., y = √ π × R ∞ x e − t / dt ] . Proof.
Let G ( · ) be the density of the N (0 , I L ) distribution. By Theo-rem 3.2, we have Risk(ˆ g ε ; ε ) ≤ Φ ∗ (ln(2 /ε )), where, for r > ∗ ( r ) = max x,y ∈ X Φ r ( x, y ) , Φ r ( x, y ) = inf φ ∈ R L ,α> (cid:26) g T x − g T y + α (cid:20) ln (cid:18)Z exp {− α − φ T ω } G ( ω − Ax ) dω (cid:19) + ln (cid:18)Z exp { α − φ T ω } G ( ω − Ay ) dω (cid:19) + 2 r (cid:21)(cid:27) = inf φ ∈ R L ,α> (cid:26) g T x − g T y + φ T A ( y − x ) + 2 (cid:20) α − φ T φ αr (cid:21)(cid:27) = inf φ { g T x − g T y + φ T A ( x − y ) + 2 √ r k φ k } = (cid:26) g T x − g T y, k A ( x − y ) k ≤ √ r , −∞ , k A ( x − y ) k > √ r .Thus, Risk(ˆ g ε ; ε ) ≤ Φ ∗ (ln(2 /ε )) = [ g T ¯ x − g T ¯ y ](4.2)for certain ¯ x, ¯ y ∈ X with k A ( x − y ) k ≤ p /ε ). It remains to prove thatRisk ∗ ( ε ) ≥ ψ − ( ε ) Φ ∗ (ln(2 /ε )) . (4.3)To this end, assume, on the contrary to what should be proved, thatRisk ∗ ( ε ) < ψ − ( ε )Φ ∗ (ln(2 /ε )) (= ψ − ( ε ) [ g T ¯ x − g T ¯ y ]) , and let us lead this assumption to a contradiction. Under our assumption,there exists ρ < ψ − ( ε ) [ g T ¯ x − g T ¯ y ], ε ′ < ε and an estimate e g such that ∀ ( x ∈ X ): Prob {| e g ( Ax + ξ ) − g T x | ≥ ρ } ≤ ε ′ . (4.4) STIMATION BY CONVEX PROGRAMMING Observing that ψ ( ε ) >
1, we see that 2 ρ < [ g T ¯ x − g T ¯ y ]. Let ˆ x = ¯ x and ˆ y bea convex combination of ¯ x and ¯ y such that 2 ρ = [ g T ˆ x − g T ˆ y ]. Note that k A (ˆ x − ˆ y ) k = (cid:20) ρ [ g T ¯ x − g T ¯ y ] (cid:21)| {z } <ψ − ( ε ) k A (¯ x − ¯ y ) k ≤ ψ − ( ε )2 q /ε ) = 2 erfinv( ε ) . Now, let Π be the hypothesis that the distribution of an observation (4.1)comes from x = ˆ x , and let Π be the hypothesis that this distribution comesfrom x = ˆ y . From (4.4) by the same standard argument as in the proof ofLemma 3.2, it follows that there exists a routine, based on a single observa-tion (4.1), for distinguishing between Π and Π , which rejects Π i when thishypothesis is true with probability ≤ ε ′ , i = 1 ,
2. But, it is well known thatthe hypotheses on shifts of the standard Gaussian distribution indeed can bedistinguished with the outlined reliability. This is possible if and only if theEuclidean distance between the corresponding shifts is at least 2 erfinv( ε ′ ).This condition is not satisfied for our Π i , i = 1 ,
2, which correspond to shifts A ˆ x and A ˆ y , since k A ˆ x − A ˆ y k ≤ ε ) < ε ). We have arrived ata desired contradiction. (cid:3) In fact, the reasoning can be slightly simplified and strengthened to yieldthe following result.
Proposition 4.2.
In the situation of Proposition 4.1, one can buildefficiently an affine estimate ˆ g ε , such that < ε < / ⇒ Risk(ˆ g ε ; ε ) ≤ ErfInv( ε/ ε ) Risk ∗ ( ε ) [cf. Proposition 4.1, and note that ErfInv( ε/ ε ) < √ /ε )ErfInv( ε ) ] . Proof.
LetΨ( x, y ; φ ) = g T x − g T y + φ T A ( y − x ) + 2 erfinv( ε/ k φ k : ( X × X ) × R L → R . Ψ clearly is a function which is continuous, convex in φ and concave in ( x, y )on its domain; by the same argument as in the proof of Theorem 3.1, Ψ hasa well-defined saddle point value2Ψ ∗ ( ε ) = inf φ Ψ( φ ) z }| { max x,y ∈ X Ψ( x, y ; φ ) = max x,y ∈ X Ψ( x,y ) z }| { inf φ Ψ( x, y ; φ ) . The functionΨ( φ ) = max x,y ∈ X [ g T x − g T y + φ T ( Ay − Ax )]+ 2 erfinv( ε/ k φ k ≥ ε ) k φ k A. B. JUDITSKY AND A. S. NEMIROVSKI is a finite convex function on R L , which goes to ∞ as k φ k → ∞ , and there-fore it attains its minimum at a point φ ∗ , so that2Ψ ∗ ( ε ) = Ψ( φ ∗ ) . Setting c ∗ = 12 (cid:20) max x ∈ X [ g T x − φ T ∗ Ax ] − max y ∈ Y [ − g T y + φ T ∗ Ay ] (cid:21) , we have, similar to the proof of Lemma 3.1, the following:(a) max x ∈ X [ g T x − φ T ∗ Ax − c ∗ ] + erfinv( ε/ k φ ∗ k = Ψ ∗ ( ε ) , (b) max y ∈ X [ − g T y + φ T ∗ Ax + c ∗ ] + erfinv( ε/ k φ ∗ k = Ψ ∗ ( ε ) . Now, consider the affine estimateˆ g ε ( ω ) = φ T ∗ ω + c ∗ . From (a) it follows that ∀ d > Ψ ∗ ( ε ): sup x ∈ X Prob { g T x − ˆ g ε ( Ax + ξ ) > d } ≤ ε ′ < ε/ , while (b) implies that ∀ d > Ψ ∗ ( ε ): sup y ∈ X Prob { ˆ g ε ( Ay + ξ ) − g T y > d } ≤ ε ′ < ε/ . We conclude that Risk(ˆ g ε ; ε ) ≤ Ψ ∗ ( ε ). To complete the proof, it suffices todemonstrate that Risk ∗ ( ε ) ≤ ErfInv( ε/ ε ) Ψ ∗ ( ε ) . (4.5)To this end, observe thatΨ( x, y ) = [ g T x − g T y ] + inf φ { φ T A ( y − x ) + 2 erfinv( ε/ k φ k } = (cid:26) g T x − g T y, k A ( y − x ) k ≤ ε/ −∞ , otherwise,whence Risk ∗ (ˆ g ε ; ε ) ≤ Ψ ∗ ( ε ) = [ g T ¯ x − g T ¯ y ] , for certain ¯ x, ¯ y ∈ X , such that k A (¯ x − ¯ y ) k ≤ ε ). Relation (4.5) canbe derived from this observation by exactly the same argument as used inthe proof of Proposition 4.1 to derive (4.3) from (4.2). (cid:3) STIMATION BY CONVEX PROGRAMMING
5. Adaptive version of the estimate.
In the situation of Problem I, let X ⊂ X ⊂ · · · ⊂ X K be a nested collection of nonempty convex compactsets in R n , such that A ( X K ) ⊂ M . Consider a modification of the problemwhere the signal x underlying our observation is known to belong to one of X k with value of k ≤ K unknown in advance. Given a linear form g T z on R n , let Risk k (ˆ g ; ε ) and Risk k ∗ ( ε ) be, respectively, the ε -risk of an estimate ˆ g on X k and the minimax optimal ε -risk of recovering g T x on X k . Let alsoΦ k ∗ ( r ) be the function associated with X = X k according to (3.1). As it isimmediately seen, the functions Φ k ∗ ( r ) grow with k . Our goal is to modifythe estimate ˆ g yielded by Theorem 3.1 in such a way that the ε -risk of themodified estimate on X k will be “nearly” Risk k ∗ ( ε ) for every k ≤ K . Thisgoal can be achieved by a straightforward application of the well-knownLepskii’s adaptation scheme [19, 20] as follows.Given δ >
0, let δ ′ ∈ (0 , δ ), and let ˆ g k ( · ) be the affine estimate with the( ε/K )-risk on X k not exceeding Φ k ∗ (ln(2 K/ε )) + δ ′ provided by Theorem 3.1as applied with ε/K substituted for ε and X k substituted for X . Then, forany k ≤ K , sup x ∈ X k Prob ω ∼ p A ( x ) ( · ) {| ˆ g k ( ω ) − g T x | > Φ k ∗ (ln(2 K/ε )) + δ } (5.1) ≤ ε ′ /K < ε/K. Given observation ω , let us say that an index k ≤ K is ω -good , if for any k ′ , k ≤ k ′ ≤ K , | ˆ g k ′ ( ω ) − ˆ g k ( ω ) | ≤ Φ k ∗ (ln(2 K/ε )) + Φ k ′ ∗ (ln(2 K/ε )) + 2 δ. Note that ω -good indexes do exist (e.g., k = K ). Given ω , we can find thesmallest ω -good index k = k ( ω ); our estimate is nothing but ˆ g ( ω ) = ˆ g k ( ω ) ( ω ). Proposition 5.1.
Assume that ε ∈ (0 , / , and let ϑ = 3 ln(2 K/ε )ln(2 /ε ) . Then, for any ( k, ≤ k ≤ K ) , sup x ∈ X k Prob ω ∼ p A ( x ) ( · ) {| ˆ g ( ω ) − g T x | > ϑ Φ k ∗ (ln(2 K/ε )) + 3 δ } < ε, (5.2) whence also ∀ ( k, ≤ k ≤ K ): Risk k (ˆ g ; ε ) ≤ K ) /ε )ln(1 / (4 ε )) Risk k ∗ ( ε ) + 3 δ. (5.3) A. B. JUDITSKY AND A. S. NEMIROVSKI
Proof.
Setting r = ln(2 K/ε ), let us fix ¯ k ≤ K and x ∈ X ¯ k and call arealization ω x -good , if ∀ ( k, ¯ k ≤ k ≤ K ): | ˆ g k ( ω ) − g T x | ≤ Φ k ∗ ( r ) + δ. (5.4)Since X k ⊃ X ¯ k when k ≥ ¯ k , (5.1) implies thatProb ω ∼ p A ( x ) ( · ) { ω is good } ≥ − ε ′ . Now, when x is the signal and ω is x -good, relations (5.4) imply that ¯ k isan ω -good index, so that k ( ω ) ≤ ¯ k . Since k ( ω ) is an ω -good index, we have | ˆ g ( ω ) − ˆ g ¯ k ( ω ) | = | ˆ g k ( ω ) ( ω ) − ˆ g ¯ k ( ω ) | ≤ Φ k ∗ ( r ) + Φ ¯ k ∗ ( r ) + 2 δ, which combines with (5.4) to imply that | ˆ g ( ω ) − g T x | ≤ ¯ k ∗ ( r ) + Φ k ( ω ) ∗ ( r ) + 3 δ ≤ ¯ k ∗ ( r ) + 3 δ, (5.5)where the concluding inequality is due to k ( ω ) ≤ ¯ k and to the fact that Φ k ∗ grows with k . The bound (5.5) holds true whenever ω is x -good, which, aswe have seen, happens with probability ≥ − ε ′ . Since ε ′ < ε and ¯ x ∈ X ¯ k isarbitrary, we conclude thatRisk ¯ k (ˆ g ; ε ) ≤ ¯ k ∗ ( r ) + 3 δ. (5.6)Using the nonnegativity and concavity of Φ ¯ k ∗ ( · ) on the nonnegative ray andrecalling the definition of r , we obtain Φ k ∗ ( r ) ≤ ln(2 K/ε )ln(2 /ε ) Φ k ∗ (ln(2 /ε )) whenever ε ≤ / K ≥
1. Recalling the definition of ϑ , the right-hand side in (5.6)therefore does not exceed ϑ Φ ¯ k ∗ (ln(2 /ε )) + 3 δ . Since ¯ k ≤ K is arbitrary, wehave proved (5.2). This bound, due to Lemma 3.2, implies (5.3). (cid:3) Acknowledgments.
The authors would like to acknowledge the valuablesuggestions made by L. Birg`e, Universit`e Paris 6, and Alexander Goldensh-luger, Haifa University. REFERENCES [1]
Ben-Tal, A. and
Nemirovski, A. (2001).
Lectures on Modern Convex Optimiza-tion: Analysis, Algorithms and Engineering Applications . SIAM, Philadelphia.MR1857264[2]
Birg´e, L. (1984). Sur un th´eor`eme de minimax et son application aux tests. (French)
Probab. Math. Statist. Birg´e, L. (1983). Approximation dans les espaces m´etriques et th´eorie del’estimation.
Z. Wahrsch. Verw. Gebiete Cai, T. and
Low, M. (2003). A note on nonparametric estimation of linear func-tionals.
Ann. Statist. Donoho, D. (1995). Statistical estimation and optimal recovery.
Ann. Statist. [6] Donoho, D. and
Liu, R. (1987).
Geometrizing Rates of Convergence . I. TechnicalReport 137a, Dept. Statistics, Univ. California, Berkeley.[7]
Donoho, D. and
Liu, R. (1991). Geometrizing rates of convergence. II.
Ann. Statist. Donoho, D., Liu, R. and
MacGibbon, B. (1990). Minimax risk over hyperrectan-gles, and implications.
Ann. Statist. Donoho, D. and
Low, M. (1992). Renormalization exponents and optimal pointwiserates of convergence.
Ann. Statist. Eubank, R. (1988).
Spline Smoothing and Nonparametric Regression . Dekker, NewYork. MR0934016[11]
Goldenshluger, A. and
Nemirovski, A. (1997). On spatially adaptive estimationof nonparametric regression.
Math. Methods Statist. H¨ardle, W. (1990).
Applied Nonparametric Regression. ES Monograph Series .Cambridge Univ. Press, Cambridge, UK. MR1161622[13] H¨ardle, W., Kerkyacharian, G., Picard, D. and
Tsybakov, A. B. (1998).
Wavelets, Approximation and Statistical Applications. Lecture Notes in Statistics . Springer, New York. MR1618204[14]
Hiriart-Urruty, J. B. and
Lemarechal, C. (1993).
Convex Analysis and Mini-mization Algorithms I: Fundamentals . Springer, Berlin.[15]
Ibragimov, I. A. and
Khasminski, R. Z. (1981).
Statistical Estimation: AsymptoticTheory . Springer, New York. MR0620321[16]
Ibragimov, I. A. and
Khas’minskij, R. Z. (1984). On the nonparametric estimationof a value of a linear functional in Gaussian white noise.
Teor. Veroyatnost. iPrimenen. Klemela, J. and
Tsybakov, A. B. (2001). Sharp adaptive estimation of linearfunctionals.
Ann. Statist. Korostelev, A. and
Tsybakov, A. (1993).
Minimax Theory of Image Reconstruc-tion. Lecture Notes in Statistics . Springer, New York. MR1226450[19] Lepskii, O. (1990). On a problem of adaptive estimation in Gaussian white noise.
Teor. Veroyatnost. i Primenen. Lepskii, O. (1991). Asymptotically minimax adaptive estimation I. Upper bounds.Optimally adaptive estimates.
Teor. Veroyatnost. i Primenen. Nemirovski, A. (2000).
Topics in Nonparametric Statistics . In
Ecole d’Ete´e de Prob-abilit´es de Saint-Flour XXVII (M. Emery, A. Nemirovski, D. Voiculescu and P.Bernard, eds.).
Lecture Notes in Mathematics
Pinsker, M. (1980). Optimal filtration of square-integrable signals in Gaussian noise.
Problemy Peredachi Informatsii Prakasa Rao, B. L. S. (1983).
Nonparametric Functional Estimation.
AcademicPress, Orlando. MR0740865[24]
Rosenblatt, M. (1991).
Stochastic Curve Estimation.
Institute of MathematicalStatistics, Hayward, CA.[25]
Simonoff, J. S. (1996).
Smoothing Methods in Statistics . Springer, New York.MR1391963[26]
Takezawa, K. (2005).
Introduction to Nonparametric Regression.
Wiley, Hoboken,NJ. MR2181216[27]
Tsybakov, A. B. (2004).
Introduction a l’Estimation Nonparam´etrique.
Springer,Berlin. MR2013911[28]
Wasserman, L. (2006).
All of Nonparametric Statistics . Springer, New York.MR2172729 A. B. JUDITSKY AND A. S. NEMIROVSKI
Laboratoire Jean KuntzmannUniversit´e Grenoble I51 rue des Math´ematiquesBP 5338041 Grenoble Cedex 9FranceE-mail: [email protected]