[PDF] Nonparametric estimation by convex programming

Abstract

The problem we concentrate on is as follows: given (1) a convex compact set X in R n , an affine mapping x↦A(x) , a parametric family { p μ (⋅)} of probability densities and (2) N i.i.d. observations of the random variable ω , distributed with the density p A(x) (⋅) for some (unknown) x∈X , estimate the value g T x of a given linear form at x . For several families { p μ (⋅)} with no additional assumptions on X and A , we develop computationally efficient estimation routines which are minimax optimal, within an absolute constant factor. We then apply these routines to recovering x itself in the Euclidean norm.

Full PDF

aa r X i v : . [ m a t h . S T ] A ug The Annals of Statistics (cid:13)

Institute of Mathematical Statistics, 2009

NONPARAMETRIC ESTIMATION BY CONVEX PROGRAMMING

By Anatoli B. Juditsky and Arkadi S. Nemirovski Universit´e Grenoble I and Georgia Institute of Technology

The problem we concentrate on is as follows: given (1) a convexcompact set X in R n , an aﬃne mapping x A ( x ), a parametricfamily { p µ ( · ) } of probability densities and (2) N i.i.d. observations ofthe random variable ω , distributed with the density p A ( x ) ( · ) for some(unknown) x ∈ X , estimate the value g T x of a given linear form at x .For several families { p µ ( · ) } with no additional assumptions on X and A , we develop computationally eﬃcient estimation routineswhich are minimax optimal, within an absolute constant factor. Wethen apply these routines to recovering x itself in the Euclidean norm.

1. Introduction.

The problem we are interested in is essentially as fol-lows: suppose that we are given a convex compact set X in R n , an aﬃnemapping x A ( x ) and a parametric family { p µ ( · ) } of probability densities.Suppose that N i.i.d. observations of the random variable ω , distributed withthe density p A ( x ) ( · ) for some (unknown) x ∈ X , are available. Our objectiveis to estimate the value g T x of a given linear form at x .In nonparametric statistics, there exists an immense literature on variousversions of this problem (see, e.g., [10, 11, 12, 13, 15, 17, 18, 21, 22, 23, 24,25, 26, 27, 28] and the references therein). To the best of our knowledge,the majority of papers on the subject focus on speciﬁc domains X (e.g.,distributions with densities from Sobolev balls), and investigate lower andupper bounds on the worst-case, with regard to x ∈ X , accuracy to whichthe problem of interest can be solved. These bounds depend on the numberof observations N , and the question of primary interest is the behavior ofthose bounds as N → ∞ . When the lower and the upper bounds coincidewithin a constant factor [or, ideally, within factor (1 + o (1)) as N → ∞ ],the estimation problem is considered essentially solved, and the estimationmethods underlying the upper bounds are treated as optimal. Received March 2008; revised July 2008. Supported in part by the NSF Grant 0619977.

AMS 2000 subject classiﬁcations.

Primary 62G08; secondary 62G15, 62G07.

Key words and phrases.

Estimation of linear functional, minimax estimation, oracle in-equalities, convex optimization, PE tomography.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Statistics ,2009, Vol. 37, No. 5A, 2278–2300. This reprint diﬀers from the original inpagination and typographic detail. 1

A. B. JUDITSKY AND A. S. NEMIROVSKI

The approach we adopt in this paper is of a diﬀerent spirit; we make no“structural assumptions” on X , aside from assumptions of convexity andcompactness which are crucial for us, and we make no assumptions on thelinear functional p . Clearly, with no structural assumptions on X and p ,explicit bounds on the risks of our estimates, as well as bounds on the mini-max optimal risk, are impossible. However, it is possible to show that whenestimating linear forms, the worst-case risk of the estimator we propose iswithin an absolute constant factor of the “ideal” (i.e., the minimax optimal) risk. It should be added that while the optimal, within an absolute constantfactor, worst-case risk of our estimates is not available in a closed analyt-ical form, it is “available algorithmically”—it can be eﬃciently computed,provided that X is computationally tractable. Note that the estimation problem, presented above, can be seen as ageneralization of the problem of estimation of linear functionals of the centralparameter of a normal distribution (see [4, 8, 9, 16]). Namely, suppose thatthe observation ω ∈ R m , ω = Ax + σξ of the unknown signal x is available. Here A is a given m × n matrix and ξ ∼ N (0 , I m ), σ > aﬃne in ω estimate is minimax optimal,within an absolute constant factor, among all possible estimates.Another special case of our setting is the problem of estimating a linearfunctional g ( p ) of an unknown distribution p , given N i.i.d. observations ω , . . . , ω N , which obey p . We suppose that it is known a priori that p ∈ X ,where X is a given convex compact set of distributions (here the parameter x is the density p itself). Some important results for this problem have beenobtained in [6] and [7]. For instance, in [7] the authors established minimaxbounds for the risk of estimation of g ( p ) and developed an estimation methodbased on the binary search algorithm. The estimation procedure uses at eachsearch iteration tests of convex hypotheses, studied in [2, 3]. That estimatorof g ( p ) is shown to be minimax optimal (within an absolute constant factor)if some basic structural assumptions about X hold.In this paper, we concentrate on the properties of aﬃne estimators . Here,we refer to an estimator ˆ g as aﬃne when it is of the form ˆ g ( ω , . . . , ω N ) = P Ni =1 φ ( ω i ), for some given functions φ , that is, if ˆ g is an aﬃne function ofthe empirical distribution. When φ itself is an aﬃne function, the estimator For details on computational tractability and complexity issues in convex optimization,see, for example, [1], Chapter 4. A reader not familiar with this area will not lose muchwhen interpreting a computationally tractable convex set as a set given by a ﬁnite systemof inequalities p i ( x ) ≤ i = 1 , . . . , m , where p i ( x ) are convex polynomials.STIMATION BY CONVEX PROGRAMMING is also aﬃne in the observations, as it is in the setting of [5]. Our motivationis to extend the results obtained in [5] to the non-Gaussian situation. Inparticular, we propose a technique of derivation of aﬃne estimators whichare minimax optimal (up to a moderate absolute constant) for a class of“good parametric families of distributions,” which is deﬁned in Section 2.1.As normal family and discrete distributions belong to the class of good para-metric families, the minimax optimal estimators for these cases are obtainedby direct application of the general construction. In this sense, our resultsgeneralize those of [7] and [5] on the estimation of linear functionals. Onthe other hand, it is clear that diﬀerent techniques, presented in the currentpaper, inherit from those developed in [3] and [7]. To make a computation-ally eﬃcient solution of the estimation problem possible, unlike the authorsof those papers, we concentrate only on the ﬁnite-dimensional situation. Asa result, the proposed estimation procedures allow eﬃcient numeric imple-mentation. This also allows us to avoid much of the intricate mathematicaldetails. However, we allow the dimension to be arbitrarily large, thus ad-dressing, essentially, a nonparametric estimation problem.The rest of this paper is organized as follows. In Section 2, we deﬁne themain components of our study—we state the estimation problem and deﬁnethe corresponding risk measures. Then, in Section 3, we provide the generalsolution to the estimation problem, which is then applied, in Section 4, tothe problems of estimating linear functionals in the normal model and thetomography model. Finally, in Section 5, we present adaptive versions ofaﬃne estimators.Note that when passing from recovering linear forms of the unknownsignal to recovering the signal itself, we do impose structural assumptionson X , but still make no structural assumptions on the aﬃne mapping A ( x ).Our “optimality results” become weaker—instead of “optimality within anabsolute constant factor” we end up with statements like “the worst-caserisk of such-and-such estimate is in between the minimax optimal risk andthe latter risk to the power χ ,” with χ depending on the geometry of X (and close to 1 when this geometry is “good enough”).

2. Problem statement.

Good parametric families of distributions.

Let (Ω , P ) be a Polishspace with Borel σ -ﬁnite measure, and M ⊂ R m . Assume that every µ ∈ M is associated with a probability density p µ ( ω )—a Borel nonnegative functionon Ω such that R Ω p µ ( ω ) P ( dω ) = 1; we refer to the mapping µ → p µ ( · ) as to a parametric density family D . Let also F be a ﬁnite-dimensional linear spaceof Borel functions on Ω which contains constants. We call a pair ( D , F ) good if it possesses the following properties: A. B. JUDITSKY AND A. S. NEMIROVSKI M is an open convex set in R m ;2. whenever µ ∈ M , we have p µ ( ω ) > µ, ν ∈ M , we have φ ( ω ) = ln( p µ ( ω ) /p ν ( ω )) ∈ F ;4. whenever φ ( ω ) ∈ F , the function F φ ( µ ) = ln (cid:18)Z Ω exp { φ ( ω ) } p µ ( ω ) P ( dω ) (cid:19) is well deﬁned and concave in µ ∈ M .The reader familiar with exponential families will immediately recognizethat the above deﬁnition implies that D is such a family. Let us denote p µ ( ω ) = exp { θ ( µ ) T ω − C ( θ ( µ )) } , µ ∈ M , its density with regard to P where θ is the natural parameter and C ( · ) as the cumulant function. Then, D isgood if:1. M is an open convex set in D P = { µ ∈ R m | R e θ ( µ ) T ω P ( dω ) < ∞} ;2. for any φ such that the cumulant function C ( θ ( µ ) + φ ) is well deﬁnded,the function [ C ( θ ( µ ) + φ ) − C ( θ ( µ ))] is concave in µ ∈ M .Let us list several examples. Example 1 ( Discrete distributions ). Let Ω = { , , . . . , M } be a ﬁniteset, P be a counting measure on Ω, M = { µ ∈ R M : µ > , P i µ i = 1 } and p µ ( i ) = µ i , i = 1 , . . . , M . Let also F be the set of all functions on Ω. Theassociated pair ( D , F ) clearly is good. Example 2 ( Poisson distributions ). Let Ω = { , , . . . } , P be the count-ing measure on Ω, M = { µ ∈ R : µ > } and p µ ( i ) = µ i exp {− µ } i ! , i ∈ Ω, so that p µ is the Poisson distribution with the parameter µ . Let also F be the set ofaﬃne functions φ ( i ) = αi + β on Ω. We claim that the associated pair ( D , F )is good. Indeed, ln( p µ ( i ) /p ν ( i )) = i [ln µ − ln ν ] + µ − ν is an aﬃne functionof i , andln X i exp { αi + β } µ i exp {− µ } i ! ! = ln(exp { β − µ } exp { µ exp { α }} )= β − µ + µ exp { α } is a concave function of µ > Example 3 ( Gaussian distributions with ﬁxed covariance ). Let Ω = R k , P be the Lebesque measure on Ω, Σ be a positive deﬁnite k × k matrix, M = R k and p µ ( ω ) = (2 π ) − k/ (Det Σ) − / exp {− ( ω − µ ) T Σ − ( ω − µ ) } STIMATION BY CONVEX PROGRAMMING be the density of the Gaussian distribution with mean µ and covariancematrix Σ. Let, further, F be comprised of aﬃne functions on Ω. We claimthat the associated pair ( D , F ) is good. Indeed, the function ln( p µ ( ω ) /p ν ( ω ))indeed is aﬃne on Ω, andln (cid:18)Z exp { φ T ω + c } p µ ( ω ) dω (cid:19) = c + φ T µ + φ T Σ φ µ . Example 4 ( Direct product of good pairs ). Let p ℓµ ℓ ( ω ℓ ) be a probabilitydensity, parameterized by µ ℓ ∈ M ℓ ⊂ R m ℓ , on a Polish space Ω ℓ with Borel σ -ﬁnite measure P ℓ , and F ℓ be a ﬁnite-dimensional linear space of Borelfunctions on Ω ℓ such that the associated pairs ( D ℓ , F ℓ ) are good. Let usdeﬁne the direct product ( D , F ) = N Lℓ =1 ( D ℓ , F ℓ ) of these pairs as follows: • The associated space with measure is (Ω = Ω × · · · × Ω L , P = P × · · · × P ℓ ). • The set of parameters is M = M × · · · × M L , and the density associ-ated with a parameter µ = ( µ , . . . , µ L ) from this set is p µ ( ω , . . . , ω L ) = Q Lℓ =1 p ℓµ ℓ ( ω ℓ ). • F is comprised of all functions φ ( ω , . . . , ω L ) = P Lℓ =1 φ ℓ ( ω ℓ ) with φ ℓ ( · ) ∈F ℓ , ℓ = 1 , . . . , m .We claim that the direct product of good pairs is good . Indeed, M is an openconvex set; when µ = ( µ , . . . , µ L ) and ν = ( ν , . . . , ν L ) are in M , we haveln( p µ ( ω , . . . , ω L ) /p ν ( ω , . . . , ω L )) = L X ℓ =1 ln( p ℓµ ℓ ( ω ℓ ) /p ℓν ℓ ( ω ℓ )) ∈ F and when φ ( ω , . . . , ω L ) = P ℓ φ ℓ ( ω ℓ ) ∈ F , we haveln (cid:18)Z Ω exp { φ ( ω ) } p µ ( ω ) P ( dω ) (cid:19) = ln Y ℓ Z Ω ℓ exp { φ ℓ ( ω ℓ ) } p ℓµ ℓ ( ω ℓ ) P ( dω ℓ ) ! = X ℓ ln (cid:18)Z Ω ℓ exp { φ ℓ ( ω ℓ ) } p ℓµ ℓ ( ω ℓ ) P ( dω ℓ ) (cid:19) , which is a sum of concave functions of µ and thus is concave in µ .2.2. The problem.

The problem we are interested in is as follows:

Problem I.

We are given the following: • a convex compact set X ⊂ R n , • a good pair ( D , F ) comprised of A. B. JUDITSKY AND A. S. NEMIROVSKI – a parametric family { p µ ( ω ) : µ ∈ M ⊂ R m } of probability densities on aBorel space Ω with σ -ﬁnite Borel measure P and– a ﬁnite-dimensional linear space F of Borel functions on Ω, • an aﬃne mapping x A ( x ) : X

7→ M , • a linear form g T z on R n ⊃ X .Aside of this a priori information, we are given a realization ω of a randomvariable taking values in Ω and distributed with the density p A ( x ) ( · ) for some unknown in advance x ∈ X . Our goal is to infer from this observation anestimate ˆ g ( ω ) of the value g T x of the given linear form at x .From now on we refer to an estimate as aﬃne , if it is of the form ˆ g ( ω ) = φ ( ω ), with certain φ ∈ F .We quantify the risk of a candidate estimate ˆ g ( · ) by its worst-case, over x ∈ X , conﬁdence interval, given the conﬁdence level. Speciﬁcally, given a conﬁdence level ε ∈ (0 , ε -risk of an estimate ˆ g as Risk(ˆ g ; ε ) = inf (cid:26) δ : sup x ∈ X Prob ω ∼ p A ( x ) ( · ) { ω : | ˆ g ( ω ) − g T x | > δ } < ε (cid:27) . The corresponding minimax optimal ε -risk is deﬁned asRisk ∗ ( ε ) = inf ˆ g ( · ) Risk(ˆ g ; ε ) , where inf is taken over the space of all Borel functions ˆ g on Ω. We areinterested also in the minimax optimal ε -risk of aﬃne estimatesRiskA( ε ) = inf φ ( · ) ∈F Risk( φ ; ε ) .

3. Minimax optimal aﬃne estimators.

Main result.

Our main result follows.

Theorem 3.1.

Let the pair ( D , F ) underlying Problem I be good. Then,the minimax optimal risk achievable with aﬃne estimates is, for small ε ,within an absolute constant factor of the “true” minimax optimal risk, specif-ically, ≤ ε < / ⇒ RiskA( ε ) ≤ θ ( ε ) Risk ∗ ( ε ) , θ ( ε ) = 2 ln(2 /ε )ln(1 / (4 ε )) . Proof.

For r ≥

0, let us setΦ r ( x, y ; φ, α ) = g T x − g T y + α ln (cid:18)Z Ω exp { α − φ ( ω ) } p A ( y ) ( ω ) P ( dω ) (cid:19) STIMATION BY CONVEX PROGRAMMING + α ln (cid:18)Z Ω exp {− α − φ ( ω ) } p A ( x ) ( ω ) P ( dω ) (cid:19) + 2 αr : Z × F + → R ,Z = X × X, F + = F × { α > } . We claim that this function is a continuous real-valued function on Z × F + ,which is convex in ( φ, α ) ∈ F + and concave in ( x, y ) ∈ Z . Indeed, the functionΨ( µ, ν ; φ ) = ln (cid:16)Z Ω exp { φ ( ω ) } p µ ( ω ) P ( dω ) (cid:17) + ln (cid:16)Z Ω exp {− φ ( ω ) } p ν ( ω ) P ( dω ) (cid:17) : ( M × M ) × F → R is well deﬁned, concave in ( µ, ν ) ∈ M × M [since ( D , F ) is good] and convex in φ ∈ F (evident). Since M is open and F is a ﬁnite-dimensional linear space,Ψ is continuous on its domain. It remains to note that Φ ε is the sum of alinear function of x, y, α and the function α Ψ( A ( x ) , A ( y ); α − φ ) which clearlyis concave in ( x, y ) [since Ψ( µ, ν ; φ ) is concave in ( µ, ν ) and A ( · ) is aﬃne]and convex in ( φ, α ) ∈ F + [since Ψ( µ, ν ; φ ) is continuous in φ ∈ F , and thetransformation f ( u ) g ( u, α ) = αf ( u/α ) converts a convex function of u intoa convex in ( α > , u ) function of ( u, α )]. Since Z is a convex ﬁnite-dimensional compact set, F + is a convex ﬁnite-dimensional set and Φ ε is continuous and convex–concave on Z × F + , wecan invoke the Sion–Kakutani theorem (see, e.g., [14]) to infer thatsup x,y ∈ X inf φ ∈F ,α> Φ r ( x, y ; φ, α ) = inf φ ∈F ,α> max x,y ∈ X Φ r ( x, y ; φ, α ) := 2Φ ∗ ( r ) . (3.1)Note that Φ ∗ ( r ) ≥ r ≥

0. Indeed,the functional f x [ h ] = ln R Ω exp { h ( ω ) } p A ( x ) ( ω ) P ( dω ) is well deﬁned and con-vex on F , whenceΦ r ( x, x ; φ, α ) = 2 αr + α ( f x [ − α − φ ] + f x [ α − φ ]) ≥ αr ≥ , whence Φ ∗ ( r ) ≥ sup x ∈ X inf φ ∈F ,α> Φ r ( x, x ; φ, α ) ≥

0. The concavity of Φ ∗ ( r )on the nonnegative ray follows immediately from the representation, yieldedby (3.1), Φ ∗ ( r ) = 12 inf φ ∈F ,α (cid:20) αr + sup x,y ∈ X Φ ( x, y ; φ, α ) (cid:21) of Φ ∗ ( r ) as the inﬁnum of a family of aﬃne functions of r . Lemma 3.1.

One has

RiskA( ε ) ≤ Φ ∗ (ln(2 /ε )) . A. B. JUDITSKY AND A. S. NEMIROVSKI

Proof.

Given δ > ε ∈ (0 , / ε -risk ≤ R ≡ Φ ∗ (ln(2 /ε )) + δ/

2, namely, as follows. By (3.1), thereexist φ ∗ ∈ F and α ∗ >

0, such that2Φ ∗ (ln(2 /ε )) + δ/ ≥ max x,y ∈ X Φ ε/ ( x, y ; φ ∗ , α ∗ )= max x ∈ X (cid:20) g T x + α ∗ ln (cid:18)Z Ω exp {− α − ∗ φ ∗ ( ω ) } p A ( x ) ( ω ) P ( dω ) (cid:19) + α ∗ ln(2 /ε ) (cid:21)| {z } U × max y ∈ X (cid:20) − g T y + α ∗ ln (cid:18)Z Ω exp { α − ∗ φ ∗ ( ω ) } p A ( y ) ( ω ) P ( dω ) (cid:19) + α ∗ ln(2 /ε ) (cid:21)| {z } V . Setting c = U − V , we havemax x ∈ X (cid:20) g T x + α ∗ ln (cid:18)Z Ω exp {− α − ∗ [ φ ∗ ( ω ) + c ] } p A ( x ) ( ω ) P ( dω ) (cid:19) + α ∗ ln(2 /ε ) (cid:21) = U − c = U + V ≤ Φ ∗ (ln(2 /ε )) + δ/ R − δ/ , max y ∈ Y (cid:20) g T x + α ∗ ln (cid:18)Z Ω exp { α − ∗ [ φ ∗ ( ω ) + c ] } p A ( y ) ( ω ) P ( dω ) (cid:19) + α ∗ ln(2 /ε ) (cid:21) = V + c = U + V ≤ Φ ∗ (ln(2 /ε )) + δ/ R − δ/ x ∈ X ln (cid:18)Z Ω exp { α − ∗ [ g T x − ( φ ∗ ( ω ) + c ) − R ] } p A ( x ) ( ω ) P ( dω ) (cid:19) ≤ ln( ε/ − δ α ∗ ≡ ln( ε ′ / , max y ∈ X ln (cid:18)Z Ω exp { α − ∗ [( φ ∗ ( ω ) + c ) − R − g T y ] } p A ( y ) ( ω ) P ( dω ) (cid:19) ≤ ln( ε ′ / , that is,(a) ∀ x ∈ X : Z Ω exp { α − ∗ [ g T x − ( φ ∗ ( ω ) + c ) − R ] } p A ( x ) ( ω ) P ( dω ) ≤ ε ′ / , (b) ∀ y ∈ X : Z Ω exp { α − ∗ [[ φ ∗ ( ω ) + c ] − R − g T y ] } p A ( y ) ( ω ) P ( dω ) ≤ ε ′ / . For a given x ∈ X , the exponent in (a) is nonnegative and is >

1, for all ω such that g T x − [ φ ∗ ( ω )+ c ] > R ; therefore, (a) implies that Prob ω ∼ p A ( x ) ( · ) { g T x > [ φ ∗ ( ω ) + c ] + R } ≤ ε ′ /

2, for every x ∈ X . By similar reasons, (b) implies that STIMATION BY CONVEX PROGRAMMING Prob ω ∼ p A ( x ) ( · ) { g T x < [ φ ∗ ( ω ) + c ] − R } ≤ ε ′ /

2, for all x ∈ X . Since by con-struction ε ′ < ε , we see that the ε -risk of the aﬃne estimate ˆ g ( ω ) = φ ∗ ( ω ) + c is ≤ R , as claimed. (cid:3) Lemma 3.2.

One has δ ∈ (0 , ⇒ Risk ∗ ( δ / ≥ Φ ∗ (ln(1 /δ )) , (3.2) whence also ε ∈ (0 , / ⇒ Risk ∗ ( δ ) ≥ ln(1 / (4 ε ))2 ln(2 /ε ) Φ ∗ (ln(2 /ε )) . (3.3) Proof.

To prove (3.2), let us set ρ = ln(1 /δ ). The function Ψ ρ ( x, y ) =inf φ ∈F ,α> Φ ρ ( x, y ; φ, α ) takes values in {−∞} ∪ R , is upper semicontinuous(since Φ r is continuous) and is not identically −∞ (in fact, it is even ≥ y = x ). Thus, Ψ ρ achieves its maximum on X × X at certain point(¯ x, ¯ y ), and for any ( α > , φ ∈ F ):Φ ρ (¯ x, ¯ y ; φ, α ) ≥ Ψ ρ (¯ x, ¯ y ) = sup x,y ∈ X inf φ ∈F ,α> Φ ρ ( x, y ; φ, α ) = 2Φ ∗ ( ρ ) , (3.4)where the concluding inequality is given by (3.1). Since ( D , F ) is a good pair,setting µ = A (¯ x ), ν = A (¯ y ) and ¯ φ ( ω ) = ln( p µ ( ω ) /p ν ( ω )), we get ¯ φ ∈ F ,which combines with (3.4) to imply that ∀ ( α > ∗ ( ρ ) ≤ g T ¯ x − g T ¯ y + α (cid:20) ln (cid:18)Z Ω exp {− α − [ α ¯ φ ( ω )] } p µ ( ω ) P ( dω ) (cid:19) + ln (cid:18)Z Ω exp { α − [ α ¯ φ ( ω )] } p ν ( ω ) P ( dω ) (cid:19) + 2 ρ (cid:21) = g T ¯ x − g T ¯ y + 2 α (cid:20) ln (cid:18)Z Ω q p µ ( ω ) p ν ( ω ) P ( dω ) (cid:19) + ρ (cid:21) . The resulting inequality holds true for all α >

0, meaning that(a) g T ¯ x − g T ¯ y ≥ ∗ ( ρ ) = 2Φ ∗ (ln(1 /δ )) , (3.5) (b) Z Ω q p µ ( ω ) p ν ( ω ) P ( dω ) ≥ exp {− ρ } = δ. Now assume, in contrast to what should be proved, that Risk ∗ ( δ / < Φ ∗ (ln(1 /δ )). Then, there exists R ′ < Φ ∗ (ln(1 /δ )), δ ′ < δ / g ( ω ) such thatProb ω ∼ p A ( x ) ( · ) {| ˆ g ( ω ) − g T x | > R ′ } ≤ δ ′ ∀ x ∈ X. A. B. JUDITSKY AND A. S. NEMIROVSKI

Now, consider two hypotheses Π , on the distribution of ω stating that thedensities of the distribution with regard to P are p µ and p ν , respectively.Consider a procedure for distinguishing between the hypotheses as follows:after ω is observed, we compare ˆ g ( ω ) with ¯ g = [ g T ¯ x + g T ¯ y ]; if ˆ g ( ω ) ≥ ¯ g ,we accept Π , otherwise we accept Π . Note that by (3.5)(a) and due to R ′ < Φ ∗ (ln(1 /δ )), the probability to accept Π when Π is true is ≤ theprobability for ˆ g ( ω ) to deviate from g T ¯ x by at most R ′ , that is, it is ≤ δ ′ .Similarly, the probability to accept Π when Π is true is ≤ δ ′ . Now, let Ω be the part of Ω where our hypotheses testing routine accepts Π , so thatin Ω = Ω \ Ω the routine accepts Π . As we just have seen, Z Ω p ν ( ω ) P ( dω ) ≤ δ ′ , Z Ω p µ ( ω ) P ( dω ) ≤ δ ′ , whence Z Ω q p µ ( ω ) p ν ( ω ) P ( dω ) = X i =1 Z Ω i q p µ ( ω ) p ν ( ω ) P ( dω ) ≤ X i =1 (cid:18)Z Ω i p µ ( ω ) P ( dω ) (cid:19) / (cid:18)Z Ω i p ν ( ω ) P ( dω ) (cid:19) / ≤ √ δ ′ < q δ / δ. The resulting inequality R Ω q p µ ( ω ) p ν ( ω ) P ( dω ) < δ contradicts (3.5)(b); wehave arrived at a desired contradiction. (3.2) is proved.To prove (3.3), let us set δ = 2 √ ε , so that Risk ∗ ( ε ) = Risk ∗ ( δ / ≥ Φ ∗ (ln(1 /δ )) = Φ ∗ ( ln( ε )), where the concluding ≥ is due to (3.2). Nowrecall that Φ ∗ ( r ) is a nonnegative and concave function of r ≥

0, so thatΦ ∗ ( tr ) ≥ t Φ ∗ ( r ), for all r ≥ ≤ t ≤

1. We therefore haveΦ ∗ (cid:18)

12 ln (cid:18) ε (cid:19)(cid:19) ≥ ln(1 / (4 ε ))2 ln(2 /ε ) Φ ∗ (cid:18) ln (cid:18) ε (cid:19)(cid:19) and we arrive at (3.3). (cid:3) Lemmas 3.1 and 3.2 clearly imply Theorem 3.1. (cid:3)

Remark 3.1.

Lemmas 3.1 and 3.2 provide certain information evenbeyond the case when the pair ( D , F ) is good, speciﬁcally, that:(i) The ε -risk of an aﬃne estimate can be made arbitrarily close to thequantity Φ + ( ε ) = inf φ ∈F ,α> sup x,y ∈ X Φ ln(2 /ε ) ( x, y ; φ, α )(cf. Lemma 3.1); STIMATION BY CONVEX PROGRAMMING (ii) We have Risk ∗ ( ε ) ≥ Φ − ( ε ) = sup x,y ∈ X inf φ ∈F ,α> Φ / / (4 ε )) ( x, y ; φ, α ) (cf. Lemma 3.2).As it is seen from the proofs of Lemmas 3.1 and 3.2, both these statementshold true without the goodness assumption. The role of the latter is inensuring that Φ + ( ε ) is within an absolute constant factor of Φ − ( ε ).Lemma 3.2 Implies the following result. Proposition 3.1.

Under the premise of Theorem 3.1, the Hellingeraﬃnity

AﬀH( µ, ν ) = Z Ω q p µ ( ω ) p ν ( ω ) P ( dω ) is a continuous and log-concave function on M × M , and the quantity Φ ∗ ( r ) , r ≥ , admits the following representation: ∗ ( r ) = max x,y { g T x − g T y : AﬀH( A ( x ) , A ( y )) ≥ exp {− r } , x, y ∈ X } . (3.6)We see that the upper bound Φ ∗ (ln(2 /ε )) on RiskAﬀ( ε ) stated in The-orem 3.1 admits a very transparent interpretation: this bound is the max-imum of the variation max x,y [ g T x − g T y ] of the estimated functional onthe set of pairs x, y ∈ X with the associated distributions “close” to eachother, namely, such that AﬀH( A ( x ) , A ( y )) ≥ ε/

2. Observe that asymptoti-cally (when r becomes small), Φ ∗ ( r ) is equivalent to the modulus of conti-nuity ω ( r, X ) of g with regard to the Hellinger distance , introduced in [7]. Proof of Proposition 3.1.

By exactly the same argument as in theproof of Theorem 3.1, the function Ψ( µ, ν ; φ ) : ( M × M ) × F → R ,Ψ( µ, ν ; φ ) = (cid:20) ln (cid:18)Z exp {− φ ( ω ) } p µ ( ω ) P ( dω ) (cid:19) + ln (cid:18)Z exp { φ ( ω ) } p ν ( ω ) P ( dω ) (cid:19)(cid:21) is well deﬁned and continuous on its domain, and this function is convex in φ and concave in ( µ, ν ). We claim thatln(AﬀH( µ, ν )) = min φ Ψ( µ, ν ; ψ ) , (3.7)which would imply that ln(AﬀH( · )) is indeed a ﬁnite concave function on M × M and as such is continuous (recall that M is open). To justify our Recall that we consider here the case of one observation. A. B. JUDITSKY AND A. S. NEMIROVSKI claim, note that, for ﬁxed µ, ν ∈ M , setting φ = ln( p ν /p µ ), we get a func-tion from F such that Ψ( µ, ν ; ¯ φ ) = 2 ln(AﬀH( µ, ν )). To complete the veriﬁca-tion of (3.7), it suﬃces to demonstrate that Ψ( µ, ν ; φ ) ≥ Ψ( µ, ν ; ¯ φ ) whenever φ ∈ F , which is immediate, since setting φ = ¯ φ + ∆, we haveexp { Ψ( µ, ν ; ¯ φ ) / } = Z q p µ ( ω ) p ν ( ω ) P ( dω )= Z [( p µ ( ω ) p ν ( ω )) / exp {− ∆( ω ) / } ] × [( p µ ( ω ) p ν ( ω )) / exp { ∆( ω ) / } ] P ( dω ) ≤ (cid:20)Z q p µ ( ω ) p ν ( ω ) exp {− ∆( ω ) } P ( dω ) (cid:21) / × (cid:20)Z q p µ ( ω ) p ν ( ω ) exp { ∆( ω ) } P ( dω ) (cid:21) / = exp { Ψ( µ, ν ; φ ) / } . Now, note that by (3.1)2Φ ∗ ( r ) = sup x,y ∈ X (cid:26) inf φ ∈F ,α> [ g T x − g T y + α Ψ( A ( x ) , A ( y ); α − φ ) + 2 αr ] (cid:27) = sup x,y ∈ X (cid:26) g T x − g T y + inf α> α (cid:20) inf φ ∈F Ψ( A ( x ) , A ( y ); α − φ ) + 2 r (cid:21)(cid:27) = sup x,y ∈ X (cid:26) g T x − g T y + inf α> α (cid:20) inf ψ ≡ α − φ ∈F Ψ( A ( x ) , A ( y ); ψ ) + 2 r (cid:21)(cid:27) = sup x,y ∈ X (cid:26) g T x − g T y + inf α> α [2 ln(AﬀH( A ( x ) , A ( y ))) + 2 r ] | {z } = n , ln(AﬀH( A ( x ) , A ( y ))) + r ≥ −∞ , ln(AﬀH( A ( x ) , A ( y ))) + r < (cid:27) [see (3.7)]= max x,y { g T x − g T y : AﬀH( A ( x ) , A ( y )) ≥ exp {− r } , x, y ∈ X } . (cid:3) The case of multiple observations.

In Problem I, our goal was toestimate g T x from a single observation ω of the random variable ω ∼ p A ( x ) ( · ),associated with x . The result can be immediately extended to the case whenwe want to recover g T x from a sample of independent observations ω , . . . , ω L of random variables ω ℓ with distributions parameterized by x . Speciﬁcally,let (Ω ℓ , P ℓ ) and ( D ℓ , F ℓ ), 1 ≤ ℓ ≤ L , be as in Example 4, and let every pair STIMATION BY CONVEX PROGRAMMING ( D ℓ , F ℓ ) be good. Assume, further, that X ⊂ R n is a convex compact set and A ℓ ( x ) are aﬃne mappings with A ℓ ( X ) ⊂ M ℓ . Given a linear form g T z on R n and a sequence of independent realizations ω ℓ ∼ p ℓA ℓ ( x ) ( · ), ℓ = 1 , . . . , L ,we want to recover from these observations the value g T x of the given aﬃneform at the “signal” x underlying our observations.In our current situation, we call a candidate estimate ˆ g ( ω , . . . , ω L ) aﬃne if it is of the form ˆ g ( ω , . . . , ω L ) = L X ℓ =1 φ ℓ ( ω ℓ ) , (3.8)where φ ℓ ∈ F ℓ , ℓ = 1 , . . . , L . Note that setting ( D , F ) = N Lℓ =1 ( D ℓ , F ℓ ), wereduce the situation to the one we have already considered. In particular,Theorem 3.1 along with the proof of Lemma 3.1 implies the following result(where the ε -risks—of an estimate, the minimax optimal and the aﬃne-minimax optimal—are deﬁned exactly as in the single-observation case). Theorem 3.2.

In the situation just described, for r > , let Φ r ( x, y ; φ, α ) = α " L X ℓ =1 ln (cid:18)Z Ω ℓ exp {− α − φ ℓ ( ω ℓ ) } p ℓA ℓ ( x ) ( ω ℓ ) P ( dω ℓ ) (cid:19) + (cid:18)Z Ω ℓ exp { α − φ ℓ ( ω ℓ ) } p ℓA ℓ ( y ) ( ω ℓ ) P ( dω ℓ ) (cid:19) + g T x − g T y + 2 αr : Z × F + → R ,Z = X × X, F + = F × · · · × F L × { α > } . The function Φ r is continuous on its domain, concave in the ( x, y ) -argument,convex in the ( φ, α ) -argument and possesses a well-deﬁned saddle point value ∗ ( r ) = sup x,y ∈ X inf φ,α ∈F + Φ r ( x, y ; φ, α ) | {z } Φ r ( x,y ) = inf ( φ,α ) ∈F + sup x,y ∈ X Φ r ( x, y ; φ, α ) | {z } Φ r ( φ,α ) , which is a concave and nonnegative function of r ≥ . Moreover: (i) For all ε ∈ (0 , / , we have RiskA( ε ) ≤ Φ ∗ (ln(2 /ε )) ≤ θ ( ε ) Risk ∗ ( ε ) , θ ( ε ) = 2 ln(2 /ε )ln(1 / (4 ε )) . A. B. JUDITSKY AND A. S. NEMIROVSKI (ii)

Given ε ∈ (0 , / and δ ≥ , in order to build an aﬃne estimate with ε -risk not exceeding [Φ ∗ (ln(2 /ε )) + δ ] , where δ > is given, it suﬃces to ﬁnd α ∗ > and φ ∗ ℓ ∈ F ℓ , ≤ ℓ ≤ L , such that Φ ln(2 /ε ) ( φ ∗ , α ∗ ) ≤ ∗ (ln(2 /ε )) + δ/ , to compute the quantity c = 12 max x ∈ X " g T x + α ∗ L X ℓ =1 ln (cid:18)Z Ω ℓ exp {− α − φ ∗ ℓ ( ω ℓ ) } p ℓA ℓ ( x ) ( ω ℓ ) P ℓ ( dω ℓ ) (cid:19) −

12 max y ∈ X " − g T y + α ∗ L X ℓ =1 ln (cid:18)Z Ω ℓ exp { α − φ ∗ ℓ ( ω ℓ ) } p ℓA ℓ ( y ) ( ω ℓ ) P ℓ ( dω ℓ ) (cid:19) and to set ˆ g ( ω , . . . , ω L ) = L X ℓ =1 φ ∗ ℓ ( ω ℓ ) + c. (3.9) Remark 3.2.

Computing the “nearly optimal” aﬃne estimate (3.9) re-duces to convex programming and thus can be carried out eﬃciently, pro-vided that we are given explicit descriptions of: • the linear spaces F ℓ , ℓ = 1 , . . . , L (as it is the case, e.g., in Examples 1–3), • and X (e.g., by a list of eﬃciently computable convex constraints whichcut X out of R n ) and are capable to compute eﬃciently the value of Φ r at a point. Remark 3.3.

Assume that the observations ω ℓ , ℓ ≤ ℓ ≤ ℓ , are copiesof the same random variable [i.e., Ω ℓ , P ℓ , D ℓ , F ℓ , A ℓ ( · ) are independent of ℓ for ℓ ≤ ℓ ≤ ℓ ]. Then, the convex function Φ r ( φ , . . . , φ L , α ) is symmetricwith regard to the arguments φ ℓ ∈ F ℓ , ℓ ≤ ℓ ≤ ℓ , and therefore, whenbuilding the estimate (3.9) we lose nothing when restricting ourselves to φ ’ssatisfying φ ℓ = φ ℓ , ℓ ≤ ℓ ≤ ℓ , which allows to reduce the computationaleﬀort of building α ∗ , φ ∗ ℓ .3.2.1. Illustration.

Consider the toy problem where we want to recoverthe probability p of getting 1 from a Bernoulli distribution, given L inde-pendent realizations ω , . . . , ω L of the associated random variable. To handlethe problem, we specialize our general setup as follows: • (Ω ℓ , P ℓ ), 1 ≤ ℓ ≤ L , are identical to the two-point set {

0; 1 } with the count-ing measure; • M is the interval (0 , p µ (1) = 1 − p µ (0) = µ , µ ∈ M ; • X is a compact convex subset in M , say, the segment [1 · e–16, 1–1 / e–16],and A ( x ) = x . STIMATION BY CONVEX PROGRAMMING Table 1

Recovering the parameter of a Bernoulli distribution

Upper risk Lower risk Ratio of ε L γ δ bound bound bounds ϑ ( ε ) .

05 10 2.91e–1 4.18e–2 3.61e–1 2.49e–1 1.45 4.580 .

05 100 4.13e–2 9.17e–3 1.33e–1 8.19e–2 1.63 4.580 .

05 1000 4.29e–3 9.91e–4 4.29e–3 2.60e–3 1.65 4.580 .

01 10 3.58e–1 2.83e–2 4.04e–1 3.29e–1 1.23 3.290 .

01 100 5.83e–2 8.84e–2 1.59e–1 1.15e–1 1.38 3.290 .

01 1000 6.15e–3 9.88e–4 5.13e–2 3.67e–3 1.40 3.290 .

001 10 4.19e–1 1.61e–2 4.42e–1 3.98e–1 1.11 2.750 .

001 100 8.15e–2 8.37e–3 1.88e–1 1.51e–1 1.24 2.750 .

001 1000 8.79e–3 9.82e–4 6.14e–3 4.88e–3 1.26 2.75

Invoking Remark 3.3, we lose nothing when restricting ourselves to aﬃneestimates of the form (3.8) with mutually identical functions φ ℓ ( · ), 1 ≤ ℓ ≤ L ,that is, with the estimatesˆ g ( ω , . . . , ω L ) = γ + δ L X ℓ =1 ω ℓ . Invoking Theorem 3.2, the coeﬃcients γ and δ are readily given by the φ -component of the saddle point (max in x, y ∈ X , min in φ = [ φ ; φ ] ∈ R and α >

0) of the convex–concave function x − y + α [ L ln( ε − φ /α (1 − x ) + ε − φ /α x )+ L ln( ε φ /α (1 − y ) + ε φ /α y ) + 2 ln(2 /ε )];the (guaranteed upper bound on the) ε -risk of this estimate is half of the cor-responding saddle point value. The saddle point (it is easily seen that it doesexist) can be computed with high accuracy by standard convex programmingtechniques. In Table 1, we present the nearly optimal aﬃne estimates alongwith the corresponding risks. In the table, the upper risk bound is the oneguaranteed by Theorem 3.2 and the lower risk bound is the largest d suchthat the hypotheses “ p = 0 . d ” and “ p = 0 . − d ” cannot be distinguishedfrom L independent observations of a random variable ∼ Bernoulli( p ) withthe sum of probabilities of errors < ε [this easily computable quantity is alower bound on the minimax optimal ε -risk Risk ∗ ( ε )], and ϑ ( ε ) = /ε )ln(0 . /ε ) is the theoretical upper bound on the “level of nonoptimality” of our esti-mate. As it could be guessed in advance, for large L , the near-optimal aﬃneestimate is close to the trivial estimate L P Lℓ =1 ω ℓ .

4. Applications.

In this section, we present some applications of Theo-rems 3.1 and 3.2. A. B. JUDITSKY AND A. S. NEMIROVSKI

Positron emission tomography.

The positron emission tomography(PET) is a noninvasive diagnostic tool allowing us to visualize not onlythe anatomy of tissues in a body, but their functioning as well. In PET,a patient is administered a radioactive tracer chosen in such a way that itconcentrates in the areas of interest (e.g., those of high metabolic activity inearly diagnosis of cancer). The tracer disintegrates, emitting positrons whichthen annihilate with nearby electrons to produce pairs of photons ﬂying atthe speed of light in opposite directions; the orientation of the resulting lineof response (LOR) is completely random. The patient is placed in a cylinderwith the surface split into small detector cells. When two of the detectorsare hit by photons “nearly simultaneously”—within an appropriately chosenshort time window—the event indicates that somewhere at a line crossingthe detectors a disintegration act took place. Such an event is registered, andthe data collected by the PET device form a list of the number of eventsregistered in every one of the bins (pairs of detectors) in the course of a giventime t . The goal of a PET reconstruction algorithm is to recover the densityof the tracer from this data. The standard mathematical model of PET is asfollows. After discretization of the ﬁeld of view, there are N voxels (small 3Dcubes) assigned with nonnegative (and unknown) amounts x i of the traces i = 1 , . . . , n . The number of LORs emanating from a voxel i is a realizationof a Poisson random variable with parameter x i , and these variables fordiﬀerent voxels are independent. Every LOR emanating from a voxel i issubject to a “lottery,” which decides in which bin (pair of detectors) it willbe registered or if it will be registered at all—some LORs can intersect thesurface of the cylinder only in one point or not intersect it at all and thusare missed. The role of the lottery is played by the random orientation ofthe LOR in question, and outcomes of diﬀerent lotteries are independent.The probabilities q iℓ for a LOR emanating from voxel i to be registered inbin ℓ are known (they are readily given by the geometry of the device).With this model, the data registered by PET is a realization of a randomvector ( ω , . . . , ω L ) ( L is the total number of bins) with independent Poisson-distributed coordinates, the parameter of the Poisson distribution associatedwith ω ℓ being A ℓ ( x ) = n X i =1 q iℓ x i . Assume that our a priori information on x allows us to point out a convexcompact set X ⊂ { x ∈ R n : x > } , such that x ∈ X . Assuming without lossof generality that P i q iℓ > ℓ (indeed, we can eliminate all bins ℓ which never register LORs) and invoking Example 2, we ﬁnd ourselvesin the situation of Section 3.2. It follows that in order to evaluate a givenlinear form g T x of the unknown tracer density x , we can use the construction STIMATION BY CONVEX PROGRAMMING from Theorem 3.2 to build a near-optimal aﬃne estimate of g T x . The recipesuggested to this end by Theorem 3.2 reads as follows: the estimate is of theform ˆ g ( ω ) = L X ℓ =1 γ ∗ ℓ y ℓ + c ∗ , where y ℓ is the number of LORs registered in bin ℓ and γ ∗ = [ γ ∗ ; . . . ; γ ∗ L ], c ∗ are given by an optimal solution ( γ ∗ , α ∗ ) to the convex optimization problemmin α> ,γ Φ r ( γ, α ) , Φ r ( γ, α ) = max x,y ∈ X ( g T x − g T y + α " L X ℓ =1 [ q ℓ ( x ) exp {− α − γ ℓ } + q ℓ ( y ) exp { α − γ ℓ } ] − q ( x ) − q ( y ) + 2 r ,r = ln(2 /ε ) , q ℓ ( z ) = n X i =1 q iℓ z i , q ( z ) = L X ℓ =1 q ℓ ( z ) . It is easily seen that the problem is solvable with c ∗ = 12 " max x ∈ X ( g T x + α ∗ " − q ( x ) + L X ℓ =1 q ℓ ( x ) exp {− α − ∗ γ ∗ ℓ } − max y ∈ X ( − g T y + α ∗ " − q ( y ) + L X ℓ =1 q ℓ ( y ) exp { α − ∗ γ ∗ ℓ } . Gaussian observations.

Now consider the standard problem of re-covering a linear form g T x of a signal x known to belong to a given convexcompact set X ⊂ R n via indirect observations of the signal corrupted byGaussian noise. Without loss of generality, let the model of observations be ω = Ax + ξ, ξ ∼ N (0 , I L ) . (4.1)The associated pair ( D , F ) is comprised of the shifts of the standard Gaus-sian distribution ( D ) and all aﬃne forms on R L ( F ) and is good (see Example3). The aﬃne estimates in the case in question are just the aﬃne functionsof ω . The near-optimality of aﬃne estimates in the case in question was es-tablished by Donoho [5], not only for the ε -risk, but for all risks based on thestandard loss functions. We have the following direct corollary of Theorem3.2 (cf. Theorem 2 and Corollary 1 of [5]): A. B. JUDITSKY AND A. S. NEMIROVSKI

Proposition 4.1.

In the situation in question, the aﬃne estimate ˆ g ε ( · ) yielded by Theorem 3.2 is asymptotically ( ε → +0) optimal, speciﬁcally, ε ∈ (0 , / ⇒ Risk(ˆ g ε ; ε ) ≤ ψ ( ε ) Risk ∗ ( ε ) ,ψ ( ε ) = p /ε )ErfInv( ε ) = 1 + o (1) as ε → +0 [here x = ErfInv( y ) stands for the inverse error function, i.e., y = √ π × R ∞ x e − t / dt ] . Proof.

Let G ( · ) be the density of the N (0 , I L ) distribution. By Theo-rem 3.2, we have Risk(ˆ g ε ; ε ) ≤ Φ ∗ (ln(2 /ε )), where, for r > ∗ ( r ) = max x,y ∈ X Φ r ( x, y ) , Φ r ( x, y ) = inf φ ∈ R L ,α> (cid:26) g T x − g T y + α (cid:20) ln (cid:18)Z exp {− α − φ T ω } G ( ω − Ax ) dω (cid:19) + ln (cid:18)Z exp { α − φ T ω } G ( ω − Ay ) dω (cid:19) + 2 r (cid:21)(cid:27) = inf φ ∈ R L ,α> (cid:26) g T x − g T y + φ T A ( y − x ) + 2 (cid:20) α − φ T φ αr (cid:21)(cid:27) = inf φ { g T x − g T y + φ T A ( x − y ) + 2 √ r k φ k } = (cid:26) g T x − g T y, k A ( x − y ) k ≤ √ r , −∞ , k A ( x − y ) k > √ r .Thus, Risk(ˆ g ε ; ε ) ≤ Φ ∗ (ln(2 /ε )) = [ g T ¯ x − g T ¯ y ](4.2)for certain ¯ x, ¯ y ∈ X with k A ( x − y ) k ≤ p /ε ). It remains to prove thatRisk ∗ ( ε ) ≥ ψ − ( ε ) Φ ∗ (ln(2 /ε )) . (4.3)To this end, assume, on the contrary to what should be proved, thatRisk ∗ ( ε ) < ψ − ( ε )Φ ∗ (ln(2 /ε )) (= ψ − ( ε ) [ g T ¯ x − g T ¯ y ]) , and let us lead this assumption to a contradiction. Under our assumption,there exists ρ < ψ − ( ε ) [ g T ¯ x − g T ¯ y ], ε ′ < ε and an estimate e g such that ∀ ( x ∈ X ): Prob {| e g ( Ax + ξ ) − g T x | ≥ ρ } ≤ ε ′ . (4.4) STIMATION BY CONVEX PROGRAMMING Observing that ψ ( ε ) >

1, we see that 2 ρ < [ g T ¯ x − g T ¯ y ]. Let ˆ x = ¯ x and ˆ y bea convex combination of ¯ x and ¯ y such that 2 ρ = [ g T ˆ x − g T ˆ y ]. Note that k A (ˆ x − ˆ y ) k = (cid:20) ρ [ g T ¯ x − g T ¯ y ] (cid:21)| {z } <ψ − ( ε ) k A (¯ x − ¯ y ) k ≤ ψ − ( ε )2 q /ε ) = 2 erﬁnv( ε ) . Now, let Π be the hypothesis that the distribution of an observation (4.1)comes from x = ˆ x , and let Π be the hypothesis that this distribution comesfrom x = ˆ y . From (4.4) by the same standard argument as in the proof ofLemma 3.2, it follows that there exists a routine, based on a single observa-tion (4.1), for distinguishing between Π and Π , which rejects Π i when thishypothesis is true with probability ≤ ε ′ , i = 1 ,

2. But, it is well known thatthe hypotheses on shifts of the standard Gaussian distribution indeed can bedistinguished with the outlined reliability. This is possible if and only if theEuclidean distance between the corresponding shifts is at least 2 erﬁnv( ε ′ ).This condition is not satisﬁed for our Π i , i = 1 ,

2, which correspond to shifts A ˆ x and A ˆ y , since k A ˆ x − A ˆ y k ≤ ε ) < ε ). We have arrived ata desired contradiction. (cid:3) In fact, the reasoning can be slightly simpliﬁed and strengthened to yieldthe following result.

Proposition 4.2.

In the situation of Proposition 4.1, one can buildeﬃciently an aﬃne estimate ˆ g ε , such that < ε < / ⇒ Risk(ˆ g ε ; ε ) ≤ ErfInv( ε/ ε ) Risk ∗ ( ε ) [cf. Proposition 4.1, and note that ErfInv( ε/ ε ) < √ /ε )ErfInv( ε ) ] . Proof.

LetΨ( x, y ; φ ) = g T x − g T y + φ T A ( y − x ) + 2 erﬁnv( ε/ k φ k : ( X × X ) × R L → R . Ψ clearly is a function which is continuous, convex in φ and concave in ( x, y )on its domain; by the same argument as in the proof of Theorem 3.1, Ψ hasa well-deﬁned saddle point value2Ψ ∗ ( ε ) = inf φ Ψ( φ ) z }| { max x,y ∈ X Ψ( x, y ; φ ) = max x,y ∈ X Ψ( x,y ) z }| { inf φ Ψ( x, y ; φ ) . The functionΨ( φ ) = max x,y ∈ X [ g T x − g T y + φ T ( Ay − Ax )]+ 2 erﬁnv( ε/ k φ k ≥ ε ) k φ k A. B. JUDITSKY AND A. S. NEMIROVSKI is a ﬁnite convex function on R L , which goes to ∞ as k φ k → ∞ , and there-fore it attains its minimum at a point φ ∗ , so that2Ψ ∗ ( ε ) = Ψ( φ ∗ ) . Setting c ∗ = 12 (cid:20) max x ∈ X [ g T x − φ T ∗ Ax ] − max y ∈ Y [ − g T y + φ T ∗ Ay ] (cid:21) , we have, similar to the proof of Lemma 3.1, the following:(a) max x ∈ X [ g T x − φ T ∗ Ax − c ∗ ] + erﬁnv( ε/ k φ ∗ k = Ψ ∗ ( ε ) , (b) max y ∈ X [ − g T y + φ T ∗ Ax + c ∗ ] + erﬁnv( ε/ k φ ∗ k = Ψ ∗ ( ε ) . Now, consider the aﬃne estimateˆ g ε ( ω ) = φ T ∗ ω + c ∗ . From (a) it follows that ∀ d > Ψ ∗ ( ε ): sup x ∈ X Prob { g T x − ˆ g ε ( Ax + ξ ) > d } ≤ ε ′ < ε/ , while (b) implies that ∀ d > Ψ ∗ ( ε ): sup y ∈ X Prob { ˆ g ε ( Ay + ξ ) − g T y > d } ≤ ε ′ < ε/ . We conclude that Risk(ˆ g ε ; ε ) ≤ Ψ ∗ ( ε ). To complete the proof, it suﬃces todemonstrate that Risk ∗ ( ε ) ≤ ErfInv( ε/ ε ) Ψ ∗ ( ε ) . (4.5)To this end, observe thatΨ( x, y ) = [ g T x − g T y ] + inf φ { φ T A ( y − x ) + 2 erﬁnv( ε/ k φ k } = (cid:26) g T x − g T y, k A ( y − x ) k ≤ ε/ −∞ , otherwise,whence Risk ∗ (ˆ g ε ; ε ) ≤ Ψ ∗ ( ε ) = [ g T ¯ x − g T ¯ y ] , for certain ¯ x, ¯ y ∈ X , such that k A (¯ x − ¯ y ) k ≤ ε ). Relation (4.5) canbe derived from this observation by exactly the same argument as used inthe proof of Proposition 4.1 to derive (4.3) from (4.2). (cid:3) STIMATION BY CONVEX PROGRAMMING

5. Adaptive version of the estimate.

In the situation of Problem I, let X ⊂ X ⊂ · · · ⊂ X K be a nested collection of nonempty convex compactsets in R n , such that A ( X K ) ⊂ M . Consider a modiﬁcation of the problemwhere the signal x underlying our observation is known to belong to one of X k with value of k ≤ K unknown in advance. Given a linear form g T z on R n , let Risk k (ˆ g ; ε ) and Risk k ∗ ( ε ) be, respectively, the ε -risk of an estimate ˆ g on X k and the minimax optimal ε -risk of recovering g T x on X k . Let alsoΦ k ∗ ( r ) be the function associated with X = X k according to (3.1). As it isimmediately seen, the functions Φ k ∗ ( r ) grow with k . Our goal is to modifythe estimate ˆ g yielded by Theorem 3.1 in such a way that the ε -risk of themodiﬁed estimate on X k will be “nearly” Risk k ∗ ( ε ) for every k ≤ K . Thisgoal can be achieved by a straightforward application of the well-knownLepskii’s adaptation scheme [19, 20] as follows.Given δ >

0, let δ ′ ∈ (0 , δ ), and let ˆ g k ( · ) be the aﬃne estimate with the( ε/K )-risk on X k not exceeding Φ k ∗ (ln(2 K/ε )) + δ ′ provided by Theorem 3.1as applied with ε/K substituted for ε and X k substituted for X . Then, forany k ≤ K , sup x ∈ X k Prob ω ∼ p A ( x ) ( · ) {| ˆ g k ( ω ) − g T x | > Φ k ∗ (ln(2 K/ε )) + δ } (5.1) ≤ ε ′ /K < ε/K. Given observation ω , let us say that an index k ≤ K is ω -good , if for any k ′ , k ≤ k ′ ≤ K , | ˆ g k ′ ( ω ) − ˆ g k ( ω ) | ≤ Φ k ∗ (ln(2 K/ε )) + Φ k ′ ∗ (ln(2 K/ε )) + 2 δ. Note that ω -good indexes do exist (e.g., k = K ). Given ω , we can ﬁnd thesmallest ω -good index k = k ( ω ); our estimate is nothing but ˆ g ( ω ) = ˆ g k ( ω ) ( ω ). Proposition 5.1.

Assume that ε ∈ (0 , / , and let ϑ = 3 ln(2 K/ε )ln(2 /ε ) . Then, for any ( k, ≤ k ≤ K ) , sup x ∈ X k Prob ω ∼ p A ( x ) ( · ) {| ˆ g ( ω ) − g T x | > ϑ Φ k ∗ (ln(2 K/ε )) + 3 δ } < ε, (5.2) whence also ∀ ( k, ≤ k ≤ K ): Risk k (ˆ g ; ε ) ≤ K ) /ε )ln(1 / (4 ε )) Risk k ∗ ( ε ) + 3 δ. (5.3) A. B. JUDITSKY AND A. S. NEMIROVSKI

Proof.

Setting r = ln(2 K/ε ), let us ﬁx ¯ k ≤ K and x ∈ X ¯ k and call arealization ω x -good , if ∀ ( k, ¯ k ≤ k ≤ K ): | ˆ g k ( ω ) − g T x | ≤ Φ k ∗ ( r ) + δ. (5.4)Since X k ⊃ X ¯ k when k ≥ ¯ k , (5.1) implies thatProb ω ∼ p A ( x ) ( · ) { ω is good } ≥ − ε ′ . Now, when x is the signal and ω is x -good, relations (5.4) imply that ¯ k isan ω -good index, so that k ( ω ) ≤ ¯ k . Since k ( ω ) is an ω -good index, we have | ˆ g ( ω ) − ˆ g ¯ k ( ω ) | = | ˆ g k ( ω ) ( ω ) − ˆ g ¯ k ( ω ) | ≤ Φ k ∗ ( r ) + Φ ¯ k ∗ ( r ) + 2 δ, which combines with (5.4) to imply that | ˆ g ( ω ) − g T x | ≤ ¯ k ∗ ( r ) + Φ k ( ω ) ∗ ( r ) + 3 δ ≤ ¯ k ∗ ( r ) + 3 δ, (5.5)where the concluding inequality is due to k ( ω ) ≤ ¯ k and to the fact that Φ k ∗ grows with k . The bound (5.5) holds true whenever ω is x -good, which, aswe have seen, happens with probability ≥ − ε ′ . Since ε ′ < ε and ¯ x ∈ X ¯ k isarbitrary, we conclude thatRisk ¯ k (ˆ g ; ε ) ≤ ¯ k ∗ ( r ) + 3 δ. (5.6)Using the nonnegativity and concavity of Φ ¯ k ∗ ( · ) on the nonnegative ray andrecalling the deﬁnition of r , we obtain Φ k ∗ ( r ) ≤ ln(2 K/ε )ln(2 /ε ) Φ k ∗ (ln(2 /ε )) whenever ε ≤ / K ≥

1. Recalling the deﬁnition of ϑ , the right-hand side in (5.6)therefore does not exceed ϑ Φ ¯ k ∗ (ln(2 /ε )) + 3 δ . Since ¯ k ≤ K is arbitrary, wehave proved (5.2). This bound, due to Lemma 3.2, implies (5.3). (cid:3) Acknowledgments.

The authors would like to acknowledge the valuablesuggestions made by L. Birg`e, Universit`e Paris 6, and Alexander Goldensh-luger, Haifa University. REFERENCES [1]

Ben-Tal, A. and

Nemirovski, A. (2001).

Lectures on Modern Convex Optimiza-tion: Analysis, Algorithms and Engineering Applications . SIAM, Philadelphia.MR1857264[2]

Birg´e, L. (1984). Sur un th´eor`eme de minimax et son application aux tests. (French)

Probab. Math. Statist. Birg´e, L. (1983). Approximation dans les espaces m´etriques et th´eorie del’estimation.

Z. Wahrsch. Verw. Gebiete Cai, T. and

Low, M. (2003). A note on nonparametric estimation of linear func-tionals.

Ann. Statist. Donoho, D. (1995). Statistical estimation and optimal recovery.

Ann. Statist. [6] Donoho, D. and

Liu, R. (1987).

Geometrizing Rates of Convergence . I. TechnicalReport 137a, Dept. Statistics, Univ. California, Berkeley.[7]

Donoho, D. and

Liu, R. (1991). Geometrizing rates of convergence. II.

Ann. Statist. Donoho, D., Liu, R. and

MacGibbon, B. (1990). Minimax risk over hyperrectan-gles, and implications.

Ann. Statist. Donoho, D. and

Low, M. (1992). Renormalization exponents and optimal pointwiserates of convergence.

Ann. Statist. Eubank, R. (1988).

Spline Smoothing and Nonparametric Regression . Dekker, NewYork. MR0934016[11]

Goldenshluger, A. and

Nemirovski, A. (1997). On spatially adaptive estimationof nonparametric regression.

Math. Methods Statist. H¨ardle, W. (1990).

Applied Nonparametric Regression. ES Monograph Series .Cambridge Univ. Press, Cambridge, UK. MR1161622[13] H¨ardle, W., Kerkyacharian, G., Picard, D. and

Tsybakov, A. B. (1998).

Wavelets, Approximation and Statistical Applications. Lecture Notes in Statistics . Springer, New York. MR1618204[14]

Hiriart-Urruty, J. B. and

Lemarechal, C. (1993).

Convex Analysis and Mini-mization Algorithms I: Fundamentals . Springer, Berlin.[15]

Ibragimov, I. A. and

Khasminski, R. Z. (1981).

Statistical Estimation: AsymptoticTheory . Springer, New York. MR0620321[16]

Ibragimov, I. A. and

Khas’minskij, R. Z. (1984). On the nonparametric estimationof a value of a linear functional in Gaussian white noise.

Teor. Veroyatnost. iPrimenen. Klemela, J. and

Tsybakov, A. B. (2001). Sharp adaptive estimation of linearfunctionals.

Ann. Statist. Korostelev, A. and

Tsybakov, A. (1993).

Minimax Theory of Image Reconstruc-tion. Lecture Notes in Statistics . Springer, New York. MR1226450[19] Lepskii, O. (1990). On a problem of adaptive estimation in Gaussian white noise.

Teor. Veroyatnost. i Primenen. Lepskii, O. (1991). Asymptotically minimax adaptive estimation I. Upper bounds.Optimally adaptive estimates.

Teor. Veroyatnost. i Primenen. Nemirovski, A. (2000).

Topics in Nonparametric Statistics . In

Ecole d’Ete´e de Prob-abilit´es de Saint-Flour XXVII (M. Emery, A. Nemirovski, D. Voiculescu and P.Bernard, eds.).

Lecture Notes in Mathematics

Pinsker, M. (1980). Optimal ﬁltration of square-integrable signals in Gaussian noise.

Problemy Peredachi Informatsii Prakasa Rao, B. L. S. (1983).

Nonparametric Functional Estimation.

AcademicPress, Orlando. MR0740865[24]

Rosenblatt, M. (1991).

Stochastic Curve Estimation.

Institute of MathematicalStatistics, Hayward, CA.[25]

Simonoff, J. S. (1996).

Smoothing Methods in Statistics . Springer, New York.MR1391963[26]

Takezawa, K. (2005).

Introduction to Nonparametric Regression.

Wiley, Hoboken,NJ. MR2181216[27]

Tsybakov, A. B. (2004).

Introduction a l’Estimation Nonparam´etrique.

Springer,Berlin. MR2013911[28]

Wasserman, L. (2006).

All of Nonparametric Statistics . Springer, New York.MR2172729 A. B. JUDITSKY AND A. S. NEMIROVSKI

Laboratoire Jean KuntzmannUniversit´e Grenoble I51 rue des Math´ematiquesBP 5338041 Grenoble Cedex 9FranceE-mail: [email protected]