[PDF] Sharp Oracle Inequalities for Aggregation of Affine Estimators

Abstract

We consider the problem of combining a (possibly uncountably infinite) set of affine estimators in non-parametric regression model with heteroscedastic Gaussian noise. Focusing on the exponentially weighted aggregate, we prove a PAC-Bayesian type inequality that leads to sharp oracle inequalities in discrete but also in continuous settings. The framework is general enough to cover the combinations of various procedures such as least square regression, kernel ridge regression, shrinking estimators and many other estimators used in the literature on statistical inverse problems. As a consequence, we show that the proposed aggregate provides an adaptive estimator in the exact minimax sense without neither discretizing the range of tuning parameters nor splitting the set of observations. We also illustrate numerically the good performance achieved by the exponentially weighted aggregate.

Full PDF

aa r X i v : . [ m a t h . S T ] F e b The Annals of Statistics (cid:13)

Institute of Mathematical Statistics, 2012

SHARP ORACLE INEQUALITIES FOR AGGREGATION OFAFFINE ESTIMATORS By Arnak S. Dalalyan and Joseph Salmon

ENSAE-Crest, Universit´e Paris Est and Universit´e Paris Diderot

We consider the problem of combining a (possibly uncountablyinﬁnite) set of aﬃne estimators in nonparametric regression modelwith heteroscedastic Gaussian noise. Focusing on the exponentiallyweighted aggregate, we prove a PAC-Bayesian type inequality thatleads to sharp oracle inequalities in discrete but also in continuoussettings. The framework is general enough to cover the combinationsof various procedures such as least square regression, kernel ridge re-gression, shrinking estimators and many other estimators used in theliterature on statistical inverse problems. As a consequence, we showthat the proposed aggregate provides an adaptive estimator in theexact minimax sense without discretizing the range of tuning param-eters or splitting the set of observations. We also illustrate numeri-cally the good performance achieved by the exponentially weightedaggregate.

1. Introduction.

There is growing empirical evidence of superiority ofaggregated statistical procedures, also referred to as blending , stacked gener-alization or ensemble methods , with respect to “pure” ones. Since their intro-duction in the 1990s, famous aggregation procedures such as Boosting [30],

Bagging [7] or

Random Forest [2] have been successfully used in practicefor a large variety of applications. Moreover, most recent Machine Learningcompetitions such as the Pascal VOC or Netﬂix challenge have been wonby procedures combining diﬀerent types of classiﬁers/predictors/estimators.It is therefore of central interest to understand from a theoretical point ofview what kind of aggregation strategies should be used for getting the bestpossible combination of the available statistical procedures.

Received April 2011; revised June 2012. Supported in part by ANR Parcimonie.

AMS 2000 subject classiﬁcations.

Primary 62G08; secondary 62C20, 62G05, 62G20.

Key words and phrases.

Aggregation, regression, oracle inequalities, model selection,minimax risk, exponentially weighted aggregation.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Statistics ,2012, Vol. 40, No. 4, 2327–2355. This reprint diﬀers from the original inpagination and typographic detail. 1

A. S. DALALYAN AND J. SALMON

Historical remarks and motivation.

In the statistical literature, tothe best of our knowledge, theoretical foundations of aggregation proce-dures were ﬁrst studied by Nemirovski (Nemirovski [48], Juditsky and Ne-mirovski [37]) and independently by a series of papers by Catoni (see [11]for an account) and Yang [63–65]. For the regression model, a signiﬁcantprogress was achieved by Tsybakov [60] with introducing the notion of opti-mal rates of aggregation and proposing aggregation-rate-optimal proceduresfor the tasks of linear, convex and model selection aggregation. This pointwas further developed in [9, 46, 53], especially in the context of high dimen-sion with sparsity constraints and in [51] for Kullback–Leibler aggregation.However, it should be noted that the procedures proposed in [60] that prov-ably achieve the lower bounds in convex and linear aggregation require fullknowledge of design distribution. This limitation was overcome in the recentwork [62].From a practical point of view, an important limitation of the previouslycited results on aggregation is that they are valid under the assumption thatthe aggregated procedures are deterministic (or random, but independentof the data used for aggregation). The generality of those results—almostno restriction on the constituent estimators—compensates to this practicallimitation.In the Gaussian sequence model, a breakthrough was reached by Leungand Barron [45]. Building on very elegant but not very well-known results byGeorge [32] , they established sharp oracle inequalities for the exponentiallyweighted aggregate (EWA) for constituent estimators obtained from the datavector by orthogonally projecting it on some linear subspaces. Dalalyan andTsybakov [21, 22] showed the result of [45] remains valid under more general(non-Gaussian) noise distributions and when the constituent estimators areindependent of the data used for the aggregation. A natural question ariseswhether a similar result can be proved for a larger family of constituent es-timators containing projection estimators and deterministic ones as speciﬁcexamples. The main aim of the present paper is to answer this question byconsidering families of aﬃne estimators.Our interest in aﬃne estimators is motivated by several reasons. First,aﬃne estimators encompass many popular estimators such as smoothingsplines, the Pinsker estimator [28, 49], local polynomial estimators, non-local means [8, 56], etc. For instance, it is known that if the underlying(unobserved) signal belongs to a Sobolev ball, then the (linear) Pinsker esti-mator is asymptotically minimax up to the optimal constant, while the best Corollary 2 in [32] coincides with Theorem 1 in [45] in the case of exponential weightswith temperature β = 2 σ ; cf. equation (2.2) below for a precise deﬁnition of exponentialweights. Furthermore, to the best of our knowledge, [32] is the ﬁrst reference using theStein lemma for evaluating the expected risk of the exponentially weighted aggregate.GGREGATION OF AFFINE ESTIMATORS projection estimator is only rate-minimax. A second motivation is that—asproved by Juditsky and Nemirovski [38]—the set of signals that are wellestimated by linear estimators is very rich. It contains, for instance, sam-pled smooth functions, sampled modulated smooth functions and sampledharmonic functions. One can add to this set the family of piecewise con-stant functions as well, as demonstrated in [50], with natural application inmagnetic resonance imaging. It is worth noting that oracle inequalities forpenalized empirical risk minimizer were also proved by Golubev [36], andfor model selection by Arlot and Bach [3], Baraud, Giraud and Huet [5].In the present work, we establish sharp oracle inequalities in the modelof heteroscedastic regression, under various conditions on the constituentestimators assumed to be aﬃne functions of the data. Our results providetheoretical guarantees of optimality, in terms of expected loss, for the ex-ponentially weighted aggregate. They have the advantage of covering in auniﬁed fashion the particular cases of frozen estimators considered in [22]and of projection estimators treated in [45].We focus on the theoretical guarantees expressed in terms of oracle in-equalities for the expected squared loss. Interestingly, although several recentpapers [3, 5, 35] discuss the paradigm of competing against the best linearprocedure from a given family, none of them provide oracle inequalities withleading constant equal to one. Furthermore, most existing results involvesome constants depending on diﬀerent parameters of the setup. In contrast,the oracle inequality that we prove herein is with leading constant one andadmits a simple formulation. It is established for (suitably symmetrized, ifnecessary) exponentially weighted aggregates [11, 21, 32] with an arbitraryprior and a temperature parameter which is not too small. The result isnonasymptotic but leads to an asymptotically optimal residual term whenthe sample size, as well as the cardinality of the family of constituent esti-mators, tends to inﬁnity. In its general form, the residual term is similar tothose obtained in the PAC-Bayes setting [42, 47, 57] in that it is proportionalto the Kullback–Leibler divergence between two probability distributions.The problem of competing against the best procedure in a given familywas extensively studied in the context of online learning and prediction withexpert advice [16, 39]. A connection between the results on online learningand statistical oracle inequalities was established by Gerchinovitz [33].1.2. Notation and examples of linear estimators.

Throughout this work,we focus on the heteroscedastic regression model with Gaussian additivenoise. We assume we are given a vector Y = ( y , . . . , y n ) ⊤ ∈ R n obeying themodel y i = f i + ξ i for i = 1 , . . . , n, (1.1) A. S. DALALYAN AND J. SALMON where ξ = ( ξ , . . . , ξ n ) ⊤ is a centered Gaussian random vector, f i = f ( x i )where f : X → R is an unknown function and x , . . . , x n ∈ X are determinis-tic points. Here, no assumption is made on the set X . Our objective is torecover the vector f = ( f , . . . , f n ) ⊤ , often referred to as signal , based on thedata y , . . . , y n . In our work, the noise covariance matrix Σ = E [ ξξ ⊤ ] is as-sumed to be ﬁnite with a known upper bound on its spectral norm ||| Σ ||| . Wedenote by h·|·i n the empirical inner product in R n : h u | v i n = (1 /n ) P ni =1 u i v i .We measure the performance of an estimator ˆ f by its expected empiricalquadratic loss: r = E [ k f − ˆ f k n ] where k f − ˆ f k n = n P ni =1 ( f i − ˆ f i ) .We only focus on the task of aggregating aﬃne estimators ˆ f λ indexed bysome parameter λ ∈ Λ. These estimators can be written as aﬃne transformsof the data Y = ( y , . . . , y n ) ⊤ ∈ R n . Using the convention that all vectors areone-column matrices, we have ˆ f λ = A λ Y + b λ , where the n × n real matrix A λ and the vector b λ ∈ R n are deterministic. It means the entries of A λ and b λ may depend on the points x , . . . , x n but not on the data Y . Letus describe now diﬀerent families of linear and aﬃne estimators successfullyused in the statistical literature. Our results apply to all these families,leading to a procedure that behaves nearly as well as the best (unknown)one of the family. Ordinary least squares . Let {S λ : λ ∈ Λ } be a set of linear subspaces of R n .A well-known family of aﬃne estimators, successfully used in the contextof model selection [6], is the set of orthogonal projections onto S λ . In thecase of a family of linear regression models with design matrices X λ , onehas A λ = X λ ( X ⊤ λ X λ ) + X ⊤ λ , where ( X ⊤ λ X λ ) + stands for the Moore–Penrosepseudo-inverse of X ⊤ λ X λ . Diagonal ﬁlters . Other common estimators are the so-called diagonal ﬁl-ters corresponding to diagonal matrices A = diag( a , . . . , a n ). Examples in-clude the following: • Ordered projections: a k = ( k ≤ λ ) for some integer λ [ ( · ) is the indicatorfunction]. Those weights are also called truncated SVD (Singular ValueDecomposition) or spectral cutoﬀ. In this case a natural parametrizationis Λ = { , . . . , n } , indexing the number of elements conserved. • Block projections: a k = ( k ≤ w ) + P m − j =1 λ j ( w j ≤ k ≤ w j +1 ) , k = 1 , . . . , n , where λ j ∈ { , } . Here the natural parametrization is Λ = { , } m − , indexingsubsets of { , . . . , m − } . • Tikhonov–Philipps ﬁlter: a k = k/w ) α , where w, α >

0. In this case, Λ =( R ∗ + ) , indexing continuously the smoothing parameters. • Pinsker ﬁlter: a k = (1 − k α w ) + , where x + = max( x,

0) and ( w, α ) = λ ∈ Λ =( R ∗ + ) . Kernel ridge regression . Assume that we have a positive deﬁnite kernel k : X × X → R and we aim at estimating the true function f in the associated GGREGATION OF AFFINE ESTIMATORS reproducing kernel Hilbert space ( H k , k · k k ). The kernel ridge estimatoris obtained by minimizing the criterion k Y − f k n + λ k f k k w.r.t. f ∈ H k (see [58], page 118). Denoting by K the n × n kernel-matrix with element K i,j = k ( x i , x j ), the unique solution ˆ f is a linear estimate of the data, ˆ f = A λ Y , with A λ = K ( K + nλI n × n ) − , where I n × n is the n × n identity matrix. Multiple Kernel learning . As described in [3], it is possible to handle thecase of several kernels k , . . . , k M , with associated positive deﬁnite matrices K , . . . , K M . For a parameter λ = ( λ , . . . , λ M ) ∈ Λ = R M + , one can deﬁne theestimators ˆ f λ = A λ Y with A λ = M X m =1 λ m K m ! M X m =1 λ m K m + nI n × n ! − . (1.2)It is worth mentioning that the formulation in equation (1.2) can be linkedto the group Lasso [66] and to the multiple kernel learning introduced in[41]—see [3] for more details. Moving averages . If we think of coordinates of f as some values assigned tothe vertices of an undirected graph, satisfying the property that two nodesare connected if the corresponding values of f are close, then it is natural toestimate f i by averaging out the values Y j for indices j that are connectedto i . The resulting estimator is a linear one with a matrix A = ( a ij ) ni,j =1 suchthat a ij = V i ( j ) /n i , where V i is the set of neighbors of the node i in thegraph and n i is the cardinality of V i .1.3. Organization of the paper.

In Section 2 we introduce EWA and statea PAC-Bayes type bound in expectation assessing optimality properties ofEWA in combining aﬃne estimators. The strengths and limitations of theresults are discussed in Section 3. The extension of these results to the caseof grouped aggregation—in relation with ill-posed inverse problems—is de-veloped in Section 4. As a consequence, we provide in Section 5 sharp oracleinequalities in various setups: ranging from ﬁnite to continuous families ofconstituent estimators and including sparse scenarii. In Section 6 we applyour main results to prove that combining Pinsker’s type ﬁlters with EWAleads to asymptotically sharp adaptive procedures over Sobolev ellipsoids.Section 7 is devoted to numerical comparison of EWA with other classicalﬁlters (soft thresholding, blockwise shrinking, etc.) and illustrates the po-tential beneﬁts of aggregating. The conclusion is given in Section 8, whilethe proofs of some technical results (Propositions 2–6) are provided in thesupplementary material [20].

2. Aggregation of estimators: Main results.

In this section we describethe statistical framework for aggregating estimators and we introduce theexponentially weighted aggregate. The task of aggregation consists in esti-

A. S. DALALYAN AND J. SALMON mating f by a suitable combination of the elements of a family of constituentestimators F Λ = (ˆ f λ ) λ ∈ Λ ∈ R n . The target objective of the aggregation is tobuild an aggregate ˆ f aggr that mimics the performance of the best constituentestimator, called oracle (because of its dependence on the unknown func-tion f ). In what follows, we assume that Λ is a measurable subset of R M ,for some M ∈ N .The theoretical tool commonly used for evaluating the quality of an ag-gregation procedure is the oracle inequality (OI), generally written E [ k ˆ f aggr − f k n ] ≤ C n inf λ ∈ Λ E [ k ˆ f λ − f k n ] + R n , (2.1)with residual term R n tending to zero as n → ∞ , and leading constant C n being bounded. The OIs with leading constant one are of central theoret-ical interest since they allow to bound the excess risk and to assess theaggregation-rate-optimality. They are often referred to as sharp OI.2.1. Exponentially weighted aggregate (EWA).

Let r λ = E [ k ˆ f λ − f k n ] de-note the risk of the estimator ˆ f λ , for any λ ∈ Λ, and let ˆ r λ be an estimator of r λ . The precise form of ˆ r λ strongly depends on the nature of the constituentestimators. For any probability distribution π over Λ and for any β >

0, wedeﬁne the probability measure of exponential weights, ˆ π , byˆ π ( dλ ) = θ ( λ ) π ( dλ ) with θ ( λ ) = exp( − n ˆ r λ /β ) R Λ exp( − n ˆ r ω /β ) π ( dω ) . (2.2)The corresponding exponentially weighted aggregate, henceforth denoted byˆ f EWA , is the expectation of ˆ f λ w.r.t. the probability measure ˆ π :ˆ f EWA = Z Λ ˆ f λ ˆ π ( dλ ) . (2.3)We will frequently use the terminology of Bayesian statistics: the measure π is called prior , the measure ˆ π is called posterior and the aggregate ˆ f EWA is then the posterior mean . The parameter β will be referred to as the tem-perature parameter . In the framework of aggregating statistical procedures,the use of such an aggregate can be traced back to George [32].The interpretation of the weights θ ( λ ) is simple: they up-weight estima-tors all the more that their performance, measured in terms of the riskestimate ˆ r λ , is good. The temperature parameter reﬂects the conﬁdence wehave in this criterion: if the temperature is small ( β ≈ r λ , assigningalmost zero weights to the other estimators. On the other hand, if β → + ∞ ,then the probability distribution over Λ is simply the prior π , and the datado not inﬂuence our conﬁdence in the estimators. GGREGATION OF AFFINE ESTIMATORS Main results.

In this paper we only focus on aﬃne estimators ˆ f λ = A λ Y + b λ , (2.4)where the n × n real matrix A λ and the vector b λ ∈ R n are deterministic.Furthermore, we will assume that an unbiased estimator b Σ of the noisecovariance matrix Σ is available. It is well known (cf. Appendix for details)that the risk of the estimator (2.4) is given by r λ = E [ k ˆ f λ − f k n ] = k ( A λ − I n × n ) f + b λ k n + Tr( A λ Σ A ⊤ λ ) n (2.5)and that ˆ r unb λ , deﬁned byˆ r unb λ = k Y − ˆ f λ k n + 2 n Tr( b Σ A λ ) − n Tr[ b Σ] , (2.6)is an unbiased estimator of r λ . Along with ˆ r unb λ , we will use another estimatorof the risk that we call the adjusted risk estimate and deﬁne byˆ r adj λ = k Y − ˆ f λ k n + 2 n Tr( b Σ A λ ) − n Tr[ b Σ] | {z } ˆ r unb λ + 1 n Y ⊤ ( A λ − A λ ) Y . (2.7)One can notice that the adjusted risk estimate ˆ r adj λ coincides with the un-biased risk estimate ˆ r unb λ if and only if the matrix A λ is an orthogonalprojector.To state our main results, we denote by P Λ the set of all probabilitymeasures on Λ and by K ( p, p ′ ) the Kullback–Leibler divergence between twoprobability measures p, p ′ ∈ P Λ : K ( p, p ′ ) =  Z Λ log (cid:18) dpdp ′ ( λ ) (cid:19) p ( dλ ) , if p is absolutely continuous w.r.t. p ′ , + ∞ , otherwise.We write S (cid:22) S (resp., S (cid:23) S ) for two symmetric matrices S and S ,when S − S (resp., S − S ) is semi-deﬁnite positive. Theorem 1.

Let all the matrices A λ be symmetric and b Σ be unbiasedand independent of Y . (i) Assume that for all λ, λ ′ ∈ Λ , it holds that A λ A λ ′ = A λ ′ A λ , A λ Σ +Σ A λ (cid:23) and b λ = 0 . If β ≥ ||| Σ ||| , then the aggregate ˆ f EWA deﬁned by equa-tions (2.2), (2.3) and the unbiased risk estimate ˆ r λ = ˆ r unb λ (2.6) satisﬁes E [ k ˆ f EWA − f k n ] ≤ inf p ∈P Λ (cid:26)Z Λ E [ k ˆ f λ − f k n ] p ( dλ ) + βn K ( p, π ) (cid:27) . (2.8) A. S. DALALYAN AND J. SALMON (ii)

Assume that, for all λ ∈ Λ , A λ (cid:22) I n × n and A λ b λ = 0 . If β ≥ ||| Σ ||| ,then the aggregate ˆ f EWA deﬁned by equations (2.2), (2.3) and the adjustedrisk estimate ˆ r λ = ˆ r adj λ (2.7) satisﬁes E [ k ˆ f EWA − f k n ] ≤ inf p ∈P Λ (cid:26)Z Λ E [ k ˆ f λ − f k n ] p ( dλ ) + βn K ( p, π )+ 1 n Z Λ ( f ⊤ ( A λ − A λ ) f + Tr[Σ( A λ − A λ )]) p ( dλ ) (cid:27) . The simplest setting in which all the conditions of part (i) of Theorem 1are fulﬁlled is when the matrices A λ and Σ are all diagonal, or diagonaliz-able in a common base. This result, as we will see in Section 6, leads to anew estimator which is adaptive, in the exact minimax sense, over the col-lection of all Sobolev ellipsoids. It also suggests a new method for eﬃcientlycombining varying-block-shrinkage estimators, as described in Section 5.4.However, part (i) of Theorem 1 leaves open the issue of aggregating aﬃneestimators deﬁned via noncommuting matrices. In particular, it does notallow us to evaluate the MSE of EWA when each A λ is a convex or linearcombination of a ﬁxed family of projection matrices on nonorthogonal linearsubspaces. These kinds of situations may be handled via the result of part(ii) of Theorem 1. One can observe that in the particular case of a ﬁnitecollection of projection estimators (i.e., A λ = A λ and b λ = 0 for every λ ),the result of part (ii) oﬀers an extension of [45], Corollary 6, to the case ofgeneral noise covariances ([45] deals only with i.i.d. noise).An important situation covered by part (ii) of Theorem 1, but not bypart (i), concerns the case when signals of interest f are smooth or sparsein a basis B sig which is diﬀerent from the basis B noise orthogonalizing thecovariance matrix Σ. In such a context, one may be interested in consideringmatrices A λ that are diagonalizable in the basis B sig which, in general, donot commute with Σ. Remark 1.

While the results in [45] yield a sharp oracle inequality inthe case of projection matrices A λ , they are of no help in the case whenthe matrices A λ are nearly idempotent and not exactly. Assertion (ii) ofTheorem 1 ﬁlls this gap by showing that if max λ Tr[ A λ − A λ ] ≤ δ , then E [ k ˆ f EWA − f k n ] is bounded byinf p ∈P Λ (cid:26)Z Λ E [ k ˆ f λ − f k n ] p ( dλ ) + βn K ( p, π ) (cid:27) + δ ( k f k n + n − ||| Σ ||| ) . Remark 2.

We have focused only on Gaussian errors to emphasize thatit is possible to eﬃciently aggregate almost any family of aﬃne estimators .We believe that by a suitable adaptation of the approach developed in [22],

GGREGATION OF AFFINE ESTIMATORS claims of Theorem 1 can be generalized—at least when ξ i are independentwith known variances—to some other common noise distributions.The results presented so far concern the situation when the matrices A λ are symmetric. However, using the last part of Theorem 1, it is possible topropose an estimator of f that is almost as accurate as the best aﬃne estima-tor A λ Y + b λ even if the matrices A λ are not symmetric. Interestingly, theestimator enjoying this property is not obtained by aggregating the originalestimators ˆ f λ = A λ Y + b λ but the “symmetrized” estimators ˜ f λ = ˜ A λ Y + b λ ,where ˜ A λ = A λ + A ⊤ λ − A ⊤ λ A λ . Besides symmetry, an advantage of the ma-trices ˜ A λ , as compared to the A λ ’s, is that they automatically satisfy thecontraction condition ˜ A λ (cid:22) I n × n required by part (ii) of Theorem 1. We willrefer to this method as Symmetrized Exponentially Weighted Aggregates(or SEWA) [19]. Theorem 2.

Assume that the matrices A λ and the vectors b λ satisfy A λ b λ = A ⊤ λ b λ = 0 for every λ ∈ Λ . Assume in addition that b Σ is an un-biased estimator of Σ and is independent of Y . Let ˜ f SEWA denote the ex-ponentially weighted aggregate of the (symmetrized) estimators ˜ f λ = ( A λ + A ⊤ λ − A ⊤ λ A λ ) Y + b λ with the weights (2.2) deﬁned via the risk estimate ˆ r unb λ .Then, under the conditions β ≥ ||| Σ ||| and π (cid:26) λ ∈ Λ : Tr( b Σ A λ ) ≤ Tr( b Σ A ⊤ λ A λ ) (cid:27) = 1 a.s. (C) it holds that E [ k ˜ f SEWA − f k n ] ≤ inf p ∈P Λ (cid:26)Z Λ E [ k ˆ f λ − f k n ] p ( dλ ) + βn K ( p, π ) (cid:27) . (2.9)To understand the scope of condition (C), let us present several cases ofwidely used linear estimators for which this condition is satisﬁed: • The simplest class of matrices A λ for which condition (C) holds true areorthogonal projections. Indeed, if A λ is a projection matrix, it satisﬁes A ⊤ λ A λ = A λ and, therefore, Tr( b Σ A λ ) = Tr( b Σ A ⊤ λ A λ ). • When the matrix b Σ is diagonal, then a suﬃcient condition for (C) is a ii ≤ P nj =1 a ji . Consequently, (C) holds true for matrices having only zeroson the main diagonal. For instance, the k NN ﬁlter in which the weightof the observation Y i is replaced by zero, that is, a ij = j ∈{ j i, ,...,j i,k } /k satisﬁes this condition. • Under a little bit more stringent assumption of homoscedasticity, that is,when b Σ = b σ I n × n , if the matrices A λ are such that all the nonzero elementsof each row are equal and sum up to one (or a quantity larger than one), A. S. DALALYAN AND J. SALMON then Tr( A λ ) = Tr( A ⊤ λ A λ ) and (C) is fulﬁlled. A notable example of linearestimators that satisfy this condition are Nadaraya–Watson estimatorswith rectangular kernel and nearest neighbor ﬁlters.

3. Discussion.

Before elaborating on the main results stated in the previ-ous section, by extending them to inverse problems and by deriving adaptiveprocedures, let us discuss some aspects of the presented OIs.3.1.

Assumptions on Σ . In some rare situations, the matrix Σ is knownand it is natural to use b Σ = Σ as an unbiased estimator. Besides this not veryrealistic situation, there are at least two contexts in which it is reasonableto assume that an unbiased estimator of Σ, independent of Y , is available.The ﬁrst case corresponds to problems in which a signal can be recordedseveral times by the same device, or once but by several identical devices.For instance, this is the case when an object is photographed many times bythe same digital camera during a short time period. Let Z , . . . , Z N be theavailable signals, which can be considered as i.i.d. copies of an n -dimensionalGaussian vector with mean f and covariance matrix Σ Z . Then, deﬁning Y =( Z + · · · + Z N ) /N and b Σ Z = ( N − − ( Z Z ⊤ + · · · + Z N Z ⊤ N − N YY ⊤ ), weﬁnd ourselves within the framework covered by previous theorems. Indeed, Y ∼ N n ( f , Σ Y ) with Σ Y = Σ Z /N and b Σ Y = b Σ Z /N is an unbiased estimateof Σ Y , independent of Y . Note that our theory applies in this setting forevery integer N ≥ g . In digital imageprocessing, g can be a black picture. This will provide a noisy signal Z drawn from Gaussian distribution N n ( g , Σ), independent of Y which is thesignal of interest. Setting b Σ = ( Z − g )( Z − g ) ⊤ , one ends up with an unbiasedestimator of Σ, which is independent of Y .3.2. OI in expectation versus OI with high probability.

All the resultsstated in this work provide sharp nonasymptotic bounds on the expectedrisk of EWA. It would be insightful to complement this study by risk boundsthat hold true with high probability. However, it was recently proved in [17]that EWA is deviation suboptimal: there exist a family of constituent es-timators and a constant

C > C/ √ n with probability at least 0 .

06. Nevertheless, several empirical studies (see,e.g., [18]) demonstrated that EWA has often a smaller risk than some of itscompetitors, such as the empirical star procedure [4], which are provablyoptimal in the sense of OIs with high probability. Furthermore, numerical

GGREGATION OF AFFINE ESTIMATORS experiments carried out in Section 7 show that the standard-deviation ofthe risk of EWA is of the order of 1 /n . This suggests that under some condi-tions on the constituent estimators it might be possible to establish OIs forEWA that are similar to (2.8) but hold true with high probability. A stepin proving this kind of result was done in [43], Theorem C, for the model ofregression with random design.3.3. Relation to previous work and limits of our results.

The OI of theprevious section requires various conditions on the constituent estimatorsˆ f λ = A λ Y + b λ . One may wonder how general these conditions are and is itpossible to extend these OIs to more general ˆ f λ ’s. Although this work doesnot answer this question, we can sketch some elements of response.First of all, we stress that the conditions of the present paper relax signiﬁ-cantly those of previous results existing in statistical literature. For instance,Kneip [40] considered only linear estimators, that is, b λ ≡ commuting matrices A λ . The ordering as-sumption is dropped in Leung and Barron [45], in the case of projectionmatrices. Note that neither of these assumptions is satisﬁed for the familiesof Pinsker and Tikhonov–Philipps estimators. The present work strength-ens existing results in considering more general, aﬃne estimators extendingboth projection matrices and ordered commuting matrices.Despite the advances achieved in this work, there are still interesting casesthat are not covered by our theory. We now introduce a family of estimatorscommonly used in image processing that do not satisfy our assumptions. Inrecent years, nonlocal means (NLM) became quite popular in image process-ing [8]. This method of signal denoising, shown to be tied in with EWA [56],removes noise by exploiting signals self-similarities. We brieﬂy deﬁne theNLM procedure in the case of one-dimensional signals.Assume that a vector Y = ( y , . . . , y n ) ⊤ given by (1.1) is observed with f i = f ( i/n ), i = 1 , . . . , n , for some function f : [0 , → R . For a ﬁxed “patch-size” k ∈ { , . . . , n } , let us deﬁne f [ i ] = ( f i , f i +1 , . . . , f i + k − ) ⊤ and Y [ i ] =( y i , y i +1 , . . . , y i + k − ) ⊤ for every i = 1 , . . . , n − k + 1. The vectors f [ i ] and Y [ i ] are, respectively, called true patch and noisy patch . The NLM consists inregarding the noisy patches Y [ i ] as constituent estimators for estimating thetrue patch f [ i ] by applying EWA. One easily checks that the constituentestimators Y [ i ] are aﬃne in Y [ i ] , that is, Y [ i ] = A i Y [ i ] + b i with A i and b i independent of Y [ i ] . Indeed, if the distance between i and i is larger than k , then Y [ i ] is independent of Y [ i ] and, therefore, A i = 0 and b i = Y [ i ] . If | i − i | < k , then the matrix A i is a suitably chosen shift matrix and b i is theprojection of Y [ i ] onto the orthogonal complement of the image of A i . Un-fortunately, these matrices { A i } and vectors { b i } do not ﬁt our framework,that is, the assumption A i b i = A ⊤ i b i = 0 is not satisﬁed. A. S. DALALYAN AND J. SALMON

Finally, our proof technique is speciﬁc to aﬃne estimators. Its extensionto estimators deﬁned as a more complex function of the data will certainlyrequire additional tools and is a challenging problem for future research. Yet,it seems unlikely to get sharp OIs with optimal remainder term for a fairlygeneral family of constituent estimators (without data-splitting), since thisgenerality inherently increases the risk of overﬁtting.

4. Ill-posed inverse problems and group-weighting.

As explained in [12,13], the model of heteroscedastic regression is well suited for describing in-verse problems. In fact, let T be a known linear operator on some Hilbertspace H , with inner product h·|·i H . For some h ∈ H , let Y be the randomprocess indexed by g ∈ H such that Y = T h + εξ ⇐⇒ ( Y ( g ) = h T h | g i H + εξ ( g ) , ∀ g ∈ H ) , (4.1)where ε > ξ is a white Gaussian noise on H , that is, for any g , . . . , g k ∈ H the vector ( Y ( g ) , . . . , Y ( g k )) is Gaussianwith zero mean and covariance matrix {h g i | g j i H } . The problem is then thefollowing: estimate the element h assuming the value of Y can be measuredfor any given g . It is customary to use as g the eigenvectors of the adjoint T ∗ of T . Under the condition that the operator T ∗ T is compact, the SVD yields T φ k = b k ψ k and T ∗ ψ k = b k φ k , for k ∈ N , where b k are the singular values, { ψ k } is an orthonormal basis in Range( T ) ⊂ H and { φ k } is the correspondingorthonormal basis in H . In view of (4.1), it holds that Y ( ψ k ) = h h | φ k i H b k + εξ ( ψ k ) , k ∈ N . (4.2)Since in practice only a ﬁnite number of measurements can be computed, itis natural to assume that the values Y ( ψ k ) are available only for k smallerthan some integer n . Under the assumption that b k = 0, the last equation isequivalent to (1.1) with f i = h h | φ i i H and Σ = diag( σ i , i = 1 , , . . . ) for σ i = εb − i . Examples of inverse problems to which this statistical model has beensuccessfully applied are derivative estimation, deconvolution with knownkernel, computerized tomography—see [12] and the references therein formore applications.For very mildly ill-posed inverse problems, that is, when the singular val-ues b k of T tend to zero not faster than any negative power of k , the approachpresented in Section 2 will lead to satisfactory results. Indeed, by choosing β = 8 ||| Σ ||| or β = 4 ||| Σ ||| , the remainder term in (2.8) and (2.9) becomes—up to a logarithmic factor—proportional to max ≤ k ≤ n b − k /n , which is theoptimal rate in the case of very mild ill-posedness.However, even for mildly ill-posed inverse problems, the approach devel-oped in the previous section becomes obsolete since the remainder blows upwhen n increases to inﬁnity. Furthermore, this is not an artifact of our the-oretical results, but rather a drawback of the aggregation strategy adopted GGREGATION OF AFFINE ESTIMATORS in the previous section. Indeed, the posterior probability measure ˆ π deﬁnedby (2.2) can be seen as the solution of the entropy-penalized empirical riskminimization problem:ˆ π n = arg inf p (cid:26)Z Λ ˆ r λ p ( dλ ) + βn K ( p, π ) (cid:27) , (4.3)where the inf is taken over the set of all probability distributions. It meansthe same regularization parameter β is employed for estimating both thecoeﬃcients f i = h h | φ i i H corrupted by noise of small magnitude and thosecorrupted by large noise. Since we place ourselves in the setting of knownoperator T and, therefore, known noise levels, such a uniform treatment of allcoeﬃcients is unreasonable. It is more natural to upweight the regularizationterm in the case of large noise downweighting the data ﬁdelity term and,conversely, to downweight the regularization in the case of small noise. Thismotivates our interest in the grouped EWA (or GEWA).Let us consider a partition B , . . . , B J of the set { , . . . , n } : B j = { T j +1 , . . . , T j +1 } , for some integers 0 = T < T < · · · < T J +1 = n . To each element B j of this partition, we associate the data sub-vector Y j = ( Y i : i ∈ B j ) andthe sub-vector of true function f j = ( f i : i ∈ B j ). As in previous sections, weare concerned by the aggregation of aﬃne estimators ˆ f λ = A λ Y + b λ , buthere we will assume the matrices A λ are block-diagonal: A λ =  A λ . . . A λ . . . . . . A Jλ  with A jλ ∈ R ( T j +1 − T j ) × ( T j +1 − T j ) . Similarly, we deﬁne ˆ f jλ and b jλ as the sub-vectors of ˆ f λ and b λ , respectively,corresponding to the indices belonging to B j . We will also assume that thenoise covariance matrix Σ and its unbiased estimate b Σ are block-diagonalwith ( T j +1 − T j ) × ( T j +1 − T j ) blocks Σ j and b Σ j , respectively. This notationimplies, in particular, that ˆ f jλ = A jλ Y j + b jλ for every j = 1 , . . . , J . Moreover,the unbiased risk estimate ˆ r unb λ of ˆ f λ can be decomposed into the sum ofunbiased risk estimates ˆ r j, unb λ of ˆ f jλ , namely, ˆ r unb λ = P Jj =1 ˆ r j, unb λ , whereˆ r j, unb λ = k Y j − ˆ f jλ k + 2 n Tr( b Σ j A jλ ) − n Tr[ b Σ j ] , j = 1 , . . . , J. To state the analogues of Theorems 1 and 2, we introduce the followingsettings.

Setting

1: For all λ, λ ′ ∈ Λ and j ∈ { , . . . , J } , A jλ are symmetric and sat-isfy A jλ A jλ ′ = A jλ ′ A jλ , A jλ Σ j + Σ j A jλ (cid:23) b jλ = 0. For a temperature vector β = ( β , . . . , β J ) ⊤ and a prior π , we deﬁne GEWA as ˆ f j GEWA = R Λ ˆ f jλ ˆ π j ( dλ ), A. S. DALALYAN AND J. SALMON where ˆ π j ( dλ ) = θ j ( λ ) π ( dλ ) with θ j ( λ ) = exp( − n ˆ r j, unb λ /β j ) R Λ exp( − n ˆ r j, unb ω /β j ) π ( dω ) . (4.4) Setting

2: For every j = 1 , . . . , J and for every λ belonging to a set of π -measure one, the matrices A λ satisfy a.s. the inequality Tr( b Σ j A jλ ) ≤ Tr( b Σ j ( A jλ ) ⊤ A jλ ) while the vectors b λ are such that A jλ b jλ = ( A jλ ) ⊤ b jλ = 0. Inthis case, for a temperature vector β = ( β , . . . , β J ) ⊤ and a prior π , we deﬁneGEWA as ˆ f j GEWA = R Λ ˜ f jλ ˆ π j ( dλ ), where ˜ f jλ = ( A jλ + ( A jλ ) ⊤ − ( A jλ ) ⊤ A jλ ) Y j + b jλ and ˆ π j is deﬁned by (4.4). Note that this setting is the grouped version ofthe SEWA. Theorem 3.

Assume that b Σ is unbiased and independent of Y . Undersetting 1, if β j ≥ ||| Σ j ||| for all j = 1 , . . . , J , then E [ k ˆ f GEWA − f k n ] ≤ J X j =1 inf p j (cid:26)Z Λ E k ˆ f jλ − f j k n p j ( dλ ) + β j n K ( p j , π ) (cid:27) . (4.5) Under setting 2, this inequality holds true if β j ≥ ||| Σ j ||| for every j =1 , . . . , J . As we shall see in Section 6, this theorem allows us to propose an estimatorof the unknown signal which is adaptive w.r.t. the smoothness properties ofthe underlying signal and achieves the minimax rates and constants over theSobolev ellipsoids provided that the operator T is mildly ill-posed, that is,its singular values decrease at most polynomially.

5. Examples of sharp oracle inequalities.

In this section we discuss con-sequences of the main result for speciﬁc choices of prior measures. For con-veying the main messages of this section it is enough to focus on settings 1and 2 in the case of only one group ( J = 1).5.1. Discrete oracle inequality.

In order to demonstrate that inequality(4.5) can be reformulated in terms of an OI as deﬁned by (2.1), let us considerthe case when the prior π is discrete, that is, π (Λ ) = 1 for a countable setΛ ⊂ Λ, and w.l.o.g Λ = N . Then, the following result holds true. Proposition 1.

Let b Σ be unbiased, independent of Y and π be sup-ported by N . Under setting 1 with J = 1 and β = β ≥ ||| Σ ||| , the aggregate ˆ f GEWA satisﬁes the inequality E [ k ˆ f GEWA − f k n ] ≤ inf ℓ ∈ N : π ℓ > (cid:18) E [ k ˆ f ℓ − f k n ] + β log(1 /π ℓ ) n (cid:19) . (5.1) Furthermore, (5.1) holds true under setting 2 for β ≥ ||| Σ ||| . GGREGATION OF AFFINE ESTIMATORS Proof.

It suﬃces to apply Theorem 3 and to upper-bound the right-hand side by the minimum over all Dirac measures p = δ ℓ such that π ℓ > (cid:3) This inequality can be compared to Corollary 2 in [5], Section 4.3. Ourresult has the advantage of having factor one in front of the expectationof the left-hand side, while in [5] a constant much larger than 1 appears.However, it should be noted that the assumptions on the (estimated) noisecovariance matrix are much weaker in [5].5.2.

Continuous oracle inequality.

It may be useful in practice to com-bine a family of aﬃne estimators indexed by an open subset of R M for some M ∈ N (e.g., to build an estimator nearly as accurate as the best kernelestimator with ﬁxed kernel and varying bandwidth). To state an oracle in-equality in such a “continuous” setup, let us denote by d ( λ , ∂ Λ) the largestreal τ > λ of radius τ —hereafter denotedby B λ ( τ )—is included in Λ. Let Leb( · ) be the Lebesgue measure in R M . Proposition 2.

Let b Σ be unbiased, independent of Y . Let Λ ⊂ R M be an open and bounded set and let π be the uniform distribution on Λ .Assume that the mapping λ r λ is Lipschitz continuous, that is, | r λ ′ − r λ | ≤ L r k λ ′ − λ k , ∀ λ , λ ′ ∈ Λ . Under setting 1 with J = 1 and β = β ≥ ||| Σ ||| , theaggregate ˆ f GEWA satisﬁes the inequality E k ˆ f GEWA − f k n ≤ inf λ ∈ Λ (cid:26) E [ k ˆ f λ − f k n ] + βMn log (cid:18) √ M n − , d ( λ , ∂ Λ)) (cid:19)(cid:27) (5.2) + L r + β log(Leb(Λ)) n . Furthermore, (5.2) holds true under setting 2 for every β ≥ ||| Σ ||| . Proof.

It suﬃces to apply assertion (i) of Theorem 1 and to upper-bound the right-hand side in inequality (2.8) by the minimum over all mea-sures having as density p λ ∗ ,τ ∗ ( λ ) = B λ ∗ ( τ ∗ ) ( λ ) / Leb( B λ ∗ ( τ ∗ )). Choosing τ ∗ = min( n − , d ( λ ∗ , ∂ Λ)) such that B λ ∗ ( τ ∗ ) ⊂ Λ, the measure p λ ∗ ,τ ∗ ( λ ) d λ is absolutely continuous w.r.t. the uniform prior π and the Kullback–Leiblerdivergence between these two measures equals log { Leb(Λ) / Leb( B λ ∗ ( τ ∗ )) } .Using Leb( B λ ∗ ( τ ∗ )) ≥ (2 τ ∗ / √ M ) M and the Lipschitz condition, we get thedesired inequality. (cid:3) Note that it is not very stringent to require the risk function r λ to beLipschitz continuous, especially since this condition needs not be satisﬁeduniformly in f . Let us consider the ridge regression: for a given design matrix A. S. DALALYAN AND J. SALMON X ∈ R n × p , A λ = X ( X ⊤ X + γ n λI n × n ) − X ⊤ and b λ = 0 with λ ∈ [ λ ∗ , λ ∗ ], γ n being a given normalization factor typically set to n or √ n , λ ∗ > λ ∗ ∈ [ λ ∗ , ∞ ]. One can easily check the Lipschitz property of the risk functionwith L r = L r ( f ) = 4 λ − ∗ k f k n + (2 /n ) Tr(Σ).5.3. Sparsity oracle inequality.

The continuous oracle inequality statedin the previous subsection is well adapted to the problems in which thedimension M of Λ is small w.r.t. the sample size n (or, more precisely, thesignal to noise ratio n/ ||| Σ ||| ). When this is not the case, the choice of theprior should be done more carefully. For instance, consider Λ ⊂ R M withlarge M under the sparsity scenario: there is a sparse vector λ ∗ ∈ Λ suchthat the risk of ˆ f λ ∗ is small. Then, it is natural to choose a prior that favorssparse λ ’s. This can be done in the same vein as in [21–24], by means of theheavy tailed prior, π ( d λ ) ∝ M Y m =1 | λ m /τ | ) Λ ( λ ) , (5.3)where τ > Proposition 3.

Let b Σ be unbiased, independent of Y . Let Λ = R M andlet π be deﬁned by (5.3). Assume that the mapping λ r λ is continuouslydiﬀerentiable and, for some M × M matrix M , satisﬁes r λ − r λ ′ − ∇ r ⊤ λ ′ ( λ − λ ′ ) ≤ ( λ − λ ′ ) ⊤ M ( λ − λ ′ ) ∀ λ , λ ′ ∈ Λ . (5.4) Under setting 1 if β ≥ ||| Σ ||| , then the aggregate ˆ f EWA = ˆ f GEWA satisﬁes E [ k ˆ f GEWA − f k n ] ≤ inf λ ∈ R M ( E k ˆ f λ − f k n + 4 βn M X m =1 log (cid:18) | λ m | τ (cid:19)) (5.5) + Tr( M ) τ . Moreover, (5.5) holds true under setting 2 if β ≥ ||| Σ ||| . Let us discuss here some consequences of this sparsity oracle inequality.First of all, consider the case of (linearly) combining frozen estimators, thatis, when ˆ f λ = P Mj =1 λ j ϕ j with some known functions ϕ j . Then, it is clear that r λ − r λ ′ − ∇ r ⊤ λ ′ ( λ − λ ′ ) = 2( λ − λ ′ ) ⊤ Φ( λ − λ ′ ), where Φ is the Gram matrixdeﬁned by Φ i,j = h ϕ i | ϕ j i n . So the condition in Proposition 3 consists inbounding the Gram matrix of the atoms ϕ j . Let us remark that in this case—see, for instance, [22, 23]—Tr( M ) is on the order of M and the choice τ = p β/ ( nM ) ensures that the last term in the right-hand side of equation (5.5)decreases at the parametric rate 1 /n . This is the choice we recommend forpractical applications. GGREGATION OF AFFINE ESTIMATORS As a second example, let us consider the case of a large number of linearestimators ˆ g = G Y , . . . , ˆ g M = G M Y satisfying conditions of setting 1 andsuch that max m =1 ,...,M ||| G m ||| ≤

1. Assume we aim at proposing an estimatormimicking the behavior of the best possible convex combination of a pairof estimators chosen among ˆ g , . . . , ˆ g M . This task can be accomplished inour framework by setting Λ = R M and ˆ f λ = λ ˆ g + · · · + λ M ˆ g M , where λ =( λ , . . . , λ M ). Remark that if { ˆ g m } satisﬁes conditions of setting 1, so does { ˆ f λ } . Moreover, the mapping λ r λ is quadratic with Hessian matrix ∇ r λ given by the entries 2 h G m f | G m ′ f i n + n Tr( G m ′ Σ G m ), m, m ′ = 1 , . . . , M . Itimplies that inequality (5.4) holds with M = ∇ r λ /

2. Therefore, denoting by σ i the i th diagonal entry of Σ and setting σ = ( σ , . . . , σ n ), we get Tr( M ) ≤||| P Mm =1 G m ||| [ k f k n + k σ k n ] ≤ M [ k f k n + k σ k n ]. Applying Proposition 3 with τ = p β/ ( nM ), we get E [ k ˆ f EWA − f k n ] ≤ inf α,m,m ′ E [ k α ˆ g m + (1 − α )ˆ g m ′ − f k n ](5.6) + 8 βn log (cid:18) (cid:20) M nβ (cid:21) / (cid:19) + βn [ k f k n + k σ k n ] , where the inf is taken over all α ∈ [0 ,

1] and m, m ′ ∈ { , . . . , M } . This in-equality is derived from (5.5) by upper-bounding the inf λ ∈ R M by the inﬁ-mum over λ ’s having at most two nonzero coeﬃcients, λ m and λ m ′ , thatare nonnegative and sum to one: λ m + λ m ′ = 1. To get (5.6), one simplynotes that only two terms of the sum P m log(1 + | λ m | τ − ) are nonzero andeach of them is not larger than log(1 + τ − ). Thus, one can achieve usingEWA the best possible risk over the convex combinations of a pair of lin-ear estimators—selected from a large (but ﬁnite) family—at the price of aresidual term that decreases at the parametric rate up to a log factor.5.4. Oracle inequalities for varying-block-shrinkage estimators.

Let usconsider now the problem of aggregation of two-block shrinkage estima-tors. This means that the constituent estimators have the following form:for λ = ( a, b, k ) ∈ [0 , × { , . . . , n } := Λ, ˆ f λ = A λ Y where A λ = diag( a ( i ≤ k ) + b ( i > k ) , i = 1 , . . . , n ). Let us choose the prior π as uniform on Λ. Proposition 4.

Let ˆ f EWA be the exponentially weighted aggregate hav-ing as constituent estimators two-block shrinkage estimators A λ Y . If Σ isdiagonal, then for any λ ∈ Λ and for any β ≥ ||| Σ ||| , E [ k ˆ f EWA − f k n ] ≤ E [ k ˆ f λ − f k n ] + βn (cid:26) (cid:18) n k f k n + n Tr(Σ)12 β (cid:19)(cid:27) . (5.7)In the case Σ = I n × n , this result is comparable to [44], page 20, The-orem 2.49, which states that in the homoscedastic regression model (Σ = A. S. DALALYAN AND J. SALMON I n × n ), EWA acting on two-block positive-part James–Stein estimators sat-isﬁes, for any λ ∈ Λ such that 3 ≤ k ≤ n − β = 8, the oracle in-equality E [ k ˆ f Leung − f k n ] ≤ E [ k ˆ f λ − f k n ] + 9 n + 8 n min K> (cid:26) K ∨ (cid:18) log n − K − (cid:19)(cid:27) . (5.8)

6. Application to minimax adaptive estimation.

Pinsker proved in hiscelebrated paper [49] that in the model (1.1) the minimax risk over ellip-soids can be asymptotically attained by a linear estimator. Let us denote by θ k ( f ) = h f | ϕ k i n the coeﬃcients of the (orthogonal) discrete cosine (DCT)transform of f , hereafter denoted by D f . Pinsker’s result—restricted toSobolev ellipsoids F D ( α, R ) = { f ∈ R n : P nk =1 k α θ k ( f ) ≤ R } — states that,as n → ∞ , the equivalencesinf ˆ f sup f ∈F D ( α,R ) E [ k ˆ f − f k n ] ∼ inf A sup f ∈F D ( α,R ) E [ k A Y − f k n ](6.1) ∼ inf w> sup f ∈F D ( α,R ) E [ k A α,w Y − f k n ](6.2)hold [61], Theorem 3.2, where the ﬁrst inf is taken over all possible estima-tors ˆ f and A α,w = D ⊤ diag((1 − k α /w ) + ; k = 1 , . . . , n ) D is the Pinsker ﬁlter inthe discrete cosine basis. In simple words, this implies that the (asymptoti-cally) minimax estimator can be chosen from the quite narrow class of linearestimators with Pinsker’s ﬁlter. However, it should be emphasized that theminimax linear estimator depends on the parameters α and R , that are gen-erally unknown. An (adaptive) estimator, that does not depend on ( α, R )and is asymptotically minimax over a large scale of Sobolev ellipsoids, hasbeen proposed by Efromovich and Pinsker [29]. The next result, that is, adirect consequence of Theorem 1, shows that EWA with linear constituentestimators is also asymptotically sharp adaptive over Sobolev ellipsoids. Proposition 5.

Let λ = ( α, w ) ∈ Λ = R and consider the prior π ( d λ ) = 2 n − α/ (2 α +1) σ (1 + n − α/ (2 α +1) σ w ) e − α dα dw, (6.3) where n σ = n/σ . Then, in model (1.1) with homoscedastic errors, the aggre-gate ˆ f EWA based on the temperature β = 8 σ and the constituent estimators ˆ f α,w = A α,w Y (with A α,w being the Pinsker ﬁlter) is adaptive in the exactminimax sense on the family of classes {F D ( α, R ) : α > , R > } . The results of this section hold true not only for the discrete cosine transform, butalso for any linear transform D such that DD ⊤ = D ⊤ D = n − I n × n . See [61], Deﬁnition 3.8.GGREGATION OF AFFINE ESTIMATORS It is worth noting that the exact minimax adaptivity property of ourestimator ˆ f EWA is achieved without any tuning parameter. All previouslyproposed methods that are provably adaptive in an exact minimax sensedepend on some parameters such as the lengths of blocks for blockwiseStein [14] and Efromovich–Pinsker [28] estimators or the step of discretiza-tion and the maximal value of bandwidth [13]. Another nice property of theestimator ˆ f EWA is that it does not require any pilot estimator based on thedata splitting device [31].We now turn to the setup of heteroscedastic regression, which correspondsto ill-posed inverse problems as described in Section 4. To achieve adaptivityin the exact minimax sense, we make use of ˆ f GEWA , the grouped version of theexponentially weighted aggregate. We assume hereafter that the matrix Σ isdiagonal with diagonal entries σ , . . . , σ n satisfying the following property: ∃ σ ∗ , γ > σ k = σ ∗ k γ (1 + o k (1)) as k → ∞ . (6.4)This kind of problems arises when T is a diﬀerential operator or the Radontransform [12], Section 1.3. To handle such situations, we deﬁne the groupsin the same spirit as the weakly geometrically increasing blocks in [15]. Let ν = ν n be a positive integer that increases as n → ∞ . Set ρ n = ν − / n anddeﬁne T j = (cid:26) (1 + ν n ) j − − , j = 1 , ,T j − + ⌊ ν n ρ n (1 + ρ n ) j − ⌋ , j = 3 , , . . . , (6.5)where ⌊ x ⌋ stands for the largest integer strictly smaller than x . Let J bethe smallest integer j such that T j ≥ n . We redeﬁne T J +1 = n and set B j = { T j + 1 , . . . , T j +1 } for all j = 1 , . . . , J . Proposition 6.

Let the groups B , . . . , B J be deﬁned as above with ν n satisfying log ν n / log n → ∞ and ν n → ∞ as n → ∞ . Let λ = ( α, w ) ∈ Λ = R and consider the prior π ( d λ ) = 2 n − α/ (2 α +2 γ +1) (1 + n − α/ (2 α +2 γ +1) w ) e − α dα dw. (6.6) Then, in model (1.1) with diagonal covariance matrix

Σ = diag( σ k ; 1 ≤ k ≤ n ) satisfying condition (6.4), the aggregate ˆ f GEWA (under setting 1) basedon the temperatures β j = 8 max i ∈ B j σ i and the constituent estimators ˆ f α,w = A α,w Y (with A α,w being the Pinsker ﬁlter) is adaptive in the exact minimaxsense on the family of classes {F ( α, R ) : α > , R > } . Note that this result provides an estimator attaining the optimal constantin the minimax sense when the unknown signal lies in an ellipsoid. This A. S. DALALYAN AND J. SALMON property holds because minimax estimators over the ellipsoids are linear. Forother subsets of R n , such as hyper-rectangles, Besov bodies and so on, this isnot true anymore. However, as proved by Donoho, Liu and MacGibbon [27],for orthosymmetric quadratically convex sets the minimax linear estimatorshave a risk which is within 25% of the minimax risk among all estimates.Therefore, following the approach developed here, it is also possible to provethat GEWA can lead to an adaptive estimator whose risk is within 25% ofthe minimax risk, for a broad class of hyperrectangles.

7. Experiments.

In this section we present some numerical experimentson synthetic data, by focusing only on the case of homoscedastic Gaussiannoise (Σ = σ I n × n ) with known variance. A toolbox is made available freelyfor download at http://josephsalmon.eu/code/index_codes.php . Addi-tional details and numerical experiments can be found in [19, 55].We evaluate diﬀerent estimation routines on several 1D signals consideredas a benchmark in the literature on signal processing [25]. The six signalswe retained for our experiments because of their diversity are depicted inFigure 1. Since these signals are nonsmooth, we have also carried out ex-periments on their smoothed versions obtained by taking the antiderivative. Fig. 1.

Test signals used in our experiments: Piece-Regular, Ramp, Piece-Polynomial,HeaviSine, Doppler and Blocks. (a) nonsmooth (Experiment I) and (b) smooth (Experi-ment II).

GGREGATION OF AFFINE ESTIMATORS Experiments on nonsmooth (resp., smooth) signals are referred to as Exper-iment I (resp., Experiment II). In both cases, prior to applying estimationroutines, we normalize the (true) sampled signal to have an empirical normequal to one and use the DCT denoted by θ ( Y ) = ( θ ( Y ) , . . . , θ n ( Y )) ⊤ .The four tested estimation routines—including EWA—are detailed below. Soft-Thresholding (ST) [25] : For a given shrinkage parameter t , the soft-thresholding estimator is b θ k = sgn( θ k ( Y ))( | θ k ( Y ) | − σt ) + . We use the data-driven threshold minimizing the Stein unbiased risk estimate [26]. Blockwise James–Stein (BJS) shrinkage [10] : The set { , . . . , n } is par-titioned into N = [ n/ log( n )] blocks B , B , . . . , B N of nearly equal size L .The corresponding blocks of true coeﬃcients θ B k ( f ) = ( θ j ( f )) j ∈ B k are thenestimated by b θ B k = (1 − λLσ S k ( Y ) ) + θ B k ( Y ) , k = 1 , . . . , N , with blocks of noisycoeﬃcients θ B k ( Y ), S k = k θ B k ( Y ) k and λ = 4 . Unbiased risk estimate (URE) minimization with Pinsker’s ﬁlters [13] :Pinsker ﬁlter with data-driven parameters α and w selected by minimizingan unbiased estimate of the risk over a suitably chosen grid for the valuesof α and w . Here, we use geometric grids ranging from 0 . α andfrom 1 to n for w . EWA on Pinsker’s ﬁlters : We consider the same ﬁnite family of linearﬁlters (deﬁned by Pinsker’s ﬁlters) as in the URE routine described above.According to Proposition 1, this leads to an estimator nearly as accurate asthe best Pinsker’s estimator in the given family.To report the result of our experiments, we have also computed the bestlinear smoother, hereafter referred to as the oracle, based on a Pinsker ﬁlterchosen among the candidates that we used for deﬁning URE and EWA. Bybest smoother we mean the one minimizing the squared error (it can becomputed since we know the ground truth). Results summarized in Table 1for Experiment I and Table 2 for Experiment II correspond to the averageover 1000 trials of the mean squared error (MSE) from which we subtractthe MSE of the oracle and multiply the resulting diﬀerence by the samplesize. We report the results for σ = 0 .

33 and for n ∈ { , , , } .Simulations show that EWA and URE have very comparable perfor-mances and are signiﬁcantly more accurate than soft-thresholding and blockJames–Stein (cf. Table 1) for every size n of signals considered. Improve-ments are particularly important when signals have large peaks or discon-tinuities. In most cases, EWA also outperforms URE, but diﬀerences areless pronounced. One can also observe that for smooth signals, the diﬀer-ence of MSEs between EWA and the oracle, multiplied by n , remains nearlyconstant when n varies. This is in agreement with our theoretical results inwhich the residual term decreases to zero inversely proportionally to n .Of course, soft-thresholding and blockwise James–Stein procedures havebeen designed for being applied to the wavelet transform of a Besov smooth A. S. DALALYAN AND J. SALMON

Table 1

Evaluation of 4 adaptive methods on 6 (nonsmooth) signals. For each sample size andeach method, we report the average value of n (MSE − MSE

Oracle ) and the correspondingstandard deviation (in parentheses), for 1000 replications of the experiment n EWA URE BJS ST EWA URE BJS ST

Blocks Doppler256 0 .

051 0 .

245 9 .

617 4 .

846 0 .

062 0 .

212 13 .

233 6 . .

42) (0 .

39) (1 .

78) (1 .

29) (0 .

35) (0 .

31) (2 .

11) (1 . − .

052 0 .

302 13 .

807 9 . − .

100 0 .

205 17 .

080 12 . .

35) (0 .

50) (2 .

16) (1 .

70) (0 .

30) (0 .

39) (2 .

29) (1 . − .

050 0 .

299 19 .

984 17 . − .

107 0 .

270 21 .

862 23 . .

36) (0 .

46) (2 .

68) (2 .

17) (0 .

35) (0 .

41) (2 .

92) (2 . − .

007 0 .

362 28 .

948 30 . − .

150 0 .

234 28 .

733 38 . .

42) (0 .

57) (3 .

31) (2 .

96) (0 .

34) (0 .

42) (3 .

19) (3 . − .

060 0 .

247 1 .

155 3 . − .

069 0 .

248 8 .

883 4 . .

19) (0 .

42) (0 .

57) (1 .

12) (0 .

32) (0 .

40) (1 .

76) (1 . − .

079 0 .

215 2 .

064 5 . − .

105 0 .

237 12 .

147 9 . .

19) (0 .

39) (0 .

86) (1 .

36) (0 .

30) (0 .

37) (2 .

28) (1 . − .

059 0 .

240 3 .

120 8 . − .

092 0 .

291 15 .

207 16 . .

23) (0 .

36) (1 .

20) (1 .

64) (0 .

34) (0 .

46) (2 .

18) (2 . − .

051 0 .

278 4 .

858 12 . − .

059 0 .

283 21 .

543 27 . .

25) (0 .

48) (1 .

42) (2 .

03) (0 .

34) (0 .

54) (2 .

47) (2 . .

038 0 .

294 6 .

933 5 .

644 0 .

017 0 .

203 12 .

201 3 . .

37) (0 .

47) (1 .

54) (1 .

20) (0 .

37) (0 .

37) (1 .

81) (1 . .

010 0 .

293 9 .

712 9 . − .

078 0 .

312 17 .

765 9 . .

36) (0 .

51) (1 .

76) (1 .

67) (0 .

35) (0 .

49) (2 .

72) (1 . − .

002 0 .

300 13 .

656 16 . − .

026 0 .

321 23 .

321 17 . .

30) (0 .

45) (2 .

25) (2 .

06) (0 .

38) (0 .

48) (2 .

96) (2 . .

007 0 .

312 19 .

113 27 . − .

007 0 .

314 31 .

550 29 . .

34) (0 .

50) (2 .

68) (2 .

61) (0 .

41) (0 .

49) (3 .

05) (2 . function, rather than to the Fourier transform of a Sobolev-smooth function.However, the point here is not to demonstrate the superiority of EWA ascompared to ST and BJS procedures. The point is to stress the importance ofhaving sharp adaptivity up to an optimal constant and not simply adaptivityin the sense of rate of convergence. Indeed, the procedures ST and BJS areprovably rate-adaptive when applied to the Fourier transform of a Sobolev-smooth function, but they are not sharp adaptive—they do not attain theoptimal constant—whereas EWA and URE do attain.

8. Summary and future work.

In this paper we have addressed the prob-lem of aggregating a set of aﬃne estimators in the context of regressionwith ﬁxed design and heteroscedastic noise. Under some assumptions on the

GGREGATION OF AFFINE ESTIMATORS Table 2

Evaluation of 4 adaptive methods on 6 smoothed signals. For each sample size and eachmethod, we report the average value of n (MSE − MSE

Oracle ) and the correspondingstandard deviation (in parentheses), for 1000 replications of the experiment n EWA URE BJS ST EWA URE BJS ST

Blocks Doppler256 0 .

387 0 .

216 0 .

216 2 .

278 0 .

214 0 .

237 1 .

608 2 . .

43) (0 .

40) (0 .

24) (0 .

98) (0 .

23) (0 .

40) (0 .

73) (1 . .

170 0 .

209 0 .

650 3 .

193 0 .

165 0 .

250 1 .

200 3 . .

20) (0 .

41) (0 .

25) (1 .

07) (0 .

20) (0 .

44) (0 .

48) (1 . .

162 0 .

226 1 .

282 4 .

507 0 .

147 0 .

229 1 .

842 5 . .

18) (0 .

41) (0 .

44) (1 .

28) (0 .

19) (0 .

45) (0 .

86) (1 . .

120 0 .

220 1 .

574 6 .

107 0 .

138 0 .

229 1 .

864 6 . .

17) (0 .

37) (0 .

55) (1 .

55) (0 .

20) (0 .

40) (1 .

07) (1 . .

217 0 .

207 1 .

399 2 .

496 0 .

269 0 .

279 2 .

120 2 . .

16) (0 .

42) (0 .

54) (0 .

96) (0 .

27) (0 .

49) (1 .

09) (0 . .

206 0 .

221 0 .

024 3 .

045 0 .

216 0 .

248 2 .

045 2 . .

18) (0 .

43) (0 .

26) (1 .

10) (0 .

20) (0 .

45) (1 .

17) (1 . .

179 0 .

200 0 .

113 3 .

905 0 .

183 0 .

228 1 .

251 3 . .

18) (0 .

50) (0 .

27) (1 .

27) (0 .

20) (0 .

41) (0 .

70) (1 . .

162 0 .

189 0 .

421 5 .

019 0 .

145 0 .

223 1 .

650 4 . .

15) (0 .

37) (0 .

27) (1 .

53) (0 .

19) (0 .

42) (1 .

12) (1 . .

162 0 .

200 0 .

339 2 .

770 0 .

215 0 .

257 1 .

486 2 . .

16) (0 .

38) (0 .

24) (1 .

00) (0 .

25) (0 .

48) (0 .

68) (1 . .

150 0 .

215 0 .

425 3 .

658 0 .

170 0 .

243 1 .

865 3 . .

18) (0 .

38) (0 .

23) (1 .

20) (0 .

46) (0 .

84) (1 . .

146 0 .

211 0 .

935 4 .

815 0 .

179 0 .

236 1 .

547 5 . .

18) (0 .

39) (0 .

33) (1 .

35) (0 .

20) (0 .

47) (1 .

02) (1 . .

141 0 .

221 1 .

316 6 .

432 0 .

165 0 .

210 2 .

246 6 . .

20) (0 .

43) (0 .

42) (1 .

54) (0 .

20) (0 .

39) (1 .

15) (1 . constituent estimators, we have proven that EWA with a suitably chosentemperature parameter satisﬁes PAC-Bayesian type inequality, from whichdiﬀerent types of oracle inequalities have been deduced. All these inequal-ities are with leading constant one and rate-optimal residual term. As anapplication of our results, we have shown that EWA acting on Pinsker’sestimators produces an adaptive estimator in the exact minimax sense.Next in our agenda is carrying out an experimental evaluation of the pro-posed aggregate using the approximation schemes described by Dalalyan andTsybakov [23], Rigollet and Tsybakov [52, 54] and Alquier and Lounici [1],with a special focus on the problems involving large scale data.Although we do not assume the covariance matrix Σ of the noise to beknown, our approach relies on an unbiased estimator of Σ which is indepen- A. S. DALALYAN AND J. SALMON dent on the observed signal and on an upper bound on the largest singularvalue of Σ. In some applications, such information may be hard to obtainand it can be helpful to relax the assumptions on b Σ. This is another interest-ing avenue for future research for which, we believe, the approach developedby Giraud [34] can be of valuable guidance.APPENDIX: PROOFS OF MAIN THEOREMSWe develop now the detailed proofs of the results stated in the manuscript.

A.1. Stein’s lemma.

The proofs of our main results rely on Stein’s lem-ma [59], recalled below, providing an unbiased risk estimate for any estimatorthat depends suﬃciently smoothly on the data vector Y . Lemma 1.

Let Y be a random vector drawn form the Gaussian distribu-tion N n ( f , Σ) . If the estimator ˆ f is a.e. diﬀerentiable in Y and the elementsof the matrix ∇ · ˆ f ⊤ := ( ∂ i ˆ f j ) have ﬁnite ﬁrst moment, then ˆ r = k Y − ˆ f k n + 2 n Tr[Σ( ∇ · ˆ f ⊤ )] − n Tr[Σ] is an unbiased estimate of r , that is, E [ˆ r ] = r . The proof can be found in [61], page 157. We apply Stein’s lemma to theaﬃne estimators ˆ f λ = A λ Y + b λ , with A λ an n × n deterministic real matrixand b λ ∈ R n a deterministic vector. We get that if b Σ is an unbiased estimatorof Σ, then ˆ r unb λ = k Y − ˆ f λ k n + n Tr[ b Σ A λ ] − n Tr[ b Σ] is an unbiased estimatorof the risk r λ = E [ k ˆ f λ − f k n ] = k ( A λ − I n × n ) f + b λ k n + n Tr[ A λ Σ A ⊤ λ ]. A.2. An auxiliary result.

Prior to proceeding with the proof of the maintheorems, we prove an important auxiliary result which is the central ingre-dient of the proofs for our main results.

Lemma 2.

Let assumptions of Lemma 1 be satisﬁed. Let { ˆ f λ : λ ∈ Λ } bea family of estimators of f and { ˆ r λ : λ ∈ Λ } a family of risk estimates suchthat the mapping Y (ˆ f λ , ˆ r λ ) is a.e. diﬀerentiable for every λ ∈ Λ . Let ˆ r unb λ be the unbiased risk estimate of ˆ f λ given by Stein’s lemma. (1) For every π ∈ P Λ and for any β > , the estimator ˆ f EWA deﬁned asthe average of ˆ f λ w.r.t. to the probability measure ˆ π ( Y , dλ ) = θ ( Y , λ ) π ( dλ ) with θ ( Y , λ ) ∝ exp {− n ˆ r λ ( Y ) /β } admits ˆ r EWA = Z Λ (cid:18) ˆ r unb λ − k ˆ f λ − ˆ f EWA k n − nβ h∇ Y ˆ r λ | Σ(ˆ f λ − ˆ f EWA ) i n (cid:19) ˆ π ( dλ ) as unbiased estimator of the risk. GGREGATION OF AFFINE ESTIMATORS (2) If, furthermore, ˆ r λ ≥ ˆ r unb λ , ∀ λ ∈ Λ and R Λ h n ∇ Y ˆ r λ | Σ(ˆ f λ − ˆ f EWA ) i n ˆ π ( dλ ) ≥ − a R Λ k ˆ f λ − ˆ f EWA k n ˆ π ( dλ ) for some constant a > , then forevery β ≥ a it holds that E [ k ˆ f EWA − f k n ] ≤ inf p ∈P Λ (cid:26)Z Λ E [ˆ r λ ] p ( dλ ) + β K ( p, π ) n (cid:27) . (A.1) Proof.

According to the Stein lemma, the quantityˆ r EWA = k Y − ˆ f EWA k n + 2 n Tr[Σ( ∇ · ˆ f EWA ( Y ))] − n Tr[Σ](A.2)is an unbiased estimate of the risk r n = E [ k ˆ f EWA − f k n ]. Using simple algebra,one checks that k Y − ˆ f EWA k n = R Λ ( k Y − ˆ f λ k n − k ˆ f λ − ˆ f EWA k n )ˆ π ( dλ ) . (A.3)By interchanging the integral and diﬀerential operators, we get the followingrelation: ∂ y i ˆ f EWA ,j = R Λ { ( ∂ y i ˆ f λ,j ( Y )) θ ( Y , λ ) + ˆ f λ,j ( Y )( ∂ y i θ ( Y , λ )) } π ( dλ ).Then, combining this equality with equations (A.2) and (A.3) implies thatˆ r EWA = Z Λ (ˆ r unb λ − k ˆ f λ − ˆ f EWA k n )ˆ π ( dλ ) + 2 n Z Λ Tr[Σˆ f λ ∇ Y θ ( Y , λ ) ⊤ ] π ( dλ ) . After having interchanged diﬀerentiation and integration, we obtain that R Λ ˆ f EWA ( ∇ Y θ ( Y , λ )) ⊤ π ( dλ ) = ˆ f EWA ∇ Y ( R Λ θ ( Y , λ ) π ( dλ )) = 0 and, therefore,we come up with the following expression for ˆ r EWA :ˆ r EWA = Z Λ (ˆ r unb λ − k ˆ f λ − ˆ f n k n + 2 h∇ Y log θ ( λ ) | Σ(ˆ f λ − ˆ f EWA ) i n )ˆ π ( dλ )= Z Λ (ˆ r unb λ − k ˆ f λ − ˆ f EWA k n − nβ − h∇ Y ˆ r λ | Σ(ˆ f λ − ˆ f EWA ) i n )ˆ π ( dλ ) . This completes the proof of the ﬁrst assertion of the lemma.To prove the second assertion, let us observe that under the requiredcondition and in view of the ﬁrst assertion, for every β ≥ a it holds thatˆ r EWA ≤ R Λ ˆ r unb λ ˆ π ( dλ ) ≤ R Λ ˆ r λ ˆ π ( dλ ) ≤ R Λ ˆ r λ ˆ π ( dλ ) + βn K (ˆ π, π ). To conclude, itsuﬃces to remark that ˆ π is the probability measure minimizing the criterion R Λ ˆ r λ p ( dλ ) + βn K ( p, π ) among all p ∈ P Λ . Thus, for every p ∈ P Λ , we haveˆ r EWA ≤ Z Λ ˆ r λ p ( dλ ) + βn K ( p, π ) . Taking the expectation of both sides, the desired result follows. (cid:3) A. S. DALALYAN AND J. SALMON

A.3. Proof of Theorem 1.

Assertion (i). In what follows, we use the matrix shorthand I = I n × n and A EWA , R Λ A λ ˆ π ( dλ ). We apply Lemma 2 with ˆ r λ = ˆ r unb λ . To check theconditions of the second part of Lemma 2, note that in view of equations(2.4) and (2.6), as well as the assumptions A ⊤ λ = A λ and A λ ′ b λ = 0, we get ∇ Y ˆ r unb λ = 2 n ( I − A λ ) ⊤ ( I − A λ ) Y − n ( I − A λ ) ⊤ b λ = 2 n ( I − A λ ) Y − n b λ . Recall now that for any pair of commuting matrices P and Q the identity( I − P ) = ( I − Q ) + 2( I − P + Q )( Q − P ) holds true. Applying this iden-tity to P = A λ and Q = A EWA (in view of the commuting property of the A λ ’s), we get the following relation: h ( I − A λ ) Y | Σ( A λ − A EWA ) Y i n = h ( I − A EWA ) Y | Σ( A λ − A EWA ) Y i n − h ( I − A EWA + A λ )( A EWA − A λ ) Y | Σ( A EWA − A λ ) Y i n . When one integrates over Λ with respect to the measure ˆ π , theterm of the ﬁrst scalar product in the right-hand side of the last equationvanishes. On the other hand, h A λ ( A EWA − A λ ) Y | Σ( A EWA − A λ ) Y i n = h A λ (ˆ f EWA − ˆ f λ ) | Σ(ˆ f EWA − ˆ f λ ) i n = h (ˆ f EWA − ˆ f λ ) | A λ Σ(ˆ f EWA − ˆ f λ ) i n = 12 n (ˆ f EWA − ˆ f λ ) ⊤ ( A λ Σ + Σ A λ )(ˆ f EWA − ˆ f λ ) ≥ . Since positive semi-deﬁniteness of matrices Σ A λ + A λ Σ implies the one ofthe matrix Σ A EWA + Σ A EWA , we also have h A EWA ( A EWA − A λ ) Y | Σ( A EWA − A λ ) Y i n ≥

0. Therefore, (cid:28)(cid:18) I − A EWA + A λ (cid:19) ( A EWA − A λ ) Y | Σ( A EWA − A λ ) Y (cid:29) n ≤ h (ˆ f EWA − ˆ f λ ) | Σ(ˆ f EWA − ˆ f λ ) i n = k Σ / (ˆ f EWA − ˆ f λ ) k n . This inequality implies that Z Λ h n ∇ Y ˆ r unb λ | Σ(ˆ f λ − ˆ f EWA ) i n ˆ π ( dλ ) ≥ − Z Λ k Σ / (ˆ f λ − ˆ f EWA ) k n ˆ π ( dλ ) . Therefore, the claim of Theorem 1 holds true for every β ≥ ||| Σ ||| . Assertion (ii). Let now ˆ f λ = A λ Y + b λ with symmetric A λ (cid:22) I n × n and b λ ∈ Ker( A λ ). Using the deﬁnition ˆ r adj λ = ˆ r unb λ + n Y ⊤ ( A λ − A λ ) Y , one easilychecks that ˆ r adj λ ≥ ˆ r unb λ for every λ and that Z Λ h n ∇ ˆ r adj λ | Σ(ˆ f λ − ˆ f EWA ) i n ˆ π ( dλ ) = Z Λ h Y − ˆ f λ ) | Σ(ˆ f λ − ˆ f EWA ) i n ˆ π ( dλ ) GGREGATION OF AFFINE ESTIMATORS = − Z Λ k Σ / (ˆ f λ − ˆ f EWA ) k n ˆ π ( dλ ) . Therefore, if β ≥ ||| Σ ||| , all the conditions required in the second part ofLemma 2 are fulﬁlled. Applying this lemma, we get the desired result. A.4. Proof of Theorem 2.

We apply the result of assertion (ii) of Theo-rem 1 to the prior π ( dλ ) replaced by the probability measure proportionalto e (2 /β ) Tr[ b Σ( A λ − A ⊤ λ A λ )] π ( dλ ). This leads to E [ k ˜ f SEWA − f k n ] ≤ inf p ∈P Λ (cid:26)Z Λ E [ k ˆ f λ − f k n ] p ( dλ ) + βn K ( p, π ) (cid:27) + βn E (cid:20) log Z Λ e (2 /β ) Tr[ b Σ( A λ − A ⊤ λ A λ )] π ( dλ ) (cid:21) . Condition (C) entails that the last term is always nonnegative and the resultfollows.

A.5. Proof of Theorem 3.

Let us place ourselves in setting 1. It is clearthat E [ k ˆ f GEWA − f k n ] = P Jj =1 E [ k ˆ f j GEWA − f j k n ]. For each j ∈ { , . . . , J } ,since β j ≥ ||| Σ j ||| , one can apply assertion (i) of Theorem 1, which leadsto the desired result. The case of setting 2 is handled in the same manner. Acknowledgment.

The authors thank Pierre Alquier for fruitful discus-sions. SUPPLEMENTARY MATERIAL

Proofs of some propositions (DOI: 10.1214/12-AOS1038SUPP; .pdf). Inthis supplement we present the detailed proofs of Propositions 2–6.REFERENCES [1]

Alquier, P. and

Lounici, K. (2011). PAC-Bayesian bounds for sparse regressionestimation with exponential weights.

Electron. J. Stat. Amit, Y. and

Geman, D. (1997). Shape quantization and recognition with random-ized trees.

Neural Comput. Arlot, S. and

Bach, F. (2009). Data-driven calibration of linear estimators withminimal penalties. In

NIPS

Audibert, J.-Y. (2007). Progressive mixture rules are deviation suboptimal. In

NIPS

Baraud, Y. , Giraud, C. and

Huet, S. (2010). Estimator selection in the Gaussiansetting. Unpublished manuscript.[6]

Barron, A. , Birg´e, L. and

Massart, P. (1999). Risk bounds for model selectionvia penalization.

Probab. Theory Related Fields

Breiman, L. (1996). Bagging predictors.

Mach. Learn. A. S. DALALYAN AND J. SALMON[8]

Buades, A. , Coll, B. and

Morel, J. M. (2005). A review of image denoisingalgorithms, with a new one.

Multiscale Model. Simul. Bunea, F. , Tsybakov, A. B. and

Wegkamp, M. H. (2007). Aggregation for Gaus-sian regression.

Ann. Statist. Cai, T. T. (1999). Adaptive wavelet estimation: A block thresholding and oracleinequality approach.

Ann. Statist. Catoni, O. (2004).

Statistical Learning Theory and Stochastic Optimization . LectureNotes in Math. . Springer, Berlin. MR2163920[12]

Cavalier, L. (2008). Nonparametric statistical inverse problems.

Inverse Problems

19. MR2421941[13]

Cavalier, L. , Golubev, G. K. , Picard, D. and

Tsybakov, A. B. (2002). Oracleinequalities for inverse problems.

Ann. Statist. Cavalier, L. and

Tsybakov, A. (2002). Sharp adaptation for inverse problems withrandom noise.

Probab. Theory Related Fields

Cavalier, L. and

Tsybakov, A. B. (2001). Penalized blockwise Stein’s method,monotone oracles and sharp adaptive estimation.

Math. Methods Statist. Cesa-Bianchi, N. and

Lugosi, G. (2006).

Prediction, Learning, and Games . Cam-bridge Univ. Press, Cambridge. MR2409394[17]

Dai, D. , Rigollet, P. and

Zhang, T. (2012). Deviation optimal learning usinggreedy Q -aggregation. Ann. Statist.

To appear. Available at arXiv:1203.2507.[18]

Dai, D. and

Zhang, T. (2011). Greedy model averaging. In

NIPS

Dalalyan, A. S. and

Salmon, J. (2011). Competing against the best nearest neigh-bor ﬁlter in regression. In

ALT . Lecture Notes in Computer Science

Dalalyan, A. S. and

Salmon, J. (2012). Supplement to “Sharp oracle inequalitiesfor aggregation of aﬃne estimators.” DOI:10.1214/12-AOS1038SUPP.[21]

Dalalyan, A. S. and

Tsybakov, A. B. (2007). Aggregation by exponential weight-ing and sharp oracle inequalities. In

Learning Theory . Lecture Notes in ComputerScience

Dalalyan, A. S. and

Tsybakov, A. B. (2008). Aggregation by exponential weight-ing, sharp PAC-Bayesian bounds and sparsity.

Mach. Learn. Dalalyan, A. S. and

Tsybakov, A. B. (2012). Sparse regression learning by ag-gregation and Langevin Monte-Carlo.

J. Comput. System Sci. Dalalyan, A. S. and

Tsybakov, A. B. (2012). Mirror averaging with sparsitypriors.

Bernoulli Donoho, D. L. and

Johnstone, I. M. (1994). Ideal spatial adaptation by waveletshrinkage.

Biometrika Donoho, D. L. and

Johnstone, I. M. (1995). Adapting to unknown smoothnessvia wavelet shrinkage.

J. Amer. Statist. Assoc. Donoho, D. L. , Liu, R. C. and

MacGibbon, B. (1990). Minimax risk over hyper-rectangles, and implications.

Ann. Statist. Efromovich, S. and

Pinsker, M. (1996). Sharp-optimal and adaptive estimation forheteroscedastic nonparametric regression.

Statist. Sinica Efro˘ımovich, S. Y. and

Pinsker, M. S. (1984). A self-training algorithm for non-parametric ﬁltering.

Avtomat. i Telemekh. Freund, Y. (1990). Boosting a weak learning algorithm by majority. In

COLT [31] Ga¨ıffas, S. and

Lecu´e, G. (2011). Hyper-sparse optimal aggregation.

J. Mach.Learn. Res. George, E. I. (1986). Minimax multiple shrinkage estimation.

Ann. Statist. Gerchinovitz, S. (2011). Sparsity regret bounds for individual sequences in onlinelinear regression.

J. Mach. Learn. Res. Giraud, C. (2008). Mixing least-squares estimators when the variance is unknown.

Bernoulli Goldenshluger, A. and

Lepski, O. (2008). Universal pointwise selection rule inmultivariate function estimation.

Bernoulli Golubev, Y. (2010). On universal oracle inequalities related to high-dimensionallinear models.

Ann. Statist. Juditsky, A. and

Nemirovski, A. (2000). Functional aggregation for nonparametricregression.

Ann. Statist. Juditsky, A. and

Nemirovski, A. (2009). Nonparametric denoising of signals withunknown local structure. I. Oracle inequalities.

Appl. Comput. Harmon. Anal. Kivinen, J. and

Warmuth, M. K. (1999). Averaging expert predictions. In

Com-putational Learning Theory (Nordkirchen, 1999) . Lecture Notes in ComputerScience

Kneip, A. (1994). Ordered linear smoothers.

Ann. Statist. Lanckriet, G. R. G. , Cristianini, N. , Bartlett, P. , El Ghaoui, L. and

Jor-dan, M. I. (2003/04). Learning the kernel matrix with semideﬁnite program-ming.

J. Mach. Learn. Res. Langford, J. and

Shawe-Taylor, J. (2002). PAC-Bayes & margins. In

NIPS

Lecu´e, G. and

Mendelson, S. (2012). On the optimality of the aggregate withexponential weights for low temperatures.

Bernoulli . To appear.[44]

Leung, G. (2004). Information theory and mixing least squares regression. Ph.D.thesis, Yale Univ.[45]

Leung, G. and

Barron, A. R. (2006). Information theory and mixing least-squaresregressions.

IEEE Trans. Inform. Theory Lounici, K. (2007). Generalized mirror averaging and D -convex aggregation. Math.Methods Statist. McAllester, D. A. (1998). Some PAC-Bayesian theorems. In

Proceedings of theEleventh Annual Conference on Computational Learning Theory (Madison, WI,1998)

Nemirovski, A. (2000). Topics in non-parametric statistics. In

Lectures on Proba-bility Theory and Statistics (Saint-Flour, 1998) . Lecture Notes in Math.

Pinsker, M. S.

Optimal ﬁltration of square-integrable signals in Gaussian noise.

Probl. Peredachi Inf. Polzehl, J. and

Spokoiny, V. G. (2000). Adaptive weights smoothing with appli-cations to image restoration.

J. R. Stat. Soc. Ser. B Stat. Methodol. Rigollet, P. (2012). Kullback–Leibler aggregation and misspeciﬁed generalized lin-ear models.

Ann. Statist. Rigollet, P. and

Tsybakov, A. (2011). Exponential screening and optimal ratesof sparse estimation.

Ann. Statist. A. S. DALALYAN AND J. SALMON[53]

Rigollet, P. and

Tsybakov, A. B. (2007). Linear and convex aggregation of densityestimators.

Math. Methods Statist. Rigollet, P. and

Tsybakov, A. B. (2011). Sparse estimation by exponentialweighting. Unpublished manuscript.[55]

Salmon, J. and

Dalalyan, A. S. (2011). Optimal aggregation of aﬃne estimators.

J. Mach. Learn. Res. Salmon, J. and

Le Pennec, E. (2009). NL-Means and aggregation procedures. In

ICIP

Seeger, M. (2003). PAC-Bayesian generalisation error bounds for Gaussian processclassiﬁcation.

J. Mach. Learn. Res. Shawe-Taylor, J. and

Cristianini, N. (2000).

An Introduction to Support VectorMachines: And Other Kernel-Based Learning Methods . Cambridge Univ. Press,Cambridge.[59]

Stein, C. M. (1973). Estimation of the mean of a multivariate distribution. In

Proc.Prague Symp. Asymptotic Statist.

Charles Univ., Prague.[60]

Tsybakov, A. B. (2003). Optimal rates of aggregation. In

COLT

Tsybakov, A. B. (2009).

Introduction to Nonparametric Estimation . Springer, NewYork. MR2724359[62]

Wang, Z. , Paterlini, S. , Gao, F. and

Yang, Y. (2012). Adaptive minimax es-timation over sparse l q -hulls. Technical report. Available at arXiv:1108.1961v4[math.ST].[63] Yang, Y. (2000). Combining diﬀerent procedures for adaptive regression.

J. Multi-variate Anal. Yang, Y. (2003). Regression with multiple candidate models: Selecting or mixing?

Statist. Sinica Yang, Y. (2004). Aggregating regression procedures to improve performance.

Bernoulli Yuan, M. and

Lin, Y. (2006). Model selection and estimation in regression withgrouped variables.

J. R. Stat. Soc. Ser. B Stat. Methodol. ENSAE-Crest3 Avenue Pierre Larousse92245 Malakoff CedexFrance E-mail: [email protected]

URL: [email protected]