Data-driven aggregation in circular deconvolution
DData-driven aggregation in circular deconvolution J AN J OHANNES Ruprecht-Karls-UniversitätHeidelberg X AVIER L OIZEAU National PhysicalLaboratory (NPL)
Preliminary version: February 2, 2021
Abstract
In a circular deconvolution model we consider the fully data driven density estimationof a circular random variable where the density of the additive independent measurementerror is unknown. We have at hand two independent iid samples, one of the contaminatedversion of the variable of interest, and the other of the additive noise. We show optimality,in an oracle and minimax sense, of a fully data-driven weighted sum of orthogonal seriesdensity estimators. Two shapes of random weights are considered, one motivated by aBayesian approach and the other by a well known model selection method. We derivenon-asymptotic upper bounds for the quadratic risk and the maximal quadratic risk overSobolev-like ellipsoids of the fully data-driven estimator. We compute rates which can beobtained in different configurations for the smoothness of the density of interest and theerror density. The rates (strictly) match the optimal oracle or minimax rates for a largevariety of cases, and feature otherwise at most a deterioration by a logarithmic factor.We illustrate the performance of the fully data-driven weighted sum of orthogonal seriesestimators by a simulation study.
Keywords:
Circular deconvolution, Orthogonal series estimation, Spectral cut-off, Model selection,Aggregation, Oracle inequality, Adaptation
AMS 2000 subject classifications:
Primary 62G07; secondary 62G20, 42A85. Institut für Angewandte Mathematik, M Λ THEM Λ TIKON, Im Neuenheimer Feld 205, D-69120 Heidelberg,Germany, e-mail: [email protected] National Centre of Excellence in Mass Spectrometry Imaging (NiCE-MSI), National Physical Laboratory(NPL), Hampton Road, Teddington TW11 0LW, Royaume-Uni, e-mail: [email protected] a r X i v : . [ m a t h . S T ] F e b Introduction
In a circular convolution model one objective is to estimate non-parametrically the densityof a random variable taking values on the unit circle from observations blurred by an addi-tive noise. Here we show optimality, in an oracle and minimax sense, of a fully data-drivenweighted sum of orthogonal series estimators (OSE’s). Two shapes of random weights areconsidered, one motivated by a Bayesian approach and the other by a well known model se-lection method. Circular data are met in a variety of applications, such as data representinga direction on a compass in handwriting recognition (Bahlmann [2006]) and in meteorology(Carnicero et al. [2013]), or anything from opinions on a political compass to time reading ona clock face (Gill and Hangartner [2010]) in political sciences. The non-parametric densityestimation in a circular deconvolution model has been considered for example in Comte andTaupin [2003], Efromovich [1997], Johannes and Schwarz [2013], while Schluttenhofer andJohannes [2020a,b], for example, study minimax testing. For an overview of convolutionalphenomenons met in other models the reader may refer to Meister [2009].Throughout this work we will tacitly identify the circle with the unit interval [0 , , fornotational convenience. Let Y := X + ε − (cid:98) X + ε (cid:99) = X + ε mod 1 be the observablecontaminated random variable and g its density. If we denote by f and ϕ the respectivecircular densities of the random variable of interest X and of the additive and independentnoise ε , then, we have g ( y ) = ( f (cid:126) ϕ )( y ) := (cid:90) [0 , f (( y − s ) − (cid:98) y − s (cid:99) ) ϕ ( s ) ds, y ∈ [0 , , such that (cid:126) stands for the circular convolution. Therefore, the estimation of f is called acircular deconvolution problem.We highlight hereafter that, thanks to the convolution theorem, an estimator of the circu-lar density f is usually based on the Fourier transforms of ϕ , and g which may be estimatedfrom the data. For any complex number z , denote z its complex conjugate, and | z | its mod-ulus. Let L := L ([0 , be the Hilbert space of square integrable complex-valued func-tions defined on [0 , endowed with the usual inner product (cid:104) h , h (cid:105) L = (cid:82) [0 , h ( x ) h ( x ) dx ,and associated norm (cid:107) (cid:113) (cid:107) L . Each h ∈ L admits a representation as discrete Fourier series h = (cid:80) j ∈ Z [ h ] j e j with respect to the exponential basis { e j } j ∈ Z , where [ h ] j := (cid:104) h, e j (cid:105) L is the j -th Fourier coefficient of h , and e j ( x ) := exp( − ι πjx ) for x ∈ [0 , , and a square root ι of − .In this work we suppose that f , ϕ , and hence g , belong to the subset D of real-valuedLebesgue densities in L . We denote the expectation associated with g and ϕ by E g , and E ϕ respectively. We note that [ g ] = 1 , and E g [ e j ( − Y )] = [ g ] j = [ g ] − j for any j ∈ Z as it is thecase for any density. The key to our analysis is the convolution theorem which states that, in a2ircular model, g = ϕ (cid:126) f holds if and only if [ g ] j = [ ϕ ] j · [ f ] j for all j ∈ Z . Therefore and aslong as [ ϕ ] j (cid:54) = 0 for all j ∈ Z , which is assumed from now on, we have f = e + (cid:88) | j |∈ N [ ϕ ] − j [ g ] j e j with [ g ] j = E g [ e j ( − Y )] and [ ϕ ] j = E ϕ [ e j ( − ε )] . (1.1)Note that an analogous representation holds in the case of deconvolution on the real line withcompactly supported X -density, i.e. when the error term ε , and hence Y , take their valuesin R . In this situation, the deconvolution density still admits a discrete representation as in(1.1), but involving the characteristic functions of ϕ and g rather than their discrete Fouriercoefficients. For a more detailed study of the Fourier analysis of probability distributions, thereader is referred, for example, to Brémaud [2014], Chapter 2.In this paper we do not know neither the density g = f (cid:126) ϕ of the contaminated ob-servations nor the error density ϕ , but we have at our disposal two independent samples ofindependent and identically distributed (iid.) random variables of size n ∈ N and m ∈ N ,respectively: Y i ∼ g, i ∈ (cid:74) n (cid:75) := (cid:74) , n (cid:75) := [1 , n ] ∩ Z , and ε i ∼ ϕ, i ∈ (cid:74) m (cid:75) . (1.2)In this situation, for each dimension parameter k ∈ N an OSE of f is given by (cid:98) f k := e + (cid:88) | j |∈ (cid:74) k (cid:75) (cid:99) [ ϕ ] + j (cid:99) [ g ] j e j , with (cid:99) [ g ] j := n − (cid:88) i ∈ (cid:74) n (cid:75) e j ( − Y i ) , (cid:99) [ ϕ ] + j := (cid:99) [ ϕ ] − j {| (cid:99) [ ϕ ] j | (cid:62) /m } and (cid:99) [ ϕ ] j := m − (cid:88) i ∈ (cid:74) m (cid:75) e j ( − ε i ) . (1.3)The threshold using the indicator function {| (cid:99) [ ϕ ] j | (cid:62) /m } , accounts for the uncertainty causedby estimating [ ϕ ] j by (cid:99) [ ϕ ] j . It corresponds to (cid:99) [ ϕ ] j ’s noise level as an estimator of [ ϕ ] j which isa natural choice (cf. Neumann [1997], p. 310f.). Thanks to the properties of the sequences ( (cid:99) [ g ] j ) j ∈ Z , and ( (cid:99) [ ϕ ] + j ) j ∈ Z , for any k in N , the estimator (cid:98) f k is a real valued function integratingto . It is not necessarily positive valued, however, one might project the estimator on D ,leading to an even smaller quadratic error. Nevertheless (cid:98) f k depends on a dimension parameter k whose choice essentially determines the estimation accuracy.In Johannes and Schwarz [2013], a minimax criterion is used to formulate optimality. Itis shown that, by choosing the dimension parameter properly, the maximal risk of an OSE asin (1.3) reaches the lower bound over Sobolev-like ellipsoids. However, the optimal choiceof the dimension depends on the unknown ellipsoids. A fully data-driven selection based ona penalised contrast method is proposed and it is shown to yield minimax optimal rates fora large family of such ellipsoids. This selection procedure is inspired by the work of Barronet al. [1999], which was applied in the case of known error density by Comte and Taupin[2003]. For an extensive overview of model selection by penalised contrast, the reader mayrefer to Massart [2007]. More precisely, Johannes and Schwarz [2013] introduce an upper3ound (cid:99) M for the dimension parameter, and penalties (pen (cid:98) Φ k ) k ∈ (cid:74) (cid:99) M (cid:75) , depending on the samples ( Y i ) i ∈ (cid:74) n (cid:75) , and ( ε i ) i ∈ (cid:74) m (cid:75) , but neither on f nor ϕ . Then, the fully data-driven estimator is definedas (cid:98) f (cid:101) k := e + (cid:88) | j |∈ (cid:74) (cid:101) k (cid:75) (cid:99) [ ϕ ] + j (cid:99) [ g ] j e j with (cid:101) k := arg min k ∈ (cid:74) (cid:99) M (cid:75) {−(cid:107) (cid:98) f k (cid:107) L + pen (cid:98) Φ k } . (1.4)The empirical upper bound (cid:99) M proposed in Johannes and Schwarz [2013] is technically ratherinvolved and more importantly simulations suggest that it leads to values which are often muchtoo restrictive.Here, rather than a data-driven selection of a dimension parameter, we propose to sum theOSE’s with positive data-driven weights adding up to one. Namely, given for each k ∈ (cid:74) n (cid:75) ,the OSE’s as in (1.3), and a random weight w k ∈ [0 , , we consider the convex sum (cid:98) f w = (cid:88) k ∈ (cid:74) n (cid:75) w k (cid:98) f k , with (cid:88) k ∈ (cid:74) n (cid:75) w k = 1 . (1.5)Introducing the model selection weights, ˘ w k := { k = (cid:98) k } , k ∈ (cid:74) n (cid:75) , with (cid:98) k := arg min k ∈ (cid:74) n (cid:75) {−(cid:107) (cid:98) f k (cid:107) L + pen (cid:98) Φ k } (1.6)allows us to consider the model selected estimator (cid:98) f (cid:98) k = (cid:98) f ˘ w = (cid:80) k ∈ (cid:74) n (cid:75) ˘ w k (cid:98) f k as a data-drivenweighted sum, avoiding a restrictive empirical upper bound (cid:99) M as in (1.4).We study a second shape of random weights, motivated by a Bayesian approach in thecontext of an inverse Gaussian sequence space model and its iterative extension respectivelydescribed in Johannes et al. [2020] and Loizeau [2020]. For some constant η ∈ N we defineBayesian weights (cid:98) w k := exp( − ηn {−(cid:107) (cid:98) f k (cid:107) L + pen (cid:98) Φ k } ) (cid:80) nl =1 exp( − ηn {−(cid:107) (cid:98) f l (cid:107) L + pen (cid:98) Φ l } ) , k ∈ (cid:74) n (cid:75) . (1.7)Note that in (1.6) and (1.7) the quantity (cid:107) (cid:98) f k (cid:107) L = (cid:80) kj = − k | (cid:99) [ ϕ ] + j | | (cid:99) [ g ] j | can be calculated fromthe data without any prior knowledge about the error density ϕ . Thereby, as the sequence ofpenalties (pen (cid:98) Φ k ) k ∈ (cid:74) n (cid:75) given in bellow (3.6) does not involve any prior knowledge neither of f nor ϕ , the weights in (1.6) and (1.7) are fully data-driven.Let us emphasise the role of the parameter η used in (1.7). If (cid:98) k as in (1.6) minimisesuniquely the penalised contrast function, then it is easily seen that for each k ∈ (cid:74) n (cid:75) theBayesian weight (cid:98) w k converges to the model selection weight ˘ w k as η → ∞ . We shall seethat the fully data-driven weighted sum (cid:98) f w with Bayesian weights w = (cid:98) w or model selectionweights w = ˘ w yields minimax optimal convergence rates over Sobolev-like ellipsoids. Thus,the theory presented here does not give a way to chose the parameter η . However, simulations4uggest that the Bayesian weights lead to more stable results as it is often recorded in the fieldof estimator aggregation.The shape of the weighted sum (cid:98) f w is similar to the form studied in the estimator aggrega-tion literature. Aggregation in the context of regression problems is considered, for instance,in Bellec and Tsybakov [2015], Dalalyan and Tsybakov [2008, 2012], Rigollet et al. [2012],Tsybakov [2014]), while Rigollet and Tsybakov [2007] study density estimation. Tradition-ally, the aggregation of a family of arbitrary estimators is performed through an optimisationprogram for the random weights, and the goal is to compare the convergence rate of the ag-gregation to the one of the best estimator in the family. Here, while we restrict ourselves toOSE’s, their number is as large as the sample size. The random weights are given explicitlywithout an optimisation program and do not rely on a sample splitting. In addition, we allowfor a degenerated cases where one OSE receives all the weight of the sum.This paper is organised as follows. In section 2 assuming that the error density ϕ is known,we introduce a family of OSE’s. We briefly recall the oracle and minimax theory beforeintroducing model selection and Bayesian weights respectively similar to (1.6), and (1.7),which still depend on characteristics of the error density. The weighted sum of the OSE’sis thus only partially data-driven. We derive non-asymptotic upper bounds for the quadraticrisk and the maximal quadratic risk over Sobolev-like ellipsoids of the partially data-drivenestimator. In section 3, dismissing the knowledge of the density ϕ an additional sample ofthe noise is observed. Choosing the weights in (1.6), and (1.7) fully data-driven we derivenon-asymptotic upper risk bounds for the now fully data-driven weighted sums of OSE’s. Insections 2 and 3 we compute rates which can be obtained in different configurations for thesmoothness of the density of interest f and the error density ϕ . The rates (strictly) match theoptimal oracle or minimax rates for a large variety of cases, and feature otherwise at most adeterioration by a logarithmic factor. We illustrate in section 4 the reasonable performance ofthe fully data-driven weighted sum of OSE’s by a simulation study. All technical proofs aredeferred to the Appendix. Notations.
Throughout this section the error density ϕ ∈ D is known. Therefore, given aniid. n -sample ( Y i ) i ∈ (cid:74) n (cid:75) from g = f (cid:126) ϕ we denote by E n f,ϕ the expectation with respect totheir joint distribution P n f,ϕ . The estimation of the unknown circular density f is based on adimension reduction which we briefly elaborate first. Given the exponential basis { e j , j ∈ Z } and a dimension parameter k ∈ N := N ∪ { } we have the subspace U k spanned by the k + 1 basis functions { e j , j ∈ (cid:74) − k, k (cid:75) } at our disposal. For abbreviation, we denote by Π k and Π ⊥ k the orthogonal projections on U k and its orthogonal complement U ⊥ k in L , respectively. For5ach h ∈ L we consider its orthogonal projection h k := Π k h and its associated approximationerror (cid:107) h k − h (cid:107) L = (cid:107) Π ⊥ k h (cid:107) L . Note that for any density p ∈ D ∩ L holds Π ⊥ p = p − e andwe define b (cid:113) ( p ) := ( b k ( p ) ) k ∈ N ∈ R N with (cid:62) b k ( p ) := (cid:107) Π ⊥ k p (cid:107) L / (cid:107) Π ⊥ p (cid:107) L (with the convention / ) (2.1)where lim k →∞ b k ( p ) = 0 due to the dominated convergence theorem. Risk bound.
Keeping in mind that the error density satisfies | [ ϕ ] k | > for all k ∈ Z , wedefine Φ (cid:113) = (Φ k ) k ∈ N ∈ R N , and, for any x (cid:113) ∈ R N let us introduce Σ x (cid:113) = (Σ x k ) k ∈ N ∈ R N with Σ x := 0 , Σ x k := k − (cid:88) j ∈ (cid:74) k (cid:75) x j ; and (cid:54) Φ k := | [ ϕ ] k | − = | [ ϕ ] − k | − . (2.2)We define the OSE’s in the present case similarly to (1.3) by (cid:101) f k = e + (cid:88) | j |∈ (cid:74) k (cid:75) [ ϕ ] − j (cid:99) [ g ] j e j , (2.3)By elementary calculations for each k ∈ N the risk of (cid:101) f k in (2.3) satisfies E n f,ϕ (cid:107) (cid:101) f k − f (cid:107) L + n − (cid:107) Π ⊥ f (cid:107) L = 2 n − k Σ Φ k + n +1 n (cid:107) Π ⊥ f (cid:107) L b k ( f ) . (2.4)The quadratic risk in the last display depends on the dimension parameter k and henceby selecting an optimal value it will be minimised, which we formulate next. For a sequence ( a k ) k ∈ N of real numbers with minimal value in a set A ⊂ N we set arg min { a k , k ∈ A } :=min { k ∈ A : a k (cid:54) a j , ∀ j ∈ A } . For any non-negative sequence x (cid:113) := ( x k ) k ∈ N , y (cid:113) :=( y k ) k ∈ N and each k ∈ N define R kn ( x (cid:113) , y (cid:113) ) := [ x k ∨ n − k y k ] := max (cid:0) x k , n − k y k (cid:1) ,k ◦ n ( x (cid:113) , y (cid:113) ) := arg min (cid:8) R kn ( x (cid:113) , y (cid:113) ) , k ∈ N (cid:9) and R ◦ n ( x (cid:113) , y (cid:113) ) := min (cid:8) R kn ( x (cid:113) , y (cid:113) ) , k ∈ N (cid:9) = R k ◦ n ( x (cid:113) , y (cid:113) ) n ( x (cid:113) , y (cid:113) ) . (2.5)Remark 2.1. Here and subsequently, our upper bounds of the risk derived from (2.4) makeuse of the definitions (2.5) , for example, replacing the sequences x (cid:113) and y (cid:113) by b (cid:113) ( f ) and Σ Φ (cid:113) , respectively. However, in what follows the sequence x (cid:113) and y (cid:113) is always monotonicallynon-increasing and non-decreasing, respectively, with x (cid:54) (cid:54) y and lim k →∞ x k = 0 =lim k →∞ y − k . In this situations by construction hold k ◦ n ( x (cid:113) , y (cid:113) ) ∈ (cid:74) n (cid:75) and R ◦ n ( x (cid:113) , y (cid:113) ) (cid:62) n − forall n ∈ N , and lim n →∞ R ◦ n ( x (cid:113) , y (cid:113) ) = 0 . For the latter observe that for each δ > there is k δ ∈ N and n δ ∈ N such that x k δ (cid:54) δ and k δ y k δ n − (cid:54) δ , R ◦ n ( x (cid:113) , y (cid:113) ) (cid:54) R k δ n ( x (cid:113) , y (cid:113) ) (cid:54) δ , for all n (cid:62) n δ . We shall use those elementary findings in the sequel without further reference. f and hence it’sassociated sequence b (cid:113) := b (cid:113) ( f ) ∈ R N + of approximation errors as in (2.1) the two cases: (p) there is K ∈ N with b K = 0 and b K − > (with the convention b − := 1 ), and (np) forall K ∈ N holds b K > . Let us stress, that for any monotonically non-decreasing sequence y (cid:113) with y (cid:62) , the order of the rate ( R ◦ n ( b (cid:113) , y (cid:113) ) ) n ∈ N defined in (2.5) with x (cid:113) replaced by b (cid:113) incase (p) and (np) is parametric and non-parametric, respectively. More precisely, in case (p) it holds k ◦ n ( b (cid:113) , y (cid:113) ) = K and R ◦ n ( b (cid:113) , y (cid:113) ) = n − Ky K for all n > Ky K / b K − , while in case (np) holds lim n →∞ k ◦ n ( b (cid:113) , y (cid:113) ) = ∞ and lim n →∞ n R ◦ n ( b (cid:113) , y (cid:113) ) = ∞ . Oracle optimality.
Coming back to the identity (2.4) and exploiting the definition (2.5) with x (cid:113) and y (cid:113) , respectively, replaced by b (cid:113) := b (cid:113) ( f ) and Σ Φ (cid:113) as in (2.2) it follows immediately inf (cid:8) E n f,ϕ (cid:107) (cid:101) f k − f (cid:107) L , k ∈ N (cid:9) (cid:54) E n f,ϕ (cid:107) (cid:101) f k ◦ n ( b (cid:113) , Σ Φ (cid:113) ) − f (cid:107) L (cid:54) ∨ (cid:107) Π ⊥ f (cid:107) L ] R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) . (2.6)On the other hand with [ a ∧ b ] := min( a, b ) for a, b ∈ R from (2.4) we also conclude inf (cid:8) E n f,ϕ (cid:107) (cid:101) f k − f (cid:107) L , k ∈ N (cid:9) (cid:62) (cid:0) [ (cid:107) Π ⊥ f (cid:107) L ∧ − (cid:107) Π ⊥ f (cid:107) L n R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) (cid:1) R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) . (2.7)For each n ∈ N combining (2.6) and (2.7) R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) , k ◦ n ( b (cid:113) , Σ Φ (cid:113) ) and (cid:101) f k ◦ n ( b (cid:113) , Σ Φ (cid:113) ) , respectively,is an oracle rate, oracle dimension and oracle optimal estimator (up to a constant), if theleading factor on the right hand side in (2.7) is uniformly in n bounded away from zero. Notethat R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) is in case (np) always an orale rate, while in case (p) whenever K Σ Φ K > [1 ∨ (cid:107) Π ⊥ f (cid:107) L ] . Aggregation.
We call aggregation weights any w := ( w k ) k ∈ (cid:74) n (cid:75) ∈ [0 , n defining on theset (cid:74) n (cid:75) a discrete probability measure P w ( { k } ) := w k , k ∈ (cid:74) n (cid:75) . We consider here and sub-sequently a weighted sum (cid:101) f w := (cid:80) k ∈ (cid:74) n (cid:75) w k (cid:101) f k of the orthogonal series estimators defined in(2.3). Clearly, the coefficients ([ (cid:101) f w ] j ) j ∈ Z of (cid:101) f w satisfy [ (cid:101) f w ] j = 0 for | j | > n , and for any | j | ∈ (cid:74) n (cid:75) holds [ (cid:101) f w ] j = (cid:80) k ∈ (cid:74) n (cid:75) w k [ (cid:101) f k ] j = P (cid:101) w ( (cid:74) | j | , n (cid:75) )[ ϕ ] − j (cid:99) [ g ] j . We note that by construction [ (cid:101) f w ] = 1 , [ (cid:101) f w ] − j = [ (cid:101) f w ] j and (cid:62) | [ (cid:101) f w ] j | . Hence, (cid:101) f w is real and integrates to one, however, itis not necessary non-negative. Our aim is to prove an upper bound for its risk E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L as well its maximal risk over Sobolev-like ellipsoids. For arbitrary aggregation weights andpenalty sequence the next lemma establishes an upper bound for the loss of the aggregatedestimator. Selecting suitably the weights and penalties this bound provides in the sequel ourkey argument. Lemma
Consider a weighted sum (cid:101) f w with arbitrary aggregation weights w and non- egative penalty terms (pen n k ) k ∈ (cid:74) n (cid:75) . For any k − , k + ∈ (cid:74) n (cid:75) holds (cid:107) (cid:101) f w − f (cid:107) L (cid:54) pen n k + +2 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ 2 (cid:107) Π ⊥ f (cid:107) L P w ( (cid:74) k − (cid:74) ) + (cid:88) k ∈ (cid:75) k + ,n (cid:75) pen n k w k {(cid:107) (cid:101) f k − f k (cid:107) L < pen nk / } + 2 (cid:88) k ∈ (cid:74) k + ,n (cid:75) (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L − pen n k / (cid:1) + + (cid:88) k ∈ (cid:75) k + ,n (cid:75) pen n k {(cid:107) (cid:101) f k − f k (cid:107) L (cid:62) pen nk / } . (2.8)Remark 2.3. Keeping (2.8) in mind let us briefly outline the principal arguments of our ag-gregation strategy. Selecting the values k + and k − close to the oracle dimension k ◦ n ( b (cid:113) , Σ Φ (cid:113) ) the first two terms in the upper bound of (2.8) are of the order of the oracle rate. On theother hand the weights are in the sequel selected such that the third and fourth are negligiblewith respect to the oracle rate, while the choice of the penalties allows as usual to bound thedeviation of the last two terms by concentration inequalities. For some constant η ∈ N , we consider either Bayesian weights (cid:101) w k := exp( − ηn {−(cid:107) (cid:101) f k (cid:107) L + pen Φ k } ) (cid:80) l ∈ (cid:74) n (cid:75) exp( − ηn {−(cid:107) (cid:101) f l (cid:107) L + pen Φ l } ) , k ∈ (cid:74) , n (cid:75) . (2.9)or model selection weights ˘ w k := { k = (cid:101) k } , k ∈ (cid:74) n (cid:75) , with (cid:101) k := arg min k ∈ (cid:74) n (cid:75) {−(cid:107) (cid:101) f k (cid:107) L + pen Φ k } (2.10)respectively similar to the ones defined in (1.7) and (1.6). Until now we have not specified thesequence of penalty terms. For a sequence x (cid:113) ∈ R N and k ∈ N define x ( k ) := max { x j : j ∈ (cid:74) , k (cid:75) } , λ x k := | log( kx ( k ) ∨ ( k +2)) | | log( k +2) | , and Λ x k := λ x k x ( k ) . (2.11)Given Φ (cid:113) ∈ R N + as in (2.2) and a numerical constant ∆ > we use pen Φ k := ∆ k Λ Φ k n − , k ∈ N . (2.12)as penalty terms. For the theoretical results below we need that the numerical constant satisfies ∆ (cid:62) . However, for a practical application this values is generally too large and a suitableconstant might be chosen by preliminary calibration experiments see Baudry et al. [2012].We derive bounds for the risk of the weighted sum estimator (cid:101) f (cid:101) w with Bayesian weights andthe model selected estimator (cid:101) f (cid:101) k = (cid:101) f ˘ w by applying Lemma 2.2. From definition (2.5) replacing x (cid:113) and y (cid:113) , respectively, by b (cid:113) = b (cid:113) ( f ) and Λ Φ (cid:113) we consider R kn ( b (cid:113) , Λ Φ (cid:113) ) for each n, k ∈ N . Notethat by construction (2.2) and (2.11), we have R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) (cid:62) R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) for all n ∈ N . Wedenote in the sequel by C an universal finite numerical constant with value changing possiblyfrom line to line. 8 roposition Consider an aggregation (cid:101) f w = (cid:80) k ∈ (cid:74) n (cid:75) w k (cid:101) f k using either Bayesian weights w := (cid:101) w as in (2.9) or model selection weights w := ˘ w as in (2.10) and penalties (pen Φ k ) k ∈ (cid:74) n (cid:75) as in (2.12) with numerical constant ∆ (cid:62) . Let k g := (cid:98) (cid:107) [ g ] (cid:107) (cid:96) (cid:99) . (p) Assume there is K ∈ N with (cid:62) b ( K − ( f ) > and b K ( f ) = 0 . If K = 0 we set c f := 0 and c f := (cid:107) Π ⊥ f (cid:107) L b K − ( f ) , otherwise. For n ∈ N let k (cid:63)n := max { k ∈ (cid:74) n (cid:75) : n > c f k Λ Φ k } ,if the defining set is not empty, and k (cid:63)n := (cid:100) k g log(2 + n ) (cid:101) otherwise. There is a finiteconstant C f,ϕ given in (B.24) depending only on f and ϕ such that for all n ∈ N holds E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) C (cid:107) Π ⊥ f (cid:107) L (cid:2) n − ∨ exp (cid:0) − λ Φ k(cid:63)n k (cid:63)n k g (cid:1)(cid:3) + C f,ϕ n − . (2.13) (np) Assume that b k ( f ) > for all k ∈ N . There is a finite constant C f,ϕ given in (B.10) depending only on f and ϕ such that for all n ∈ N holds E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) C [ (cid:107) Π ⊥ f (cid:107) L ∨ ρ ◦ n ( b (cid:113) , Λ Φ (cid:113) ) + C f,ϕ n − with ρ ◦ n ( b (cid:113) , Λ Φ (cid:113) ) := min k ∈ (cid:74) n (cid:75) (cid:8)(cid:2) R kn ( b (cid:113) , Λ Φ (cid:113) ) ∨ exp (cid:0) − λ Φ k kk g (cid:1)(cid:3)(cid:9) . (2.14) Corollary
Let the assumptions of Proposition 2.4 be satisfied. (p)
If in addition (A1) there is n f,ϕ ∈ N such that for all n (cid:62) n f,ϕ holds λ Φ k (cid:63)n k (cid:63)n (cid:62) k g log n ,then there is a constant C f,ϕ depending only on f and ϕ such that for all n ∈ N holds E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) C f,ϕ n − . (np) If in addition (A2) : there is n f,ϕ ∈ N such that k ◦ n := k ◦ n ( b (cid:113) , Λ Φ (cid:113) ) as in (2.5) for all n (cid:62) n f,ϕ satisfies k ◦ n λ Φ k ◦ n (cid:62) k g | log R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) | , then there is a constant C f,ϕ dependingonly on f and ϕ such that E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) C f,ϕ R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) for all n ∈ N . Illustration ( a n ) n ∈ N and ( b n ) n ∈ N the notation a n ∼ b n if the sequence ( a n /b n ) n ∈ N is bounded away both from zeroand infinity. We illustrate the last results considering usual behaviours for the sequences b (cid:113) and Φ (cid:113) . Regarding the error density ϕ we consider for a > the following two cases (o) Φ k ∼ k a and (s) Φ k ∼ exp( k a ) . The error density ϕ is called ordinary smooth in case (o) and supersmooth in case (s) , and it holds, respectively, (o) k (cid:63)n ∼ n / (2 a +1) and k (cid:63)n λ Φ k (cid:63)n ∼ n / (2 a +1) , and (s) k (cid:63)n ∼ (log n ) / (2 a ) and k (cid:63)n λ Φ k (cid:63)n ∼ (log n ) / (2 a ) . Clearly in both cases (A1) holds true andhence employing Corollary 2.5 (p) the aggregated estimator attains the parametric rate. Onthe other hand, for (np) we use for the deconvolution density f as particular specifications (o) | [ f ] k | ∼ k − p − and (s) | [ f ] k | ∼ k p − exp( − k p ) with p > .9rder b k Φ k R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) ρ ◦ n ( b (cid:113) , Λ Φ (cid:113) ) [o-o] k − p k a n − p p +2 a +1 n − p p +2 a +1 n − p p +2 a +1 [o-s] k − p e k a (log n ) − pa (log n ) − pa (log n ) − pa [s-o] e − k p k a (log n ) a +12 p n − (log n ) a +12 p n − (cid:40) (log n ) a +12 p n − : p < / , (log n ) a +1 n − : p (cid:62) / . To calculate the order in the last table we used that the dimension parameter k ◦ n := k ◦ n ( b (cid:113) , Λ Φ (cid:113) ) satisfies [o-o] k ◦ n ∼ n / (2 p +2 a +1) and λ Φ k ◦ n k ◦ n ∼ n / (2 p +2 a +1) , [o-s] k ◦ n ∼ (log n ) / (2 a ) and λ Φ k ◦ n k ◦ n ∼ (log n ) / (2 a ) , and [s-o] k ◦ n ∼ (log n ) ( a +1 / /p and λ Φ k ◦ n k ◦ n ∼ (log n ) / (2 p ) . We notethat in each of the three cases the order of R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) and the order of the oracle rate R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) coincide. Moreover, the additional assumption (A2) in Corollary 2.5 (np) is satisfied in case [o-o] and [o-s] , but in [s-o] only with p < / . Consequently, in this situations due to Corol-lary 2.5 (np) the partially data-driven aggregation is oracle optimal (up to a constant). Oth-erwise, the upper bound ρ ◦ n ( b (cid:113) , Λ Φ (cid:113) ) in Proposition 2.4 (2.14) faces a detoriation compared tothe rate R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) . In case [s-o] with p (cid:62) / setting k (cid:63)n := k g | log R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) | ∼ (log n ) theupper bound ρ ◦ n ( b (cid:113) , Λ Φ (cid:113) ) (cid:54) R k (cid:63)n n ( b (cid:113) , Λ Φ (cid:113) ) ∼ (log n ) a +1 n − features a deterioration at most by alogarithmic factor (log n ) (2 a +1)(1 − / (2 p )) compared to the oracle rate (log n ) (2 a +1) / (2 p ) n − . Minimax optimality.
Rather than considering for each k ∈ N the risk of the OSE (cid:101) f k forgiven f and ϕ we shall measure now its accuracy by a maximal risk over pre-specified classesof densities determining a priori conditions on f and ϕ , respectively. For an arbitrary positivesequence x (cid:113) ∈ R N + and h ∈ L we write shortly (cid:107) h (cid:107) x := (cid:80) j ∈ Z x | j | | [ h ] j | . Given strictlypositive sequences f (cid:113) = ( f k ) k ∈ N and s (cid:113) = ( s k ) k ∈ N , and constants r, d (cid:62) we define F r f := { p ∈ D : (cid:107) p (cid:107) / f (cid:54) r } and E d s := { p ∈ D : d − (cid:54) s j | [ p ] | j | | (cid:54) d, ∀ j ∈ Z } . Here and subsequently, we suppose the following minimal regularity conditions are satisfied.Assumption (A3).
The sequences f (cid:113) , s − (cid:113) are monotonically non-increasing with f = 1 = s , lim k →∞ f k = 0 = lim k →∞ s − k and (cid:80) k ∈ N f k / s k = (cid:107) f (cid:113) / s (cid:113) (cid:107) (cid:96) < ∞ . We shall emphasize that for k ∈ N , f ∈ F r f and ϕ ∈ E d s hold (cid:107) Π ⊥ f (cid:107) L b k ( f ) (cid:54) r f k and /d (cid:54) Σ Φ k / Σ s k (cid:54) d with Σ s k = k − (cid:80) j ∈ (cid:74) k (cid:75) s j which we use in the sequel without further ref-erence. Exploiting again the identity (2.4) and the definition (2.5) with x (cid:113) and y (cid:113) , respectively,replaced by f (cid:113) and Σ s (cid:113) it follows for all k, n ∈ N sup (cid:8) E n f,ϕ (cid:107) (cid:101) f k − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) (2 d + r ) R kn ( f (cid:113) , Σ s (cid:113) ) . (2.15)The upper bound in the last display depends on the dimension parameter k and hence bychoosing an optimal value k ◦ n ( f (cid:113) , Σ s (cid:113) ) the upper bound will be minimised. From (2.15) we10educe that sup (cid:8) E n f,ϕ (cid:107) (cid:101) f k ◦ n ( f (cid:113) , s (cid:113) ) − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) (2 d + r ) R ◦ n ( f (cid:113) , Σ s (cid:113) ) for all n ∈ N .On the other hand Johannes and Schwarz [2013] have shown that for all n ∈ N inf (cid:101) f sup (cid:8) E n f,ϕ (cid:107) (cid:101) f − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:62) C R ◦ n ( f (cid:113) , Σ s (cid:113) ) , (2.16)where C > and the infimum is taken over all possible estimators (cid:101) f of f . Consequently, ( R ◦ n ( f (cid:113) , Σ s (cid:113) ) ) n ∈ N , ( k ◦ n ( f (cid:113) , Σ s (cid:113) ) ) n ∈ N and ( (cid:101) f k ◦ n ( f (cid:113) , Σ s (cid:113) ) ) n ∈ N , respectively, is a minimax rate, a minimaxdimension and minimax optimal estimator (up to a constant). Aggregation.
Exploiting Lemma 2.2 we derive now bounds for the maximal risk of theaggregated estimator (cid:101) f w using either Bayesian weights w := (cid:101) w as in (2.9) or model selectionweights w := ˘ w as in (2.10). Keeping the definition (2.11) in mind we use in the sequel thatfor any ϕ ∈ E d s and k ∈ N hold (1+log d ) − (cid:54) λ Φ k /λ s k (cid:54) (1+log d ) and ζ d := d (1+log d ) (cid:62) λ Φ k Φ ( k ) / ( λ s k s k ) (cid:62) ζ − d . (2.17)It follows for all k, n ∈ N , f ∈ F r f and ϕ ∈ E d s immediately r R kn ( f (cid:113) , Λ s (cid:113) ) (cid:62) (cid:107) Π ⊥ f (cid:107) L b k ( f ) and ∆ ζ d R kn ( f (cid:113) , Λ s (cid:113) ) (cid:62) pen Φ k . (2.18)Note that by construction R ◦ n ( f (cid:113) , Λ s (cid:113) ) (cid:62) R ◦ n ( f (cid:113) , Σ s (cid:113) ) for all n ∈ N . Proposition
Consider an aggregation (cid:101) f w using either Bayesian weights w := (cid:101) w as in (2.9) or model selection weights w := ˘ w as in (2.10) and penalties (pen Φ k ) k ∈ (cid:74) n (cid:75) as in (2.12) with numerical constant ∆ (cid:62) . Let (A3) be satisfied and set k fs := (cid:98) rζ d (cid:107) f (cid:113) / s (cid:113) (cid:107) (cid:96) (cid:99) .There is a constant C rd fs given in (B.31) depending only on F r f and E d s such that for all n ∈ N sup (cid:8) E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C ( r + ζ d ) ρ ◦ n ( f (cid:113) , Λ s (cid:113) ) + C rd fs n − with ρ ◦ n ( f (cid:113) , Λ s (cid:113) ) := min k ∈ (cid:74) n (cid:75) (cid:8)(cid:2) R kn ( f (cid:113) , Λ s (cid:113) ) ∨ exp (cid:0) − λ s k kk fs (cid:1)(cid:3)(cid:9) . (2.19) Corollary
Let the assumptions of Proposition 2.7 be satisfied. If in addition (A2’) thereis n fs ∈ N such that k ◦ n := k ◦ n ( f (cid:113) , Λ s (cid:113) ) as in (2.5) satisfies k ◦ n λ s k ◦ n (cid:62) k fs | log R ◦ n ( f (cid:113) , Λ s (cid:113) ) | for all n (cid:62) n fs , then there is a constant C rd fs depending only on the classes F r f and E d s such that sup (cid:8) E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C rd fs R ◦ n ( f (cid:113) , Λ s (cid:113) ) for all n ∈ N . Illustration f (cid:113) and s (cid:113) . f k s k R ◦ n ( f (cid:113) , Σ s (cid:113) ) R ◦ n ( f (cid:113) , Λ s (cid:113) ) ρ ◦ n ( f (cid:113) , Λ s (cid:113) ) [o-o] k − p k a n − p p +2 a +1 n − p p +2 a +1 n − p p +2 a +1 [o-s] k − p e k a (log n ) − pa (log n ) − pa (log n ) − pa [s-o] e − k p k a (log n ) a +12 p n − (log n ) a +12 p n − (cid:40) (log n ) a +12 p n − : p < / , (log n ) a +1 n − : p (cid:62) / .
11e note that in each of the three cases the order of R ◦ n ( f (cid:113) , Λ s (cid:113) ) coincide with the order of theminimax rate R ◦ n ( f (cid:113) , Σ s (cid:113) ) . Moreover, the additional assumption (A2’) in Corollary 2.8 is satis-fied in case [o-o] and [o-s] , but in [s-o] only with p < / . Consequently, in this situations dueto Corollary 2.8 the partially data-driven aggregation is minimax optimal (up to a constant).Otherwise, the upper bound ρ ◦ n ( f (cid:113) , Λ s (cid:113) ) in Proposition 2.7 (2.19) faces a detoriation comparedto R ◦ n ( f (cid:113) , Λ s (cid:113) ) ., e.g. in case [s-o] with p (cid:62) / by a logarithmic factor (log n ) (2 a +1)(1 − / (2 p )) . In this section we dispense with any knowledge about the error density ϕ . Instead we assumetwo independent sample ( Y i ) i ∈ (cid:74) n (cid:75) and ( ε i ) i ∈ (cid:74) m (cid:75) as in (1.2). We denote by E n,m f,ϕ , E n f,ϕ and E m ϕ the expectation with respect to their joint distribution P n,m f,ϕ , and marginals P n f,ϕ , and P m ϕ ,respectively. Risk bound.
Exploiting the independence assumption, the risk of the orthogonal series es-timators (cid:98) f k given in (1.3) can be decomposed for each n, m, k ∈ N as follows E n,m f,ϕ (cid:107) (cid:98) f k − f (cid:107) L = n − (cid:88) | j |∈ (cid:74) k (cid:75) Φ j (1 − | [ g ] j | ) E m ε (cid:0) | (cid:99) [ ϕ ] + j [ ϕ ] j | (cid:1) + (cid:107) Π ⊥ f (cid:107) L b k ( f )+ (cid:88) | j |∈ (cid:74) k (cid:75) | [ f ] j | E m ε (cid:0) | (cid:99) [ ϕ ] j − [ ϕ ] j | | (cid:99) [ ϕ ] + j | (cid:1) + (cid:88) | j |∈ (cid:74) k (cid:75) [ f ] j P m ε ( | (cid:99) [ ϕ ] j | < /m ) . (3.1)Exploiting Lemma A.6 in the appendix we control the deviations of the additional terms es-timating the error density. Therewith, setting (cid:107) Π ⊥ f (cid:107) ∧ Φ /m = 2 (cid:80) j ∈ N | [ f ] j | [1 ∧ Φ j /m ] , se-lecting k ◦ n := k ◦ n ( b (cid:113) , Σ Φ (cid:113) ) as in (2.5) with R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) = R k ◦ n n ( b (cid:113) , Σ Φ (cid:113) ) it follows for all n, m ∈ NE n,m f,ϕ (cid:107) (cid:98) f k ◦ n − f (cid:107) L (cid:54) ( (cid:107) Π ⊥ f (cid:107) L + 8) R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) + 8( C + 1) (cid:107) Π ⊥ f (cid:107) ∧ Φ /m . (3.2)Remark 3.1. Note that (cid:107) Π ⊥ f (cid:107) L = 0 implies (cid:107) Π ⊥ f (cid:107) ∧ Φ /m = 0 , while for (cid:107) Π ⊥ f (cid:107) L > holds (cid:107) Π ⊥ f (cid:107) ∧ Φ /m (cid:62) (cid:107) Π ⊥ f (cid:107) L m − , and hence any additional term of order n − + m − isnegligible with respect to R ◦ n ( f ) + (cid:107) Π ⊥ f (cid:107) ∧ Φ /m , since R ◦ n ( f ) (cid:62) n − . On the other handif (cid:107) f (cid:107) < ∞ then (cid:107) Π ⊥ f (cid:107) ∧ Φ /m (cid:54) m − (cid:107) f (cid:107) . Consequently, in case (p) the order of theupper bound is parametric in both sample sizes, i.e., E n,m f,ϕ (cid:107) (cid:98) f k ◦ n − f (cid:107) L (cid:54) C f,ϕ ( n ∧ m ) − forall n, m ∈ N and a finite constant C f,ϕ > depending on f and ϕ only. We shall further mphasise that in case n = m for any density f and ϕ it holds (cid:107) Π ⊥ f (cid:107) ∧ Φ /n = (cid:88) | j |∈ (cid:74) k ◦ n ( f ) (cid:75) | [ f ] j | [1 ∧ n − Φ j ] + (cid:88) | j | >k ◦ n ( f ) | [ f ] j | [1 ∧ n − Φ j ] (cid:54) (cid:107) Π ⊥ f (cid:107) L k ◦ n Σ Φ k ◦ n /n + (cid:107) Π ⊥ f (cid:107) L b k ◦ n (cid:54) (cid:107) Π ⊥ f (cid:107) L R k ◦ n n ( b (cid:113) , Σ Φ (cid:113) ) (3.3) which in turn implies E n,m f,ϕ (cid:107) (cid:98) f k ◦ n − f (cid:107) L (cid:54) C (1 ∨ (cid:107) Π ⊥ f (cid:107) L ) R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) . In other words, theestimation of the unknown error density ϕ is negligible whenever n (cid:54) m . Aggregation.
Introducing aggregation weights w consider an aggregation (cid:98) f w = (cid:80) k ∈ (cid:74) n (cid:75) w k (cid:98) f k of the orthogonal series estimators (cid:98) f k , k ∈ N , defined in (1.3) with coefficients ([ (cid:98) f w ] j ) j ∈ Z sat-isfying [ (cid:98) f w ] j = 0 for | j | > n , and [ (cid:98) f w ] j = P w ( (cid:74) | j | , n (cid:75) ) (cid:99) [ ϕ ] + j (cid:99) [ g ] j for any | j | ∈ (cid:74) n (cid:75) . We note thatagain by construction [ (cid:98) f w ] = 1 , [ (cid:98) f w ] − j = [ (cid:98) f w ] j and (cid:62) | [ (cid:98) f w ] j | . Hence, (cid:98) f w is real and inte-grates to one, however, it is not necessarily non-negative. Our aim is to prove an upper boundfor its risk E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L and its maximal risk sup (cid:8) E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) . Hereand subsequently, we denote ˇ f k := (cid:80) kj = − k (cid:99) [ ϕ ] + j [ g ] j e j = (cid:80) kj = − k (cid:99) [ ϕ ] + j [ ϕ ] j [ f ] j e j for k ∈ N . Forarbitrary aggregation weights and penalties, the next lemma establishes an upper bound forthe loss of the aggregated estimator. Selecting the weights and penalties suitably, it providesin the sequel our key argument. Lemma
Consider an weighted sum (cid:98) f w with arbitrary aggregation weights w and non-negative penalty terms (pen n k ) k ∈ (cid:74) n (cid:75) . For any k − , k + ∈ (cid:74) n (cid:75) holds (cid:107) (cid:98) f w − f (cid:107) L (cid:54) (cid:107) (cid:98) f k + − ˇ f k + (cid:107) L + 3 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ 3 (cid:107) Π ⊥ f (cid:107) L P w ( (cid:74) k − (cid:74) ) + (cid:88) l ∈ (cid:75) k + ,n (cid:75) pen n l w l {(cid:107) (cid:98) f l − ˇ f l (cid:107) L < pen nl } + 3 (cid:88) l ∈ (cid:75) k + ,n (cid:75) (cid:0) (cid:107) (cid:98) f l − ˇ f l (cid:107) L − pen n l / (cid:1) + + (cid:88) l ∈ (cid:75) k + ,n (cid:75) pen n l {(cid:107) (cid:98) f l − ˇ f l (cid:107) L (cid:62) pen nl / } + 6 (cid:88) j ∈ (cid:74) n (cid:75) | (cid:99) [ ϕ ] + j | | [ ϕ ] j − (cid:99) [ ϕ ] j | | [ f ] j | + 2 (cid:88) j ∈ (cid:74) n (cid:75) {| (cid:99) [ ϕ ] j | < /m } | [ f ] j | (3.4)Remark 3.3. The upper bound in (3.4) is similar to (2.8) apart of the last two terms whichare controled again by Lemma A.6. However, in oder to control the third and fourth termwe replace in both weights and penalties the quantities Φ , which are not anymore known inadvance, by their natural estimators. We consider either for some constant η ∈ N Bayesian weights ( (cid:98) w k ) k ∈ (cid:74) n (cid:75) as in (1.7) ormodel selection weights ( ˘ w k ) k ∈ (cid:74) n (cid:75) as in (1.6). Given (cid:98) Φ (cid:113) = ( (cid:98) Φ j ) j ∈ N , with (cid:98) Φ j := | (cid:99) [ ϕ ] + j | = | (cid:99) [ ϕ ] + − j | = | (cid:99) [ ϕ ] − j | {| (cid:99) [ ϕ ] j | (cid:62) /m } , (3.5)13 (cid:98) Φ k as in (2.11) with x (cid:113) replaced by (cid:98) Φ (cid:113) and a numerical constant ∆ > we use pen (cid:98) Φ k := ∆ k Λ (cid:98) Φ k n − , k ∈ N (3.6)as penalty terms. Theorem
Consider a weighted sum (cid:98) f w = (cid:80) k ∈ (cid:74) n (cid:75) w k (cid:98) f k using either Bayesian weights w := (cid:98) w as in (1.7) or model selection weights w := ˘ w as in (1.6) and penalties (pen (cid:98) Φ k ) k ∈ (cid:74) n (cid:75) as in (3.6) with numerical constant ∆ (cid:62) . Let k g := (cid:98) (cid:107) [ g ] (cid:107) (cid:96) (cid:99) and for m ∈ N set k (cid:63)m := max { k ∈ (cid:74) m (cid:75) : 289 log( k + 2) λ Φ k Φ ( k ) (cid:54) m } , if the defining set is not empty, and k (cid:63)m := (cid:100) k g log(2 + m ) (cid:101) otherwise. (p) Assume there is K ∈ N with (cid:62) b ( K − ( f ) > and b K ( f ) = 0 . If K = 0 we set c f := 0 and c f := (cid:107) Π ⊥ f (cid:107) L b K − ( f ) , otherwise. For n ∈ N let k (cid:63)n := max { k ∈ (cid:74) n (cid:75) : n > c f k Λ Φ k } ,if the defining set is not empty, and k (cid:63)n := (cid:100) k g log(2 + n ) (cid:101) otherwise. There is a constant C fϕ given in (C.15) depending only on f and ϕ such that for all n, m ∈ N holds E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) C (cid:107) Π ⊥ f (cid:107) L (cid:2) ( n ∧ m ) − ∨ exp (cid:0) − λ Φ( k(cid:63)n ∧ k(cid:63)m ) ( k (cid:63)n ∧ k (cid:63)m ) k g (cid:1)(cid:3) + C fϕ ( n ∧ m ) − . (3.7) (np) Assume b k ( f ) > for all k ∈ N and consider ρ ◦ n ( b (cid:113) , Λ Φ (cid:113) ) as in (2.14) . There is a constant C fϕ given in (C.11) depending only on f and ϕ such that for all n, m ∈ N holds E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) C { [1 ∨ (cid:107) Π ⊥ f (cid:107) L ] ρ ◦ n ( b (cid:113) , Λ Φ (cid:113) ) + ρ ◦ m ( f, Φ (cid:113) ) } + C fϕ ( n ∧ m ) − with ρ ◦ m ( f, Φ (cid:113) ) := (cid:107) Π ⊥ f (cid:107) ∧ Φ /m ∨ (cid:107) Π ⊥ f (cid:107) L (cid:2) b k (cid:63)m ( f ) ∨ exp (cid:0) − λ Φ k(cid:63)m k (cid:63)m k g (cid:1)(cid:3) . (3.8) Corollary
Let the assumptions of Theorem 3.4 be satisfied and in addition (A4) there is m fϕ ∈ N such that λ Φ k (cid:63)m k (cid:63)m (cid:62) k g log m for all m (cid:62) m fϕ . (p) If (A1) as in Corollary 2.5 and (A4) hold true, then there is a constant C fϕ dependingonly on f and ϕ such that for all n, m ∈ N holds E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) C fϕ ( n ∧ m ) − . (np) If (A2) as in Corollary 2.5 and (A4) hold true, then there is a constant C fϕ dependingonly on f and ϕ such that E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) C fϕ (cid:0) R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) + (cid:107) Π ⊥ f (cid:107) ∧ Φ /m + b k (cid:63)m ( f ) (cid:1) for all n, m ∈ N holds true. Illustration (o) and (s) for the error density ϕ as in Illustration 2.6,where in both cases Corollary 2.5 (A1) holds true (cf. Illustration 2.6 (o) and (s) ). MoreoverCorollary 3.5 (A4) is satisfied, since (o) k (cid:63)m ∼ ( m/ log m ) / (2 a ) and k (cid:63)m λ Φ k (cid:63)m ∼ ( m/ log m ) / (2 a ) ,and (s) k (cid:63)m ∼ (log m ) / (2 a ) and k (cid:63)m λ Φ k (cid:63)m ∼ (log m ) / (2 a ) . Therefore, employing Corollary 3.5 (p) the fully data-driven aggregation attains the parametric rate. For (np) due to Corollary 3.5the risk of the fully data-driven aggregated estimator has the order R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) + (cid:107) Π ⊥ f (cid:107) ∧ Φ /m ,14f (A2) and (A4) are satisfied and b k (cid:63)m ( f ) is negligible with respect to (cid:107) Π ⊥ f (cid:107) ∧ Φ /m . The upperbound ρ ◦ m ( f, Φ (cid:113) ) in Theorem 3.4 (3.8) faces otherwise a detoriation compared to (cid:107) Π ⊥ f (cid:107) ∧ Φ /m which we illustrate considering the cases in Illustration 2.6. Note that the other upper bound ρ ◦ n ( b (cid:113) , Λ Φ (cid:113) ) in Theorem 3.4 (3.8) already appears in Proposition 2.4 and has been discussed inIllustration 2.6. Therefore, we state below the order of the additional term ρ ◦ m ( f, Φ (cid:113) ) only.Order | [ f ] k | Φ k (cid:107) Π ⊥ f (cid:107) ∧ Φ /m ρ ◦ m ( f, Φ (cid:113) ) [o-o] k − p − k a m − p/a ( m/ log m ) − ,m − ( m/ log m ) − p/a : p < a , ( m/ log m ) − : p = a m − : p > a . [o-s] k − p − e k a | log m | − p/a | log m | − p/a [s-o] k p − e − k p k a m − m − Combining the Illustrations 2.6 and 3.6 the fully data-driven aggregation attains the rate R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) + (cid:107) Π ⊥ f (cid:107) ∧ Φ /m in case [o-s] , [o-o] with p (cid:62) a , and [s-o] with p (cid:54) / . Incase [o-o] with p < a and [s-o] with p > / its rate features a detoriation compared to R ◦ n ( b (cid:113) , Σ Φ (cid:113) ) + (cid:107) Π ⊥ f (cid:107) ∧ Φ /m by a logarithmic factor (log m ) p/a and (log n ) (2 a +1)(1 − / (2 p )) , re-spectively. Minimax optimality.
For m ∈ N setting (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ := sup (cid:8) f j (1 ∧ s j /m ) : j ∈ N (cid:9) it holds (cid:107) Π ⊥ f (cid:107) ∧ Φ /m (cid:54) dr (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ for all f ∈ F r f and ϕ ∈ F r f . Exploiting again theupper bound (3.2) and the definition (2.5) for all k, n, m ∈ N follows immediately sup (cid:8) E n,m f,ϕ (cid:107) (cid:98) f k − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C rd ( R kn ( f (cid:113) , Σ s (cid:113) ) + (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ ) . The upper bound in the last display depends on the dimension parameter k and hence bychoosing an optimal value k ◦ n ( f (cid:113) , Σ s (cid:113) ) as in (2.5) the upper bound will be minimised and it holds sup (cid:8) E n,m f,ϕ (cid:107) (cid:98) f k ◦ n ( f (cid:113) , Σ s (cid:113) ) − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C rd ( R ◦ n ( f (cid:113) , Σ s (cid:113) ) + (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ ) . On theother hand Johannes and Schwarz [2013] have shown that for all n, m ∈ N inf (cid:98) f sup (cid:8) E n f,ϕ (cid:107) (cid:98) f − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:62) C (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ , where C > and the infimum is taken over all possible estimators (cid:98) f of f . Consequently,combining (2.16) and the last lower bound ( R ◦ n ( f (cid:113) , Σ s (cid:113) ) + (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ ) n,m ∈ N , ( k ◦ n ( f (cid:113) , Σ s (cid:113) ) ) n ∈ N and ( (cid:98) f k ◦ n ( f (cid:113) , Σ s (cid:113) ) ) n ∈ N , respectively, is a minimax rate, a minimax dimension and minimax optimalestimator (up to a constant). Aggregation.
By applying Lemma 3.2 we derive bounds for the maximal risk defined of thefully data-driven aggregation. 15 heorem
Consider an aggregation (cid:98) f w using either Bayesian weights w := (cid:98) w as in (1.7) or model selection weights w := ˘ w as in (1.6) and penalties (pen (cid:98) Φ k ) k ∈ (cid:74) n (cid:75) as in (3.6) withnumerical constant ∆ (cid:62) . Let (A3) be satisfied and ρ ◦ n ( f (cid:113) , Λ s (cid:113) ) as in (2.19) . Set k fs := (cid:98) rζ d (cid:107) f (cid:113) / s (cid:113) (cid:107) (cid:96) (cid:99) and for m ∈ N , k (cid:63)m := max { k ∈ (cid:74) m (cid:75) : 289 log( k + 2) ζ d λ s k s k (cid:54) m } , ifthe defining set is not empty, and k (cid:63)m := (cid:100) k fs log(2 + m ) (cid:101) otherwise. Then there is a constant C rd fs given in (C.23) depending only on the classes F r f and E d s such that for all n, m ∈ N sup (cid:8) E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C ( r + ζ d ) (cid:0) ρ ◦ n ( f (cid:113) , Λ s (cid:113) )+ ρ ◦ m ( f (cid:113) , s (cid:113) ) (cid:1) + C rd fs ( n ∧ m ) − with ρ ◦ m ( f (cid:113) , s (cid:113) ) := (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ ∨ f k (cid:63)m ∨ exp (cid:0) − λ s k(cid:63)m k (cid:63)m k fs (cid:1)(cid:3) . (3.9) Corollary
Let the assumptions of Theorem 3.7 be satisfied. If (A2’) as in Corollary 2.8and in addition (A4’) there is m fs ∈ N such that λ s k (cid:63)m k (cid:63)m (cid:62) k fs log m for all m (cid:62) m fs , thenthere is a constant C rd fs depending only on the classes F r f and E d s such that for all n, m ∈ N holds sup (cid:8) E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C rd fs (cid:0) R ◦ n ( f (cid:113) , Λ s (cid:113) ) + (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ + f k (cid:63)m (cid:1) . Illustration ρ ◦ n ( f (cid:113) , Λ s (cid:113) ) appearing in Theorem 3.7 for typi-cal configurations in Illustration 2.9, thus we state below the order of the additional term only.Order f k s k (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ ρ ◦ m ( f (cid:113) , s (cid:113) ) [o-o] k − p k a (cid:40) m − p/a m − ( m/ log m ) − p/a : p (cid:54) a ,m − : p > a . [o-s] k − p e k a | log m | − p/a | log m | − p/a [s-o] e − k p k a m − m − Note that in all cases the additional assumption (A4’) in Corollary 3.8 is satisfied (as in Illustra-tion 3.6), and hence ρ ◦ m ( f (cid:113) , s (cid:113) ) is of order (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ + f k (cid:63)m . Moreover, in case [o-o] , [o-s] and [s-o] holds f k (cid:63)m ∼ ( m/ log m ) − p/a , f k (cid:63)m ∼ (log m ) − p/a and f k (cid:63)m ∼ exp( − ( m/ log m ) p/a ) ,respectively. Consequently, f k (cid:63)m is negligible compared to (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ in case [o-s] and [s-o] , but in [o-o] for p > a only. Combining Illustrations 2.9 and 3.9 the fully data-drivenaggregation attains the minimax rate in case [o-s] with p > , [o-o] with p > a and [s-o] with p (cid:54) / , while in case [o-o] with p (cid:54) a and [s-o] with p > / its rate features a detori-ation by a logarithmic factor (log m ) p/a and (log n ) (2 a +1)(1 − / (2 p )) , respectively, compared tothe minimax rate. Let us illustrate the performance of the fully data-driven weighted sum of OSE’s either withmodel selection (1.6) or Bayesian (1.7) ( η = 1 ) weights or by a simulation study. As a first16 .00.10.20.3 0.0 0.5 1.0 1.5 2.0 D L l o ss Interquantile intervalsMean D optimal = 0.048 0.00.10.20.3 0.0 0.5 1.0 1.5 2.0 D L l o ss Interquantile intervalsMean D optimal = 0.037 Figure 1: Empirical Bayesian risk for Bayesian (left) and model selection (right) weights over replicates as a function of the constant ∆ and a minimal value (black vertical)step, we calibrate the constant ∆ appearing in the penalty (3.6). Indeed, Theorems 3.4 and 3.7stipulate that any choice ∆ (cid:62) ensures optimal rates but this is not a necessary conditionand the constant obtained this way is often too large. Hence, we select a value minimising aBayesian empirical risk obtained by repeating times the procedure as described hereafter.We randomly pick a noise density ϕ and a density of interest f , respectively, from a familyof wrapped asymmetric Laplace distributions and a family of wrapped normal distributions.For the noise density the location parameter is uniformly-distributed in [0 , and both thescale, and the asymmetry parameter follow a Γ distribution with shape . and scale . Forthe density of interest the mean is again uniformly-distributed in [0 , , while the standarddeviation has a Γ distribution with shape and scale . . Next we generate a sample ( ε i ) i ∈ (cid:74) m (cid:75) of size m = 5000 from ϕ , and a sample ( Y i ) i ∈ (cid:74) n (cid:75) of size n = 500 from g = f (cid:126) ϕ . Weuse them to construct the estimators (cid:98) f (cid:98) w , and (cid:98) f ˘ w as in Theorem 3.4 for a range of values of ∆ . Finally, we compute and store the L -loss of each estimator obtained this way. Given theresult of the repetitions, for each value of the constant ∆ we use the sample of L -lossesto compute estimators of the mean squared error and the quantiles of the distribution of the L -loss. Finally, we select and fix from now on a value of ∆ that minimises the empiricalmean squared error. The results of this procedure are reported in fig. 1. Using the calibratedconstants and samples of size n = m = 1000 in fig. 2 we depict a realisation of the weightedsum estimators with Bayesian or model selection weights. In this example, f is a mixtureof two von Mises distributions and ϕ is a wrapped asymetric Laplace distribution. The twoestimators estimate the true density properly and behave similarly, we investigate next if therecan be a significant performance difference.In the remaining part of this section we illustrate the numerical performance of the weightedsum estimators and their dependence on the sample sizes n and m by reporting the Bayesianempirical risk obtained by repeating 100 times a procedure described next. In opposite toabove we randomly pick a noise density ϕ and a density of interest f , respectively, from a17 .00.10.20.30.40.5 0 2 4 6 x D en s i t y BayesModelSelectionTruth
True density and weighted sum estimators with model selection and Bayesian weights (a) x D en s i t y BayesModelSelection
Weight sequence for Bayesian and model selection weights (b)
Figure 2: Weighted sum estimators using Bayesian weights or model selection weights (a),and the associated random weights (b) with a sample of size n = m = 1000 family of wrapped normal distributions and a family of wrapped asymetric Laplace distribu-tions. For the noise density the mean and concentration parameters are uniformly distributedin [0 , . For the density of interest, location parameter is uniformly-distributed in [0 , , andboth the scale, and the asymmetry parameters follow a Γ distribution with shape and scale .Note that the families differ from the ones used to calibrate the constant ∆ . Next we generatea sample ( ε i ) i ∈ (cid:74) m (cid:75) of size m = 1000 from ϕ , and a sample ( Y i ) i ∈ (cid:74) n (cid:75) of size n = 1000 from g = f (cid:126) ϕ . For a range of subsamples with different samples sizes we construct the estimators (cid:98) f (cid:98) w , and (cid:98) f ˘ w as in Theorem 3.4 and compute their L -losses. Given the results of the repeti-tions, for the different values of n and m we use the sample of L -losses to compute estimatorsof the mean squared error and the quantiles of the distribution of the L -loss. The evolution ofthe L -loss for the weighted sum estimator with Bayesian weights or model selection weights,and their ratio, is represented in fig. 3, when the sample sizes n and m vary. In figs. 3a and 3b,both empirical errors decrease nicely as n and m increase. In fig. 3c, it seems like, on smallersample sizes the estimator with Bayesian weights performs better than the one with modelselection weights, while the opposite happens for larger samples. In fig. 4 more attention isgiven to the spread of the L -loss around its empirical mean. The three columns (from leftto right) refer to the estimator with Bayesian or model selection weights, and their ratio. Inthe first row (figs. 4a to 4c) the noise sample size is fixed at m = 500 and in each graph thesample-size n increases from to . In the second row (figs. 4d to 4f) both sample havethe same size m = n which again in each graph increases from to . In the last row(figs. 4g to 4i) the size of the noisy sample is fixed at n = 500 and in each graph the sample-size m of the noise increases from to . These graphics show that the distribution ofthe L -losses is skewed. However, in all cases, both estimators behave reasonably.The simulations were performed with the R software, using the libraries ’circular’, ’gg-plot2’, ’reshape2’, ’foreach’, and ’doParallel’. (see Agostinelli and Lund [2017], Corporation18 n m (a) n m (b) n m (c) Figure 3: Empirical Bayesian risk for Bayesian (a) or model selection (b) weights and theirratio (c) over replicates as a function of the sample size n (abscissa) and m (ordinate).
250 500 750 1000 n L l o ss Empirical mean square errorInterquantile intervals (a)
250 500 750 1000 n L l o ss Empirical mean square errorInterquantile intervals (b)
250 500 750 1000 n L l o ss Interquantile intervalsEmpirical mean ratio (c) n = m L l o ss Empirical mean square errorInterquantile intervals (d) n = m L l o ss Empirical mean square errorInterquantile intervals (e) n = m L l o ss Interquantile intervalsEmpirical mean ratio (f) m L l o s s Empirical mean square errorInterquantile intervals (g) m L l o s s Empirical mean square errorInterquantile intervals (h) m L l o s s Interquantile intervalsEmpirical mean ratio (i)
Figure 4: Empirical convergence rate for weighted sum estimators with Bayesian and modelselection weights, and their ratio 19nd Weston [2019], Microsoft and Weston [2020], R Core Team [2018], Wickham [2007,2016]). All the scripts are available upon request to the authors.
Appendix
A Preliminaries
This section gathers technical results. The next result is due to Johannes et al. [2020].
Lemma
A.1.
Given n ∈ N and ˇ f, ˆ f ∈ L consider the families of orthogonal projections (cid:8) ˆ f k = Π k ˆ f, k ∈ (cid:74) n (cid:75) (cid:9) and (cid:8) ˇ f k = Π k ˇ f, k ∈ (cid:74) n (cid:75) (cid:9) . If (cid:107) Π ⊥ k ˇ f (cid:107) L = (cid:107) Π ⊥ ˇ f (cid:107) L b k ( ˇ f ) for all k ∈ (cid:74) n (cid:75) , then for any l ∈ (cid:74) n (cid:75) holds (i) (cid:107) ˆ f k (cid:107) L − (cid:107) ˆ f l (cid:107) L (cid:54) (cid:107) ˆ f l − ˇ f l (cid:107) L − (cid:107) Π ⊥ ˇ f (cid:107) L { b k ( ˇ f ) − b l ( ˇ f ) } , for all k ∈ (cid:74) l (cid:74) ; (ii) (cid:107) ˆ f k (cid:107) L − (cid:107) ˆ f l (cid:107) L (cid:54) (cid:107) ˆ f k − ˇ f k (cid:107) L + (cid:107) Π ⊥ ˇ f (cid:107) L { b l ( ˇ f ) − b k ( ˇ f ) } , for all k ∈ (cid:75) l, n (cid:75) . The next assertion provides our key arguments in order to control the deviations of thereminder terms. Both inequalities are due to Talagrand [1996], the formulation of the first parteq. (A.2) can be found for example in Klein and Rio [2005], while the second part eq. (A.3)is based on equation (5.13) in Corollary 2 in Birgé and Massart [1998] and stated in this formfor example in Comte and Merlevede [2002].
Lemma
A.2. (Talagrand’s inequalities) Let ( Z i ) i ∈ (cid:74) n (cid:75) be independent Z -valued random vari-ables and let ν h = n − (cid:80) i ∈ (cid:74) n (cid:75) [ ν h ( Z i ) − E ( ν h ( Z i ))] for ν h belonging to a countable class { ν h , h ∈ H} of measurable functions. If the following conditions are satisfied sup h ∈H sup z ∈Z | ν h ( z ) | (cid:54) ψ, E (sup h ∈H | ν h | ) (cid:54) Ψ , sup h ∈H n (cid:88) i ∈ (cid:74) n (cid:75) V ar( ν h ( Z i )) (cid:54) τ, (A.1) then there is an universal numerical constant C > such that E (cid:0) sup h ∈H | ν h | − (cid:1) + (cid:54) C (cid:20) τn exp (cid:18) − n Ψ τ (cid:19) + ψ n exp (cid:18) − n Ψ100 ψ (cid:19)(cid:21) (A.2) P (cid:0) sup h ∈H | ν h | (cid:62) (cid:1) (cid:54) (cid:2) exp (cid:0) − n Ψ τ (cid:1) + exp (cid:0) − n Ψ200 ψ (cid:1)(cid:3) . (A.3)Remark A.3. Introduce the unit ball B k := { h ∈ U k : (cid:107) h (cid:107) L (cid:54) } contained in the linearsubspace U k . Setting ν h ( Y ) = (cid:80) | j |∈ (cid:74) k (cid:75) [ h ] j [ ϕ ] − j e j ( − Y ) we have (cid:107) (cid:101) f k − f k (cid:107) L = sup h ∈ B k | (cid:88) | j |∈ (cid:74) k (cid:75) [ ϕ ] − j { n (cid:88) i ∈ (cid:74) n (cid:75) ( e j ( − Y i ) − [ g ] j ) } [ h ] j | = sup h ∈ B k | ν h | . he last identity provides the necessary argument to link the next Lemmata A.4 and A.5 andTalagrand’s inequalities in Lemma A.2. Note that, the unit ball B k is not a countable set offunctions, however, it contains a countable dense subset, say H , since L is separable, and itis straightforward to see that sup h ∈ B k | ν h | = sup h ∈H | ν h | . The proof of Lemma A.4 given in Johannes and Schwarz [2013] makes use of Lemma A.2by computing the quantities ψ , Ψ , and τ which verify the three inequalities (A.1). We providein Lemma A.5 a slight modification of this result following along the lines of the proof ofLemma A.4 in Johannes and Schwarz [2013]. Lemma
A.4.
Let Φ ( k ) = max { Φ j , j ∈ (cid:74) k (cid:75) } , λ Φ k (cid:62) and k Λ Φ k = λ Φ k k Φ ( k ) , then there is anumerical constant C such that for all n ∈ N and k ∈ (cid:74) n (cid:75) holds (i) E n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L − k Λ Φ k n − (cid:1) + (cid:54) C (cid:20) (cid:107) [ g ] (cid:107) (cid:96) Φ ( k ) n exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) + k Φ ( k ) n exp (cid:0) − √ nλ Φ k (cid:1)(cid:21) (ii) P n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L (cid:62) k Λ Φ k n − (cid:1) (cid:54) (cid:20) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) + exp (cid:0) − √ nλ Φ k (cid:1)(cid:21) Lemma
A.5.
Consider (cid:98) f k − ˇ f k = (cid:80) | j |∈ (cid:74) k (cid:75) (cid:99) [ ϕ ] + j ( (cid:99) [ g ] j − [ g ] j ) e j . Denote by P n Y | ε and E n Y | ε theconditional distribution and expectation, respectively, of ( Y i ) i ∈ (cid:74) n (cid:75) given ( ε i ) i ∈ (cid:74) m (cid:75) . Let (cid:98) Φ j = | (cid:99) [ ϕ ] + j | , (cid:98) Φ k = k (cid:80) j ∈ (cid:74) k (cid:75) (cid:98) Φ j , (cid:98) Φ ( k ) = max j ∈ (cid:74) k (cid:75) (cid:98) Φ j , Λ (cid:98) Φ k = λ (cid:98) Φ k k (cid:98) Φ ( k ) and λ (cid:98) Φ k (cid:62) . Then there is anumerical constant C such that for all n ∈ N and k ∈ (cid:74) n (cid:75) holds (i) E n Y | ε (cid:0) (cid:107) (cid:98) f k − ˇ f k (cid:107) L − (cid:98) Φ k n − (cid:1) + (cid:54) C (cid:20) (cid:107) [ g ] (cid:107) (cid:96) (cid:98) Φ ( k ) n exp (cid:0) − λ (cid:98) Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) + k (cid:98) Φ ( k ) n exp (cid:0) − (cid:113) nλ (cid:98) Φ k (cid:1)(cid:21) (ii) P n Y | ε (cid:0) (cid:107) (cid:98) f k − ˇ f k (cid:107) L (cid:62) (cid:98) Φ k n − (cid:1) (cid:54) (cid:20) exp (cid:0) − λ (cid:98) Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) + exp (cid:0) − (cid:113) nλ (cid:98) Φ k (cid:1)(cid:21) Proof of Lemma A.5.
For h ∈ B k set ν h ( Y ) = (cid:80) | j |∈ (cid:74) k (cid:75) [ h ] j (cid:99) [ ϕ ] + j e j ( − Y ) where E n Y | ε ν h ( Y ) = (cid:80) | j |∈ (cid:74) k (cid:75) [ h ] j (cid:99) [ ϕ ] + j [ g ] j and (cid:107) (cid:98) f k − ˇ f k (cid:107) L = sup h ∈ B k | ν h | (see Remark A.3). We intent to applyLemma A.2. Therefore, we compute next quantities ψ , Ψ , and τ verifying the three inequal-ities required in Lemma A.2. First, we have sup h ∈ B k sup y ∈ [0 , | ν h ( y ) | = 2 (cid:80) j ∈ (cid:74) k (cid:75) (cid:98) Φ j (cid:54) k (cid:98) Φ ( k ) =: ψ . Next, find Ψ . Exploiting sup h ∈ B k |(cid:104) (cid:98) f k − ˇ f k , h (cid:105) L | = (cid:80) | j |∈ (cid:74) k (cid:75) (cid:98) Φ | j | | (cid:99) [ g ] j − [ g ] j | and E n Y | ε | (cid:99) [ g ] j − [ g ] j | (cid:54) n , it holds E n Y | ε (cid:0) sup h ∈ B k |(cid:104) (cid:98) f k − ˇ f k , h (cid:105) L (cid:1) (cid:54) (cid:80) | j |∈ (cid:74) k (cid:75) (cid:98) Φ | j | /n (cid:54) (cid:98) Φ k /n =: Ψ . Finally, consider τ . Using E Y | ε (cid:0) e j ( Y ) e j (cid:48) ( − Y ) (cid:1) = [ g ] j (cid:48)− j for each h ∈ B k holds E Y | ε | ν h ( Y ) | = (cid:88) | j | , | j (cid:48) |∈ (cid:74) k (cid:75) [ h ] j (cid:99) [ ϕ ] + j [ g ] j (cid:48)− j (cid:99) [ ϕ ] + j (cid:48) [ h ] j (cid:48) = (cid:104)U k (cid:98) A U k [ h ] , [ h ] (cid:105) (cid:96) defining the Hermitian and positive semi-definite matrix (cid:98) A := ( (cid:99) [ ϕ ] + j (cid:99) [ ϕ ] + j (cid:48) [ g ] j (cid:48)− j ) j,j (cid:48) ∈ Z and themapping U k : C Z → C Z with z (cid:55)→ U k z = ( z l {| l |∈ (cid:74) k (cid:75) } ) l ∈ Z . Obviously, U k is an orthogonal21rojection from (cid:96) onto the linear subspace spanned by all (cid:96) -sequences with support on theindex-set (cid:74) − k, − (cid:75) ∪ (cid:74) k (cid:75) . Straightforward algebra shows sup h ∈ B k n (cid:88) i ∈ (cid:74) n (cid:75) V ar Y | ε ( ν h ( Y i )) (cid:54) sup h ∈ B k (cid:104)U k (cid:98) A U k [ h ] , [ h ] (cid:105) (cid:96) = sup h ∈ B k (cid:107)U k (cid:98) A U k [ h ] (cid:107) (cid:96) (cid:54) (cid:107)U k (cid:98) A U k (cid:107) s . where (cid:107) M (cid:107) s := sup (cid:107) x (cid:107) (cid:96) (cid:54) (cid:107) M x (cid:107) (cid:96) denotes the spectral-norm of a linear M : (cid:96) → (cid:96) . For asequence z ∈ C Z let ∇ z be the multiplication operator given by ∇ z x := ( z j x j ) j ∈ Z . Clearly,we have U k (cid:98) A U k = U k ∇ (cid:99) [ ϕ ] + U k C [ g ] U k ∇ (cid:99) [ ϕ ] + U k , where C [ g ] := ([ g ] j − j (cid:48) ) j,j (cid:48) ∈ Z . Consequently, sup h ∈ B k n (cid:88) i ∈ (cid:74) n (cid:75) V ar Y | ε ( ν h ( Y i )) (cid:54) (cid:107)U k ∇ (cid:99) [ ϕ ] + U k (cid:107) s (cid:107)C [ g ] (cid:107) s (cid:107)U k ∇ (cid:99) [ ϕ ] + U k (cid:107) s = (cid:107)U k ∇ (cid:99) [ ϕ ] + U k (cid:107) s (cid:107)C [ g ] (cid:107) s , where (cid:107)U k ∇ (cid:99) [ ϕ ] + U k (cid:107) s = max { (cid:98) Φ j , j ∈ (cid:74) k (cid:75) } = (cid:98) Φ ( k ) . For ( C [ g ] z ) k := (cid:80) j ∈ Z [ g ] j − k z j , k ∈ Z it iseasily verified that (cid:107)C [ g ] z (cid:107) (cid:96) (cid:54) (cid:107) [ g ] (cid:107) (cid:96) (cid:107) z (cid:107) (cid:96) and hence (cid:107)C [ g ] (cid:107) s (cid:54) (cid:107) [ g ] (cid:107) (cid:96) , which implies sup h ∈ B k n (cid:88) i ∈ (cid:74) n (cid:75) V ar Y | ε ( ν h ( Y i )) (cid:54) (cid:107) [ g ] (cid:107) (cid:96) (cid:98) Φ ( k ) =: τ. Replacing in Remark A.3 (A.1) and (A.2) the quantities ψ, Ψ and τ together with k Λ (cid:98) Φ k = λ (cid:98) Φ k k (cid:98) Φ ( k ) gives the assertion (i) and (ii) , which completes the proof. Lemma
A.6.
There is a finite numerical constant C > such that for all j ∈ Z hold m E m ε | [ ϕ ] j − (cid:99) [ ϕ ] j | (cid:54) C , (i) E m ε (cid:0) | [ ϕ ] j (cid:99) [ ϕ ] + j | (cid:1) (cid:54) ; (ii) P m ε ( | (cid:99) [ ϕ ] + j | < /m ) (cid:54) ∧ Φ j /m ) , (iii) E m ε (cid:0) | [ ϕ ] j − (cid:99) [ ϕ ] j | | (cid:99) [ ϕ ] + j | (cid:1) (cid:54) C (1 ∧ Φ j /m ) . Given k ∈ N for all j ∈ (cid:74) k (cid:75) we have (iv) P m ε (cid:0) | (cid:99) [ ϕ ] j / [ ϕ ] j − | > / (cid:1) (cid:54) (cid:0) − m | [ ϕ ] j | (cid:1) (cid:54) (cid:0) − m ( k ) (cid:1) .Proof of Lemma A.6. The elementary properties (i)-(iii) are shown, for example, in Johannesand Schwarz [2013] and the assertion (iv) follows directly from Hoeffding’s inequality.
Lemma
A.7.
Let m, k ∈ N and set (cid:102) k := { / (cid:54) (cid:98) Φ j / Φ j (cid:54) / ∀ j ∈ (cid:74) k (cid:75) } . (i) If Φ ( k ) (cid:54) (4 / m then P m ε ( (cid:102) ck ) (cid:54) k exp (cid:0) − m ( k ) (cid:1) . (ii) For m k := (cid:98) ( k ) / (cid:99) holds P m ε ( (cid:102) ck ) (cid:54) (555 km k m − ) ∧ (12 km k m − ) for all m ∈ N . (iii) If m (cid:62)
289 log( k + 2) λ Φ k Φ ( k ) then P m ε ( (cid:102) ck ) (cid:54) (11226 m − ) ∧ (53 m − ) .Proof of Lemma A.7. We start our proof with the observation that for each j ∈ Z with Φ j (cid:54) (4 / m holds {| (cid:99) [ ϕ ] j / [ ϕ ] j − | (cid:54) / } ⊆ { / (cid:54) | [ ϕ ] j (cid:99) [ ϕ ] + j | (cid:54) / } Consequently, if Φ ( k ) (cid:54) (4 / m then (cid:102) ck ⊂ (cid:83) j ∈ (cid:74) k (cid:75) {| (cid:99) [ ϕ ] j / [ ϕ ] j − | > / } and hence (i) follows from Lemma A.6 (iv).Consider (ii). Given k ∈ N and m k := (cid:98) ( k ) / (cid:99) ∈ N we distinguish for m ∈ N the cases22a) m > m k and (b) m ∈ (cid:74) m k (cid:75) . In case (a) it holds Φ ( k ) (cid:54) (4 / m , and hence (i) implies(ii). In case (b) (ii) holds trivially, since P m ε ( (cid:102) ck ) (cid:54) m k m − ∧ m k m − . Consider (iii). Since m (cid:62)
289 log( k + 2) λ Φ k Φ ( k ) (cid:62) (9 / ( k ) from (i) follows m P m ε ( (cid:102) ck ) (cid:54) km exp (cid:0) − m ( k ) (cid:1) (cid:54) k Φ k ) exp (cid:0) − m ( k ) (cid:1) (cid:54) and analogously m P m ε ( (cid:102) ck ) (cid:54) , which completes the proof. Lemma
A.8.
Consider for any l ∈ N the event (cid:102) l := (cid:8) (cid:54) Φ − j (cid:98) Φ j (cid:54) , ∀ j ∈ (cid:74) l (cid:75) (cid:9) . For each l ∈ (cid:74) n (cid:75) and k ∈ (cid:74) l (cid:74) setting (cid:107) Π kl ˇ f n (cid:107) L := (cid:80) | j |∈ (cid:75) k,l (cid:75) Φ − j (cid:98) Φ j | [ f ] j | hold (i) (cid:107) Π kl ˇ f n (cid:107) L (cid:54) (cid:107) Π kn ˇ f n (cid:107) L = (cid:107) Π ⊥ k ˇ f n (cid:107) L and (cid:107) Π kl ˇ f n (cid:107) L (cid:102) l (cid:62) (cid:107) Π ⊥ f (cid:107) L ( b k ( f ) − b l ( f )) .Moreover, for any l ∈ N and k ∈ (cid:74) l (cid:75) hold (ii) (cid:98) Φ ( l ) (cid:54) m , Φ ( l ) (cid:54) (cid:98) Φ ( l ) (cid:102) l (cid:54) Φ ( l ) , λ (cid:98) Φ l (cid:62) , λ Φ l (cid:54) λ (cid:98) Φ l (cid:102) l (cid:54) λ Φ l , and hence pen Φ k (cid:54) pen (cid:98) Φ k (cid:102) l (cid:54) Φ k ; (iii) (cid:8)(cid:98) Φ ( l ) < (cid:9) = (cid:8)(cid:98) Φ ( l ) = 0 (cid:9) , and hence pen (cid:98) Φ l = pen (cid:98) Φ l { (cid:98) Φ ( l ) (cid:62) } .Proof of Lemma A.8. The assertions (i) and (ii) follow by elementary calculations from thedefinition of the event (cid:102) l , and we omit the details. Consider (iii). For each j ∈ Z holds (cid:98) Φ j = | (cid:99) [ ϕ ] + j | = 0 on the event (cid:8) | (cid:99) [ ϕ ] j | < /m (cid:9) and (cid:98) Φ j (cid:62) on the complement (cid:8) | (cid:99) [ ϕ ] j | (cid:62) /m (cid:9) ,since | (cid:99) [ ϕ ] j | (cid:54) . Consequently, (cid:8)(cid:98) Φ j < (cid:9) = (cid:8) | (cid:99) [ ϕ ] j | < /m (cid:9) = (cid:8)(cid:98) Φ j = 0 (cid:9) , which implies(iii), and completes the proof. B Proofs of section 2
Proof of Lemma 2.2.
We start the proof with the observation that [ (cid:101) f w ] − [ f ] = 0 , and foreach j ∈ Z holds [ (cid:101) f w ] j − [ f ] j = [ (cid:101) f w ] − j − [ f ] − j , where [ (cid:101) f w ] j − [ f ] j = − [ f ] j for all | j | > n and [ (cid:101) f w ] j − [ f ] j = [ ϕ ] − j ( (cid:99) [ g ] j − [ g ] j ) P w ( (cid:74) | j | , n (cid:75) ) − [ f ] j P w ( (cid:74) | j | (cid:74) ) for all | j | ∈ (cid:74) n (cid:75) . Consequently, (keep in mind that | [ ϕ ] − j | = Φ j ) we have (cid:107) (cid:101) f w − f (cid:107) L (cid:54) (cid:88) | j |∈ (cid:74) n (cid:75) { Φ j | (cid:99) [ g ] j − [ g ] j | P w ( (cid:74) | j | , n (cid:75) ) } + (cid:88) | j |∈ (cid:74) n (cid:75) | [ f ] j | P w ( (cid:74) | j | (cid:74) ) + (cid:88) | j | >n | [ f ] j | , (B.1)23here we consider the first and the two other terms on the right hand side separately. Consid-ering the first term we split the sum into two parts. Precisely, (cid:88) | j |∈ (cid:74) n (cid:75) Φ j | (cid:99) [ g ] j − [ g ] j | P w ( (cid:74) | j | , n (cid:75) ) (cid:54) (cid:107) (cid:101) f k + − f k + (cid:107) L + (cid:88) l ∈ (cid:75) k + ,n (cid:75) w l (cid:107) (cid:101) f l − f l (cid:107) L (cid:54) pen n k + + (cid:88) l ∈ (cid:74) k + ,n (cid:75) (cid:0) (cid:107) (cid:101) f l − f l (cid:107) L − pen n l / (cid:1) + + (cid:88) l ∈ (cid:75) k + ,n (cid:75) w l pen n l {(cid:107) (cid:101) f l − f l (cid:107) L (cid:62) pen nl / } + (cid:88) l ∈ (cid:75) k + ,n (cid:75) pen n l w l {(cid:107) (cid:101) f l − f l (cid:107) L < pen nl / } (B.2)Considering the second and third term we split the first sum into two parts and obtain (cid:88) | j |∈ (cid:74) n (cid:75) | [ f ] j | P w ( (cid:74) | j | (cid:74) ) + (cid:88) | j | >n | [ f ] j | (cid:54) (cid:88) | j |∈ (cid:74) k − (cid:75) | [ f ] j | P w ( (cid:74) | j | (cid:74) ) + (cid:88) | j |∈ (cid:75) k − ,n (cid:75) | [ f ] j + (cid:88) | j | >n | [ f ] j | (cid:54) (cid:107) Π U ⊥ f (cid:107) L { P w ( (cid:74) k − (cid:74) ) + b k − } (B.3)Combining (B.1) and (B.2), (B.3) we obtain the assertion, which completes the proof. B.1 Proof of Proposition 2.4 and Corollary 2.5
Proof of Proposition 2.4.
We present the main arguments to prove Proposition 2.4. The tech-nical details are gathered in Lemmata B.2 to B.5 in the end of this section. Keeping in mindthe definition (2.5) and (2.12) here and subsequently we use that R kn ( b (cid:113) , Λ Φ (cid:113) ) (cid:62) b k ( f ) and ∆ R kn ( b (cid:113) , Λ Φ (cid:113) ) (cid:62) pen Φ k for all k ∈ (cid:74) n (cid:75) . (B.4)For arbitrary k (cid:5) + , k (cid:5)− ∈ (cid:74) n (cid:75) (to be choosen suitable below) let us define k − := min (cid:110) k ∈ (cid:74) k (cid:5)− (cid:75) : (cid:107) Π ⊥ f (cid:107) L b k ( f ) (cid:54) (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− ( f ) + 4 pen Φ k (cid:5)− (cid:111) and k + := max (cid:110) k ∈ (cid:74) k (cid:5) + , n (cid:75) : pen Φ k (cid:54) (cid:107) Π ⊥ f (cid:107) L b k (cid:5) + ( f ) + 4 pen Φ k (cid:5) + (cid:111) (B.5)where the defining set obviously contains k (cid:5)− and k (cid:5) + , respectively, and hence, it is not empty.We intend to combine the upper bound in (2.8) and the bounds considering Bayesianweights w = (cid:101) w as in (2.9) and model selection weights w = ˘ w as in (2.10) given inLemma B.2 and Lemma B.3, respectively. First note, that due to Lemma B.2 (i) we have E n f,ϕ P (cid:101) w ( (cid:74) k − (cid:74) ) (cid:54) { k − > } η ∆ exp (cid:0) − η ∆14 k (cid:5)− Λ Φ k (cid:5)− (cid:1) + { k − > } P n f,ϕ (cid:0) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L (cid:62) pen Φ k (cid:5)− / (cid:1) w = (cid:101) w as in (2.9) follows immediately E n f,ϕ (cid:107) (cid:101) f (cid:101) w − f (cid:107) L (cid:54) pen Φ k + +2 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ n − η + η ∆ (cid:107) Π ⊥ f (cid:107) L { k − > } exp (cid:0) − η ∆14 k (cid:5)− Λ Φ k (cid:5)− (cid:1) +2 (cid:88) k ∈ (cid:74) k (cid:5) + ,n (cid:75) E n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L − pen Φ k (cid:1) + + (cid:88) k ∈ (cid:74) k (cid:5) + ,n (cid:75) pen Φ k P n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L (cid:62) pen Φ k (cid:1) + 2 (cid:107) Π ⊥ f (cid:107) L { k − > } P n f,ϕ (cid:0) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L (cid:62) pen Φ k (cid:5)− (cid:1) (B.6)On the other hand for model selection weights w = ˘ w we combine again the upperbound in (2.8) and the bounds given in Lemma B.3. Clearly, due to Lemma B.3 we have E n f,ϕ P ˘ w ( (cid:74) k − (cid:74) ) = P n f,ϕ (cid:0) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L (cid:62) pen Φ k (cid:5)− / (cid:1) and, hence from (2.8) follows immediately E n f,ϕ (cid:107) (cid:101) f (cid:98) k − f (cid:107) L (cid:54) pen Φ k + +2 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+2 (cid:88) k ∈ (cid:74) k (cid:5) + ,n (cid:75) E n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L − pen Φ k (cid:1) + + (cid:88) k ∈ (cid:74) k (cid:5) + ,n (cid:75) pen Φ k P n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L (cid:62) pen Φ k (cid:1) + 2 (cid:107) Π ⊥ f (cid:107) L { k − > } P n f,ϕ (cid:0) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L (cid:62) pen Φ k (cid:5)− (cid:1) (B.7)The deviations of the last three terms in (B.6) and (B.7) we bound in Lemma B.4 by exploitingusual concentration inequalities. Precisly, we obtain E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) pen Φ k + +2 (cid:107) Π ⊥ f (cid:107) L b k − + C (cid:107) Π ⊥ f (cid:107) L { k − > } exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1) + C (cid:0) (cid:107) Π ⊥ f (cid:107) L { k − > } + Φ k g ) k g + Φ n o ) (cid:1) n − . (B.8)Indeed, combining Lemma B.4 and (B.6) for Bayesian weights we obtain E n f,ϕ (cid:107) (cid:101) f (cid:101) w − f (cid:107) L (cid:54) pen Φ k + +2 (cid:107) Π ⊥ f (cid:107) L b k − + C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) η exp (cid:0) − η ∆14 k (cid:5)− Λ Φ k (cid:5)− (cid:1) + exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− (cid:107) [ g ] (cid:107) (cid:96) (cid:1)(cid:1) + C (cid:0) η + (cid:107) Π ⊥ f (cid:107) L { k − > } + Φ k g ) k g + Φ n o ) (cid:1) n − (B.9)Therewith, be using that Λ Φ k (cid:5)− (cid:62) λ Φ k (cid:5)− , η ∆14 > (cid:107) [ g ] (cid:107) (cid:96) > k g (since η (cid:62) and (cid:107) [ g ] (cid:107) (cid:96) (cid:62) | [ g ] | = 1 ) from (B.9) follows the upper bound (B.8). Consider secondly model selectionweights w = ˘ w as in (2.10). Combining Lemma B.4, (cid:107) [ g ] (cid:107) (cid:96) (cid:54) k g and the upper boundgiven in (B.7) we obtain (B.8).From the upper bound (B.8) for a suitable coice of the dimension parameters k (cid:5)− , k (cid:5) + ∈ (cid:74) n (cid:75) we derive separately the risk bound in the two cases (p) and (np) considered in Proposition 2.4.The tedious case-by-case analysis for (p) is deferred to Lemma B.5 in the end of this section.25n case (np) with k ◦ n := k ◦ n ( b (cid:113) , Λ Φ (cid:113) ) ∈ (cid:74) n (cid:75) and R kn ( b (cid:113) , Λ Φ (cid:113) ) as in (2.5) we set k (cid:5) + := k ◦ n andlet k (cid:5)− ∈ (cid:74) n (cid:75) . Keeping (B.4) in mind the definition (B.5) of k + and k − implies pen Φ k + (cid:54) (cid:107) Π ⊥ f (cid:107) L + 2∆) R k ◦ n n ( b (cid:113) , Λ Φ (cid:113) ) and (cid:107) Π ⊥ f (cid:107) L b k − (cid:54) ( (cid:107) Π ⊥ f (cid:107) L + 4∆) R k (cid:5)− n ( b (cid:113) , Λ Φ (cid:113) ) whichtogether with R k (cid:5)− n ( b (cid:113) , Λ Φ (cid:113) ) (cid:62) R k ◦ n n ( b (cid:113) , Λ Φ (cid:113) ) = R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) = min (cid:8) R kn ( b (cid:113) , Λ Φ (cid:113) ) , k ∈ N (cid:9) (cid:62) n − and exploiting (B.8) implies the assertion (2.14), that is for all k (cid:5)− ∈ (cid:74) n (cid:75) holds E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) C ( (cid:107) Π ⊥ f (cid:107) L ∨ (cid:2) R k (cid:5)− n ( b (cid:113) , Λ Φ (cid:113) ) ∨ exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1)(cid:3) + C (cid:2) Φ k g ) k g + Φ n o ) (cid:3) n − , (B.10)with n o = 15(600) , which completes the proof of Proposition 2.4. Proof of Corollary 2.5.
Consider the case (p) . Under (A1) for all n > n f, Φ we have trivially exp (cid:0) − λ Φ k (cid:63)n k (cid:63)n /k g (cid:1) (cid:54) n − , while for n ∈ (cid:74) n f, Φ (cid:75) holds exp (cid:0) − λ Φ k (cid:63)n k (cid:63)n /k g (cid:1) (cid:54) (cid:54) n f, Φ n − .Thereby, from (2.13) in Proposition 2.4 follows immediately the assertion (p) . In case (np) due to (A2) for k ◦ n := k ◦ n ( b (cid:113) , Λ Φ (cid:113) ) as in (2.5) we have trivially exp (cid:0) − λ Φ k ◦ n k ◦ n /k g (cid:1) (cid:54) R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) while for n ∈ (cid:74) n f, Φ (cid:75) holds exp (cid:0) − λ Φ k ◦ n k ◦ n /k g (cid:1) (cid:54) (cid:54) n R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) (cid:54) n f, Φ R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) . Thereby,from (2.14) in Proposition 2.4 with R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) = min k ∈ (cid:74) n (cid:75) R kn ( b (cid:113) , Λ Φ (cid:113) ) follows (np) , whichcompletes the proof of Corollary 2.5.Below we state and prove the technical Lemmata B.2 to B.5 used in the proof of Proposi-tion 2.4. The proof of Lemma B.2 is based on Lemma B.1 given first. Lemma
B.1.
Consider Bayesian weights (cid:101) w as in (2.9) . Let l ∈ (cid:74) n (cid:75) . (i) For all k ∈ (cid:74) l (cid:74) holds (cid:101) w k (cid:8) (cid:107) (cid:101) f l − f l (cid:107) L < pen Φ l (cid:9) (cid:54) exp (cid:0) ηn (cid:8) pen Φ l + (cid:107) Π ⊥ f (cid:107) L b l ( f ) − (cid:107) Π ⊥ f (cid:107) L b k ( f ) − pen Φ k (cid:9)(cid:1) . (ii) For all k ∈ (cid:75) l, n (cid:75) holds (cid:101) w k (cid:8) (cid:107) (cid:101) f k − f k (cid:107) L < pen Φ k (cid:9) (cid:54) exp (cid:0) ηn (cid:8) − pen Φ k + (cid:107) Π ⊥ f (cid:107) L b l ( f ) + pen Φ l (cid:9)(cid:1) .Proof of Lemma B.1. Given k, l ∈ (cid:74) n (cid:75) and an event Ω kl (to be specified below) it follows (cid:101) w k Ω kl = exp( − ηn {−(cid:107) (cid:101) f k (cid:107) L + pen Φ k } ) (cid:80) l ∈ (cid:74) n (cid:75) exp( − ηn {−(cid:107) (cid:101) f l (cid:107) L + pen Φ l } ) Ω kl (cid:54) exp (cid:0) ηn (cid:8) (cid:107) (cid:101) f k (cid:107) L − (cid:107) (cid:101) f l (cid:107) L + (pen Φ l − pen Φ k ) (cid:9)(cid:1) Ω kl (B.11)We distinguish the two cases (i) k ∈ (cid:74) l (cid:74) and (ii) k ∈ (cid:75) l, n (cid:75) . Consider first (i) k ∈ (cid:74) l (cid:74) . Due toLemma A.1 (i) (with ˆ f := (cid:101) f n and ˇ f := f ) from (B.11) we conclude (cid:101) w k Ω kl (cid:54) exp (cid:0) ηn (cid:8) (cid:107) (cid:101) f l − f l (cid:107) L − (cid:107) Π ⊥ f (cid:107) L ( b k ( f ) − b l ( f ) )+(pen Φ l − pen Φ k ) (cid:9)(cid:1) Ω kl Ω kl := (cid:8) (cid:107) (cid:101) f l − f l (cid:107) L < pen Φ l (cid:9) the last bound implies (cid:101) w k (cid:8) (cid:107) (cid:101) f l − f l (cid:107) L < pen Φ l (cid:9) (cid:54) exp (cid:0) ηn (cid:8) pen Φ l − (cid:107) Π ⊥ f (cid:107) L ( b k ( f ) − b l ( f ) )+(pen Φ l − pen Φ k ) (cid:9)(cid:1) . Rearranging the arguments of the last upper bound we obtain the assertion (i). Considersecondly (ii) k ∈ (cid:75) l, n (cid:75) . From Lemma A.1 (ii) (with ˆ f := (cid:101) f n and ˇ f := f ) and (B.11) follows (cid:101) w k Ω lk (cid:54) exp (cid:0) ηn (cid:8) (cid:107) (cid:101) f k − f k (cid:107) L + (cid:107) Π ⊥ f (cid:107) L ( b l ( f ) − b k ( f ))+(pen Φ l − pen Φ k ) (cid:9)(cid:1) Ω lk . Setting Ω lk := { (cid:107) (cid:101) f k − f k (cid:107) L < pen Φ k } and exploiting b k ( f ) (cid:62) we obtain (ii), whichcompletes the proof. Lemma
B.2.
Consider Bayesian weights (cid:101) w as in (2.9) and penalties (pen Φ k ) k ∈ (cid:74) n (cid:75) as in (2.12) .For any k (cid:5)− , k (cid:5) + ∈ (cid:74) n (cid:75) and associated k + , k − ∈ (cid:74) n (cid:75) as in (B.5) hold (i) P (cid:101) w ( (cid:74) k − (cid:74) ) (cid:54) η ∆ { k − > } exp (cid:0) − η ∆14 k (cid:5)− Λ Φ k (cid:5)− (cid:1) + (cid:8) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L (cid:62) pen Φ k (cid:5)− / (cid:9) ; (ii) (cid:80) k ∈ (cid:75) k + ,n (cid:75) pen Φ k (cid:101) w k {(cid:107) (cid:101) f k − f k (cid:107) L < pen Φ k / } (cid:54) η n − .Proof of Lemma B.2. Consider (i). Let k − ∈ (cid:74) k (cid:5)− (cid:75) as in (B.5). For the non trivial case k − > from Lemma B.1 (i) with l = k (cid:5)− follows for all k < k − (cid:54) k (cid:5)− (cid:101) w k (cid:8) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L < pen Φ k (cid:5)− / (cid:9) (cid:54) exp (cid:0) ηn (cid:8) − (cid:107) Π ⊥ f (cid:107) L b k + ( pen Φ k (cid:5)− + (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− ) − pen Φ k (cid:9)(cid:1) , and hence by exploiting the definition (B.5) of k − , that is (cid:107) Π ⊥ f (cid:107) L b k (cid:62) (cid:107) Π ⊥ f (cid:107) L b ( k − − > (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− + 4 pen Φ k (cid:5)− , we obtain for each k ∈ (cid:74) k − (cid:74) (cid:101) w k (cid:8) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L < pen Φ k (cid:5)− / (cid:9) (cid:54) exp (cid:0) − ηn pen Φ k (cid:5)− − ηn pen Φ k (cid:1) . The last upper bound together with pen Φ k = ∆ k Λ Φ k n − (cid:62) ∆ kn − , k ∈ (cid:74) n (cid:75) , as in (2.11) gives P (cid:101) w ( (cid:74) k − (cid:74) ) (cid:54) P (cid:101) w ( (cid:74) k − (cid:74) ) (cid:8) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L < pen Φ k (cid:5)− / (cid:9) + (cid:8) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L (cid:62) pen Φ k (cid:5)− / (cid:9) (cid:54) exp (cid:0) − η n pen Φ k (cid:5)− (cid:1) (cid:88) k ∈ (cid:74) k − (cid:74) exp( − η ∆ k ) + (cid:8) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L (cid:62) pen Φ k (cid:5)− / (cid:9) which combined with (cid:80) k ∈ N exp( − µk ) (cid:54) µ − for any µ > implies (i). Consider (ii). Let k + ∈ (cid:74) k (cid:5) + , n (cid:75) as in (B.5). For the non trivial case k + < n from Lemma B.1 (ii) with l = k (cid:5) + follows for all k > k + (cid:62) k (cid:5) + (cid:101) w k (cid:8) (cid:107) (cid:101) f k − f k (cid:107) L < pen Φ k / (cid:9) (cid:54) exp (cid:0) ηn (cid:8) − pen Φ k + (cid:107) Π ⊥ f (cid:107) L b k (cid:5) + + pen Φ k (cid:5) + (cid:9)(cid:1) , k + , that is, pen Φ k (cid:62) pen Φ ( k + +1) > pen Φ k (cid:5) + + (cid:107) Π ⊥ f (cid:107) L b k (cid:5) + , we obtain for each k ∈ (cid:75) k + , n (cid:75) (cid:101) w k (cid:8) (cid:107) (cid:101) f k − f k (cid:107) L < pen Φ k / (cid:9) (cid:54) exp (cid:0) ηn (cid:8) − pen Φ k (cid:9)(cid:1) . Consequently, using pen Φ k = ∆ kλ Φ k Φ ( k ) n − , k ∈ (cid:74) n (cid:75) , as in (2.11) implies (cid:88) k ∈ (cid:75) k + ,n (cid:75) pen Φ k (cid:101) w k {(cid:107) (cid:101) f k − f k (cid:107) L < pen nk / } (cid:54) ∆ n − (cid:88) k ∈ (cid:75) k + ,n (cid:75) kλ Φ k Φ ( k ) exp (cid:0) − η ∆4 kλ Φ k Φ ( k ) (cid:1) (B.12)Exploiting that ( λ Φ k ) / = log( k Φ ( k ) ∨ ( k +2))log( k +2) (cid:62) , k Φ ( k ) (cid:54) exp(( λ Φ k ) / log( k + 2)) for each k ∈ N , ∆ / (cid:62) e ) and η (cid:62) for all k ∈ N holds η ∆4 k − log( k + 2) (cid:62) . Making furtheruse of the elementary inequality a exp( − ab ) (cid:54) exp( − b ) for a, b (cid:62) it follows λ Φ k k Φ ( k ) exp (cid:0) − η ∆4 λ Φ k k Φ ( k ) (cid:1) (cid:54) λ Φ k exp (cid:0) − η ∆4 λ Φ k k Φ ( k ) + (cid:113) λ Φ k log( k + 2) (cid:1) (cid:54) λ Φ k exp (cid:0) − λ Φ k ( η ∆4 k − log( k + 2)) (cid:1) (cid:54) exp (cid:0) − ( η ∆4 k − log( k + 2)) (cid:1) = ( k + 2) exp (cid:0) − η ∆4 k (cid:1) . which with (cid:80) k ∈ N µk exp( − µk ) (cid:54) and (cid:80) k ∈ N µ exp( − µk ) (cid:54) for any µ > implies (cid:88) k ∈ (cid:75) k + ,n (cid:75) λ Φ k k Φ ( k ) exp (cid:0) − η ∆4 λ Φ k k Φ ( k ) (cid:1) (cid:54) ∞ (cid:88) k = k + +1 ( k + 2) exp (cid:0) − η ∆4 k (cid:1) (cid:54) η . Combining the last bound and (B.12) we obtain assertion (ii), which completes the proof.The next result can be directly deduced from Lemma B.2 by letting η → ∞ . However,we think the direct proof given in Lemma B.3 provides an interesting illustration of the values k + , k − ∈ (cid:74) n (cid:75) as defined in (B.5). Lemma
B.3.
Consider model selection weights ˘ w as in (2.10) and penalties (pen Φ k ) k ∈ (cid:74) n (cid:75) asin (2.12) . For any k (cid:5)− , k (cid:5) + ∈ (cid:74) n (cid:75) and associated k + , k − ∈ (cid:74) n (cid:75) as in (B.5) hold (i) P ˘ w ( (cid:74) k − (cid:74) ) {(cid:107) f k (cid:5)− − f k (cid:5)− (cid:107) L < pen Φ k (cid:5)− / } = 0 ; (ii) (cid:80) k ∈ (cid:75) k + ,n (cid:75) pen Φ k ˘ w k {(cid:107) (cid:101) f k − f k (cid:107) L < pen Φ k / } = 0 .Proof of Lemma B.3. By definition of (cid:98) k it holds −(cid:107) (cid:101) f (cid:98) k (cid:107) L + pen Φ (cid:98) k (cid:54) −(cid:107) (cid:101) f k (cid:107) L + pen Φ k for all k ∈ (cid:74) n (cid:75) , and hence (cid:107) (cid:101) f (cid:98) k (cid:107) L − (cid:107) (cid:101) f k (cid:107) L (cid:62) pen Φ (cid:98) k − pen Φ k for all k ∈ (cid:74) n (cid:75) . (B.13)Consider (i). Let k − ∈ (cid:74) k (cid:5)− (cid:75) as in (B.5). For the non trivial case k − > it is sufficient toshow, that { (cid:98) k ∈ (cid:74) k − (cid:74) } ⊆ {(cid:107) (cid:101) f k − f k (cid:107) L (cid:62) pen Φ k (cid:5)− / } holds. On the event { (cid:98) k ∈ (cid:74) k − (cid:74) } we have (cid:54) (cid:98) k < k − (cid:54) k (cid:5)− and thus the definition (B.5) of k − implies (cid:107) Π ⊥ f (cid:107) L b (cid:98) k ( f ) (cid:62) (cid:107) Π ⊥ f (cid:107) L b k − − ( f ) > (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− ( f ) + 4 pen Φ k (cid:5)− . (B.14)28n the other hand side from Lemma A.1 (i) (with ˆ f := (cid:101) f n and ˇ f := f ) follows (cid:107) (cid:101) f (cid:98) k (cid:107) L − (cid:107) (cid:101) f k (cid:5)− (cid:107) L (cid:54) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L − (cid:107) Π ⊥ f (cid:107) L { b (cid:98) k ( f ) − b k (cid:5)− ( f ) } . (B.15)Combining, first (B.13) and (B.15), and secondly (B.14) with pen Φ (cid:98) k (cid:62) we conclude (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L (cid:62) pen Φ (cid:98) k − pen Φ k (cid:5)− + (cid:107) Π ⊥ f (cid:107) L { b (cid:98) k ( f ) − b k (cid:5)− ( f ) } > pen Φ k (cid:5)− , hence { (cid:98) k ∈ (cid:74) k − (cid:74) } ⊆ {(cid:107) (cid:101) f k − f k (cid:107) L (cid:62) pen Φ k (cid:5)− / } , which shows (i). Consider (ii). Let k + ∈ (cid:74) k (cid:5) + , n (cid:75) as in (B.5). For the non trivial case k + < n it is sufficient to show that, { (cid:98) k ∈ (cid:75) k + , n (cid:75) } ⊆ {(cid:107) (cid:101) f (cid:98) k − f (cid:98) k (cid:107) L (cid:62) pen Φ (cid:98) k / } . On the event { (cid:98) k ∈ (cid:75) k + , n (cid:75) } holds (cid:98) k > k + (cid:62) k (cid:5) + and thus the definition (B.5) of k + implies pen Φ (cid:98) k (cid:62) pen Φ ( k + +1) > (cid:107) Π ⊥ f (cid:107) L b k (cid:5) + ( f ) + 4 pen Φ k (cid:5) + (B.16)and due to Lemma A.1 (ii) (with ˆ f := (cid:101) f n and ˇ f := f ) also (cid:107) (cid:101) f (cid:98) k (cid:107) L − (cid:107) (cid:101) f k (cid:5) + (cid:107) L (cid:54) (cid:107) (cid:101) f (cid:98) k − f (cid:98) k (cid:107) L + (cid:107) Π ⊥ f (cid:107) L { b k (cid:5) + ( f ) − b (cid:98) k ( f ) } . (B.17)Combining, first (B.13) and (B.17), and secondly (B.16) with b (cid:98) k ( f ) (cid:62) it follows that (cid:107) (cid:101) f (cid:98) k − f (cid:98) k (cid:107) L (cid:62) pen Φ (cid:98) k − pen Φ k (cid:5) + − (cid:107) Π ⊥ f (cid:107) L { b k (cid:5) + ( f ) − b (cid:98) k ( f ) } > pen Φ (cid:98) k hence { (cid:98) k ∈ (cid:75) k + , n (cid:75) } ⊆ { (cid:107) (cid:101) f (cid:98) k − f (cid:98) k (cid:107) L (cid:62) pen Φ (cid:98) k } , which shows (ii) and completed the proof. Lemma
B.4.
Consider (pen Φ k ) k ∈ (cid:74) n (cid:75) as in (2.12) with ∆ (cid:62) . Let k g := (cid:98) (cid:107) [ g ] (cid:107) (cid:96) ) (cid:99) and n o := 15(600) . There exists a finite numerical constant C > such that for all n ∈ N and all k (cid:5)− ∈ (cid:74) n (cid:75) hold (i) (cid:80) k ∈ (cid:74) n (cid:75) E n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L − pen Φ k / (cid:1) + (cid:54) C n − (cid:0) Φ ( k g ) k g + Φ ( n o ) (cid:1) ; (ii) (cid:80) k ∈ (cid:74) n (cid:75) pen Φ k P n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L (cid:62) pen Φ k / (cid:1) (cid:54) C n − (cid:0) Φ k g ) k g + Φ n o ) (cid:1) ; (iii) P n f,ϕ (cid:0) (cid:107) (cid:101) f k (cid:5)− − f k (cid:5)− (cid:107) L (cid:62) pen Φ k (cid:5)− / (cid:1) (cid:54) C (cid:0) exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− (cid:107) [ g ] (cid:107) (cid:96) (cid:1) + n − (cid:1) .Proof of Lemma B.4. We show below that for k Λ Φ k = λ Φ k k Φ ( k ) with λ Φ k (cid:62) as in (2.11) thereis a numerical constant C such that for all n ∈ N and k ∈ (cid:74) n (cid:75) hold(a) (cid:80) k ∈ (cid:74) n (cid:75) E n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L − k Λ Φ k /n (cid:1) + (cid:54) C n − (cid:0) Φ ( k g ) k g + Φ ( n o ) (cid:1) ; (b) (cid:80) k ∈ (cid:74) n (cid:75) k Λ Φ k P n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L (cid:62) k Λ Φ k /n (cid:1) (cid:54) C (cid:0) Φ k g ) k g + Φ n o ) (cid:1) ; (c) P n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L (cid:62) k Λ Φ k /n (cid:1) (cid:54) C (cid:0) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) + n − (cid:1) . pen Φ k / (cid:62) k Λ Φ k n − for all k ∈ (cid:74) n (cid:75) the bounds (a), (b) and (c), respectively, implyimmediately Lemma B.4 (i), (ii) and (iii). In the sequel we use without further reference that k Φ ( k ) (cid:54) exp( (cid:113) λ Φ k log( k + 2)) and λ Φ k (cid:62) for each k ∈ N . Considering (a) we show that (cid:88) k ∈ (cid:74) n (cid:75) Φ ( k ) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) ( k g ) (cid:107) [ g ] (cid:107) (cid:96) and (cid:88) k ∈ (cid:74) n (cid:75) k Φ ( k ) n exp (cid:0) − √ nλ Φ k (cid:1) (cid:54) Φ ( n o ) n o . (B.18)hold for all n ∈ N , where a combination of the last bounds and Lemma A.4 (i) impliesdirectly (a). We decompose the first sum in (B.18) into two parts which we bound separately.Exploiting that (cid:80) k ∈ N exp( − µk ) (cid:54) µ − for any µ > and setting (cid:101) k g := (cid:98) (cid:107) [ g ] (cid:107) (cid:96) ) (cid:99) holds (cid:88) k ∈ (cid:74) (cid:101) k g (cid:75) Φ ( k ) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) Φ ( (cid:101) k g ) (cid:88) k ∈ (cid:74) (cid:101) k g (cid:75) exp (cid:0) − k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) Φ ( (cid:101) k g ) (cid:107) [ g ] (cid:107) (cid:96) . (B.19)On the other hand for any k > (cid:101) k g holds √ λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:62) log( k + 2) implying Φ ( k ) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) exp (cid:0) − k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) and hence (cid:80) k ∈ (cid:75) (cid:101) k g ,n (cid:75) Φ ( k ) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) (cid:80) k ∈ (cid:75) (cid:101) k g ,n (cid:75) exp (cid:0) − k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) (cid:107) [ g ] (cid:107) (cid:96) . The last bound, (B.19) and (cid:101) k g (cid:54) k g imply together the first bound in (B.18). Con-sidering the second bound for n ∈ N we distinguish the following two cases, (a) n > (cid:101) n o :=15(200) and (b) n ∈ (cid:74) (cid:101) n o (cid:75) . Firstly, consider (a), where √ n (cid:62)
200 log( n + 2) and hence (cid:88) k ∈ (cid:74) n (cid:75) k Φ ( k ) n exp (cid:0) − √ nλ Φ k (cid:1) (cid:54) (cid:88) k ∈ (cid:74) n (cid:75) n exp (cid:0) − (cid:113) λ Φ k [ √ n − log( k + 2)] (cid:1) (cid:54) (cid:88) k ∈ (cid:74) n (cid:75) n = 1 . (B.20)Secondly, considering (b) n ∈ (cid:74) (cid:101) n o (cid:75) holds (cid:80) k ∈ (cid:74) n (cid:75) k Φ ( k ) n exp (cid:0) − √ nλ Φ k (cid:1) (cid:54) (cid:101) n o Φ ( (cid:101) n o ) (cid:54) n o Φ ( n o ) ,since Φ ( n ) (cid:54) Φ ( (cid:101) n o ) (cid:54) Φ ( n o ) . Combining (B.20) and the last bound for the two cases (a) n > (cid:101) n o and (b) n ∈ (cid:74) (cid:101) n o (cid:75) we obtain the second bound in (B.18). Consider (b). We show that (cid:88) k ∈ (cid:74) n (cid:75) kλ Φ k Φ ( k ) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) Φ k g ) k g and (cid:88) k ∈ (cid:74) n (cid:75) kλ Φ k Φ ( k ) exp (cid:0) − √ nλ Φ k (cid:1) (cid:54) Φ n o ) n o (B.21)hold for all n ∈ N . Combining the last bounds and Lemma A.4 (ii) we obtain (b). Wedecompose the first sum in (B.21) into two parts which we bound separately. Note that log( k Φ ( k ) ) (cid:54) e k Φ ( k ) , and hence λ Φ k (cid:54) k Φ ( k ) . Setting k g = (cid:98) (cid:107) [ g ] (cid:107) (cid:96) ) (cid:99) holds (cid:88) k ∈ (cid:74) k g (cid:75) kλ Φ k Φ ( k ) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) λ Φ k g Φ ( k g ) k g (cid:88) k ∈ (cid:74) k g (cid:75) exp (cid:0) − k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) k g Φ k g ) (200 (cid:107) [ g ] (cid:107) (cid:96) ) (B.22)30n the other hand for any k (cid:62) (cid:107) [ g ] (cid:107) (cid:96) ) holds k (cid:62) (400 (cid:107) [ g ] (cid:107) (cid:96) ) log( k +2) , and hence k − (cid:107) [ g ] (cid:107) (cid:96) log( k +2) (cid:62) (cid:107) [ g ] (cid:107) (cid:96) log( k +2) or equivalently, k (cid:107) [ g ] (cid:107) (cid:96) − log( k +2) (cid:62) log( k +2) (cid:62) , which implies kλ Φ k Φ ( k ) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) ( k + 2) exp (cid:0) − k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) . Consequently,exploiting that for any µ > holds (cid:80) k ∈ N ( k + 2) exp( − µk ) (cid:54) exp( µ ) µ − + 2 µ − we obtain (cid:80) k ∈ (cid:75) k g ,n (cid:75) kλ Φ k Φ ( k ) exp (cid:0) − λ Φ k k (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:54) exp( (cid:107) [ g ] (cid:107) (cid:96) )(200 (cid:107) [ g ] (cid:107) (cid:96) ) + 2(200 (cid:107) [ g ] (cid:107) (cid:96) ) . The lastbound and (B.22) imply together the first bound in (B.21). Considering the second bound, for n ∈ N we distinguish the following two cases, (a) n > n o = 15(600) and (b) n ∈ (cid:74) n o (cid:75) .Firstly, consider (a) , where √ n (cid:62)
600 log( n + 2) , and hence together with λ Φ k (cid:54) k Φ ( k ) itfollows (cid:88) k ∈ (cid:74) n (cid:75) kλ Φ k Φ ( k ) exp (cid:0) − √ nλ Φ k (cid:1) (cid:54) (cid:88) k ∈ (cid:74) n (cid:75) k Φ k ) exp (cid:0) − √ nλ Φ k (cid:1) (cid:54) (cid:88) k ∈ (cid:74) n (cid:75) n exp (cid:0) − (cid:113) λ Φ k [ √ n − log( n + 2)] (cid:1) (cid:54) (cid:88) k ∈ (cid:74) n (cid:75) n = 1 . (B.23)Secondly, consider (b). Since n b exp( − an /c ) (cid:54) ( cbea ) cb for all c > and a, b (cid:62) it follows (cid:88) k ∈ (cid:74) n (cid:75) kλ Φ k Φ ( k ) exp (cid:0) − √ nλ Φ k (cid:1) (cid:54) n λ Φ n Φ ( n ) exp (cid:0) −√ n (cid:1) (cid:54) Φ n ) n exp (cid:0) −√ n (cid:1) (cid:54) Φ n o ) (cid:0) (cid:1) (cid:54) Φ n o ) n o . Combining (B.23) and the last bound for the two cases (a) n > n o and (b) n ∈ (cid:74) n o (cid:75) we obtainthe second bound in (B.21). Consider (c). Since √ nλ Φ k (cid:62) √ n and n exp( − √ n ) (cid:54) (200) fromLemma A.4 (ii) follows immediately (c), which completes the proof. Lemma
B.5.
Let the assumptions of Proposition 2.4 (p) be satisfied. There is a finite numericconstant C > such that for all n ∈ N with n o := 15(600) holds E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) C (cid:107) Π ⊥ f (cid:107) L (cid:2) n − ∨ exp (cid:0) − λ Φ k(cid:63)n k (cid:63)n k g (cid:1)(cid:3) + C (cid:0) [1 ∨ K ∨ c f K Φ K ) ](Φ + (cid:107) Π ⊥ f (cid:107) L ) + Φ k g ) k g + Φ n o ) (cid:1) n − . (B.24) Proof of Lemma B.5.
The proof is based on the upper bound (B.8) which holds for any k (cid:5)− , k (cid:5) + ∈ (cid:74) n (cid:75) and associated k − , k + ∈ (cid:74) n (cid:75) as defined in (B.5). Consider first the case K = 0 ,where b = 0 and hence (cid:107) Π ⊥ f (cid:107) L = 0 . From (B.8) follows E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) pen Φ k + + C (cid:0) Φ k g ) k g + Φ n o ) (cid:1) n − (B.25)Setting k (cid:5) + := 1 it follows from the definition (B.5) of k + that pen Φ k + (cid:54) Φ = 4∆Λ Φ n − and Λ Φ = λ Φ1 Φ (1) (cid:54) Φ . Thereby (keep in mind ∆ (cid:62) ) (B.25) implies E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) C (cid:0) Φ + Φ k g ) k g + Φ n o ) (cid:1) n − (B.26)31onsider now K ∈ N , and hence (cid:107) Π ⊥ f (cid:107) L b [ K − > . Setting n f := [ K ∨ (cid:98) c f K Λ Φ K (cid:99) ] ∈ N we distinguish for n ∈ N the following two cases, (a) n ∈ (cid:74) n f (cid:75) and (b) n > n f . Firstly,consider (a) with n ∈ (cid:74) n f (cid:75) , then setting k (cid:5)− := 1 , k (cid:5) + := 1 we have k − = 1 , (cid:62) b and fromthe definition (B.5) of k + also pen Φ k + (cid:54) (cid:107) Π ⊥ f (cid:107) L b + 2 pen Φ ) (cid:54) (cid:107) Π ⊥ f (cid:107) L + 4∆Φ .Thereby, from (B.8) follows E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) Φ + (cid:107) Π ⊥ f (cid:107) L + C (cid:0) Φ k g ) k g + Φ n o ) (cid:1) n − (cid:54) C (cid:0) Φ n + (cid:107) Π ⊥ f (cid:107) L n + Φ k g ) k g + Φ n o ) (cid:1) n − . Moreover, for all n ∈ (cid:74) n f (cid:75) with n f = [ K ∨ (cid:98) c f K Λ Φ K (cid:99) ] and K Λ Φ K = Kλ Φ K Φ ( K ) (cid:54) K Φ K ) holds n (cid:54) [ K ∨ c f K Φ K ) ] and thereby, E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) C (cid:0) [ K ∨ c f K Φ K ) ](Φ + (cid:107) Π ⊥ f (cid:107) L ) + Φ k g ) k g + Φ n o ) (cid:1) n − . (B.27)Secondly, consider (b), i.e., n > n f . Setting k (cid:5) + := K (cid:54) [ K ∨ (cid:98) c f K Λ Φ K (cid:99) ] = n f , i.e., k (cid:5) + ∈ (cid:74) n (cid:75) ,it follows b k (cid:5) + = 0 and the definition (B.5) of k + implies pen Φ k + (cid:54) Φ k (cid:5) + = 4∆ K Λ Φ K n − (cid:54) K Φ K ) n − . From (B.8) follows for all n > n f thus E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) (cid:107) Π ⊥ f (cid:107) L b k − + C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:2) n − ∨ exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1)(cid:3) + C (cid:0) K Φ K ) + Φ k g ) k g + Φ n o ) (cid:1) n − . (B.28)Note that for all n > n f holds k (cid:63)n = max { k ∈ (cid:74) K, n (cid:75) : n > c f k Λ Φ k } , since the defining setcontaining K is not empty. Consequently, k (cid:63)n (cid:62) K and, hence b k (cid:63)n ( f ) = 0 , and k (cid:63)n Λ Φ k (cid:63)n n −
We present first the main arguments to prove Proposition 2.7 which makes use of Corollary B.6deferred to the end of this section. 32onsidering an aggregation (cid:101) f w = (cid:80) k ∈ (cid:74) n (cid:75) w k (cid:101) f k using either Bayesian weights w := (cid:101) w asin (2.9) or model selection weights w := ˘ w as in (2.10) we make use of the upper bounds(B.6) and (B.7), respectively. In Corollary B.6 we bound the last three terms in (B.6) and(B.7) uniformly over F r f and E d s . Moreover, we note that the definition (B.5) of k + and k − implies pen Φ k + (cid:54) (6 r + 4∆ ζ d ) R k (cid:5) + n ( f (cid:113) , Λ s (cid:113) ) and (cid:107) Π ⊥ f (cid:107) L b k − ( f ) (cid:54) ( r + 4∆ ζ d ) R k (cid:5)− n ( f (cid:113) , Λ s (cid:113) ) .Combining (B.6) and (B.7), the last bounds, (cid:107) Π ⊥ f (cid:107) L (cid:54) r , η (cid:62) , η ∆14 k (cid:5)− Λ Φ k (cid:5)− (cid:62) k fs λ s k (cid:5)− k (cid:5)− and Corollary B.6 we obtain for all n ∈ N sup (cid:8) E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) (6 r +4∆ ζ d ) R k (cid:5) + n ( f (cid:113) , Λ s (cid:113) )+2( r +4∆) R k (cid:5)− n ( f (cid:113) , Λ s (cid:113) )+ C r exp (cid:0) − λ s k (cid:5)− k (cid:5)− k fs (cid:1) + C n − { r + d (cid:0) s k fs k fs + s n o (cid:1) } . (B.30)For k ◦ n := k ◦ n ( f (cid:113) , Λ s (cid:113) ) ∈ (cid:74) n (cid:75) and R kn ( f (cid:113) , Λ s (cid:113) ) as in (2.5) we set k (cid:5) + := k ◦ n , then for all k (cid:5)− ∈ (cid:74) n (cid:75) holds R k (cid:5)− n ( f (cid:113) , Λ s (cid:113) ) (cid:62) R k ◦ n n ( f (cid:113) , Λ s (cid:113) ) = R ◦ n ( f (cid:113) , Λ s (cid:113) ) = min (cid:8) R kn ( f (cid:113) , Λ s (cid:113) ) , k ∈ N (cid:9) (cid:62) n − . Combiningthe last bound and (B.30) implies the assertion (2.19), that is for all k (cid:5)− ∈ (cid:74) n (cid:75) holds sup (cid:8) E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C ( r + ζ d ) (cid:2) R k (cid:5)− n ( f (cid:113) , Λ s (cid:113) ) ∨ exp (cid:0) − λ s k (cid:5)− k (cid:5)− k fs (cid:1)(cid:3) + + C n − { r + d (cid:0) s k fs k fs + s n o (cid:1) } (B.31)with n o = 15(600) , which completes the proof of Proposition 2.7. Proof of Corollary 2.8.
Under (A2’) for k ◦ n := k ◦ n ( f (cid:113) , Λ s (cid:113) ) as in (2.5) holds exp (cid:0) − λ s k ◦ n k ◦ n /k fs (cid:1) (cid:54) R ◦ n ( f (cid:113) , Λ s (cid:113) ) while for n ∈ (cid:74) n f , s (cid:75) we have exp (cid:0) − λ s k ◦ n k ◦ n /k fs (cid:1) (cid:54) (cid:54) n R ◦ n ( f (cid:113) , Λ s (cid:113) ) (cid:54) n f , s R ◦ n ( f (cid:113) , Λ s (cid:113) ) .Thereby, from (2.19) with R ◦ n ( f (cid:113) , Λ s (cid:113) ) = min k ∈ (cid:74) n (cid:75) R kn ( f (cid:113) , Λ s (cid:113) ) follows immediately the claim,which completes the proof of Corollary 2.8. Corollary
B.6.
Consider (pen Φ k ) k ∈ (cid:74) n (cid:75) as in (2.12) with ∆ (cid:62) . Let n o := 15(600) and k fs := (cid:98) rζ d (cid:107) f (cid:113) / s (cid:113) (cid:107) (cid:96) (cid:99) . There exists a finite numerical constant C > such that foreach f ∈ F r f and ϕ ∈ E d s and for all n ∈ N and k ∈ (cid:74) n (cid:75) hold (i) (cid:80) k ∈ (cid:74) n (cid:75) E n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L − pen Φ k / (cid:1) + (cid:54) C n − d (cid:0) s k fs k fs + s n o (cid:1) ; (ii) (cid:80) k ∈ (cid:74) n (cid:75) pen Φ k P n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L (cid:62) pen Φ k / (cid:1) (cid:54) C n − d (cid:0) s k fs k fs + s n o (cid:1) ; (iii) P n f,ϕ (cid:0) (cid:107) (cid:101) f k − f k (cid:107) L (cid:62) pen Φ k / (cid:1) (cid:54) C (cid:0) exp (cid:0) − λ s k kk fs (cid:1) + n − (cid:1) .Proof of Corollary B.6. The result follows immediately from (a)-(c) in the proof of Lemma B.4by using that for all f ∈ F r f , ϕ ∈ E d s and k ∈ N hold d − λ Φ k (cid:62) ζ − d λ s k , Φ ( k ) (cid:54) d s k and (cid:107) [ g ] (cid:107) (cid:96) (cid:54) d (cid:107) f (cid:113) / s (cid:113) (cid:107) (cid:96) (cid:107) f (cid:107) / f (cid:54) rd (cid:107) s (cid:113) f (cid:113) (cid:107) (cid:96) , and we omit the details.33 Proofs of section 3
Proof of Lemma 3.2.
We start the proof with the observation that [ (cid:98) f ] − [ f ] = 0 and for each j ∈ Z holds [ (cid:98) f w ] j − [ f ] j = [ (cid:98) f w ] − j − [ f ] − j , where [ (cid:98) f w ] j − [ f ] j = − [ f ] j for all | j | > n , and [ (cid:98) f w ] j − [ f ] j = (cid:99) [ ϕ ] + j ( (cid:99) [ g ] j − [ g ] j ) P w ( (cid:74) | j | , n (cid:75) ) + (cid:99) [ ϕ ] + j ([ ϕ ] j − (cid:99) [ ϕ ] j )[ f ] j P w ( (cid:74) | j | , n (cid:75) ) − X j [ f ] j P w ( (cid:74) | j | (cid:74) ) − X cj [ f ] j for all | j | ∈ (cid:74) n (cid:75) with X j := {| (cid:99) [ ϕ ] j | (cid:62) /m } and X cj := {| (cid:99) [ ϕ ] j | < /m } . Consequently, we have (cid:107) (cid:98) f w − f (cid:107) L (cid:54) (cid:88) | j |∈ (cid:74) n (cid:75) | (cid:99) [ ϕ ] + j | | (cid:99) [ g ] j − [ g ] j | P w ( (cid:74) | j | , n (cid:75) )+ 3 (cid:88) | j |∈ (cid:74) n (cid:75) X j | [ f ] j | P w ( (cid:74) | j | (cid:74) ) + (cid:88) | j | >n | [ f ] j | + 3 (cid:88) | j |∈ (cid:74) n (cid:75) | (cid:99) [ ϕ ] + j | | [ ϕ ] j − (cid:99) [ ϕ ] j | | [ f ] j | + (cid:88) | j |∈ (cid:74) n (cid:75) X cj | [ f ] j | . (C.1)where we consider the first and the second and third term on the right hand side separately.Considering the first term from (cid:107) (cid:98) f k − ˇ f k (cid:107) L = (cid:80) | j |∈ (cid:74) k (cid:75) | (cid:99) [ ϕ ] + j | | (cid:99) [ g ] j − [ g ] j | follows (cid:88) | j |∈ (cid:74) n (cid:75) | (cid:99) [ ϕ ] + j | ( (cid:99) [ g ] j − [ g ] j ) P w ( (cid:74) | j | , n (cid:75) ) (cid:54) (cid:107) (cid:98) f k + − ˇ f k + (cid:107) L + (cid:88) l ∈ (cid:75) k + ,n (cid:75) w l (cid:0) (cid:107) (cid:98) f l − ˇ f l (cid:107) L − pen n l / (cid:1) + + (cid:88) l ∈ (cid:75) k + ,n (cid:75) w l pen n l {(cid:107) (cid:98) f l − ˇ f l (cid:107) L (cid:62) pen nl / } + (cid:88) l ∈ (cid:75) k + ,n (cid:75) w l pen n l {(cid:107) (cid:98) f l − ˇ f l (cid:107) L < pen nl / } (C.2)Considering the second and third term we split the first sum into two parts and obtain (cid:88) | j |∈ (cid:74) n (cid:75) X j | [ f ] j | P w ( (cid:74) | j | (cid:74) ) + (cid:88) | j | >n | [ f ] j | (cid:54) (cid:88) | j |∈ (cid:74) k − (cid:75) | [ f ] j | X j P w ( (cid:74) | j | (cid:74) ) + (cid:88) | j |∈ (cid:75) k − ,n (cid:75) | [ f ] j | + 2 (cid:88) | j | >n | [ f ] j | (cid:54) (cid:107) Π ⊥ f (cid:107) L { P w ( (cid:74) k − (cid:74) ) + b k − ( f ) } (C.3)Combining (C.1) and (C.2), (C.3) we obtain the assertion, which completes the proof. C.1 Proof of Theorem 3.4 and Corollary 3.5
We present first the main arguments of the proof of Theorem 3.4. More technical details aregathered in Lemmata C.2 to C.5 in the end of this section. Keeping in mind the definitions342.12) and (3.6) let us for l ∈ (cid:74) n (cid:75) introduce the event (cid:102) l := (cid:8) / (cid:54) Φ − j (cid:98) Φ j (cid:54) / , ∀ j ∈ (cid:74) l (cid:75) (cid:9) and its complement (cid:102) cl , where due to Lemma A.8 holds pen Φ k (cid:102) l (cid:54) pen (cid:98) Φ k (cid:102) l (cid:54) Φ k for all k ∈ (cid:74) l (cid:75) .For any k (cid:5) + , k (cid:5)− ∈ (cid:74) n (cid:75) (to be choosen suitable below) let us define k − := min (cid:110) k ∈ (cid:74) k (cid:5)− (cid:75) : (cid:107) Π ⊥ f (cid:107) L b k ( f ) (cid:54) (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− ( f ) + 104 pen Φ k (cid:5)− (cid:111) and k + := max (cid:110) k ∈ (cid:74) k (cid:5) + , n (cid:75) : pen (cid:98) Φ k (cid:54) (cid:107) Π ⊥ k (cid:5) + ˇ f n (cid:107) L + 4 pen (cid:98) Φ k (cid:5) + (cid:1)(cid:111) (C.4)where (cid:107) Π ⊥ k ˇ f n (cid:107) L = (cid:80) | j |∈ (cid:75) k,n (cid:75) (cid:98) Φ j Φ − j | [ f ] j | and the defining set obviously contains k (cid:5)− and k (cid:5) + ,respectively, and hence, they are not empty. Note that by construction the random dimension k + is independent of the sample ( Y i ) i ∈ (cid:74) n (cid:75) . We intend to combine the upper bound in (3.4)and the bounds for Bayesian weights w = (cid:98) w as in (1.7) and model selection weights w = ˘ w as in (1.6) given in Lemma C.2 and Lemma C.3, respectively. Conditionally on ( ε i ) i ∈ (cid:74) m (cid:75) ther.v.’s ( Y i ) i ∈ (cid:74) n (cid:75) are iid. and we denote by P n Y | ε and E n Y | ε their joint conditional distribution andexpectation, respectively.Exploiting Lemma C.2 (i) and (ii), where (i) implies E n Y | ε P (cid:98) w ( (cid:74) k − (cid:74) ) (cid:54) { k − > } (cid:0) η ∆ exp (cid:0) − η ∆2 k (cid:5)− Λ Φ k (cid:5)− (cid:1) + P n Y | ε (cid:0) (cid:107) (cid:98) f k (cid:5)− − ˇ f k (cid:5)− (cid:107) L (cid:62) pen (cid:98) Φ k (cid:5)− / (cid:1) (cid:102) k (cid:5)− + (cid:102) ck (cid:5)− (cid:1) , from (3.4) for Bayesian weights w = (cid:98) w as in (1.7) follows immediately E n Y | ε (cid:107) (cid:98) f w − f (cid:107) L (cid:54) E n Y | ε (cid:107) (cid:98) f k + − ˇ f k + (cid:107) L + 3 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ η ∆ (cid:107) Π ⊥ f (cid:107) L { k − > } exp (cid:0) − η ∆2 k (cid:5)− Λ Φ k (cid:5)− (cid:1) + n − η + 3 (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) P n Y | ε (cid:0) (cid:107) (cid:98) f k (cid:5)− − ˇ f k (cid:5)− (cid:107) L (cid:62) pen (cid:98) Φ k (cid:5)− / (cid:1) (cid:102) k (cid:5)− + (cid:102) ck (cid:5)− (cid:1) + 3 (cid:88) l ∈ (cid:75) k + ,n (cid:75) E n Y | ε (cid:0) (cid:107) (cid:98) f l − ˇ f l (cid:107) L − pen (cid:98) Φ l / (cid:1) + + (cid:88) l ∈ (cid:75) k + ,n (cid:75) pen (cid:98) Φ l P n Y | ε (cid:0) (cid:107) (cid:98) f l − ˇ f l (cid:107) L (cid:62) pen (cid:98) Φ l / (cid:1) + 6 (cid:88) j ∈ (cid:74) n (cid:75) | (cid:99) [ ϕ ] + j | | [ ϕ ] j − (cid:99) [ ϕ ] j | | [ f ] j | + 2 (cid:88) j ∈ (cid:74) n (cid:75) {| (cid:99) [ ϕ ] j | < /m } | [ f ] j | . (C.5)On the other hand (C.5) holds also true for model selection weights w = ˘ w by a combina-tion of the upper bound in (3.4) and the bounds given in Lemma C.3.The deviations of the last three terms in (C.5) we bound in Lemma C.4, which implies35 n Y | ε (cid:107) (cid:98) f w − f (cid:107) L (cid:54) E n Y | ε (cid:107) (cid:98) f k + − ˇ f k + (cid:107) L + 3 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) η exp (cid:0) − η ∆2 k (cid:5)− Λ Φ k (cid:5)− (cid:1) + exp (cid:0) − λ (cid:98) Φ k (cid:5)− k (cid:5)− (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:102) k (cid:5)− + (cid:102) ck (cid:5)− (cid:1) + 6 (cid:88) j ∈ (cid:74) n (cid:75) | (cid:99) [ ϕ ] + j | | [ ϕ ] j − (cid:99) [ ϕ ] j | | [ f ] j | + 2 (cid:88) j ∈ (cid:74) n (cid:75) {| (cid:99) [ ϕ ] j | < /m } | [ f ] j | + C n − (cid:0) η + [1 ∨ (cid:98) Φ k g ) ] k g + [1 ∨ (cid:98) Φ n o ) ] + (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:1) . (C.6)Keeping (2.11) and pen (cid:98) Φ k / (cid:62) k Λ (cid:98) Φ k n − in mind on the one hand holds E n Y | ε (cid:107) (cid:98) f k + − ˇ f k + (cid:107) L =2 (cid:80) k + j =1 (cid:98) Φ j /n (cid:54) k + Λ (cid:98) Φ k + n − (cid:54) pen (cid:98) Φ k + and due to Lemma A.8 E n Y | ε (cid:107) (cid:98) f k + − ˇ f k + (cid:107) L (cid:54) m for k + ∈ (cid:74) n (cid:75) due to Lemma A.8 (ii). Consequently, E n Y | ε (cid:107) (cid:98) f k + − ˇ f k + (cid:107) L (cid:54) m (cid:102) ck (cid:5) + + pen (cid:98) Φ k + (cid:102) k (cid:5) + and hence E n Y | ε (cid:107) (cid:98) f k + − ˇ f k + (cid:107) L (cid:54) m (cid:102) ck (cid:5) + + (6 (cid:107) Π ⊥ k (cid:5) + ˇ f n (cid:107) L + 4 pen (cid:98) Φ k (cid:5) + ) (cid:102) k (cid:5) + exploiting thedefinition (C.4) of k + . Thereby, with (cid:98) Φ ( j ) (cid:54) m , j ∈ N , η (cid:62) and ∆ (cid:62) from (C.6) follows E n Y | ε (cid:107) (cid:98) f w − f (cid:107) L (cid:54) pen (cid:98) Φ k (cid:5) + (cid:102) k (cid:5) + + (cid:107) Π ⊥ k (cid:5) + ˇ f n (cid:107) L (cid:102) k (cid:5) + + 6 m (cid:102) ck (cid:5) + + 3 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) η exp (cid:0) − η ∆2 k (cid:5)− Λ Φ k (cid:5)− (cid:1) + exp (cid:0) − λ (cid:98) Φ k (cid:5)− k (cid:5)− (cid:107) [ g ] (cid:107) (cid:96) (cid:1) (cid:102) k (cid:5)− + (cid:102) ck (cid:5)− (cid:1) + C n − (cid:0) [1 ∨ (cid:98) Φ k g ) ] k g (cid:102) kg + k g m (cid:102) ckg + [1 ∨ (cid:98) Φ n o ) ] (cid:102) no + m (cid:102) cno + (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:1) + 6 (cid:88) j ∈ (cid:74) n (cid:75) | (cid:99) [ ϕ ] + j | | [ ϕ ] j − (cid:99) [ ϕ ] j | | [ f ] j | + 2 (cid:88) j ∈ (cid:74) n (cid:75) {| (cid:99) [ ϕ ] j | < /m } | [ f ] j | . Exploiting Lemma A.8 (ii), Λ Φ k (cid:5)− (cid:62) λ Φ k (cid:5)− and η ∆2 > (cid:107) [ g ] (cid:107) (cid:96) > k g it follows E n Y | ε (cid:107) (cid:98) f w − f (cid:107) L (cid:54) Φ k (cid:5) + + (cid:107) Π ⊥ k (cid:5) + ˇ f n (cid:107) L + 3 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1) + (cid:102) ck (cid:5)− (cid:1) + C (cid:0) m (cid:102) ck (cid:5) + + n − (cid:0) k g m (cid:102) ckg + m (cid:102) cno (cid:1)(cid:1) + 6 (cid:88) j ∈ (cid:74) n (cid:75) | (cid:99) [ ϕ ] + j | | [ ϕ ] j − (cid:99) [ ϕ ] j | | [ f ] j | + 2 (cid:88) j ∈ (cid:74) n (cid:75) {| (cid:99) [ ϕ ] j | < /m } | [ f ] j | + C n − (cid:0) Φ k g ) k g + Φ n o ) + (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:1) . Bounding the second term and the two sums on the right hand side due to Lemma A.6 implies E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) Φ k (cid:5) + + (cid:107) Π ⊥ f (cid:107) L b k (cid:5) + ( f ) + 3 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1) + P m ϕ ( (cid:102) ck (cid:5)− ) (cid:1) + C (cid:0) m P m ϕ ( (cid:102) ck (cid:5) + ) + n − { k g m P m ϕ ( (cid:102) ck g ) + m P m ϕ ( (cid:102) cn o ) } (cid:1) + C (cid:107) Π ⊥ f (cid:107) ∧ Φ /m + C n − { Φ k g ) k g + Φ n o ) + (cid:107) Π ⊥ f (cid:107) L { k − > } } . C such that for all m, k ∈ N holds P m ϕ ( (cid:102) ck ) (cid:54) C k Φ k ) m − and hence, m P m ϕ ( (cid:102) ck g ) (cid:54) C k g Φ k g ) and m P m ϕ ( (cid:102) cn o ) (cid:54) C n o Φ n o ) .Consequently, there is a numerical constant C > such that for all n, m ∈ N holds E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) Φ k (cid:5) + + (cid:107) Π ⊥ f (cid:107) L b k (cid:5) + ( f ) + 3 (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1) + P m ϕ ( (cid:102) ck (cid:5)− ) (cid:1) + C m P m ϕ ( (cid:102) ck (cid:5) + )+ C (cid:107) Π ⊥ f (cid:107) ∧ Φ /m + C n − (cid:0) Φ k g ) k g + Φ n o ) + (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:1) (C.7)(keep in mind that n o is a numerical constant).From the upper bound (C.7) for a suitable choice of the dimension parameters k (cid:5)− , k (cid:5) + ∈ (cid:74) n (cid:75) we derive separately the risk bound in the two cases (p) and (np) considered in Theo-rem 3.4. The tedious case-by-case analysis for (p) is deferred to Lemma C.5 in the end of thissection.In case (np) we destinguish for m ∈ N with m Φ := (cid:98) λ Φ Φ (1) (cid:99) the following twocases, (a) m ∈ (cid:74) m Φ (cid:75) and (b) m > m Φ . Consider firstly the case (a) m ∈ (cid:74) m Φ (cid:75) . We set k (cid:5) + = k (cid:5)− = 1 , and hence k − = 1 , b (cid:54) , pen Φ (cid:54) ∆Φ n − , Φ (cid:54) Φ n o ) , m Φ (cid:54) C Φ and due to Lemma A.7 (ii) P m ε ( (cid:102) c ) (cid:54) C Φ m − . Thereby, from (C.7) for all n ∈ N and m ∈ (cid:74) m Φ (cid:75) follows E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) C (cid:107) Π ⊥ f (cid:107) ∧ Φ /m + C [1 ∨(cid:107) Π ⊥ f (cid:107) L ]Φ m − + C (cid:0) Φ k g ) k g +Φ n o ) (cid:1) n − . (C.8)Consider secondly (b) m > m Φ with k (cid:63)m := max { k ∈ (cid:74) m (cid:75) : 289 log( k + 2) λ Φ k Φ ( k ) (cid:54) m } .For each k ∈ (cid:74) k (cid:63)m (cid:75) holds m (cid:62)
289 log( k + 2) λ Φ k Φ ( k ) , and thus from Lemma A.7 (iii) follows P m ε ( (cid:102) ck ) (cid:54) m − . For k ◦ n := k ◦ n ( b (cid:113) , Λ Φ (cid:113) ) ∈ (cid:74) n (cid:75) as in (2.5) setting k (cid:5) + := k ◦ n ∧ k (cid:63)m , where m P m ε ( (cid:102) ck (cid:5) + ) (cid:54) C m − , pen Φ k (cid:5) + (cid:54) ∆ R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) and b k (cid:5) + ( f ) (cid:54) R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) + b k (cid:63)m ( f ) , (C.7) implies E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) C [1 ∨ (cid:107) Π ⊥ f (cid:107) L ] R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) + (cid:107) Π ⊥ f (cid:107) L b k (cid:63)m ( f ) + 3 (cid:107) Π ⊥ f (cid:107) L b k − ( f ) + C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1) + P m ε ( (cid:102) ck (cid:5)− ) (cid:1) + C (cid:107) Π ⊥ f (cid:107) ∧ Φ /m + C m − + C n − (cid:0) Φ k g ) k g + Φ n o ) + (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:1) . (C.9)Let k (cid:63)n := arg min {R kn ( b (cid:113) , Λ Φ (cid:113) ) ∨ exp (cid:0) − λ Φ k kk g (cid:1) : k ∈ (cid:74) n (cid:75) } . Setting k (cid:5)− := k (cid:63)n ∧ k (cid:63)m fromLemma A.7 (iii) follows P m ε ( (cid:102) ck (cid:5)− ) (cid:54) m − , while k − as in definition (C.4) satisfies (cid:107) Π ⊥ f (cid:107) L b k − ( f ) (cid:54) (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− ( f ) + 104 pen Φ k (cid:5)− (cid:54) (cid:107) Π ⊥ f (cid:107) L b k (cid:63)m ( f ) + ( (cid:107) Π ⊥ f (cid:107) L + 104∆) R k (cid:63)n n ( b (cid:113) , Λ Φ (cid:113) ) , n − (cid:54) R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) (cid:54) R k (cid:63)n n ( b (cid:113) , Λ Φ (cid:113) ) by (2.5) and (cid:107) Π ⊥ f (cid:107) ∧ Φ /m (cid:62) (cid:107) Π ⊥ f (cid:107) L m − (seeRemark 3.1). Thereby, we obtain from (C.9) for all n ∈ N and m > m Φ E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) C [1 ∨ (cid:107) Π ⊥ f (cid:107) L ] min k ∈ (cid:74) n (cid:75) { (cid:2) R kn ( b (cid:113) , Λ Φ (cid:113) ) ∨ exp (cid:0) − λ Φ k kk g (cid:1)(cid:3) } + C (cid:107) Π ⊥ f (cid:107) L (cid:2) b k (cid:63)m ( f ) ∨ exp (cid:0) − λ Φ k(cid:63)m k (cid:63)m k g (cid:1)(cid:3) + C (cid:107) Π ⊥ f (cid:107) ∧ Φ /m + C m − + C n − (cid:0) Φ k g ) k g + Φ n o ) (cid:1) . (C.10)Combining (C.8) and (C.10) for the cases (a) and (b) for all n, m ∈ N holds E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) C [1 ∨ (cid:107) Π ⊥ f (cid:107) L ] min k ∈ (cid:74) n (cid:75) { [ R kn ( b (cid:113) , Λ Φ (cid:113) ) ∨ exp (cid:0) − λ Φ k kk g (cid:1) ] } { m>m Φ } + C (cid:107) Π ⊥ f (cid:107) L (cid:2) b k (cid:63)m ( f ) ∨ exp (cid:0) − λ Φ k(cid:63)m k (cid:63)m k g (cid:1)(cid:3) { m>m Φ } + C (cid:107) Π ⊥ f (cid:107) ∧ Φ /m + C [1 ∨ (cid:107) Π ⊥ f (cid:107) L ]Φ m − + C { Φ k g ) k g + Φ n o ) } n − , (C.11)which shows (3.8) and completes the proof of Theorem 3.4. Proof of Corollary 3.5.
Consider the case (p) . In the proof of Corollary 3.5 we have shown,that under the additional assumption (A1) holds E n f,ϕ (cid:107) (cid:101) f w − f (cid:107) L (cid:54) C f, Φ n − for all n ∈ N . Ifin addition (A4) is satisfied for k (cid:63)m as in Theorem 3.4, then we have for all m > m f,ϕ trivially exp (cid:0) − λ Φ k(cid:63)m k (cid:63)m k g (cid:1) (cid:54) m − while for n ∈ (cid:74) m fϕ (cid:75) we have exp (cid:0) − λ Φ k(cid:63)m k (cid:63)m k g (cid:1) (cid:54) (cid:54) m f,ϕ m − . Com-bining both bounds we obtain the assertion (p) . On the other hand side, in case (np) under theadditional assumption (A2) holds min k ∈ (cid:74) n (cid:75) (cid:8)(cid:2) R kn ( f (cid:113) , Λ s (cid:113) ) ∨ exp (cid:0) − λ Φ k kk ϕ (cid:1)(cid:3)(cid:9) (cid:54) n f, Φ R ◦ n ( b (cid:113) , Λ Φ (cid:113) ) (cf. Corollary 2.5 (np) ). A combination of the last bound and exp (cid:0) − λ Φ k(cid:63)m k (cid:63)m k g (cid:1) (cid:54) m f,ϕ m − dueto (A4) implies the assertion (np) , which completes the proof of Corollary 3.5.Below we state and prove the technical Lemmata C.2 to C.4 used in the proof of Theo-rem 3.4. The proof of Lemma C.2 is based on Lemma C.1 given first. Lemma
C.1.
Consider Bayesian weights (cid:98) w as in (1.7) and let l ∈ (cid:74) n (cid:75) . (i) For (cid:102) l := (cid:8) (cid:54) Φ − j (cid:98) Φ j (cid:54) , ∀ j ∈ (cid:74) l (cid:75) (cid:9) and k ∈ (cid:74) l (cid:74) holds (cid:98) w k (cid:8) (cid:107) (cid:98) f l ˆ f l (cid:107) L < pen (cid:98) Φ l (cid:9) (cid:102) l (cid:54) exp (cid:0) ηn (cid:8) pen Φ l + (cid:107) Π ⊥ f (cid:107) L b l ( f ) − (cid:107) Π ⊥ f (cid:107) L b k ( f ) − pen Φ k (cid:9)(cid:1) ; (ii) For (cid:107) Π ⊥ l ˇ f n (cid:107) L = (cid:80) | j |∈ (cid:75) l,n (cid:75) Φ − j (cid:98) Φ j | [ f ] j | and k ∈ (cid:75) l, n (cid:75) holds (cid:98) w k (cid:8) (cid:107) (cid:98) f k − ˆ f k (cid:107) L < pen (cid:98) Φ k (cid:9) (cid:54) exp (cid:0) ηn (cid:8) − pen (cid:98) Φ k + (cid:107) Π ⊥ l ˇ f n (cid:107) L + pen (cid:98) Φ l (cid:9)(cid:1) .Proof of Lemma C.1. Given k, l ∈ (cid:74) n (cid:75) and an event Ω kl (to be specified below) it follows (cid:98) w k Ω kl (cid:54) exp (cid:0) ηn (cid:8) (cid:107) (cid:98) f k (cid:107) L − (cid:107) (cid:98) f l (cid:107) L + (pen (cid:98) Φ l − pen (cid:98) Φ k ) (cid:9)(cid:1) Ω kl . (C.12)38e distinguish the two cases (i) k ∈ (cid:74) , l (cid:74) and (ii) k ∈ (cid:75) l, n (cid:75) . Consider first (i) k ∈ (cid:74) , l (cid:74) .From (i) in Lemma A.1 (with ˆ f := (cid:98) f n and ˇ f := ˇ f n ) follows (cid:98) w k Ω kl (cid:54) exp (cid:0) ηn (cid:8) (cid:107) (cid:98) f l − ˇ f l (cid:107) L − (cid:107) Π kl ˇ f n (cid:107) L + (pen (cid:98) Φ l − pen (cid:98) Φ k ) (cid:9)(cid:1) Ω kl (C.13)Setting Ω kl := {(cid:107) (cid:98) f l − ˇ f l (cid:107) L < pen (cid:98) Φ l / } ∩ (cid:102) l the last bound together with Lemma A.8 (i) and(iii) implies the assertion (i). Consider secondly (ii) k ∈ (cid:75) l, n (cid:75) . From (ii) in Lemma A.1 (with ˆ f := (cid:98) f n and ˇ f := ˇ f n ) and (C.12) follows (cid:98) w k Ω lk (cid:54) exp (cid:0) ηn (cid:8) (cid:107) (cid:98) f k − ˇ f k (cid:107) L + (cid:107) Π lk ˇ f n (cid:107) L + (pen (cid:98) Φ l − pen (cid:98) Φ k ) (cid:9)(cid:1) Ω lk Setting Ω lk := {(cid:107) (cid:98) f k − ˇ f k (cid:107) L < pen (cid:98) Φ k / } the last bound together with Lemma A.8 (i) implies(ii), which completes the proof. Lemma
C.2.
Consider Bayesian weights (cid:98) w as in (1.7) and penalties (pen (cid:98) Φ k ) k ∈ (cid:74) n (cid:75) as in (3.6) .For any k (cid:5)− , k (cid:5) + ∈ (cid:74) n (cid:75) and associated k + , k − ∈ (cid:74) n (cid:75) as in (C.4) hold (i) P (cid:101) w ( (cid:74) k − (cid:74) ) (cid:54) η ∆ { k − > } exp (cid:0) − η ∆2 k (cid:5)− Λ Φ k (cid:5)− (cid:1) + {(cid:107) (cid:98) f k (cid:5)− − ˇ f k (cid:5)− (cid:107) L (cid:62) pen (cid:98) Φ k (cid:5)− / }∪ (cid:102) ck (cid:5)− ; (ii) (cid:80) k ∈ (cid:75) k + ,n (cid:75) pen (cid:98) Φ k (cid:98) w k {(cid:107) (cid:98) f k − ˇ f k (cid:107) L < pen (cid:98) Φ k / } (cid:54) η n − .Proof of Lemma C.2. Consider (i). Let k − ∈ (cid:74) k (cid:5)− (cid:75) as in (C.4). For the non trivial case k − > from Lemma C.1 (i) with l = k (cid:5)− follows for all k < k − (cid:54) k (cid:5)− (cid:98) w k (cid:8) (cid:107) (cid:98) f k (cid:5)− − ˇ f k (cid:5)− (cid:107) L < pen (cid:98) Φ k (cid:5)− / (cid:9) ∩ (cid:102) l (cid:54) exp (cid:0) ηn (cid:8) − (cid:107) Π ⊥ f (cid:107) L b k ( f ) + ( pen Φ k (cid:5)− + (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− ( f )) − pen Φ k (cid:9)(cid:1) , and hence by exploiting the definition (C.4) of k − , that is (cid:107) Π ⊥ f (cid:107) L b k (cid:62) (cid:107) Π ⊥ f (cid:107) L b ( k − − > (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− + 104 pen Φ k (cid:5)− , we obtain for each k ∈ (cid:74) k − (cid:74) (cid:98) w k (cid:8) (cid:107) (cid:98) f k (cid:5)− − ˇ f k (cid:5)− (cid:107) L < pen (cid:98) Φ k (cid:5)− / (cid:9) ∩ (cid:102) l (cid:54) exp (cid:0) − ηn pen Φ k (cid:5)− − ηn pen Φ k (cid:1) . The last upper bound together with pen Φ k = ∆ k Λ Φ k n − (cid:62) ∆ kn − , k ∈ (cid:74) n (cid:75) , as in (2.11) gives P (cid:101) w ( (cid:74) k − (cid:74) ) (cid:54) exp (cid:0) − η n pen Φ k (cid:5)− (cid:1) (cid:88) k ∈ (cid:74) k − (cid:74) exp( − η ∆50 k ) + (cid:8) (cid:107) (cid:98) f k (cid:5)− − ˇ f k (cid:5)− (cid:107) L (cid:62) pen (cid:98) Φ k (cid:5)− / (cid:9) ∪ (cid:102) ck (cid:5)− which combined with (cid:80) k ∈ N exp( − µk ) (cid:54) µ − for any µ > implies (i). Consider (ii). Let k + ∈ (cid:74) k (cid:5) + , n (cid:75) as in (C.4). For the non trivial case k + < n from Lemma C.1 (ii) with l = k (cid:5) + follows for all k > k + (cid:62) k (cid:5) + (cid:98) w k (cid:8) (cid:107) (cid:98) f k − ˇ f k (cid:107) L < pen (cid:98) Φ k / (cid:9) (cid:54) exp (cid:0) ηn (cid:8) − pen (cid:98) Φ k + (cid:107) Π ⊥ k (cid:5) + ˇ f n (cid:107) L + pen (cid:98) Φ k (cid:5) + (cid:9)(cid:1) , k + , that is, pen (cid:98) Φ k (cid:62) pen (cid:98) Φ ( k + +1) > pen (cid:98) Φ k (cid:5) + + (cid:107) Π ⊥ k (cid:5) + ˇ f n (cid:107) L , we obtain for each k ∈ (cid:75) k + , n (cid:75) (cid:98) w k (cid:8) (cid:107) (cid:101) f k − ˇ f k (cid:107) L < pen (cid:98) Φ k / (cid:9) (cid:54) exp (cid:0) ηn (cid:8) − pen (cid:98) Φ k (cid:9)(cid:1) . The last bound together with Lemma A.8 (iii), i.e., pen (cid:98) Φ k = pen (cid:98) Φ k { (cid:98) Φ ( k ) (cid:62) } , implies (cid:88) k ∈ (cid:75) k + ,n (cid:75) pen (cid:98) Φ k (cid:98) w k {(cid:107) (cid:98) f k − ˇ f k (cid:107) L < pen (cid:98) Φ k / } (cid:54) (cid:88) k ∈ (cid:75) k + ,n (cid:75) pen (cid:98) Φ k exp (cid:0) − η n pen (cid:98) Φ k (cid:1) = (cid:88) k ∈ (cid:75) k + ,n (cid:75) pen (cid:98) Φ k exp (cid:0) − η n pen (cid:98) Φ k (cid:1) { (cid:98) Φ ( k ) (cid:62) } = ∆ n − (cid:88) k ∈ (cid:75) k + ,n (cid:75) kλ (cid:98) Φ k (cid:98) Φ ( k ) exp (cid:0) − η ∆4 kλ (cid:98) Φ k (cid:98) Φ ( k ) (cid:1) { (cid:98) Φ ( k ) (cid:62) } (C.14)Comparing the last bound with (B.12) the remainder of the proof of (ii) follows line by linethe arguments used to prove of Lemma B.2 (ii) starting by (B.12), and we omit the details,which completes the proof. Lemma
C.3.
Consider model selection weights ˘ w as in (1.6) and penalties (pen (cid:98) Φ k ) k ∈ (cid:74) n (cid:75) as in (3.6) . For any k (cid:5)− , k (cid:5) + ∈ (cid:74) n (cid:75) and associated k + , k − ∈ (cid:74) n (cid:75) as in (C.4) hold (i) P ˘ w ( (cid:74) k − (cid:74) ) {(cid:107) (cid:98) f k (cid:5)− − ˇ f k (cid:5)− (cid:107) L < pen (cid:98) Φ k (cid:5)− / }∩ (cid:102) k (cid:5)− = 0 ; (ii) (cid:80) k ∈ (cid:75) k + ,n (cid:75) pen (cid:98) Φ k ˘ w k {(cid:107) (cid:98) f k − ˇ f k (cid:107) L < pen (cid:98) Φ k / } = 0 .Proof of Lemma C.3. The assertions can be directly deduced from Lemma C.2 by letting η → ∞ or following line by line the proof of Lemma B.3, and we omit the details. Lemma
C.4.
Consider (pen (cid:98) Φ k ) k ∈ (cid:74) ,n (cid:75) as in (3.6) with ∆ (cid:62) . Let k g := (cid:98) (cid:107) [ g ] (cid:107) (cid:96) ) (cid:99) and n o := 15(600) . There exists a finite numerical constant C > such that for all n ∈ N and all k (cid:5)− ∈ (cid:74) n (cid:75) hold (i) (cid:80) k ∈ (cid:74) n (cid:75) E n Y | ε (cid:0) (cid:107) (cid:98) f k − ˆ f k (cid:107) L − pen (cid:98) Φ k / (cid:1) + (cid:54) C n − (cid:0) [1 ∨ (cid:98) Φ ( k g ) ] k g + [1 ∨ (cid:98) Φ ( n o ) ] (cid:1) ; (ii) (cid:80) k ∈ (cid:74) n (cid:75) pen (cid:98) Φ k P n Y | ε (cid:0) (cid:107) (cid:98) f k − ˆ f k (cid:107) L (cid:62) pen (cid:98) Φ k / (cid:1) (cid:54) C n − (cid:0) [1 ∨ (cid:98) Φ k g ) ] k g + [1 ∨ (cid:98) Φ n o ) ] (cid:1) ; (iii) P n Y | ε (cid:0) (cid:107) (cid:98) f k (cid:5)− − ˆ f k (cid:5)− (cid:107) L (cid:62) pen (cid:98) Φ k (cid:5)− / (cid:1) (cid:54) C (cid:0) exp (cid:0) − λ (cid:98) Φ k (cid:5)− k (cid:5)− (cid:107) [ g ] (cid:107) (cid:96) (cid:1) + n − (cid:1) .Proof of Lemma C.4. By using Lemma A.5 rather than Lemma A.4 together with (cid:8)(cid:98) Φ ( l ) < (cid:9) = (cid:8)(cid:98) Φ ( l ) = 0 (cid:9) for all l ∈ N due to Lemma A.8 (ii) the proof follows line by line the proof ofLemma B.4, and we omit the details. 40 emma C.5.
Let the assumptions of Theorem 3.4 (p) be satisfied. There is a numerical con-stant C such that for all n, m ∈ N with n o := 15(600) holds E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) C (cid:107) Π ⊥ f (cid:107) L (cid:2) n − ∨ m − ∨ exp (cid:0) − λ Φ[ k(cid:63)n ∧ k(cid:63)m ][ k (cid:63)n ∧ k (cid:63)m ] k g (cid:1)(cid:3) + C (cid:0)(cid:2) ∨ K ∨ c f K Φ K ) (cid:3)(cid:0) Φ + (cid:107) Π ⊥ f (cid:107) L (cid:1) + Φ k g ) k g + Φ n o ) (cid:1) n − + C (cid:0) Φ + K Φ K ) + (cid:107) Π ⊥ f (cid:107) L Φ ( K ) (cid:1) m − . (C.15) Proof of Lemma C.5.
The proof follows a long the lines of the proof of Lemma B.5 by us-ing the upper bound (C.7) instead of (B.8) which hold for any k (cid:5)− , k (cid:5) + ∈ (cid:74) n (cid:75) and associ-ated k − , k + ∈ (cid:74) n (cid:75) as defined in (C.4) contrarily to (B.5). We present exemplary the case(b) n > n f := [ K ∨ (cid:98) c f K Λ Φ K (cid:99) ] with K ∈ N and c f := (cid:107) Π ⊥ f (cid:107) L b K − , and omit thedetails for the others. Setting k (cid:5) + := K (cid:54) n f , i.e., k (cid:5) + ∈ (cid:74) n (cid:75) , it follows b k (cid:5) + = 0 and pen Φ k (cid:5) + = ∆ K Λ Φ K n − (cid:54) ∆ K Φ K ) n − . From (C.7) follows for all n > n f thus E n,m f,ϕ (cid:107) (cid:98) f (cid:98) w − f (cid:107) L (cid:54) (cid:107) Π ⊥ f (cid:107) L b k − ( f )+ C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1) + P m ϕ ( (cid:102) ck (cid:5)− ) (cid:1) + C m P m ϕ ( (cid:102) cK )+ C (cid:107) Π ⊥ f (cid:107) ∧ Φ /m + C n − (cid:0) K Φ K ) + Φ k g ) k g + Φ n o ) + (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:1) . Exploiting Lemma A.7 (ii) there is a numerical constant C such that for all m ∈ N holds P m ϕ ( (cid:102) cK ) (cid:54) C K Φ K ) m − , which together with (cid:107) Π ⊥ f (cid:107) ∧ Φ /m (cid:54) (cid:107) Π ⊥ f (cid:107) L Φ ( K ) m − implies E n,m f,ϕ (cid:107) (cid:98) f (cid:98) w − f (cid:107) L (cid:54) + C n − (cid:0) K Φ K ) + Φ k g ) k g + Φ n o ) + (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:1) . + 3 (cid:107) Π ⊥ f (cid:107) L b k − + C (cid:107) Π ⊥ f (cid:107) L { k − > } (cid:0) exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1) + P m ϕ ( (cid:102) ck (cid:5)− ) (cid:1) + C m − (cid:0) K Φ K ) + (cid:107) Π ⊥ f (cid:107) L Φ ( K ) (cid:1) (C.16)In order to control the terms involving k (cid:5)− and k − we destinguish for m ∈ N with m f, Φ := (cid:98)
289 log( K + 2) λ Φ K Φ ( K ) (cid:99) the following two cases, (b-i) m ∈ (cid:74) m f, Φ (cid:75) and (b-ii) m > m f, Φ .Consider first (b-i) m ∈ (cid:74) m f, Φ (cid:75) . We set k (cid:5)− = 1 and hence k − = 1 . Thereby, with b ( f ) (cid:54) , log( K + 2) (cid:54) K +2 e (cid:54) K , λ Φ k Φ ( k ) (cid:54) K Φ K ) , and hence m f, Φ (cid:54) C K Φ K ) , from (C.16)follows for all m ∈ (cid:74) m f, Φ (cid:75) E n,m f,ϕ (cid:107) (cid:98) f (cid:98) w − f (cid:107) L (cid:54) C n − (cid:0) K Φ K ) + Φ k g ) k g + Φ n o ) (cid:1) + C m − (cid:0) K Φ K ) + (cid:107) Π ⊥ f (cid:107) L ( K Φ K ) + Φ ( K ) ) (cid:1) (C.17)Consider (b-ii) m > m f, Φ ensuring the defining set of k (cid:63)m = max { k ∈ (cid:74) m (cid:75) : 289 log( k +2) λ Φ k Φ ( k ) (cid:54) m } is not empty and k (cid:63)m (cid:62) K . For each k (cid:5)− ∈ (cid:74) K, k (cid:63)m (cid:75) it follows P m ε ( (cid:102) ck (cid:5)− ) (cid:54) m − due to Lemma A.7 (iii). Since n > n f = [ K ∨ (cid:98) c f K Λ Φ K (cid:99) ] with c f := (cid:107) Π ⊥ f (cid:107) L b K − k (cid:63)n = max { k ∈ (cid:74) n (cid:75) : n > c f k Λ Φ k } is not empty and k (cid:63)n (cid:62) K . Foreach k (cid:5)− ∈ (cid:74) K, k (cid:63)n (cid:75) we have b k (cid:5)− ( f ) = 0 , and pen Φ k (cid:5)− = k (cid:5)− Λ Φ k (cid:5)− n − < c − f = (cid:107) Π ⊥ f (cid:107) L b K − .It follows (cid:107) Π ⊥ f (cid:107) L b ( K − > (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− + 104 pen Φ k (cid:5)− and trivially (cid:107) Π ⊥ f (cid:107) L b K = 0 < (cid:107) Π ⊥ f (cid:107) L b k (cid:5)− + 104 pen Φ k (cid:5)− . Therefore, k − as in (C.4) satisfies k − = K and hence b k − = 0 .Finally, setting k (cid:5)− := k (cid:63)n ∧ k (cid:63)m it follows P m ϕ ( (cid:102) ck (cid:5)− ) (cid:54) m − , k − = K and b k − = 0 . From(C.16) follows for all m > m f, Φ and n > n f, Φ thus E n,m f,ϕ (cid:107) (cid:98) f (cid:98) w − f (cid:107) L (cid:54) C n − (cid:0) K Φ K ) n − + Φ k g ) k g + Φ n o ) + (cid:107) Π ⊥ f (cid:107) L (cid:1) + C (cid:107) Π ⊥ f (cid:107) L exp (cid:0) − λ Φ k (cid:5)− k (cid:5)− k g (cid:1) + C m − (cid:0) K Φ K ) + (cid:107) Π ⊥ f (cid:107) L Φ ( K ) (cid:1) . (C.18)By combining (C.17) and (C.18) for the cases (b-i) m ∈ (cid:74) m f, Φ (cid:75) and (b-ii) m > m f, Φ the upperbound (C.15) holds in case (b), i.e., for all m ∈ N and for all n > n f, Φ , which completes theproof of Lemma C.5. C.2 Proof of Theorem 3.7 and Corollary 3.8
Proof of Theorem 3.7.
Keeping (2.17) in mind for all f ∈ F r f , ϕ ∈ E d s and k, n, m ∈ N wehave (cid:107) [ g ] (cid:107) (cid:96) (cid:54) rd (cid:107) s (cid:113) f (cid:113) (cid:107) (cid:96) , hence k g = (cid:98) (cid:107) [ g ] (cid:107) (cid:96) (cid:99) (cid:54) (cid:98) rζ d (cid:107) s (cid:113) f (cid:113) (cid:107) (cid:96) (cid:99) = k fs and k g λ Φ k (cid:62) k fs λ s k , (cid:107) Π ⊥ f (cid:107) L (cid:54) r , (cid:107) Π ⊥ f (cid:107) L b k ( f ) (cid:54) r f k , pen Φ k (cid:54) ∆ ζ d k Λ s k /n , (cid:107) Π ⊥ f (cid:107) ∧ Φ /m (cid:54) rd (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ , and k − as in (C.4) satisfies (cid:107) Π ⊥ f (cid:107) L b k − ( f ) (cid:54) r f k (cid:5)− + 104∆ ζ d k (cid:5)− Λ s k (cid:5)− /n .Combining the last bounds together with the upper bound (C.7) there is a numerical constant C > such that uniformely for all f ∈ F r f , ϕ ∈ E d s , n, m ∈ N and k (cid:5)− , k (cid:5) + ∈ (cid:74) n (cid:75) holds E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L (cid:54) ζ d Λ s k (cid:5) + k (cid:5) + /n + r f k (cid:5) + + 3 r f k (cid:5)− + 312∆ ζ d k (cid:5)− Λ s k (cid:5)− /n + C r (cid:0) exp (cid:0) − λ s k (cid:5)− k (cid:5)− k fs (cid:1) + P m ϕ ( (cid:102) ck (cid:5)− ) (cid:1) + C m P m ϕ ( (cid:102) ck (cid:5) + )+ C rd (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ + C n − (cid:0) d s k fs k fs + d s n o + r (cid:1) (C.19)We destinguish for m ∈ N with m s := (cid:98) ζ d λ s s (cid:99) the two cases, (a) m ∈ (cid:74) m s (cid:75) and(b) m > m s .Consider (a). We set k (cid:5) + = k (cid:5)− = 1 . Since P m ε ( (cid:102) c ) (cid:54) C Φ m − (cid:54) C d s m − due toLemma A.7 (ii), (C.19) implies for all n ∈ N and m ∈ (cid:74) m s (cid:75) sup (cid:8) E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C rd (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ + C m − ( rζ d f + d ) s + C n − (cid:0) d s k fs k fs + d s n o + r (cid:1) . (C.20)Consider secondly (b). Since m > m s the defining set of k (cid:63)m := max { k ∈ (cid:74) m (cid:75) :289 log( k + 2) ζ d λ s k s k (cid:54) m } is not empty. Keeping in mind, that due to (2.17) for all ϕ ∈ E d s k ∈ (cid:74) k (cid:63)m (cid:75) holds ζ d λ s k s k (cid:62) λ Φ k Φ ( k ) , and hence m (cid:62)
289 log( k + 2) λ Φ k Φ ( k ) and P m ε ( (cid:102) ck ) (cid:54) m − applying Lemma A.7 (iii). For k ◦ n := k ◦ n ( f (cid:113) , Λ s (cid:113) ) ∈ (cid:74) n (cid:75) as in (2.5) let k (cid:5) + := k ◦ n ∧ k (cid:63)m and hence m P m ε ( (cid:102) ck (cid:5) + ) (cid:54) C m − . Since Λ s k (cid:5) + k (cid:5) + /n (cid:54) R k ◦ n n ( f (cid:113) , Λ s (cid:113) ) = R ◦ n ( f (cid:113) , Λ s (cid:113) ) and f k (cid:5) + (cid:54) R ◦ n ( f (cid:113) , Λ s (cid:113) ) + f k (cid:63)m from (C.19) follows sup (cid:8) E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) (2∆ ζ d + 2 r ) R ◦ n ( f (cid:113) , Λ s (cid:113) ) + r f k (cid:63)m + 3 r f k (cid:5)− + 312∆ ζ d k (cid:5)− Λ s k (cid:5)− /n + C r (cid:0) exp (cid:0) − λ s k (cid:5)− k (cid:5)− k fs (cid:1) + P m ϕ ( (cid:102) ck (cid:5)− ) (cid:1) + C m − + C rd (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ + C n − (cid:0) d s k fs k fs + d s n o + r (cid:1) (C.21)For k (cid:63)n := arg min {R kn ( f (cid:113) , Λ s (cid:113) ) ∨ exp (cid:0) − λ s k kk fs (cid:1) : k ∈ (cid:74) n (cid:75) } with R k (cid:63)n n ( f (cid:113) , Λ s (cid:113) ) (cid:54) ρ ◦ n ( f (cid:113) , Λ s (cid:113) ) let k (cid:5)− := k (cid:63)n ∧ k (cid:63)m and hence P m ε ( (cid:102) ck (cid:5)− ) (cid:54) m − . Since r f k (cid:5)− + ζ d Λ s k (cid:5)− k (cid:5)− n − (cid:54) r f k (cid:63)m +( r + ζ d ) R k (cid:63)n n ( f (cid:113) , Λ s (cid:113) ) and n − (cid:54) R ◦ n ( f (cid:113) , Λ s (cid:113) ) (cid:54) R k (cid:63)n n ( f (cid:113) , Λ s (cid:113) ) from (C.21) follows for all n ∈ N , m > m s sup (cid:8) E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C ( r + ζ d ) ρ ◦ n ( f (cid:113) , Λ s (cid:113) ) + C rd (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ + 5 r (cid:2) f k (cid:63)m ∨ exp (cid:0) − λ s k(cid:63)m k (cid:63)m k fs (cid:1)(cid:3) + C rm − + C n − (cid:0) d s k fs k fs + d s n o (cid:1) (C.22)Combining (C.20) and (C.22) for the cases (a) and (b) for all n, m ∈ N holds sup (cid:8) E n,m f,ϕ (cid:107) (cid:98) f w − f (cid:107) L : f ∈ F r f , ϕ ∈ E d s (cid:9) (cid:54) C ( r + ζ d ) ρ ◦ n ( f (cid:113) , Λ s (cid:113) ) { m>m ϕ } + C r (cid:2) f k (cid:63)m ∨ exp (cid:0) − λ s k(cid:63)m k (cid:63)m k fs (cid:1)(cid:3) { m>m ϕ } + C rd (cid:107) f (cid:113) (1 ∧ s (cid:113) /m ) (cid:107) ∞ + C m − ( rζ d f + d ) s + C n − (cid:0) d s k fs k fs + d s n o (cid:1) (C.23)which shows (3.9) and completes the proof of Theorem 3.7. Proof of Corollary 3.8.
The proof is similar to the proof of Corollary 2.8 and Corollary 3.5,and we omit the details.
References
C. Agostinelli and U. Lund.
R package circular : Circular Statistics (version 0.4-93) . CA:Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University,Venice, Italy. UL: Department of Statistics, California Polytechnic State University, SanLuis Obispo, California, USA, 2017. URL https://r-forge.r-project.org/projects/circular/ .C. Bahlmann. Directional features in online handwriting recognition.
Pattern Recognition , 39(1):115–125, 2006. 43. Barron, L. Birgé, and P. Massart. Risk bounds for model selection via penalization.
Prob-ability Theory and Related Fields , 113(3):301–413, 1999.J.-P. Baudry, C. Maugis, and B. Michel. Slope heuristics: overview and implementation.
Statistics and Computing , 22(2):455–470, 2012.P. C. Bellec and A. B. Tsybakov. Sharp oracle bounds for monotone and convex regressionthrough aggregation.
Journal of Machine Learning Research , 16:1879–1892, 2015.L. Birgé and P. Massart. Minimum contrast estimators on sieves: exponential bounds and ratesof convergence.
Bernoulli , 4(3):329–375, 1998.P. Brémaud. Fourier analysis of stochastic processes. In
Fourier Analysis and StochasticProcesses , pages 119–179. Springer, 2014.J. A. Carnicero, M. C. Ausín, and M. P. Wiper. Non-parametric copulas for circular–linear andcircular–circular data: an application to wind directions.
Stochastic environmental researchand risk assessment , 27(8):1991–2002, 2013.F. Comte and F. Merlevede. Adaptive estimation of the stationary density of discrete andcontinuous time mixing processes.
ESAIM: Probability and Statistics , 6:211–238, 2002.F. Comte and M.-L. Taupin. Adaptive density deconvolution for circular data. Prépublicationmap5 2003-10, Université Paris Descartes, 2003.M. Corporation and S. Weston. doParallel: Foreach Parallel Adaptor for the ’parallel’ Pack-age , 2019. URL https://CRAN.R-project.org/package=doParallel . Rpackage version 1.0.15.A. Dalalyan and A. B. Tsybakov. Aggregation by exponential weighting, sharp pac-bayesianbounds and sparsity.
Machine Learning , 72(1-2):39–61, 2008.A. S. Dalalyan and A. B. Tsybakov. Sparse regression learning by aggregation and langevinmonte-carlo.
Journal of Computer and System Sciences , 78(5):1423–1443, 2012.S. Efromovich. Density estimation for the case of supersmooth measurement error.
Journalof the American Statistical Association , 92:526–535, 1997.J. Gill and D. Hangartner. Circular data in political science and how to handle it.
PoliticalAnalysis , pages 316–336, 2010.J. Johannes and M. Schwarz. Adaptive circular deconvolution by model selection under un-known error distribution.
Bernoulli , 19(5A):1576–1611, 2013.44. Johannes, A. Simoni, and R. Schenk. Adaptive bayesian estimation in indirect gaussiansequence space models.
Annals of Economics and Statistics , (137):83–116, 2020.T. Klein and E. Rio. Concentration around the mean for maxima of empirical processes.
TheAnnals of Probability , 33(3):1060–1077, 2005.X. Loizeau.
Hierarchical Bayes and frequentist aggregation in inverse problems . PhD thesis,2020.P. Massart.
Concentration inequalities and model selection.
Ecole d’été de probabilités deSaint-Flour XXXIII – 2003, Lecture Notes in Mathematics 1896. Berlin: Springer, 2007.A. Meister.
Deconvolution problems in nonparametric statistics.
Lecture Notes in Statistics193. Berlin: Springer, 2009.Microsoft and S. Weston. foreach: Provides Foreach Looping Construct , 2020. URL https://CRAN.R-project.org/package=foreach . R package version 1.5.0.M. H. Neumann. On the effect of estimating the error density in nonparametric deconvolution.
Journal of Nonparametric Statistics , 7:307–330, 1997.R Core Team.
R: A Language and Environment for Statistical Computing . R Foundation forStatistical Computing, Vienna, Austria, 2018. URL .P. Rigollet and A. B. Tsybakov. Linear and convex aggregation of density estimators.
Mathe-matical Methods of Statistics , 16(3):260–280, 2007.P. Rigollet et al. Kullback–leibler aggregation and misspecified generalized linear models.
The Annals of Statistics , 40(2):639–665, 2012.S. Schluttenhofer and J. Johannes. Adaptive minimax testing for circular convolution. Tech-nical report, arXiv:2007.06388, 2020a.S. Schluttenhofer and J. Johannes. Minimax testing and quadratic functional estimation forcircular convolution. Technical report, arXiv:2004.12714, 2020b.M. Talagrand. New concentration inequalities in product spaces.
Inventiones mathematicae ,126:505–563, 1996.A. B. Tsybakov. Aggregation and minimax optimality in high-dimensional estimation. In
Proceedings of the International Congress of Mathematicians , volume 3, pages 225–246,2014. 45. Wickham. Reshaping data with the reshape package.
Journal of Statistical Software , 21(12):1–20, 2007.H. Wickham. ggplot2: Elegant Graphics for Data Analysisggplot2: Elegant Graphics for Data Analysis