Non-asymptotic Bayesian Minimax Adaptation
aa r X i v : . [ m a t h . S T ] A ug Non-asymptotic Bayesian Minimax Adaptation
Keisuke Yano and Fumiyasu Komaki , Department of Mathematical Informatics, Graduate School of Information Science andTechnology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan e-mail: [email protected] ; [email protected] RIKEN Brain Science Institute, 2-1 Hirosawa, Wako City, Saitama 351-0198, Japan
Abstract:
This paper studies a Bayesian approach to non-asymptotic minimaxadaptation in nonparametric estimation. Estimating an input function onthe basis of output functions in a Gaussian white-noise model is discussed.The input function is assumed to be in a Sobolev ellipsoid with an unknownsmoothness and an unknown radius. Our purpose in this paper is to presenta Bayesian approach attaining minimaxity up to a universal constant with-out any knowledge regarding the smoothness and the radius. Our Bayesianapproach provides not only a rate-exact minimax adaptive estimator inlarge sample asymptotics but also a risk bound for the Bayes estimatorquantifying the effects of both the smoothness and the ratio of the squaredradius to the noise variance, where the smoothness and the ratio are thekey parameters to describe the minimax risk in this model. Application tonon-parametric regression models is also discussed.
MSC 2010 subject classifications:
Primary 62G05; secondary 62G20.
Keywords and phrases:
Adaptive posterior contraction, Bayesian non-parametrics, Gaussian infinite sequence model, Nonparametric regression,Pinsker’s theorem.
1. Introduction
Consider estimation of the mean in a Gaussian infinite sequence model. Let x = ( x , x , . . . ) be an observation from P θ,ε := ⊗ ∞ i =1 N ( θ i , ε ) with an unknownmean θ ∈ l and a known variance ε . We assume that θ is included in a Sobolevellipsoid E ( α , B ) := (cid:26) θ ∈ l : ∞ X i =1 i α θ i ≤ B (cid:27) , (1)where both the smoothness α and the radius B are unknown. We measurethe performance of an estimator ˆ θ of θ by the normalized mean squared risk R ( θ, ˆ θ ) = E θ,ε [ || ˆ θ ( X ) − θ || ] /B , where E θ,ε is the expectation of X withrespect to P θ,ε and || v || := P ∞ i =1 v i for v ∈ l .Estimation in Gaussian infinite sequence models is canonical in the contextof nonparametric estimation. Consider the case in which α is a positive integer.With the setting n = ⌊ /ε ⌋ , this estimation is equivalent to estimation of aninput function in a Gaussian white-noise model, that is, estimating an unknown . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation input function f based on independent and identically distributed (i.i.d.) outputfunctions Y ( · ) , . . . , Y n ( · ) given byd Y i ( t ) = f ( t )d t + d W ( t ) , t ∈ [0 , , i = 1 , . . . , n, where W ( · ) is a standard Brownian motion, and f is an L [0 ,
1] function ofwhich the L [0 , α -th derivative is bounded by B/π α . The corre-spondence between parameters ( f and θ ) in Gaussian white-noise and Gaussianinfinite sequence models is as follows. Let { φ i : i = 1 , . . . } be the trigonometricseries: φ ( t ) := 1 ,φ k ( t ) := √ kπt ) , k = 1 , , . . . ,φ k +1 ( t ) := √ kπt ) , k = 1 , , . . . . (2)For i ∈ N , θ i corresponds to R f ( t ) φ i ( t )d t . The equivalence of Gaussian white-noise and Gaussian infinite sequence models can be shown through a sufficiencyreduction and through transformation via the trigonometric series; for the proofincluding the case in which α is not an integer, see Lemma A.3. in [41]. Non-parametric regression models are asymptotically equivalent to Gaussian infinitesequence models. See [9] and Subsection 3.2 in this paper. For comprehensivereferences, see [13, 43, 41, 20].The aim of the present paper is to develop a Bayesian approach to non-asymptotic minimax adaptation in the Gaussian infinite sequence model. Anon-asymptotically minimax adaptive estimator is defined by an estimator ˆ θ forwhich there exists a positive constant C not depending on B or ε such thatsup θ ∈E ( α ,B ) R ( θ, ˆ θ ) ≤ C inf δ sup θ ∈E ( α ,B ) R ( θ, δ ) for any 0 < ε ≤ B. (3)For this definition of non-asymptotic minimax adaptation, see p.362 of [4] andp.212 of [8]. We develop a prior distribution that yields a non-asymptoticallyminimax adaptive Bayes estimator and present a risk bound for the Bayes esti-mator.Non-asymptotic minimax adaptation is important since it gives not only therate of convergence in ε but also a risk bound quantifying the influence of theratio ε/B and the smoothness α . The important point here is that only twoquantities ε/B and α completely determine the difficulty of estimation in theGaussian infinite sequence model, that is, the minimax risk over a Sobolevellipsoid: inf ˆ θ sup θ ∈E ( α ,B ) R ( θ, ˆ θ ). This is shown by the fact that the minimaxrisk is invariant whenever the value of ε/B is unchanged; For example, theminimax risk over E ( α , B ) with the noise variance ε /
100 is identical to theminimax risk over E ( α , B ) with the noise variance ε . In fact, if we let ˜ θ = . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation θ/B , ˜ ε = ε/B , and e X = X/B , then we haveinf ˆ θ sup θ ∈E ( α ,B ) R ( θ, ˆ θ ) = inf ˆ θ sup ˜ θ ∈E ( α , E ˜ θ, ˜ ε (cid:20) ∞ X i =1 { ˜ θ i − ˆ θ i ( B e X ) /B } (cid:21) = inf ˆ θ sup θ ∈E ( α , E θ, ˜ ε " ∞ X i =1 { θ i − ˆ θ i ( e X ) } . (4)Developing a Bayesian approach to non-asymptotic minimax adaptation isnot straightforward. Since, if both B and α are known, truncation estimator(1 ≤ d X , . . . , i ≤ d X i , . . . ) with d = ( B/ε ) / (2 α +1) attains (3), one expects thatputting a prior distribution on d would attain non-asymptotic minimax adapta-tion. However, this idea is not satisfactory as shown in the following. Considera prior distribution of the form θ | D ∼ (cid:20) D ⊗ i =1 N (0 , (cid:21) ⊗ (cid:20) ∞ ⊗ i = D +1 N (0 , (cid:21) ,D ∼ M, where M is a distribution on N . For the Bayes estimator ˆ θ ∗ based on this prior,we havesup θ ∈E ( α ,B ) R ( θ, ˆ θ ∗ ) ≥ sup θ ∈E ( α ,B ) E θ,ε [( θ − ˆ θ ∗ ( X )) ] B ≥ ε ( ε + 1) sup θ ∈E ( α ,B ) ( θ /B ) , where the first inequality follows since the mean squared risk is larger thanthe coordinate-wise mean squared risk, and the second inequality follows sinceˆ θ ∗ ( X ) = { ε / ( ε +1) } X . The rightmost term in the above inequality is boundedbelow by ε / ( ε + 1) because E ( α , B ) contains ( B, , , . . . ). In contrast, it fol-lows that the minimax risk goes to 0 as B grows since the minimax risk isinvariance whenever the value of B/ε is unchanged and since the minimax riskgoes to 0 as ε goes to 0. Thus, ˆ θ ∗ does not attain non-asymptotic minimax adap-tation. Other examples that do not attain non-asymptotic minimax adaptationare presented in Section 2.We work with a simple prior distribution of the form θ | ( D, K ) ∼ (cid:20) D ⊗ i =1 N (0 , ε D K +1 i − (2 K +1) ) (cid:21) ⊗ (cid:20) ∞ ⊗ i = D +1 N (0 , (cid:21) , ( D, K ) ∼ M ⊗ F, where M and F are distributions on N . In Section 3, we show that its Bayes esti-mator is non-asymptotically adaptive. The prior is a modification of a sieve priorin the literature; see Subsection 1.1 below. The modification comes from twoprincipal ideas. The first idea is to endow ( B/ε ) with a prior distribution. Start-ing from the Gaussian prior distribution ⊗ Di =1 N (0 , V i − K − ) ⊗ ⊗ ∞ i = D +1 N (0 , D , V , and K , we endow D , V , and K with prior distributions. Here, the . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation prior distribution on V /ε corresponds to a prior distribution of ( B/ε ) . Thesecond idea is to put a prior distribution simultaneously on D and V , focusingon the stochastic behavior of the seminorm P Di =1 i − K − N i with independentGaussian random variables { N i : i = 1 , . . . , D } as shown in Lemma 4. Thesecond idea also enables us to calculate the posterior distribution easily. There is an extensive literature on asymptotic minimax adaptation in Gaussianinfinite sequence models. Efromoivich and Pinsker [14] developed an asymptot-ically minimax adaptive estimator. Cai, Low, and Zhao [10] and Cavalier andTsybakov [11] constructed an asymptotically minimax adaptive estimator on thebasis of the James–Stein estimator. There exists a literature from a Bayesianperspective. Belitser and Ghosal [6] showed that putting prior distributions onthe hyperparameter α in the Gaussian distribution G( · | α ) leads to asymptoticminimax adaptation in the case in which the smoothness is included in a discreteset. Scricciolo [35] obtained the corresponding result in Bayesian nonparametricdensity estimation. Huang [23] removed the assumption on α , incurring theneed to pay the price that a logarithmic factor is added to the rate. See also[2, 18, 26, 33]. Recently, the results on rate-exact Bayesian minimax adaptationwithout any assumption on α are elegantly established by [16, 22, 24, 25]. How-ever, non-asymptotic minimax adaptation implies asymptotic minimax adapta-tion, whereas the converse does not hold as shown in Section 2. To achievenon-asymptotic Bayesian minimax adaptation, further consideration is requiredin constructing a prior distribution.Non-asymptotic minimax adaptation has been studied from the viewpointsof model selection and frequentist model averaging. In nonparametric densityestimation, Birg´e and Massart [7] showed that the estimator based on an ana-logue of Mallows’ C p [28] (equivalently, the Akaike Information Criterion (AIC)[1] and Stein’s Unbiased Risk Estimator (SURE) [38]) attains non-asymptoticminimax adaptation. For the corresponding results in nonparametric regressionmodels and in Gaussian infinite sequence models, see [3, 8]. See also [4] formore general results. Non-asymptotic minimax adaptation is also attained byfrequentist model averaging. A slight modification of the arguments in [12, 27]shows that the frequentist model averaging estimator using Mallows’ C p hasnon-asymptotic minimax adaptation as discussed in Section 2. Yet, the ques-tion whether there exists a fully Bayesian approach attaining non-asymptoticminimax adaptation has remained unresolved. Although there is a connectionbetween the frequentist model averaging estimator and Bayesian model averag-ing as discussed in [21] (see also the appendix in [27]), posterior distributions of θ are not available in the frequentist model averaging approach. For this reason,we distinguish frequentist model averaging from Bayesian model averaging.The prior distribution that we use is a modification of a sieve prior discussedin [2, 33, 45, 36, 37]. Originally, the sieve prior was introduced by Zhao [45]to resolve some Bayesian nonparametric problems regarding prior masses on . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation the parameter space. Arbel et al. [2] showed that the Bayes estimator based onthe sieve prior is asymptotically minimax adaptive up to a logarithmic factorunder the asymptotics as ε →
0. A practical advantage of sieve priors in Gaus-sian sequence models is that an exact sampling from the posterior distributionsis performed by a simple acceptance-rejection method. However, the Bayes es-timator based on the original sieve prior in [45] fails non-asymptotic minimaxadaptation as discussed in Section 2. Thus, a non-trivial refinement is necessary.
The remainder of the paper is organized as follows. In Section 2, we reviewseveral existing estimators from the viewpoint of non-asymptotic adaptation. InSection 3, a non-asymptotically adaptive Bayes estimator is proposed. This is theprincipal part of this study. Section 4 presents numerical experiments. Numericalexperiments report that our Bayesian approach is possibly better than the modelselection based estimator and is comparable to the model averaging estimator.Section 5 provides proofs of theorems in Section 3. Some proofs of lemmas andsupplementary numerical experiments are provided in appendices.
2. Non-asymptotic adaptation and existing estimators
In this section, we review existing estimators from the perspective of non-asymptotic adaptation with the aim of assisting the reader in understandingnon-asymptotic adaptation.Let us mention a necessary condition for non-asymptotic minimax adaptationahead. It is a necessary condition for non-asymptotic minimax adaptation thatthe rate of convergence of the minimax risk of an estimator with respect to ε/B is( ε/B ) α / (2 α +1) . In particular, the rate of convergence of a non-asymptoticallyminimax adaptive estimator with respect to 1 /B is B − α / (2 α +1) . This is be-cause for each α >
0, the asymptotic equalitylim
B/ε →∞ h inf ˆ θ sup θ ∈E ( α ,B ) R ( θ, ˆ θ ) . ( ε/B ) α / (2 α +1) i = c P ( α ) (5)follows from (4) and from Pinsker’s theorem [31] with B = 1, where c P ( α ) :=(2 α + 1) / (2 α +1) { α / ( α + 1) } α / (2 α +1) . Pinsker’s theorem states that wehave, for each α > B > ε → h inf ˆ θ sup θ ∈E ( α ,B ) R ( θ, ˆ θ ) . ( ε/B ) α / (2 α +1) i = c P ( α ) . (6) Non-asymptotic minimax adaptation implies asymptotic minimax adaptationby definition, whereas the converse does not hold. Even when α is known,asymptotic minimaxity in small- ε asymptotics does not necessarily imply (3). . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation First, consider the Bayes estimator ˆ θ G( ·| α ) based on the Gaussian distributionG( · | α ) := ⊗ ∞ i =1 N (0 , i − α − ) , α > . This estimator is shown to achieve asymptotic minimaxity as ε → ε → h sup θ ∈E ( α ,B ) R ( θ, ˆ θ G( ·| α ) ) . ε α / (2 α +1) i < ∞ ;see [15, 45]. Using the necessary condition that we mentioned above, we showthat this estimator does not attain non-asymptotic minimax adaptation. Fromthe necessary condition it suffices to show that sup θ ∈R ( α , B ) is bounded belowby a positive constant not depending on B . Let ¯ θ be an l -vector of which the i -th coordinate is B if i = 1 and 0 otherwise. Then, the supremum of R ( θ, ˆ θ G )over E ( α , B ) is given bysup θ ∈E ( α ,B ) R ( θ, ˆ θ G ) ≥ R (¯ θ, ˆ θ G ) ≥ ε / (1 + ε ) , ε > , and thus the proof is completed.Second, consider the sieve prior introduced by [45]:C M ( · | α ) := ∞ X d =1 M ( d ) (cid:20) d ⊗ i =1 N (0 , i − α − ) (cid:21) ⊗ (cid:20) ∞ ⊗ i = d +1 N (0 , (cid:21) , where M is a probability distribution on N with M ( d ) > d ∈ N . TheBayes estimator with α = α is shown to be asymptotically minimax, and theBayes estimator with α = 1 / α > θ C M ( ·| α ) for any α > M is given by { / (1 + ε ) } X , it follows from the same calculation as that of ˆ θ G( ·| α ) that wehave sup θ ∈E ( α ,B ) R ( θ, ˆ θ C M ( ·| α ) ) ≥ ε / (1 + ε ) , ε > , which shows that ˆ θ C M ( ·| α ) does not attain non-asymptotic minimax adaptation.Finally, consider the blockwise James–Stein estimator. The blockwise James–Stein estimator is shown to achieve asymptotic minimax adaptation; see [10, 11].The construction of the blockwise James–Stein estimator is rather technicaland is not presented here. Note that the blockwise James–Stein estimator isa truncation estimator with d = ⌊ /ε ⌋ , where a truncation estimator withdimension d is the d -dimensional truncation of a estimator in l . Any truncationestimator ˆ θ ( d ) with d (possibly depending on ε ) does not attain non-asymptoticminimax adaptation; the proof of this property follows the same line as those of . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation ˆ θ G( ·| α ) and ˆ θ C M ( ·| α ) since the supremum of R ( θ, ˆ θ ( d ) ) over E ( α , B ) is given bysup θ ∈E ( α ,B ) R ( θ, ˆ θ ( d ) ) ≥ sup θ ∈E ( α ,B ) ∞ X i = d +1 θ i /B ≥ ( d + 1) − α , ε > . The above results are summarized in the following proposition.
Proposition 1.
The following three hold: (i) the Bayes estimator ˆ θ G( ·| α ) doesnot satisfy (3); (ii) for any α > , the Bayes estimator ˆ θ C M ( ·| α ) does not satisfy(3); (iii) the blockwise James–Stein estimator does not satisfy (3). We explain that the model selection and model averaging estimators are non-asymptotically adaptive. For d ∈ N , let ˆ r d := − P di =1 X i + 2 ε d . Let ˆ θ MS be an estimator of which the i -th component is given by X i i ≤ ˆ d , where ˆ d ∈ argmin d ∈ N ˆ r d . Let ˆ θ MA ,β be an estimator of which the i -th component is givenby P ∞ d =1 w d X i i ≤ d , where w d ∝ exp {− β ˆ r d / (2 ε ) } for d ∈ N and P ∞ d =1 w d = 1.For simplicity, we assume that β ≤ / Proposition 2 (Theorem 1 in [8] and Section 7 in [27]) . There exist positiveconstants C and C for which the inequalities sup θ ∈E ( α ,B ) R ( θ, ˆ θ MS ) ≤ C ( ε/B ) α / (2 α +1) , sup θ ∈E ( α ,B ) R ( θ, ˆ θ MA ,β ) ≤ C ( ε/B ) α / (2 α +1) hold, provided that ε/B is smaller than one. Here, C is a universal constantand C depends only on β . The proof for the model selection-based estimator follows immediately fromTheorem 1 in [8]; see also [44] The proof for the model averaging-based estimatoris given in Appendix A for the sake of completeness.
3. Non-asymptotic Bayesian adaptation
As discussed in the introduction, we work with the prior distributionΠ := ∞ X k =1 F ( k )S M ( · | α = k ) , (7)where S M ( · | α ) := ∞ X d =1 M ( d )S( · | d, α ) (8) . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation and S( · | d, α ) is the distribution on l given byS( · | d, α ) := (cid:20) d ⊗ i =1 N (0 , ε d α +1 i − (2 α +1) ) (cid:21) ⊗ (cid:20) ∞ ⊗ i = d +1 N (0 , (cid:21) . (9)In the present paper, M is assumed to be of the form M ( d ) ∝ exp( − ηd ) with η >
0, and F is assumed to be of the form F ( d ) ∝ exp( − γd ) with γ > Theorem 1 presents non-asymptotic adaptive posterior contraction of Π andCorollary 2 demonstrates non-asymptotic adaptation of the Bayes estimator ofΠ.
Theorem 1.
There exist positive constants C and c depending only on α , η of M , and γ of F for which the inequality E θ ,ε Π( k θ − θ k /B ≥ C ( ε/B ) α / (2 α +1) | X ) ≤ exp {− c ( B/ε ) / (2 α +1) } holds uniformly in θ ∈ E ( α , B ) provided that ε/B is smaller than one. The proof of this theorem is given in Section 5.
Corollary 2.
For every α > and every B > , the Bayes estimator based on Π is non-asymptotically adaptive: there exists a positive constant C dependingonly on α , η of M , γ of F for which the inequality sup θ ∈E ( α ,B ) R ( θ, ˆ θ Π ) ≤ C inf ˆ θ sup θ ∈E ( α ,B ) R ( θ, ˆ θ ) for any < ε ≤ B holds.Proof of Corollary 2. Take θ arbitrarily in E ( α , B ). We show thatsup θ ∈E ( α ,B ) R ( θ, ˆ θ Π ) ≤ e C ( ε/B ) α / (2 α +1) for some positive constant e C not depending on ε or B , since it follows fromTheorem 4.9 in [29] that there exists a universal positive constant e C for whichwe have inf ˆ θ sup θ ∈E ( α ,B ) R ( θ, ˆ θ ) ≥ e C ( ε/B ) α / (2 α +1) for any 0 < ε ≤ B. By Jensen’s inequality, we haveE θ ,ε || ˆ θ Π − θ || /B ≤ E θ ,ε Z || θ − θ || /B dΠ( θ | X ) . . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation By Fubini’s theorem, we haveE θ ,ε Z || θ − θ || /B dΠ( θ | X ) = E θ ,ε Z ∞ Π( k θ − θ k /B ≥ t (cid:12)(cid:12) X )d t = Z ∞ E θ ,ε Π( k θ − θ k /B ≥ t (cid:12)(cid:12) X )d t. Taking sufficiently large C depending only on α , η , and γ , and dividing [0 , ∞ )into [0 , C ( ε/B ) α / (2 α +1) ) and [ C ( ε/B ) α / (2 α +1) , ∞ ), Theorem 1 yields Z ∞ E θ ,ε Π( k θ − θ k /B ≥ t (cid:12)(cid:12) X )d t ≤ C ( ε/B ) α / (2 α +1) + ( C/c ) exp {− c ( B/ε ) / (2 α +1) } , where c is the constant in Theorem 1. Since constants C and c do not dependon θ , the above inequality completes the proof.Several remarks are provided in order. Remark 1 (Posterior contraction of Gaussain prior distributions) . In Section2, we showed that the Bayes estimator based on the Gaussian prior G( · | α ) doesnot satisfy (3). This prior also does not possess posterior contraction at the rate( ε/B ) α / (2 α +1) with respect to ε/B . Consider G( · | α ) = ⊗ ∞ i =1 N (0 , i − α − ).Let ε = 1 and let ¯ θ be an l -vector of which the i -th coordinate is B if i = 1and 0 if otherwise. For any δ >
0, any
C >
0, and P ¯ θ, -almost all x , we haveG( || θ − ¯ θ || /B < CB − δ | X = x, α = α )= G (cid:20) ∞ X i =2 θ i + ( θ − B ) < CB − δ | X = x, α = α (cid:21) ≤ G[( θ − B ) < CB − δ | X = x, α = α ]= Pr[ B (1 − √ CB − δ ) < ( N − x / √ / √ < B (1 + √ CB − δ )] → B → ∞ , where N is a one-dimensional standard normal random variable. Thus, by thedominated convergence theorem, we havelim B →∞ sup θ ∈E ( α ,B ) E θ ,ε G( || θ − θ || /B ≥ CB − δ | X, α = α ) = 1 . Remark 2 (A further possibility) . We mention the possibility that the Bayesestimator based on another prior distribution could attain non-asymptotic adap-tation. [40] considered the Bayes estimator ˆ θ V based on the Gaussian scale mix-ture prior distribution Z ∞ Z ∞ ⊗ ∞ i =1 N (0 , vi − α − )d V ( v )d A ( α ) , . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation where V and A are distributions on [0 , ∞ ). When addressing the Gaussianprocess prior distribution, the mixture with respect to the prior variance isoften used; see also [32], [42], and [39]. Although we conjecture that Bayesestimators based on the scale mixtures would be non-asymptotically adaptive(see Appendix E), proving that appears to be challenging. Our prior distributionΠ enjoys the discrete structure of a prior distribution of d . A computationaladvantage of Π is that the discrete structure simplifies the calculation of theposterior distribution. For the explicit form of the posterior distribution, seeAppendix D. A technical advantage of Π is that the calculations of the essentialsupport and the small ball probability might be easier, as shown in Lemmas 1and 4. Remark 3 (Non-asymptotic adaptation of S M ( · | α ) in the undersmoothregion) . The proof of the principal theorem is based on the following non-asymptotic adaptation of S M ( · | α ) in the undersmooth case that α ≥ α − / Theorem 3.
Assume that α ≥ α − / . Then, there exist positive numbers C and c depending only on α and η of M for which the inequality E θ ,ε S M ( || θ − θ || /B ≥ C ( ε/B ) α / (2 α +1) | X, α ) ≤ exp {− c ( B/ε ) / (2 α +1) } , holds uniformly in θ ∈ E ( α , B ) , provided that B/ε is larger than one.
From Theorem 3, the Bayes estimator based on S M ( · | α ) is non-asymptoticallyadaptive at least in the undersmooth region where α ≥ α − / Our results also construct a non-asymptotically adaptive Bayes estimator innonparametric regression models. Consider estimating a regression function f :[0 , → R based on observations { Y , . . . , Y n } obeying Y i = f ( i/n ) + W i , i = 1 , . . . , n, where W i s’ are i . i . d . error terms from N (0 , { φ , φ , . . . } thetrigonometric series (2). We assume that f belongs to the periodic Sobolev space W ( α , B ) defined as follows: For α > B > W ( α , B ) := (cid:26) f = ∞ X i =1 θ i φ i : ∞ X j =1 a j θ j ≤ B /π α (cid:27) , where a = 0 and for k ∈ N , a k = (2 k ) α and a k +1 = (2 k ) α . In the cases in which α is a positive integer, it follows from the Parseval equalitythat P ∞ j =1 a j θ j = k f ( α ) k L /π α , where k · k L is the L [0 , . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation For an arbitrary positive integer p such that p < n , we work with a priordistribution of f using a prior on R p . Let Π ( p ) be a prior distribution of theform Π ( p ) = p X d =1 M ( d ) ∞ X k =1 S( · | d, α = k ) (cid:30) p X ˜ d =1 M ( ˜ d ) . We identify Π ( p ) as a prior distribution of f via the transformation ( θ , . . . , θ p ) → P pi =1 θ i φ i . Let ˆ f Π ( p ) be the Bayes estimator based on Π ( p ) .The following corollary provides a non-asymptotic risk bound for the Bayesestimator ˆ f Π ( p ) . Let R n ( f, ˆ f ) be the risk of an estimator ˆ f defined by R n ( f, ˆ f ) := E f,n k f − ˆ f k L /B , where k · k L is the L [0 ,
1] norm, and E f,n is the expectation with respect tothe distribution of { Y , . . . , Y n } with true regression function f . Let τ ( p ) be theapproximation error τ ( p ) := sup f ∈W ( α ,B ) inf θ ∈ R p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) f − p X i =1 θ i φ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) L . Corollary 4.
There exists a positive constant C depending only on α , η of M , and γ of F for which the Bayes estimator ˆ f Π ( p ) based on Π ( p ) satisfies sup f ∈W ( α ,B ) R n ( f, ˆ f Π ( p ) ) ≤ C [min { p, ( nB ) / (2 α +1) } /n + τ ( p )] /B , (10) provided that nB is larger than one. The proof is a simple extension of that of Theorem 1 and is given in Subsec-tion5.4.The implication of this corollary is that ˆ f Π ( n − is non-asymptotically adap-tive, provided that n α ≥ B . From Corollary 4 with p = n −
1, and from thebound that τ ( p ) ≤ B p − α , there exists a positive constant C not dependingon n or B such thatsup f ∈W ( α ,B ) R n ( f, ˆ f Π ( n − ) ≤ C ( nB ) − α / (2 α +1) , provided that n α ≥ B . From Theorem 4.9 in [29], there exists a positive con-stant C depending only on α for which the inequality C max ≤ p 4. Numerical experiments In this section, we present numerical experiments focusing on the performancecomparison of non-asymptotically adaptive estimators in low-( B/ε ) settings.The other comparisons including the comparison between non-asymptoticallyadaptive estimators and estimators not satisfying (3) are provided in AppendixE. The numerical experiments are intended to compare non-asymptotically adap-tive estimators. The following three estimators are compared: • the Bayes estimator ˆ θ Π based on Π with η = 2 and γ = 2; • the model averaging estimator ˆ θ MA , / with β = 1 / • the model selection-based estimator ˆ θ MS .The numerical experiments are conducted using the p = 100-dimensional trun-cation. The noise variance ε is fixed to one and the volume B is varied in { , , , , } . Losses at two parameter values are used for comparison. The fol-lowing parameter values are used: • θ (1) i := Bi − . / √ 100 for i ∈ N ; • θ (2)1 := B and θ (2) i := 0 for i ≥ θ (1) is included in E ( α , B ) for any 0 < α < . 014 and is not includedin E ( α , B ) for any α > . θ (2) is included in E ( α , B ) forany α > B Lo ss Method ProposedModel averagingModel selection Fig 1 . Means of losses with error bars at θ = θ (1) in cases with B = 1 , , , , . The meansand error bars of the model selection based estimator at B = 1 are omitted because they areoutside the range [0 , . . The results are presented in Figures 1 and 2. At each B , means (with standarddeviations) of the proposed Bayes estimator ˆ θ Π , the model averaging estimatorˆ θ MA , / , and the model selection based estimator ˆ θ MS are plotted side-by-side.The proposed estimator ˆ θ Π is abbreviated by “Proposed;” the model averaging . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation B Lo ss Method ProposedModel averagingModel selection Fig 2 . Means of losses with error bars at θ = θ (2) in cases with B = 1 , , , , . The meansand error bars of the model selection based estimator at B = 1 are omitted because they areoutside the range [0 , . . The error bars of the model selection based estimator at B = 2 areomitted becasue the upper bar is outside the range [0 , . . estimator ˆ θ MA , / is abbreviated by “Model averaging;” the model selection-based estimator ˆ θ MS is abbreviated by “Model selection.” In each plot, the lowerlimit used in the error bar is calculated as the maximum of zero and the meanminus the standard deviation. We omit the values outside the range [0 , . α and B . Figure 1 indicates that the proposed Bayes estimator outperforms themodel averaging estimator, while Figure 2 indicates that the proposed Bayes es-timator underperforms the model averaging estimator. However, even in Figure2, when B is small, the performance of the proposed Bayes estimator is compa-rable (or possibly superior) to that of the model averaging estimator. Comparedto the model averaging estimator, our approach directly puts a prior distribu-tion on the scale of the parameter, which seems to present the better outcomewhen B is small. 5. Proof for Section 3 The proofs follow the standard arguments in the Bayesian nonparametric litera-ture [5, 17, 37]. The essential difference appears in the prior mass condition withrespect to ε/B under which the prior puts a sufficient mass on the neighborsaround the true parameter with respect to ε/B ; see Lemma 4.The organization of this section is as follows. In Subsection 5.1, we preparesome lemmas to be used. In Subsection 5.2, we present the proof of Theorem 3.In Subsection 5.3, we present the proof of Theorem 1. . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation In this subsection, we present our lemmas. The proofs of the lemmas are providedin Appendix B. Note that { θ : k θ − θ k /B ≥ C ( ε/B ) α / (2 α +1) } = { θ : k θ − θ k /ε ≥ C ( B/ε ) / (2 α +1) } . The first lemma provides the essential support of S M,α . For a constant c > θ ∈ E ( α , B ), let E c ( θ ) := θ ∈ l : X i> ⌊ c ( B/ε ) / (2 α ⌋ ( θ i − θ ,i ) /ε ≤ ( B/ε ) / (2 α +1) . Lemma 1 (Essential support of the prior) . For any α > and any c > , theinequality S M ( E c c ( θ ) | α ) ≤ exp {− η ( c − B/ε ) / (2 α +1) } holds uniformly in θ ∈ E ( α , B ) . The second and third lemmas provide the complexity of the interest spaceand the existence of test sequences. For a positive integer C > c > 0, we divide { θ ∈ l : k θ − θ k /ε ≥ C ( B/ε ) / (2 α +1) } as { θ : || θ − θ || /ε ≥ C ( B/ε ) / (2 α +1) } = ∞ ∪ j = C R ( j ; c ) ∪ [ { θ : || θ − θ || /ε ≥ C ( B/ε ) / (2 α +1) } ∩ E c c ( θ )] , where for j = C, C + 1 , . . . , R ( j ; c ) := { θ ∈ E c ( θ ) : ( j + 1)( B/ε ) / (2 α +1) > || θ − θ || /ε ≥ j ( B/ε ) / (2 α +1) } . For j = C, C +1 , . . . , let N ( j ; c ) be the ( ε/ p j ( B/ε ) / (2 α +1) -covering numberwith respect to || · || of R ( j ; c ). Lemma 2 (Covering number of R ( j ; c ); cf. Proposition A.1. in [16]) . Foreach j = C, C + 1 , . . . and every c > , log( N ( j ; c )) is bounded above by c ( B/ε ) / (2 α +1) . Lemma 3 (Existence of test sequences; cf. Lemma 5 in [19]) . Let j be anypositive integer. Let θ be in E ( α , B ) . Let ¯ θ ( j ) be any l -vector such that || ¯ θ ( j ) − θ || /ε ≥ j ( B/ε ) / (2 α +1) . Let ψ ( j ) ( X ) := 1 || X − ¯ θ ( j ) || < || X − θ || . Then, the in-equalities E θ ,ε [ ψ ( j ) ( X )] ≤ exp {− ( j/ B/ε ) / (2 α +1) } and sup θ : k θ − ¯ θ ( j ) k≤k ¯ θ ( j ) − θ k / E θ,ε [1 − ψ ( j ) ( X )] ≤ exp {− ( j/ B/ε ) / (2 α +1) } hold. . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation The fourth lemma is the prior mass condition. Lemma 4 (Prior mass condition) . Assume that α ≥ α − / . There exists apositive constant c depending only on α and η of M for which the inequality S M ( θ : k θ − θ k /ε ≤ B/ε ) / (2 α +1) | α ) ≥ exp {− c ( B/ε ) / (2 α +1) } holds uniformly in θ ∈ E ( α , B ) , provided that ε/B is smaller than one. The fifth lemma ensures a high probability set on which the likelihood ratioof the marginal distribution and the true distribution is bounded below. Wedenote the restriction of S M ( · | α ) onto { θ : k θ − θ k /ε ≤ B/ε ) / (2 α +1) } by e S M ( · | α ): e S M ( A | α ) := S M ( A | α )S M ( { θ : k θ − θ k /ε ≤ B/ε ) / (2 α +1) } | α )for a Borel set A in l ∩ { θ : k θ − θ k /ε ≤ B/ε ) / (2 α +1) } . Let H ( θ ) := (cid:26) X : log Z d P θ,ε d P θ ,ε ( X )d e S M ( θ | α ) ≥ − B/ε ) / (2 α +1) (cid:27) . Lemma 5. For every θ ∈ E ( α , B ) , the inequality E θ ,ε [1 H c ( θ ) ( X )] ≤ exp {− (1 / B/ε ) / (2 α +1) } holds. The proof assumes that C is a positive integer. If C is not an integer, we replace C with ⌊ C ⌋ . The values of C in Theorem 3 and c in Lemma 1 are provided in(21) below. Take θ arbitrarily in E ( α , B ). Recall the equality { θ : k θ − θ k /B ≥ C ( ε/B ) α / (2 α +1) } = { θ : k θ − θ k /ε ≥ C ( B/ε ) / (2 α +1) } . The expectation of the tail probability of the posterior is divided as follows:E θ ,ε [S M ( k θ − θ k /ε ≥ C ( B/ε ) / (2 α +1) | X, α )]= E θ ,ε [1 H ( θ ) ( X )S M ( k θ − θ k /ε ≥ C ( B/ε ) / (2 α +1) | X, α )]+ E θ ,ε [1 H c ( θ ) ( X )S M ( k θ − θ k /ε ≥ C ( B/ε ) / (2 α +1) | X, α )] . (11)From Lemma 5, and because the probability is bounded above by one, the latterterm on the right hand side of (11) is bounded as follows:E θ ,ε [1 H c ( θ ) ( X )S M ( k θ − θ k /ε ≥ C ( B/ε ) / (2 α +1) | X, α )] ≤ exp {− (1 / B/ε ) / (2 α +1) } . (12) . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation We next bound the former term in the right-hand side of (11).From Bayes’ theorem, we haveE θ ,ε [1 H ( θ ) ( X )S M ( || θ − θ || /ε ≥ C ( B/ε ) / (2 α +1) | X, α )]= E θ ,ε (cid:20) H ( θ ) ( X ) R || θ − θ || /ε ≥ C ( B/ε ) / (2 α d P θ,ε d P θ ,ε ( X )dS M ( θ | α ) R d P θ,ε d P θ ,ε ( X )dS M ( θ | α ) (cid:21) . (13)Consider the numerator Z k θ − θ k /ε ≥ C ( B/ε ) / (2 α d P θ,ε d P θ ,ε ( X )dS M ( θ | α ) . Letting { ¯ θ ( j,k ) : k = 1 , . . . , N ( j ; c ) } be an ( ε/ p j ( B/ε ) / (2 α +1) -net of R ( j ; c ),Lemma 3 yields sequences of measurable functions ψ j,k such that for each k , wehave E θ ,ε [ ψ j,k ( X )] ≤ exp {− ( j/ B/ε ) / (2 α +1) } (14)and sup θ : k θ − ¯ θ ( j,k ) k < ( ε/ √ j ( B/ε ) / (2 α E θ,ε [1 − ψ j,k ( X )] ≤ exp {− ( j/ B/ε ) / (2 α +1) } . (15)Letting U (¯ θ ( j,k ) ) be the ( ε/ p j ( B/ε ) / (2 α +1) -ball around ¯ θ ( j,k ) , and using thesequences { ψ j,k } and the balls { U (¯ θ ( j,k ) ) } , we have, for X ∈ H ( θ ), Z k θ − θ k /ε ≥ C ( B/ε ) / (2 α d P θ,ε d P θ ,ε ( X )dS M ( θ | α ) ≤ ∞ X j = C N ( j ; c ) X k =1 Z U (¯ θ ( j,k ) ) (1 − ψ j,k ( X )) d P θ,ε d P θ ,ε ( X )dS M ( θ | α )+ ∞ X j = C N ( j ; c ) X k =1 Z U (¯ θ ( j,k ) ) ψ j,k ( X ) d P θ,ε d P θ ,ε ( X )dS M ( θ | α )+ Z E c c ( θ ) d P θ,ε d P θ ,ε ( X )dS M ( θ | α ) . From the above inequality, it follows thatE θ ,ε [1 H ( θ ) ( X )S M ( k θ − θ k /ε ≥ C ( B/ε ) / (2 α +1) | X, α )] ≤ T + T + T , (16)where T := E θ ,ε H ( θ ) P ∞ j = C P N ( j ; c ) k =1 R U (¯ θ ( j,k ) ) (1 − ψ j,k ) d P θ,ε d P θ ,ε dS M ( θ | α ) R d P θ,ε d P θ ,ε dS M ( θ | α ) , . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation T := E θ ,ε H ( θ ) P ∞ j = C P N ( j ; c ) k =1 R U (¯ θ ( j,k ) ) ψ j,k d P θ,ε d P θ ,ε dS M ( θ | α ) R d P θ,ε d P θ ,ε dS M ( θ | α ) , and T := E θ ,ε H ( θ ) ( X ) R E c c ( θ ) d P θ,ε d P θ ,ε ( X )dS M ( θ | α ) R d P θ,ε d P θ ,ε ( X )dS M ( θ | α ) . Providing upper bounds on T , T , and T will complete the proof.Consider an upper bound on T in (16). In bounding T , we use the followinglower bound on R { d P θ,ε / d P θ ,ε } ( X )dS M ( θ | α ). From the definition of H ( θ )and from Lemma 4, for X ∈ H ( θ ), we have Z d P θ,ε d P θ ,ε ( X )dS M ( θ | α ) ≥ Z k θ − θ k /ε ≤ B/ε ) / (2 α d P θ,ε d P θ ,ε ( X )dS M ( θ | α )= S M ( k θ − θ k /ε ≤ B/ε ) / (2 α +1) ) Z d P θ,ε d P θ ,ε ( X )d e S M ( θ | α ) ≥ exp {− ( c + 2)( B/ε ) / (2 α +1) } . (17)From the above inequality, from Fubini’s theorem, from Lemmas 2 and 3, andfrom the inequality that 1 − exp {− / } > / e , we have T ≤ exp { ( c + 2)( B/ε ) / (2 α +1) } E θ ,ε (cid:20) ∞ X j = C N ( j ; c ) X k =1 Z U (¯ θ ( j,k ) ) (1 − ψ j,k ) d P θ,ε d P θ ,ε dS M ( θ | α ) (cid:21) ≤ exp { ( c + 2)( B/ε ) / (2 α +1) } ∞ X j = C N ( j ; c ) exp {− ( j/ B/ε ) / (2 α +1) }≤ exp { (2 c + c + 6 − C/ B/ε ) / (2 α +1) } . (18)Consider an upper bound on T in (16). Since 1 − exp {− / } > / e , we have T ≤ E θ ,ε (cid:20) H ( θ ) ( X ) ∞ X j = C N ( j ; c ) X k =1 ψ j,k ( X ) (cid:21) ≤ ∞ X j = C N ( j ; c ) X k =1 exp {− ( j/ B/ε ) / (2 α +1) }≤ exp { (2 c + 3 − C/ B/ε ) / (2 α +1) } . (19)Here, the second inequality follows from Lemma 3, and the third inequalityfollows from Lemma 2. . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation For an upper bound on T in (16), it follows that T ≤ exp { ( c + 2)( B/ε ) / (2 α +1) } E θ ,ε Z E c c ( θ ) d P θ,ε d P θ ,ε dS M ( θ | α ) ≤ exp { ( c + 2)( B/ε ) / (2 α +1) } S M ( E c c ( θ ) | α ) ≤ exp { ( c + 2 + η − ηc )( B/ε ) / (2 α +1) } . (20)The first inequality follows from (17). The second inequality follows from Fu-bini’s theorem. The third inequality follows from Lemma 4.Thus, using (12), (18), (19), and (20) for an upper bound on (11), and taking c and C such that c + 2 + η − ηc < , c + 3 − C/ < , and 2 c + c + 6 − C/ < , (21)we complete the proof. We provide the proof of Theorem 1. Replacing Lemmas 1 and 4 by Lemmas 6and 7, respectively, completes the proof, because the other lemmas used in theproof of Theorem 3 do not depend on prior distributions. The proofs of Lemmas6 and 7 are provided in Appendix B. Lemma 6. For any c > , the inequality Π( E c c ( θ )) ≤ exp {− η ( c − B/ε ) / (2 α +1) } holds uniformly in θ ∈ E ( α , B ) . Here, η is a hyperparameter of M . Lemma 7. There exists a positive constant c depending only on α , η of M ,and γ of F for which the inequality Π( θ : k θ − θ k /ε ≤ B/ε ) ) ≥ exp {− c ( B/ε ) / (2 α +1) } holds uniformly in θ ∈ E ( α , B ) , provided that ε/B is smaller than one. It suffices to derive a risk bound for the posterior mean ˆ θ Π ( p ) in estimation of( θ , . . . , θ p ) based on i.i.d. p observations X i from N ( θ i , /n ). The reason is asfollows. For i ∈ N , let θ i = R f φ i d t . We focus on P pi =1 ( θ i − ˆ θ Π ( p ) ,i ) instead of k f − ˆ f k L since it follows from the Parseval inequality k f − ˆ f Π ( p ) k L = p X i =1 ( θ i − ˆ θ Π ( p ) ,i ) + ∞ X i = p +1 θ i . . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation The bias term appearing in the true distribution of { Y , . . . , Y n } due to { φ j ( k/n ) : j = p + 1 , p + 2 , . . . , k = 1 , . . . , n } is negligible by a sufficiency reduction becausethe vectors { ( φ i (1 /n ) , . . . , φ i (1)) ⊤ / √ n : i = 1 , . . . , p } are orthonormals in R p ,provided that p < n .Let p := min { p, ( nB ) / (2 α +1) } . To complete the proof, it suffices to showthat, for a sufficiently large C > α , η , and γ , the inequalitysup θ ( p )0 : P pi =1 i α θ ( p ) , i, ≤ B E θ ( p )0 ,ε Π ( p ) (cid:18) n p X i =1 { θ i − θ ( p )0 ,i } ≥ Cp (cid:19) ≤ exp {− cp } holds, where c is a constant depending only on α , η, and γ . This is proved asfollows. If p is larger than c ( nB ) / (2 α +1) for the constant c appearing inthe proof of Theorem 3, then the same proof as that of Theorem 1 is available.Consider the case in which p is smaller than c ( nB ) / (2 α +1) . In this case, wereplace ( nB ) / (2 α +1) by p . This replacement does not change the conclusionas discussed below. Lemmas 3 and 5 do not change because their proofs relyonly on the properties of the Gaussian measure. Lemma 2 still holds becausethe log of the covering number is bounded by 2 p . Lemma 6 obviously holds forΠ ( p ) , because E c ( θ ) is R p itself for the case in which p < c ( nB ) / (2 α +1) .Lemma 7 holds for S ( p ) M,α , because the required lemma (Lemma 8) for the proofof Lemma 7 is still available. This completes the proof. 6. Discussion We propose two principal future studies. The first study is to find a meansof attaining non-asymptotic adaptation in the empirical Bayesian manner. Re-cently, Petrone et al. [30] and Rousseau and Szab´o [34] established importantasymptotic results on the performance of empirical Bayesian nonparametrics.Focusing on a simple setting, we expect to be able to answer whether thereexists an empirical Bayesian method attaining non-asymptotic adaptation. Theinvestigation would also provide an insight into the relationship among Bayesiannonparametrics, empirical Bayesian nonparametrics, model selection, and fre-quentist model averaging. The second is to investigate non-asymptotic Bayesianadaptation in the other settings. The present paper used a Gaussian infinite se-quence model under Sobolev-type parameter constraints for simplicity and forclarity of presentation. One possible extension of our work would be to investi-gate non-asymptotic Bayesian adaptation in density estimation. Resolution ofthese problems will increase our understanding of nonparametric estimation. Appendix A: Proof for Section 2 In this appendix, we provide the proof of Proposition 2 in the case with themodel averaging estimator. To apply the argument in Section 7.B. of [27], slightmodifications to the loss functions and the weights are necessary, because Leungand Barron [27] use the finite dimensional l loss function and assume that the . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation number of models is finite. Although these modifications are straightforward,we provide them for the sake of completeness. Proof. Let D := ⌊ ( B/ε ) / (2 α +1) ⌋ . The l risk E θ,ε k θ − ˆ θ MA ,β k is decomposedas follows:E θ,ε k θ − ˆ θ MA ,β k = E θ,ε D X i =1 ( θ i − ˆ θ MA ,β,i ) + E θ,ε ∞ X i = D +1 ( θ i − ˆ θ MA ,β,i ) . (22)First, we show that the latter term on the right hand side in (22) is O( D ).The latter term on the right hand side in (22) is bounded above asE θ,ε ∞ X i = D +1 ( θ i − ˆ θ MA ,β,i ) ≤ ∞ X i = D +1 θ i + 2E θ,ε ∞ X i = D +1 ˆ θ ,β, i ≤ ∞ X i = D +1 θ i + 4 √ ε ∞ X i = D +1 (cid:26) E θ,ε (cid:18) ∞ X d = i w d (cid:19) (cid:27) / , where the second inequality follows from the fact that ˆ θ MA ,β,i = P ∞ d =1 w d X i i ≤ d and from the Cauchy–Schwarz inequality. Consider an upper bound on the latterterm in the above inequality. Note that the inequality ∞ X d = i w d ≤ ∞ X d = i exp (cid:20) { β/ (2 ε ) } (cid:26) d X j = D X j − ε ( d − D ) (cid:27)(cid:21) holds. For i ≥ D and for some s > ∞ X d = i Pr (cid:18) d X j = D X j − ε ( d − D ) > − ε ( d − D ) / (cid:19) ≤ ∞ X d = i Pr (cid:18) d X j = D N j / ∞ X j = D θ j /ε > d − D ) / (cid:19) ≤ exp {− s ( i − D ) } , where the first inequality follows because xy ≤ x / y for x, y > 0, and thesecond inequality follows because P j = D θ j /ε ≤ D α +1 D − α , because d > D , and from the Borell–Sudakov–Tsirelson Gaussian concentration inequality.Therefore, there exists a universal positive constant s ′ for whichE θ,ε ∞ X i = D +1 ( θ i − ˆ θ MA ,β,i ) ≤ ε { D + 100 √ D + 4 √ − s ′ D ) } . Second, we show that the former term on the right hand side in (22) isbounded as E θ,ε D X i =1 ( θ i − ˆ θ MA ,β,i ) ≤ E θ,ε D X d =1 ˜ w ( D ) d ˆ r ( D ) d + Dε , . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation where for d = 1 , . . . , D , ˜ w ( D ) d := w d / P Dd ′ =1 w d ′ and for d = 1 , , . . . , ˆ r ( D ) d := P Di =1 ( X i − X i i ≤ d ) − Dε + 2 min { D, d } ε . Letˆ r ( D ) := ∞ X d =1 w d (cid:20) ˆ r ( D ) d − (1 − β ) D X i =1 ( X i i ≤ d − ˆ θ MA ,β,i ) (cid:21) . Noting that ˆ r ( D ) d is a risk unbiased estimator of E θ,ε [ P Di =1 ( X i i ≤ d − θ i ) ] andthat ˆ r ( D ) is a risk unbiased estimator of E θ,ε [ P Di =1 (ˆ θ MA ,β,i − θ i ) ], we haveE θ,ε D X i =1 ( θ i − ˆ θ MA ,β,i ) = E θ,ε ˆ r ( D ) ≤ E θ,ε ∞ X d =1 w d ˆ r ( D ) d , where we use the condition that β ≤ / 2. Thus, we obtainE θ,ε D X i =1 ( θ i − ˆ θ MA ,β,i ) ≤ E θ,ε D X d =1 ˜ w ( D ) d ˆ r ( D ) d + Dε . Finally, since D X d =1 ˜ w ( D ) d ˆ r ( D ) d ≤ min d =1 ,...,D ˆ r ( D ) d + (2 ε /β ) log D, applying the argument in Section 7.B. in [27] completes the proof. Appendix B: Proofs of lemmas in Section 5 In this appendix, we provide the proofs of Lemmas 1, 4, 5, 6, 7. For the proofof Lemma 2, see Proposition A.1 in [16]. For the proof of Lemma 3, see Lemma5 in [19]. Proof of Lemma 1. Let ¯ D = ⌊ c ( B/ε ) / (2 α +1) ⌋ . From the definition of S M ( · | α ),S M ( E c c ( θ ) | α ) = ¯ D X d =1 M ( d )S (cid:18) X i> ¯ D ( θ i − θ ,i ) /ε > ( B/ε ) / (2 α +1) | d, α (cid:19) + ∞ X d = ¯ D +1 M ( d )S (cid:18) X i> ¯ D ( θ i − θ ,i ) /ε > ( B/ε ) / (2 α +1) | d, α (cid:19) . The first term on the right hand side of the above equality vanishes, becausethe identity S (cid:18) X i> ¯ D ( θ i − θ ,i ) /ε > ( B/ε ) / (2 α +1) | d, α (cid:19) = 0 , . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation holds for d ≤ ¯ D since P i> ¯ D θ ,i /ε ≤ (1 /c α )( B/ε ) / (2 α +1) < ( B/ε ) / (2 α +1) . The second term is bounded by exp { η − ηc ( B/ε ) / (2 α +1) } . This completes theproof. Proof of Lemma 4. The proof relies on the following lemma. Let { N i } ∞ i =1 beindependent random series from the standard Gaussian distribution. Lemma 8. For each α > , there exists a positive constant c depending onlyon α such that, for a sufficiently large d ∈ N and for each ( v , . . . , v d ) ∈ R d , theinequality Pr (cid:18) d X i =1 ( i − α − / N i − v i ) ≤ d − α (cid:19) ≥ exp (cid:26) − d X i =1 i α +1 v i / − c d (cid:27) holds. The proof of the lemma is provided in Appendix C for the sake of complete-ness.Return to the proof of Lemma 4. Let T = ⌊ ( B/ε ) / (2 α +1) ⌋ . From Lemma 8with d = T , v i = θ ,i / ( εT α +1 / ), i = 1 , . . . , d ,S( k θ − θ k /ε ≤ T | α, d = ⌊ T ⌋ ) ≥ Pr (cid:18) T X i =1 ( εT α +1 / i − α − / N i − θ ,i ) /ε ≤ T (cid:19) ≥ exp (cid:26) − T X i =1 i α +1 θ ,i / ( ε T α +1 ) − c T (cid:27) ≥ exp {− ( c + 1) T } . (23)Here, the first inequality follows because P ∞ i = T +1 θ ,i /ε ≤ T . The second in-equality follows from Lemma 8. The last inequality follows because T X i =1 i α +1 θ ,i (cid:14) ε = T X i =1 i α − α )+1 i α θ ,i (cid:14) ε ≤ T α +2 , where we use the condition that 2( α − α ) + 1 > / (1 − e − η ) ≥ e − η for η > M ( k θ − θ k /ε ≤ T, α ) ≥ M ( T )S( k θ − θ k /ε ≤ T | d = T, α ) ≥ exp {− ( c + 2 + η ) T } . This completes the proof. Proof of Lemma 5. By Jensen’s inequality, P θ ,ε [ H ( θ )] = P θ ,ε (cid:20) log Z d P θ,ε d P θ ,ε d e S M ( θ | α ) ≥ − B/ε ) / (2 α +1) (cid:21) ≥ P θ ,ε (cid:20) Z log d P θ,ε d P θ ,ε d e S M ( θ | α ) ≥ − B/ε ) / (2 α +1) (cid:21) . . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation Since e S M ( · | α ) has a support on { θ : k θ − θ k /ε ≤ B/ε ) / (2 α +1) } , we have Z log d P θ,ε d P θ ,ε d e S M ( θ | α ) = Z (cid:26) h ( X − θ ) , ( θ − θ ) i ε − || θ − θ || ε (cid:27) d e S M ( θ | α ) ≥ Z (cid:26) h ( X − θ ) , ( θ − θ ) i ε (cid:27) d e S M ( θ | α ) − ( B/ε ) / (2 α +1) , where h· , ·i is the l -inner product. Thus, letting N be a one-dimensional stan-dard normal random variable yields P θ ,ε [ H c ( θ )] ≤ P θ ,ε (cid:20) Z h ( X − θ ) , ( θ − θ ) i ε d e S M ( θ | α ) < − ( B/ε ) / (2 α +1) (cid:21) = Pr (cid:20) Z s(cid:18) || θ − θ || ε (cid:19) N d e S M ( θ | α ) > ( B/ε ) / (2 α +1) (cid:21) ≤ Pr[ q ( B/ε ) / (2 α +1) N > ( B/ε ) / (2 α +1) ] ≤ exp {− (1 / B/ε ) / (2 α +1) } . Here, for the last inequality, we use the inequality Pr( | N | > r ) ≤ exp( − r / , r > Proof of Lemma 6. This follows immediately from the identity Π = P ∞ k =1 F ( k )S M ( · | α = k ) and from Lemma 1. Proof of Lemma 7. Take an integer k ≥ α − / T = ⌊ ( B/ε ) / (2 α +1) ⌋ .Then,Π( k θ − θ k /ε ≤ T ) ≥ F ( k )S M ( k θ − θ k /ε ≤ T | α = k ) ≥ F ( k ) M ( T )S( k θ − θ k /ε ≤ T | α = k, d = T ) ≥ exp {− − γk − ( c + 2 + η ) T } , where c is a positive constant depending only on k and appearing in Lemma 8and the last inequality follows from the proof of Lemma 4. This completes theproof. Appendix C: Proof of Lemma 8 In this appendix, for the sake of completeness, we provide the proof of an im-portant inequality for estimating the small ball probability (Lemma 8). . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation Proof. Since the distributions of N and − N are identical, we havePr (cid:18) d X i =1 ( i − α − / N i − v i ) ≤ d − α (cid:19) =Pr (cid:18) d X i =1 ( i − α − / N i − v i ) ≤ d − α (cid:19) / (cid:18) d X i =1 ( i − α − / N i + v i ) ≤ d − α (cid:19) / Z P di =1 i − α − x i ≤ d − α cosh (cid:18) d X i =1 i α +1 / x i v i (cid:19) exp {− P di =1 x i / − P di =1 i α +1 v i / } (2 π ) d/ d x ≥ exp (cid:26) − d X i =1 i α +1 v i / (cid:27) Z P di =1 i − α − x i ≤ d − α exp {− P di =1 x i / } (2 π ) d/ d x. Second, we show that there exists a positive constant c depending only on α such that for d ∈ N , the inequality Z P di =1 i − α − x i ≤ d − α exp {− P di =1 x i / } (2 π ) d/ d x ≥ exp {− c d } holds. Changing variables yields Z P di =1 i − α − x i ≤ d − α exp {− P di =1 x i / } (2 π ) d/ d x ≥ { Γ( d + 1) } α +1 / Z P di =1 y i ≤ d − α exp {− d α +1 P di =1 y i / } (2 π ) d/ d y ≥ { Γ( d + 1) } α +1 / exp( − d/ π ) d/ Z P di =1 y i ≤ d − α d y. Since R P di =1 y i ≤ d − α d y = d − dα π d/ / Γ( d/ c , Z P di =1 i − α − x i ≤ d − α exp {− P di =1 x i / } (2 π ) d/ d x ≥ [ { Γ( d + 1) } α +1 / / { d dα Γ( d/ } ] exp( − ˜ c d ) . Here, it follows from Stirling’s formula that there exist positive constants ˜ c and˜ c depending only on α such that for d ∈ N the inequalities { Γ( d + 1) } α +1 / ≥ exp { ( α + 1 / d + 1 / 2) log d − ˜ c d } ,d dα Γ( d/ ≤ exp { ( α + 1 / d log d + ˜ c d } hold, and thus we obtain Z P di =1 i − α − x i ≤ d − α exp {− P di =1 x i / } (2 π ) d/ d x ≥ exp {− (˜ c + ˜ c + ˜ c ) d } . . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation Appendix D: The explicit form of the posterior In this appendix, we provide an explicit form of the posterior of Π. The explicitform of the posterior is useful when conducting numerical experiments. Theposterior of Π is given byΠ( · | x ) = ∞ X k =1 F ( k | x ) ∞ X d =1 M ( d | x, k )S( · | x, d, k ) , where S( · | x, d, k ) is given by (cid:20) d ⊗ i =1 N (cid:18)(cid:18) − d/i ) k +1 + 1 (cid:19) x i , ε (cid:18) − d/i ) k +1 + 1 (cid:19)(cid:19)(cid:21) ⊗ (cid:20) ∞ ⊗ i = d +1 N (0 , (cid:21) ,M ( · | x, k ) is given by M ( d | x, k ) ∝ M ( d ) d Y i =1 (cid:26) (cid:18) di (cid:19) k +1 (cid:27) − / exp (cid:18) d X i =1 x i ε ( d/i ) k +1 ( d/i ) k +1 + 1 (cid:19) ,F ( · | x ) is given by F ( k | x ) ∝ F ( k ) ∞ X d =1 M ( d ) d Y i =1 (cid:26) (cid:18) di (cid:19) k +1 (cid:27) − / exp (cid:18) d X i =1 x i ε ( d/i ) k +1 ( d/i ) k +1 + 1 (cid:19) . Here we omit the normalizing constant.The derivation of the posterior form is as follows. Letting µ ( d ) be the prod-uct of the d -dimensional Lebesgue measure and ⊗ ∞ d +1 N (0 , 0) together with theBayes theorem yieldsdS( · | x, d, k )d µ ( d ) ∝ dS( · | d, k )d µ ( d ) ( θ ( d ) ) d P θ,ε d P ,ε ( x ) ∝ exp (cid:26) − d X i =1 θ i ε ( d/i ) k +1 − d X i =1 ( x i − θ i ) ε (cid:27) ∝ exp (cid:26) − d X i =1 ε (cid:18) d/i ) k +1 + 1 (cid:19)(cid:26) θ i − x i / ( d/i ) k +1 (cid:27) (cid:27) . (24)Thus, we obtain the explicit form of S( · | x, d, k ). Since the marginal distribu-tion P S( ·| d,k ) of x with respect to S( · | d, k ) is [ ⊗ di =1 N (0 , ε (1 + ( d/i ) k +1 ))] ⊗ [ ⊗ ∞ d +1 N (0 , ε )], we obtain M ( d | x, k ) ∝ M ( d ) d P S( ·| d,k ) d P ,ε ( x ) ∝ M ( d ) d Y i =1 (1 + ( d/i ) k +1 ) − / exp (cid:26) − x i ε (1 + ( d/i ) k +1 ) + x i ε (cid:27) . A similar calculation yields the explicit form of F ( k | x ). . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation Appendix E: Supplementary numerical experiments In this appendix, we provide several numerical experiments aimed at assistingthe readers’ understanding. The experimental setting is almost the same as thatof Section 4: Recall that numerical experiments are conducted with p = 100-dimensional settings, and that the noise variance ε is fixed to one. In additionto the estimators and the parameter values in Section 4, we use the followingestimators and parameter values: • the maximum likelihood estimator ( X , . . . , X p ) ; • the blockwise James–Stein estimator of which the truncation dimension is p ; • the Bayes estimator based on the Gaussian scale mixture prior distribution Z ⊗ ∞ i =1 N (0 , ti − )d V ( t )with the discretized inverse Gamma distribution V of which the rate andshape parameters are both one.For i = 1 , , . . . , • ˆ θ (3) i := Bi − . / √ • ˆ θ (4) i := Bi − / p π / E.1. Comparison between estimators with and withoutnon-asymptotic adaptation We compare the performance between estimators with and without non-asymptoticadaptation using white noise representation. We represented the true parameter θ as t ∈ [0 , → P pi =1 θ i φ i ( t ), the observation x as t ∈ [0 , → P pi =1 x i φ i ( t ),and an estimator ˆ θ as t ∈ [0 , → P pi =1 ˆ θ i ( x ) φ i ( t ), where { φ i ( · ) } ∞ i =1 is thetrigonometric series. In Figures 3–6, these are plotted at { . × i } i =1 . −250255075 0.00 0.25 0.50 0.75 1.00 t Method TrueMLEBJS Fig 3 . White noise representationof the true parameter and estima-tors without non-asymptotic adapta-tion at θ = θ (3) and B = 10 . −250255075 0.00 0.25 0.50 0.75 1.00 t Method TrueProposedModel averaging Fig 4 . White noise representationof the true parameter and estima-tors with non-asymptotic adaptationat θ = θ (3) and B = 10 .. Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation −2002040 0.00 0.25 0.50 0.75 1.00 t Method TrueMLEBJS Fig 5 . White noise representationof the true parameter and estima-tors without non-asymptotic adapta-tion at θ = θ (4) and B = 10 . −2002040 0.00 0.25 0.50 0.75 1.00 t Method TrueProposedModel averaging Fig 6 . White noise representation ofthe true signal and estimators with-out non-asymptotic adaptation at θ = θ (4) and B = 10 . Figure 3 shows the true parameter (abbreviated by “True”), the maximumlikelihood estimator (abbreviated by “MLE”), and the blockwise James–Steinestimator (abbreviated by “BJS”) at θ = θ (3) and B = 10. Figure 4 shows thetrue parameter, the proposed Bayes estimator (abbreviated by “Proposed”), andthe model averaging estimator (abbreviated by “Model averaging”) at θ = θ (3) and B = 10. Figures 5 and 6 shows these at θ = θ (4) and B = 10.Figures 3 and 5 indicate that estimators without non-asymptotic adaptationdo not work as smoothers when B is relatively large. Figures 4 and 6 indicatethat estimators with non-asymptotic adaptation detect the true parameter. E.2. Comparison with the Gaussian scale mixture prior distribution We compare the performance of the proposed Bayes estimator with that ofthe Bayes estimator based on the Gaussian scale mixture prior distribution.The comparison is intended to indicate that the Bayes estimator based on theGaussian scale mixture prior would also be non-asymptotically adaptive as con-jectured in Remark 2 in Section 3. −250255075 0.00 0.25 0.50 0.75 1.00 t Method TrueProposedScale mixture Fig 7 . White noise representation oftrue parameter, the proposed estima-tor, and the Bayes estimator based onthe Gaussian scale mixture prior at θ = θ (3) and B = 10 . −2002040 0.00 0.25 0.50 0.75 1.00 t Method TrueProposedScale mixture Fig 8 . White noise representation oftrue parameter, the proposed estima-tor, and the Bayes estimator based onthe Gaussian scale mixture prior at θ = θ (4) and B = 10 .. Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation B Lo ss Method ProposedScale mixture Fig 9 . Means of losses with errorbars at θ = θ (3) in cases with B =1 , , , , . The error bars of the scalemixture estimator at B = 1 areomitted because the upper bar is out-side the range [0 , . B Lo ss Method ProposedScale mixture Fig 10 . Means of losses with errorbars at θ = θ (4) in cases with B =1 , , , , . Figures 7 and 8 are comparisons using the white noise representation. Figures9 and 10 are comparisons using the values of losses at θ = θ (3) , θ (4) . The proposedBayes estimator is abbreviated by “Proposed,” and the Bayes estimator based onthe Gaussian scale mixture prior distribution is abbreviated by “Scale mixture.”Figures 7–10 indicate that the performance of the Bayes estimator basedon the Gaussian scale mixture prior distribution is comparable to that of theproposed estimator. References [1] Akaike, H. (1973). Information theory and extension of the maximumlikelihood principle. In Proc. 2nd Int. Symp. Info. Theory Arbel, J. , Gayraud, G. and Rousseau, J. (2013). Bayesian optimaladaptive estimation using a sieve prior. Scand. J. Statist. Baraud, Y. (2000). Model selection for regression on a fixed design. Probab. Theory Relat. Fields Barron, A. , Birg´e, L. and Massart, P. (1999). Risk bounds for modelselection via penalization. Probab. Theory Relat. Fields Barron, A. , Schervish, M. and Wasserman, L. (1999). The consis-tency of posterior distributions in nonparametric problems. Ann. Statist. Belitser, E. and Ghosal, S. (2003). Adaptive Bayesian Inference of themean of an inifinite-dimensional normal distribution. Ann. Statist. Birg´e, L. and Massart, P. (1997). From model selection to adaptive es-timation. In Festschrift for Lucien Le Cam: Research Papers in Probabilityand Statistics Birg´e, L. and Massart, P. (2001). Gaussian model selection. J. Eur.Math. Soc. Brown, L. and Low, M. (1996). Asymptotic equivalence of nonparametricregression and white noise. Ann. Statist. . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation [10] Cai, T. , Low, M. and Zhao, L. (2000). Sharp adaptive estimation by ablockwise method. Technical Report, Wharton School, University of Penn-sylvania, Philadelphia.[11] Cavalier, L. and Tsybakov, A. (2001). Penalized blockwise Stein’smethod, monotone oracles and sharp adaptive estimation. Math. Methodsof Statist. Dalalyan, A. and Salmon, J. (2012). Sharp oracle inequalities for ag-gregation of affine estimators. Ann. Statist. Efromovich, S. (1999). Nonparametric Curve Estimation . Springer.[14] Efromovich, S. and Pinsker, M. (1984). Learning algorithm for non-parmetric filtering. Automation and Remote Control Freedman, D. (1999). On the Bernstein–von Mises theorem with infinite-dimensional parameters. Ann. Statist. Gao, C. and Zhou, H. (2016). Rate exact Bayesian adaptation with mod-ified block priors. Ann. Statist. Ghosal, S. , Ghosh, J. and van der Vaart, A. (2000). Convergencerate of posterior distributions. Ann. Statist. Ghosal, S. , Lember, J. and van der Vaart, A. (2008). NonparametricBayesian model selection and averaging. Elec. J. Statist. Ghosal, S. and van der Vaart, A. (2007). Convergence rates of poste-rior distributions for noniid observations. Ann. Statist. Gin´e, E. and Nickl, R. (2016). Mathematical foundations of infinite-dimensional statistical models . Cambridge University Press.[21] Hartigan, J. (2002). Bayesian Regression Using Akaike Priors TechnicalReport, New Haven, CT, Yale University.[22] Hoffmann, M. , Rousseau, J. and Schmidt-Hieber, J. (2015). OnAdaptive posterior concentration rates. Ann. Statist. Huang, T. (2004). Convergence rates for posterior distributions and adap-tive estimation. Ann. Statist. Johannes, J. , Schenk, R. and Simoni, A. (2014). Adaptive Bayesianestimation in Gaussian sequence space models. In Contributions in infinite-dimensional statistics and related topics Knapik, B. , Szab´o, B. , van der Vaart, A. and van Zanten, H. (2016).Bayes procedures for adaptive inference in inverse problems for the whitenoise model. Probab. Theory Relat. Fields Knapik, B. , van der Vaart, A. and van Zanten, H. (2011). Bayesianinverse problems with Gaussian priors. Ann. Statist. Leung, G. and Barron, A. (2006). Information Theory and MixingLeast-Squares Regressions. IEEE tran. on INFOR. THEORY Mallows, C. (1973). Some comments on C p . Technometrics Massart, P. (2007). Concentration Inequalities and Model Selection: Ecoled’Et´e de Probabilit´es de Saint-Flour XXXIII-2003 . Springer.[30] Petrone, S. , Rousseau, J. and Scricciolo, C. (2014). Bayes and em-pirical Bayes: do they merge? Biometrika Pinsker, M. (1980). Optimal filtering of square integrable signals in Gaus- . Yano and F. Komaki/Non-asymptotic Bayesian Minimax Adaptation sian white noise. Problems Inform. Transmission Rasmussen, C. and Williams, K. (2005). Gaussian Processes for Ma-chine Learning . the MIT Press.[33] Ray, K. (2013). Bayesian inverse problems with non-conjugate priors. Elec.J. Statist. Rousseau, J. and Szab´o, B. (2017). Asymptotic behaviour of the empir-ical Bayes posterior associated to maximum marginal likelihood estimator. Ann. Statist. Scricciolo, C. (2006). Convergence rates for Bayesian density estimationof infinite-dimensional exponential families. Ann. Statist. Shen, W. and Ghosal, S. (2015). Adaptive Bayesian Procedures UsingRandom Series Priors. Scand. J. Statist. Shen, X. and Wasserman, L. (2001). Rate of Convergence of posteriordistributions. Ann. Statist. Stein, C. (1973). Estimation of the mean of a multivariate normal distri-bution. In Proc. Prague Symp. Asmptotic Statistics Suzuki, T. (2012). PAC-Bayesian Bound for Gaussian Process Regressionand Multiple Kernel Additive Model. In Szab´o, B. , van der Vaart, A. and van Zanten, H. (2013). EmpiricalBayes scaling of Gaussian priors in the white noise model. Elec. J. Statist. Tsybakov, A. (2009). Introduction to Nonparametric Estimation .Springer.[42] van der Vaart, A. and van Zanten, H. (2009). Adaptive Bayesian Esti-mation Using A Gaussian Random Field with Inverse Gamma Bandwidth. Ann. Statist. Wasserman, L. (2006). All of Nonparametric Statistics . Springer.[44] Yang, Y. (2005). Can the strengths of AIC and BIC be shared? A conflictbetween model indentification and regression estimation. Biometrika Zhao, L. (2000). Bayesian aspects of some nonparametric problems. Ann.Statist.28