Bernstein von Mises Theorems for Gaussian Regression with increasing number of regressors
aa r X i v : . [ m a t h . S T ] M a r The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2011
BERNSTEIN–VON MISES THEOREMS FOR GAUSSIANREGRESSION WITH INCREASING NUMBER OF REGRESSORS
By Dominique Bontemps
Universit´e Paris-Sud
This paper brings a contribution to the Bayesian theory of non-parametric and semiparametric estimation. We are interested in theasymptotic normality of the posterior distribution in Gaussian linearregression models when the number of regressors increases with thesample size. Two kinds of Bernstein–von Mises theorems are obtainedin this framework: nonparametric theorems for the parameter itself,and semiparametric theorems for functionals of the parameter. Weapply them to the Gaussian sequence model and to the regression offunctions in Sobolev and C α classes, in which we get the minimaxconvergence rates. Adaptivity is reached for the Bayesian estimatorsof functionals in our applications.
1. Introduction.
To estimate a parameter of interest in a statistical mo-del, a Bayesian puts a prior distribution on it and looks at the posterior dis-tribution, given the observations. A Bernstein–von Mises theorem is a resultgiving conditions under which the posterior distribution is asymptoticallynormal, centered at the maximum likelihood estimator (MLE) of the modelused, with a variance equal to the asymptotic frequentist variance of theMLE. Other centering can be used; see, for instance, van der Vaart (1998),page 144, after the proof of Lemma 10.3.Such an asymptotic posterior normality is important because it allows theconstruction of approximate credible regions, based on the posterior distri-bution, which retain good frequentist properties. In particular, the MonteCarlo Markov chain algorithms (MCMC) make feasible the construction ofBayesian confidence regions in complex models, for which frequentist confi-dence regions are difficult to build; however, Bernstein–von Mises theoremsare difficult to derive in complex models.
Received September 2010; revised June 2011.
AMS 2000 subject classifications.
Key words and phrases.
Nonparametric Bayesian statistics, semiparametric Bayesianstatistics, Bernstein–von Mises theorem, posterior asymptotic normality, adaptive estima-tion.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2011, Vol. 39, No. 5, 2557–2584. This reprint differs from the original inpagination and typographic detail. 1
D. BONTEMPS
Note that the Bernstein–von Mises theorem also has links with informa-tion theory [see Clarke and Barron (1990) and Clarke and Ghosal (2010)].For parametric models, the Bernstein–von Mises theorem is a well-knownresult, for which we refer to van der Vaart (1998). In nonparametric models(where the parameter space is infinite-dimensional or growing) and semi-parametric models (when the parameter of interest is a finite-dimensionalfunctional of the complete infinite-dimensional parameter), there are stillrelatively few asymptotic normality results. Freedman (1999) gives negativeresults, and we recall some positive ones below. However, many recent papersdeal with the convergence rate of posterior distributions in various settings,which is linked with the model complexity: we refer to Ghosal, Ghosh andvan der Vaart (2000), Shen and Wasserman (2001) as early representativesof this school.Nonparametric Bernstein–von Mises theorems have been developed formodels based on a sieve approximation, where the dimension of the pa-rameter grows with the sample size. In particular, two situations have beenstudied: regression models in Ghosal (1999); exponential models in Ghosal(2000), Clarke and Ghosal (2010) and Boucheron and Gassiat (2009) (thislast one deals with the discrete case, when the observations follow someunknown infinite multinomial distribution).In semiparametric frameworks the asymptotic normality has been ob-tained in several situations. Kim and Lee (2004) and Kim (2006) study thenonparametric right-censoring model and the proportional hazard model.Castillo (2010) obtains Bernstein–von Mises theorems for Gaussian processpriors, in the semiparametric framework where the unknown quantity is( θ, f ), with θ the parameter of interest and f an infinite-dimensional nui-sance parameter. See also Shen (2002). Rivoirard and Rousseau (2009) ob-tain the Bernstein–von Mises theorem for linear functionals of the densityof the observations, in the context of a sieve approximation: sequences ofspaces with an increasing dimension k n are used to approximate an infinite-dimensional space. These authors achieve also the frequentist minimax es-timation rate for densities in specific regularity classes with a deterministic(nonadaptive) value of the dimension k n .Here we obtain nonparametric and semiparametric Bernstein–von Misestheorems in a Gaussian regression framework with an increasing number ofregressors. We address two challenging problems. First, we try to understandbetter when the Bernstein–von Mises theorem holds and when it does not. Inthe latter case the Bayesian credible sets no longer preserve their frequentistasymptotic properties. Second, we look for adaptive Bayesian estimators inour semiparametric settings.Our nonparametric results cover the case of a specific Gaussian prior, andthe case of more generic smooth priors. They are said to be nonparametricbecause we use sieve priors, that is, the dimension of the parameter grows. ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION These results improve on the preceding ones by Ghosal (1999) which didnot suppose the normality of the errors but imposed other conditions, inparticular, on the growth rate of the number of regressors. We apply ourresults to the Gaussian sequence model, as well as to periodic Sobolev classesand to regularity classes C α [0 ,
1] in the context of the regression model(using, resp., trigonometric polynomials and splines as regressors). In allthese situations we get the asymptotic normality of the posterior in additionto the minimax convergence rates, with appropriate (nonadaptive) choices ofthe prior. We also show that for some priors known to reach this convergencerate, the Bernstein–von Mises theorem does not hold.We derive also semiparametric Bernstein–von Mises theorems for linearand nonlinear functionals of the parameter. The linear case is an immediatecorollary of the nonparametric theorems and does not need any additionalconditions. We apply these results to the periodic Sobolev classes to estimatea linear functional and the L norm of the regression function f when it issmooth enough, and in both cases we are able to build an adaptive Bayesianestimator which achieves the minimax convergence rate in all classes of thecollection, in addition to the asymptotic normality.The paper is organized as follows. We present the framework in Section 2.Section 3 states the nonparametric Bernstein–von Mises theorems, for Gaus-sian and non-Gaussian priors. In Section 4 we derive the semiparametricBernstein–von Mises theorems for linear and nonlinear functionals of theparameter. Then in Section 5 we give applications to the Gaussian sequencemodel, and to the regression of a function in a Sobolev and C α [0 ,
1] class.In Section 6 the nonparametric and semiparametric Bernstein–von Misestheorems are proved. The appendices contain various technical tools used inthe main analysis; the appendices can be found in the supplemental article[Bontemps (2011)].
2. Framework.
We consider a Gaussian linear regression framework. Forany n ≥
1, our observation Y = ( Y , . . . , Y n ) ∈ R n is a Gaussian random vec-tor Y = F + ε, (1)where the vector of errors ε = ( ε , . . . , ε n ) ∼ N (0 , σ n I n ), with I n the n × n identity matrix, and the mean vector F belongs to R n . Note that the di-mension of Y is the sample size n , and that σ n is known but may dependon n . Let F be the true mean vector of Y with distribution N ( F , σ n I n ).Probability expectations under F are denoted P F and E .Let φ , . . . , φ k n a collection of k n linearly independent regressors in R n ,where k n ≤ n grows with n . We gather these regressors in the n × k n -matrix Φof rank k n , and h φ i = { Φ θ : θ = ( θ , . . . , θ k n ) ∈ R k n } denotes their linear span.The Bernstein–von Mises theorems will be stated in association with h φ i , the D. BONTEMPS vector space of possible mean vectors in the model, which is possibly mis-specified. We denote by P θ the probability distribution of a random variablefollowing N (Φ θ, σ n I n ) and E θ the associated expectation.As examples, we present three different settings, each with its own collec-tion of regressors. In Section 5 the Bernstein–von Mises theorems are appliedto each of these frameworks:(1) The Gaussian sequence model.
Our first application concerns theGaussian sequence model, which is also equivalent to the white noise model[see Massart (2007), Chapter 4, e.g.]. We consider the infinite-dimensionalsetting Y j = θ j + 1 √ n ξ j , j ≥ , (2)where the random variables ξ j , j ≥ N (0 , k n coordinates with k n ≤ n , we retrieveour model (1) with θ = ( θ j ) ≤ j ≤ k n , σ n = 1 / √ n and Φ T Φ = I k n .(2) Regression of a function in a Sobolev class.
Let f : [0 , → R be a func-tion in L ([0 , Y i = f ( i/n ) + ε i (3)for 1 ≤ i ≤ n , where the errors ε i are i.i.d. N (0 , σ n ) and σ n does not dependon n .We denote by ( ϕ j ) j ≥ the Fourier basis ϕ ≡ ,ϕ m ( x ) = √ πmx ) ∀ m ≥ , (4) ϕ m +1 ( x ) = √ πmx ) ∀ m ≥ . In conjunction with the regular design x i = i/n for 1 ≤ i ≤ n , this givesthe collection of regressors φ j = ( ϕ j ( i/n )) ≤ i ≤ n , ≤ j ≤ k n . In practice, we suppose that f belongs to one of the periodic Sobolevclasses: Definition 1.
Let α >
L >
0. Let ( ϕ j ) j ≥ denote the Fourierbasis (4). We define the Sobolev class W ( α, L ) as the collection of all func-tions f = P ∞ j =1 θ j ϕ j in L ([0 , θ = ( θ j ) j ≥ is an element of theellipsoid of ℓ ( N ), Θ( α, L ) = ( θ ∈ ℓ ( N ) : ∞ X j =1 a j θ j ≤ L π α ) , ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION where a j = (cid:26) j α , if j is even;( j − α , if j is odd.(5)(3) Regression of a function in C α [0 , . Let α >
0, and f ∈ C α [0 , f is α times continuously differentiable with k f k α < ∞ , α being the greatest integer less than α and the seminorm k · k α being definedby k f k α = sup x = x ′ | f ( α ) ( x ) − f ( α ) ( x ′ ) || x − x ′ | α − α . Consider a design ( x ( n ) i ) n ≥ , ≤ i ≤ n , not necessarily uniform. Here F is thevector ( f ( x ( n ) i )) ≤ i ≤ n . Once again we suppose that σ n = σ does not dependon n .Fix an integer q ≥ α , and let K = k n + 1 − q . Partition the interval (0 , K subintervals (( j − /K, j/K ] for 1 ≤ j ≤ K . We want to perform theregression of f in the space of splines of order q defined on that partition, anduse the B -splines basis ( B j ) ≤ j ≤ k n [see, e.g., de Boor (1978)]. Our collectionof regressors is φ j = ( B j ( x ( n ) i )) ≤ i ≤ n , for 1 ≤ j ≤ k n .For any value of n ≥
1, let f W be a prior distribution on R k n and, for F = Φ θ , let W be the prior distribution on F ∈ R n obtained from f W on θ .Its support is included in h φ i . Let P W denote the marginal distribution of Y under prior W , and W ( dG ( F ) | Y ) denote the posterior distribution of a func-tional G ( F ). Note that everything depends on n ( W , e.g., is a distributionon R n ) even if we do not use n as an index to simplify our notation.Both the parametrization by θ and the corresponding collection of regres-sors φ , . . . , φ k n are arbitrary: what matters is the posterior distribution of F and this depends on the space h φ i , not on the basis used to parametrize it.The span h φ i is characterized by the matrix Σ = Φ(Φ T Φ) − Φ T of the or-thogonal projection onto h φ i .The prior W is a sieve prior: that is, its support comes from a finite-dimensional model whose dimension k n grows with n . The collection ofgrowing models h φ i (the sieve) can be seen as an approximation frame-work, each model being possibly misspecified. There is no true parameter inour setting: the true mean vector F may fall outside h φ i and correspond tonone of the possible values of θ . There is then a bias which has to be dealtwith, linked to the choice of the cutoff k n .When dealing with Bernstein–von Mises results, the question of the asymp-totic centering point arises. In nonparametric models constructed on aninfinite-dimensional parameter, there is no definition of a MLE; what the D. BONTEMPS natural centering for a Bernstein–von Mises theorem should be in such sit-uations is not clear. In the model h φ i , the orthogonal projection Y h φ i = Σ Y of Y is also the MLE of F . We set θ Y = (Φ T Φ) − Φ T Y its associated pa-rameter. Let also F h φ i = Φ θ be the projection of F on h φ i , with θ =(Φ T Φ) − Φ T F . Now, F − F h φ i corresponds to the bias introduced by theuse of the model h φ i , and F h φ i is the centering point of the distribution ofthe MLE Y h φ i under P F : Y h φ i ∼ N ( F h φ i , σ n Σ) . Although the MLE is naturally defined in the sieve h φ i , it heavily depends onthe choice of h φ i . Therefore, the Bernstein–von Mises theorems we establishdepend on the choice of the sieve the prior distribution is built on.
3. Nonparametric Bernstein–von Mises theorems.
The proofs of ournonparametric results are delayed to Section 6.3.1.
With Gaussian priors.
We consider here a centered, normal priordistribution W which is isotropic on h φ i , so that W = N (0 , τ n Σ) for somesequence τ n . τ n is a scale parameter, and essentially the only assumptionneeded in this case is that τ n is large enough as n grows. Let k Q − Q ′ k TV denote the total variation norm between two probability distributions Q and Q ′ . Theorem 1.
Assume that σ n = o ( τ n ) , k F k = o ( τ n /σ n ) and k n = o ( τ n /σ n ) . Then E k W ( dF | Y ) − N ( Y h φ i , σ n Σ) k TV → as n → ∞ . In terms of θ instead of F , an equivalent statement is E k f W ( dθ | Y ) − N ( θ Y , σ n (Φ T Φ) − ) k TV → n → ∞ . Theorem 1 does not deal with the modeling bias introduced by takinga prior restricted to h φ i . This is an important question in nonparametricstatistics, and k n has to be chosen in order to achieve a satisfactory bias-variance trade-off.As an example, let us consider a typical regression framework with F =( f ( x i )) ≤ i ≤ n , where f is some function and ( x i ) ≤ i ≤ n some design. If σ n does not depend on n , both conditions k F k = o ( τ n /σ n ) and k n = o ( τ n /σ n )are satisfied if f is bounded and n / = o ( τ n ). These conditions can be readin another way: τ n must be large enough with respect to k F k and k n .3.2. With smooth priors.
We consider now more general priors. To un-derstand better the conditions we use, we need to look at the mechanics ofthe Bernstein–von Mises theorem.
ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION Behind a Bernstein–von Mises theorem there is a LAN structure: the log-likelihood admits a quadratic expansion near the MLE. Since the posteriordensity is proportional to the product of the prior density and the likelihood,the prior has to be locally constant to let the likelihood alone influence theposterior and produce the Gaussian shape. To prove a Bernstein–von Misestheorem, we look for a subset which is simultaneously (1) large enough, sothat the posterior will concentrate on it, and (2) small enough, so that wecan find approximately constant priors on it. The larger the dimension ofthe model is, the more difficult it is to combine these two requirements, andthe more difficult it is to obtain a Bernstein–von Mises theorem.The geometry of the subsets are naturally suggested by the normal dis-tribution we are looking for. For
M >
0, consider the ellipsoid E θ , Φ ( M ) = { θ ∈ R k n : ( θ − θ ) T Φ T Φ( θ − θ ) ≤ σ n M } . (6) Theorem 2.
Suppose that W is induced by a distribution f W on θ ad-mitting a density w ( θ ) with respect to the Lebesgue measure. If there existsa sequence ( M n ) n ≥ such that: (1) sup k Φ h k ≤ σ n M n , k Φ g k ≤ σ n M n w ( θ + h ) w ( θ + g ) → as n → ∞ , (2) k n ln k n = o ( M n ) , (3) max(0 , ln( √ det(Φ T Φ) σ knn w ( θ ) )) = o ( M n ) ,then E k W ( dF | Y ) − N ( Y h φ i , σ n Σ) k TV → as n → ∞ . With condition (1) below we ask for a sufficiently flat prior f W in anellipsoid E θ , Φ ( M n ). Condition (2) ensures, in particular, that the weight thenormal distribution puts on E θ , Φ ( M n ) in the limit goes to 1. Condition (3)makes quantities linked to the volume of E θ , Φ ( M n ) appear and guaranteesthat it has enough prior weight. This kind of assumption is common in theliterature dealing with the concentration of posterior distributions; see, forinstance, Ghosal, Ghosh and van der Vaart (2000).Several of our applications illustrate that priors known to induce the pos-terior minimax convergence rate may not be flat enough to get the Gaussianshape with the asymptotic variance σ n Σ.An important remark is the following: condition (2) does not really limitthe growth rate of k n . Read in conjunction with the other two conditions, wesee that a flatter prior distribution will permit us to take M n larger. Thus,the only condition on the growth rate of k n is k n ≤ n . D. BONTEMPS
Note that Theorem 2 is not a generalization of Theorem 1: Theorem 1 ismore powerful for isotropic Gaussian priors. Consider again the regressionframework with F = ( f ( x i )) ≤ i ≤ n , where f is a bounded function and( x i ) ≤ i ≤ n is some design. Suppose that σ n does not depend on n , and take k n = n and W = N (0 , τ n Σ). Then the conditions of Theorem 1 are satisfiedas soon as n / = o ( τ n ), but with Theorem 2 we need n ln n = o ( τ n ).Our main applications, to the Gaussian sequence model and to the re-gression model using trigonometric polynomials and splines, are developedin Section 5. We now present two remarks about the parametric case andthe comparison with the pioneer work of Ghosal (1999). The parametric case.
Consider the regression of a function f definedon [0 , k of regressors. Set a design ( x ( n ) i ) n ≥ , ≤ i ≤ n ,with x ( n ) i ∈ [( i − /n, i/n ] for any n ≥
1, and F = ( f ( x ( n ) i )) ≤ i ≤ n . Choosea finite number of piecewise continuous and linearly independent regres-sors ( ϕ j ) ≤ j ≤ k on [0 , φ j = ( ϕ j ( x ( n ) i )) ≤ i ≤ n for 1 ≤ j ≤ k . Assumethat f , k n = k , σ n = σ and f W do not depend on n .We would like to compare Theorem 2 with the usual Bernstein–von Misestheorem for parametric models applied to such a regression framework. Inthat setting, let us suppose that w is continuous and positive, and that f is bounded. Then condition (1) becomes M n = o ( n ), while condition (3)reduces to ln n = o ( M n ). Clearly, there exist such sequences ( M n ) n ≥ , soTheorem 2 applies. Here the rescaling by √ n of the Bernstein–von Misestheorem for parametric models is hidden in the asymptotic posterior vari-ance σ (Φ T Φ) − of the parameter θ . Indeed, (1 /n ) Φ T Φ is a Riemannsum and converges toward the Gramian matrix of the collection ( ϕ j ) ≤ j ≤ k in L ([0 , Proof.
We have k Φ θ k ≤ k F k ≤ √ n k f k ∞ , and k θ k ≤ k (Φ T Φ) − k ·k Φ θ k ≤ k n (Φ T Φ) − kk f k ∞ . (1 /n ) Φ T Φ converges toward the Gramian ma-trix of the collection ( ϕ j ) ≤ j ≤ k in L ([0 , n large enough. Therefore, θ is bounded, and we canconsider it lies in some compact set on which w is uniformly continuous andlower bounded by a positive constant. The rest follows. (cid:3) Comparison with Ghosal’s conditions.
The Bernstein–von Mises theoremin a regression setting when the number of parameters goes to infinity hasbeen first studied by Ghosal (1999) as an early step in the development offrequentist nonparametric Bayesian theory. In his paper the errors ε i arenot supposed to be Gaussian. Under the Gaussianity assumption we getimproved results, which means that we have a nontrivial generalization ofthe Ghosal (1999) conditions in the case of Gaussian errors. In particular, ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION our condition for the prior smoothness is simpler, and the growth rate ofthe dimension k n is much less constrained: • Ghosal (1999) does not admit a modeling bias between F and Φ θ . Inthe present work the normality of the errors permits us to take F = Φ θ without any cost, as it appears in the core of the proof (Lemma 7). Thepossibility of considering misspecified models is an important improve-ment. • In Ghosal (1999) σ n is constant, which does not allow the application tothe Gaussian sequence model. • Ghosal (1999) restricts the growth of the dimension k n to k n ln k n = o ( n )(see below). In our setting we only require k n ≤ n . With Ghosal’s conditionwe could not have obtained the applications to the Gaussian sequencemodel or to the regression model for Sobolev or C α classes.Let δ n = k (Φ T Φ) − k be the operator norm of (Φ T Φ) − for the ℓ metric,and let η n be the maximal value on the diagonal of Σ. With our notation,the last two assumptions of Ghosal (1999) become:(A3) There exists η > w ( θ ) > η k n . Moreover, | ln w ( θ ) − ln w ( θ ) | ≤ L n ( C ) k θ − θ k , (7)whenever k θ − θ k ≤ Cδ n k n √ ln k n , where the Lipschitz constant L n ( C ) issubject to some growth restriction [see assumption (A4)].(A4) ∀ C > L n ( C ) δ n k n p ln k n → η n k / n p ln k n → . (8)Further, the design satisfies a condition on the trace of Φ T Φ:tr(Φ T Φ) = O ( nk n ) . (9)Since Σ is an orthogonal projection matrix on a k n -dimensional space,tr(Σ) = k n and η n ≥ k n /n . Thus, the last part of (8) implies k n ln k n = o ( n ).If we add the normality of the errors and a slight technical conditionln n = o ( k n ln k n ), these assumptions imply ours. Indeed, set M n = C k n ln k n for some arbitrary value of C . Our condition (2) is immediate. Condition (1)is got from (7) and the first part of (8). The beginning of (A3) implies − ln w ( θ ) = O ( k n ) = o ( M n ). Using the concavity of the ln function and (9),we get ln det(Φ T Φ) ≤ k n ln tr(Φ T Φ) − k n ln k n = O ( k n ln n ) = o ( M n ). There-fore, our condition (3) holds.
4. Semiparametric Bernstein–von Mises theorems.
We consider two kindsof functionals of F : linear and nonlinear ones. These results can be easilyadapted to functionals of θ , using the maps θ Φ θ and F (Φ T Φ) − Φ T F . D. BONTEMPS
The linear case.
For linear functionals of F , we have the followingcorollary: Corollary 1.
Let p ≥ be fixed, and G be a R p × R n -matrix. Supposethat the conditions of either Theorems 1 or 2 are satisfied. Then E k W ( d ( GF ) | Y ) − N ( GY h φ i , σ n G Σ G T ) k TV → as n → ∞ . Further, the distribution of GY h φ i is N ( GF h φ i , σ n G Σ G T ) . Corollary 1 is just a linear transform of the preceding theorems, and ofthe distribution of Y h φ i .An example of application is given in Section 5.2, in the context of theregression on Fourier’s basis.4.2. The nonlinear case.
The Bernstein–von Mises theorem which is pre-sented here for nonlinear functionals is derived from the nonparametric the-orems thanks to Taylor expansions. In the Taylor expansion of a functional,the first order term naturally leads to the posterior normality, as in thecase of linear functionals. We do not want that the second order term in-terfere with this phenomenon: it has to be controlled. The conditions ofTheorem 2 below are stated to permit this control of the second orderterm.Let p ≥ G : R n R p be a twice continuously differentiablefunction. For F ∈ R n , let ˙ G F denote the Jacobian matrix of G at F , and D F G ( · , · ) the second derivative of G , as a bilinear function on R n . For any F ∈ h φ i and a >
0, let B F ( a ) = sup h ∈h φ i : k h k ≤ σ n a sup ≤ t ≤ k D F + th G ( h, h ) k , (10)where k · k denotes the Euclidean norm of R p .We also consider the following nonnegative symmetric matrixΓ F = σ n ˙ G F Σ ˙ G TF . (11)In the following, k Γ − F k denotes the Euclidean operator norm of Γ − F , whichis also the inverse of the smallest eigenvalue of Γ F .Let I be the collection of all intervals in R , and for any I ∈ I , let ψ ( I ) = P ( Z ∈ I ), where Z is a N (0 ,
1) random variable. Recall that Y h φ i is the MLEand the orthogonal projection of Y on h φ i . Theorem 3.
Let G : R n R p be a twice continuously differentiable func-tion, and let Γ F be as just defined. Suppose that Γ F h φ i is nonsingular, andthat there exists a sequence ( M n ) n ≥ such that k n = o ( M n ) and B F h φ i ( M n ) = o ( k Γ − F h φ i k − ) . (12) ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION Suppose further that the conditions of either Theorems 1 or 2 are satisfied.Then, for any b ∈ R p , E (cid:20) sup I ∈I (cid:12)(cid:12)(cid:12)(cid:12) W (cid:18) b T ( G ( F ) − G ( Y h φ i )) q b T Γ F h φ i b ∈ I (cid:12)(cid:12)(cid:12) Y (cid:19) − ψ ( I ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) → as n → ∞ . (13) Under the same conditions, sup I ∈I (cid:12)(cid:12)(cid:12)(cid:12) P (cid:18) b T ( G ( Y h φ i ) − G ( F h φ i )) q b T Γ F h φ i b ∈ I (cid:19) − ψ ( I ) (cid:12)(cid:12)(cid:12)(cid:12) → as n → ∞ . (14)Note that sup I ∈I | Q ( I ) − Q ′ ( I ) | is the Levy–Prokhorov distance betweentwo distributions Q and Q ′ on R . The Levy–Prokhorov distance metrizes theconvergence in distribution. So, when p = 1 the Levy–Prokhorov distancebetween the distribution W ( dG ( F ) | Y ) and N ( G ( Y h φ i ) , Γ F h φ i ) goes to 0 inmean, while G ( Y h φ i ) goes to N ( G ( F h φ i ) , Γ F h φ i ) in distribution.An application of Theorem 3 is given in Section 5.2, in the context of theregression on Fourier’s basis. The proof is delayed to Section 6.3.
5. Applications.
Here we give the three applications described in Sec-tion 2. The models studied and the collections of regressors used have alreadybeen defined there.5.1.
The Gaussian sequence model.
We consider the model (2). Here theMLE is the projection θ Y = ( Y j ) ≤ j ≤ k n .The nonparametric case corresponds to the estimation of θ . Under theassumption that θ is in some regularity class, we will obtain a Bernstein–von Mises theorem with the posterior convergence rate already obtainedin previous works, in particular, Ghosal and van der Vaart (2007). On theother hand, for some priors known to achieve this rate, it will be seen thatthe centering point and the asymptotic variance of the posterior distri-bution do not fit with the ones expected in a Bernstein–von Mises theo-rem. We also look at the semiparametric estimation of the squared ℓ normof θ .5.1.1. The nonparametric estimation of θ . Proposition 1.
Suppose that P k n j =1 ( θ j ) is bounded. This holds when θ is an element of ℓ ( N ) not depending on n . With a prior f W = N (0 , τ n I k n ) such that n − / = o ( τ n ) , we have for any sequence k n ≤ n , E (cid:13)(cid:13)(cid:13)(cid:13)f W ( dθ | Y ) − N (cid:18) θ Y , n I k n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) TV → as n → ∞ , D. BONTEMPS and the convergence rate of θ toward θ is q k n n : for every λ n → ∞ , E "f W k θ − θ k ≥ λ n r k n n (cid:12)(cid:12)(cid:12) Y ! → . Recall that θ = ( θ j ) ≤ j ≤ k n is the projection of θ . Proof of Proposition 1.
The beginning is an immediate corollaryof Theorem 1. For the convergence rate, let λ n → ∞ . Since θ Y − θ ∼N (0 , n I k n ), P k θ Y − θ k ≥ λ n r k n n ! → . In the same way E "f W k θ − θ Y k ≥ λ n r k n n ! ≤ E (cid:13)(cid:13)(cid:13)(cid:13)f W ( dθ | Y ) − N (cid:18) θ Y , n I k n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) TV + N (cid:18) , n I k n (cid:19) ( k h k ≤ λ n r k n n )! , which goes to 0. Therefore, E "f W k θ − θ k ≥ λ n r k n n ! → . (cid:3) However, in such a general setting we have no information about the biasbetween θ and its projection θ . Several authors add the assumption thatthe true parameter belongs to a Sobolev class of regularity α >
0, defined bythe relation P ∞ j =1 | θ j | j α < ∞ . In this setting we show that for some priorsthe induced posterior may achieve the nonparametric convergence rate butwith a centering point and a variance different from what is expected in theBernstein–von Mises theorem. Then we exhibit priors for which both theBernstein–von Mises theorem and the nonparametric convergence rate hold.From now on, we suppose that P ∞ j =1 | θ j | j α < ∞ . In this setting Ghosaland van der Vaart (2007), Section 7.6, consider a prior f W such that θ , θ , . . . are independent, and θ j is normally distributed with variance σ j,k n .Further, the variances are supposed to satisfy c/k n ≤ min { σ j,k n j α : 1 ≤ j ≤ k n } ≤ C/k n (15)for some positive constants c and C . Suppose that α ≥ / C and C such that C n / (1+2 α ) ≤ k n ≤ C n / (1+2 α ) . Then Ghosal ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION and van der Vaart (2007), Theorem 11, proved that the posterior convergesat the rate n − α/ (1+2 α ) .In order to get n − I k n as asymptotic variance, we need more stringentconditions on k n , or a flatter prior. To see this is necessary, consider, for k n ≈ n / (1+2 α ) , the following choice for σ j,k n : σ j,k n = (cid:26) k − n , if 1 ≤ j ≤ k n / α /n, if j > k n / { σ j,k n j α : 1 ≤ j ≤ k n } ≈ k − n , and the posterior converges at therate n − α/ (1+2 α ) .For this case we can explicitly calculate the posterior distribution. This issimilar to the calculation made in the proof of Theorem 1. The coordinatesare independent, and f W ( dθ j | Y ) = N (cid:18) σ j,k n σ n + σ j,k n Y j , σ n σ j,k n σ n + σ j,k n (cid:19) . For j > k n / σ j,kn σ n + σ j,kn = α α , and, therefore, k f W ( dθ j | Y ) − N ( Y j , σ n ) k TV isbounded away from 0.By contrast, with an isotropic and flat prior we obtain the centering pointand the asymptotic variance we expected, and the same convergence rate aspreviously. We have the following: Proposition 2.
Suppose that θ belongs to the Sobolev class of regular-ity α > . Choose a prior f W = N (0 , τ n I k n ) such that n − / = o ( τ n ) , whichensures the asymptotic normality of the posterior distribution as in Propo-sition 1.If further k n ≈ n / (1+2 α ) , then the convergence rate of θ toward θ andtoward θ is n − α/ (1+2 α ) : for every λ n → ∞ , E [ f W ( k θ − θ k ≥ λ n n − α/ (1+2 α ) | Y )] → . Proof.
We consider θ and θ as elements of ℓ ( N ) by setting θ j = θ ,j = 0 for j ≥ k n + 1. The convergence rate toward θ has already beenestablished in Proposition 1. Since θ ,j = θ j for 1 ≤ j ≤ k n , k θ − θ k ≤ k − αn qP ∞ j = k n +1 ( θ j ) j α = O ( k − αn ). Therefore, the convergence rate of θ to-ward θ is also n − α/ (1+2 α ) . (cid:3) Semiparametric theorem for the ℓ norm of θ . We consider theprior distribution used in Proposition 2, but now we look at the posteriordistribution of k θ k . To get asymptotic normality with variance n − / , wejust need k n = o ( √ n ). To control the bias term, we need α > /
2, and inthis case we get an adaptive Bayesian estimator. D. BONTEMPS
Proposition 3.
Let α > / and suppose that θ belongs to the Sobolevclass of regularity α . Choose a prior f W = N (0 , τ n I k n ) such that n − / = o ( τ n ) . Then, for any choice of k n such that k n = o ( √ n ) and √ n = o ( k αn ) , E (cid:20) sup I ∈I (cid:12)(cid:12)(cid:12)(cid:12)f W (cid:18) √ n ( k θ k − k θ Y k )2 k θ k ∈ I (cid:12)(cid:12)(cid:12) Y (cid:19) − ψ ( I ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) → as n → ∞ and √ n ( k θ Y k −k θ k )2 k θ k → N (0 , in distribution, as n → ∞ . Further, the biasis negligible with respect to the square root of the variance: √ n ( k θ k − k θ k )2 k θ k = o (1) . In particular, the choice k n = p n/ ln n is adaptive in α . Proof.
We set up an application of Theorem 3. Since σ n = n − / , theconditions of Theorem 1 are fulfilled.Here G ( θ ) = θ T θ , ˙ G θ = 2 θ T and ¨ G θ = 2 I k n . Therefore, B θ ( M n ) = 2 M n /n ,while Γ θ = 4 k θ k /n .Let us choose ( M n ) n ≥ such that k n = o ( M n ) and M n = o ( √ n ). Suchsequences exist and fulfill the conditions of Theorem 3.Since k θ k → k θ k , we can substitute the variance Γ θ by 4 k θ k /n andget the two asymptotic normality results, (13) and (14).As n → ∞ , k θ k − k θ k = k θ − θ k = O ( k − αn ), as in the proof of Propo-sition 2. If √ n = o ( k αn ), we get √ n ( k θ k − k θ k ) = o (1). (cid:3) Regression on Fourier’s basis.
Now we consider the regression mo-del (3) with a function f in a Sobolev class W ( α, L ), and use Fourier’sbasis (4). For any θ ∈ R k n , we define f θ = P k n j =1 θ j ϕ j . We also denote by θ ∈ ℓ ( N ) the sequence of Fourier’s coefficients of f : f = P ∞ j =1 θ j ϕ j .The following useful lemma about our collection of regressors can befound, for instance, in Tsybakov (2004) (we slightly modified it to takeinto account the case n even): Lemma 1.
Suppose either that n is odd and k n ≤ n , or n is even and k n ≤ n − . Consider the collection ( φ j ) ≤ j ≤ k n defined before, and Φ theassociated matrix. Then Φ T Φ = nI k n . This makes the regression on Fourier’s basis very close to the Gaussiansequence model, and the results we obtain are similar.In this subsection we first consider the estimation of f in a Sobolev class,for which we get a Bernstein–von Mises theorem and the frequentist minimax ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION n − α/ (1+2 α ) posterior convergence rate for the L norm. Then we consider twosemiparametric settings: the estimation of a linear functional of f , and theestimation of the L norm of f . We get the adaptive √ n convergence ratefor any α > / Nonparametric Bernstein–von Mises theorem in Sobolev classes.
Proposition 4.
Suppose that f belongs to some Sobolev class W ( α, L ) for L > and α > / . Let k n ≈ n / (1+2 α ) and f W = N (0 , γ n I k n ) be the prioron θ , for a sequence ( γ n ) n ≥ such that / √ n = o ( γ n ) . Then E (cid:13)(cid:13)(cid:13)(cid:13)f W ( dθ | Y ) − N (cid:18) θ Y , σ n I k n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) TV → as n → ∞ and the convergence rate relative to the Euclidean norm for f θ is n − α/ (1+2 α ) :for every λ n → ∞ , E [ f W ( k f θ − f k ≥ λ n n − α/ (1+2 α ) | Y )] → . Proof.
The conditions of Theorem 1 are fulfilled: with τ n = nγ n , wehave n = o ( τ n ). The first assertion follows.Because of the orthogonal nature of Fourier’s basis, k f θ − f k = k θ − θ k in ℓ ( N ). We use the decomposition k θ − θ k ≤ k θ − θ k + k θ − θ k . Inthe same way as in the proof of Proposition 1, for any λ n → ∞ , E "f W k θ − θ k ≥ λ n r k n n ! → . Going back to Definition 1, we have k θ − θ k = ∞ X j = k n +1 ( θ j ) ≤ k − αn ∞ X j = k n +1 a αj ( θ j ) = O ( k − αn ) . This permits to get E [ f W ( k θ − θ k ≥ λ n n − α/ (1+2 α ) | Y )] → . (cid:3) Linear functionals of f . Let g : [0 , → R be a function in L ([0 , F ( f ) = R f g , and we approximate it by1 n n X i =1 g ( i/n ) f ( i/n ) = GF , where G = ( g ( i/n ) /n ) T ≤ i ≤ n . The plug-in MLE estimator of GF in the mis-specified model h φ i is GY h φ i . More generally, we consider the functional F GF . The following result is adaptive, in the sense that the same choice k n = ⌊ n/ ln n ⌋ entails the convergence rate n − / for all values of α > / D. BONTEMPS
Proposition 5.
Suppose f is bounded, and let W be the prior inducedby the N (0 , γ n I k n ) distribution on θ , for a sequence ( γ n ) n ≥ such that / √ n = o ( γ n ) . Then: (1) E k W ( d ( GF ) | Y ) − N ( GY h φ i , σ G Σ G T ) k TV → and the distribution of GY h φ i is N ( GF h φ i , σ G Σ G T ) . (2) Suppose further that f and g belong to some Sobolev class W ( α, L ) for L > and α > / . Then G Σ G T ∼ n R g , E (cid:13)(cid:13)(cid:13)(cid:13) W (cid:18) d √ n ( GF − GY h φ i ) σ qR g (cid:12)(cid:12)(cid:12) Y (cid:19) − N (0 , (cid:13)(cid:13)(cid:13)(cid:13) TV → and √ n ( GY h φ i − GF h φ i ) σ qR g → N (0 , in distribution, as n → ∞ . (3) Suppose that f and g belong to some Sobolev class W ( α, L ) for L > and α > / , and suppose further that k n is large enough so that n = o ( k αn ) .Then the bias is negligible with respect to the square root of the variance: √ n ( GF h φ i − F ( f )) σ qR g = o (1) . Before the proof we give two lemmas, proved in Appendix B in the sup-plemental article [Bontemps (2011)], about the error terms of the approx-imation of a Sobolev class by a sieve build on Fourier’s basis, and of theapproximation of an integral by a Riemann sum.
Lemma 2.
Let α > / and L > . We suppose n odd or k n < n . If f ∈ W ( α, L ) , k F − F h φ i k ≤ (1 + o (1)) √ Lπ α √ nk αn . Further, k F k ∼ q n R f and k F − F h φ i k = O ( k − αn k F k ) . Lemma 3.
Let two functions f ∈ W ( α, L ) and g ∈ W ( α ′ , L ′ ) for some α, α ′ > / and two positive numbers L and L ′ . Then (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n X i =1 f ( i/n ) g ( i/n ) − Z f g (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( n − inf( α,α ′ ) ) . Proof of Proposition 5. (1) The first assertion is just Corollary 1.The conditions of Theorem 1 are fulfilled, as in the proof of Proposition 4.
ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION (2) If g ∈ W ( α, L ) for L > α > / G Σ G T = k Σ G T k ∼ k G T k byLemma 2. In the meantime k G T k = n P ni =1 g ( x i ) ∼ n R g by Lemma 3.So G Σ G T ∼ n R g , and the variance in the formulas of Corollary 1 can besubstituted with n R g .(3) We decompose the bias into two terms, | GF − F ( f ) | and | GF h φ i − GF | , and show that both are o ( n − / ). The first term is controlled byLemma 3. For the last one, | GF h φ i − GF | ≤ k G T kk F h φ i − F k . But k G T k = O ( n − / ), k F h φ i − F k = O ( k − αn k F k ) by Lemma 2 and k F k = O ( √ n ). Weconclude thanks to the assumption n = o ( k αn ). (cid:3) L norm of f . Suppose that we want to estimate F ( f ) = R f .We can consider the plug-in MLE estimator G ( Y h φ i ) = 1 n k Y h φ i k = 1 n n X i =1 k n X j =1 θ Y,j ϕ j ( i/n ) ! . More generally, we define, for any F ∈ R n , G ( F ) = 1 n k F k . (16)With a Gaussian prior, we obtain the following result, which is also adap-tive: the same k n = ⌊√ n/ ln n ⌋ is suitable whatever α > / Proposition 6.
Let G ( F ) = k F k /n . Suppose that f ∈ W ( α, L ) forsome L > and α > / . Let W be the prior induced by the N (0 , γ n I k n ) distribution on θ , for a sequence ( γ n ) n ≥ such that / √ n = o ( γ n ) . The se-quence ( k n ) n ≥ can be chosen such that k n = o ( √ n ) and √ n = o ( k αn ) , andwith such a choice, E (cid:20) sup I ∈I (cid:12)(cid:12)(cid:12)(cid:12) W (cid:18) √ n ( G ( F ) − G ( Y h φ i ))2 σ p F ( f ) ∈ I (cid:12)(cid:12)(cid:12) Y (cid:19) − ψ ( I ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) → as n → ∞ and √ n ( G ( Y h φ i ) − G ( F h φ i ))2 σ √ F ( f ) → N (0 , in distribution, as n → ∞ . Further, thebias is negligible with respect to the square root of the variance: √ n ( G ( F h φ i ) − F ( f ))2 σ p F ( f ) = o (1) . A similar corollary could be stated for a non-Gaussian prior.
Proof of Proposition 6.
First, let us note that the conditions ofTheorem 1 are fulfilled, as in the proof of Proposition 4. Lemma 10 in Ap-pendix B insures that f is bounded. D. BONTEMPS
In this setting ˙ G F = (2 /n ) F T and D F G ( h, h ) = (2 /n ) k h k for any F ∈ R n and any h ∈ R n . Therefore, B F ( a ) = 2 σ a/n , and Γ F = 4( σ /n ) k F k . ByLemma 2, k F h φ i k ∼ k F k ∼ n F ( f ). Thus, Γ F h φ i = 4(1 + o (1)) F ( f ) /n .Let us choose ( M n ) n ≥ such that k n = o ( M n ) and M n = o ( √ n ). Suchsequences exist and fulfill the conditions of Theorem 3. We can substitutethe variance Γ F h φ i by 4 F ( f ) /n and get the two asymptotic normality results.Let us now consider the bias term: F ( f ) − G ( F h φ i ) ≤ k F k − k F h φ i k n + Z f − n n X i =1 f ( i/n ) ! . We use Lemma 2 to control k F k − k F h φ i k , and Lemma 3 for the otherterm: |F ( f ) − G ( F h φ i ) | = O ( k − αn ) + O ( n − α ) . This is a o (1 / √ n ) under the assumptions of Corollary 6. (cid:3) Regression on splines.
Here we consider the regression model forfunctions in C α [0 ,
1] with α >
0, using splines, set up in Section 2. We firstdevelop further the framework and the assumptions used here, and recallthe previous result of Ghosal and van der Vaart (2007), Section 7.7.1, whichobtains the posterior concentration at the frequentist minimax rate. Thenwe present two Bernstein–von Mises theorems: the first one with the sameprior as Ghosal and van der Vaart (2007) but a stronger condition on k n (orequivalently on α ); the second one with a flatter prior, for which we obtainthe minimax convergence rate in addition to the asymptotic Gaussianity ofthe posterior distribution.To see this, we begin with some preliminaries. For any θ ∈ R k n , define f θ = P k n j =1 θ j B j . The B -splines basis has the following approximation property:for any α >
0, there exist C α > f ∈ C α [0 , θ ∞ ∈ R k n satisfying k f − f θ ∞ k ∞ ≤ C α k − αn k f k α . (17)We need the design ( x ( n ) i ) n ≥ , ≤ i ≤ n to be sufficiently regular and, asstressed in Ghosal and van der Vaart (2007), the spatial separation prop-erty of B -splines permits us to express the precise condition in terms of thecovariance matrix Φ T Φ. We suppose that there exist positive constants C and C such that, as n increases, for any θ ∈ R k n , C nk n k θ k ≤ θ T Φ T Φ θ ≤ C nk n k θ k . (18)Let us associate the norm k f k n = q n P ni =1 | f ( x i ) | to the design. Notethat √ n k f θ k n = k Φ θ k if θ ∈ R k n . Under (18) we have a relation between ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION k · k n and the Euclidean norm on the parameter space: for every θ and θ , C k θ − θ k ≤ p k n k f θ − f θ k n ≤ C k θ − θ k . With these conditions Ghosal and van der Vaart (2007), Theorem 12, getthe posterior concentration at the minimax rate. Take α ≥ /
2, let f W = N (0 , I k n ) be the prior on the spline coefficients, and suppose there existconstants C and C such that C n / (1+2 α ) ≤ k n ≤ C n / (1+2 α ) . Then theposterior concentrates at the minimax rate n − α/ (1+2 α ) relative to k · k n : forevery λ n → ∞ , E [ f W ( k f θ − f k n ≥ λ n n − α/ (1+2 α ) | Y )] → . This is equivalent to a convergence rate n (1 − α ) / (2(1+2 α )) relative to the Eu-clidean norm for θ : E [ f W ( k θ − θ k ≥ λ n n (1 − α ) / (2(1+2 α )) | Y )] → . Indeed, (17) and the projection property imply k f θ − f k n ≤ k f θ ∞ − f k n ≤ k f θ ∞ − f k ∞ ≤ C α k f k α k − αn . Now, with modified assumptions we get the Bernstein–von Mises theoremin two different settings. First, with the same prior as Ghosal and van derVaart (2007):
Proposition 7.
Assume that f is bounded, k n = o (( n ln n ) / ) and (18)holds. Let f W = N (0 , I k n ) be the prior on the spline coefficients. Then E k f W ( dθ | Y ) − N ( θ Y , σ (Φ T Φ) − ) k TV → as n → ∞ (19) and the convergence rate relative to the Euclidean norm for θ is k n √ n .Remarks. We need α > C and C such that C n / (1+2 α ) ≤ k n ≤ C n / (1+2 α ) . In this case the convergence ratefor θ is n (1 − α ) / (2(1+2 α )) . Proof of Proposition 7.
We set up an application of Theorem 2. Wecan choose M n such that k n ln n = o ( M n ) and M n = o ( nk n ). Assumption (2)is then trivially satisfied.From (18) we get k Φ T Φ k ≤ C nk n and k (Φ T Φ) − k ≤ C − k n n . We havealso ln det(Φ T Φ) ≤ k n ln C + k n ln( nk n ) = O ( k n ln n ) = o ( M n ). Since θ =Φ(Φ T Φ) − F , k θ k ≤ k n C n k F k ≤ k f k ∞ C k n . Therefore, − ln w ( θ ) = O (1) + k θ k = O ( k n ) = o ( M n ), and assumption (3)holds. D. BONTEMPS
Let h ∈ R k n such that k Φ h k ≤ σ M n . We have k h k ≤ k (Φ T Φ) − k ·k Φ h k ≤ σ k n M n C n = o ( k − n ). Therefore,sup k Φ h k ≤ σ M n (cid:12)(cid:12)(cid:12)(cid:12) ln w ( θ + h ) w ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup k Φ h k ≤ σ M n k h k + 2 k h kk θ k o (1)(20)and assumption (1) follows.Let us now prove the convergence rate. Let λ n → ∞ . Then P (cid:18) k θ Y − θ k ≥ λ n k n √ n (cid:19) ≤ P (cid:18) k Φ( θ Y − θ ) k ≥ C λ n k n (cid:19) → k Φ( θ Y − θ ) k ∼ σ χ ( k n ). In the same way E (cid:20)f W (cid:18) k θ − θ Y k ≥ λ n k n √ n (cid:19)(cid:21) ≤ E k f W ( dθ | Y ) − N ( θ Y , σ (Φ T Φ) − ) k TV + N (0 , σ (Φ T Φ) − ) (cid:18)(cid:26) h : k h k ≤ λ n k n √ n (cid:27)(cid:19) → , where Theorem 2 controls the first term in the right. Therefore, assump-tion (3) holds: E (cid:20)f W (cid:18) k θ − θ k ≥ λ n k n √ n (cid:19)(cid:21) → . Now, (19) is the same as Theorem 2 in terms of f W . (cid:3) The situation is similar to the one we encountered with the Gaussiansequence model. To get the Bernstein–von Mises theorem with the sameconvergence rate as Ghosal and van der Vaart (2007) for α ≤
1, we needa flatter prior:
Proposition 8.
Assume that f is bounded and (18) holds. Let f W = N (0 , τ n I k n ) be the prior on the spline coefficients, with the sequence τ n sat-isfying k n ln nn = o ( τ n ) and k n ln nn = o ( τ n ) . Then E k f W ( dθ | Y ) − N ( θ Y , σ (Φ T Φ) − ) k TV → as n → ∞ and the convergence rate relative to the Euclidean norm for θ is k n √ n . When α > k n is of order n / (1+2 α ) , the conditions reduce to n (2 − α ) / (1+2 α ) ln n = o ( τ n ). So we obtain the convergence rate of Ghosal and ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION van der Vaart (2007) in addition to the Gaussian shape with the same k n ,even for α ≤
1, but with a different prior.
Proof of Proposition 8.
The proof is essentially the same as forProposition 7. M n can be chosen so that k n ln n = o ( M n ), M n = o ( nτ n k n ), and M n = o ( nτ n k n ). These last two conditions are the ones needed to obtain thesame upper bounds as in (20). (cid:3)
6. Proofs.
Proof of Theorem 1.
In the present setting all distributions are ex-plicit and admit known densities with respect to the corresponding Lebesguemeasure. We decompose any y ∈ R n in two orthogonal components y =Φ θ y + y ′ , with Φ T y ′ = 0. Then dP θ ( y ) = c exp (cid:26) − σ n ( k Φ θ k + k Φ θ y k + k y ′ k − θ T Φ T Φ θ y ) (cid:27) ,d f W ( θ ) = c exp (cid:26) − τ n k Φ θ k (cid:27) ,dP θ ( y ) d f W ( θ ) = c c exp (cid:26) − σ n + τ n σ n τ n (cid:13)(cid:13)(cid:13)(cid:13) Φ (cid:18) θ − τ n σ n + τ n θ y (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) − σ n + τ n ) k Φ θ y k − σ n k y ′ k (cid:27) , where c = (2 π ) − n/ σ − nn and c = (2 π ) − k n / τ − k n n det(Φ T Φ) − .Using the Bayes rule, we get the density of f W ( dθ | Y ), in which we recognizethe normal distribution f W ( dθ | Y ) = N (cid:18) τ n σ n + τ n θ Y , σ n τ n σ n + τ n (Φ T Φ) − (cid:19) . (21)So we have an exact expression for f W ( dθ | Y ), but the centering and thevariance do not correspond to the limit distribution given in Theorem 1.Therefore, we make use of the triangle inequality, with intermediate distri-bution Q = N ( τ n σ n + τ n θ Y , σ n (Φ T Φ) − ): k f W ( dθ | Y ) − N ( θ Y , σ n (Φ T Φ) − ) k TV (22) ≤ k f W ( dθ | Y ) − Q k TV + k Q − N ( θ Y , σ n (Φ T Φ) − ) k TV . We first deal with the change in the variance, that is, the first term onthe right in (22).Let α n = τ n σ n q ln(1 + σ n τ n ), and f and g be, respectively, the density func-tions of N (0 , I k n ) and N (0 , τ n σ n + τ n I k n ). Let U be a random variable following D. BONTEMPS the chi-square distribution with k n degrees of freedom χ ( k n ). Let √ Φ T Φbe a square root of the matrix Φ T Φ. The total variation norm is invariantunder the bijective affine map θ σ n √ Φ T Φ( θ − τ n σ n + τ n θ Y ), so k f W ( dθ | Y ) − Q k TV = (cid:13)(cid:13)(cid:13)(cid:13) N (0 , I k n ) − N (cid:18) , τ n σ n + τ n I k n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) TV = Z R kn ( g − f ) + = Z k x k≤√ k n α n ( g ( x ) − f ( x )) d n x = P (cid:18) U ≤ σ n + τ n τ n k n α n (cid:19) − P ( U ≤ k n α n )= P (cid:18)p k n ( α n − ≤ U − k n √ k n ≤ p k n (cid:18) σ n + τ n τ n α n − (cid:19)(cid:19) . As n goes to infinity, U − k n √ k n converges toward N (0 ,
1) in distribution. Usingthe Taylor expansion of ln, we find α n = 1 − σ n τ n + o (cid:18) σ n τ n (cid:19) and, therefore, p k n ( α n − ∼ − p k n σ n τ n , p k n (cid:18) σ n + τ n τ n α n − (cid:19) ∼ p k n σ n τ n . Since k n = o ( τ n /σ n ), both these quantities go to 0. As a consequence, k f W ( dθ | Y ) − Q k TV goes to zero as n goes to infinity.Let us now deal with the centering term, that is, the second term on theright in (22). Lemma 4.
Let U be a standard normal random variable, let k ≥ andlet Z ∈ R k . Then kN (0 , I k ) − N ( Z, I k ) k TV = P ( | U | ≤ k Z k / ≤ k Z k / √ π. Proof.
Let g be the density of N (0 , I k ). Then kN (0 , I k ) − N ( Z, I k ) k TV = Z R k ( g ( x ) − g ( x − Z )) + d k x = Z { x T Z ≤k Z k } ( g ( x ) − g ( x − Z )) d k x = P ( U ≤ k Z k / − P ( U + k Z k ≤ k Z k / ≤ k Z k / √ π. The last line comes from the density of N (0 ,
1) being bounded by 1 / √ π . (cid:3) ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION Using again the invariance of the total variation norm under the bijectiveaffine map θ σ n √ Φ T Φ( θ − τ n σ n + τ n θ Y ), kN ( θ Y , σ n (Φ T Φ) − ) − Q k TV = (cid:13)(cid:13)(cid:13)(cid:13) N (0 , I k n ) − N (cid:18) σ n √ Φ T Φ θ Y τ n + σ n , I k n (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) TV ≤ √ π σ n ( τ n + σ n ) k Φ θ Y k≤ √ π σ n ( τ n + σ n ) ( k F k + √ ε T Σ ε ) .ε T Σ ε is a random variable following σ n χ ( k n ) distribution. By Jensen’s in-equality, E [ √ ε T Σ ε ] ≤ p E [ ε T Σ ε ] = σ n √ k n . Therefore, E kN ( θ Y , σ n (Φ T Φ) − ) − Q k TV ≤ √ π σ n τ n + σ n ( k F k + σ n p k n ) , which goes to zero under the assumptions of Theorem 1.To conclude the proof, note that we deduce the results on W ( dF | Y ) fromthe ones on f W ( dθ | Y ), by the linear relation F = Φ θ .6.2. Proof of Theorem 2.
We make the proof for f W ( dθ | Y ). Then theresult for W ( dF | Y ) is immediate. Our method is adapted from Boucheronand Gassiat (2009).To any probability measure P on R k n , we associate the probability P M = P ( · ∩ E θ , Φ ( M )) P ( E θ , Φ ( M ))(23)with support in E θ , Φ ( M ). It can be easily checked that k P − P M k TV = P ( E cθ , Φ ( M )) . (24)The proof is divided into three steps based on the use of M n asa threshold to truncate the probability distributions. Lemma 5 belowcontrols E kN ( θ Y , σ n (Φ T Φ) − ) − N M n ( θ Y , σ n (Φ T Φ) − ) k TV , Lemma 6 con-trols E k f W M n ( dθ | Y ) − N M n ( θ Y , σ n (Φ T Φ) − ) k TV and Proposition 9 controls E k f W ( dθ | Y ) − f W M n ( dθ | Y ) k TV . Taken together, these results give Theo-rem 2. Lemma 5. If k n < M n , then E kN ( θ Y , σ n (Φ T Φ) − ) − N M n ( θ Y , σ n (Φ T Φ) − ) k TV ≤ e − ( √ M n − √ k n ) / . If k n = o ( M n ), for n large enough, this bound can be replaced by e − M n / . D. BONTEMPS
Proof of Lemma 5.
To control this quantity, we consider two cases,depending on whether θ Y is near or far from θ : kN ( θ Y , σ n (Φ T Φ) − ) − N M n ( θ Y , σ n (Φ T Φ) − ) k TV = N ( θ Y , σ n (Φ T Φ) − )( E cθ , Φ ( M n ))(25) ≤ ( θ Y − θ ) T Φ T Φ( θ Y − θ ) >σ n M n / + N ( θ , σ n (Φ T Φ) − )( E cθ , Φ ( M n / . Let U be a random variable following a χ ( k n ) distribution. Taking theexpectation on both sides of (25) gives E kN ( θ Y , σ n (Φ T Φ) − ) − N M n ( θ Y , σ n (Φ T Φ) − ) k TV ≤ P ( U > M n / . Now, Cirelson’s inequality [see, e.g., Massart (2007)] P ( √ U > p k n + √ x ) ≤ exp( − x )(26)used with x = ( √ M n − √ k n ) implies Lemma 5. (cid:3) Lemma 6. If sup k Φ h k ≤ σ n M n , k Φ g k ≤ σ n M n w ( θ + h ) w ( θ + g ) → as n → ∞ , then E k f W M n ( dθ | Y ) − N M n ( θ Y , σ n (Φ T Φ) − ) k TV → as n → ∞ . Proof.
Let us first note that, for every θ and τ in R k n , for every Y ∈ R n , dP θ ( Y ) dP τ ( Y ) = exp (cid:26) −k Φ θ k + k Φ τ k − Y T Φ( τ − θ )2 σ n (cid:27) (27) = d N ( θ Y , σ n (Φ T Φ) − )( θ ) d N ( θ Y , σ n (Φ T Φ) − )( τ ) . This directly comes from the expressions for the Gaussian densities.In the following the first lines are just rewriting. Then we use Jensen’sinequality with the convex function x (1 − x ) + , and make use of (27). Weabbreviate N M n ( θ Y , σ n (Φ T Φ) − ) into N M n : k f W M n ( dθ | Y ) − N M n k TV = Z (cid:18) − d N M n ( θ ) d f W M n ( θ | Y ) (cid:19) + d f W M n ( θ | Y )= Z (cid:18) − d N M n ( θ ) R ( w ( τ ) /d N M n ( τ )) dP τ ( Y ) d N M n ( τ ) w ( θ ) dP θ ( Y ) (cid:19) + d f W M n ( θ | Y ) ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION ≤ Z Z (cid:18) − w ( τ ) d N M n ( θ ) dP τ ( Y ) w ( θ ) d N M n ( τ ) dP θ ( Y ) (cid:19) + d N M n ( τ ) d f W M n ( θ | Y )= Z Z (cid:18) − w ( τ ) w ( θ ) (cid:19) + d N M n ( τ ) d f W M n ( θ | Y ) ≤ − inf k Φ h k ≤ σ n M n , k Φ g k ≤ σ n M n w ( θ + h ) w ( θ + g ) . (cid:3) Proposition 9 (Posterior concentration).
Suppose that conditions (1),(2) and (3) of Theorem 2 hold. Then E k f W ( dθ | Y ) − f W M n ( dθ | Y ) k TV = E [ f W ( E Cθ , Φ ( M n ) | Y )] → as n → ∞ . Proposition 9 is proved in Appendix A in the supplemental article [Bon-temps (2011)]. However, we state here the following important lemma, be-cause of its significance.
Lemma 7.
Let a ∈ R n such that Φ T a = 0 . Then, for any y ∈ R n , W ( dF | Y = y ) = W ( dF | Y = y + a ) . Lemma 7 states that the distribution W ( dF | Y ) is invariant under anytranslation of Y orthogonal to h φ i . Now, regard W ( dF | Y ) as a randomvariable. Then any statement on W ( dF | Y ) or f W ( dθ | Y ) valid when Y ∼N ( F , σ n I n ) with F ∈ h φ i can be extended at zero cost by Lemma 7 to thecase F ∈ R n . For instance, proving Proposition 9 in the case F = Φ θ isenough.6.3. Proof of Theorem 3.
We begin with (13). Consider the followingTaylor expansion: G ( F ) − G ( Y h φ i )= ˙ G F h φ i ( F − Y h φ i )+ 12 Z (1 − t ) D F h φ i + t ( F − F h φ i ) G ( F − F h φ i , F − F h φ i ) dt − Z (1 − t ) D F h φ i + t ( Y h φ i − F h φ i ) G ( Y h φ i − F h φ i , Y h φ i − F h φ i ) dt using the Lagrange form of the error term. Suppose that F ∈ h φ i , k F − F h φ i k ≤ σ n M n and k Y h φ i − F h φ i k ≤ σ n M n . Then, for any b ∈ R p , | b T ( G ( F ) − G ( Y h φ i ) − ˙ G F h φ i ( F − Y h φ i )) | ≤ k b k B F h φ i ( M n ) . D. BONTEMPS
On the other hand, q b T Γ F h φ i b ≥ q k Γ − F h φ i k − k b k . Moreover, (cid:13)(cid:13)(cid:13)(cid:13) W (cid:18) d b T ˙ G F h φ i ( F − Y h φ i ) q b T Γ F h φ i b (cid:12)(cid:12)(cid:12) Y (cid:19) − N (0 , (cid:13)(cid:13)(cid:13)(cid:13) TV ≤ k W ( dF | Y ) − N ( Y h φ i , σ n Σ) k TV . Let η n = q k Γ − F h φ i k B F h φ i ( M n ), which tends to 0 by hypothesis. Let also I η n = { x ∈ R : ∃ x ′ ∈ I, | x − x ′ | ≤ η n } . Note that ψ ( I η n ) ≤ ψ ( I ) + q π η n .Gathering all this information, we can get the upper bound W (cid:18) b T ( G ( F ) − G ( Y h φ i )) q b T Γ F h φ i b ∈ I (cid:12)(cid:12)(cid:12) Y (cid:19) ≤ W (cid:18) b T ˙ G F h φ i ( F − Y h φ i ) q b T Γ F h φ i b ∈ I η n (cid:12)(cid:12)(cid:12) Y (cid:19) + k Y h φ i − F h φ i k >σ n M n + W ( k F − F h φ i k > σ n M n | Y ) ≤ ψ ( I ) + r π η n + k W ( dF | Y ) − N ( Y h φ i , σ n Σ) k TV + k Y h φ i − F h φ i k >σ n M n + W ( k F − F h φ i k > σ n M n | Y ) . A lower bound is obtained in the same way. Taking the expectation, E (cid:12)(cid:12)(cid:12)(cid:12) W (cid:18) b T ( G ( F ) − G ( Y h φ i )) q b T Γ F h φ i b ∈ I (cid:12)(cid:12)(cid:12) Y (cid:19) − ψ ( I ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ o (1) + P ( k Y h φ i − F h φ i k > σ n M n )(28) + E [ W ( k F − F h φ i k > σ n M n | Y )] . But k Y h φ i − F h φ i k follows the σ n χ ( k n ) distribution, and since k n = o ( M n ), P ( k Y h φ i − F h φ i k > σ n M n ) = o (1) . To bound (28), we use the following:
Lemma 8.
Suppose that the conditions of either Theorems 1 or 2 aresatisfied. Then E [ W ( k F − F h φ i k > σ n M n | Y )] → as n → ∞ . ERNSTEIN–VON MISES THEOREMS FOR GAUSSIAN REGRESSION Proof.
For smooth priors, this is an immediate corollary of Proposi-tion 9. Let us suppose we are under the conditions of Theorem 1.Let Z be a N (0 , σ n τ n σ n + τ n Σ) random vector in R n independent on Y , and U a random variable following χ ( k n ). From (21) we get W ( dF | Y ) = N (cid:18) τ n σ n + τ n Y h φ i , σ n τ n σ n + τ n Σ (cid:19) . Therefore, W ( k F − F h φ i k > σ n M n | Y )= P (cid:18)(cid:13)(cid:13)(cid:13)(cid:13) Z + τ n σ n + τ n Y h φ i − F h φ i (cid:13)(cid:13)(cid:13)(cid:13) > σ n M n (cid:19) ≤ P (cid:18) k Z k > σ n p M n − (cid:13)(cid:13)(cid:13)(cid:13) τ n σ n + τ n Y h φ i − F h φ i (cid:13)(cid:13)(cid:13)(cid:13)(cid:19) ≤ , if (cid:13)(cid:13)(cid:13)(cid:13) τ n σ n + τ n Y h φ i − F h φ i (cid:13)(cid:13)(cid:13)(cid:13) > σ n √ M n ,P (cid:18) k Z k > σ n M n (cid:19) = P (cid:18) U > σ n + τ n τ n M n (cid:19) , otherwise.Since k n = o ( M n ), P ( U > M n /
9) = o (1). On the other hand, (cid:13)(cid:13)(cid:13)(cid:13) τ n σ n + τ n Y h φ i − F h φ i (cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13) Σ (cid:18) τ n σ n + τ n ε + σ n σ n + τ n F (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) ≤ k Σ ε k + σ n p σ n + τ n k F k . Since k F k = o ( τ n /σ n ), σ n k F k σ n + τ n = o (1) < M n for n large enough. k Σ ε k isa σ n χ ( k n ) variable. Therefore, for n large enough, E [ W ( k F − F h φ i k > σ n M n | Y )] ≤ P ( U > M n /
9) = o (1) . (cid:3) Now, (28) gives (13).The proof of the frequentist assertion (14) is similar and delayed to Ap-pendix C in the supplemental article [Bontemps (2011)].
Acknowledgments.
The author would like to thank E. Gassiat and I. Cas-tillo for valuable discussions and suggestions.SUPPLEMENTARY MATERIAL
Supplement to “Bernstein–von Mises theorems for Gaussian regressionwith increasing number of regressors” (DOI: 10.1214/11-AOS912SUPP; D. BONTEMPS .pdf). This contains the proofs of various technical results stated in themain article “Bernstein–von Mises Theorems for Gaussian regression withincreasing number of regressors.”REFERENCES
Bontemps, D. (2011). Supplement to “Bernstein–von Mises theorems for Gaussian re-gression with increasing number of regressors.” DOI:10.1214/11-AOS912SUPP.
Boucheron, S. and
Gassiat, E. (2009). A Bernstein–von Mises theorem for discreteprobability distributions.
Electron. J. Stat. Castillo, I. (2010). A semi-parametric Bernstein–von Mises theorem.
Probab. TheoryRelated Fields . DOI:10.1007/s00440-010-0316-5.
Clarke, B. S. and
Barron, A. R. (1990). Information-theoretic asymptotics of Bayesmethods.
IEEE Trans. Inform. Theory Clarke, B. and
Ghosal, S. (2010). Reference priors for exponential families with in-creasing dimension.
Electron. J. Stat. de Boor, C. (1978). A Practical Guide to Splines . Applied Mathematical Sciences .Springer, New York. MR0507062 Freedman, D. (1999). Wald lecture: On the Bernstein–von Mises theorem with infinite-dimensional parameters.
Ann. Statist. Ghosal, S. (1999). Asymptotic normality of posterior distributions in high-dimensionallinear models.
Bernoulli Ghosal, S. (2000). Asymptotic normality of posterior distributions for exponential fam-ilies when the number of parameters tends to infinity.
J. Multivariate Anal. Ghosal, S. , Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates ofposterior distributions.
Ann. Statist. Ghosal, S. and van der Vaart, A. (2007). Convergence rates of posterior distributionsfor non-i.i.d. observations.
Ann. Statist. Kim, Y. (2006). The Bernstein–von Mises theorem for the proportional hazard model.
Ann. Statist. Kim, Y. and
Lee, J. (2004). A Bernstein–von Mises theorem in the nonparametric right-censoring model.
Ann. Statist. Massart, P. (2007).
Concentration Inequalities and Model Selection . Lecture Notes inMath. . Springer, Berlin. MR2319879
Rivoirard, V. and
Rousseau, J. (2009). Bernstein von Mises theorem for linear func-tionals of the density. Available at http://arxiv.org/abs/0908.4167 . Shen, X. (2002). Asymptotic normality of semiparametric and nonparametric posteriordistributions.
J. Amer. Statist. Assoc. Shen, X. and
Wasserman, L. (2001). Rates of convergence of posterior distributions.
Ann. Statist. Tsybakov, A. B. (2004).
Introduction `a L’estimation Non-Param´etrique . Springer,Berlin. van der Vaart, A. W. (1998).
Asymptotic Statistics . Cambridge Univ. Press, Cambridge.