On the Bernstein-von Mises theorem for the Dirichlet process
aa r X i v : . [ m a t h . S T ] A ug On the Bernstein-von Mises theorem for the Dirichlet process
Kolyan Ray ∗ and Aad van der Vaart † Abstract
We establish that Laplace transforms of the posterior Dirichlet process converge tothose of the limiting Brownian bridge process in a neighbourhood about zero, uniformlyover Glivenko-Cantelli function classes. For real-valued random variables and functionsof bounded variation, we strengthen this result to hold for all real numbers. This lastresult is proved via an explicit strong approximation coupling inequality.
MSC 2000 subject classification : Primary 62G20; secondary 62G15, 60F17.
Key words: Bernstein–von Mises, Dirichlet process, strong approximation, Bayesian non-parametrics.
Let P n = n − P ni =1 δ Z i be the empirical distribution of an i.i.d. sample Z , . . . , Z n from adistribution P on some measurable space ( X , A ), and given Z , . . . , Z n let P n be a drawfrom the Dirichlet process with base measure ν + n P n . Thus ν is a finite measure on thesample space and P n | Z , . . . , Z n ∼ DP( ν + n P n ) for all n , which is the posterior distributionobtained when equipping the distribution of the observations Z , Z , . . . , Z n with a Dirichletprocess prior with base measure ν . The case ν = 0 is allowed; the process P n is then knownas the Bayesian bootstrap. For full definitions and properties, see the review in Chapter 4 of[11].The Dirichlet process is the standard “nonparametric prior” on the set of probabilitydistributions on a (Polish) sample space and was first made popular in Bayesian nonpara-metrics by Ferguson [10] and has subsequently been used in numerous statistical applications.The purpose of this note is to prove the following result concerning the Bernstein-von Misestheorem for the Dirichlet process posterior. Theorem 1.
Suppose G is a P -Glivenko-Cantelli class of measurable functions g : X → R with envelope function G such that νG < ∞ and P G δ < ∞ , for some δ > . Then thereexists a neighbourhood of such that for every t in the neighbourhood, for P ∞ -almost everysequence Z , Z , . . . , sup g ∈G (cid:12)(cid:12)(cid:12) E (cid:2) e t √ n ( P n g − P n g ) | Z , . . . , Z n (cid:3) − e t P ( g − P g ) / (cid:12)(cid:12)(cid:12) → . (1) ∗ Department of Mathematics, Imperial College London. E-mail: [email protected] † Mathematical Institute, Leiden University. E-mail: [email protected] research leading to these results has received funding from the European Research Council under ERCGrant Agreement 320637 and is (partly) financed by the NWO Spinoza prize awarded to A.W. van der Vaartby the Netherlands Organisation for Scientific Research (NWO). t e t σ / is the Laplace transform of the normal distribution with mean 0and variance σ . The theorem thus says that the Laplace transform of the posterior Dirichletprocess centered at the empirical measure tends to the Laplace transform of a centerednormal distribution with variance P ( g − P g ) in a neighbourhood of 0. This implies thatthe posterior Dirichlet process tends in distribution to a normal distribution (see Section2.4), which is a version of the Bernstein-von Mises theorem for the Dirichlet process prior (aweak version, as the usual theorem gives the approximation in the total variation distance;see Section 12.2 of [11] for discussion). The convergence of the Laplace transform is usefulfor handling for instance moments of the posterior distribution.The main contribution of the theorem is, however, to provide uniformity in a class offunctions g . This uniformity refers to the marginal posterior distributions of the process (cid:0) √ n ( P n g − P n g ) : g ∈ G (cid:1) . The stronger sense of uniformity of distributional convergence ofthis process as a random element in the set ℓ ∞ ( G ) is known to be true if G is a Donsker class,as shown in [14] (also see [17, 18]). This is a much stronger property than Glivenko-Cantellias assumed here. Remark 1.
Theorem 1 can be extended to the assertion (1) for a sequence G n of classes ofmeasurable functions. Inspection of the proofs below shows that it suffices that these classessatisfy sup g ∈G n ∪G n | P n g − P g | → , a.s. , and possess envelope functions G n such that P G n = O (1) and max ≤ i ≤ n G n ( Z i ) = o ( √ n/ log n ) ,almost surely. For convergence in probability in (1) it suffices that these conditions hold inprobability, and the (last) condition on the maximum is implied by the condition on the en-velope. If the classes G n are separable, then uniformity over G n is implied by uniformity over G n , as shown by Lemma 8 of [26]. Major applications of studying posterior Laplace transforms of functionals as in (1) includeestablishing semiparametric and nonparametric Bernstein-von Mises theorems [3, 4, 24, 27],especially for inverse problems [20, 21, 23], posterior contraction rates in the supremum norm[2, 22] and convergence rates for Tikhonov-type penalised least squares estimators [20, 22].Such proofs typically require uniformity over function classes as established in (1) and uselikelihood expansions based on local asymptotic normality (LAN) of the model. Because theDirichlet process prior does not give probability one to a dominated set of measures, theresulting posterior distribution cannot be derived using Bayes formula; one cannot thus usethe LAN approach of the aforementioned papers to prove (1).Our result is applicable when a Dirichlet process prior is assigned to some distributionalcomponent of the model, such as the covariate distribution in regression models with ran-dom design. For example, Theorem 1 has recently been applied to establish semiparametricBernstein-von Mises results for estimating average treatment effects in causal inference prob-lems [25, 26]. Indeed, results there suggest that for estimating functionals, using a Dirichletprocess prior on the covariate distribution can yield better performance than other commonpriors choices, especially in high-dimensional covariate settings.
The case X = R The proof of Theorem 1 requires uniformly bounded exponential moments of the process( √ n ( P n g − P n g ) : g ∈ G ), which only holds for small | t | under the moment condition P G δ < of the theorem (see Lemmas 2 and 3). When X = R , we can strengthen Theorem 1 tohold for all t ∈ R under significantly stronger conditions on G .We now assume Z , Z , . . . are i.i.d. random variables taking values in X = R . Recallthat the total variation of a function f : R → R on an interval [ a, b ] is V ba ( f ) = sup Π ∈P ba n Π X i =1 | f ( x i ) − f ( x i − ) | , where P ba = { Π = ( x , . . . , x n P ) : a = x ≤ x ≤ · · · ≤ x n Π , n Π ∈ N } is the set of all partitionsof [ a, b ], and define | f | BV = sup a,b V ba ( f ). Proposition 1.
Suppose G is a class of right-continuous functions g : R → R such that sup g ∈G | g | BV < ∞ . Then for every t ∈ R , for P ∞ -almost every sequence Z , Z , . . . , sup g ∈G (cid:12)(cid:12)(cid:12) E (cid:2) e t √ n ( P n g − P n g ) | Z , . . . , Z n (cid:3) − e t P ( g − P g ) / (cid:12)(cid:12)(cid:12) → . Since bounded variation balls are universal Donsker classes, this is a significantly strongerrequirement than G being P -Glivenko-Cantelli in Theorem 1. We prove this result by ex-ploiting a strong approximation, which establishes a rate of convergence for representations ofthese random variables defined on a common probability space and has various applications inprobability and statistics, for instance studying distributional approximations of transformedrandom variables ψ n ( √ n ( P n − P n )), where the functions ψ n depend on n . For an overviewof the theory of strong approximations and a survey of their applications in probability andstatistics, see Cs¨org˝o and R´ev´esz [6] and Cs¨org˝o and Hall [7], respectively.Let F n ( t ) = P n ( −∞ ,t ] and F n ( t ) = P n ( −∞ ,t ] denote the distribution function of theposterior Dirichlet process draw P n and empirical distribution function, respectively. In aslight abuse of notation, we shall write F ∼ DP( ν ) to mean F = P ( −∞ , · ] for P ∼ DP ( ν ). Wewrite | ν | = ν ( R ). Recall that a Brownian bridge { B ( s ) : s ∈ [0 , } is a mean-zero Gaussianprocess with covariance function EB ( s ) B ( s ) = s ∧ s − s s . A Kiefer process { K ( s, t ) : s ∈ [0 , , t ≥ } is a two-parameter mean-zero Gaussian process with covariance function EK ( s , t ) K ( s , t ) = ( s ∧ s − s s )( t ∧ t ). For each t > { t − / K ( s, t ) : s ∈ [0 , } is a Brownian bridge, while { K ( s, n + 1) − K ( s, n ) : n ≥ } is a sequence of independentBrownian bridges.An almost sure strong approximation of the posterior Dirichlet process was establishedby Lo [19]. He showed that on a suitable probability space, there exist random elements F , K and Z , Z , · · · ∼ iid F such that F | Z , . . . , Z n ∼ DP ( ν + n P n ) for every n , K is a Kieferprocess independent of Z , Z , . . . and such thatsup z ∈ R (cid:12)(cid:12)(cid:12) √ n ( F − F n )( z ) − n − / K ( F ( z ) , n ) (cid:12)(cid:12)(cid:12) = O ( n − / (log n ) / (log log n ) / ) a.s. (2)Applications of (2) include studying the large sample behaviour of the Bayesian bootstrapand smoothed Dirichlet process posterior [19], as well as receiver operating characteristic(ROC) curves [13]. We revisit this result by establishing an explicit coupling inequality inorder to make uniform the constants in (2). This for instance allows control of exponentialmoments, which is needed to prove Proposition 1.We henceforth assume that the underlying probability space is rich enough that all randomvariables and processes subsequently introduced may be defined on it. Since the posterior3istribution is conditional on the observations Z , . . . , Z n , it is natural for a Bayesian to indexthe Gaussian process in (2) by the empirical distribution function F n to obtain a conditionalGaussian approximation. The following is the explicit coupling inequality analogue of Lemma6.3 of [19]. Theorem 2.
On a suitable probability space, there exist random variables Z , Z , · · · ∼ iid P and F , with F | Z , . . . , Z n ∼ DP( ν + n P n ) , and a sequence of Brownian bridges ( B n ) independent of Z , Z , . . . , such that P (cid:18) sup z ∈ R (cid:12)(cid:12) √ n ( F − F n )( z ) − B n ( F n ( z )) (cid:12)(cid:12) ≥ C (log n + | ν | ) + x √ n (cid:12)(cid:12)(cid:12)(cid:12) Z , . . . , Z n (cid:19) ≤ C e − C x for all x > and n ≥ , where C − C are universal constants. This result says one can couple the Dirichlet process posteriors to a sequence of Brownianbridges independent of the underlying data. The theorem could also be rephrased withthe random variables Z , Z , . . . replaced by any real numbers z , z , . . . to emphasize thisindependence.For x = x n taken equal to a constant times log n , the right side sums finite over n ,and hence the complement of the events at x n are valid for every sufficiently large n , by theBorel-Cantelli lemma, for almost every sequence Z , Z , . . . . Provided that | ν | = O (log n ), thisyields that sup z ∈ R |√ n ( F − F n )( z ) − B n ( F n ( z )) | = O ( n − / log n ), for almost every sequence Z , Z , . . . , which improves on the rate n − / (log n ) / (log log n ) / in Lemma 6.3 of [19]. Thisis because we replace the KMT coupling used in [19], which involves a Kiefer process, witha direct quantile coupling due to [5] involving dependent Brownian bridges. The followingis the analogous result when the Brownian bridges are related amongst themselves by tyingthem to a Kiefer process K ( · , n ) = P ni =1 B i for ( B i ) independent Brownian bridges. Theorem 3.
On a suitable probability space, there exist random variables Z , Z , · · · ∼ iid P and F , with F | Z , . . . , Z n ∼ DP( ν + n P n ) , and a Kiefer process K independent of Z , Z , . . . ,such that P sup z ∈ R (cid:12)(cid:12)(cid:12) √ n ( F − F n )( z ) − n − / K ( F n ( z ) , n ) (cid:12)(cid:12)(cid:12) ≥ C | ν | + x log nn / + C √ log nx / n / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z , . . . , Z n ! ≤ C e − C x for all x > and n ≥ , where C − C are universal constants. Theorem 2 does not say anything about the joint distribution in n of the correspondingBrownian bridges and thus only “in probability” or “in distribution” limit results can beproved. On the other hand, despite the slower convergence rate, Theorem 3 can be used toestablish the almost sure limiting behaviour of statistics of interest based upon √ n ( F − F n )( z ),for instance a law of the iterated logarithm.If | ν | = O ( n / (log n ) / ), the above yields P ( ·| Z , Z , . . . ) − almost sure order n − / (log n ) / ,significantly slower than the rate in Theorem 2. In Theorem 3 we follow the approach of [19]of using the KMT coupling rather than a quantile coupling as in Theorem 2. Indeed, upto logarithmic factors, a better rate is not obtainable for coupling a quantile process with aKiefer process [8], as opposed to dependent Brownian bridges. We obtain a slightly sloweralmost sure rate than the n − / (log n ) / (log log n ) / achieved in Lemma 6.3 of [19] due totechnical arguments used to make the coupling non-asymptotic.4e may also index the Brownian bridges by the true distribution function F at theexpense of a slower rate. The following is the coupling inequality analogue of Theorem 2.1of [19]. Corollary 1.
On a suitable probability space, there exist random variables Z , Z , · · · ∼ iid F and F , with F | Z , . . . , Z n ∼ DP( ν + n P n ) , and a sequence of Brownian bridges ( B n ) independent of Z , Z , . . . , such that for any y > , the event A n,y = {√ n k F n − F k ∞ ≤ y } satisfies P ( A n,y ) ≥ − e − y and P sup z ∈ R (cid:12)(cid:12) √ n ( F − F n )( z ) − B n ( F ( z )) (cid:12)(cid:12) ≥ C (log n + | ν | ) + xn / + C √ y ( √ log n + √ x ) n / (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Z , . . . , Z n ! A n,y ≤ C e − C x for all x > and n ≥ , where C − C are universal constants. The Bayesian interpretation is that there are events ( A n,y ) of high P n -probability de-pending only on the observations Z , . . . , Z n on which one can approximate the posteriorDirichlet process with a sequence of Brownian bridges independent of the underlying data.If | ν | = O ( n − / (log n ) / ), setting y = √ δ log n with δ > / P ∞ -almost ev-ery sequence Z , Z , . . . , we have approximation rate n − / (log n ) / , P ( ·| Z , Z , . . . ) − almostsurely. A similar, if more complicated, expression can be proved with the Brownian bridges( B n ) replaced by the Kiefer process K , in particular yielding P ( ·| Z , Z , . . . ) − almost surerate O ( n − / (log n ) / ) for P ∞ -almost every sequence Z , Z , . . . . For given Z , Z , . . . , the Dirichlet process posterior distribution can be represented in lawas the convex combination P n g = V n νg | ν | + (1 − V n ) P ni =1 E i g ( Z i ) P ni =1 E i , (3)where V n ∼ Beta( | ν | , n ) and E , E , . . . are i.i.d. exponential variables with mean 1, indepen-dent of V n . For ν = 0, the center measure ν/ | ν | and the variable V n are interpreted as 0, andthe first term vanishes.With the notation ¯ E n = n − P ni =1 E i , some algebra gives √ n ( P n g − P n g ) = √ nV n (cid:16) νg | ν | − P ni =1 E i g ( Z i ) P ni =1 E i (cid:17) + 1¯ E n √ n n X i =1 ( E i − g ( Z i ) − P n g ) . (4)The variable V n is of the order 1 /n and the first term in brackets on the right side is boundedabove by νG/ | ν | + max ≤ i ≤ n | G ( Z i ) | , which is o ( √ n ) almost surely, see Lemma 4 below.Therefore the first term on the right tends to zero and is negligible as n → ∞ . The leading5actor 1 / ¯ E n of the second term on the right of (4) tends to 1, by the strong law of largenumbers. If P g < ∞ , then P n g → P g, a.s. P n g → P g , a.s.max ≤ i ≤ n | g ( Z i ) | / √ n → , a.s.(See Lemma 4 for the last claim.) This may be used to show that, for every ε > n n X i =1 E h ( E i − ( g ( Z i ) − P n g ) | Z , Z , . . . i → P ( g − P g ) =: σ g , a.s.1 n n X i =1 E h ( E i − ( g ( Z i ) − P n g ) | E i − || g ( Z i ) − P n g | >ε √ n | Z , Z , . . . i → , a.s.The Lindeberg central limit theorem (together with Slutsky’s lemma) then gives that √ n ( P n g − P n g ) | Z , Z , . . . N (0 , σ g ) , a.s. (5)If we would know that the moment generating function of the variables on the left werebounded, then this would imply convergence of exponential moments, and the propositionwould be proved for G = { g } . The approach to proving the proposition will be to strengthenfirst the preceding display to uniformity in g , and next show that exponential moments ofthe variables on the left are suitably bounded.For the uniformity, we use the assumption that G is Glivenko-Cantelli. This is notoverly strong, and it may be not far off from necessary. Indeed, if the variables E i − / ¯ E n were not presentand ν = 0, then the conditional distribution of the left side of the preceding display would be N (cid:0) , P n ( g − P n g ) (cid:1) , and convergence of these normal distributions to N (0 , σ g ) would implythe convergence P n ( g − P n g ) → σ g , uniformly in g if the convergence in distribution wereuniform. This is close to the Glivenko-Cantelli property. Lemma 1.
Suppose G is a P -Glivenko-Cantelli class of measurable functions g : X → R with envelope function G such that νG < ∞ and P G < ∞ . Then the convergence in (5) isuniform in g ∈ G , i.e. for any metric d defining weak convergence of probability measures on R , sup g ∈G d (cid:16) L (cid:0) √ n ( P n g − P n g ) | Z , Z , . . . (cid:1) , N (0 , σ g ) (cid:17) → , a.s.Proof. By the the square-integrability of G and conservation of the Glivenko-Cantelli propertyunder continuous transformations (see [30]), the set { g : g ∈ G} is also Glivenko-Cantelli.Thus the set of all sequences Z , Z , . . . such that all ofmax ≤ i ≤ n G ( Z i ) / √ n → , sup g ∈G | P n g − P g | → , sup g ∈G | P n g − P g | → , P G < ∞ . Fix somesequence Z , Z , . . . for which all three statements are true, and suppose that the left side ofthe lemma does not tend to 0. Then there exists η > { n ′ } ⊂ { n } suchthat for all elements n ′ of the subsequence, the left side is larger than η . Thus there exists asubsequence g n ′ ∈ G such that d (cid:16) L (cid:0) √ n ′ ( P n ′ g n ′ − P n ′ g n ′ ) | Z , Z , . . . (cid:1) , N (0 , σ g n ′ ) (cid:17) > η. Now, P n ′ g n ′ − P g n ′ → , P n ′ g n ′ − P g n ′ → , sup ≤ i ≤ n ′ | g n ′ ( Z i ) | / √ n ′ ≤ max ≤ i ≤ n ′ G ( Z i ) / √ n ′ → . This implies that P n ′ ( g n ′ − P n ′ g n ′ ) − σ g n ′ →
0. Since the sequence σ g n ′ is bounded, there is afurther subsequence { n ′′ } ⊂ { n ′ } such that σ g n ′′ → σ ∈ [0 , ∞ ). We can apply the Lindebergcentral limit theorem as in the argument preceding the lemma to conclude that d (cid:16) L (cid:0) √ n ′′ ( P n ′′ g n ′′ − P n ′′ g n ′′ ) | Z , Z , . . . (cid:1) , N (0 , σ ) (cid:17) → . Since σ g n ′′ → σ , this remains true if σ is replaced by σ g n ′′ . This contradicts the constructionof the functions g n ′ . Lemma 2.
Suppose that the conclusion of Lemma 1 holds and for some
T > , lim sup n →∞ sup g ∈G E (cid:2) e T √ n ( P n g − P n g ) | Z , . . . , Z n (cid:3) < ∞ , a.s. (6) Then (1) holds for ≤ t < T . Furthermore, if (6) holds for some T < , then (1) holds for T < t ≤ .Proof. We can take the distance in Lemma 1 equal to d ( F, G ) = sup h ∈H (cid:12)(cid:12)(cid:12)Z h dF − Z h dG (cid:12)(cid:12)(cid:12) , where H is a set of uniformly bounded and uniformly Lipschitz functions h : R → R (seeChapter 1.12 in [31]). For given t > M >
0, we may choose this collection to containthe function h M ( x ) = e tx ∧ M . Lemma 1 thus gives that, with E Z denoting the conditionalexpectation given Z , Z , . . . ,sup g ∈G (cid:12)(cid:12)(cid:12) E Z (cid:2) e t √ n ( P n g − P n g ) ∧ M (cid:3) − Z h M dN (0 , σ g ) (cid:12)(cid:12)(cid:12) → . Since sup g ∈G σ g ≤ P G < ∞ , we can choose M such that | R h M dN (0 , σ g ) − e t σ g / | isarbitrary small, uniformly in g ∈ G . Furthermore, (cid:12)(cid:12)(cid:12) E Z (cid:2) e t √ n ( P n g − P n g ) ∧ M − e t √ n ( P n g − P n g ) (cid:3)(cid:12)(cid:12)(cid:12) ≤ E Z (cid:2) e t √ n ( P n g − P n g ) e t √ n ( Png − P ng ) ≥ M (cid:3) ≤ M ( T − t ) /t E Z (cid:2) e T √ n ( P n g − P n g ) (cid:3) . t < T and sufficiently large M and n , this is arbitarily small, uniformly in g ∈ G , byassumption (6).The proof of the assertion with T < P n − P n by P n − P n in theargument). Lemma 3. If G has envelope function G such that P G δ < ∞ for some δ > , then (6) holds for every T in a sufficiently small neighbourhood of .Proof. By the Cauchy-Schwarz inequality, E e T ( Y + Y ) < ∞ if E e T Y i < ∞ , for i = 1 ,
2. Thusit suffices to prove that the two terms on the right side of (4) both possess finite exponentialmoments that are bounded in n .Since P G < ∞ , we have that ε n := max ≤ i ≤ n G ( Z i ) / √ n →
0, almost surely by Lemma4. The absolute value of the first term of (4) is bounded above by nV n ( n − / νG/ | ν | + ε n ),where ε n tends to zero almost surely. Thus this term has bounded exponential moments byLemma 5.Next consider the second term on the right side of (4), or equivalently, assume that ν = 0.The absolute value satisfies √ n | P n g − P n g | ≤ E n √ n n X i =1 E i | g ( Z i ) − P n g | ≤ √ n max ≤ i ≤ n G ( Z i ) = 2 nε n . Since Ee X = 1 + R ∞ P ( X ≥ x ) e x dx , it follows that for T > Z (cid:2) e T √ n ( P n g − P n g ) (cid:3) = 1 + Z T ε n n P Z (cid:16) T √ n n X i =1 ( E i − g ( Z i ) − P n g ) > x ¯ E n (cid:17) e x dx ≤ Z ∞ P (cid:0) ¯ E n < − p x/n (cid:1) e x dx + Z T ε n n P Z (cid:16) T √ n n X i =1 ( E i − g ( Z i ) − P n g ) > x (1 − p x/n ) (cid:17) e x dx. The probability in the first integral on the far right is bounded above by e − x/ , for every x >
0, using (8). Thus the integral involving this term is bounded above by R ∞ e − x/ dx = 2.For x in the integration range of the second integral on the far right, the number 1 − p x/n is at least 1 / n . Hence the preceding display is bounded above by3+ Z ∞ P Z (cid:16) T √ n n X i =1 ( E i − g ( Z i ) − P n g ) > x (cid:17) e x dx = 2+ E Z (cid:2) e T n − / P ni =1 ( E i − g ( Z i ) − P n g ) (cid:3) . It suffices to show that the last expectation is finite and bounded in g ∈ G for some T > ψ ( x ) = e x −
1, and let k · k ψ be the corresponding Orlicz norm. Then, by Proposi-8ion A.1.6 and Lemma 2.2.2 in [31], with the norms interpreted conditionally given Z , (cid:13)(cid:13)(cid:13) √ n n X i =1 ( E i − g ( Z i ) − P n g ) (cid:13)(cid:13)(cid:13) ψ . (cid:13)(cid:13)(cid:13) √ n n X i =1 ( E i − g ( Z i ) − P n g ) (cid:13)(cid:13)(cid:13) + 1 √ n (cid:13)(cid:13)(cid:13) max ≤ i ≤ n | E i − || g ( Z i ) − P n g | (cid:13)(cid:13)(cid:13) ψ . p P n ( g − P n g ) + log n √ n k E − k ψ max ≤ i ≤ n | g ( Z i ) | . p P n G + log n √ n max ≤ i ≤ n G ( Z i ) . Under the condition P G δ < ∞ , the first term is bounded almost surely by the law of largenumbers, while the second term tends to zero almost surely by Lemma 4.By the definition of the Orlicz norm, E e | Y | /C ≤ C ≥ k Y k ψ and any random variable Y . This concludes the proof for T >
0. For
T <
0, we copy the preceding argument, butreplace E i − − E i and T by | T | . Lemma 4. If Y , Y , . . . are i.i.d. random variables with E | Y i | r < ∞ for some r > , then max ≤ i ≤ n | Y i | /n /r → , almost surely.Proof. For any y > ≤ i ≤ n | Y i | r n ≤ y r n + 1 n n X i =1 | Y i | r | Y i | >y . As n → ∞ the first term tends to zero, for fixed y , while the second tends to E | Y | r | Y | >y by the law of large numbers, and can be made arbitrarily small by choice of y . Lemma 5. If V n ∼ Beta( | ν | , n ) and t n → t ∈ [0 , , then E e nt n V n → (1 − t ) −| ν | . In particular E e nt n V n → when t n → .Proof. We have that Z e ntv v | ν |− (1 − v ) n − dv = n −| ν | Z n e tu u | ν |− (1 − u/n ) n − du. The integrand is dominated by e tu u | ν |− e − ( n − u/n , which is uniformly integrable for suffi-ciently large n and t <
1. Therefore, for fixed t <
1, the integral times n | ν | is asymptotic to R ∞ u | ν |− e − u (1 − t ) du = (1 − t ) −| ν | Γ( | ν | ) by the dominated convergence theorem. By the defi-nition of the beta distribution, the expectation E e nt n V n is the quotient of two of the integralsas in the display, with t = t n and with t = 0, respectively. Proof of Proposition 1.
Because by assumption the variation of a function g ∈ G is boundeduniformly over all intervals [ a, b ], the limits of g ( x ) as x → ±∞ exist and are finite. (Indeed thevalues | g ( a ) | , for a <
0, are bounded by | g (0) | + V a ( g ) ≤ | g (0) | + V and hence every sequence g ( x n ) with x n → −∞ has a converging subsequence. If there were two subsequences x n and y n with different limits, then these could without loss of generality be chosen alternating:9 ≥ y ≥ x ≥ y ≥ · · · and the variation over the partitions containing y N , x N , . . . y , x would tend to infinity with N .) It can be seen that the variation of the extended function g over [ −∞ , ∞ ] is the supremum of the variations over all intervals [ a, b ], and hence is alsofinite. In particular, the functions g − g ( −∞ ) are uniformly bounded. As shifting the functionsby a constant does not change the claim of the proposition, we can assume without loss ofgenerality that g ( −∞ ) = 0, and that the class G has a uniformly bounded envelope function G . We can then decompose g as g = g + − g − , for right-continuous, nondecreasing functions g + , g − : [ −∞ , ∞ ] → R , uniformly bounded by 2 V (e.g. Section 6.3 of [28]). Let dg = dg + − dg − be the corresponding signed (Stieltjes) measure, and | dg | = dg + + dg − its total variation.We work on the probability space from Theorem 2. For ( B n ) the Brownian bridges inthat theorem and g ∈ G , set W n g = − Z B n ◦ F n dg. It can be seen that given Z , Z , . . . , the variable W n g possesses a N (0 , k g − P n g k L ( P n ) )-distribution, whence W n is a P n -Brownian bridge process, indexed by G .The process F = F n in Theorem 2 is the distribution function of the posterior Dirichletprocess P n . By partial integration, we have( P n − P ) g = Z g d ( F n − F n ) = − Z ( F n − F n ) dg. Writing ∆ n = k√ n ( F n − F n ) − B n ◦ F n k ∞ , we thus find that, for every sequence Z , Z , . . . , |√ n ( P n − P n ) g − W n ( g ) | = (cid:12)(cid:12)(cid:12)Z (cid:0) √ n ( F n − F n ) − B n ◦ F n (cid:1) dg (cid:12)(cid:12)(cid:12) ≤ V ∆ n . Since bounded variation balls are uniform Donsker classes, G is P -Glivenko-Cantelli. Thusto prove the proposition, by Lemmas 1 and 2, we need show only that (6) holds for all T ∈ R .Using the last display and Cauchy-Schwarz,sup g ∈G E Z (cid:2) e T √ n ( P n − P n ) g (cid:3) ≤ (cid:0) E Z (cid:2) e T V ∆ n (cid:3)(cid:1) / × sup g ∈G (cid:16) E Z (cid:2) e T W n ( g ) (cid:3)(cid:17) / . The first term converges to 1 as n → ∞ for every sequence Z , Z , . . . by Lemma 7 below.The second term equals sup g ∈G e T P n ( g − P n g ) ≤ e T P n G → e T P G < ∞ , P ∞ -a.s. Thisestablishes (6) and completes the proof. We recall some useful facts. For a centered Gaussian process ( G t ) t ∈ T with countable indexset T satisfying sup t ∈ T | G t | < ∞ (Borell’s inequality - Theorem 7.1 of [16]): P (cid:18) sup t ∈ T | G t | ≥ E sup t ∈ T | G t | + x (cid:19) ≤ e − x σ (7)for every x >
0, where σ = sup t ∈ T E G t < ∞ . Note that if G has continuous sample pathsand T ⊂ R is uncountable, (7) still holds, since we may restrict the supremum to a countableskeleton of T .For X θ ∼ Gamma( θ, P ( X θ > θ + √ θx + x ) ≤ e − x , P ( X θ < θ − √ θx ) ≤ e − x (8)10or every x >
0, see p. 28-29 of [1]. We also denote by P Z the conditional probability given Z , . . . , Z n . Proof of Theorem 2.
Recall that F | Z , . . . , Z n = P n ( −∞ , · ] and let ¯ F | Z , . . . , Z n = ¯ P n ( −∞ , · ] for ¯ P n ∼ DP( n P n ). Using the representation (3), conditionally on Z , . . . , Z n , k√ n ( F − F n ) − √ n ( ¯ F − F n ) k ∞ = sup t ∈ R √ n (cid:12)(cid:12) ( P n − ¯ P n )1 ( −∞ ,t ] (cid:12)(cid:12) ≤ √ nV n , where V n ∼ Beta( | ν | , n ) is independent of ¯ F . The random variable V n is equal in distributionto X/ ( X + Y n ), where X ∼ Gamma( | ν | ,
1) and Y n ∼ Gamma( n,
1) are independent. Applying(8) gives P ( Y n < Cn ) ≤ e − x for C = 1 − p / > / < x ≤ n/
3, and then that P ( X/Y n > n − ( | ν | + p | ν | x + x )) ≤ e − x for all 0 < x ≤ n/
3. For x ≥ n/
3, we have thetrivial probability bound P ( X/ ( X + Y n ) ≥ n − ( | ν | + p | ν | x + x )) ≤ P ( X/ ( X + Y n ) ≥
2) = 0 . Combining the above and using 2 p | ν | x ≤ | ν | + x , P Z ( k√ n ( F − F n ) − √ n ( ¯ F − F n ) k ∞ ≥ n − / ( | ν | + x )) ≤ e − x (9)for all x >
0. It therefore remains to show the desired exponential inequality with √ n ( ¯ F − F n )instead of √ n ( F − F n ).Let U , . . . , U n − ∼ U (0 ,
1) be i.i.d. and independent of ( Z i ) i ≥ and denote the cor-responding order statistics by 0 = U (0) < U (1) < · · · < U ( n − < U ( n ) = 1. For given Z , . . . , Z n , the Bayesian bootstrap posterior distribution can be represented in law as ¯ P n = P ni =1 ( U ( i ) − U ( i − ) δ Z ( i ) , where Z (1) ≤ · · · ≤ Z ( n ) are the order statistics of the sample andwe have used the exchangeability of ( U ( i ) − U ( i − : 1 ≤ i ≤ n − F ( z ) = n X i =1 ( U ( i ) − U ( i − )1 { Z ( i ) ≤ z } . (10)Define the empirical quantile function Q n − ( t ) of the U i ’s by Q n − ( t ) = U ( i ) if i − n − < t ≤ in − , i = 1 , , . . . , n − q n − ( t ) = √ n − Q n − ( t ) − t ) to be the uniform quantile process. By Theorem 1 ofCs¨org˝o and R´ev´esz [5], one can define for each n a Brownian bridge { ˜ B n ( t ) : 0 ≤ t ≤ } onthe same probability space such that for all x ≥ P (cid:18) sup ≤ t ≤ (cid:12)(cid:12)(cid:12) q n ( t ) − ˜ B n ( t ) (cid:12)(cid:12)(cid:12) ≥ c log n + x √ n (cid:19) ≤ c e − c x , (11)where c , c , c are universal constants. Since these Brownian bridges are constructed basedon ( U i ) i ≥ , which are independent of ( Z i ) i ≥ , they may also be taken to be independent of11 Z i ) i ≥ . Setting B n = ˜ B n − and following [19], k√ n ( ¯ F − F n ) − B n ( F n ) k ∞ = max ≤ i 2. Since B n is a Brownian bridge, V , . . . , V n − are Gaussian random variables with V i ∼ N (0 , in ( n − (1 − in ( n − )). Thus Var( V i ) ≤ /n for all i , and so the standard Gaussianmaximal inequality Lemma 2.3.4 of [12] yields E max ≤ i 0. Applying Borell’s inequality (7), for x > P (cid:18) II B ≥ C √ log n + x √ n (cid:19) ≤ e − cx . (13)For III B , recall that for a Brownian bridge B n , P ( k B n k ∞ > x ) = 2 P ∞ k =1 ( − k − e − k x ≤ e − x for x > h ( x ) =(1 − x ) / , for some ξ ∈ [0 , /n ], r nn − − r nn − (cid:16) − p − /n (cid:17) = r nn − n √ − ξ ≤ n . Therefore, P ( III B ≥ n − x ) ≤ e − x for x > I B − III B via a union bound and comparingthe dominating terms, P Z (cid:18) k√ n ( ¯ F − F n ) − B n ( F n ) k ∞ ≥ C log n + x √ n (cid:19) ≤ C e − C x for all x > C , C , C > 0. Together with (9) this gives theresult. Proof of Theorem 3. Using the exponential inequality (9), we need show only the result with √ n ( ¯ F − F n ) instead of √ n ( F − F n ), where ¯ F is defined in (10). Let H n ( s ) = n P ni =1 { U i ≤ s } 12e the empirical distribution function of the i.i.d. random variables U , U , · · · ∼ U (0 , α n ( s ) = √ n ( H n ( s ) − s ), s ∈ [0 , e K ( s, t ) with P (cid:18) sup ≤ s ≤ (cid:12)(cid:12)(cid:12) α n ( s ) − n − / e K ( s, n ) (cid:12)(cid:12)(cid:12) ≥ C (log n ) + x log n √ n (cid:19) ≤ De − cx (14)for all n ≥ x > C, c, D > 0. We take as Kiefer process K ( s, t ) = − e K ( s, t ).Arguing as in (12), k√ n ( ¯ F − F n ) − n − / K ( F n , n ) k ∞ ≤ r nn − ≤ i 1) : s ∈ [0 , } is a Brownian bridge for each n ≥ 2, we use the first inequality in Lemma 6 to deal with thesecond term. Together these yield P Z I K ≥ C (log n ) + x log nn / + C √ log nx / n / ! ≤ C e − C x for all x > C − C > n ≥ II K ≤ √ nn − ≤ i 1) : s ∈ [0 , } is a Brownian bridge, II (1) K is equalin distribution to II B , for which we use the inequality (13). Similarly, II (3) K = ( n − − max ≤ i 0. For n ≥ 2, we have II (2) K ≤ Cn − / max ≤ i 4. By Lemma 2.3.4 of [12], E max ≤ i 0. Together these give for all x > P (cid:18) II K ≥ C √ log n + √ xn / (cid:19) ≤ C e − C x . Using the exponential inequalities for I K and II K , a union bound and that x / . log n + x , P Z k√ n ( ¯ F − F n ) − n − / K ( F n , n ) k ∞ ≥ C (log n ) n / + x log nn / + C √ log nx / n / ! ≤ C e − C x for all x > C − C > 0. The first term on the right-hand sidedominates if and only if x ≤ D (log n ) /n / for a universal constant D > 0. For such x , theupper bound in the last display is bounded by C exp( − C (log 2) / / ) for all n ≥ 2, whichcan be made larger than 1 by taking C universal and large enough. The last display is thustrivially satisfied for such x , which implies P k√ n ( ¯ F − F n ) − n − / K ( F n , n ) k ∞ ≥ x log nn / + C √ log nx / n / ! ≤ C e − C x for all x > C − C > 0. Together with (9) this yieldsthe result. Proof of Corollary 1. That P ( A n,y ) ≥ − e − y follows from the Dvoretzky-Kiefer-Wolfowitz-Massart inequality. Let { B n : n ≥ } be the Brownian bridges from Theorem 2. By thetriangle inequality, (cid:12)(cid:12) √ n ( F − F n )( z ) − B n ( F ( z )) (cid:12)(cid:12) ≤ (cid:12)(cid:12) √ n ( F − F n )( z ) − B n ( F n ( z )) (cid:12)(cid:12) + | B n ( F n ( z )) − B n ( F ( z )) | , and the exponential inequality for the first term follows from Theorem 2. Since { B n : n ≥ } are independent of ( Z i ) i ≥ by Theorem 2, applying the second inequality in Lemma 6 gives P (cid:18) sup z ∈ R (cid:12)(cid:12) B F n ( z ) − B F ( z ) (cid:12)(cid:12) ≥ K √ yn − / (cid:16)p log n + √ x (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) Z , . . . , Z n (cid:19) A n,y ≤ e − x for all x > K > 0. The result follows by a union bound. Lemma 6. Let B = { B t : t ∈ [0 , } be a Brownian bridge and F n be the empirical distributionfunction of Z , . . . , Z n ∼ iid F . Then there exists a universal constant K > such that, for n ≥ and every x > , P (cid:18) sup z ∈ R | B F n ( z ) − B F ( z ) | ≥ K √ log nn / x / (cid:19) ≤ e − x . If B is independent of Z , . . . , Z n , then there also exists K > such that, for n ≥ and every x > , P (cid:18) sup z ∈ R (cid:12)(cid:12) B F n ( z ) − B F ( z ) (cid:12)(cid:12) ≥ K k F n − F k / ∞ (cid:16)p log n + √ x (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) Z , . . . , Z n (cid:19) ≤ e − x . roof. The intrinsic metric of the Brownian bridge is bounded above by the square root of theEuclidean distance, whence its metric entropy integral is a multiple of δ δ max (cid:0)p log(1 /δ ) , (cid:1) .Therefore, by Dudley’s theorem (see [12], Theorem 2.3.8). E sup s,t (cid:2) | B s − B t | /J ( | s − t | ) (cid:3) < ∞ ,for J ( δ ) = √ δ max (cid:0)p log(1 /δ ) , (cid:1) . Because the process ( s, t ) ( B s − B t ) /J ( | s − t | ) is cen-tered Gaussian with uniformly bounded variance, we can apply Borell’s inequality (7) to seethat there exist constants D, E > y > (cid:16) sup C > Cy ≤ D ( y − E ) , for y > E . Then, for y > E ,P (cid:16) sup 0. For x < log 2, we havethat 2 e − x > x ≥ log 2,we have 4 e − x ≤ e − x and max( p log( n/x ) , ≥ K √ log n , for some constant K > n ≥ 2. The first inequality of the lemma follows.For the second inequality of the lemma, note that E B [sup z | B F n ( z ) − B F ( z ) || Z , . . . , Z n ] . J (cid:0) k F n − F k ∞ (cid:1) = p k F n − F k ∞ η n , for η n = max(log(1 / k F n − F k ∞ ) , z Var B ( B F n ( z ) − B F ( z ) | Z , . . . , Z n ) . k F n − F k ∞ . Therefore, by Borell’s inequality(7), there exists K > (cid:16) sup z ∈ R | B F n ( z ) − B F ( z ) | p k F n − F k ∞ ≥ K ( η n + y ) (cid:12)(cid:12)(cid:12) Z , . . . , Z n (cid:17) ≤ e − y . We conclude by noting that lim inf n √ n log log n k F n − F k ∞ = π/ > η n . √ log n , a.s. Lemma 7. Consider the setting of Theorem 2 and let ∆ n = k√ n ( F − F n ) − B n ( F n ) k ∞ . Thenfor any t ∈ R and every sequence Z , Z , . . . , as n → ∞ , E [ e t ∆ n | Z , . . . , Z n ] → . roof. Suppose t > 0. For α n = C (log n + | ν | ) / √ n , with C the universal constant fromTheorem 2, and using the change of variable u = e tα n + tx/ √ n ,E[ e t ∆ n | Z , . . . , Z n ] ≤ e tα n + Z ∞ e tαn P ( e t ∆ n ≥ u | Z , . . . , Z n ) du = e tα n + te tα n √ n Z ∞ P (cid:18) ∆ n ≥ C (log n + | ν | ) + x √ n (cid:12)(cid:12)(cid:12)(cid:12) Z , . . . , Z n (cid:19) e tx/ √ n dx. Using Theorem 2 and that C − t/ √ n > n large enough,E[ e t ∆ n | Z , . . . , Z n ] ≤ e tα n + te tα n √ n Z ∞ C e − ( C − t/ √ n ) x dx = e tα n (cid:18) C tC √ n − t (cid:19) → n → ∞ . Since ∆ n ≥ e t ∆ n | Z , . . . , Z n ] ≥ t > 0. The case t < For completeness, we include the proof that (1) implies √ n ( P n − P n ) g N (0 , P ( g − P g ) )for every g ∈ G and P ∞ -almost every sequence Z , Z , . . . . Lemma 8. If Y n are random variables with E e tY n → e t σ / , for every t in a subset of R thatcontains both a strictly increasing sequence with limit 0 and a strictly decreasing sequencewith limit 0, then Y n N (0 , σ ) .Proof. Let T be the set of points and let a < b > T . Because E e tY n isbounded in n , for both t = a and t = b , the sequence Y n is tight, by Markov’s inequality. Forevery t ∈ T strictly between a and b , some power larger than 1 of the variable e tY n is boundedin L , and hence the sequence e tY n is uniformly integrable. Consequently, if Y is a weak limitpoint of Y n , then E e tY n tends to E e tY along the same subsequence for every t ∈ ( a, b ) ∩ T . Inview of the assumption of the lemma, it follows that E e tY = e t σ / . The set t ∈ ( a, b ) ∩ T is infinite by assumption. Finiteness of E e tY on this set implies that the function z E e zY is analytic in an open strip containing the real axis. By analytic continuation it is equal to e z σ / , whence E e isY = e − s σ / , for every s ∈ R . Corollary 2. If ( Y n , Z n ) are random elements with E( e tY n | Z n ) → e t σ / , in probability, forevery t in a set that contains both a strictly increasing sequence with limit 0 and a strictlydecreasing sequence with limit 0, then Y n | Z n N (0 , σ ) , in probability. If the convergencein the assumption is in the almost sure sense, then the conclusion is also true in the almostsure sense.Proof. For the conclusion in probability it suffices to show that every subsequence of { n } hasa further subsequence with d (cid:0) L ( Y n | Z n ) , N (0 , σ ) (cid:1) → 0, almost surely, where d is a metricdefining weak convergence. From the assumption we know that every subsequence has afurther subsequence with E( e tY n | Z n ) → e t σ / , almost surely. For a countable set of t ,we can construct a single subsequence with this property for every t , by a diagonalizationscheme. The preceding lemma gives that d (cid:0) L ( Y n | Z n ) , N (0 , σ ) (cid:1) → 0, almost surely, alongthis subsequence. 16 eferences [1] Boucheron, S., Lugosi, G., and Massart, P. Concentration inequalities . Ox-ford University Press, Oxford, 2013. A nonasymptotic theory of independence, With aforeword by Michel Ledoux.[2] Castillo, I. On Bayesian supremum norm contraction rates. Ann. Statist. 42 , 5 (2014),2058–2091.[3] Castillo, I., and Nickl, R. On the Bernstein-von Mises phenomenon for nonpara-metric Bayes procedures. Ann. Statist. 42 , 5 (2014), 1941–1969.[4] Castillo, I., and Rousseau, J. A Bernstein–von Mises theorem for smooth function-als in semiparametric models. Ann. Statist. 43 , 6 (2015), 2353–2383.[5] Cs¨org˝o, M., and R´ev´esz, P. Strong approximations of the quantile process. Ann.Statist. 6 , 4 (1978), 882–894.[6] Cs¨org˝o, M., and R´ev´esz, P. Strong approximations in probability and statistics . Prob-ability and Mathematical Statistics. Academic Press, Inc. [Harcourt Brace Jovanovich,Publishers], New York-London, 1981.[7] Cs¨org˝o, S., and Hall, P. The Koml´os-Major-Tusn´ady approximations and theirapplications. Austral. J. Statist. 26 , 2 (1984), 189–218.[8] Deheuvels, P. On the approximation of quantile processes by Kiefer processes. J.Theoret. Probab. 11 , 4 (1998), 997–1018.[9] Dudley, R. M. Real analysis and probability , vol. 74 of Cambridge Studies in AdvancedMathematics . Cambridge University Press, Cambridge, 2002. Revised reprint of the1989 original.[10] Ferguson, T. Prior distributions on spaces of probability measures. Ann. Statist. 2 (1974), 615–629.[11] Ghosal, S., and van der Vaart, A. W. Fundamentals of Nonparametric BayesianInference . Cambridge Series in Statistical and Probabilistic Mathematics. CambridgeUniversity Press, Cambridge, 2017.[12] Gin´e, E., and Nickl, R. Mathematical foundations of infinite-dimensional statisti-cal models . Cambridge Series in Statistical and Probabilistic Mathematics. CambridgeUniversity Press, New York, 2016.[13] Gu, J., and Ghosal, S. Strong approximations for resample quantile processes andapplication to ROC methodology. J. Nonparametr. Stat. 20 , 3 (2008), 229–240.[14] James, L. Large sample asymptotics for the two-parameter Poisson-Dirichlet process.In Pushing the Limits of Contemporary Statistics: Contributions in Honor of JayantaK. Ghosh , vol. 3 of Inst. Math. Stat. Collect. Inst. Math. Statist., Beachwood, OH, 2008,pp. 187–199. 1715] Koml´os, J., Major, P., and Tusn´ady, G. An approximation of partial sums of in-dependent RV’s and the sample DF. I. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete32 (1975), 111–131.[16] Ledoux, M. The concentration of measure phenomenon , vol. 89 of Mathematical Sur-veys and Monographs . American Mathematical Society, Providence, RI, 2001.[17] Lo, A. Y. Weak convergence for Dirichlet processes. Sankhy¯a Ser. A 45 , 1 (1983),105–111.[18] Lo, A. Y. A remark on the limiting posterior distribution of the multiparameter Dirich-let process. Sankhy¯a Ser. A 48 , 2 (1986), 247–249.[19] Lo, A. Y. A large sample study of the Bayesian bootstrap. Ann. Statist. 15 , 1 (1987),360–375.[20] Monard, F., Nickl, R., and Paternain, G. P. Efficient nonparametric Bayesianinference for X -ray transforms. Ann. Statist. 47 , 2 (2019), 1113–1147.[21] Nickl, R. Bernstein–von Mises theorems for statistical inverse problems I: Schr¨odingerequation. J. Eur. Math. Soc. (JEMS) 22 , 8 (2020), 2697–2750.[22] Nickl, R., and Ray, K. Nonparametric statistical inference for drift vector fields ofmulti-dimensional diffusions. Ann. Statist. 48 , 3 (2020), 1383–1408.[23] Nickl, R., and S¨ohl, J. Bernstein-von Mises theorems for statistical inverse problemsII: compound Poisson processes. Electron. J. Stat. 13 , 2 (2019), 3513–3571.[24] Ray, K. Adaptive Bernstein–von Mises theorems in Gaussian white noise. Ann. Statist.45 , 6 (2017), 2511–2536.[25] Ray, K., and Szab´o, B. Debiased Bayesian inference for average treatment effects. In Advances in Neural Information Processing Systems 33 . 2019.[26] Ray, K., and van der Vaart, A. W. Semiparametric Bayesian causal inference. Ann. Statist., to appear.[27] Rivoirard, V., and Rousseau, J. Bernstein-von Mises theorem for linear functionalsof the density. Ann. Statist. 40 , 3 (2012), 1489–1523.[28] Royden, H., and Fitzpatrick, P. Real Analysis . Prentice Hall, 2010.[29] Shorack, G. R., and Wellner, J. A. Empirical processes with applications to statis-tics . Wiley Series in Probability and Mathematical Statistics: Probability and Mathe-matical Statistics. John Wiley & Sons, Inc., New York, 1986.[30] van der Vaart, A., and Wellner, J. A. Preservation theorems for Glivenko-Cantelliand uniform Glivenko-Cantelli classes. In High dimensional probability, II (Seattle, WA,1999) , vol. 47 of Progr. Probab. Birkh¨auser Boston, Boston, MA, 2000, pp. 115–133.[31] van der Vaart, A. W., and Wellner, J. A. E + y (cid:17) ≤ e − Dy . There exists a constant y (cid:17) ≤ e − Cy . By making C if necessary still smaller, we can ensure that the right side is bigger than 1 for y ≤ E , and then the preceding inequality is valid for every y > y > (cid:16) sup z ∈ R | F n ( z ) − F ( z ) | > y (cid:17) ≤ e − ny . Combining these two inequalities, we see that, for every y , y > (cid:16) sup z ∈ R | B F n ( z ) − B F ( z ) | ≥ y J ( y ) (cid:17) ≤ e − Cy + 2 e − ny . We choose y = p x/C and y = p x/n to reduce the right side to 4 e − x , and then have y J ( y ) ≥ K x / max( p log( n/x ) , /n / , for some constant K >