The Bernstein-von Mises theorem for the Pitman-Yor process of nonnegative type
TThe Bernstein-von Mises theorem forthe Pitman-Yor process of nonnegativetype
S.E.M.P. Franssen ∗ Mathematical InstituteLeiden Universitye-mail: [email protected] andA.W. van der Vaart ∗ Mathematical InstituteLeiden Universitye-mail: [email protected]
Abstract:
The Pitman-Yor process is a nonparametric species samplingprior with number of different species of the order of n σ for some σ > √ n rate. We propose a bias correction and show that after correcting for thebias, the posterior distribution will be asymptotically normal. Without thebias correction, the coverage of the credible sets can be arbitrarily low, andwe illustrate this finding with simulations where we compare the coverageof corrected and uncorrected credible sets. MSC2020 subject classifications:
Primary 62G20; secondary 62G15.
Keywords and phrases:
Pitman-Yor process, Bernstein-von Mises theo-rem, weak convergence.
1. Introduction
The Pitman-Yor process, also known as the two-parameter Poisson-DirichletProcess, is a random probability distribution, which is used as a prior distri-bution in a nonparametric Bayesian analysis. It is a discrete distribution withhas an expected number of atoms of the order of n σ [Kar67]. Because of thispower-law behaviour, the Pitman-Yor processes are a popular choice when alarge number of clusters is expected. For instance, in genetics or topic mod-elling [ea09, Teh06, GGJ05, ADBP18].We define the Pitman-Yor process through its stick-breaking construction.We restrict to the Pitman-Yor processes of nonnegative type. ∗ The research leading to these results is partly financed by the NWO Spinoza prize awardedto A.W. van der Vaart by the Netherlands Organisation for Scientific Research (NWO).1 a r X i v : . [ m a t h . S T ] F e b .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process Definition 1.1.
Let σ ∈ [0 , , M > − σ and let G be an atomless probabilitydistribution. We say P is a Pitman-Yor process (of nonnegative type), denoted P ∼ PY ( σ, M, G ) if P can be represented as P = ∞ (cid:88) i =1 W i δ θ i , where θ i iid ∼ G and W i = V i (cid:81) i − j =1 (1 − V j ) and V i ind ∼ beta (1 − σ, M + iσ ) . In the special case where σ = 0, the PY(0 , M, G ) process is the Dirichletprocess. In this special case, the results we develop in this paper are alreadyknown and therefore we focus on the case σ > P follows the Pitman-Yor process and given P let the observa-tions X , . . . , X n be an independent and identically distributed sample from P . In a Bayesian analysis the conditional distribution of P , called the poste-rior distribution of P , is used for inference. It was shown in [Jam08, dBea15]that if X , . . . , X n are in reality i.i.d. from a measure P with a decomposition P = (1 − λ ) P d + λP c in a discrete distribution P d and atomless distribution P c , then P | X , . . . , X n (cid:32) δ (1 − λ ) P d + λ (1 − σ ) P c + σλG , where δ Q denotes the Dirac measure at the probability distribution Q . In partic-ular, the limit is P if P is discrete, “Posterior Consistency”, but the posterioris inconsistent otherwise, unless G = P c . These authors refined the asymptoticsin case of an atomless distribution to convergence in distribution of the rescaledand recentered posterior distribution to √ n ( P − (1 − σ ) P n − σG ) | X , . . . , X n (cid:32) √ − σ G P + (cid:112) σ (1 − σ ) G G + (cid:112) σ (1 − σ ) Z ( P − G )where P n is the empirical distribution of X , . . . , X n and G P is a P -Brownianbridge and Z ∼ N(0 , P , inparticular for discrete P where the posterior is consistent. Bernstein-von Misestheorems are important for the interpretation of Bayesian credible sets. Theyallow translating Bayesian credible sets into asymptotic confidence sets.The main consequence of our result is that in case of a discrete P , where thePitman-Yor processes are consistent, there are two regimes: • If the number of distinct observations K n satisfies that K n √ n converges tozero P -almost surely, then √ n ( P − P n ) | X , . . . , X n (cid:32) G P . In this regime, the Pitman-Yor posterior satisfies the same Bernstein-vonMises result as the Dirichlet posterior and the Bayesian inference is correct.We are in this regime if the number of the atoms decreases fast enoughand in particular, if P is finitely discrete. .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process • If the weights of the atoms of P decreases too slowly, then the Bernstein-von Mises typically fails and Bayesian credible sets will not be frequentistconfidence sets due to overly large bias.Another consequence of the main result is that, if the true distribution is dis-crete, we can deterministically de-bias the posterior to alter the credible sets sothat they have asymptotically correct frequentist coverage.
2. Notation and previous results on the Pitman-Yor Process
The Pitman-Yor process is a particular case of a Gibbs process [dBea15], whichitself is a special case of species sampling processes.The posterior distribution of the Pitman-Yor process is known, see [Pit96,Corollary 20]. We state it in the form of [GvdV17, Theorem 14.37]. The deriva-tion of the posterior distribution depends on the invariance under size-biasedpermutations. Our results will depend heavily on this specific form of the pos-terior.
Theorem 2.1. If P ∼ PY ( σ, M, G ) with σ ≥ , then the posterior distributionof P based on i.i.d. observations X , . . . , X n | P ∼ P is the distribution of therandom measure R n K n (cid:88) j =1 ˆ W j δ ˜ X j + (1 − R n ) Q n , where • R n ∼ Beta ( n − K n σ, M + K n σ ) , • W = (cid:16) ˆ W , . . . , ˆ W K n (cid:17) ∼ Dir ( K n ; N ,n − σ, . . . , N K n ,n − σ ) and • Q n ∼ PY ( σ, M + σK n , G ) are all independently distributed. Here ˜ X , . . . , ˜ X K n are the distinct observationsin X , . . . , X n and N ,n , . . . , N K n ,n are their multiplicities, and K n is the totalnumber of distinct observations. We will compare the posterior distribution with the empirical distribution P n = n (cid:80) ni =1 δ X i . A class of functions F is called P -Donsker, for a probabilitydistribution P , if √ n ( P n − P ) (cid:32) G P in (cid:96) ∞ ( F ), where G P is a tight Gaussianelement in (cid:96) ∞ ( F ). This will be a mean zero Gaussian process with covariancefunction E [ G P f G P f ] = P ( f − P f )( f − P f ).We begin by introducing some notation. For any stochastic process Y : T → R we denote: || Y || T = sup t ∈ T | Y ( t ) | . For a given semimetric ρ we write F δ = { f − g : f, g ∈ F , ρ ( f, g ) ≤ δ } . Fortwo sequences a n , b n we write a n (cid:38) b n if there exists a constant C such thatfor all n ∈ N a n ≤ CB n . We say that a Donsker class has a bounded envelopefunction if sup f ∈F | f | ∞ = M < ∞ . For a given function f : R → R , we denote .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process E P f = (cid:82) f ( x ) d P ( x ) and Var P ( f ) = E P [( f − E P f ) ]. If X is a function from aprobability space to the real numbers, we define E ∗ X to be the outer measureof X , and whenever it exists X ∗ to be the minimal measurable majorant of X .Conditional weak convergence in probability is best understood using thebounded Lipschitz metric, see for example [WvdV96] Chapter 1.12. Weak con-vergence, G n (cid:32) G , of a sequence of random variables, G n in (cid:96) ∞ ( F ), to a sepa-rable limit G , is equivalent to convergence in the bounded Lipschitz metric. LetBL be the set of all functions : (cid:96) ∞ ( F ) → [0 ,
1] such that for all z , z ∈ (cid:96) ∞ ( F ) | h ( z ) − h ( z ) | ≤ (cid:107) z − z (cid:107) F . Then the bounded Lipschitz metric is d BL ( G n , G ) = sup h ∈ BL | E ∗ h ( G n ) − Eh ( G ) | . Then we can reason about conditional convergence as convergence of the con-ditional distribution in the bounded Lipschitz metric to zero a limit. For con-ditional convergence in probability and almost surely we will mean that thebounded Lipschitz distance to the limiting distribution converges to zero inouter probability respectively outer almost surely.
3. Main result
In this paper we identify the limiting distribution for the posterior distributionof the Pitman-Yor process by comparing the posterior distribution with theempirical distribution.
Theorem 3.1.
Let P be a probability distribution and λ ∈ [0 , such that P = (1 − λ ) P d + λP c where P d is a discrete probability distribution and P c isan atomless distribution. If P follows a PY ( σ, M, G ) process, then the posteriordistribution in the model X , . . . , X n | P ∼ P satisfies √ n (cid:32) P − P n − σK n n (cid:32) G − K n K n (cid:88) i =1 δ ˜ X i (cid:33)(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n (cid:32) √ − λ G P d + (cid:112) (1 − σ ) λ G P c + (cid:112) σ (1 − σ ) λ G G + (cid:112) (1 − σλ ) σλZ (cid:18) (1 − λ ) P d + (1 − σ ) λP c − σλ − G (cid:19) + (cid:112) (1 − σ ) λ (1 − λ ) √ − σλ Z ( P c − P d ) in (cid:96) ∞ ( F ) P ∞ -probability, for every P Donsker class of functions F for whichthe PY ( σ, σ, G ) process satisfies the central limit theorem in (cid:96) ∞ ( F ) . Here Z and Z are independent standard normal random variables. If in addition P ∗ (cid:107) f − P f (cid:107) F < ∞ this convergence is also almost surely. We can interpret the terms appearing in the asymptotic distribution as fol-lows: .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process • √ − λ G P d is the asymptotic distribution coming from the discrete partof the true distribution. • (cid:112) (1 − σ ) λ G P c is the asymptotic distribution coming from the continuouspart of the true distribution. • (cid:112) σ (1 − σ ) λ G G is the asymptotic distribution coming from the prior. • (cid:112) (1 − σλ ) σλZ (cid:16) (1 − λ ) P d +(1 − σ ) λP c − σλ − G (cid:17) is the balance between the priorand the true distribution. • √ (1 − σ ) λ (1 − λ ) √ − σλ Z ( P c − P d ) is the balance between the continuous and dis-crete part of the true distribution.Let us consider the two special cases, that the true distribution P is eitherdiscrete or atomless. In case of an atomless distribution, we have with probabilityone K n = n , hence 1 K n K n (cid:88) i =1 δ ˜ X i = 1 n n (cid:88) i =1 δ X i = P n , and because λ = 1 we recover the Bernstein-von Mises theorem of [Jam08]. Incase of a discrete P we have λ = 0 and simplify the limiting distribution into G P which is the correct Gaussian noise. However, even in this case the centeringmay not be the empirical measure, and hence we do not necessarily obtain astandard Bernstein-von Mises theorem. We have the following corollary Corollary 3.2.
Suppose P is a discrete probability distribution. If P fol-lows a PY ( σ, M, G ) process, then the posterior distribution in the model X , . . . , X n | P ∼ P satisfies √ n (cid:32) P − P n − σK n n (cid:32) G − K n K n (cid:88) i =1 δ ˜ X i (cid:33)(cid:33) | X , . . . , X n (cid:32) G P (1) in (cid:96) ∞ ( F ) almost surely [ P ∞ ] , for every P -Donsker class of functions F forwhich the PY ( σ, σ, G ) process satisfies the central limit theorem in (cid:96) ∞ ( F ) . We find that credible sets need to be shifted by − σK n n (cid:16) G − K n (cid:80) K n i =1 δ ˜ X i (cid:17) inorder to become valid asymptotic confidence sets, an observable quantity. Thiscorrection only works in the case where the true distribution is discrete, sincein the other cases the variance of the asymptotic Gaussian distribution will bewrong as well.The next corollary gives conditions on the true distribution and the Donskerclass to achieve the right uncertainty quantification for discrete random vari-ables. Corollary 3.3.
Suppose F is a bounded P Donsker class and suppose that K n √ n → in outer probability respectively outer almost surely, then the bias isasymptoticly negligible at √ n rate. .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process Proof. ||√ n σK n n (cid:32) G − K n K n (cid:88) i =1 δ ˜ X i (cid:33) || F = σ K n √ n || G − K n K n (cid:88) i =1 δ ˜ X i || F ≤ σo (1) M → K n √ n → Example 3.4.
Let us take G = N (1 , , P = (cid:80) ∞ i =1 p i δ i probability measureon the positive integers with p i = πi ) , and f = − ≤ + > . Since G ( f ) = 0 ,we get √ n σK n n G − K n K n (cid:88) j =1 δ ˜ X j ( f ) = σ √ n K n (cid:88) j =1 f ( ˜ X i ) Eventually we will see the first observation, so eventually almost surely this isequal to σ √ n ( − K n − − σ √ n + σ K n √ n (cid:32) σν. by [Kar67]. Remark 3.5.
One can adapt the proof using the same ideas for arbitrary mea-sures G , by replacing the break-off point by the median of G , and modifyingthe counterexample of the true distribution appropriately.
4. Mean and variance of posterior distribution
For completeness, we add explicit formulas for the mean and variance of theposterior distribution. We will prove these results in the appendix.
Lemma 4.1.
Let P ∼ PY ( σ, M, G ) where σ ≥ . Then the mean and varianceof the posterior distribution of P based on observations X , . . . , X n | P iid ∼ P are .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process as follows E [ P ( f ) | X , . . . , X n ] = K n (cid:88) j =1 N j,n − σn + M f ( ˜ X j ) + M + σK n n + M G ( f ) , Var ( P ( f ) | X , . . . , X n ) = K n (cid:88) j =1 N j,n − σn − K n σ f ( ˜ X j ) − G ( f ) × ( n − σK n )( M + σK n )( n + M ) ( n + M + 1) − (cid:16)(cid:80) K n j =1 ( N j,n − σ ) f ( ˜ X j ) (cid:17) ( n − σK n )( n + M )( n + M + 1)+ (cid:80) K n j =1 ( N j,n − σ ) f ( ˜ X j ) ( n + M )( n + M + 1)+ (1 − σ )( M + σK n + 1)( n + M )( n + M + 1) Var G ( f ) . Lemma 4.2.
Suppose X , . . . , X n iid ∼ P , where P = (1 − λ ) P d + λP c . If P follows a PY ( σ, M, G ) process, then the posterior distribution in the model X , . . . , X n | P ∼ P , P almost surely E [ P ( f ) | X , . . . , X n ] → (1 − λ ) P d + (1 − σ ) λP c + λσGn Var ( P ( f ) | X , . . . , X n ) → (1 − λ ) Var P d ( f ) + (1 − σ ) λ Var P c ( f )+ (1 − σ ) σλ Var G ( f )+ (1 − σ ) λ (1 − λ )1 − σλ (cid:0) P d ( f ) − P c ( f ) (cid:1) + (1 − σλ ) σλ × (cid:18) (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f )1 − σλ − G ( f ) (cid:19) . Here we can recognize the variances as stated in theorem Theorem 3.1.
5. Numerical simulations
In order to illustrate that credible sets do not always cover the truth, we im-plemented 3 experiments. We implement a slight modification of Algorithm1 from [ADBP18] to draw samples from the posterior distribution. The idea isbased on Example 3.4. We implement three probability distributions, P , P , P ,draw samples from them, build the posterior and then look if the credible setscover the truth. We construct the credible sets as symmetric intervals around theposterior mean of 95% posterior probability. The first probability distributionis a finite probability distribution defined in Table 1, below. .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process Table 1
Probability distribution P k 1 2 3 4 5 6 P ( X = k ) 0.1 0.1 0.2 0.2 0.3 0.1 The other two probability distributions have infinite support, and are givenas follows: P ( X = k ) ∝ k , P ( X = k ) ∝ k . . This means that the number of distinct observations is bounded by six, propor-tional to √ n and proportional to n for P , P and P , respectively.We define the function f to be f ( x ) = x ≥ , and G = N (1 , F = { f } , and look when the credible sets coming from theposterior distribution of P f cover the true value of this quantity.In our numerical simulations we expect to get the correct coverage withoutcorrection for the first probability distribution. The coverage in case of thesecond probability distribution converges to a positive but wrong constant andthe coverage converges to zero for the third probability distribution.Table 2 and Table 3 show the results of the coverage of the 95%-credibleintervals centered around the posterior mean and the coverage of the correctedcredible sets after shifting these intervals by σ K n n (cid:16) Gf − K n (cid:80) K n i =1 f ( ˜ X i ) (cid:17) . Table 2
The coverage of uncorrected posterior credible balls n 10 100 1000 10000 100000 P P P Table 3
The coverage of corrected posterior credible balls n 10 100 1000 10000 100000 P P P We have drawn pictures from the density of the posterior of the Pitman Yorprocess as considered above. We sampled 100000 times from P , and we plottedthe posterior distribution of P ( f ) based on X , . . . , X n for varying sample sizes n using the R “density” function. The results of this are in Figure 1: .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process Fig 1: Density plotsThe density of the distribution of
P f based on 10 observations from P . . . . . . . . Pf D en s i t y The density of the distribution of
P f based on 10 observations from P . Pf D en s i t y The density of the distribution of
P f based on 10 observations from P . Pf D en s i t y The density of the distribution of
P f based on 10 observations from P . Pf D en s i t y The density of the distribution of
P f based on 10 observations from P . Pf D en s i t y .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process
6. Proof of Theorem 3.1
The posterior of the Pitman-Yor process can be represented as the distributionof R n K n (cid:88) i =1 W n,i δ ˜ X i + (1 − R n ) Q n =: R n S n + (1 − R n ) Q n , for independent variables R n , W n , Q n , with conditionally on X , . . . , X n , • R n ∼ Beta( n − σK n , M + σK n ), • Q n ∼ PY( σ, M + σK n , G ), • W n = ( W n, , . . . , W n,K n ) ∼ Dirichlet( K n ; N n, − σ, . . . , N n,K n − σ ).We have that K n /n → λ , almost surely, for λ the weight of the atomless com-ponent of P (see Lemma 6.1 below, or [GvdV17], proof of Theorem 14.19).If K n → ∞ , in particular when λ >
0, then both n − σK n → ∞ and M + σK n → ∞ , almost surely. As in [GvdV17, Jam08], we find that √ n (cid:18) R n − σK n n (cid:19) | X , . . . , X n (cid:32) N (0 , (1 − σλ ) σλ ) , a.s. , (2) (cid:112) K n ( Q n − G ) | X , . . . , X n (cid:32) (cid:114) − σσ G G , a.s. . (3)Below we show that, for W the process defined in (7), and ˜ P n f = K − n (cid:80) K n i =1 f ( ˜ X i ), √ n (cid:32) S n f − P n f − ( σK n /n )˜ P n f )1 − σK n /n (cid:33) | X , . . . , X n (cid:32) W f − σλ − (cid:0) (1 − σ ) λP c f + (1 − λ ) P d f (cid:1) W − σλ ) , a.s. , (4)where the centering satisfies T n f := P n f − σK n n ˜ P n f → (1 − σ ) λP c f + (1 − λ ) P d f, a.s. . (5)Given (2)–(5), the derivation can be finished by some algebra. Specifically, for˜ T n = T n / (1 − σK n /n ), √ n (cid:18) R n S n − (cid:18) − σK n n (cid:19) ˜ T n (cid:19) = √ n (cid:18) R n − σK n n (cid:19) S n + √ n ( S n − ˜ T n ) (cid:18) − σK n n (cid:19) , √ n (cid:18) (1 − R n ) Q n − (cid:18) σK n n (cid:19) G (cid:19) = −√ n (cid:18) R n − σK n n (cid:19) Q n + √ n ( Q n − G ) σK n n . Adding these equations and replacing multiplicative factors by their limits orapproximations, we see that, almost surely √ n (cid:18) PY n − P n + σK n n (˜ P n − G ) (cid:19) = √ n (cid:18) R n − σK n n (cid:19) ( ˜ T n − G ) + √ n ( S n − ˜ T n )(1 − σλ ) + (cid:112) K n ( Q n − G ) σ (cid:114) K n n + o P (1) . .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process The three terms on the right are independent given X , . . . , X n and their con-ditional limits in distribution are given.If K n remains bounded, then (3) fails, but Q n runs through finitely manydifferent Pitman-Yor processes. Of the pair of displays, the second one is notuseful, as Q n (cid:54)→ G . However, K n can be bounded only if λ = 0 and thus thenormal limit in (2) is degenerate. Hence √ n ( R n −
1) = − σK n / √ n + o P (1) = o P (1), almost surely, and hence √ n (1 − R n ) Q n f →
0. In the first of the twodisplays the first term on the right vanishes asymptotically, and we find that˜ T n f = P n f + O P (1 /n ). This gives that √ n ( P Y n − P n ) = √ n ( S n − ˜ T n ) + o P (1),almost surely.The preceding results are true as processes in f , for every finite set of P -square integrable functions, and the convergence of the Pitman-Yor posterior isthen also obtained as a process in f . If the convergences are in (cid:96) ∞ ( F ), then theconvergence of the Pitman-Yor posterior is also in (cid:96) ∞ ( F ).The first term on the left side of (5) tends to P f , almost surely. To analysethe second term on the left, let S be the atoms of P and split (cid:80) K n i =1 f ( ˜ X i ) = (cid:80) ni =1 f ( X i )1 X i / ∈ S + (cid:80) K n i =1 f ( ˜ X i )1 ˜ X i ∈ S . By Lemma 6.2 below the second sumdivided by n goes to zero if P | f | < ∞ , while the first divided by n tends to λP c f , almost surely. Thus the left side of Equation (5) tends to P f − σλP c f =(1 − σ ) λP c f + (1 − λ ) P d f , almost surely.A Gamma representation for W n is W n,i = U i, + (cid:80) N n,i − j =1 U i,j (cid:80) K n i =1 (cid:16) U i, + (cid:80) N n,i − j =1 U i,j (cid:17) , for all U i,j independent, U i, ∼ Γ(1 − σ,
1) and U i,j ∼ Γ(1 , j ≥
1. Re-label the n variables U i,j as ξ n, , . . . , ξ n,n , as follows. Let S be the set of allatoms of P . Every X i / ∈ S appears exactly once among X , . . . , X n ; set thecorresponding ξ n,i equal to U i, . Every X i ∈ S appears N n,i ≥ X , . . . , X n . Set the ξ n,j with indices corresponding to these appearances equalto U i, , U i, , . . . , U i,N n,i − . Then K n (cid:88) i =1 W n,i f ( ˜ X i ) = n − (cid:80) ni =1 ξ n,i f ( X i ) n − (cid:80) ni =1 ξ n,i . (6)We shall show that √ n (cid:32) n n (cid:88) i =1 ξ n,i f ( X i ) − P n f + σK n n ˜ P n f (cid:33) | X , . . . , X n (cid:32) W f := (cid:112) λ (1 − σ ) W P c f + √ − λ W P d f, a.s. (7)for W P a P -Brownian motion and the two Brownian motions on the right sideindependent. These results are true as processes in f . The choice f = 1 givesthe convergence of n − (cid:80) ni =1 ξ n,i , which is the denominator in Equation (6).Relations Equation (6)-Equation (7) and a little algebra give Equation (4). .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process For a given f with P f < ∞ , the convergence Equation (7) follows by thecentral limit theorem, as follows. The variables ξ n, , . . . , ξ n,n are independent.The K n variables corresponding to the distinct values are Γ(1 − σ, , n n (cid:88) i =1 (var ξ n,i ) f ( X i ) = 1 n n (cid:88) i =1 f ( X i ) − σn K n (cid:88) i =1 f ( ˜ X i ) → λ (1 − σ ) P c f + (1 − λ ) P d f , a.s. , by Equation (5). The Lindeberg-Feller condition, for every ε > n n (cid:88) i =1 E (cid:16) ( ξ n,i − E ξ n,i ) f ( X i )1 | ξ n,i − E ξ n,i | | f ( X i ) | >ε √ n | X , . . . , X n (cid:17) → X , X , . . . such that both P n f = O (1) andmax ≤ i ≤ n | f ( X i ) | / √ n →
0, which is almost every sequence if P f < ∞ . Bythe Lindeberg-Feller central limit theorem it follows that, conditionally given X , . . . , X n , the sequence n − / (cid:80) ni =1 ( ξ n,i − E ξ n,i ) f ( X i ) tends to a mean zeronormal distribution with variance the right side of the preceding display, almostsurely. We recognize this as the variance of the right side of (7). Because n (cid:88) i =1 ( E ξ n,i ) f ( X i ) = n (cid:88) i =1 f ( X i ) − σ K n (cid:88) i =1 f ( ˜ X i ) , the proof of (7) for a single f is complete.By the Cram´er-Wold device and linearity in f , the convergence is then impliedfor finite sets of f .For convergence as processes in (cid:96) ∞ ( F ), it suffices to prove asymptotic tight-ness. The processes n − / (cid:80) ni =1 ( ξ n,i − E ξ n,i ) f ( X i ) are multiplier processes withmean zero, independent multipliers. Because the multipliers are not i.i.d., a di-rect application of the conditional multiplier central limit theorem [WvdV96] isnot possible. However, the multipliers have two forms Γ(1 − σ,
1) and Γ(1 , E ξ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ( ξ n,i − E ξ n,i ) f ( X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗F δ ≤ E ξ,ξ (cid:48) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) ξ n,i − E ξ n,i + ξ (cid:48) n,i − E ξ (cid:48) n,i (cid:1) f ( X i ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∗F δ , for any random variables ξ (cid:48) n,i independent of the ξ n,i . We can choose thesevariables so that all ξ n,i + ξ (cid:48) n,i i . i . d . ∼ Γ(1 , Lemma 6.1.
The number K n of distinct values among X , . . . , X n i . i . d . ∼ λP c +(1 − λ ) P d satisfies K n /n → λ , almost surely. The number K dn of those valuesin the support S of P d satisfies K dn /n → , almost surely. .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process Proof.
The number of distinct values not in S is K cn := n P n ( S c ) and hence K cn /n → P ( S c ) = λ , almost surely. If S = { s , s , . . . } , then the number ofdistinct values in S is bounded above by m + n P n { s m +1 , s m +2 , . . . } , for any m ,and hence K dn /n ≤ m/n + P n { s m +1 , s m +2 , . . . } → P { s m +1 , s m +2 , . . . } , almostsurely, for every m .Note: [Kar67], Theorem 8 shows for discrete P that K n / E K n →
1, almostsurely. Any rate of growth n γ for 0 < γ < E K n . Karlin alsoproves asymptotic normality of K n − E K n under regularity conditions on thedecrease of the atoms. Lemma 6.2. If F has integrable envelope, then sup f (cid:12)(cid:12)(cid:12) n − (cid:80) K n i =1 f ( ˜ X i )1 ˜ X i ∈ S (cid:12)(cid:12)(cid:12) → , almost surely.Proof. For any M , the supremum is bounded above by n − K dn M + P n F F >M → P F F >M , almost surely.
7. Discussion
In this paper we have derived a Bernstein-Von Mises theorem for the Pitman-Yor process PY( σ, M, G ), 0 ≤ σ < M > − σ . This work assumes fixed σ and M , however, one might want to learn σ from the data. The M in thePitman-Yor process gives the prior strength, and σ is related to the number ofdistinct observations one expects to see. If one splits the data into two parts anduses for example maximal marginal likelihood estimators to estimate σ , we canplug the estimate into our prior, and the analysis as stated holds with ˆ σ insteadof σ . However, when you would put a prior on σ the proof as stated does notgo through: the tightness conditions are still fine, but the asymptotic normalityhas to be established.
8. Acknowledgements
The authors would like to thank Botond Szab´o for his extensive feedback. Thiswork has been presented several times and the ensuing discussions helped iden-tify where the exposition could be improved.
References [ADBP18] Julyan Arbel, Pierpaolo De Blasi, and Igor Pr¨unster. Stochasticapproximations to the Pitman-Yor process.
Bayesian Analysis , 2018.Advance publication.[dBea15] de Blasi et al. Are Gibbs-type priors the most natural generalizationof the Dirichlet process?
IEEE Transactions on Pattern Analysis andMachine Intelligence , 37(2):212–229, 2015. .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process [ea09] Wood et al. A stochastic memoizer for sequence data. In ICML ’09Proceedings of the 26th Annual International Conference on MachineLearning , 2009.[GGJ05] Sharon Goldwater, Thomas L Griffiths, and Mark Johnson. Interpo-lating between types and tokens by estimating power-law generators.In
Advances in neural information processing systems , 2005.[GvdV17] Subhashis Ghosal and Aad van der Vaart.
Fundamentals of Non-parametric Bayesian Inference . Cambridge University Press, 2017.[Jam08] Lancelot James. Large sample asymptotics for the two-parameterPoisson-Dirichlet process.
Pushing the limits of contemporary Statis-tics: Contributions in Honor of Jayanta k. Ghosh , 3, 2008.[Kar67] Samuel Karlin. Central limit theorems for certain infinite urnschemes.
Journal of Mathematics and Mechanics , 17(4), 1967.[Pit96] Jim Pitman. Some developments of the Blackwell-MacQueen urnscheme.
Institute of Mathematical Statistics Lecture Notes - Mono-graph Series , 30:245–267, 1996.[Teh06] Yee Whye Teh. A hierarchical Bayesian language model based onPitman-Yor processes. In
ACL-44 Proceedings of the 21st Interna-tional Conference on Computational Linguistics and the 44th annualmeeting of the Association for Computational Linguistics , pages 985–992, 2006.[WvdV96] Jon Wellner and Aad van der Vaart.
Weak convergence and Empir-ical processes . Springer-Verlag, 1996. .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process We give a proof of Lemma 4.1.
Proof.
We begin by recalling the posterior distribution from Theorem 2.1. Notethat we have the following results: • E [ R n ] = n − K n σn + M and Var( R n ) = ( n − K n σ )( M + K n σ )( n + M ) ( n + M +1) . • E [ Q n ( f )] = G ( f ), Var( Q n ( f )) = − σM + σK n Var G ( f ).The first two results are standard results for Beta distributed random variables,and the last two results are because Q n is a Pitman-Yor process. Now we justneed to compute the moments for the weights W j . We use the following resultsfrom the Dirichlet distribution. If ˜ X ∼ Dir ( K n , α , . . . , α K n ), then E [ ˜ X i ] = α i (cid:80) K n k =1 α k , Var (cid:16) ˜ X i (cid:17) = α i ( (cid:80) K n k =1 α k − α i )( (cid:80) K n k =1 α k ) (1 + (cid:80) K n k =1 α k ) , and Cov( ˜ X i , ˜ X j ) = − α i α j ( (cid:80) K n k =1 α k ) (1 + (cid:80) K n k =1 α k ) . In our case α i = N i,n − σ , K = K n and (cid:80) K n k =1 α k = n − σK n . Then a directcomputation shows that E [ K n (cid:88) j =1 W j f ( ˜ X j )] = K n (cid:88) j =1 N j,n − σn − K n σ f ( ˜ X j ) . For the variance we use that, for independent random variables, the variance ofthe sum is the sum of the covariances.Var( K n (cid:88) i =1 W j f ( ˜ X j ) | X , . . . , X n ) = (cid:88) i (cid:54) = j Cov( W i , W j ) f ( ˜ X i ) f ( ˜ X j )+ K n (cid:88) i =1 Var( W i ) f ( ˜ X i ) = (cid:88) i (cid:54) = j − ( N i,n − σ )( N j,n − σ )( n − σK n ) ( n − σK n + 1) f ( ˜ X i ) f ( ˜ X j )+ K n (cid:88) i =1 ( N i,n − σ )( n − σK n − N i,n + σ )( n − σK n ) ( n − σK n + 1) f ( ˜ X i ) = − (cid:16)(cid:80) K n j =1 ( N j,n − σ ) f ( ˜ X j ) (cid:17) ( n − σK n ) ( n − σK n + 1)+ (cid:80) K n j =1 ( N j,n − σ ) f ( ˜ X j ) ( n − σK n )( n − σK n + 1) . .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process Now we can compute the mean and variance. Using independence between R n , W and Q n and linearity we see that E [ P ( f ) | X , . . . , X n ] = K n (cid:88) j =1 N j,n − σn + M f ( ˜ X j ) + M + σK n n + M G ( f ) . In order to compute the variance we apply the law of total variance. For anytwo random variables
X, Y with finite second moment we have thatVar( X ) = E [Var ( X | Y )] + Var ( E [ X | Y ]) . We split into conditioning on R n and the rest, so we can use the independencebetween W and Q n . We compute these piece by piece. First considerFirst consider E Var R n K n (cid:88) j =1 W j f ( ˜ X j ) + (1 − R n ) Q n ( f ) | R n . Due to the independence of W and Q n given R n = E R n Var K n (cid:88) j =1 W j f ( ˜ X j ) + (1 − R n ) Var ( Q n ( f )) . Simplifying the expression yields= E [ R n ]Var K n (cid:88) j =1 W j f ( ˜ X j ) + E [(1 − R n ) ]Var ( Q n ( f ))] . Filling in the known moments results in= ( n − σK n )( n + 1 − σK n )( n + M )( n + M + 1) Var K n (cid:88) j =1 W j f ( ˜ X j ) + ( M + σK n )( M + σK n + 1)( n + M )( n + M + 1) Var ( Q n ( f )) . Expanding the variance terms and simplifying gives= − (cid:16)(cid:80) K n j =1 ( N j,n − σ ) f ( ˜ X j ) (cid:17) ( n − σK n )( n + M )( n + M + 1) + (cid:80) K n j =1 ( N j,n − σ ) f ( ˜ X j ) ( n + M )( n + M + 1)+ (1 − σ )( M + σK n + 1)( n + M )( n + M + 1) Var G ( f ) . .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process Next wel deal withVar E [ R n K n (cid:88) j =1 W j f ( ˜ X j ) + (1 − R n ) Q n ( f ) | R n ] . Computing the expected value gives= Var R n K n (cid:88) j =1 N j,n − σn − K n σ f ( ˜ X j ) + (1 − R n ) G ( f ) . Reorganising terms= Var G ( f ) + R n ( K n (cid:88) j =1 N j,n − σn − K n σ f ( ˜ X j ) − G ( f )) . The constant term does not contribute to the variance so can be ingored, andthen taking the square of the constant in front of R n results in= ( K n (cid:88) j =1 N j,n − σn − K n σ f ( ˜ X j ) − G ( f )) Var ( R n ) . Computing the variance of R n gives= ( K n (cid:88) j =1 N j,n − σn − K n σ f ( ˜ X j ) − G ( f )) ( n − σK n )( M + σK n )( n + M ) ( n + M + 1) . Therefore by the law of total variance we find the result.We now give a proof of Lemma 4.2.
Proof.
We begin with some basic results which we will apply in several places.We note the following two almost sure limits: K n n → λ P -almost surely and (cid:80) K n j =1 ( N j,n − σ ) f ( ˜ X j ) n → (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f ) P -a.s.For the posterior mean we know the exact formula by Lemma 4.1 and thereforethe following limit can be computed: E [ P ( f ) | X , . . . , X n ] = K n (cid:88) j =1 N j,n − σn + M f ( ˜ X j ) + M + σK n n + M G ( f ) → (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f ) + λσG ( f ) P -a.s.Recall from Lemma 4.1 the formula for the posterior variance. We analysethis term by term. They all follow directly from the remarks at the beginningof the the proof, and the limits hold P -almost surely. .E.M.P. Franssen and A.W. van der Vaart/BvM for the Pitman-Yor process First we find that K n (cid:88) j =1 N j,n − σn − K n σ f ( ˜ X j ) → (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f )1 − σλ . Secondly, n ( n − σK n )( M + σK n )( n + M ) ( n + M + 1) → (1 − σλ ) σλ. Next, − n (cid:16)(cid:80) K n j =1 ( N j,n − σ ) f ( ˜ X j ) (cid:17) ( n − σK n )( n + M )( n + M + 1) → − (cid:0) (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f ) (cid:1) − σλ . Also, n (cid:80) K n j =1 ( N j,n − σ ) f ( ˜ X j ) ( n + M )( n + M + 1) → (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f ) . And finally, n (1 − σ )( M + σK n + 1)( n + M )( n + M + 1) Var G ( f ) → (1 − σ ) σλ Var G ( f ) . This means we now have computed the limit of the posterior variance. We willnow add all the terms together, and by the continuous mapping theorem wefind that, n Var ( P ( f ) | X , . . . , X n ) → (1 − σλ ) σλ (cid:18) (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f )1 − σλ − G ( f ) (cid:19) − (cid:0) (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f ) (cid:1) − σλ + (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f )+ (1 − σ ) σλ Var G ( f ) a.s. P . Note that − (cid:0) (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f ) (cid:1) − σλ + (1 − λ ) P d ( f ) + (1 − σ ) λP c ( f )= (1 − λ )Var P d ( f ) + (1 − σ ) λ Var P c ( f )+ (1 − σ ) λ (1 − λ )1 − σλ (cid:0) P d ( f ) − P c ( f ) (cid:1) ..